Sunteți pe pagina 1din 1644

Editorial Board

General Editor

Neil J. Salkind
University of Kansas

Associate Editors

Bruce B. Frey
University of Kansas
Donald M. Dougherty
University of Texas Health Science Center at San Antonio

Managing Editors

Kristin Rasmussen Teasdale


University of Kansas
Nathalie Hill-Kapturczak
University of Texas Health Science Center at San Antonio
Copyright © 2010 by SAGE Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the
publisher.
For information:
SAGE Publications, Inc.
2455 Teller Road
Thousand Oaks, California 91320
E-mail: order@sagepub.com

SAGE Publications Ltd..


1 Oliver’s Yard
55 City Road
London EC1Y 1SP
United Kingdom

SAGE Publications India Pvt. Ltd.


B 1/I 1 Mohan Cooperative Industrial Area
Mathura Road, New Delhi 110 044
India

SAGE Publications Asia-Pacific Pte. Ltd.


33 Pekin Street #02-01
Far East Square
Singapore 048763

Printed in the United States of America.

Library of Congress Cataloging-in-Publication Data

Encyclopedia of research design/edited by Neil J. Salkind.


v. cm.
Includes bibliographical references and index.
ISBN 978-1-4129-6127-1 (cloth)
1. Social sciences—Statistical methods—Encyclopedias. 2. Social sciences—Research—Methodology—Encyclopedias.
I. Salkind, Neil J.

HA29.E525 2010
001.403—dc22 2010001779

This book is printed on acid-free paper.

10   11   12   13   14   10   9   8   7   6   5   4   3   2   1

Publisher: Rolf A. Janke


Acquisitions Editor: Jim Brace-Thompson
Editorial Assistant: Michele Thompson
Developmental Editor: Carole Maurer
Reference Systems Coordinators: Leticia M. Gutierrez, Laura Notton
Production Editor: Kate Schroeder
Copy Editors: Bonnie Freeman, Liann Lech, Sheree Van Vreede
Typesetter: C&M Digitals (P) Ltd.
Proofreaders: Kristin Bergstad, Kevin Gleason, Sally Jaskold, Sandy Zilka
Indexer: Virgil Diodato
Cover Designer: Glenn Vogel
Marketing Manager: Amberlyn McKay
Contents

Volume 1
List of Entries   vii
Reader’s Guide   xiii
About the Editors   xix
Contributors   xxi
Introduction   xxix
Entries
A 1 E 399
B 57 F 471
C 111 G 519
D 321

Volume 2
List of Entries   vii
Entries
H 561 M 745
I 589 N 869
J 655 O 949
K 663 P 985
L 681

Volume 3
List of Entries   vii
Entries
Q 1149 V 1589
R 1183 W 1611
S 1295 Y 1645
T 1489 Z 1653
U 1583

Index   1675
List of Entries

Abstract Between-Subjects Design. See Cohort Design


Accuracy in Parameter Single-Subject Design; Collinearity
Estimation Within-Subjects Design Column Graph
Action Research Bias Completely Randomized
Adaptive Designs in Clinical Biased Estimator Design
Trials Bivariate Regression Computerized Adaptive Testing
Adjusted F Test. See Block Design Concomitant Variable
Greenhouse–Geisser Bonferroni Procedure Concurrent Validity
Correction Bootstrapping Confidence Intervals
Alternative Hypotheses Box-and-Whisker Plot Confirmatory Factor Analysis
American Educational Research b Parameter Confounding
Association Congruence
American Psychological Canonical Correlation Analysis Construct Validity
Association Style Case-Only Design Content Analysis
American Statistical Case Study Content Validity
Association Categorical Data Analysis Contrast Analysis
Analysis of Covariance Categorical Variable Control Group
(ANCOVA) Causal-Comparative Design Control Variables
Analysis of Variance (ANOVA) Cause and Effect Convenience Sampling
Animal Research Ceiling Effect “Convergent and Discriminant
Applied Research Central Limit Theorem Validation by the Multitrait–
A Priori Monte Central Tendency, Measures of Multimethod Matrix”
Carlo Simulation Change Scores Copula Functions
Aptitudes and Instructional Chi-Square Test Correction for Attenuation
Methods Classical Test Theory Correlation
Aptitude-Treatment Interaction Clinical Significance Correspondence Analysis
Assent Clinical Trial Correspondence Principle
Association, Measures of Cluster Sampling Covariate
Autocorrelation Coefficient Alpha C Parameter. See Guessing
“Coefficient Alpha and the Parameter
Bar Chart Internal Structure Criterion Problem
Bartlett’s Test of Tests” Criterion Validity
Barycentric Discriminant Coefficient of Concordance Criterion Variable
Analysis Coefficient of Variation Critical Difference
Bayes’s Theorem Coefficients of Correlation, Critical Theory
Behavior Analysis Design Alienation, and Determination Critical Thinking
Behrens–Fisher t′ Statistic Cohen’s d Statistic Critical Value
Bernoulli Distribution Cohen’s f Statistic Cronbach’s Alpha. See
Beta Cohen’s Kappa Coefficient Alpha

vii
viii List of Entries

Crossover Design Experience Sampling Method Inclusion Criteria


Cross-Sectional Design Experimental Design Independent Variable
Cross-Validation Experimenter Expectancy Effect Inference: Deductive and
Cumulative Frequency Exploratory Data Analysis Inductive
Distribution Exploratory Factor Analysis Influence Statistics
  Ex Post Facto Study Influential Data Points
Databases External Validity Informed Consent
Data Cleaning   Instrumentation
Data Mining Face Validity Interaction
Data Snooping Factorial Design Internal Consistency Reliability
Debriefing Factor Loadings Internal Validity
Decision Rule False Positive Internet-Based Research Method
Declaration of Helsinki Falsifiability Interrater Reliability
Degrees of Freedom Field Study Interval Scale
Delphi Technique File Drawer Problem Intervention
Demographics Fisher’s Least Significant Interviewing
Dependent Variable Difference Test Intraclass Correlation
Descriptive Discriminant Fixed-Effects Models Item Analysis
Analysis Focus Group Item Response Theory
Descriptive Statistics Follow-Up Item-Test Correlation
Dichotomous Variable Frequency Distribution  
Differential Item Functioning Frequency Table Jackknife
Directional Hypothesis Friedman Test John Henry Effect
Discourse Analysis F Test  
Discriminant Analysis   Kolmogorov−Smirnov Test
Discussion Section Gain Scores, Analysis of KR-20
Dissertation Game Theory Krippendorff’s Alpha
Distribution Gauss–Markov Theorem Kruskal–Wallis Test
Disturbance Terms Generalizability Theory Kurtosis
Doctrine of Chances, The General Linear Model  
Double-Blind Procedure Graphical Display of Data L’Abbé Plot
Dummy Coding Greenhouse–Geisser Correction Laboratory Experiments
Duncan’s Multiple Range Test Grounded Theory Last Observation Carried
Dunnett’s Test Group-Sequential Designs in Forward
  Clinical Trials Latent Growth Modeling
Ecological Validity Growth Curve Latent Variable
Effect Coding Guessing Parameter Latin Square Design
Effect Size, Measures of Guttman Scaling Law of Large Numbers
Endogenous Variables   Least Squares, Methods of
Error Hawthorne Effect Levels of Measurement
Error Rates Heisenberg Effect Likelihood Ratio Statistic
Estimation Hierarchical Linear Modeling Likert Scaling
Eta-Squared Histogram Line Graph
Ethics in the Research Process Holm’s Sequential Bonferroni LISREL
Ethnography Procedure Literature Review
Evidence-Based Decision Homogeneity of Variance Logic of Scientific Discovery,
Making Homoscedasticity The
Exclusion Criteria Honestly Significant Difference Logistic Regression
Exogenous Variables (HSD) Test Loglinear Models
Expected Value Hypothesis Longitudinal Design
List of Entries ix

Main Effects Nominal Scale Percentile Rank


Mann–Whitney U Test Nomograms Pie Chart
Margin of Error Nonclassical Experimenter Pilot Study
Markov Chains Effects Placebo
Matching Nondirectional Hypotheses Placebo Effect
Matrix Algebra Nonexperimental Design Planning Research
Mauchly Test Nonparametric Statistics Poisson Distribution
MBESS Nonparametric Statistics for Polychoric Correlation
McNemar’s Test the Behavioral Sciences Coefficient
Mean Nonprobability Sampling Polynomials
Mean Comparisons Nonsignificance Pooled Variance
Median Normal Distribution Population
Meta-Analysis Normality Assumption Positivism
“Meta-Analysis of Normalizing Data Post Hoc Analysis
Psychotherapy Nuisance Variable Post Hoc Comparisons
Outcome Studies” Null Hypothesis Power
Methods Section Nuremberg Code Power Analysis
Method Variance NVivo Pragmatic Study
Missing Data, Imputation of   Precision
Mixed- and Random-Effects Observational Research Predictive Validity
Models Observations Predictor Variable
Mixed Methods Design Occam’s Razor Pre-Experimental Design
Mixed Model Design Odds Pretest–Posttest Design
Mode Odds Ratio Pretest Sensitization
Models Ogive Primary Data Source
Monte Carlo Simulation Omega Squared Principal Components Analysis
Mortality Omnibus Tests Probabilistic Models for
Multilevel Modeling One-Tailed Test Some Intelligence and
Multiple Comparison Tests “On the Theory of Scales of Attainment Tests
Multiple Regression Measurement” Probability, Laws of
Multiple Treatment Order Effects Probability Sampling
Interference Ordinal Scale “Probable Error of a Mean, The”
Multitrait–Multimethod Orthogonal Comparisons Propensity Score Analysis
Matrix Outlier Proportional Sampling
Multivalued Treatment Effects Overfitting Proposal
Multivariate Analysis of   Prospective Study
Variance (MANOVA) Pairwise Comparisons Protocol
Multivariate Normal Panel Design “Psychometric Experiments”
Distribution Paradigm Psychometrics
  Parallel Forms Reliability Purpose Statement
Narrative Research Parameters p Value
National Council on Parametric Statistics  
Measurement in Education Partial Correlation Q Methodology
Natural Experiments Partial Eta-Squared Q-Statistic
Naturalistic Inquiry Partially Randomized Qualitative Research
Naturalistic Observation Preference Trial Design Quality Effects Model
Nested Factor Design Participants Quantitative Research
Network Analysis Path Analysis Quasi-Experimental Design
Newman−Keuls Test and Pearson Product-Moment Quetelet’s Index
Tukey Test Correlation Coefficient Quota Sampling
x List of Entries

R Scheffé Test Student’s t Test


R2 Scientific Method Sums of Squares
Radial Plot Secondary Data Source Survey
Random Assignment Selection Survival Analysis
Random-Effects Models Semipartial Correlation SYSTAT
Random Error Coefficient Systematic Error
Randomization Tests Sensitivity Systematic Sampling
Randomized Block Design Sensitivity Analysis
Random Sampling Sequence Effects “Technique for the Measurement
Random Selection Sequential Analysis of Attitudes, A”
Random Variable Sequential Design Teoria Statistica Delle Classi e
Range “Sequential Tests of Statistical Calcolo Delle Probabilità
Rating Hypotheses” Test
Ratio Scale Serial Correlation Test−Retest Reliability
Raw Scores Shrinkage Theory
Reactive Arrangements Significance, Statistical Theory of Attitude
Recruitment Significance Level, Concept of Measurement
Regression Artifacts Significance Level, Interpretation Think-Aloud Methods
Regression Coefficient and Construction Thought Experiments
Regression Discontinuity Sign Test Threats to Validity
Regression to the Mean Simple Main Effects Thurstone Scaling
Reliability Simpson’s Paradox Time-Lag Study
Repeated Measures Design Single-Blind Study Time-Series Study
Replication Single-Subject Design Time Studies
Research Social Desirability Treatment(s)
Research Design Principles Software, Free Trend Analysis
Research Hypothesis Spearman–Brown Prophecy Triangulation
Research Question Formula Trimmed Mean
Residual Plot Spearman Rank Order Triple-Blind Study
Residuals Correlation True Experimental Design
Response Bias Specificity True Positive
Response Surface Design Sphericity True Score
Restriction of Range Split-Half Reliability t Test, Independent Samples
Results Section Split-Plot Factorial Design t Test, One Sample
Retrospective Study SPSS t Test, Paired Samples
Robust Standard Deviation Tukey’s Honestly Significant
Root Mean Square Error Standard Error of Estimate Difference (HSD)
Rosenthal Effect Standard Error of Two-Tailed Test
Rubrics Measurement Type I Error
  Standard Error of the Mean Type II Error
Sample Standardization Type III Error
Sample Size Standardized Score
Sample Size Planning Statistic Unbiased Estimator
Sampling Statistica Unit of Analysis
Sampling and Retention of Statistical Control U-Shaped Curve
Underrepresented Groups Statistical Power Analysis for the  
Sampling Distributions Behavioral Sciences “Validity”
Sampling Error Stepwise Regression Validity of Measurement
SAS Stratified Sampling Validity of Research
Scatterplot Structural Equation Modeling Conclusions
List of Entries xi

Variability, Measure of Weights Yates’s Correction


Variable Welch’s t Test Yates’s Notation
Variance Wennberg Design Yoked Control Procedure
Volunteer Bias White Noise  
  Wilcoxon Rank Sum Test z Distribution
Wave WinPepi Zelen’s Randomized Consent Design
Weber−Fechner Law Winsorize z Score
Weibull Distribution Within-Subjects Design z Test

Reader’s Guide

The Reader’s Guide is provided to assist readers in locating entries on related topics. It classifies entries
into 28 general topical categories:

1. Descriptive Statistics 10. Organizations 20. Scaling


2. Distributions 11. Publishing 21. Software Applications
3. Graphical Displays of 12. Qualitative Research 22. Statistical Assumptions
Data 13. Reliability of Scores 23. Statistical Concepts
4. Hypothesis Testing 14. Research Design Concepts 24. Statistical Procedures
5. Important Publications 15. Research Designs 25. Statistical Tests
6. Inferential Statistics 16. Research Ethics 26. Theories, Laws, and
7. Item Response Theory 17. Research Process Principles
8. Mathematical Concepts 18. Research Validity Issues 27. Types of Variables
9. Measurement Concepts 19. Sampling 28. Validity of Scores.

Entries may be listed under more than one topic.

Descriptive Statistics Distributions Column Graph


Central Tendency, Bernoulli Distribution Frequency Table
Measures of Copula Functions Graphical Display of Data
Cohen’s d Statistic Cumulative Frequency Growth Curve
Cohen’s f Statistic Distribution Histogram
Correspondence Analysis Distribution L’Abbé Plot
Descriptive Statistics Frequency Distribution Line Graph
Effect Size, Measures of Kurtosis Nomograms
Eta-Squared Law of Large Numbers Ogive
Factor Loadings Normal Distribution Pie Chart
Krippendorff’s Alpha Normalizing Data Radial Plot
Mean Poisson Distribution Residual Plot
Median Quetelet’s Index Scatterplot
Mode Sampling Distributions U-Shaped Curve
Partial Eta-Squared Weibull Distribution
Range Winsorize Hypothesis Testing
Standard Deviation z Distribution Alternative Hypotheses
Statistic Beta
Trimmed Mean Graphical Displays of Data Critical Value
Variability, Measure of Bar Chart Decision Rule
Variance Box-and-Whisker Plot Hypothesis

xiii
Contributors

Hervé Abdi Deborah L. Bandalos Panagiotis Besbeas


University of Texas at Dallas University of Georgia Athens University of
Mona M. Abo-Zena Economics and Business
Kimberly A. Barchard
Tufts University University of Nevada, Peter Bibby
J. H. Abramson Las Vegas University of Nottingham
Hebrew University Alyse A. Barker Tracie L. Blumentritt
Ashley Acheson Louisiana State University University of Wisconsin–
University of Texas Health La Crosse
Peter Barker
Science Center at University of Oklahoma Frederick J. Boehmke
San Antonio University of Iowa
J. Jackson Barnette
Alan C. Acock Colorado School of Daniel Bolt
Oregon State University Public Health University of Wisconsin
Pamela Adams Thomas Bart Sara E. Bolt
University of Lethbridge Northwestern University Michigan State University
Joy Adamson William M. Bart Susan J. Bondy
University of York University of Minnesota University of Toronto
David L. R. Affleck Randy J. Bartlett Matthew J. Borneman
University of Montana Blue Sigma Analytics Southern Illinois University
Alan Agresti Philip J. Batterham Sarah E. Boslaugh
University of Florida Australian National Washington University
James Algina University
Robert L. Boughner
University of Florida Pat Bazeley Rogers State University
Justin P. Allen Australian Catholic
James A. Bovaird
University of Kansas University
University of Nebraska–
Terry Andres Amy S. Beavers Lincoln
University of Manitoba University of Tennessee
Michelle Boyd
Tatiyana V. Apanasovich Bethany A. Bell Tufts University
Cornell University University of South Carolina
Clare Bradley
Jan Armstrong Brandon Bergman Royal Holloway,
University of New Mexico Nova Southeastern University University of London
Scott Baldwin Arjan Berkeljon Joanna Bradley-Gilbride
Brigham Young University Brigham Young University Health Psychology Research

xxi
Contributors xxiii

Wendy D. Donlin-Washington Rex Galbraith Brian D. Haig


University of North Carolina, University College London University of Canterbury
Wilmington
Xin Gao Ryan Hansen
Kristin Duncan York University Kansas University
San Diego State University
Jennie K. Gill J. Michael Hardin
Leslie A. Duram University of Victoria University of Alabama
Southern Illinois University
Steven G. Gilmour Jeffrey R. Harring
Ronald C. Eaves Queen Mary, University of University of Maryland
Auburn University London
Sarah L. Hastings
Michael Eid Jack Glaser Radford University
Free University of Berlin University of California,
Berkeley Curtis P. Haugtvedt
Thomas W. Epps Fisher College of Business
University of Virginia Perman Gochyyev
University of Arizona Kentaro Hayashi
Shelley Esquivel
University of Hawai‘i
University of Tennessee James M. M. Good
University of Durham Nancy Headlee
Shihe Fan
University of Tennessee
Capital Health Janna Goodwin
Regis University Larry V. Hedges
Kristen Fay
Institute for Applied Research in Matthew S. Goodwin Northwestern University
Youth Development The Groden Center, Inc. Jay Hegdé
Christopher Finn Peter Goos Medical College of Georgia
University of California, Universiteit Antwerpen Joel M. Hektner
Berkeley North Dakota State University
William Drew Gouvier
Ronald Fischer Louisiana State University Amanda R. Hemmesch
Victoria University of Wellington Brandeis University
Elizabeth Grandfield
Kristin Floress California State University, Chris Herrera
University of Wisconsin– Fullerton Montclair State University
Stevens Point
Scott Graves
David Hevey
Eric D. Foster Bowling Green State University
Trinity College Dublin
University of Iowa
Leon Greene
Christiana Hilmer
Bruce B. Frey University of Kansas
San Diego State University
University of Kansas
Elena L. Grigorenko
Andrea E. Fritz Yale University Perry R. Hinton
Colorado State University Oxford Brookes University
Matthew J. Grumbein
Steve Fuller Kansas University Bettina B. Hoeppner
University of Warwick Brown University
Fei Gu
R. Michael Furr University of Kansas Scott M. Hofer
Wake Forest University University of Victoria
Matthew M. Gushta
John Gaber American Institutes for Research Robert H. Horner
University of Arkansas University of Oregon
Amanda Haboush
Sharon L. Gaber University of Nevada, David C. Howell
University of Arkansas Las Vegas University of Vermont
xxiv Contributors

Chia-Chien Hsu Michael Karson Takis Konstantopoulos


Ohio State University University of Denver Heriot-Watt University,
Pennsylvania State
Yen-Chih Hsu Maria Kateri
University
University of Pittsburgh University of Piraeus
Kentaro Kato Margaret Bull Kovera
Qiaoyan Hu
University of Minnesota City University of New York
University of Illinois at Chicago
Michael W. Kattan John H. Krantz
Jason L. Huang
Cleveland Clinic Hanover College
Michigan State University
Jerome P. Keating Marie Kraska
Schuyler W. Huck
University of Texas at San Antonio Auburn University
University of Tennessee
Lisa A. Keller David R. Krathwohl
Tania B. Huedo-Medina
University of Massachusetts Syracuse University
University of Connecticut
Amherst Klaus Krippendorff
Craig R. Hullett
Ken Kelley University of Pennsylvania
University of Arizona
University of Notre Dame Jennifer Kuhn
David L. Hussey
John C. Kern II University of Tennessee
Kent State University
Duquesne University Jonna M. Kulikowich
Alan Hutson
Sarah Kershaw Pennsylvania State University
University at Buffalo
Florida State University Kevin A. Kupzyk
Robert P. Igo, Jr.
H. J. Keselman University of Nebraska–
Case Western Reserve University
University of Manitoba Lincoln
Ching-Kang Ing
Kristina Keyton Oi-man Kwok
Academia Sinica
Texas Tech University Texas A&M University
Heide Deditius Island
Eun Sook Kim Chiraz Labidi
Pacific University
Texas A&M University United Arab Emirates
Lisa M. James University
University of Texas Health Jee-Seon Kim
University of Wisconsin–Madison Michelle Lacey
Science Center at
Tulane University
San Antonio Seock-Ho Kim
University of Georgia Tze Leung Lai
Samantha John
Stanford University
University of Texas Health Seong-Hyeon Kim
Science Center at Fuller Theological Seminary David Mark Lane
San Antonio Rice University
Seoung Bum Kim
Ruthellen Josselson University of Texas at Roger Larocca
Fielding Graduate University Arlington Oakland University
Laura M. Justice Bruce M. King Robert E. Larzelere
Ohio State University Clemson University Oklahoma State University
Patricia Thatcher Kantor Neal Kingston Lauren A. Lee
Florida State University, University of Kansas University of Arizona
Florida Center for Reading
Research Roger E. Kirk Marvin Lee
Baylor University Tennessee State University
George Karabatsos
University of Illinois at Alan J. Klockars Pierre Legendre
Chicago University of Washington Université de Montréal
xxvi Contributors

Sylvie Mrug Indeira Persaud Dawn M. Richard


University of Alabama at St. Vincent Community College University of Texas Health
Birmingham Science Center at
Nadini Persaud
San Antonio
Karen D. Multon University of the West Indies at
University of Kansas Cave Hill Michelle M. Riconscente
University of Southern
Daniel J. Mundfrom Maria M. Pertl California
University of Northern Trinity College Dublin
Colorado Edward E. Rigdon
John V. Petrocelli Georgia State University
Daniel L. Murphy Wake Forest University
University of Texas at Austin Steven Roberts
Shayne B. Piasta Australian National University
Mandi Wilkes Musso Ohio State University
Louisiana State University Jon E. Roeckelein
Andrea M. Piccinin Mesa College
Raymond S. Nickerson Oregon State University
Tufts University H. Jane Rogers
Rogério M. Pinto University of Connecticut
Adelheid A. M. Nicol Columbia University William M. Rogers
Royal Military College
Steven C. Pitts Grand Valley State University
Forrest Wesron Nutter, Jr. University of Maryland, Isabella Romeo
Iowa State University Baltimore County University of Milan–Bicocca
Thomas G. O’Connor Jason D. Pole Lisa H. Rosen
University of Rochester Pediatric Oncology Group of University of Texas at Dallas
Medical Center Ontario Deden Rukmana
Stephen Olejnik Wayne E. Pratt Savannah State University
University of Georgia Wake Forest University André A. Rupp
Aline Orr Katherine Presnell University of Maryland
University of Texas at Austin Southern Methodist University Ehri Ryu
Rhea L. Owens Jesse E. Purdy Boston College
University of Kansas Southwestern University Darrell Sabers
Serkan Ozel LeAnn Grogan Putney University of Arizona
Texas A&M University University of Nevada, Thomas W. Sager
Anita Pak Las Vegas University of Texas at Austin
University of Toronto Weiqiang Qian Neil J. Salkind
Qing Pan University of California– University of Kansas
George Washington University Riverside Brian A. Sandford
Sang Hee Park Richard Race Pittsburgh State University
Indiana University– Roehampton University Annesa Flentje Santa
Bloomington Philip H. Ramsey University of Montana
Carol S. Parke Queens College of City Yasuyo Sawaki
Duquesne University University of New York Waseda University
Meagan M. Patterson Alan Reifman David A. Sbarra
University of Kansas Texas Tech University University of Arizona
Jamis J. Perrett Matthew R. Reynolds Janina L. Scarlet
Texas A&M University University of Kansas Brooklyn College
Contributors xxvii

Stefan Schmidt Kellie M. Smith Minghe Sun


University Medical Centre John Jay College of Criminal University of Texas at
Freiburg Justice, City University of San Antonio
New York
Vicki L. Schmitt Florensia F. Surjadi
University of Alabama Dongjiang Song Iowa State University
Association for the Advance of Xinyu Tang
C. Melanie Schuele
Medical Instrumentation University of Pittsburgh
Vanderbilt University
Fujian Song Hisashi Tanizaki
Stanley L. Sclove
University of East Anglia Kobe University
University of Illinois at Chicago
Roy Sorensen Tish Holub Taylor
Chris Segrin
Washington University in Private Practice
University of Arizona
St. Louis (psychology)
Edith Seier
Chris Spatz Kristin Rasmussen Teasdale
East Tennessee State University
Hendrix College Christian Psychological
Jane Sell Services
Scott A. Spaulding
Texas A&M University
University of Oregon Felix Thoemmes
Richard J. Shavelson Arizona State University
Karen M. Staller
Stanford University
University of Michigan Jay C. Thomas
Yu Shen Pacific University
Henderikus J. Stam
University of Texas M. D.
University of Calgary Nathan A. Thompson
Anderson Cancer Center
Assessment Systems
Jeffrey T. Steedle Corporation
Alissa R. Sherry
Council for Aid to Education
University of Texas at Austin Theresa A. Thorkildsen
David W. Stockburger University of Illinois at
David J. Sheskin
United States Air Force Chicago
Western Connecticut State
Academy
University Gail Tiemann
Stephen Stockton University of Kansas
Towfic Shomar
University of Tennessee
London School of Economics Rocio Titiunik
Eric R. Stone University of California,
Matthias Siemer
Wake Forest University Berkeley
University of Miami
David L. Streiner Sigmund Tobias
Carlos Nunes Silva
University of Toronto University at Albany, State
University of Lisbon
University of New York
Dean Keith Simonton Ian Stuart-Hamilton
University of Glamorgan David J. Torgerson
University of California, Davis University of York
Kishore Sinha Jeffrey Stuewig Carol Toris
Birsa Agricultural University George Mason University College of Charleston
Stephen G. Sireci Thuntee Sukchotrat Francis Tuerlinckx
University of Massachusetts University of Texas at University of Leuven
Selcuk R. Sirin Arlington Jean M. Twenge
New York University San Diego State University
Tia Sukin
Timothy Sly University of Massachusetts Marion K. Underwood
Ryerson University Amherst University of Texas at Dallas
xxviii Contributors

Gerard J. P. Van Breukelen Murray Webster, Jr. Hongwei Yang


Maastricht University University of North University of Kentucky
Carolina–Charlotte
Brandon K. Vaughn Jie Yang
University of Texas at Austin Greg William Welch University of Illinois at
University of Kansas Chicago
Eduardo Velasco
Morgan State University Barbara M. Wells Jingyun Yang
University of Kansas Massachusetts General
Wayne F. Velicer
Cancer Prevention Research Brian J. Wells Hospital
Center Cleveland Clinic Feifei Ye
Madhu Viswanathan Craig Stephen Wells University of Pittsburgh
University of Illinois at University of Massachusetts Z. Ebrar Yetkiner
Urbana-Champaign Amherst Texas A&M University
Hoa T. Vo Stephen G. West Yue Yin
University of Texas Health Arizona State University University of Illinois at
Science Center at San Antonio Chicago
David C. Wheeler
Rainer vom Hofe Emory University Ke-Hai Yuan
University of Cincinnati
K. A. S. Wickrama University of Notre Dame
Richard Wagner Iowa State University Kally Yuen
Florida State University
Rand R. Wilcox University of Melbourne
Abdus S. Wahed University of Southern Elaine Zanutto
University of Pittsburgh California National Analysts
Harald Walach Lynne J. Williams Worldwide
University of Northampton University of Toronto April L. Zenisky
Michael J. Walk Scarborough University of Massachusetts
University of Baltimore Thomas O. Williams, Jr. Amherst
David S. Wallace Virginia Polytechnic Institute Hantao Zhang
Fayetteville State University John T. Willse University of Iowa
John Walsh University of North Carolina Zhigang Zhang
University of Victoria at Greensboro Memorial Sloan-Kettering
Hong Wang Victor L. Willson Cancer Center
University of Pittsburgh Texas A&M University Shi Zhao
Jun Wang Joachim K. Winter University of Illinois at
Colorado State University University of Munich Chicago

Xuebin Wang Suzanne Woods-Groves Xiaoling Zhong


Shanghai University University of Iowa University of Notre Dame
Rose Marie Ward Jiun-Yu Wu Linda Reichwein Zientek
Miami University Texas A&M University Sam Houston State University
Edward A. Wasserman Karl L. Wuensch Jiyun Zu
University of Iowa East Carolina University Educational Testing Service
Introduction

The Encyclopedia of Research Design is a collec- of topics that could be selected. We tried to select
tion of entries written by scholars in the field of those that are the most commonly used and that
research design, the discipline of how to plan and readers would find most useful and important to
conduct empirical research, including the use of have defined and discussed. At the same time, we
both quantitative and qualitative methods. A had to balance this selection with the knowledge
simple review of the Reader’s Guide shows how that there is never enough room to include every-
broad the field is, including such topics as descrip- thing. Terms were included because of a general
tive statistics, a review of important mathematical consensus that they were essential for such a work
concepts, a description and discussion of the as this.
importance of such professional organizations as Once the initial list of possible entries was
the American Educational Research Association defined in draft form, it was revised to produce the
and the American Statistical Association, the role set of categories and entries that you see in the
of ethics in research, important inferential proce- Reader’s Guide at the beginning of Volume 1. We
dures, and much more. Two topics are especially ultimately wanted topics that were sufficiently
interesting and set this collection of volumes apart technical to enlighten the naïve but educated
from similar works: (1) a review of important reader, and at the same time we wanted to avoid
research articles that have been seminal in the those topics from which only a small percentage of
field and have helped determine the direction of potential readers would benefit.
several ideas and (2) a review of popular tools As with many other disciplines, there is a great
(such as software) used to analyze results. This deal of overlap in terminology within research
collection of more than 500 entries includes cover- design, as well as across related disciplines. For
age of these topics and many more. example, the two relatively simple entries titled
Descriptive Statistics and Mean have much in com-
mon and necessarily cover some of the same con-
Process
tent (using different words because they were
The first step in the creation of the Encyclopedia written by different authors), but each entry also
of Research Design was the identification of peo- presents a different approach to understanding the
ple with the credentials and talent to perform cer- general topic of central tendency. More advanced
tain tasks. The associate editors were selected on topics such as Analysis of Variance and Repeated
the basis of their experience and knowledge in the Measures Design also have a significant number of
field of research design, and the managing editors conceptual ideas in common. It is impossible to
were selected for their experience in helping man- avoid overlap because all disciplines contain terms
age large projects. and ideas that are similar, which is what gives a
Once the editor selected the associate editors discipline its internal order—similar ideas and
and managing editors, the next step was for the such belong together. Second, offering different
group to work collectively to identify and select a language and explanations (but by no means iden-
thorough and complete listing of the important tical words) provides a more comprehensive and
topics in the area of research design. This was not varied view of important ideas. That is the strength
easy because there are hundreds, if not thousands, in the diversity of the list of contributors in the

xxix
xxx Introduction

Encyclopedia of Research Design and why it is the resubmitted, it was once again reviewed and, when
perfect instrument for new learners, as well as acceptable, passed on to production. Notably,
experienced researchers, to learn about new topics most entries were acceptable on initial submission.
or just brush up on new developments.
As we worked with the ongoing and revised
How to Use the
drafts of entries, we recruited authors to write the
Encyclopedia of Research Design
various entries. Part of the process of asking schol-
ars to participate included asking for their feed- The Encyclopedia of Research Design is a collection
back as to what should be included in the entry of entries intended for the naïve, but educated, con-
and what related topics should be included. The sumer. It is a reference tool for users who may be
contributors were given the draft entry list and interested in learning more about a particular research
were encouraged to suggest others ideas and direc- technique (such as “control group” or “reliability”).
tions to pursue. Many of their ideas and sugges- Users can search the Encyclopedia for specific
tions were useful, and often new entries were information or browse the Reader’s Guide to find
added to the list. Almost until the end of the entire topics of interest. For readers who want to pursue
process of writing entries, the entry list continued a topic further, each entry ends with both a list of
to be revised. related entries in the Encyclopedia and a set of
Once the list was finalized, we assigned each further readings in the literature, often including
one a specific length of 1,000, 2,000, or 3,000 online sources.
words. This decision was based on the importance
of the topic and how many words we thought
Acknowledgments
would be necessary to represent it adequately. For
example, the entry titled Abstract was deemed to As editor, I have had the pleasure of working as
be relatively limited, whereas we encouraged the the lead on several Sage encyclopedias. Because of
author of Reliability, an absolutely central topic to the complex nature of the topics included in the
research design, to write at least 3,000 words. As Encyclopedia of Research Design and the associ-
with every other step in the development of the ated difficulty writing about them, this was a par-
Encyclopedia of Research Design, we always ticularly challenging project. Many of the topics
allowed and encouraged authors to provide feed- are very complex and needed extra effort on the
back about the entries they were writing and part of the editors to identify how they might be
nearly always agreed to their requests. improved. Research design is a big and complex
The final step was to identify authors for each world, and it took a special effort to parse entries
of the 513 entries. We used a variety of mecha- down to what is contained in these pages, so a
nisms, including asking advisory board members great deal of thanks goes to Dr. Bruce Frey from
to identify scholars who were experts in a particu- the University of Kansas and Dr. Donald M.
lar area; consulting professional journals, books, Dougherty from the University of Texas Health
conference presentations, and other sources to Science Center at San Antonio for their diligence,
identify authors familiar with a particular topic; flexibility, talent, and passion for seeing this three-
and drawing on the personal contacts that the edi- volume set attain a very high standard.
torial board members have cultivated over many Our editors at Sage, Jim Brace-Thompson,
years of working in this field. If potential authors senior acquisitions editor, and Rolf Janke, vice
felt they could not participate, we asked them to president and publisher, SAGE Reference, do what
suggest someone who might be interested in writ- the best editors do: provide guidance and support
ing the entry. and leave us alone to do what we do best while
Once authors were confirmed, they were given they keep an eye on the entire process to be sure
explicit directions and deadlines for completing we do not go astray.
and submitting their entry. As the entries were sub- Kristin Teasdale and Nathalie Hill-Kapturczak
mitted, the editorial board of the encyclopedia read acted as managing editors and with great dedica-
them and, if necessary, requested both format and tion and professional skill managed to find authors,
substantive changes. Once a revised entry was see to it that documents were submitted on time,
Introduction xxxi

and track progress through the use of Sage’s elec- authors. They understood the task at hand was to
tronic tools. It is not an understatement that this introduce educated readers such as you to this
project would not have gotten done on time or run very broad field of research design. Without
as smoothly without their assistance. exception, they performed this task admirably.
The real behind-the-scenes heroes and heroines While reviewing submissions, we editors would
of this entire project are the editorial and produc- often find superb explications of difficult topics,
tion people at Sage who made sure that all the is and we became ever more pleased to be a part of
were dotted and the (Student) ts crossed. Among this important project.
them is Carole Mauer, senior developmental edi- And as always, we want to dedicate this
tor, who has been the most gentle of supportive encyclopedia to our loved ones—partners,
and constructive colleagues, always had the ans­ spouses, and children who are always there for
wers to countless questions, and guided us in the us and help us see the forest through the trees,
right directions. With Carole’s grace and opti- the bigger picture that makes good things
mism, we were ready to do what was best for the great.
project, even when the additional work made con-
siderable demands. Other people we would like to Neil J. Salkind, Editor
sincerely thank are Michele Thompson, Leticia M. University of Kansas
Gutierrez, Laura Notton, Kate Schroeder, Bonnie
Freeman, Liann Lech, and Sheree Van Vreede, all Bruce B. Frey, Associate Editor
of whom played a major role in seeing this set of
University of Kansas
volumes come to fruition. It is no exaggeration
that what you see here would not have been pos-
sible without their hard work. Donald M. Dougherty, Associate Editor
Of course this encyclopedia would not exist University of Texas Health Science Center at
without the unselfish contributions of the many San Antonio
A
prediction concerning the correlations of the sub-
ABSTRACT scales could be confirmed, suggesting high validity.
No statistically significant negative association
was observed between the Black nationalist and
An abstract is a summary of a research or a review assimilationist ideology subscales. This result is
article and includes critical information, including discussed as a consequence of the specific social
a complete reference to the work, its purpose, context Black Germans live in and is not consid-
methods used, conclusions reached, and implica- ered to lower the MIBI’s validity. Observed differ-
tions. For example, here is one such abstract from ences in mean scores to earlier studies of African
the Journal of Black Psychology authored by Timo American racial identity are also discussed.
Wandert from the University of Mainz, published
in 2009 and titled ‘‘Black German Identities: Vali-
dating the Multidimensional Inventory of Black Abstracts serve several purposes. First, they
Identity.’’ provide a quick summary of the complete pub-
All the above-mentioned elements are included lication that is easily accessible in the print
in this abstract: the purpose, a brief review of form of the article or through electronic
important ideas to put the purpose into a context, means. Second, they become the target for
the methods, the results, and the implications of search tools and often provide an initial
the results. screening when a researcher is doing a litera-
ture review. It is for this reason that article
This study examines the reliability and validity of titles and abstracts contain key words that one
a German version of the Multidimensional Inven- would look for when searching for such infor-
tory of Black Identity (MIBI) in a sample of 170 mation. Third, they become the content of
Black Germans. The internal consistencies of all reviews or collections of abstracts such as Psy-
subscales are at least moderate. The factorial cINFO, published by the American Psychologi-
structure of the MIBI, as assessed by principal cal Association (APA). Finally, abstracts
component analysis, corresponds to a high degree sometimes are used as stand-ins for the actual
to the supposed underlying dimensional structure. papers when there are time or space limita-
Construct validity was examined by analyzing tions, such as at professional meetings. In this
(a) the intercorrelations of the MIBI subscales and instance, abstracts are usually presented as
(b) the correlations of the subscales with external posters in presentation sessions.
variables. Predictive validity was assessed by ana- Most scholarly publications have very clear
lyzing the correlations of three MIBI subscales guidelines as to how abstracts are to be created,
with the level of intra-racial contact. All but one prepared, and used. For example, the APA, in the

1
2 Accuracy in Parameter Estimation

Publication Manual of the American Psychological obtaining narrow confidence intervals. The stan-
Association, provides information regarding the dard AIPE approach yields the necessary sample
elements of a good abstract and suggestions for size so that the expected width of a confidence
creating one. While guidelines for abstracts of interval will be sufficiently narrow. Because confi-
scholarly publications (such as print and electronic dence interval width is a random variable based
journals) tend to differ in the specifics, the follow- on data, the actual confidence interval will almost
ing four guidelines apply generally: certainly differ from (e.g., be larger or smaller
than) the expected confidence interval width. A
1. The abstract should be short. For example, APA modified AIPE approach allows sample size to be
limits abstracts to 250 words, and MEDLINE planned so that there will be some desired degree
limits them to no more than 400 words. The of assurance that the observed confidence interval
abstract should be submitted as a separate page.
will be sufficiently narrow. The standard AIPE
2. The abstract should appear as one unindented approach addresses questions such as what size
paragraph. sample is necessary so that the expected width of
3. The abstract should begin with an introduction the 95% confidence interval width will be no
and then move to a very brief summary of the larger than ω, where ω is the desired confidence
method, results, and discussion. interval width. However, the modified AIPE
approach addresses questions such as what size
4. After the abstract, five related keywords should
sample is necessary so that there is γ 100% assur-
be listed. These keywords help make electronic
searches efficient and successful. ance that the 95% confidence interval width will
be no larger than ω, where γ is the desired value
With the advent of electronic means of creating of the assurance parameter.
and sharing abstracts, visual and graphical abstracts Confidence interval width is a way to operation-
have become popular, especially in disciplines in alize the accuracy of the parameter estimate, holding
which they contribute to greater understanding by everything else constant. Provided appropriate
the reader. assumptions are met, a confidence interval consists
of a set of plausible parameter values obtained from
Neil J. Salkind applying the confidence interval procedure to data,
where the procedure yields intervals such that
See also American Psychological Association Style; Ethics (1  α)100% will correctly bracket the population
in the Research Process; Literature Review parameter of interest, where 1  α is the desired con-
fidence interval coverage. Holding everything else
Further Readings constant, as the width of the confidence interval
American Psychological Association. (2009). Publication decreases, the range of plausible parameter values is
Manual of the American Psychological Association narrowed, and thus more values can be excluded as
(6th ed.). Washington, DC: Author. implausible values for the parameter. In general,
Fletcher, R. H. (1988). Writing an abstract. Journal of whenever a parameter value is of interest, not only
General Internal Medicine, 3(6), 607–609. should the point estimate itself be reported, but so
Luhn, H. P. (1999). The automatic creation of literature too should the corresponding confidence interval for
abstracts. In I. Mani & M. T. Maybury (Eds.), the parameter, as it is known that a point estimate
Advances in automatic text summarization (pp.
almost certainly differs from the population value
15–21). Cambridge: MIT Press.
and does not give an indication of the degree of
uncertainty with which the parameter has been esti-
mated. Wide confidence intervals, which illustrate
ACCURACY IN the uncertainty with which the parameter has been
estimated, are generally undesirable. Because the
PARAMETER ESTIMATION direction, magnitude, and accuracy of an effect can
be simultaneously evaluated with confidence inter-
Accuracy in parameter estimation (AIPE) is an vals, it has been argued that planning a research
approach to sample size planning concerned with study in an effort to obtain narrow confidence
Accuracy in Parameter Estimation 3

intervals is an ideal way to improve research findings mean difference was in fact exactly 0.50, the 95%
and increase the cumulative knowledge of confidence interval has a lower and upper limit of
a discipline. .147 and .851, respectively. Thus, the lower confi-
Operationalizing accuracy as the observed dence limit is smaller than ‘‘small’’ and the upper
confidence interval width is not new. In fact, confidence limit is larger than ‘‘large.’’ Although
writing in the 1930s, Jerzy Neyman used the there was enough statistical power (recall that sam-
confidence interval width as a measure of accu- ple size was planned so that power ¼.80, and
racy in his seminal work on the theory of confi- indeed, the null hypothesis of no group mean differ-
dence intervals, writing that the accuracy of ence was rejected, p ¼ .005), in this case sample size
estimation corresponding to a fixed value of was not sufficient from an accuracy perspective, as
1  α may be measured by the length of the con- illustrated by the wide confidence interval.
fidence interval. Statistically, accuracy is defined Historically, confidence intervals were not often
as the square root of the mean square error, reported in applied research in the behavioral, edu-
which is a function of precision and bias. When cational, and social sciences, as well as in many
the bias is zero, accuracy and precision are other domains. Cohen once suggested researchers
equivalent concepts. The AIPE approach is so failed to report confidence intervals because their
named because its goal is to improve the overall widths were ‘‘embarrassingly large.’’ In an effort to
accuracy of estimates, and not just the precision plan sample size so as not to obtain confidence
or bias alone. Precision can often be improved at intervals that are embarrassingly large, and in fact
the expense of bias, which may or may not to plan sample size so that confidence intervals are
improve the accuracy. Thus, so as not to obtain sufficiently narrow, the AIPE approach should be
estimates that are sufficiently precise but possi- considered. The argument for planning sample size
bly more biased, the AIPE approach sets its goal from an AIPE perspective is based on the desire to
of obtaining sufficiently accurate parameter esti- report point estimates and confidence intervals
mates as operationalized by the width of the cor- instead of or in addition to the results of null
responding (1  α)100% confidence interval. hypothesis significance tests. This paradigmatic
Basing important decisions on the results of shift has led to AIPE approaches to sample size
research studies is often the goal of the study. How- planning becoming more useful than was previ-
ever, when an effect has a corresponding confidence ously the case, given the emphasis now placed on
interval that is wide, decisions based on such effect confidence intervals instead of a narrow focus on
sizes need to be made with caution. It is entirely the results of null hypothesis significance tests.
possible for a point estimate to be impressive Whereas the power analytic approach to sample
according to some standard, but for the confidence size planning has as its goal the rejection of a false
limits to illustrate that the estimate is not very accu- null hypothesis with some specified probability,
rate. For example, a commonly used set of guide- the AIPE approach is not concerned with whether
lines for the standardized mean difference in the some specified null value can be rejected (i.e., is
behavioral, educational, and social sciences is that the null value outside the confidence interval lim-
population standardized effect sizes of 0.2, 0.5, and its?), making it fundamentally different from the
0.8 are regarded as small, medium, and large power analytic approach. Not surprisingly, the
effects, respectively, following conventions estab- AIPE and power analytic approaches can suggest
lished by Jacob Cohen beginning in the 1960s. very different values for sample size, depending on
Suppose that the population standardized mean dif- the particular goals (e.g., desired width or desired
ference is thought to be medium (i.e., 0.50), based power) specified. The AIPE approach to sample
on an existing theory and a review of the relevant size planning is able to simultaneously consider the
literature. Further suppose that a researcher direction of an effect (which is what the null
planned the sample size so that there would be hypothesis significance test provides), its magni-
a statistical power of .80 when the Type I error rate tude (best and worst case scenarios based on the
is set to .05, which yields a necessary sample size of values of the confidence limits), and the accuracy
64 participants per group (128 total). In such a situ- with which the population parameter was esti-
ation, supposing that the observed standardized mated (via the width of the confidence interval).
4 Action Research

The term accuracy in parameter estimation (and through several cycles of action. The most common
the acronym AIPE) was first used by Ken Kelley purpose of action research is to guide practitioners
and Scott E. Maxwell in 2003 with an argument as they seek to uncover answers to complex pro-
given for its widespread use in lieu of or in addition blems in disciplines such as education, health
to the power analytic approach. However, the gen- sciences, sociology, or anthropology. Action research
eral idea of AIPE has appeared in the literature spo- is typically underpinned by ideals of social justice
radically since at least the 1960s. James Algina, as and an ethical commitment to improve the quality
well as Stephen Olejnik and Michael R. Jiroutek, of life in particular social settings. Accordingly, the
contributed to similar approaches. The goal of the goals of action research are as unique to each study
approach suggested by Algina is to have an esti- as participants’ contexts; both determine the type of
mate sufficiently close to its corresponding popula- data-gathering methods that will be used. Because
tion value, and the goal suggested by Olejnik and action research can embrace natural and social sci-
Jiroutek is to simultaneously have a sufficient ence methods of scholarship, its use is not limited to
degree of power and confidence interval narrow- either positivist or heuristic approaches. It is, as John
ness. Currently, the most extensive program for Dewey pointed out, an attitude of inquiry rather
planning sample size from the AIPE perspective is than a single research methodology.
R using the MBESS package. This entry presents a brief history of action
research, describes several critical elements of
Ken Kelley action research, and offers cases for and against
the use of action research.
See also Confidence Intervals; Effect Size, Measures of;
Power Analysis; Sample Size Planning
Historical Development
Further Readings Although not officially credited with authoring
the term action research, Dewey proposed five
Cohen, J. (1988). Statistical power analysis for the
behavioral sciences (2nd ed). Hillsdale, NJ: Lawrence phases of inquiry that parallel several of
Erlbaum. the most commonly used action research pro-
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). cesses, including curiosity, intellectualization,
Sample size planning for statistical power and hypothesizing, reasoning, and testing hypotheses
accuracy in parameter estimation. Annual Review of through action. This recursive process in scien-
Psychology, 59, 537–563. tific investigation is essential to most contempo-
Thompson, B. (2002). What future quantitative social rary action research models. The work of
science research could look like: Confidence intervals Kurt Lewin is often considered seminal in estab-
for effect sizes. Educational Researcher, 31, 25–32.
lishing the credibility of action research. In
anthropology, William Foote Whyte conducted
early inquiry using an action research process
similar to Lewin’s. In health sciences, Reginald
ACTION RESEARCH Revans renamed the process action learning
while observing a process of social action among
Action research differs from conventional research nurses and coal miners in the United Kingdom.
methods in three fundamental ways. First, its pri- In the area of emancipatory education, Paulo
mary goal is social change. Second, members of the Freire is acknowledged as one of the first to
study sample accept responsibility for helping undertake action research characterized by par-
resolve issues that are the focus of the inquiry. ticipant engagement in sociopolitical activities.
Third, relationships between researcher and study The hub of the action research movement shifted
participants are more complex and less hierarchical. from North America to the United Kingdom in the
Most often, action research is viewed as a process late 1960s. Lawrence Stenhouse was instrumental in
of linking theory and practice in which scholar- revitalizing its use among health care practitioners.
practitioners explore a social situation by posing John Elliott championed a form of educational
a question, collecting data, and testing a hypothesis action research in which the researcher-as-participant
Action Research 5

takes increased responsibility for individual and col- observing, and reflecting recur during an action
lective changes in teaching practice and school research study. Iterancy, as a unique and critical
improvement. Subsequently, the 1980s were witness characteristic, can be attributed to Lewin’s early
to a surge of action research activity centered in conceptualization of action research as involving
Australia. Wilfred Carr and Stephen Kemmis hypothesizing, planning, fact-finding (reconnais-
authored Becoming Critical, and Kemmis and Robin sance), execution, and analysis (see Figure 1).
McTaggart’s The Action Research Planner informed These iterations comprise internal and external
much educational inquiry. Carl Glickman is often repetition referred to as learning loops, during
credited with a renewed North American interest in which participants engage in successive cycles of
action research in the early 1990s. He advocated collecting and making sense of data until agree-
action research as a way to examine and implement ment is reached on appropriate action. The result
principles of democratic governance; this interest is some form of human activity or tangible docu-
coincided with an increasing North American appe- ment that is immediately applicable in partici-
tite for postmodern methodologies such as personal pants’ daily lives and instrumental in informing
inquiry and biographical narrative. subsequent cycles of inquiry.

Characteristics Collaboration
Reflection Action research methods have evolved to include
Focused reflection is a key element of most collaborative and negotiatory activities among vari-
action research models. One activity essential to ous participants in the inquiry. Divisions between
reflection is referred to as metacognition, or the roles of researchers and participants are fre-
thinking about thinking. Researchers ruminate quently permeable; researchers are often defined as
on the research process even as they are perform- both full participants and external experts who
ing the very tasks that have generated the prob- engage in ongoing consultation with participants.
lem and, during their work, derive solutions Criteria for collaboration include evident structures
from an examination of data. Another aspect for sharing power and voice; opportunities to con-
of reflection is circumspection, or learning-in- struct common language and understanding among
practice. Action research practitioners typically partners; an explicit code of ethics and principles;
proceed through various types of reflection, agreement regarding shared ownership of data; pro-
including those that focus on technical proficien- visions for sustainable community involvement and
cies, theoretical assumptions, or moral or ethical action; and consideration of generative methods to
issues. These stages are also described as learn- assess the process’s effectiveness.
ing for practice, learning in practice, and learn- The collaborative partnerships characteristic of
ing from practice. Learning for practice involves action research serve several purposes. The first is to
the inquiry-based activities of readiness, aware- integrate into the research several tenets of evidence-
ness, and training engaged in collaboratively by based responsibility rather than documentation-
the researcher and participants. Learning in based accountability. Research undertaken for pur-
practice includes planning and implementing poses of accountability and institutional justification
intervention strategies and gathering and making often enforces an external locus of control. Con-
sense of relevant evidence. Learning from prac- versely, responsibility-based research is characterized
tice includes culminating activities and planning by job-embedded, sustained opportunities for parti-
future research. Reflection is integral to the cipants’ involvement in change; an emphasis on the
habits of thinking inherent in scientific explora- demonstration of professional learning; and fre-
tions that trigger explicit action for change. quent, authentic recognition of practitioner growth.

Iterancy Role of the Researcher


Most action research is cyclical and continu- Action researchers may adopt a variety of roles
ous. The spiraling activities of planning, acting, to guide the extent and nature of their relationships
6 Action Research

1 2 3
GENERAL action action action
IDEA PLAN step step step

decision decision
decision about 2 about 3
about 1
reconnaissance
reconnaissance reconnaissance
of goals and
of results of results
means
reconnaissance of
results might indicate
change in general plan

Figure 1 Lewin’s Model of Action Research


Source: Lewin, K. (1946). Action research and minority problems. Journal of Social Issues, 2, 34–46.

with participants. In a complete participant role, The learning by the participants and by the
the identity of the researcher is neither concealed researcher is rarely mutually exclusive; moreover, in
nor disguised. The researchers’ and participants’ practice, action researchers are most often full
goals are synonymous; the importance of partici- participants.
pants’ voice heightens the necessity that issues of Intertwined purpose and the permeability of
anonymity and confidentiality are the subject of roles between the researcher and the participant
ongoing negotiation. The participant observer role are frequently elements of action research studies
encourages the action researcher to negotiate levels with agendas of emancipation and social justice.
of accessibility and membership in the participant Although this process is typically one in which the
group, a process that can limit interpretation of external researcher is expected and required to
events and perceptions. However, results derived provide some degree of expertise or advice, partici-
from this type of involvement may be granted pants—sometimes referred to as internal research-
a greater degree of authenticity if participants are ers—are encouraged to make sense of, and apply,
provided the opportunity to review and revise per- a wide variety of professional learning that can be
ceptions through a member check of observations translated into ethical action. Studies such as these
and anecdotal data. A third possible role in action contribute to understanding the human condition,
research is the observer participant, in which the incorporate lived experience, give public voice to
researcher does not attempt to experience the experience, and expand perspectives of participant
activities and events under observation but negoti- and researcher alike.
ates permission to make thorough and detailed
notes in a fairly detached manner. A fourth role,
A Case for and Against Action Research
less common to action research, is that of the com-
plete observer, in which the researcher adopts pas- Ontological and epistemological divisions between
sive involvement in activities or events, and qualitative and quantitative approaches to research
a deliberate—often physical—barrier is placed abound, particularly in debates about the credibility
between the researcher and the participant in order of action research studies. On one hand, quantita-
to minimize contamination. These categories only tive research is criticized for drawing conclusions
hint at the complexity of roles in action research. that are often pragmatically irrelevant; employing
Action Research 7

methods that are overly mechanistic, impersonal, • The complexity of social interactions makes
and socially insensitive; compartmentalizing, and other research approaches problematic.
thereby minimizing, through hypothetico-deductive • Theories derived from positivist educational
schemes, the complex, multidimensional nature of research have been generally inadequate in
human experiences; encouraging research as an iso- explaining social interactions and cultural
phenomena.
lationist and detached activity void of, and impervi- • Increased public examination of public
ous to, interdependence and collaboration; and institutions such as schools, hospitals, and
forwarding claims of objectivity that are simply not corporate organizations requires insights of
fulfilled. a type that other forms of research have not
On the other hand, qualitative aspects of provided.
action research are seen as quintessentially unre- • Action research can provide a bridge across the
liable forms of inquiry because the number of perceived gap in understanding between
uncontrolled contextual variables offers little practitioners and theorists.
certainty of causation. Interpretive methodolo-
gies such as narration and autobiography can Reliability and Validity
yield data that are unverifiable and potentially The term bias is a historically unfriendly pejora-
deceptive. Certain forms of researcher involve- tive frequently directed at action research. As
ment have been noted for their potential to much as possible, the absence of bias constitutes
unduly influence data, while some critiques con- conditions in which reliability and validity can
tend that Hawthorne or halo effects—rather increase. Most vulnerable to charges of bias are
than authentic social reality—are responsible for action research inquiries with a low saturation
the findings of naturalist studies. point (i.e., a small N), limited interrater reliability,
Increased participation in action research in the and unclear data triangulation. Positivist studies
latter part of the 20th century paralleled a growing make attempts to control external variables that
demand for more pragmatic research in all fields may bias data; interpretivist studies contend that it
of social science. For some humanities practi- is erroneous to assume that it is possible to do any
tioners, traditional research was becoming irrele- research—particularly human science research—
vant, and their social concerns and challenges that is uncontaminated by personal and political
were not being adequately addressed in the find- sympathies and that bias can occur in the labora-
ings of positivist studies. They found in action tory as well as in the classroom. While value-free
research a method that allowed them to move inquiry may not exist in any research, the critical
further into other research paradigms or to com- issue may not be one of credibility but, rather, one
mit to research that was clearly bimethodological. of recognizing divergent ways of answering ques-
Increased opportunities in social policy develop- tions associated with purpose and intent. Action
ment meant that practitioners could play a more research can meet determinants of reliability and
important role in conducting the type of research validity if primary contextual variables remain
that would lead to clearer understanding of social consistent and if researchers are as disciplined as
science phenomena. Further sociopolitical impetus possible in gathering, analyzing, and interpreting
for increased use of action research derived from the evidence of their study; in using triangulation
the politicizing effects of the accountability move- strategies; and in the purposeful use of participa-
ment and from an increasing solidarity in humani- tion validation. Ultimately, action researchers must
ties professions in response to growing public reflect rigorously and consistently on the places
scrutiny. and ways that values insert themselves into studies
The emergence of action research illustrates and on how researcher tensions and contradictions
a shift in focus from the dominance of statistical can be consistently and systematically examined.
tests of hypotheses within positivist paradigms
toward empirical observations, case studies, and
Generalizability
critical interpretive accounts. Research protocols
of this type are supported by several contentions, Is any claim of replication possible in studies
including the following: involving human researchers and participants?
8 Action Research

Perhaps even more relevant to the premises and • How are issues of representation, validity, bias,
intentions that underlie action research is the and reliability discussed?
question, Is this desirable in contributing to our • What is the role of the research? In what ways
understanding of the social world? Most action does this align with the purpose of the study?
• In what ways will this study contribute to
researchers are less concerned with the traditional
knowledge and understanding?
goal of generalizability than with capturing the
richness of unique human experience and meaning.
A defensible understanding of what constitutes
Capturing this richness is often accomplished
knowledge and of the accuracy with which it is
by reframing determinants of generalization and
portrayed must be able to withstand reasonable
avoiding randomly selected examples of human
scrutiny from different perspectives. Given the
experience as the basis for conclusions or extrapo-
complexities of human nature, complete under-
lations. Each instance of social interaction, if
standing is unlikely to result from the use of a
thickly described, represents a slice of the social
single research methodology. Ethical action
world in the classroom, the corporate office, the
researchers will make public the stance and lenses
medical clinic, or the community center. A certain
they choose for studying a particular event. With
level of generalizability of action research results
transparent intent, it is possible to honor the
may be possible in the following circumstances:
unique, but not inseparable, domains inhabited by
• Participants in the research recognize and
social and natural, thereby accommodating appre-
confirm the accuracy of their contributions.
ciation for the value of multiple perspectives of the
• Triangulation of data collection has been human experience.
thoroughly attended to.
• Interrater techniques are employed prior to
drawing research conclusions. Making Judgment on Action Research
• Observation is as persistent, consistent, and
longitudinal as possible. Action research is a relatively new addition to the
• Dependability, as measured by an auditor, repertoire of scientific methodologies, but its appli-
substitutes for the notion of reliability. cation and impact are expanding. Increasingly
• Confirmability replaces the criterion of sophisticated models of action research continue
objectivity. to evolve as researchers strive to more effectively
capture and describe the complexity and diversity
Ethical Considerations of social phenomena.
One profound moral issue that action research- Perhaps as important as categorizing action
ers, like other scientists, cannot evade is the use research into methodological compartments is the
they make of knowledge that has been generated necessity for the researcher to bring to the study
during inquiry. For this fundamental ethical rea- full self-awareness and disclosure of the personal
son, the premises of any study—but particularly and political voices that will come to bear on
those of action research—must be transparent. results and action. The action researcher must
Moreover, they must attend to a wider range of reflect on and make transparent, prior to the study,
questions regarding intent and purpose than sim- the paradoxes and problematics that will guide the
ply those of validity and reliability. These ques- inquiry and, ultimately, must do everything that is
tions might include considerations such as the fair and reasonable to ensure that action research
following: meets requirements of rigorous scientific study.
Once research purpose and researcher intent are
• Why was this topic chosen? explicit, several alternative criteria can be used to
• How and by whom was the research funded? ensure that action research is sound research.
• To what extent does the topic dictate or align These criteria include the following types, as noted
with methodology? by David Scott and Robin Usher:
• Are issues of access and ethics clear?
• From what foundations are the definitions of Aparadigmatic criteria, which judge natural and
science and truth derived? social sciences by the same strategies of data
Adaptive Designs in Clinical Trials 9

collection and which apply the same determinants Carr, W., & Kemmis, S. (1986). Becoming critical:
of reliability and validity Education, knowledge and action research.
Philadelphia: Farmer.
Diparadigmatic criteria, which judge social
Dewey, J. (1910). How we think. Boston: D. C. Heath.
phenomena research in a manner that is Freire, P. (1968). Pedagogy of the oppressed. New York:
dichotomous to natural science events and which Herder & Herder.
apply determinants of reliability and validity that Habermas, J. (1971). Knowledge and human interests.
are exclusive to social science Boston: Beacon.
Multiparadigmatic criteria, which judge research of Holly, M., Arhar, J., & Kasten, W. (2005). Action
the social world through a wide variety of research for teachers: Traveling the yellow brick road.
strategies, each of which employs unique Upper Saddle River, NJ: Pearson/Merrill/Prentice Hall,
postmodern determinants of social science Kemmis, S., & McTaggart, R. (1988). The action
research planner. Geelong, Victoria, Australia: Deakin
Uniparadigmatic criteria, which judge the natural University.
and social world in ways that are redefined and Lewin, K. (1946). Action research and minority
reconceptualized to align more appropriately with problems. Journal of Social Issues, 2, 34–46.
a growing quantity and complexity of knowledge Revans, R. (1982). The origins and growth of action
learning. Bromley, UK: Chartwell-Bratt.
In the final analysis, action research is favored Sagor, R. (1992). How to conduct collaborative action
by its proponents because it research. Alexandria, VA: Association for Supervision
and Curriculum Development.
• honors the knowledge and skills of all participants Schön, D. (1983). The reflective practitioner. New York:
• allows participants to be the authors of their Basic Books.
own incremental progress
• encourages participants to learn strategies of
problem solving
• promotes a culture of collaboration
• enables change to occur in context
ADAPTIVE DESIGNS
• enables change to occur in a timely manner IN CLINICAL TRIALS
• is less hierarchical and emphasizes collaboration
• accounts for rather than controls phenomena
Some designs for clinical trial research, such as
drug effectiveness research, allow for modification
Action research is more than reflective practice. and make use of an adaptive design. Designs such
It is a complex process that may include either qual- as adaptive group-sequential design, n-adjustable
itative or quantitative methodologies, one that has design, adaptive seamless phase II–III design, drop-
researcher and participant learning at its center. the-loser design, adaptive randomization design,
Although, in practice, action research may not often adaptive dose-escalation design, adaptive treat-
result in high levels of critical analysis, it succeeds ment-switching design, and adaptive-hypothesis
most frequently in providing participants with intel- design are adaptive designs.
lectual experiences that are illuminative rather than In conducting clinical trials, investigators first
prescriptive and empowering rather than coercive. formulate the research question (objectives) and
Pamela Adams then plan an adequate and well-controlled study
that meets the objectives of interest. Usually, the
See also Evidence-Based Decision Making; External objective is to assess or compare the effect of one
Validity; Generalizability Theory; Mixed Methods or more drugs on some response. Important steps
Design; Naturalistic Inquiry involved in the process are study design, method
of analysis, selection of subjects, assignment of
subjects to drugs, assessment of response, and
Further Readings assessment of effect in terms of hypothesis testing.
Berg, B. (2001). Qualitative research methods for the All the above steps are outlined in the study proto-
social sciences. Toronto, Ontario, Canada: Allyn and col, and the study should follow the protocol to
Bacon. provide a fair and unbiased assessment of the
10 Adaptive Designs in Clinical Trials

treatment effect. However, it is not uncommon to switch a patient’s treatment from an initial
adjust or modify the trial, methods, or both, either assignment to an alternative treatment because of
at the planning stage or during the study, to pro- a lack of efficacy or a safety concern.
vide flexibility in randomization, inclusion, or Adaptive-Hypothesis Design. Adaptive-hypothesis
exclusion; to allow addition or exclusion of doses; design allows change in research hypotheses based
to extend treatment duration; or to increase or on interim analysis results.
decrease the sample size. These adjustments are
mostly done for one or more of the following rea-
sons: to increase the probability of success of the
trial; to comply with budget, resource, or time Sample Size
constraints; or to reduce concern for safety. How- There has been considerable research on adaptive
ever, these modifications must not undermine the designs in which interim data at first stage are used
validity and integrity of the study. This entry to reestimate overall sample size. Determination of
defines various adaptive designs and discusses the sample size for a traditional randomized clinical
use of adaptive designs for modifying sample size. trial design requires specification of a clinically
meaningful treatment difference, to be detected
Adaptive Design Variations with some desired power. Such determinations can
become complicated because of the need for speci-
Adaptive design of a clinical trial is a design that fying nuisance parameters such as the error vari-
allows adaptation of some aspects of the trial after ance, and the choice for a clinically meaningful
its initiation without undermining the trial’s valid- treatment difference may not be straightforward.
ity and integrity. There are variations of adaptive However, adjustment of sample size with proper
designs, as described in the beginning of this entry. modification of Type I error may result in an over-
Here is a short description of each variation: powered study, which wastes resources, or an
underpowered study, with little chance of success.
Adaptive Group-Sequential Design. Adaptive A traditional clinical trial fixes the sample size
group-sequential design allows premature
in advance and performs the analysis after all sub-
termination of a clinical trial on the grounds of
jects have been enrolled and evaluated. The advan-
safety, efficacy, or futility, based on interim results.
tages of an adaptive design over classical designs
n-Adjustable Design. Adaptive n-adjustable design are that adaptive designs allow design assumptions
allows reestimation or adjustment of sample size, (e.g., variance, treatment effect) to be modified on
based on the observed data at interim. the basis of accumulating data and allow sample
Adaptive Seamless Phase II–III Design. Such size to be modified to avoid an under- or overpow-
a design addresses, within a single trial, objectives ered study. However, researchers have shown that
that are normally achieved through separate Phase an adaptive design based on revised estimates of
IIb and Phase III trials. treatment effect is nearly always less efficient than
Adaptive Drop-the-Loser Design. Adaptive drop- a group sequential approach. Dramatic bias can
the-loser design allows dropping of low-performing occur when power computation is being per-
treatment group(s). formed because of significance of interim results.
Yet medical researchers tend to prefer adaptive
Adaptive Randomization Design. Adaptive
randomization design allows modification of designs, mostly because (a) clinically meaningful
randomization schedules. effect size can change when results from other
trials may suggest that smaller effects than origi-
Adaptive Dose-Escalation Design. An adaptive nally postulated are meaningful; (b) it is easier to
dose-escalation design is used to identify the
request a small budget initially, with an option to
maximum tolerated dose (MTD) of a medication.
ask for supplemental funding after seeing the
This design is usually considered optimal in later-
phase clinical trials. interim data; and (c) investigators may need to see
some data before finalizing the design.
Adaptive Treatment-Switching Design. An adaptive
treatment-switching design allows investigators to Abdus S. Wahed
Alternative Hypotheses 11

Further Readings Example


Chow, S., Shao, J., & Wang, H. (2008). Sample size
calculations in clinical trial research. Boca Raton, FL: An example can help elucidate the role of alterna-
Chapman & Hall. tive hypotheses. Consider a researcher who is com-
paring the effects of two drugs for treating
a disease. The researcher hypothesizes that one of
the two drugs will be far superior in treating the
ADJUSTED F TEST disease. If the researcher rejects the null hypothe-
sis, he or she is likely to infer that one treatment
performs better than the other. In this example, the
See Greenhouse–Geisser Correction statistical alternative is a statement about the
population parameters of interest (e.g., population
means). When it is inferred, the conclusion is that
ALTERNATIVE HYPOTHESES the two means are not equal, or equivalently, that
the samples were drawn from distinct populations.
The researcher must then make a substantive
The alternative hypothesis is the hypothesis that is ‘‘leap’’ to infer that one treatment is superior to the
inferred, given a rejected null hypothesis. Also other. There may be many other possible explana-
called the research hypothesis, it is best described tions for the two means’ not being equal; however,
as an explanation for why the null hypothesis was it is likely that the researcher will infer an alterna-
rejected. Unlike the null, the alternative hypothesis tive that is in accordance with the original purpose
is usually of most interest to the researcher. of the scientific study (such as wanting to show
This entry distinguishes between two types of that one drug outperforms the other). It is impor-
alternatives: the substantive and the statistical. In tant to remember, however, that concluding that
addition, this entry provides an example and dis- the means are not equal (i.e., inferring the statisti-
cusses the importance of experimental controls in cal alternative hypothesis) does not provide any
the inference of alternative hypotheses and the scientific evidence at all for the chosen conceptual
rejection of the null hypothesis. alternative. Particularly when it is not possible to
control for all possible extraneous variables, infer-
ence of the conceptual alternative hypothesis may
Substantive or Conceptual Alternative
involve a considerable amount of guesswork, or at
It is important to distinguish between the substan- minimum, be heavily biased toward the interests
tive (or conceptual, scientific) alternative and the of the researcher.
statistical alternative. The conceptual alternative is A classic example in which an incorrect alterna-
that which is inferred by the scientist given tive can be inferred is the case of the disease
a rejected null. It is an explanation or theory that malaria. For many years, it was believed that the
attempts to account for why the null was rejected. disease was caused by breathing swamp air or liv-
The statistical alternative, on the other hand, is sim- ing around swamplands. In this case, scientists
ply a logical complement to the null that provides comparing samples from two populations (those
no substantive or scientific explanation as to why who live in swamplands and those who do not)
the null was rejected. When the null hypothesis is could have easily rejected the null hypothesis,
rejected, the statistical alternative is inferred in line which would be that the rates of malaria in the
with the Neyman–Pearson approach to hypothesis two populations were equal. They then would
testing. At this point, the substantive alternative put have inferred the statistical alternative, that the
forth by the researcher usually serves as the ‘‘rea- rates of malaria in the swampland population were
son’’ that the null was rejected. However, a rejected higher. Researchers could then infer a conceptual
null does not by itself imply that the researcher’s alternative—swamplands cause malaria. However,
substantive alternative hypothesis is correct. Theo- without experimental control built into their study,
retically, there could be an infinite number of expla- the conceptual alternative is at best nothing more
nations for why a null is rejected. than a convenient alternative advanced by the
12 American Educational Research Association

researchers. As further work showed, mosquitoes, conceptual alternative has been inferred. Surely,
which live in swampy areas, were the primary anyone can reject a null, but few can identify and
transmitters of the disease, making the swamp- infer a correct alternative.
lands alternative incorrect.
Daniel J. Denis, Annesa Flentje Santa,
and Chelsea Burfeind
The Importance of Experimental Control
See also Hypothesis; Null Hypothesis
One of the most significant challenges posed by an
inference of the scientific alternative hypothesis is
the infinite number of plausible explanations for Further Readings
the rejection of the null. There is no formal statisti-
cal procedure for arriving at the correct scientific Cohen, J. (1994). The earth is round (p < .05). American
alternative hypothesis. Researchers must rely on Psychologist, 49, 997–1003.
Cowles, M. (2000). Statistics in psychology: An historical
experimental control to help narrow the number
perspective. Philadelphia: Lawrence Erlbaum.
of plausible explanations that could account for Denis, D. J. (2001). Inferring the alternative hypothesis:
the rejection of the null hypothesis. In theory, if Risky business. Theory & Science, 2, 1. Retrieved
every conceivable extraneous variable were con- December 2, 2009, from http://theoryandscience
trolled for, then inferring the scientific alternative .icaap.org/content/vol002.001/03denis.html
hypothesis would not be such a difficult task. Hays, W. L. (1994). Statistics (5th ed.). New York:
However, since there is no way to control for every Harcourt Brace.
possible confounding variable (at least not in Neyman, J., & Pearson, E. S. (1928). On the use and
most social sciences, and even many physical interpretation of certain test criteria for purposes of
sciences), the goal of good researchers must be to statistical inference (Part 1). Biometrika, 20A,
175–240.
control for as many extraneous factors as possible.
The quality and extent of experimental control
is proportional to the likelihood of inferring cor-
rect scientific alternative hypotheses. Alternative
hypotheses that are inferred without the prerequi- AMERICAN EDUCATIONAL
site of such things as control groups built into the RESEARCH ASSOCIATION
design of the study or experiment are at best plau-
sible explanations as to why the null was rejected,
The American Educational Research Association
and at worst, fashionable hypotheses that the
(AERA) is an international professional organiza-
researcher seeks to endorse without the appropri-
tion based in Washington, D.C., and dedicated to
ate scientific license to do so.
promoting research in the field of education.
Through conferences, publications, and awards,
Concluding Comments AERA encourages the scientific pursuit and dis-
semination of knowledge in the educational arena.
Hypothesis testing is an integral part of every
Its membership is diverse, drawn from within the
social science researcher’s job. The statistical and
education professions, as well as from the broader
conceptual alternatives are two distinct forms of
social science field.
the alternative hypothesis. Researchers are most
often interested in the conceptual alternative
hypothesis. The conceptual alternative hypothesis
Mission
plays an important role; without it, no conclusions
could be drawn from research (other than rejecting The mission of AERA is to influence the field of
a null). Despite its importance, hypothesis testing education in three major ways: (1) increasing
in the social sciences (especially the softer social knowledge about education, (2) promoting educa-
sciences) has been dominated by the desire to tional research, and (3) encouraging the use of
reject null hypotheses, whereas less attention has educational research results to make education
been focused on establishing that the correct better and thereby improve the common good.
American Educational Research Association 13

History with the National Education Association in 1930,


gaining Washington, D.C., offices and support for
AERA publicizes its founding as taking place in AERA’s proposed new journal, the Review of
1916. However, its roots have been traced back to Educational Research. AERA underwent several
the beginnings of the educational administration changes during the 1930s. Besides the creation of
research area and the school survey movement, both the new journal, AERA decided to affiliate with
of which took place in the 1910s. This new spirit of other professional groups that shared common
cooperation between university researchers and pub- interests, such as the National Committee on
lic schools led eight individuals to found the Research in Secondary Education and the National
National Association of Directors of Educational Council of Education. The recognition of superior
Research (NADER) as an interest group within the research articles at an annual awards ceremony
National Education Association’s Department of was established in 1938, although the awards cere-
Superintendence in February 1915. With the crea- mony was disbanded for many years starting in
tion of its first organizational constitution in 1916, 1942. By 1940, AERA membership stood at 496.
NADER committed itself to the improvement of Much of AERA’s growth (other than member-
public education through applied research. ship) has come in the form of its many journals. In
NADER’s two goals were to organize educational 1950, the AERA Newsletter published its first
research centers at public educational settings and to issue. Its goal was to inform the membership about
promote the use of appropriate educational mea- current news and events in education. Its name
sures and statistics in educational research. Full was changed to Educational Researcher in 1965.
membership in this new organization was restricted In 1963, the American Educational Research Jour-
to individuals who directed research bureaus, nal was created to give educational researchers an
although others involved in educational research outlet for original research articles, as previous
could join as associate members. NADER produced AERA publications focused primarily on reviews.
its first publication, Educational Research Bulletin, By 1970, the Review of Educational Research had
in 1916. Within 3 years from its founding, its changed its focus, which led to the creation of
membership had almost quadrupled, to 36 full another new journal, the Review of Research in
members. In 1919, two of the founders of NADER Education. Meanwhile, membership has continued
started producing a new journal, the Journal of Edu- to grow, with current membership at approxi-
cational Research, soon to be adopted by NADER mately 26,000 individuals.
as an official publication.
With the growth in educational research pro-
Organization of Association Governance
grams in the late 1910s to early 1920s, NADER
revised its constitution in 1921 to allow full mem- The organization of AERA has changed since its
bership status for anyone involved in conducting founding to accommodate the greater membership
and producing educational research, by invitation and its diverse interests. AERA is currently gov-
only, after approval from the executive committee. erned by a council, an executive board, and stand-
To better display this change in membership ing committees. The council is responsible for
makeup, the group changed its name in 1922 to policy setting and the assignment of standing com-
the Educational Research Association of America. mittees for AERA and is formed of elected
This shift allowed for a large increase in member- members, including the president, president-elect,
ship, which grew to 329 members by 1931. immediate past president, two at-large members,
Members were approximately two thirds from division vice presidents, the chair of the special
a university background and one third from the interest group (SIG) executive committee, and
public schools. In 1928, the organization changed a graduate student representative. The council
its name once more, becoming the American Edu- meets three times per year. In addition to serving
cational Research Association (AERA). on the council, the president appoints members of
After a brief uproar among AERA membership the standing committees.
involving the ownership of the Journal of Educa- The executive board is an advisory board that
tional Research, AERA decided to affiliate itself guides the president and executive director of
14 American Educational Research Association

AERA. The board meets concurrently with council Membership


meetings and more frequently as needed. The
board has managed elections, appointed an execu- At about 26,000 members, AERA is among the
tive director, and selected annual meeting sites, largest professional organizations in the United
in addition to other needed tasks. The council States. Membership in AERA takes several forms
currently appoints 21 standing committees and but is primarily divided among voting full mem-
charges them with conducting specific tasks in bers and nonvoting affiliates. In order to be a vot-
accordance with AERA policy. These committees ing member, either one must hold the equivalent of
range in focus from annual meeting policies and a master’s degree or higher; be a graduate student
procedures to scholars and advocates for gender sponsored by a voting member of one’s university
equity in education and to technology. faculty; or have emeritus status, earned following
retirement after more than 20 years of voting
membership in AERA. Affiliate members are those
Divisions who are interested in educational research but do
not have a master’s degree, undergraduate students
AERA has identified 12 areas of professional or who are sponsored by their faculty, or non–U.S.
academic interest to its members and has labeled citizens who do not meet the master’s degree
these areas divisions. On joining AERA, each requirement. For both voting and nonvoting mem-
member selects which division to join. Members berships, students pay a reduced rate. Members of
can belong to more than one division for an AERA gain several benefits, including a reduced
additional annual fee. The divisions hold busi- cost to attend the annual meeting, free member-
ness meetings and present research in their inter- ship in one division, and free subscriptions to both
est area at the annual meeting. The 12 divisions Educational Researcher and one other AERA jour-
are as follows: Administration, Organization, nal of the member’s choice.
and Leadership; Curriculum Studies; Learning
and Instruction; Measurement and Research
Methodology; Counseling and Human Develop- Publications
ment; History and Historiography; Social Con-
AERA publishes six peer-reviewed journals as well
text of Education; Research, Evaluation, and
as several books and a series of Research Points,
Assessment in Schools; Education in the Profes-
published quarterly, which are designed to help
sions; Postsecondary Education; Teaching and
those working on policy issues connect with cur-
Teacher Education; and Educational Policy and
rent educational research findings. AERA’s peer-
Politics.
reviewed journals include the American Educa-
tional Research Journal, Educational Evaluation
and Policy Analysis, Educational Researcher, the
SIGs
Journal of Educational and Behavioral Statistics,
SIGs are small interest groups within AERA mem- the Review of Educational Research, and the
bership. SIGs differ from divisions in that their Review of Research in Education.
focus tends to be on more specific topics than the The American Educational Research Journal
broad interests represented by the divisions. Like focuses on original scientific research in the field of
divisions, SIGs typically hold business meetings education. It has two subdivisions: one that exam-
and support presentations of research in their ines research on teaching, learning, and human
interest area at the annual meeting. AERA has development and one for social and institutional
approximately 163 SIGs. Membership in the SIGs analysis. Educational Evaluation and Policy Analy-
is based on annual dues, which range from $5 to sis publishes original research focusing on evaluation
$75. SIGs range in focus from Academic Audit and policy analysis issues. Educational Researcher
Research in Teacher Education to Hierarchical publishes information of general interest to a broad
Linear Modeling to Rasch Measurement and to variety of AERA members. Interpretations and sum-
Writing and Literacies. AERA members may join maries of current educational research, as well as
as many SIGs as they wish. book reviews, make up the majority of its pages.
American Educational Research Association 15

The Journal of Educational and Behavioral Statistics provides or sponsors the most offerings for gradu-
focuses on new statistical methods for use in educa- ate students is the graduate student council. This
tional research, as well as critiques of current prac- group, composed of 28 graduate students and divi-
tices. It is published jointly with the American sion and staff representatives, meets at every
Statistical Association. The Review of Educational annual meeting to plan offerings for the graduate
Research publishes reviews of previously published students. Its mission is to support graduate student
articles by interested parties from varied back- members to become professional researchers or
grounds. The Review of Research in Education is practitioners though education and advocacy. The
an annual publication that solicits critical essays on graduate student council sponsors many sessions
a variety of topics facing the field of education. All at the annual meeting, as well as hosting a graduate
AERA’s journals are published by Sage. student resource center at the event. It also pub-
lishes a newsletter three times per year and hosts
Annual Meetings a Listserv where graduate students can exchange
information.
AERA’s annual meetings are an opportunity to
bring AERA’s diverse membership together to dis-
cuss and debate the latest in educational practices
Awards
and research. Approximately 16,000 attendees
gather annually to listen, discuss, and learn. For the AERA offers an extensive awards program, and
2008 meeting, 12,024 presentation proposals were award recipients are announced at the president’s
submitted, and more than 2,000 were presented. In address during the annual meeting. AERA’s divi-
addition to presentations, many business meetings, sions and SIGs also offer awards, which are pre-
invited sessions, awards, and demonstrations are sented during each group’s business meeting.
held. Several graduate student-oriented sessions are AERA’s awards cover educational researchers at
also held. Many sessions focusing on educational all stages of their career, from the Early Career
research related to the geographical location of the Award to the Distinguished Contributions to
annual meeting are also presented. Another valu- Research in Education Award. Special awards are
able educational opportunity is the many profes- also given in other areas, including social justice
sional development and training courses offered issues, public service, and outstanding books.
during the conference. These tend to be refresher
courses in statistics and research design or evalua-
tion or workshops on new assessment tools or class- Fellowships and Grants
room-based activities. In addition to the scheduled Several fellowships are offered through AERA,
sessions, exhibitors of software, books, and testing with special fellowships focusing on minority
materials present their wares at the exhibit hall, and researchers, researchers interested in measurement
members seeking new jobs can meet prospective (through a program with the Educational Testing
employers in the career center. Tours of local attrac- Service), and researchers interested in large-scale
tions are also available. Each year’s meeting is orga- studies through a partnership with the American
nized around a different theme. In 2008, the annual Institutes for Research. AERA also offers several
meeting theme was Research on Schools, Neighbor- small grants for various specialties, awarded up to
hoods, and Communities: Toward Civic Responsi- three times per year.
bility. The meeting takes place at the same time and
place as the annual meeting of the National Council Carol A. Carman
on Measurement in Education.
See also American Statistical Association; National
Council on Measurement in Education
Other Services and Offerings
Graduate Student Council Further Readings
Graduate students are supported through sev- Hultquist, N. J. (1976). A brief history of AERA’s
eral programs within AERA, but the program that publishing. Educational Researcher, 5(11), 9–13.
16 American Psychological Association Style

Mershon, S., & Schlossman, S. (2008). Education, Past tense should also be used to describe results
science, and the politics of knowledge: The American of an empirical study conducted by the author
Educational Research Association, 1915–1940. (e.g., ‘‘self-esteem increased over time’’). Present
American Journal of Education, 114(3), 307–340. tense (e.g., ‘‘these results indicate’’) should be used
in discussing and interpreting results and drawing
Websites conclusions.
American Educational Research Association:
http://www.aera.net
Nonbiased Language
APA style guidelines recommend that authors avoid
AMERICAN PSYCHOLOGICAL language that is biased against particular groups.
ASSOCIATION STYLE APA provides specific guidelines for describing age,
gender, race or ethnicity, sexual orientation, and
disability status. Preferred terms change over time
American Psychological Association (APA) style
and may also be debated within groups; authors
is a system of guidelines for writing and format-
should consult a current style manual if they are
ting manuscripts. APA style may be used for
unsure of the terms that are currently preferred or
a number of types of manuscripts, such as theses,
considered offensive. Authors may also ask study
dissertations, reports of empirical studies, litera-
participants which term they prefer for themselves.
ture reviews, meta-analyses, theoretical articles,
General guidelines for avoiding biased language
methodological articles, and case studies. APA
include being specific, using labels as adjectives
style is described extensively in the Publication
instead of nouns (e.g., ‘‘older people’’ rather than
Manual of the American Psychological Associa-
‘‘the elderly’’), and avoiding labels that imply
tion (APA Publication Manual). The APA Publi-
a standard of judgment (e.g., ‘‘non-White,’’ ‘‘stroke
cation Manual includes recommendations on
victim’’).
writing style, grammar, and nonbiased language,
as well as guidelines for manuscript formatting,
such as arrangement of tables and section head-
ings. The first APA Publication Manual was pub- Formatting
lished in 1952; the most recent edition was
published in 2009. APA style is the most The APA Publication Manual also provides a num-
accepted writing and formatting style for jour- ber of guidelines for formatting manuscripts. These
nals and scholarly books in psychology. The use include guidelines for use of numbers, abbrevia-
of a single style that has been approved by the tions, quotations, and headings.
leading organization in the field aids readers,
researchers, and students in organizing and
understanding the information presented.
Tables and Figures
Tables and figures may allow numerical infor-
Writing Style
mation to be presented more clearly and concisely
The APA style of writing emphasizes clear and than would be possible in text. Tables and figures
direct prose. Ideas should be presented in an may also allow for greater ease in comparing
orderly and logical manner, and writing should be numerical data (for example, the mean depression
as concise as possible. Usual guidelines for clear scores of experimental and control groups). Fig-
writing, such as the presence of a topic sentence in ures and tables should present information clearly
each paragraph, should be followed. Previous and supplement, rather than restate, information
research should be described in either the past provided in the text of the manuscript. Numerical
tense (e.g., ‘‘Surles and Arthur found’’) or the pres- data reported in a table should not be repeated in
ent perfect tense (e.g., ‘‘researchers have argued’’). the text.
American Psychological Association Style 17

Headings with a brief statement that states the author’s


hypotheses and the ways in which these hypothe-
Headings provide the reader with an outline
ses are supported by the previous research
of the organization of the manuscript. APA style
discussed in the introduction. The introduction
includes five levels of heading (examples below).
section should demonstrate how the question at
Most manuscripts do not require all five levels.
hand is grounded in theory.
Topics of equal importance should have the same
level of heading throughout the manuscript
(e.g., the Method sections of multiple experiments Method
should have the same heading level). Having only
The method section describes how the study
one headed subsection within a section should be
was conducted. The method section is frequently
avoided.
broken up into subsections with headings such
Level 1: Centered Boldface Uppercase and
as Participants, Materials and Measures, and
Lowercase Heading Procedure.
Descriptions of participants should include
Level 2: Flush Left Boldface Uppercase and summaries of demographic characteristics such as
Lowercase Heading participants’ ages, genders, and races or ethnicities.
Level 3: Indented boldface lowercase heading Other demographic characteristics, such as socio-
ending with a period. economic status and education level, should be
Level 4: Indented boldface italicized lowercase reported when relevant. The method by which par-
heading ending with a period. ticipants were recruited (e.g., by newspaper adver-
tisements or through a departmental subject pool)
Level 5: Indented italicized lowercase heading should also be included.
ending with a period.
Materials and measures should be described
such that the reader can know what would be
Manuscript Sections needed to replicate the study. Measures that are
commercially available or published elsewhere
A typical APA style manuscript that reports on an
should be referred to by name and attributed to
empirical study has five sections: Abstract, Intro-
their authors (e.g., ‘‘Self-esteem was measured
duction, Method, Results, and Discussion.
using the Perceived Competence Scale for Children
(Harter, 1982).’’). Measures created for the study
Abstract should be described and may be reproduced in the
The abstract is a concise (150–250 word) sum- manuscript in a table or appendix.
mary of the contents of a manuscript. An abstract The procedure should describe any experimen-
should include a description of the topic or prob- tal manipulation, instructions to participants
lem under investigation and the most important (summarized unless instructions are part of the
findings or conclusions. If the manuscript describes experimental manipulation, in which case they
an empirical study, the abstract should also include should be presented verbatim), order in which
information about participants and experimental measures and manipulations were presented, and
methods. The abstract of a published article is control features (such as randomization and
often included in databases to allow researchers to counterbalancing).
search for relevant studies on a particular topic.
Results
Introduction
The results section presents and summarizes the
The introduction section introduces the reader data collected and discusses the statistical analyses
to the question under investigation. The author conducted and their results. Analyses should be
should describe the topic or problem, discuss other reported in sufficient detail to justify conclusions.
research related to the topic, and state the purpose All relevant analyses should be reported, even those
of this study. The introduction should conclude whose results were statistically nonsignificant or
18 American Psychological Association Style

that did not support the stated hypotheses. The (e.g., American Psychologist, Child Development,
results section will typically include inferential sta- Journal of Personality and Social Psychology).
tistics, such as chi-squares, F tests, or t tests. For Peer review is the process of evaluation of scientific
these statistics, the value of the test statistic, degrees work by other researchers with relevant areas of
of freedom, p value, and size and direction of effect expertise. The methodology and conclusions of an
should be reported, for instance, F(1, 39) ¼ 4.86, article published in a peer-reviewed journal have
p ¼ .04, η2 ¼ .12. been examined and evaluated by several experts in
The results section may include figures (such as the field.
graphs or models) and tables. Figures and tables
will typically appear at the end of a manuscript. If
In-Text Citations
the manuscript is being submitted for publication,
notes may be included in the text to indicate where Throughout the manuscript text, credit should
figures or tables should be placed (e.g., ‘‘Insert be given to authors whose work is referenced. In-
Table 1 here.’’). text citations allow the reader to be aware of the
source of an idea or finding and locate the work in
the reference list at the end of the manuscript.
Discussion
APA style uses an author-date citation method
The discussion section is where the findings and (e.g., Bandura, 1997). For works with one or two
analyses presented in the results section are sum- authors, all authors are included in each citation.
marized and interpreted. The author should dis- For works with three to five authors, the first in-
cuss the extent to which the results support the text citation lists all authors; subsequent citations
stated hypotheses. Conclusions should be drawn list only the first author by name (e.g., Hughes,
but should remain within the boundaries of the Bigler, & Levy, 2007, in first citation, Hughes et
data obtained. Ways in which the findings of the al., 2007, in subsequent citations). Works with six
current study relate to the theoretical perspectives or more authors are always cited in the truncated
presented in the introduction should also be et al. format.
addressed. This section should briefly acknowledge
the limitations of the current study and address Reference Lists
possible alternative explanations for the research
findings. The discussion section may also address References should be listed alphabetically by
potential applications of the work or suggest the last name of the first author. Citations in the
future research. reference list should include names of authors,
article or chapter title, and journal or book title.
References to articles in journals or other period-
Referring to Others’ Work
icals should include the article’s digital object
It is an author’s job to avoid plagiarism by noting identifier if one is assigned. If a document was
when reference is made to another’s work or ideas. accessed online, the reference should include
This obligation applies even when the author is a URL (Web address) where the material can be
making general statements about existing knowl- accessed. The URL listed should be as specific as
edge (e.g., ‘‘Self-efficacy impacts many aspects of possible; for example, it should link to the article
students’ lives, including achievement motivation rather than to the publication’s homepage. The
and task persistence (Bandura, 1997).’’). Citations APA Publication Manual includes guidelines for
allow a reader to be aware of the original source citing many different types of sources. Examples
of ideas or data and direct the reader toward of some of the most common types of references
sources of additional information on a topic. appear below.
When preparing a manuscript, an author may
be called on to evaluate sources and make deci- Book: Bandura, A. (1997). Self-efficacy: The
sions about the quality of research or veracity of exercise of control. New York: W. H. Freeman.
claims. In general, the most authoritative sources Chapter in edited book: Powlishta, K. K. (2004).
are articles published in peer-reviewed journals Gender as a social category: Intergroup processes
American Statistical Association 19

and gender-role development. In M. Bennet & consumers representing a wide range of science
F. Sani (Eds.), The development of the social self and education fields. Since its inception in Novem-
(pp. 103–133). New York: Psychology Press. ber 1839, the ASA has aimed to provide both sta-
Journal article: Hughes, J. M., Bigler, R. S., & tistical science professionals and the public with
Levy, S. R. (2007). Consequences of learning about a standard of excellence for statistics-related pro-
historical racism among European American and jects. According to ASA publications, the society’s
African American children. Child Development, mission is ‘‘to promote excellence in the applica-
78, 1689–1705. doi: 10.1111/j.1467–8624.2007 tion of statistical science across the wealth of
.01096.x human endeavor.’’ Specifically, the ASA mission
Article in periodical (magazine or newspaper): includes a dedication to excellence with regard to
Gladwell, M. (2006, February 6). Troublemakers: statistics in practice, research, and education;
What pit bulls can teach us about profiling. The a desire to work toward bettering statistical educa-
New Yorker, 81(46), 33–41. tion and the profession of statistics as a whole;
Research report: Census Bureau. (2006). Voting a concern for recognizing and addressing the needs
and registration in the election of November 2004. of ASA members; education about the proper uses
Retrieved from U.S. Census Bureau website: http:// of statistics; and the promotion of human welfare
www.census.gov/population/www/socdemo/ through the use of statistics.
voting.html Regarded as the second-oldest continuously
operating professional association in the United
Meagan M. Patterson States, the ASA has a rich history. In fact, within 2
years of its founding, the society already had
See also Abstract; Bias; Demographics; Discussion
a U.S. president—Martin Van Buren—among its
Section; Dissertation; Methods Section; Results
members. Also on the list of the ASA’s historical
Section
members are Florence Nightingale, Alexander
Graham Bell, and Andrew Carnegie. The original
Further Readings
founders, who united at the American Education
American Psychological Association. (2009). Publication Society in Boston to form the society, include U.S.
Manual of the American Psychological Association Congressman Richard Fletcher; teacher and fun-
(6th ed.). Washington, DC: Author. draiser William Cogswell; physician and medicine
APA Publications and Communications Board Working reformist John Dix Fisher; statistician, publisher,
Group on Journal Article Reporting Standards.
and distinguished public health author Lemuel
(2008). Reporting standards for research in
psychology: Why do we need them? What might they
Shattuck; and lawyer, clergyman, and poet Oliver
be? American Psychologist, 63, 839–851. Peabody. The founders named the new organiza-
Carver, R. P. (1984). Writing a publishable research tion the American Statistical Society, a name that
report in education, psychology, and related lasted only until the first official meeting in Febru-
disciplines. Springfield, IL: Charles C Thomas. ary 1840.
Cuba, L. J. (2002). A short guide to writing about social In its beginning years, the ASA developed
science (4th ed.). New York: Longman. a working relationship with the U.S. Census
Dunn, D. S. (2007). A short guide to writing about Bureau, offering recommendations and often lend-
psychology (2nd ed.). New York: Longman. ing its members as heads of the census. S. N. D.
Sabin, W. A. (2004). The Gregg reference manual (10th
North, the 1910 president of the ASA, was also
ed.). New York: McGraw-Hill.
the first director of the permanent census office.
The society, its membership, and its diversity in
statistical activities grew rapidly after World War I
AMERICAN STATISTICAL as the employment of statistics in business and
government gained popularity. At that time, large
ASSOCIATION cities and universities began forming local chap-
ters. By its 100th year in existence, the ASA had
The American Statistical Association (ASA) is more members than it had ever had, and those
a society for scientists, statisticians, and statistics involved with the society commemorated the
20 Analysis of Covariance (ANCOVA)

centennial with celebrations in Boston and Phila- Computational and Graphical Statistics; Journal
delphia. However, by the time World War II was of Educational and Behavioral Statistics; Journal
well under way, many of the benefits the ASA of Statistical Software; Journal of Statistics Educa-
experienced from the post–World War I surge were tion; Statistical Analysis and Data Mining; Statis-
reversed. For 2 years—1942 and 1943—the soci- tics in Biopharmaceutical Research; Statistics
ety was unable to hold annual meetings. Then, Surveys; and Technometrics.
after World War II, as after World War I, the ASA The official Web site of the ASA offers a more
saw a great expansion in both its membership and comprehensive look at the mission, history, publi-
its applications to burgeoning science endeavors. cations, activities, and future directions of the soci-
Today, ASA has expanded beyond the United ety. Additionally, browsers can find information
States and can count 18,000 individuals as mem- about upcoming meetings and events, descriptions
bers. Its members, who represent 78 geographic of outreach and initiatives, the ASA bylaws and
locations, also have diverse interests in statistics. constitution, a copy of the Ethical Guidelines for
These interests range from finding better ways to Statistical Practice prepared by the Committee on
teach statistics to problem solving for homelessness Professional Ethics, and an organizational list of
and from AIDS research to space exploration, board members and leaders.
among a wide array of applications. The society
comprises 24 sections, including the following: Kristin Rasmussen Teasdale
Bayesian Statistical Science, Biometrics, Biophar-
maceutical Statistics, Business and Economic Statis-
tics, Government Statistics, Health Policy Statistics, Further Readings
Nonparametric Statistics, Physical and Engineering Koren, J. (1970). The history of statistics: Their
Sciences, Quality and Productivity, Risk Analysis, development and progress in many countries. New
a section for Statistical Programmers and Analysts, York: B. Franklin. (Original work published 1918)
Statistical Learning and Data Mining, Social Statis- Mason, R. L. (1999). ASA: The first 160 years. Retrieved
tics, Statistical Computing, Statistical Consulting, October 10, 2009, from http://www.amstat.org/about/
Statistical Education, Statistical Graphics, Statistics first160years.cfm
and the Environment, Statistics in Defense and Wilcox, W. F. (1940). Lemuel Shattuck, statistician,
National Security, Statistics in Epidemiology, Statis- founder of the American Statistical Association.
tics in Marketing, Statistics in Sports, Survey Journal of the American Statistical Association, 35,
224–235.
Research Methods, and Teaching of Statistics in
the Health Sciences. Detailed descriptions of each
section, lists of current officers within each section, Websites
and links to each section are available on the ASA
American Statistical Association: http://www.amstat.org
Web site.
In addition to holding meetings coordinated by
more than 60 committees of the society, the ASA
sponsors scholarships, fellowships, workshops, ANALYSIS OF COVARIANCE
and educational programs. Its leaders and mem-
bers also advocate for statistics research funding (ANCOVA)
and offer a host of career services and outreach
projects. Behavioral sciences rely heavily on experiments
Publications from the ASA include scholarly and quasi experiments for evaluating the effects of,
journals, statistical magazines, books, research for example, new therapies, instructional methods,
guides, brochures, and conference proceeding or stimulus properties. An experiment includes at
publications. Among the journals available are least two different treatments (conditions), and
American Statistician; Journal of Agricultural, human participants are randomly assigned one
Biological, and Environmental Statistics; Journal treatment. If assignment is not based on randomi-
of the American Statistical Association; Journal zation, the design is called a quasi experiment. The
of Business and Economic Statistics; Journal of dependent variable or outcome of an experiment
Analysis of Covariance (ANCOVA) 21

or a quasi experiment, denoted by Y here, is usu- and a variance σ e2 , which is the same in both
ally quantitative, such as the total score on a clini- groups. By definition, α1 þ α2 ¼ 0, and so
cal questionnaire or the mean response time on α2  α1 ¼ 2α2 is the expected posttest group dif-
a perceptual task. Treatments are evaluated by ference adjusted for the covariate X. This is even
comparing them with respect to the mean of the better seen by rewriting Equation 1 as
outcome Y using either analysis of variance  
(ANOVA) or analysis of covariance (ANCOVA). Yij  β Xij  X ¼ μ þ αj þ eij (2)
Multiple linear regression may also be used, and
categorical outcomes require other methods, such showing that ANCOVA is ANOVA of Y
as logistic regression. This entry explains the pur- adjusted for X. Due to the centering of X, that is,
poses of, and assumptions behind, ANCOVA for the subtraction of X, the adjustment is on the aver-
the classical two-group between-subjects design. age zero in the total sample. So the centering
ANCOVA for within-subject and split-plot designs affects individual outcome values and group
is discussed briefly at the end. means, but not the total or grand mean μ of Y.
Researchers often want to control or adjust sta- ANCOVA can also be written as a multiple
tistically for some independent variable that is not regression model:
experimentally controlled, such as gender, age, or Yij ¼ β0 þ β1 Gij þ β2 Xij þ eij (3)
a pretest value of Y. A categorical variable such as
gender can be included in ANOVA as an addi- where Gij is a binary indicator of treatment group
tional factor, turning a one-way ANOVA into (Gi1 ¼ 0 for controls, Gi2 ¼ 1 for treated), and β 2
a two-way ANOVA. A quantitative variable such is the slope β in Equation 3. Comparing Equation
as age or a pretest recording can be included as 1 with Equation 3 shows that β 1 ¼ 2α2 and that
a covariate, turning ANOVA into ANCOVA. β 0 ¼ ðμ  α2  βXÞ. Centering in Equation 3
ANCOVA is the bridge from ANOVA to multiple both G and X (i.e., coding G as  1 and þ 1, and
regression. There are two reasons for including
subtracting X from X) will give β 0 ¼ μ and
a covariate in the analysis if it is predictive of the
β 1 ¼ α2 . Application of ANCOVA requires esti-
outcome Y. In randomized experiments, it reduces
mation of β in Equation 1. Its least squares solu-
unexplained (within-group) outcome variance,
tion is σσXY
2 , the within-group covariance between
thereby increasing the power of the treatment X
effect test and reducing the width of its confidence pre- and posttest, divided by the within-group pre-
interval. In quasi experiments, it adjusts for a group test variance, which in turn are both estimated
difference with respect to that covariate, thereby from the sample.
adjusting the between-group difference on Y for
confounding.
Assumptions
As Equations 1 and 3 show, ANCOVA assumes
Model
that the covariate has a linear effect on the out-
The ANCOVA model for comparing two groups come and that this effect is homogeneous, the
at posttest Y, using a covariate X, is as follows: same in both groups. So there is no treatment by
covariate interaction. Both the linearity and the
Yij ¼ μ þ αj þ βðXij  XÞ þ eij , (1) homogeneity assumption can be tested and
relaxed by adding to Equation 3 as predictors
where Yij is the outcome for person i in group j X × X and G × X, respectively, but this entry
(e.g., j ¼ 1 for control, j ¼ 2 for treated), and Xij is concentrates on the classical model, Equation 1
the covariate value for person i in group j, μ is the or Equation 3. The assumption of homogeneity
grand mean of Y, αj is the effect of treatment j, β is of residual variance σ e2 between groups can also
the slope of the regression line for predicting Y be relaxed.
from X within groups, X is the overall sample Another assumption is that X is not affected by
mean of covariate X, and eij is a normally distrib- the treatment. Otherwise, X must be treated as
uted residual or error term with a mean of zero a mediator instead of as a covariate, with
22 Analysis of Covariance (ANCOVA)

consequences for the interpretation of analysis treatment) is the same with or without adjustment,
with versus without adjustment for X. If X is mea- again apart from sampling error. Things are differ-
sured before treatment assignment, this assump- ent for the MS (error), which is the denominator
tion is warranted. of the F test in ANOVA. ANCOVA estimates β
A more complicated ANCOVA assumption is such that the MS (error) is minimized, thereby
that X is measured without error, where error maximizing the power of the F test. Since the stan-
refers to intra-individual variation across replica- dard error (SE) of  ^ is proportional to the square
tions. This assumption will be valid for a covariate root of the MS (error), this SE is minimized, lead-
such as age but not for a questionnaire or test ing to more precise effect estimation by covariate
score, in particular not for a pretest of the out- adjustment.
come at hand. Measurement error in X leads to In a nonrandomized study with groups differing
attenuation, a decrease of its correlation with Y on the covariate X, the covariate-adjusted group
and of its slope β in Equation 1. This leads to a loss effect ^ systematically differs from the unadjusted
of power in randomized studies and to bias in non- effect ðY 2  Y 1 Þ. It is unbiased if the ANCOVA
randomized studies. assumptions are satisfied and treatment assignment
A last ANCOVA assumption that is often men- is random conditional on the covariate, that is,
tioned, but not visible in Equation 1, is that there random within each subgroup of persons who are
is no group difference on X. This seems to contra- homogeneous on the covariate. Although the MS
dict one of the two purposes of ANCOVA, that is, (error) is again minimized by covariate adjustment,
adjustment for a group difference on the covariate. this does not imply that the SE of  ^ is reduced.
The answer is simple, however. The assumption is This SE is a function not only of MS (error), but
not required for covariates that are measured with- also of treatment–covariate correlation. In a ran-
out measurement error, such as age. But if there is domized experiment, this correlation is zero apart
measurement error in X, then the resulting under- from sampling error, and so the SE depends only
estimation of its slope β in Equation 1 leads to on the MS (error) and sample size. In nonrando-
biased treatment effect estimation in case of mized studies, the SE increases with treatment–
a group difference on X. An exception is the case covariate correlation and can be larger with than
of treatment assignment based on the observed without adjustment. But in nonrandomized stud-
covariate value. In that case, ANCOVA is unbi- ies, the primary aim of covariate adjustment is cor-
ased in spite of measurement error in X, whether rection for bias, not a gain of power.
groups differ on X or not, and any attempt at cor- The two purposes of ANCOVA are illustrated
rection for attenuation will then introduce bias. in Figures 1 and 2, showing the within-group
The assumption of no group difference on X is regressions of outcome Y on covariate X, with the
addressed in more detail in a special section on the ellipses summarizing the scatter of individual per-
use of a pretest of the outcome Y as covariate. sons around their group line. Each group has its
own regression line with the same slope β (reflect-
ing absence of interaction) but different intercepts.
Purposes
In Figure 1, of a nonrandomized study, the groups
The purpose of a covariate in ANOVA depends on differ on the covariate. Moving the markers for
the design. To understand this, note that both group means along their regression line to
ANCOVA gives the following adjusted estimator a common covariate value X gives the adjusted
of the group difference: group difference  ^ on outcome Y, reflected by the
vertical distance between the two lines, which is
^ ¼ ðY 2 Y 1 Þ  βðX2 X1 Þ
 ð4Þ also the difference between both intercepts. In Fig-
ure 2, of a randomized study, the two groups have
In a randomized experiment,
 the group differ- the same mean covariate value, and so unadjusted
ence on the covariate, X1  X2 , is zero, and so and adjusted group difference on Y are the same.
the adjusted difference  ^ is equal to the unad- However, in both figures the adjustment has yet
justed difference ðY 2  Y 1 Þ, apart from sampling another effect, illustrated in Figure 2. The MS
error. In terms of ANOVA, the mean square (MS; (error) of ANOVA without adjustment is the entire
Analysis of Covariance (ANCOVA) 23

Y Y

MS(error) in ANCOVA
MS(error) in ANOVA
Outcome Difference

Y2

Δ̂

Y1

X1 X X2 X X1 = X 2 = X X
Covariate Difference

Figure 2 Reduction of Unexplained Outcome Vari-


ance by Covariate Adjustment in a Random-
Figure 1 Adjustment of the Outcome Difference
ized Study
Between Groups for a Covariate Difference
in a Nonrandomized Study Notes: Vertical distance between upper and lower end of an
ellipse indicates unexplained (within-group) outcome vari-
Notes: Regression lines for treated (upper) and untreated ance in ANOVA. Vertical distance within an ellipse indicates
(lower) group. Ellipses indicate scatter of individuals around unexplained outcome variance in ANCOVA.
their group lines. Markers on the lines indicate unadjusted
(solid) and adjusted (open) group means.

is equivalent to Method 1. The Group × Time


within-group variance in vertical direction,
interaction test in a repeated measures ANOVA
ignoring regression lines. The MS (error) of
with pretest and posttest as repeated measures is
ANCOVA is the variance of the vertical dis-
equivalent to Method 2. So the choice is between
tances of individual dots from their group regres-
Methods 1 and 2 only. Note that Method 2 is
sion line. All variation in the Y-direction that
a special case of Method 1 in the sense that
can be predicted from the covariate; that is, all
choosing β ¼ 1 in ANCOVA gives ANOVA of
increase of Y along the line is included in the
change, as Equation 2 shows. In a randomized
unadjusted MS (error) but excluded from the
experiment, there is no pretest group difference,
adjusted MS (error),
 which
 is thus smaller. In
2 and both methods give the same unbiased treat-
fact, it is only 1  ρXY as large as the unad-
ment effect apart from sampling error, as Equa-
justed MS (error), where ρXY is the within-group
tion 4 shows. However, ANCOVA gives
correlation between outcome and covariate.
a smaller MS (error), leading to more test power
and a smaller confidence interval than ANOVA
Using a Pretest of the Outcome as Covariate of change, except if β ≈ 1 in ANCOVA and the
sample size N is small. In nonrandomized stud-
An important special case of ANCOVA is that ies, the value for β in Equation 4 does matter,
in which a pretest measurement of Y is used as and ANCOVA gives a different treatment effect
covariate. The user can then choose between two than does ANOVA of change. The two methods
methods of analysis: may even lead to contradictory conclusions,
which is known as Lord’s ANCOVA paradox.
1. ANCOVA with the pretest as covariate and the The choice between the two methods then
posttest as outcome. depends on the assignment procedure. This is
2. ANOVA with the change score (posttest minus best seen by writing both as a repeated measures
pretest) as outcome. model.
ANOVA of change is equivalent to testing the
Two other popular methods come down to Group × Time interaction in the following
either of these two: ANCOVA of the change score model (where regression weights are denoted
24 Analysis of Covariance (ANCOVA)

by γ to distinguish them from the βs in earlier to the issue of underestimation of β in Equation 1


equations): due to measurement error in the covariate. Correc-
tion for this underestimation gives, under certain
Yijt ¼ γ 0 þ γ 1 Gij þ γ 2 Tit þ γ 3 Gij Tit þ eijt ð5Þ conditions, ANOVA of change. In the end,
however, the correct method of analysis for non-
Here, Yijt is the outcome value of person i in randomized studies of preexisting groups is a com-
group j at time t, G is the treatment group plicated problem because of the risk of hidden
(0 ¼ control, 1 ¼ treated), T is the time (0 ¼ pretest, confounders. Having two pretests with a suitable
1 ¼ posttest), and eijt is a random person effect with time interval and two control groups is then
an unknown 2 × 2 within-group covariance recommended to test the validity of both methods
matrix
of pre- and posttest measures. By filling in of analysis. More specifically, treating the second
the 0 or 1 values for G and T, one can see that γ 0 pretest as posttest or treating the second control
is the pretest (population) mean of the control group as experimental group should not yield a sig-
group, γ 1 is the pretest mean difference between nificant group effect because there is no treatment.
the groups, γ 2 is the mean change in the control
group, and γ 3 is the difference in mean change
Covariates in Other Popular Designs
between groups. Testing the interaction effect γ 3 in
Equation 5 is therefore equivalent to testing the This section discusses covariates in within-subject
group effect on change (Y – X). The only difference designs (e.g., crossovers) and between-subject
between repeated measures ANOVA and Equation designs with repeated measures (i.e., a split-plot
5 is that ANOVA uses (–1, þ 1) instead of (0, 1) design).
coding for G and T. A within-subject design with a quantitative out-
ANCOVA can be shown to be equivalent to come can be analyzed with repeated measures
testing γ 3 in Equation 5 after deleting the term ANOVA, which reduces to Student’s paired t test if
γ 1 Gij by assuming γ 1 ¼ 0, which can be done with there are only two treatment conditions. If a covari-
mixed (multilevel) regression. So ANCOVA ate such as age or a factor such as gender is added,
assumes that there is no group difference at pre- then repeated measures ANOVA with two treat-
test. This assumption is satisfied by either of two ments comes down to applying ANCOVA twice:
treatment assignment procedures: (1) randomiza- (1) to the within-subject difference D of both mea-
tion and (2) assignment based on the pretest X. surements (within-subject part of the ANOVA) and
Both designs start with one group of persons so (2) to the within-subject average A of both mea-
that there can be no group effect at pretest. surements (between-subject part of the ANOVA).
Groups are created after the pretest. This is why ANCOVA of A tests the main effects of age and
ANCOVA is the best method of analysis for both gender. ANCOVA of D tests the Treatment ×
designs. In randomized experiments it has more Gender and Treatment × Age interactions and the
power than ANOVA of change. With treatment main effect of treatment. If gender and age are cen-
assignment based on the pretest such that tered as in Equation 1, this main effect is μ in
X1 6¼ X2 , ANCOVA is unbiased whereas Equation 1, the grand mean of D. If gender and
ANOVA of change is then biased by ignoring age are not centered, as in Equation 3, the grand
regression to the mean. In contrast, if naturally mean of D equals β0 þ β1 G þ β2 X, where G is
occurring or preexisting groups are assigned, such now gender and X is age. The most popular soft-
as Community A getting some intervention and ware, SPSS (an IBM company, formerly called
Community B serving as control, then ANCOVA PASWâ Statistics), centers factors (here, gender)
will usually be biased whereas ANOVA of change but not covariates (here, age) and tests the signifi-
may be unbiased. A sufficient set of conditions for cance of β0 instead of the grand mean of D when
ANOVA of change to be unbiased, then, is (a) that reporting the F test of the within-subject main
the groups are random samples from their respec- effect. The optional pairwise comparison test in
tive populations and (b) that without treatment SPSS tests the grand mean of D, however.
these populations change equally fast (or not at Between-subject designs with repeated mea-
all). The bias in ANCOVA for this design is related sures, for example, at posttest and follow-ups or
Analysis of Covariance (ANCOVA) 25

during and after treatment, also allow covariates. advised to use the average of both pretests as cov-
The analysis is the same as for the within-subject ariate since this average suffers less from attenua-
design extended with gender and age. But interest tion by measurement error. In nonrandomized
now is in the Treatment (between-subject) × Time studies with only one pretest and one control
(within-subject) interaction and, if there is no such group, researchers should apply ANCOVA and
interaction, in the main effect of treatment aver- ANOVA of change and pray that they lead to the
aged across the repeated measures, rather than in same conclusion, differing in details only.
the main effect of the within-subject factor time. A Additionally, if there is substantial dropout
pretest recording can again be included as covari- related to treatment or covariates, then all data
ate or as repeated measure, depending on the treat- should be included in the analysis to prevent bias,
ment assignment procedure. Note, however, that using mixed (multilevel) regression instead of tra-
as the number of repeated measures increases, the ditional ANOVA to prevent listwise deletion of
F test of the Treatment × Time interaction may dropouts. Further, if pretest data are used as an
have low power. More powerful are the Treat- inclusion criterion in a nonrandomized study, then
ment × Linear (or Quadratic) Time effect test and the pretest data of all excluded persons should be
discriminant analysis. included in the effect analysis by mixed regression
Within-subject and repeated measures designs to reduce bias.
can have not only between-subject covariates
such as age but also within-subject or time- Gerard J. P. Van Breukelen
dependent covariates. Examples are a baseline
See also Analysis of Variance (ANOVA); Covariate;
recording within each treatment of a crossover
Experimental Design; Gain Scores, Analysis of;
trial, and repeated measures of a mediator. The
PretestPosttest Design; Quasi-Experimental Design;
statistical analysis of such covariates is beyond
Regression Artifacts; Split-Plot Factorial Design
the scope of this entry, requiring advanced meth-
ods such as mixed (multilevel) regression or
structural equations modeling, although the case Further Readings
of only two repeated measures allows a simpler Campbell, D. T., & Kenny, D. A. (1999). A primer on
analysis by using as covariates the within-subject regression artifacts. New York: Guilford Press.
average and difference of the original covariate. Cohen, J., Cohen, P., West, S. G., & Aiken, L. (2003).
Applied multiple regression/correlation analysis for the
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Practical Recommendations for the
Frison, L., & Pocock, S. (1997). Linearly divergent
Analysis of Studies With Covariates treatment effects in clinical trials with repeated
measures: Efficient analysis using summary statistics.
Based on the preceding text, the following recom-
Statistics in Medicine, 16, 2855–2872.
mendations can be given: In randomized studies, Judd, C. M., Kenny, D. A., & McClelland, G. H. (2001).
covariates should be included to gain power, nota- Estimating and testing mediation and moderation in
bly a pretest of the outcome. Researchers are within-subject designs. Psychological Methods, 6,
advised to center covariates and check linearity 115–134.
and absence of treatment–covariate interaction as Maxwell, S. E., & Delaney, H. D. (1990). Designing
well as normality and homogeneity of variance of experiments and analyzing data: A model
the residuals. In nonrandomized studies of preex- comparison perspective. Pacific Grove, CA:
isting groups, researchers should adjust for covari- Brooks/Cole.
ates that are related to the outcome to reduce bias. Rausch, J. R., Maxwell, S. E., & Kelley, K. (2003).
Analytic methods for questions pertaining to
With two pretests or two control groups, research-
a randomized pretest, posttest, follow-up design.
ers should check the validity of ANCOVA and Journal of Clinical Child & Adolescent Psychology,
ANOVA of change by treating the second pretest 32(3), 467–486.
as posttest or the second control group as experi- Reichardt, C. S. (1979). The statistical analysis of data
mental group. No group effect should then be from nonequivalent group designs. In T. D. Cook &
found. In the real posttest analysis, researchers are D. T. Campbell (Eds.), Quasi-experimentation: Design
26 Analysis of Variance (ANOVA)

and analysis issues for field settings (pp. 147–205). Table 1 Comparison of Two Treatments Based On
Boston: Houghton-Mifflin. Systolic Blood Pressure Change
Rosenbaum, P. R. (1995). Observational studies. New
York: Springer. Treatment
Senn, S. J. (2006). Change from baseline and analysis of
covariance revisited. Statistics in Medicine, 25, Placebo Drug A
4334–4344.
Senn, S., Stevens, L., & Chaturvedi, N. (2000). Repeated 1.3 4.0
measures in clinical trials: Simple strategies for 1.5 5.7
analysis using summary measures. Statistics in
Medicine, 19, 861–877. 0.5 3.5
Van Breukelen, G. J. P. (2006). ANCOVA versus change
0.8 0.4
from baseline: More power in randomized studies,
more bias in nonrandomized studies. Journal of 1.1 1.3
Clinical Epidemiology, 59, 920–925.
Winkens, B., Van Breukelen, G. J. P., Schouten, H. J. A., 3.4 0.8
& Berger, M. P. F. (2007). Randomized clinical trials 0.8 10.7
with a pre- and a post-treatment measurement:
Repeated measures versus ANCOVA models. 3.6 0.3
Contemporary Clinical Trials, 28, 713–719.
0.3 0.5
 2.2  3.3

ANALYSIS OF VARIANCE (ANOVA) responses can be compared across categories of


one or more factors.
Usually a two-sample t test is applied to test for a Since its development, ANOVA has played an
significant difference between two population indispensable role in the application of statistics in
means based on the two samples. For example, many fields, such as biology, social sciences,
consider the data in Table 1. Twenty patients with finance, pharmaceutics, and scientific and indus-
high blood pressure are randomly assigned to two trial research. Although ANOVA can be applied to
groups of 10 patients. Patients in Group 1 are various statistical models, and the simpler ones are
assigned to receive placebo, while patients in usually named after the number of categorical
Group 2 are assigned to receive Drug A. Patients’ variables, the concept of ANOVA is based solely
systolic blood pressures (SBPs) are measured on identifying the contribution of individual fac-
before and after treatment, and the differences in tors in the total variability of the data. In the
SBPs are recorded in Table 1. A two-sample t test above example, if the variability in SBP changes
would be an efficient method for testing the due to the drug is large compared with the chance
hypothesis that drug A is more effective than pla- variability, then one would think that the effect of
cebo when the differences in before and after mea- the drug on SBP is substantial. The factors could
surements are normally distributed. However, be different individual characteristics, such as age,
there are usually more than two groups involved sex, race, occupation, social class, and treatment
for comparison in many fields of scientific investi- group, and the significant differences between the
gation. For example, extend the data in Table 1 to levels of these factors can be assessed by forming
the data in Table 2. Here the study used 30 patients the ratio of the variability due to the factor itself
who are randomly assigned to placebo, Drug A, and that due to chance only.
and Drug B. The goal here is to compare the
effects of placebo and experimental drugs in reduc-
History
ing SBP. But a two-sample t test is not applicable
here as we have more than two groups. Analysis As early as 1925, R. A. Fisher first defined the
of variance (ANOVA) generalizes the idea of the methodology of ANOVA as ‘‘separation of the var-
two-sample t test so that normally distributed iance ascribable to one group of causes from the
Analysis of Variance (ANOVA) 27

Table 2 Comparison Of Three Treatments Based On referred to as fixed-effects one-way ANOVA.


Systolic Blood Pressure Change When the factor is a random factor whose levels
can be considered as a sample from the population
Treatment
of levels, one-way ANOVA is referred to as ran-
Placebo Drug A Drug B dom-effects one-way ANOVA. Fixed-effects one-
way ANOVA is applied to answer the question of
1.3 4.0 7.6
whether the population means are equal or not.
0.5 5.7 9.2 Given k population means, the null hypothesis
0.5 3.5 4.0 can be written as

0.4 0.4 1.8 H0 : μ1 ¼ μ2 ¼    ¼ μk (1)


1.1 1.3 5.3
The alternative hypothesis, Ha , can be written as
0.6 0.8 2.6
Ha : k population means are not all equal.
0.8 10.7 3.8
3.6 0.3 1.2 In the random-effects one-way ANOVA model,
0.3 0.5 0.4
the null hypothesis tested is that the random effect
has zero variability.
2.2 3.3 2.6 Four assumptions must be met for applying
ANOVA:

variance ascribable to other groups’’ (p. 216). A1: All samples are simple random samples drawn
Henry Scheffé defined ANOVA as ‘‘a statistical from each of k populations representing k
technique for analyzing measurements depending categories of a factor.
on several kinds of effects operating simulta- A2: Observations are independent of one another.
neously, to decide which kinds of effects are
A3: The dependent variable is normally distributed
important and to estimate the effects. The mea- in each population.
surements or observations may be in an experi-
mental science like genetics or a nonexperimental A4: The variance of the dependent variable is the
one like astronomy’’ (p. 3). At first, this methodol- same in each population.
ogy focused more on comparing the means while
treating variability as a nuisance. Nonetheless, Suppose, for the jth group, the data consist
since its introduction, ANOVA has become the of the nj measurements Yj1 , Yj2 ; . . . ; Ynj ; j ¼ 1;
most widely used statistical methodology for test- 2; . . . ; k. Then the total variation in the data can
ing the significance of treatment effects. be expressed as the corrected sum of squares (SS)
P Pnj
Based on the number of categorical variables, as follows: TSS ¼ kj¼1 i¼1 ðYji  yÞ2 , where y is
ANOVA can be distinguished into one-way the mean of the overall sample. On the other hand,
ANOVA and two-way ANOVA. Besides, ANOVA variation due to the factor is given by
models can also be separated into a fixed-effects
model, a random-effects model, and a mixed model X
k

based on how the factors are chosen during data SST ¼ ðyj  yÞ2 ; (2)
collection. Each of them is described separately. j¼1

where yj is the mean from the jth group. The vari-


ation due to chance (error) is then calculated as
One-Way ANOVA
SSE (error sum of squares) ¼ TSS  SST. The com-
One-way ANOVA is used to assess the effect of ponent variations are usually presented in a table
a single factor on a single response variable. When with corresponding degrees of freedom (df), mean
the factor is a fixed factor whose levels are the square error, and F statistic. A table for one-way
only ones of interest, one-way ANOVA is also ANOVA is shown in Table 3.
28 Analysis of Variance (ANOVA)

Table 3 General ANOVA Table for One-Way case, in which one factor is fixed and the other fac-
ANOVA (k populations) tor is random. Two-way ANOVA is applied to
Source d.f. SS MS F answer the question of whether Factor A has a sig-
nificant effect on the response adjusted for Factor
SST MST B, whether Factor B has a significant effect on the
Between k1 SST MST ¼
k1 MSE response adjusted for Factor A, or whether there is
SSE an interaction effect between Factor A and
Within nk SSE MSE ¼ Factor B.
nk
Total n1 TSS All null hypotheses can be written as
Note: n ¼ sample size; k ¼ number of groups; SST ¼ sum of
1. H01 : There is no Factor A effect.
squares treatment (factor); MST ¼ mean square treatment (fac-
tor) ; SSE ¼ sum of squares error; TSS ¼ total sum of squares. 2. H02 : There is no Factor B effect.
3. H03 : There is no interaction effect between
For a given level of significance α, the null Factor A and Factor B.
hypothesis H0 would be rejected and one could con-
clude that k population means are not all equal if The ANOVA table for two-way ANOVA is
shown in Table 4.
F ≥ Fk1; nk; 1α (3) In the fixed case, for a given α, the null hypoth-
esis H01 would be rejected, and one could con-
where Fk1; nk; 1α is the 100(1  α)% point of F clude that there is a significant effect of Factor A if
distribution with k  1 and n  k df.
FðFactor AÞ ≥ Fr1;rcðn1Þ;1α , (4)
Two-Way ANOVA where Fr1; rcðn1Þ; 1α is the 100(1  α)% point of
Two-way ANOVA is used to assess the effects of F distribution with r  1 and rc(n  1) df.
two factors and their interaction on a single The null hypothesis H02 would be rejected, and
response variable. There are three cases to be con- one could conclude that there is a significant effect
sidered: the fixed-effects case, in which both fac- of Factor B if
tors are fixed; the random-effects case, in which
both factors are random; and the mixed-effects FðFactorBÞ ≥ Fc1; rcðn1Þ; 1α (5)

Table 4 General Two-Way ANOVA Table


F
Source d.f. SS MS Fixed Mixed or Random

SSR MSR MSR


Factor A (main effect) r1 SSR MSR ¼
r1 MSE MSRC
SSC MSC MSC
Factor B (main effect) c1 SSC MSC ¼
c1 MSE MSRC
SSRC MSRC MSRC
Factor A × Factor B (interaction) (r  1)(c  1) SSRC MSRC ¼
ðr  1Þðc  1Þ MSE MSE
SSE
Error rc(n  1) SSE MSE ¼
rcðn  1Þ

Total rcn  1 TSS


Note: r ¼ number of groups for A; c ¼ number of groups for B; SSR ¼ sum of squares for Factor A; MSR ¼ mean sum of
squares for Factor A; MSRC ¼ mean sum of squares for the Interaction A × B; SSC ¼ sum of squares for Factor B; MSC ¼ mean
square for Factor B ; SSRC ¼ sum of squares for the Interaction A × B; SSE ¼ sum of squares error; TSS ¼ total sum of squares.
Animal Research 29

where Fc1; rcðn1Þ; 1α is the 100(1  α)% point multivariable methods (3rd ed.). Pacific Grove, CA:
of F distribution with c  1 and rc(n  1) df. Duxbury Press.
The null hypothesis H03 would be rejected, and Lindman, H. R. (1992). Analysis of variance in
one could conclude that there is a significant effect experimental design. New York: Springer-Verlag.
Scheffé, H. (1999). Analysis of variance. New York:
of interaction between Factor A and Factor B if
Wiley-Interscience.

FðFactor A × Factor BÞ ≥ Fðr1Þðc1Þ; rcðn1Þ; 1α , (6)

where Fðr1Þðc1Þ; rcðn1Þ; 1α is the 100(1  α)%


point of F distribution with (r  1)(c  1) and ANIMAL RESEARCH
rc(n  1) df.
It is similar in the random case, except for dif- This entry reviews the five basic research designs
ferent F statistics and different df for the denomi- available to investigators who study the behavior
nator for testing H01 and H02 . of nonhuman animals. Use of these experimental
methods is considered historically, followed by
a short review of the experimental method proper.
Statistical Packages Then, for each design, the discussion focuses on
SAS procedure ‘‘PROC ANOVA’’ performs manipulation of the independent variable or vari-
ANOVA for balanced data from a wide variety of ables, examples of testable hypotheses, sources of
experimental designs. The ‘‘anova’’ command in error and confounding, sources of variation within
STATA fits ANOVA and analysis of covariance the design, and statistical analyses. The entry con-
(ANCOVA) models for balanced and unbalanced cludes with a section on choosing the appropriate
designs, including designs with missing cells; mod- research design. In addition, this entry addresses
els for repeated measures ANOVA; and models for why it is important to choose a research design
factorial, nested, or mixed designs. The ‘‘anova’’ prior to collecting data, why certain designs are
function in S-PLUS produces a table with rows good for testing some hypotheses but not others,
corresponding to each of the terms in the object, and how to choose a research design. This entry
plus an additional row for the residuals. When focuses on nonhuman animals, but the content gen-
two or more objects are used in the call, a similar eralizes directly to the study of behavior, either
table is produced showing the effects of the pair- human or nonhuman.
wise differences between the models, considered
sequentially from the first to the last. SPSS (an
IBM company, formerly called PASWâ Statistics) The Experimental Method:
provides a range of ANOVA options, including Pathway to the Scientific Study
automated follow-up comparisons and calcula- of Nonhuman Animal Behavior
tions of effect size estimates.
Through his dissertation written in 1898, Edward
Abdus S. Wahed and Xinyu Tang L. Thorndike initiated the controlled experimental
analysis of nonhuman animal behavior. His use of
See also Analysis of Covariance (ANCOVA); Repeated the experimental method provided researchers
Measures Design; t Test, Independent Samples interested in the evolution of intelligence, learning
and memory, and mental continuity the opportu-
Further Readings nity to determine the causes of behavior. With the
publication of his work in 1911 and the plea for
Bogartz, R. S. (1994). An introduction to the analysis of
objective methods of science by C. Lloyd Morgan,
variance. Westport, CT: Praeger.
Fisher, R. A. (1938). Statistical methods for research
John B. Watson, and others, the use of anecdotal
workers (7th ed.). Edinburgh, Scotland: Oliver and methods, so prevalent in that time, virtually came
Boyd. to an end. Thorndike’s work helped establish psy-
Kleinbaum, D. G., Kupper, L. L., & Muller, K. E. chology as first and foremost a science, and later
(1998). Applied regression analysis and other a profession.
30 Animal Research

The experimental method requires at minimum effects comprise the second form of sequence effects
two groups: the experimental group and the con- and provide a confound wherein the experimenter
trol group. Subjects (nonhuman animals) or partici- does not know whether treatment effects or order
pants (human animals) in the experimental group effects caused the change in the dependent or
receive the treatment, and subjects or participants response variable. Counterbalancing the order in
in the control group do not. All other variables are which subjects receive the treatments can eliminate
held constant or eliminated. When conducted cor- order effects, but in lesion studies, this is not possi-
rectly and carefully, the experimental method can ble. It is interesting that counterbalancing will not
determine cause-and-effect relationships. It is the eliminate carryover effects. However, such effects are
only method that can. often, but not always, eliminated when the experi-
menter increases the time between conditions.
Carryover effects are not limited to a single
Research Designs With One Factor
experiment. Animal cognition experts studying sea
Completely Randomized Design lions, dolphins, chimpanzees, pigeons, or even gray
parrots often use their subjects in multiple experi-
The completely randomized design is character- ments, stretching over years. While the practice is
ized by one independent variable in which subjects not ideal, the cost of acquiring and maintaining the
receive only one level of treatment. Subjects or par- animal over its life span dictates it. Such effects can
ticipants are randomly drawn from a larger popu- be reduced if subjects are used in experiments that
lation, and then they are randomly assigned to one differ greatly or if long periods of time have elapsed
level of treatment. All other variables are held con- between studies. In some instances, researchers take
stant, counterbalanced, or eliminated. Typically, advantage of carryover effects. Animals that are
the restriction of equal numbers of subjects in each trained over long periods to perform complex tasks
group is required. Independent variables in which will often be used in an extended series of related
subjects experience only one level are called experiments that build on this training.
between-subjects variables, and their use is wide- Data from the completely randomized design
spread in the animal literature. Testable hypotheses can be statistically analyzed with parametric or
include the following: What dosage of drug has nonparametric statistical tests. If the assumptions
the greatest effect on reducing seizures in rats? of a parametric test are met, and there are only
Which of five commercial diets for shrimp leads to two levels of treatment, data are analyzed with an
the fastest growth? Does experience influence egg- independent t test. For three or more groups, data
laying sites in apple snails? Which of four methods are analyzed using the analysis of variance
of behavioral enrichment decreases abnormal (ANOVA) with one between-subjects factor.
behavior in captive chimpanzees the most? Sources of variance include treatment variance,
The completely randomized design is chosen which includes both treatment and error variance,
when carryover effects are of concern. Carryover and error variance by itself. The F test is Treat-
effects are one form of sequence effects and result ment þ Error Variance divided by Error Variance.
when the effect of one treatment level carries over F scores greater than 1 indicate the presence of
into the next condition. For example, behavioral treatment variability. Because a significant F score
neuroscientists often lesion or ablate brain tissue to tells the experimenter only that there is at least
assess its role in behavioral systems including repro- one significant difference, post hoc tests are
duction, sleep, emotion, learning, and memory. In required to determine where the differences lie.
these studies, carryover effects are almost guaran- Depending on the experimenter’s need, several
teed. Requiring all subjects to proceed through the post hoc tests, including a priori and a posteriori,
control group first and then the experimental group are available. If the assumptions of a parametric
is not an option. In cases in which subjects experi- test are of concern, the appropriate nonparametric
ence treatment levels in the same order, performance test is the Mann–Whitney U test for two-group
changes could result through practice or boredom or designs and the Kruskal–Wallis test if three or
fatigue on the second or third or fourth time the ani- more groups are used. Mann–Whitney U tests pro-
mals experience the task. These so-called order vide post hoc analyses.
Animal Research 31

Repeated Measures Design experimenters can use a repeated measures design


to reduce the probability of making a Type II error,
In the repeated measures or within-subjects that is, accepting the null hypothesis as true when
design, subjects experience all levels of the treat- it is false. In the absence of response bias, the
ment variable, which in this case is known as the advantage of the within-subjects design is lost
within-subjects variable. Subjects are randomly because the degrees of freedom are smaller relative
drawn from a larger population and exposed to all to the between-subjects design. The important
treatment conditions in orders that have been point here is that choice of research design should
counterbalanced. All other variables have been take into consideration the subject’s traits.
held constant, counterbalanced, or eliminated. The repeated measures design is subject to both
Examples of testable hypotheses could include the carryover effects and order effects. As mentioned,
following: Is learning the reversal of a two-choice counterbalancing the order that subjects receive
discrimination problem more difficult for a goldfish the different levels of treatment controls for order
than learning the original problem? Does time of effects. Complete counterbalancing requires that
day affect the efficacy of a drug? Do bobcats each condition both precedes and follows all other
choose prey on the basis of the population density conditions an equal number of times. The equation
of their preferred prey or the population density of n(n – 1), where n equals the number of conditions,
their nonpreferred prey? determines the number of possible orders.
The repeated measures design has the major For example, with three treatment levels, six differ-
advantage of reducing the number of subjects ent orders are possible (ABC, BCA, CAB, ACB,
required. For two-group designs, the number of BAC, and CBA). When the number of treatment
subjects required is halved. With three or more levels increases to the point where complete coun-
groups, the savings are even greater. When work- terbalancing is not possible, experimenters use
ing with animal species that are difficult to obtain, incomplete counterbalancing. Here, only the
difficult to maintain, endangered, threatened, or requirement that each condition precedes all other
just very expensive, this advantage is critical. A conditions an equal number of times is followed.
second advantage comes from having subjects Order effects can also be eliminated with a pro-
experience all levels of treatment. Statistically, var- cedure commonly used in the operant conditioning
iability attributable to subjects can be removed literature. In the ABA design, subjects experience
from the error term. The result is that the F score the control condition (A), then the experimental
is determined when the experimenter divides the condition (B), then the control condition (A) again.
treatment variance by residual error variance. In For example, imagine a pigeon has been trained to
situations in which some subjects perform consis- peck a key for its favorite food. In the experiment,
tently at high rates of response whereas others per- the pigeon first experiences a 15-minute period in
form consistently at low rates of response, subject which each 20th response is reinforced with
variability will be high. Under these conditions, 4-second access to grain (control condition). Then
the repeated measures design is more powerful the pigeon receives a 15-minute period during
than the between-subjects design. which access time to grain differs but averages 4
For example, this appears to be the case with seconds and the number of responses required for
goldfish. When goldfish are trained to strike a target food also differs but averages to 20. Finally, the
for a food reward, some subjects respond at very pigeon receives a third 15-minute period identical
high rates, whereas others hit the target at reduced to the first. In this example, response rates
rates. It is interesting that even in light of treat- increased dramatically during the treatment condi-
ments that increase or decrease rates of response, tion and then returned to the level of the first con-
high responders stay relatively high and low dition during the third period. When response
responders stay relatively low. The result is a great rates return to the level observed in Condition 1,
deal of variability that remains as error variance in operant researchers call that recapturing baseline.
a between-subjects design but that is removed from It is often the most difficult part of operant
the error term in a repeated measures design. In research. Note that if the subject does not return
essence, where there is considerable response bias, to baseline in the third condition, the effect could
32 Animal Research

be attributable to carryover effects, order effects, change during the year? Does learning ability dif-
or treatment effects. If the subject does return to fer between predators and prey in marine and
baseline, the effect is due to the treatment and not freshwater environments?
due to sequence effects. Note also that this experi- Besides the obvious advantage of reducing
ment is carried out in Las Vegas every day, except expenses by testing two or more variables at the
that Conditions 1 and 3 are never employed. same time, the completely randomized design can
Data from the repeated measures design are determine whether the independent variables inter-
analyzed with a dependent t test if two groups are act to produce a third variable. Such interactions
used and with the one-way repeated measures can lead to discoveries that would have been
ANOVA or randomized-block design if more missed with a single-factor design. Consider the
than two treatment levels are used. Sources of following example: An animal behaviorist, work-
variance include treatment variance, subject var- ing for a pet food company, is asked to determine
iance, and residual variance, and as mentioned, whether a new diet is good for all ages and all
treatment variance is divided by residual vari- breeds of canines. With limited resources of time,
ance to obtain the F score. Post hoc tests deter- housing, and finances, the behaviorist decides to
mine where the differences lie. The two-group test puppies and adults from a small breed of dogs
nonparametric alternative is Wilcoxon’s signed and puppies and adults from a large breed. Twelve
ranks test. With more than two groups, the healthy animals for each condition (small-breed
Friedman test is chosen and Wilcoxon’s signed puppy, small-breed adult, large-breed puppy, and
ranks tests serve as post hoc tests. large-breed adult) are obtained from pet suppliers.
All dogs are acclimated to their new surrounds for
a month and then maintained on the diet for 6
Research Designs With Two or More Factors months. At the end of 6 months, an index of body
mass (BMI) is calculated for each dog and is sub-
Completely Randomized Factorial Design
tracted from the ideal BMI determined for these
Determining the causes of nonhuman animal breeds and ages by a group of veterinarians from
behavior often requires manipulation of two or the American Veterinary Medical Association.
more independent variables simultaneously. Such Scores of zero indicate that the diet is ideal. Nega-
studies are possible using factorial research tive scores reflect dogs that are gaining too much
designs. In the completely randomized factorial weight and becoming obese. Positive scores reflect
design, subjects experience only one level of each dogs that are becoming underweight.
independent variable. For example, parks and The results of the analysis reveal no significant
wildlife managers might ask whether male or effect of Age, no significant effect of Breed, but
female grizzly bears are more aggressive in spring, a very strong interaction. The CEO is ecstatic,
when leaving their dens after a long period of inac- believing that the absence of main effects means
tivity, or in fall, when they are preparing to enter that the BMI scores do not differ from zero and
their dens. In this design, bears would be randomly that the new dog food is great for all dogs. How-
chosen from a larger population and randomly ever, a graph of the interaction reveals otherwise.
assigned to one of four combinations (female fall, Puppies from the small breed have very low BMI
female spring, male fall, or male spring). The com- scores, indicating severely low weight. Ironically,
pletely randomized design is used when sequence puppies from the large breed are extremely obese,
effects are of concern or if it is unlikely that the as are small-breed adult dogs. However, large-
same subjects will be available for all conditions. breed adults are dangerously underweight. In
Examples of testable hypotheses include the fol- essence, the interaction reveals that the new
lowing: Do housing conditions of rhesus monkeys diet affects small and large breeds differentially
bred for medical research affect the efficacy of depending on age, with the outcome that the new
antianxiety drugs? Do different dosages of a newly diet would be lethal for both breeds at both age
developed pain medication affect females and levels, but for different reasons. The lesson here is
males equally? Does spatial memory ability in that when main effects are computed, the mean
three different species of mountain jays (corvids) for the treatment level of one variable contains the
Animal Research 33

scores from all levels of the other treatment. The randomly drawn from a larger population and
experimenter must examine the interaction effect assigned randomly to one level of the between-
carefully to determine whether there were main subjects variable. Subjects experience all levels of
effects that were disguised by the interaction or the within-subjects variable. The order in which
whether there were simply no main effects. subjects receive the treatment levels of the within-
In the end, the pet food company folded, but subjects variable is counterbalanced unless the
valuable design and statistical lessons were learned. experimenter wants to examine change over time
First, in factorial designs there is always the or repeated exposures to the treatment. The split-
potential for the independent variables to interact, plot design is particularly useful for studying the
producing a third treatment effect. Second, a signifi- effects of treatment over time or exposure to treat-
cant interaction means that the independent vari- ment. Often, order of treatment cannot be counter-
ables affect each other differentially and that the balanced in the within-subject factor. In these cases,
main effects observed are confounded by the inter- order effects are expected. Examples of testable
action. Consequently, the focus of the analysis must hypotheses include the following: Can crabs learn
be on the interaction and not on the main effects. to associate the presence of polluted sand with ill-
Finally, the interaction can be an expected outcome, ness? Can cephalopods solve a multiple T-maze
a neutral event, or a complete surprise. For exam- faster than salmonids? Does tolerance to pain medi-
ple, as discussed in the split-plot design, to demon- cation differ between males and females? Does
strate that learning has occurred, a significant maturation of the hippocampus in the brains of rats
interaction is required. On the surprising side, on affect onset of the paradoxical effects of reward?
occasion, when neuroscientists have administered In experiments in which learning is the focus,
two drugs at the same time, or lesioned two brain order effects are confounded with Trials, the
sites at the same time, the variables interacted to within-subjects treatment, and an interaction is
produce effects that relieved symptoms better than expected. For example, in a typical classical condi-
either drug alone, or the combination of lesions tioning experiment, subjects are randomly assigned
produced an effect never before observed. to the paired group or the unpaired group. Sub-
Statistically, data from the completely random- jects in the paired group receive paired presenta-
ized factorial design are analyzed with a two-way tions of a stimulus (light, tone, etc.) and a second
ANOVA with both factors between-subject vari- stimulus that causes a response (meat powder, light
ables. Sources of variance include treatment vari- foot shock). Subjects in the unpaired group receive
ability for each independent variable and each presentations of both stimuli, but they are never
unique combination of independent variables and paired. Group (paired/unpaired) serves as the
error variance. Error variance is estimated by add- between-subjects factor, and Trials (1–60) serves as
ing the within-group variability for each AB cell. A the within-subjects factor. Evidence of learning is
single error term is used to test all treatment effects obtained when the number of correct responses
and the interaction. Post hoc tests determine where increases as a function of trials for the paired
the differences lie in the main effects, and simple group, but not for the unpaired group. Thus, the
main effects tests are used to clarify the source of interaction, and not the main effects, is critical in
a significant interaction. Nonparametric statistical experiments on learning.
analyses for factorial designs are not routinely Statistical analysis of the split-plot or mixed
available on statistical packages. One is advised to design is accomplished with the mixed design
check the primary literature if a nonparametric ANOVA, with one or more between factors and
between-subjects factorial test is required. one or more within factors. With one between and
one within factor, there are five sources of vari-
ance. These include treatment variance for both
The Split-Plot Design
main effects and the interaction and two sources
The split-plot or mixed design is used extensively of error variance. The between factor is tested with
to study animal learning and behavior. In its sim- error variability attributable to that factor. The
plest form, the design has one between-subjects fac- within factor and the interaction are tested with
tor and one within-subjects factor. Subjects are residual error variance. Significant main effects are
34 Animal Research

further analyzed by post hoc tests, and the sources variance for both main effects and the interaction,
of the interaction are determined with simple main and three error terms. Separate error terms are
effects tests. Nonparametric tests are available in used to test each treatment effect and the interac-
the primary literature. tion. Post hoc tests and simple main effects tests
determine where the differences lie in the main
effects and the interaction, respectively. Nonpara-
Randomized Block Factorial Design
metric analyses for this design can be found in the
In this design, subjects experience all levels of primary literature.
two or more independent variables. Such experi-
ments can be difficult and time-consuming to con-
Choosing a Research Design
duct, to analyze, and to interpret. Sequence
effects are much more likely and difficult to con- Choosing a research design that is cost effective,
trol. Carryover effects can result from the inter- feasible, and persuasive and that minimizes the
action as well as the main effects of treatment probability of making Type 1 errors (experimenter
and consequently can be more difficult to detect rejects the null hypothesis when in fact it is true)
or eliminate. Order effects can be controlled and Type II errors (experimenter accepts the null
through counterbalancing, but the number of hypothesis when in fact it is false) requires infor-
possible orders quickly escalates, as exemplified mation about the availability, accessibility, mainte-
by the equation m(n)(n  1), where m equals the nance, care, and cost of subjects. It also requires
number of treatment levels in the independent knowledge of the subjects themselves, including
variable m, and n equals the number of treat- their stable traits, their developmental and evolu-
ment levels in n. tionary histories, their adaptability to laboratory
Because subjects experience all levels of all life, their tolerance of treatment, and their ability
treatments, subject variability can be subtracted to be trained. Knowledge of potential carryover
from the error term. This advantage, coupled with effects associated with the treatment and whether
the need for fewer subjects and the ability to test these effects are short lived or long lasting is also
for interactions, makes this design of value in important. Finally, the researcher needs to know
learning experiments with exotic animals. The how treatment variance and error variance are par-
design also finds application in the behavioral neu- titioned to take advantage of the traits of some ani-
rosciences when the possibility of interactions mals or to increase the feasibility of conducting the
between drugs presented simultaneously or experiment. In addition, the researcher needs to
sequentially needs to be assessed. The design can know how interaction effects can provide evidence
also assess the effects of maturation, imprinting, of change over time or lead to new discoveries.
learning, or practice on important behavioral sys- But there is more. Beginning-level researchers
tems, including foraging, migration, navigation, often make two critical mistakes when establishing
habitat selection, choosing a mate, parental care, a program of research. First, in their haste and
and so on. The following represent hypotheses that enthusiasm, they rush out and collect data and
are testable with this design: Do nest site selection then come back to the office not knowing how to
and nest building change depending on the success statistically analyze their data. When these same
of last year’s nest? Do successive operations to people take their data to a statistician, they soon
relieve herniated discs lead to more damage when learn a critical lesson: Never conduct an experi-
coupled with physical therapy? Are food prefer- ment and then attempt to fit the data to a particu-
ences in rhesus monkeys related to nutritional lar design. Choose the research design first, and
value, taste, or social learning from peers? Does then collect the data according to the rules of the
predation pressure influence a prey’s choice of diet design. Second, beginning-level researchers tend to
more in times when food is scarce, or when food is think that the more complex the design, the more
abundant? compelling the research. Complexity does not cor-
Data analysis is accomplished with a repeated relate positively with impact. Investigators should
measures factorial ANOVA. Seven sources of vari- opt for the simplest design that can answer the
ance are computed: subject variance, treatment question. The easier it is to interpret a results
Applied Research 35

section, the more likely it is that reviewers will The most common way applied research is
understand and accept the findings and the conclu- understood is by comparing it to basic research.
sions. Simple designs are easier to conduct, ana- Basic research—‘‘pure’’ science—is grounded in the
lyze, interpret, and communicate to peers. scientific method and focuses on the production of
new knowledge and is not expected to have an
Jesse E. Purdy immediate practical application. Although the dis-
See also Analysis of Variance (ANOVA); Confounding;
tinctions between the two contexts are arguably
Factorial Design; Nonparametric Statistics for the
somewhat artificial, researchers commonly identify
Behavioral Sciences; Parametric Statistics; Post Hoc
four differences between applied research and
Comparisons; Single-Subject Design
basic research. Applied research differs from basic
research in terms of purpose, context, validity, and
methods (design).
Further Readings
Conover, W. J. (1999). Practical nonparametric statistics Research Purpose
(3rd ed.). New York: Wiley. The purpose of applied research is to increase
Glover, T. J., & Mitchell, K. J. (2002). An introduction to
what is known about a problem with the goal of
biostatistics. New York: McGraw-Hill.
creating a better solution. This is in contrast to
Howell, D. C. (2007). Statistical methods for psychology
(6th ed.). Monterey, CA: Thomson/Wadsworth. basic research, in which the primary purpose is to
Kirk, R. E. (1995). Experimental design: Procedures for expand on what is known—knowledge—with lit-
behavioral sciences (3rd ed.). Monterey, CA: tle significant connections to contemporary pro-
Wadsworth. blems. A simple contrast that shows how research
Lehman, A., O’Rourke, N., Hatcher, L., & Stepanski, E. J. purpose differentiates these two lines of investiga-
(2005). JMP for basic univariate and multivariate tion can be seen in applied behavior analysis and
statistics: A step-by-step guide. Cary, NC: SAS Institute. psychological research. Applied behavior is
Wasserman, L. (2005). All of nonparametric statistics. a branch of psychology that generates empirical
New York: Springer.
observations that focus at the level of the individ-
Winer, B. J. (1971). Statistical principles in experimental
ual with the goal of developing effective interven-
design (2nd ed.). New York: McGraw-Hill.
tions to solve specific problems. Psychology, on the
other hand, conducts research to test theories or
explain changing trends in certain populations.
The irrelevance of basic research to immediate
APPLIED RESEARCH problems may at times be overstated. In one form
or another, observations generated in basic
Applied research is inquiry using the application of research eventually influence what we know about
scientific methodology with the purpose of gener- contemporary problems. Going back to the previ-
ating empirical observations to solve critical pro- ous comparison, applied behavior investigators
blems in society. It is widely used in varying commonly integrate findings generated by cogni-
contexts, ranging from applied behavior analysis tive psychologists—how people organize and ana-
to city planning and public policy and to program lyze information—in explaining specific types of
evaluation. Applied research can be executed behaviors and identifying relevant courses of inter-
through a diverse range of research strategies that ventions to modify them. The question is, how
can be solely quantitative, solely qualitative, or much time needs to pass (5 months, 5 years, 50
a mixed method research design that combines years) in the practical application of research
quantitative and qualitative data slices in the same results in order for the research to be deemed basic
project. What all the multiple facets in applied research? In general, applied research observations
research projects share is one basic commonality— are intended to be implemented in the first few
the practice of conducting research in ‘‘nonpure’’ years whereas basic researchers make no attempt
research conditions because data are needed to to identify when their observations will be realized
help solve a real-life problem. in everyday life.
36 Applied Research

Research Context validity ask whether the investigator makes the


correct observation on the causal relationship
The point of origin at which a research project among identified variables. Questions of external
begins is commonly seen as the most significant validity ask whether the investigators appropri-
difference between applied research and basic ately generalize observations from their research
research. In applied research, the context of press- project to relevant situations. A recognized distinc-
ing issues marks the beginning in a line of investi- tion is that applied research values external valid-
gation. Applied research usually begins when ity more than basic research projects do. Assuming
a client has a need for research to help solve an applied research project adequately addresses
a problem. The context the client operates in pro- questions of internal validity, its research conclu-
vides the direction the applied investigator takes in sions are more closely assessed in how well they
terms of developing the research questions. The apply directly to solving problems.
client usually takes a commanding role in framing Questions of internal validity play a more signifi-
applied research questions. Applied research ques- cant role in basic research. Basic research focuses
tions tend to be open ended because the client sees on capturing, recording, and measuring causal rela-
the investigation as being part of a larger context tionships among identified variables. The applica-
made up of multiple stakeholders who understand tion of basic research conclusions focuses more on
the problem from various perspectives. their relevance to theory and the advancement of
Basic research begins with a research question knowledge than on their generalizability to similar
that is grounded in theory or previous empirical situations.
investigations. The context driving basic research The difference between transportation planning
takes one of two paths: testing the accuracy of and transportation engineering is one example of
hypothesized relationships among identified vari- the different validity emphasis in applied research
ables or confirming existing knowledge from earlier and basic research. Transportation planning is
studies. In both scenarios, the basic research investi- an applied research approach that is concerned
gator usually initiates the research project based on with the siting of streets, highways, sidewalks, and
his or her ability to isolate observable variables and public transportation to facilitate the efficient
to control and monitor the environment in which movement of goods and people. Transportation
they operate. Basic research questions are narrowly planning research is valued for its ability to answer
defined and are investigated with only one level of questions of external validity and address transpor-
analysis: prove or disprove theory or confirm or tation needs and solve traffic problems, such as
not confirm earlier research conclusions. congestion at a specific intersection. Traffic engi-
The contrast in the different contexts between neering is the basic research approach to studying
applied research and basic research is simply put by function, design, and operation of transportation
Jon S. Bailey and Mary R. Burch in their explana- facilities and looks at the interrelationship of vari-
tion of applied behavior research in relation to psy- ables that create conditions for the inefficient
chology. The contrast can be pictured like this: In movement of goods and people. Traffic engineer-
applied behavior research, subjects walk in the door ing is valued more for its ability to answer ques-
with unique family histories that are embedded in tions of internal validity in correctly identifying
distinct communities. In basic research, subjects the relationship among variables that can cause
‘‘come in packing crates from a breeding farm, the traffic and makes little attempt to solve specific
measurement equipment is readily available, the traffic problems.
experimental protocols are already established, and
the research questions are derivative’’ (p. 3).
Research Design
Applied research projects are more likely follow
Emphasis on Validity
a triangulation research design than are basic
The value of all research—applied and basic—is research investigations. Triangulation is the
determined by its ability to address questions of research strategy that uses a combination of multi-
internal and external validity. Questions of internal ple data sets, multiple investigators, multiple
A Priori Monte Carlo Simulation 37

theories, and multiple methodologies to answer


research questions. This is largely because of the A PRIORI MONTE
context that facilitates the need for applied
research. Client-driven applied research projects
CARLO SIMULATION
tend to need research that analyzes a problem
from multiple perspectives in order to address the An a priori Monte Carlo simulation is a special
many constituents that may be impacted by the case of a Monte Carlo simulation that is used in
study. In addition, if applied research takes place the design of a research study, generally when ana-
in a less than ideal research environment, multiple lytic methods do not exist for the goal of interest
data sets may be necessary in order for the applied for the specified model or are not convenient. A
investigator to generate a critical mass of observa- Monte Carlo simulation is generally used to evalu-
tions to be able to make defensible conclusions ate empirical properties of some quantitative
about the problem at hand. method by generating random data from a popula-
Basic research commonly adheres to a single- tion with known properties, fitting a particular
method, single-data-research strategy. The narrow model to the generated data, collecting relevant
focus in basic research requires the investigator to information of interest, and replicating the entire
eliminate possible research variability (bias) to bet- procedure a large number of times (e.g., 10,000).
ter isolate and observe changes in the studied vari- In an a priori Monte Carlo simulation study, inter-
ables. Increasing the number of types of data sets est is generally in the effect of design factors on
accessed and methods used to obtain them the inferences that can be made rather than a gen-
increases the possible risk of contaminating the eral attempt at describing the empirical properties
basic research laboratory of observations. of some quantitative method. Three common cate-
Research design in transportation planning is gories of design factors used in a priori Monte
much more multifaceted than research design in Carlo simulations are sample size, model misspeci-
traffic engineering. This can be seen in how each fication, and unsatisfactory data conditions. As
approach would go about researching transporta- with Monte Carlo methods in general, the compu-
tion for older people. Transportation planners tational tediousness of a priori Monte Carlo simu-
would design a research strategy that would look lation methods essentially requires one or more
at the needs of a specific community and assess computers because of the large number of replica-
several different data sets (including talking to the tions and thus the heavy computational load.
community) obtained through several different Computational loads can be very great when the
research methods to identify the best combination a priori Monte Carlo simulation is implemented
of interventions to achieve a desired outcome. for methods that are themselves computationally
Traffic engineers will develop a singular research tedious (e.g., bootstrap, multilevel models, and
protocol that focuses on total population demand Markov chain Monte Carlo methods).
in comparison with supply to determine unmet For an example of when an a priori Monte
transportation demand of older people. Carlo simulation study would be useful, Ken
Kelley and Scott Maxwell have discussed sample
John Gaber size planning for multiple regression when inter-
est is in sufficiently narrow confidence intervals
See also Planning Research; Scientific Method
for standardized regression coefficients (i.e., the
accuracy-in-parameter-estimation approach to
sample size planning). Confidence intervals
Further Readings based on noncentral t distributions should be
used for standardized regression coefficients.
Baily, S. J., & Burch, M. R. (2002). Research methods in
applied behavior analysis. Thousand Oaks, CA: Sage.
Currently, there is no analytic way to plan for
Bickman, L., & Rog, D. J. (1998). Handbook of applied the sample size so that the computed interval
social research methods. Thousand Oaks, CA: Sage. will be no larger than desired some specified per-
Kimmel, J. A. (1988). Ethics and values in applied social cent of the time. However, Kelley and Maxwell
research. Newbury Park, CA: Sage. suggested an a priori Monte Carlo simulation
38 Aptitudes and Instructional Methods

procedure when random data from the situation accuracy in parameter estimation. Annual Review of
of interest are generated and a systematic search Psychology, 59, 537–563.
(e.g., a sequence) of different sample sizes is used Muthén, L., & Muthén, B. (2002). How to use a Monte
until the minimum sample size is found at which Carlo study to decide on sample size and determine
power. Structural Equation Modeling, 4, 599–620.
the specified goal is satisfied.
As another example of when an application of
a Monte Carlo simulation study would be useful,
Linda Muthén and Bengt Muthén have discussed APTITUDES AND
a general approach to planning appropriate sample
size in a confirmatory factor analysis and struc- INSTRUCTIONAL METHODS
tural equation modeling context by using an
a priori Monte Carlo simulation study. In addition Research on the interaction between student char-
to models in which all the assumptions are satis- acteristics and instructional methods is important
fied, Muthén and Muthén suggested sample size because it is commonly assumed that different stu-
planning using a priori Monte Carlo simulation dents learn in different ways. That assumption is
methods when data are missing and when data are best studied by investigating the interaction between
not normal—two conditions most sample size student characteristics and different instructional
planning methods do not address. methods. The study of that interaction received its
Even when analytic methods do exist for greatest impetus with the publication of Lee Cron-
designing studies, sensitivity analyses can be imple- bach and Richard Snow’s Aptitudes and Instruc-
mented within an a priori Monte Carlo simulation tional Methods in 1977, which summarized
framework. Sensitivity analyses in an a priori research on the interaction between aptitudes and
Monte Carlo simulation study allow the effect of instructional treatments, subsequently abbreviated
misspecified parameters, misspecified models, and/ as ATI research. Cronbach and Snow indicated that
or the validity of the assumptions on which the the term aptitude, rather than referring exclusively
method is based to be evaluated. The generality of to cognitive constructs, as had previously been the
the a priori Monte Carlo simulation studies is its case, was intended to refer to any student character-
biggest advantage. As Maxwell, Kelley, and Joseph istic. Cronbach stimulated research in this area in
Rausch have stated, ‘‘Sample size can be planned earlier publications suggesting that ATI research
for any research goal, on any statistical technique, was an ideal meeting point between the usually dis-
in any situation with an a priori Monte Carlo sim- tinct research traditions of correlational and experi-
ulation study’’ (2008, p. 553). mental psychology. Before the 1977 publication of
Aptitudes and Instructional Methods, ATI research
Ken Kelley was spurred by Cronbach and Snow’s technical
report summarizing the results of such studies,
See also Accuracy in Parameter Estimation; Monte Carlo
which was expanded in 1977 with the publication
Simulation; Power Analysis; Sample Size Planning
of the volume.
Further Readings
Background
Kelley, K. (2007). Sample size planning for the coefficient
of variation: Accuracy in parameter estimation via When asked about the effectiveness of different
narrow confidence intervals. Behavior Research treatments, educational researchers often respond
Methods, 39(4), 755–766. that ‘‘it depends’’ on the type of student exposed to
Kelley, K., & Maxwell, S. E. (2008). Power and accuracy the treatment, implying that the treatment interacted
for omnibus and targeted effects: Issues of sample size
with some student characteristic. Two types of inter-
planning with applications to multiple regression. In
P. Alasuuta, L. Bickman, & J. Brannen (Eds.), The
actions are important in ATI research: ordinal and
SAGE handbook of social research methods (pp. disordinal, as shown in Figure 1. In ordinal interac-
166–192). Thousand Oaks, CA: Sage. tions (top two lines in Figure 1), one treatment
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). yields superior outcomes at all levels of the student
Sample size planning for statistical power and characteristic, though the difference between the
Aptitudes and Instructional Methods 39

30 treatment groups and two groups (high and low)


on the student characteristic. In such studies, main
25 effects were not necessarily expected for either the
20 treatment or the student characteristic, but the
interaction between them is the result of greatest
Outcome

15 interest.
10 Cronbach and Snow pointed out that in
ANOVA designs, the student characteristic
5 examined was usually available as a continuous
0 score that had at least ordinal characteristics,
and the research groups were developed by split-
−5 ting the student characteristic distribution at
Student Characteristic
some point to create groups (high and low; high,
medium, and low; etc.). Such division into
Figure 1 Ordinal and Disordinal Interactions groups ignored student differences within each
group and reduced the available variance by an
estimated 34%. Cronbach and Snow recom-
outcomes is greater at one part of the distribution mended that research employ multiple linear
than elsewhere. In disordinal interactions (the bot- regression analysis in which the treatments
tom two lines in Figure 1), one treatment is superior would be represented by so-called dummy vari-
at one point of the student distribution while the ables and the student characteristic could be
other treatment is superior for students falling at analyzed as a continuous score. It should also be
another point. The slope difference in ordinal inter- noted, however, that when the research sample is
actions indicates that ultimately they are also likely at extreme ends of the distribution (e.g., one
to be disordinal, that is, the lines will cross at a fur- standard deviation above or below the mean),
ther point of the student characteristic distribution the use of ANOVA maximizes the possibility of
than observed in the present sample. finding differences between the groups.

Research Design ATI Research Review


ATI studies typically provide a segment of instruc- Reviews of ATI research reported few replicated
tion by two or more instructional methods that are interactions. Among the many reasons for these
expected to be optimal for students with different inconsistent findings were vague descriptions of
characteristics. Ideally, research findings or some the instructional treatments and sketchy rela-
strong theoretical basis should exist that leads to tionships between the student characteristic and
expectations of differential effectiveness of the the instruction. Perhaps the most fundamental
instruction for students with different characteris- reason for the inconsistent findings was
tics. Assignment to instructional method may be the inability to identify the cognitive processes
entirely random or random within categories of required by the instructional treatments and
the student characteristic. For example, students engaged by the student characteristic. Slava
may be randomly assigned to a set of instructional Kalyuga, Paul Ayres, Paul Chandler, and John
methods and their anxiety then determined by Sweller demonstrated that when the cognitive
some measure or experimental procedure. Or, in processes involved in instruction have been clari-
quasi-experimental designs, high- and low-anxiety fied, more consistent ATI findings have been
students may be determined first and then—within reported and replicated.
the high- and low-anxiety groups—assignment to Later reviews of ATI research, such as those
instructional methods should be random. by J. E. Gustaffson and J. O. Undheim or by Sig-
ATI research was traditionally analyzed with mund Tobias, reported consistent findings for
analysis of variance (ANOVA). The simplest ATI Tobias’s general hypothesis that students with lim-
design conforms to a 2 × 2 ANOVA, with two ited knowledge of a domain needed instructional
40 Aptitude-Treatment Interaction

support, that is, assistance to the learner, whereas


more knowledgeable students could succeed with- APTITUDE-TREATMENT
out it. The greater consistency of interactions
involving prior knowledge as the student charac-
INTERACTION
teristic may be attributable to some attributes of
such knowledge. Unlike other student characteris- There are countless illustrations in the social
tics, prior domain knowledge contains the cogni- sciences of a description of a phenomenon existing
tive processes to be used in the learning of that for many years before it is labeled and systema-
material. In addition, the prior knowledge measure tized as a scientific concept. One such example is
is likely to have been obtained in a situation fairly in Book II of Homer’s Iliad, which presents an
similar to the one present during instruction, thus interesting account of the influence exerted by
also contributing any variance attributable to Agamemnon, king of Argos and commander of
situativity to the results. the Greeks in the Trojan War, on his army. In par-
ticular, Homer describes the behavior of Odysseus,
Sigmund Tobias a legendary king of Ithaca, and the behavior of
Thersites, a commoner and rank-and-file soldier,
See also Analysis of Covariance (ANCOVA); Interaction;
as contrasting responses to Agamemnon’s leader-
Reactive Arrangements
ship and role as ‘‘the shepherd of the people.’’
Odysseus, Homer says, is ‘‘brilliant,’’ having ‘‘done
Further Readings
excellent things by thousands,’’ while he describes
Thersites as that ‘‘who knew within his head many
Cronbach, L. J. (1957). The two disciplines of scientific words, but disorderly,’’ and ‘‘this thrower of
psychology. American Psychologist, 12, 671–684. words, this braggart.’’ Where the former admires
Cronbach, L. J. (1975). Beyond the two disciplines of the leadership of Agamemnon, accepts his code of
scientific psychology. American Psychologist, 30,
honor, and responds to his request to keep the sage
116–127.
of Troy, the latter accuses Agamemnon of greed
Cronbach, L. J., & Snow, R. E. (1969). Individual
differences and learning ability as a function of and promiscuity and demands a return to Sparta.
instructional variables. (Contract No. OEC 4–6- The observation that an intervention—educa-
061269–1217). Stanford, CA: Stanford University, tional, training, therapeutic, or organizational—
School of Education. when delivered the same way to different people
Cronbach, L. J., & Snow, R. E. (1977). Aptitudes and might result in differentiated outcomes, was made
instructional methods. New York: Irvington Press. a long time ago, as long as the eighth century BCE,
Gustaffson, J., & Undheim, J. O. (1996). Individual as exemplified by Homer. In attempts to compre-
differences in cognitive functions. In D. S. C. Berliner hend and explain this observation, researchers and
& R. C. Calfee (Eds.), Handbook of educational
practitioners have focused primarily on the concept
psychology (pp. 186–242). New York: Macmillan
Reference.
of individual differences, looking for main effects
Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). that are attributable to concepts such as ability, per-
The expertise reversal effect. Educational sonality, motivation, or attitude. When these inqui-
Psychologist, 38, 23–31. ries started, early in the 20th century, not many
Tobias, S. (1976). Achievement treatment interactions. parallel interventions were available. In short, the
Review of Educational Research, 46, 61–74. assumption at the time was that a student (a trainee
Tobias, S. (1982). When do instructional methods make in a workplace, a client in a clinical setting, or a
a difference? Educational Researcher, 11(4), 4–9. soldier on a battlefield) possessed specific character-
Tobias, S. (1989). Another look at research on the istics, such as Charles Spearman’s g factor of intelli-
adaptation of instruction to student characteristics.
gence, that could predict his or her success or
Educational Psychologist, 24, 213–227.
Tobias, S. (2009). An eclectic appraisal of the success or
failure in a training situation. However, this
failure of constructivist instruction: In S. Tobias & attempt to explain the success of an intervention by
T. D. Duffy (Eds.), Constructivist theory applied to the characteristics of the intervenee was challenged
education: Success or failure? (pp. 335–350). New by the appearance of multiple parallel interventions
York: Routledge. aimed at arriving at the same desired goal by
Aptitude-Treatment Interaction 41

employing various strategies and tactics. It turned self-teaching (e.g., mastering a new skill such as
out that there were no ubiquitous collections of typing), a clinical manipulation (e.g., a session of
individual characteristics that would always result massage) or long-term therapy (e.g., psychoanaly-
in success in a situation. Moreover, as systems of sis), or inspiring a soldier to fight a particular
intervention in education, work training in indus- battle (e.g., issuing an order) or preparing troops
try, and clinical fields developed, it became appar- to use new strategies of war (e.g., fighting insur-
ent that different interventions, although they gency). Aptitude is used to signify any systematic
might be focused on the same target (e.g., teaching measurable dimension of individual differences
children to read, training bank tellers to operate (or a combination of such) that is related to a par-
their stations, helping a client overcome depression, ticular treatment outcome. In other words,
or preparing soldiers for combat), clearly worked aptitude does not necessarily mean a level of gen-
differently for different people. It was then sug- eral cognitive ability or intelligence; it can capture
gested that the presence of differential outcomes of specific personality traits or transient psychologi-
the same intervention could be explained by apti- cal states. The most frequently studied aptitudes
tude-treatment interaction (ATI, sometimes also of ATI are in the categories of cognition,
abbreviated as AxT), a concept that was introduced conation, and affection, but aptitudes are not lim-
by Lee Cronbach in the second part of the 20th ited to these three categories. Finally, interaction
century. demarcates the degree to which the results of two
ATI methodology was developed to coaccount or more interventions will differ for people who
both for the individual characteristics of the inter- differ in one or more aptitudes. Of note is that
venee and the variations in the interventions while interaction here is defined statistically and that
assessing the extent to which alternative forms of both intervention and aptitude can be captured
interventions might have differential outcomes as by qualitative or quantitative variables (observed,
a function of the individual characteristics of the measured, self-reported, or derived). Also of note
person to whom the intervention is being deliv- is that, being a statistical concept, ATI behaves
ered. In other words, investigations of ATI have just as any statistical interaction does. Most
been designed to determine whether particular important, it can be detected only when studies
treatments can be selected or modified to optimally are adequately powered. Moreover, it acknowl-
serve individuals possessing particular characteris- edges and requires the presence of main effects of
tics (i.e., ability, personality, motivation). Today, the aptitude (it has to be a characteristic that
ATI is discussed in three different ways: as a con- matters for a particular outcome, e.g., general
cept, as a method for assessing interactions among cognitive ability rather than shoe size for predict-
person and situation variables, and as a framework ing a response to educational intervention) and
for theories of aptitude and treatment. the intervention (it has to be an effective treat-
ment that is directly related to an outcome,
e.g., teaching a concept rather than just giving
ATI as a Concept
students candy). This statistical aspect of ATI is
ATI as a concept refers to both an outcome and important for differentiating it from what is
a predictor of that outcome. Understanding these referred to by the ATI developers and proponents
facets of ATI requires decomposing the holistic as transaction. Transaction signifies the way
concept into its three components—treatment, in which ATI is constructed, the environment
aptitude, and the interaction between them. The and the process in which ATI emerges; in other
term treatment is used to capture any type of words, ATI is always a statistical result of a trans-
manipulation aimed at changing something. Thus, action through which a person possessing certain
with regard to ATI, treatment can refer to a spe- aptitudes experiences a certain treatment. ATI as
cific educational intervention (e.g., the teaching of an outcome identifies combinations of treatments
equivalent fractions) or conceptual pedagogical and aptitudes that generate a significant change
framework (e.g., Waldorf pedagogy), a particular or a larger change compared with other combina-
training (e.g., job-related activity, such as master- tions. ATI as a predictor points to which treat-
ing a new piece of equipment at a work place) or ment or treatments are more likely to generate
42 Aptitude-Treatment Interaction

significant or larger change for a particular indi- popularity lately is the regression discontinuity
vidual or individuals. design. In this design, the presence of ATI is regis-
tered when the same intervention is administered
before and after a particular event (e.g., a change
ATI as a Method
in aptitude in response to linguistic immersion
ATI as a method permits the use of multiple exper- while living in a country while continuing to study
imental designs. The very premise of ATI is its the language of that country).
capacity to combine correlational approaches (i.e.,
studies of individual differences) and experimental
approaches (i.e., studies of interventional manipu- ATI as a Theoretical Framework
lations). Multiple paradigms have been developed
to study ATI; many of them have been and con- ATI as a theoretical framework underscores the
tinue to be applied in other, non-ATI, areas of flexible and dynamic, rather than fixed and deter-
interventional research. In classical accounts of ministic, nature of the coexistence (or coaction) of
ATI, the following designs are typically mentioned. individual characteristics (i.e., aptitudes) and situa-
In a simple standard randomized between-persons tions (i.e., interventions). As a theory, ATI captures
design, the outcome is investigated for persons the very nature of variation in learning—not every-
who score at different levels of a particular apti- one learns equally well from the same method of
tude when multiple, distinct interventions are instruction, and not every method of teaching
compared. Having registered these differential out- works for everyone; in training—people acquire
comes, intervention selection is then carried out skills in a variety of ways; in therapy—not every-
based on a particular level of aptitude to optimize one responds well to a particular therapeutic
the outcome. Within this design, often, when ATI approach; and in organizational activities—not
is registered, it is helpful to carry out additional everyone prefers the same style of leadership. In
studies (e.g., case studies) to investigate the reason this sense, as a theoretical framework, ATI appeals
for the manifestation of ATI. The treatment revi- to professionals in multiple domains as it justifies
sion design assumes the continuous adjustment of the presence of variation in outcomes in class-
an intervention (or the creation of multiple parallel rooms, work environments, therapeutic settings,
versions of it) in response to how persons with dif- and battlefields. While applicable to all types and
ferent levels of aptitude react to each improvement levels of aptitudes and all kinds of interventions,
in the intervention (or alternative versions of the ATI is particularly aligned with more extreme
intervention). The point here is to optimize the levels of aptitudes, both low and high, and more
intervention by creating its multiple versions or its specialized interventions. The theory of ATI
multiple stages so that the outcome is optimized at acknowledges the presence of heterogeneity in
all levels of aptitude. This design has between- and both aptitudes and interventions, and its premise is
within-person versions, depending on the purposes to find the best possible combinations of the two
of the intervention that is being revised (e.g., to maximize the homogeneity of the outcome. A
ensuring that all children can learn equivalent frac- particular appeal of the theory is its transactional
tions regardless of their level of aptitude or ensur- nature and its potential to explain and justify both
ing the success of the therapy regardless of the success and failure in obtaining the desired out-
variability in depressive states of a client across come. As a theoretical framework, ATI does not
multiple therapy sessions). In the aptitude growth require the interaction to either be registered
design, the target of intervention is the level of empirically or be statistically significant. It calls for
aptitude. The idea here is that as the level of a theoretical examination of the aptitude and inter-
aptitude changes, different types of interventions ventional parameters whose interaction would best
might be used to optimize the outcome. This type explain the dynamics of learning, skill acquisition
of design is often used in combination with and demonstration, therapy, and leadership. The
growth-curve analyses. It can be applied as either beneficiaries of this kind of examination are of
between-persons or within-person designs. Finally, two kinds. First, it is the researchers themselves.
a type of design that has been gaining much Initially thinking through experiments and field
Aptitude-Treatment Interaction 43

studies before trying to confirm the existence of to fit groups of individuals, was revised. The ‘‘new
ATI empirically was, apparently, not a common view’’ of ATI, put forward by Cronbach in 1975,
feature of ATI studies during the height of their acknowledged that, although in existence, ATI is
popularity. Perhaps a more careful consideration much more complex and fluid than initially pre-
of the ‘‘what, how, and why’’ of measurement dicted and ATI’s dynamism and fluidity prevent
in ATI research would have prevented the observa- professionals from cataloging specific types of ATI
tion that many ATI findings resulted from and generalizing guidelines for prescribing different
somewhat haphazard fishing expeditions, and the interventions to people, given their aptitudes.
resulting views on ATI research would have been Although the usefulness of ATI as a theory has been
different. A second group of beneficiaries of ATI recognized, its features as a concept and as
studies are practitioners and policy makers. That a method have been criticized along the lines of (a)
there is no intervention that works for all, and that our necessarily incomplete knowledge of all possi-
one has to anticipate both successes and failures ble aptitudes and their levels, (b) the shortage of
and consider who will and who will not benefit good psychometric instruments that can validly and
from a particular intervention, are important reali- reliably quantify aptitudes, (c) the biases inherent in
zations to make while adopting a particular many procedures related to aptitude assessment
educational program, training package, therapeu- and intervention delivery, and (d) the lack of under-
tic approach, or organizational strategy, rather standing and possible registering of important
than in the aftermath. However, the warning ‘‘other’’ nonstatistical interactions (e.g., between
against embracing panaceas, made by Richard student and teacher, client and therapist, environ-
Snow, in interventional research and practice is ment and intervention). And yet ATI has never been
still just a warning, not a common presupposition. completely driven from the field, and there have
been steady references to the importance of ATI’s
framework and the need for better-designed empiri-
Criticism
cal studies of ATI.
Having emerged in the 1950s, interest in ATI
peaked in the 1970s and 1980s, but then dissi-
Gene × Environment Interaction
pated. This expansion and contraction were driven
by an initial surge in enthusiasm, followed by ATI has a number of neighboring concepts that
a wave of skepticism about the validity of ATI. also work within the general realm of qualifying
Specifically, a large-scope search for ATI, whose and quantifying individual differences in situations
presence was interpreted as being marked by dif- of acquiring new knowledge or new skills. Among
ferentiated regression slopes predicting outcomes these concepts are learning styles, learning strate-
from aptitudes for different interventions, or by gies, learning attitudes, and many interactive
the significance of the interaction terms in analysis effects (e.g., aptitude-outcome interaction). Quite
of variance models, was enthusiastically carried out often, the concept of ATI is discussed side by side
by a number of researchers. The accumulated data, with these neighboring concepts. Of particular
however, were mixed and often contradictory— interest is the link between the concept of ATI and
there were traces of ATI, but its presence and the concept of Gene × Environment interaction
magnitude were not consistently identifiable or rep- (G × E). The concept of G × E first appeared in
licable. Many reasons have been mentioned in dis- nonhuman research but gained tremendous popu-
cussions of why ATI is so elusive: underpowered larity in the psychological literature within the
studies, weak theoretical conceptualizations of ATI, same decade. Of note is that the tradition of its
simplistic research designs, imperfections in statisti- use in this literature is very similar to that of the
cal analyses, and the magnitude and even the non- usage of ATI; specifically, G × E also can be
existence of ATI, among others. As a result of this viewed as a concept, a method, and a theoretical
discussion, the initial prediction of the originator of framework. But the congruence between the two
ATI’s concept, Lee Cronbach, that interventions concepts is incomplete, of course; the concept of
designed for the average individual would be ulti- G × E adopts a very narrow definition of aptitude,
mately replaced by multiple parallel interventions in which individual differences are reduced to
44 Assent

genetic variation, and a very broad definition of Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic
treatment, in which interventions can be equated formative evaluation: A meta-analysis. Exceptional
with live events. Yet an appraisal of the parallels Children, 53, 199–208.
between the concepts of ATI and G × E is useful Grigorenko, E. L. (2005). The inherent complexities of
gene-environment interactions. Journal of
because it captures the field’s desire to engage
Gerontology, 60B, 53–64.
interaction effects for explanatory purposes when- Snow, R. E. (1984). Placing children in special education:
ever the explicatory power of main effects is Some comments. Educational Researcher, 13, 12–14.
disappointing. And it is interesting that the accu- Snow, R. E. (1991). Aptitude-treatment interaction as
mulation of the literature on G × E results in a set a framework for research on individual differences in
of concerns similar to those that interrupted the psychotherapy. Journal of Consulting & Clinical
golden rush of ATI studies in the 1970s. Psychology, 59, 205–216.
Yet methodological concerns aside, the concept Spearman, C. (1904). ‘‘General intelligence’’ objectively
of ATI rings a bell for all of us who have ever tried determined and measured. American Journal of
Psychology, 15, 201–293.
to learn anything in a group of people: what works
Violato, C. (1988). Interactionism in psychology and
for some of us will not work for the others as long
education: A new paradigm or source of confusion?
as we differ on even one characteristic that is rele- Journal of Education Thought, 22, 4–20.
vant to the outcome of interest. Whether it was
wit or something else by which Homer attempted
to differentiate Odysseus and Thersites, the poet
did at least successfully make an observation that ASSENT
has been central to many fields of social studies
and that has inspired the appearance of the con-
cept, methodology, and theoretical framework of The term assent refers to the verbal or written
ATI, as well as the many other concepts that agreement to engage in a research study. Assent is
capture the essence of what it means to be an indi- generally applicable to children between the ages
vidual in any given situation: that individual differ- of 8 and 18 years, although assent may apply to
ences in response to a common intervention exist. other vulnerable populations also.
Millennia later, it is an observation that still claims Vulnerable populations are those composed
our attention. of individuals who are unable to give consent due
to diminished autonomy. Diminished autonomy
Elena L. Grigorenko occurs when an individual is incapacitated, has
restricted freedom, or is a minor. Understanding
See also Effect Size, Measures of; Field Study; Growth the relevance of assent is important because with-
Curve; Interaction; Intervention; Power; Within- out obtaining the assent of a participant, the
Subjects Design researcher has restricted the freedom and auton-
omy of the participant and in turn has violated the
basic ethical principle of respect for persons.
Assent with regard to vulnerable populations is
Further Readings discussed here, along with the process of obtaining
Cronbach, L. J. (1957). The two disciplines of scientific assent and the role of institutional review boards
psychology. American Psychologist, 12, 671–684. in the assent process.
Cronbach, L. J. (1975). Beyond the two disciplines of
scientific psychology. American Psychologist, 30,
116–127. Vulnerable Populations
Cronbach, L. J., & Snow, R. E. (1977). Aptitudes and
Respect for persons requires that participants agree
instructional methods: A handbook for research on
interactions. Irvington, UK: Oxford University Press.
to engage in research voluntarily and have ade-
Dance, K. A., & Neufeld, R. W. J. (1988). Aptitude- quate information to make an informed decision.
treatment interaction research in the clinical setting: A Most laws recognize that a person 18 years of age
review of attempts to dispel the ‘‘patient uniformity’’ or older is able to give his or her informed consent
myth. Psychological Bulletin, 194, 192–213. to participate in the research study. However, in
Assent 45

some cases individuals lack the capacity to provide child gives permission for the child to attend
informed consent. An individual may lack the a social skills group for socially anxious children,
capacity to give his or her consent for a variety of but the child does not assent to treatment, the
reasons; examples include a prisoner who is child may be enrolled in the group without his or
ordered to undergo an experimental treatment her assent. However, it is recommended that assent
designed to decrease recidivism, a participant be obtained whenever possible. Further, if a child
with mental retardation, or an older adult with does not give assent initially, attempts to obtain
dementia whose caretakers believe an experimen- assent should continue throughout the research.
tal psychotherapy group may decrease his or Guidelines also suggest that assent may be over-
her symptoms. Each of the participants in these looked in cases in which the possible benefits of
examples is not capable of giving permission to the research outweigh the costs. For example, if
participate in the research because he or she either one wanted to study the effects of a life-saving
is coerced into engaging in the research or lacks drug for children and the child refused the medica-
the ability to understand the basic information tion, the benefit of saving the child’s life outweighs
necessary to fully consent to the study. the cost of not obtaining assent. Assent may be
State laws prohibit minors and incapacitated overlooked in cases in which assent of the partici-
individuals from giving consent. In these cases, pants is not feasible, as would be the case of
permission must be obtained from parents and a researcher interested in studying children who
court-appointed guardians, respectively. How- died as a result of not wearing a seatbelt.
ever, beyond consent, many ethicists, profes- Obtaining assent is an active process whereby
sional organizations, and ethical codes require the participant and the researcher discuss the
that assent be obtained. With children, state requirements of the research. In this case, the par-
laws define when a young person is legally com- ticipant is active in the decision making. Passive
petent to make informed decisions. Some argue consent, a concept closely associated with assent
that the ability to give assent is from 8 to 14 and consent, is the lack of protest, objection, or
years of age because the person is able to com- opting out of the research study and is considered
prehend the requirements of the research. In gen- permission to continue with the research.
eral, however, it is thought that by the age of 10,
children should be able to provide assent to par-
ticipate. It is argued that obtaining assent Institutional Review Boards
increases the autonomy of the individual. By Institutional review boards frequently make
obtaining assent, individuals are afforded as requirements as to the way assent is to be obtained
much control as possible over their decision to and documented. Assent may be obtained either
engage in the research given the circumstances, orally or in writing and should always be docu-
regardless of their mental capacity. mented. In obtaining assent, the researcher pro-
vides the same information as is provided to an
Obtaining Assent individual from whom consent is requested. The
language level and details may be altered in order
Assent is not a singular event. It is thought that to meet the understanding of the assenting partici-
assent is a continual process. Thus, researchers are pant. Specifically, the participant should be
encouraged to obtain permission to continue with informed of the purpose of the study; the time
the research during each new phase of research necessary to complete the study; as well as the
(e.g., moving from one type of task to the next). If risks, benefits, and alternatives to the study or
an individual assents to participate in the study treatment. Participants should also have access to
but during the study requests to discontinue, it is the researcher’s contact information. Finally, limits
recommended that the research be discontinued. of confidentiality should be addressed. This is par-
Although obtaining assent is strongly recom- ticularly important for individuals in the prison
mended, failure to obtain assent does not necessar- system and for children.
ily preclude the participant from engaging in the
research. For example, if the parent of a 4-year-old Tracy J. Cohn
46 Association, Measures of

See also Debriefing; Ethics in the Research Process; determine the appropriate statistical technique or
Informed Consent; Interviewing test that is needed to establish the existence of an
association. If the statistical test shows a conclusive
Further Readings association that is unlikely to occur by random
chance, different types of regression models can be
Belmont report: Ethical principles and guidelines for the used to quantify how change in exposure to a vari-
protection of human subjects of research. (1979). able relates to the change in the outcome variable
Washington, DC: U.S. Government Printing Office.
of interest.
Grisso, T. (1992). Minors’ assent to behavioral research
without parental consent. In B. Stanley & J. E. Sieber
(Eds.), Social research on children and adolescents:
Ethical issues (pp. 109–127). Newbury Park, CA: Examining Association Between Continuous
Sage. Variables With Correlation Analyses
Miller, V. A., & Nelson, R. M. (2006). A developmental
approach to child assent for nontherapeutic research. Correlation is a measure of association between
Journal of Pediatrics, 150(4), 25–30. two variables that expresses the degree to which
Ross, L. F. (2003). Do healthy children deserve greater the two variables are rectilinearly related. If the
protection in medical research? Journal of Pediatrics, data do not follow a straight line (e.g., they follow
142(2), 108–112.
a curve), common correlation analyses are not
appropriate. In correlation, unlike regression anal-
ysis, there are no dependent and independent
ASSOCIATION, MEASURES OF variables.
When both variables are measured as discrete
Measuring association between variables is very or continuous variables, it is common for research-
relevant for investigating causality, which is, in ers to examine the data for a correlation between
turn, the sine qua non of scientific research. these variables by using the Pearson product-
However, an association between two variables moment correlation coefficient (r). This coefficient
does not necessarily imply a causal relationship, has a value between  1 and þ 1 and indicates
and the research design of a study aimed at the strength of the association between the two
investigating an association needs to be carefully variables. A perfect correlation of ± 1 occurs only
considered in order for the study to obtain valid when all pairs of values (or points) fall exactly on
information. Knowledge of measures of associa- a straight line.
tion and the related ideas of correlation, regres- A positive correlation indicates in a broad way
sion, and causality are cornerstone concepts in that increasing values of one variable correspond
research design. This entry is directed at to increasing values in the other variable. A nega-
researchers disposed to approach these concepts tive correlation indicates that increasing values in
in a conceptual way. one variable corresponds to decreasing values in
the other variable. A correlation value close to
0 means no association between the variables. The
Measuring Association
r provides information about the strength of the
In scientific research, association is generally correlation (i.e., the nearness of the points to
defined as the statistical dependence between two a straight line). Figure 1 gives some examples of
or more variables. Two variables are associated if correlations, correlation coefficients, and related
some of the variability of one variable can be regression lines.
accounted for by the other, that is, if a change in A condition for estimating correlations is that
the quantity of one variable conditions a change in both variables must be obtained by random sam-
the other variable. pling from the same population. For example, one
Before investigating and measuring association, can study the correlation between height and
it is first appropriate to identify the types of vari- weight in a sample of children but not the correla-
ables that are being compared (e.g., nominal, ordi- tion between height and three different types of
nal, discrete, continuous). The type of variable will diet that have been decided by the investigator. In
Association, Measures of 47

(a) r = 1.00 (b) r = 1.00 (c) r = 0.99


y = 0.95x + 0.98 y = 0.95x − 0.04 y = 0.10x + 1.24

(d) r = 0.46 (e) r = 0.99


y = 0.53x + 0.22 y = 0.50x − 0.79

(f) r = 0.00 (g) r = 0.45


y = 0.02x + 0.99 y = 0.06x − 0.10

Figure 1 Scatterplots for Correlations of Various Magnitudes


Notes: Simulation examples show perfect positive (a, c) and negative (b) correlations, as well as regression lines with similar
correlation coefficients but different slopes (a, c). The figure also shows regression lines with similar slopes but different corre-
lation coefficients (d, e and f, g).
48 Association, Measures of

the latter case, it would be more appropriate to Quantifying Association in General


apply a regression analysis. and by Regression Analyses in Particular
The Pearson correlation coefficient may not be
appropriate if there are outliers (i.e., extreme In descriptive research, the occurrence of an
values). Therefore, the first step when one is study- outcome variable is typically expressed by group
ing correlations is to draw a scatterplot of the two measurements such as averages, proportions, inci-
variables to examine whether there are any out- dence, or prevalence rates. In analytical research,
liers. These variables should be standardized to the an association can be quantified by comparing, for
same scale before they are plotted. example, the absolute risk of the outcome in the
If outliers are present, nonparametric types exposed group and in the nonexposed group. Mea-
of correlation coefficients can be calculated to surements of association can then be expressed
examine the linear association. The Spearman rank either as differences (difference in risk) or as ratios,
correlation coefficient, for example, calculates cor- such as relative risks (a ratio of risks) or odds
relation coefficients based on the ranks of both ratios (a ratio of odds), and so forth. A ratio with
variables. Kendall’s coefficient of concordance cal- a numerical value greater than 1 (greater than
culates the concordance and discordance of the 0 for differences) indicates a positive association
observed (or ranked) exposure and outcome vari- between the exposure variable and the outcome
ables between pairs of individuals. variables, whereas a value less than 1 (less than
When the variables one is investigating are 0 for differences) indicates a negative association.
nominal or have few categories or when the scat- These measures of association can be calculated
terplot of the variables suggests an association that from cross-tabulation of the outcome variable and
is not rectilinear but, for example, quadratic exposure categories, or they can be estimated in
or cubic, then the correlation coefficients des regression models.
cribed above are not suitable. In these cases other General measures of association such as correla-
approaches are needed to investigate the associa- tion coefficients and chi-square tests are rather
tion between variables. unspecific and provide information only on the
existence and strength of an association. Regres-
sion analysis, however, attempts to model the rela-
tionship between two variables by fitting a linear
Chi-Square Tests of Association
equation to observed data in order to quantify,
for Categorical Variables
and thereby predict, the change in the outcome of
A common method for investigating ‘‘general’’ interest with a unit increase in the exposure vari-
association between two categorical variables is able. In regression analysis, one variable is consid-
to perform a chi-square test. This method com- ered to be an explanatory variable (the exposure),
pares the observed number of individuals within and the other is considered to be a dependent vari-
cells of a cross-tabulation of the categorical vari- able (the outcome).
ables with the number of individuals one would The method of least squares is the method
expect in the cells if there was no association applied most frequently for fitting a regression
and the individuals were randomly distributed. line. This method calculates the best-fitting line for
If the observed and expected frequencies differ the observed data by minimizing the sum of the
statistically (beyond random chance according squares of the vertical deviations from each data
to the chi-square distribution), the variables are point to the line. When a point is placed exactly
said to be associated. on the fitted line, its vertical deviation is 0. Devia-
A chi-square test for trend can also examine tions are also known as residuals or errors. The
for linear association when the exposure category better an explanatory variable predicts the out-
is ordinal. Other statistical tests of association come, the lower is the sum of the squared residuals
include measurements of agreement in the associa- (i.e., residual variance).
tion, such as the kappa statistic or McNemar’s A simple linear regression model, for example,
test, which are suitable when the study design is can examine the increase in blood pressure with
a matched case-control. a unit increase in age with the regression model
Association, Measures of 49

Y ¼ a þ bX þ e, framework to identify causality. For example,


researchers want to discover modifiable causes of
where X is the explanatory variable (i.e., age in a disease.
years) and Y is the dependent variable (i.e., blood Statistical associations say nothing by them-
pressure in mm Hg). The slope of the line is b and selves on causality. Their causal value depends on
represents the change in blood pressure for every the knowledge background of the investigator and
year of age. Observe that b does not provide infor- the research design in which these statistical asso-
mation about the strength of the association but ciations are observed. In fact, the only way to be
only on the average change in Y when X increases completely sure that an association is causal would
by one unit. The strength of the association is indi- be to observe the very same individual living two
cated by the correlation coefficient r, which parallel and exactly similar lives except that in one
informs on the closeness of the points to the life, the individual was exposed to the variable of
regression line (see Figure 1). The parameter a is interest, and in the other life, the same individual
the intercept (the value of y when x ¼ 0), which was not exposed to the variable of interest (a situa-
corresponds to the mean blood pressure in the tion called counterfactual). In this ideal design, the
sample. Finally, e is the residual or error. two (hypothetical) parallel lives of the individual
A useful measure is the square of the correlation are exchangeable in every way except for the
coefficient, r2, also called the coefficient of deter- exposure itself.
mination, and it indicates how much of the vari- While the ideal research design is a just a chi-
ance in the outcome (e.g., blood pressure) is mera, there are alternative approaches that try to
explained by the exposure (e.g., age). As shown in approximate the ideal design by comparing similar
Figure 1, different bs can be found with similar rs, groups of people rather than the same individual.
and similar bs can be observed with different rs. One can, for example, perform an experiment by
For example, many biological variables have been taking random samples from the same population
proposed as risk factors for cardiovascular diseases and randomly allocating the exposure of interest to
because they showed a high b value, but they have the samples (i.e., randomized trials). In a random-
been rejected as common risk factors because their ized trial, an association between the average level
r2 was very low. of exposure and the outcome is possibly causal. In
Different types of regression techniques are suit- fact, random samples from the same population
able for different outcome variables. For example, are—with some random uncertainty—identical
a logistic regression is suitable when the outcome concerning both measured and unmeasured vari-
is binary (i.e., 0 or 1), and logistic regression can ables, and the random allocation of the exposure
examine, for example, the increased probability creates a counterfactual situation very appropriate
(more properly, the increase in log odds) of myo- for investigating causal associations. The random-
cardial infarction with unit increase in age. The ized trial design is theoretically closest to the ideal
multinomial regression can be used for analyzing design, but sometimes it is unattainable or unethi-
outcome with several categories. A Poisson regres- cal to apply this design in the real world.
sion can examine how rate of disease changes with If conducting a randomized trial is not possible,
exposure, and a Cox regression is suitable for sur- one can use observational designs in order to simu-
vival analysis. late the ideal design, at least with regard to mea-
sured variables. Among observational approaches
it is common to use stratification, restriction, and
Association Versus Causality
multiple regression techniques. One can also take
Exposure variables that show a statistical relation- into account the propensity for exposure when
ship with an outcome variable are said to be asso- comparing individuals or groups (e.g., propensity
ciated with the outcome. It is only when there is scores techniques), or one may investigate the
strong evidence that this association is causal that same individual at two different times (case cross-
the exposure variable is said to determine the out- over design). On some occasions one may have
come. In everyday scientific work, researchers access to natural experiments or instrumental
apply a pragmatic rather than a philosophical variables.
50 Association, Measures of

When planning a research design, it is always association is modified by a third variable. The
preferable to perform a prospective study because study sample may be different from the rest of the
it identifies the exposure before any individual has population (e.g., only men or only healthy people),
developed the outcome. If one observes an associa- but this situation does not necessarily convey that
tion in a cross-sectional design, one can never be the results obtained are biased and cannot be
sure of the direction of the association. For exam- applied to the general population. Many random-
ple, low income is associated with impaired health ized clinical trials are performed on a restricted
in cross-sectional studies, but it is not known sample of individuals, but the results are actually
whether bad health leads to low income or the generalizable to the whole population. However, if
opposite. As noted by Austin Bradford Hill, the there is an interaction between variables, the effect
existence of a temporal relationship is the main cri- modification that this interaction produces must be
terion for distinguishing causality from association. considered. For example, the association between
Other relevant criteria pointed out by this author exposure to asbestos and lung cancer is much more
are consistency, strength, specificity, dose-response intense among smokers than among nonsmokers.
relationship, biological plausibility, and coherence. Therefore, a study on a population of nonsmokers
would not be generalizable to the general popula-
Bias and Random Error in Association Studies tion. Failure to consider interactions may even ren-
der associations spurious in a sample that includes
When planning a study design for investigating
the whole population. For example, a drug may
causal associations, one needs to consider the pos-
increase the risk of death in a group of patients but
sible existence of random error, selection bias,
decrease this risk in other different groups of
information bias, and confounding, as well as the
patients. However, an overall measure would show
presence of interactions or effect modification and
no association since the antagonistic directions of
of mediator variables.
the underlying associations compensate each other.
Bias is often defined as the lack of internal
validity of the association between exposure and
outcome variable of interest. This is in contrast to Information Bias
external validity, which concerns generalizability
Information bias simply arises because informa-
of the association to other populations. Bias can
tion collected on the variables is erroneous. All
also be defined as nonrandom or systematic differ-
variables must be measured correctly; otherwise,
ence between an estimate and the true value of the
one can arrive at imprecise or even spurious
population.
associations.

Random Error
Confounding
When designing a study, one always needs to
include a sufficient number of individuals in the An association between two variables can be
analyses to achieve appropriate statistical power confounded by a third variable. Imagine, for exam-
and ensure that conclusive estimates of association ple, that one observes an association between the
can be obtained. Suitable statistical power is espe- existence of yellow nails and mortality. The causal-
cially relevant when it comes to establishing the ity of this association could be plausible. Since nail
absence of association between two variables. tissue stores body substances, the yellow coloration
Moreover, when a study involves a large number might indicate poisoning or metabolic disease that
of individuals, more information is available. causes an increased mortality. However, further
More information lowers the random error, which investigation would indicate that individuals with
in turn increases the precision of the estimates. yellow nails were actually heavy smokers. The
habit of holding the cigarette between the fingers
discolored their nails, but the cause of death was
Selection Bias
smoking. That is, smoking was associated with
Selection bias can occur if the sample differs both yellow nails and mortality and originated
from the rest of the population and if the observed a confounded association (Figure 2).
Autocorrelation 51

See also Bias; Cause and Effect; Chi-Square Test;


Smoking Confounding; Correlation; Interaction; Multiple
Regression; Power Analysis

Confounded association
Yellow nails Death
Further Readings
Altman, D. G. (1991). Practical statistics for medical
research. New York: Chapman & Hall/CRC.
Figure 2 Deceptive Correlation Between Yellow
Hernan, M. A., Hernandez-Diaz, S., Werler, M. M., &
Nails and Mortality
Mitchell, A. A. (2002). Causal knowledge as
Note: Because smoking is associated with both yellow nails a prerequisite for confounding evaluation: An
and mortality, it originated a confounded association application to birth defects epidemiology. American
between yellow nails and mortality Journal of Epidemiology, 155(2), 176–184.
Hernan, M. A., & Robins, J. M. (2006). Instruments for
Low income Smoking Death causal inference: An epidemiologist’s dream?
Epidemiology, 17(4), 360–372.
Mediator
Hill, A. B. (1965). Environment and disease: Association
or causation? Proceedings of the Royal Society of
Medicine, 58, 295–300.
Figure 3 Smoking Acts as a Mediator Between
Jaakkola, J. J. (2003). Case-crossover design in air
Income and Early Death
pollution epidemiology. European Respiratory
Note: Heavy smoking mediated the effect of low income on Journal, 40, 81s-85s.
mortality. Last, J. M. (Ed.). (2000). A dictionary of epidemiology
(4th ed.). New York: Oxford University Press.
Mediation Liebetrau, A. M. (1983). Measures of association.
Newbury Park, CA: Sage.
In some cases an observed association is medi- Lloyd, F. D., & Van Belle, G. (1993). Biostatistics: A
ated by an intermediate variable. For example, methodology for the health sciences. New York:
individuals with low income present a higher risk Wiley.
of early death than do individuals with high Oakes, J. M., & Kaufman J. S. (Eds.). (2006). Methods in
income. Simultaneously, there are many more social epidemiology. New York: Wiley.
heavy smokers among people with low income. In Rothman, K. J. (Ed.). (1988). Causal inference. Chestnut
this case, heavy smoking mediates the effect of low Hill, MA: Epidemiology Resources.
income on mortality. Susser, M. (1991). What is a cause and how do we know
one? A grammar for pragmatic epidemiology.
Distinguishing which variables are confounders
American Journal of Epidemiology, 33, 635–648.
and which are mediators cannot be done by statis-
tical techniques only. It requires previous knowl-
edge, and in some cases variables can be both
confounders and mediators.
AUTOCORRELATION
Directed Acyclic Graphs Autocorrelation describes sample or population
Determining which variables are confounders, observations or elements that are related to each
intermediates, or independently associated variables other across time, space, or other dimensions. Cor-
can be difficult when many variables are involved. related observations are common but problematic,
Directed acyclic graphs use a set of simple rules to largely because they violate a basic statistical
create a visual representation of direct and indirect assumption about many samples: independence
associations of covariates and exposure variables across elements. Conventional tests of statistical
with the outcome. These graphs can help research- significance assume simple random sampling, in
ers understand possible causal relationships. which not only each element has an equal chance
of selection but also each combination of elements
Juan Merlo and Kristian Lynch has an equal chance of selection; autocorrelation
52 Autocorrelation

violates this assumption. This entry describes com- the same unit at an earlier time, frequently one
mon sources of autocorrelation, the problems it period removed (often called t  1).
can cause, and selected diagnostics and solutions.
• Spatial correlation occurs in cluster samples
(e.g., classrooms or neighborhoods): Physically
Sources
adjacent elements have a higher chance of entering
What is the best predictor of a student’s 11th- the sample than do other elements. These adjacent
grade academic performance? His or her 10th- elements are typically more similar to already sam-
grade grade point average. What is the best predic- pled cases than are elements from a simple random
tor of this year’s crude divorce rate? Usually last sample of the same size.
year’s divorce rate. The old slogan ‘‘Birds of
a feather flock together’’ describes a college class- • A variation of spatial correlation occurs with
room in which students are about the same age, contagion effects, such as crime incidence (burglars
at the same academic stage, and often in the ignore city limits in plundering wealthy neighbor-
same disciplinary major. That slogan also describes hoods) or an outbreak of disease.
many residential city blocks, where adult inhabi-
tants have comparable incomes and perhaps even • Multiple (repeated) measures administered to
similar marital and parental status. When examin- the same individual at approximately the same
ing the spread of a disease, such as the H1N1 time (e.g., a lengthy survey questionnaire with
influenza, researchers often use epidemiological many Likert-type items in agree-disagree format).
maps showing concentric circles around the initial
outbreak locations.
Autocorrelation Terms
All these are examples of correlated observa-
tions, that is, autocorrelation, in which two indivi- The terms positive or negative autocorrelation
duals from a classroom or neighborhood cluster, often apply to time-series data. Societal inertia can
cases from a time series of measures, or proximity inflate the correlation of observed measures across
to a contagious event resemble each other more time. The social forces creating trends such as fall-
than two cases drawn from the total population of ing marriage rates or rising gross domestic product
elements by means of a simple random sample. often carry over from one period into the next.
Correlated observations occur for several reasons: When trends continue over time (e.g., a student’s
grades), positive predictions can be made from one
• Repeated, comparable measures are taken on period to the next, hence the term positive
the same individuals over time, such as many pre- autocorrelation.
test and posttest experimental measures or panel However, forces at one time can also create
surveys, which reinterview the same individual. compensatory or corrective mechanisms at the
Because people remember their prior responses or next, such as consumers’ alternating patterns of
behaviors, because many behaviors are habitual, ‘‘save, then spend’’ or regulation of production
and because many traits or talents stay relatively based on estimates of prior inventory. The data
constant over time, these repeated measures points seem to ricochet from one time to the next,
become correlated for the same person. so adjacent observations are said to be negatively
correlated, creating a cobweb-pattern effect.
• Time-series measures also apply to larger The order of the autocorrelation process refer-
units, such as birth, divorce, or labor force partici- ences the degree of periodicity in correlated obser-
pation rates in countries or achievement grades in vations. When adjacent observations are
a county school system. Observations on the same correlated, the process is first-order autoregression,
variable are repeated on the same unit at some or AR (1). If every other observation, or alternate
periodic interval (e.g., annual rate of felony observations, is correlated, this is an AR (2) pro-
crimes). The units transcend the individual, and cess. If every third observation is correlated, this is
the periodicity of measurement is usually regular. an AR (3) process, and so on. The order of the
A lag describes a measure of the same variable on process is important, first because the most
Autocorrelation 53

available diagnostic tests and corrections are for σ ¼


v2 =ð1  ρ2 Þ,
the simplest situation, an AR (1); higher order pro-
cesses require more complex corrections. Second, where v is the random component in a residual or
the closer two observations are in time or space, error term.
the larger the correlation between them, creating
more problems for the data analyst. An AR (1)
Rho
process describes many types of autocorrelation,
such as trend data or contagion effects. When elements are correlated, a systematic bias
thus enters into estimates of the residuals or error
terms. This bias is usually estimated numerically
Problems by rho ðρÞ, the intraclass correlation coefficient, or
the correlation of autocorrelation. Rho estimates
Because the similarity among study elements is the average correlation among (usually) adjacent
more pronounced than that produced by other pairs of elements. Rho is found, sometimes unob-
probability samples, each autocorrelated case trusively, in many statistics that attempt to correct
‘‘counts less’’ than a case drawn using simple ran- and compensate for autocorrelation, such as hier-
dom sampling. Thus, the ‘‘real’’ or corrected sam- archical linear models.
ple size when autocorrelation is present is smaller
than a simple random sample containing the same
An Example: Autocorrelation Effects
number of elements. This statistical attenuation of
on Basic Regression Models
the casebase is sometimes called the design effect,
and it is well known to survey statisticians who With more complex statistical techniques, such
design cluster samples. as regression, the effects of ρ multiply beyond pro-
The sample size is critical in inferential statis- viding a less stable estimate of the population
tics. The N comprises part of the formula for mean. If autocorrelation occurs for scores on the
estimates of sample variances and the standard dependent variable in OLS regression, then the
error. The standard error forms the denominator regression residuals will also be autocorrelated,
for statistics such as t tests. The N is also used to creating a systematic bias in estimates of the resi-
calculate degrees of freedom for many statistics, duals and statistics derived from them. For exam-
such as F tests in analysis of variance or multiple ple, standard computer OLS regression output will
regression, and it influences the size of chi- be invalid for the following: the residual sum of
square. squares, the standard error of the (regression) esti-
When autocorrelation is present, use of the mate, the F test, R2 and the adjusted R2, the stan-
observed N means overestimating the effective n. dard errors of the Bs, the t tests, and significance
The calculated variances and standard errors that levels for the Bs.
use simple random sampling formulae (as most As long as residuals are correlated only among
statistics computer programs do) are, in fact, too themselves and not back with any of the predictor
low. In turn, this means t tests and other inferential variables, the OLS regression coefficient estimates
statistics are too large, leading the analyst to reject themselves should be unbiased. However, the Bs
the null hypothesis inappropriately. In short, auto- are no longer best linear unbiased estimators, and
correlation often leads researchers to think that the estimates of the statistical significance of the
many study results are statistically significant when Bs and the constant term are inaccurate. If there is
they are not. positive autocorrelation (the more usual case in
For example, in ordinary least squares (OLS) trend data), the t tests will be inappropriately
regression or analysis of variance, autocorrelation large. If there is negative autocorrelation (less com-
renders the simple random sampling formulae mon), the computer program’s calculated t tests
invalid for the error terms and measures derived will be too small. However, there also may be
from them. The true sum of squared errors ðσÞ is autocorrelation in an independent variable, which
now inflated (often considerably) because it is generally aggravates the underestimation of resi-
divided by a fraction: duals in OLS regression.
54 Autocorrelation

Diagnosing First-Order Autocorrelation depend on the number of cases and the number
of predictor variables. The d calculation cannot
There are several ways to detect first-order
be used with regressions through the origin,
autocorrelation in least squares analyses. Pairs of
with standardized regression equations, or with
adjacent residuals can be plotted against time (or
equations that include lags of the dependent vari-
space) and the resulting scatterplot examined.
able as predictors.
However, the scatterplot ‘‘cloud of points’’
Many other computer programs provide
mentioned in most introductory statistics texts
iterative estimates of ρ and its standard error,
often resembles just that, especially with large
and sometimes the Durbin–Watson d as well.
samples. The decision is literally based on an ‘‘eye-
Hierarchical linear models and time-series analysis
ball’’ analysis.
programs are two examples. The null hypothesis
Second, and more formally, the statistical signif-
ρ ¼ 0 can be tested through a t-distribution with
icance of the number of positive and negative runs
the ratio
or sign changes in the residuals can be tested.
Tables of significance tests for the runs test are ρ=seρ :
available in many statistics textbooks. The situa-
tion of too many runs means the adjacent residuals
The t value can be evaluated using the t tables if
have switched signs too often and oscillate, result-
needed. If ρ is not statistically significant, there is
ing in a diagnosis of negative autocorrelation. The
no first-order autocorrelation. If the analyst is will-
situation of too few runs means long streams of
ing to specify the positive or negative direction of
positive or negative trends, thus suggesting positive
the autocorrelation in advance, one-tailed tests of
autocorrelation. The number of runs expected in
statistical significance are available.
a random progression of elements depends on the
number of observations. Most tables apply to rela-
tively small sample sizes, such as N < 40. Since Possible Solutions
many time series for social trends are relatively
When interest centers on a time series and the lag
short in duration, depending on the availability of
of the dependent variable, it is tempting to attempt
data, this test can be more practical than it initially
solving the autocorrelation problem by simply
appears.
including a lagged dependent variable (e.g., yt1 )
One widely used formal diagnostic for first-
as a predictor in OLS regression or as a covariate
order autocorrelation is the Durbin-Watson d sta-
in analysis of covariance. Unfortunately, this alter-
tistic, which is available in many statistical com-
native creates a worse problem. Because the obser-
puter programs. The d statistic is approximately
vations are correlated, the residual term e is now
calculated as 2ð1  ρÞ where ρet et1 is the intra-
correlated back with yt1 , which is a predictor for
class correlation coefficient. The et can be defined
the regression or analysis of covariance. Not only
as adjacent residuals (in the following formula, v
does this alternative introduce bias into the previ-
represents the true random error terms that one
ously unbiased B coefficient estimates, but using
really wants to estimate):
lags also invalidates the use of diagnostic tests such
et ¼ ρet  1 þ vt as the Durbin-Watson d.
The first-differences (Cochrane-Orcutt) solution
Thus d is a ratio of the sum of squared differ- is one way to correct autocorrelation. This gener-
ences between adjacent residuals to the sum alized least squares (GLS) solution creates a set of
of squared residuals. The d has an interesting sta- new variables by subtracting from each variable
tistical distribution: Values near 2 imply ρ ¼ 0 (no (not just the dependent variable) its own t  1 lag
autocorrelation); d is 0 when ρ ¼ 1 (extreme or adjacent case. Then each newly created variable
positive autocorrelation) and 4 when ρ ¼  1 in the equation is multiplied by the weight
(extreme negative autocorrelation). In addition, (1  ρ) to make the error terms behave randomly.
d has two zones of indecision (one near 0 and one An analyst may also wish to check for higher
near 4), in which the null hypothesis ρ ¼ 0 is nei- order autoregressive processes. If a GLS solution
ther accepted nor rejected. The zones of indecision was created for the AR (1) autocorrelation, some
Autocorrelation 55

statistical programs will test for the statistical sig- computer programs exist, either freestanding or
nificance of ρ using the Durbin-Watson d for the within larger packages, such as the Statistical
reestimated GLS equation. If ρ does not equal 0, Package for the Social Sciences (SPSS; an IBM
higher order autocorrelation may exist. Possible company, formerly called PASWâ Statistics).
solutions here include logarithmic or polynomial Autocorrelation is an unexpectedly common
transformations of the variables, which may atten- phenomenon that occurs in many social and
uate ρ. The analyst may also wish to examine behavioral science phenomena (e.g., psychologi-
econometrics programs that estimate higher order cal experiments or the tracking of student devel-
autoregressive equations. opment over time, social trends on employment,
In the Cochrane-Orcutt solution, the first obser- or cluster samples). Its major possible conse-
vation is lost; this may be problematic in small quence—leading one to believe that accidental
samples. The Prais-Winsten approximation has sample fluctuations are statistically significant—
been used to estimate the first observation in case is serious. Checking and correcting for autocor-
of bivariate correlation or regression (with a loss relation should become a more automatic
of one additional degree of freedom). process in the data analyst’s tool chest than it
In most social and behavioral science data, once currently appears to be.
autocorrelation is corrected, conclusions about the
statistical significance of the results become much Susan Carol Losh
more conservative. Even when corrections for ρ
See also Cluster Sampling; Hierarchical Linear Modeling;
have been made, some statisticians believe that R2s
Intraclass Correlation; Multivariate Analysis of
or η2s to estimate the total explained variance in
Variance (MANOVA); Time-Series Study
regression or analysis of variance models are
invalid if autocorrelation existed in the original
analyses. The explained variance tends to be quite Further Readings
large under these circumstances, reflecting the
covariation of trends or behaviors. Bowerman, B. L. (2004). Forecasting, time series, and
Several disciplines have other ways of handling regression (4th ed.). Pacific Grove, CA: Duxbury Press.
autocorrelation. Some alternate solutions are Gujarati, D. (2009). Basic econometrics (5th ed.). New
York: McGraw-Hill.
paired t tests and multivariate analysis of variance
Luke, D. A. (2004). Multilevel modeling. Thousand
for either repeated measures or multiple dependent Oaks, CA: Sage.
variables. Econometric analysts diagnose treat- Menard, S. W. (2002). Longitudinal research (2nd ed.).
ments of higher order periodicity, lags for either Thousand Oaks, CA: Sage.
predictors or dependent variables, and moving Ostrom, C. W., Jr. (1990), Time series analysis: Regression
averages (often called ARIMA). Specialized techniques (2nd ed.). Thousand Oaks, CA: Sage.
B
year as a series of 34 bars, one for each of
BAR CHART the imports and exports of Scotland’s 17 trading
partners. However, his innovation was largely
ignored in Britain for a number of years. Playfair
The term bar chart refers to a category of dia- himself attributed little value to his invention,
grams in which values are represented by the apologizing for what he saw as the limitations of
height or length of bars, lines, or other symbolic the bar chart. It was not until 1801 and the publi-
representations. Bar charts are typically used to cation of his Statistical Breviary that Playfair rec-
display variables on a nominal or ordinal scale. ognized the value of his invention. Playfair’s
Bar charts are a very popular form of information invention fared better in Germany and France. In
graphics often used in research articles, scientific 1811 the German Alexander von Humboldt pub-
reports, textbooks, and popular media to visually lished adaptations of Playfair’s bar graph and pie
display relationships and trends in data. However, charts in Essai Politique sur le Royaume de la
for this display to be effective, the data must be Nouvelle Espagne. In 1821, Jean Baptiste Joseph
presented accurately, and the reader must be able Fourier adapted the bar chart to create the first
to analyze the presentation effectively. This entry graph of cumulative frequency distribution,
provides information on the history of the bar referred to as an ogive. In 1833, A. M. Guerry
charts, the types of bar charts, and the construc- used the bar chart to plot crime data, creating the
tion of a bar chart. first histogram. Finally, in 1859 Playfair’s work
began to be accepted in Britain when Stanley
History Jevons published bar charts in his version of an
economic atlas modeled on Playfair’s earlier work.
The creation of the first bar chart is attributed to Jevons in turn influenced Karl Pearson, commonly
William Playfair and appeared in The Commercial considered the ‘‘father of modern statistics,’’ who
and Political Atlas in 1786. Playfair’s bar graph promoted the widespread acceptance of the bar
was an adaptation of Joseph Priestley’s time-line chart and other forms of information graphics.
charts, which were popular at the time. Ironically,
Playfair attributed his creation of the bar graph to
a lack of data. In his Atlas, Playfair presented 34
Types
plates containing line graphs or surface charts
graphically representing the imports and exports Although the terms bar chart and bar graph are
from different countries over the years. Since he now used interchangeably, the term bar chart was
lacked the necessary time-series data for Scotland, reserved traditionally for corresponding displays
he was forced to graph its trade data for a single that did not have scales, grid lines, or tick marks.

57
58 Bar Chart

Total Earnings for Various Companies for the Year 2007 U.S. dollars, and the widths of the
9
bars are used to represent the per-
centage of the earnings coming
8 from exports. The information
7 expressed by the bar width can be
displayed by means of a scale on
6
Millions of USD

the horizontal axis or by a legend,


5 or, as in this case, the values might
be noted directly on the graph. If
4
both positive and negative values
3 are plotted on the quantitative axis,
2
the graph is called a deviation
graph. On occasion the bars are
1 replaced with pictures or symbols
0 to make the graph more attractive
Company A Company B Company C Company D Company E or to visually represent the data
series; these graphs are referred to
Company A 1.3 as pictographs or pictorial bar
Company B 2.1 graphs.
Company C 8.2 It may on occasion be desirable
Company D 1.5 to display a confidence interval for
Company E 3.1 the values plotted on the graph. In
these cases the confidence intervals
can be displayed by appending an
Figure 1 Simple Bar Chart and Associated Data error bar, or a shaded, striped, or
Note: USD ¼ U.S. dollars. tapered area to the end of the bar,
representing the possible values cov-
ered in the confidence interval. If
The value each bar represented was instead shown the bars are used to represent the range between
on or adjacent to the data graphic. the upper and lower values of a data series rather
An example bar chart is presented in Figure 1. than one specific value, the graph is called a range
Bar charts can display data by the use of either bar graph. Typically the lower values are plotted
horizontal or vertical bars; vertical bar charts on the left in a horizontal bar chart and on the
are also referred to as column graphs. The bars are bottom for a vertical bar chart. A line drawn across
typically of a uniform width with a uniform space the bar can designate additional or inner values,
between bars. The end of the bar represents the such as a mean or median value. When the five-
value of the category being plotted. When there is number summary (the minimum and maximum
no space between the bars, the graph is referred to values, the upper and lower quartiles, and the
as a joined bar graph and is used to emphasize the median) is displayed, the graph is commonly
differences between conditions or discrete cate- referred to as a box plot or a box-and-whisker
gories. When continuous quantitative scales are diagram.
used on both axes of a joined bar chart, the chart A simple bar graph allows the display of a single
is referred to as a histogram and is often used to data series, whereas a grouped or clustered bar
display the distribution of variables that are of graph displays two or more data series on one
interval or ratio scale. If the widths of the bars are graph (see Figure 3). In clustered bar graphs, ele-
not uniform but are instead used to display some ments of the same category are plotted side by side;
measure or characteristic of the data element different colors, shades, or patterns, explained in
represented by the bar, the graph is referred to as a legend, may be used to differentiate the various
an area bar graph (see Figure 2). In this graph, the data series, and the spaces between clusters distin-
heights of the bars represent the total earnings in guish the various categories.
Bar Chart 59

Total Earnings and Percentage Earnings From Exports contribution of the components of
7
the category. If the graph represents
the separate components’ percentage
6 20%
of the whole value rather than the
actual values, this graph is com-
5
monly referred to as a 100% stacked
40% bar graph. Lines can be drawn to
Millions of USD

4
connect the components of a stacked
bar graph to more clearly delineate
3 50% the relationship between the same
components of different categories.
30%
2
A stacked bar graph can also use
75% only one bar to demonstrate the con-
1
tribution of the components of only
one category, condition, or occasion,
0
in which case it functions more like
Company A Company B Company C Company D Company E a pie chart. Two data series can also
be plotted together in a paired bar
Percentage graph, also referred to as a sliding
of Earnings bar or bilateral bar graph. This
Earnings from
(USD) Exports
graph differs from a clustered bar
graph because rather than being
Company A 4.2 40
plotted side by side, the values for
Company B 2.1 30 one data series are plotted with hori-
Company C 1.5 75 zontal bars to the left and the values
Company D 5.7 20 for the other data series are plotted
Company E 2.9 50 with horizontal bars to the right.
The units of measurement and scale
intervals for the two data series need
Figure 2 Area Bar Graph and Associated Data not be the same, allowing for a visual
Note: USD ¼ U.S. dollars. display of correlations and other
meaningful relationships between
the two data series. A paired bar
While there is no limit to the number of series graph can be a variation of either a simple, clus-
that can be plotted on the same graph, it is wise to tered, or stacked bar graph. A paired bar graph
limit the number of series plotted to no more than without spaces between the bars is often called
four in order to keep the graph from becoming con- a pyramid graph or a two-way histogram. Another
fusing. To reduce the size of the graph and to method for comparing two data series is the differ-
improve readability, the bars for separate categories ence bar graph. In this type of bar graph, the bars
can be overlapped, but the overlap should be less represent the difference in the values of two data
than 75% to prevent the graph from being mis- series. For instance, one could compare the perfor-
taken for a stacked bar graph. A stacked bar graph, mance of two different classes on a series of tests or
also called a divided or composite bar graph, has compare the different performance of males and
multiple series stacked end to end instead of side by females on a series of assessments. The direction of
side. This graph displays the relative contribution of the difference can be noted at the ends of bars or by
the components of a category; a different color, labeling the bars. When comparing multiple factors
shade, or pattern differentiates each component, as at two points in time or under two different condi-
described in a legend. The end of the bar represents tions, one can use a change bar graph. The bars in
the value of the whole category, and the heights of this graph are used to represent the change between
the various data series represent the relative the two conditions or times. Since the direction of
60 Bar Chart

Total Earnings by Company for 2006–2008 direction and the measurement


9
scale for the primary axes. The
decision to present the data in a
8 horizontal or vertical format is
7 largely a matter of personal prefer-
ence; a vertical presentation, how-
Millions of USD

6
ever, is more intuitive for displaying
5 amount or quantity, and a horizon-
4
tal presentation makes more sense
for displaying distance or time. A
3 horizontal presentation also allows
2 for more space for detailed labeling
of the categorical axis. The choice
1
of an appropriate scale is critical
0 for accurate presentation of data in
Company A Company B Company C Company D Company E
a bar graph. Simple changes in the
starting point or the interval of
2006 2007 2008
a scale can make the graph look
Company A 0.5 1 1.3
dramatically different and may pos-
Company B 3 2.1 2.1
sibly misrepresent the relationships
Company C 5 6.9 8.2
within the data. The best method
Company D 3 3.5 1.5
for avoiding this problem is to
Company E 2 4.5 3.1 always begin the quantitative scale
at 0 and to use a linear rather than
Figure 3 Clustered Bar Chart and Associated Data
a logarithmic scale. However, in
cases in which the values to be
Note: USD ¼ U.S. dollars. represented are extremely large,
a start value of 0 effectively hides
change is usually important with these types of any differences in the data because by necessity the
graphs, a coding system is used to indicate the intervals must be extremely wide. In these cases it
direction of the change. is possible to maintain smaller intervals while still
starting the scale at 0 by the use of a clearly
marked scale break. Alternatively, one can high-
light the true relationship between the data by
Creating an Effective Bar Chart
starting the scale at 0 and adding an inset of
A well-designed bar chart can effectively communi- a small section of the larger graph to demonstrate
cate a substantial amount of information relatively the true relationship. Finally, it is important to
easily, but a poorly designed graph can create con- make sure the graph and its axes are clearly
fusion and lead to inaccurate conclusions among labeled so that the reader can understand what
readers. Choosing the correct graphing format or data are being presented. Modern technology
technique is the first step in creating an effective allows the addition of many superfluous graphical
graphical presentation of data. Bar charts are best elements to enhance the basic graph design.
used for making discrete comparisons between sev- Although the addition of these elements is a matter
eral categorical variables because the eye can spot of personal choice, it is important to remember
very small differences in relative height. However, that the primary aim of data graphics is to display
a bar chart works best with four to six categories; data accurately and clearly. If the additional ele-
attempting to display more than six categories on ments detract from this clarity of presentation,
a bar graph can lead to a crowded and confusing they should be avoided.
graph. Once an appropriate graphing technique
has been chosen, it is important to choose the Teresa P. Clark and Sara E. Bolt
Bartlett’s Test 61

See also Box-and-Whisker Plot; Distribution; Graphical known as Bartlett’s test, and Bartlett’s test statistic
Display of Data; Histogram; Pie Chart is given by
Qr
w
Further Readings ðS2i Þ i
‘1 ¼ i¼1
P r ,
Cleveland, W. S., & McGill, R. (1985). Graphical wi Si2
perception and graphical methods for analyzing i¼1
scientific data. Science, 229, 828–833.
Harris, R. L. (1999). Information graphics: A where wi ¼ ðni  1Þ=ðN  rÞ is known as the
comprehensive illustrated reference. New York: P
r
weight for the ith group and N ¼ ni is the sum
Oxford University Press. i¼1
Playfair, W. (1786). The commercial and political atlas. of the individual sample sizes. In the equireplicate
London: Corry. case (i.e., n1 ¼    ¼ nr ¼ n), the weights are
Shah, P., & Hoeffner, J. (2002). Review of graph
equal, and wi ¼ 1=r for each i ¼ 1, . . . , r: The test
comprehension research: Implications for instruction.
statistic is the ratio of the weighted geometric
Educational Psychology Review, 14, 47–69.
Spence, I. (2000). The invention and use of statistical mean of the group sample variances to their
charts. Journal de la Société Francaise de Statistique, weighted arithmetic mean. The values of the test
141, 77–81. statistic are bounded as 0 ≤ ‘1 ≤ 1 by Jensen’s
Tufte, E. R. (1983). The visual display of quantitative inequality. Large values of 0 ≤ ‘1 ≤ 1 (i.e., values
information. Cheshire, CT: Graphics. near 1) indicate agreement with the null hypothe-
Wainer, H. (1996). Depicting error. American Statistician, sis, whereas small values indicate disagreement
50, 101–111. with the null. The terminology ‘1 is used to indi-
cate that Bartlett’s test is based on M. S. Bartlett’s
modification of the likelihood ratio test, wherein he
BARTLETT’S TEST replaced the sample sizes ni with their correspond-
ing degrees of freedom, ni  1. Bartlett did so to
make the test unbiased. In the equireplicate case,
The assumption of equal variances across treatment
Bartlett’s test and the likelihood ratio test result in
groups may cause serious problems if violated in
the same test statistic and same critical region.
one-way analysis of variance models. A common
The distribution of ‘1 is complex even when the
test for homogeneity of variances is Bartlett’s test.
null hypothesis is true. R. E. Glaser showed
This statistical test checks whether the variances
that the distribution of ‘1 could be expressed as
from different groups (or samples) are equal.
a product of independently distributed beta ran-
Suppose that there are r treatment groups and
dom variables. In doing so he renewed much
we want to test
interest in the exact distribution of Bartlett’s test.
 ‘1 ≤ bα ðn1 , . . . , nr Þ where

H0 : σ 21 ¼ σ 22 ¼    ¼ σ 2r We reject H0 provided

Pr ‘1 < bα ðn1 , . . . , nr Þ ¼ α when H 0 is true. The
versus Bartlett critical value bα ðn1 , . . . , nr Þ is indexed by
H1 : σ 2m 6¼ σ 2k for some m 6¼ k: level of significance and the individual sample
sizes. The critical values were first tabled in the
In this context, we assume that we have inde- equireplicate case, and the critical value was sim-
pendently chosen random samples of size plified to bα ðn, . . . , nÞ ¼ bα ðnÞ. Tabulating critical
ni , i ¼ 1, . . . , r from each
 of the
 r independent values with unequal sample sizes becomes counter-
populations. Let Xij e N μi , σ 2i be independently productive because of possible combinations of
distributed with a normal distribution having mean groups, sample sizes, and levels of significance.
μi and variance σ 2i for each j ¼ 1, . . . , ni and each
i ¼ 1, . . . , r. Let Xi be the sample mean and S2i the
Example
sample variance of the sample taken from the
ith group or population. The uniformly most pow- Consider an experiment in which lead levels are
erful unbiased parametric test of size α for testing measured at five different sites. The data in Table 1
for equality of variances among r populations is come from Paul Berthouex and Linfield Brown:
62 Bartlett’s Test

Table 1 Ten Measurements of Lead Concentration (mG=L) Measured on Waste Water Specimens
Measurement No.

Lab 1 2 3 4 5 6 7 8 9 10
1 3.4 3.0 3.4 5.0 5.1 5.5 5.4 4.2 3.8 4.2
2 4.5 3.7 3.8 3.9 4.3 3.9 4.1 4.0 3.0 4.5
3 5.3 4.7 3.6 5.0 3.6 4.5 4.6 5.3 3.9 4.1
4 3.2 3.4 3.1 3.0 3.9 2.0 1.9 2.7 3.8 4.2
5 3.3 2.4 2.7 3.2 3.3 2.9 4.4 3.4 4.8 3.0
Source: Berthouex, P. M., & Brown, L. C. (2002). Statistics for environmental engineers (2nd ed., p. 170). Boca Raton,
FL: Lewis.

From these data one can compute the sample So for the lead levels, we have the following
variances and weights, which are given below: values: b0:05 ð5; 10Þ¼0:8025.
_ At the 5% level of
significance, there is not enough evidence to reject
Labs Weight Variance the null hypothesis of equal variances.
1 0.2 0.81778 Because of the complexity of the distribution
2 0.2 0.19344 of ‘1 , Bartlett’s test originally employed an
3 0.2 0.41156 approximation. Bartlett proved that
4 0.2 0.58400   
5 0.2 0.54267  ln ‘1 =c e χ2 ðr  1Þ,
where
By substituting these values into the formula for h iP
r
1 1 1

‘1 , we obtain 1þ 3ðr1Þ ni 1
 Nr
i¼1
c¼ :
0:46016 Nr
‘1 ¼ ¼ 0:90248:
0:509889 The approximation works poorly for small
sample sizes. This approximation is more accurate
Critical values, bα ðα, nÞ, of Bartlett’s test are as sample sizes increase, and it is recommended
tabled for cases in which the sample sizes are equal that minðni Þ ≥ 3 and that most ni > 5.
and α ¼ :05. These values are given in D. Dyer and
Jerome Keating for various values of r, the number
of groups, and n, the common sample size (see Assumptions
Table 2). Works by Glaser, M. T. Chao, and Glaser, Bartlett’s test statistic is quite sensitive to
and S. B. Nandi provide tables of exact critical nonnormality. In fact, W. J. Conover, M. E. John-
values of Bartlett’s test. The most extensive set son, and M. M. Johnson echo the results of
is contained in Dyer and Keating. Extensions (for G. E. P. Box that Bartlett’s test is very sensitive to
larger numbers of groups) to the table of critical samples that exhibit nonnormal kurtosis. They rec-
values can be found in Keating, Glaser, and ommend that Bartlett’s test be used only when the
N. S. Ketchum. data conform to normality. Prior to using Bartlett’s
test, it is recommended that one test for normality
using an appropriate test such as the Shapiro–Wilk
Approximation
W test. In the event that the normality assumption
In the event that the sample sizes are not equal, is violated, it is recommended that one test equal-
one can use the Dyer–Keating approximation to ity of variances using Howard Levene’s test.
the critical values:
Mark T. Leung and Jerome P. Keating
X
a
ni
bα ða; n1 ; . . . ; na Þ¼_ × bα ða,ni Þ: See also Critical Value; Likelihood Ratio Statistic;
i¼1
N Normality Assumption; Parametric Statistics; Variance
Bartlett’s Test 63

Table 2 Table of Bartlett’s Critical Values


Number of Populations, r
n 2 3 4 5 6 7 8 9 10
3 .3123 .3058 .3173 .3299 . . . . .
4 .4780 .4699 .4803 .4921 .5028 .5122 .5204 .5277 .5341
5 .5845 .5762 .5850 .5952 .6045 .6126 .6197 .6260 .6315

6 .6563 .6483 .6559 .6646 .6727 .6798 .6860 .6914 .6961


7 .7075 .7000 .7065 .7142 .7213 .7275 .7329 .7376 .7418
8 .7456 .7387 .7444 .7512 .7574 .7629 .7677 .7719 .7757
9 .7751 .7686 .7737 .7798 .7854 .7903 .7946 .7984 .8017
10 .7984 .7924 .7970 .8025 .8076 .8121 .8160 .8194 .8224

11 .8175 .8118 .8160 .8210 .8257 .8298 .8333 .8365 .8392


12 .8332 .8280 .8317 .8364 .8407 .8444 .8477 .8506 .8531
13 .8465 .8415 .8450 .8493 .8533 .8568 .8598 .8625 .8648
14 .8578 .8532 .8564 .8604 .8641 .8673 .8701 .8726 .8748
15 .8676 .8632 .8662 .8699 .8734 .8764 .8790 .8814 .8834

16 .8761 .8719 .8747 .8782 .8815 .8843 .8868 .8890 .8909


17 .8836 .8796 .8823 .8856 .8886 .8913 .8936 .8957 .8975
18 .8902 .8865 .8890 .8921 .8949 .8975 .8997 .9016 .9033
19 .8961 .8926 .8949 .8979 .9006 .9030 .9051 .9069 .9086
20 .9015 .8980 .9003 .9031 .9057 .9080 .9100 .9117 .9132

21 .9063 .9030 .9051 .9078 .9103 .9124 .9143 .9160 .9175


22 .9106 .9075 .9095 .9120 .9144 .9165 .9183 .9199 .9213
23 .9146 .9116 .9135 .9159 .9182 .9202 .9219 .9235 .9248
24 .9182 .9153 .9172 .9195 .9217 .9236 .9253 .9267 .9280
25 .9216 .9187 .9205 .9228 .9249 .9267 .9283 .9297 .9309

26 .9246 .9219 .9236 .9258 .9278 .9296 .9311 .9325 .9336


27 .9275 .9249 .9265 .9286 .9305 .9322 .9337 .9350 .9361
28 .9301 .9276 .9292 .9312 .9330 .9347 .9361 .9374 .9385
29 .9326 .9301 .9316 .9336 .9354 .9370 .9383 .9396 .9406
30 .9348 .9325 .9340 .9358 .9376 .9391 .9404 .9416 .9426

40 .9513 .9495 .9506 .9520 .9533 .9545 .9555 .9564 .9572


50 .9612 .9597 .9606 .9617 .9628 .9637 .9645 .9652 .9658
60 .9677 .9665 .9672 .9681 .9690 .9698 .9705 .9710 .9716
80 .9758 .9749 .9754 .9761 .9768 .9774 .9779 .9783 .9787
100 .9807 .9799 .9804 .9809 .9815 .9819 .9823 .9827 .9830
Source: Dyer, D., & Keating, J. P. (1980). On the determination of critical values for Bartlett’s test. Journal of the American
Statistical Association, 75, 313–319. Reprinted with permission from the Journal of the American Statistical Association.
Copyright 1980 by the American Statistical Association. All rights reserved.
Note: The table shows the critical values used in Bartlett’s test of equal variance at the 5% level of significance.
64 Barycentric Discriminant Analysis

Further Readings defined categories. For example, BADIA can be


used (a) to assign people to a given diagnostic
Bartlett, M. S. (1937). Properties of sufficiency and
statistical tests. Proceedings of the Royal Statistical group (e.g., patients with Alzheimer’s disease,
Society, Series A, 160, 268–282. patients with other dementia, or people aging
Berthouex, P. M., & Brown, L. C. (2002). Statistics for without dementia) on the basis of brain imaging
environmental engineers (2nd ed.). Boca Raton, FL: data or psychological tests (here the a priori cate-
Lewis. gories are the clinical groups), (b) to assign wines
Box, G. E. P. (1953). Nonnormality and tests on to a region of production on the basis of several
variances. Biometrika, 40, 318–335. physical and chemical measurements (here the
Chao, M. T., & Glaser, R. E. (1978). The exact a priori categories are the regions of production),
distribution of Bartlett’s test statistic for homogeneity
(c) to use brain scans taken on a given participant
of variances with unequal sample sizes. Journal of the
American Statistical Association, 73, 422–426.
to determine what type of object (e.g., a face,
Conover, W. J., Johnson, M. E., & Johnson, M. M. a cat, a chair) was watched by the participant
(1981). A comparative study of tests of homogeneity when the scans were taken (here the a priori cate-
of variances with applications to the Outer gories are the types of object), or (d) to use DNA
Continental Shelf bidding data. Technometrics, 23, measurements to predict whether a person is at
351–361. risk for a given health problem (here the a priori
Dyer, D., & Keating, J. P. (1980). On the determination categories are the types of health problem).
of critical values for Bartlett’s test. Journal of the BADIA is more general than standard discrimi-
American Statistical Association, 75, 313–319. nant analysis because it can be used in cases for
Glaser, R. E. (1976). Exact critical values for Bartlett’s
which discriminant analysis cannot be used. This
test for homogeneity of variances. Journal of the
American Statistical Association, 71, 488–490.
is the case, for example, when there are more vari-
Keating, J. P., Glaser, R. E., & Ketchum, N. S. (1990). ables than observations or when the measurements
Testing hypotheses about the shape parameter of are categorical.
a gamma distribution. Technometrics, 32, 67–82. BADIA is a class of methods that all rely on the
Levene, H. (1960). Robust tests for equality of variances. same principle: Each category of interest is repre-
In I. Olkin (Ed.), Contributions to probability and sented by the barycenter of its observations (i.e.,
statistics (pp. 278–292). Palo Alto, CA: Stanford the weighted average; the barycenter is also called
University Press. the center of gravity of the observations of a given
Madansky, A. (1989). Prescriptions for working category), and a generalized principal components
statisticians. New York: Springer-Verlag.
analysis (GPCA) is performed on the category by
Nandi, S. B. (1980). On the exact distribution of
a normalized ratio of weighted geometric mean to the
variable matrix. This analysis gives a set of dis-
unweighted arithmetic mean in samples from gamma criminant factor scores for the categories and
distributions. Journal of the American Statistical another set of factor scores for the variables. The
Association, 75, 217–220. original observations are then projected onto the
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of category factor space, providing a set of factor
variance test for normality (complete samples). scores for the observations. The distance of each
Biometrika, 52, 591–611. observation to the set of categories is computed
from the factor scores, and each observation is
assigned to the closest category. The comparison
between the a priori and a posteriori category
BARYCENTRIC DISCRIMINANT assignments is used to assess the quality of the
discriminant procedure. The prediction for the
ANALYSIS observations that were used to compute the bary-
centers is called the fixed-effect prediction. Fixed-
Barycentric discriminant analysis (BADIA) gener- effect performance is evaluated by counting the
alizes discriminant analysis, and like discriminant number of correct and incorrect assignments and
analysis, it is performed when measurements made storing these numbers in a confusion matrix.
on some observations are combined to assign these Another index of the performance of the fixed-
observations, or new observations, to a priori effect model—equivalent to a squared coefficient
Bayes’s Theorem 65

of correlation—is the ratio of category variance to category barycenters are computed from each of
the sum of category variance plus variance of the these sets. These barycenters are then projected
observations within each category. This coefficient onto the discriminant factor scores. The variability
is denoted R2 and is interpreted as the proportion of the barycenters can be represented graphically as
of variance of the observations explained by the a confidence ellipsoid that encompasses a given
categories or as the proportion of the variance proportion (say 95%) of the barycenters. When the
explained by the discriminant model. The perfor- confidence intervals of two categories do not over-
mance of the fixed-effect model can also be repre- lap, these two categories are significantly different.
sented graphically as a tolerance ellipsoid that In summary, BADIA is a GPCA performed on
encompasses a given proportion (say 95%) of the the category barycenters. GPCA encompasses vari-
observations. The overlap between the tolerance ous techniques, such as correspondence analysis,
ellipsoids of two categories is proportional to the biplot, Hellinger distance analysis, discriminant
number of misclassifications between these two analysis, and canonical variate analysis. For each
categories. specific type of GPCA, there is a corresponding
New observations can also be projected onto version of BADIA. For example, when the GPCA
the discriminant factor space, and they can be is correspondence analysis, this is best handled
assigned to the closest category. When the actual with the most well-known version of BADIA:
assignment of these observations is not known, the discriminant correspondence analysis. Because
model can be used to predict category member- BADIA is based on GPCA, it can also analyze data
ship. The model is then called a random model (as tables obtained by the concatenation of blocks
opposed to the fixed model). An obvious problem, (i.e., subtables). In this case, the importance (often
then, is to evaluate the quality of the prediction for called the contribution) of each block to the over-
new observations. Ideally, the performance of all discrimination can also be evaluated and repre-
the random-effect model is evaluated by counting sented as a graph.
the number of correct and incorrect classifications
for new observations and computing a confusion Hervé Abdi and Lynne J. Williams
matrix on these new observations. However, it is
See also Bootstrapping; Canonical Correlation Analysis;
not always practical or even feasible to obtain new
Correspondence Analysis; Discriminant Analysis;
observations, and therefore the random-effect per-
Jackknife; Matrix Algebra; Principal Components
formance is, in general, evaluated using computa-
Analysis
tional cross-validation techniques such as the
jackknife or the bootstrap. For example, a jack-
knife approach (also called leave one out) can Further Readings
be used by which each observation is taken out
Efron, B., & Tibshirani, R. J. (1993). An introduction to
of the set, in turn, and predicted from the model
the bootstrap. New York: Chapman & Hall.
built on the other observations. The predicted Greenacre, M. J. (1984). Theory and applications of
observations are then projected in the space of the correspondence analysis. London: Academic Press.
fixed-effect discriminant scores. This can also be Saporta, G., & Niang, N. (2006). Correspondence
represented graphically as a prediction ellipsoid. A analysis and classification. In M. Greenacre &
prediction ellipsoid encompasses a given propor- J. Blasius (Eds.), Multiple correspondence analysis and
tion (say 95%) of the new observations. The over- related methods (pp. 371–392). Boca Raton, FL:
lap between the prediction ellipsoids of two Chapman & Hall/CRC.
categories is proportional to the number of mis-
classifications of new observations between these
two categories.
The stability of the discriminant model can be BAYES’S THEOREM
assessed by a cross-validation model such as the
bootstrap. In this procedure, multiple sets of obser- Bayes’s theorem is a simple mathematical formula
vations are generated by sampling with replacement used for calculating conditional probabilities. It
from the original set of observations, and the figures prominently in subjectivist or Bayesian
66 Bayes’s Theorem

approaches to statistics, epistemology, and induc- Pierre-Simon Laplace published his paper ‘‘Mém-
tive logic. Subjectivists, who maintain that rational oire sur la Probabilité des Causes par les Évène-
belief is governed by the laws of probability, lean ments’’ in 1774 that Bayes’s ideas gained wider
heavily on conditional probabilities in their theo- attention. Laplace extended the use of inverse
ries of evidence and their models of empirical probability to a variety of distributions and intro-
learning. Bayes’s theorem is central to these para- duced the notion of ‘‘indifference’’ as a means of
digms because it simplifies the calculation of specifying prior distributions in the absence of
conditional probabilities and clarifies significant prior knowledge. Inverse probability became dur-
features of the subjectivist position. ing the 19th century the most commonly used
This entry begins with a brief history of method for making statistical inferences. Some of
Thomas Bayes and the publication of his theorem. the more famous examples of the use of inverse
Next, the entry focuses on probability and its role probability to draw inferences during this period
in Bayes’s theorem. Last, the entry explores mod- include estimation of the mass of Saturn, the prob-
ern applications of Bayes’s theorem. ability of the birth of a boy at different locations,
the utility of antiseptics, and the accuracy of judi-
cial decisions.
History
In the latter half of the 19th century, authori-
Thomas Bayes was born in 1702, probably in Lon- ties such as Siméon-Denis Poisson, Bernard Bol-
don, England. Others have suggested the place of zano, Robert Leslie Ellis, Jakob Friedrich Fries,
his birth to be Hertfordshire. He was the eldest of John Stuart Mill, and A. A. Cournot began to
six children of Joshua and Ann Carpenter Bayes. make distinctions between probabilities about
His father was a nonconformist minister, one of things and probabilities involving our beliefs
the first seven in England. Information on Bayes’s about things. Some of these authors attached the
childhood is scarce. Some sources state that he terms objective and subjective to the two types
was privately educated, and others state he of probability. Toward the end of the century,
received a liberal education to prepare for the min- Karl Pearson, in his Grammar of Science, argued
istry. After assisting his father for many years, he for using experience to determine prior distribu-
spent his adult life as a Presbyterian minister at the tions, an approach that eventually evolved into
chapel in Tunbridge Wells. In 1742, Bayes was what is now known as empirical Bayes. The
elected as a fellow by the Royal Society of Lon- Bayesian idea of inverse probability was also
don. He retired in 1752 and remained in Tun- being challenged toward the end of the 19th cen-
bridge Wells until his death in April of 1761. tury, with the criticism focusing on the use of
Throughout his life he wrote very little, and uniform or ‘‘indifference’’ prior distributions to
only two of his works are known to have been express a lack of prior knowledge.
published. These two essays are Divine Benevo- The criticism of Bayesian ideas spurred research
lence, published in 1731, and Introduction to the into statistical methods that did not rely on prior
Doctrine of Fluxions, published in 1736. He was knowledge and the choice of prior distributions.
known as a mathematician not for these essays but In 1922, Ronald Alymer Fisher’s paper ‘‘On the
for two other papers he had written but never pub- Mathematical Foundations of Theoretical Statis-
lished. His studies focused in the areas of probabil- tics,’’ which introduced the idea of likelihood
ity and statistics. His posthumously published and maximum likelihood estimates, revolutionized
article now known by the title ‘‘An Essay Towards modern statistical thinking. Jerzy Neyman and
Solving a Problem in the Doctrine of Chances’’ Egon Pearson extended Fisher’s work by adding
developed the idea of inverse probability, which the ideas of hypothesis testing and confidence
later became associated with his name as Bayes’s intervals. Eventually the collective work of Fisher,
theorem. Inverse probability was so called because Neyman, and Pearson became known as frequen-
it involves inferring backwards from the data to tist methods. From the 1920s to the 1950s, fre-
the parameter (i.e., from the effect to the cause). quentist methods displaced inverse probability as
Initially, Bayes’s ideas attracted little attention. It the primary methods used by researchers to make
was not until after the French mathematician statistical inferences.
Bayes’s Theorem 67

Interest in using Bayesian methods for statistical unknown event has happened and failed: Required
inference revived in the 1950s, inspired by Leonard the chance that the probability of its happening in
Jimmie Savage’s 1954 book The Foundations of a single trial lies somewhere between any two
Statistics. Savage’s work built on previous work of degrees of probability that can be named.’’ Bayes’s
several earlier authors exploring the idea of subjec- reasoning began with the idea of conditional
tive probability, in particular the work of Bruno de probability:
Finetti. It was during this time that the terms If PðBÞ > 0, the conditional probability of A
Bayesian and frequentist began to be used to refer given B, denoted by PðAjBÞ, is
to the two statistical inference camps. The number
of papers and authors using Bayesian statistics con- PðA ∩ BÞ PðABÞ
PðAjBÞ ¼ or :
tinued to grow in the 1960s. Examples of Bayesian PðBÞ PðBÞ
research from this period include an investigation
by Frederick Mosteller and David Wallace into the Bayes’s main focus then became defining
authorship of several of the Federalist papers and PðBjAÞ in terms of PðAjBÞ:
the use of Bayesian methods to estimate the para- A key component that Bayes needed was the
meters of time-series models. The introduction of law of total probability. Sometimes it is not possi-
Monte Carlo Markov chain (MCMC) methods to ble to calculate the probability of the occurrence
the Bayesian world in the late 1980s made compu- of an event A: However, it is possible to find
tations that were impractical or impossible earlier PðAjBÞ and PðAjBc Þ for some event Bwhere Bc is
realistic and relatively easy. The result has been the complement of B: The weighted average, PðAÞ,
a resurgence of interest in the use of Bayesian of the probability of A given that B has occurred
methods to draw statistical inferences. and the probability of A given that B has not
occurred can be defined as follows:
Let B be an event with PðBÞ > 0 and PðBc Þ > 0.
Publishing of Bayes’s Theorem Then for any event A,
Bayes never published his mathematical papers,
and therein lies a mystery. Some suggest his theo- PðAÞ ¼ PðAjBÞPðBÞ þ PðAjBc ÞPðBc Þ:
logical concerns with modesty might have played
If there are k events, B1; . . . ; Bk ; that form a par-
a role in his decision. However, after Bayes’s death,
tition of the sample space, and A is another event
his family asked Richard Price to examine Bayes’s
in the sample space, then the events
work. Price was responsible for the communica-
B1 A, B2 A, . . . , Bk A will form a partition of A:
tion of Bayes’s essay on probability and chance to
Thus, the law of total probability can be extended
the Royal Society. Although Price was making
as follows:
Bayes’s work known, he was occasionally mis-
Let Bj be an event with PðBjÞ > 0 for
taken for the author of the essays and for a time
j ¼ 1, . . . , k: Then for any event A,
received credit for them. In fact, Price only added
introductions and appendixes to works he had X
k
published for Bayes, although he would eventually PðAÞ ¼ PðBj ÞPðAjBj Þ:
write a follow-up paper to Bayes’s work. j¼1
The present form of Bayes’s theorem was actu-
ally derived not by Bayes but by Laplace. Laplace These basic rules of probability served as the
used the information provided by Bayes to con- inspiration for Bayes’s theorem.
struct the theorem in 1774. Only in later papers
did Laplace acknowledge Bayes’s work.
Bayes’s Theorem
Bayes’s theorem allows for a reduction in uncer-
Inspiration of Bayes’s Theorem
tainty by considering events that have occurred.
In ‘‘An Essay Towards Solving a Problem in the The theorem is applicable as long as the probabil-
Doctrine of Chances,’’ Bayes posed a problem to ity of the more recent event (given an earlier event)
be solved: ‘‘Given the number of times in which an is known. With this theorem, one can find the
68 Bayes’s Theorem

probability of the earlier event, given the more 7 39



recent event that has occurred. The earlier event is PðBBjRÞ ¼ 7 39 6
18 95
91 5 21
≈0:46
 þ 18  190
18 95
þ 18  190
known as the prior probability. The primary focus
is on the probability of the earlier event given the
more recent event that has occurred (called the The probability that the first two balls chosen
posterior probability). The theorem can be were blue given the third ball selected was red is
described in the following manner: approximately 46%.
Let Bj be an event with PðBj Þ > 0 for Bayes’s theorem can be applied to the real
j ¼ 1, . . . , k and forming a partition of the sample world to make appropriate estimations of proba-
space. Furthermore, let A be an event such that bility in a given situation. Diagnostic testing is
PðAÞ > 0: Then for i ¼ 1, . . . ,k; one example in which the theorem is a useful
tool. Diagnostic testing identifies whether a per-
PðBi ÞPðAjBi Þ son has a particular disease. However, these tests
PðBi jAÞ ¼ :
P
k
contain error. Thus, a person can test positive
PðBj ÞPðAjBj Þ
j¼1 for the disease and in actuality not be carrying
the disease. Bayes’s theorem can be used to esti-
PðBi Þ is the prior probability and the probability mate the probability that a person truly has the
of the earlier event. PðAjBi Þ is the probability of disease given that the person tests positive. As an
the more recent event given the prior has occurred illustration of this, suppose that a particular can-
and is referred to as the likelihood. PðBi jAÞ is what cer is found for every 1 person in 2,000. Further-
one is solving for and is the probability of the ear- more, if a person has the disease, there is a 90%
lier event given that the recent event has occurred chance the diagnostic procedure will result in
(the posterior probability). There is also a version a positive identification. If a person does not
of Bayes’s theorem based on a secondary event C : have the disease, the test will give a false positive
1% of the time. Using Bayes’s theorem, the prob-
PðBi jCÞPðAjBi CÞ ability that a person with a positive test result,
PðBi jACÞ ¼ : actually has the cancer C, is
P
k
PðBj jCÞPðAjBj CÞ
j¼1 1
ð:90Þ
PðCjPÞ ¼ 1
2000
≈ 0:043:
Example 2000
ð0:90Þ þ 1999
2000
ð0:01Þ

A box contains 7 red and 13 blue balls. Two


If a person tests positive for the cancer test,
balls are selected at random and are discarded with-
there is only a 4% chance that the person has the
out their colors being seen. If a third ball is drawn
cancer. Consequently, follow-up tests are almost
randomly and observed to be red, what is the prob-
always necessary to verify a positive finding with
ability that both of the discarded balls were blue?
medical screening tests.
Let BB; BR, and RR represent the events that
Bayes’s theorem has also been used in psycho-
the discarded balls are blue and blue, blue and red,
metrics to make a classification scale rather than
and red and red, respectively. Let R represent the
an ability scale in the classroom. A simple exam-
event that the third ball chosen is red. Solve for
ple of classification is dividing a population into
the posterior probability PðBBjRÞ:
two categories of mastery and nonmastery of
a subject. A test would be devised to determine
PðBBjRÞ ¼
whether a person falls in the mastery or the non-
PðRjBBÞPðBBÞ mastery category. The posterior probabilities for
PðRjBBÞPðBBÞ þ PðRjBRÞPðBRÞþPðRjRRÞPðRRÞ different skills can be collected, and the results
would show mastered skills and nonmastered
The probability that the first two balls drawn skills that need attention. The test may even
were blue, red, or blue and red are 39/95, 21/190, allow for new posterior probabilities to be com-
and 91/190, in that order. Now, puted after each question.
Bayes’s Theorem 69

The two examples presented above are just PðH1 ÞPðDjH1 Þ


a small sample of the applications in which PðH1 jDÞ ¼ :
PðH0 ÞPðDjH0 Þ þ PðH1 ÞPðDjH1 Þ
Bayes’s theorem has been useful. While certain
academic fields concentrate on its use more than
The PðH0 jDÞ and PðH1 jDÞ are posterior proba-
others do, the theorem has far-reaching influence
bilities (i.e., the probability that the null is true
in business, medicine, education, psychology,
given the data and the probability that the alterna-
and so on.
tive is true given the data, respectively). The PðH0 Þ
and PðH1 Þ are prior probabilities (i.e., the proba-
Bayesian Statistical Inference bility that the null or the alternative is true prior to
considering the new data, respectively).
Bayes’s theorem provides a foundation for Bayes-
In frequentist hypothesis testing, one considers
ian statistical inference. However, the approach to
only PðDjH0 Þ, which is called the p value. If the
inference is different from that of a traditional (fre-
p value is smaller than a predetermined signifi-
quentist) point of view. With Bayes’s theorem,
cance level, then one rejects the null hypothesis
inference is dynamic. That is, a Bayesian approach
and asserts the alternative hypothesis. One com-
uses evidence about a phenomenon to update
mon mistake is to interpret the p value as the
knowledge of prior beliefs.
probability that the null hypothesis is true, given
There are two popular ways to approach
the observed data. This interpretation is a Bayesian
inference. The traditional way is the frequentist
one. From a Bayesian perspective, one may obtain
approach, in which the probability P of an
PðH0 jDÞ, and if that probability is sufficiently
uncertain event A, written PðAÞ, is defined by
small, then one rejects the null hypothesis in favor
the frequency of that event, based on previous
of the alternative hypothesis. In addition, with
observations. In general, population parameters
a Bayesian approach, several alternative hypothe-
are considered as fixed effects and do not have
ses can be considered at one time.
distributional form. The frequentist approach to
As with traditional frequentist confidence
defining the probability of an uncertain event is
intervals, a credible interval can be computed in
sufficient, provided that one has been able to
Bayesian statistics. This credible interval is
record accurate information about many past
defined as the posterior probability interval and
instances of the event. However, if no such his-
is used in ways similar to the uses of confidence
torical database exists, then a different approach
intervals in frequentist statistics. For example,
must be considered.
a 95% credible interval means that the posterior
Bayesian inference is an approach that allows
probability of the parameter lying in the given
one to reason about beliefs under conditions of
range is 0.95. A frequentist 95% confidence
uncertainty. Different people may have different
interval means that with a large number of
beliefs about the probability of a prior event,
repeated samples, 95% of the calculated confi-
depending on their specific knowledge of factors
dence intervals would include the true value of
that might affect its likelihood. Thus, Bayesian
the parameter; yet the probability that the
inference has no one correct probability or
parameter is inside the actual calculated confi-
approach. Bayesian inference is dependent on both
dence interval is either 0 or 1. In general, Bayes-
prior and observed data.
ian credible intervals do not match a frequentist
In a traditional hypothesis test, there are two
confidence interval, since the credible interval
complementary hypotheses: H 0 , the status quo
incorporates information from the prior distribu-
hypothesis, and H 1 , the hypothesis of change. Let-
tion whereas confidence intervals are based only
ting D stand for the observed data, Bayes’s theo-
on the data.
rem applied to the hypotheses becomes

PðH0 ÞPðDjH0 Þ Modern Applications


PðH0 jDÞ ¼
PðH0 ÞPðDjH0 Þ þ PðH1 ÞPðDjH1 Þ
In Bayesian statistics, information about the data
and and a priori information are combined to
70 Bayes’s Theorem

estimate the posterior distribution of the para- that move with a probability that is dependent
meters. This posterior distribution is used to on the current and proposed state.
infer the values of the parameters, along with In order to perform valid inference, the Mar-
the associated uncertainty. Multiple tests and kov chain must have approximately converged
predictions can be performed simultaneously to the posterior distribution before the samples
and flexibly. Quantities of interest that are func- are stored and used for inference. In addition,
tions of the parameters are straightforward to enough samples must be stored after conver-
estimate, again including the uncertainty. Poste- gence to have a large effective sample size; if the
rior inferences can be updated as more data are autocorrelation of the chain is high, then the
obtained, so study design is more flexible than number of samples needs to be large. Lack of
for frequentist methods. convergence or high autocorrelation of the chain
Bayesian inference is possible in a number of is detected via convergence diagnostics, which
contexts in which frequentist methods are defi- include autocorrelation and trace plots, as well
cient. For instance, Bayesian inference can be per- as Geweke, Gelman–Rubin, and Heidelberger–
formed with small data sets. More broadly, Welch diagnostics. Software for MCMC can also
Bayesian statistics is useful when the data set may be validated by a distinct set of techniques.
be large but when few data points are associated These techniques compare the posterior samples
with a particular treatment. In such situations drawn by the software with samples from the
standard frequentist estimators can be inappropri- prior and the data model, thereby validating the
ate because the likelihood may not be well approx- joint distribution of the data and parameters as
imated by a normal distribution. The use of estimated by the software.
Bayesian statistics also allows for the incorpora-
tion of prior information and for simultaneous Brandon K. Vaughn and Daniel L. Murphy
inference using data from multiple studies. Infer-
See also Estimation; Hypothesis; Inference: Deductive and
ence is also possible for complex hierarchical
Inductive; Parametric Statistics; Probability, Laws of
models.
Lately, computation for Bayesian models is
most often done via MCMC techniques, which
Further Readings
obtain dependent samples from the posterior dis-
tribution of the parameters. In MCMC, a set of Bayes, T. (1763). An essay towards solving a problem in
initial parameter values is chosen. These parame- the doctrine of chances. Philosophical Transactions of
ter values are then iteratively updated via a spe- the Royal Society of London, 53, 370–418.
cially constructed Markovian transition. In the Box, G. E., & Jenkins, G. M. (1970). Time series
limit of the number of iterations, the parameter analysis: Forecasting and control. San Francisco:
Holden-Day.
values are distributed according to the posterior
Dale, A. I. (1991). A history of inverse probability from
distribution. In practice, after approximate con- Thomas Bayes to Karl Pearson. New York: Springer-
vergence of the Markov chain, the time series of Verlag.
sets of parameter values can be stored and then Daston, L. (1994). How probabilities came to be
used for inference via empirical averaging (i.e., objective and subjective. Historia Mathematica, 21,
Monte Carlo). The accuracy of this empirical 330–344.
averaging depends on the effective sample size of Fienberg, S. E. (1992). A brief history of statistics in three
the stored parameter values, that is, the number and one-half chapters: A review essay. Statistical
of iterations of the chain after convergence, Science, 7, 208–225.
adjusted for the autocorrelation of the chain. Fienberg, S. E. (2006). When did Bayesian analysis
become ‘‘Bayesian’’? Bayesian Analysis, 1(1), 1–40.
One method of specifying the Markovian transi-
Fisher, R. A. (1922). On the mathematical foundations
tion is via Metropolis–Hastings, which proposes of theoretical statistics. Philosophical Transactions
a change in the parameters, often according to of the Royal Society of London, Series A, 222,
a random walk (the assumption that many 309–368.
unpredictable small fluctuations will occur in Laplace, P. S. (1774). Mémoire sur la probabilité des
a chain of events), and then accepts or rejects causes par les évènements [Memoir on the probability
Behavior Analysis Design 71

of causes of events]. Mémoires de la Mathématique et as a result of changing the consequence of behav-


de Physique Presentés à l’Académie Royale des ior, the process is called punishment.
Sciences, Par Divers Savans, & Lûs dans ses Behavior analysis encompasses two types of
Assemblées, 6, 621–656. research: the experimental analysis of behavior,
Mosteller, F., & Wallace, D. L. (1963). Inference in an
consisting of research to discover basic underlying
authorship problem. Journal of the American
Statistical Association, 59, 275–309.
behavioral principles, and applied behavior analy-
Pearson, K. (1892). The grammar of science. London: sis, involving research implementing basic princi-
Walter Scott. ples in real-world situations. Researchers in this
Savage, L. J. (1954). The foundations of statistics. New field are often referred to as behavior analysts, and
York: Wiley. their research can take place in both laboratory
Stigler, S. M. (1986). The history of statistics: The and naturalistic settings and with animals and
measurement of uncertainty before 1900. Cambridge, humans. Basic behavioral processes can be studied
MA: Harvard University Press. in any species, and the findings may be applied to
other species. Therefore, researchers can use ani-
mals for experimentation, which can increase
experimental control by eliminating or reducing
confounding variables. Since it is important to ver-
BEHAVIOR ANALYSIS DESIGN ify that findings generalize across species, experi-
ments are often replicated with other animals and
Behavior analysis is a specific scientific approach with humans. Applied behavior analysis strives to
to studying behavior that evolved from John Wat- develop empirically based interventions rooted in
son’s behaviorism and the operant research model principles discovered through basic research.
popularized by B. F. Skinner during the middle of Many empirically based treatments have been
the 20th century. This approach stresses direct developed with participants ranging from children
experimentation and measurement of observable with autism to corporate executives and to stu-
behavior. A basic assumption of behavior analysis dents and substance abusers. Contributions have
is that behavior is malleable and controlled pri- been made in developmental disabilities, mental
marily by consequences. B. F. Skinner described retardation, rehabilitation, delinquency, mental
the basic unit of behavior as an operant, a behavior health, counseling, education and teaching, busi-
emitted to operate on the environment. Addition- ness and industry, and substance abuse and addic-
ally, he proposed that the response rate of the tion, with potential in many other areas of social
operant serve as the basic datum of the scientific significance. Similar designs are employed in both
study of behavior. An operant is characterized by basic and applied research, but they differ with
a response that occurs within a specific environ- regard to subjects studied, experimental settings,
ment and produces a specific consequence. Accord- and degree of environmental control.
ing to the principles of operant conditioning, Regardless of the subject matter, a primary fea-
behavior is a function of three interactive compo- ture of behavior analytic research is that the
nents, illustrated by the three-term contingency: behavior of individual organisms is examined
context, response, and consequences of behavior. under conditions that are rigorously controlled.
The relationship between these three variables One subject can provide a representative sample,
forms the basis of all behavioral research. Within and studying an individual subject thoroughly can
this framework, individual components of the sometimes provide more information than can
three-term contingency can be studied by manipu- studying many subjects because each subject’s data
lating experimental context, response require- are considered an independent replication. Behav-
ments, or consequences of behavior. A change in ior analysts demonstrate the reliable manipulation
any one of these components often changes the of behavior by changing the environment. Manip-
overall function of behavior, resulting in a change ulating the environment allows researchers to dis-
in future behavior. If the consequence strengthens cover the relationships between behavior and
future behavior, the process is called reinforce- environment. This method is referred to as single-
ment. If future behavior is weakened or eliminated subject or within-subject research and requires
72 Behavior Analysis Design

unique designs, which have been outlined by James response is typically more broadly defined and
Johnston and Hank Pennypacker. Consequently, may be highly individualized. For example, self-
this method takes an approach to the collection, injurious behavior in a child with autism may
validity, analysis, and generality of data that is dif- include many forms that meet a common defini-
ferent from approaches that primarily use group tion of minimum force that leaves a mark. Just as
designs and inferential statistics to study behavior. in basic research, a variety of behavioral measure-
ments can be used as dependent variables. The
response class must be sensitive to the influence
Measurement Considerations of the independent variable (IV) without being
affected by extraneous variables so that effects can
Defining Response Classes
be detected. The response class must be defined in
Measurement in single-subject design is objec- such a way that researchers can clearly observe
tive and restricted to observable phenomena. and record behavior.
Measurement considerations can contribute to
behavioral variability that can obscure experimen- Observation and Recording
tal effects, so care must be taken to avoid potential Once researchers define a response class, the
confounding variables. Measurement focuses on methods of observation and recording are impor-
targeting a response class, which is any set of tant in order to obtain a complete and accurate
responses that result in the same environmental record of the subject’s behavior. Measurement is
change. Response classes are typically defined by direct when the focus of the experiment is the
function rather than topography. This means that same as the phenomenon being measured. Indirect
the form of the responses may vary considerably measurement is typically avoided in behavioral
but produce the same result. For example, a button research because it undermines experimental con-
can be pressed several ways, with one finger, with trol. Mechanical, electrical, or electronic devices
the palm, with the toe, or with several fingers. The can be used to record responses, or human obser-
exact method of action is unimportant, but any vers can be selected and trained for data collection.
behavior resulting in button depression is part of Machine and human observations may be used
a response class. Topographical definitions are together throughout an experiment. Behavior is
likely to result in classes that include some or all continuous, so observational procedures must be
of several functional response classes, which can designed to detect and record each response within
produce unwanted variability. Researchers try to the targeted response class.
arrange the environment to minimize variability
within a clearly defined response class.
There are many ways to quantify the occurrence Experimental Design and
of a response class member. The characteristics of Demonstration of Experimental Effects
the behavior captured in its definition must suit
Experimental Arrangements
the needs of the experiment, be able to address the
experimental question, and meet practical limits The most basic single-subject experimental
for observation. In animal studies, a response is design is the baseline–treatment sequence, the AB
typically defined as the closing of a circuit in an design. This procedure cannot account for certain
experimental chamber by depressing a lever or confounds, such as maturation, environmental his-
pushing a key or button. With this type of tory, or unknown extraneous variables. Replicat-
response, the frequency and duration that a circuit ing components of the AB design provide
is closed can be recorded. Conditions can be additional evidence that the IV is the source of any
arranged to measure the force used to push the change in the dependent measure. Replication
button or lever, the amount of time that occurs designs consist of a baseline or control condition
between responses, and the latency and accuracy (A), followed by one or more experimental or
of responding in relation to some experimentally treatment conditions (B), with additional condi-
arranged stimulus. These measurements serve tions indicated by successive letters. Subjects expe-
as dependent variables. In human studies, the rience both the control and the experimental
Behavior Analysis Design 73

conditions, often in sequence and perhaps more experiment. The causes of variability can often be
than once. An ABA design replicates the original identified and systematically evaluated. Behavior
baseline, while an ABAB design replicates the analysts have demonstrated that frequently chang-
baseline and the experimental conditions, allowing ing the environment results in greater degrees of
researchers to infer causal relationships between variability. Inversely, holding the environment
variables. These designs can be compared with constant for a time allows behavior to stabilize
a light switch. The first time one moves the switch and minimizes variability. Murray Sidman has
from the on position to the off position, one can- offered several suggestions for decreasing variabil-
not be completely certain that one’s behavior was ity, including strengthening the variables that
responsible for the change in lighting conditions. directly maintain the behavior of interest, such as
One cannot be sure the light bulb did not burn out increasing deprivation, increasing the intensity of
at that exact moment or the electricity did not shut the consequences, making stimuli more detectable,
off coincidentally. Confidence is bolstered when or providing feedback to the subject. If these
one pushes the switch back to the on position and changes do not immediately affect variability, it
the lights turn back on. With a replication of mov- could be that behavior requires exposure to the
ing the switch to off again, one has total confi- condition for a longer duration. Employing these
dence that the switch is controlling the light. strategies to control variability increases the likeli-
Single-subject research determines the effective- hood that results can be interpreted and replicated.
ness of the IV by eliminating or holding constant
any potential confounding sources of variability.
One or more behavioral measures are used as Reduction of Confounding Variables
dependent variables so that data comparisons are
Extraneous, or confounding, variables affect the
made from one condition to another. Any change
detection of behavioral change due to the IV. Only
in behavior between the control and the experi-
by eliminating or minimizing external sources of
mental conditions is attributed to the effects of the
variability can data be judged as accurately reflect-
IV. The outcome provides a detailed interpretation
ing performance. Subjects should be selected that
of the effects of an IV on the behavior of the
are similar along extra-experimental dimensions in
subject.
order to reduce extraneous sources of variability.
Replication designs work only in cases in which
For example, it is common practice to use animals
effects are reversible. Sequence effects can occur
from the same litter or to select human partici-
when experience in one experimental condition
pants on the basis of age, level of education, or
affects a subject’s behavior in subsequent condi-
socioeconomic status. Environmental history of
tions. The researcher must be careful to ensure
an organism can also influence the target behav
consistent experimental conditions over replica-
ior; therefore, subject selection methods should
tions. Multiple-baseline designs with multiple indi-
attempt to minimize differences between subjects.
viduals, multiple behaviors, or multiple settings
Some types of confounding variables cannot be
can be used in circumstances in which sequence
removed, and the researcher must design an exper-
effects occur, or as a variation on the AB design.
iment to minimize their effects.
Results are compared across control and experi-
mental conditions, and factors such as irreversibil-
ity of effects, maturation of the subject, and
Steady State Behavior
sequence effect can be examined.
Single-subject designs rely on the collection of
steady state baseline data prior to the administra-
Behavioral Variability
tion of the IV. Steady states are obtained by expos-
Variability in single-subject design refers both to ing the subject to only one condition consistently
variations in features of responding within a single until behavior stabilizes over time. Stabilization
response class and to variations in summary mea- is determined by graphically examining the vari-
sures of that class, which researchers may be ability in behavior. Stability can be defined as a pat-
examining across sessions or entire phases of the tern of responding that exhibits relatively little
74 Behavior Analysis Design

variation in its measured dimensional quantities individual response variability but can highlight the
over time. effects of the experimental conditions on respond-
Stability criteria specify the standards for evalu- ing, thus promoting steady states. Responding sum-
ating steady states. Dimensions of behavior such marized across individual sessions represents some
as duration, latency, rate, and intensity can be combination of individual responses across a group
judged as stable or variable during the course of of sessions, such as mean response rate during
experimental study, with rate most commonly used baseline conditions. This method should not be the
to determine behavioral stability. Stability criteria only means of analysis but is useful when one is
must set limits on two types of variability over looking for differences among sets of sessions shar-
time. The first is systematic increases and decreases ing common characteristics.
of behavior, or trend, and the second is unsystem- Single-subject design uses ongoing behavioral
atic changes in behavior, or bounce. Only when data to establish steady states and make decisions
behavior is stable, without trend or bounce, should about the experimental conditions. Graphical anal-
the next condition be introduced. Specific stability ysis is completed throughout the experiment, so
criteria include time, visual inspection of graphical any problems with the design or measurement can
data, and simple statistics. Time criteria can desig- be uncovered immediately and corrected. How-
nate the number of experimental sessions or dis- ever, graphical analysis is not without criticism.
crete period in which behavior stabilizes. The time Some have found that visual inspection can be
criterion chosen must encompass even the slowest insensitive to small but potentially important dif-
subject. A time criterion allowing for longer expo- ferences of graphic data. When evaluating the sig-
sure to the condition may needlessly lengthen the nificance of data from this perspective, one must
experiment if stability occurs rapidly; on the other take into account the magnitude of the effect, vari-
hand, behavior might still be unstable, necessitat- ability in data, adequacy of experimental design,
ing experience and good judgment when a time value of misses and false alarms, social signifi-
criterion is used. A comparison of steady state cance, durability of behavior change, and number
behavior under baseline and different experimental and kinds of subjects. The best approach to analy-
conditions allows researchers to examine the sis of behavioral data probably uses some combi-
effects of the IV. nation of both graphical and statistical methods
because each approach has relative advantages
and disadvantages.
Scientific Discovery Through Data Analysis
Single-subject designs use visual comparison of
Judging Significance
steady state responding between conditions as the
primary method of data analysis. Visual analysis Changes in level, trend, variability, and serial
usually involves the assessment of several variables dependency must be detected in order for one to
evident in graphed data. These variables include evaluate behavioral data. Level refers to the gen-
upward or downward trend, the amount of vari- eral magnitude of behavior for some specific
ability within and across conditions, and differ- dimension. For example, 40 responses per minute
ences in means and stability both within and across is a lower level than 100 responses per minute.
conditions. Continuous data are displayed against Trend refers to the increasing or decreasing nature
the smallest unit of time that is likely to show sys- of behavior change. Variability refers to changes in
tematic variability. Cumulative graphs provide the behavior from measurement to measurement.
greatest level of detail by showing the distribution Serial dependency occurs when a measurement
of individual responses over time and across vari- obtained during one time period is related to
ous stimulus conditions. Data can be summarized a value obtained earlier.
with less precision by the use of descriptive statis- Several features of graphs are important, such
tics such as measures of central tendency (mean as trend lines, axis units, number of data points,
and median), variation (interquartile range and and condition demarcation. Trend lines are lines
standard deviation), and association (correlation that fit the data best within a condition. These
and linear regression). These methods obscure lines allow for discrimination of level and may
Behrens–Fisher t0 Statistic 75

assist in discrimination of behavioral trends. The Johnston, J. M., & Pennypacker, H. S. (1993). Strategies
axis serves as an anchor for data, and data points and tactics of behavioral research (2nd ed.). Hillsdale,
near the bottom of a graph are easier to interpret NJ: Lawrence Erlbaum.
than data in the middle of a graph. The number of Mazur, J. E. (2007). Learning and behavior (6th ed.).
Upper Saddle River, NJ: Pearson Prentice Hall.
data points also seems to affect decisions, with
Poling, A., & Grossett, D. (1986). Basic research designs
fewer points per phase improving accuracy. in applied behavior analysis. In A. Poling & R. W.
Fuqua (Eds.), Research methods in applied behavior
analysis: Issues and advances (pp. 7–27). New York:
Generality Plenum Press.
Generality, or how the results of an individual Sidman, M. (1960). Tactics of scientific research:
Evaluating experimental data in psychology. Boston:
experiment apply in a broader context outside
Authors Cooperative.
the laboratory, is essential to advancing science.
Skinner, B. F. (1938). The behavior of organisms: An
The dimensions of generality include subjects, experimental analysis. New York: D. Appleton-
response classes, settings, species, variables, Century.
methods, and processes. Single-subject designs Skinner, B. F. (1965). Science and human behavior. New
typically involve a small number of subjects that York: Free Press.
are evaluated numerous times, permitting in- Watson, J. B. (1913). Psychology as a behaviorist views
depth analysis of these individuals and the phe- it. Psychological Review, 20, 158–177.
nomenon in question, while providing systematic
replication. Systematic replication enhances gen-
erality of findings to other populations or condi-
tions and increases internal validity. The internal
BEHRENS–FISHER t0 STATISTIC
validity of an experiment is demonstrated when
additional subjects demonstrate similar behavior The Behrens–Fisher t0 statistic can be employed
under similar conditions; although the absolute when one seeks to make inferences about the
level of behavior may vary among subjects, the means of two normal populations without assum-
relationship between the IV and the relative ing the variances are equal. The statistic was
effect on behavior has been reliably demon- offered first by W. U. Behrens in 1929 and refor-
strated, illustrating generalization. mulated by Ronald A. Fisher in 1939:

Jennifer L. Bredthauer and ðx1  x2 Þ  ðμ1  μ2 Þ


Wendy D. Donlin-Washington t0 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ t1 sin θ  t2 cos θ,
s21 =n1 þ s22 =n2
See also Animal Research; Applied Research; where sample mean x1 and sample variance s21 are
Experimental Design; Graphical Display of Data; obtained from the random sample of size n1 from
Independent Variable; Research Design Principles; the normal distribution with mean μ1 and vari-
Single-Subject Design; Trend Analysis; Within-Subjects qffiffiffiffiffiffiffiffiffiffiffi
 ffi
Design ance σ 21 , t1 ¼ ðx1  μ1 Þ= s21 n1 has a t distribu-
tion with ν1 ¼ n1  1 degrees of freedom, the
respective quantities with subscript
pffiffiffiffiffi2 
are defined
Further Readings pffiffiffiffiffi
similarly, and tan θ ¼ s1 n1 s2 n2 or
 pffiffiffiffiffi pffiffiffiffiffi
Bailey, J. S., & Burch, M. R. (2002). Research methods in θ ¼ tan1 s1 n1 s2 n2 . The distribution
applied behavior analysis. Thousand Oaks, CA: Sage. of t0 is the Behrens–Fisher distribution. It is, hence,
Baron, A., & Perone, M. (1998). Experimental design a mixture of the two t distributions. The problem
and analysis in the laboratory of human operant
arising when one tries to test the normal popula-
behavior. In K. A. Lattal & M. Perone (Eds.),
Handbook of methods in human operant behavior
tion means without making any assumptions
(pp. 3–14). New York: Plenum Press. about their variances is referred to as the Behrens–
Fisch, G. S. (1998). Visual inspection of data revisited: Fisher problem or as the two means problem.
Do the eyes still have it? Behavior Analyst, 21(1), Under the usual null hypothesis of
111–123. H0 : μ1 ¼ μ2 , the test statistic t0 can be obtained
76 Behrens–Fisher t0 Statistic

and compared with the percentage points of the program, Prðt0 > 2:143Þ ¼ :049; indicating the
Behrens–Fisher distribution. Tables for the Beh- null hypothesis cannot be rejected at α ¼ :05 when
rens–Fisher distribution are available, and the the alternative hypothesis is nondirectional,
table entries are prepared on the basis of the four Ha : μ1 6¼ μ2 , because p ¼ :098. The correspond-
numbers ν1 ¼ n1  1, ν2 ¼ n2  1, θ, and the Type ing 95% interval for the population mean differ-
I error rate α. For example, Ronald A. Fisher and ence is ½0:421, 3:308.
Frank Yates in 1957 presented significance points
of the Behrens–Fisher distribution in two tables,
Related Methods
one for ν1 and ν2 ¼ 6, 8, 12, 24, ∞; θ ¼ 0 ;
15 , 30 , 45 , 60 , 75 , 90 ; and α ¼ :05, :01, and The Student’s t test for independent means can be
the other for ν1 that is greater than used when the two population variances are
ν2 ¼ 1, 2, 3,4, 5, 6, 7; θ ¼ 0 , 15 , 30 , 45 , 60 , 75 , assumed to be equal and σ 21 ¼ σ 22 ¼ σ 2 :
90 and α ¼ :10; :05, :02, :01: Seock-Ho Kim and
Allan S. Cohen in 1998 presented significance ðx1  x2 Þ  ðμ1  μ2 Þ
t¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
. . ffi ,
points of the Behrens–Fisher distribution for ν1 2 2
that is greater than ν2 ¼ 2, 4, 6, 8, 10, 12; sp n1 þ sp n2
θ ¼ 0 , 15 , 30 , 45 , 60 , 75 , 90 ; and α ¼ :10,
:05, :02, :01, and also offered computer programs where the pooled variance that provides the estimate
for obtaining tail areas and percentage values of of the common population variance 2
  σ is defined
the Behrens–Fisher distribution. 2 2 2
as sp ¼ ðn1  1Þs1 þ ðn2  1Þs2 ðn1 þ n2  2Þ. It
Using the Behrens–Fisher distribution, one can has a t distribution with ν ¼ n1 þ n2  2 degrees of
construct the 100ð1  αÞ% interval that contains freedom. The example data yield the Student’s
μ1  μ2 with t ¼ 3:220, ν ¼ 14; the two-tailed p ¼ :006, and the
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 95% confidence interval of [0.482, 2.405]. The null
x1  x2 ± tα=2
0
ðν1 ; ν2 ; θÞ s21 =n1 þ s22 =n2 ; hypothesis of equal population means is rejected at
the nominal α ¼ :05, and the confidence interval
does not contain 0.
where the probability
 0 that t0 > tα=2
0
ðν1, ν2 , θÞ is α=2 When the two variances cannot be assumed to
0
or, equivalently, Pr t > t α=2 ðν1 ; ν2 ; θÞ ¼ α=2: be the same, one of the solutions is to use the
This entry first illustrates the statistic with an Behrens–Fisher t0 statistic. There are several
example. Then related methods are presented, and alternative solutions. One simple way to solve
the methods are compared. the two means problem, called the smaller
degrees of freedom t test, is to use the same t0
Example statistic that has a t distribution with different
degrees of freedom:
Driving times from a person’s house to work were
measured for two different routes with n1 ¼ 5 and t0 e t½minðv1 ; ν2 Þ;
n2 ¼ 11: The ordered data from the first route are
6.5, 6.8, 7.1, 7.3, 10.2, yielding x1 ¼ 7:580 and where the degrees of freedom is the smaller value
s21 ¼ 2:237, and the data from the second route of ν1 or ν2. Note that this method should be used
are 5.8, 5.8, 5.9, 6.0, 6.0, 6.0, 6.3, 6.3, 6.4, 6.5, only if no statistical software is available because
6.5, yielding x2 ¼ 6:136 and s22 ¼ 0:073. It is it yields a conservative test result and a wider con-
assumed that the two independent samples were fidence interval. The example data yield
drawn from two normal distributions having t0 ¼ 2:143, ν ¼ 4, the two-tailed p ¼ :099, and the
means μ1 and μ2 and variances σ 21 and σ 22 ; respec- 95% confidence interval of ½0:427; 3:314: The
tively. A researcher wants to know whether the null hypothesis of equal population means is not
average driving times differed for the two routes. rejected at α ¼ :05, and the confidence interval
The test statistic under the null hypothesis of contains 0.
equal population means is t0 ¼ 2:143 with ν1 ¼ 4, B. L. Welch in 1938 presented an approxi-
ν2 ¼ 10; and θ ¼ 83:078: From the computer mate t test. It uses the same t0 statistic that has
Behrens–Fisher t0 Statistic 77

a t distribution with the approximate degrees of Jerzy Neyman and Egon S. Pearson’s sampling the-
freedom ν0 : ory. Among the methods, Welch’s approximate
t test and the Welch–Aspin t test are the most
t0 e tðν0 Þ; important ones from the frequentist perspective.
The critical values and the confidence intervals
.h  . i
from various methods under the frequentist
where ν0 ¼ 1 c2 ν1 þ ð1  cÞ2 ν2 with
 2   2    2   approach are in general different from those of
c ¼ s1 n1 s1 n1 þ s2 n2 . The approxima- either the fiducial or the Bayesian approach. For
tion is accurate when both sample sizes are 5 or the one-sided alternative hypothesis, however, it is
larger. Although there are other solutions, Welch’s interesting to note that the generalized extreme
approximate t test might be the best practical solu- region to obtain the generalized p developed by
tion to the Behrens–Fisher problem because of its Kam-Wah Tsui and Samaradasa Weerahandi in
availability from the popular statistical software, 1989 is identical to the extreme area from the
including SPSS (an IBM company, formerly called Behrens–Fisher t0 statistic.
PASWâ Statistics) and SAS. The example data The critical values for the two-sided alternative
yield t0 ¼ 2:143, ν0 ¼ 4:118, the two-tailed hypothesis at α ¼ :05 for the example data are
p ¼ :097, and the 95% confidence interval of 2.776 for the smaller degrees of freedom t test,
½0:406; 3:293 The null hypothesis of equal 2.767 for the Behrens–Fisher t0 test, 2.745 for
population means is not rejected at α ¼ :05, and Welch’s approximate t test, 2.715 for the Welch–
the confidence interval contains 0. Aspin t test, and 2.145 for the Student’s t test. The
In addition to the previous method, the Welch– respective 95% fiducial and confidence intervals
Aspin t test employs an approximation of the distri- are ½0:427, 3:314 for the smaller degrees of free-
bution of t0 by the method of moments. The exam- dom test, ½0:421, 3:308 for the Behrens–Fisher
ple data yield t0 ¼ 2:143, and the critical value t0 test, ½0:406, 3:293 for Welch’s approximate
under the Welch–Aspin t test for the two-tailed test t test, ½0:386, 3:273 for the Welch–Aspin t test,
is 2.715 at α ¼ :05. The corresponding 95% confi- and [0.482, 2.405] for the Student’s t test. The
dence interval is ½0:386, 3:273: Again, the null smaller degrees of freedom t test yielded the most
hypothesis of equal population means is not rejected conservative result with the largest critical value
at α ¼ :05, and the confidence interval contains 0. and the widest confidence interval. The Student’s t
test yielded the smallest critical value and the
shortest confidence interval. All other intervals lie
Comparison of Methods
between these two intervals. The differences
The Behrens–Fisher t0 statistic and the Behrens– between many solutions to the Behrens–Fisher
Fisher distribution are based on Fisher’s fiducial problem might be less than their differences from
approach. The approach is to find a fiducial proba- the Student’s t test when sample sizes are greater
bility distribution that is a probability distribution than 10.
of a parameter from observed data. Consequently, The popular statistical software programs SPSS
0 and SAS produce results from Welch’s approxi-
the interval that involves tα=2 ðν1 , ν2 , θÞ is referred
to as the 100ð1  αÞ% fiducial interval. mate t test and the Student’s t test, as well as the
The Bayesian solution to the Behrens–Fisher respective confidence intervals. It is essential to
problem was offered by Harold Jeffreys in 1940. have a table that contains the percentage points of
When uninformative uniform priors are used for the Behrens–Fisher distribution or computer pro-
the population parameters, the Bayesian solution grams that can calculate the tail areas and percent-
to the Behrens–Fisher problem is identical to that age values in order to use the Behrens–Fisher t0 test
of Fisher’s in 1939. The Bayesian highest posterior or to obtain the fiducial interval. Note that Welch’s
density interval that contains the population mean approximate t test may not be as effective as the
difference with the probability of 1  α is identical Welch–Aspin t test. Note also that the sequential
to the 100ð1  αÞ% fiducial interval. testing of the population means on the basis of the
There are many solutions to the Behrens–Fisher result from either Levene’s test of the equal popu-
problem based on the frequentist approach of lation variances from SPSS or the folded F test
78 Bernoulli Distribution

from SAS is not recommended in general because it is the simplest probability distribution, it pro-
of the complicated nature of control of the Type I vides a basis for other important probability distri-
error (rejecting a true null hypothesis) in the butions, such as the binomial distribution and the
sequential testing. negative binomial distribution.

Seock-Ho Kim
Definition and Properties
See also Mean Comparisons; Student’s t Test; t Test,
An experiment of chance whose result has only
Independent Samples
two possibilities is called a Bernoulli trial (or Ber-
noulli experiment). Let p denote the probability of
Further Readings success in a Bernoulli trial ð0 < p < 1Þ. Then,
a random variable X that assigns value 1 for a suc-
Behrens, W. U. (1929). Ein Beitrag zur Fehlerberechnung
bei wenigen Beobachtungen [A contribution to error
cess with probability p and value 0 for a failure
estimation with few observations]. with probability 1  p is called a Bernoulli ran-
Landwirtschaftliche Jahrbücher, 68, 807–837. dom variable, and it follows the Bernoulli distribu-
Fisher, R. A. (1939). The comparison of samples with tion with probability p, which is denoted by
possibly unequal variances. Annals of Eugenics, 9, X e BerðpÞ. The probability mass function of
174–180. Ber(p) is given by
Fisher, R. A., & Yates, F. (1957). Statistical tables for
biological, agricultural and medical research (4th ed.).
Edinburgh, UK: Oliver and Boyd.
PðX ¼ xÞ ¼ px ð1  pÞ1x ; x ¼ 0; 1:
Jeffreys, H. (1940). Note on the Behrens-Fisher formula.
Annals of Eugenics, 10, 48–51. The mean of X is p, and the variance is
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). pð1  pÞ. Figure 1 shows the probability mass
Continuous univariate distributions (Vol. 2, 2nd ed.). function of Ber(.7). The horizontal axis represents
New York: Wiley. values of X, and the vertical axis represents the
Kendall, M., & Stuart, A. (1979). The advanced theory corresponding probabilities. Thus, the height is .7
of statistics (Vol. 2, 4th ed.). New York: Oxford
at X ¼ 1, and .3 for X ¼ 0. The mean of Ber(0.7)
University Press.
Kim, S.-H., & Cohen, A. S. (1998). On the Behrens-
is 0.7, and the variance is .21.
Fisher problem: A review. Journal of Educational and Suppose that a Bernoulli trial with probability p
Behavioral Statistics, 23, 356–377. is independently repeated for n times, and we
Tsui, K.-H., & Weerahandi, S. (1989). Generalized obtain a random sample X1 , X2 ; . . . ; Xn : Then, the
p-values in significance testing of hypotheses in the number of successes Y ¼ X1 þ X2 þ    þ Xn fol-
presence of nuisance parameters. Journal of the lows the binomial distribution with probability
American Statistical Association, 84, 602–607;
Correction, 86, 256.
Welch, B. L. (1938). The significance of the difference 1.0
between two means when the population variances are
unequal. Biometrika, 29, 350–362. .8
Probability

.6

BERNOULLI DISTRIBUTION .4

.2
The Bernoulli distribution is a discrete probability
distribution for a random variable that takes only .0
two possible values, 0 and 1. Examples of events 0 1
X
that lead to such a random variable include coin
tossing (head or tail), answers to a test item (cor-
rect or incorrect), outcomes of a medical treatment Figure 1 Probability Mass Function of the Bernoulli
(recovered or not recovered), and so on. Although Distribution With p ¼ :7
Bernoulli Distribution 79

p and the number of trials n, which is denoted by case of the negative binomial distribution in which
Y e Binðn, pÞ. Stated in the opposite way, the the number of failures is counted before observing
Bernoulli distribution is a special case of the bino- the first success (i.e., t ¼ 1).
mial distribution in which the number of trials n is Assume a finite Bernoulli population in which
1. The probability mass function of Bin(n,p) is individual members are denoted by either 0 or 1.
given by If sampling is done by randomly selecting one
member at each time with replacement (i.e., each
n! selected member is returned to the population
PðY ¼ yÞ ¼ py ð1  pÞny ,
y!ðn  yÞ! before the next selection is made), then the result-
y ¼ 0; 1; . . . ; n, ing sequence constitutes independent Bernoulli
trials, and the number of successes follows the
where n! is the factorial of n; which equals the binomial distribution. If sampling is done at ran-
product nðn  1Þ    2 · 1. The mean of Y is np, dom but without replacement, then each of the
and the variance is npð1  pÞ: Figure 2 shows the individual selections is still a Bernoulli trial, but
probability mass function of Bin(10,.7), which is they are no longer independent of each other. In
obtained as the distribution of the sum of 10 inde- this case, the number of successes follows the
pendent random variables, each of which follows hypergeometric distribution, which is specified by
Ber(.7). The height of each bar represents the the population probability p, the number of trials
probability that Y takes the corresponding value; n, and the population size m.
for example, the probability of Y ¼ 7 is about .27. Various approximations are available for the
The mean is 7 and the variance is 2.1. In general, binomial distribution. These approximations
the distribution is skewed to the right when are extremely useful when n is large because
p < :5, skewed to the left when p > :5; and sym- in that case the factorials in the binomial proba-
metric when p ¼ :5: bility mass function become prohibitively large
and make probability calculations tedious.
For example,pby the central
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi limit theorem,
Relationship to Other Probability Distributions
Z ¼ ðY  npÞ npð1  pÞ approximately fol-
The Bernoulli distribution is a basis for many lows the standard normal distribution Nð0, 1Þ
probability distributions, as well as for the bino- when Y e Binðn, pÞ. The constant 0.5 is often
mial distribution. The number of failures before added to the denominator to improve the
observing a success t times in independent Ber- approximation (called continuity correction). As
noulli trials follows the negative binomial distribu- a rule of thumb, the normal approximation
tion with probability p and the number of works well when either (a) npð1  pÞ > 9 or
successes t. The geometric distribution is a special (b) np > 9 for 0 < p ≤ :5. The Poisson distribu-
tion with parameter np also well approximates
.30 Bin(n,p) when n is large and p is small. The
Poisson approximation works well if
0:31
n p > :47; for example, p > :19, :14, and :11
when n ¼ 20, 50, and 100, respectively. If
Probability

.20
n0:31 p ≥ :47, then the normal distribution gives
better approximations.
.10

.00 Estimation
0 1 2 3 4 5 6 7 8 9 10
Inferences regarding the population proport-
Y
ion p can be made from a random sample
X1 , X2 , . . . , Xn from Ber(p), whose sum follows
Figure 2 Probability Mass Function of the Binomial Bin(n, p). The population proportion p can be
Distribution With p ¼ 7 and n ¼ 10 estimated by the sample mean (or the sample
80 Bernoulli Distribution

P
proportion) p^ ¼ X ¼ ni¼1 Xi =n; which is an unbi- variable (i.e., Y ¼ 0; 1), the logistic regression
ased estimator of p. model is expressed by the equation
Interval estimation is usually made by the nor-
mal approximation. If n is large enough (e.g., pðxÞ
ln ¼ b0 þ b1 x1 þ    þ bK xK ,
n > 100), a 100ð1  αÞ% confidence interval is 1  pðxÞ
given by
where ln is the natural logarithm, pðxÞ is the prob-
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^ð1  p
p ^Þ ability of Y ¼ 1 (or the expected value of Y) given
^ ± zα=2
p ; x1 ; x2 ; . . . ; xK ; and b0 ; b1 ; . . . ; bK are the regres-
n
sion coefficients. The left-hand side of the above
where p^ is the sample proportion and zα=2 is the equation is called the logit, or the log-odds ratio,
value of the standard normal variable that gives of proportion p. The logit is symmetric about zero;
the probability α=2 in the right tail. For smaller ns, it is positive (negative) if p > :5 (p < :5), and zero
the quadratic approximation gives better results: if p ¼ :5. It approaches positive (negative) infinity
0 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 as p approaches 1 (0). Another representation
1 z 2
^ð1  p
p ^Þ z2α=2 equivalent to the above is
 @p ^ þ α=2 ± zα=2 þ 2 A:
1 þ zα=2 n 2n n 4n expðb0 þ b1 x1 þ    þ bK xK Þ
pðxÞ ¼ :
1 þ expðb0 þ b1 x1 þ    þ bK xK Þ
The quadratic approximation works well if
:1 < p < :9 and n is as large as 25. The right-hand side is called the logistic
There are often cases in which one is interested regression function. In either case, the model
in comparing two population proportions. Sup- states that the distribution of Y given predictors
pose that we obtained sample proportions p ^1 and x1 , x2 , . . . , xK is Ber[p(x)], where the logit of p(x)
^2 with sample sizes n1 and n2 , respectively. Then,
p is determined by a linear combination of predic-
the difference between the population proportions tors x1 , x2 , . . . , xK . The regression coefficients
is estimated by the difference between the sample are estimated from N sets of observed data
proportions p^1  p^2 . Its standard error is given by (Yi ; x i1 ; xi2 ; . . . ; xiK Þ; i ¼ 1; 2; . . . ; N:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^1 ð1  p
p ^1 Þ p ^ ð1  p ^2 Þ The Binomial Error Model
SEðp^1  p
^2 Þ ¼ þ 2 ,
n1 n2
The binomial error model is one of the mea-
from which one can construct a 100ð1  αÞ confi- surement models in the classical test theory. Sup-
dence interval as pose that there are n test items, each of which is
scored either 1 (correct) or 0 (incorrect). The bino-
ðp ^2 Þ ± zα=2 SEðp
^1  p ^1  p
^2 Þ: mial error model assumes that the distribution
of person i’s total score Xi given his or her
‘‘proportion-corrected’’ true score ζi ð0 < ζi < 1Þ is
Applications Binðn; ζi Þ:
Logistic Regression n! nx
PðXi ¼ xjζi Þ ¼ ζx ð1  ζi Þ ,
Logistic regression is a regression model about x!ðn  xÞ! i
the Bernoulli probability and used when the x ¼ 0, 1, . . . , n:
dependent variable takes only two possible values.
Logistic regression models are formulated as gener- This model builds on a simple assumption that
alized linear models in which the canonical link for all items, the probability of a correct response
function is the logit link and the Bernoulli distribu- for a person with true score ζi is equal to ζi , but
tion is assumed for the dependent variable. the error variance, nζi ð1  ζi Þ, varies as a function
In the standard case in which there are K linear of ζi unlike the standard classical test model.
predictors x1 , x2 , . . . , xK and the dependent vari- The observed total score Xi ¼ xi serves as an
able Y, which represents a Bernoulli random estimate of nζi , and the associated error variance
Beta 81

can also be estimated as σ^i2 ¼ xi ðn  xi Þ=ðn  1Þ. power of a test, equal to 1  β rather than β itself,
Averaging this error variance over is referred to as a measure of quality for a hypothe-
N persons gives the overall error variance sis test. This entry discusses the role of β in
σ^2 ¼ ½xðn  xÞ  s2  ðn  1Þ, where x is the sam- hypothesis testing and its relationship with
ple mean of observed total scores over the N per- significance ðαÞ.
sons and s2 is the sample variance. It turns out
that by substituting σ^2 and s2 in the definition of
reliability, the reliability of the n-item test equals Hypothesis Testing and Beta
the Kuder–Richardson formula 21 under the bino-
mial error model. Hypothesis testing is a very important part of sta-
tistical inference: the formal process of deciding
whether a particular contention (called the null
History hypothesis) is supported by the data, or whether
The name Bernoulli was taken from Jakob Ber- a second contention (called the alternative hypoth-
noulli, a Swiss mathematician in the 17th century. esis) is preferred. In this context, one can represent
He made many contributions to mathematics, espe- the situation in a simple 2 × 2 decision table in
cially in calculus and probability theory. He is the which the columns reflect the true (unobservable)
first person who expressed the idea of the law of situation and the rows reflect the inference made
large numbers, along with its mathematical proof based on a set of data:
(thus, the law is also called Bernoulli’s theorem).
Bernoulli derived the binomial distribution in the Null Alternative
case in which the probability p is a rational number, Hypothesis Is Hypothesis Is
and his result was published in 1713. Later in the Decision True/Preferred True/Preferred
18th century, Thomas Bayes generalized Bernoulli’s Fail to Correct Type II error
binomial distribution by removing its rational reject decision
restriction on p in his formulation of a statistical null hypothesis
theory that is now known as Bayesian statistics. Reject null Type I error Correct decision
hypothesis in favor
Kentaro Kato and William M. Bart of alternative
hypothesis
See also Logistic Regression; Normal Distribution; Odds
Ratio; Poisson Distribution; Probability, Laws of
The language used in the decision table is subtle
Further Readings but deliberate. Although people commonly speak
of accepting hypotheses, under the maxim that sci-
Agresti, A. (2002). Categorical data analysis (2nd ed.). entific theories are not so much proven as sup-
New York: Wiley.
ported by evidence, we might more properly speak
Johnson, N. L., Kemp, A. W., & Kotz, S. (2005).
Univariate discrete distributions (3rd ed.). Hoboken,
of failing to reject a hypothesis rather than of
NJ: Wiley. accepting it. Note also that it may be the case that
Lindgren, B. W. (1993). Statistical theory (4th ed.). Boca neither the null nor the alternative hypothesis is, in
Raton, FL: Chapman & Hall/CRC. fact, true, but generally we might think of one as
Lord, F. M., & Novick, M. R. (1968). Statistical theories preferable over the other on the basis of evi-
of mental test scores. Reading, MA: Addison-Wesley. dence. Semantics notwithstanding, the decision
table makes clear that there exist two distinct
possible types of error: that in which the null
hypothesis is rejected when it is, in fact, true;
BETA and that in which the null hypothesis is not
rejected when it is, in fact, false. A simple exam-
Beta ðβÞ refers to the probability of Type II error ple that helps one in thinking about the differ-
in a statistical hypothesis test. Frequently, the ence between these two types of error is
82 Beta

a criminal trial in the U.S. judicial system. In say that ‘‘a scientific fact should be regarded as
that system, there is an initial presumption of experimentally established only if a properly
innocence (null hypothesis), and evidence is pre- designed experiment rarely fails to give this level of
sented in order to reach a decision to convict significance’’ (Fisher, 1926, p. 504).
(reject the null hypothesis) or acquit (fail to Although it is not generally possible to control
reject the null). In this context, a Type I error is both α and β for a test with a fixed sample size, it
committed if an innocent person is convicted, is typically possible to decrease β while holding α
while a Type II error is committed if a guilty per- constant if the sample size is increased. As a result,
son is acquitted. Clearly, both types of error can- a simple way to conduct tests with high power
not occur in a single trial; after all, a person (low β) is to select a sample size sufficiently large
cannot be both innocent and guilty of a particu- to guarantee a specified power for the test. Of
lar crime. However, a priori we can conceive of course, such a sample size may be prohibitively
the probability of each type of error, with the large or even impossible, depending on the nature
probability of a Type I error called the signifi- and cost of the experiment. From a research design
cance level of a test and denoted by α, and the perspective, sample size is the most critical aspect
probability of a Type II error denoted by β, with of ensuring that a test has sufficient power, and
1  β, the probability of not committing a Type a priori sample size calculations designed to pro-
II error, called the power of the test. duce a specified power level are common when
designing an experiment or survey. For example, if
one wished to test the null hypothesis that a mean
Relationship With Significance
μ was equal to μ0 versus the alternative that μ
Just as it is impossible to realize both types of error was equal to μ1 > μ0 , the sample size required to
in a single test, it is also not possible to minimize ensure a Type II error of β if α ¼ :05 is
both α and β in a particular experiment with fixed n ¼ fσð1:645  Φ1 ðβÞÞ=ðμ1  μ0 Þg2 , where Φ is
sample size. In this sense, in a given experiment, the standard normal cumulative distribution func-
there is a trade-off between α and β, meaning that tion and σ is the underlying standard deviation, an
both cannot be specified or guaranteed to be low. estimate of which (usually the sample standard
For example, a simple way to guarantee no chance deviation) is used to compute the required sample
of a Type I error would be to never reject the null size.
hypothesis regardless of the data, but such a strat- The value of β for a test is also dependent on
egy would typically result in a very large β. Hence, the effect size—that is, the measure of how differ-
it is common practice in statistical inference to fix ent the null and alternative hypotheses are, or the
the significance level at some nominal, low value size of the effect that the test is designed to detect.
(usually .05) and to compute and report β in com- The larger the effect size, the lower β will typically
municating the result of the test. Note the implied be at fixed sample size, or, in other words, the
asymmetry between the two types of error possible more easily the effect will be detected.
from a hypothesis test: α is held at some prespeci-
fied value, while β is not constrained. The prefer- Michael A. Martin and Steven Roberts
ence for controlling α rather than β also has an
See also Hypothesis; Power; p Value; Type I Error;
analogue in the judicial example above, in which
Type II Error
the concept of ‘‘beyond reasonable doubt’’ captures
the idea of setting α at some low level, and where
there is an oft-stated preference for setting a guilty Further Readings
person free over convicting an innocent person,
Fisher, R. A. (1926). The arrangement of field
thereby preferring to commit a Type II error over
experiments. Journal of the Ministry of Agriculture of
a Type I error. The common choice of .05 for α Great Britain, 23, 503–513.
most likely stems from Sir Ronald Fisher’s 1926 Lehmann, E. L. (1986). Testing statistical hypotheses.
statement that he ‘‘prefers to set a low standard of New York: Wiley.
significance at the 5% point, and ignore entirely all Moore, D. (1979). Statistics: Concepts and controversies.
results that fail to reach that level.’’ He went on to San Francisco: W. H. Freeman.
Bias 83

answer the phone when the incoming call is from


BETWEEN-SUBJECTS DESIGN someone unknown.)
Researchers often rely on volunteers to partici-
See Single-Subject Design; Within-Subjects pate in their studies, but there may be something
different about those who volunteer to participate
Design
in studies and those who do not volunteer that sys-
tematically biases the sample. For example, people
who volunteer to participate in studies of new
treatments for psychological disorders may be
BIAS more motivated to get better than those who do
not volunteer, leading researchers to overestimate
Bias is systematic error in data collected to address the effectiveness of a new treatment. Similarly,
a research question. In contrast to random errors, people who are selected to participate in surveys
which are randomly distributed and therefore even may choose not to respond to the survey. If there
out across people or groups studied, biases are are systematic differences between responders and
errors that are systematically related to people, nonresponders, then the generalizability of the
groups, treatments, or experimental conditions survey findings is limited.
and therefore cause the researcher to overestimate
or underestimate the measurement of a behavior
Selection Bias
or trait. Bias is problematic because it can endan-
ger the ability of researchers to draw valid conclu- Selection bias is present when participants in dif-
sions about whether one variable causes a second ferent study conditions possess different character-
variable (threats to internal validity) or whether istics at the start of a study that could influence the
the results generalize to other people (threats to outcome measures in the study. Selection bias is
external validity). Bias comes in many forms, presumed in quasi-experimental designs, in which
including sampling bias, selection bias, experi- participants are not randomly assigned to experi-
menter expectancy effects, and response bias. mental conditions, as is often the case in educa-
tional research in which classrooms or classes
receive different interventions. For example, if
Sampling Bias
a researcher wanted to examine whether students
Human participants in studies generally represent learn more in introductory psychology classes that
a subset of the entire population of people whom have a small number of students than they do in
the researcher wishes to understand; this subset of classes with a large number, a selection bias may
the entire population is known as the study sam- exist if better students are more likely to choose
ple. Unless a study sample is chosen using some classes that have fewer students. If students in
form of random sampling in which every member smaller classes outperform students in larger clas-
of the population has a known, nonzero chance of ses, it will be unclear whether the performance dif-
being chosen to participate in the study, it is likely ference is the result of smaller classes or because
that some form of sampling bias exists. Even for better students self-select into smaller classes.
surveys that attempt to use random sampling of In this case, selection bias makes it appear that
a population via random-digit dialing, the sample there are differences between the two conditions
necessarily excludes people who do not have (large vs. small classes) when in truth there might
phone service. Thus people of lower socioeco- be no difference (only differences between the
nomic status—and who are therefore less likely to types of people who sign up for the classes). Selec-
be able to afford phone service—may be underrep- tion bias could also lead researchers to conclude
resented in samples generated using random-digit that there are no differences between groups when
dialing. (A modern artifact of random-digit dialing in fact there are. Imagine a researcher wanted to
that leads to bias in the other direction is that test whether a new intervention reduced recidivism
those with cell phones and many people with land- among juvenile offenders and that researcher relied
lines routinely screen calls and are less likely to on collaborators at social service agencies to
84 Biased Estimator

randomly assign participants to the intervention or indicates liberal political attitudes rather than
the control group. If the collaborators sometimes sometimes indicating conservative attitudes and
broke with random assignment and assigned the sometimes indicating liberal attitudes), it is possi-
juveniles who were most in need (e.g., had the ble for researchers to over- or underestimate the
worst criminal records) to the intervention group, favorability of participants’ attitudes, whether
then when both groups were subsequently fol- participants possess a particular trait, or the
lowed to determine whether they continued to likelihood that they will engage in a particular
break the law (or were caught doing so), the selec- behavior.
tion bias would make it difficult to find a difference
between the two groups. The preintervention dif-
ferences in criminal behavior between the interven- Avoiding Bias
tion and control groups might mask any effect of Careful research design can minimize systematic
the intervention or even make it appear as if the errors in collected data. Random sampling reduces
intervention increased criminal behavior. sample bias. Random assignment to condition
minimizes or eliminates selection bias. Ensuring
Experimenter Expectancy Effects that experimenters are blind to experimental con-
ditions eliminates the possibility that experimenter
Researchers usually have hypotheses about how expectancies will influence participant behavior or
subjects will perform under different experimental bias the data collected. Bias reduction improves
conditions. When a researcher knows which researchers’ ability to generalize findings and to
experimental group a subject is assigned to, the draw causal conclusions from the data.
researcher may unintentionally behave differently
toward the participant. The different treatment, Margaret Bull Kovera
which systematically varies with the experimental
condition, may cause the participant to behave in See also Experimenter Expectancy Effect; Response Bias;
a way that confirms the researcher’s hypothesis or Sampling; Selection; Systematic Error
expectancy, making it impossible to determine
whether it is the difference in the experimenter’s
Further Readings
behavior or in the experimental conditions that
causes the change in the subject’s behavior. Robert Larzelere, R. E., Kuhn, B. R., & Johnson, B. (2004). The
Rosenthal and his colleagues were among the first intervention selection bias: An unrecognized confound
to establish experimenter expectancy effects when in intervention research. Psychological Bulletin, 130,
they told teachers that some of their students had 289–303.
Rosenthal, R. (2002). Covert communication in
been identified as ‘‘late bloomers’’ whose academic
classrooms, clinics, courtrooms, and cubicles.
performance was expected to improve over the American Psychologist, 57, 839–849.
course of the school year. Although the students Rosenthal, R., & Rosnow, R. L. (1991). Essentials of
chosen to be designated as late bloomers had in behavioral research methods and data analysis
fact been selected randomly, the teachers’ expecta- (Chapter 10, pp. 205–230). New York: McGraw-Hill.
tions about their performance appeared to cause Welkenhuysen-Gybels, J., Billiet, J., & Cambré, B.
these students to improve. (2003). Adjustment for acquiescence in the assessment
of the construct equivalence of Likert-type score items.
Journal of Cross-Cultural Psychology, 34, 702–722.
Response Bias
Another source of systematic error comes from
participant response sets, such as the tendency for
participants to answer questions in an agreeable BIASED ESTIMATOR
manner (e.g., ‘‘yes’’ and ‘‘agree’’), known as an
acquiescent response set. If all the dependent mea- In many scientific research fields, statistical models
sures are constructed such that agreement with an are used to describe a system or a population, to
item means the same thing (e.g., agreement always interpret a phenomenon, or to investigate the
Biased Estimator 85

relationship among various measurements. These σ2


distribution with mean μ and variance n. There-
statistical models often contain one or multiple  2 2
components, called parameters, that are unknown fore, E X ¼ μ2 þ σn 6¼ μ2 .
and thus need to be estimated from the data Example 2 indicates that one should be careful
(sometimes also called the sample). An estimator, about determining whether an estimator is biased.
which is essentially a function of the observable Specifically, although θ^ is an unbiased estimator
data, is biased if its expectation does not equal the  
for θ, g θ^ may be a biased estimator for gðθÞ
parameter to be estimated. if g is a nonlinear function. In Example 2,
To formalize this concept, suppose θ is the
gðθÞ ¼ θ2 is such a function. However, when g is
parameter of interest in a statistical model. Let θ^ a linear function, that is, gðθÞ ¼ aθ þ b where a
be its estimator based on an observed sample.  
  and b are two constants, then g θ^ is always an
Then θ^ is a biased estimator if E θ^ 6¼ θ, where unbiased estimator for gðθÞ.
E denotes the expectation operator. Similarly, one
may say that θ^ is an unbiased estimator if
  Example 3
E θ^ ¼ θ. Some examples follow.
Let X1 , . . . , Xn be an observed sample from some
distribution (not necessarily normal) with mean μ
Example 1 and variance σ 2 . The sample variance S2 , which is
Pn  2
1
Suppose an investigator wants to know the aver- defined as n1 Xi  X , is an unbiased estima-
i¼1
age amount of credit card debt of undergraduate n 
P 2
1
students from a certain university. Then the popu- tor for σ 2, while the intuitive guess n Xi  X
lation would be all undergraduate students cur- i¼1

rently enrolled in this university, and the would yield a biased estimator. A heuristic argu-
population mean of the amount of credit card debt ment is given here. If μ were known,
P
n
of these undergraduate students, denoted by θ, is 1
n ðXi  μÞ2 could be calculated, which would
the parameter of interest. To estimate θ, a random i¼1
sample is collected from the university, and the be an unbiased estimator for σ 2. But since μ is not
sample mean of the amount of credit card debt is known, it has to be replaced by X. This replace-
calculated. Denote this sample mean by θ^1. Then ment actually makes the numerator smaller. That
  n 
P 2 P n
E θ^1 ¼ θ; that is, θ^1 is an unbiased estimator. If is, Xi  X ≤ ðXi  μÞ2 regardless of the
the largest amount of credit card debt from the i¼1 i¼1

sample, call it θ^2 , is used to estimate θ, then obvi- value of μ. Therefore, the denominator has to be
  reduced a little bit (from n to n  1) accordingly.
ously θ^2 is biased. In other words, E θ^2 6¼ θ.
A closely related concept is the bias of an esti-
 
mator, which is defined as E θ^  θ. Therefore, an
unbiased estimator can also be defined as an esti-
Example 2 mator whose bias is zero, while a biased estimator
In this example a more abstract scenario is exam- is one whose bias is nonzero. A biased estimator is
ined. Consider a statistical model in which a ran- said to underestimate the parameter if the bias is
dom variable X follows a normal distribution with negative or overestimate the parameter if the bias
mean μ and variance σ 2 , and suppose a random is positive.
sample X1 , . . . , Xn is observed. Let the parameter θ Biased estimators are usually not preferred in
P
n estimation problems, because in the long run,
be μ. It is seen in Example 1 that X ¼ 1n Xi , the they do not provide an accurate ‘‘guess’’ of the
i¼1
parameter. Sometimes, however, cleverly con-
sample mean of X1 , . . . , Xn , is an unbiased estima-
2 structed biased estimators are useful because
tor for θ. But X is a biased estimator for μ2 (or although their expectation does not equal the
θ2 ). This is because X follows a normal parameter under estimation, they may have a small
86 Bivariate Regression

variance. To this end, a criterion that is quite com- See also Distribution; Estimation; Expected Value
monly used in statistical science for judging the
quality of an estimator needs to be introduced.
Further Readings
The mean square error (MSE) of an estimatorh θ^
2 i Rice, J. A. (1994). Mathematical statistics and data
for the parameter θ is defined as E θ^  θ . analysis (2nd ed.). Belmont, CA: Duxbury Press.
Apparently, one should seek estimators that make
the MSE small, which means that θ^ is ‘‘close’’ to θ.
Notice that
BIVARIATE REGRESSION
h 2 i h
2 i 
2
E θ^  θ ¼ E θ^  E θ^ þ E θ^  θ
  Regression is a statistical technique used to help
¼ Var θ^ þ Bias2 , investigate how variation in one or more variables
predicts or explains variation in another variable.
meaning that the magnitude of the MSE, which is This popular statistical technique is flexible in that
always nonnegative, is determined by two compo- it can be used to analyze experimental or nonex-
nents: the variance and the bias of the estimator. perimental data with multiple categorical and con-
Therefore, an unbiased estimator (for which the tinuous independent variables. If only one variable
bias would be zero), if possessing a large variance, is used to predict or explain the variation in
may be inferior to a biased estimator whose vari- another variable, the technique is referred to as
ance and bias are both small. One of the most pro- bivariate regression. When more than one variable
minent examples is the shrinkage estimator, in is used to predict or explain variation in another
which a small amount of bias for the estimator variable, the technique is referred to as multiple
gains a great reduction of variance. Example 4 is regression. Bivariate regression is the focus of this
a more straightforward example of the usage of entry.
a biased estimator. Various terms are used to describe the indepen-
dent variable in regression, namely, predictor vari-
able, explanatory variable, or presumed cause.
The dependent variable is often referred to as an
Example 4 outcome variable, criterion variable, or presumed
Let X be a Poisson random variable, that is, effect. The choice of independent variable term
λ x
PðX ¼ xÞ ¼ e x!λ , for x ¼ 0, 1, 2, . . . . Suppose the will likely depend on the preference of the
researcher or the purpose of the research. Bivariate
parameter θ ¼ e2λ, which is essentially
regression may be used solely for predictive pur-
½PðX ¼ 0Þ2 , is of interest and needs to be esti- poses. For example, do scores on a college
mated. If an unbiased estimator, say θ^1 ðXÞ, for θ is entrance exam predict college grade point average?
desired, then by the definition of unbiasedness, it Or it may be used for explanation. Do differences
P∞ eλ λx
must satisfy ^ 2λ in IQ scores explain differences in achievement
x¼0 θ1 ðxÞ x! ¼ e or, equiva-
P∞ θ^1 ðxÞλx scores? It is often the case that although the term
lently, x¼0 x! ¼ eλ for all positive values of
predictor is used by researchers, the purpose of the
λ. Clearly, the only solution is that θ^1 ðxÞ ¼ ð1Þx . research is, in fact, explanatory.
But this unbiased estimator is rather absurd. For Suppose a researcher is interested in how well
example, if X ¼ 10, then the estimator θ^1 takes reading in first grade predicts or explains fifth-
the value of 1, whereas if X ¼ 11, then θ^1 is  1. grade science achievement scores. The researcher
As a matter of fact, a much more reasonable esti- hypothesizes that those who read well in first
mator would be θ^2 ðXÞ ¼ e2X , based on the maxi- grade will also have high science achievement in
mum likelihood approach. This estimator is biased fifth grade. An example bivariate regression will
but always has a smaller MSE than θ^1 ðXÞ: be performed to test this hypothesis. The data used
in this example are a random sample of students
Zhigang Zhang and Qianxing Mo (10%) with first-grade reading and fifth-grade
Bivariate Regression 87

science scores and are taken from the Early Because science scores are the outcome, the sci-
Childhood Longitudinal Study public database. ence scores are regressed on first-grade reading
Variation in reading scores will be used to explain scores. The easiest way to conduct such analysis is
variation in science achievement scores, so first- to use a statistical program. The estimates from the
grade reading achievement is the explanatory vari- output may then be plugged into the equation.
able and fifth-grade science achievement is the out- For these data, the prediction equation is
come variable. Before the analysis is conducted, Y 0 ¼ 21:99 þ ð:58ÞX: Therefore, if a student’s first-
however, it should be noted that bivariate regres- grade reading score was 60, the predicted fifth-
sion is rarely used in published research. For grade science achievement score for that student
example, intelligence is likely an important com- would be 21.99 þ (.58)60, which equals 56.79.
mon cause of both reading and science achieve- One might ask, why even conduct a regression
ment. If a researcher was interested in explaining analysis to obtain a predicted science score when
fifth-grade science achievement, then potential Johnny’s science score was already available? There
important common causes, such as intelligence, are a few possible reasons. First, perhaps
would need to be included in the research. a researcher wants to use the information to pre-
dict later science performance, either for a new
group of students or for an individual student,
based on current first-grade reading scores. Second,
Regression Equation
a researcher may want to know the relation
The simple equation for bivariate linear regression between the two variables, and a regression pro-
is Y ¼ a þ bX þ e: The science achievement score, vides a nice summary of the relation between the
Y, for a student equals the intercept or constant scores for all the students. For example, do those
(a), plus the slope (b) times the reading score (X) students who tend to do well in reading in first
for that student, plus error (e). Error, or the residual grade also do well in science in fifth grade? Last,
component (e), represents the error in prediction, a researcher might be interested in different out-
or what is not explained in the outcome variable. comes related to early reading ability when consid-
The error term is not necessary and may be ering the possibility of implementing an early
dropped so that the following equation is used: reading intervention program. Of course a bivariate
Y 0 ¼ a þ bX: Y 0 is the expected (or predicted) relation is not very informative. A much more
score. The intercept is the predicted fifth-grade sci- thoughtfully developed causal model would need
ence score for someone whose first-grade reading to be developed if a researcher was serious about
score is zero. The slope (b, also referred to as the this type of research.
unstandardized regression coefficient) represents
the predicted unit increase in science scores associ-
Scatterplot and Regression Line
ated with a one-unit increase in reading scores. X is
the observed score for that person. The two para- The regression equation describes the linear rela-
meters (a and b) that describe the linear relation tion between variables; more specifically, it
between the predictor and outcome are thus the describes science scores as a function of reading
intercept and the regression coefficient. These para- scores. A scatterplot could be used to represent the
meters are often referred to as least squares estima- relation between these two variables, and the use
tors and will be estimated using the two sets of of a scatterplot may assist one in understanding
scores. That is, they represent the optimal estimates regression. In a scatterplot, the science scores (out-
that will provide the least error in prediction. come variable) are on the y-axis, and the reading
Returning to the example, the data used in the scores (explanatory variable) are on the x-axis.
analysis were first-grade reading scores and fifth- A scatterplot is shown in Figure 1. Each per-
grade science scores obtained from a sample of son’s reading and science scores in the sample are
1,027 school-age children. T-scores, which have plotted. The scores are clustered fairly closely
a mean of 50 and standard deviation of 10, were together, and the general direction looks to be posi-
used. The means for the scores in the sample were tive. Higher scores in reading are generally associ-
51.31 for reading and 51.83 for science. ated with higher scores in science. The next step is
88 Bivariate Regression

is a constant value added to everyone’s


80.000
score), and as demonstrated above, it is
useful in plotting a regression line. The
slope (b ¼ :58) is the unstandardized
Fifth-Grade Science T-Scores

70.000
coefficient. It was statistically signifi-
60.000 cant, indicating that reading has a statis-
tically significant influence on fifth-
50.000 grade science. A 1-point T-score
increase in reading is associated with
40.000 a .58 T-score point increase in science
scores. The bs are interpreted in the
30.000 R 2 linear = 0.311
metric of the original variable. In the
example, all the scores were T-scores.
20.000 Unstandardized coefficients are espe-
20.000 40.000 60.000 80.000 cially useful for interpretation when the
First-Grade Reading T-Scores metric of the variables is meaningful.
Sometimes, however, the metric of the
Figure 1 Scatterplot and Regression Line
variables is not meaningful.
Two equations were generated in the
regression analysis. The first, as dis-
to fit a regression line. The regression line is plotted cussed in the example above, is referred to as the
so that it minimizes errors in prediction, or simply, unstandardized solution. In addition to the unstan-
the regression line is the line that is closest to all the dardized solution, there is a standardized solution.
data points. The line is fitted automatically in many In this equation, the constant is dropped, and z
computer programs, but information obtained in scores (mean = 0, standard deviation = 1), rather
the regression analysis output can also be used to than the T-scores (or raw scores), are used. The
plot two data points that the line should be drawn standardized regression coefficient is referred to as
through. For example, the intercept (where the line a beta weight ðβÞ. In the example, the beta
crosses the y-axis) represents the predicted science weight was .56. Therefore, a one-standard-devia-
score when reading equals zero. Because the value tion increase in reading was associated with a .56-
of the intercept was 21.99, the first data point standard-deviation increase in science achievement.
would be found at 0 on the x-axis and at 21.99 on The unstandardized and standardized coefficients
the y-axis. The second point on the line may be were similar in this example because T-scores are
located at the mean reading score (51.31) and mean standardized scores, and the sample statistics for
science score (51.83). A line can then be drawn the T-scores were fairly close to the population
through those two points. The line is shown in Fig- mean of 50 and standard deviation of 10.
ure 1. Points that are found along this regression It is easy to convert back and forth from stan-
line represent the predicted science achievement dardized to unstandardized regression coefficients:
score for Person A with a reading score of X.
standard deviation of reading scores
β¼b
standard deviation of science scores
Unstandardized and Standardized Coefficients
For a more thorough understanding of bivariate or
regression, it is useful to examine in more detail
the output obtained after running the regression. standard deviation of science scores
b¼β :
First, the intercept has no important substantive standard deviation of reading scores
meaning. It is unlikely that anyone would score
a zero on the reading test, so it does not make From an interpretative standpoint, should some-
much sense. It is useful in the unstandardized solu- one interpret the unstandardized or the standard-
tion in that it is used to obtain predicted scores (it ized coefficient? There is some debate over which
Bivariate Regression 89

one to use for interpretative statements, but in use an F test associated with the value obtained
a bivariate regression, the easiest answer is that if with the formula
both variables are in metrics that are easily inter-
pretable, then it would make sense to use the R2 =k
:
unstandardized coefficients. If the metrics are not ð1  R2 Þ=ðN  k  1Þ
meaningful, then it may make more sense to use the
standardized coefficient. Take, for example, number In this formula, R2 equals the variance explained,
of books read per week. If number of books read 1  R2 is the variance unexplained, and k equals
per week was represented by the actual number of the degrees of freedom (df) for the regression
books read per week, the variable is in a meaningful (which is 1 because one explanatory variable was
metric. If the number of books read per week vari- used). With the numbers plugged in, the formula
able were coded so that 0 = no books read per would look like
week, 1 = one to three books read per week, and
:31=1
2 = four or more books read per week, then the var-
iable is not coded in a meaningful metric, and the :69=ð1027  1  1Þ
standardized coefficient would be the better one to
and results in F ¼ 462:17: An F table indicates
use for interpretation.
that reading did have a statistically significant
effect on science achievement, R2 ¼ :31,
R and R2 Fð1,1025Þ ¼ 462:17, p < :01:
In standard multiple regression, a researcher
In bivariate regression, typically the regression
typically interprets the statistical significance of R2
coefficient is of greatest interest. Additional infor-
(the statistical significance of the overall equation)
mation is provided in the output, however. R is
and the statistical significance of the unique effects
used in multiple regression output and represents
of each individual explanatory variable. Because
a multiple correlation. Because there is only one
this is bivariate regression, however, the statistical
explanatory variable, R (.56) is equal to the corre-
significance test of the overall regression and the
lation coefficient (r = .56) between reading and sci-
regression coefficient (b) will yield the same
ence scores. Note that this value is also identical to
results, and typically the statistical significance
the β. Although the values of β and r are the
tests for each are not reported.
same, the interpretation differs. The researcher is
The statistical significance of the regression
not proposing an agnostic relation between read-
coefficient (b) is evaluated with a t test. The null
ing scores and science scores. Rather the researcher
hypothesis is that the slope equals zero, that is, the
is positing that early reading explains later science
regression line is parallel with the x-axis. The
achievement. Hence, there is a clear direction in
t-value is obtained by
the relation, and this direction is not specified in
a correlation. b
R2 is the variance in science scores explained :
standard error of b
by reading scores. In the current example,
R2 ¼ 31: First-grade reading scores explained In this example, b ¼ :58; and its associated stan-
31% of the variance in fifth-grade science dard error was .027. The t-value was 21.50. A
achievement scores. t-table could be consulted to determine whether
21.50 is statistically significant. Or a rule of thumb
may be used that given the large sample size and
Statistical Significance
with a two-tailed significance test, a t-value greater
R and R2 are typically used to evaluate the statisti- than 2 will be statistically significant at the p < :05
cal significance of the overall regression equation level. Clearly, the regression coefficient was statisti-
(the tests for the two will result in the same cally significant. Earlier it was mentioned that
answer). The null hypothesis is that R2 equals zero because this is a bivariate regression, the signifi-
in the population. One way of calculating the sta- cance of the overall regression and b provide redun-
tistical significance of the overall regression is to dant information. The use of F and t tests may thus
90 Bivariate Regression

be confusing, but note that F (462.17) equals Each person’s residual is thus represented by the
t2 ð21:502 ) in this bivariate case. A word of caution: distance between the observed score and the
This finding does not generalize to multiple regres- regression line. Because the regression line repre-
sion. In fact, in a multiple regression, the overall sents the predicted scores, the residuals are the dif-
regression might be significant, and some of the bs ference between predicted and observed scores.
may or may not be significant. In a multiple regres- Again, the regression line minimizes the distance
sion, both the overall regression equation and the of these residuals from the regression line. Much
individual coefficients are examined for statistical as residuals are thought of as science scores with
significance. the effects of reading scores removed, the residual
variance is the proportion of variance in science
scores left unexplained by reading scores. In the
Residuals
example, the residual variance was .69, or 1  R2.
Before completing this explanation of bivariate
regression, it will be instructive to discuss a topic
that has been for the most part avoided until now: Regression Interpretation
the residuals. Earlier it was mentioned that e
(residual) was also included in the regression equa- An example interpretation for the reading and sci-
tion. Remember that regression parameter esti- ence example concludes this entry on bivariate
mates minimize the prediction errors, but the regression. The purpose of this study was to deter-
prediction is unlikely to be perfect. The residuals mine how well first-grade science scores explained
represent the error in prediction. Or the residual fifth-grade science achievement scores. The regres-
variance represents the variance that is left unex- sion of fifth-grade science scores on first-grade
plained by the explanatory variable. Returning to reading scores was statistically significant,
the example, if reading scores were used to predict R2 ¼ :31, Fð1; 1025Þ ¼ 462:17, p < :01: Reading
science scores for those 1,026 students, each stu- accounted for 31% of the variance in science
dent would have a prediction equation in which achievement. The unstandardized regression coeffi-
his or her reading score would be used to calculate cient was .58, meaning that for each T-score point
a predicted science score. Because the actual score increase in reading, there was a .58 T-score
for each person is also known, the residual for increase in science achievement. Children who are
each person would represent the observed fifth- better readers in first grade also tend to be higher
grade science score minus the predicted score achievers in fifth-grade science.
obtained from the regression equation. Residuals
Matthew R. Reynolds
are thus observed scores minus predicted scores, or
conceptually they may be thought of as the fifth- See also Correlation; Multiple Regression; Path Analysis;
grade science scores with effects of first-grade Scatterplot; Variance
reading removed.
Another way to think of the residuals is to
revert back to the scatterplot in Figure 1. The x-
Further Readings
axis represents the observed scores, and the y-axis
represents the science scores. Both predicted and Bobko, P. (2001). Correlation and regression: Principles
actual scores are already plotted on this scatter- and applications for industrial/organizational
plot. That is, the predicted scores are found on the psychology and management (2nd ed.). Thousand
regression line. If a person’s reading score was 40, Oaks, CA: Sage.
the predicted science score may be obtained by Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Applied multiple regression/correlation analysis for the
first finding 40 on the x-axis, and then moving up
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
in a straight line until reaching the regression line. Erlbaum.
The observed science scores for this sample are Keith, T. Z. (2006). Multiple regression and beyond.
also shown on the plot, represented by the dots Boston: Pearson.
scattered about the regression line. Some are very McDonald, R. P. (1999). Test theory: A unified
close to the line whereas others are farther away. treatment. Mahwah, NJ: Lawrence Erlbaum.
Block Design 91

Miles, J., & Shevlin, M. (2001). Applying regression and facilities. The salient features of the five most often
correlation: A guide for students and researchers. used block designs are described next.
Thousand Oaks, CA: Sage.
Schroeder, L. D., Sjoquist, D. L., & Stephan, P. E. (1986).
Understanding regression analysis: An introductory Block Designs With One Treatment
guide. Beverly Hills, CA: Sage.
Weisburg, S. (2005). Applied linear regression (3rd ed.).
Dependent Samples t-Statistic Design
Hoboken, NJ: Wiley. The simplest block design is the randomization
and analysis plan that is used with a t statistic for
dependent samples. Consider an experiment to
compare two ways of memorizing Spanish vocabu-
lary. The dependent variable is the number of trials
BLOCK DESIGN required to learn the vocabulary list to the crite-
rion of three correct recitations. The null and alter-
Sir Ronald Fisher, the father of modern experimen- native hypotheses for the experiment are,
tal design, extolled the advantages of block designs respectively,
in his classic book, The Design of Experiments.
He observed that block designs enable researchers H0: μ1  μ2 ¼ 0
to reduce error variation and thereby obtain more
powerful tests of false null hypotheses. In the and
behavioral sciences, a significant source of error
variation is the nuisance variable of individual dif-
H1: μ1  μ2 6¼ 0;
ferences. This nuisance variable can be isolated by
assigning participants or experimental units to
blocks so that at the beginning of an experiment, where μ1 and μ2 denote the population means for
the participants within a block are more homoge- the two memorization approaches. It is reasonable
neous with respect to the dependent variable than to believe that IQ is negatively correlated with the
are participants in different blocks. Three proce- number of trials required to memorize Spanish
dures are used to form homogeneous blocks. vocabulary. To isolate this nuisance variable, n
blocks of participants can be formed so that the
1. Match participants on a variable that is two participants in each block have similar IQs. A
correlated with the dependent variable. Each simple way to form blocks of matched participants
block consists of a set of matched participants. is to rank the participants in terms of IQ. The par-
ticipants ranked 1 and 2 are assigned to Block 1,
2. Observe each participant under all or a portion
of the treatment levels or treatment
those ranked 3 and 4 are assigned to Block 2, and
combinations. Each block consists of a single so on. Suppose that 20 participants have volun-
participant who is observed two or more times. teered for the memorization experiment. In this
Depending on the nature of the treatment, case, n ¼ 10 blocks of dependent samples can
a period of time between treatment level be formed. The two participants in each block
administrations may be necessary in order for are randomly assigned to the memorization
the effects of one treatment level to dissipate approaches. The layout for the experiment is
before the participant is observed under other shown in Figure 1.
levels. The null hypothesis is tested using a t statistic
3. Use identical twins or litter mates. Each block for dependent samples. If the researcher’s hunch is
consists of participants who have identical or correct—that IQ is correlated with the number of
similar genetic characteristics. trials to learn—the design should result in a more
powerful test of a false null hypothesis than would
Block designs also can be used to isolate other a t-statistic design for independent samples. The
nuisance variables, such as the effects of adminis- increased power results from isolating the nuisance
tering treatments at different times of day, on dif- variable of IQ so that it does not appear in the
ferent days of the week, or in different testing estimates of the error effects.
92 Block Design

are used. However, the procedure used to form


Treat. Dep. Treat. Dep. homogeneous blocks does affect the interpretation
Level Var. Level Var . of the results. The results of an experiment with
Block1 a1 Y11 a2 Y12 repeated measures generalize to a population of
Block2 a1 Y21 a2 Y22 participants who have been exposed to all the treat-
ment levels. The results of an experiment with
Block3 a1 Y31 a2 Y32
.. .. .. .. .. matched participants generalize to a population of
. . . . . participants who have been exposed to only one
Block10 a1 Y10, 1 a2 Y10, 2
treatment level.
The total sum of squares (SS) and total degrees
Y.1 Y.2
of freedom for a randomized block design are par-
titioned as follows:

Figure 1 Layout for a Dependent Samples t-Statistic SSTOTAL ¼ SSA þ SSBLOCKS þ SSRESIDUAL
Design np  1 ¼ ðp  1Þ þ ðn  1Þ þ ðn  1Þðp  1Þ,
Notes: aj denotes a treatment level (Treat. Level); Yij denotes
a measure of the dependent variable (Dep. Var.). Each block where SSA denotes the Treatment A SS and
in the memorization experiment contains two matched SSBLOCKS denotes the blocks SS: The
participants. The participants in each block are randomly SSRESIDUAL is the interaction between Treat-
assigned to the treatment levels. The means of the treatments
levels are denoted by Y · 1 and Y · 2.
ment A and blocks; it is used to estimate error
effects. Many test statistics can be thought of as
a ratio of error effects and treatment effects as
follows:
Randomized Block Design
f ðerror effectsÞþ f ðtreatment effectsÞ
The randomized block analysis of variance Test statistic ¼ ,
f ðerror effectsÞ
design can be thought of as an extension of
a dependent samples t-statistic design for the case
in which the treatment has two or more levels. where f ðÞ denotes a function of the effects in
The layout for a randomized block design with parentheses. The use of a block design enables
p ¼ 3 levels of Treatment A and n ¼ 10 blocks is a researcher to isolate variation attributable to
shown in Figure 2. A comparison of this layout the blocks variable so that it does not appear in
with that in Figure 1 for the dependent samples estimates of error effects. By removing this nui-
t-statistic design reveals that the layouts are the sance variable from the numerator and denomi-
same except that the randomized block design has nator of the test statistic, a researcher is
three treatment levels. rewarded with a more powerful test of a false
In a randomized block design, a block might null hypothesis.
contain a single participant who is observed Two null hypotheses can be tested in a random-
under all p treatment levels or p participants ized block design. One hypothesis concerns the
who are similar with respect to a variable that is equality of the Treatment A population means;
correlated with the dependent variable. If each the other hypothesis concerns the equality of the
block contains one participant, the order in blocks population means. For this design and
which the treatment levels are administered is those described later, assume that the treatment
randomized independently for each block, represents a fixed effect and the nuisance variable,
assuming that the nature of the research hypoth- blocks, represents a random effect. For this mixed
esis permits this. If a block contains p matched model, the null hypotheses are
participants, the participants in each block are
randomly assigned to the treatment levels.
The statistical analysis of the data is the same H0: μ · 1 ¼ μ · 2 ¼    ¼ μ · p
whether repeated measures or matched participants ðtreatment A population means are equalÞ
Block Design 93

Treat. Dep. Treat. Dep. Treat. Dep.


Level Var. Level Var. Level Var.

Block1 a1 Y11 a2 Y12 a3 Y13 Y1.


Block2 a1 Y21 a2 Y22 a3 Y23
Y2.
Block3 a1 Y31 a2 Y32 a3 Y33
.. .. .. .. .. .. .. Y3.
. . . . . . . ..
.
Block10 a1 Y10, 1 a2 Y10, 2 a3 Y10, 3
Y10.

Y.1 Y.2 Y.3

Figure 2 Layout for a Randomized Block Design With p ¼ 3 Treatment Levels and n ¼ 10 Blocks
Notes: aj denotes a treatment level (Treat. Level); Yij denotes a measure of the dependent variable (Dep. Var.). Each block
contains three matched participants. The participants in each block are randomly assigned to the treatment levels. The means
of treatment A are denoted by and Y · 1 and Y · 2 and Y · 3 and the means of the blocks are denoted by Y · 1 ; . . . ; Y 10 · :

H0: σ 2BL ¼ 0 disadvantages of the design include the diffi-


culty of forming homogeneous blocks and of
ðvariance of the blocks, BL, population
observing participants p times when p is large
means is equalto zeroÞ and the restrictive sphericity assumption of the
design. This assumption states that in order for
where μij denotes the population mean for the ith F statistics to be distributed as central F when
block and the jth level of treatment A. the null hypothesis is true, the variances of the
The F statistics for testing these hypotheses are differences for all pairs of treatment levels must
SSA=ðp  1Þ be homogeneous; that is,

SSRESIDUAL=½ðn  1Þðp  1Þ
σ 2Yj Yj 0 ¼ σ 2j þ σ 2j 0  2σ jj 0 for all j and j 0 :
MSA
¼
MSRESIDUAL
Generalized Randomized Block Design
and A generalized randomized block design is
SSBL=ðn  1Þ a variation of a randomized block design.
F¼ Instead of having n blocks of p homogeneous
SSRESIDUAL=½ðn  1Þðp  1Þ
participants, the generalized randomized block
MSBL design has w groups of np homogeneous partici-
¼ : pants. The w groups, like the n blocks in a ran-
MSRESIDUAL
domized block design, represent a nuisance
variable that a researcher wants to remove from
The test of the blocks null hypothesis is gener- the error effects. The generalized randomized
ally of little interest because the population block design can be used when a researcher is
means of the nuisance variable are expected to interested in one treatment with p ≥ 2 treatment
differ. levels and the researcher has sufficient partici-
The advantages of the design are simplicity in pants to form w groups, each containing np
the statistical analysis and the ability to isolate homogeneous participants. The total number of
a nuisance variable so as to obtain greater participants in the design is N ¼ npw: The lay-
power to reject a false null hypothesis. The out for the design is shown in Figure 3.
94 Block Design

Treat. Dep. Treat. Dep. Treat. Dep.


Level Var. Level Var. Level Var.

1 a1 Y111 3 a2 Y321 5 a3 Y531


Group1
2 a1 Y211 4 a2 Y421 6 a3 Y631

Y.11 Y.21 Y.31

7 a1 Y712 9 a2 Y922 11 a3 Y11, 32


Group2
8 a1 Y812 10 a2 Y10, 22 12 a3 Y12, 32

Y.12 Y.22 Y.32

13 a1 Y13, 13 15 a2 Y15, 23 17 a3 Y17, 33


Group3
14 a1 Y14, 13 16 a2 Y16, 23 18 a3 Y18, 33

Y.13 Y.23 Y.33

19 a1 Y19, 14 21 a2 Y21, 24 23 a3 Y23, 34


Group4
20 a1 Y20, 14 22 a2 Y22, 24 24 a3 Y24, 34

Y.14 Y.24 Y.34

25 a1 Y25, 15 27 a2 Y27, 25 29 a3 Y29, 35


Group5
26 a1 Y26, 15 28 a2 Y28, 25 30 a3 Y30, 35

Y.15 Y.25 Y.35

Figure 3 Generalized Randomized Block Design With N ¼ 30 Participants, p ¼ 3 Treatment Levels, and w ¼ 5
Groups of np ¼ ð2Þð3Þ ¼ 6 Homogeneous Participants

In the memorization experiment described earlier, groups. The within-cells SS; SSWCELL; is used to
suppose that 30 volunteers are available. The 30 par- estimate error effects. Three null hypotheses can
ticipants are ranked with respect to IQ. The be tested:
np ¼ ð2Þð3Þ ¼ 6 participants with the highest IQs
are assigned to Group 1, the next 6 participants are 1: H0: μ1 · ¼ μ2 · ¼    ¼ μp ·
assigned to Group 2, and so on. The np ¼ 6 partici- ðTreatment A population means are equalÞ;
pants in each group are then randomly assigned to
the p ¼ 3 treatment levels with the restriction that 2: H0: σ 2G ¼ 0
n ¼ 2 participants are assigned to each level. ðVariance of the groups; G; population
The total SS and total degrees of freedom are
partitioned as follows: means is equal to zeroÞ;

3: H0: σ 2A × G ¼ 0
SSTOTAL ¼ SSA þ SSG þ SSA × G þ SSWCELL
npw  1 ¼ ðp  1Þ þ ðw  1Þ þ ðp  1Þðw  1Þ þ pwðn  1Þ; ðVariance of the A × G interaction
is equal to zeroÞ;
where SSG denotes the groups SS and SSA × G where μijz denotes a population mean for the ith
denotes the interaction of Treatment A and participant in the jth treatment level and zth group.
Block Design 95

Treat. Dep. Treat. Dep. Treat. Dep. Treat. Dep.


Comb. Var. Comb. Var. Comb. Var. Comb. Var.

Block1 a1b1 Y111 a1b2 Y112 a2b1 Y121 a2b2 Y122 Y1..

Block2 a1b1 Y211 a1b2 Y212 a2b1 Y221 a2b2 Y222 Y2..

Block3 a1b1 Y311 a1b2 Y312 a2b1 Y321 a2b2 Y322 Y3..
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
Block10 a1b1 Y10, 11 a1b2 Y10, 12 a2b1 Y10, 21 a2b2 Y10, 22 Y10..

Y.11 Y.12 Y.21 Y.22

Figure 4 Layout for a Two-Treatment, Randomized Block Factorial Design in Which Four Homogeneous Participants
Are Randomly Assigned to the pq ¼ 2 × 2 ¼ 4 Treatment Combinations in Each Block
Note: aj bk denotes a treatment combination (Treat. Comb.); Yijk denotes a measure of the dependent variable (Dep. Var.).

The three null hypotheses are tested using the fol- Randomized Block Factorial Design
lowing F statistics:
A randomized block factorial design with two
SSA=ðp  1Þ MSA treatments, denoted by A and B, is constructed by
1. F ¼ ¼ ,
SSWCELL=½pwðn  1Þ MSWCELL crossing the p levels of Treatment A with the
SSG=ðw  1Þ MSG q levels of Treatment B. The design’s n blocks
2. F ¼ ¼
SSWCELL=½pwðn  1Þ MSWCELL
, each contain p × q treatment combinations:
a1 b1 , a1 b2 . . . ap bq : The design enables a researcher
SSA × G=ðp  1Þðw  1Þ MSA × G to isolate variation attributable to one nuisance
3. F ¼ ¼ :
SSWCELL=½pwðn  1Þ MSWCELL variable while simultaneously evaluating two treat-
The generalized randomized block design ments and associated interaction.
enables a researcher to isolate one nuisance vari- The layout for the design with p ¼ 2 levels of
able—an advantage that it shares with the ran- Treatment A and q ¼ 2 levels of Treatment B is
domized block design. Furthermore, the design shown in Figure 4. It is apparent from Figure 4
uses the within-cell variation in the that all the participants are used in simultaneously
pw ¼ ð3Þð5Þ ¼ 15 cells to estimate error effects evaluating the effects of each treatment. Hence,
rather than an interaction, as in the randomized the design permits efficient use of resources
block design. Hence, the restrictive sphericity because each treatment is evaluated with the same
assumption of the randomized block design is precision as if the entire experiment had been
replaced with the assumption of homogeneity of devoted to that treatment alone.
within-cell population variances. The total SS and total degrees of freedom for
a two-treatment randomized block factorial design
are partitioned as follows:
Block Designs With Two or More Treatments
SSTOTAL ¼ SSBL þ SSA þ SSB
The blocking procedure that is used with a ran- npq  1 ¼ ðn  1Þ þ ðp  1Þ þ ðq  1Þ
domized block design can be extended to experi-
ments that have two or more treatments, denoted þ SSA × B þ SSRESIDUAL
by the letters A, B, C, and so on. þ ðp  1Þðq  1Þ þ ðn  1Þðpq  1Þ:
96 Block Design

Four null hypotheses can be tested: are present. The design has another disadvan-
tage: If Treatment A or B has numerous levels,
1. H0: σ 2BL ¼ 0 (Variance of the blocks, BL, say four or five, the block size becomes prohibi-
population means is equal to zero), tively large. For example, if p ¼ 4 and q ¼ 3, the
design has blocks of size 4 × 3 ¼ 12: Obtaining n
2. H0: μ · 1 · ¼ μ · 2 · ¼    ¼ μ · p · (Treatment A blocks with 12 matched participants or observ-
population means are equal), ing n participants on 12 occasions is often not
feasible. A design that reduces the size of the
3. H0: μ · · 1 ¼ μ · · 2 ¼    ¼ μ · · q (Treatment B blocks is described next.
population means are equal),

4. H0: A × B interaction = 0 (Treatments A and B


Split-Plot Factorial Design
do not interact),
In the late 1920s, Fisher and Frank Yates
where μijk denotes a population mean for the ith addressed the problem of prohibitively large
block, jth level of treatment A, and kth level of block sizes by developing confounding schemes
treatment B. The F statistics for testing the null in which only a portion of the treatment combi-
hypotheses are as follows: nations in an experiment are assigned to each
block. The split-plot factorial design achieves
SSBL=ðn  1Þ a reduction in the block size by confounding one
F¼ or more treatments with groups of blocks.
SSRESIDUAL=½ðn  1Þðpq  1Þ
Group–treatment confounding occurs when the
MSBL effects of, say, Treatment A with p levels are
¼ ,
MSRESIDUAL indistinguishable from the effects of p groups of
blocks.
The layout for a two-treatment split-plot fac-
SSA=ðp  1Þ torial design is shown in Figure 5. The block size

SSRESIDUAL=½ðn  1Þðpq  1Þ in the split-plot factorial design is half as large as
the block size of the randomized block factorial
MSA
¼ , design in Figure 4 although the designs contain
MSRESIDUAL the same treatment combinations. Consider the
SSB=ðq  1Þ sample means Y · 1 · and Y · 2 · in Figure 5.
F¼ Because of confounding, the difference between
SSRESIDUAL=½ðn  1Þðpq  1Þ
Y · 1 · and Y · 2 · reflects both group effects and
MSB Treatment A effects.
¼ ; The total SS and total degrees of freedom for
MSRESIDUAL
a split-plot factorial design are partitioned as follows:

SSA × B=ðp  1Þðq  1Þ SSTOTAL ¼ SSA þ SSBLðAÞ þ SSB þ SSA × B þ SSRESIDUAL


F¼ npq  1 ¼ ðp  1Þ þ pðn  1Þ þ ðq  1Þ þ ðp  1Þðq  1Þ þ pðn  1Þðq  1Þ;
SSRESIDUAL=½ðn  1Þðpq  1Þ
MSA × B where SSBL(A) denotes the SS for blocks within
¼ : Treatment A. Three null hypotheses can be tested:
MSRESIDUAL

The design shares the advantages and disad-


1. H0: μ · 1 · ¼ μ · 2 · ¼    ¼ μ · p (Treatment A
vantages of the randomized block design. Fur- population means are equal),
thermore, the design enables a researcher to
efficiently evaluate two or more treatments and 2. H0: μ · · 1 ¼ μ · · 2 ¼    ¼ μ · · q (Treatment B
associated interactions in the same experiment. population means are equal),
Unfortunately, the design lacks simplicity in the 3. H0: A × B interaction = 0 (Treatments A and B
interpretation of the results if interaction effects do not interact),
Block Design 97

Treat. Dep. Treat. Dep.


Comb. Var. Comb. Var.

Block1 a1b1 Y111 a1b2 Y112


Block2 a1b1 Y211 a1b2 Y212
a1 Group1 .. .. .. .. .. Y.1.
. . . . .
Block10 a1b1 Y10, 11 a1b2 Y10, 12
Block11 a2b1 Y11, 21 a2b2 Y11, 22

a2 Group2 Block12 a2b1 Y12, 21 a2b2 Y12, 22


.. .. .. .. .. Y.2.
. . . . .
Block20 a2b1 Y20, 21 a2b2 Y20, 22

Y..1 Y..2

Figure 5 Layout for a Two-Treatment, Split-Plot Factorial Design in Which 10 þ 10 ¼ 20 Homogeneous Blocks Are
Randomly Assigned to the Two Groups
Notes: aj bk denotes a treatment combination (Treat. Comb.); Yijk denotes a measure of the dependent variable (Dep. Var.).
Treatment A is confounded with groups. Treatment B and the A × B are not confounded.

where μijk denotes the ith block, jth level of treat- tests of Treatment B and the A × B interaction is
ment A, and kth level of treatment B. The F greater than that for Treatment A.
statistics are
Roger E. Kirk

SSA=ðp  1Þ MSA See also Analysis of Variance (ANOVA); Confounding; F


F¼ ¼ ,
SSBLðAÞ=½pðn  1Þ MSBLðAÞ Test; Nuisance Variable; Null Hypothesis; Sphericity;
Sums of Squares

SSB=ðq  1Þ
F¼ Further Readings
SSRESIDUAL=½pðn  1Þðq  1Þ
Dean, A., & Voss, D. (1999). Design and analysis of
experiments. New York: Springer-Verlag.
MSB Kirk, R. E. (1995). Experimental design: Procedures for
¼ ,
MSRESIDUAL the behavioral sciences (3rd ed.). Pacific Grove, CA:
Brooks/Cole.
Kirk, R. E. (2002). Experimental design. In I. B. Weiner
SSA × B=ðp  1Þðq  1Þ (Series Ed.) & J. Schinka & W. F. Velicer (Vol. Eds.),

SSRESIDUAL=½pðn  1Þðq  1Þ Handbook of psychology: Vol. 2. Research methods in
psychology (pp. 3–32). New York: Wiley.
Maxwell, S. E., & Delaney, H. D. (2004). Designing
MSA × B experiments and analyzing data: A model comparison
¼ :
MSRESIDUAL perspective (2nd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Myers, J. L., & Well, A. D. (2003). Research design and
The split-plot factorial design uses two error
statistical analysis (2nd ed.). Mahwah, NJ: Lawrence
terms: MSBL(A) is used to test Treatment A; a dif- Erlbaum.
ferent and usually much smaller error term, Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
MSRESIDUAL; is used to test Treatment B and Experimental and quasi-experimental designs for
the A × B interaction. Because MSRESIDUAL is generalized causal inference. Boston: Houghton
generally smaller than MSBL(A), the power of the Mifflin.
98 Bonferroni Procedure

Although his work was in mathematical probabil-


BONFERRONI PROCEDURE ity, researchers have since applied his work to
statistical inference. Bonferroni’s principal contri-
The Bonferroni procedure is a statistical adjustment bution to statistical inference was the identification
to the significance level of hypothesis tests when of the probability inequality that bears his name.
multiple tests are being performed. The purpose of
an adjustment such as the Bonferroni procedure is
Explanation
to reduce the probability of identifying significant
results that do not exist, that is, to guard against The Bonferroni procedure is an application of
making Type I errors (rejecting null hypotheses the Bonferroni inequality to the probabilities
when they are true) in the testing process. This associated with multiple testing. It prescribes
potential for error increases with an increase in the using an adjustment to the significance level for
number of tests being performed in a given study individual tests when simultaneous statistical
and is due to the multiplication of probabilities inference for several tests is being performed.
across the multiple tests. The Bonferroni procedure The adjustment can be used for bounding simul-
is often used as an adjustment in multiple compari- taneous confidence intervals, as well as for
sons after a significant finding in an analysis of vari- simultaneous testing of hypotheses.
ance (ANOVA) or when constructing simultaneous The Bonferroni inequality states the following:
confidence intervals for several population para-
meters, but more broadly, it can be used in any situ- 1. Let Ai , i ¼ 1 to k, represent k events. Then,
ation that involves multiple tests. The Bonferroni k
T Pk  
procedure is one of the more commonly used proce- P Ai ≥ 1  P Ai , where Ai is the
i¼1 i¼1
dures in multiple testing situations, primarily complement of the event Ai:
because it is an easy adjustment to make. A strength
of the Bonferroni procedure is its ability to maintain 2. Consider the mechanics of the Bonferroni
Type I error rates at or below a nominal value. A k
T Pk  
weakness of the Bonferroni procedure is that it inequality, P Ai ≥ 1  P Ai ,
i¼1 i¼1
often overcorrects, making testing results too con-
servative because of a decrease in statistical power. 3. and rewrite the inequality as follows:
A variety of other procedures have been devel-
oped to control the overall Type I error level when !
\
k X
k
 
multiple tests are performed. Some of these other 1P Ai ≤ P Ai :
multiple comparison and multiple testing proce- i¼1 i¼1
dures, including the Student–Newman–Keuls pro-
cedure, are derivatives of the Bonferroni procedure,
modified to make the procedure less conservative Now, consider Ai as a Type I error in the ith test in
k
without sacrificing Type I error control. Other mul- T
tiple comparison and multiple testing procedures a collection of k hypothesis tests. Then P Ai
i¼1
are simulation based and are not directly related to represents the probability that no Type I errors
the Bonferroni procedure. k
T
This entry describes the procedure’s back- occur in the k hypothesis tests, and 1  P Ai
ground, explains the procedure, and provides an i¼1

example. This entry also presents applications for represents the probability that at least one Type I
 
the procedure and examines recent research. error occurs in the k hypothesis tests. P Ai repre-
sents the probability of a Type I error in the ith
test, and we can label this probability as
 
Background αi ¼ P Ai . So Bonferroni’s inequality implies that
the probability of at least one Type I error occur-
The Bonferroni procedure is named after the P
k
Italian mathematician Carlo Emilio Bonferroni. ring in k hypothesis tests is ≤ αi :
i¼1
Bonferroni Procedure 99

error rate is at most 10:05; or :50. The error rate


in this scenario is in fact .40, which is considerably
less than .50.
As another example, consider the extreme case
in which all k ¼ 10 hypothesis tests are exactly
dependent on each other—the same test is con-
ducted 10 times on the same data. In this scenario,
the experiment-wise error rate does not increase
because of the multiple tests. In fact, if the Type I
error rate for one of the tests is α ¼ :05; the exper-
P (A1) P (A2)
iment-wise error rate is the same, .05, for all 10
tests simultaneously. We can see this result from
k
Figure 1 Illustration of Bonferroni’s Inequality T
the Bonferroni inequality: P Ai ¼ PðAi Þ
Notes: A1 ¼ event 1; A2 ¼ event 2; P(A1) ¼ probability of i¼1
A1; P(A2) ¼ probability of A2. Interaction between events when the events, Ai , are all the same because
results in redundance in probability. T
k
Ai ¼ Ai . The Bonferroni procedure would sug-
i¼1

If, as is often assumed, all k tests have the same gest an upper bound on this experiment-wise proba-
probability of a Type I error, α, then we can con- bility as .50—overly conservative by 10-fold! It
clude that the probability of at least one Type I would be unusual for a researcher to conduct k
error occurring in k hypothesis tests is ≤ kα. equivalent tests on the same data. However, it
Consider an illustration of Bonferroni’s inequal- would not be unusual for a researcher to conduct k
ity in the simple case in which k ¼ 2: Let the two tests and for many of those tests, if not all, to be
events A1 and A2 have probabilities P(A1) and partially interdependent. The more interdependent
P(A2), respectively. The sum of the probabilities of the tests are, the smaller the experiment-wise error
the two events is clearly greater than the probability rate and the more overly conservative the Bonfer-
of the union of the two events because the former roni procedure is.
counts the probability of the intersection of the two Other procedures have sought to correct for
events twice, as shown in Figure 1. inflation in experiment-wise error rates without
The Bonferroni procedure is simple in the sense being as conservative as the Bonferroni procedure.
that a researcher need only know the number of However, none are as simple to use. These other
tests to be performed and the probability of a Type procedures include the Student–Newman–Keuls,
I error for those tests in order to construct this Tukey, and Scheffé procedures, to name a few.
upper bound on the experiment-wise error rate. Descriptions of these other procedures and their
However, as mentioned earlier, the Bonferroni pro- uses can be found in many basic statistical meth-
cedure is often criticized for being too conservative. ods textbooks, as well as this encyclopedia.
Consider that the researcher does not typically
know what the actual Type I error rate is for a given
test. Rather, the researcher constructs the test so Example
that the maximum allowable Type I error rate is α. Consider the case of a researcher studying
Then the actual Type I error rate may be consider- the effect of three different teaching
ably less than α for any given test. methods on the average words per minute
For example, suppose a test is constructed with ðμ1 , μ2 , μ3 Þ at which a student can read.
a nominal α ¼ :05. Suppose the researcher con- The researcher tests three hypotheses:
ducts k ¼ 10 such tests on a given set of data, and μ1 ¼ μ2 ðvs: μ1 6¼ μ2 Þ; μ1 ¼ μ3 ðvs: μ1 6¼ μ3 Þ; and
the actual Type I error rate for each of the tests is μ2 ¼ μ3 ðvs: μ2 6¼ μ3 Þ. Each test is conducted at
.04. Using the Bonferroni procedure, the a nominal level, α0 ¼ :05; resulting in a comparison-
researcher concludes that the experiment-wise wise error rate of αc ¼ :05 for each test. Denote A1,
100 Bonferroni Procedure

A2, and A3 as the event of falsely rejecting the null when testing beta coefficients in a multiple
hypotheses 1, 2, and 3, respectively, and denote regression analysis, it has been shown that the
p1, p2, and p3 the probability of events A1, A2, overall Type I error rate in such an analysis
and A3, respectively. These would be the individual involving as few as eight regression coefficients
p values for these tests. It may be assumed that can exceed .30, resulting in almost a 1 in 3
some dependence exists among the three events, A1, chance of falsely rejecting a null hypothesis.
A2, and A3, principally because the events are all Using a Bonferroni adjustment when one is con-
based on data collected from a single study. Conse- ducting these tests would control that overall
quently, the experiment-wise error rate, the proba- Type I error rate. Similar adjustments can be
bility of falsely rejecting any of the three null used to test for main effects and interactions
hypotheses, is at least equal to αe ¼ :05 but poten- in ANOVA and multivariate ANOVA designs
tially as large as :053 ¼ :15: For this reason, we because all that is required to make the adjustment
may apply the Bonferroni procedure by dividing our is that the researcher knows the number of tests
nominal level of α0 ¼ :05 by k ¼ 3 to obtain being performed. The Bonferroni adjustment has
α0 ¼ :0167: Then, rather than comparing the been used to adjust the experiment-wise Type I
p values p1, p2, and p3 to α0 ¼ 05, we compare error rate for multiple tests in a variety of disci-
them to α0 ¼ :0167: The experiment-wise error plines, such as medical, educational, and psycho-
rate is therefore adjusted down so that it is less than logical research, to name a few.
or equal to the original intended nominal level of
α0 ¼ :05:
Recent Research
It should be noted that although the Bonferroni
procedure is often used in the comparison of mul- One of the main criticisms of the Bonferroni proce-
tiple means, because the adjustment is made to the dure is the fact that it overcorrects the overall Type
nominal level, α0 , or to the test’s resulting p value, I error rate, which results in lower statistical power.
the multiple tests could be hypothesis tests of any Many modifications to this procedure have been
population parameters based on any probability proposed over the years to try to alleviate this prob-
distributions. So, for example, one experiment lem. Most of these proposed alternatives can be
could involve a hypothesis test regarding a mean classified either as step-down procedures (e.g., the
and another hypothesis test regarding a variance, Holm method), which test the most significant
and an adjustment based on k ¼ 2 could be made (and, therefore, smallest) p value first, or step-up
to the two tests to maintain the experiment-wise procedures (e.g., the Hochberg method), which
error rate at the nominal level. begin testing with the least significant (and largest)
p value. With each of these procedures, although
the tests are all being conducted concurrently, each
Applications
hypothesis is not tested at the same time or at the
As noted above, the Bonferroni procedure is used same level of significance.
primarily to control the overall α level (i.e., the More recent research has attempted to find
experiment-wise level) when multiple tests are being a divisor between 1 and k that would protect the
performed. Many statistical procedures have been overall Type I error rate at or below the nominal
developed at least partially for this purpose; how- .05 level but closer to that nominal level so as to
ever, most of those procedures have applications have a lesser effect on the power to detect actual
exclusively in the context of making multiple com- differences. This attempt was based on the premise
parisons of group means after finding a significant that making no adjustment to the α level is too lib-
ANOVA result. While the Bonferroni procedure can eral an approach (inflating the experiment-wise
also be used in this context, one of its advantages error rate), and dividing by the number of tests, k,
over other such procedures is that it can also be is too conservative (overadjusting that error rate).
used in other multiple testing situations that do not It was shown that the optimal divisor is directly
initially entail an omnibus test such as ANOVA. determined by the proportion of nonsignificant dif-
For example, although most statistical tests ferences or relationships in the multiple tests being
do not advocate using a Bonferroni adjustment performed. Based on this result, a divisor of
Bootstrapping 101

kð1  qÞ, where q ¼ the proportion of nonsignifi- Simes, R. J. (1986). An improved Bonferroni procedure
cant tests, did the best job of protecting against for multiple tests of significance. Biometrika, 73(3),
Type I errors without sacrificing as much power. 751–754.
Unfortunately, researchers often do not know, Westfall, P. H., Tobias, R. D., Rom, D., Wolfinger, R. D.,
& Hochberg, Y. (1999). Multiple comparisons and
a priori, the number of nonsignificant tests that
multiple tests using SAS. Cary, NC: SAS Institute.
will occur in the collection of tests being per-
formed. Consequently, research has also shown
that a practical choice of the divisor is k/1.5
(rounded to the nearest integer) when the number
of tests is greater than three. This modified Bonfer-
BOOTSTRAPPING
roni adjustment will outperform alternatives in
keeping the experiment-wise error rate at or below The bootstrap is a computer-based statistical tech-
the nominal .05 level and will have higher power nique that is used to obtain measures of precision
than other commonly used adjustments. of parameter estimates. Although the technique is
sufficiently general to be used in time-series analy-
Jamis J. Perrett and Daniel J. Mundfrom sis, permutation tests, cross-validation, nonlinear
regression, and cluster analysis, its most common
See also Analysis of Variance (ANOVA); Hypothesis; use is to compute standard errors and confidence
Multiple Comparison Tests; Newman–Keuls Test and intervals. Introduced by Bradley Efron in 1979,
Tukey Test; Scheffé Test the procedure itself belongs in a broader class of
estimators that use sampling techniques to create
empirical distributions by resampling from the
Further Reqdings
original data set. The goal of the procedure is to
produce analytic expressions for estimators that
Bain, L. J., & Engelhandt, M. (1992). Introduction to are difficult to calculate mathematically. The name
probability and mathematical statistics (2nd ed.). itself derives from the popular story in which
Boston: PWS-Kent. . Baron von Munchausen (after whom Munchausen
Hochberg, Y. (1988). A sharper Bonferroni procedure for
syndrome is also named) was stuck at the bottom
multiple tests of significance. Biometrika, 75(4),
800–802.
of a lake with no alternative but to grab his own
Holland, B. S., & Copenhaver, M. D. (1987). An bootstraps and pull himself to the surface. In a sim-
improved sequentially rejective Bonferroni test ilar sense, when a closed-form mathematical solu-
procedure. Biometrics, 43, 417–423. tion is not easy to calculate, the researcher has no
Holm, S. (1979). A simple sequentially rejective multiple alternative but to ‘‘pull himself or herself up by the
test procedure. Scandinavian Journal of Statistics, 6, bootstraps’’ by employing such resampling techni-
65–70. ques. This entry explores the basic principles and
Hommel, G. (1988). A stagewise rejective multiple test procedures of bootstrapping and examines its
procedure based on a modified Bonferroni test. other applications and limitations.
Biometrika, 75(2), 383–386.
Mundfrom, D. J., Perrett, J., Schaffer, J., Piccone, A., &
Roozeboom, M. A. (2006). Bonferroni adjustments in Basic Principles and Estimation Procedures
tests for regression coefficients. Multiple Linear
Regression Viewpoints, 32(1), 1–6. The fundamental principle on which the procedure
Rom, D. M. (1990). A sequentially rejective test is based is the belief that under certain general con-
procedure based on a modified Bonferroni inequality. ditions, the relationship between a bootstrapped
Biometrika, 77(3), 663–665. estimator and a parameter estimate should be simi-
Roozeboom, M. A., Mundfrom, D. J., & Perrett, J.
lar to the relationship between the parameter esti-
(2008, August). A single-step modified Bonferroni
procedure for multiple tests. Paper presented at the
mate and the unknown population parameter of
Joint Statistical Meetings, Denver, CO. interest. As a means of better understanding the
Shaffer, J. P. (1986). Modified sequentially rejective origins of this belief, Peter Hall suggested a valuable
multiple test procedures. Journal of the American visual: a nested Russian doll. According to Hall’s
Statistical Association, 81(395), 826–831. thought experiment, a researcher is interested in
102 Bootstrapping

determining the number of freckles present on the practice, bootstrap samples are obtained by
outermost doll. However, the researcher is not able a Monte Carlo procedure to draw (with replace-
to directly observe the outermost doll and instead ment) multiple random samples of size n from the
can only directly observe the inner dolls, all of initial sample data set, calculating the parameter of
which resemble the outer doll, but because of their interest for the sample drawn, say θb, and repeating
successively smaller size, each possesses succes- the process k times. Hence, the bootstrap technique
sively fewer freckles. The question facing the allows researchers to generate an estimated sam-
researcher then is how to best use information pling distribution in cases in which they have
from the observable inner dolls to draw conclu- access to only a single sample rather than the entire
sions about the likely number of freckles present population. A minimum value for k is typically
on the outermost doll. To see how this works, assumed to be 100 and can be as many as 10,000,
assume for simplicity that the Russian doll set con- depending on the application.
sists of three parts, the outermost doll and two Peter Bickel and David Freedman defined the
inner dolls. In this case, the outermost doll can be following three necessary conditions if the boot-
thought of as the population, which is assumed to strap is to provide consistent estimates of the
possess n0 freckles; the second doll can be thought asymptotic distribution of a parameter: (1) The
of as the original sample, which is assumed to pos- statistic being bootstrapped must converge weakly
sess n1 freckles; and the third doll can be thought to an asymptotic distribution whenever the data-
of as the bootstrap sample, which is assumed to generating distribution is in a neighborhood of the
possess n2 freckles. A first guess in this situation truth, or in other words, the convergence still
might be to use the observed number of freckles on occurs if the truth is allowed to change within the
the second doll as the best estimate of the likely neighborhood as the sample size grows. (2) The
number of freckles on the outermost doll. Such an convergence to the asymptotic distribution must
estimator will necessarily be biased, however, be uniform in that neighborhood. (3) The asymp-
because the second doll is smaller than the outer- totic distribution must depend on the data-generat-
most doll and necessarily possesses a smaller num- ing process in a continuous way. If all three
ber of freckles. In other words, employing n1 as an conditions hold, then the bootstrap should provide
estimate of n0 necessarily underestimates the true reliable estimates in many different applications.
number of freckles on the outermost doll. This is As a concrete example, assume that we wish to
where the bootstrapped estimator, n2 , reveals its obtain the standard error of the median value for
true value. Because the third doll is smaller than a sample of 30 incomes. The researcher needs to
the second doll by an amount similar to that by create 100 bootstrap samples because this is the
which the second doll is smaller than the outermost generally agreed on number of replications needed
doll, the ratio of the number of freckles on the two to compute a standard error. The easiest way to
inner dolls, n1 : n2 , should be a close approxima- sample with replacement is to take the one data
tion of the ratio of the number of freckles on the set and copy it 500 times for 100 bootstrap sam-
second doll to number on the outer doll, n0 : n1 . ples in order to guarantee that each observation
This in a nutshell is the principle underlying the has an equal likelihood of being chosen in each
bootstrap procedure. bootstrap sample. The researcher then assigns ran-
More formally, the nonparametric bootstrap dom numbers to each of the 15,000 observations
derives from an empirical distribution function,F, ^ ð500  30Þ and sorts each observation by its ran-
which is a random sample of size n from a probabil- dom number assignment from lowest to highest.
ity distribution F: The estimator, θ,^ of the popula- The next step is to make 100 bootstrap samples of
tion parameter θ is defined as some function of the 30 observations each and disregard the other
random sample ðX1 , X2 , . . . Xn Þ. The objective of 12,000 observations. After the 100 bootstrap sam-
the bootstrap is to assess the accuracy of the esti- ples have been made, the median is calculated
mator, θ.^ The bootstrap principle described above from each of the samples, and the bootstrap esti-
states that the relationship between θ^ and θ should mate of the standard error is just the standard
be mimicked by that between θb and θ, ^ where θb is deviation of the 100 bootstrapped medians.
the bootstrap estimator from bootstrap samples. In Although this procedure may seem complicated, it
Bootstrapping 103

is actually relatively easy to write a bootstrapping the 95th percentile is the higher critical value.
program with the use of almost any modern statis- Thus the bootstrapped t interval is
tical program, and in fact, many statistical pro-
grams include a bootstrap command. ½θ^  T:95
boot ^ θ^  T boot s:e:ðθÞ:
s:e:ðθÞ, ^
:05
Besides generating standard error estimates, the
bootstrap is commonly used to directly estimate
confidence intervals in cases in which they would Michael Chernick has pointed out that the biggest
otherwise be difficult to produce. Although a num- drawback of this method is that it is not always
ber of different bootstrapping approaches exist for obvious how to compute the standard errors,
computing confidence intervals, the following dis- ^
Sboot and s:e:ðθÞ.
cussion focuses on two of the most popular. The
first, called the percentile method, is straightfor-
ward and easy to implement. For illustration pur-
Other Applications
poses, assume that the researcher wishes to obtain
a 90% confidence interval. To do so, the In addition to calculating such measures of preci-
researcher would (a) start by obtaining 1,000 sion, the bootstrap procedure has gained favor for
bootstrap samples and the resulting 1,000 boot- a number of other applications. For one, the boot-
strap estimates, θ^b , and (b) order the 1,000 strap is now popular as a method for performing
observed estimates from the smallest to the largest. bias reduction. Bias reduction can be explained as
The 90% confidence interval would then consist follows. The bias of an estimator is the difference
of the specific value bootstrap estimates falling at between the expected value of an estimator, E(θ), ^
the 5th and the 95th percentiles of the sorted dis- ^
and the true value of the parameter, θ, or E(θ  θÞ.
tribution. This method typically works well for If an estimator is biased, then this value is non-
large sample sizes because the bootstrap mimics zero, and the estimator is wrong on average. In the
the sampling distribution, but it does not work case of such a biased estimator, the bootstrap prin-
well for small sample size. If the number of obser- ciple is employed such that the bias is estimated by
vations in the sample is small, Bradley Efron and taking the average of the difference between the
Robert Tibshirani have suggested using a bias cor- bootstrap estimate, θ^b , and the estimate from the
rection factor. initial sample, θ^ over the k different bootstrap esti-
The second approach, called the bootstrap t con- mates. Efron defined the bias of the bootstrap as
fidence interval, is more complicated than the per- E(θ^  θ^b ) and suggested reducing the bias of the
centile method, but it is also more accurate. To original estimator, θ,^ by adding estimated bias.
understand this method, it is useful to review a stan- This technique produces an estimator that is close
dard confidence interval, which is defined as to unbiased.
^ θ^ þ tα=2,df s:e:ðθÞ,
½θ^  tα=2,df s:e:ðθÞ, ^ where θ^ is the Recently, the bootstrap has also become
estimate, tα=2,df is the critical value from the t-table popular in different types of regression analysis,
with df degrees of freedom for a ð1  αÞ confidence including linear regression, nonlinear regression,
^ is the standard error of the esti- time-series analysis, and forecasting. With linear
interval, and s:e:ðθÞ
regression, the researcher can either fit the resi-
mate. The idea behind the bootstrap t interval is
duals from the fitted model, or the vector of the
that the critical value is found through bootstrap-
dependent and independent variables can be
ping instead of simply reading the value contained
bootstrapped. If the error terms are not normal
in a published table. Specifically, the bootstrap t is
and the sample size is small, then the researcher
defined as T boot ¼ ðθ^boot  θÞ=S ^ boot , where θ^boot is
is able to obtain bootstrapped confidence inter-
the estimate of θ from a bootstrap sample and Sboot vals, like the one described above, instead of
is an estimate of the standard deviation of θ from relying on asymptotic theory that likely does not
the bootstrap sample. The k values of T boot are apply. In nonlinear regression analysis, the boot-
then ordered from lowest to highest, and then, for strap is a very useful tool because there is no
a 90% confidence interval, the value at the 5th per- need to differentiate and an analytic expression
centile is the lower critical value and the value at is not necessary.
104 Box-and-Whisker Plot

Limitations Andrews, D. W. K. (1999). Estimation when a parameter


is on a boundary. Econometrica, 67(6), 1341–1383.
Although the above discussion has highlighted that Andrews, D. W. K. (2003). A three-step method for
the bootstrap technique is potentially valuable in choosing the number of bootstrap repetitions.
a number of situations, it should be noted that it is Econometrica, 68(1), 23–51.
not the ideal solution to every statistical problem. Arthreya, K. B. (1985). Bootstrap of the mean in the
One problem would occur in cases in which para- infinite variance case. Annals of Statistics, 15,
meters are constrained to be on a boundary of the 724–731.
Bickel, P. J., & Freeman, D. (1981). Some asymptotic
parameter space (such as when a priori theoretical
theory for the bootstrap. Annals of Statistics, 9,
restrictions require a certain estimated parameter 1196–1217.
to be of a specific sign). Common examples of Chernick, M. R. (1999). Bootstrap methods: A
such restrictions include traditional demand analy- practitioner’s guide. New York: Wiley.
sis in which the income effect for a normal good is Efron, B. (1979). Bootstrap methods: Another look at the
constrained to be positive and the own-price effect jackknife. Annals of Statistics, 7, 1–26.
is constrained to be negative, cost function analysis Efron, B. (1983). Estimating the error rate of a
in which curvature constraints imply that second- perdiction rule: Improvements on cross-validation.
order price terms satisfy concavity conditions, and Journal of the American Statistical Association, 78,
time-series models for conditional heteroskedasti- 316–331.
Efron, B. (1987). Better bootstrap confidence intervals,
city in which the same parameters are constrained
Journal of the American Statistical Association, 82,
to be nonnegative. Such cases are potentially 171–200.
problematic for the researcher because standard Efron, B., & Tibshirani, R. (1986). An introduction to
error estimates and confidence bounds are difficult the bootstrap. New York: Chapman & Hall.
to compute using classical statistical inference, and Hall, P. (1992). The bootstrap and Edgeworth expansion.
therefore the bootstrap would be a natural choice. New York: Springer-Verlag.
Don Andrews has demonstrated, however, that
this procedure is not asymptotically correct to the
first order when parameters are on a boundary.
This is because the bootstrap puts too much mass BOX-AND-WHISKER PLOT
below the cutoff point for the parameter and
therefore does a poor job of mimicking the true A box-and-whisker plot, or box plot, is a tool used
population distribution. Other circumstances in to visually display the range, distribution symme-
which the bootstrap fails include an extremely try, and central tendency of a distribution in order
small sample size, its use with matching estimators to illustrate the variability and the concentration
to evaluate programs, and distributions with long of values within a distribution. The box plot is
tails. a graphical representation of the five-number sum-
Christiana Hilmer mary, or a quick way of summarizing the center
and dispersion of data for a variable. The five-
See also Bias; Confidence Intervals; Distribution; number summary includes the minimum value, 1st
Jackknife; Central Tendency, Measures of; Median; (lower) quartile (Q1), median, 3rd (upper) quartile
Random Sampling; Sampling; Standard Deviation; (Q3), and the maximum value. Outliers are also
Standard Error of Estimate; Statistic; Student’s t indicated on a box plot. Box plots are especially
Test; Unbiased Estimator; Variability, useful in research methodology and data analysis
Measure of as one of the many ways to visually represent data.
From this visual representation, researchers glean
several pieces of information that may aid in draw-
Further Readings ing conclusions, exploring unexpected patterns in
Abadie, A., & Imbens, G. W. (2006). On the failure of the data, or prompting the researcher to develop
the bootstrap for matching estimators (NBER Working future research questions and hypotheses. This
Paper 0325). Cambridge, MA: National Bureau of entry provides an overview of the history of the
Economic Research. box plot, key components and construction of the
Box-and-Whisker Plot 105

box plot, and a discussion of the appropriate uses contains an even number of values, the median
of a box plot. represents an average of the two middle values.
To create the rectangle (or box) associated
with a box plot, one must determine the 1st and
History
3rd quartiles, which represent values (along with
A box plot is one example of a graphical technique the median) that divide all the values into four
used within exploratory data analysis (EDA). EDA sections, each including approximately 25% of
is a statistical method used to explore and under- the values. The 1st (lower) quartile (Q1) repre-
stand data from several angles in social science sents a value that divides the lower 50% of the
research. EDA grew out of work by John Tukey values (those below the median) into two equal
and his associates in the 1960s and was developed sections, and the 3rd (upper) quartile (Q3) repre-
to broadly understand the data, graphically repre- sents a value that divides the upper 50% of the
sent data, generate hypotheses and build models to values (those above the median) into two equal
guide research, add robust measures to an analysis, sections. As with calculating the median, quar-
and aid the researcher in finding the most appro- tiles may represent the average of two values
priate method for analysis. EDA is especially when the number of values below and above the
helpful when the researcher is interested in identi- median is even. The rectangle of a box plot is
fying any unexpected or misleading patterns in the drawn such that it extends from the 1st quartile
data. Although there are many forms of EDA, through the 3rd quartile and thereby represents
researchers must employ the most appropriate the interquartile range (IQR; the distance
form given the specific procedure’s purpose between the 1st and 3rd quartiles). The rectangle
and use. includes the median.
In order to draw the ‘‘whiskers’’ (i.e., lines
extending from the box), one must identify
Definition and Construction
fences, or values that represent minimum and
One of the first steps in any statistical analysis is maximum values that would not be considered
to describe the central tendency and the variabil- outliers. Typically, fences are calculated to be
ity of the values for each variable included in the Q  1:5 IQR (lower fence) and Q3 þ 1:5 IQR
analysis. The researcher seeks to understand the (upper fence). Whiskers are lines drawn by con-
center of the distribution of values for a given necting the most extreme values that fall within
variable (central tendency) and how the rest of the fence to the lines representing Q1 and Q3.
the values fall in relation to the center (variabil- Any value that is greater than the upper fence or
ity). Box plots are used to visually display vari- lower than the lower fence is considered an out-
able distributions through the display of robust lier and is displayed as a special symbol beyond
statistics, or statistics that are more resistant to the whiskers. Outliers that extend beyond the
the presence of outliers in the data set. Although fences are typically considered mild outliers on
there are somewhat different ways to construct the box plot. An extreme outlier (i.e., one that is
box plots depending on the way in which the located beyond 3 times the length of the IQR
researcher wants to display outliers, a box plot from the 1st quartile (if a low outlier) or 3rd
always provides a visual display of the five-num- quartile (if a high outlier) may be indicated by
ber summary. The median is defined as the value a different symbol. Figure 1 provides an illustra-
that falls in the middle after the values for the tion of a box plot.
selected variable are ordered from lowest to Box plots can be created in either a vertical or
highest value, and it is represented as a line in a horizontal direction. (In this entry, a vertical box
the middle of the rectangle within a box plot. As plot is generally assumed for consistency.) They
it is the central value, 50% of the data lie above can often be very helpful when one is attempting
the median and 50% lie below the median. to compare the distributions of two or more data
When the distribution contains an odd number sets or variables on the same scale, in which case
of values, the median represents an actual they can be constructed side by side to facilitate
value in the distribution. When the distribution comparison.
106 Box-and-Whisker Plot

25.00
15

20.00

15.00

10.00

5.00 IQR

0.00

Data set

Data set values Defining features of this box plot


2.0 Median = 6.0
2.0 First (Lower) Quartile = 3.0
2.0 Third (Upper) Quartile = 8.0
3.0 Interquartile Range (IQR) = 5.0
3.0 Lower Inner Fence = -4.5
5.0 Upper Inner Fence = 15.5
6.0 Range = 20.0
6.0 Mild Outlier = 22.0
7.0
7.0
8.0
8.0
9.0
10.0
22.0

Figure 1 Box Plot Created With a Data Set and SPSS (an IBM company, formerly called PASWâ Statistics)
Notes: Data set values: 2.0, 2.0, 2.0, 3.0, 3.0, 5.0, 6.0, 6.0, 7.0, 7.0, 8.0, 8.0, 9.0, 10.0, 22.0. Defining features of this box
plot: Median ¼ 6.0; First (lower) quartile ¼ 3.0; Third (upper) quartile ¼ 8.0; Interquartile range (IQR) ¼ 5.0; Lower inner
fence ¼ 4.5; Upper inner fence ¼ 15.5; Range ¼ 20.0; Mild outlier ¼ 22.0.

Steps to Creating a Box Plot lower quartile (Q1 ), upper quartile (Q3 ), and
minimum and maximum values.
The following six steps are used to create a verti-
cal box plot: 2. Calculate the IQR.
3. Determine the lower and upper fences.
1. Order the values within the data set from 4. Using a number line or graph, draw a box
smallest to largest and calculate the median, to mark the location of the 1st and
Box-and-Whisker Plot 107

3rd quartiles. Draw a line across the box to Variations


mark the median.
Over the past few decades, the availability of sev-
5. Make a short horizontal line below eral statistical software packages has made EDA
and above the box to locate the minimum
easier for social science researchers. However,
and maximum values that fall within the
lower and upper fences. Draw a line
these statistical packages may not calculate parts
connecting each short horizontal line of a box plot in the same way, and hence some
to the box. These are the box plot caution is warranted in their use. One study con-
whiskers. ducted by Michael Frigge, David C. Hoaglin, and
Boris Iglewicz found that statistical packages cal-
6. Mark each outlier with an asterisk or an ‘‘o.’’
culate aspects of the box plot in different ways. In
one example, the authors used three different sta-
tistical packages to create a box plot with the same
Making Inferences distribution. Though the median looked approxi-
mately the same across the three box plots, the dif-
R. Lyman Ott and Michael Longnecker
ferences appeared in the length of the whiskers.
described five inferences that one can make from
The reason for the differences was the way the sta-
a box plot. First, the researcher can easily iden-
tistical packages used the interquartile range to
tify the median of the data by locating the line
calculate the whiskers. In general, to calculate the
drawn in the middle of the box. Second, the
whiskers, one multiplies the interquartile range by
researcher can easily identify the variability of
a constant and then adds the result to Q3 and sub-
the data by looking at the length of the box.
tracts it from Q1 . Each package used a different
Longer boxes illustrate greater variability
constant, ranging from 1.0 to 3.0. Though
whereas shorter box lengths illustrate a tighter
packages typically allow the user to adjust the con-
distribution of the data around the median.
stant, a package typically sets a default, which
Third, the researcher can easily examine the
may not be the same as another package’s default.
symmetry of the middle 50% of the data distri-
This issue, identified by Frigge and colleagues, is
bution by looking at where the median line falls
important to consider because it guides the identi-
in the box. If the median is in the middle of the
fication of outliers in the data. In addition, such
box, then the data are evenly distributed on
variations in calculation lead to the lack of a stan-
either side of the median, and the distribution
dardized process and possibly to consumer confu-
can be considered symmetrical. Fourth, the
sion. Therefore, the authors provided three
researcher can easily identify outliers in the data
suggestions to guide the researcher in using statisti-
by the asterisks outside the whiskers. Fifth, the
cal packages to create box plots. First, they sug-
researcher can easily identify the skewness of the
gested using a constant of 1.5 when the number of
distribution. On a distribution curve, data
observations is between 5 and 20. Second, they
skewed to the right show more of the data to the
suggested using a constant of 2.0 for outlier detec-
left with a long ‘‘tail’’ trailing to the right. The
tion and rejection. Finally, they suggested using
opposite is shown when the data are skewed to
a constant of 3.0 for extreme cases. In the absence
the left. To identify skewness on a box plot, the
of standardization across statistical packages,
researcher looks at the length of each half of the
researchers should understand how a package cal-
box plot. If the lower or left half of the box plot
culates whiskers and follow the suggested constant
appears longer than the upper or right half, then
values.
the data are skewed in the lower direction or
skewed to the left. If the upper half of the box
plot appears longer than the lower half, then the Applications
data are skewed in the upper direction or skewed
to the right. If a researcher suspects the data are As with all forms of data analysis, there are many
skewed, it is recommended that the researcher advantages and disadvantages, appropriate uses,
investigate further by means of a histogram. and certain precautions researchers should consider
108 b Parameter

when using a box plot to display distributions. Box standard. The study also found that compared
plots provide a good visualization of the range and with vertical box plots, box plots positioned hori-
potential skewness of the data. A box plot may pro- zontally were associated with fewer judgment
vide the first step in exploring unexpected patterns errors.
in the distribution because box plots provide a good
indication of how the data are distributed around Sara C. Lewandowski and Sara E. Bolt
the median. Box plots also clearly mark the location
of mild and extreme outliers in the distribution. See also Exploratory Data Analysis; Histogram; Outlier
Other forms of graphical representation that graph
individual values, such as dot plots, may not make
this clear distinction. When used appropriately, box
plots are useful in comparing more than one sample Further Readings
distribution side by side. In other forms of data
Behrens, J. T. (1997). Principles and procedures of
analysis, a researcher may choose to compare data exploratory data analysis. Psychological Methods, 2,
sets using a t test to compare means or an F test to 131–160.
compare variances. However, these methods are Behrens, J. T., Stock, W. A., & Sedgwick, C. E. (1990).
more vulnerable to skewness in the presence Judgment errors in elementary box-plot displays.
of extreme values. These methods must also Communications in Statistics B: Simulation and
meet normality and equal variance assumptions. Computation, 19, 245–262.
Alternatively, box plots can compare the differences Frigge, M., Hoaglin, D. C., & Iglewicz, B. (1989). Some
between variable distributions without the need to implementations of the boxplot. American Statistician,
43, 50–54.
meet certain statistical assumptions.
Massart, D. L., Smeyers-Verbeke, J., Capron, X., &
However, unlike other forms of EDA, box plots
Schlesier, K. (2005). Visual presentation of data by
show less detail than a researcher may need. For means of box plots. LCGC Europe, 18, 215–218.
one, box plots may display only the five-number Moore, D. S. (2001). Statistics: Concepts and
summary. They do not provide frequency measures controversies (5th ed.). New York: W. H. Freeman.
or the quantitative measure of variance and stan- Moore, D. S., & McCabe, P. G. (1998). Introduction to
dard deviation. Second, box plots are not used in the practice of statistics (3rd ed.). New York:
a way that allows the researcher to compare the W. H. Freeman.
data with a normal distribution, which stem plots Ott, R. L., & Longnecker, M. (2001). An introduction to
and histograms do allow. Finally, box plots would statistical methods and data analysis (5th ed.). Pacific
Grove, CA: Wadsworth.
not be appropriate to use with a small sample size
Tukey, J. W. (1977). Exploratory data analysis. Reading,
because of the difficulty in detecting outliers and MA: Addison-Wesley.
finding patterns in the distribution.
Besides taking into account the advantages and
disadvantages of using a box plot, one should con-
sider a few precautions. In a 1990 study conducted
by John T. Behrens and colleagues, participants
frequently made judgment errors in determining b PARAMETER
the length of the box or whiskers of a box plot. In
part of the study, participants were asked to judge The b parameter is an item response theory (IRT)–
the length of the box by using the whisker as based index of item difficulty. As IRT models have
a judgment standard. When the whisker length become an increasingly common way of modeling
was longer than the box length, the participants item response data, the b parameter has become
tended to overestimate the length of the box. a popular way of characterizing the difficulty of an
When the whisker length was shorter than the box individual item, as well as comparing the relative
length, the participants tended to underestimate difficulty levels of different items. This entry
the length of the box. The same result was found addresses the b parameter with regard to different
when the participants judged the length of the IRT models. Further, it discusses interpreting, esti-
whisker by using the box length as a judgment mating, and studying the b parameter.
b Parameter 109

Rasch (1PL) Model 2PL Model 3PL Model


1.0 1.0 1.0

0.8 0.8 0.8


Probability

Probability

Probability
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Theta Theta Theta
b = −1.5 a = 0.5, b = −1.5 a = 0.5, b = −1.5, c = .1
b = 0.0 a = 1.8, b = 0.0 a = 1.8, b = 0.0, c = .3
b = 1.0 a = 1.0, b = 1.0 a = 1.0, b = 1.0, c = .2

Figure 1 Item Characteristic Curves for Example Items, One-Parameter Logistic (1PL or Rasch), Two-Parameter
Logistic (2PL), and Three-Parameter Logistic (3PL) Models

b Parameter Within Different negative values representing easy (or frequently


Item Response Theory Models endorsed) items.
The 2PL and 3PL models include additional
The precise interpretation of the b parameter is item parameters that interact with the b parameter
dependent on the specific IRT model within in determining the probability of correct response.
which it is considered, the most common being The 2PL model adds an item discrimina
the one-parameter logistic (1PL) or Rasch tion parameter (aj ), so the probability of correct
model, the two-parameter logistic (2PL) model, response is
and three-parameter logistic (3PL) model. Under
the 1PL model, the b parameter is the single item exp½aj ðθi  bj Þ
feature by which items are distinguished in char- PðXij ¼ 1Þ ¼ ,
1 þ exp½aj ðθi  bj Þ
acterizing the likelihood of a correct response.
Specifically, the probability of correct response and the 3PL model adds a lower asymptote
(Xij ¼ 1) by examinee i to item j is given by (‘‘pseudoguessing’’) parameter, resulting in

expðθi  bj Þ exp½aj ðθi  bj Þ


PðXij ¼ 1Þ ¼ , PðXij ¼ 1Þ ¼ cj þ ð1  cj Þ :
1 þ expðθi  bj Þ 1 þ exp½aj ðθi  bj Þ

where θj represents an ability-level (or trait-level) While the same general interpretation of the
parameter of the examinee. An interpretation of b parameter as a difficulty parameter still applies
the b parameter follows from its being attached to under the 2PL and 3PL models, the discrimination
the same metric as that assigned to θ. and lower asymptote parameters also contribute to
Usually this metric is continuous and unbounded; the likelihood of a correct response at a given
the indeterminacy of the metric is often handled by ability level.
assigning either the mean of θ (across examinees) or
b (across items) to 0. Commonly b parameters will
Interpretation of the b Parameter
assume values between  3 and 3, with more
extreme positive values representing more difficult Figure 1 provides an illustration of the b parameter
(or infrequently endorsed) items, and more extreme with respect to the 1PL, 2PL, and 3PL models. In
110 b Parameter

this figure, item characteristic curves (ICCs) for a fundamental role in how important measure-
three example items are shown with respect to ment applications, such as item bias (differential
each model. Each curve represents the probability item functioning), test equating, and appropriate-
of a correct response as a function of the latent ness measurement, are conducted and evaluated in
ability level of the examinee. Across all three mod- an IRT framework.
els, it can be generally seen that as the b parameter
increases, the ICC tends to decrease, implying
a lower probability of correct response. Estimating the b Parameter
In the 1PL and 2PL models, the b parameter The b parameter is often characterized as a struc-
has the interpretation of representing the level of tural parameter within an IRT model and as such
the ability or trait at which the respondent has will generally be estimated in the process of fitting
a .50 probability of answering correctly (endorsing an IRT model to item response data. Various esti-
the item). For each of the models, the b mation strategies have been proposed and investi-
parameter also identifies the ability level that cor- gated, some being more appropriate for certain
responds to the inflection point of the ICC, and model types. Under the 1PL model, conditional
thus the b parameter can be viewed as determining maximum likelihood procedures are common. For
the ability level at which the item is maximally all three model types, marginal maximum likeli-
informative. Consequently, the b parameter is hood, joint maximum likelihood, and Bayesian
a critical element in determining where along the estimation procedures have been developed and
ability continuum an item provides its most effec- are also commonly used.
tive estimation of ability, and thus the parameter
has a strong influence on how items are selected
when administered adaptively, such as in a comput- Studying the b Parameter
erized adaptive testing environment.
Under the 1PL model, the b parameter effec- The b parameter can also be the focus of further
tively orders all items from easiest to hardest, and analysis. Models such as the linear logistic test
this ordering is the same regardless of the exam- model and its variants attempt to relate the
inee ability or trait level. This property is no longer b parameter to task components within an item
present in the 2PL and 3PL models, as the ICCs of that account for its difficulty. Such models also
items may cross, implying a different ordering of provide a way in which the b parameter’s estimates
item difficulties at different ability levels. This of items can ultimately be used to validate a test
property can also be seen in the example items in instrument. When the b parameter assumes the
Figure 1 in which the ICCs cross for the 2PL and value expected given an item’s known task compo-
3PL models, but not for the 1PL model. Conse- nents, the parameter provides evidence that the
quently, while the b parameter remains the key item is functioning as intended by the item writer.
factor in influencing the difficulty of the item, it is
Daniel Bolt
not the sole determinant.
An appealing aspect of the b parameter for all See also Differential Item Functioning; Item Analysis;
IRT models is that its interpretation is invariant Item Response Theory; Parameters; Validity of
with respect to examinee ability or trait level. That Measurement
is, its value provides a consistent indicator of item
difficulty whether considered for a population of
high, medium, or low ability. This property is not Further Readings
present in more classical measures of item diffi-
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory
culty (e.g., ‘‘proportion correct’’), which are influ- item response models: A generalized linear and
enced not only by the difficulty of the item, but nonlinear approach. New York: Springer.
also by the distribution of ability in the population Embretson, S. E., & Reise, S. P. (2000). Item response
in which they are administered. This invariance theory for psychologists. Mahwah, NJ: Lawrence
property allows the b parameter to play Erlbaum.
C
a researcher is interested in discovering the vari-
CANONICAL CORRELATION ables (among a set of variables) that best predict
a single variable. The set of variables may be
ANALYSIS termed the independent, or predictor, variables;
the single variable may be considered the depen-
Canonical correlation analysis (CCA) is a multivar- dent, or criterion, variable. CCA is similar, except
iate statistical method that analyzes the relation- that there are multiple dependent variables, as well
ship between two sets of variables, in which each as multiple independent variables. The goal is to
set contains at least two variables. It is the most discover the pattern of variables (on both sides of
general type of the general linear model, with mul- the equation) that combine to produce the highest
tiple regression, multiple analysis of variance, anal- predictive values for both sets. The resulting com-
ysis of variance, and discriminant function analysis bination of variables for each side, then, may be
all being special cases of CCA. thought of as a kind of latent or underlying vari-
Although the method has been available for able that describes the relation between the two
more than 70 years, its use has been somewhat sets of variables.
limited until fairly recently due to its lack of inclu- A simple example from the literature illustrates
sion in common statistical programs and its rather its use: A researcher is interested in investigating
labor-intensive calculations. Currently, however, the relationships among gender, social dominance
many computer programs do include CCA, and orientation, right wing authoritarianism, and three
thus the method has become somewhat more forms of prejudice (stereotyping, opposition to
widely used. equality, and negative affect). Gender, social domi-
This entry begins by explaining the basic logic nance orientation, and right wing authoritarianism
of and defining important terms associated with constitute the predictor set; the three forms of prej-
CCA. Next, this entry discusses the interpretation udice are the criterion set. Rather than computing
of CCA results, statistical assumptions, and limita- three separate multiple regression analyses (viz.,
tions of CCA. Last, it provides an example from the three predictor variables regressing onto one
the literature. criterion variable, one at a time), the researcher
instead computes a CCA on the two sets of vari-
ables to discern the most important predictor(s) of
Basic Logic
the three forms of prejudice. In this example, the
The logic of CCA is fairly straightforward and can CCA revealed that social dominance orientation
be explained best by likening it to a ‘‘multiple- emerged as the overall most important dimension
multiple regression.’’ That is, in multiple regression that underlies all three forms of prejudice.

111
112 Canonical Correlation Analysis

Important Terms that the canonical variate is of little practical


significance.
To appreciate the various terms associated with A second method of interpretation involves
CCA, it is necessary to have a basic understanding evaluating the degree to which the individual vari-
of the analytic procedure itself. The first step in ables load onto their respective canonical variates.
CCA involves collapsing each person’s score for Variables that have high loadings on a particular
each variable, in the two variable sets, into a single variate essentially have more in common with it
composite, or ‘‘synthetic,’’ variable. These syn- and thus should be given more weight in interpret-
thetic variables are created such that the correla- ing its meaning. This is directly analogous to fac-
tion between the two sets is maximal. This occurs tor loading in factor analysis and indicates the
by weighting each person’s score and then sum- degree of importance of the individual variables in
ming the weighted scores for the respective vari- the overall relation between the criterion and pre-
able sets. Pairs of linear synthetic variables created dictor variable sets. The commonly accepted con-
by this maximization process are called canonical vention is to interpret only those variables that
variates. The bivariate correlation between the achieve loadings of .40 or higher. The term struc-
pairs of variates is the canonical correlation (some- ture coefficient (or canonical factor loadings) refers
times called the canonical function). There will be to the correlation between each variable and its
two canonical variates produced for each canoni- respective canonical variate, and its interpretation
cal correlation, with one variate representing the is similar to the bivariate correlation coefficient. In
predictor variables and the other representing the addition, as with correlation coefficient interpreta-
criterion variables. The total number of canonical tion, both the direction and the magnitude of the
variate pairs produced is equal to the number of structure coefficient are taken into consideration.
variables in either the criterion or predictor set, The structure coefficient squared represents the
whichever is smaller. Finally, squaring the canoni- amount of variance a given variable accounts for
cal correlation coefficient yields the proportion of in its own canonical variate, based on the set of
variance the pairs of canonical variates (not the variables to which it belongs. Structure coeffi-
original variables) linearly share. cients, then, represent the extent to which the indi-
vidual variables and their respective canonical
variates are related, and in effect, their interpreta-
Interpretation of Results
tion forms the crux of CCA.
Similar to other statistical methods, the first step in
CCA is determining the statistical significance of
Statistical Assumptions
the canonical correlation coefficients associated
with the variates. Different tests of significance Reliable use of CCA requires multivariate normal-
(e.g., Wilks’s lambda, Bartlett’s test) are generally ity, and the analysis itself proceeds on the assump-
reported in statistical computer programs, and tion that all variables and linear combinations of
only those canonical correlations that differ signifi- variables approximate the normal distribution.
cantly from zero are subsequently interpreted. However, with large sample sizes, CCA can be
With respect to the previously mentioned example robust to this violation.
(social dominance and right wing authoritarian- The technique also requires minimal measure-
ism), three canonical variates were produced ment error (e.g., alpha coefficients of .80 or above)
(because the smallest variable set contained three because low scale reliabilities attenuate correlation
variables), but in that case, only the first coefficients, which, ultimately, reduces the proba-
two were significant. Canonical correlations are bility of detecting a significant canonical correla-
reported in descending order of importance, and tion. In addition, it is very useful to have an
normally only the first one or two variates are sig- adequate range of scores for each variable because
nificant. Further, even if a canonical correlation correlation coefficients can also be attenuated by
is significant, most researchers do not interpret truncated or restricted variance; thus, as with
those below .30 because the amount of variance other correlation methods, sampling methods that
shared (that is, the correlation squared) is so small increase score variability are highly recommended.
Canonical Correlation Analysis 113

Outliers (that is, data points that are well ‘‘out- meaning (i.e., the latent variable) from the
side’’ a particular distribution of scores) can also obtained results.
significantly attenuate correlation, and their occur-
rence should be minimized or, if possible,
eliminated.
Example in the Literature
For conventional CCA, linear relationships
among variables are required, although CCA algo- In the following example, drawn from the adoles-
rithms for nonlinear relationships are currently cent psychopathology literature, a CCA was per-
available. And, as with other correlation-based formed on two sets of variables, one (the predictor
analyses, low multicollinearity among variables is set) consisting of personality pattern scales of the
assumed. Multicollinearity occurs when variables Millon Adolescent Clinical Inventory, the other
in a correlation matrix are highly correlated with (the criterion set) consisting of various mental dis-
each other, a condition that reflects too much order scales from the Adolescent Psychopathology
redundancy among the variables. As with measure- Scale. The goal of the study was to discover
ment error, high multicollinearity also reduces the whether the two sets of variables were significantly
magnitudes of correlation coefficients. related (i.e., would there exist a multivariate rela-
Stable (i.e., reliable) canonical correlations are tionship?), and if so, how might these two sets be
more likely to be obtained if the sample size is related (i.e., how might certain personality styles
large. It is generally recommended that adequate be related to certain types of mental disorders?).
samples should range from 10 to 20 cases per vari- The first step in interpretation of a CCA is to
able. So, for example, if there are 10 variables, one present the significant variates. In the above exam-
should strive for a minimum of 100 cases (or parti- ple, four significant canonical variates emerged;
cipants). In general, the smaller the sample size, however, only the first two substantially contrib-
the more unstable the CCA. uted to the total amount of variance (together
accounting for 86% of the total variance), and
thus only those two were interpreted. Second, the
Limitation
structure coefficients (canonical loadings) for the
The main limitation associated with CCA is that it two significant variates were described. Structure
is often difficult to interpret the meaning of the coefficients are presented in order of absolute mag-
resulting canonical variates. As others have noted, nitude and are always interpreted as a pair. By
a mathematical procedure that maximizes correla- convention, only coefficients greater than .40 are
tions may not necessarily yield a solution that is interpreted. In this case, for the first canonical vari-
maximally interpretable. This is a serious limita- ate, the predictor set accounted for 48% (Rc2) of
tion and may be the most important reason CCA the variance in the criterion set. Further examina-
is not used more often. Moreover, given that it is tion of the canonical loadings for this variate
a descriptive technique, the problems with inter- showed that the predictor set was represented
preting the meaning of the canonical correlation mostly by the Conformity subscale (factor loading
and associated variates are particularly trouble- of .92) and was related to lower levels of mental
some. For example, suppose a medical sociologist disorder symptoms. The structure coefficients for
found that low unemployment level, high educa- the second canonical variate were then interpreted
tional level, and high crime rate (the predictor in a similar manner.
variables) are associated with good medical out- The last, and often most difficult, step in CCA
comes, low medical compliance, and high medical is to interpret the overall meaning of the analysis.
expenses (the criterion variables). What might this Somewhat similar to factor analysis interpretation,
pattern mean? Perhaps people who are employed, latent variables are inferred from the pattern of
educated, and live in high crime areas have expen- the structure coefficients for the variates. A possi-
sive, but good, health care, although they do not ble interpretation of the above example might be
comply with doctors’ however, compared with that an outgoing conformist personality style is
other multivariate techniques, with CCA there predictive of overall better mental health for ado-
appears to be greater difficulty in extracting the lescents at risk for psychopathology.
114 Case-Only Design

Summary disadvantages. In observational studies, by defini-


tion, the researcher is looking for causes, predic-
The purpose of canonical correlation analysis, tors, and risk factors by observing the
a multivariate statistical technique, is to analyze phenomenon without doing any intervention on
the relationships between two sets of variables, in the subjects, whereas in experimental designs,
which each set contains at least two variables. It is there is an intervention on the study subjects. In
appropriate for use when researchers want to practice, however, there is no clear-cut distinction
know whether the two sets are related and, if so, between different types of study design in biomedi-
how they are related. CCA produces canonical cal research. Case-only studies are genetics studies
variates based on linear combinations of measured in which individuals with and without a genotype
variables. The variates are a kind of latent variable of interest are compared, with an emphasis on
that relates one set of variables to the other set. environmental exposure.
CCA shares many of the statistical assumptions of For evaluation of etiologic and influencing role
other correlational techniques. The major limita- of some factors (such as risk factors) on the occur-
tion of CCA lies in its interpretability; that is, rence of diseases, we necessarily need to have
although significant variates can be derived mathe- a proper control group. Otherwise, no inferential
matically, they may not necessarily be meaningful. decision can properly be made on the influencing
Tracie L. Blumentritt role of risk factors. One of the study designs with
control subjects is case–control study. These stud-
See also Bivariate Regression; Multiple Regression ies are designed to assess the cause–effect associa-
tion by comparing a group of patients with
a (matched) group of control subjects in terms of
Further Readings
influencing or etiologic factors.
Hotelling, H. (1935). The most predictable criterion. Some concerns in case–control studies, includ-
Journal of Educational Psychology, 26, 139–142. ing control group and appropriate selection of
Levine, M. S. (1977). Canonical analysis and factor control subjects, expensive cost for examining risk
comparison. Beverly Hills, CA: Sage. markers in both cases and controls (particularly in
Stevens, J. (1986). Applied multivariate statistics for the
genetic studies), and the time-consuming process
social sciences. Hillsdale, NJ: Lawrence Erlbaum.
Tabachnick, B. G., & Fidell, L. S. (1996). Using
of such studies, have led to the development of the
multivariate statistics (3rd ed.). New York: case-only method in studying the gene–environ-
HarperCollins. ment interaction in human diseases. Investigators
Thompson, B. (1984). Canonical correlation analysis: studying human malignancies have broadly used
Uses and interpretation. Beverly Hills, CA: Sage. this method in recent years.
The case-only method was originally designed
as a valid approach to the analysis and screening
of genetic factors in the etiology of multifactorial
CASE-ONLY DESIGN diseases and also to assessing the gene–environ-
ment interactions in the etiology. In a case-only
Analytical studies are designed to test the hypothe- study, cases with and without the susceptible geno-
ses created by descriptive studies and to assess the type are compared with each other in terms of the
cause–effect association. These studies are able to existence of the environmental exposure.
measure the effect of a specific exposure on the To conduct a case-only design, one applies the
occurrence of an outcome over time. Depending same epidemiological approaches to case selection
on the nature of the exposure, whether it has been rules as for any case–control study. The case-only
caused experimentally as an intervention on the study does not, however, have the complexity of
study subjects or has happened naturally, without rules for the selection of control subjects that usu-
any specific intervention on the subjects, these ally appears in traditional case–control studies.
studies may be divided into two major groups: The case-only method also requires fewer cases
observational and experimental studies. Each of than the traditional case–control study. Further-
these designs has its own advantages and more, the case-only design has been reported to be
Case Study 115

more efficient, precise, and powerful compared Greenland, S. (1993). Basic problems in interaction
with a traditional case–control method. assessment. Environmental Health Perspectives,
Although the case-only design was originally 101(Suppl. 4), 59–66.
created to improve the efficiency, power, and preci- Khoury, M. J., & Flanders, W. D. (1996). Nontraditional
epidemiologic approaches in the analysis of gene-
sion of the study of the gene–environment interac-
environment interaction: Case-control studies with no
tions by examining the prevalence of a specific controls!; American Journal of Epidemiology, 144,
genotype among case subjects only, it is now used 207–213.
to investigate how some other basic characteristics Smeeth, L., Donnan, P. T., & Cook, D. G. (2006). The
that vary slightly (or never vary) over time (e.g., use of primary care databases: Case-control and case-
gender, ethnicity, marital status, social and eco- only designs. Family Practice, 23, 597–604.
nomic status) modify the effect of a time-depen-
dent exposure (e.g., air pollution, extreme
temperatures) on the outcome (e.g., myocardial
infarction, death) in a group of cases only (e.g., CASE STUDY
decedents).
To avoid misinterpretation and bias, some tech- Case study research is a versatile approach to
nical assumptions should be taken into account in research in social and behavioral sciences. Case
case-only studies. The assumption of independence studies consist of detailed inquiry into a bounded
between the susceptibility genotypes and the envi- entity or unit (or entities) in which the researcher
ronmental exposures of interest in the population either examines a relevant issue or reveals phe-
is the most important one that must be considered nomena through the process of examining the
in conducting these studies. In practice, this entity within its social and cultural context. Case
assumption may be violated by some confounding study has gained in popularity in recent years.
factors (e.g., age, ethnic groups) if both exposure However, it is difficult to define because resear-
and genotype are affected. This assumption can be chers view it alternatively as a research design, an
tested by some statistical methods. Some other approach, a method, or even an outcome. This
technical considerations also must be assumed in entry examines case study through different lenses
the application of the case-only model in various to uncover the versatility in this type of research
studies of genetic factors. More details of these approach.
assumptions and the assessment of the gene–envi-
ronment interaction in case-only studies can be
Overview
found elsewhere.
Case study researchers have conducted studies in
Saeed Dastgiri traditional disciplines such as anthropology, eco-
nomics, history, political science, psychology, and
See also Ethics in the Research Process
sociology. Case studies have also emerged in areas
such as medicine, law, nursing, business, adminis-
tration, public policy, social work, and education.
Further Readings
Case studies may be used as part of a larger study
Begg, C. B., & Zhang, Z. F. (1994). Statistical analysis of or as a stand-alone design. Case study research
molecular epidemiology studies employing case series. may be considered a method for inquiry or an
Cancer Epidemiology, Biomarkers & Prevention, 3, evaluation of a bounded entity, program, or sys-
173–175. tem. Case studies may consist of more than one
Chatterjee, N., Kalaylioglu, Z., Shih, J. H., & Gail, entity (unit, thing) or of several cases within one
M. H. (2006). Case-control and case-only designs with
entity, but care must be taken to limit the number
genotype and family history data: Estimating relative
risk, residual familial aggregation, and cumulative
of cases in order to allow for in-depth analysis and
risk. Biometrics, 62, 36–48. description of each case.
Cheng, K. F. (2007). Analysis of case-only studies Researchers who have written about case study
accounting for genotyping error. Annals of Human research have addressed it in different ways,
Genetics, 71, 238–248. depending on their perspectives and points of view.
116 Case Study

Some researchers regard case study as a research pertains only to a particular group of students,
process used to investigate a phenomenon in its such as the study habits of sixth-grade students
real-world setting. Some have considered case who are reading above grade level. In this situa-
study to be a design—a particular logic for setting tion, the case is artificially bounded because the
up the study. Others think of case study as a quali- researcher establishes the criteria for selection of
tative approach to research that includes particular the participants of the study as it relates to the
qualitative methods. Others have depicted it in issue being studied and the research questions
terms of the final product, a written holistic exami- being asked.
nation, interpretation, and analysis of one or more Selecting case study as a research design is
entities or social units. Case study also has been appropriate for particular kinds of questions being
defined in terms of the unit of study itself, or the asked. For example, if a researcher wants to know
entity being studied. Still other researchers con- how a program works or why a program has been
sider that case study research encompasses all carried out in a particular way, case study can be
these notions taken together in relation to the of benefit. In other words, when a study is of an
research questions. exploratory or explanatory nature, case study is
well suited. In addition, case study works well for
understanding processes because the researcher is
Case Study as a Bounded System
able to get close to the participants within their
Although scholars in various fields and disci- local contexts. Case study design helps the
plines have given case study research many forms, researcher understand the complexity of a program
what these various definitions and perspectives or a policy, as well as its implementation and
have in common is the notion from Louis Smith effects on the participants.
that case study is inquiry about a bounded system.
Case studies may be conducted about such entities
Design Decisions
as a single person or several persons, a single class-
room or classrooms, a school, a program within When a researcher begins case study research,
a school, a business, an administrator, or a specific a series of decisions must be made concerning the
policy, and so on. The case study also may be rationale, the design, the purpose, and the type of
about a complex, integrated system, as long as case study. In terms of the rationale for conducting
researchers are able to put boundaries or limits the study, Robert Stake indicated three specific
around the system being researched. forms. Case study researchers may choose to learn
The notion of boundedness may be understood about a case because of their inherent interest in
in more than one way. An entity is naturally the case itself. A teacher who is interested in fol-
bounded if the participants have come together by lowing the work of a student who presents partic-
their own means for their own purposes having ular learning behaviors would conduct an intrinsic
nothing to do with the research. An example case study. In other words, the case is self-selected
would be a group of students in a particular class- due to the inquirer’s interest in the particular
room, a social club that meets on a regular basis, entity. Another example could be an evaluation
or the staff members in a department of a local researcher who may conduct an intrinsic case to
business. The entity is naturally bounded because evaluate a particular program. The program itself
it consists of participants who are together for is of interest, and the evaluator is not attempting
their own common purposes. A researcher is able to compare that program to others.
to study the entity in its entirety for a time frame On the other hand, an instrumental case is one
consistent with the research questions. that lends itself to the understanding of an issue or
In other instances, an entity may be artificially phenomenon beyond the case itself. For example,
bounded through the criteria set by a researcher. In examining the change in teacher practices due to
this instance, the boundary is suggested by select- an educational reform may lead a researcher to
ing from among participants to study an issue par- select one teacher’s classroom as a case, but with
ticular to some, but not all, the participants. For the intent of gaining a general understanding of the
example, the researcher might study an issue that reform’s effects on classrooms. In this instance, the
Case Study 117

case selection is made for further understanding of the way the district structures its financial expendi-
a larger issue that may be instrumental in inform- tures, the design would be holistic because the
ing policy. For this purpose, one case may not be researcher would be exploring the entire school
enough to fully understand the reform issue. The district to see effects. However, if the researcher
researcher may decide to have more than one case examined the constraints placed on each regional
in order to see how the reform plays out in differ- subdivision within the district, each regional subdi-
ent classroom settings or in more than one school. vision would become a subunit of the study as an
If the intent is to study the reform in more than embedded case design. Each regional office would
one setting, then the researcher would conduct be an embedded case and the district itself the
a collective case study. Once the researcher has major case. The researcher would procedurally
made a decision whether to understand the inher- examine each subunit and then reexamine the
ent nature of one case (intrinsic) or to understand major case to see how the embedded subunits
a broader issue that may be represented by one of inform the whole case.
more cases (instrumental or collective), the
researcher needs to decide the type of case that will
illuminate the entity or issue in question.
Multiple Case Design
Multiple cases have been considered by some to
Single-Case Design
be a separate type of study, but Yin considers them
In terms of design, a researcher needs to decide a variant of single-case design and thus similar in
whether to examine a single case, multiple cases, methodological procedures. This design also may
or several cases embedded within a larger system. be called a collective case, cross-case, or compara-
In his work with case study research, Robert Yin tive case study. From the previous example, a team
suggests that one way to think about conducting of researchers might consider studying the finan-
a study of a single case is to consider it as a holistic cial decisions made by the five largest school dis-
examination of an entity that may demonstrate the tricts in the United States since the No Child Left
tenets of a theory (i.e., critical case). If the case is Behind policy went into effect. Under this design,
highly unusual and warrants in-depth explanation, the research team would study each of the largest
it would be considered an extreme or unique case. districts as an individual case, either alone or with
Another use for a single-case design would be embedded cases within each district.
a typical or representative case, in which the After conducting the analysis for each district,
researcher is highlighting an everyday situation. the team would further conduct analyses across
The caution here is for the researcher to be well the five cases to see what elements they may have
informed enough to know what is typical or com- in common. In this way, researchers would poten-
monplace. A revelatory case is one in which tially add to the understanding of the effects of
researchers are able to observe a phenomenon that policy implementation among these large school
had been previously inaccessible. One other type districts. By selecting only the five largest school
of single case is the longitudinal case, meaning one districts, the researchers have limited the findings
in which researchers can examine the same entity to cases that are relatively similar. If researchers
over time to see changes that occur. wanted to compare the implementation under
Whether to conduct a holistic design or an highly varied circumstances, they could use maxi-
embedded design depends on whether researchers mum variation selection, which entails finding
are looking at issues related globally to one entity cases with the greatest variation. For example,
or whether subcases within that entity must be they may establish criteria for selecting a large
considered. For example, in studying the effects of urban school district, a small rural district,
an educational policy on a large, urban public a medium-sized district in a small city, and perhaps
school district, researchers could use either type of a school district that serves a Native American
design, depending on what they deem important population on a reservation. By deliberately select-
to examine. If researchers focused on the global ing from among cases that had potential for differ-
aspect of the policy and how it may have changed ing implementation due to their distinct contexts,
118 Case Study

the researchers expect to find as much variation as Case Study Approaches to Data
possible.
No matter what studies that involve more than Data Collection
a single case are called, what case studies have in One of the reasons that case study is such a ver-
common is that findings are presented as individ- satile approach to research is that both quantita-
ual portraits that contribute to our understanding tive and qualitative data may be used in the study,
of the issues, first individually and then collec- depending on the research questions asked. While
tively. One question that often arises is whether many case studies have qualitative tendencies, due
the use of multiple cases can represent a form of to the nature of exploring phenomena in context,
generalizability, in that researchers may be able to some case study researchers use surveys to find out
show similarities of issues across the cases. The demographic and self-report information as part
notion of generalizability in an approach that of the study. Researchers conducting case studies
tends to be qualitative in nature may be of concern tend also to use interviews, direct observations,
because of the way in which one selects the cases participant observation, and source documents as
and collects and analyzes the data. For example, in part of the data analysis and interpretation. The
statistical studies, the notion of generalizability documents may consist of archival records, arti-
comes from the form of sampling (e.g., random- facts, and websites that provide information about
ized sampling to represent a population and a con- the phenomenon in the context of what people
trol group) and the types of measurement tools make and use as resources in the setting.
used (e.g., surveys that use Likert-type scales and
therefore result in numerical data). In case study
Data Analysis
research, however, Yin has offered the notion that
the different cases are similar to multiple experi- The analytic role of the researcher is to system-
ments in which the researcher selects among simi- atically review these data, first making a detailed
lar and sometimes different situations to verify description of the case and the setting. It may be
results. In this way, the cases become a form of helpful for the researcher to outline a chronology
generalizing to a theory, either taken from the liter- of actions and events, although in further analysis,
ature or uncovered and grounded in the data. the chronology may not be as critical to the study
as the thematic interpretations. However, it may
be useful in terms of organizing what may other-
wise be an unwieldy amount of data.
Nature of the Case Study In further analysis, the researcher examines the
data, one case at a time if multiple cases are
Another decision to be made in the design of involved, for patterns of actions and instances of
a study is whether the purpose is primarily descrip- issues. The researcher first notes what patterns are
tive, exploratory, or explanatory. The nature of constructed from one set of data within the case
a descriptive case study is one in which the and then examines subsequent data collected
researcher uses thick description about the entity within the first case to see whether the patterns are
being studied so that the reader has a sense of hav- consistent. At times, the patterns may be evident
ing ‘‘been there, done that,’’ in terms of the phe- in the data alone, and at other times, the patterns
nomenon studied within the context of the may be related to relevant studies from the litera-
research setting. Exploratory case studies are those ture. If multiple cases are involved, a cross-case
in which the research questions tend to be of the analysis is then conducted to find what patterns
what can be learned about this issue type. The goal are consistent and under what conditions other
of this kind of study is to develop working hypoth- patterns are apparent.
eses about the issue and perhaps to propose further
research. An explanatory study is more suitable
Reporting the Case
for delving into how and why things are happen-
ing as they are, especially if the events and people While case study research has no predetermined
involved are to be observed over time. reporting format, Stake has proposed an approach
Categorical Data Analysis 119

for constructing an outline of the report. He sug- Gomm, R., Hammersley, M., & Foster, P. (2000). Case
gests that the researcher start the report with study method: Key issues, Key texts. Thousand Oaks,
a vignette from the case to draw the reader into CA: Sage.
time and place. In the next section, the researcher Merriam, S. B. (1998). Qualitative research and case
study applications in education. San Francisco: Jossey-
indentifies the issue studied and the methods for
Bass.
conducting the research. The next part is a full Stake, R. E. (1995). Art of case study research. Thousand
description of the case and context. Next, in Oaks, CA: Sage.
describing the issues of the case, the researcher can Stake, R. E. (2005). Multiple case study analysis. New
build the complexity of the study for the reader. York: Guilford.
The researcher uses evidence from the case and Yin, R. K. (2003). Case study research. Thousand Oaks,
may relate that evidence to other relevant research. CA: Sage.
In the next section, the researcher presents the so
what of the case—a summary of claims made from
the interpretation of data. At this point, the
researcher may end with another vignette that CATEGORICAL DATA ANALYSIS
reminds the reader of the complexity of the case in
terms of realistic scenarios that readers may then A categorical variable consists of a set of non-
use as a form of transference to their own settings overlapping categories. Categorical data are
and experiences. counts for those categories. The measurement
scale of a categorical variable is ordinal if the
categories exhibit a natural ordering, such as
opinion variables with categories from ‘‘strongly
Versatility disagree’’ to ‘‘strongly agree.’’ The measurement
Case study research demonstrates its utility in that scale is nominal if there is no inherent ordering.
the researcher can explore single or multiple phe- The types of possible analysis for categorical
nomena or multiple examples of one phenomenon. data depend on the measurement scale.
The researcher can study one bounded entity in
a holistic fashion or multiple subunits embedded Types of Analysis
within that entity. Through a case study approach,
researchers may explore particular ways that parti- When the subjects measured are cross-classified on
cipants conduct themselves in their localized con- two or more categorical variables, the table of
texts, or the researchers may choose to study counts for the various combinations of categories
processes involving program participants. Case is a contingency table. The information in a contin-
study research can shed light on the particularity gency table can be summarized and further ana-
of a phenomenon or process while opening up ave- lyzed through appropriate measures of association
nues of understanding about the entity or entities and models as discussed below. These measures
involved in those processes. and models differentiate according to the nature of
the classification variables (nominal or ordinal).
LeAnn Grogan Putney Most studies distinguish between one or more
response variables and a set of explanatory vari-
ables. When the main focus is on the association
and interaction structure among a set of response
Further Readings variables, such as whether two variables are condi-
tionally independent given values for the other
Creswell, J. W. (2007). Qualitative inquiry and research
variables, loglinear models are useful, as described
design: Choosing among five approaches. Thousand
Oaks, CA: Sage.
in a later section. More commonly, research ques-
Creswell, J. W., & Maietta, R. C. (2002). Qualitative tions focus on effects of explanatory variables on
research. In D. C. Miller & N. J. Salkind (Eds.), a categorical response variable. Those explanatory
Handbook of research design and social measurement variables might be categorical, quantitative, or of
(pp. 162–163). Thousand Oaks, CA: Sage. both types. Logistic regression models are then of
120 Categorical Data Analysis

particular interest. Initially such models were way this is done is to introduce a random effect in
developed for binary (success–failure) response the model to represent each cluster, thus extending
variables. They describe the logit, which is the GLM to a generalized linear mixed model, the
log½PðY ¼ 1Þ=PðY ¼ 2Þ, using the equation mixed referring to the model’s containing both
random effects and the usual sorts of fixed effects.
log½PðY ¼ 1Þ=PðY ¼ 2Þ ¼ a þ β1 x1 þ β2 x2
þ    þ βp xp , Two-Way Contingency Tables
where Y is the binary response variable and Two categorical variables are independent if the
x1 , . . . , xp the set of the explanatory variables. The probability of response in any particular category
logistic regression model was later extended to of one variable is the same for each category of the
nominal and ordinal response variables. For a nom- other variable. The most well-known result on
inal response Y with J categories, the model simul- two-way contingency tables is the test of the null
taneously describes hypothesis of independence, introduced by Karl
Pearson in 1900. If X and Y are two categorical
log½PðY ¼ 1Þ=PðY ¼ JÞ, variables with I and J categories, respectively, then
log½PðY ¼ 2Þ=PðY ¼ JÞ, . . . , their cross-classification leads to a I × J table of
observed frequencies n ¼ ðnij Þ. Under this hypoth-
log½PðY ¼ J  1Þ=PðY ¼ JÞ:
esis, the expected cell frequencies are values that
For ordinal responses, a popular model uses have the same marginal totals as the observed
explanatory variables to predict a logit defined in counts but perfectly satisfy the hypothesis. They
terms of a cumulative probability, equal mij ¼ nπi þ π þ j , i ¼ 1; . . . ; I; jP ¼ 1; . . . ; J,
where n is the total sample size (n ¼ i;j nij ) and
log½PðY ≤ jÞ=PðY > jÞ; j ¼ 1; 2; . . . ; J  1: πi þ (π þ j ) is the ith row (jth column) marginal of
the underlying probabilities matrix π ¼ ðπij Þ. Then
For categorical data, the binomial and multino- the corresponding maximum likelihood (ML) esti-
mial distributions play the central role that the n n
mates equal m ^ ij ¼ npi þ p þ j ¼ i þn þ j , where pij
normal does for quantitative data. Models for cat-
denotes the sample proportion in cell (i, j). The
egorical data assuming the binomial or multino-
hypothesis of independence is tested through Pear-
mial were unified with standard regression and
son’s chi-square statistic,
analysis of variance (ANOVA) models for quanti-
tative data assuming normality were unified X ðnij  m^ ij Þ2
through the introduction of the generalized linear X2 ¼ : (1)
model (GLM). This very wide class of models can
i;j m
^ ij
incorporate data assumed to come from any of
a variety of standard distributions (such as the nor- The p value is the right-tail probability above
mal, binomial, and Poisson). The GLM relates the observed X2 value. The distribution of X2
a function of the mean (such as the log or logit of under the null hypothesis is approximated by
the mean) to explanatory variables with a linear a χ2ðI1ÞðJ1Þ , provided that the individual expected
predictor. Certain GLMs for counts, such as Pois- cell frequencies are not too small. In fact, Pearson
son regression models, relate naturally to log linear claimed that the associated degrees of freedom (df)
and logistic models for binomial and multinomial were IJ  1, and R. A. Fisher corrected this in
responses. 1922. Fisher later proposed a small-sample test of
More recently, methods for categorical data independence for 2 × 2 tables, now referred to as
have been extended to include clustered data, for Fisher’s exact test. This test was later extended to
which observations within each cluster are allowed I × J tables as well as to more complex hypotheses
to be correlated. A very important special case is in both two-way and multiway tables. When a con-
that of repeated measurements, such as in a longi- tingency table has ordered row or column cate-
tudinal study in which each subject provides a clus- gories (ordinal variables), specialized methods can
ter of observations taken at different times. One take advantage of that ordering.
Categorical Data Analysis 121

Ultimately more important than mere testing of Models for Two-Way Contingency Tables
significance is the estimation of the strength of the
association. For ordinal data, measures can incor- Independence between the classification variables
porate information about the direction (positive or X and Y (i.e., mij ¼ nπi þ π þ j , for all i and j) can
negative) of the association as well. equivalently be expressed in terms of a log linear
More generally, models can be formulated that model as
are more complex than independence, and
logðmij Þ ¼ λ þ λX Y
i þ λj ,
expected frequencies mij can be estimated under
the constraint that the model holds. If m ^ ij are the i ¼ 1; . . . ; I; j ¼ 1; . . . ; J:
corresponding maximum likelihood estimates,
then, to test the hypothesis that the model holds, The more general model that allows association
one can use the Pearson statistic (Equation 1) or between the variables is
the statistic that results from the standard statisti-
cal approach of conducting a likelihood-ratio test, logðmij Þ ¼ λ þ λX Y XY
i þ λj þ λij ,
ð3Þ
which is i ¼ 1; . . . ; I; j ¼ 1; . . . ; J:
X  
nij Loglinear models describe the way the categori-
G2 ¼ 2 nij ln : ð2Þ
i;j m
^ ij cal variables and their association influence the
count in each cell of the contingency table. They
Under the null hypothesis, both statistics have the can be considered as a discrete analogue of
same large-sample chi-square distribution. ANOVA. The two-factor interaction terms relate
The special case of the 2 × 2 table occurs com- to odds ratios describing the association. As in
monly in practice, for instance for comparing two ANOVA models, some parameters are redundant
groups on a success/fail–type outcome. In a 2 × 2 in these specifications, and software reports esti-
table, the basic measure of association is the odds mates by assuming certain constraints.
ratio. For the probability table The general model (Equation 3) does not
  impose any structure on the underlying associa-
π11 π12 tion, and so it fits the data perfectly. Associations
π21 π22 can be modeled through association models. The
simplest such model, the linear-by-linear associa-
the odds ratio is defined as θ ¼ ππ11 ππ22 . Indepen- tion model, is relevant when both classification
12 21 variables are ordinal. It replaces the interaction
dence corresponds to θ ¼ 1. Inference about the
term λXY
ij by the product φμi νj , where μi and νj
odds ratio can be based on the fact that for large
are known scores assigned to the row and column
samples,
categories, respectively. This model,
 
1 1 1 1 logðmij Þ ¼ λ þ λX Y
^
logðθÞ N logðθÞ; þ þ þ : i þ λj þ φμi νj ,
e n11 n12 n21 n22 ð4Þ
i ¼ 1; . . . ; I; j ¼ 1; . . . ; J,
The odds ratio relates to the relative risk r. In par-
has only one parameter more than the indepen-
ticular, if we assume that the rows of the above
dence model, namely φ. Consequently, the associ-
2 × 2 table represent two independent groups of
ated df are ðI  1ÞðJ  1Þ  1, and once it holds,
subjects (A and B) and the columns correspond to
independence can be tested conditionally on it by
presence/absence of a disease, then the relative risk
π testing φ ¼ 0 via a more powerful test with
for this disease is defined as r ¼ πA , where
B df ¼ 1. The linear-by-linear association model
πA ¼ ππ11 is the probability of disease for the first (Equation 4) can equivalently be expressed in

group and πB is defined analogously. Since terms of the ðI  1ÞðJ  1Þ local odds ratios
π π
1πB
θ ¼ r 1π , it follows that θ ≈ r whenever πA and θij ¼ π ij iþ1;jþ1
π ðI ¼ 1; . . . ; I  1; j ¼ 1; . . . ; J  1Þ,
A i;jþ1 iþ1;j
πB are close to 0. defined by adjacent rows and columns of the table:
122 Categorical Data Analysis

θij ¼ exp½φðμi þ 1  μi Þðνj þ 1  νj Þ; methods that are available for ordinary regression
ð5Þ models, such as stepwise selection methods and fit
i ¼ 1; . . . ; I  1; j ¼ 1; . . . ; J  1:
indices such as Akaike Information Criterion.
Loglinear models for multiway tables can include
With equally spaced scores, all the local odds higher order interactions up to the order equal to the
ratios are identical, and the model is referred to as dimension of the table. Two-factor terms describe
uniform association. More general models treat conditional association between two variables, three-
one or both sets of scores as parameters. Asso- factor terms describe how the conditional association
ciation models have been mainly developed by varies among categories of a third variable, and so
L. Goodman. forth. CA has also been extended to higher dimen-
Another popular method for studying the pat- sional tables, leading to multiple CA.
tern of association between the row and column Historically, a common way to analyze higher
categories of a two-way contingency table is corre- way contingency tables was to analyze all the
spondence analysis (CA). It is mainly a descriptive two-way tables obtained by collapsing the table
method. CA assigns optimal scores to the row and over the other variables. However, the two-way
column categories and plots these scores in two or associations can be quite different from condi-
three dimensions, providing thus a reduced rank tional associations in which other variables are
display of the underlying association. controlled. The association can even change
The special case of square I × I contingency direction, a phenomenon known as Simpson’s
tables with the same categories for the rows and paradox. Conditions under which tables can be
the columns occurs with matched-pairs data. For collapsed are most easily expressed and visual-
example, such tables occur in the study of rater ized using graphical models that portray each
agreement and in the analysis of social mobility. A variable as a node and a conditional association
condition of particular interest for such data is mar- as a connection between two nodes. The patterns
ginal homogeneity, that πi þ ¼ π þ i , i ¼ 1; . . . ; I. of associations and their strengths in two-way or
For the 2 × 2 case of binary matched pairs, the test multiway tables can also be illustrated through
comparing the margins using the chi-square statistic special plots called mosaic plots.
ðn12  n21 Þ2 =ðn12 þ n21 Þ is called McNemar’s test.

Inference and Software


Multiway Contingency Tables
Least squares is not an optimal estimation method
The models described earlier for two-way tables for categorical data, because the variance of sam-
extend to higher dimensions. For multidimensional ple proportions is not constant but rather depends
tables, a number of models are available, varying on the corresponding population proportions.
in terms of the complexity of the association struc- Because of this, parameters for categorical data
ture. For three variables, for instance, models were estimated historically by the use of weighted
include ones for which (a) the variables are mutu- least squares, giving more weight to observations
ally independent; (b) two of the variables are asso- having smaller variances. Currently, the most pop-
ciated but are jointly independent of the third; ular estimation method is maximum likelihood,
(c) two of the variables are conditionally indepen- which is an optimal method for large samples for
dent, given the third variable, but may both be any type of data. The Bayesian approach to infer-
associated with the third; and (d) each pair of vari- ence, in which researchers combine the informa-
ables is associated, but the association between tion from the data with their prior beliefs to
each pair has the same strength at each level of the obtain posterior distributions for the parameters of
third variable. Because the number of possible interest, is becoming more popular. For large sam-
models increases dramatically with the dimension, ples, all these methods yield similar results.
model selection methods become more important Standard statistical packages, such as SAS, SPSS
as the dimension of the table increases. When the (an IBM company, formerly called PASWâ Statis-
underlying theory for the research study does not tics), Stata, and S-Plus and R, are well suited for
suggest particular methods, one can use the same analyzing categorical data. Such packages now
Categorical Variable 123

have facility for fitting GLMs, and most of the male ¼ 0, female ¼ 1) that do not contain mathe-
standard methods for categorical data can be matical information beyond the frequency counts
viewed as special cases of such modeling. Bayesian related to group membership. Instead, categorical
analysis of categorical data can be carried out variables often provide valuable social-oriented
through WINBUGS. Specialized software such as information that is not quantitative by nature (e.g.,
the programs StatXact and LogXact, developed by hair color, religion, ethnic group).
Cytel Software, are available for small-sample In the hierarchy of measurement levels, categori-
exact methods of inference for contingency tables cal variables are associated with the two lowest vari-
and for logistic regression parameters. able classification orders, nominal or ordinal scales,
depending on whether the variable groups exhibit an
Maria Kateri and Alan Agresti intrinsic ranking. A nominal measurement level con-
sists purely of categorical variables that have no
See also Categorical Variable; Correspondence Analysis;
ordered structure for intergroup comparison. If the
General Linear Model; R; SAS; Simpson’s Paradox;
categories can be ranked according to a collectively
SPSS
accepted protocol (e.g., from lowest to highest), then
these variables are ordered categorical, a subset of
Further Readings the ordinal level of measurement.
Categorical variables at the nominal level of mea-
Agresti, A. (2002). Categorical data analysis (2nd ed.).
surement have two properties. First, the categories
New York: Wiley.
Agresti, A. (2007). An introduction to categorical data are mutually exclusive. That is, an object can belong
analysis (2nd ed.). New York: Wiley. to only one category. Second, the data categories
Agresti, A. (2010). Analysis of Ordinal Categorical Data. have no logical order. For example, researchers can
New York: Wiley. measure research participants’ religious back-
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. grounds, such as Jewish, Protestant, Muslim, and so
(1975). Discrete Multivariate Analysis: Theory and on, but they cannot order these variables from
Practice. Cambridge: MIT Press. lowest to highest. It should be noted that when cate-
Congdon, P. (2005). Bayesian Models for Categorical gories get numeric labels such as male ¼ 0 and
Data. New York: Wiley.
female ¼ 1 or control group ¼ 0 and treatment
Goodman, L. A. (1986). Some useful extensions of the
group ¼ 1, the numbers are merely labels and do not
usual correspondence analysis and the usual log-linear
models approach in the analysis of contingency tables indicate one category is ‘‘better’’ on some aspect than
with or without missing entries. International another. The numbers are used as symbols (codes)
Statistical Review, 54, 243-309. and do not reflect either quantities or a rank order-
Kateri, M. (2008). Categorical data. In S. Kotz (Ed.), ing. Dummy coding is the quantification of a variable
Encyclopedia of statistical sciences (2nd ed.). with two categories (e.g., boys, girls). Dummy cod-
Hoboken, NJ: Wiley-Interscience. ing will allow the researcher to conduct specific anal-
yses such as the point-biserial correlation coefficient,
in which a dichotomous categorical variable is
Websites
related to a variable that is continuous. One example
Cytel Software: http://www.cytel.com of the use of point-biserial correlation is to compare
WINBUGS: http://www.mrc-bsu.cam.ac.uk/bugs/ males with females on a measure of mathematical
winbugs/contents.shtml ability.
Categorical variables at the ordinal level of
measurement have the following properties: (a) the
data categories are mutually exclusive, (b) the data
CATEGORICAL VARIABLE categories have some logical order, and (c) the data
categories are scaled according to the amount of
Categorical variables are qualitative data in which a particular characteristic. Grades in courses (i.e.,
the values are assigned to a set of distinct groups or A, B, C, D, and F) are an example. The person
categories. These groups may consist of alpha- who earns an A in a course has a higher level of
betic (e.g., male, female) or numeric labels (e.g., achievement than one who gets a B, according to
124 Causal-Comparative Design

the criteria used for measurement by the course most frequent category or categories if there is
instructor. However, one cannot assume that the more than one mode), but at the ordinal level,
difference between an A and a B is the same as the the median or point below which 50% of the
difference between a B and a C. Similarly, scores fall is also used. The chi-square distribu-
researchers might set up a Likert-type scale to tion is used for categorical data at the nominal
measure level of satisfaction with one’s job and level. Observed frequencies in each category are
assign a 5 to indicate extremely satisfied, 4 to indi- compared with the theoretical or expected fre-
cate very satisfied, 3 to indicate moderately satis- quencies. Types of correlation coefficients that
fied, and so on. A person who gives a rating of 5 use categorical data include point biserial; Spear-
feels more job satisfaction than a person who gives man rho, in which both variables are at the ordi-
a rating of 3, but it has no meaning to say that one nal level; and phi, in which both variables are
person has 2 units more satisfaction with a job dichotomous (e.g., boys vs. girls on a yes–no
than another has or exactly how much more satis- question). Categorical variables can also be used
fied one is with a job than another person is. in various statistical analyses such as t tests,
In addition to verbal descriptions, categorical analysis of variance, multivariate analysis of var-
variables are often presented visually using iance, simple and multiple regression analysis,
tables and charts that indicate the group fre- and discriminant analysis.
quency (i.e., the number of values in a given cat-
egory). Contingency tables show the number of Karen D. Multon and Jill S. M. Coleman
counts in each category and increase in complex-
See also Bar Chart; Categorical Data Analysis; Chi-
ity as more attributes are examined for the same
Square Test; Levels of Measurement; Likert Scaling;
object. For example, a car can be classified
Nominal Scale; Ordinal Scale; Pie Chart; Variable
according to color, manufacturer, and model.
This information can be displayed in a contin-
gency table showing the number of cars that Further Readings
meet each of these characteristics (e.g., the num-
ber of cars that are white and manufactured by Siegel, A. F., & Morgan, C. J. (1996). Statistics and data
General Motors). This same information can be analysis: An introduction (2nd ed.). New York: Wiley.
Simonoff, J. S. (2003). Analyzing categorical data. New
expressed graphically using a bar chart or pie
York: Springer.
chart. Bar charts display the data as elongated
bars with lengths proportional to category fre-
quency, with the category labels typically being
the x-axis and the number of values the y-axis.
On the other hand, pie charts show categorical CAUSAL-COMPARATIVE DESIGN
data as proportions of the total value or as a per-
centage or fraction. Each category constitutes A causal-comparative design is a research design
a section of a circular graph or ‘‘pie’’ and repre- that seeks to find relationships between indepen-
sents a subset of the 100% or fractional total. In dent and dependent variables after an action or
the car example, if 25 cars out of a sample of event has already occurred. The researcher’s goal
100 cars were white, then 25%, or one quarter, is to determine whether the independent variable
of the circular pie chart would be shaded, and affected the outcome, or dependent variable, by
the remaining portion of the chart would be comparing two or more groups of individuals.
shaded alternative colors based on the remaining There are similarities and differences between
categorical data (i.e., cars in colors other than causal-comparative research, also referred to as ex
white). post facto research, and both correlational and
Specific statistical tests that differ from other experimental research. This entry discusses these
quantitative approaches are designed to account differences, as well as the benefits, process, limita-
for data at the categorical level. The only mea- tions, and criticism of this type of research design.
sure of central tendency appropriate for categor- To demonstrate how to use causal-comparative
ical variables at the nominal level is mode (the research, examples in education are presented.
Causal-Comparative Design 125

Comparisons With Correlational Research independent variable that is the focus of the study.
Another similarity is that the goal of both types of
Many similarities exist between causal-compara- research is to determine what effect the indepen-
tive research and correlational research. Both dent variable may or may not have on the depen-
methods are useful when experimental research dent variable or variables.
has been deemed impossible or unethical as the While the premises of the two research designs
research design for a particular question. Both are comparable, there are vast differences between
causal-comparative and correlational research causal-comparative research and experimental
designs attempt to determine relationships among research. First and foremost, causal-comparative
variables, but neither allows for the actual manip- research occurs after the event or action has been
ulation of these variables. Thus, neither can defini- completed. It is a retrospective way of determining
tively state that a true cause-and-effect relationship what may have caused something to occur. In true
occurred between these variables. Finally, neither experimental research designs, the researcher
type of design randomly places subjects into con- manipulates the independent variable in the exper-
trol and experimental groups, which limits the imental group. Because the researcher has more
generalizability of the results. control over the variables in an experimental
Despite similarities, there are distinct differences research study, the argument that the independent
between causal-comparative and correlational variable caused the change in the dependent vari-
research designs. In causal-comparative research, able is much stronger. Another major distinction
the researcher investigates the effect of an indepen- between the two types of research is random sam-
dent variable on a dependent variable by compar- pling. In causal-comparative research, the research
ing two or more groups of individuals. For subjects are already in groups because the action
example, an educational researcher may want to or event has already occurred, whereas subjects in
determine whether a computer-based ACT pro- experimental research designs are randomly
gram has a positive effect on ACT test scores. In selected prior to the manipulation of the variables.
this example, the researcher would compare the This allows for wider generalizations to be made
ACT scores from a group of students that com- from the results of the study.
pleted the program with scores from a group that Table 1 breaks down the causal-comparative,
did not complete the program. In correlational correlational, and experimental methods in refer-
research, the researcher works with only one ence to whether each investigates cause-effect and
group of individuals. Instead of comparing two whether the variables can be manipulated. In addi-
groups, the correlational researcher examines the tion, it notes whether groups are randomly
effect of one or more independent variables on the assigned and whether the methods study groups or
dependent variable within the same group of sub- individuals.
jects. Using the same example as above, the corre-
lational researcher would select one group of
subjects who have completed the computer-based
ACT program. The researcher would use statistical When to Use Causal-Comparative
measures to determine whether there was a positive Research Designs
relationship between completion of the ACT pro- Although experimental research results in more
gram and the students’ ACT scores. compelling arguments for causation, there are
many times when such research cannot, or should
not, be conducted. Causal-comparative research
Comparisons With Experimental Research
provides a viable form of research that can be con-
A few aspects of causal-comparative research par- ducted when other methods will not work. There
allel experimental research designs. Unlike correla- are particular independent variables that are not
tional research, both experimental research and capable of being manipulated, including gender,
causal-comparative research typically compare ethnicity, socioeconomic level, education level, and
two or more groups of subjects. Research subjects religious preferences. For instance, if researchers
are generally split into groups on the basis of the intend to examine whether ethnicity affects
126 Causal-Comparative Design

Table 1 Comparison of Causal-Comparative, Correlational, and Experimental Research


Randomly Identifies
Investigates Assigns Involves Studies Variables
Cause– Manipulates Participants Group Groups or for Experimental
Method Effect Variable to Groups Comparisons Individuals Focus Exploration
Causal- Yes No (it No (groups Yes Two or more Focus on Yes
comparative already formed groups of differences
research occurred) prior to individuals and of variables
study) one independent between
variable groups
Correlational No No No (only No Two or more Focus on Yes
research one variables and relationship
group) one group of among
individuals variables
Experimental Yes Yes Yes Yes Groups or Depends on Yes
research individuals design;
depending focuses on
on design cause/effect
of variables

self-esteem in a rural high school, they cannot Furthermore, causal-comparative research may
manipulate a subject’s ethnicity. This indepen- prove to be the design of choice even when experi-
dent variable has already been decided, so the mental research is possible. Experimental research
researchers must look to another method of is both time-consuming and costly. Many school
determining cause. In this case, the researchers districts do not have the resources to conduct a full-
would group students according to their ethnic- scale experimental research study, so educational
ity and then administer self-esteem assessments. leaders may choose to do a causal-comparative
Although the researchers may find that one eth- study. For example, the leadership might want to
nic group has higher scores than another, they determine whether a particular math curriculum
must proceed with caution when interpreting the would improve math ACT scores more effectively
results. In this example, it might be possible that than the curriculum already in place in the school
one ethnic group is also from a higher socioeco- district. Before implementing the new curriculum
nomic demographic, which may mean that the throughout the district, the school leaders might
socioeconomic variable affected the assessment conduct a causal-comparative study, comparing
scores. their district’s math ACT scores with those from
Some independent variables should not be a school district that has already used the curricu-
manipulated. In educational research, for example, lum. In addition, causal-comparative research is
ethical considerations require that the research often selected as a precursor to experimental
method not deny potentially useful services to stu- research. In the math curriculum example, if the
dents. For instance, if a guidance counselor wanted causal-comparative study demonstrates that the cur-
to determine whether advanced placement course riculum has a positive effect on student math ACT
selection affected college choice, the counselor scores, the school leaders may then choose to con-
could not ethically force some students to take cer- duct a full experimental research study by piloting
tain classes and prevent others from taking the the curriculum in one of the schools in the district.
same classes. In this case, the counselor could still
compare students who had completed advanced
Conducting Causal Comparative Research
placement courses with those who had not, but
causal conclusions are more difficult than with an The basic outline for conducting causal compara-
experimental design. tive research is similar to that of other research
Causal-Comparative Design 127

Develop Research Questions

Determine Independent and


Dependent Variables

Select Participants for Control and


Experimental Groups

Apply Control Methods to Participant Do Not Apply Control Methods to


Samples Participant Samples

Matching Homogeneous Analysis of


Subgroups Covariance

Collect Data

Use Preexisting Data Develop Instrument and


Collect Data

Analyze and Interpret Data

Inferential Descriptive
Statistics Statistics

Report Findings

Figure 1 Flowchart for Conducting a Study


128 Causal-Comparative Design

designs. Once the researcher determines the focus because the independent variable of socioeconomic
of the research and develops hypotheses, he or she status cannot be manipulated.
selects a sample of participants for both an experi- Because many factors may influence the depen-
mental and a control group. Depending on the dent variable, the researcher should be aware of,
type of sample and the research question, the and possibly test for, a variety of independent vari-
researcher may measure potentially confounding ables. For instance, if the researcher wishes to
variables to include them in eventual analyses. The determine whether socioeconomic level affects
next step is to collect data. The researcher then a student’s decision to drop out of high school, the
analyzes the data, interprets the results, and researcher may also want to test for other poten-
reports the findings. Figure 1 illustrates this tial causes, such as parental support, academic
process. ability, disciplinary issues, and other viable
options. If other variables can be ruled out, the
case for socioeconomic level’s influencing the drop-
Determine the Focus of Research
out rate will be much stronger.
As in other research designs, the first step in
conducting a causal-comparative research study is
to identify a specific research question and gener- Participant Sampling and
ate a hypothesis. In doing so, the researcher identi- Threats to Internal Validity
fies a dependent variable, such as high dropout
rates in high schools. The next step is to explore In causal-comparative research, two or more
reasons the dependent variable has occurred or is groups of participants are compared. These groups
occurring. In this example, several issues may are defined by the different levels of the indepen-
affect dropout rates, including such elements as dent variable(s). In the previous example, the
parental support, socioeconomic level, gender, eth- researcher compares a group of high school drop-
nicity, and teacher support. The researcher will outs with a group of high school students who
need to select which issue is of importance to his have not dropped out of school. Although this is
or her research goals. One hypothesis might be, not an experimental design, causal-comparative
‘‘Students from lower socioeconomic levels drop researchers may still randomly select participants
out of high school at higher rates than students within each group. For example, a researcher may
from higher socioeconomic levels.’’ Thus, the inde- select every fifth dropout and every fifth high
pendent variable in this scenario would be socio- school student. However, because the participants
economic levels of high school students. are not randomly selected and placed into groups,
It is important to remember that many factors internal validity is threatened. To strengthen the
affect dropout rates. Controlling for such factors research design and counter threats to internal
in causal-comparative research is discussed later in validity, the researcher might choose to impose the
this entry. Once the researcher has identified the selection techniques of matching, using homoge-
main research problem, he or she operationally neous subgroups, or analysis of covariance
defines the variables. In the above hypothesis, the (ANCOVA), or both.
dependent variable of high school dropout rates is
Matching
fairly self-explanatory. However, the researcher
would need to establish what constitutes lower One method of strengthening the research sam-
socioeconomic levels and higher socioeconomic ple is to select participants by matching. Using this
levels. The researcher may also wish to clarify the technique, the researcher identifies one or more
target population, such as what specific type of characteristics and selects participants who have
high school will be the focus of the study. Using these characteristics for both the control and the
the above example, the final research question experimental groups. For example, if the
might be, ‘‘Does socioeconomic status affect drop- researcher wishes to control for gender and grade
out rates in the Appalachian rural high schools in level, he or she would ensure that both groups
East Tennessee?’’ In this case, causal comparative matched on these characteristics. If a male 12th-
would be the most appropriate method of research grade student is selected for the experimental
Causal-Comparative Design 129

group, then a male 12th-grade student must be information as possible, especially if the researcher
selected for the control group. In this way the is planning to use the control method of matching.
researcher is able to control these two extraneous
variables.
Data Analysis and Interpretation
Comparing Homogeneous Subgroups Once the data have been collected, the
Another control technique used in causal-com- researcher analyzes and interprets the results.
parative research is to compare subgroups that are Although causal-comparative research is not true
clustered according to a particular variable. For experimental research, there are many methods of
example, the researcher may choose to group and analyzing the resulting data, depending on the
compare students by grade level. He or she would research design. It is important to remember that
then categorize the sample into subgroups, com- no matter what methods are used, causal-compar-
paring 9th-grade students with other 9th-grade ative research does not definitively prove cause-
students, 10th-grade students with other 10th- and-effect results. Nevertheless, the results will
grade students, and so forth. Thus, the researcher provide insights into causal relationships between
has controlled the sample for grade level. the variables.

Analysis of Covariance Inferential Statistics

Using the ANCOVA statistical method, the When using inferential statistics in causal-com-
researcher is able to adjust previously dispropor- parative research, the researcher hopes to demon-
tionate scores on a pretest in order to equalize the strate that a relationship exists between the
groups on some covariate (control variable). The independent and dependent variables. Again, the
researcher may want to control for ACT scores appropriate method of analyzing data using this
and their impact on high school dropout rates. In type of statistics is determined by the design of the
comparing the groups, if one group’s ACT scores research study. The three most commonly used
are much higher or lower than the other’s, the methods for causal-comparative research are the
researcher may use the technique of ANCOVA to chi-square test, paired-samples and independent t
balance the two groups. This technique is particu- tests, and analysis of variance (ANOVA) or
larly useful when the research design includes ANCOVA.
a pretest, which assesses the dependent variable Pearson’s chi-square, the most commonly used
before any manipulation or treatment has chi-square test, allows the researcher to determine
occurred. For example, to determine the effect of whether there is a statistically significant relation-
an ACT curriculum on students, a researcher ship between the experimental and control groups
would determine the students’ baseline ACT based on frequency counts. This test is useful when
scores. If the control group had scores that were the researcher is working with nominal data, that
much higher to begin with than the experimental is, different categories of treatment or participant
group’s scores, the researcher might use the characteristics, such as gender. For example, if
ANCOVA technique to balance the two groups. a researcher wants to determine whether males
and females learn more efficiently from different
teaching styles, the researcher may compare
a group of male students with a group of female
Instrumentation and Data Collection
students. Both groups may be asked whether they
The methods of collecting data for a causal- learn better from audiovisual aids, group discus-
comparative research study do not differ from any sion, or lecture. The researcher could use chi-
other method of research. Questionnaires, pretests square testing to analyze the data for evidence of
and posttests, various assessments, and behavior a relationship.
observation are common methods for collecting Another method of testing relationships in
data in any research study. It is important, how- causal-comparative research is to use independent
ever, to also gather as much demographic or dependent t tests. When the researcher is
130 Causal-Comparative Design

comparing the mean scores of two groups, these counter this issue, the researcher must test several
tests can determine whether there is a significant different theories to establish whether other vari-
difference between the control and experimental ables affect the dependent variable. The researcher
groups. The independent t test is used in research can reinforce the research hypothesis if he or she
designs when no controls have been applied to the can demonstrate that other variables do not have
samples, while the dependent t test is appropriate a significant impact on the dependent variable.
for designs in which matching has been applied to Reversal causation is another issue that may
the samples. One example of the use of t testing in arise in causal-comparative research. This problem
causal-comparative research is to determine the occurs when it is not clear that the independent
significant difference in math course grades variable caused the changes in the dependent vari-
between two groups of elementary school students able, or that a dependent variable caused the inde-
when one group has completed a math tutoring pendent variable to occur. For example, if
course. If the two samples were matched on cer- a researcher hoped to determine the success rate of
tain variables such as gender and parental support, an advanced English program on students’ grades,
the dependent t test would be used. If no matching he or she would have to determine whether the
was involved, the independent t test would be the English program had a positive effect on the stu-
test of choice. The results of the t test allow the dents, or in the case of reversal causation, whether
researcher to determine whether there is a statisti- students who make higher grades do better in the
cally significant relationship between the indepen- English program. In this scenario, the researcher
dent variable of the math tutoring course and the could establish which event occurred first. If the
dependent variable of math course grade. students had lower grades before taking the
To test for relationships between three or more course, then the argument that the course
groups and a continuous dependent variable, impacted the grades would be stronger.
a researcher might select the statistical technique The inability to construct random samples is
of one-way ANOVA. Like the independent t test, another limitation in causal-comparative research.
this test determines whether there is a significant There is no opportunity to randomly choose parti-
difference between groups based on their mean cipants for the experimental and control groups
scores. In the example of the math tutoring course, because the events or actions have already
the researcher may want to determine the effects occurred. Without random assignment, the results
of the course for students who attended daily ses- cannot be generalized to the public, and thus the
sions and students who attended weekly sessions, researcher’s results are limited to the population
while also assessing students who never attended that has been included in the research study.
sessions. The researcher could compare the aver- Despite this problem, researchers may strengthen
age math grades of the three groups to determine their argument by randomly selecting participants
whether the tutoring course had a significant from the previously established groups. For exam-
impact on the students’ overall math grades. ple, if there were 100 students who had completed
a computer-based learning course, the researcher
would randomly choose 20 students to compare
Limitations
with 20 randomly chosen students who had not
Although causal-comparative research is effective completed the course. Another method of reinfor-
in establishing relationships between variables, cing the study would be to test the hypothesis with
there are many limitations to this type of research. several different population samples. If the results
Because causal-comparative research occurs ex are the same in all or most of the sample, the argu-
post facto, the researcher has no control over the ment will be more convincing.
variables and thus cannot manipulate them. In
addition, there are often variables other than the
Criticisms
independent variable(s) that may impact the
dependent variable(s). Thus, the researcher cannot There have been many criticisms of causal-
be certain that the independent variable caused the comparative research. For the most part, critics
changes in the dependent variable. In order to reject the idea that causal-comparative research
Cause and Effect 131

results should be interpreted as evidence of causal discussion of cause and effect is that of smoking
relationships. These critics believe that there are and lung cancer. A question that has surfaced in
too many limitations in this type of research to cancer research in the past several decades is,
allow for a suggestion of cause and effect. Some What is the effect of smoking on an individual’s
critics are frustrated with researchers who hold health? Also asked is the question, Does smoking
that causal-comparative research provides stronger cause lung cancer? Using data from observational
causal evidence than correlational research does. studies, researchers have long established the rela-
Instead, they maintain that neither type of research tionship between smoking and the incidence of
can produce evidence of a causal relationship, so lung cancer; however, it took compelling evidence
neither is better than the other. Most of these from several studies over several decades to estab-
critics argue that experimental research designs are lish smoking as a ‘‘cause’’ of lung cancer.
the only method of research that can illustrate any The term effect has been used frequently in sci-
type of causal relationships between variables. entific research. Most of the time, it can be seen
Almost all agree, however, that experimental that a statistically significant result from a linear
designs potentially provide the strongest evidence regression or correlation analysis between two
for causation. variables X and Y is explained as effect. Does X
really cause Y or just relate to Y? The association
Ernest W. Brewer and Jennifer Kuhn (correlation) of two variables with each other in
the statistical sense does not imply that one is the
See also Cause and Effect; Correlation; Experimental
cause and the other is the effect. There needs to be
Design; Ex Post Facto Study; Quasi-Experimental
a mechanism that explains the relationship in
Designs
order for the association to be a causal one. For
example, without the discovery of the substance
Further Readings nicotine in tobacco, it would have been difficult to
establish the causal relationship between smoking
Fraenkel, J. R., & Wallen, N. E. (2009). How to design and lung cancer. Tobacco companies have claimed
and evaluate research in education. New York: that since there is not a single randomized con-
McGraw-Hill.
trolled trial that establishes the differences in death
Gay, L. R., Mills, G. E., & Airasian, P. (2009).
Educational research: Competencies for analysis and
from lung cancer between smokers and nonsmo-
applications. Upper Saddle River, NJ: Pearson kers, there was no causal relationship. However,
Education. a cause-and-effect relationship is established by
Lodico, M. G., Spaulding, D. T., & Voegtle, K. H. observing the same phenomenon in a wide variety
(2006). Methods in educational research: From theory of settings while controlling for other suspected
to practice. San Francisco: Jossey-Bass. mechanisms.
Mertler, C. A., & Charles, C. M. (2005). Introduction to Statistical correlation (e.g., association) des-
educational research. Boston: Pearson. cribes how the values of variable Y of a specific
Suter, W. N. (1998). Primer of educational research. population are associated with the values of
Boston: Allyn and Bacon.
another variable X from the same population. For
example, the death rate from lung cancer increases
with increased age in the general population. The
association or correlation describes the situation
CAUSE AND EFFECT that there is a relationship between age and the
death rate from lung cancer. Randomized prospec-
Cause and effect refers to a relationship between tive studies are often used as a tool to establish
two phenomena in which one phenomenon is the a causal effect. Time is a key element in causality
reason behind the other. For example, eating too because the cause must happen prior to the effect.
much fast food without any physical activity leads Causes are often referred to as treatments or expo-
to weight gain. Here eating without any physical sures in a study. Suppose a causal relationship
activity is the ‘‘cause’’ and weight gain is the between an investigational drug A and response Y
‘‘effect.’’ Another popular example in the needs to be established. Suppose YA represents the
132 Ceiling Effect

response when the participant is treated using A for average treatment effect with confounders con-
and Y0 is the response when the subject is trea- trolled have been proposed.
ted with placebo under the same conditions. The
causal effect of the investigational drug is defined Abdus S. Wahed and Yen-Chih Hsu
as the population average δ ¼ EðYA  Y0 Þ. How-
See also Clinical Trial; Observational Research;
ever, a person cannot be treated with both placebo
Randomization Tests
and Treatment A under the same conditions. Each
participant in a randomized study will have, usu-
ally, equal potential of receiving Treatment A or Further Readings
the placebo. The responses from the treatment Freedman, D. (2005). Statistical models: Theory and
group and the placebo group are collected at a spe- practice. Cambridge, UK: Cambridge University Press.
cific time after exposure to the treatment or pla- Holland, P. W. (1986). Statistics and causal inference.
cebo. Since participants are randomized to the two Journal of the American Statistical Association,
groups, it is expected that the conditions (repre- 81(396), 945–960.
sented by covariates) are balanced between the
two groups. Therefore, randomization controls for
other possible causes that can affect the response
Y, and hence the difference between the average CEILING EFFECT
responses from the two groups, can be thought of
an estimated causal effect of treatment A on Y. The term ceiling effect is a measurement limitation
Even though a randomized experiment is a pow- that occurs when the highest possible score or
erful tool for establishing a causal relationship, close to the highest score on a test or measurement
a randomization study usually needs a lot of instrument is reached, thereby decreasing the like-
resources and time, and sometimes it cannot be lihood that the testing instrument has accurately
implemented for ethical or practical reasons. Alter- measured the intended domain. A ceiling effect
natively, an observational study may be a good can occur with questionnaires, standardized tests,
tool for causal inference. In an observational study, or other measurements used in research studies. A
the probability of receiving (or not receiving) treat- person’s reaching the ceiling or scoring positively
ment is assessed and accounted for. In the example on all or nearly all the items on a measurement
of the effect of smoking on lung cancer, smoking instrument leaves few items to indicate whether
and not smoking are the treatments. However, for the person’s true level of functioning has been
ethical reasons, it is not practical to randomize accurately measured. Therefore, whether a large
subjects to treatments. Therefore, researchers had percentage of individuals reach the ceiling on an
to rely on observational studies to establish the instrument or whether an individual scores very
causal effect of smoking on lung cancer. high on an instrument, the researcher or inter-
Causal inference plays a significant role in preter has to consider that what has been mea-
medicine, epidemiology, and social science. An sured may be more of a reflection of the
issue about the average treatment effect is also parameters of what the instrument is able to mea-
worth mentioning. The average treatment effect, sure than of how the individuals may be ultimately
δ ¼ EðY1 Þ  EðY2 Þ, between two treatments is functioning. In addition, when the upper limits of
defined as the difference between two outcomes, a measure are reached, discriminating between the
but, as mentioned previously, a subject can receive functioning of individuals within the upper range
only one of ‘‘rival’’ treatments. In other words, it is is difficult. This entry focuses on the impact of ceil-
impossible for a subject to have two outcomes at ing effects on the interpretation of research results,
the same time. Y1 and Y2 are called counterfactual especially the results of standardized tests.
outcomes. Therefore, the average treatment effect
can never be observed. In the causal inference liter-
Interpretation of Research Results
ature, several estimating methods of average treat-
ment effect are proposed to deal with this When a ceiling effect occurs, the interpretation of
obstacle. Also, for observational study, estimators the results attained is impacted. For example,
Ceiling Effect 133

a health survey may include a range of questions performance on the original sampling. Standard-
that focus on the low to moderate end of physical ized tests include aptitude tests such as the Wechs-
functioning (e.g., individual is able to walk up ler Intelligence Scales, the Stanford-Binet Scales,
a flight of stairs without difficulty) versus a range and the Scholastic Aptitude Test, and achievement
of questions that focus on higher levels of physical tests such as the Iowa Test of Basic Skills and the
functioning (e.g., individual is able to walk at Woodcock-Johnson: Tests of Achievement.
a brisk pace for 1 mile without difficulty). Ques- When individuals score at the upper end of
tions within the range of low to moderate physical a standardized test, especially 3 standard devia-
functioning provide valid items for individuals on tions from the mean, then the ceiling effect is a fac-
that end of the physical functioning spectrum tor. It is reasonable to conclude that such a person
rather than for those on the higher end of the has exceptional abilities compared with the aver-
physical functioning spectrum. Therefore, if an age person within the sampled population, but the
instrument geared toward low to moderate physi- high score is not necessarily a highly reliable mea-
cal functioning is administered to individuals with sure of the person’s true ability. What may have
physical health on the upper end of the spectrum, been measured is more a reflection of the test than
a ceiling effect will likely be reached in a large por- of the person’s true ability. In order to attain an
tion of the cases, and interpretation of their ulti- indication of the person’s true ability when the
mate physical functioning would be limited. ceiling has been reached, an additional measure
A ceiling effect can be present within results with an increased range of difficult items would be
of a research study. For example, a researcher appropriate to administer. If such a test is not
may administer the health survey described in available, the test performance would be inter-
the previous paragraph to a treatment group in preted stating these limitations.
order to measure the impact of a treatment on The ceiling effect should also be considered
overall physical health. If the treatment group when one is administering a standardized test to
represents the general population, the results an individual who is at the top end of the age
may show a large portion of the treatment group range for a test and who has elevated skills. In this
to have benefited from the treatment because situation, the likelihood of a ceiling effect is high.
they have scored high on the measure. However, Therefore, if a test administrator is able to use
this high score may signify the presence of a ceil- a test that places the individual on the lower age
ing effect, which calls for caution when one end of a similar or companion test, the chance of
is interpreting the significance of the positive a ceiling effect would most likely be eliminated.
results. If a ceiling effect is suspected, an alterna- For example, the Wechsler Intelligence Scales have
tive would be to use another measure that separate measures to allow for measurement of
provides items that target better physical func- young children, children and adolescents, and
tioning. This would allow participants to dem- older adolescents and adults. The upper end and
onstrate a larger degree of differentiation in lower end of the measures overlap, meaning that
physical functioning and provide a measure that a 6- to 7-year-old could be administered the
is more sensitive to change or growth from the Wechsler Preschool and Primary Scale of Intelli-
treatment. gence or the Wechsler Intelligence Scale for Chil-
dren. In the event a 6-year-old is cognitively
advanced, the Wechsler Intelligence Scale for Chil-
Standardized Tests
dren would be a better choice in order to avoid
The impact of the ceiling effect is important when a ceiling effect.
interpreting standardized test results. Standardized It is important for test developers to monitor
tests have higher rates of score reliability because ceiling effects on standardized instruments as they
the tests have been administered to a large sam- are utilized in the public. If a ceiling effect is
pling of the population. The large sampling pro- noticed within areas of a test over time, those ele-
vides a scale of standard scores that reliably and ments of the measure should be improved to pro-
validly indicates how close a person who takes the vide better discrimination for high performers.
same test performs compared with the mean The rate of individuals scoring at the upper end
134 Central Limit Theorem

of a measure should coincide with the standard distribution, then the probability that it is larger
scores and percentiles on the normal curve. than a and smaller than b is equal to the integral
of the function f(x) (the area under the graph of
Tish Holub Taylor the function) from x ¼ a to x ¼ b. The normal den-
sity is also known as Gaussian density, named for
See also Instrumentation; Standardized Score; Survey;
Carl Friedrich Gauss, who used this function to
Validity of Measurement
describe astronomical data. If we put μ ¼ 0 and
σ 2 ¼ 1 in the above formula, then we obtain the
Further Readings
so-called standard normal density.
Austin, P. C., & Bruner, L. J. (2003). Type I error In precise mathematical language, the CLT
inflation in the presence of a ceiling effect. American states the following: Suppose that X1 ; X2 ; . . . are
Statistician, 57(2), 97–105. independent random variables with the same dis-
Gunst, R. F., & Barry, T. E. (2003). One way to moderate tribution, having mean μ and variance σ 2 but
ceiling effects. Quality Progress, 36(10), 84–86. being otherwise arbitrary. Let Sn ¼ X1 þ    þ Xn
Kaplan, C. (1992). Ceiling effects in assessing high-IQ
be their sum. Then
children with the WPPSI-R. Journal of Clinical Child
Psychology, 21(4), 403–406. Z b
Rifkin, B. (2005). A ceiling effect in traditional classroom Sn  nμ 1 2
foreign language instruction: Data from Russian.
Pða < pffiffiffiffiffiffiffiffi < bÞ → pffiffiffiffiffiffi et =2 dt,
nσ 2 a 2π
Modern Language Journal, 89(1), 3–18.
Sattler, J. M. (2001). Assessment of children: Cognitive as n → ∞:
applications (4th ed.). San Diego, CA: Jerome M. Sattler.
Taylor, R. L. (1997). Assessment of exceptional students: It is more appropriate to define the standard
Educational and psychological procedures (4th ed.). normal density as the density of a random variable
Boston: Allyn and Bacon. ζ with zero mean and variance 1 with the property
Uttl, B. (2005). Measurement of individual differences:
that, for every a and b there is c such that if ζ1 ; ζ2
Lessons from memory assessment in research and
clinical practice. Psychological Science, 16(6), 460–467.
are independent copies of ζ, then aζ1 þ bζ2 is
a copy of cζ . It follows that a2 þ b2 ¼ c2 holds
and that there is only one choice for the density of
ζ, namely, the standard normal density.
CENTRAL LIMIT THEOREM As an example, consider tossing a fair coin
n ¼ 1; 000 times and determining the probability
The central limit theorem (CLT) is, along with the that fewer than 450 heads are obtained. The
theorems known as laws of large numbers, the cor- CLT can be used to give a good approximation
nerstone of probability theory. In simple terms, the of this probability. Indeed, if we let Xi be a ran-
theorem describes the distribution of the sum of dom variable that takes value 1 if heads show up
a large number of random numbers, all drawn at the ith toss or value 0 if tails show up, then
independently from the same probability distribu- we see that the assumptions of the CLT are satis-
tion. It predicts that, regardless of this distribution, fied because the random variables have the same
as long as it has finite variance, then the sum fol- mean μ ¼ 1=2 and variance σ 2 ¼ 1=4. On
lows a precise law, or distribution, known as the the other hand, Sn ¼ X1 þ    þ Xn is the number
normal distribution. of heads.pffiffiffiffiffiffiffi
Since
ffi Sn ≤ 450 ifpffiffiffiffiffiffiffiffi
andffi only if
Let us describe the normal distribution with ðSn  nμÞ= nσ ≤ ð450  500Þ= 250 ¼ 3:162,
2

mean μ and variance σ 2 : It is defined through its we find, by the CLT, that the probability that we
density function, get at most 450 heads equals the integral of the
standard density from ∞ to 3:162. This inte-
1 2 2 gral can be computed with the help of a com-
f ðxÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi eðxμÞ =2σ ; puter (or tables in olden times) and found to be
2πσ 2
about 0.00078, which is a reasonable approxi-
where the variable x ranges from ∞ to þ ∞. mation. Incidentally, this kind of thing leads to
This means that if a random variable follows this the so-called statistical hypothesis testing: If we
Central Limit Theorem 135

toss a coin and see 430 heads and 570 tails, then practice. In 1935, Paul Lévy and William Feller
we should be suspicious that the coin is not fair. established necessary conditions for the validity of
the CLT. In 1951, H. F. Trotter gave an elementary
analytical proof of the CLT.
Origins
The origins of the CLT can be traced to a paper
The Functional CLT
by Abraham de Moivre (1733), who described
the CLT for symmetric Bernoulli trials; that is, in The functional CLT is stated for summands that
tossing a fair coin n times, the number Sn of take values in a multidimensional (one talks of
heads has apffiffidistribution that is approximately random vectors in a Euclidean space) or even
that of n2 þ 2n ζ: The result fell into obscurity but infinite-dimensional space. A very important
was revived in 1812 by Pierre-Simon Laplace, instance of the functional CLT concerns conver-
who proved and generalized de Moivre’s result gence to Brownian motion, which also provides
to asymmetric Bernoulli trials (weighted coins). a means of defining this most fundamental object
Nowadays, this particular case is known as the of modern probability theory. Suppose that Sn is
de Moivre–Laplace CLT and is usually proved as stated in the beginning of this entry. Define
using Stirling’s approximation for the product n!ffi
pffiffiffiffiffiffiffiffi a function sn(t) of ‘‘time’’ as follows: At time
positive integers: n! ≈ nn en 2πn:
of the first npffiffiffiffiffiffi t ¼ k/n, where k is pa ffiffiffipositive integer, let sn(t) have
The factor 2π here is the same as the one value ðSk  μkÞ=σ n; now join the points [k/n,
appearing in the normal density. Two 19th- sn(k/n)], [(k þ 1)/n, sn(k þ 1)/n)] by a straight line
century Russian mathematicians, Pafnuty Che- segment, for each value of k, to obtain the graph
byshev and A. A. Markov, generalized the CLT of a continuous random function sn(t). Donsker’s
of their French predecessors and proved it using theorem states that the probability distribution of
the method of moments. the random function sn(t) converges (in a certain
sense) to the distribution of a random function, or
stochastic process, which can be defined to be the
The Modern CLT
standard Brownian motion.
The modern formulation of the CLT gets rid of the
assumption that the summands be identically dis-
Other Versions and Consequences of the CLT
tributed random variables. Its most general and
useful form is that of the Finnish mathematician Versions of the CLT for dependent random vari-
J. W. Lindeberg (1922). It is stated for triangular ables also exist and are very useful in practice
arrays, that is, random variables Sn ¼ Xn;1 ; when independence is either violated or not possi-
Xn;2 ; . . . ; Xn;kn , depending on two indices that, for ble to establish. Such versions exist for Markov
fixed n, are independent with respect to the second processes, for regenerative processes, and for mar-
index. Letting Sn ¼ X1 þ    þ Xn;kn be the sum tingales, as well as others.
with respect to the second index, we have that Sn The work on the CLT gave rise to the general
minus its mean, ESn, and divided by its standard area known as weak convergence (of stochastic
deviation, has a distribution that, as n tends to ∞, processes), the origin of which is in Yuri Pro-
is standard normal, provided an asymptotic neg- khorov (1956) and Lucien Le Cam (1957). Nowa-
ligibility condition for the variances holds. A ver- days, every result of CLT can (and should) be seen
sion of Lindeberg’s theorem was formulated by in the light of this general theory.
A. Liapounoff in 1901. It is worth pointing out Finally, it should be mentioned that all flavors
that Liapounoff introduced a new proof technique of the CLT rely heavily on the assumption that
based on characteristic functions (also known as the probability that the summands be very large
Fourier transforms), whereas Lindeberg’s tech- is small, which is often cast in terms of finiteness
nique was an ingenious step-by-step replacement of a moment such as the variance. In a variety of
of the general summands by normal ones. Lia- applications, heavy-tailed random variables do
pounoff’s theorem has a condition that is weaker not satisfy this assumption. The study of limit
than that of Lindeberg but is quite good in theorems of sums of independent heavy-tailed
136 Central Limit Theorem

random variables leads to the so-called stable Example 1: Confidence Intervals.


distributions, which are generalizations of the
Suppose we set up an experiment in which we
normal law, and like the normal law, they are
try to find out how the glucose blood level in a cer-
‘‘preserved by linear transformations,’’ but they
tain population affects a disease. We introduce
all have infinite variance. Boris Gnedenko and
a mathematical model as follows: Assume that the
A. N. Kolmogorov call such limit results central
glucose level data we collect are observations of
limit theorems as well, but the terminology is
independent random variables, each following
not well established, even though the research,
a probability distribution with an unknown mean
applications, and use of such results are quite
θ (the ‘‘true’’ glucose level) and wish to estimate
extended nowadays. A functional CLT (an ana-
this θ: Letting X1 ; . . . ; Xn be the data we have,
log of Donsker’s theorem) for sums of heavy-
a reasonable estimate for θ is the sample mean
tailed random variables gives rise to the so-called
X ¼ Sn =n ¼ ðX1 þ    þ Xn Þ=n: A p-confidence
stable processes, which are special cases of pro-
interval (where p is a large probability, say
cesses with stationary and independent incre-
p ¼ :95) is an interval such that the probability
ments known as Lévy processes.
that it contains X is at least p. To find this confi-
dence interval, we use the central limit theorem pffiffiffi to
justify that nðX  pσffiffinÞ ¼ ðSn  nθÞ=σ n is
approximately standard normal. We then find the
Uses of the CLT number α such that Pðα < ζ < αÞ ¼ p, where ζ
is a standard normal random variable, and, on
The uses of the CLT are numerous. First, it is a cor-
replacing pffiffiffiζ by nðX  pσp ffiffiÞ, we have that
nerstone of statistics. Indeed, one could argue that n ffiffiffi
PðX  ασ n < θ < X þ ασ nÞ ¼ p; meaning
without the CLT, there would be no such subject;
thatpthe
ffiffiffi interval centered at X and having width
for example, confidence intervals and parameter
2ασ n is a p-confidence interval. One way to deal
and density estimation make explicit use of the
with the fact that σ may be unknown is to replace
CLT or a version of it known as the local limit the-
it by the sample standard deviation.
orem, which, roughly speaking, states that the
probability density function of Sn converges to the
normal density. Of particular importance are
dependent versions of the CLT, which find appli- Example 2: An Invariance Principle
cations in the modern areas of data mining and
A very important, for applications, consequence
artificial intelligence.
of the CLT (or, more generally, of the principle of
In connection with statistics, of importance is
weak convergence mentioned above) is that it
the question of speed of convergence to CLT.
yields information about functions of the approxi-
Answers to this are given by the classical Berry-
mating sequence. For example, consider a fair
Esséen theorem or by more modern methods
game of chance, say tossing a fair coin whereby $1
consisting of embedding the random sums into
is won on appearance of heads or lost on appear-
a Brownian motion (via the so-called Hungarian
ance of tails. Let Sn be the net winnings in n coin
construction of Komlós, Major, and Tusnády).
tosses (negative if there is a net loss). Define the
In applied probability, various stochastic
quantity Tn roughly as the number of times before
models depending on normal random variables or
n that the gambler had a net loss. By observing that
Brownian motions can be justified via (functional)
this quantity is a function of the function sn(t) dis-
central limit theorems. Areas within applied prob-
cussed earlier, and using the Donsker’s CLT, it can
ability benefiting from the use of the CLT are,
be seen that the distribution function of Tn =n has,
among others, stochastic networks, particle sys-
tems, stochastic geometry, the theory of risk, math- pffiffiffi the arcsine law: PðTn =n ≤ xÞ ≈
approximately,
ð2=πÞarcsin x; for large n.
ematical finance, and stochastic simulation.
There are also various applications of the CLT Takis Konstantopoulos
in engineering, physics (especially in statistical
mechanics), and mathematical biology. See also Distribution; Law of Large Numbers
Central Tendency, Measures of 137

Further Readings Trotter, H. F. (1959). Elementary proof of the central


limit theorem. Archiv der Mathematik, 10, 226–234.
Billingsley, P. (1968). Convergence of probability Whitt, W. (2002). Stochastic-process limits. New York:
measures. New York: Wiley. Springer-Verlag.
de Moivre, A. (1738). The doctrine of chances (2nd ed.).
London: Woodfall.
de Moivre, A. (1926). Approximatio ad summam
terminorum binomii a þn b in seriem expansi
[Approximating the sum of the terms of the binomial
ða þ bÞn expanded into a series]. Reprinted by
R. C. Archibald, A rare pamphlet of de Moivre and CENTRAL TENDENCY,
some of his discoveries. Isis, 8, 671–684. (Original
work published 1733) MEASURES OF
Donsker, M. (1951). An invariance principle for certain
probability limit theorems. Memoirs of the American One of the most common statistical analyses used
Mathematical Society, 6, 1–10.
in descriptive statistics is a process to determine
Feller, W. (1935). Über den Zentralen Grenzwertsatz
des Wahrscheinlichkeitsrechnung [Central limit
where the average of a set of values falls. There are
theorem of probability theory]. Math. Zeitung, 40, multiple ways to determine the middle of a group
521–559. of numbers, and the method used to find the aver-
Feller, W. (1937). Über den Zentralen Grenzwertsatz des age will determine what information is known and
Wahrscheinlichkeitsrechnung [Central limit theorem of how that average should be interpreted. Depending
probability theory]: Part II. Math. Zeitung, 42, on the data one has, some methods for finding the
301–312. average may be more appropriate than others.
Gnedenko, B. V., & Kolmogorov, A. N. (1968). Limit The average describes the typical or most com-
theorems for sums of independent random variables
mon number in a group of numbers. It is the one
(K. L. Chung, Trans.). Reading, MA: Addison-Wesley.
value that best represents the entire group of
(Original work published 1949)
Komlós, J., Major, P., & Tusnády, G. (1975). An values. Averages are used in most statistical analy-
approximation of partial sums of independent RV’-s, ses, and even in everyday life. If one wanted to find
and the sample DF. Zeit. Wahrscheinlichkeitstheorie the typical house price, family size, or score on
Verw. 32, 111–131. a test, some form of average would be computed
Laplace, P.-S. (1992). Théorie analytique des probabilités each time. In fact, one would compute a different
[Analytical theory of probability]. Paris: Jacques type of average for each of those three examples.
Gabay. (Original work published 1828) Researchers use different ways to calculate the
Le Cam, L. (1957). Convergence in distribution of average, based on the types of numbers they are
stochastic processes. University of California
examining. Some numbers, measured on the nomi-
Publications in Statistics, 2, 207–236.
nal level of measurement, are not appropriate to do
Lévy, P. (1935). Propriétés asymptotiques des sommes de
variables aléatoires indépendantes ou encha^inées some types of averaging. For example, if one were
[Asymptotic properties of sums of independent or examining the variable of types of vegetables, and
dependent random variables]. Journal des the labels of the levels were cucumbers, zucchini,
Mathématiques Pures et Appliquées, 14, 347–402. carrots, and turnips, if one performed some types of
Liapounoff, A. (1901). Nouvelle forme du théorème sur average, one may find that the average vegetable
la limite de probabilités [New form of a theory on was 3=4 carrot and 1=4 turnip, which makes no
probability limits]. Memoirs de l’Académie Imperial sense at all. Similarly, using some types of average
des Sciences de Saint-Pétersbourg, 13, 359–386. on interval-level continuous variables may result in
Lindeberg, J. W. (1922). Eine neue Herleitung des
an average that is very imprecise and not very repre-
Exponentialgesetzes in der
Wahrscheinlichkeitsrechnung [A new investitgation
sentative of the sample one is using. When the three
into the theory of calculating probabilities]. main methods for examining the average are collec-
Mathematische Zeitschrift, 15, 211–225. tively discussed, they are referred to as measures of
Prokhorov, Y. (1956). Convergence of random processes central tendency. The three primary measures of
and limit theorems in probability theory. Theory of central tendency commonly used by researchers are
Probability and Its Applications, 1, 157–214. the mean, the median, and the mode.
138 Central Tendency, Measures of

Mean Other Definitions


The mean is the most commonly used (and misused) Other definitions and explanations can also be
measure of central tendency. The mean is defined as used when interpreting the mean. One common
the sum of all the scores in the sample, divided by description is that the mean is like a balance point.
the number of scores in the sample. This type of The mean is located at the center of the values of
mean is also referred to as the arithmetic mean, to the sample or population. If one were to line up
distinguish it from other types of means, such as the the values of each score on a number line, the
geometric mean or the harmonic mean. Several com- mean would fall at the exact point where the
mon symbols or statistical notations are used to rep- values are equal on each side; the mean is the point
resent the mean, including x, which is read as x-bar closest to the squared distances of all the scores in
(the mean of the sample), and μ, which is read as the distribution.
mu (the mean of the population). Some research arti- Another way of thinking about the mean is as
cles also use an italicized uppercase letter M to indi- the amount per individual, or how much each
cate the sample mean. individual would receive if one were to divide
Much as different symbols are used to represent the total amount equally. If Steve has a total of
the mean, different formulas are used to calculate $50, and if he were to divide it equally among
the mean. The difference between the two most his four friends, each friend would receive
common formulas is found only in the symbols $50/4 ¼ $12.50. Therefore, $12.50 would be the
used, as the formula for calculating the mean of mean.
a sample uses the symbols appropriate for a sam- The mean has several important properties
ple, and the other formula is used to calculate the that are found only with this measure of central
mean of a population, and as such uses the sym- tendency. If one were to change a score in the
bols that refer to a population. sample from one value to another value, the cal-
For calculating the mean of a sample, use culated value of the mean would change. The
value of the mean would change because the
x ¼ x=n:
value of the sum of all the scores would change,
For calculating the mean of a population, use thus changing the numerator. For example,
given the scores in the earlier homework assign-
μ ¼ X=N: ment example (10, 8, 5, 7, and 9), the student
previously scored a mean of 7.8. However, if
Calculating the mean is a very simple process. one were to change the value of the score of 9
For example, if a student had turned in five home- to a 4 instead, the value of the mean would
work assignments that were worth 10 points each, change:
the student’s scores on those assignments might
have been 10, 8, 5, 7, and 9. To calculate the stu- 10 þ 8 þ 5 þ 7 þ 4 ¼ 34
dent’s mean score on the homework assignments, 34=5 ¼ 6:8:
first one would add the values of all the homework
assignments: By changing the value of one number in the
sample, the value of the mean was lowered by
10 þ 8 þ 5 þ 7 þ 9 ¼ 39:
one point. Any change in the value of a score
Then one would divide the sum by the total will result in a change in the value of the mean.
number of assignments (which was 5): If one were to remove or add a number to the
sample, the value of the mean would also
39=5 ¼ 7:8: change, as then there would be fewer (or greater)
numbers in the sample, thus changing the numer-
For this example, 7.8 would be the student’s ator and the denominator. For example, if one
mean score for the five homework assignments. were to calculate the mean of only the first
That indicates that, on average, the student scored four homework assignments (thereby removing
a 7.8 on each homework assignment. a number from the sample),
Central Tendency, Measures of 139

10 þ 8 þ 5 þ 7 ¼ 30 10ð4Þ þ 8ð4Þ þ 5ð4Þ þ 7ð4Þ þ 9ð4Þ


30=4 ¼ 7:5: 40 þ 32 þ 20 þ 28 þ 36 ¼ 156
156=5 ¼ 31:2;
If one were to include a sixth homework assign-
ment (adding a number to the sample) on which results in the mean being multiplied by four as
the student scored an 8, well. Dividing each score by 3,

10 þ 8 þ 5 þ 7 þ 9 þ 8 ¼ 47 10=3 þ 8=3 þ 5=3 þ 7=3 þ 9=3


47=6 ¼ 7:83: 3:33 þ 2:67 þ 1:67 þ 2:33 þ 3 ¼ 13
13=5 ¼ 2:6;
Either way, whether one adds or removes
a number from the sample, the mean will almost
always change in value. The only instance in will result in the mean being divided by three as
which the mean will not change is if the number well.
that is added or removed is exactly equal to the
mean. For example, if the score on the sixth home-
work assignment had been 7.8, Weighted Mean
Occasionally, one will need to calculate the
10 þ 8 þ 5 þ 7 þ 9 þ 7:8 ¼ 46:8 mean of two or more groups, each of which has its
46:8=6 ¼ 7:8: own mean. In order to get the overall mean (some-
times called the grand mean or weighted mean),
If one were to add (or subtract) a constant to one will need to use a slightly different formula
each score in the sample, the mean will increase from the one used to calculate the mean for only
(or decrease) by the same constant value. So if the one group:
professor added three points to the score on every
homework assignment, Weighted mean ¼ ð x1 þ x2 þ . . . xn Þ=
ðn1 þ n2 þ . . . nn Þ:
ð10 þ 3Þ þ ð8 þ 3Þ þ ð5 þ 3Þ þ ð7 þ 3Þ þ ð9 þ 3Þ
13 þ 11 þ 8 þ 10 þ 12 ¼ 54 To calculate the weighted mean, one will divide
54=5 ¼ 10:8: the sum of all the scores in every group by the
number of scores in every group. For example, if
Carla taught three classes, and she gave each class
The mean homework assignment score
a test, the first class of 25 students might have
increased by 3 points. Similarly, if the professor
a mean of 75, the second class of 20 students
took away two points from each original home-
might have a mean of 85, and the third class of 30
work assignment score,
students might have a mean of 70. To calculate
the weighted mean, Carla would first calculate the
ð10  2Þ þ ð8  2Þ þ ð5  2Þ þ ð7  2Þ þ ð9  2Þ
summed scores for each class, then add the
8 þ 6 þ 3 þ 5 þ 7 ¼ 29; summed scores together, then divide by the total
29=5 ¼ 5:8: number of students in all three classes. To find the
summed scores, Carla will need to rework the for-
The mean homework assignment score mula for the sample mean to find the summed
decreased by two points. The same type of situa- scores instead:
tion will occur with multiplication and division. If
one multiplies (or divides) every score by the same X ¼ X=n becomes x ¼ xðnÞ:
number, the mean will also be multiplied (or
divided) by that number. Multiplying the five origi- With this reworked formula, find each summed
nal homework scores by four, score first:
140 Central Tendency, Measures of

x1 ¼ 75ð25Þ ¼ 1; 875 different ways in which those two measures of cen-


X2 ¼ 85ð20Þ ¼ 1; 700 tral tendency are calculated.
It is simple to find the middle when there are an
X3 ¼ 70ð30Þ ¼ 2; 100: odd number of scores, but it is a bit more complex
when the sample has an even number of scores.
Next, add the summed scores together: For example, when there were six homework
1; 875 þ 1; 700 þ 2; 100 ¼ 5; 675: scores (10, 8, 5, 7, 9, and 8), one would still line
up the homework scores from lowest to highest,
Then add the number of students per class then find the middle:
together:

25 þ 20 þ 30 ¼ 75:
578 | 8 9 10:
In this example, the median falls between two
Finally, divide the total summed scores by the total
identical scores, so one can still say that the median
number of students:
is 8. If the two middle numbers were different, one
5; 675=75 ¼ 75:67 would find the middle number between the two
numbers. For example, if one increased one of the
This is the weighted mean for the test scores of the student’s homework scores from an 8 to a 9,
three classes taught by Carla.
578 | 9 9 10:
Median
In this case, the middle falls halfway between 8
The median is the second measure of central ten- and 9, at a score of 8.5.
dency. It is defined as the score that cuts the distri- Statisticians disagree over the correct method for
bution exactly in half. Much as the mean can be calculating the median when the distribution has
described as the balance point, where the values multiple repeated scores in the center of the distribu-
on each side are identical, the median is the point tion. Some statisticians use the methods described
where the number of scores on each side is equal. above to find the median, whereas others believe the
As such, the median is influenced more by the scores in the middle need to be reduced to fractions
number of scores in the distribution than by the to find the exact midpoint of the distribution. So in
values of the scores in the distribution. The median a distribution with the following scores,
is also the same as the 50th percentile of any distri-
bution. Generally the median is not abbreviated or 2 3 3 4 5 5 5 5 5 6,
symbolized, but occasionally Mdn is used.
The median is simple to identify. The method some statisticians would say the median is 5,
used to calculate the median is the same for both whereas others (using the fraction method) would
samples and populations. It requires only two steps report the median as 4.7.
to calculate the median. In the first step, order the
numbers in the sample from lowest to highest. So Mode
if one were to use the homework scores from the
mean example, 10, 8, 5, 7, and 9, one would first The mode is the last measure of central tendency.
order them 5, 7, 8, 9, 10. In the second step, find It is the value that occurs most frequently. It is the
the middle score. In this case, there is an odd num- simplest and least precise measure of central
ber of scores, and the score in the middle is 8. tendency. Generally, in writing about the mode,
scholars label it simply mode, although some
57 8 9 10 books or papers use Mo as an abbreviation. The
method for finding the mode is the same for both
Notice that the median that was calculated is not samples and populations. Although there are
the same as the mean for the same sample of several ways one could find the mode, a simple
homework assignments. This is because of the method is to list each score that appears in the
Central Tendency, Measures of 141

sample. The score that appears the most often is distributed, and there is no specific reason that one
the mode. For example, given the following sam- would want to use a different measure of central
ple of numbers, tendency, then the mean should be the best measure
to use.
3; 4; 6; 2; 7; 4; 5; 3; 4; 7; 4; 2; 6; 4; 3; 5;
Median
one could arrange them in numerical order:
There are several reasons to use the median
2; 2; 3; 3; 3; 4; 4; 4; 4; 4; 5; 5; 6; 6; 7; 7: instead of the mean. Many statisticians believe that
it is inappropriate to use the mean to measure cen-
Once the numbers are arranged, it becomes
tral tendency if the distribution was measured at
apparent that the most frequently appearing num-
the ordinal level. Because variables measured at
ber is 4:
the ordinal level contain information about direc-
tion but not distance, and because the mean is
2; 2; 3; 3; 3; 4; 4; 4; 4; 4; 5; 5; 6; 6; 7; 7: measured in terms of distance, using the mean to
calculate central tendency would provide informa-
Thus 4 is the mode of that sample of numbers.
tion that is difficult to interpret.
Unlike the mean and the median, it is possible to
Another occasion to use the median is when the
have more than one mode. If one were to add two
distribution contains an outlier. An outlier is
threes to the sample,
a value that is very different from the other values.
2; 2; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 5; 5; Outliers tend to be located at the far extreme of
the distribution, either high or low. As the mean is
6; 6; 7; 7;
so sensitive to the value of the scores, using the
then both 3 and 4 would be the most commonly mean as a measure of central tendency in a distri-
occurring number, and the mode of the sample bution with an outlier would result in a nonrepre-
would be 3 and 4. The term used to describe a sam- sentative score. For example, looking again at the
ple with two modes is bimodal. If there are more five homework assignment scores, if one were to
than two modes in a sample, one says the sample replace the nine with a score of 30,
is multimodal. 5 7 8 9 10 becomes 5 7 8 30 10
5 þ 7 þ 8 þ 30 þ 10 ¼ 60
When to Use Each Measure
60=5 ¼ 12:
Because each measure of central tendency is calcu-
lated with a different method, each measure is dif- By replacing just one value with an outlier, the
ferent in its precision of measuring the middle, as newly calculated mean is not a good representa-
well as which numbers it is best suited for. tion of the average values of our distribution. The
same would occur if one replaced the score of 9
Mean with a very low number:
The mean is often used as the default measure of 5 7 8 9 10 becomes 5 7 8 4 10
central tendency. As most people understand the
5 þ 7 þ 8 þ 4 þ 10 ¼ 26
concept of average, they tend to use the mean
whenever a measure of central tendency is needed, 26=5 ¼ 5:2:
including times when it is not appropriate to use
the mean. Many statisticians would argue that the Since the mean is so sensitive to outliers, it is
mean should not be calculated for numbers that are best to use the median for calculating central ten-
measured at the nominal or ordinal levels of mea- dency. Examining the previous example, but using
surement, due to difficulty in the interpretation of the median,
the results. The mean is also used in other statistical
5 7 8 4 10
analyses, such as calculations of standard deviation.
If the numbers in the sample are fairly normally 4 5 7 8 10:
142 Change Scores

The middle number is 7, therefore the median is fractions of numbers, making it difficult to inter-
7, a much more representative number than the pret the results. A common example of this is the
mean of that sample, 5.2. saying, ‘‘The average family has a mom, a dad,
If the numbers in the distribution were mea- and 2.5 kids.’’ People are discrete variables, and as
sured on an item that had an open-ended option, such, they should never be measured in such a way
one should use the median as the measure of cen- as to obtain decimal results. The mode can also be
tral tendency. For example, a question that asks used to provide additional information, along with
for demographic information such as age or salary other calculations of central tendency. Information
may include either a lower or upper category that about the location of the mode compared with the
is open ended: mean can help determine whether the distribution
they are both calculated from is skewed.
Number of Cars Owned Frequency
0 4 Carol A. Carman
1 10
See also Descriptive Statistics; Levels of Measurement;
2 16
Mean; Median; Mode; ‘‘On the Theory of Scales of
3 or more 3
Measurement’’; Results Section; Sensitivity; Standard
Deviation; Variability, Measure of
The last answer option is open ended because
an individual with three cars would be in the same
category as someone who owned 50 cars. As such, Further Readings
it is impossible to accurately calculate the mean
Coladarci, T., Cobb, C. D., Minium, E. W., & Clarke, R.
number of cars owned. However, it is possible to
C. (2004). Fundamentals of statistical reasoning in
calculate the median response. For the above
education. Hoboken, NJ: Wiley.
example, the median would be 2 cars owned. Gravetter, F. J., & Wallnau, L. B. (2004). Statistics for the
A final condition in which one should use the behavioral sciences (6th ed.). Belmont, CA: Thomson
median instead of the mean is when one has incom- Wadsworth.
plete information. If one were collecting survey Salkind, N. J. (2008). Statistics for people who (think
information, and some of the participants refused they) hate statistics. Thousand Oaks, CA: Sage.
or forgot to answer a question, one would have Thorndike, R. M. (2005). Measurement and evaluation
responses from some participants but not others. It in psychology and education. (7th ed.). Upper Saddle
would not be possible to calculate the mean in this River, NJ: Pearson Education.
instance, as one is missing important information
that could change the mean that would be calcu-
lated if the missing information were known.
CHANGE SCORES
Mode
The measurement of change is fundamental in the
The mode is well suited to be used to measure social and behavioral sciences. Many researchers
the central tendency of variables measured at the have used change scores to measure gain in ability
nominal level. Because variables measured at the or shift in attitude over time, or difference scores
nominal level are given labels, then any number between two variables to measure a construct
assigned to these variables does not measure quan- (e.g., self-concept vs. ideal self). This entry intro-
tity. As such, it would be inappropriate to use the duces estimation of change scores, its assumptions
mean or the median with variables measured at and applications, and at the end offers a recom-
this level, as both of those measures of central ten- mendation on the use of change scores.
dency require calculations involving quantity. The Let Y and X stand for the measures obtained by
mode should also be used when finding the middle applying the same test to the subjects on two occa-
of distribution of a discrete variable. As these vari- sions. Observed change or difference score is
ables exist only in whole numbers, using other D ¼ Y  X. The true change is DT ¼ YT  XT ,
methods of central tendency may result in where YT and XT represent the subject’s true status
Change Scores 143

at these times. The development of measuring the be borrowed from the other measure (e.g., Y). The
true change DT follows two paths, one using estimator obtained with the Lord procedure is bet-
change score and the other using residual change ter than those from the previous two ways and the
score. raw change score in that it yields a smaller mean
The only assumption to calculate change score square of deviation between the estimate and the
is that Y (e.g., posttest scores) and X (e.g., pretest true change (D ^ T  DT ).
scores) should be on the same numerical scale; that Lee Cronbach and Lita Furby, skeptical of
is, the scores on posttest are comparable to scores change scores, proposed a better estimate of true
on pretest. This only requirement does not suggest change by incorporating information in regression
that pretest and posttest measure the same con- of two additional categories of variables besides
struct. Thus, such change scores can be extended the pretest and posttest measures used in Lord’s
to any kind of difference score between two mea- procedure. The two categories of variables include
sures (measuring the same construct or not) that Time 1 measures W (e.g., experience prior to the
are on the same numerical scale. The two mea- treatment, demographic variables, different treat-
sures are linked, as if the two scores are obtained ment group membership) and Time 2 measures Z
from a single test or two observations are made by (e.g., a follow-up test a month after the treatment).
the same observer. The correlation between linked W and X need not be simultaneous, nor do Z and
observations (e.g., two observations made by the Y. Note that W and Z can be multivariate. Addi-
same observer) will be higher than that between tional information introduced by W and Z
independent observations (e.g., two observations improves the estimation of true change with smal-
made by different observers). Such linkage must be ler mean squares of (D ^ T  DT ) if the sample size is
considered in defining the reliability coefficient for large enough for the weight of the information
difference scores. The reliability of change or dif- related to W and Z to be accurately estimated.
ference scores is defined as the correlation of the The other approach to estimating the true
scores with independently observed difference change DT is to use the residual change score. The
scores. The reliability for change scores produced development of the residual change score estimate
by comparing two independent measures will most is similar to that of the change score estimate. The
likely be smaller than that for the linked case. raw residual-gain score is obtained by regressing
Raw change or difference scores are computed the observed posttest measure Y on the observed
with two observed measures (D ¼ Y  X). pretest measure X. In this way, the portion in the
Observed scores are systematically related to ran- posttest score that can be predicted linearly from
dom error of measurement and thus unreliable. pretest scores is removed from the posttest score.
Conclusions based on these scores tend to be Compared with change score, residualized change
fallacious. is not a more correct measure in that it might
True change score is measured as the difference remove some important and genuine change in the
between the person’s true status at posttest and subject. The residualized score helps identify indi-
pretest times, DT ¼ YT  XT . The key is to viduals who changed more (or less) than expected.
remove the measurement error from the two The true residual-gain score is defined as the
observed measures. There are different ways to expected value of raw residual-gain score over
correct the errors in the two raw measures used to many observations on the same person. It is the
obtain raw gain scores. The first way is to correct residual obtained for a subject in the population
the error in pretest scores using the reliability coef- linear regression of true final status on true initial
ficient of X and simple regression. The second way status. The true residual gain is obtained by extend-
is to correct errors in both pretest and posttest ing Lord’s multiple-regression approach and can be
scores using the reliability coefficient of both X further improved by including W or Z information.
and Y and simple regression. The third way is the In spite of all the developments in the estimation
Lord procedure. With this procedure, the estimates of change or residual change scores, researchers are
of YT and XT are obtained by the use of a multiple often warned not to use these change scores. Cron-
regression procedure that incorporates the reliabil- bach and Furby summarized four diverse research
ity of a measure (e.g., X) and information that can issues in the measurement-of-change literature that
144 Chi-Square Test

may appear to require change scores, including analyzes grosser data than do parametric tests such
(1) providing a measure of individual change, as t tests and analyses of variance (ANOVAs), the
(2) investigating correlates or predictors of change chi-square test can report only whether groups in
(or change rate), (3) identifying individuals with a sample are significantly different in some mea-
slowest change rate for further special treatment, sured attribute or behavior; it does not allow one
and (4) providing an indicator of a construct that to generalize from the sample to the population
can serve as independent, dependent, or covariate from which it was drawn. Nonetheless, because
variables. However, for most of these questions, the chi-square is less ‘‘demanding’’ about the data it
estimation of change scores is unnecessary and is will accept, it can be used in a wide variety of
at best inferior to other methods of analysis. One research contexts. This entry focuses on the appli-
example is one-group studies of the pretest– cation, requirements, computation, and interpreta-
treatment–posttest form. The least squares estimate tion of the chi-square test, along with its role in
of the mean true gain is simply the difference determining associations among variables.
between mean observed pretest and posttest scores.
Hypothesis testing and estimation related to treat-
Bivariate Tabular Analysis
ment effect should be addressed directly to
observed sample means. This suggests that the anal- Though one can apply the chi-square test to a sin-
ysis of change does not need estimates of true gle variable and judge whether the frequencies for
change scores. It is more appropriate to directly use each category are equal (or as expected), a chi-
raw score mean vectors, covariance matrices, and square is applied most commonly to frequency
estimated reliabilities. Cronbach and Furby recom- results reported in bivariate tables, and interpret-
mended that, when investigators ask questions in ing bivariate tables is crucial to interpreting the
which change scores appear to be the natural mea- results of a chi-square test. Bivariate tabular analy-
sure to be obtained, the researchers should frame sis (sometimes called crossbreak analysis) is used
their questions in other ways. to understand the relationship (if any) between
two variables. For example, if a researcher wanted
Feifei Ye to know whether there is a relationship between
the gender of U.S. undergraduates at a particular
See also Gain Scores, Analysis of; Longitudinal Design;
university and their footwear preferences, he or
Pretest–Posttest Design
she might ask male and female students (selected
as randomly as possible), ‘‘On average, do you
Further Readings prefer to wear sandals, sneakers, leather shoes,
boots, or something else?’’ In this example, the
Chan, D. (2003). Data analysis and modeling
independent variable is gender and the dependent
longitudinal processes. Group & Organization
Management, 28, 341–365.
variable is footwear preference. The independent
Cronbach, L. J., & Furby, L. (1970). How should we variable is the quality or characteristic that the
measure ‘‘change’’—or should we? Psychological researcher hypothesizes helps to predict or explain
Bulletin, 74(1), 68–80. some other characteristic or behavior (the depen-
Linn, R. L., & Slinde, J. A. (1977). The determination of dent variable). Researchers control the indepen-
the significance of change between pre- and posttesting dent variable (in this example, by sampling males
periods. Review of Educational Research, 47(1), and females) and elicit and measure the dependent
121–150. variable to test their hypothesis that there is some
relationship between the two variables.
To see whether there is a systematic relationship
between gender of undergraduates at University X
CHI-SQUARE TEST and reported footwear preferences, the results could
be summarized in a table as shown in Table 1.
The chi-square test is a nonparametric test of the Each cell in a bivariate table represents the
statistical significance of a relation between two intersection of a value on the independent variable
nominal or ordinal variables. Because a chi-square and a value on the dependent variable by showing
Chi-Square Test 145

Table 1 Male and Female Undergraduate Footwear data more easily, but how confident can one be
Preferences at University X (Raw Frequencies) that those apparent patterns actually reflect a sys-
Leather tematic relationship in the sample between the
Group Sandals Sneakers Shoes Boots Other variables (between gender and footwear prefer-
Male 6 17 13 9 5 ence, in this example) and not just a random
Female 13 5 7 16 9 distribution?

Chi-Square Requirements
Table 2 Male and Female Undergraduate Footwear The chi-square test of statistical significance is
Preferences at University X (Percentages)
a series of mathematical formulas that compare
Leather the actual observed frequencies of the two vari-
Group Sandals Sneakers Shoes Boots Other N ables measured in a sample with the frequencies
Male 12 34 26 18 10 50 one would expect if there were no relationship at
Female 26 10 14 32 18 50 all between those variables. That is, chi-square
assesses whether the actual results are different
enough from the null hypothesis to overcome a cer-
how many times that combination of values was tain probability that they are due to sampling
observed in the sample being analyzed. Typically, error, randomness, or a combination.
in constructing bivariate tables, values on the inde- Because chi-square is a nonparametric test, it
pendent variable are arrayed on the vertical axis, does not require the sample data to be at an inter-
while values on the dependent variable are arrayed val level of measurement and more or less nor-
on the horizontal axis. This allows one to read mally distributed (as parametric tests such as t
‘‘across,’’ from values on the independent variable tests do), although it does rely on a weak assump-
to values on the dependent variable. (Remember, tion that each variable’s values are normally dis-
an observed relationship between two variables is tributed in the population from which the sample
not necessarily causal.) is drawn. But chi-square, while forgiving, does
Reporting and interpreting bivariate tables is have some requirements:
most easily done by converting raw frequencies (in
each cell) into percentages of each cell within the 1. Chi-square is most appropriate for analyzing
categories of the independent variable. Percentages relationships among nominal and ordinal vari-
basically standardize cell frequencies as if there ables. A nominal variable (sometimes called a cate-
were 100 subjects or observations in each category gorical variable) describes an attribute in terms of
of the independent variable. This is useful for com- mutually exclusive, nonhierarchically related cate-
paring across values on the independent variable if gories, such as gender and footwear preference.
the raw row totals are close to or more than 100, Ordinal variables measure an attribute (such as
but increasingly dangerous as raw row totals military rank) that subjects may have more or less
become smaller. (When reporting percentages, one of but that cannot be measured in equal incre-
should indicate total N at the end of each row or ments on a scale. (Results from interval variables,
independent variable category.) such as scores on a test, would have to first be
Table 2 shows that in this sample roughly twice grouped before they could ‘‘fit’’ into a bivariate
as many women as men preferred sandals and table and be analyzed with chi-square; this group-
boots, about 3 times more men than women pre- ing loses much of the incremental information of
ferred sneakers, and twice as many men as women the original scores, so interval data are usually
preferred leather shoes. One might also infer from analyzed using parametric tests such as ANOVAs
the ‘‘Other’’ category that female students in this and t tests. The relationship between two ordinal
sample had a broader range of footwear prefer- variables is usually best analyzed with a Spearman
ences than did male students. rank order correlation.)
Converting raw observed values or frequencies 2. The sample must be randomly drawn from
into percentages allows one to see patterns in the the population.
146 Chi-Square Test

3. Data must be reported in raw frequencies— solution is to combine, or collapse, two cells
not, for example, in percentages. together. (Categories on a variable cannot be
excluded from a chi-square analysis; a researcher
4. Measured variables must be independent of
cannot arbitrarily exclude some subset of the data
each other. Any observation must fall into only
from analysis.) But a decision to collapse categories
one category or value on each variable, and no cat-
should be carefully motivated, preserving the integ-
egory can be inherently dependent on or influenced
rity of the data as it was originally collected.
by another.
5. Values and categories on independent and
dependent variables must be mutually exclusive Computing the Chi-Square Value
and exhaustive. In the footwear data, each subject
The process by which a chi-square value is com-
is counted only once, as either male or female and
puted has four steps:
as preferring sandals, sneakers, leather shoes,
boots, or other kinds of footwear. For some vari- 1. Setting the p value. This sets the threshold of
ables, no ‘‘other’’ category may be needed, but tolerance for error. That is, what odds is the
often ‘‘other’’ ensures that the variable has been researcher willing to accept that apparent patterns
exhaustively categorized. (Some kinds of analysis in the data may be due to randomness or sampling
may require an ‘‘uncodable’’ category.) In any case, error rather than some systematic relationship
the results for the whole sample must be included. between the measured variables? The answer
depends largely on the research question and the
6. Expected (and observed) frequencies cannot
consequences of being wrong. If people’s lives
be too small. Chi-square is based on the expecta-
depend on the interpretation of the results, the
tion that within any category, sample frequencies
researcher might want to take only one chance in
are normally distributed about the expected popu-
100,000 (or 1,000,000) of making an erroneous
lation value. Since frequencies of occurrence can-
claim. But if the stakes are smaller, he or she might
not be negative, the distribution cannot be normal
accept a greater risk—1 in 100 or 1 in 20. To mini-
when expected population values are close to zero
mize any temptation for post hoc compromise of
(because the sample frequencies cannot be much
scientific standards, researchers should explicitly
below the expected frequency while they can be
motivate their threshold before they perform any
much above it). When expected frequencies are
test of statistical significance. For the footwear
large, there is no problem with the assumption of
study, we will set a probability of error threshold
normal distribution, but the smaller the expected fre-
of 1 in 20, or p < :05:
quencies, the less valid the results of the chi-square
test. In addition, because some of the mathematical 2. Totaling all rows and columns. See Table 3.
formulas in chi-square use division, no cell in a table
3. Deriving the expected frequency of each cell.
can have an observed raw frequency of zero.
Chi-square operates by comparing the observed
The following minimums should be obeyed:
frequencies in each cell in the table to the frequen-
For a 1 × 2 or 2 × 2 table, expected frequencies in
cies one would expect if there were no relationship
each cell should be at least 5. at all between the two variables in the popula-
tions from which the sample is drawn (the null
For larger tables ð2 × 4 or 3 × 3 or larger), if all
expected frequencies but one are at least 5 and if
Table 3 Male and Female Undergraduate Footwear
the one small cell is at least 1, chi-square is
Preferences at University X: Observed
appropriate. In general, the greater the degrees of
Frequencies With Row and Column Totals
freedom (i.e., the more values or categories on the
independent and dependent variables), the more Leather
lenient the minimum expected frequencies Group Sandals Sneakers Shoes Boots Other Total
threshold. Male 6 17 13 9 5 50
Female 13 5 7 16 9 50
Sometimes, when a researcher finds low expec-
Total 19 22 20 25 14 100
ted frequencies in one or more cells, a possible
Chi-Square Test 147

Table 4 Male and Female Undergraduate Footwear 4. Measuring the size of the difference between
Preferences at University X: Observed and the observed and expected frequencies in each cell.
Expected Frequencies To do this, calculate the difference between the
Leather observed and expected frequency in each cell,
Group Sandals Sneakers Shoes Boots Other Total square that difference, and then divide that prod-
Male: 6 17 13 9 5 50 uct by the difference itself. The formula can be
observed 9.5 11 10 12.5 7 expressed as ðO  EÞ2 =E:
expected Squaring the difference ensures a positive num-
ber, so that one ends up with an absolute value of
Female: 13 5 7 16 9 50 differences. (If one did not work with absolute
observed 9.5 11 10 12.5 7 values, the positive and negative differences across
expected the entire table would always add up to 0.) Divid-
ing the squared difference by the expected fre-
Total 19 22 20 25 14 100 quency essentially removes the expected frequency
from the equation, so that the remaining measures
of observed versus expected difference are compa-
hypothesis). The null hypothesis—the ‘‘all other rable across all cells.
things being equal’’ scenario—is derived from the So, for example, the difference between obser-
observed frequencies as follows: The expected fre- ved and expected frequencies for the Male–Sandals
quency in each cell is the product of that cell’s row preference is calculated as follows:
total multiplied by that cell’s column total, divided
by the sum total of all observations. So, to derive 1. Observed (6)  Expected (9.5) ¼ Difference
the expected frequency of the ‘‘Males who prefer (–3.5)
Sandals’’ cell, multiply the top row total (50) by 2. Difference (  3.5) squared ¼ 12.25
the first column total (19) and divide that product
3. Difference squared (12.25)/Expected ð9:5Þ ¼
by the sum total (100): ðð50 × 19Þ=100Þ ¼ 9:5:
1.289
The logic of this is that we are deriving the
expected frequency of each cell from the union of
The sum of all products of this calculation on
the total frequencies of the relevant values on each
each cell is the total chi-square value for the table:
variable (in this case, Male and Sandals), as a pro-
14.026.
portion of all observed frequencies in the sample.
This calculation produces the expected frequency
Interpreting the Chi-Square Value
of each cell, as shown in Table 4.
Now a comparison of the observed results with Now the researcher needs some criterion against
the results one would expect if the null hypothesis which to measure the table’s chi-square value in
were true is possible. (Because the sample includes order to tell whether it is significant (relative to the
the same number of male and female subjects, the p value that has been motivated). The researcher
male and female expected scores are the same. needs to know the probability of getting a chi-
This will not be the case with unbalanced sam- square value of a given minimum size even if the
ples.) This table can be informally analyzed, com- variables are not related at all in the sample. That
paring observed and expected frequencies in each is, the researcher needs to know how much larger
cell (e.g., males prefer sandals less than expected), than zero (the chi-square value of the null hypoth-
across values on the independent variable (e.g., esis) the table’s chi-square value must be before the
males prefer sneakers more than expected, females null hypothesis can be rejected with confidence.
less than expected), or across values on the depen- The probability depends in part on the complexity
dent variable (e.g., females prefer sandals and and sensitivity of the variables, which are reflected
boots more than expected, but sneakers and shoes in the degrees of freedom of the table from which
less than expected). But some way to measure how the chi-square value is derived.
different the observed results are from the null A table’s degrees of freedom (df) can be
hypothesis is needed. expressed by this formula: df ¼ ðr  1Þðc  1Þ.
148 Chi-Square Test

That is, a table’s degrees of freedom equals the between the variables in the data can be derived
number of rows in the table minus 1, multiplied from a table’s chi-square value.
by the number of columns in the table minus 1. For tables larger than 2 × 2 (such as Table 1),
(For 1 × 2 tables, df ¼ k  1; where k ¼ number a measure called Cramer’s V is derived by the fol-
of values or categories on the variable.) Different lowing formula (where N ¼ the total number of
chi-square thresholds are set for relatively gross observations, and k ¼ the smaller of the number of
comparisons ð1 × 2 or 2 × 2Þ versus finer compari- rows or columns):
sons. (For Table 1, df ¼ ð2  1Þð5  1Þ ¼ 4:)
In a statistics book, the sampling distribution of Cramer’s V ¼ the square root of
chi-square (also known as critical values of chi- ðchi-square divided by ðN times ðk minus 1ÞÞÞ
square) is typically listed in an appendix. The
researcher can read down the column representing So, for (2 × 5) Table 1, Cramer’s V can be com-
his or her previously chosen probability of error puted as follows:
threshold (e.g., p < .05) and across the row repre-
senting the degrees of freedom in his or her table. 1. N(k  1) ¼ 100 (2  1) ¼ 100
If the researcher’s chi-square value is larger than 2. Chi-square/100 ¼ 14.026/100 ¼ 0.14
the critical value in that cell, his or her data repre-
3. Square root of 0.14 ¼ 0.37
sent a statistically significant relationship between
the variables in the table. (In statistics software
The product is interpreted as a Pearson correla-
programs, all the computations are done for the
tion coefficient (r). (For 2 × 2 tables, a measure
researcher, and he or she is given the exact p value
called phi is derived by dividing the table’s
of the results.)
chi-square value by N (the total number of obser-
Table 1’s chi-square value of 14.026, with
vations) and then taking the square root of the
df ¼ 4, is greater than the related critical value of
product. Phi is also interpreted as a Pearson r.)
9.49 (at p ¼ .05), so the null hypothesis can be
Also, r2 is a measure called shared variance.
rejected, and the claim that the male and female
Shared variance is the portion of the total repre-
undergraduates at University X in this sample dif-
sentation of the variables measured in the sample
fer in their (self-reported) footwear preferences
data that is accounted for by the relationship mea-
and that there is some relationship between gender
sured by the chi-square. For Table 1, r2 ¼ .137, so
and footwear preferences (in this sample) can be
approximately 14% of the total footwear prefer-
affirmed.
ence story is accounted for by gender.
A statistically significant chi-square value indi-
A measure of association like Cramer’s V is
cates the degree of confidence a researcher may hold
an important benchmark of just ‘‘how much’’ of
that the relationship between variables in the sample
the phenomenon under investigation has been
is systematic and not attributable to random error.
accounted for. For example, Table 1’s Cramer’s V of
It does not help the researcher to interpret the
0.37 (r2 ¼ .137) means that there are one or more
nature or explanation of that relationship; that must
variables remaining that, cumulatively, account for
be done by other means (including bivariate tabular
at least 86% of footwear preferences. This measure,
analysis and qualitative analysis of the data).
of course, does not begin to address the nature of
the relation(s) among these variables, which is a cru-
cial part of any adequate explanation or theory.
Measures of Association
Jeff Connor-Linton
By itself, statistical significance does not ensure that
the relationship is theoretically or practically impor- See also Correlation; Critical Value; Degrees of Freedom;
tant or even very large. A large enough sample may Dependent Variable; Expected Value; Frequency Table;
demonstrate a statistically significant relationship Hypothesis; Independent Variable; Nominal Scale;
between two variables, but that relationship may Nonparametric Statistics; Ordinal Scale; p Value; R2;
be a trivially weak one. The relative strength of Random Error; Random Sampling; Sampling Error;
association of a statistically significant relationship Variable; Yates’s Correction.
Classical Test Theory 149

Further Readings Spearman in the early 1900s. One of the first texts
to codify the emerging discipline of measurement
Berman, E. M. (2006). Essential statistics for public
managers and policy analysts (2nd ed.). Washington, and CTT was Harold Gulliksen’s Theory of Men-
DC: CQ Press. tal Tests. Much of what Gulliksen presented in
Field, A. (2005). Discovering statistics using SPSS (2nd that text is used unchanged today. Other theories
ed.). London: Sage. of measurement (generalizability theory, item
Hatch, E., & Lazaraton, A. (1991). The research manual: response theory) have emerged that address some
Design and statistics for applied linguistics. Rowley, known weaknesses of CTT (e.g., homoscedasticity
MA: Newbury House. of error along the test score distribution). How-
Siegel, S. (1956). Nonparametric statistics for the ever, the comparative simplicity of CTT and its
behavioral sciences. New York: McGraw-Hill.
continued utility in the development and descrip-
Sokal, R. R., & Rohlf, F. J. (1995). Biometry: The
principles and practice of statistics in biological
tion of assessments have resulted in CTT’s contin-
research (3rd ed.). New York: W. H. Freeman. ued use. Even when other test theories are used,
Welkowitz, J., Ewen, R. B., & Cohen, J. (1982). CTT often remains an essential part of the devel-
Introductory statistics for the behavioral sciences (3rd opment process.
ed.). New York: Academic Press.
Formal Definition
CTT relies on a small set of assumptions. The
CLASSICAL TEST THEORY implications of these assumptions build into the
useful CTT paradigm. The fundamental assump-
tion of CTT is found in the equation
Measurement is the area of quantitative social sci-
ence that is concerned with ascribing numbers to
X ¼ T þ E, ð1Þ
individuals in a meaningful way. Measurement is
distinct from statistics, though measurement theo-
where X represents an observed score, T represents
ries are grounded in applications of statistics.
true score, and E represents error of measurement.
Within measurement there are several theories that
The concept of the true score, T, is often misun-
allow us to talk about the quality of measurements
derstood. A true score, as defined in CTT, does not
taken. Classical test theory (CTT) can arguably be
have any direct connection to the construct that
described as the first formalized theory of measure-
the test is intended to measure. Instead, the true
ment and is still the most commonly used method
score represents the number that is the expected
of describing the characteristics of assessments.
value for an individual based on this specific test.
With test theories, the term test or assessment is
Imagine that a test taker took a 100-item test on
applied widely. Surveys, achievement tests, intelli-
world history. This test taker would get a score,
gence tests, psychological assessments, writing
perhaps an 85. If the test was a multiple-choice
samples graded with rubrics, and innumerable
test, probably some of those 85 points were
other situations in which numbers are assigned to
obtained through guessing. If given the test again,
individuals can all be considered tests. The terms
the test taker might guess better (perhaps obtain-
test, assessment, instrument, and measure are used
ing an 87) or worse (perhaps obtaining an 83).
interchangeably in this discussion of CTT. After
The causes of differences in observed scores are
a brief discussion of the early history of CTT, this
not limited to guessing. They can include anything
entry provides a formal definition and discusses
that might affect performance: a test taker’s state
CTT’s role in reliability and validity.
of being (e.g., being sick), a distraction in the test-
ing environment (e.g., a humming air conditioner),
or careless mistakes (e.g., misreading an essay
Early History
prompt).
Most of the central concepts and techniques asso- Note that the true score is theoretical. It can
ciated with CTT (though it was not called that at never be observed directly. Formally, the true score
the time) were presented in papers by Charles is assumed to be the expected value (i.e., average)
150 Classical Test Theory

of X, the observed score, over an infinite number coefficients allow for the quantification of consis-
of independent administrations. Even if an exam- tency so that decisions about utility can be made.
inee could be given the same test numerous times, T and E cannot be observed directly for indivi-
the administrations would not be independent. duals, but the relative contribution of these compo-
There would be practice effects, or the test taker nents to X is what defines reliability. Instead of
might learn more between administrations. direct observations of these quantities, group-level
Once the true score is understood to be an aver- estimates of their variability are used to arrive at an
age of observed scores, the rest of Equation 1 is estimate of reliability. Reliability is usually defined as
straightforward. A scored performance, X, can be
thought of as deviating from the true score, T, by σ 2T σ2
ρXX0 ¼ 2
¼ T2 þ σ E2 , ð2Þ
some amount of error, E. There are additional σX σT
assumptions related to CTT: T and E are uncorre-
lated ðρTE ¼ 0Þ, error on one test is uncorrelated where X, T, and E are defined as before. From this
with the error on another test ðρE1 E2 ¼ 0Þ, and equation one can see that reliability is the ratio of
error on one test is uncorrelated with the true true score variance to observed score variance
score on another test ðρE1 E2 ¼ 0Þ. These assump- within a sample. Observed score variance is the
tions will not be elaborated on here, but they are only part of this equation that is observable. There
used in the derivation of concepts to be discussed. are essentially three ways to estimate the unob-
served portion of this equation. These three
approaches correspond to three distinct types of
reliability: stability reliability, alternate-form reli-
Reliability
ability, and internal consistency.
If E tends to be large in relation to T, then a test All three types of reliability are based on corre-
result is inconsistent. If E tends to be small in rela- lations that exploit the notion of parallel tests.
tion to T, then a test result is consistent. The major Parallel tests are defined as tests having equal
contribution of CTT is the formalization of this true scores ðT ¼ T 0 Þ and equal error variance
concept of consistency of test scores. In CTT, test ðσ 2E ¼ σ 2E0 Þ and meeting all other assumptions of
score consistency is called reliability. Reliability CTT. Based on these assumptions, it can be derived
provides a framework for thinking about and (though it will not be shown here) that the correla-
quantifying the consistency of a test. Even though tion between parallel tests provides an estimate of
reliability does not directly address constructs, it is reliability (i.e., the ratio of true score variance to
still fundamental to measurement. The often used observed score variance). The correlation among
bathroom scale example makes the point. Does these two scores is the observed reliability coeffi-
a bathroom scale accurately reflect weight? Before cient ðrXX0 Þ. The correlation can be interpreted as
establishing the scale’s accuracy, its consistency the proportion of variance in observed scores that
can be checked. If one were to step on the scale is attributable to true scores; thus rXX0 ¼ r2XT : The
and it said 190 pounds, then step on it again and type of reliability (stability reliability, alternate-
it said 160 pounds, the scale’s lack of utility could form reliability, internal consistency) being esti-
be determined based strictly on its inconsistency. mated is determined by how the notion of a parallel
Now think about a formal assessment. Imagine test is established.
an adolescent was given a graduation exit exam The most straightforward type of reliability is
and she failed, but she was given the same exam stability reliability. Stability reliability (often called
again a day later and she passed. She did not learn test–retest reliability) is estimated by having a rep-
in one night everything she needed in order to resentative sample of the intended testing popula-
graduate. The consequences would have been tion take the same instrument twice. Because the
severe if she had been given only the first opportu- same test is being used at both measurement
nity (when she failed). If this inconsistency opportunities, the notion of parallel tests is easy to
occurred for most test takers, the assessment support. The same test is used on both occasions,
results would have been shown to have insufficient so true scores and error variance should be the
consistency (i.e., reliability) to be useful. Reliability same. Ideally, the two measurement opportunities
Classical Test Theory 151

will be close enough that examinees have not for example, two test halves were found to have
changed (learned or developed on the relevant con- a correlation of .60, the actual reliability would be
struct). You would not, for instance, want to base
your estimates of stability reliability on pretest– 2ð:60Þ 1:20
ρXX0 ¼ ¼ ¼ :75:
intervention–posttest data. The tests should also 1 þ :60 1:60
not be given too close together in time; otherwise,
practice or fatigue effects might influence results. As with all the estimates of reliability, the extent
The appropriate period to wait between testing to which the two measures violate the assumption of
will depend on the construct and purpose of the parallel tests determines the accuracy of the result.
test and may be anywhere from minutes to weeks. One of the drawbacks of using the split-half
Alternate-form reliability requires that each approach is that it does not produce a unique
member of a representative sample respond on result. Other splits of the test will produce (often
two alternate assessments. These alternate forms strikingly) different estimates of the internal con-
should have been built to be purposefully parallel sistency of the test. This problem is ameliorated by
in content and scores produced. The tests should the use of single internal consistency coefficients
be administered as close together as is practical, that provide information similar to split-half reli-
while avoiding fatigue effects. The correlation ability estimates. Coefficient alpha (often called
among these forms represents the alternate-form Cronbach’s alpha) is the most general (and most
reliability. Higher correlations provide more confi- commonly used) estimate of internal consistency.
dence that the tests can be used interchangeably The Kuder–Richardson 20 (KR-20) and 21 (KR-
(comparisons of means and standard deviations 21) are reported sometimes, but these coefficients
will also influence this decision). are special cases of coefficient alpha and do not
Having two administrations of assessments is require separate discussion. The formula for coeffi-
often impractical. Internal consistency reliability cient alpha is
methods require only a single administration. There  
are two major approaches to estimating internal con- k σ 2i
ρXX0 ≥ α ¼ 1 2 , ð4Þ
sistency: split half and coefficient alpha. Split-half k1 σX
estimates are easily understood but have mostly been
supplanted by the use of coefficient alpha. For where k is the number of items, σ 2i is the variance
a split-half approach, a test is split into two halves. of item i, and σ 2X is the variance of test X. Coeffi-
This split is often created by convenience (e.g., all cient alpha is equal to the average of all possible
odd items in one half, all even items in the other split-half methods computed using Phillip Rulon’s
half). The split can also be made more methodically method (which uses information on the variances of
(e.g., balancing test content and item types). Once total scores and the differences between the split-
the splits are obtained, the scores from the two half scores). Coefficient alpha is considered a conser-
halves are correlated. Conceptually, a single test is vative estimate of reliability that can be interpreted
being used to create an estimate of alternate-form as a lower bound to reliability. Alpha is one of the
reliability. For reasons not discussed in this entry, the most commonly reported measures of reliability
correlation from a split-half method will underesti- because it requires only a single test administration.
mate reliability unless it is corrected. Although all three types of reliability are
The Spearman–Brown prophecy formula for based on the notions of parallel tests, they are
predicting the correlation that would have been influenced by distinct types of error and, there-
obtained if each half had been as long as the full- fore, are not interchangeable. Stability reliability
length test is given by measures the extent to which occasions influence
results. Errors associated with this type of reli-
2ρAB ability address how small changes in examinees
ρXX0 ¼ , ð3Þ
1 þ ρAB or the testing environment impact results. Alter-
nate-form reliability addresses how small differ-
where ρAB is the original correlation between the ences in different versions of tests may impact
test halves and ρXX0 is the corrected reliability. If, results. Internal consistency addresses the way
152 Classical Test Theory

in which heterogeneity of items limits the infor- Readers unfamiliar with the current unified under-
mation provided by an assessment. In some standing of validity should consult a more com-
sense, any one coefficient may be an overesti- plete reference; the topics to be addressed here
mate of reliability (even alpha, which is consid- might convey an overly simplified (and positivist)
ered to be conservative). Data collected on each notion of validity.
type of reliability would yield three different Construct validity is the overarching principle
coefficients that may be quite different from one in validity. It asks, Is the correct construct being
another. Researchers should be aware that a reli- measured? One of the principal ways that con-
ability coefficient that included all three sources struct validity is established is by a demonstration
of error at once would likely be lower than that tests are associated with criteria or other tests
a coefficient based on any one source of error. that purport to measure the same (or related) con-
Generalizability theory is an extension of CTT structs. The presence of strong associations (e.g.,
that provides a framework for considering multi- correlations) provides evidence of construct valid-
ple sources of error at once. Generalizability ity. Additionally, research agendas may be estab-
coefficients will tend to be lower than CTT- lished that investigate whether test performance is
based reliability coefficients but may more accu- affected by things (e.g., interventions) that should
rately reflect the amount of error in measure- influence the underlying construct. Clearly, the col-
ments. Barring the use of generalizability theory, lection of construct validity evidence is related to
a practitioner must decide what types of reliabil- general research agendas. When one conducts
ity are relevant to his or her research and make basic or applied research using a quantitative
sure that there is evidence of that type of consis- instrument, two things are generally confounded
tency (i.e., through consultation of appropriate in the results: (1) the validity of the instrument for
reliability coefficients) in the test results. the purpose being employed in the research and
(2) the correctness of the research hypothesis. If
a researcher fails to support his or her research
Validity Investigations and Research hypothesis, there is often difficulty determining
whether the result is due to insufficient validity of
With Classical Test Theory
the assessment results or flaws in the research
CTT is mostly a framework for investigating reli- hypothesis.
ability. Most treatments of CTT also include exten- One of CTT’s largest contributions to our
sive descriptions of validity; however, similar understanding of both validity investigations and
techniques are used in the investigation of validity research in general is the notion of reliability as an
whether or not CTT is the test theory being upper bound to a test’s association with another
employed. In short hand, validity addresses the measure. From Equation 1, observed score vari-
question of whether test results provide the ance is understood to comprise a true score, T, and
intended information. As such, validity evidence is an error term, E. The error term is understood to
primary to any claim of utility. A measure can be be random error. Being that it is random, it cannot
perfectly reliable, but it is useless if the intended correlate with anything else. Therefore, the true
construct is not being measured. Returning to score component is the only component that may
the bathroom scale example, if a scale always have a nonzero correlation with another variable.
describes one adult of average build as 25 pounds The notion of reliability as the ratio of true score
and second adult of average build as 32 pounds, variance to observed score variance (Equation 2)
the scale is reliable. However, the scale is not accu- makes the idea of reliability as an upper bound
rately reflecting the construct weight. explicit. Reliability is the proportion of systematic
Many sources provide guidance about the variance in observed scores. So even if there were
importance of validity and frameworks for the a perfect correlation between a test’s true score
types of data that constitute validity evidence. and a perfectly measured criterion, the observed
More complete treatments can be found in Samuel correlation could be no larger than the square root
Messick’s many writings on the topic. A full treat- of the reliability of the test. If both quantities
ment is beyond the scope of this discussion. involved in the correlation have less than perfect
Clinical Significance 153

reliability, the observed correlation would be even Final Thoughts


smaller. This logic is often exploited in the reverse
direction to provide a correction for attenuation CTT provides a useful framework for evaluating
when there is a lack of reliability for measures the utility of instruments. Researchers using quan-
used. The correction for attenuation provides an titative instruments need to be able to apply funda-
upper bound estimate for the correlation among mental concepts of CTT to evaluate the tools they
true scores. The equation is usually presented as use in their research. Basic reliability analyses can
be conducted with the use of most commercial sta-
ρ tistical software (e.g., SPSS—an IBM company,
ρTX TY ¼ pffiffiffiffiffiffiffiffiffiXY
pffiffiffiffiffiffiffiffiffi ð5Þ formerly called PASWâ Statistics—and SAS).
ρXX0 ρYY 0
John T. Willse
where ρXY is the observed correlation between two
measures and ρXX0 and ρYY 0 are the respective reli- See also Reliability; ‘‘Validity’’
abilities of the two measures. This correction for
Further Readings
attenuation provides an estimate for the correlation
among the underlying constructs and is therefore American Educational Research Association, American
useful whenever establishing the magnitude of rela- Psychological Association, & National Council on
tionships among constructs is important. As such, it Measurement in Education. (1999). Standards for
applies equally well to many research scenarios as educational and psychological testing. Washington,
DC: American Educational Research Association.
it does to attempts to establish construct validity.
Crocker, L., & Algina, J. (1986). Introduction to classical
For instance, it might impact interpretation if and modern test theory. New York: Harcourt Brace
a researcher realized that an observed correlation Jovanovich.
of .25 between two instruments which have inter- Cronbach, L. J. (1951). Coefficient alpha and the internal
nal consistencies of .60 and .65 represented a poten- structure of tests. Psychometrika, 16, 297–334.
tial true score correlation of Cronbach, L. J., Gleser, G. C., Nanda, H., &
Rajaratnam, N. (1972). The dependability of
:25 behavioral measurements: Theory of generalizability
ρTX TY ¼ pffiffiffiffiffiffiffipffiffiffiffiffiffiffi ¼ :40 for scores and profiles. New York: Wiley.
:60 :65 Gulliksen, H. (1950). Theory of mental tests. New York:
Wiley.
The corrected value should be considered an Spearman, C. (1904). The proof and measurement of
upper bound estimate for the correlation if all association between two things. American Journal of
measurement error were purged from the instru- Psychology, 15, 72–101.
Traub, R. E. (1987). Classical test theory in historical
ments. Of course, such perfectly reliable measure-
perspective. Educational Measurement: Issues &
ment is usually impossible in the social sciences, so
Practice, 16(4), 8–14.
the corrected value will generally be considered
theoretical.

CLINICAL SIGNIFICANCE
Group-Dependent Estimates
In treatment outcome research, statistically sig-
Estimates in CTT are highly group dependent. Any nificant changes in symptom severity or end-state
evidence of reliability and validity is useful only if functioning have traditionally been used to dem-
that evidence was collected on a group similar to onstrate treatment efficacy. In more recent stud-
the current target group. A test that is quite reli- ies, the effect size, or magnitude of change
able and valid with one population may not be associated with the experimental intervention,
with another. To make claims about the utility of has also been an important consideration in data
instruments, testing materials (e.g., test manuals or analysis and interpretation. To truly understand
technical reports) must demonstrate the appropri- the impact of a research intervention, it is essen-
ateness and representativeness of the samples used. tial for the investigator to adjust the lens, or
154 Clinical Significance

‘‘zoom out,’’ to also examine other signifiers of Scale scores as well as significantly improved per-
change. Clinical significance is one such marker formance on a computerized test measuring sus-
and refers to the meaningfulness, or impact of an tained attention. The data interpretation is correct
intervention, on clients and others in their social from a statistical point of view. However, the
environment. An intervention that is clinically majority of parents view the intervention as incon-
significant must demonstrate substantial or, at sequential because their children continue to evi-
least, reasonable benefit to the client or others, dence a behavioral problem that disrupts home
such as family members, friends, or coworkers. life. Moreover, most of the treated children see the
Benefits gained, actual or perceived, must be treatment as ‘‘a waste of time’’ because they are
weighed against the costs of the intervention. still being teased or ostracized by peers. Despite
These costs may include financial, time, or fam- significant sustained attention performance
ily burden. Some researchers use the term practi- improvements, classroom teachers also rate the
cal significance as a synonym for clinical experimental intervention as ineffective because
significance because both terms consider the they did not observe meaningful changes in aca-
import of a research finding in everyday life. demic performance.
However, there are differences in the usage of A third scenario is the case of null research
the two terms. Clinical significance is typically findings in which there is also an inconsistency
constrained to treatment outcome or prevention between statistical interpretation and the client
studies whereas practical significance is used or family perspective. For instance, an experi-
broadly across many types of psychological mental treatment outcome study is conducted
research, including cognitive neuroscience, with adult trauma survivors compared with
developmental psychology, environmental psy- treatment as usual. Overall, the treatment-as-
chology, and social psychology. This entry dis- usual group performed superiorly to the new
cusses the difference between statistical and intervention. In fact, on the majority of post-
clinical significance and describes methods for traumatic stress disorder (PTSD) measures, the
measuring clinical significance. experimental group evidenced no statistical
change from pre- to posttest. Given the addi-
tional costs of the experimental intervention, the
Statistical Significance Versus investigators may decide that it is not worth fur-
ther investigation. However, qualitative inter-
Clinical Significance
views are conducted with the participants. The
Statistically significant findings do not always cor- investigators are surprised to learn that most
respond to the client’s phenomenological experi- participants receiving the intervention are highly
ence or overall evaluation of beneficial impact. satisfied although they continue to meet PTSD
First, a research study may have a small effect size diagnostic criteria. The interviews demonstrate
yet reveal statistically significant findings due to clinical significance among the participants, who
high power. This typically occurs when a large perceive a noticeable reduction in the intensity
sample size has been used. Nevertheless, the clini- of daily dissociative symptoms. These partici-
cal significance of the research findings may be pants see the experimental intervention as quite
trivial from the research participants’ perspective. beneficial in terms of facilitating tasks of daily
Second, in other situations, a moderate effect size living and improving their quality of life.
may yield statistically significant results, yet the When planning a research study, the investiga-
pragmatic benefit to the client in his or her every- tor should consider who will evaluate clinical sig-
day life is questionable. For example, children nificance (client, family member, investigator,
diagnosed with attention-deficit/hyperactivity dis- original treating therapist) and what factor(s) are
order may participate in an intervention study important to the evaluator (changes in symptom
designed to increase concentration and reduce dis- severity, functioning, personal distress, emotional
ruptive behavior. The investigators conclude that regulation, coping strategies, social support
the active intervention was beneficial on the basis resources, quality of life, etc.). An investigator
of significant improvement on Connors Ratings should also consider the cultural context and the
Clinical Trial 155

rater’s cultural expertise in the area under exami- comparable to that of a normal comparison group
nation. Otherwise, it may be challenging to with no history of eating disorders, or the study
account for unexpected results. For instance, a 12- may examine whether binge–purge end-state func-
week mindfulness-based cognitive intervention for tioning statistically differs from a group of indivi-
depressed adults is initially interpreted as success- duals currently meeting diagnostic criteria for
ful in improving mindfulness skills, on the basis of bulimia nervosa.
statistically significant improvements on two The Reliable Change Index, developed by Neil
dependent measures: scores on a well-validated Jacobson and colleagues, and equivalence testing
self-report mindfulness measure and attention are the most commonly used comparison method
focus ratings by the Western-trained research strategies. These comparative approaches have sev-
therapists. The investigators are puzzled that most eral limitations, however, and the reader is directed
participants do not perceive mindfulness skill to the Further Readings for more information on
training as beneficial; that is, the training has not the conceptual and methodological issues.
translated into noticeable improvements in depres- The tension between researchers and clinical
sion or daily life functioning. The investigators practitioners will undoubtedly lessen as clinical
request a second evaluation by two mindfulness significance is foregrounded in future treatment
experts—a non-Western meditation practitioner outcome studies. Quantitative and qualitative
versed in traditional mindfulness practices and measurement of clinical significance will be
a highly experienced yoga practitioner. The mind- invaluable in deepening our understanding of
fulness experts independently evaluate the research factors and processes that contribute to client
participants and conclude that culturally relevant transformation.
mindfulness performance markers (postural align-
ment, breath control, and attentional focus) are Carolyn Brodbeck
very weak among participants who received the
See also Effect Size, Measures of; Power; Significance,
mindfulness intervention.
Statistical

Measures Further Readings

Clinical significance may be measured several Beutler, L. E., & Moleiro, C. (2001). Clinical versus
ways, including subjective evaluation, absolute reliable and significant change. Clinical Psychology:
change (did the client evidence a complete return Science and Practice, 8, 441–445.
Jacobson, N. S., Follette, W. C., & Revenstrof, D. (1984).
to premorbid functioning or how much has an
Psychotherapy outcome research: Methods for
individual client changed across the course of
reporting variability and evaluating clinical
treatment without comparison with a reference significance. Behavior Therapy, 15, 335–352.
group), comparison method, or societal impact Jacobson, N. S., & Truax, P. (1991). Clinical significance:
indices. In most studies, the comparison method is A statistical approach to defining meaningful change
the most typically employed strategy for measuring in psychotherapy research. Journal of Consulting 19.
clinical significance. It may be used for examining Jensen, P. S. (2001). Clinical equivalence: A step, misstep,
whether the client group returns to a normative or just a misnomer? Clinical Psychology: Science &
level of symptoms or functioning at the conclusion Practice, 8, 436–440.
of treatment. Alternatively, the comparison Kendall, P. C., Marrs-Garcia, A., Nath, S. R., &
Sheldrick, R. C. (1999). Normative comparisons for
method may be used to determine whether the
the evaluation of clinical significance. Journal of
experimental group statistically differs from an Consulting & Clinical Psychology, 67, 285–299.
impaired group at the conclusion of treatment
even if the experimental group has not returned to
premorbid functioning. For instance, a treatment
outcome study is conducted with adolescents CLINICAL TRIAL
diagnosed with bulimia nervosa. After completing
the intervention, the investigators may consider A clinical trial is a prospective study that involves
whether the level of body image dissatisfaction is human subjects in which an intervention is to be
156 Clinical Trial

evaluated. In a clinical trial, subjects are followed the new drug in order to estimate the maximum
from a well-defined starting point or baseline. tolerated dose.
The goal of a clinical trial is to determine Phase II trials build on the Phase I results in
whether a cause-and-effect relationship exists terms of which dose level or levels warrant further
between the intervention and response. Exam- investigation. Phase II trials are usually fairly
ples of interventions used in clinical trials small-scale trials. In cancer studies, Phase II trials
include drugs, surgery, medical devices, and edu- traditionally involve a single dose with a surrogate
cation and subject management strategies. In end point for mortality, such as change in tumor
each of these cases, clinical trials are conducted volume. The primary comparison of interest is the
to evaluate both the beneficial and harmful effect of the new regimen versus established
effects of the new intervention on human sub- response rates. In other diseases, such as cardiol-
jects before it is made available to the popula- ogy, Phase II trials may involve multiple dose levels
tion of interest. Special considerations for and randomization. The primary goals of a Phase
conducting clinical trials include subject safety II trial are to determine the optimal method of
and informed consent, subject compliance, and administration and examine the potential efficacy
intervention strategies to avoid bias. This entry of a new regimen. Phase II trials generally have
describes the different types of clinical trials and longer follow-up times than do Phase I trials.
discusses ethics in relation to clinical trials. Within Phase II trials, participants are closely
monitored for safety. In addition, pharmacoki-
netic, pharmacodynamic, or pharmacogenomic
studies or a combination are often incorporated as
Drug Development Trials
part of the Phase II trial design. In many settings,
Clinical trials in drug development follow from two or more Phase II trials are undertaken prior to
laboratory experiments, usually involving in a Phase III trial.
vitro experiments or animal studies. The tradi- Phase III trials are undertaken if the Phase II
tional goal of a preclinical study is to obtain pre- trial or trials demonstrate that the drug may be
liminary information on pharmacology and reasonably safe and potentially effective. The pri-
toxicology. Before a new drug may be used in mary goal of a Phase III trial is to compare the
human subjects, several regulatory bodies, such effectiveness of the new treatment with that of
as the internal review board (IRB), Food and either a placebo condition or standard of care.
Drug Administration (FDA), and data safety Phase III trials may involve long-term follow-up
monitoring board, must formally approve the and many participants. The sample size is deter-
study. Clinical trials in drug development are mined using precise statistical methods based on
conducted in a sequential fashion and catego- the end point of interest, the clinically relevant dif-
rized as Phase I, Phase II, Phase III, and Phase IV ference between treatment arms, and control of
trial designs. The details of each phase of a clini- Type I and Type II error rates. The FDA may
cal trial investigation are well defined within require two Phase III trials for approval of new
a document termed a clinical trial protocol. The drugs. The process from drug synthesis to the com-
FDA provides recommendations for the structure pletion of a Phase III trial may take several years.
of Phase I through III trials in several disease Many investigations may never reach the Phase II
areas. or Phase III trial stage.
Phase I trials consist primarily of healthy volun- The gold-standard design in Phase III trials
teer and participant studies. The primary objective employs randomization and is double blind. That
of a Phase I trial is to determine the maximum tol- is, participants who are enrolled in the study agree
erated dose. Other objectives include determining to have their treatment selected by a random or
drug metabolism and bioavailability (how much pseudorandom process, and neither the evaluator
drug reaches the circulation system). Phase I stud- nor the participants have knowledge of the true
ies generally are short-term studies that involve treatment assignment. Appropriate randomization
monitoring toxicities in small cohorts of partici- procedures ensure that the assigned treatment is
pants treated at consistently higher dose levels of independent of any known or unknown prognostic
Clinical Trial 157

characteristic. Randomization provides two impor- Screening Trials


tant advantages. First, the treatment groups tend
to be comparable with regard to prognostic fac- Screening trials are conducted to determine
tors. Second, it provides the theoretical underpin- whether preclinical signs of a disease should be
ning for a valid statistical test of the null monitored to increase the chance of detecting
hypothesis. Randomization and blinding are individuals with curable forms of the disease. In
mechanisms used to minimize potential bias that these trials there are actually two interventions.
may arise from multiple sources. Oftentimes blind- The first intervention is the program of monitor-
ing is impossible or infeasible due to either safety ing normal individuals for curable forms of dis-
issues or more obvious reasons such as the clear ease. The second intervention is the potentially
side effects of the active regimen compared with curative procedure applied to the diseased,
a placebo control group. Randomization cannot detected individuals. Both these components
eliminate statistical bias due to issues such as dif- need to be effective for a successful screening
ferential dropout rates, differential noncompliance trial.
rates, or any other systematic differences imposed
during the conduct of a trial. Prevention Trials
Phase IV clinical trials are oftentimes referred
to as postmarketing studies and are generally Prevention trials evaluate interventions intended to
surveillance studies conducted to evaluate long- reduce the risk of developing a disease. The inter-
term safety and efficacy within the new patient vention may consist of a vaccine, high-risk behav-
population. ior modification, or dietary supplement.

Nonrandomized Clinical Trials Ethics in Clinical Trials


In nonrandomized observational studies, there is no The U.S. Research Act of 1974 established the
attempt to intervene in the participants’ selection National Commission for the Protection of
between the new and the standard treatments. When Human Subjects of Biomedical and Behavioral
the results of participants who received the new Research. The purpose of this commission was to
treatment are compared with the results of a group establish ethical guidelines for clinical research.
of previously treated individuals, the study is consid- The Act also mandated that all clinical trials
ered historically controlled. Historically controlled funded entirely or in part by the federal govern-
trials are susceptible to several sources of bias, ment be reviewed by IRBs. The Commission out-
including those due to shifts in medical practice over lined in detail how IRBs should function and
time, changes in response definitions, and differences provided ethical guidelines for conducting research
in follow-up procedures or recording methods. involving human participants. The Commission’s
Some of these biases can be controlled by imple- Belmont Report articulated three important ethical
menting a parallel treatment design. In this case, the principles for clinical research: respect for persons,
new and the standard treatment are applied to each beneficence, and justice.
group of participants concurrently. This approach The principle of respect for persons acknowl-
provides an opportunity to standardize the partici- edges that participants involved in clinical trials
pant care and evaluation procedures used in each must be properly informed and always permitted
treatment group. The principal drawback of obser- to exercise their right to self-determination. The
vational trials (historically controlled or parallel informed consent process arises from this princi-
treatment) is that the apparent differences in out- ple. The principle of beneficence acknowledges
come between the treatment groups may actually be that there is often a potential for benefit as well
due to imbalances in known and unknown prognos- as harm for individuals involved in clinical trials.
tic subject characteristics. These differences in out- This principle requires that there be a favorable
come can be erroneously attributed to the balance toward benefit, and any potential for
treatment, and thus these types of trials cannot harm must be minimized and justified. The prin-
establish a cause-and-effect relationship. ciple of justice requires that the burdens and
158 Cluster Sampling

benefits of clinical research be distributed fairly. group’s average as the data point. The assumption
An injustice occurs when one group in society is that the group data point is a small representa-
bears a disproportionate burden of research tive of the population of all adolescents. The fact
while another reaps a disproportionate share of that the collection of individuals in the unit serves
its benefit. as the data point, rather than each individual serv-
ing as a data point, differentiates this sampling
Alan Hutson and Mark Brady technique from most others.
An advantage of cluster sampling is that it is
See also Adaptive Designs in Clinical Trials; Ethics in the
a great time saver and relatively efficient in that
Research Process; Group-Sequential Designs in
travel time and other expenses are saved. The pri-
Clinical Trials
mary disadvantage is that one can lose the hetero-
geneity that exists within groups by taking all in
Further Readings the group as a single unit; in other words, the
strategy may introduce sampling error. Cluster
Friedman, L. M., Furberg, C. D., & Demets, D. L.
sampling is also known as geographical sampling
(1998). Fundamentals of clinical trials (3rd ed.). New
York: Springer Science. because areas such as neighborhoods become the
Machin, D., Day, S., & Green, S. (2006). Textbook of unit of analysis.
clinical trials (2nd ed.). Hoboken, NJ: Wiley. An example of cluster sampling can be seen in
Piantadosi, S. (2005). Clinical trials: A methodologic a study by Michael Burton from the University of
perspective (2nd ed.). Hoboken, NJ: Wiley. California and his colleagues, who used both strat-
Senn, S. (2008). Statistical issues in drug development ified and cluster sampling to draw a sample from
(2nd ed.). Hoboken, NJ: Wiley. the United States Census Archives for California in
1880. These researchers emphasized that with little
effort and expended resources, they obtained very
useful knowledge about California in 1880 per-
CLUSTER SAMPLING taining to marriage patterns, migration patterns,
occupational status, and categories of ethnicity.
A variety of sampling strategies are available in Another example is a study by Lawrence T. Lam
cases when setting or context create restrictions. and L. Yang of duration of sleep and attention def-
For example, stratified sampling is used when the icit/hyperactivity disorder among adolescents in
population’s characteristics such as ethnicity or gen- China. The researchers used a variant of simple
der are related to the outcome or dependent vari- cluster sampling, two-stage random cluster sam-
ables being studied. Simple random sampling, in pling design, to assess duration of sleep.
contrast, is used when there is no regard for strata While cluster sampling may not be the first
or defining characteristics of the population from choice given all options, it can be a highly targeted
which the sample is drawn. The assumption is that sampling method when resources are limited and
the differences in these characteristics are normally sampling error is not a significant concern.
distributed across all potential participants.
Cluster sampling is the selection of units of nat- Neil J. Salkind
ural groupings rather than individuals. For exam-
ple, in marketing research, the question at hand See also Convenience Sampling; Experience Sampling
might be how adolescents react to a particular Method; Nonprobability Sampling; Probability
brand of chewing gum. The researcher may access Sampling; Proportional Sampling; Quota Sampling;
such a population through traditional channels Random Sampling; Sampling Error; Stratified
such as the high school but may also visit places Sampling; Systematic Sampling
where these potential participants tend to spend
time together, such as shopping malls and movie
Further Readings
theaters. Rather than counting each one of the
adolescents’ responses to a survey as one data Burton, M., Della Croce, M., Masri, S. A., Bartholomew,
point, the researcher would count the entire M., & Yefremain, A. (2005). Sampling from the
Coefficient Alpha 159

United States Census Archives. Field Methods, 17(1), straighter the first time you were measured. Mea-
102-118. surement error for psychological attributes such as
Ferguson, D. A. (2009). Name-based cluster sampling. preferences, values, attitudes, achievement, and
Sociological Methods & Research, 37(4), 590–598. intelligence can also influence observed scores.
Lam, L. T., & Yang, L. (2008). Duration of sleep and
For example, on the day of a spelling test, a child
ADHD tendency among adolescents in China. Journal
of Attention Disorders, 11, 437–444.
could have a cold that may negatively influence
Pedulla, J. J., & Airasian, P. W. (1980). Sampling from how well she would perform on the test. She
samples: A comparison of strategies in longitudinal may get a 70% on a test when she actually knew
research. Educational & Psychological Measurement, 80% of the material. That is, her observed score
40, 807–813. may be lower than her true score in spelling
achievement. Thus, temporary factors such as
physical health, emotional state of mind, guessing,
outside distractions, misreading answers, or misre-
COEFFICIENT ALPHA cording answers would artificially inflate or deflate
the true scores for a characteristic. Characteristics
Coefficient alpha, or Cronbach’s alpha, is one way of the test or the test administration can also cre-
to quantify reliability and represents the propor- ate measurement error.
tion of observed score variance that is true score Ideally, test users would like to interpret indivi-
variance. Reliability is a property of a test that is dual’s observed scores on a measure to reflect the
derived from true scores, observed scores, and person’s true characteristic, whether it is physical
measurement error. Scores or values that are (e.g., blood pressure, weight) or psychological
obtained from the measurement of some attribute (e.g., knowledge of world history, level of self-
or characteristic of a person (e.g., level of intelli- esteem). In order to evaluate the reliability of
gence, preference for types of foods, spelling scores for any measure, one must estimate the
achievement, body length) are referred to as extent to which individual differences are of func-
observed scores. In contrast, true scores are the tion of the real or true score differences among
scores one would obtain if these characteristics respondents versus the extent to which they are
were measured without any random error. For a function of measurement error. A test that is con-
example, every time you go to the doctor, the sidered reliable minimizes the measurement error
nurse measures your height. That is the observed so that error is not highly correlated with true
height or observed ‘‘score’’ by that particular score. That is, the relationship between the true
nurse. You return for another visit 6 months later, score and observed score should be strong.
and another nurse measures your height. Again, Given the assumptions of classical test theory,
that is an observed score. If you are an adult, it is observed or empirical test scores can be used to
expected that your true height has not changed in estimate measurement error. There are several
the 6 months since you last went to the doctor, but different ways to calculate empirical estimates of
the two values might be different by .5 inch. When reliability. These include test–retest reliability,
measuring the quantity of anything, whether it is alternate- or parallel-forms reliability, and inter-
a physical characteristic such as height or a psycho- nal consistency reliability. To calculate test–
logical characteristic such as food preferences, retest reliability, respondents must take the same
spelling achievement, or level of intelligence, it is test or measurement twice (e.g., the same nurse
expected that the measurement will always be measuring your height at two different points in
unreliable to some extent. That is, there is no per- time). Alternate-forms reliability requires the
fectly reliable measure. Therefore, the observed construction of two tests that measure the same
score is the true score plus some amount of error, set of true scores and have the same amount of
or an error score. error variance (this is theoretically possible, but
Measurement error can come from many differ- difficult in a practical sense). Thus, the respon-
ent sources. For example, one nurse may have dent completes both forms of the test in order to
taken a more careful measurement of your height determine reliability of the measures. Internal
than the other nurse. Or you may have stood up consistency reliability is a practical alternative to
160 Coefficient Alpha

the test–retest and parallel-forms procedures does the split-half method of determining inter-
because the respondents have to complete only nal consistency reliability. Thus, internal consis-
one test at any one time. One form of internal tency reliability is an estimate of how well the
consistency reliability is split-half reliability, in sum score on the items captures the true score
which the items for a measure are divided into on the entire domain from which the items are
two parts or subtests (e.g., odd- and even-num- derived.
bered items), composite scores are computed for Internal consistency, such as measured by Cron-
each subtest, and the two composite scores are bach’s alpha, is a measure of the homogeneity of
correlated to provide an estimate of total test the items. When the various items of an instrument
reliability. (This value is then adjusted by means are measuring the same construct (e.g., depression,
of the Spearman–Brown prophecy formula.) knowledge of subtraction), then scores on the
Split-half reliability is not used very often, items will tend to covary. That is, people will tend
because it is difficult for the two halves of the to score the same way across many items. The
test to meet the criteria of being ‘‘parallel.’’ That items on a test that has adequate or better internal
is, how the test is divided is likely to lead to sub- consistency will be highly intecorrelated. Internal
stantially different estimates of reliability. consistency reliability is the most appropriate type
The most widely used method of estimating of reliability for assessing dynamic traits or traits
reliability is coefficient alpha (α), which is an that change over time, such as test anxiety or
estimate of internal consistency reliability. Lee elated mood.
Cronbach’s often-cited article entitled ‘‘Coeffi-
cient Alpha and the Internal Structure of Tests’’
Calculating Coefficient Alpha
was published in 1951. This coefficient proved
very useful for several reasons. First only one test The split-half method of determining internal con-
administration was required rather than more sistency is based on the assumption that the two
than one, as in test–retest or parallel-forms esti- halves represent parallel subtests and that the cor-
mates of reliability. Second, this formula could relation between the two halves produces the reli-
be applied to dichotomously scored items or ability estimate. For the ‘‘item-level’’ approach,
polytomous items. Finally, it was easy to calcu- such as the coefficient alpha, the logic of the split-
late, at a time before most people had access to half approach is taken further in that each item is
computers, from the statistics learned in a basic viewed as a subtest. Thus, the association between
statistics course. items can be used to represent the reliability of the
Coefficient alpha is also know as the ‘‘raw’’ entire test. The item-level approach is a two-step
coefficient alpha. This method and other meth- process. In the first step, item-level statistics are
ods of determining internal consistency reliabil- calculated (item variances, interitem covariances,
ity (e.g., the generalized Spearman–Brown or interitem correlations). In the second step, the
formula or the standardized alpha estimate) have item-level information is entered into specialized
at least two advantages over the lesser-used split- equations to estimate the reliability of the com-
half method. First, they use more information plete test.
about the test than the split-half method does. Below is the specialized formula for the calcula-
Imagine if a split-half reliability was computed, tion of coefficient alpha. Note that the first step is
and then we randomly divided the items from to determine the variance of scores on the com-
the same sample into another set of split halves plete test.
and recomputed, and kept doing this with all
 n  SD2  P SD2 
possible combinations of split-half estimates of
α ¼ × X i
;
reliability. Cronbach’s alpha is mathematically n1 SD2X
equivalent to all possible split-half estimates,
although it is not computed that way. Second, where n equals the number of components
methods of calculating internal consistency esti- (items or subtests), SD2X is the variance of the
mates require fewer assumptions about the sta- observed total test scores, and SD2i is the variance
tistical properties of the individual items than of component i.
Coefficient Alpha 161

An alternate way to calculate coefficient alpha would be likely to produce higher reliability esti-
is mates. Therefore, to optimize Cronbach’s coeffi-
cient alpha, it would be important to use
Nr a heterogeneous sample, which would have maxi-
α ¼ ;
ðv þ ðN  1Þ  rÞ mum true score variability. The sample size is also
likely to influence the magnitude of coefficient
where N equals the number of components, v alpha. This is because a larger size is more likely
is the average variance, and r is the average of to produce a larger variance in the true scores.
all Pearson correlation coefficients between Typically, large numbers of subjects (typically in
components. excess of 200) are needed to obtain generalizable
The methods of calculating coefficient alpha reliability estimates.
as just described use raw scores (i.e., no trans- The easiest way to increase coefficient alpha is
formation has been made to the item scores). by increasing the number of items. Therefore, if,
There is also the standardized coefficient alpha for example, a researcher is developing a new
described in some statistical packages as ‘‘Cron- measure of depression, he or she may want to
bach’s Alpha Based on Standardized Items’’ begin with a large number of items to assess var-
(SPSS [an IBM company, formerly called ious aspects of depression (e.g., depressed mood,
PASWâ Statistics] Reliability Analysis proce- feeling fatigued, loss of interest in favorite activi-
dure) or ‘‘Cronbach Coefficient Alpha for Stan- ties). That is, the total variance becomes larger,
dardized Variables’’ (SAS). This standardized relative to the sum of the variances of the items,
alpha estimate of reliability provides an esti- as the number of items is increased. It can also
mate of reliability for an instrument in which be shown that when the interitem correlations
scores on all items have been standardized to are about the same, alpha approaches one as the
have equal means and standard deviations. The number of items approaches infinity. However, it
use of the raw score formula or the standardized is also true that reliability is expected to be high
score formula often produces roughly similar even when the number of items is relatively
results. small if the correlations among them are high.
For example, a measure with 3 items whose
average intercorrelation is .50 is expected to
Optimizing Coefficient Alpha
have a Cronbach’s alpha coefficient of .75. This
There are four (at least) basic ways to influence same alpha coefficient of .75 can be calculated
the magnitude of Cronbach’s coefficient alpha. from a measure composed of 9 items with an
The first has to with the characteristics of the sam- average intercorrelation among the 9 items of
ple (e.g., homogeneity vs. heterogeneity). Second .25 and of 27 items when the average intercorre-
and third, respectively, are the characteristics lation among them is .10.
of the sample (e.g., size) and the number Selecting ‘‘good’’ items during the construction
of items in the instrument. The final basic way of an instrument is another way to optimize the
occurs during the construction of the instrument. alpha coefficient. That is, scale developers typically
Reliability is sample specific. A homogeneous want items that correlate highly with each other.
sample with reduced true score variability may Most statistical packages provide an item-total
reduce the alpha coefficient. Therefore, if you correlation as well as a calculation of the internal
administered an instrument measuring depression consistency reliability if any single item were
to only those patients recently hospitalized for removed. So one can choose to remove any items
depression, the reliability may be somewhat low that reduce the internal consistency reliability
because all participants in the sample have already coefficient.
been diagnosed with severe depression, and so
there is not likely to be much variability in the
Interpreting Cronbach’s Alpha Coefficient
scores on the measure. However, if one were to
administer the same instrument to a general popu- A reliable test minimizes random measurement
lation, the larger variance in depression scores error so that error is not highly correlated with
162 Coefficient Alpha

the true scores. The relationship between the two or more ‘‘group’’ or common factors under-
true score and the observed scores should be lie the relations among the items. That is, an
strong. A reliability coefficient is the proportion instrument with distinct sets of items or factors
of the observed score variance that is true score can still have an average intercorrelation among
variance. Thus, a coefficient alpha of .70 for the items that is relatively large and would then
a test means that 30% of the variance in scores result in a high alpha coefficient. Also, as dis-
is random and not meaningful. Rules of thumb cussed previously, even if the average intercorre-
exist for interpreting the size of coefficient lation among items is relatively small, the alpha
alphas. Typically, a ‘‘high’’ reliability coefficient can be high if the number of items is relatively
is considered to be .90 or above, ‘‘very good’’ is large. Therefore, a high internal consistency esti-
.80 to .89, and ‘‘good’’ or ‘‘adequate’’ is .70 to mate cannot serve as evidence of the homogene-
.79. Cronbach’s alpha is a lower-bound estimate. ity of a measure. However, a low internal
That is, the actual reliability may be slightly consistency reliability coefficient does mean that
higher. It is also considered to be the most accu- the measure is not homogeneous because the
rate type of reliability estimate within the classi- items do not correlate well together.
cal theory approach, along with the Kuder– It is important to keep several points in mind
Richardson 20, which is used only for dichoto- about reliability when designing or choosing an
mous variables. instrument. First, although reliability reflects
The interpretation of coefficient alpha and variance due to true scores, it does not indicate
other types of reliability depends to some extent what the true scores are measuring. If one uses
on just what is being measured. When tests are a measure with a high internal consistency (.90),
used to make important decisions about people, it may be that the instrument is measuring some-
it would be essential to have high reliability thing different from what is postulated. For
(e.g., .90 or above). For example, individualized example, a personality instrument may be asses-
intelligence tests have high internal consistency. sing social desirability and not the stated con-
Often an intelligence test is used to make impor- struct. Often the name of the instrument
tant final decisions. In contrast, lower reliability indicates the construct being tapped (e.g., the
(e.g., .60 to .80) may be acceptable for looking XYZ Scale of Altruism), but adequate reliability
at group differences in such personality charac- does not mean that the measure is assessing what
teristics as the level of extroversion. it purports to measure (in this example, altru-
It should be noted that the internal consis- ism). That is a validity argument.
tency approach applied through coefficient alpha
assumes that the items, or subparts, of an instru- Karen D. Multon and Jill S. M. Coleman
ment measure the same construct. Broadly
See also Classical Test Theory; ‘‘Coefficient Alpha and
speaking, that means the items are homoge-
the Internal Structure of Tests’’; Correlation;
neous. However, there is no general agreement
Instrumentation; Internal Consistency Reliability;
about what the term homogeneity means and
Pearson Product-Moment Correlation Coefficient;
how it might be measured. Some authors inter-
Reliability; Spearman–Brown Prophecy Formula;
pret it to mean unidimensionality, or having only
‘‘Validity’’
one factor. However, Cronbach did not limit
alpha to an instrument with only one factor. In
fact, he said in his 1951 article, ‘‘Alpha estimates
the proportion of the test variance due to all Further Readings
common factors among items. That is, it reports
Cortina, J. M. (1993). What is coefficient alpha? An
how much the test score depends upon general
examination of theory and application. Journal of
and group rather than item specific factors’’ Applied Psychology, 78, 98–104.
(p. 320). Cronbach’s ‘‘general’’ factor is the first Cronbach, L. J. (1951). Coefficient alpha and the internal
or most important factor, and alpha can be high structure of tests. Psychometrika, 16(3), 297–334.
even if there is no general factor underlying the Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
relations among the items. This will happen if theory (3rd ed.). New York: McGraw-Hill.
‘‘Coefficient Alpha and the Internal Structure of Tests’’ 163

Osburn, H. G. (2000). Coefficient alpha and related test scores. As demonstrated by Cronbach, alpha is
internal consistency reliability coefficients. the mean of all possible split-half coefficients for
Psychological Methods, 5, 343–355. a test. Alpha is generally applicable for studying
measurement consistency whenever data include
multiple observations of individuals (e.g., item
‘‘COEFFICIENT ALPHA AND THE scores, ratings from multiple judges, stability of
performance over multiple trials). Cronbach
INTERNAL STRUCTURE OF TESTS’’ showed that the well-known Kuder–Richardson
formula 20 (KR-20), which preceded alpha, was
Lee Cronbach’s 1951 Psychometrika article ‘‘Coef- a special case of alpha when items are scored
ficient Alpha and the Internal Structure of Tests’’ dichotomously.
established coefficient alpha as the preeminent esti- One sort of internal consistency reliability coef-
mate of internal consistency reliability. Cronbach ficient, the coefficient of precision, estimates the
demonstrated that coefficient alpha is the mean of correlation between a test and a hypothetical repli-
all split-half reliability coefficients and discussed cated administration of the same test when no
the manner in which coefficient alpha should be changes in the examinees have occurred. In con-
interpreted. Specifically, alpha estimates the corre- trast, Cronbach explained that alpha, which esti-
lation between two randomly parallel tests admin- mates the coefficient of equivalence, reflects the
istered at the same time and drawn from correlation between two different k-item tests ran-
a universe of items like those in the original test. domly drawn (without replacement) from a uni-
Further, Cronbach showed that alpha does not verse of items like those in the test and
require the assumption that items be unidimen- administered simultaneously. Since the correlation
sional. In his reflections 50 years later, Cronbach of a test with itself would be higher than the corre-
described how coefficient alpha fits within general- lation between different tests drawn randomly
izability theory, which may be employed to obtain from a pool, alpha provides a lower bound for the
more informative explanations of test score coefficient of precision. Note that alpha (and other
variance. internal consistency reliability coefficients) pro-
Concerns about the accuracy of test scores are vides no information about variation in test scores
commonly addressed by computing reliability that could occur if repeated testings were sepa-
coefficients. An internal consistency reliability rated in time. Thus, some have argued that such
coefficient, which may be obtained from a single coefficients overstate reliability.
test administration, estimates the consistency of Cronbach dismissed the notion that alpha
scores on repeated test administrations taking requires the assumption of item unidimensionality
place at the same time (i.e., no changes in exami- (i.e., all items measure the same aspect of individ-
nees from one test to the next). Split-half ual differences). Instead, alpha provides an esti-
reliability coefficients, which estimate internal mate (lower bound) of the proportion of variance
consistency reliability, were established as a in test scores attributable to all common factors
standard of practice for much of the early 20th accounting for item responses. Thus, alpha can
century, but such coefficients are not unique reasonably be applied to tests typically adminis-
because they depend on particular splits of items tered in educational settings and that comprise
into half tests. Cronbach presented coefficient items that call on several skills or aspects of under-
alpha as an alternative method for estimating standing in different combinations across items.
internal consistency reliability. Alpha is com- Coefficient alpha, then, climaxed 50 years of work
puted as follows: on correlational conceptions of reliability begun
 P 2 by Charles Spearman.
k s In a 2004 article published posthumously, ‘‘My
1  i2 i ;
k1 st Current Thoughts on Coefficient Alpha and Suc-
cessor Procedures,’’ Cronbach expressed doubt
where k is the number of items, s2i is the variance that coefficient alpha was the best way of judging
of scores on item i, and s2t is the variance of total reliability. It covered only a small part of the range
164 Coefficient of Concordance

of measurement uses, and consequently it should coefficient alpha article had been cited in more
be viewed within a much larger system of reliabil- than 5,000 publications.
ity analysis, generalizability theory. Moreover,
alpha focused attention on reliability coefficients Jeffrey T. Steedle and Richard J. Shavelson
when that attention should instead be cast on
See also Classical Test Theory; Generalizability Theory;
measurement error and the standard error of
Internal Consistency Reliability; KR-20; Reliability;
measurement.
Split-Half Reliability
For Cronbach, the extension of alpha (and clas-
sical test theory) came when Fisherian notions of
experimental design and analysis of variance were Further Readings
put together with the idea that some ‘‘treatment’’ Brennan, R. L. (2001). Generalizability theory. New
conditions could be considered random samples York: Springer-Verlag.
from a large universe, as alpha assumes about item Cronbach, L. J., & Shavelson, R. J. (2004). My current
sampling. Measurement data, then, could be col- thoughts on coefficient alpha and successor
lected in complex designs with multiple variables procedures. Educational & Psychological
(e.g., items, occasions, and rater effects) and ana- Measurement, 64(3), 391–418.
lyzed with random-effects analysis of variance Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.),
models. The goal was not so much to estimate Educational measurement (pp. 65–110). Westport,
CT: Praeger.
a reliability coefficient as to estimate the compo-
Shavelson, R. J. (2004). Editor’s preface to Lee
nents of variance that arose from multiple vari- J. Cronbach’s ‘‘My Current Thoughts on Coefficient
ables and their interactions in order to account for Alpha and Successor Procedures.’’ Educational &
observed score variance. This approach of parti- Psychological Measurement, 64(3), 389–390.
tioning effects into their variance components pro- Shavelson, R. J., & Webb, N. M. (1991). Generalizability
vides information as to the magnitude of each of theory: A primer. Newbury Park, CA: Sage.
the multiple sources of error and a standard error
of measurement, as well as an ‘‘alpha-like’’ reliabil-
ity coefficient for complex measurement designs.
Moreover, the variance-component approach COEFFICIENT OF CONCORDANCE
can provide the value of ‘‘alpha’’ expected by
increasing or decreasing the number of items (or Proposed by Maurice G. Kendall and Bernard
raters or occasions) like those in the test. In addi- Babington Smith, Kendall’s coefficient of concor-
tion, the proportion of observed score variance dance (W) is a measure of the agreement among
attributable to variance in item difficulty (or, for several (m) quantitative or semiquantitative vari-
example, rater stringency) may also be com- ables that are assessing a set of n objects of inter-
puted, which is especially important to contem- est. In the social sciences, the variables are often
porary testing programs that seek to determine people, called judges, assessing different subjects
whether examinees have achieved an absolute, or situations. In community ecology, they may be
rather than relative, level of proficiency. Once species whose abundances are used to assess habi-
these possibilities were envisioned, coefficient tat quality at study sites. In taxonomy, they may
alpha morphed into generalizability theory, with be characteristics measured over different species,
sophisticated analyses involving crossed and biological populations, or individuals.
nested designs with random and fixed variables There is a close relationship between Milton
(facets) producing variance components for Friedman’s two-way analysis of variance without
multiple measurement facets such as raters and replication by ranks and Kendall’s coefficient of
testing occasions so as to provide a complex concordance. They address hypotheses concerning
standard error of measurement. the same data table, and they use the same χ2 sta-
By all accounts, coefficient alpha—Cronbach’s tistic for testing. They differ only in the formula-
alpha—has been and will continue to be the most tion of their respective null hypothesis. Consider
popular method for estimating behavioral Table 1, which contains illustrative data. In Fried-
measurement reliability. As of 2004, the 1951 man’s test, the null hypothesis is that there is no
Coefficient of Concordance 165

Table 1 Illustrative Example: Ranked Relative Abundances of Four Soil Mite Species (Variables) at 10 Sites (Objects)
Ranks (column-wise) Sum of Ranks

Species 13 Species 14 Species 15 Species 23 Ri


Site 4 5 6 3 5 19.0
Site 9 10 4 8 2 24.0
Site 14 7 8 5 4 24.0
Site 22 8 10 9 2 29.0
Site 31 6 5 7 6 24.0
Site 34 9 7 10 7 33.0
Site 45 3 3 2 8 16.0
Site 53 1.5 2 4 9 16.5
Site 61 1.5 1 1 2 5.5
Site 69 4 9 6 10 29.0
Source: Legendre, P. (2005) Species associations: The Kendall coefficient of concordance revisited. Journal of Agricultural,
Biological, & Environmental Statistics, 10, 230. Reprinted with permission from the Journal of Agricultural, Biological, &
Environmental Statistics. Copyright 2005 by the American Statistical Association. All rights reserved.
Notes: The ranks are computed columnwise with ties. Right-hand column: sum of the ranks for each site.

real difference among the n objects (sites, rows of or


Table 1) because they pertain to the same statisti-
cal population. Under the null hypothesis, they 12S0  3m2 nðn þ 1Þ2
should have received random ranks along the vari- W ¼ , ð2Þ
m2 ðn3  nÞ  mT
ous variables, so that their sums of ranks should
be approximately equal. Kendall’s test focuses on where n is the number of objects and m is the
the m variables. If the null hypothesis of Fried- number of variables. T is a correction factor for
man’s test is true, this means that the variables tied ranks:
have produced rankings of the objects that are
independent of one another. This is the null X
g

hypothesis of Kendall’s test. T ¼ ðtk3  tk Þ, ð3Þ


k¼1

Computing Kendall’s W in which tk is the number of tied ranks in each (k)


There are two ways of computing Kendall’s W sta- of g groups of ties. The sum is computed over all
tistic (first and second forms of Equations 1 and groups of ties found in all m variables of the data
2); they lead to the same result. S or S0 is computed table. T ¼ 0 when there are no tied values.
first from the row-marginal sums of ranks Ri Kendall’s W is an estimate of the variance of the
received by the objects: row sums of ranks Ri divided by the maximum
possible value the variance can take; this occurs
X
n
2 X
n when all variables are in total agreement. Hence
S ¼ ðRi  RÞ or S0 ¼ R2i ¼ SSR, ð1Þ 0 ≤ W ≤ 1, 1 representing perfect concordance. To
i¼1 i¼1 derive the formulas for W (Equation 2), one has to
know that when all variables are in perfect agree-
where S is a sum-of-squares statistic over the row ment, the sum of all sums of ranks in the data table
sums of ranks Ri, and R is the mean of the Ri (right-hand column of Table 1) is mnðn þ 1Þ=2 and
values. Following that, Kendall’s W statistic can be that the sum of squares of the sums of all ranks is
obtained from either of the following formulas: m2n(n þ 1)(2n þ 1)/6 (without ties).
There is a close relationship between Charles
12S Spearman’s correlation coefficient rS and Kendall’s
W ¼
m2 ðn3  nÞ  mT W statistic: W can be directly calculated from the
166 Coefficient of Concordance

mean (rS ) of the pairwise Spearman correlations rS be included in the index, because different
using the following relationship: groups of species may be associated to different
environmental conditions.
ðm  1ÞrS þ 1
W ¼ , ð4Þ
m Testing the Significance of W
where m is the number of variables (judges) among Friedman’s chi-square statistic is obtained from W
which Spearman correlations are computed. Equa- by the formula
tion 4 is strictly true for untied observations only;
for tied observations, ties are handled in a bivariate χ2 ¼ mðn  1ÞW: ð5Þ
way in each Spearman rS coefficient whereas in
This quantity is asymptotically distributed like
Kendall’s W the correction for ties is computed in
chi-square with ν ¼ ðn  1Þ degrees of freedom; it
a single equation (Equation 3) for all variables.
can be used to test W for significance. According to
For two variables (judges) only, W is simply a lin-
Kendall and Babington Smith, this approach is satis-
ear transformation of rS: W ¼ (rS þ 1)/2. In that
factory only for moderately large values of m and n.
case, a permutation test of W for two variables is
Sidney Siegel and N. John Castellan Jr. recom-
the exact equivalent of a permutation test of rS for
mend the use of a table of critical values for W
the same variables.
when n ≤ 7 and m ≤ 20; otherwise, they recommend
The relationship described by Equation 4 clearly
testing the chi-square statistic (Equation 5) using the
limits the domain of application of the coefficient of
chi-square distribution. Their table of critical values
concordance to variables that are all meant to esti-
of W for small n and m is derived from a table of
mate the same general property of the objects: vari-
critical values of S assembled by Friedman using the
ables are considered concordant only if their
z test of Kendall and Babington Smith and repro-
Spearman correlations are positive. Two variables
duced in Kendall’s classic monograph, Rank Corre-
that give perfectly opposite ranks to a set of objects
lation Methods. Using numerical simulations, Pierre
have a Spearman correlation of  1, hence W ¼ 0
Legendre compared results of the classical chi-
for these two variables (Equation 4); this is the
square test of the chi-square statistic (Equation 5) to
lower bound of the coefficient of concordance. For
the permutation test that Siegel and Castellan also
two variables only, rS ¼ 0 gives W ¼ 0.5. So coeffi-
recommend for small samples (small n). The simula-
cient W applies well to rankings given by a panel of
tion results showed that the classical chi-square test
judges called in to assess overall performance in
was too conservative for any sample size (n) when
sports or quality of wines or food in restaurants, to
the number of variables m was smaller than 20; the
rankings obtained from criteria used in quality tests
test had rejection rates well below the significance
of appliances or services by consumer organizations,
level, so it remained valid. The classical chi-square
and so forth. It does not apply, however, to variables
test had a correct level of Type I error (rejecting
used in multivariate analysis in which negative as
a null hypothesis that is true) for 20 variables and
well as positive relationships are informative. Jerrold
more. The permutation test had a correct rate of
H. Zar, for example, uses wing length, tail length,
Type I error for all values of m and n. The power of
and bill length of birds to illustrate the use of the
the permutation test was higher than that of the
coefficient of concordance. These data are appropri-
classical chi-square test because of the differences in
ate for W because they are all indirect measures of
rates of Type I error between the two tests. The dif-
a common property, the size of the birds.
ferences in power disappeared asymptotically as the
In ecological applications, one can use the
number of variables increased.
abundances of various species as indicators of
An alternative approach is to compute the fol-
the good or bad environmental quality of the
lowing F statistic:
study sites. If a group of species is used to pro-
duce a global index of the overall quality (good F ¼ ðm  1ÞW=ð1  WÞ, ð6Þ
or bad) of the environment at the study sites,
only the species that are significantly associated which is asymptotically distributed like F with
and positively correlated to one another should ν1 ¼ n  1  ð2=mÞ and ν2 ¼ ν1 ðm  1Þ degrees
Coefficient of Concordance 167

of freedom. Kendall and Babington Smith law, consumer protection, etc.). In other types of
described this approach using a Fisher z transfor- studies, scientists are interested in identifying
mation of the F statistic, z ¼ 0.5 loge(F). They variables that agree in their estimation of a com-
recommended it for testing W for moderate values mon property of the objects. This is the case in
of m and n. Numerical simulations show, however, environmental studies in which scientists are
that this F statistic has correct levels of Type I interested in identifying groups of concordant
error for any value of n and m. species that are indicators of some property of
In permutation tests of Kendall’s W, the objects the environment and can be combined into indi-
are the permutable units under the null hypothesis ces of its quality, in particular in situations of
(the objects are sites in Table 1). For the global test pollution or contamination.
of significance, the rank values in all variables are The contribution of individual variables to
permuted at random, independently from variable the W statistic can be assessed by a permutation
to variable because the null hypothesis is the inde- test proposed by Legendre. The null hypothesis
pendence of the rankings produced by all vari- is the monotonic independence of the variable
ables. The alternative hypothesis is that at least subjected to the test, with respect to all the other
one of the variables is concordant with one, or variables in the group under study. The alterna-
with some, of the other variables. Actually, for tive hypothesis is that this variable is concordant
permutation testing, the four statistics SSR with other variables in the set under study, hav-
(Equation 1), W (Equation 2), χ2 (Equation 5), ing similar rankings of values (one-tailed test).
and F (Equation 6) are monotonic to one another The statistic W can be used directly in a poste-
since n and m, as well as T, are constant within riori tests. Contrary to the global test, only the
a given permutation test; thus they are equivalent variable under test is permuted here. If that vari-
statistics for testing, producing the same permuta- able has values that are monotonically indepen-
tional probabilities. The test is one-tailed because dent of the other variables, permuting its values
it recognizes only positive associations between at random should have little influence on the W
vectors of ranks. This may be seen if one considers statistic. If, on the contrary, it is concordant with
two vectors with exactly opposite rankings: They one or several other variables, permuting its
produce a Spearman statistic of  1, hence a value values at random should break the concordance
of zero for W (Equation 4). and induce a noticeable decrease on W.
Many of the problems subjected to Kendall’s Two specific partial concordance statistics can
concordance analysis involve fewer than 20 vari- also be used in a posteriori tests. The first one is the
ables. The chi-square test should be avoided in mean, rj , of the pairwise Spearman correlations
these cases. The F test (Equation 6), as well as the between variable j under test and all the other vari-
permutation test, can safely be used with all values ables. The second statistic, Wj , is obtained by apply-
of m and n. ing Equation 4 to rj instead of r, with m the number
of variables in the group. These two statistics are
shown in Table 2 for the example data; rj and Wj
Contributions of Individual Variables are monotonic to each other because m is constant
in a given permutation test. Within a given a poster-
to Kendall’s Concordance
iori test, W is also monotonic to Wj because only
The overall permutation test of W suggests the values related to variable j are permuted when
a way of testing a posteriori the significance of testing variable j. These three statistics are thus
the contributions of individual variables to the equivalent for a posteriori permutation tests, produc-
overall concordance to determine which of the ing the same permutational probabilities. Like rj , Wj
individual variables are concordant with one or can take negative values; this is not the case of W.
several other variables in the group. There is There are advantages to performing a single
interest in several fields in identifying discordant a posteriori test for variable j instead of (m  1)
variables or judges. This includes all fields that tests of the Spearman correlation coefficients
use panels of judges to assess the overall quality between variable j and all the other variables: The
of the objects or subjects under study (sports, tests of the (m  1) correlation coefficients would
168 Coefficient of Concordance

Table 2 Results of (a) the Overall and (b) the A Posteriori Tests of Concordance Among the Four Species of Table
1; (c) Overall and (d) A Posteriori Tests of Concordance Among Three Species

(a) Overall test of W statistic, four species. H0 : The four species are not concordant with one another.
Kendall’s W ¼ 0.44160 Permutational p value ¼ .0448*
F statistic ¼ 2.37252 F distribution p value ¼ .0440*
Friedman’s chi-square ¼ 15.89771 Chi-square distribution p value ¼ .0690

(b) A posteriori tests, four species. H0 : This species is not concordant with the other three.
rj Wj p Value Corrected p Decision at α ¼ 5%
Species 13 0.32657 0.49493 .0766 .1532 Do not reject H0
Species 14 0.39655 0.54741 .0240 .0720 Do not reject H0
Species 15 0.45704 0.59278 .0051 .0204* Reject H0
Species 23 0.16813 0.12391 .7070 .7070 Do not reject H0

(c) Overall test of W statistic, three species. H0 : The three species are not concordant with one another.
Kendall’s W ¼ 0.78273 Permutational p value ¼ :0005*
F statistic ¼ 7.20497 F distribution p value ¼ :0003*
Friedman’s chi-square ¼ 21.13360 Chi-square distribution p value ¼ .0121*

(d) A posteriori tests, three species. H0 : This species is not concordant with the other two.
rj Wj p Value Corrected p Decision at α ¼ 5%
Species 13 0.69909 0.79939 .0040 .0120* Reject H0
Species 14 0.59176 0.72784 .0290 .0290* Reject H0
Species 15 0.73158 0.82105 .0050 .0120* Reject H0
Source: (a) and (b): Adapted from Legendre, P. (2005). Species associations: The Kendall coefficient of concordance revisited. Journal of
Agricultural, Biological, and Environmental Statistics, 10, 233. Reprinted with permission from the Journal of Agricultural, Biological
and Environmental Statistics. Copyright 2005 by the American Statistical Association. All rights reserved.
Notes: rj ¼ mean of the Spearman correlations with the other species; Wj ¼ partial concordance per species; p value ¼ permutational
probability (9,999 random permutations); corrected p ¼ Holm-corrected p value. * ¼ Reject H0 at α ¼ :05:

have to be corrected for multiple testing, and they The example data are analyzed in Table 2. The
could provide discordant information; a single test overall permutational test of the W statistic is sig-
of the contribution of variable j to the W statistic nificant at α ¼ 5%, but marginally (Table 2a). The
has greater power and provides a single, clearer cause appears when examining the a posteriori
answer. In order to preserve a correct or approxi- tests in Table 2b: Species 23 has a negative mean
mately correct experimentwise error rate, the proba- correlation with the three other species in the
bilities of the a posteriori tests computed for all group (rj ¼  .168). This indicates that Species 23
species in a group should be adjusted for multiple does not belong in that group. Were we analyzing
testing. a large group of variables, we could look at the
A posteriori tests are useful for identifying the next partition in an agglomerative clustering den-
variables that are not concordant with the others, drogram, or the next K-means partition, and pro-
as in the examples, but they do not tell us whether ceed to the overall and a posteriori tests for the
there are one or several groups of congruent vari- members of these new groups. In the present illus-
ables among those for which the null hypothesis of trative example, Species 23 clearly differs from the
independence is rejected. This information can be other three species. We can now test Species 13,
obtained by computing Spearman correlations 14, and 15 as a group. Table 2c shows that this
among the variables and clustering them into group has a highly significant concordance, and all
groups of variables that are significantly and posi- individual species contribute significantly to the
tively correlated. overall concordance of their group (Table 2d).
Coefficient of Variation 169

In Table 2a and 2c, the F test results are concor- Friedman, M. (1940). A comparison of alternative tests
dant with the permutation test results, but due to of significance for the problem of m rankings. Annals
small m and n, the chi-square test lacks power. of Mathematical Statistics, 11, 86–92.
Kendall, M. G. (1948). Rank correlation methods (1st
ed.). London: Charles Griffith.
Kendall, M. G., & Babington Smith, B. (1939). The
Discussion problem of m rankings. Annals of Mathematical
Statistics, 10, 275–287.
The Kendall coefficient of concordance can be Legendre, P. (2005). Species associations: The Kendall
used to assess the degree to which a group of vari- coefficient of concordance revisited. Journal of
ables provides a common ranking for a set of Agricultural, Biological, & Environmental Statistics,
objects. It should be used only to obtain a state- 10, 226–245.
ment about variables that are all meant to measure Zar, J. H. (1999). Biostatistical analysis (4th ed.). Upper
the same general property of the objects. It should Saddle River, NJ: Prentice Hall.
not be used to analyze sets of variables in which
the negative and positive correlations have equal
importance for interpretation. When the null
hypothesis is rejected, one cannot conclude that all COEFFICIENT OF VARIATION
variables are concordant with one another, as
shown in Table 2 (a) and (b); only that at least one The coefficient of variation measures the vari-
variable is concordant with one or some of the ability of a series of numbers independent of the
others. unit of measurement used for these numbers. In
The partial concordance coefficients and a pos- order to do so, the coefficient of variation elimi-
teriori tests of significance are essential comple- nates the unit of measurement of the standard
ments of the overall test of concordance. In several deviation of a series of numbers by dividing the
fields, there is interest in identifying discordant standard deviation by the mean of these num-
variables; this is the case in all fields that use bers. The coefficient of variation can be used to
panels of judges to assess the overall quality of the compare distributions obtained with different
objects under study (e.g., sports, law, consumer units, such as the variability of the weights of
protection). In other applications, one is interested newborns (measured in grams) with the size of
in using the sum of ranks, or the sum of values, adults (measured in centimeters). The coefficient
provided by several variables or judges, to create of variation is meaningful only for measurements
an overall indicator of the response of the objects with a real zero (i.e., ‘‘ratio scales’’) because the
under study. It is advisable to look for one or sev- mean is meaningful (i.e., unique) only for these
eral groups of variables that rank the objects scales. So, for example, it would be meaningless
broadly in the same way, using clustering, and to compute the coefficient of variation of the
then carry out a posteriori tests on the putative temperature measured in degrees Fahrenheit,
members of each group. Only then can their values because changing the measurement to degrees
or ranks be pooled into an overall index. Celsius will not change the temperature but will
change the value of the coefficient of variation
Pierre Legendre (because the value of zero for Celsius is 32 for
Fahrenheit, and therefore the mean of the tem-
See also Friedman Test; Holm’s Sequential Bonferroni
perature will change from one scale to the
Procedure; Spearman Rank Order Correlation
other). In addition, the values of the measure-
ment used to compute the coefficient of variation
are assumed to be always positive or null. The
Further Readings coefficient of variation is primarily a descriptive
Friedman, M. (1937). The use of ranks to avoid the statistic, but it is amenable to statistical infer-
assumption of normality implicit in the analysis of ences such as null hypothesis testing or confi-
variance. Journal of the American Statistical dence intervals. Standard procedures are often
Association, 32, 675–701. very dependent on the normality assumption,
170 Coefficient of Variation

and current work is exploring alternative proce- of variation denoted γ ν An unbiased estimate of the
dures that are less dependent on this normality population coefficient of variation, denoted C^ ν ; is
assumption. computed as
 
^ν ¼ 1
Definition and Notation C 1þ Cν ð3Þ
4N
The coefficient of variation, denoted Cv (or occa-
sionally V), eliminates the unit of measurement (where N is the sample size).
from the standard deviation of a series of numbers
by dividing it by the mean of this series of num-
Testing the Coefficient of Variation
bers. Formally, if, for a series of N numbers, the
standard deviation and the mean are denoted When the coefficient of variation is computed on
respectively by S and M, the coefficient of varia- a sample drawn from a normal population, its
tion is computed as standard error, denoted σ Cν , is known and is equal
to
S
Cv ¼ : ð1Þ
M γν
σ Cν ¼ pffiffiffiffiffiffiffi ð4Þ
Often the coefficient of variation is expressed as 2N
a percentage, which corresponds to the following
When γ ν is not known (which is, in general,
formula for the coefficient of variation:
the case), σ Cν can be estimated by replacing γ ν
S × 100 by its estimation from the sample. Either Cν or
Cv ¼ : ð2Þ ^ν can be used for this purpose ðC
C ^ ν being prefer-
M
able because it is a better estimate). So σ Cν can
This last formula can be potentially misleading be estimated as
because, as shown later, the value of the coefficient
of variation can exceed 1 and therefore would cre- Cν C^ν
ate percentages larger than 100. In that case, For- SCν ¼ pffiffiffiffiffiffiffi or ^SCν ¼ pffiffiffiffiffiffiffi : ð5Þ
2N 2N
mula 1, which expresses Cv as a ratio rather than
a percentage, should be used. Therefore, under the assumption of normality, the
statistic
Range
Cν  γ ν
In a finite sample of N nonnegative numbers with tCν ¼ ð6Þ
SCν
a real zero, the coefficient of variation can take
pffiffiffiffiffiffiffiffiffiffiffiffiffi
a value between 0 and N  1 (the maximum follows a Student distribution with ν ¼ N  1
value of Cv is reached when all values but one are degrees of freedom. It should be stressed that this
equal to zero). test is very sensitive to the normality assumption.
Work is still being done to minimize the effect of
this assumption.
Estimation of a Population
If Equation (6) is rewritten, confidence intervals
Coefficient of Variation
can be computed as
The coefficient of variation computed on a sam-
ple is a biased estimate of the population coefficient Cν ± tα;ν SCν ð7Þ

Table 1 Example for the Coefficient of Variation


Saleperson 1 2 3 4 5 6 7 8 9 10
Commission 152 155 164 164 182 221 233 236 245 248
Note: Daily commission (in dollars) of 10 salespersons.
Coefficients of Correlation, Alienation, and Determination 171

(with tα;ν being the critical value of Student’s t for Cν ± tα;ν SCν ¼ 0:2000 ± 2:26 × 0:0447
the chosen α level and for ν ¼ N  1 degrees of ð12Þ
freedom). Again, because C ^ ν is a better estimation ¼ 0:200 ± 0:1011
of γ v than Cν is, it makes sense to use C ^ ν rather
than Cν . and therefore we conclude that there is a probabil-
ity of .95 that the value of γ ν lies in the interval
[0.0989 to 0.3011].
Example
Hervé Abdi
Table 1 lists the daily commission in dollars of 10
car salespersons. The mean commission is equal to See also Mean; Standard Deviation; Variability, Measure
$200, with a standard deviation of $40. of; Variance
This gives a value of the coefficient of variation
of Further Readings
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.
S 40
Cν ¼ ¼ ¼ 0:200; ð8Þ (2009). Experimental design and analysis
M 200 for psychology. Oxford, UK: Oxford University
Press.
which corresponds to a population estimate of Curto, J. D., & Pinto, J. C. (2009). The coefficient of
variation asymptotic distribution in the case of non-iid
    random variables. Journal of Applied Statistics, 36,
^ν ¼ 1 1 21–32.
C 1þ Cν ¼ 1 þ × 0:200
4N 4 × 10 Nairy, K. S., & Rao, K. N. (2003). Tests of coefficients
of variation of normal populations. Commnunications
¼ 0:205:
in Statistics Simulation and Computation, 32,
ð9Þ 641–661.
Martin, J. D., & Gray, L. N. (1971). Measurement of
relative variation: Sociological examples. American
The standard error of the coefficient of variation is Sociological Review, 36, 496–502.
estimated as Sokal, R. R., & Rohlf, F. J. (1995). Biometry (3rd ed.).
New York: Freeman.
Cν 0:200
SCν ¼ pffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:0447 ð10Þ
2N 2 × 10
COEFFICIENTS OF CORRELATION,
(the value of ^
SCν is equal to 0.0458).
ALIENATION, AND DETERMINATION
A t criterion testing the hypothesis that the
population value of the coefficient of variation is The coefficient of correlation evaluates the similar-
equal to zero is equal to ity of two sets of measurements (i.e., two depen-
dent variables) obtained on the same observations.
Cν  γ ν 0:2000 The coefficient of correlation indicates the amount
tCν ¼ ¼ ¼ 4:47: ð11Þ of information common to the two variables. This
SCν 0:0447
coefficient takes values between  1 and þ 1
(inclusive). A value of þ 1 shows that the two
This value of tCν ¼ 4:47 is larger than the criti- series of measurements are measuring the same
cal value of tα;ν ¼ 2:26 (which is the critical value thing. A value of  1 indicates that the two mea-
of a Student’s t distribution for α ¼ :05 and surements are measuring the same thing, but one
ν ¼ :9 degrees of freedom). Therefore, we can measurement varies inversely to the other. A value
reject the null hypothesis and conclude that γ ν is of 0 indicates that the two series of measurements
larger than zero. A 95% corresponding confidence have nothing in common. It is important to note
interval gives the values of that the coefficient of correlation measures only
172 Coefficients of Correlation, Alienation, and Determination

the linear relationship between two variables and relationship, and when they have different signs,
that its value is very sensitive to outliers. they indicate a negative relationship.
The squared correlation gives the proportion The average value of the SCPWY is called the
of common variance between two variables and covariance (just like the variance, the covariance
is also called the coefficient of determination. can be computed by dividing by S or by S  1):
Subtracting the coefficient of determination from
unity gives the proportion of variance not shared SCP SCP
covWY ¼ ¼ : ð2Þ
between two variables. This quantity is called Number of Observations S
the coefficient of alienation.
The significance of the coefficient of correla- The covariance reflects the association between
tion can be tested with an F or a t test. This entry the variables, but it is expressed in the original
presents three different approaches that can be units of measurement. In order to eliminate the
used to obtain p values: (1) the classical units, the covariance is normalized by division by
approach, which relies on Fisher’s F distribu- the standard deviation of each variable. This
tions; (2) the Monte Carlo approach, which defines the coefficient of correlation, denoted rW.Y,
relies on computer simulations to derive empiri- which is equal to
cal approximations of sampling distributions;
and (3) the nonparametric permutation (also covWY
rW:Y ¼ : ð3Þ
known as randomization) test, which evaluates σW σY
the likelihood of the actual data against the set
of all possible configurations of these data. In Rewriting the previous formula gives a more prac-
addition to p values, confidence intervals can be tical formula:
computed using Fisher’s Z transform or the more
modern, computationally based, and nonpara- SCPWY
rW:Y ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð4Þ
metric Efron’s bootstrap. SSW SSY
Note that the coefficient of correlation always
overestimates the intensity of the correlation in the where SCP is the sum of the cross-product and
population and needs to be ‘‘corrected’’ in order to SSW and SSY are the sum of squares of W and Y,
provide a better estimation. The corrected value is respectively.
called shrunken or adjusted.
Correlation Computation: An Example
The computation for the coefficient of correlation
is illustrated with the following data, describing
Notations and Definition
the values of W and Y for S ¼ 6 subjects:
Suppose we have S observations, and for each
observation s, we have two measurements, W1 ¼ 1; W2 ¼ 3; W3 ¼ 4; W4 ¼ 4; W5 ¼ 5; W6 ¼ 7
denoted Ws and Ys, with respective means denoted Y1 ¼ 16; Y2 ¼ 10; Y3 ¼ 12; Y4 ¼ 4; Y5 ¼ 8; Y6 ¼ 10:
Mw and My. For each observation, we define the
cross-product as the product of the deviations of
Step 1. Compute the sum of the cross-products.
each variable from its mean. The sum of these
First compute the means of W and Y:
cross-products, denoted SCPwy, is computed as

X
S 1X S
24
MW ¼ Ws ¼ ¼ 4 and
SCPWY ¼ ðWs  MW ÞðYs  MY Þ: ð1Þ S s¼1 6
s
1X S
60
MY ¼ Ys ¼ ¼ 10:
The sum of the cross-products reflects the asso- S s¼1 6
ciation between the variables. When the deviations
have the same sign, they indicate a positive The sum of the cross-products is then equal to
Coefficients of Correlation, Alienation, and Determination 173

P
X
S ðYs  MY ÞðWs  MW Þ
SCPWY ¼ ðYs  MY ÞðWs  MW Þ s SCPWY
rW:Y ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s SSY × SSW SSW SSY
¼ ð16  10Þð1  4Þ 20 20 20
¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffi ¼ 
þ ð10  10Þð3  4Þ 80 × 20 1600 40
þ ð12  10Þð4  4Þ ¼ :5:
þ ð4  10Þð4  4Þ ð8Þ
þ ð8  10Þð5  4Þ ð5Þ
þ ð10  10Þð7  4Þ This value of r ¼ .5 can be interpreted as an
indication of a negative linear relationship between
¼ ð6 ×  3Þð0 ×  1Þ W and Y.
þ ð2 × 0Þð6 × 0Þ
þ ð2 × 1Þ þ ð0 × 3Þ
Properties of the Coefficient of Correlation
¼ 18 þ 0 þ 0 þ 0  2 þ 0
¼ 20: The coefficient of correlation is a number without
unit. This occurs because dividing the units of the
numerator by the same units in the denominator
Step 2. Compute the sums of squares. The sum
eliminates the units. Hence, the coefficient of cor-
of squares of Ws is obtained as
relation can be used to compare different studies
X
S performed using different variables.
SSW ¼ ðWs  MW Þ2 The magnitude of the coefficient of correlation
s¼1 is always smaller than or equal to 1. This happens
2
¼ ð1  4Þ þ ð3  4Þ þ ð4  4Þ
2 2 because the numerator of the coefficient of correla-
tion (see Equation 4) is always smaller than or
þ ð4  4Þ2 þ ð5  4Þ2 þ ð7  4Þ2 equal to its denominator (this property follows
¼ ð3Þ2 þ ð1Þ2 þ 02 þ 02 ð6Þ from the Cauchy–Schwartz inequality). A coeffi-
cient of correlation that is equal to þ 1 or  1
þ 12 þ 32
indicates that the plot of the observations will have
¼ 9þ1þ0þ0þ1þ9 all observations positioned on a line.
¼ 18 þ 0 þ 0 þ 0  2 þ 0 The squared coefficient of correlation gives the
¼ 20: proportion of common variance between two
variables. It is also called the coefficient of deter-
The sum of squares of Ys is mination. In our example, the coefficient of deter-
mination is equal to r2WY ¼ :25: The proportion
X
S
of variance not shared between the variables is
SSY ¼ ðYs  MY Þ2
s¼1
called the coefficient of alienation, and for our
example, it is equal to 1  r2WY ¼ :75:
¼ ð16  10Þ2 þ ð10  10Þ2
þ ð12  10Þ2 þ ð4  10Þ2 þ ð8  10Þ2
Interpreting Correlation
þ ð10  10Þ2
Linear and Nonlinear Relationship
¼ 62 þ 02 þ 22 þ ð6Þ2 þ ð2Þ2 þ 02
¼ 36 þ 0 þ 4 þ 36 þ 4 þ 0 The coefficient of correlation measures only lin-
ear relationships between two variables and will
¼ 80: miss nonlinear relationships. For example, Figure 1
ð7Þ displays a perfect nonlinear relationship between
two variables (i.e., the data show a U-shaped rela-
Step 3. Compute rW:Y . The coefficient of corre- tionship with Y being proportional to the square of
lation between W and Y is equal to W), but the coefficient of correlation is equal to 0.
174 Coefficients of Correlation, Alienation, and Determination
Y

Y
W W

Figure 1 A Perfect Nonlinear Relationship With Figure 2 The Dangerous Effect of Outliers on the
a 0 Correlation (rW:Y ¼ 0) Value of the Coefficient of Correlation
Notes: The correlation of the set of points represented by the
circles is equal to  .87. When the point represented by the
Effect of Outliers
diamond is added to the set, the correlation is now equal to
Observations far from the center of the distribu- þ .61, which shows that an outlier can determine the value
of the coefficient of correlation.
tion contribute a lot to the sum of the cross-pro-
ducts. In fact, as illustrated in Figure 2, a single
extremely deviant observation (often called an out- France, the number of Catholic churches in a city,
lier) can dramatically influence the value of r. as well as the number of schools, is highly corre-
lated with the number of cases of cirrhosis of the
liver, the number of teenage pregnancies, and the
Geometric Interpretation
number of violent deaths. Does this mean that
Each set of observations can also be seen as churches and schools are sources of vice and that
a vector in an S dimensional space (one dimension newborns are murderers? Here, in fact, the
per observation). Within this framework, the cor- observed correlation is due to a third variable,
relation is equal to the cosine of the angle between namely the size of the cities: the larger a city, the
the two vectors after they have been centered by larger the number of churches, schools, alcoholics,
subtracting their respective mean. For example, and so forth. In this example, the correlation
a coefficient of correlation of r ¼ .50 corre- between number of churches or schools and alco-
sponds to a 150-degree angle. A coefficient of cor- holics is called a spurious correlation because it
relation of 0 corresponds to a right angle, and reflects only their mutual correlation with a third
therefore two uncorrelated variables are called variable (i.e., size of the city).
orthogonal (which is derived from the Greek word
for right angle). Testing the Significance of r
A null hypothesis test for r can be performed using
Correlation and Causation an F statistic obtained as
The fact that two variables are correlated does
not mean that one variable causes the other one: r2
F ¼ × ðS  2Þ: ð9Þ
Correlation is not causation. For example, in 1  r2
Coefficients of Correlation, Alienation, and Determination 175

Fisher F: ν1 = 1, ν2 = 4 r 2 Distribution: Monte Carlo Approach


400 200
F = 1.33
# of Samples

# of Samples
300 p = .313 150
200 100
α = .05
100 Fcritical = 7.7086 50
0
0 5 10 15 20 25 30 35 40 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Values of F
Values of r 2
F Distribution: Monte Carlo Approach
Figure 3 The Fisher Distribution for ν1 ¼ 1 and 600
F = 1.33
ν2 ¼ 4, Along With α ¼ .05

# of Samples
p = .310
400
Note: Critical value of F ¼ 7:7086.
α = .05
200 Fcritical = 7.5500l
For our example, we find that
0
:25 0 10 20 30 40 50 60 70 80
F ¼ × ð6  2Þ ¼ Values of F
1  :25
:25 1 4
× 4 ¼ × 4 ¼ ¼ 1:33:
:75 3 3 Figure 4 Histogram of Values of r2 and F Computed
From 1,000 Random Samples When the
In order to perform a statistical test, the next Null Hypothesis Is True
step is to evaluate the sampling distribution of the
Notes: The histograms show the empirical distribution of F
F. This sampling distribution provides the proba-
and r2 under the null hypothesis.
bility of finding any given value of the F criterion
(i.e., the p value) under the null hypothesis (i.e.,
pffiffiffi
when there is no correlation between the vari- performed using t ¼ F; which is distributed
ables). If this p value is smaller than the chosen under H0 as a Student’s distribution with
level (e.g., .05 or .01), then the null hypothesis can ν ¼ S  2 degrees of freedom).
be rejected, and r is considered significant. The For our example, the Fisher distribution shown
problem of finding the p value can be addressed in in Figure 3 has ν1 ¼ 1 and ν2 ¼ S  2 ¼ 6  2 ¼ 4
three ways: (1) the classical approach, which uses and gives the sampling distribution of F. The use
Fisher’s F distributions; (2) the Monte Carlo of this distribution will show that the probability
approach, which generates empirical probability of finding a value of F ¼ 1:33 under H0 is equal
distributions; and (3) the (nonparametric) permu- to p ≈ :313 (most statistical packages will rou-
tation test, which evaluates the likelihood of the tinely provide this value). Such a p value does not
actual configuration of results among all other lead to rejecting H0 at the usual level of α ¼ :05
possible configurations of results. or α ¼ :01: An equivalent way of performing a test
uses critical values that correspond to values of F
Classical Approach
whose p value is equal to a given α level. For our
In order to analytically derive the sampling dis- example, the critical value (found in tables avail-
tribution of F, several assumptions need to be able in most standard textbooks) for α ¼ :05 is
made: (a) the error of measurement is added to the equal to Fð1; 4Þ ¼ 7:7086: Any F with a value
true measure; (b) the error is independent of the larger than the critical value leads to rejection of
measure; and (c) the mean error is normally dis- the null hypothesis at the chosen α level, whereas
tributed, has a mean of zero, and has a variance of an F value smaller than the critical value leads
σ 2e : When theses assumptions hold and when the one to fail to reject the null hypothesis. For our
null hypothesis is true, the F statistic is distributed example, because F ¼ 1.33 is smaller than the criti-
as a Fisher’s F with ν1 ¼ 1 and ν2 ¼ S  2 degrees cal value of 7.7086, we cannot reject the null
of freedom. (Incidentally, an equivalent test can be hypothesis.
176 Coefficients of Correlation, Alienation, and Determination

Monte Carlo Approach For our example, we find that 310 random
samples (out of 1,000) had a value of F lar-
A modern alternative to the analytical deriva- ger than F ¼ 1.33, and this corresponds to a prob-
tion of the sampling distribution is to empirically ability of p ¼ .310 (compare with a value of
obtain the sampling distribution of F when the null p ¼ .313 for the classical approach). Because
hypothesis is true. This approach is often called this p value is not smaller than α ¼ :05; we can-
a Monte Carlo approach. not reject the null hypothesis. Using the critical-
With the Monte Carlo approach, we generate value approach leads to the same decision.
a large number of random samples of observations The empirical critical value for α ¼ :05 is equal
(e.g., 1,000 or 10,000) and compute r and F for to 7.5500 (see Figure 4). Because the computed
each sample. In order to generate these samples, value of F ¼ 1.33 is not larger than the 7.5500,
we need to specify the shape of the population we do not reject the null hypothesis.
from which these samples are obtained. Let us use
a normal distribution (this makes the assumptions
for the Monte Carlo approach equivalent to the
Permutation Tests
assumptions of the classical approach). The fre-
quency distribution of these randomly generated For both the Monte Carlo and the traditional
samples provides an estimation of the sampling (i.e., Fisher) approaches, we need to specify
distribution of the statistic of interest (i.e., r or F). the shape of the distribution under the null
For our example, Figure 4 shows the histogram of hypothesis. The Monte Carlo approach can be
the values of r2 and F obtained for 1,000 random used with any distribution (but we need to spec-
samples of 6 observations each. The horizontal ify which one we want), and the classical
axes represent the different values of r2 (top panel) approach assumes a normal distribution. An
and F (bottom panel) obtained for the 1,000 trials, alternative way to look at a null hypothesis test is
and the vertical axis the number of occurrences of to evaluate whether the pattern of results for
each value of r2 and F. For example, the top panel the experiment is a rare event by comparing it to
shows that 160 samples (of the 1,000 trials) have all the other patterns of results that could
a value of r2 ¼ .01, which was between 0 and .01 have arisen from these data. This is called a
(this corresponds to the first bar of the histogram permutation test or sometimes a randomization
in Figure 4). test.
Figure 4 shows that the number of occurrences This nonparametric approach originated with
of a given value of r2 and F decreases as an inverse Student and Ronald Fisher, who developed the
function of their magnitude: The greater the value, (now standard) F approach because it was possible
the less likely it is to obtain it when there is no cor- then to compute one F but very impractical to
relation in the population (i.e., when the null compute all the Fs for all possible permutations. If
hypothesis is true). However, Figure 4 shows also Fisher could have had access to modern compu-
that the probability of obtaining a large value of r2 ters, it is likely that permutation tests would be the
or F is not null. In other words, even when the null standard procedure.
hypothesis is true, very large values of r2 and F can So, in order to perform a permutation test, we
be obtained. need to evaluate the probability of finding the
From now on, this entry focuses on the F distri- value of the statistic of interest (e.g., r or F) that
bution, but everything also applies to the r2 distri- we have obtained, compared with all the values
bution. After the sampling distribution has been we could have obtained by permuting the values
obtained, the Monte Carlo procedure follows the of the sample. For our example, we have six obser-
same steps as the classical approach. Specifically, if vations, and therefore there are
the p value for the criterion is smaller than the
chosen α level, the null hypothesis can be rejected. 6! ¼ 6 × 5 × 4 × 3 × 2 ¼ 720
Equivalently, a value of F larger than the α-level
critical value leads one to reject the null hypothesis different possible patterns of results. Each of these
for this α level. patterns corresponds to a given permutation of the
Coefficients of Correlation, Alienation, and Determination 177

r 2: Permutation test for ν1 = 1, ν2 = 4


comparing Figure 5, where we have plotted the
200
permutation histogram for F, with Figure 3, where
we have plotted the Fisher distribution.
# of Samples

150
When the number of observations is small (as is
100
the case for this example with six observations), it
50
is possible to compute all the possible permutations.
0 In this case we have an exact permutation test. But
0 0.2 0.4 0.6 0.8 1
Values of r 2
the number of permutations grows very fast when
the number of observations increases. For example,
F: Permutation test for ν1 = 1, ν2 = 4
600
with 20 observations the total number of permuta-
F = 1.33 tions is close to 2.4 × 1018 (this is a very big num-
# of Samples

p = .306
400 ber!). Such large numbers obviously prohibit
200
α = .05 computing all the permutations. Therefore, for sam-
Fcritical = 7.7086
ples of large size, we approximate the permutation
0 test by using a large number (say 10,000 or
0 5 10 15 20 25 30 35 40
100,000) of random permutations (this approach is
Values of F
sometimes called a Monte Carlo permutation test).

Figure 5 Histogram of F Values Computed From the Confidence Intervals


6! ¼ 720 Possible Permutations of the Six
Scores of the Example
Classical Approach
The value of r computed from a sample is an esti-
mation of the correlation of the population from
data. For instance, here is a possible permutation which the sample was obtained. Suppose that we
of the results for our example: obtain a new sample from the same population and
that we compute the value of the coefficient of corre-
W1 ¼ 1; W2 ¼ 3; W3 ¼ 4; W4 ¼ 4; W5 ¼ 5; W6 ¼ 7 lation for this new sample. In what range is this
Y1 ¼ 8; Y2 ¼ 10; Y3 ¼ 16; Y4 ¼ 12; Y5 ¼ 10; Y6 ¼ 4: value likely to fall? This question is answered by
computing the confidence interval of the coefficient
(Note that we need to permute just one of the of correlation. This gives an upper bound and
two series of numbers; here we permuted Y). This a lower bound between which the population coeffi-
permutation gives a value of rW.Y ¼ .30 and of cient of correlation is likely to stand. For example,
r2WY ¼ :09: We computed the value of rW.Y for the we want to specify the range of values of rW.Y in
remaining 718 permutations. The histogram is plot- which the correlation in the population has a 95%
ted in Figure 5, where, for convenience, we have also chance of falling.
plotted the histogram of the corresponding F values. Using confidence intervals is more general than
For our example, we want to use the permuta- a null hypothesis test because if the confidence inter-
tion test to compute the probability associated val excludes the value 0 then we can reject the null
with r2W:Y ¼ :25: This is obtained by computing hypothesis. But a confidence interval also gives
the proportion of r2W:Y larger than .25. We counted a range of probable values for the correlation. Using
220 r2W:Y out of 720 larger or equal to .25; this confidence intervals has another big advantage: We
gives a probability of can act as if we could accept the null hypothesis. In
order to do so, we first compute the confidence
220 interval of the coefficient of correlation and look at
p ¼ ¼ :306: the largest magnitude it can have. If we consider
720
that this value is small, then we can say that even if
It is interesting to note that this value is very close the magnitude of the population correlation is not
to the values found with the two other approaches zero, it is too small to be of interest.
(cf. Fisher distribution p ¼ .313 and Monte Carlo Conversely, we can give more weight to a con-
p. ¼ 310). This similarity is confirmed by clusion if we show that the smallest possible value
178 Coefficients of Correlation, Alienation, and Determination

for the coefficient of correlation will still be large Step 1. Before doing any computation, we need
enough to be impressive. to choose an α level that will correspond to the
The problem of computing the confidence inter- probability of finding the population value of r in
val for r has been explored (once again) by Student the confidence interval. Suppose we chose the
and Fisher. Fisher found that the problem was not value α ¼ :05: This means that we want to obtain
simple but that it could be simplified by transform- a confidence interval such that there is a 95%
ing r into another variable called Z. This transfor- chance, or ð1  αÞ ¼ ð1  :05Þ ¼ :95; of having
mation, which is called Fisher’s Z transform, the population value being in the confidence inter-
creates a new Z variable whose sampling distribu- val that we will compute.
tion is close to the normal distribution. Therefore,
Step 2. Find in the table of the normal distribu-
we can use the normal distribution to compute the
tion the critical values corresponding to the chosen
confidence interval of Z, and this will give a lower
α level. Call this value Zα . The most frequently
and a higher bound for the population values of Z.
used values are
Then we can transform these bounds back into r
values (using the inverse Z transformation), and
Zα¼:10 ¼ 1:645 ðα ¼ :10Þ
this gives a lower and upper bound for the possible
values of r in the population. Zα¼:05 ¼ 1:960 ðα ¼ :05Þ
Zα¼:01 ¼ 2:575 ðα ¼ :01Þ
Fisher’s Z Transform Zα¼001 ¼ 3:325 ðα ¼ 001Þ:
Fisher’s Z transform is applied to a coefficient of
correlation r according to the following formula: Step 3. Transform r into Z using Equation 10.
For the present example, with r ¼ 5, we fnd that
1 Z ¼  0.5493.
Z ¼ ½lnð1 þ rÞ  lnð1  rÞ; ð10Þ
2
Step 4. Compute a quantity called Q as
where ln is the natural logarithm. rffiffiffiffiffiffiffiffiffiffiffi
The inverse transformation, which gives r from 1
Z, is obtained with the following formula: Q ¼ Zα × :
S3
expf2 × Zg  1 For our example we obtain
r ¼ ; ð11Þ
expf2 × Zg þ 2 rffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffi
1 1
where expfxg means to raise the number e to the Q ¼ Z:05 × ¼ 1:960 × ¼ 1:1316:
63 3
power fxg (i.e., expfxg ¼ ex and e is Euler’s con-
stant, which is approximately 2.71828). Most Step 5. Compute the lower and upper limits for
hand calculators can be used to compute both Z as
transformations.
Fisher showed that the new Z variable has a sam- Lower Limit ¼ Zlower ¼ Z  Q
pling distribution that is normal, with a mean of ¼ 0:5493  1:1316 ¼ 1:6809
0 and a variance of S  3. From this distribution we
can compute directly the upper and lower bounds
of Z and then transform them back into values of r. Upper Limit ¼ Zupper ¼ Z þ Q
¼ 0:5493 þ 1:1316 ¼ 0:5823:
Example
The computation of the confidence interval for Step 6. Transform Zlower and Zupper into rlower
the coefficient of correlation is illustrated using and rupper. This is done with the use of Equation
the previous example, in which we computed a coef- 11. For the present example, we find that
ficient of correlation of r ¼ .5 on a sample made
Lower Limit ¼ rlower ¼ :9330
of S ¼ 6 observations. The procedure can be decom-
posed into six steps, which are detailed next. Upper Limit ¼ rupper ¼ :5243:
Coefficients of Correlation, Alienation, and Determination 179

r: Bootstrap sampling distribution population of interest in order to estimate the sam-


35 pling distribution of a statistic computed on the
30 sample. Practically this means that in order to esti-
mate the sampling distribution of a statistic, we
25 just need to create bootstrap samples obtained by
drawing observations with replacement (whereby
# of Samples

20 each observation is put back into the sample after


it has been drawn) from the original sample. The
15
distribution of the bootstrap samples is taken as
the population distribution. Confidence intervals
10
are then computed from the percentile of this
5 distribution.
For our example, the first bootstrap sample that
0 we obtained comprised the following observations
−1 −0.5 0 0.5 1 (note that some observations are missing and some
Values of r are repeated as a consequence of drawing with
replacement):
Figure 6 Histogram of rW:Y Values Computed From
s1 ¼ observation 5,
1,000 Bootstrapped Samples Drawn With
Replacement From the Data From Our s2 ¼ observation 1,
Example s3 ¼ observation 3,
s4 ¼ observation 2,
The range of possible values of r is very large: the s4 ¼ observation 3,
value of the coefficient of correlation that we have
computed could come from a population whose s6 ¼ observation 6:
correlation could have been as low as rlower ¼
This gives the following values for the first
:9330 or as high as rupper ¼.5243. Also, because
bootstrapped sample obtained by drawing with
zero is in the range of possible values, we cannot
replacement from our example:
reject the null hypothesis (which is the conclusion
reached with the null hypothesis tests). W1 ¼ 5; W2 ¼ 1; W3 ¼ 4; W4 ¼ 3; W5 ¼ 4; W6 ¼ 7
It is worth noting that because the Z transfor-
mation is nonlinear, the confidence interval is not Y1 ¼ 8; Y2 ¼ 16; Y3 ¼ 12; Y4 ¼ 10; Y5 ¼ 12; Y6 ¼ 10:
symmetric around r.
This bootstrapped sample gives a correlation of
Finally, current statistical practice recommends
rW.Y ¼ .73.
the routine use of confidence intervals because this
If we repeat the bootstrap procedure for 1,000
approach is more informative than null hypothesis
samples, we obtain the sampling distribution of
testing.
rW.Y as shown in Figure 6. From this figure, it is
obvious that the value of rW.Y varies a lot with
such a small sample (in fact, it covers the whole
Efron’s Bootstrap
range of possible values, from  1 to þ 1). In
A modern Monte Carlo approach for deriving con- order to find the upper and the lower limits of
fidence intervals was proposed by Bradley Efron. a confidence interval, we look for the correspond-
This approach, called the bootstrap, was probably ing percentiles. For example, if we select a value of
the most important advance for inferential statis- α ¼ :05; we look at the values of the boot-
tics in the second part of the 20th century. strapped distribution corresponding to the 2.5th
The idea is simple but could be implemented and the 97.5th percentiles. In our example, we find
only with modern computers, which explains why that 2.5% of the values are smaller than  .9487
it is a recent development. With the bootstrap and that 2.5% of the values are larger than .4093.
approach, we treat the sample as if it were the Therefore, these two values constitute the lower
180 Cohen’s d Statistic

and the upper limits of the 95% confidence inter- simplified computational formulas). Specifically,
val of the population estimation of rW.Y (cf. the when both variables are ranks (or transformed
values obtained with Fisher’s Z transform of into ranks), we obtain the Spearman rank correla-
 .9330 and .5243). Contrary to Fisher’s Z trans- tion coefficient (a related transformation will pro-
form approach, the bootstrap limits are not depen- vide the Kendall rank correlation coefficient);
dent on assumptions about the population or its when both variables are dichotomous (i.e., they
parameters (but it is comforting to see that these take only the values 0 and 1), we obtain the phi
two approaches concur for our example). Because coefficient of correlation; and when only one of
the value of 0 is in the confidence interval of rW.Y, the two variables is dichotomous, we obtain the
we cannot reject the null hypothesis. This shows point-biserial coefficient.
once again that the confidence interval approach
provides more information than the null hypothe- Hervé Abdi and Lynne J. Williams
sis approach.
See also Coefficient of Concordance; Confidence
Intervals
Shrunken and Adjusted r
The coefficient of correlation is a descriptive statis- Further Readings
tic that always overestimates the population corre-
lation. This problem is similar to the problem of Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.
(2009). Experimental design and analysis for
the estimation of the variance of a population
psychology. Oxford, UK: Oxford University Press.
from a sample. In order to obtain a better estimate Cohen, J., & Cohen, P. (1983) Applied multiple
of the population, the value r needs to be cor- regression/correlation analysis for the social sciences.
rected. The corrected value of r goes under differ- Hillsdale, NJ: Lawrence Erlbaum.
ent names: corrected r, shrunken r, or adjusted r Darlington, R. B. (1990). Regression and linear models.
(there are some subtle differences between these New York: McGraw-Hill.
different appellations, but we will ignore them Edwards, A. L. (1985). An introduction to linear
here) and denote it by ~r2 : Several correction for- regression and correlation. New York: Freeman.
mulas are available. The one most often used esti- Pedhazur, E. J. (1997). Multiple regression in behavioral
mates the value of the population correlation as research. New York: Harcourt Brace.

  
2 2 S1
~r ¼ 1  ð1  r Þ : ð12Þ
S2
COHEN’S d STATISTIC
For our example, this gives
     Cohen’s d statistic is a type of effect size. An effect
S1 5
2 2
~r ¼ 1 ð1  r Þ ¼ 1  ð1  :25Þ × size is a specific numerical nonzero value used to
S2 4 represent the extent to which a null hypothesis is
 
5 false. As an effect size, Cohen’s d is typically used
¼ 1  :75 × ¼ 0:06:
4 to represent the magnitude of differences between
two (or more) groups on a given variable, with
With this formula, we find that the estimation of larger values representing a greater differentiation
the population
pffiffiffiffi correlation
pffiffiffiffiffiffiffi drops from r ¼ .50 to between the two groups on that variable. When
~r2 ¼  ~r2 ¼  :06 ¼ :24: comparing means in a scientific study, the report-
ing of an effect size such as Cohen’s d is considered
Particular Cases of the complementary to the reporting of results from
a test of statistical significance. Whereas the test of
Coefficient of Correlation
statistical significance is used to suggest whether
Mostly for historical reasons, some specific cases a null hypothesis is true (no difference exists
of the coefficient of correlation have their own between Populations A and B for a specific phe-
names (in part because these special cases lead to nomenon) or false (a difference exists between
Cohen’s d Statistic 181

Populations A and B for a specific phenomenon), The population means are replaced with sample
the calculation of an effect size estimate is used to means (Y j ), and the population standard deviation
represent the degree of difference between the two is replaced with Sp, the pooled standard deviation
populations in those instances for which the null from the sample. The pooled standard deviation is
hypothesis was deemed false. In cases for which derived by weighing the variance around each
the null hypothesis is false (i.e., rejected), the sample mean by the respective sample size.
results of a test of statistical significance imply that
reliable differences exist between two populations
on the phenomenon of interest, but test outcomes Calculation of the Pooled Standard Deviation
do not provide any value regarding the extent of
that difference. The calculation of Cohen’s d and Although computation of the difference in sam-
its interpretation provide a way to estimate the ple means is straightforward in Equation 2, the
actual size of observed differences between two pooled standard deviation may be calculated in
groups, namely, whether the differences are small, a number of ways. Consistent with the traditional
medium, or large. definition of a standard deviation, this statistic
may be computed as

Calculation of Cohen’s d Statistic sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi


P
ðnj  1Þs2j
Sp ¼ P , ð3Þ
Cohen’s d statistic is typically used to estimate ðnj Þ
between-subjects effects for grouped data, consis-
tent with an analysis of variance framework.
where nj represents the sample sizes for j groups
Often, it is employed within experimental contexts
and s2j represents the variance (i.e., squared stan-
to estimate the differential impact of the experi-
dard deviation) of the j samples. Often, however,
mental manipulation across conditions on the
the pooled sample standard deviation is corrected
dependent variable of interest. The dependent vari-
for bias in its estimation of the corresponding
able must represent continuous data; other effect
population parameter, σ ε . Equation 4 denotes this
size measures (e.g., Pearson family of correlation
correction of bias in the sample statistic (with the
coefficients, odds ratios) are appropriate for non-
resulting effect size often referred to as Hedge’s g):
continuous data.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
ðnj  1Þs2j
Sp ¼ P : ð4Þ
General Formulas ðnj  1Þ
Cohen’s d statistic represents the standardized
mean differences between groups. Similar to other When simply computing the pooled standard
means of standardization such as z scoring, the deviation across two groups, this formula may be
effect size is expressed in standard score units. In reexpressed in a more common format. This for-
general, Cohen’s d is defined as mula is suitable for data analyzed with a two-way
analysis of variance, such as a treatment–control
μ1  μ2 contrast:
d ¼ , ð1Þ
σε
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
where d represents the effect size, μ1 and μ2 rep- ðn1  1Þðs21 Þ þ ðn2  1Þðs22 Þ
Sp ¼
resent the two population means, and σ ε repre- ðn1  1Þ þ ðn2  1Þ
sents the pooled within-group population sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð5Þ
standard deviation. In practice, these population ðn1  1Þðs21 Þ þ ðn2  1Þðs22 Þ
¼ :
parameters are typically unknown and estimated ðn1 þ n2  2Þ
by means of sample statistics:
The formula may be further reduced to the
^ ¼ Y1  Y2 :
d ð2Þ average of the sample variances when sample sizes
Sp are equal:
182 Cohen’s d Statistic

sffiffiffiffi
by the pooled standard deviation across the j
s2j
Sp ¼ : ð6Þ repeated measures. The same formula may also be
j applied to simple contrasts within repeated mea-
sures designs, as well as interaction contrasts in
or
mixed (between- and within-subjects factors) or
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi split-plot designs. Note, however, that the simple
s21 þ s22
Sp ¼ : ð7Þ application of the pooled standard deviation for-
2 mula does not take into account the correlation
in the case of two groups. between repeated measures. Researchers disagree
Other means of specifying the denominator for as to whether these correlations ought to contrib-
Equation 2 are varied. Some formulas use the aver- ute to effect size computation; one method of
age standard deviation across groups. This proce- determining Cohen’s d while accounting for the
dure disregards differences in sample size in cases correlated nature of repeated measures involves
of unequal n when one is weighing sample var- computing d from a paired t test.
iances and may or may not correct for sample bias
in estimation of the population standard deviation.
Further formulas employ the standard deviation of Additional Means of Calculation
the control or comparison condition (an effect size Beyond the formulas presented above, Cohen’s
referred to as Glass’s ). This method is particu- d may be derived from other statistics, including
larly suited when the introduction of treatment or the Pearson family of correlation coefficients (r), t
other experimental manipulation leads to large tests, and F tests. Derivations from r are particu-
changes in group variance. Finally, more complex larly useful, allowing for translation among vari-
formulas are appropriate when calculating Cohen’s ous effect size indices. Derivations from other
d from data involving cluster randomized or statistics are often necessary when raw data to
nested research designs. The complication partially compute Cohen’s d are unavailable, such as when
arises because of the three available variance statis- conducting a meta-analysis of published data.
tics from which the pooled standard deviation When d is derived as in Equation 3, the following
may be computed: the within-cluster variance, the formulas apply:
between-cluster variance, or the total variance
(combined between- and within-cluster variance). 2r
Researchers must select the variance statistic d ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi , ð8Þ
1  r2
appropriate for the inferences they wish to draw.

tðn1 þ n2 Þ
Expansion Beyond Two-Group Comparisons: d ¼ pffiffiffiffiffipffiffiffiffiffiffiffiffi , ð9Þ
Contrasts and Repeated Measures df n1 n2

Cohen’s d always reflects the standardized dif- and


ference between two means. The means, however, pffiffiffi
are not restricted to comparisons of two indepen- Fðn1 þ n2 Þ
d ¼ pffiffiffiffiffipffiffiffiffiffiffiffiffi : ð10Þ
dent groups. Cohen’s d may also be calculated in df n1 n2
multigroup designs when a specific contrast is of
interest. For example, the average effect across Note that Equation 10 applies only for F tests
two alternative treatments may be compared with with 1 degree of freedom (df) in the numerator;
a control. The value of the contrast becomes the further formulas apply when df > 1.
numerator as specified in Equation 2, and the When d is derived as in Equation 4, the follow-
pooled standard deviation is expanded to include ing formulas ought to be used:
all j groups specified in the contrast (Equation 4).
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
A similar extension of Equations 2 and 4 may
2r df ðn1 þ n2 Þ
be applied to repeated measures analyses. The dif- d ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi ð11Þ
ference between two repeated measures is divided 1r 2 n1 n2
Cohen’s d Statistic 183

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  (CI) for the statistic to determine statistical
1 1
d ¼ t þ ð12Þ significance:
n1 n2
CI ¼ d ± zðsd Þ: ð19Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ffi
1 1 The z in the formula corresponds to the z-score
d ¼ F þ : ð13Þ
n1 n2 value on the normal distribution corresponding to
the desired probability level (e.g., 1.96 for a 95%
Again, Equation 13 applies only to instances in CI). Variances and CIs may also be obtained
which the numerator df ¼ 1. through bootstrapping methods.
These formulas must be corrected for the corre-
lation (r) between dependent variables in repeated
measures designs. For example, Equation 12 is cor- Interpretation
rected as follows: Cohen’s d, as a measure of effect size, describes the
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  overlap in the distributions of the compared sam-
ð1  rÞ ð1  rÞ ples on the dependent variable of interest. If the
d ¼ t þ : ð14Þ
n1 n2 two distributions overlap completely, one would
expect no mean difference between them (i.e.,
Finally, conversions between effect sizes com- Y 1  Y 2 ¼ 0). To the extent that the distributions
puted with Equations 3 and 4 may be easily do not overlap, the difference ought to be greater
accomplished: than zero (assuming Y 1 > Y 2 ).
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Cohen’s d may be interpreted in terms of both
n1 þ n2 statistical significance and magnitude, with the lat-
deq3 ¼ deq4 ð15Þ
ðn1 þ n2  2Þ ter the more common interpretation. Effect sizes
are statistically significant when the computed CI
and
does not contain zero. This implies less than perfect
deq3 overlap between the distributions of the two groups
deq4 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð16Þ compared. Moreover, the significance testing
n1 þ n2
ðn1 þ n2 2Þ implies that this difference from zero is reliable, or
not due to chance (excepting Type I errors). While
Variance and Confidence Intervals significance testing of effect sizes is often under-
The estimated variance of Cohen’s d depends on taken, however, interpretation based solely on sta-
how the statistic was originally computed. When tistical significance is not recommended. Statistical
sample bias in the estimation of the population significance is reliant not only on the size of the
pooled standard deviation remains uncorrected effect but also on the size of the sample. Thus, even
(Equation 3), the variance is computed in the fol- large effects may be deemed unreliable when insuf-
lowing manner: ficient sample sizes are utilized.
Interpretation of Cohen’s d based on the magni-
  
n1 þ n2 d2 n1 þ n2 tude is more common than interpretation based on
sd ¼ þ : statistical significance of the result. The magnitude
n1 n2 2ðn1 þ n2  2Þ n1 þ n2  2
of Cohen’s d indicates the extent of nonoverlap
ð17Þ
between two distributions, or the disparity of the
A simplified formula is employed when sample mean difference from zero. Larger numeric values
bias is corrected as in Equation 4: of Cohen’s d indicate larger effects or greater
differences between the two means. Values may be
2 positive or negative, although the sign merely
n1 þ n2 d
sd ¼ þ : ð18Þ indicates whether the first or second mean in the
n1 n2 2ðn1 þ n2  2Þ
numerator was of greater magnitude (see Equation
Once calculated, the effect size variance 2). Typically, researchers choose to subtract the
may be used to compute a confidence interval smaller mean from the larger, resulting in a positive
184 Cohen’s d Statistic

effect size. As a standardized measure of effect, the to Cohen’s rules of thumb have been proposed.
numeric value of Cohen’s d is interpreted in stan- These include comparisons with effects sizes based
dard deviation units. Thus, an effect size of d ¼ 0.5 on (a) normative data concerning the typical
indicates that two group means are separated by growth, change, or differences between groups
one-half standard deviation or that one group prior to experimental manipulation; (b) those
shows a one-half standard deviation advantage obtained in similar studies and available in the pre-
over the other. vious literature; (c) the gain necessary to attain an
The magnitude of effect sizes is often described a priori criterion; and (d) cost–benefit analyses.
nominally as well as numerically. Jacob Cohen
defined effects as small (d ¼ 0.2), medium
Cohen’s d in Meta-Analyses
(d ¼ 0.5), or large (d ¼ 0.8). These rules of thumb
were derived after surveying the behavioral Cohen’s d, as a measure of effect size, is often used
sciences literature, which included studies in vari- in individual studies to report and interpret the
ous disciplines involving diverse populations, inter- magnitude of between-group differences. It is also
ventions or content under study, and research a common tool used in meta-analyses to aggregate
designs. Cohen, in proposing these benchmarks in effects across different studies, particularly in
a 1988 text, explicitly noted that they are arbitrary meta-analyses involving study of between-group
and thus ought not be viewed as absolute. How- differences, such as treatment studies. A meta-
ever, as occurred with use of .05 as an absolute cri- analysis is a statistical synthesis of results from
terion for establishing statistical significance, independent research studies (selected for inclusion
Cohen’s benchmarks are oftentimes interpreted as based on a set of predefined commonalities), and
absolutes, and as a result, they have been criticized the unit of analysis in the meta-analysis is the data
in recent years as outdated, atheoretical, and inher- used for the independent hypothesis test, including
ently nonmeaningful. These criticisms are espe- sample means and standard deviations, extracted
cially prevalent in applied fields in which medium- from each of the independent studies. The statisti-
to-large effects prove difficult to obtain and smal- cal analyses used in the meta-analysis typically
ler effects are often of great importance. The small involve (a) calculating the Cohen’s d effect size
effect of d ¼ 0.07, for instance, was sufficient for (standardized mean difference) on data available
physicians to begin recommending aspirin as an within each independent study on the target vari-
effective method of preventing heart attacks. Simi- able(s) of interest and (b) combining these individ-
lar small effects are often celebrated in interven- ual summary values to create pooled estimates by
tion and educational research, in which effect sizes means of any one of a variety of approaches (e.g.,
of d ¼ 0.3 to d ¼ 0.4 are the norm. In these fields, Rebecca DerSimonian and Nan Laird’s random
the practical importance of reliable effects is often effects model, which takes into account variations
weighed more heavily than simple magnitude, as among studies on certain parameters). Therefore,
may be the case when adoption of a relatively sim- the methods of the meta-analysis may rely on use
ple educational approach (e.g., discussing vs. not of Cohen’s d as a way to extract and combine data
discussing novel vocabulary words when reading from individual studies. In such meta-analyses, the
storybooks to children) results in effect sizes of reporting of results involves providing average
d ¼ 0.25 (consistent with increases of one-fourth d values (and CIs) as aggregated across studies.
of a standard deviation unit on a standardized In meta-analyses of treatment outcomes in the
measure of vocabulary knowledge). social and behavioral sciences, for instance, effect
Critics of Cohen’s benchmarks assert that such estimates may compare outcomes attributable to
practical or substantive significance is an impor- a given treatment (Treatment X) as extracted from
tant consideration beyond the magnitude and sta- and pooled across multiple studies in relation to
tistical significance of effects. Interpretation of an alternative treatment (Treatment Y) for Out-
effect sizes requires an understanding of the con- come Z using Cohen’s d (e.g., d ¼ 0.21, CI ¼ 0.06,
text in which the effects are derived, including the 1.03). It is important to note that the meaningful-
particular manipulation, population, and depen- ness of this result, in that Treatment X is, on aver-
dent measure(s) under study. Various alternatives age, associated with an improvement of about
Cohen’s f Statistic 185

one-fifth of a standard deviation unit for Outcome means increases relative to the average standard
Z relative to Treatment Y, must be interpreted in deviation within each group. Jacob Cohen has sug-
reference to many factors to determine the actual gested that the values of 0.10, 0.25, and 0.40
significance of this outcome. Researchers must, at represent small, medium, and large effect sizes,
the least, consider whether the one-fifth of a stan- respectively.
dard deviation unit improvement in the outcome
attributable to Treatment X has any practical
Calculation
significance.
Cohen’s f is calculated as
Shayne B. Piasta and Laura M. Justice
f ¼ σ m =σ; ð1Þ
See also Analysis of Variance (ANOVA); Effect Size,
Measures of; Mean Comparisons; Meta-Analysis;
where σ m is the standard deviation (SD) of popula-
Statistical Power Analysis for the Behavioral Sciences
tion means (mi) represented by the samples and σ is
the common within-population SD; σ ¼ MSE1=2 .
Further Readings MSE is the mean square of error (within groups)
from the overall ANOVA F test. It is based on the
Cohen, J. (1988). Statistical power analysis for the deviation of the population means from the mean of
behavioral sciences (2nd ed.). Mahwah, NJ: Lawrence
the combined populations or the mean of the means
Erlbaum.
Cooper, H., & Hedges, L. V. (1994). The handbook of
(M).
research synthesis. New York: Russell Sage hX i1=2
Foundation. σm ¼ ðmi  MÞ2=k ð2Þ
Hedges, L. V. (2007). Effect sizes in cluster-randomized
designs. Journal of Educational & Behavioral
Statistics, 32, 341–370. for equal sample sizes and
Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. hX i1=2
(2007, July). Empirical benchmarks for interpreting σm ¼ ni ðmi  MÞ2=N ð3Þ
effect sizes in research. New York: MDRC.
Ray, J. W., & Shadish, W. R. (1996). How
interchangeable are different estimators of effect size? for unequal sample sizes.
Journal of Consulting & Clinical Psychology, 64,
1316–1325.
Wilkinson, L., & APA Task Force on Statistical Inference. Examples
(1999). Statistical methods in psychology journals: Example 1
Guidelines and explanations. American Psychologist,
54, 594–604. Table 1 provides descriptive statistics for a study
with four groups and equal sample sizes. ANOVA
results are shown. The calculations below result in
an estimated f effect size of .53, which is consid-
COHEN’S f STATISTIC ered large by Cohen standards. An appropriate
interpretation is that about 50% of the variance in
Effect size is a measure of the strength of the rela- the dependent variable (physical health) is
tionship between variables. Cohen’s f statistic is explained by the independent variable (presence or
one appropriate effect size index to use for a one- absence of mental or physical illnesses at age 16).
way analysis of variance (ANOVA). Cohen’s f is hX i1=2
a measure of a kind of standardized average effect σm ¼ ðmi  MÞ2=k ¼ ½ðð71:88  62:74Þ2
in the population across all the levels of the inde-
pendent variable. þ ð66:08  62:74Þ2 þ ð58:44  62:74Þ2
Cohen’s f can take on values between zero, þ 54:58  62:74Þ Þ=4
2 1=2
¼ 6:70
when the population means are all equal, and an
indefinitely large number as standard deviation of f ¼ σ m =σ ¼ 6:70=161:291=2 ¼ 6:70=12:7 ¼ 0:53
186 Cohen’s f Statistic

Table 1 Association of Mental Disorders and Physical Illnesses at a Mean Age of 16 Years With Physical Health at
a Mean Age of 33 Years: Equal Sample Sizes
Group n M SD
Reference group (no disorders) 80 71.88 5.78
Mental disorder only 80 66.08 19.00
Physical illness only 80 58.44 8.21
Physical illness and mental disorder 80 54.58 13.54
Total 320 62.74 14.31

Analysis of Variance
Source Sum of Squares df Mean Square F p
Between groups 14,379.93 3 4,793.31 29.72 0.00
Within groups 50,967.54 316 161.29
Total 65,347.47 319
Source: Adapted from Chen, H., Cohen, P., Kasen, S., Johnson, J. G., Berenson, K., & Gordon, K. (2006). Impact of
adolescent mental disorders and physical illnesses on quality of life 17 years later. Archives of Pediatrics & Adolescent
Medicine, 160, 93–99.

Table 2 Association of Mental Disorders and Physical Illnesses at a Mean Age of 16 Years With Physical Health at
a Mean Age of 33 Years: Unequal Sample Sizes
Group n M SD
Reference group (no disorders) 256 72.25 17.13
Mental disorder only 89 68.16 21.19
Physical illness only 167 66.68 18.58
Physical illness and mental disorder 96 57.67 18.86
Total 608 67.82 19.06

Analysis of Variance
Source Sum of Squares df Mean Square F p
Between groups 15,140.20 3 5,046.73 14.84 0.00
Within groups 205,445.17 604 340.14
Total 220,585.37 607
Source: Adapted from Chen, H., Cohen, P., Kasen, S., Johnson, J. G., Berenson, K., & Gordon, K. (2006). Impact of
adolescent mental disorders and physical illnesses on quality of life 17 years later. Archives of Pediatrics & Adolescent
Medicine, 160, 93–99.
Note: Adapted from total sample size of 608 by choosing 80 subjects for each group.

Example 2 (presence or absence of mental or physical ill-


nesses at age 16).
Table 2 provides descriptive statistics for
a study very similar to the one described in
σ m ¼ ½ ni ðmi  MÞ2=N1=2 ¼ ½ð256 * ð72:25
Example 1. There are four groups, but the sam-
ple sizes are unequal. ANOVA results are shown.  67:82Þ2 þ 89 * ð68:16  67:82Þ2
The calculations for these samples result in an þ 167 * ð66:68  67:82Þ2 þ 96 * ð57:67
estimated f effect size of .27, which is considered
medium by Cohen standards. For this study, an  67:82Þ2 Þ=6081=2 ¼ 4:99
appropriate interpretation is that about 25% of
the variance in the dependent variable (physical
health) is explained by the independent variable f ¼ σ m =σ ¼ 4:99=340:141=2 ¼ 4:99=18:44 ¼ 0:27
Cohen’s Kappa 187

Cohen’s f and d
Cohen’s f is an extension of Cohen’s d, which is
COHEN’S KAPPA
the appropriate measure of effect size to use for a t
test. Cohen’s d is the difference between two group Cohen’s Kappa coefficient () is a statistical mea-
means divided by the pooled SD for the two sure of the degree of agreement or concordance
groups. The relationship between f and d when between two independent raters that takes into
one is comparing two means (equal sample sizes) account the possibility that agreement could occur
is d ¼ 2f. If Cohen’s f ¼ 0.1, the SD of kðk ≥ 2Þ by chance alone.
population means is one tenth as large as the SD Like other measures of interrater agreement, 
of the observations within the populations. For is used to assess the reliability of different raters or
k ¼ two populations, this effect size indicates measurement methods by quantifying their consis-
a small difference between the two populations: tency in placing individuals or items in two or
d ¼ 2f ¼ 2 * 0.10 ¼ 0.2. more mutually exclusive categories. For instance,
Cohen’s f in Equation 1 is positively biased in a study of developmental delay, two pediatri-
because the sample means in Equation 2 or 3 are cians may independently assess a group of toddlers
likely to vary more than do the population means. and classify them with respect to their language
One can use the following equation from Scott development into either ‘‘delayed for age’’ or ‘‘not
Maxwell and Harold Delaney to calculate an delayed.’’ One important aspect of the utility of
adjusted Cohen’s f: this classification is the presence of good agree-
ment between the two raters. Agreement between
1=2 two raters could be simply estimated as the per-
fadj ¼ ½ðk  1ÞðF  1Þ=N ð4Þ centage of cases in which both raters agreed. How-
ever, a certain degree of agreement is expected by
Applying Equation 4 to the data in Table 1 chance alone. In other words, two raters could still
yields agree on some occasions even if they were ran-
domly assigning individuals into either category.
fadj ¼ ½ðk  1ÞðF  1Þ=N
1=2 In situations in which there are two raters and
the categories used in the classification system have
¼ ½ð4  1Þð29:72  1Þ=3201=2 ¼ 0:52: no natural order (e.g., delayed vs. not delayed;
present vs. absent), Cohen’s  can be used to quan-
tify the degree of agreement in the assignment of
For Table 2,
these categories beyond what would be expected
by random guessing or chance alone.
1=2
fadj ¼ ½ðk  1ÞðF  1Þ=N

¼ ½ð4  1Þð14:84  1Þ=6081=2 ¼ 0:26: Calculation


Specifically,  can be calculated using the follow-
Sophie Chen and Henian Chen ing equation:
po  pe
See also Cohen’s d Statistic; Effect Size, Measures of  ¼ ;
1  pe

Further Readings where Po is the proportion of the observed agree-


ment between the two raters, and Pe is the propor-
Cohen, J. (1988). Statistical power analysis for the
tion of rater agreement expected by chance alone.
behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence
Erlbaum.
A  of þ 1 indicates complete agreement, whereas
Maxwell, S. E., & Delaney, H. D. (2004). Designing a  of 0 indicates that there is no agreement
experiments and analyzing data: A model comparison between the raters beyond that expected by ran-
perspective (2nd ed.). Mahwah, NJ: Lawrence dom guessing or chance alone. A negative  indi-
Erlbaum. cates that the agreement was less than expected by
188 Cohen’s Kappa

Table 1 Data From the Hypothetical Study Described in the Text: Results of Assessments of Developmental Delay
Made by Two Pediatricians
Rater 2

Rater 1 Delayed Not Delayed Total


Delayed 20 3 23
Not delayed 7 70 77
Total 27 73 100

Table 2 Landis and Koch Interpretation of Cohen’s  questions that may arise in a reliability study. For
Cohen’s Kappa Degree of Agreement instance, it might be of interest to determine
< 0.20 Poor whether disagreement between the two pediatri-
0.21–0.40 Fair cians in the above example was more likely to
0.41–0.60 Moderate occur when diagnosing developmental delay than
0.61–0.80 Good when diagnosing normal development or vice
0.81–1.00 Very good versa. However,  cannot be used to address this
Source: Landis & Koch, 1977.
question, and alternative measures of agreement
are needed for that purpose.

chance, with  of  1.0 indicating perfect dis-


agreement beyond what would be expected by Limitations
chance. It has been shown that the value of  is sensitive to
To illustrate the use of the equation, let us the prevalence of the trait or condition under
assume that the results of the assessments made by investigation (e.g., developmental delay in the
the two pediatricians in the above-mentioned above example) in the study population. Although
example are as shown in Table 1. The two raters two raters may have the same degree of agree-
agreed on the classification of 90 toddlers (i.e., Po ment, estimates of  might be different for a popu-
is 0.90). To calculate the probability of the lation in which the trait under study is very
expected agreement (Pe), we first calculate the common compared with another population in
probability that both raters would have classified which the trait is less prevalent.
a toddler as delayed if they were merely randomly Cohen’s  does not take into account the seri-
classifying toddlers to this category. This could be ousness of the disagreement. For instance, if a trait
obtained by multiplying the marginal probabilities is rated on a scale that ranges from 1 to 5, all dis-
of the delayed category, that is, ð23 ‚ 100Þ × agreements are treated equally, whether the raters
ð27 ‚ 100Þ ¼ 0:062: Similarly, the probability that disagreed by 1 point (e.g., 4 vs. 5) or by 4 points
both raters would have randomly classified a tod- on the scale (e.g., 1 vs. 5). Cohen introduced
dler as not delayed is ð77 ‚ 100Þ × ð73 ‚ 100Þ ¼ weighted  for use with such ordered scales. By
0:562: Therefore, the total agreement expected by assigning weights to each disagreement pair, it is
chance alone (Pe) is 0.562 þ 0.062 ¼ 0.624. Using possible to incorporate a measure of the serious-
the equation,  is equal to 0.73. ness of the disagreement, for instance by assigning
Richard Landis and Gary Koch have proposed larger weights to more severe disagreement.
the following interpretation for estimates of  Both  and weighted  are limited to cases in
(Table 2). Although arbitrary, this classification is which there are only two raters. Alternative mea-
widely used in the medical literature. According to sures of agreements (e.g., Fleiss ) can be used in
this classification, the agreement between the two situations in which there are more than two raters
pediatricians in the above example is ‘‘good.’’ and in which study subjects are not necessarily
It should be noted that  is a summary measure always rated by the same pair of raters.
of the agreement between two raters and cannot
therefore be used to answer all the possible Salaheddin M. Mahmud
Cohort Design 189

See also Interrater Reliability; Reliability much of the early progress in understanding occu-
pational diseases. Cohort studies based on data
Further Readings derived from company records and vital records
led to the identification of many environmental
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha,
and occupational risk factors. Several major
D. (1999). Beyond kappa: A review of interrater
agreement measures. Canadian Journal of Statistics,
cohort studies with follow-up that spanned dec-
27, 3–23. ades have made significant contributions to our
Cohen, J. (1960). A coefficient of agreement for nominal understanding of the causes of several common
scales. Educational & Psychological Measurement, 1, chronic diseases. Examples include the Framing-
37–46. ham Heart Study, the Tecumseh Community
Fleiss, J. L. (1981). Statistical methods for rates and Health Study, the British Doctors Study, and the
proportions (2nd ed.). New York: Wiley. Nurses’ Health Study.
Gwet, K. (2002). Inter-rater reliability: Dependency on In a classic cohort study, individuals who are
trait prevalence and marginal homogeneity. Statistical initially free of the disease being researched are
Methods for Inter-Rater Reliability Assessment Series,
enrolled into the study, and individuals are each
2, 1–9.
Landis, J. R., & Koch, G. G. (1977). The measurement
categorized into one of two groups according to
of observer agreement for categorical data. Biometrics, whether they have been exposed to the suspected
33, 159–174. risk factor. One group, called the exposed group,
includes individuals known to have the charac-
teristic or risk factor under study. For instance,
in a cohort study of the effect of smoking on
COHORT DESIGN lung cancer, the exposed group may consist of
known smokers. The second group, the unex-
In epidemiology, a cohort design or cohort study is posed group, will comprise a comparable group
a nonexperimental study design that involves com- of individuals who are also free of the disease
paring the occurrence of a disease or condition in initially but are nonsmokers. Both groups are
two or more groups (or cohorts) of people that dif- then followed up for a predetermined period of
fer on a certain characteristic, risk factor, or expo- time or until the occurrence of disease or death.
sure. The disease, state, or condition under study Cases of the disease (lung cancer in this instance)
is often referred to as the outcome, whereas the occurring among both groups are identified in
characteristic, risk factor, or exposure is often the same way for both groups. The number of
referred to as the exposure. A cohort study is one people diagnosed with the disease in the exposed
of two principal types of nonexperimental study group is compared with that among the unex-
designs used to study the causes of disease. The posed group to estimate the relative risk of dis-
other is the case–control design, in which cases of ease due to the exposure or risk factor. This type
the disease under study are compared with respect of design is sometimes called a prospective
to their past exposure with a similar group of indi- cohort study.
viduals who do not have the disease. In a retrospective (historical) cohort study, the
Cohort (from the Latin cohors, originally a unit researchers use existing records or electronic data-
of a Roman legion) is the term used in epidemiol- bases to identify individuals who were exposed at
ogy to refer to a group of individuals who share a certain point in the past and then ‘‘follow’’ them
a common characteristic; for example, they may all up to the present. For instance, to study the effect
belong to the same ethnic or age group or be of exposure to radiation on cancer occurrence
exposed to the same risk factor (e.g., radiation or among workers in a uranium mine, the researcher
soil pollution). may use employee radiation exposure records to
The cohort study is a relatively recent innova- categorize workers into those who were exposed
tion. The first cohort studies were used to confirm to radiation and those who were not at a certain
the link between smoking and lung cancer that date in the past (e.g., 10 years ago). The medical
had been observed initially in earlier case–control records of each employee are then searched to
studies. Cohort studies also formed the basis for identify those employees who were diagnosed with
190 Cohort Design

cancer from that date onward. Like prospective the cohort study design does not usually involve
cohort designs, the frequency of occurrence of the manipulating the exposure under study in any
disease in the exposed group is compared with that way that changes the exposure status of the
within the unexposed group in order to estimate participants.
the relative risk of disease due to radiation expo-
sure. When accurate and comprehensive records
Advantages and Disadvantages
are available, this approach could save both time
and money. But unlike the classic cohort design, in Cohort studies are used instead of experimental
which information is collected prospectively, the study designs, such as clinical trials, when experi-
researcher employing a retrospective cohort design ments are not feasible for practical or ethical rea-
has little control over the quality and availability sons, such as when investigating the effects of
of information. a potential cause of disease.
Cohort studies could also be classified as In contrast to case–control studies, the design of
closed or open cohort studies. In a closed cohort cohort studies is intuitive, and their results are eas-
study, cohort membership is decided at the onset ier to understand by nonspecialists. Furthermore,
of the study, and no additional participants are the temporal sequence of events in a cohort study
allowed to join the cohort once the study starts. is clear because it is always known that the expo-
For example, in the landmark British Doctors sure has occurred before the disease. In case–con-
Study, participants were male doctors who were trol and cross-sectional studies, it is often unclear
registered for medical practice in the United whether the suspected exposure has led to the dis-
Kingdom in 1951. This cohort was followed up ease or the other way around.
with periodic surveys until 2001. This study pro- In prospective cohort studies, the investigator
vided strong evidence for the link between smok- has more control over what information is to be
ing and several chronic diseases, including lung collected and at what intervals. As a result, the
cancer. prospective cohort design is well suited for study-
In an open (dynamic) cohort study, the cohort ing chronic diseases because it permits fuller
membership may change over time as additional understanding of the disease’s natural history.
participants are permitted to join the cohort, and In addition, cohort studies are typically better
they are followed up in a fashion similar to that of than case–control studies in studying rare expo-
the original participants. For instance, in a prospec- sures. For instance, case–control designs are not
tive study of the effects of radiation on cancer practical in studying occupational exposures that
occurrence among uranium miners, newly are rare in the general population, as when expo-
recruited miners are enrolled in the study cohort sure is limited to a small cohort of workers in
and are followed up in the same way as those a particular industry. Another advantage of cohort
miners who were enrolled at the inception of the studies is that multiple diseases and conditions
cohort study. related to the same exposure could be easily exam-
Regardless of their type, cohort studies are dis- ined in one study.
tinguished from other epidemiological study On the other hand, cohort studies tend to be
designs by having all the following features: more expensive and take longer to complete than
other nonexperimental designs. Generally, case–
control and retrospective cohort studies are more
• The study group or groups are observed over
efficient and less expensive than the prospective
time for the occurrence of the study outcome.
• The study group or groups are defined on the
cohort studies. The prospective cohort design is
basis of whether they have the exposure at the
not suited for the study of rare diseases, because
start or during the observation period before the prospective cohort studies require following up
occurrence of the outcome. Therefore, in a large number of individuals for a long time.
a cohort study, it is always clear that the Maintaining participation of study subjects over
exposure has occurred before the outcome. time is a challenge, and selective dropout from the
• Cohort studies are observational or study (or loss to follow-up) may result in biased
nonexperimental studies. Unlike clinical trials, results. Because of lack of randomization, cohort
Cohort Design 191

studies are more potentially subject to bias and information on the exposure(s) under investiga-
confounding than experimental studies are. tion. In addition, information on demographic and
socioeconomic factors (e.g., age, gender, and occu-
pation) is often collected. As in all observational
Design and Implementation
studies, information on potential confounders,
The specifics of cohort study design and implemen- factors associated with both the exposure and out-
tation depend on the aim of the study and the come under study that could confuse the interpre-
nature of the risk factors and diseases under study. tation of the results, is also collected. Depending
However, most prospective cohort studies begin by on the type of the exposure under study, the study
assembling one or more groups of individuals. design may also include medical examinations of
Often members of each group share a well-defined study participants, which may include clinical
characteristic or exposure. For instance, a cohort assessment (e.g., measuring blood pressure), labo-
study of the health effects of uranium exposure ratory testing (e.g., measuring blood sugar levels
may begin by recruiting all uranium miners or testing for evidence for an infection with a cer-
employed by the same company, whereas a cohort tain infectious agent), or radiological examinations
study of the health effects of exposure to soil pol- (e.g., chest x-rays). In some studies, biological
lutants from a landfill may include all people living specimens (e.g., blood or serum specimens) are col-
within a certain distance from the landfill. Several lected and stored for future testing.
major cohort studies have recruited all people born Follow-up procedures and intervals are also
in the same year in a city or province (birth important design considerations. The primary aim
cohorts). Others have included all members of of follow-up is to determine whether participants
a professional group (e.g., physicians or nurses), developed the outcome under study, although
regardless of where they lived or worked. Yet most cohort studies also collect additional infor-
others were based on a random sample of the mation on exposure and confounders to determine
population. Cohort studies of the natural history changes in exposure status (e.g., a smoker who
of disease may include all people diagnosed with quits smoking) and other relevant outcomes (e.g.,
a precursor or an early form of the disease and development of other diseases or death). As with
then followed up as their disease progressed. exposures, the method of collecting information
The next step is to gather information on the on outcomes depends on the type of outcome and
exposure under investigation. Cohort studies can the degree of desired diagnostic accuracy. Often,
be used to examine exposure to external agents, mailed questionnaires and phone interviews are
such as radiation, second-hand smoke, an infec- used to track participants and determine whether
tious agent, or a toxin. But they can also be used they developed the disease under study. For certain
to study the health effects of internal states (e.g., types of outcome (e.g., death or development of
possession of a certain gene), habits (e.g., smoking cancer), existing vital records (e.g., the National
or physical inactivity), or other characteristics Death Index in the United States) or cancer regis-
(e.g., level of income or educational status). The tration databases could be used to identify study
choice of the appropriate exposure measurement participants who died or developed cancer. Some-
method is an important design decision and times, in-person interviews and clinic visits are
depends on many factors, including the accuracy required to accurately determine whether a partici-
and reliability of the available measurement meth- pant has developed the outcome, as in the case of
ods, the feasibility of using these methods to mea- studies of the incidence of often asymptomatic dis-
sure the exposure for all study participants, and eases such as hypertension or HIV infection.
the cost. In certain designs, called repeated measure-
In most prospective cohort studies, baseline ments designs, the above measurements are per-
information is collected on all participants as they formed more than once for each participant.
join the cohort, typically using self-administered Examples include pre-post exposure studies, in
questionnaires or phone or in-person interviews. which an assessment such as blood pressure mea-
The nature of the collected information depends surement is made before and after an intervention
on the aim of the study but often includes detailed such as the administration of an antihypertensive
192 Cohort Design

medication. Pre-post designs are more commonly if the remaining cohort members differ from those
used in experimental studies or clinical trials, but who were lost to follow-up with respect to the
there are occasions where this design can be used exposure under study. For instance, in a study of
in observational cohort studies. For instance, the effects of smoking on dementia, smoking may
results of hearing tests performed during routine misleadingly appear to reduce the risk of dementia
preemployment medical examinations can be com- because smokers are more likely than nonsmokers
pared with results from hearing tests performed to die at younger age, before dementia could be
after a certain period of employment to assess the diagnosed.
effect of working in a noisy workplace on hearing
acuity.
In longitudinal repeated measurements designs, Analysis of Data
typically two or more exposure (and outcome)
measurements are performed over time. These Compared with other observational epidemiologic
studies tend to be observational and could there- designs, cohort studies provide data permitting the
fore be carried out prospectively or less commonly calculations of several types of disease occurrence
retrospectively using precollected data. These stud- measures, including disease prevalence, incidence,
ies are ideal for the study of complex phenomena and cumulative incidence. Typically, disease inci-
such as the natural history of chronic diseases, dence rates are calculated separately for each of
including cancer. The repeated measurements the exposed and the unexposed study groups. The
allow the investigator to relate changes in time- ratio between these rates, the rate ratio, is then
dependent exposures to the dynamic status of the used to estimate the degree of increased risk of the
disease or condition under study. This is especially disease due to the exposure.
valuable if exposures are transient and may not be In practice, more sophisticated statistical meth-
measurable by the time the disease is detected. For ods are needed to account for the lack of randomi-
instance, repeated measurements are often used in zation. These methods include direct or indirect
longitudinal studies to examine the natural history standardization, commonly used to account for
of cervical cancer as it relates to infections with differences in the age or gender composition of the
the human papilloma virus. In such studies, parti- exposed and unexposed groups. Poisson regression
cipants are typically followed up for years with could be used to account for differences in one or
prescheduled clinic visits at certain intervals (e.g., more confounders. Alternatively, life table and
every 6 months). At each visit, participants are other survival analysis methods, including Cox
examined for evidence of infection with human proportional hazard models, could be used to ana-
papilloma virus or development of cervical cancer. lyze data from cohort studies.
Because of the frequent testing for these condi-
tions, it is possible to acquire a deeper understand-
ing of the complex sequence of events that Special Types
terminates with the development of cancer.
Nested Case–Control Studies
One important goal in all cohort studies is to
minimize voluntary loss to follow-up due to parti- In a nested case–control study, cases and con-
cipants’ dropping out of the study or due to trols are sampled from a preexisting and usually
researchers’ failure to locate and contact all parti- well-defined cohort. Typically, all subjects who
cipants. The longer the study takes to complete, develop the outcome under study during follow-
the more likely that a significant proportion of the up are included as cases. The investigator then
study participants will be lost to follow-up because randomly samples a subset of noncases (subjects
of voluntary or involuntary reasons (e.g., death, who did not develop the outcome at the time
migration, or development of other diseases). of diagnosis of cases) as controls. Nested case–
Regardless of the reasons, loss to follow-up is control studies are more efficient than cohort
costly because it reduces the study’s sample size studies because exposures are measured only for
and therefore its statistical power. More impor- cases and a subset of noncases rather than for all
tant, loss to follow-up could bias the study results members of the cohort.
Collinearity 193

For example, to determine whether a certain Further Readings


hormone plays a role in the development of pros-
Donaldson, L., & Donaldson, R. (2003). Essential public
tate cancer, one could assemble a cohort of men, health (2nd ed.). Newbury, UK: LibraPharm.
measure the level of that hormone in the blood of Rothman, K. J., & Greenland, S. (1998). Modern
all members of the cohort, and then follow up epidemiology (2nd ed.). Philapelphia: Lippincott-
with them until development of prostate cancer. Raven.
Alternatively, a nested case–control design could Singh, H., & Mahmud, S. M. (2009). Different study
be employed if one has access to a suitable cohort. designs in the epidemiology of cancer: Case-control vs.
In this scenario, the investigator would identify all cohort studies. Methods in Molecular Biology, 471,
men in the original cohort who developed prostate 217–225.
cancer during follow-up (the cases). For each case,
one or more controls would then be selected from
among the men who were still alive and free of
prostate cancer at the time of the case’s diagnosis.
COLLINEARITY
Hormone levels would be then measured for only
the identified cases and controls rather than for all Collinearity is a situation in which the predictor,
members of the cohort. or exogenous, variables in a linear regression
Nested case–control designs have most of the model are linearly related among themselves or
strengths of prospective cohort studies, including with the intercept term, and this relation may lead
the ability to demonstrate that the exposure to adverse effects on the estimated model para-
occurred before the disease. However, the main meters, particularly the regression coefficients and
advantage of nested case–control designs is eco- their associated standard errors. In practice,
nomic efficiency. Once the original cohort is estab- researchers often treat correlation between predic-
lished, nested case–control designs require far less tor variables as collinearity, but strictly speaking
financial resources and are quicker in producing they are not the same; strong correlation implies
results than establishing a new cohort. This is collinearity, but the opposite is not necessarily
especially the case when outcomes are rare or true. When there is strong collinearity in a linear
when the measurement of the exposure is costly in regression model, the model estimation procedure
time and resources. is not able to uniquely identify the regression coef-
ficients for highly correlated variables or terms and
therefore cannot separate the covariate effects.
This lack of identifiability affects the interpretabil-
Case–Cohort Studies ity of the regression model coefficients, which can
cause misleading conclusions about the relation-
Similar to the nested case–control study, in ships between variables under study in the model.
a case–cohort study, cases and controls are sam-
pled from a preexisting and usually well-defined
cohort. However, the selection of controls is not Consequences of Collinearity
dependent on the time spent by the noncases in the
There are several negative effects of strong collin-
cohort. Rather the adjustment for confounding
earity on estimated regression model parameters
effects of time is performed during analysis. Case–
that can interfere with inference on the relation-
cohort studies are more efficient and economical
ships of predictor variables with the response vari-
than full cohort studies, for the same reasons that
able in a linear regression model. First, the
nested case–control studies are economical. In
interpretation of regression coefficients as marginal
addition, the same set of controls can be used to
effects is invalid with strong collinearity, as it is
study multiple outcomes or diseases.
not possible to hold highly correlated variables
Salaheddin M. Mahmud constant while increasing another correlated vari-
able one unit. In addition, collinearity can make
See also Case-Only Design; Ethics in the Research regression coefficients unstable and very sensitive
Process to change. This is typically manifested in a large
194 Collinearity

change in magnitude or even a reversal in sign in tool are that it does not illuminate the nature
one regression coefficient after another predictor of the collinearity, which is problematic if the
variable is added to the model or specific observa- collinearity is between more than two variables,
tions are excluded from the model. It is especially and it does not consider collinearity with the
important for inference that a possible conse- intercept. A diagnostic tool that accounts for
quence of collinearity is a sign for a regression these issues consists of variance-decomposition
coefficient that is counterintuitive or counter to proportions of the regression coefficient vari-
previous research. The instability of estimates is ance–covariance matrix and the condition index
also realized in very large or inflated standard of the matrix of the predictor variables and con-
errors of the regression coefficients. The fact that stant term. Some less formal diagnostics of
these inflated standard errors are used in signifi- collinearity that are commonly used are a coun-
cance tests of the regression coefficients leads to terintuitive sign in a regression coefficient, a rela-
conclusions of insignificance of regression coeffi- tively large change in value for a regression
cients, even, at times, in the case of important pre- coefficient after another predictor variable is
dictor variables. In contrast to inference on the added to the model, and a relatively large stan-
regression coefficients, collinearity does not impact dard error for a regression coefficient. Given that
the overall fit of the model to the observed statistical inference on regression coefficients is
response variable data. typically a primary concern in regression analy-
sis, it is important for one to apply diagnostic
tools in a regression analysis before interpreting
Diagnosing Collinearity
the regression coefficients, as the effects of col-
There are several commonly used exploratory linearity could go unnoticed without a proper
tools to diagnose potential collinearity in diagnostic analysis.
a regression model. The numerical instabilities in
analysis caused by collinearity among regression
model variables lead to correlation between the
Remedial Methods for Collinearity
estimated regression coefficients, so some techni-
ques assess the level of correlation in both the There are several methods in statistics that
predictor variables and the coefficients. Coeffi- attempt to overcome collinearity in standard lin-
cients of correlation between pairs of predictor ear regression models. These methods include
variables are statistical measures of the strength principal components regression, ridge regres-
of association between variables. Scatterplots of sion, and a technique called the lasso. Principal
the values of pairs of predictor variables provide components regression is a variable subset selec-
a visual description of the correlation among tion method that uses combinations of the
variables, and these tools are used frequently. exogenous variables in the model, and ridge
There are, however, more direct ways to assess regression and the lasso are penalization meth-
collinearity in a regression model by inspecting ods that add a constraint on the magnitude of
the model output itself. One way to do so is the regression coefficients. Ridge regression was
through coefficients of correlation of pairs of designed precisely to reduce collinearity effects
estimated regression coefficients. These statisti- by penalizing the size of regression coefficients.
cal summary measures allow one to assess the The lasso also shrinks regression coefficients, but
level of correlation among different pairs of cov- it shrinks the least significant variable coeffi-
ariate effects as well as the correlation between cients toward zero to remove some terms from
covariate effects and the intercept. Another way the model. Ridge regression and the lasso are
to diagnose collinearity is through variance infla- considered superior to principal components
tion factors, which measure the amount of regression to deal with collinearity in regression
increase in the estimated variances of regression models because they more purposely reduce
coefficients compared with when predictor vari- inflated variance in regression coefficients due to
ables are uncorrelated. Drawbacks of the vari- collinearity while retaining interpretability of
ance inflation factor as a collinearity diagnostic individual covariate effects.
Column Graph 195

Future Research variable(s). Column graphs present a series of ver-


tical equal-width rectangles, each with a height
While linear regression models offer the potential proportional to the frequency (or percentage) of
for explaining the nature of relationships between a specific category of observations. Categories are
variables in a study, collinearity in the exogenous labeled on the x-axis (the horizontal axis), and fre-
variables and intercept can generate curious results quencies are labeled on the y-axis (the vertical
in the estimated regression coefficients that may axis). For example, a column graph that displays
lead to incorrect substantive conclusions about the the partisan distribution of a single session of
relationships of interest. More research is needed a state legislature would consist of at least two rec-
regarding the level of collinearity that is acceptable tangles, one representing the number of Demo-
in a regression model before statistical inference is cratic seats, one representing the number of
unduly influenced. Republican seats, and perhaps one representing
David C. Wheeler the number of independent seats. The x-axis
would consist of the ‘‘Democratic’’ and ‘‘Republi-
See also Coefficients of Correlation, Alienation, and can’’ labels while the y-axis would consist of labels
Determination; Correlation; Covariate; Exogenous representing intervals for the number of seats in
Variable; Inference: Deductive and Inductive; the state legislature.
Parameters; Predictor Variable; Regression Coefficient; When developing a column graph, it is vital that
Significance, Statistical; Standard Error of Estimate; the researcher present a set of categories that is
Variance both exhaustive and mutually exclusive. In other
words, each potential value must belong to one
and only one category. Developing such a category
Further Readings
schema is relatively easy when the researcher is
Belsley, D. A. (1991). Conditioning diagnostics: faced with discrete data, in which the number
Collinearity and weak data in regression. New York: observations for a particular value can be counted
Wiley. (i.e. nominal and ordinal level data). For nominal
Fox, J. (1997). Applied regression analysis, linear models, data, the categories are unordered, with inter-
and related methods. Thousand Oaks, CA: Sage.
changeable values (e.g., gender, ethnicity, religion).
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The
elements of statistical learning: Data mining,
For ordinal data, the categories exhibit some type
inference, and prediction. New York: Springer-Verlag. of relation to each other, although the relation does
Neter, J., Kutner, M. H., Nachtsheim, C. J., & not exhibit specificity beyond ranking the values
Wasserman, W. (1996). Applied linear regression (e.g., greater than vs. less than, agree strongly vs.
models. Chicago: Irwin. agree vs. disagree vs. disagree strongly). Because
nominal and ordinal data are readily countable,
once the data are obtained, the counts can then be
readily transformed into a column graph.
COLUMN GRAPH With continuous data (i.e., interval- and ratio-
level data), an additional step is required. For
A column graph summarizes categorical data by interval data, the distance between observations is
presenting parallel vertical bars with a height (and fixed (e.g., Carolyn makes $5,000 more per year
hence area) proportionate to specific quantities of than John does). For ratio-level data, the data can
data for each category. This type of graph can be be measured as in relation to a fixed point (e.g.,
useful in comparing two or more distributions of a family’s income falls a certain percentage above
nominal- or ordinal-level data. or below the poverty line). For continuous data,
the number of potential values can be infinite or,
at the very least, extraordinarily high, thus produc-
Developing Column Graphs
ing a cluttered chart. Therefore, it is best to reduce
A column graph can and should provide an easy- interval- or ratio-level data to ordinal-level data by
to-interpret visual representation of a frequency or collapsing the range of potential values into a few
percentage distribution of a single (or multiple) select categories. For example, a survey that asks
196 Column Graph

80
72
70
70 67 66 65 66
64
62 62
60
60 58
56 56
54 53
51 52 52
50 48
46
Number of Seats

40

30

20

10

0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Election
Democrat Republican

Figure 1 1990–2009 Illinois House Partisan Distribution (Column Graph)


Source: Almanac of Illinois Politics. (2009). Springfield: Illinois Issues.

respondents for their age could produce dozens of a survey might present the aggregate distribution
potential answers, and therefore it is best to con- of self-reported partisanship, but the researcher
dense the variable into a select few categories (e.g., can also demonstrate the gender gap by displaying
18–25, 26–35, 36–45, 46–64, 65 and older) before separate partisan distributions for male and female
making a column graph that summarizes the respondents. By putting each distribution into a sin-
distribution. gle graph, the researcher can visually present the
gender gap in a readily understandable format.
Column graphs might also be used to explore
Multiple Distribution Column Graphs
chronological trends in distributions. One such
Column graphs can also be used to compare multi- example is in Figure 1, which displays the partisan
ple distributions of data. Rather than presenting makeup of the Illinois state House from the 1990
a single set of vertical rectangles that represents through the 2008 elections. The black bars repre-
a single distribution of data, column graphs pres- sent the number of Republican seats won in the
ent multiple sets of rectangles, one for each distri- previous election (presented on the x-axis) and the
bution. For ease of interpretation, each set of gray bars represent the number of Democratic
rectangles should be grouped together and sepa- seats. By showing groupings of the two partisan
rate from the other distributions. This type of bars side by side, the graph provides for an easy
graph can be particularly useful in comparing interpretation of which party had control of the
counts of observations across different categories legislative chamber after a given election. Further-
of interest. For example, a researcher conducting more, readers can look across each set of bars to
Column Graph 197

100%

90%

80% 46 48
51 52 53 52
58 56 56
64
70%
Number of Seats

60%

50%

40%

30% 72 70
67 66 65 66
60 62 62
54
20%

10%

0%
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Election

Democrat Republican

Figure 2 1990–2009 Illinois House Partisan Distribution (100% Stacked Column Graph)
Source: Almanac of Illinois Politics. (2009). Springfield: Illinois Issues.

determine the extent to which the partisan makeup variables of interest (such as time). Either version
of the Illinois House has changed over time. of the stacked column approach can be useful in
comparing multiple distributions, such as with Fig-
ure 2, which presents a 100% stacked column rep-
Stacked Column
resentation of chronological trends in Illinois
One special form of the column graph is the House partisan makeup.
stacked column, which presents data from a partic-
ular distribution in a single column. A regular Column Graphs and Other Figures
stacked column places the observation counts
directly on top of one another, while a 100% When one is determining whether to use a column
stacked column does the same while modifying graph or some other type of visual representation
each observation count into a percentage of the of relevant data, it is important to consider the
overall distribution of observations. The former main features of other types of figures, particularly
type of graph might be useful when the researcher the level of measurement and the number of distri-
wishes to present a specific category of interest butions of interest.
(which could be at the bottom of the stack) and
the total number of observations. The latter might
Bar Graph
be of interest when the researcher wants a visual
representation of how much of the total distribu- A column graph is a specific type of bar graph
tion is represented by each value and the extent to (or chart). Whereas the bar graph can present sum-
which that distribution is affected by other maries of categorical data in either vertical or
198 Column Graph

80

70

60
Number of Seats

50

40

30

20

10

0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Election

Democrat Republican

Figure 3 1990–2009 Illinois House Partisan Distribution (Line Graph)


Source: Almanac of Illinois Politics. (2009). Springfield: Illinois Issues.

horizontal rectangles, a column graph presents


Number of Survey Respondents

only vertical rectangles. The decision to present


60 horizontal or vertical representations of the data is
purely aesthetic, and so the researcher might want
to consider which approach would be the most
40 visually appealing, given the type of analysis pre-
sented by the data.
20

0 Pie Chart
3 6 9 12 15 18
Number of Correct Answers (Out of 19) A pie chart presents the percentage of each cate-
gory of a distribution as a segment of a circle. This
type of graph allows for only a single distribution
Figure 4 Political Knowledge Variable Distribution at a time; multiple distributions require multiple
(Histogram) pie charts. As with column and bar graphs, pie
Source: Adapted from the University of Illinois Subject Pool, charts represent observation counts (or percen-
Bureau of Educational Research. tages) and thus are used for discrete data, or at the
Completely Randomized Design 199

very least continuous data collapsed into a select See also Bar Chart; Histogram; Interval Scale; Line
few discrete categories. Graph; Nominal Scale; Ordinal Scale; Pie Chart; Ratio
Scale

Line Graph Further Readings

A line graph displays relationships between two Frankfort-Nachmias, C., & Nachmias, D. (2007).
changing variables by drawing a line that connects Research methods in the social sciences (7th ed.). New
York: Worth.
actual or projected values of a dependent variable
Harris, R. L. (2000). Information graphics: A
(y-axis) based on the value of an independent vari- comprehensive illustrated reference. New York:
able (x-axis). Because line graphs display trends Oxford University Press.
between plots of observations across an indepen- Stoval, J. G. (1997). Infographics: A journalist’s guide.
dent variable, neither the dependent nor the Boston: Allyn and Bacon.
independent variable can contain nominal data.
Furthermore, ordinal data do not lend themselves
to line graphs either, because the data are ordered
only within the framework of the variables. As with COMPLETELY RANDOMIZED DESIGN
column graphs and bar graphs, line graphs can
track multiple distributions of data based on cate- A completely randomized design (CRD) is the sim-
gories of a nominal or ordinal variable. Figure 3 plest design for comparative experiments, as it uses
provides such an example, with yet another method only two basic principles of experimental designs:
of graphically displaying the Illinois House partisan randomization and replication. Its power is best
makeup. The election year serves as the indepen- understood in the context of agricultural experi-
dent variable, and the number of Illinois House ments (for which it was initially developed), and it
seats for a particular party as the dependent vari- will be discussed from that perspective, but true
able. The solid gray and dotted black lines present experimental designs, where feasible, are useful in
the respective Democratic and Republican legisla- the social sciences and in medical experiments.
tive seat counts, allowing for easy interpretation of In CRDs, the treatments are allocated to the
the partisan distribution trends. experimental units or plots in a completely ran-
dom manner. CRD may be used for single- or
multifactor experiments. This entry discusses
Histogram the application, advantages, and disadvantages
of CRD studies and the processes of conducting
A histogram is a special type of column graph
and analyzing them.
that allows for a visual representation of a single
frequency distribution of interval or ratio data
without collapsing the data to a few select cate- Application
gories. Visually, a histogram looks similar to a col-
umn graph, but without any spaces between the CRD is mostly useful in laboratory and green house
rectangles. On the x-axis, a histogram displays experiments in agricultural, biological, animal,
intervals rather than discrete categories. Unlike environmental, and food sciences, where experi-
a bar graph or a column graph, a histogram only mental material is reasonably homogeneous. It is
displays distributions and cannot be used to com- more difficult when the experimental units are
pare multiple distributions. The histogram in Fig- people.
ure 4 displays the distribution of a political
knowledge variable obtained via a 2004 survey of Advantages and Disadvantages
University of Illinois undergraduate students. The
survey included a series of 19 questions about U.S. This design has several advantages. It is very flexi-
government and politics. ble as any number of treatments may be used, with
equal or unequal replications. The design has
Michael A. Lewkowicz a comparatively simple statistical analysis and
200 Completely Randomized Design

retains this simplicity even if some observations For experiments with more than 10 treatments,
are missing or lost accidentally. The design pro- a 2-digit random number table or a combination of
vides maximum degrees of freedom for the estima- two rows or columns of 1-digit random numbers
tion of error variance, which increases the precision can be used. Here each 2-digit random number is
of an experiment. divided by the number of treatments, and the resid-
However, the design is not suitable if a large ual number is selected. When the residual is 00, the
number of treatments are used and the experi- divisor number is selected.The digit 00 already
mental material is not reasonably homogeneous. occurring in the table is discarded. The digit 00 is
Therefore, it is seldom used in agricultural field discarded.
experiments in which soil heterogeneity may be On small pieces of paper identical in shape and
present because of soil fertility gradient or in ani- size, the numbers 1; 2 ; . . . ; N are written. These
mal sciences when the animals (experimental are thoroughly mixed in a box, the papers are
units) vary in such things as age, breed, or initial drawn one by one, and the numbers on the
body weight, or with people. selected papers are random numbers. After each
draw, the piece of paper is put back, and thorough
mixing is performed again.
Layout of the Design
The random numbers may also be generated by
The plan of allocation of the treatments to the computers.
experimental material is called the layout of the
design. Statistical Analysis

Let the ith (i ¼ 1, 2, . . . , v) treatments be repli-


The data from CRD are examined by the method
P of one-way analysis of variance (ANOVA) for
cated ri times. Therefore, N ¼ ri is the total
number of required experimental units.
classification.
The analysis of data from CRD is performed
The treatments are allocated to the experimen- using the linear fixed-effects model given as
tal units or plots in a completely random manner.
Each treatment has equal probability of allocation yij ¼ μ þ ti þ εij , i ¼ 1, 2, . . . , ν; j ¼ 1, 2, . . . , ri ;
to an experimental unit.
Given below is layout plan of CRD with four where yij is the yield or response from the jth unit
treatments, denoted by integers, each replicated 3 receiving the ith treatment, ti is the fixed effect of
times and allocated to 12 experimental units. the ith treatment, and εij is random error effect.
Suppose we are interested in knowing whether
wheat can be grown in a particular agroclimatic
3 2 4 4
situation. Then we will randomly select some vari-
3 1 4 1
eties of wheat, from all possible varieties, for test-
3 1 2 2
ing in an experiment. Here the effect of treatment
is random. When the varieties are the only ones to
Randomization
be tested in an experiment, the effect of varieties
Some common methods of random allocation of (treatments) is fixed.
treatments to the experimental units are illustrated The experimental error is a random variable
in the following: with mean zero and constant variance(σ 2 ). The
Consider an experiment with less than or up to experimental error is normally and independently
10 treatments. In this case, a 1-digit random num- distributed with mean zero and constant variance
ber table can be used. The treatments are allotted for every treatment.
P
a number each. The researcher picks up random Let N ¼ PriPbe the total number of experi-
numbers with replacement (i.e., a random number mental units, yij ¼ y .P
. . ¼ G ¼ grand total
may get repeated) from the random number table of all the observations, and yij ¼ y ¼ T i ¼ the
until the number of replications of that treatment total response from the experimental units receiv-
is exhausted. ing the ith treatment. Then,
Completely Randomized Design 201

Table 1 Analysis of Variance for Completely Randomized Design


Sv df SS MSS Variance Ratio
Between treatments v1 TrSS s2t ¼ TrSS=v  1 Ft = s2t =s2e
Within treatments Nv ErSS s2e ¼ ErSS=N  v
Note: Sn sources of variation; df ¼ degrees of freedom; SS ¼ sum of squares; MSS ¼ mean sum of squares (variance);
TrSS ¼ treatments sum of squares; ErSS ¼ error sum of squares

XX X
ðyij  μÞ2 ¼ ri ðyi . . .  y . . .Þ2 þ treatments, and the degree of precision required.
XX 2
The minimum number of replications required to
ðyij  yi :Þ , i ¼ 1, . . . , ν; j ¼ 1, 2, . . . , ri , detect the specified differences between two treat-
ment means at a specified level of significance is
that is, the total sum of squares is equal to the sum given by the t statisticpat error df and α% level of
of the treatments sum of squares and the error significance, t ¼ d=fse 2=rg, where d ¼ X1  X2 ;
sum of squares. This implies that the total varia- X1 and X2 are treatment means with each treat-
tion can be partitioned into two components: ment replicated r times, and s2e is error variance.
between treatments and within treatments (experi- Therefore,
mental error).
2
Table 1 is an ANOVA table for CRD. r ¼ 2t2 s2e =d :
Under the null hypothesis, H0 ¼ t1 ¼ t2 ¼    ¼
tv ; that is, all treatment means are equal, the statistic It is observed that from 12 df onwards, the
Ft ¼ s2t =s2e follows the F distribution with (ν  1) values of t and F (for smaller error variance)
and (N  ν) degrees of freedom. decrease considerably slowly, and so from empiri-
If F (observed) ≥ F (expected) for the same df cal considerations the number of replications are
and at a specified level of significance, say α%, so chosen as to provide about 12 df for error vari-
then the null hypothesis of equality of means is ance for the experiment.
rejected at α% level of significance.
If H0 is rejected, then the pairwise comparison Nonconformity to the Assumptions
between the treatment means is made by using the
critical difference (CD) test, When the assumptions are not realized, the
researcher may apply one of the various transfor-
CDα% ¼ SEd · t for error df and α% level mations in order to bring improvements in
of significance; assumptions. The researcher then may proceed
with the usual ANOVA after the transformation.
p
where SEd ¼ ErSSð1=ri þ 1=rj Þ=ðN  vÞ; SEd Some common types of transformations are
stands for standard error of difference between described below.
means
p of two treatments and when ri ¼ rj ; SEd ¼
2ErSS=frðN  vÞg.
Arc Sine Transformation
If the difference between any two treatment
means is greater than or equal to the critical differ- This transformation, also called angular transfor-
ence, the treatment means are said to differ mation, is used for count data obtained from bino-
significantly. mial distribution, such as success–failure, diseased–
The critical difference is also called the least sig- nondiseased, infested–noninfested, barren–nonbarren
nificant difference. tillers, inoculated–noninoculated, male–female,
dead–alive, and so forth.
The transformation is not applicable to per-
Number of Replications
centages of carbohydrates, protein, profit, dis-
The number of replications required for an experi- ease index, and so forth, which are not derived
ment is affected by the inherent variability and from count data. The transformation is not
size of the experimental material, the number of needed if nearly all the data lie between 30%
202 Computerized Adaptive Testing

and 70%. The transformation may be used when to examinees differ based on the examinees’
data range from 0% to 30% and beyond or from responses to previous questions. Computerized
100% to 70% and below. For transformation, adaptive testing uses computers to facilitate the
0/n and n/n should be taken as 1=4n and ‘‘adaptive’’ aspects of the process and to automate
(1  1=4n), respectively. scoring. This entry discusses historical pers-
pectives, goals, psychometric and item selection
approaches, and issues associated with adaptive
Square Root Transformation testing in general and computerized adaptive test-
The square root transformation is used for data ing in particular.
from a Poisson distribution, that is, when data are
counts of rare events such as number of defects or Historical Perspectives
accidents, number of infested plants in a plot,
insects caught on traps, or weeds per plot. The Adaptive testing is not new. Through the ages
transformation consists of taking the square root examiners have asked questions and, depending on
of each observation before proceeding with the the response given, have chosen different direc-
ANOVA in the usual manner. tions for further questioning for different exami-
nees. Clinicians have long taken adaptive
approaches, and so since the advent of standard-
Logarithmic Transformation ized intelligence testing, many such tests have used
The logarithmic transformation is used when adaptive techniques. For both the 1916 edition of
standard deviation is proportional to treatment the Stanford-Binet and the 1939 edition of the
means, that is, if the coefficient of variation is con- Wechsler-Bellevue intelligence tests, the examiner
stant. The transformation achieves additivity. It is chose a starting point and, if the examinee
used for count data (large whole numbers) cover- answered correctly, asked harder questions until
ing a wide range, such as number of insects per a string of incorrect answers was provided. If the
plant or number of egg masses. When zero occurs, first answer was incorrect, an easier starting point
1 is added to it before transformation. was chosen.
Early group-administered adaptive tests faced
Kishore Sinha administration problems in the precomputerized
administration era. Scoring each item and making
See also Analysis of Variance (ANOVA); Experimental continual routing decisions was too slow a process.
Design; Fixed-Effects Models; Randomized Block One alternative explored by William Angoff and
Design; Research Design Principles Edith Huddleston in the 1950s was two-stage test-
ing. An examinee would take a half-length test of
Further Readings medium difficulty, that test was scored, and then
the student would be routed to either an easier or
Bailey, R. A. (2008). Design of comparative experiments. a harder last half of the test. Another alternative
Cambridge, UK: Cambridge University Press.
involved use of a special marker that revealed
Box, G. E. P., Hunter, W. G., & Hunter, J. S. (1978).
Statistics for experiments. New York: Wiley.
invisible ink when used. Examinees would use
Dean, A., & Voss, D. (1999). Design and analysis of their markers to indicate their responses. The
experiments. New York: Springer. marker would reveal the number of the next item
Montgomery, D. C. (2001). Design and analysis of they should answer. These and other approaches
experiments. New York: Wiley. were too complex logistically and never became
popular.
Group-administered tests could not be adminis-
tered adaptively with the necessary efficiency until
COMPUTERIZED ADAPTIVE TESTING the increasing availability of computer-based testing
systems, circa 1970. At that time research on com-
Adaptive testing, in general terms, is an assessment puter-based testing proliferated. In 1974, Frederic
process in which the test items administered Lord (inventor of the theoretical underpinnings of
Computerized Adaptive Testing 203

much of modern psychometric theory) suggested guess the answers to very difficult items. Thus
the field would benefit from researchers’ getting the answers to items that are very difficult for
together to share their ideas. David Weiss, a profes- a student are worse than useless—they can be
sor at the University of Minnesota, thought this misleading.
a good idea and coordinated a series of three con- An adaptive test maximizes reliability by repla-
ferences on computerized adaptive testing in 1975, cing items that are too difficult or too easy with
1977, and 1979, bringing together the greatest items of an appropriate difficulty. Typically this is
thinkers in the field. These conferences energized done by ordering items by difficulty (usually using
the research community, which focused on the the- sophisticated statistical models such as Rasch scal-
oretical underpinnings and research necessary to ing or the more general item response theory) and
develop and establish the psychometric quality of administering more difficult items subsequent to
computerized adaptive tests. correct responses and easier items after incorrect
The first large-scale computerized adaptive responses.
tests appeared circa 1985, including the U.S. By tailoring the difficulty of the items adminis-
Army’s Computerized Adaptive Screening Test, tered dependent on prior examinee responses,
the College Board’s Computerized Placement a well-developed adaptive test can achieve the reli-
Tests (the forerunner of today’s Accuplacer), and ability of a conventional test with approximately
the Computerized Adaptive Differential Ability one half to one third the items or, alternatively,
Tests of the Psychological Corporation (now can achieve a significantly higher reliability in
part of Pearson Education). Since that time com- a given amount of time. (However, testing time is
puterized adaptive tests have proliferated in all unlikely to be reduced in proportion to the item
spheres of assessment. reduction because very easy and very difficult
items often require little time to answer correctly,
or not try, respectively.)
Goals of Adaptive Testing
The choice of test content and administration
Minimization of Testing Time
mode should be based on the needs of a testing
program. What is best for one program is not An alternative goal that can be addressed with
necessarily of importance for other programs. adaptive testing is minimizing the testing time
There are primarily three different needs that needed to achieve a fixed level of reliability. Items
can be addressed well by adaptive testing: maxi- can be administered until the reliability of the test
mization of test reliability for a given testing for a particular student, or the decision accuracy
time, minimization of individual testing time for a test that classifies examinees into groups
to achieve a particular reliability or decision (such as pass–fail), reaches an acceptable level. If
accuracy, and the improvement of diagnostic the match between the proficiency of an examinee
information. and the quality of the item pool is good (e.g., if
there are many items that measure proficiency par-
ticularly well within a certain range of scores), few
Maximization of Test Reliability
items will be required to determine an examinee’s
Maximizing reliability for a given testing time proficiency level with acceptable precision. Also,
is possible because in conventional testing, many some examinees may be very consistent in their
examinees spend time responding to items that responses to items above or below their proficiency
are either trivially easy or extremely hard, and level—it might require fewer items to pinpoint
thus many items do not contribute to our under- their proficiency compared with people who are
standing of what the examinee knows and can inconsistent.
do. If a student can answer complex multiplica- Some approaches to adaptive testing depend on
tion and division questions, there is little to be the assumption that items can be ordered along
gained asking questions about single-digit addi- a single continuum of difficulty (also referred to as
tion. When multiple-choice items are used, stu- unidimensionality). One way to look at this is that
dents are sometimes correct when they randomly the ordering of item difficulties is the same for all
204 Computerized Adaptive Testing

identifiable groups of examinees. In some areas of problem. Thus, in such an approach, expert
testing, this is not a reasonable assumption. For knowledge of relationships within the content
example, in a national test of science for eighth- domain drives the selection of the next item to be
grade students, some students may have just stud- administered. Although this is an area of great
ied life sciences and others may have completed research interest, at this time there are no large-
a course in physical sciences. On average, life sci- scale adaptive testing systems that use this
ence questions will be easier for the first group approach.
than for the second, while physical science ques-
tions will be relatively easier for the second group.
Psychometric Approaches to Scoring
Sometimes the causes of differences in dimension-
ality may be more subtle, and so in many branches There are four primary psychometric approaches
of testing, it is desirable not only to generate an that have been used to support item selection and
overall total score but to produce a diagnostic pro- scoring in adaptive testing: maximum likelihood
file of areas of strength and weakness, perhaps item response theory, Bayesian item response the-
with the ultimate goal of pinpointing remediation ory, classical test theory, and decision theory.
efforts. Maximum likelihood item response theory
An examinee who had recently studied physical methods have long been used to estimate exam-
science might get wrong a life science item that inee proficiency. Whether they use the Rasch
students in general found easy. Subsequent items model, three-parameter logistic model, partial
would be easier, and that student might never credit, or other item response theory models,
receive the relatively difficult items that she or he fixed (previously estimated) item parameters
could have answered correctly. That examinee make it fairly easy to estimate the proficiency
might receive an underestimate of her or his true level most likely to have led to the observed pat-
level of science proficiency. tern of item scores.
One way to address this issue for multidimen- Bayesian methods use information in addition
sional constructs would be to test each such con- to the examinee’s pattern of responses and previ-
struct separately. Another alternative would be to ously estimated item parameters to estimate
create groups of items balanced on content—test- examinee proficiency. Bayesian methods assume
lets—and give examinees easier or harder testlets a population distribution of proficiency scores,
as the examinees progress through the test. often referred to as a prior, and use the unlikeness
of an extreme score to moderate the estimate of
the proficiency of examinees who achieved such
Improvement of Diagnostic Information
extreme scores. Since examinees who have large
Diagnostic approaches to computerized adap- positive or negative errors of measurement tend to
tive testing are primarily in the talking stage. get more extreme scores, Bayesian approaches tend
Whereas one approach to computerized adaptive to be more accurate (have smaller errors of mea-
diagnostic testing would be to use multidimen- surement) than maximum likelihood approaches.
sional item response models and choose items The essence of Bayesian methods for scoring
based on the correctness of prior item responses is that the probabilities associated with the
connected to the difficulty of the items on the mul- proficiency likelihood function are multiplied by
tiple dimensions, an alternative approach is to lay the probabilities in the prior distribution, leading
out a tested developmental progression of skills to a new, posterior, distribution. Once that
and select items based on precursor and ‘‘postcur- posterior distribution is produced, there are two
sor’’ skills rather than on simple item difficulty. For primary methods for choosing an examinee
example, if one posits that successful addition of score. The first is called modal estimation or
two-digit numbers with carrying requires the abil- maximum a posteriori—pick the proficiency
ity to add one-digit numbers with carrying, and an level that has the highest probability (probability
examinee answers a two-digit addition with carry- density). The second method is called expected
ing problem incorrectly, then the next problem a poste-riori and is essentially the mean of the
asked should be a one-digit addition with carrying posterior distribution.
Computerized Adaptive Testing 205

Table 1 Record for Two Hypothetical Examinees Taking a Mastery Test Requiring 60% Correct on the Full-Length
(100-item) Test
Examinee 1 Examinee 2

After Testlet Confidence Number Percent Number Percent


Number Interval Correct Correct Correct Correct
1 .20–1.00 6 .60 9 .90
2 .34–.86 11 .55 19 .95
3 .40–.80 18 .60
4 .34–.76 25 .63
5 .37–.73 29 .58
6 .50–.60 34 .57
7 .52–.68 39 .56
8 .54–.66 43 .54
9 .56–.64 48 .53
10 .60–.60

Item response theory proficiency estimates examinee’s estimated score is entirely above or
have different psychometric characteristics from entirely below the proportion-correct cut score for
classical test theory approaches to scoring tests a pass decision.
(based on the number of items answered cor- For example, consider a situation in which to
rectly). The specifics of these differences are pass, an examinee must demonstrate mastery of
beyond the scope of this entry, but they have led 60% of the items in a 100-item pool. We could
some testing programs (especially ones in which ask all 100 items and see what percentage
some examinees take the test traditionally and the examinee answers correctly. Alternatively, we
some take it adaptively) to transform the item could ask questions and stop as soon as the exam-
response theory proficiency estimate to an esti- inee answers 60 items correctly. This could save as
mate of the number right that would have been much as 40% of computer time at a computerized
obtained on a hypothetical base form of the test. test administration center. In fact, one could be
From the examinee’s item response theory–esti- statistically confident that an examinee will get at
mated proficiency and the set of item parameters least 60% of the items correct administering fewer
for the base form items, the probability of a cor- items.
rect response can be estimated for each item, The Lewis–Sheehan decision theoretic approach
and the sum of those probabilities is used to esti- is based on a branch of statistical theory called
mate the number of items the examinee would Waldian sequential ratio testing, in which after
have answered correctly. each sample of data, the confidence interval is cal-
Decision theory is a very different approach culated, and a judgment is made, either that
that can be used for tests that classify or categorize enough information is possessed to make a decision
examinees, for example into two groups, passing or that more data needs to be gathered. Table 1
and failing. One approach that has been imple- presents two simplified examples of examinees
mented for at least one national certification exam- tested with multiple parallel 10-item testlets. The
ination was developed by Charles Lewis and percent correct is boldfaced when it falls outside
Kathleen Sheehan. Parallel testlets—sets of perhaps of the 99% confidence interval and a pass–fail
10 items that cover the same content and are of decision is made.
about the same difficulty—were developed. After The confidence interval in this example narrows
each testlet is administered, a decision is made to as the sample of items (number of testlets adminis-
either stop testing and make a pass–fail decision or tered) increases and gets closer to the full 100-item
administer another testlet. The decision is based domain. Examinee 1 in Table 1 has a true domain
on whether the confidence interval around an percent correct close to the cut score of .60. Thus
206 Computerized Adaptive Testing

it takes nine testlets to demonstrate the examinee If items are chosen solely by difficulty, some exam-
is outside the required confidence interval. Exam- inees might not receive any items about electricity
inee 2 has a true domain percent correct that is and others might receive no items about optics.
much higher than the cut score and thus demon- Despite items’ being scaled on a national sample,
strates mastery with only two testlets. any given examinee might have an idiosyncratic
pattern of knowledge. Also, the sample of items in
the pool might be missing topic coverage in certain
Item Selection Approaches
difficulty ranges. A decision must be made to focus
Item selection can be subdivided into two pieces: on item difficulty, content coverage, or some com-
selecting the first item and selecting all subsequent bination of the two.
items. Selection of the first item is unique because Another consideration is item exposure. Since
for all other items, some current information most adaptive test item pools are used for exten-
about student proficiency (scores from previous ded periods of time (months if not years), it is
items) is available. Several approaches can be used possible for items to become public knowledge.
for selecting the first item. Everyone can get the Sometimes this occurs as a result of concerted
same first item—one that is highly discriminating cheating efforts, sometimes just because examinees
near the center of the score distribution—but such and potential examinees talk to each other.
an approach would quickly make that item nonse- More complex item selection approaches, such
cure (i.e., it could be shared easily with future as developed by Wim van der Linden, can be used
examinees). An alternative would to be to ran- to construct ‘‘on the fly’’ tests that meet a variety
domly select from a group of items that discrimi- of predefined constraints on item selection.
nate well to the center of the score distribution.
Yet another approach might be to input some prior
information (for example, class grades or previous Issues
year’s test scores) to help determine a starting
Item Pool Requirements
point.
Once the first item is administered, several Item pool requirements for adaptive tests are
approaches exist for determining which items typically more challenging than for traditional
should be administered next to an examinee. tests. With this proliferation, the psychometric
There are two primary approaches to item selec- advantages of adaptive testing have often been
tion for adaptive tests based on item response delivered, but logistical problems have been dis-
theory: maximum information and minimum covered. When stakes are high (for example,
posterior variance. With maximum information admissions or licensure testing), test security is
approaches, one selects the item that possesses very important—ease of cheating would invalidate
the maximum amount of statistical information test scores. It is still prohibitively expensive to
at the current estimated proficiency. Maximum arrange for a sufficient number of computers in
information approaches are consistent with a secure environment to test hundreds of thou-
maximum likelihood item response theory scor- sands of examinees at one time, as some of the
ing. Minimum posterior variance methods are largest testing programs have traditionally done
appropriate for Bayesian scoring approaches. with paper tests. As an alternative, many comput-
With this approach, the item is selected that, erized adaptive testing programs (such as the
after scoring, will lead to the smallest variance Graduate Record Examinations) have moved from
of the posterior distribution, which is used to three to five administrations a year to administra-
estimate the examinee’s proficiency. tions almost every day. This exposes items to large
Most adaptive tests do not apply either of these numbers of examinees who sometimes remember
item selection approaches in their purely statistical and share the items. Minimizing widespread cheat-
form. One reason to vary from these approaches is ing when exams are offered so frequently requires
to ensure breadth of content coverage. Consider the creation of multiple item pools, an expensive
a physics test covering mechanics, electricity, mag- undertaking, the cost of which is often passed on
netism, optics, waves, heat, and thermodynamics. to examinees.
Concomitant Variable 207

Another issue relates to the characteristics of See also Bayes’s Theorem; b Parameter; Decision Rule;
the item pool required for adaptive testing. In a tra- Guessing Parameter; Item Response Theory;
ditional norm-referenced test, most items are cho- Reliability; ‘‘Sequential Tests of Statistical
sen so that about 60% of the examinees answer Hypotheses’’
the item correctly. This difficulty level maximally
differentiates people on a test in which everyone
gets the same items. Thus few very difficult items Further Readings
are needed, which is good since it is often difficult Wainer, H. (Ed.). (1990). Computerized adaptive testing:
to write very difficult items that are neither ambig- A primer. Hillsdale, NJ: Lawrence Erlbaum.
uous nor trivial. On an adaptive test, the most pro- Mills, C., Potenza, M., Fremer, J., & Ward, W. (Ed.)
ficient examinees might all see the same very (2002). Computer-based testing. Hillsdale, NJ:
difficult items, and thus to maintain security, Lawrence Erlbaum.
a higher percentage of such items are needed. Parshall, C., Spray, J., Kalohn, J., & Davey, T. (2002).
Practical considerations in computer-based testing.
New York: Springer.
Comparability
Comparability of scores on computerized and
traditional tests is a very important issue as long as
a testing program uses both modes of administra-
CONCOMITANT VARIABLE
tion. Many studies have been conducted to deter-
mine whether scores on computer-administered It is not uncommon in designing research for an
tests are comparable to those on traditional paper investigator to collect an array of variables repre-
tests. Results of individual studies have varied, senting characteristics on each observational unit.
with some implying there is a general advantage to Some of these variables are central to the investiga-
examinees who take tests on computer, others tion, whereas others reflect preexisting differences
implying the advantage goes to examinees taking in observational units and are not of interest
tests on paper, and some showing no statistically per se. The latter are called concomitant variables,
significant differences. Because different computer- also referred to as covariates. Frequently in prac-
administration systems use different hardware, tice, these incidental variables represent undesired
software, and user interfaces and since these stud- sources of variation influencing the dependent
ies have looked at many different content domains, variable and are extraneous to the effects of mani-
perhaps these different results are real. However, pulated (independent) variables, which are of pri-
in a 2008 meta-analysis of many such studies of mary interest. For designed experiments in which
K–12 tests in five subject areas, Neal Kingston observational units are randomized to treatment
showed a mean weighted effect size of  .01 stan- conditions, failure to account for concomitant
dard deviations, which was not statistically signifi- variables can exert a systematic influence (or bias)
cant. However, results differed by data source and on the different treatment conditions. Alterna-
thus the particular test administration system used. tively, concomitant variables can increase the error
variance, thereby reducing the likelihood of detect-
ing real differences among the groups. Given these
Differential Access to Computers potential disadvantages, an ideal design strategy
Many people are concerned that differential is one that would minimize the effect of the
access to computers might disadvantage some unwanted sources of variation corresponding to
examinees. While the number of studies looking at these concomitant variables. In practice, two gen-
potential bias related to socioeconomic status, gen- eral approaches are used to control the effects of
der, ethnicity, or amount of computer experience is concomitant variables: (1) experimental control
small, most recent studies have not found signifi- and (2) statistical control. An expected benefit of
cant differences in student-age populations. controlling the effects of concomitant variables by
either or both of these approaches is a substantial
Neal Kingston reduction in error variance, resulting in greater
208 Concomitant Variable

precision for estimating the magnitude of treat- models used in analyzing the data, for instance by
ment effects and increased statistical power. the use of socioeconomic status as a covariate in
an analysis of covariance (ANCOVA). Statisti-
Experimental Control cal control for regression procedures such as
ANCOVA means removing from the experimental
Controlling the effects of concomitant variables is error and from the treatment effect all extraneous
generally desirable. In addition to random assign- variance associated with the concomitant variable.
ment of subjects to experimental conditions, meth- This reduction in error variance is proportional to
ods can be applied to control these variables in the the strength of the linear relationship between the
design phase of a study. One approach is to use dependent variable and the covariate and is often
a small number of concomitant variables as the quite substantial. Consequently, statistical control
inclusion criteria for selecting subjects to participate is most advantageous in situations in which the
in the study (e.g., only eighth graders whose parents concomitant variable and outcome have a strong
have at least a high school education). A second linear dependency (e.g., a covariate that represents
approach is to match subjects on a small number of an earlier administration of the same instrument
concomitant variables and then randomly assign used to measure the dependent variable).
each matched subject to one of the treatment condi- In quasi-experimental designs in which random
tions. This requires that the concomitant variables assignment of observational units to treatments is
are available prior to the formation of the treat- not possible or potentially unethical, statistical
ment groups. Blocking, or stratification, as it is control is achieved through adjusting the estimated
sometimes referred to, is another method of con- treatment effect by controlling for preexisting
trolling concomitant variables in the design stage of group differences on the covariate. This adjust-
a study. The basic premise is that subjects are sorted ment can be striking, especially when the differ-
into relatively homogeneous blocks on the basis of ence on the concomitant variable across intact
levels of one or two concomitant variables. The treatment groups is dramatic. Like blocking or
experimental conditions are subsequently random- matching, using ANCOVA to equate groups on
ized within each stratum. Exerting experimental important covariates should not be viewed as
control through case selection, matching, and a substitute for randomization. Control of all
blocking can reduce experimental error, often resul- potential concomitant variables is not possible in
ting in improved statistical power to detect differ- quasi-experimental designs, which therefore are
ences among the treatment groups. As an exclusive always subject to threats to internal validity from
design strategy, the usefulness of any one of these unidentified covariates. This occurs because un-
three methods to control the effects of concomitant controlled covariates may be confounded with the
variables is limited, however. It is necessary to rec- effects of the treatment in a manner such that
ognize that countless covariates, in addition to group comparisons are biased.
those used to block or match subjects, may be
affecting the dependent variable and thus posing
potential threats to drawing appropriate inferences Benefits of Control
regarding treatment veracity. In contrast, randomi- To the extent possible, research studies should be
zation to experimental conditions ensures that any designed to control concomitant variables that are
idiosyncratic differences among the groups are not likely to systematically influence or mask the
systematic at the outset of the experiment. Random important relationships motivating the study. This
assignment does not guarantee that the groups are can be accomplished by exerting experimental
equivalent but rather that any observed differences control through restricted selection procedures,
are due only to chance. randomization of subjects to experimental condi-
tions, or stratification. In instances in which com-
plete experimental control is not feasible or in
Statistical Control
conjunction with limited experimental control, sta-
The effects of concomitant variables can be tistical adjustments can be made through regres-
controlled statistically if they are included in the sion procedures such as ANCOVA. In either case,
Concurrent Validity 209

advantages of instituting some level of control a hands-on measure in which examinees


have positive benefits that include (a) a reduction demonstrate first aid procedures.
in error variance and, hence, increased power and • Scores on a depression inventory are used to
(b) a decrease in potential bias attributable to classify individuals who are simultaneously
unaccounted concomitant variables. diagnosed by a licensed psychologist.

Jeffrey R. Harring The primary motives for developing a new


measure designed to tap the same construct of
See also Analysis of Covariance (ANCOVA); Block
interest as an established measure are cost and
Designs; Covariate; Experimental Design; Power;
convenience. A new measure that is shorter or
Statistical Control
less expensive but leads users to draw the same
conclusions as a longer, more costly measure is
a highly desirable alternative. For example, to
Further Readings obtain the necessary information about exam-
Dayton, C. M. (1970). The design of educational inee ability, decision makers might be able to
experiments. New York: McGraw-Hill. administer a short written test to a large number
Maxwell, S. E., & Delaney, H. D. (2004). Designing of individuals at the same time instead of pricey,
experiments and analyzing data: A model comparison one-on-one performance evaluations that involve
perspective (2nd ed.). Mahwah, NJ: Lawrence multiple ratings.
Erlbaum. Before users make decisions based on scores
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement,
from a new measure, they must have evidence
design and analysis: An integrated approach.
Hillsdale, NJ: Lawrence Erlbaum.
that there is a close relationship between the
scores of that measure and the performance on
the criterion measure. This evidence can be
obtained through a concurrent validation study.
In a concurrent validation study, the new mea-
CONCURRENT VALIDITY sure is administered to a sample of individuals
that is representative of the group for which the
Validity refers to the degree to which a measure measure will be used. An established criterion
accurately taps the specific construct that it claims measure is administered to the sample at, or
to be tapping. Criterion-related validity is con- shortly after, the same time. The strength of the
cerned with the relationship between individuals’ relationship between scores on the new measure
performance on two measures tapping the same and scores on the criterion measure indicates
construct. It typically is estimated by correlating the degree of concurrent validity of the new
scores on a new measure with scores from an measure.
accepted criterion measure. There are two forms The results of a concurrent validation study are
of criterion-related validity: predictive validity and typically evaluated in one of two ways, determined
concurrent validity. by the level of measurement of the scores from the
Concurrent validity focuses on the extent to two measures. When the scores on both the new
which scores on a new measure are related to measure and the criterion measure are continuous,
scores from a criterion measure administered at the degree of concurrent validity is established via
the same time, whereas predictive validity uses the a correlation coefficient, usually the Pearson prod-
scores from the new measure to predict perfor- uct–moment correlation coefficient. The correla-
mance on a criterion measure administered at tion coefficient between the two sets of scores is
a later time. also known as the validity coefficient. The validity
Examples of contexts in which concurrent coefficient can range from  1 to þ 1; coefficients
validity is relevant include the following: close to 1 in absolute value indicate high concur-
rent validity of the new measure.
• Scores on a written first aid exam are highly In the example shown in Figure 1, the concurrent
correlated with scores assigned by raters during validity of the written exam is quite satisfactory
210 Concurrent Validity

100 Table 1 Cross-Classification of Classification From


Inventory and Diagnosis by Psychologist
Score Assigned by Raters for

Diagnosis by Psychologist
80 Classification from
Hands-On Measure

Depression Inventory Depressed Not Depressed


60
Depressed 14 3
Not Depressed 7 86

40

• relevant to the desired decisions: Scores or


20
classifications on the criterion measure
should closely relate to, or represent,
0 variation on the construct of interest. Previous
0 20 40 60 80 100 validation studies and expert opinions
Score on Written First Aid Exam should demonstrate the usefulness and
appropriateness of the criterion for making
inferences and decisions about the construct of
Figure 1 Scatterplot of Scores on a Written First interest.
Aid Exam and Scores Assigned by • free from bias: Scores or classifications on
Raters During a Hands-On Demonstration the criterion measure should be free from
Measure bias, meaning that they should not be
influenced by anything other than the
construct of interest. Specifically, scores
should not be impacted by personal
because the scores on the written exam correlate characteristics of the individual, subjective
highly with the scores assigned by the rater; indivi- opinions of a rater, or other measurement
duals scoring well on the written measure were also conditions.
rated as highly proficient and vice versa. • reliable: Scores or classifications on the
When the new measure and the criterion criterion measure should be stable and
measure are both used to classify individuals, coef- replicable. That is, conclusions drawn
ficients of classification agreement, which are var- about the construct of interest should not be
iations of correlation coefficients for categorical clouded by inconsistent results across repeated
data, are typically used. Evidence of high concur- administrations, alternative forms, or a
rent validity is obtained when the classifications lack of internal consistency of the criterion
based on the new measure tend to agree with clas- measure.
sifications based on the criterion measure.
Table 1 displays hypothetical results of a concur- If the criterion measure against which the new
rent validation study for categorical data. The con- measure is compared is invalid because it fails to
current validity of the depression inventory is high meet these quality standards, the results of a con-
because the classification based on the new mea- current validation study will be compromised. Put
sure and the psychologist’s professional judgment differently, the results from a concurrent validation
match in 91% of the cases. study are only as useful as the quality of the crite-
When determining the concurrent validity of rion measure that is used in it. Thus, it is key to
a new measure, the selection of a valid criterion select the criterion measure properly to ensure that
measure is critical. Robert M. Thorndike has a lack of a relationship between the scores from
noted that criterion measures should ideally be rel- the new measure and the scores from the criterion
evant to the desired decisions, free from bias, and measure is truly due to problems with the new
reliable. In other words, they should already pos- measure and not to problems with the criterion
sess all the ideal measurement conditions that the measure.
new measure should possess also. Specifically, cri- Of course, the selection of an appropriate crite-
terion measures should be rion measure will also be influenced by the
Confidence Intervals 211

availability or the cost of the measure. Thus, the selected from the same population, one confidence
practical limitations associated with criterion mea- interval is constructed based on one sample with
surements that are inconvenient, expensive, or a certain confidence level. Together, all the confi-
highly impractical to obtain may outweigh other dence intervals should include the population
desirable qualities of these measures. parameter with the confidence level.
Suppose one is interested in estimating the pro-
Jessica Lynn Mislevy and André A. Rupp portion of bass among all types of fish in a lake. A
95% confidence interval for this proportion,
See also Criterion Validity; Predictive Validity
[25%, 36%], is constructed on the basis of a ran-
dom sample of fish in the lake. After more inde-
Further Readings pendent random samples of fish are selected from
Crocker, L., & Algina, J. (1986). Introduction to classical the lake, through the same procedure more confi-
and modern test theory. Belmont, CA: Wadsworth dence intervals are constructed. Together, all these
Group. confidence intervals will contain the true propor-
Thorndike, R. M. (2005). Measurement and evaluation tion of bass in the lake approximately 95% of
in psychology and education (7th ed.). Upper Saddle the time.
River, NJ: Pearson Education. The lower and upper boundaries of a confidence
interval are called lower confidence limit and
upper confidence limit, respectively. In the earlier
example, 25% is the lower 95% confidence limit,
CONFIDENCE INTERVALS and 36% is the upper 95% confidence limit.

A confidence interval is an interval estimate of


Confidence Interval Versus Significance Test
an unknown population parameter. It is con-
structed according to a random sample from the A significance test can be achieved by constructing
population and is always associated with a cer- confidence intervals. One can conclude whether
tain confidence level that is a probability, usually a test is significant based on the confidence inter-
presented as a percentage. Commonly used con- vals. Suppose the null hypothesis is that the popu-
fidence levels include 90%, 95%, and 99%. For lation mean μ equals 0 and the predetermined
instance, a confidence level of 95% indicates significance level is α. Let I be the constructed
that 95% of the time the confidence intervals 100ð1  αÞ% confidence interval for μ. If 0 is
will contain the population parameter. A higher included in the interval I, then the null hypothesis
confidence level usually forces a confidence inter- is accepted; otherwise, it is rejected. For example,
val to be wider. a researcher wants to test whether the result of an
Confidence intervals have a long history. Using experiment has a mean 0 at 5% significance level.
confidence intervals in statistical inference can be After the 95% confidence interval [  0.31, 0.21]
tracked back to the 1930s, and they are being used is obtained, it can be concluded that the null
increasingly in research, especially in recent medical hypothesis is accepted, because 0 is included in the
research articles. Researchersand research organiza- confidence interval. If the 95% confidence interval
tions such as the American Psychological Associa- is [0.11, 0.61], it indicates that the mean is signifi-
tion suggest that confidence intervals should always cantly different from 0.
be reported because confidence intervals provide For a predetermined significance level or confi-
information on both significance of test and vari- dence level, the ways of constructing confidence
ability of estimation. intervals are usually not unique. Shorter confidence
intervals are usually better because they indicate
greater power in the sense of significance test.
Interpretation
One-sided significance tests can be achieved by
A confidence interval is a range in which an constructing one-sided confidence intervals. Sup-
unknown population parameter is likely to be pose a researcher is interested in an alternative
included. After independent samples are randomly hypothesis that the population mean is larger than
212 Confidence Intervals

0 at the significance level .025. The researcher will The Population Follows a Normal Distribution
construct a one-sided confidence interval taking With Unknown Variance.
the form of ð∞; b with some constant b. Note Suppose a sample of size N is randomly selected
that the width of a one-sided confidence interval is from the population with observations x1 ;
infinity. Following the above example, the null P
N
hypothesis would be that the mean is less than or x2 ; . . . ; xN . Let x ¼ xi =N. This is the sample
equal to 0 at the .025 level. If the 97.5% one-sided i¼1

confidence interval is ð∞; 3:8, then the null mean. The sample variance is defined as
hypothesis is accepted because 0 is included in the P
N
ðxi  xÞ2 =ðN  1Þ, denoted by s2. Therefore
interval. If the 97.5% confidence interval is i¼1
pffiffiffiffiffi
ð∞; 2:1 instead, then the null hypothesis is the confidence interval is ½x  tN1;α=2 × s= N ;
rejected because there is no overlap between pffiffiffiffiffi
x þ tN1;α=2 × s= N . Here tN1;α=2 is the upper
ð∞; 2:1 and ½0; ∞Þ. α=2 quantile, meaning PðtN1;α=2 ≤ TN1 Þ ¼ α=2
and TN1 follows the t distribution with degree of
freedom N  1. Refer to the t-distribution table or
Examples use software to get tN1;α=2 . If N is greater than
Confidence Intervals for a Population 30, one can use zα=2 instead of tN1;α=2 in the con-
Mean With Confidence Level 100ð1  αÞ% fidence interval formula because there is little dif-
ference between them for large enough N.
Confidence intervals for a population mean are
As an example, the students’ test scores in a class
constructed on the basis of the sample mean
follow a normal distribution. One wants to con-
distribution.
struct a 95% confidence interval for the class aver-
age score based on an available random sample of
The Population Follows a Normal Distribution size N ¼ 10. The 10 scores are 69, 71, 77, 79, 82,
With a Known Variance σ 2 84, 80, 94, 78, and 67. The sample mean and the
After a random sample of size N is selected sample variance are 78.1 and 62.77, respectively.
from the population, one is able to calculate the According the t-distribution table, t9;0:25 ¼ 2:26.
sample mean x, which is the average of all the The 95% confidence interval for the class average
observations. pThus the confidence score is ½72.13, 84.07.
ffiffiffiffiffi pffiffiffiffiffi interval is
½x  zα=2 × σ= N ; x þ zα=2 × σ= N  that is p cen-
ffiffiffiffiffi The Population Follows a General Distribution
tered at x with half of the length zα=2 × σ= N . Other Than Normal Distribution
zα=2 is the upper α=2 quantile, meaning
Pðzα=2 ≤ ZÞ ¼ α=2. Here Z is a standard normal This is a common situation one might see in
random variable. To find zα=2 , one can either refer practice. After obtaining a random sample of size
to the standard normal distribution table or use N from the population, where it is required that
statistical software. Nowadays most of the statisti- N ≥ 30, the sample mean and the sample variance
cal software, such as Excel, R, SAS, SPSS (an IBM can be computed as in the previous subsections,
company, formerly called PASWâ Statistics), and denoted by x and s2 , respectively. According to the
Splus, have a simple command that will do the central limit theorem, an approximate confidence pffiffiffiffiffi
job. For commonly used confidence intervals, 90% interval can pbeffiffiffiffiffi expressed as ½x  zα=2 × s= N ;
confidence level corresponds to z0:05 ¼ 1.645, 95% x þ zα=2 × s= N .
corresponds to z0:025 ¼ 1.96, and 99% corresponds
to z0:005 ¼ 2.56.
Confidence Interval for
The above confidence interval will shrink to
Difference of Two Population Means
a point if N goes to infinity. So the interval esti-
With Confidence Level 100ð1  αÞ%
mate turns into a point estimate. It can be inter-
preted as if the whole population is taken as One will select a random sample from each
the sample; the sample mean is actually the population. Suppose the sample sizes are N1 and
population mean. N2 . Denote the sample means by x1 and x2
Confidence Intervals 213

and the sample variances by s21 and s22 , For example, a doctor wants to construct
respectively, for the two samples. The confidence a 99% confidence interval for the chance of having
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
a certain disease by studying patients’ x-ray slides.
interval is ½x1  x2  zα=2 × s21 =N1 þ s22 =N2 ,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N ¼ 30 x-ray slides are randomly selected, and
x1  x2 þ zα=2 × s21 =N1 þ s22 =N2 . the number of positive slides follows a distribution
known as the binomial distribution. Suppose 12 of
If one believes that the two population ^ ¼ 12/30,
them are positive for the disease. Hence p
variances are about the same, the confidence ^ ¼ 12
which is the sample proportion. Since N p
interval will be ½x1  x2  tN1 þN2 2;α=2 ×sp ; x1  ^
and Nð1  pÞ ¼ 18 are both larger than 5, the
x2 þ tN1 þN2 2;α=2 × sp , where confidence interval for the unknown proportion
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sp ¼ ðN1 þN2 Þ½ðN1  1Þs21 þ ðN2  1Þs22 =½ðN1 þ can be constructed using the normal approxima-
tion. The estimated standard error for p ^ is 0.09.
N2  2ÞN1 N2 :
Continuing with the above example about the Thus the lower 99% confidence limit is 0.17 and
students’ scores, call that class Class A: If one is the upper 99% confidence limit is 0.63. So the
interested in comparing the average scores 99% confidence interval is ½0:17; 0:63.
between Class A and another class, Class B, the The range of a proportion is between 0 and 1.
confidence interval for the difference between But sometimes the constructed confidence interval
the average class scores will be constructed. for the proportion may exceed it. When this hap-
First, randomly select a group of students in pens, one should truncate the confidence interval
Class B. Suppose the group size is 8. These eight to make the lower confidence limit 0 or the upper
students’ scores are 68, 79, 59, 76, 80, 89, 67, confidence limit 1.
and 74. The sample mean is 74, and the sample Since the binomial distribution is discrete, a cor-
variance is 86.56 for Class B. If Class A and rection for continuity of 0:5=N may be used to
Class B are believed to have different population improve the performance of confidence intervals.
variances, then the 95% confidence interval The corrected upper limit is added by 0:5=N, and
for the difference of the average scores is the corrected lower limit is subtracted by 0:5=N.
½4.03, 12.17] by the first formula provided in One can also construct confidence intervals
this subsection. If one believes these two classes for the proportion difference between two popu-
have about the same population variance, then lations based on the normal approximation. Sup-
the 95% confidence interval will be changed to pose two random samples are independently
½4.47, 12.57] by the second formula. selected from the two populations, with sample
sizes N1 and N2 and sample proportions p ^1 and
Confidence Intervals for a Single Proportion ^
p2 respectively. The estimated population
and Difference of Two Proportions proportion difference is the sample proportion
difference, p ^1  p
^2 , and the estimated standard
Sometimes one may need to construct confi-
error for the proportion difference is
dence intervals for a single unknown population pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^ the sample proportion, seðp^1  p^2 Þ ¼ p ^1 ð1  p ^1 Þ = N1 þ p ^2 ð1  p ^2 Þ=N2 .
proportion. Denote p
which can be obtained from a random sample The confidence interval for two-sample propor-
tion difference is ½ðp ^1  p ^2 Þ  zα=2 × seðp ^1  p ^2 Þ;
from the population. The estimated standard ffi error
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
for the proportion is seðp ^Þ ¼ p ^ð1  p ^Þ=N. Thus ðp1  p2 Þ þ zα=2 × seðp1  p2 Þ:
^ ^ ^ ^
the confidence interval for the unknown propor- Similar to the normal approximation for a single
^  zα=2 × seðp
tion is ½p ^ þ zα=2 × seðp
^Þ; p ^Þ. proportion, the approximation for the proportion
This confidence interval is constructed on the difference depends on sample sizes and sample pro-
basis of the normal approximation (refer to the cen- portions. The rule of thumb is that N p ^1 , Nð1 p ^1 Þ,
tral limit theorem for the normal approximation). Np ^2 , and Nð1  p ^2 Þ should each be larger than 10.
The normal approximation is not appropriate when
Confidence Intervals for Odds Ratio
the proportion is very close to 0 or 1. A rule of
thumb is that when N p ^ > 5 and Nð1  p ^Þ > 5, An odds ratio (OR) is a commonly used effect
usually the normal approximation works well. size for categorical outcomes, especially in health
214 Confidence Intervals

science, and is the ratio of the odds in Category 1 Table 1 Lung Cancer Among Smokers and Nonsmokers
to the odds in Category 2. For example, one wants Smoking Lung No Lung
to find out the relationship between smoking and Status Cancer Cancer Total
lung cancer. Two groups of subjects, smokers and Smokers N11 N12 N1.
nonsmokers, are recruited. After a few years’ fol- Nonsmokers N21 N22 N2.
low-ups, N11 subjects among the smokers are diag- Total N.1 N.2 N
nosed with lung cancer and N21 subjects among
the nonsmokers. There are N12 and N22 subjects
N11 =ðN11 þ N12 Þ, and the risk among nonsmo-
who do not have lung cancer among the smokers
kers is estimated as N21 =ðN21 þ N22 Þ. The RR is
and the nonsmokers, respectively. The odds of hav-
the ratio of the above two risks, which is
ing lung cancer among the smokers and the non-
½N11 =ðN11 þ N12 Þ=½N21 =ðN21 þ N22 Þ.
smokers are estimated as N11 =N12 and N21 =N22 ,
Like the OR, the sampling distribution of
respectively. The OR of having lung cancer
lnðRRÞ is approximated to a normal distribution.
among the smokers compared with the nonsmo-
The standard error for lnðRRÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi is
kers is the ratio of the above two odds, which is
seðlnðRRÞÞ ¼ 1=N11  1=N1 : þ 1=N21  1=N2 .
ðN11 =N12 Þ=ðN21 =N22 Þ ¼ ðN11 N22 Þ=ðN21 N12 Þ: A 95% confidence interval for lnðRRÞ is

For a relatively large total sample size, lnðORÞ ½lnðRRÞ  1:96 × seðlnðRRÞÞ, lnðRRÞ þ 1:96
is approximated to a normal distribution, so the × seðlnðRRÞÞ:
construction for the confidence interval for
lnðORÞ is similar to that for the normal distribu- Thus the 95% confidence interval for the RR is
tion. The standard error for lnðORÞ is defined as
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½exp ðlnðRRÞ  1:96 × se ðlnðRRÞÞÞ; exp ðlnðRRÞ þ
seðlnðORÞÞ ¼ 1=N11 þ 1=N12 þ1=N21 þ 1=N22 . 1:96 × seðlnðRRÞÞÞ.
A 95% confidence interval for lnðORÞ is The confidence intervals for the RRs are not
½lnðORÞ  1:96 × seðlnðORÞÞ, lnðORÞ þ 1:96 × se symmetric about the estimated RR either. One can
ðlnðORÞÞ. As the exponential function is mono- tell the significance of a test from the corresponding
tonic, there is one-to-one mapping between the confidence interval for the RR. Usually the null
OR and ln(OR). Thus a 95% confidence interval hypothesis is that RR ¼ 1, which means that the
for the OR is ½expðlnðORÞ  1:96 × seðlnðORÞÞÞ, two groups have the same risk. For the above
expðlnðORÞ þ 1:96 × seðlnðORÞÞÞ: example, the null hypothesis would be that the risks
The confidence intervals for the OR are not of developing lung cancer among smokers and non-
symmetric about the estimated OR. But one can smokers are equal. If 1 is included in the confidence
still tell the significance of the test on the basis of interval, one may accept the null hypothesis. If not,
the corresponding confidence interval for the OR. the null hypothesis should be rejected.
For the above example, the null hypothesis is that
there is no difference between smokers and non- Confidence Intervals for Variance
smokers in the development of lung cancer; that is,
A confidence interval for unknown population
OR ¼ 1. If 1 is included in the confidence inter-
variance can be constructed with the use of a central
val, one should accept the null hypothesis; other-
chi-square distribution. For a random sample with
wise, one should reject it.
size N and sample variance s2 , an approximate
two-sided 100ð1  αÞ% confidence interval for
population variance σ 2 is ½ðN  1Þs2 =χ2N1 ðα=2Þ,
Confidence Intervals for Relative Risk
ðN  1Þs2 =χ2N1 ð1  α=2Þ. Here χ2N1 ðα=2Þ is at
Another widely used concept in health care is the upper α=2 quantile satisfying the requirement
relative risk (RR), which is the risk difference bet- that the probability that a central chi-square ran-
ween two groups. Risk is defined as the chance of dom variable with degree of freedom N  1 is
having a specific outcome among subjects in that greater than χ2N1 ðα=2Þ is α=2. Note that this confi-
group. Taking the above example, the risk of hav- dence interval may not work well if the sample size
ing lung cancer among smokers is estimated as is small or the distribution is far from normal.
Confidence Intervals 215

Bootstrap Confidence Intervals vector θ satisfies Pðθ ∈ DÞ ¼ 1  α, where D does


not need to be a product of simple intervals.
The bootstrap method provides an alternative way
for constructing an interval to measure the accu-
Links With Bayesian Intervals
racy of an estimate. It is especially useful when the
usual confidence interval is hard or impossible to A Bayesian interval, or credible interval, is derived
calculate. Suppose sðxÞ is used to estimate an from the posterior distribution of a population
unknown population parameter θ based on a sam- parameter. From a Bayesian point of view, the
ple x of size N. A bootstrap confidence interval for parameter θ can be regarded as a random quantity,
the estimate can be constructed as follows. One which follows a distribution PðθÞ, known as the
randomly draws another sample x of the same prior distribution. For each fixed θ, the data x are
size N with replacement from the original sample assumed to follow the conditional distribution
x. The estimate sðx Þ based on x is called a boot- PðxjθÞ, known as the model. Then Bayes’s theorem
strap replication of the original estimate. Repeat can be applied on x to get the adjusted distribution
the procedure for a large number of times, say PðθjxÞ of θ, known as the posterior distribution,
1,000 times. Then the αth quantile and the which is proportional to PðxjθÞ · PðθÞ. A 1  α
ð1  αÞth quantile of the 1,000 bootstrap replica- Bayesian interval I is an interval that satisfies
tions serve as the lower and upper limits of Pðθ ∈ IjxÞ ¼ 1  α according to the posterior dis-
a 100(1  2α)% bootstrap confidence interval. tribution. In order to guarantee the optimality or
Note that the interval obtained in this way may uniqueness of I, one may require that the values of
vary a little from time to time due to the random- the posterior density function inside I always be
ness of bootstrap replications. Unlike the usual greater than any one outside I. In those cases, the
confidence interval, the bootstrap confidence inter- Bayesian interval I is called the highest posterior
val does not require assumptions on the popula- density region. Unlike the usual confidence inter-
tion distribution. Instead, it highly depends on the vals, the level 1  α of a Bayesian interval I indi-
data x itself. cates the probability that the random θ falls into I.

Simultaneous Confidence Intervals Qiaoyan Hu, Shi Zhao, and Jie Yang

Simultaneous confidence intervals are intervals for See also Bayes’s Theorem; Boostrapping; Central Limit
estimating two or more parameters at a time. For Theorem; Normal Distribution; Odds Ratio; Sample;
example, suppose μ1 and μ2 are the means of two Significance, Statistical
different populations. One wants to find confi-
dence intervals I1 and I2 simultaneously such that Further Readings
Blyth, C. R., & Still, H. A. (1983). Binomial confidence
Pðμ1 ∈ I1 and μ2 ∈ I2 Þ ¼ 1  α:
intervals. Journal of the American Statistical
Association, 78, 108–116.
If the sample x1 used to estimate μ1 is indepen- Efron, B., & Tibshirani, R. J. (1993). An introduction to
dent of the sample x2 for μ2,pthen
ffiffiffiffiffiffiffiffiffiffiffiIffi1 and I2 can be the bootstrap. New York: Chapman & Hall/CRC.
simply calculated as 100 1  α% confidence Fleiss, J. R., Levin, B., & Paik, M. C. (2003). Statistical
intervals for μ1 and μ2 , respectively. methods for rates and proportions. Hoboken, NJ: Wiley.
The simultaneous confidence intervals I1 and I2 Newcombe, R. G. (1998). Two-sided confidence intervals
can be used to test whether μ1 and μ2 are equal. If for the single proportion: Comparison of seven
I1 and I2 are nonoverlapped, then μ1 and μ2 are methods. Statistics in Medicine, 17, 857–872.
significantly different from each other at a level Pagano, M., & Gauvreau, K. (2000). Principles of
biostatistics. Pacific Grove, CA: Duxbury.
less than α.
Smithson, M. (2003).Confidence intervals. Thousand
Simultaneous confidence intervals can be general- Oaks, CA: Sage.
ized into a confidence region in the multidimen- Stuart, A., Ord, K., & Arnold, S. (1998). Kendall’s
sional parameter space, especially when the esti- advanced theory of statistics: Vol. 2A. Classical
mates for parameters are not independent. A inference and the linear model (6th ed.). London:
100(1  α)% confidence region D for the parameter Arnold.
216 Confirmatory Factor Analysis

CFA model, and the manner and settings in which


CONFIRMATORY FACTOR ANALYSIS the CFA model is most commonly implemented.

The Confirmatory Factor Analysis Model


Research in the social and behavioral sciences
often focuses on the measurement of unobservable, The CFA model belongs to the larger family of
theoretical constructs such as ability, anxiety, depr- modeling techniques referred to as structural
ession, intelligence, and motivation. Constructs are equation models (SEM). Structural equation
identified by directly observable, manifest variables models offer many advantages over traditional
generally referred to as indicator variables (note modeling techniques (although many traditional
that indicator, observed, and manifest variables are techniques, such as multiple regression, are con-
often used interchangeably). Indicator variables sidered special types of SEMs), among these the
can take many forms, including individual items use of latent variables and the ability to model
or one or more composite scores constructed complex measurement error structures. The lat-
across multiple items. Many indicators are avail- ter, measurement error, is not accounted for in
able for measuring a construct, and each may dif- the traditional techniques, and, as was previ-
fer in how reliably it measures the construct. The ously mentioned, ignoring measurement error
choice of which indicator to use is based largely oftentimes leads to inaccurate results. Many dif-
on availability. Traditional statistical techniques ferent types of models can be put forth in the
using single indicator measurement, such as regres- SEM framework, with the more elaborate mod-
sion analysis and path analysis, assume the indica- els containing both of the SEM submodels: (a)
tor variable to be an error-free measure of the the measurement model and (b) the structural
particular construct of interest. Such an assump- model. The measurement model uses latent vari-
tion can lead to erroneous results. ables to explain variability that is shared by the
Measurement error can be accounted for by the indicator variables and variability that is unique
use of multiple indicators of each construct, thus to each indicator variable. The structural model,
creating a latent variable. This process is generally on the other hand, builds on the measurement
conducted in the framework of factor analysis, model by analyzing the associations between the
a multivariate statistical technique developed in latent variables as direct causal effects.
the early to mid-1900s primarily for identifying The CFA model is considered a measurement
and/or validating theoretical constructs. The gen- model in which an unanalyzed association (covari-
eral factor analysis model can be implemented in ance) between the latent variables is assumed to
an exploratory or confirmatory framework. The exist. Figure 1 displays a basic two-factor CFA
exploratory framework, referred to as exploratory model in LISREL notation. This diagram displays
factor analysis (EFA), seeks to explain the relation- two latent variables, labeled with xi (pronounced
ships among the indicator variables through a given ksi), ξ1 and ξ2 , each identified by three indicator
number of previously undefined latent variables. variables (i.e., X1, X2, and X3 for ξ1 and X4, X5,
In contrast, the confirmatory framework, referred and X6 for ξ2 ). The single-headed arrows pointing
to as confirmatory factor analysis (CFA), uses from the latent variables to each of the indicator
latent variables to reproduce and test previously variables represent factor loadings or pattern coef-
defined relationships between the indicator vari- ficients, generally interpreted as regression coeffi-
ables. The methods differ in their underlying pur- cients, labeled with lambda as λ11 , λ21 , λ31 ,
pose. Whereas EFA is a data-driven approach, λ42 , λ52 , and λ62 and signifying the direction and
CFA is a hypothesis driven approach requiring the- magnitude of the relationship between each indica-
oretically and/or empirically based insight into the tor and latent variable. For instance, λ21 represents
relationships among the indicator variables. This the direction and magnitude of the relationship
insight is essential for establishing a starting point- between latent variable one (ξ1 ) and indicator vari-
for the specification of a model to be tested. What able two (X2). These coefficients can be reported
follows is a more detailed, theoretical discussion of in standardized or unstandardized form. The
the CFA model, the process of implementing the double-headed arrow between the latent variables,
Confirmatory Factor Analysis 217

δ is a q × 1 vector of measurement errors that are


δ1 x1 λ11 ϕ21 λ 42 x4 δ4 assumed to be independent of each other. The
equation above written in matrix notation is rec-
ognized as the measurement model in a general
λ 21 λ 52 structural equation model.
δ2 x2 ξ1 ξ2 x5 δ5
The path diagram in Figure 1 depicts an ideal
CFA model. The terms simple structure and unidi-
mensional measurement are used in the general
δ3 x3 λ 31 λ 62 x6 δ6 factor analysis framework to signify a model that
meets two conditions: (1) Each latent variable is
defined by a subset of indicator variables which
Figure 1 Two-Factor Confirmatory Factor Analysis in are strong indicators of that latent variable, and
LISREL Notation (2) each indicator variable is strongly related to
a single latent variable and weakly related to the
other latent variable(s). In general, the second con-
labeled as ’21 (phi), represents an unanalyzed dition is met in the CFA framework with the initial
covariance or correlation between the latent vari- specification of the model relating each indicator
ables. This value signifies the direction and magni- variable to a single latent variable. However, sub-
tude of the relationship between the latent sequent testing of the initial CFA model may reveal
variables. Variability in the indicator variables not the presence of an indicator variable that is
attributable to the latent variable, termed measure- strongly related to multiple latent variables. This is
ment error, is represented by the single-headed a form of multidimensional measurement repre-
arrows pointing toward the indicator variables sented in the model by a path between an indicator
from each of the measurement error terms, labeled variable and the particular latent variable it is
with delta as δ1 ; δ2 ; δ3 ; δ4 ; δ5 ; and δ6 . These are related to. This is referred to as a cross-loading
essentially latent variables assumed to be indepen- (not included in Figure 1).
dent of each other and of the latent variables ξ1 Researchers using EFA often encounter cross-
and ξ2 . loadings because models are not specified in
The CFA model depicted in Figure 1 can be advance. Simple structure can be obtained in EFA
represented in equation form as follows: through the implementation of a rotation method.
However, a rotated factor solution does not
x1 ¼ λ11ξ1 þ δ1 remove the presence of cross-loadings, particularly
x2 ¼ λ21ξ1 þ δ2 when the rotation method allows for correlated
x3 ¼ λ31ξ1 þ δ3 factors (i.e., oblique rotation). As previously men-
tioned, specifying the model in advance provides
x4 ¼ λ42ξ2 þ δ4
the researcher flexibility in testing if the inclusion
x5 ¼ λ52ξ2 þ δ5 of one or more cross-loadings results in a better
x6 ¼ λ62ξ2 þ δ6 explanation of the relationships in the observed
data. Literature on this topic reveals the presence
These equations can be written in more compact of conflicting points of view. In certain instances, it
form by using matrix notation: may be plausible for an indicator to load on multi-
ple factors. However, specifying unidimensional
x ¼ ξ þ δ; models has many advantages, including but not
limited to providing a clearer interpretation of the
where x is a q × 1 vector of observed indicator model and a more precise assessment of conver-
variables (where q is the number of observed indi- gent and discriminant validity than multidimen-
cator variables),  is a q × n matrix of regression sional models do.
coefficients relating the indicators to the latent Multidimensional measurement is not limited to
variables (where n is the number of latent vari- situations involving cross-loadings. Models are also
ables), ξ is an n × 1 vector of latent variables, and considered multidimensional when measurement
218 Confirmatory Factor Analysis

errors are not independent of one another. In such must determine whether, based on the sample
models, measurement errors can be specified to cor- covariance matrix, S, and the model-implied
relate. Correlating measurement errors allows for covariance matrix, ðYÞ, a unique estimate of
hypotheses to be tested regarding shared variance each unknown parameter in the model can be
that is not due to the underlying factors. Specifying identified. The model is identified if the number of
the presence of correlated measurement errors in parameters to be estimated is less than or equal to
a CFA model should be based primarily on model the number of unique elements in the variance–
parsimony and, perhaps most important, on sub- covariance matrix used in the analysis. This is the
stantive considerations. case when the degrees of freedom for the model
are greater than or equal to 0. In addition, a metric
must be defined for every latent variable, including
Conducting a Confirmatory Factor Analysis
the measurement errors. This is typically done by
Structural equation modeling is also referred to as setting the metric of each latent variable equal to
covariance structure analysis because the covari- the metric of one of its indicators (i.e., fixing the
ance matrix is the focus of the analysis. The gen- loading between indicator and its respective latent
eral null hypothesis to be tested in CFA is variable to 1) but can also be done by setting the
variance of each latent variable equal to 1.
¼ ðÞ; The process of model estimation in CFA (and
SEM, in general) involves the use of a fitting func-
where is the population covariance matrix, esti- tion such as generalized least squares or maximum
mated by the sample covariance matrix S, and likelihood (ML) to obtain estimates of model para-
ðθÞ is the model-implied covariance matrix, esti- meters that minimize the discrepancy between the
mated by ðYÞ. Researchers using CFA seek to sample covariance matrix, S, and the model
specify a model that most precisely explains the implied covariance matrix, ðYÞ. ML estimation
relationships among the variables in the original is the most commonly used method of model esti-
data set. In other words, the model put forth by mation. All the major software packages (e.g.,
the researcher reproduces or fits the observed sam- AMOS, EQS, LISREL, MPlus) available for posing
ple data to some degree. The more precise the fit and testing CFA models provide a form of ML
of the model to the data, the smaller the difference estimation. This method has also been extended to
between the sample covariance matrix, S, and the address issues that are common in applied settings
model-implied covariance matrix, ðYÞ. To evalu- (e.g., nonnormal data, missing data ), making CFA
ate the null hypothesis in this manner, a sequence applicable to a wide variety of data types.
of steps common to implementing structural equa- Various goodness-of-fit indices are available for
tion models must be followed. determining whether the sample covariance
The first step, often considered the most chal- matrix, S, and the model implied covariance
lenging, requires the researcher to specify the matrix, ðYÞ, are sufficiently equal to deem the
model to be evaluated. To do so, researchers must model meaningful. Many of these indices are
use all available information (e.g., theory, previous derived directly from the ML fitting function. The
research) to postulate the relationships they expect primary index is the chi-square. Because of the
to find in the observed data prior to the data col- problems inherent with this index (e.g., inflated by
lection process. Postulating a model may also sample size), assessing the fit of a model requires
involve the use of EFA as an initial step in develop- the use of multiple indices from the three broadly
ing the model. Different data sets must be used defined categories of indices: (1) absolute fit,
when this approach is taken because EFA results (2) parsimony correction, and (3) comparative fit.
are subject to capitalizing on chance. Following an The extensive research on fit indices has fueled the
EFA with a CFA on the same data set may com- debate and answered many questions as to which
pound the capitalization-on-chance problem and are useful and what cutoff values should be
lead to inaccurate conclusions based on the results. adopted for determining adequate model fit in
Next, the researcher must determine whether a variety of situations. Research has indicated that
the model is identified. At this stage, the research fit indices from each of these categories provide
Confirmatory Factor Analysis 219

pertinent information and should be used in uni- Scale Development


son when conducting a CFA.
CFA can be particularly useful in the process of
Even for a well-fitting model, further respecifica-
developing a scale for tests or survey instruments.
tion may be required because goodness-of-fit indices
Developing such an instrument requires a strong
evaluate the model globally. Obviously, a model that
understanding of the theoretical constructs that
fits the data poorly requires reconsideration of the
are to be evaluated by the instrument. CFA proce-
nature of the relationships specified in the model.
dures can be used to verify the underlying struc-
Models that fit the data at an acceptable level, how-
ture of the instrument and to quantify the
ever, may fail to explain all the relationships in the
relationship between each item and the construct
variance–covariance matrix adequately. The process
it was designed to measure, as well as quantify the
of model respecification involves a careful examina-
relationships between the constructs themselves. In
tion of areas in which the model may not explain
addition, CFA can be used to aid in the process of
relationships adequately. A careful inspection of
devising a scoring system as well as evaluating the
the residual variance–covariance matrix, which
reliability of an instrument. Recent advances in the
represents the difference between the sample
use of CFA with categorical data also make it an
variance–covariance matrix, S, and the model-
attractive alternative to the item response theory
implied variance–covariance matrix, ðYÞ, can
model for investigating issues such as measurement
reveal poorly explained relationships. Modification
invariance.
indices can also be used for determining what part
of a model should be respecified. The Lagrange
multiplier test can be used to test whether certain Validity of Theoretical Constructs
relationships should be included in the model, and CFA is a particularly effective tool for examin-
the Wald test can be used to determine whether cer- ing the validity of theoretical constructs. One of
tain relationships should be excluded from the the most common approaches to testing validity of
model. Last, parameter estimates themselves should theoretical constructs is through the multitrait–
be carefully examined in terms of practical (i.e., multimethod (MTMM) design, in which multiple
interpretability) as well as statistical significance. traits (i.e., constructs) are assessed through multi-
Following the aforementioned steps is common ple methods. The MTMM design is particularly
practice in the application of CFA models in effective in evaluating convergent validity (i.e.,
research. The final step, respecification of the strong relations among different indicators, col-
model, can uncover statistically equivalent versions lected via different methods, of the same construct)
of the original CFA model found to provide the and discriminant validity (i.e., weak relations
best fit. These models are equivalent in the sense among different indicators, collected via different
that they each explain the original data equally methods, of distinct constructs). The MTMM
well. While many of the equivalent models that design can also be used to determine whether
can be generated will not offer theoretically plausi- a method effect (i.e., strong relations among differ-
ble alternatives to the original model, there will ent indicators of distinct constructs, collected via
likely be a competing model that must be consid- the same method) is present.
ered before a final model is chosen. The complex
decision of which model offers the best fit statisti-
Measurement Invariance
cally and theoretically can be made easier by
a researcher with a strong grasp of the theory CFA models are commonly used in the assess-
underlying the original model put forth. ment of measurement invariance and population
heterogeneity. By definition, measurement invari-
Applications of Confirmatory ance refers to the degree to which measurement
models generalize across different groups (e.g.,
Factor Analysis Models
males vs. females, native vs. nonnative English
CFA has many applications in the practical realm. speakers) as well as how a measurement instru-
The following section discusses some of the more ment generalizes across time. To assess measure-
common uses of the CFA model. ment invariance, CFA models can be specified as
220 Confounding

either multiple group models or multiple indicator– relationships among constructs will require sound
multiple cause (MIMIC) models. While these measurement of the latent constructs through the
models are considered interchangeable, there are CFA approach.
advantages to employing the multiple group app-
roach. In the multiple group approach, tests of Greg William Welch
invariance across a greater number of parameters
See also Exploratory Factor Analysis; Latent Variable;
can be conducted. The MIMIC model tests for dif-
Structural Equation Modeling
ferences in intercept and factor means, essentially
providing information about which covariates have
direct effects in order to determine what grouping Further Readings
variables might be important in a multiple group Brown, T. A. (2006). Confirmatory factor analysis for
analysis. A multiple group model, on the other applied research. New York: Guilford Press.
hand, offers tests for differences in intercept and Spearman, C. E. (1904). General intelligence objectively
factor means as well as tests of other parameters determined and measured. American Journal of
such as the factor loadings, error variances and Psychology, 5, 201–293.
covariances, factor means, and factor covariances. Thurstone, L. L. (1935). The vectors of the mind.
The caveat of employing a multiple group model, Chicago: University of Chicago Press.
however, lies in the necessity of having a sufficiently
large sample size for each group, as well as addres-
sing the difficulties that arise in analyses involving
many groups. The MIMIC model is a more practi- CONFOUNDING
cal approach with smaller samples.
Confounding occurs when two variables systemati-
cally covary. Researchers are often interested in
Higher Order Models
examining whether there is a relationship between
The CFA models presented to this point have two or more variables. Understanding the relation-
been first-order models. These first-order models ship between or among variables, including
include the specification of all necessary para- whether those relationships are causal, can be
meters excluding the assumed relationship between complicated when an independent or predictor
the factors themselves. This suggests that even variable covaries with a variable other than the
though a relationship between the factors is dependent variable. When a variable systemati-
assumed to exist, the nature of that relationship is cally varies with the independent variable, the con-
‘‘unanalyzed’’ or not specified in the initial model. founding variable provides an explanation other
Higher order models are used in cases in which the than the independent variable for changes in the
relationship between the factors is of interest. A dependent variable.
higher order model focuses on examining the rela-
tionship between the first-order factors, resulting
Confounds in Correlational Designs
in a distinction between variability shared by the
first-order factors and variability left unexplained. Confounding variables are at the heart of the
The process of conducting a CFA with second- third-variable problem in correlational studies. In
order factors is essentially the same as the process a correlational study, researchers examine the rela-
of testing a CFA with first-order factors. tionship between two variables. Even if two vari-
ables are correlated, it is possible that a third,
confounding variable is responsible for the appar-
Summary
ent relationship between the two variables. For
CFA models are used in a variety of contexts. example, if there were a correlation between
Their popularity results from the need in applied icecream consumption and homicide rates, it
research for formal tests of theories involving would be a mistake to assume that eating ice
unobservable latent constructs. In general, the cream causes homicidal rages or that murderers
popularity of SEM and its use in testing causal seek frozen treats after killing. Instead, a third
Confounding 221

variable—heat—is likely responsible for both of an independent variable and random assign-
increases in ice cream consumption and homicides ment of participants to experimental conditions,
(given that heat has been shown to increase aggres- it is possible for experiments to contain con-
sion). Although one can attempt to identify and founds. An experiment may contain a confound
statistically control for confounding variables in because the experimenter intentionally or unin-
correlational studies, it is always possible that an tentionally manipulated two constructs in a way
unidentified confound is producing the correlation. that caused their systematic variation. The Illi-
nois Pilot Program on Sequential, Double-Blind
Procedures provides an example of an experi-
Confounds in Quasi-Experimental ment that suffers from a confound. In this study
commissioned by the Illinois legislature, eyewit-
and Experimental Designs
ness identification procedures conducted in sev-
The goal of quasi-experimental and experimen- eral Illinois police departments were randomly
tal studies is to examine the effect of some treat- assigned to one of two conditions. For the
ment on an outcome variable. When the sequential, double-blind condition, administra-
treatment systematically varies with some other tors who were blind to the suspect’s identity
variable, the variables are confounded, meaning showed members of a lineup to an eyewitness
that the treatment effect is comingled with the sequentially (i.e., one lineup member at a time).
effects of other variables. Common sources of For the single-blind, simultaneous condition,
confounding include history, maturation, instru- administrators knew which lineup member was
mentation, and participant selection. History the suspect and presented the witness with all
confounds may arise in quasi-experimental the lineup members at the same time. Research-
designs when an event that affects the outcome ers then examined whether witnesses identified
variable happens between pretreatment measure- the suspect or a known-innocent lineup member
ment of the outcome variable and its posttreat- at different rates depending on the procedure
ment measurement. The events that occur used. Because the mode of lineup presentation
between pre- and posttest measurement, rather (simultaneous vs. sequential) and the admini-
than the treatment, may be responsible for strator’s knowledge of the suspect’s identity
changes in the dependent variable. Maturation were confounded, it is impossible to determine
confounds are a concern if participants could whether the increase in suspect identifications
have developed—cognitively, physically, emo- found for the single-blind, simultaneous presen-
tionally—between pre- and posttest measure- tations is due to administrator knowledge, the
ment of the outcome variable. Instrumentation mode of presentation, or some interaction of the
confounds occur when different instruments are two variables. Thus, manipulation of an inde-
used to measure the dependent variable at pre- pendent variable protects against confounding
and posttest or when the instrument used to col- only when the manipulation cleanly varies a sin-
lect the observation deteriorates (e.g., a spring gle construct.
loosens or wears out on a key used for respond- Confounding can also occur in experiments if
ing in a timed task). Selection confounds may be there is a breakdown in the random assignment of
present if the participants are not randomly participants to conditions. In applied research, it is
assigned to treatments (e.g., use of intact groups, not uncommon for partners in the research process
participants self-select into treatment groups). In to want an intervention delivered to people who
each case, the confound provides an alternative deserve or are in need of the intervention, resulting
explanation—an event, participant development, in the funneling of different types of participants
instrumentation changes, preexisting differences into the treatment and control conditions. Random
between groups—for any treatment effects on assignment can also fail if the study’s sample size is
the outcome variable. relatively small because in those situations even ran-
Even though the point of conducting an dom assignment may result in people with particu-
experiment is to control the effects of potentially lar characteristics appearing in treatment conditions
confounding variables through the manipulation rather than in control conditions merely by chance.
222 Congruence

Statistical Methods for Vanderweele, T. (2006). The use of propensity score


methods in psychiatric research. International Journal
Dealing With Confounds
of Methods in Psychiatric Research, 15, 95–103.
When random assignment to experimental condi-
tions is not possible or is attempted but fails, it is
likely that people in the different conditions also
differ on other dimensions, such as attitudes, per- CONGRUENCE
sonality traits, and past experience. If it is possible
to collect data to measure these confounding vari- The congruence between two configurations of
ables, then statistical techniques can be used to points quantifies their similarity. The configura-
adjust for their effects on the causal relationship tions to be compared are, in general, produced by
between the independent and dependent variables. factor analytic methods that decompose an ‘‘obser-
One method for estimating the effects of the con- vations by variables’’ data matrix and produce one
founding variables is the calculation of propensity set of factor scores for the observations and one
scores. A propensity score is the probability of set of factor scores (i.e., the loadings) for the vari-
receiving a particular experimental treatment con- ables. The congruence between two sets of factor
dition given the participants’ observed score on scores collected on the same units (which can be
a set of confounding variables. Controlling for this observations or variables) measures the similarity
propensity score provides an estimate of the true between these two sets of scores. If, for example,
treatment effect adjusted for the confounding vari- two different types of factor analysis are per-
ables. The propensity score technique cannot con- formed on the same data set, the congruence
trol for the effects of unmeasured confounding between the two solutions is evaluated by the simi-
variables. Given that it is usually easy to argue for larity of the configurations of the factor scores pro-
additional confounds in the absence of clean duced by these two techniques.
manipulations of the independent variable and This entry presents three coefficients used to
random assignment, careful experimental design evaluate congruence. The first coefficient is called
that rules out alternative explanations for the the coefficient of congruence. It measures the simi-
effects of the independent variable is the best larity of two configurations by computing a cosine
method for eliminating problems associated with between matrices of factor scores. The second and
confounding. third coefficients are the RV coefficient and the
Mantel coefficient. These two coefficients evaluate
Margaret Bull Kovera
the similarity of the whole configuration of units.
See also Experimental Design; Instrumentation; In order to do so, the factor scores of the units are
Propensity Score Analysis; Quasi-Experimental first transformed into a units-by-units square
Designs; Selection matrix, which reflects the configuration of similar-
ity between the units; and then the similarity
between the configurations is measured by a coeffi-
Further Readings cient. For the RV coefficient, the configuration
McDermott, K. B., & Miller, G. E. (2007). Designing between the units is obtained by computing
studies to avoid confounds. In R. S. Sternberg, H. L. a matrix of scalar products between the units, and
Roediger, III, & D. F. Halpern (Eds.), Critical thinking a cosine between two scalar product matrices eval-
in psychology (pp. 131–142). New York: Cambridge uates the similarity between two configurations.
University Press. For the Mantel coefficient, the configuration
Schacter, D. L., Daws, R., Jacoby, L. L., Kahneman, D., between the units is obtained by computing
Lempert, R., Roediger, H. L., et al. (2008). Studying
a matrix of distance between the units, and a
eyewitness investigations in the field. Law & Human
Behavior, 32, 3–5.
coefficient of correlation between two distance
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). matrices evaluates the similarity between two
Experimental and quasi-experimental designs for configurations.
generalized causal inference. Boston: Houghton The congruence coefficient was first defined by
Mifflin. C. Burt under the name unadjusted correlation as
Congruence 223

a measure of the similarity of two factorial config- scores (for observations) or factor loadings (for
urations. The name congruence coefficient was variables). The congruence coefficient is denoted ’
later tailored by Ledyard R. Tucker. The congru- or sometimes rc , and it can be computed with
ence coefficient is also sometimes called a monoto- three different equivalent formulas (where T
nicity coefficient. denotes the transpose of a matrix):
The RV coefficient was introduced by Yves P
Escoufier as a measure of similarity between xi;j yi;j
i;j
squared symmetric matrices (specifically: posi- ’ ¼ rc ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u ! !ffi ð1Þ
tive semidefinite matrices) and as a theoretical u P P 2
tool to analyze multivariate techniques. The RV t x2i;j yi;j
i;j i;j
coefficient is used in several statistical tech-
niques, such as statis and distatis. In order to
compare rectangular matrices with the RV or the
Mantel coefficients, the first step is to transform vecfXgT vecfYg
these rectangular matrices into square matrices.
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ r  ffi ð2Þ
T T
The Mantel coefficient was originally intro- vecfXg vecfXg vecfYg vecfYg
duced by Nathan Mantel in epidemiology but it is
now widely used in ecology.
The congruence and the Mantel coefficients tracefXYT g
are cosines (recall that the coefficient of correla- ¼ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi


ð3Þ
tion is a centered cosine), and as such, they take tracefXXT g tracefYYT g
values between  1 and þ 1. The RV coefficient
is also a cosine, but because it is a cosine RV Coefficient
between two matrices of scalar products (which,
technically speaking, are positive semidefinite The RV coefficient was defined by Escoufier as
matrices), it corresponds actually to a squared a similarity coefficient between positive semidefi-
cosine, and therefore the RV coefficient takes nite matrices. Escoufier and Pierre Robert pointed
values between 0 and 1. out that the RV coefficient had important mathe-
The computational formulas of these three coef- matical properties because it can be shown that
ficients are almost identical, but their usage and most multivariate analysis techniques amount to
theoretical foundations differ because these coeffi- maximizing this coefficient with suitable con-
cients are applied to different types of matrices. straints. Recall, at this point, that a matrix S is
Also, their sampling distributions differ because of called positive semidefinite when it can be
the types of matrices on which they are applied. obtained as the product of a matrix by its trans-
pose. Formally, we say that S is positive semidefi-
nite when there exists a matrix X such that
Notations and Computational Formulas
S ¼ XXT : ð4Þ
Let X be an I by J matrix and Y be an I by K
matrix. The vec operation transforms a matrix Note that as a consequence of the definition, posi-
into a vector whose entries are the elements of the tive semidefinite matrices are square and symmet-
matrix. The trace operation applies to square ric, and that their diagonal elements are always
matrices and gives the sum of the diagonal larger than or equal to zero.
elements. If S and T denote two positive semidefinite
matrices of same dimensions, the RV coefficient
between them is defined as
Congruence Coefficient
tracefST Tg
The congruence coefficient is defined when both Rν ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð5Þ

 
matrices have the same number of rows and col- tracefST Sg × tracefTT Tg
umns (i.e., J ¼ K). These matrices can store factor
224 Congruence

This formula is computationally equivalent to Mantel Coefficient


For the Mantel coefficient, if the data are not
vecfSgT vecfTg already in the form of distances, then the first step
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
RV ¼ r   
T T is to transform these data into distances. These dis-
vecfSg vecfSg vecfTg vecfTg tances can be Euclidean distances, but any other
ð6Þ type of distance will work. If D and B denote the
two I by I distance matrices of interest, then the
Mantel coefficient between these two matrices is
denoted rM, and it is computed as the coefficient
P
I P
I
of correlation between their off-diagonal elements
si;j ti;j
i j as
¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
! !: ð7Þ
u
u P I P I P I P I
t s2i;j ti;j2 P
I1 P
I
i j i j ðdi;j  dÞðbi;j  bÞ
i¼1 j¼iþ1
rM ¼ v"
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
#" #ffi
u
For rectangular matrices, the first step is to u IP 1 P I 2 P P
I 1 I 2
t ðdi;j  dÞ ðbi;j  bÞ
transform the matrices into positive semi- i¼1 j¼iþ1 i¼1 j¼iþ1
definite matrices by multiplying each matrix by
its transpose. So, in order to compute the value ð10Þ
of the RV coefficient between the I by J matrix
X and the I by K matrix Y, the first step it to (where d and b are the mean of the off-diagonal
compute elements of, respectively, matrices D and B).

S ¼ XXT and T ¼ YYT : ð8Þ

If we combine Equations 5 and 8, we find that Tests and Sampling Distributions


the RV coefficient between these two rectangular The congruence, the RV , and the Mantel coeffi-
matrices is equal to cients quantify the similarity between two
matrices. An obvious practical problem is to be
tracefXXT YYT g able to perform statistical testing on the value
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
RV ¼

ffi : of a given coefficient. In particular it is often
tracefXX XX g × tracefYY YY g
T T T T
important to be able to decide whether a value
ð9Þ of coefficient could have been obtained by
chance alone. To perform such statistical tests,
The comparison of Equations 3 and 9 shows that one needs to derive the sampling distribution of
the congruence and the RV coefficients are equiva- these coefficients under the null hypothesis (i.e.,
lent only in the case of positive semidefinite in order to test whether the population coeffi-
matrices. cient is null). More sophisticated testing
From a linear algebra point of view, the requires one to derive the sampling distribution
numerator of the RV coefficient corresponds to for different values of the population para-
a scalar product between positive semidefinite meters. So far, analytical methods have failed to
matrices and therefore gives to this set of matri- completely characterize such distributions, but
ces the structure of a vector space. Within this computational approaches have been used with
framework, the denominator of the RV coeffi- some success. Because the congruence, the RV ,
cient is called the Frobenius or Schur or Hilbert- and the Mantel coefficients are used with differ-
Schmidt matrix scalar product, and the RV coef- ent types of matrices, their sampling distribu-
ficient is a cosine between matrices. This vector tions differ, and so work done with each type of
space structure is responsible for the mathemati- coefficient has been carried independently of the
cal properties of the RV coefficient. others.
Congruence 225

Some approximations for the sampling distribu- For computing the congruence coefficient,
tions have been derived recently for the congru- these two matrices are transformed into two vec-
ence coefficient and the RV coefficient, with tors of 6 × 3 ¼ 18 elements each, and a cosine
particular attention given to the RV coefficient. (cf. Equation 1) is computed between these two
The sampling distribution for the Mantel coeffi- vectors. This gives a value of the coefficient of
cient has not been satisfactorily approximated, congruence of ’ ¼ 7381: In order to evaluate
and the statistical tests provided for this coefficient whether this value is significantly different from
rely mostly on permutation tests. zero, a permutation test with 10,000 permuta-
tions was performed. In this test, the rows of
one of the matrices were randomly permuted,
Congruence Coefficient and the coefficient of congruence was computed
for each of these 10,000 permutations. The
Recognizing that analytical methods were probability of obtaining a value of ’ ¼ :7381
unsuccessful, Bruce Korth and Tucker decided to under the null hypothesis was evaluated as the
use Monte Carlo simulations to gain some proportion of the congruence coefficients larger
insights into the sampling distribution of the than ’ ¼ :7381: This gives a value of p ¼ .0259,
congruence coefficient. Their work was com- which is small enough to reject the null hypothe-
pleted by Wendy J. Broadbooks and Patricia B. sis at the .05 alpha level, and thus one can con-
Elmore. From this work, it seems that the sam- clude that the agreement between the ratings of
pling distribution of the congruence coefficient these two experts cannot be attributed to
depends on several parameters, including the chance.
original factorial structure and the intensity of
the population coefficient, and therefore no sim-
ple picture emerges, but some approximations RV Coefficient
can be used. In particular, for testing that a con-
gruence coefficient is null in the population, an Statistical approaches for the RV coefficient
approximate conservative test is to use Fisher’s have focused on permutation tests. In this frame-
Z transform and to treat the congruence coeffi- work, the permutations are performed on the
cient like a coefficient of correlation. Broad- entries of each column of the rectangular matri-
books and Elmore have provided tables for ces X and Y used to create the matrices S and T
population values different from zero. With the or directly on the rows and columns of S and T.
availability of fast computers, these tables can It is interesting to note that work by Frédérique
easily be extended to accommodate specific Kazi-Aoual and colleagues has shown that the
cases. mean and the variance of the permutation test
distribution can be approximated directly from S
Example and T.
The first step is to derive an index of the dimen-
Here we use an example from Hervé sionality or rank of the matrices. This index,
Abdi and Dominique Valentin (2007). Two denoted βS (for matrix S ¼ XXT), is also known as
wine experts are rating 10 wines on three differ- v in the brain imaging literature, where it is called
ent scales. The results of their ratings are a sphericity index and is used as an estimation of
provided in the two matrices below, denoted X the number of degrees of freedom for multivariate
and Y: tests of the general linear model. This index
depends on the eigenvalues of the S matrix,
2 3 2 3 denoted S λ‘ ; and is defined as
1 6 7 3 6 7
65 3 277 6 37  2
6 64 4 7 P
L
66 1 17 6 17
X ¼ 6 7and Y ¼ 6 7 1 7 : ð11Þ S λ‘
tracefSg2
67 1 277 62 2 27 ‘
6 6 7 βS ¼ ¼ : ð12Þ
42 5 45 42 6 65 P
L
2 tracefSSg
S λ‘
3 4 4 1 7 5 ‘
226 Congruence

The mean of the set of permutated coefficients The problem of the lack of normality of the
between matrices S and T is then equal to permutation-based sampling distribution of the RV
pffiffiffiffiffiffiffiffiffiffi coefficient has been addressed by Moonseong Heo
βS βT and K. Ruben Gabriel, who have suggested ‘‘nor-
EðRV Þ ¼ : ð13Þ
I1 malizing’’ the sampling distribution by using a log
transformation. Recently Julie Josse, Jerome Pagès,
The case of the variance is more complex and and François Husson have refined this approach
involves computing three preliminary quantities and indicated that a gamma distribution would
for each matrix. The first quantity denoted δS is give an even better approximation.
(for matrix S) equal to
Example
P
I
s2i;i As an example, we use the two scalar product
i matrices obtained from the matrices used to illus-
δS ¼ : ð14Þ
P
L
2 trate the congruence coefficient (cf. Equation 11).
s λ‘
‘ For the present example, these original matrices
are centered (i.e., the mean of each column has
The second one is denoted αS for matrix S and is been subtracted from each element of the column)
defined as prior to computing the scalar product matrices.
Specifically, if X and Y denote the centered matri-
αS ¼ I  1  βS : ð15Þ ces derived from X and Y, we obtain the following
scalar product matrices:
The third one is denoted CS (for matrix S) and is
defined as T
S¼XX ¼
ðI  1Þ½IðI þ 1ÞδS  ðI  1ÞðβS þ 2Þ 2 3
29:56 8:78 20:78 20:11 12:89 7:22
CS ¼ :
αS ðI  3Þ 6 8:78 2:89 5:89 5:56 3:44 2:11 7
6 7
ð16Þ 6 7
6 20:78 5:89 14:89 14:56 9:44 5:11 7
6 7
6 20:11 5:56 14:56 16:22 10:78 5:44 7
With these notations, the variance of the permuted 6 7
6 7
coefficients is obtained as 4 12:89 3:44 9:44 10:78 7:22 3:56 5
7:22 2:11 5:11 5:44 3:56 1:89
2IðI  1Þ þ ðI  3ÞCS CT ð19Þ
V ð RV Þ ¼ α S α T × :
IðI þ 1ÞðI  2ÞðI  1Þ3
ð17Þ and

For very large matrices, the sampling distribu- T


tion of the permutated coefficients is relatively sim- T¼YY ¼
ilar to a normal distribution (even though it is, in 2 3
11:81 3:69 15:19 9:69 8:97 7:81
general, not normal), and therefore one can use 6 3:69 1:81
6 7:31 1:81 3:53 3:69 77
a Z criterion to perform null hypothesis testing or 6 7
to compute confidence intervals. For example, the 6 15:19 7:31 34:81 9:31 16:03 20:19 7
6 7:
criterion 6 9:69 1:81 9:31 10:81 6:53 5:69 7
6 7
6 7
4 8:97 3:53 16:03 6:53 8:14 8:97 5
RV  EðRV Þ
ZRV ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð18Þ 7:81 3:69 20:19 5:69 8:97 12:81
V ð RV Þ
ð20Þ
can be used to test the null hypothesis that the
observed value of RV was due to chance. We find the following value for the RV coefficient:
Congruence 227

P
I P
I
si;j ti;j
i j
RV ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
! !
u
u P I P I P I P I
t s2i;j 2
ti;j
i j i j
ð21Þ
ð29:56 × 11:81Þ þ ð8:78 ×  3:69Þ þ    þ ð1:89 × 12:81Þ
¼ rhffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ih iffi
2 2 2 2 2
ð29:56Þ þ ð8:78Þ þ    þ ð1:89Þ ð11:81Þ þ ð3:69Þ þ    þ ð12:81Þ

¼ :7936 :

To test the significance of a value of RV ¼ :7936; which is small enough to reject the null hypothesis
we first compute the following quantities: at the .05 alpha level. It is worth noting that the
normal approximation gives a more liberal (i.e.,
βS ¼ 1:0954 αS ¼ 3:9046 smaller) value of p than does the nonparametric
δS ¼ 0:2951 CS  1:3162 permutation test (which is more accurate in this
ð22Þ case because the sampling distribution of RV is not
βT ¼ 1:3851 αT ¼ 3:6149
normal).
δT ¼ 0:3666 CT ¼ 0:7045

Plugging these values into Equations 13, 17, and Mantel Coefficient
18, we find The exact sampling distribution of the Mantel
coefficient is not known. Numerical simulations
EðRV Þ ¼ 0:2464;
suggest that, when the distance matrices originate
V ðRV Þ ¼ 0:0422; and ð23Þ from different independent populations, the sam-
ZRV ¼ 2:66: pling distribution of the Mantel coefficient is sym-
metric (though not normal) with a zero mean. In
Assuming a normal distribution for the ZRV fact, Mantel, in his original paper, presented some
gives a p value of .0077, which would allow for approximations for the variance of the sampling
the rejection of the null hypothesis for the distributions of rM (derived from the permutation
observed value of the RV coefficient. test) and suggested that a normal approximation
could be used, but the problem is still open. In
Permutation Test practice, though, the probability associated to
As an alternative approach to evaluate whether a specific value of rM is derived from permutation
the value of RV ¼ :7936 is significantly different tests.
from zero, a permutation test with 10,000 permu-
Example
tations was performed. In this test, the whole set
of rows and columns (i.e., the same permutation As an example, two distance matrices derived
of I elements is used to permute rows and col- from the congruence coefficient example (cf. Equa-
umns) of one of the scalar product matrices was tion 11) are used. These distance matrices can be
randomly permuted, and the RV coefficient was computed directly from the scalar product matrices
computed for each of these 10,000 permutations. used to illustrate the computation of the RV coeffi-
The probability of obtaining a value of cient (cf. Equations 19 and 20). Specifically, if S is
RV ¼ :7936 under the null hypothesis was evalu- a scalar product matrix and if s denotes the vector
ated as the proportion of the RV coefficients larger containing the diagonal elements of S, and if 1
than RV ¼ :7936: This gave a value of p ¼ .0281, denotes an I by 1 vector of ones, then the matrix D
228 Congruence

of the squared Euclidean distances between the ele- to the pattern of similarity of the columns of the
ments of S is obtained as (cf. Equation 4): matrices and therefore will not detect similar con-
figurations when one of the configurations is
D ¼ 1sT þ s1T  2S : ð24Þ rotated or dilated. By contrast, both the RV coeffi-
cient and the Mantel coefficients are sensitive to
Using Equation 24, we transform the scalar- the whole configuration and are insensitive to
product matrices from Equations 19 and 20 into changes in configuration that involve rotation or
the following distance matrices: dilatation. The RV coefficient has the additional
2 3 merit of being theoretically linked to most multi-
0 50 86 86 11 17 variate methods and of being the base of Procrus-
6 50 0 6 8 17 97
6 7 tes methods such as statis or distatis.
6 86 6 0 2 41 27 7
D ¼ 66 7 ð25Þ
86 8 2 0 45 29 7 Hervé Abdi
6 7
4 11 17 41 45 0 25
17 9 27 29 2 0 See also Coefficients of Correlation, Alienation, and
Determination; Principal Components Analysis; R2;
and Sampling Distributions
2 3
0 21 77 42 2 9
6 21 0 22 9 17 22 7 Further Readings
6 7
6 77 22 0 27 75 88 7
T ¼ 6
6 42
7: ð26Þ Abdi, H. (2003). Multivariate analysis. In M. Lewis-
6 9 27 0 32 35 7
7 Beck, A. Bryman, & T. Futing (Eds.), Encyclopedia for
4 2 17 75 32 0 35 research methods for the social sciences. Thousand
9 22 88 35 3 0 Oaks, CA: Sage.
Abdi, H. (2007). RV coefficient and congruence coefficient.
For computing the Mantel coefficient, the upper In N. J. Salkind (Ed.), Encyclopedia of measurement
diagonal elements of each of these two matrices are and statistics. Thousand Oaks, CA: Sage.
Bedeian, A. G., Armenakis, A. A., & Randolph, W. A.
stored into a vector of 1 I × ðI  1Þ ¼ 15 elements,
2 (1988). The significance of congruence coefficients: A
and the standard coefficient of correlation is com- comment and statistical test. Journal of Management,
puted between these two vectors. This gives a value 14, 559–566.
of the Mantel coefficient of rM ¼ .5769. In order to Borg, I., & Groenen, P. (1997). Modern multidimensional
evaluate whether this value is significantly different scaling. New York: Springer Verlag.
from zero, a permutation test with 10,000 permuta- Broadbooks, W. J., & Elmore, P. B. (1987). A Monte
tions was performed. In this test, the whole set of Carlo study of the sampling distribution of the
congruence coefficient. Educational & Psychological
rows and columns (i.e., the same permutation of I
Measurement, 47, 1–11.
elements is used to permute rows and columns) of Burt, C. (1948). Factor analysis and canonical
one of the matrices was randomly permuted, and the correlations. British Journal of Psychology, Statistical
Mantel coefficient was computed for each of these Section, 1, 95–106.
10,000 permutations. The probability of obtaining Escoufier, Y. (1973). Le traitement des variables
a value of rM ¼ .5769. under the null hypothesis was vectorielles [Treatment of variable vectors].
evaluated as the proportion of the Mantel coeffi- Biometrics, 29, 751–760.
cients larger than rM ¼ .5769. This gave a value of Harman, H. H. (1976). Modern factor analysis (3rd ed.
p ¼ .0265, which is small enough to reject the null rev.). Chicago: Chicago University Press.
hypothesis at the .05 alpha level. Heo, M., & Gabriel, K. R. (1998). A permutation test of
association between configurations by means of the Rν
coefficient. Communications in Statistics Simulation &
Conclusion Computation, 27, 843–856.
Holmes, S. (2007). Multivariate analysis: The French
The congruence, RV , and Mantel coefficients all way. In Probability and statistics: Essays in honor of
measure slightly different aspects of the notion of David A. Freedman (Vol. 2, pp. 219–233).
congruence. The congruence coefficient is sensitive Beachwood, OH: Institute of Mathematical Statistics.
Construct Validity 229

Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. model with or without a clear substantive or theo-
Cambridge: Cambridge University Press. retical understanding of that dimension and thus
Korth, B. A., & Tucker, L. R. (1975). The distribution of can be used in a purely statistical sense. For exam-
chance congruence coefficients from simulated data. ple, the latent traits in item response theory analy-
Psychometrika, 40, 361–372.
sis are often introduced as latent variables but not
Manly, B. J. F. (1997). Randomization, bootstrap and
Monte Carlo methods in biology (2nd ed.). London:
associated with a particular construct until validity
Chapman and Hall. evidence supports such an association.
Schlich, P. (1996). Defining and validating assessor The object of validation has evolved with valid-
compromises about product distances and attribute ity theory. Initially, validation was construed in
correlations. In T. Näs & E. Risvik (Eds.), terms of the validity of a test. Lee Cronbach and
Multivariate analysis of data in sensory sciences others pointed out that validity depends on how
(pp. 259–306). New York: Elsevier. a test is scored. For example, detailed content cod-
Smouse, P. E., Long, J. C., & Sokal, R. R. (1986). ing of essays might yield highly valid scores
Multiple regression and correlation extensions of the
whereas general subjective judgments might not. As
Mantel test of matrix correspondence. Systematic
a result, validity theory shifted its focus from vali-
Zoology, 35, 627–632.
Worsley, K. J., & Friston, K. J. (1995). Analysis of fMRI dating tests to validating test scores. In addition, it
time-series revisited–Again. NeuroImage, 2, 173–181. became clear that the same test scores could be
used in more than one way and that the level of
validity could vary across uses. For example, the
same test scores might offer a highly valid measure
CONSTRUCT VALIDITY of intelligence but only a moderately valid indicator
of attention deficit/hyperactivity disorder. As
a result, the emphasis of validity theory again
Construct validity refers to whether the scores of shifted from test scores to test score interpretations.
a test or instrument measure the distinct dimension Yet a valid interpretation often falls short of justify-
(construct) they are intended to measure. The pres- ing a particular use. For example, an employment
ent entry discusses origins and definitions of con- test might validly measure propensity for job suc-
struct validation, methods of construct validation, cess, but another available test might do as good
the role of construct validity evidence in the valid- a job at the same cost but with less adverse impact.
ity argument, and unresolved issues in construct In such an instance, the validity of the test score
validity. interpretation for the first test would not justify its
use for employment testing. Thus, Samuel Messick
has urged that test scores are rarely interpreted in
Origins and Definitions
a vacuum as a purely academic exercise but are
Construct validation generally refers to the collec- rather collected for some purpose and put to some
tion and application of validity evidence intended use. However, in common parlance, one frequently
to support the interpretation and use of test scores expands the notion of test to refer to the entire pro-
as measures of a particular construct. The term cedure of collecting test data (testing), assigning
construct denotes a distinct dimension of individual numeric values based on the test data (scoring),
variation, but use of this term typically carries the making inferences about the level of a construct on
connotation that the construct does not allow for the basis of those scores (interpreting), and applying
direct observation but rather depends on indirect those inferences to practical decisions (use). Thus
means of measurement. As such, the term construct the term test validity lives on as shorthand for the
differs from the term variable with respect to this validity of test score interpretations and uses.
connotation. Moreover, the term construct is some- Early on, tests were thought to divide into two
times distinguished from the term latent variable types: signs and samples. If a test was interpreted as
because construct connotes a substantive interpre- a sign of something else, the something else was
tation typically embedded in a body of substantive understood as a construct, and construct validation
theory. In contrast, the term latent variable refers was deemed appropriate. For example, responses to
to a dimension of variability included in a statistical items on a personality inventory might be viewed as
230 Construct Validity

signs of personality characteristics, in which case the Content validity evidence provides evidence of
personality characteristic constitutes the construct of construct validity because it shows that the test
interest. In contrast, some tests were viewed as only properly covers the intended domain of content
samples and construct validation was not deemed related to the construct definition. As such, con-
necessary. For example, a typing test might sample struct validity has grown from humble origins as
someone’s typing and assess its speed and accuracy. one relatively esoteric form of validity to the whole
The scores on this one test (produced from a sam- of validity, and it has come to encompass other
pling of items that could appear on a test) were forms of validity evidence.
assumed to generalize merely on the basis of statisti- Messick distinguished two threats to construct
cal generalization from a sample to a population. validity. Construct deficiency applies when a test
Jane Loevinger and others questioned this distinc- fails to measure some aspects of the construct that
tion by pointing out that the test sample could never it should measure. For example, a mathematics
be a random sample of all possible exemplars of the test that failed to cover some portion of the curric-
behavior in question. For example, a person with ulum for which it was intended would demon-
high test anxiety might type differently on a typing strate this aspect of poor construct validity. In
test from the way the person types at work, and contrast, construct-irrelevant variance involves
someone else might type more consistently on a brief things that the test measures that are not related to
test than over a full workday. As a result, interpret- the construct of interest and thus should not affect
ing the sampled behavior in terms of the full range the test scores. The example of a math test that is
of generalization always extends beyond mere statis- sensitive to vocabulary level illustrates this aspect.
tical sampling to broader validity issues. For this A test with optimal construct validity therefore
reason, all tests are signs as well as samples, and measures everything that it should measure but
construct validation applies to all tests. nothing that it should not.
At one time, test validity was neatly divided into Traditionally, validation has been directed
three types: content, criterion, and construct, with toward a specific test, its scores, and their intended
the idea that one of these three types of validity interpretation and use. However, construct valida-
applied to any one type of test. However, criterion- tion increasingly conceptualizes validation as con-
related validity depends on the construct interpreta- tinuous with extended research programs into the
tion of the criterion, and test fairness often turns on construct measured by the test or tests in question.
construct-irrelevant variance in the predictor scores. This shift reflects a broader shift in the behavioral
Likewise, content validation may offer valuable evi- sciences away from operationalism, in which a vari-
dence in support of the interpretation of correct able is theoretically defined in terms of a single
answers but typically will not provide as strong operational definition, in favor of multioperational-
a line of evidence for the interpretation of incorrect ism, in which a variety of different measures trian-
answers. For example, someone might know the gulate on the same construct. As a field learns to
mathematical concepts but answer a math word measure a construct in various ways and learns
problem incorrectly because of insufficient vocabu- more about how the construct relates to other vari-
lary or culturally inappropriate examples. Because ables through evidence collected using these mea-
all tests involve interpretation of the test scores in sures, the overall understanding of the construct
terms of what they are intended to measure, con- increases. The stronger this overall knowledge base
struct validation applies to all tests. In contempo- about the construct, the more confidence one can
rary thinking, there is a suggestion that all validity have in interpreting the scores derived from a par-
should be of one type, construct validity. ticular test as measuring this construct. Moreover,
This line of development has led to unified (but the more one knows about the construct, the more
not unitary) conceptions of validity that elevate specific and varied are the consequences entailed
construct validity from one kind of validity among by interpreting test scores as measures of that con-
others to the whole of validity. Criterion-related struct. As a result, one can conceptualize construct
evidence provides evidence of construct validity by validity as broader than test validity because it
showing that test scores relate to other variables involves the collection of evidence to validate theo-
(i.e., criterion variables) in the predicted ways. ries about the underlying construct as measured by
Construct Validity 231

a variety of tests, rather than merely the interpreta- Latent class analysis offers a similar measurement
tion of scores from one particular test. model based on the same basic assumption but
applicable to situations in which the latent variable
is itself categorical. All three methods typically
Construct Validation Methodology
offer tests of goodness of fit based on the assump-
At its inception, when construct validity was con- tion of local independence and the ability of the
sidered one kind of validity appropriate to certain modeled latent variables to account for the rela-
kinds of tests, inspection of patterns of correlations tionships among the item responses.
offered the primary evidence of construct validity. An important aspect of the above types of evi-
Lee Cronbach and Paul Meehl described a nomo- dence involves the separate analysis of various
logical net as a pattern of relationships between scales or subscales. Analyzing each scale separately
variables that partly fixed the meaning of a con- does not provide evidence as strong as does ana-
struct. Later, factor analysis established itself as lyzing them together. This is because separate anal-
a primary methodology for providing evidence of yses work only with local independence of items
construct validity. Loevinger described a structural on the same scale. Analyzing multiple scales com-
aspect of construct validity as the pattern of rela- bines this evidence with evidence based on rela-
tionships between items that compose a test. Fac- tionships between items on different scales. So, for
tor analysis allows the researcher to investigate the example, three subscales might each fit a one-fac-
internal structure of item responses, and some tor model very well, but a three-factor model
combination of replication and confirmatory fac- might fail miserably when applied to all three sets
tor analysis allows the researcher to test theoretical of items together. Under a hypothetico-deductive
hypotheses about that structure. Such hypotheses framework, testing the stronger hypothesis of mul-
typically involve multiple dimensions of variation ticonstruct local independence offers more support
tapped by items on different subscales and there- to interpretations of sets of items that pass it than
fore measuring different constructs. A higher order does testing a weaker piecemeal set of hypotheses.
factor may reflect a more general construct that The issue just noted provides some interest in
comprises these subscale constructs. returning to the earlier notion of a nomological net
Item response theory typically models dichoto- as a pattern of relationships among variables in
mous or polytomous item responses in relation to which the construct of interest is embedded. The
an underlying latent trait. Although item response idea of a nomological net arose during a period
theory favors the term trait, the models apply to when causation was suspect and laws (i.e., nomic
all kinds of constructs. Historically, the emphasis relationships) were conceptualized in terms of pat-
with item response theory has been much more terns of association. In recent years, causation has
heavily on unidimensional measures and providing made a comeback in the behavioral sciences, and
evidence that items in a set all measure the same methods of modeling networks of causal relations
dimension of variation. However, recent develop- have become more popular. Path analysis can be
ments in factor analysis for dichotomous and poly- used to test hypotheses about how a variable fits
tomous items, coupled with expanded interest in into such a network of observed variables, and thus
multidimensional item response theory, have path analysis provides construct validity evidence
brought factor analysis and item response theory for test scores that fit into such a network as pre-
together under one umbrella. Item response theory dicted by the construct theory. Structural equation
models are generally equivalent to a factor analysis models allow the research to combine both ideas
model with a threshold at which item responses by including both measurement models relating
change from one discrete response to another items to latent variables (as in factor analysis) and
based on an underlying continuous dimension. structural models that embed the latent variables in
Both factor analysis and item response theory a causal network (as in path analysis). These
depend on a shared assumption of local indepen- models allow researchers to test complex hypothe-
dence, which means that if one held constant the ses and thus provide even stronger forms of
underlying latent variable, the items would no lon- construct validity evidence. When applied to pas-
ger have any statistical association between them. sively observed data, however, such causal models
232 Construct Validity

contain no magic formula for spinning causation cognitive processing involved, one can manipulate
out of correlation. Different models will fit the various cognitive subtasks required to answer
same data, and the same model will fit data gener- items and predict the difficulty of the resulting
ated by different causal mechanisms. Nonetheless, items from these manipulations.
such models allow researchers to construct highly
falsifiable hypotheses from theories about the con- Role in Validity Arguments
struct that they seek to measure.
Complementary to the above, experimental and Modern validity theory generally structures the eval-
quasi-experimental evidence also plays an impor- uation of validity on the basis of various strands of
tant role in assessing construct validity. If a test evidence in terms of the construction of a validity
measures a given construct, then efforts to manip- argument. The basic idea is to combine all available
ulate the value of the construct should result in evidence into a single argument supporting the
changes in test scores. For example, consider intended interpretation and use of the test scores.
a standard program evaluation study that demon- Recently, Michael Kane has distinguished an inter-
strates a causal effect of a particular training pro- pretive argument from the validity argument. The
gram on performance of the targeted skill set. If interpretive argument spells out the assumptions
the measure of performance is well validated and and rationale for the intended interpretation of the
the quality of the training is under question, then scores, and the validity argument supports the valid-
this study primarily provides evidence in support ity of the interpretive argument, particularly by pro-
of the training program. In contrast, however, if viding evidence in support of key assumptions. For
the training program is well validated but the per- example, an interpretive argument might indicate
formance measure is under question, then the same that an educational performance mastery test
study primarily provides evidence in support of the assumes prior exposure and practice with the mate-
construct validity of the measure. Such evidence rial. The validity argument might then provide evi-
can generally be strengthened by showing that the dence that given these assumptions, test scores
intervention affects the variables that it should but correspond to the degree of mastery.
also does not affect the variables that it should The key to developing an appropriate validity
not. Showing that a test is responsive to manipula- argument rests with identifying the most important
tion of a variable that should not affect it offers and controversial premises that require evidential
one way of demonstrating construct-irrelevant var- support. Rival hypotheses often guide this process.
iance. For example, admissions tests sometimes The two main threats to construct validity
provide information about test-taking skills in an described above yield two main types of rival
effort to minimize the responsiveness of scores to hypotheses addressed by construct validity evi-
further training in test taking. dence. For example, sensitivity to transient emo-
Susan Embretson distinguished construct repre- tional states might offer a rival hypothesis to the
sentation from nomothetic span. The latter refers validity of a personality scale related to construct-
to the external patterns of relationships with other irrelevant variance. Differential item functioning,
variables and essentially means the same thing as in which test items relate to the construct differ-
nomological net. The former refers to the cognitive ently for different groups of test takers, also relates
processes involved in answering test items. To the to construct-irrelevant variance, yielding rival
extent that answering test items involves the hypotheses about test scores related to group char-
intended cognitive processes, the construct is prop- acteristics. A rival hypothesis that a clinical depres-
erly represented, and the measurements have sion inventory captures only one aspect of
higher construct validity. As a result, explicitly depressive symptoms involves a rival hypothesis
modeling the cognitive operation involved in about construct deficiency.
answering specific item types has blossomed as
a means of evaluating construct validity, at least in
Unresolved Issues
areas in which the underlying cognitive mechan-
isms are well understood. As an example, if one A central controversy in contemporary validity
has a strong construct theory regarding the theory involves the disagreements over the breadth
Content Analysis 233

of validity evidence. Construct validation provides Definition and Conception


an integrative framework that ties together all
forms of validity evidence in a way continuous The phrase content analysis, first mentioned in
with empirical research into the construct, but a 1941 paper by Douglas Waples and Bernard
some have suggested a less expansive view of Berelson, became defined in 1948 by Paul
validity as more practical. Construct validity evi- F. Lazarsfeld and Berelson. Webster’s Dictionary
dence based on test consequences remains a con- has listed content analysis since its 1961 edition.
tinuing point of controversy, particularly with However, the practice of analyzing media matter is
respect to the notion of consequential validity as almost as old as writing. It became of interest to
a distinct form of validity. Finally, there remains the church, worried about the effects of the written
a fundamental tension in modern validity theory word other than God’s; to governments, trying to
between the traditional fact–value dichotomy settle political, legal, and religious disputes; to
and the fundamental role of values and evalua- journalists, hoping to document the changes in
tion in assessing the evidence in favor of specific newspaper publishing due to its commercialization
tests, scores, interpretations, and uses. and popularization; to corporations interested in
surveying their symbolic environments for oppor-
Keith A. Markus and Chia-ying Lin tunities and threats; and to social scientists, origi-
nally drawn into the competition between the
See also Content Validity; Criterion Validity; Structural
press and newly emerging media, then radio and
Equation Modeling
television, but soon discovering the importance of
all kinds of mediated communication to under-
stand social, political, economic, and psychological
Further Readings phenomena. Communication research advanced
American Educational Research Association, American content analysis, but owing to the proliferation of
Psychological Association, & National Council on media and the recognition that humans define
Measurement in Education. (1999). Standards for themselves and each other in communication,
educational and psychological testing. Washington, coordinate their beliefs and actions in communica-
DC: American Educational Research Association. tion, and construct the realities they live with in
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), communication, content analysis is now used by
Educational measurement (4th ed., pp. 17–64). literally all social sciences.
Westport, CT: American Council on Education and As a technique, content analysis embraces spe-
Praeger.
cialized procedures. It is teachable. Its use can be
Messick, S. (1989). Validity. In R. L. Linn (Ed.),
Educational measurement (3rd ed., pp. 13–103). New
divorced from the authority of the researcher. As
York: American Council on Education and Macmillan. a research technique, content analysis can provide
Wainer, H., & Braun, H. I. (1988). Test validity. new kinds of understanding social phenomena or
Mahwah, NJ: Lawrence Erlbaum. inform decisions on pertinent actions. Content
analysis is a scientific tool.
All techniques are expected to be reliable. Sci-
entific research techniques should result in repli-
cable findings. Replicability requires research
CONTENT ANALYSIS procedures to be explicit and communicable so
that researchers, working at different times and
Content analysis is a research technique for perhaps under different circumstances, can apply
making replicable and valid inferences from them and come to the same conclusions about
texts (or other meaningful matter) to the con- the same phenomena.
text of their use. Scientific research must also yield valid results.
This entry further explores the definition and To establish validity, research results must survive
conceptions of content analysis. It then provides in the face of independently available evidence of
information on its conceptual framework and the what they claim. The methodological requirements
steps involved in performing content analysis. of reliability and validity are not unique to content
234 Content Analysis

analysis but make particular demands on the tech- likelihood of cross-border hostilities as a function
nique that are not found as problematic in other of how one country’s national press portrays its
methods of inquiry. neighbor? What are a city’s problems as inferred
The reference to text is not intended to restrict from citizens’ letters to its mayor? What do school
content analysis to written material. The paren- children learn about their nation’s history through
thetical phrase ‘‘or other meaningful matter’’ is to textbooks? What criteria do Internet users employ
imply content analysis’s applicability to anything to authenticate electronic documents?
humanly significant: images, works of art, maps,
signs, symbols, postage stamps, songs, and music,
Other Conceptions
whether mass produced, created in conversations,
or private. Texts, whether composed by individual Unlike content analysis, observation and measure-
authors or produced by social institutions, are ment go directly to the phenomenon of analytic
always intended to point their users to something interest. Temperature and population statistics
beyond their physicality. However, content analy- describe tangible phenomena. Experiments with
sis does not presume that readers read a text as human participants tend to define the range of
intended by its source; in fact, authors may be responses in directly analyzable form, just as struc-
quite irrelevant, often unknown. In content analy- tured interviews delineate the interviewees’ multi-
sis, available texts are analyzed to answer research ple choices among answers to prepared interview
questions not necessarily shared by everyone. questions. Structured interviews and experiments
What distinguishes content analysis from most with participants acknowledge subjects’ responses
observational methods in the social sciences is that to meanings but bypass them by standardization.
the answers to its research questions are inferred Content analysts struggle with unstructured
from available text. Content analysts are not inter- meanings.
ested in the physicality of texts that can be Social scientific literature does contain concep-
observed, measured, and objectively described. tions of content analysis that mimic observational
The alphabetical characters of written matter, the methods, such as those of George A. Miller, who
pixels of digital images, and the sounds one can characterizes content analysis as a method for put-
manipulate at a control panel are mere vehicles of ting large numbers of units of verbal matter into
communication. What text means to somebody, analyzable categories. A definition of this kind
what it represents, highlights and excludes, provides no place for methodological standards.
encourages or deters—all these phenomena do not Berelson’s widely cited definition fares not much
reside inside a text but come to light in processes better. For him, ‘‘content analysis is a research
of someone’s reading, interpreting, analyzing, con- technique for the objective, systematic and quanti-
cluding, and in the case of content analysis, tative description of the manifest content of com-
answering pertinent research questions concerning munication’’ (p. 18). The restriction to manifest
the text’s context of use. content would rule out content analyses of psycho-
Typical research questions that content analysts therapeutic matter or of diplomatic exchanges,
might answer are, What are the consequences for both of which tend to rely on subtle clues to
heavy and light viewers of exposure to violent tele- needed inferences. The requirement of quantifica-
vision shows? What are the attitudes of a writer tion, associated with objectivity, has been chal-
on issues not mentioned? Who is the author of an lenged, especially because the reading of text is
anonymously written work? Is a suicide note real, qualitative to start and interpretive research favors
requiring intervention, or an empty threat? Which qualitative procedures without being unscientific.
of two textbooks is more readable by sixth gra- Taking the questionable attributes out of Berel-
ders? What is the likely diagnosis for a psychiatric son’s definition reduces content analysis to the sys-
patient, known through an interview or the tematic analysis of content, which relies on
responses to a Rorschach test? What is the ethnic, a metaphor of content that locates the object of
gender, or ideological bias of a newspaper? Which analysis inside the text—a conception that some
economic theory underlies the reporting of researchers believe is not only misleading but also
business news in the national press? What is the prevents the formulation of sound methodology.
Content Analysis 235

phenomena were observable directly, content anal-


Content analysis
ysis of texts would be redundant. Content analysts
Interpretation Inference Description
pursue questions that could conceivably be
answered by examining texts but that could also
Analytical construct
be validated by other means, at least in principle.
Research question The latter rules out questions that have to do
Context
Conceived by
Stable correlations
Texts
with a researcher’s skill in processing text, albeit
known or assumed
content analyst Answer systematically. For example, the question of how
much violence is featured on television is answer-
able by counting incidences of it. For content ana-
The many Validating evidence lysts, an index of violence on television needs to
worlds
of others
Meanings, references, and uses of Texts have empirical validity in the sense that it needs to
say something about how audiences of violent
shows react, how their conception of the world is
shaped by being exposed to television violence, or
Figure 1 A Framework for Content Analysis whether it encourages or discourages engaging in
violent acts. Inferences about antecedents and con-
sequences can be validated, at least in principle.
Counts can be validated only by recounting. With-
There are definitions, such as Charles Osgood’s out designating where validating evidence could be
or Ole Holsti’s, that admit inferences but restrict found, statements about the physicality of text
them to the source or destination of the analyzed would not answer the research questions that
messages. These definitions provide for the use of define a content analysis.
validity criteria by allowing independent evidence Research questions must admit alternative
to be brought to bear on content analysis results, answers. They are similar to a set of hypotheses to
but they limit the analysis to causal inferences. be tested, except that inferences from text deter-
mine choices among them.
Conceptual Framework
Figure 1 depicts the methodologically relevant ele- Context of the Analysis
ments of content analysis. A content analysis usu- All texts can be read in multiple ways and pro-
ally starts with either or both (a) available text vide diverging information to readers with diverg-
that, on careful reading, poses scientific research ing competencies and interests. Content analysts
questions or (b) research questions that lead the are not different in this regard, except for their
researcher to search for texts that could answer mastery of analytical techniques. To keep the
them. range of possible inferences manageable, content
analysts need to construct a context in which their
research questions can be related to available texts
Research Questions
in ways that are transparent and available for
In content analysis, research questions need to examination by fellow scientists. This restriction is
go outside the physicality of text into the world of quite natural. Psychologists construct their world
others. The main motivation for using content unlike sociologists do, and what is relevant when
analysis is that the answers sought cannot be policy recommendation or evidence in court needs
found by direct observation, be it because the phe- to be provided may have little to do with an analy-
nomena of interest are historical, hence past; sis that aims at deciding when different parts of
enshrined in the mind of important people, not the Bible were written. The context of a content
available for interviews; concern policies that are analysis always is the analyst’s choice. There are
deliberately hidden, as by wartime enemies; or no restrictions except for having to be explicit and
concern the anticipated effects of available com- arguably related to the world of others for whom
munications, hence not yet present. If these the analyzed text means something, refers to
236 Content Analysis

something, and is useful or effective, though not meanings reside in words, not in syntax and orga-
necessarily as content analysts conceptualize these nization; (b) meanings are shared by everyone—
things. ‘‘manifest,’’ in Berelson’s definition—as implied in
the use of published dictionaries and thesauri; and
(c) certain differentiations among word meanings
Description of Text
can be omitted in favor of the gist of semantic
Usually, the first step in a content analysis is word classes. Tagging texts is standard in several
a description of the text. Mary Bock called content computer aids for content analysis. The General
analyses that stop there ‘‘impressionistic’’ because Inquirer software, for example, assigns the words
they leave open what a description could mean. I, me, mine, and myself to the tag ‘‘self’’ and the
Three types of description may be distinguished: tags ‘‘self,’’ ‘‘selves,’’ and ‘‘others’’ to the second-
(1) selected word counts, (2) categorizations by order tag ‘‘person.’’ Where words are ambiguous,
common dictionaries or thesauri, and (3) recording such as play, the General Inquirer looks for disam-
or scaling by human coders. biguating words in the ambiguous word’s environ-
ment—looking, in the case of play, for example,
for words relating to children and toys, musical
Selected Word Counts
instruments, theatrical performances, or work—
Selected word counts can easily be obtained and thereby achieves a less ambiguous tagging.
mechanically and afford numerous comparisons by Tagging is also used to scale favorable or unfa-
sources or situations or over time. For example, the vorable attributes or assign positive and negative
12 most frequent words uttered by Paris Hilton in signs to references.
an interview with Larry King were 285 I, 66 you,
61 my, 48 like, 45 yes, 44 really, 40 me, 33 I’m, 32
Recording or Scaling by Human Coders
people, 28 they, 17 life and time, and 16 jail. That
I is by far the most frequent word may suggest that Recording or scaling by human coders is the
the interviewee talked largely about herself and her traditional and by far the most common path
own life, which incidentally included a brief visit in taken to obtain analyzable descriptions of text.
jail. Such a distribution of words is interesting not The demand for content analysis to be reliable is
only because normally one does not think about met by standard coding instructions, which all
words when listening to conversations but also coders are asked to apply uniformly to all units of
because its skewedness is quite unusual and invites analysis. Units may be words, propositions, para-
explanations. But whether Hilton is self-centered, graphs, news items, or whole publications of
whether her response was due to Larry King’s ques- printed matter; scenes, actors, episodes, or whole
tioning, how this interview differed from others he movies in the visual domain; or utterances, turns
conducted, and what the interview actually taken, themes discussed, or decisions made in
revealed to the television audience remain specula- conversations.
tion. Nevertheless, frequencies offer an alternative The use of standard coding instructions offers
to merely listening or observing. content analysts not only the possibility of analyz-
Many computer aids to content analysis start ing larger volumes of text and employing many
with words, usually omitting function words, such coders but also a choice between emic and etic
as articles, stemming them by removing grammati- descriptions—emic by relying on the very cate-
cal endings, or focusing on words of particular gories that a designated group of readers would
interest. In that process, the textual environments use to describe the textual matter, etic by deriving
of words are abandoned or, in the case of key- coding categories from the theories of the context
words in context lists, significantly reduced. that the content analysts have adopted. The latter
choice enables content analysts to describe latent
contents and approach phenomena that ordinary
Categorizing by Common Dictionaries or Thesauri
writers and readers may not be aware of. ‘‘Good’’
Categorization by common dictionaries or the- and ‘‘bad’’ are categories nearly everyone under-
sauri is based on the assumptions that (a) textual stands alike, but ‘‘prosocial’’ and ‘‘antisocial’’
Content Analysis 237

attitudes, the concept of framing, or the idea of analysts cannot bypass justifying this step. It
a numerical strength of word associations needs to would be methodologically inadmissible to claim
be carefully defined, exemplified, and tested for to have analyzed ‘‘the’’ content of a certain news
reliability. channel, as if no inference were made or as if con-
tent were contained in its transmissions, alike for
everyone, including content analysts. It is equally
Inference
inadmissible to conclude from applying a standard
Abduction coding instrument and a sound statistics on reli-
ably coded data, that the results of a content anal-
Although sampling considerations are impor-
ysis say anything about the many worlds of others.
tant in selecting texts for analysis, the type of infer-
They may represent nothing other than the content
ence that distinguishes content analysis from
analyst’s systematized conceptions.
observational methods is abduction—not induc-
Regarding the analytical construct, content
tion or deduction. Abduction proceeds from parti-
analysts face two tasks, preparatory and applied.
culars—texts—to essentially different particulars—
Before designing a content analysis, researchers
the answers to research questions. For example,
may need to test or explore available evidence,
inferring the identity of the author from textual
including theories of the stable relations on
qualities of an unsigned work; inferring levels of
grounds of which the use of analytical constructs
anxiety from speech disturbances; inferring
can be justified. After processing the textual
a source’s conceptualization from the proximity of
data, the inferences tendered will require similar
words it uses; inferring Stalin’s successor from
justifications.
public speeches by Politburo members at the occa-
sion of Stalin’s birthday; or inferring possible solu-
tions to a conflict entailed by the metaphors used Interpretation
in characterizing that conflict.
The result of an inference needs to be interpreted
so as to select among the possible answers to the
given research question. In identifying the author
Analytical Constructs
of an unsigned document, one may have to trans-
Inferences of this kind require some evidential late similarities between signed and unsigned
support that should stem from the known, documents into probabilities associated with con-
assumed, theorized, or experimentally confirmed ceivable authors. In predicting the use of a weapon
stable correlations between the textuality as system from enemy domestic propaganda, one
described and the set of answers to the research may have to extrapolate the fluctuations of men-
question under investigation. Usually, this eviden- tioning it into a set of dates. In ascertaining gender
tial support needs to be operationalized into a form biases in educational material, one may have to
applicable to the descriptions of available texts transform the frequencies of gender references and
and interpretable as answers to the research ques- their evaluation into weights of one gender over
tions. Such operationalizations can take numerous another.
forms. By intuition, one may equate a measure of Interpreting inferences in order to select among
the space devoted to a topic with the importance alternative answers to a research question can be
a source attributes to it. The relation between dif- quite rigorous. Merely testing hypotheses on the
ferent speech disturbances and the diagnosis of cer- descriptive accounts of available texts stays within
tain psychopathologies may be established by the impressionistic nature of these descriptions and
correlation. The relation between the proximity of has little to do with content analysis.
words and associations, having been experimen-
tally confirmed, may be operation-alized in cluster-
Criteria for Judging Results
ing algorithms that compute word clusters from
strings of words. There are essentially three conditions for judging
While the evidential support for the intended the acceptability of content analysis results. In the
inferences can come from anywhere, content absence of direct validating evidence for the
238 Content Validity

inferences that content analysts make, there content analysts may need to rely on indirect evi-
remain reliability and plausibility. dence. For example, when inferring the psychopa-
thology of a historical figure, accounts by the
person’s contemporaries, actions on record, or
Reliability
comparisons with today’s norms may be used to
Reliability is the ability of the research process triangulate the inferences. Similarly, when military
to be replicated elsewhere. It assures content ana- intentions are inferred from the domestic broad-
lysts that their data are rooted in shared ground casts of wartime enemies, such intentions may be
and other researchers that they can figure out what correlated with observable consequences or remain
the reported findings mean or add their own data on record, allowing validation at a later time. Cor-
to them. Traditionally, the most unreliable part of relative validity is demonstrated when the results
a content analysis is the recording, categorization, of a content analysis correlate with other variables.
or scaling of text by human coders, and content Structural validity refers to the degree to which the
analysts employing coders for this purpose are analytical construct employed does adequately
required to assess the reliability of that process model the stable relations underlying the infer-
quantitatively. Measures of reliability are provided ences, and functional validity refers to the history
by agreement coefficients with suitable reliability of the analytical construct’s successes. Semantic
interpretations, such as Scott’s π (pi) and Krippen- validity concerns the validity of the description of
dorff’s α (alpha). The literature contains recom- textual matter relative to a designated group of
mendations regarding the minimum agreement readers, and sampling validity concerns the repre-
required for an analytical process to be sufficiently sentativeness of the sampled text. Unlike in obser-
reliable. However, that minimum should be vational research, texts need to be sampled in view
derived from the consequences of answering the of their ability to provide the answers to research
research question incorrectly. Some disagreements questions, not necessarily to represent the typical
among coders may not make a difference, but content produced by their authors.
others could direct the process to a different result.
Klaus Krippendorff

Plausibility See also Hypothesis; Interrater Reliability; Krippendorff’s


Alpha; Reliability; Validity of Research Conclusions
Computer content analysts pride themselves in
having bypassed reliability problems. However, all
content analysts need to establish the plausibility of
Further Readings
the path taken from texts to their results. This pre-
supposes explicitness as to the analytical steps Berelson, B. (1952). Content analysis in communications
taken. The inability to examine critically the steps research. New York: Free Press.
by which a content analysis proceeded to its con- Krippendorff, K. (2004). Content analysis: An
clusion introduces doubts in whether the analysis introduction to its methodology (2nd ed.). Thousand
Oaks, CA: Sage.
can be trusted, and implausibility can fail the effort.
Krippendorff, K., & Bock, M. A. (2008). The content
Content analysts cannot hide behind obscure algo- analysis reader. Thousand Oaks, CA: Sage.
rithms whose inferences are unclear. Plausibility
may not be quantifiable, as reliability is, but it is
one criterion all content analyses must satisfy.

CONTENT VALIDITY
Validity
In content analysis, validity may be demon- Content validity refers to the extent to which the
strated variously. The preferred validity is predic- items on a test are fairly representative of the
tive, matching the answers to the research entire domain the test seeks to measure. This entry
question with subsequently obtained facts. When discusses origins and definitions of content valida-
direct and post facto validation is not possible, tion, methods of content validation, the role of
Content Validity 239

content validity evidence in validity arguments, the ability to add in contexts outside addition
and unresolved issues in content validation. tests.
At the heart of the above issue lies the paradig-
matic shift from discrete forms of validity, each
appropriate to one kind of test, to a more unified
Origins and Definitions
approach to test validation. The term content
One of the strengths of content validation is the validity initially differentiated one form of validity
simple and intuitive nature of its basic idea, which from criterion validity (divisible into concurrent
holds that what a test seeks to measure constitutes validity and predictive validity, depending on the
a content domain and the items on the test should timing of the collection of the criterion data) and
sample from that domain in a way that makes the construct validity (which initially referred primar-
test items representative of the entire domain. ily to the pattern of correlations with other vari-
Content validation methods seek to assess this ables, the nomological net, and to the pattern of
quality of the items on a test. Nonetheless, the association between the scores on individual items
underlying theory of content validation is fraught within the test). Each type of validity arose from
with controversies and conceptual challenges. a set of practices that the field developed to
At one time, different forms of validation, and address a particular type of practical application
indeed validity, were thought to apply to different of test use. Content validity was the means of vali-
types of tests. Florence Goodenough made an dating tests used to sample a content domain and
influential distinction between tests that serve as evaluate mastery within that domain. The unified
samples and tests that serve as signs. From this view of validity initiated by Jane Loevinger and
view, personality tests offer the canonical example Lee Cronbach, and elaborated by Samuel Messick,
of tests as signs because personality tests do not sought to forge a single theory of test validation
sample from a domain of behavior that constitutes that subsumed these disparate practices.
the personality variable but rather serve to indicate The basic practical concern involved the fact
an underlying personality trait. In contrast, educa- that assessment of the representativeness of the
tional achievement tests offer the canonical exam- content domain achieved by a set of items does
ple of tests as samples because the items sample not provide a sufficient basis to evaluate the
from a knowledge or skill domain, operationally soundness of inferences from scores on the test.
defined in terms of behaviors that demonstrate For example, a student correctly answering arith-
that corresponding knowledge or skill that the test metic items at a level above chance offers stronger
measures achievement in. For example, if an addi- support for the conclusion that he or she can do
tion test contains items representative of all combi- the arithmetic involved than the same student fail-
nations of single digits, then it may adequately ing to correctly answer the items offers for the
represent addition of single-digit numbers, but it conclusion that he or she cannot do the arithmetic.
would not adequately represent addition of num- It may be that the student can correctly calculate 6
bers with more than one digit. divided by 2 but has not been exposed to the 6/2
Jane Loevinger and others have argued that the notation used in the test items. In another context,
above distinction does not hold up because all tests a conscientious employee might be rated low on
actually function as signs. The inferences drawn a performance scale because the items involve
from test scores always extend beyond the test-tak- tasks that are important to and representative of
ing behaviors themselves, but it is impossible for the domain of conscientious work behaviors, but
the test to include anything beyond test-taking opportunities for which come up extremely rarely
behaviors. Even work samples can extend only to in the course of routine work (e.g., reports defec-
samples of work gathered within the testing proce- tive equipment when encountered). Similarly, a test
dure (as opposed to portfolios, which lack the with highly representative items might have inade-
standardization of testing procedures). To return quate reliability or other deficiencies that reduce
to the above example, one does not use an addi- the validity of inferences from its scores. The tradi-
tion test to draw conclusions only about answering tional approach to dividing up types of validity
addition items on a test but seeks to generalize to and categorizing tests with respect to the
240 Content Validity

appropriate type of validation tends in practice to validation efforts that exclude content validation
encourage reliance on just one kind of validity evi- where it could provide an important and perhaps
dence for a given test. Because just one type alone, necessary line of support. These considerations
including content-related validation evidence, does have led to proposals to modify the argument
not suffice to underwrite the use of a test, the uni- approach to validation in ways that make content-
fied view sought to discourage such categorical related evidence necessary or at least strongly
typologies of either tests or validity types and recommended for tests based on sampling from
replace these with validation methods that com- a content domain.
bined different forms of evidence for the validity Contemporary approaches to content validation
of the same test. typically distinguish various aspects of content
As Stephen Sireci and others have argued, the validity. A clear domain definition is foundational
problem with the unified approach with respect to for all the other aspects of content validity because
content validation stems directly from this effort without a clear definition of the domain, test
to improve on inadequate test validation practices. developers, test users, or anyone attempting to do
A central ethos of unified approaches involves the validation research has no basis for a clear assess-
rejection of a simple checklist approach to valida- ment of the remaining aspects. This aspect of con-
tion in which completion of a fixed set of steps tent validation closely relates to the emphasis in
results in a permanently validated test that requires the Standards for Educational and Psychological
no further research or evaluation. As an antidote Testing on clearly defining the purpose of a test as
to this checklist conception, Michael Kane and the first step in test validation.
others elaborated the concept of a validity argu- A second aspect of content validity, domain rel-
ment. The basic idea was that test validation evance, draws a further connection between con-
involves building an argument that combines mul- tent validation and the intended purpose of the
tiple lines of evidence of the overall evaluation of test. Once the domain has been defined, domain
a use or interpretation of scores derived from a test. relevance describes the degree to which the defined
To avoid a checklist, the argument approach leaves domain bears importance to the purpose of the
it open to the test validator to exercise judgment test. For example, one could imagine a test that
and select the lines of evidence that are most does a very good job of sampling the skills
appropriate in a given instance. This generally required to greet visitors, identify whom they wish
involves selecting the premises of the validation to see, schedule appointments, and otherwise exer-
argument that bear the most controversy and for cise the judgment and complete the tasks required
which empirical support can be gathered within of an effective receptionist. However, if the test use
practical constraints on what amounts to a reason- involves selecting applicants for a back office sec-
able effort. One would not waste resources gather- retarial position that does not involve serving as
ing empirical evidence for claims that no one a receptionist, then the test would not have good
would question. Similarly, one would not violate domain relevance for the intended purpose. This
ethical standards in order to validate a test of neu- aspect of content validation relates to a quality of
ral functioning by damaging various portions of the defined domain independent of how well the
the cortex in order to experimentally manipulate test taps that domain.
the variable with random assignment. Nor would In contrast, domain representation does not
one waste resources on an enormous and costly evaluate the defined domain but rather evaluates
effort to test one assumption if those resources the effectiveness with which the test samples that
could be better used to test several others in a less domain. Clearly, this aspect of content validation
costly fashion. In short, the validity argument depends on the previous two. Strong content rep-
approach to test validation does not specify that resentation does not advance the quality of a test if
any particular line of evidence is required of the items represent a domain with low relevance.
a validity argument. As a result, an effort to dis- Furthermore, even if the items do represent
courage reliance on content validation evidence a domain well, the test developer has no effective
alone may have swung the pendulum too far in the means of ascertaining that fact without a clear
opposite direction by opening the door to domain definition. Domain representation can
Content Validity 241

suffer in two ways: Items on the test may fail to as noted above, even this test-centered approach to
sample some portion of the test domain, in which content validity remains relative to the purpose for
case the validity of the test suffers as a result of which one uses the test. Domain relevance depends
construct underrepresentation. Alternatively, the on this purpose, and the purpose of the test should
test might contain items from outside the test ideally shape the conceptualization of the test
domain, in which case these items introduce con- domain. However, focus on just the content of the
struct-irrelevant variance into the test total score. items allows for a broadening of content valida-
It is also possible that the test samples all and only tion beyond the conception of a test as measuring
the test domain but does so in a way that overem- a construct conceptualized as a latent variable
phasizes some areas of the domain while underem- representing a single dimension of variation. It
phasizing other areas. In such a case, the items allows, for instance, for a test domain that spans
sample the entire domain but in a nonrepresenta- a set of tasks linked another way but heteroge-
tive manner. An example would be an addition test neous in the cognitive processes involved in com-
where 75% of the items involved adding only even pleting them. An example might be the domain of
numbers and no odd numbers. tasks associated with troubleshooting a complex
An additional aspect of content validation piece of technology such as a computer network.
involves clear, detailed, and thorough documenta- No one algorithm or process might serve to trou-
tion of the test construction procedures. This bleshoot every problem in the domain, but content
aspect of content validation reflects the epistemic validation held separate from response processes
aspect of modern test validity theory: Even if a test can nonetheless apply to such a test.
provides an excellent measure of its intended con- In contrast, the idea that content validity applies
struct, test users cannot justify the use of the test to response processes existed as a minority posi-
unless they know that the test provides an excel- tion for most of the history of content validation,
lent measure. Test validation involves justifying an but has close affinities to both the unified notion
interpretation or use of a test, and content valida- of validation as an overall evaluation based on the
tion involves justifying the test domain and the sum of the available evidence and also with cogni-
effectiveness with which the test samples that tive approaches to test development and valida-
domain. Documentation of the process leading to tion. Whereas representativeness of the item
the domain definition and generation of the item content bears more on a quality of the stimulus
pool provides a valuable source of content-related materials, representativeness of the response pro-
validity evidence. One primary element of such cesses bears more on an underlying individual dif-
documentation, the test blueprint, specifies the var- ferences variable as a property of the person
ious areas of the test domain and the number of tested. Susan Embretson has distinguished con-
items from each of those areas. Documentation of struct representation, involving the extent to which
the process used to construct the test in keeping items require the cognitive processes that the test is
with the specified test blueprint thereby plays a cen- supposed to measure, from nomothetic span,
tral role in evaluating the congruency between the which is the extent to which the test bears the
test domain and the items on the test. expected patterns of association with other vari-
The earlier passages of this entry have left open ables (what Cronbach and Paul Meehl called
the question of whether content validation refers nomological network). The former involves con-
only to the items on the test or also to the pro- tent validation applied to processes whereas the
cesses involved in answering those items. Con- latter involves methods more closely associated
struct validation has its origins in a time when with criterion-related validation and construct val-
tests as the object of validation were not yet clearly idation methods.
distinguished from test scores or test score inter-
pretations. As such, most early accounts focused
Content Validation Methodology
on the items rather than the processes involved in
answering them. Understood this way, content val- Content-related validity evidence draws heavily
idation focuses on qualities of the test rather than from the test development process. The content
qualities of test scores or interpretations. However, domain should be clearly defined at the start of
242 Content Validity

this process, item specifications should be justified described above. This methodology relies on
in terms of this domain definition, item construc- a strong cognitive theory of how test takers pro-
tion should be guided and justified by the item spe- cess test items and thus applies best when item
cifications, and the overall test blueprint that response strategies are relatively well understood
assembles the test from the item pool should also and homogeneous across items. The approach
be grounded in and justified by the domain defini- sometimes bears a strong relation to the facet anal-
tion. Careful documentation of each of these pro- ysis methods of Louis Guttman in that item specifi-
cesses provides a key source of validity evidence. cations describe and quantify a variety of item
A standard method for assessing content valid- attributes, and these can be used to predict fea-
ity involves judgments by subject matter experts tures of item response patterns such as item diffi-
(SMEs) with expertise in the content of the test. culty. This approach bears directly on content
Two or more SMEs rate each item, although large validity because it requires a detailed theory relat-
or diverse tests may require different SMEs for dif- ing how items are answered to what the items
ferent items. Ratings typically involve domain rele- measure. Response process information can also
vance or importance of the content in individual be useful in extrapolating from the measured con-
test items. Good items have high means and low tent domain to broader inferences in applied test-
standard deviations, indicating high agreement ing, as described in the next section.
among raters. John Flanagan introduced a critical
incident technique for generating and evaluating
Role in Validity Arguments
performance-based items. C. H. Lawshe, Lewis
Aiken, and Ronald Hambleton each introduced At one time, the dominant approach was to iden-
quantitative measures of agreement for use with tify certain tests as the type of test to which con-
criterion-related validation research. Victor Mar- tent validation applies and rely on content validity
tuza introduced a content validity index, which evidence for the evaluation of such tests. Currently,
has generated a body of research in the nursing lit- few if any scholars would advocate sole reliance
erature. A number of authors have also explored on content validity evidence for any test. Instead,
multivariate methods for investigating and summa- content-related evidence joins with other evidence
rizing SME ratings, including factor analysis and to support key inferences and assumptions in
multidimensional scaling methods. Perhaps not a validity argument that combines various sources
surprisingly, the results can be sensitive to the of evidence to support an overall assessment of the
approach taken to structuring the judgment task. test score interpretation and use.
Statistical analysis of item scores can also be Kane has suggested a two-step approach in
used to evaluate content validity by showing that which one first constructs an argument for test
the content domain theory is consistent with the score interpretations and then evaluates that argu-
clustering of items into related sets of items by ment with a test validity argument. Kane has sug-
some statistical criteria. These methods include gested a general structure involving four key
factor analysis, multidimensional scaling methods, inferences to which content validity evidence can
and cluster analysis. Applied to content validation, contribute support. First, the prescribed scoring
these methods overlap to some degree with con- method involves an inference from observed test-
struct validation methods directed toward the taking behaviors to a specific quantification
internal structure of a test. Test developers most intended to contribute to measurement through an
often combine such methods with methods based overall quantitative summary of the test takers’
on SME ratings to lessen interpretational ambigu- responses. Second, test score interpretation
ity of the statistical results. involves generalization from the observed test
A growing area of test validation related to con- score to the defined content domain sampled by
tent involves cognitive approaches to modeling the the test items. Third, applied testing often involves
processes involved in answering specific item types. a further inference that extrapolates from the mea-
Work by Embretson and Robert Mislevy exempli- sured content domain to a broader domain of
fies this approach, and such approaches focus on inference that the test does not fully sample.
the construct representation aspect of test validity Finally, most applied testing involves a final set of
Contrast Analysis 243

inferences from the extrapolated level of perfor- Kane, M. (2006). Content-related validity evidence in test
mance to implications for actions and decisions development. In S. M. Downing & T. M. Haladyna
applied to a particular test taker who earns a par- (Eds.), Handbook of test development (pp. 131–153).
ticular test score. Mahwah, NJ: Lawrence Erlbaum.
McKenzie, J. F., Wood, M. L., Kotecki, J. E., Clark, J. K.,
Interpretation of statistical models used to pro-
& Brey, R. A. (1999). Research notes establishing
vide criterion- and construct-related validity evi- content validity: Using qualitative and quantitative
dence would generally remain indeterminate were steps. American Journal of Health Behavior, 23, 311–
it not for the grounding of test score interpreta- 318.
tions provided by content-related evidence. While Popham, W. J. (1992). Appropriate expectations for
not a fixed foundation for inference, content- content judgments regarding teacher licensure tests.
related evidence provides a strong basis for taking Applied Measurement in Education, 5, 285–301.
one interpretation of a nomothetic structure as Sireci, S. (1998). The construct of content validity. Social
more plausible than various rival hypotheses. As Indicators Research, 45, 83–117.
such, content-related validity evidence continues to
play an important role in test development and
complements other forms of validity evidence in
validity arguments. CONTRAST ANALYSIS
A standard analysis of variance (ANOVA) pro-
Unresolved Issues vides an F test, which is called an omnibus test
As validity theory continues to evolve, a number because it reflects all possible differences between
of issues in content validation remain unresolved. the means of the groups analyzed by the ANOVA.
For instance, the relative merits of restricting con- However, most experimenters want to draw con-
tent validation to test content or expanding it to clusions more precise than ‘‘the experimental
involve item response processes warrant further manipulation has an effect on participants’ behav-
attention. A variety of aspects of content validity ior.’’ Precise conclusions can be obtained from
have been identified, suggesting a multidimensional contrast analysis because a contrast expresses a spe-
attribute of tests, but quantitative assessments of cific question about the pattern of results of an
content validity generally emphasize single-number ANOVA. Specifically, a contrast corresponds to
summaries. Finally, the ability to evaluate content a prediction precise enough to be translated into
validity in real time with computer-adaptive testing a set of numbers called contrast coefficients, which
remains an active area of research. reflect the prediction. The correlation between the
contrast coefficients and the observed group means
Keith A. Markus and Kellie M. Smith directly evaluates the similarity between the pre-
diction and the results.
See also Construct Validity; Criterion Validity When performing a contrast analysis, one needs
to distinguish whether the contrasts are planned or
post hoc. Planned, or a priori, contrasts are
Further Readings
selected before running the experiment. In general,
American Educational Research Association, American they reflect the hypotheses the experimenter wants
Psychological Association, & National Council on to test, and there are usually few of them. Post
Measurement in Education. (1999). Standards for hoc, or a posteriori (after the fact), contrasts are
educational and psychological testing. Washington, decided after the experiment has been run. The
DC: American Educational Research Association. goal of a posteriori contrasts is to ensure that
Crocker, L. M., Miller, D., & Franks, E. A. (1989).
unexpected results are reliable.
Quantitative methods for assessing the fit between test
and curriculum. Applied Measurement in Education,
When performing a planned analysis involving
2, 179–194. several contrasts, one needs to evaluate whether
Embretson, S. E. (1983). Construct validity: Construct these contrasts are mutually orthogonal or not.
representation versus nomothetic span. Psychological Two contrasts are orthogonal when their contrast
Bulletin, 93, 179–197. coefficients are uncorrelated (i.e., their coefficient
244 Contrast Analysis

of correlation is zero). The number of possible A Priori Orthogonal Contrasts


orthogonal contrasts is one less than the number
of levels of the independent variable. For Multiple Tests
All contrasts are evaluated by the same general When several contrasts are evaluated, several
procedure. First, the contrast is formalized as a set statistical tests are performed on the same data set,
of contrast coefficients (also called contrast and this increases the probability of a Type I error
weights). Second, a specific F ratio (denoted Fψ ) is (i.e., rejection of the null hypothesis when it is
computed. Finally, the probability associated with true). In order to control the Type I error at the
Fψ is evaluated. This last step changes with the level of the set (also known as the family) of con-
type of analysis performed. trasts, one needs to correct the α level used to eval-
uate each contrast. This correction for multiple
contrasts can be done with the use of the  Sidák
equation, the Bonferroni (also known as Boole, or
Research Hypothesis as a Contrast Expression Dunn) inequality, or the Monte Carlo technique.

When a research hypothesis is precise, it is possible Sidák and Bonferroni


to express it as a contrast. A research hypothesis,
The probability of making at least one Type I
in general, can be expressed as a shape, a configu-
error for a family of orthogonal (i.e., statistically
ration, or a rank ordering of the experimental
independent) contrasts (C) is
means. In all these cases, one can assign numbers
that will reflect the predicted values of the experi- C
α½PF ¼ 1  ð1  α½PCÞ : ð1Þ
mental means. These numbers are called contrast
coefficients when their mean is zero. To convert with α[PF] being the Type I error for the family of
a set of numbers into a contrast, it suffices to sub- contrasts and α[PC] being the Type I error per
tract their mean from each of them. Often, for contrast. This equation can be rewritten as
convenience, contrast coefficients are expressed
with integers.
α½PC ¼ 1  ð1  α½PFÞ1=C : ð2Þ
For example, assume that for a four-group
design, a theory predicts that the first and second This formula, called the Sidák equation, shows
groups should be equivalent, the third group how to correct the [PC] values used for each
should perform better than these two groups, and contrast.
the fourth group should do better than the third Because the Sidák equation involves a fractional
with an advantage of twice the gain of the third power, one can use an approximation known as
over the first and the second. When translated into the Bonferroni inequality, which relates α[PC] to
a set of ranks, this prediction gives α[PF]:

α½PF
C1 C2 C3 C4 Mean α½PC ≈ : ð3Þ
C
1 1 2 4 2
Sidák and Bonferroni are related by the inequality

After subtracting the mean, we get the following 1=C α½PF


α½PC ¼ 1  ð1  α½PFÞ ≥ : ð4Þ
contrast: C
C1 C2 C3 C4 Mean They are, in general, very close to each other.
1 1 0 2 0 As can be seen, the Bonferroni inequality is a pes-
simistic estimation. Consequently Sidák should
In case of doubt, a good heuristic is to draw the be preferred. However, the Bonferroni inequality
predicted configuration of results, and then to rep- is more well known and hence is used and cited
resent the position of the means by ranks. more often.
Contrast Analysis 245

Monte Carlo α½PC ¼


The Monte Carlo technique can also be used number of contrasts having reached significance
to correct for multiple contrasts. The Monte total number of contrasts
Carlo technique consists of running a simulated 2; 403
experiment many times using random data, with ¼ ¼ :0479: ð5Þ
50; 000
the aim of obtaining a pattern of results showing
what would happen just on the basis of chance. This value falls close to the theoretical value of
This approach can be used to quantify α[PF], the α ¼ .05.
inflation of Type I error due to multiple testing. It can be seen also that for 7,868 experiments,
Equation 1 can then be used to set α[PC] in no contrast reached significance. Correspondingly,
order to control the overall value of the Type I for 2,132 experiments (10,000  7,868), at least
error. one Type I error was made. From these data,
As an illustration, suppose that six groups α[PF] can be estimated as
with 100 observations per group are created
with data randomly sampled from a normal α½PF ¼
population. By construction, the H0 is true (i.e., number of families with at least 1 Type I error
all population means are equal). Now, construct
total number of families
five independent contrasts from these six groups.
For each contrast, compute an F test. If the prob- 2; 132
1:5 ¼ ¼ :2132: ð6Þ
ability associated with the statistical index is 10; 000
smaller than α ¼ .05, the contrast is said to reach
significance (i.e., α[PC] is used). Then have This value falls close to the theoretical value given
a computer redo the experiment 10,000 times. In by Equation 1:
sum, there are 10,000 experiments, 10,000 fami- C
lies of contrasts, and 5 × 10,000 ¼ 50,000 con- α½PF ¼ 1  ð1  α½PCÞ ¼ 1  ð1  :05Þ5 ¼ :226 :
trasts. The results of this simulation are given in
Checking the Orthogonality of Two Contrasts
Table 1.
Table 1 shows that the H0 is rejected for Two contrasts are orthogonal (or independent)
2,403 contrasts out of the 50,000 contrasts actu- if their contrast coefficients are uncorrelated. Con-
ally performed (5 contrasts × 10,000 experi- trast coefficients have zero sum (and therefore
ments). From these data, an estimation of α[PC] a zero mean). Therefore, two contrasts, whose A
is computed as contrast coefficients are denoted Ca1 and Ca2, will
be orthogonal if and only if
Table 1 Results of a Monte Carlo Simulation X
A

Number of X: Number of Number of Ca;i Ca;j ¼ 0: ð7Þ


a¼1
Families With Type 1 Errors Type I
X Type I Errors per Family Errors
Computing Sum of Squares, Mean Square, and F
7,868 0 0
1,907 1 1,907 The sum of squares for a contrast can be com-
192 2 384 puted with the Ca coefficients. Specifically, the sum
20 3 60 of squares for a contrast is denoted SSψ and is
13 4 52 computed as P
0 5 0 Sð Ca Ma: Þ2
SSψ ¼ P 2 ð8Þ
10,000 2,403 Ca
Notes: Numbers of Type I errors when performing C ¼ 5
where S is the number of subjects in a group.
contrasts for 10,000 analyses of variance performed on
a six-group design when the H0 is true. For example, 192 Also, because the sum of squares for a contrast
families out of the 10,000 have two Type I errors. This gives has one degree of freedom, it is equal to the mean
2 × 192 ¼ 384 Type I errors. square of effect for this contrast:
246 Contrast Analysis

Table 2 Data From a Replication of an Experiment by Smith (1979)


Experimental Context

Group 1 Group 2 Group 3 Group 4 Group 5


Same Different Imagery Photo Placebo
25 11 14 25 8
26 21 15 15 20
17 9 29 23 10
15 6 10 21 7
14 7 12 18 15
17 14 22 24 7
14 12 14 14 1
20 4 20 27 17
11 7 22 12 11
21 19 12 11 4
Ya 180 110 170 190 100
Ma 18 11 17 19 10
Ma – M 3 –4 2 4 –5
P
(Yas – Ma)2 218 284 324 300 314
Source: Adapted from Smith (1979).
Note: The dependent variable is the number of words recalled.

SSψ SSψ Example


MSψ ¼ ¼ ¼ SSψ : ð9Þ
dfψ 1 This example is inspired by an experiment by
Steven M. Smith in 1979. The main purpose of
The Fψ ratio for a contrast is now computed as this experiment was to show that one’s being in
the same mental context for learning and for test-
ing leads to better performance than being in dif-
MSψ ferent contexts. During the learning phase,
Fψ ¼ : ð10Þ
MSerror participants learned a list of 80 words in a room
painted with an orange color, decorated with pos-
ters, paintings, and a decent amount of parapher-
Evaluating F for Orthogonal Contrasts nalia. A memory test was performed to give
Planned orthogonal contrasts are equivalent to subjects the impression that the experiment was
independent questions asked of the data. Because over. One day later, the participants were unex-
of that independence, the current procedure is to pectedly retested on their memory. An experi-
act as if each contrast were the only contrast menter asked them to write down all the words
tested. This amounts to not using a correction for from the list that they could remember. The test
multiple tests. This procedure gives maximum took place in five different experimental condi-
power to the test. Practically, the null hypothesis tions. Fifty subjects (10 per group) were randomly
for a contrast is tested by computing an F ratio as assigned to one of the five experimental groups.
indicated in Equation 10 and evaluating its p value The five experimental conditions were
using a Fisher sampling distribution with v1 ¼ 1
and v2 being the number of degrees of freedom of 1. Same context. Participants were tested in the
same room in which they learned the list.
MSerror [e.g., in independent measurement designs
with A groups and S observations per group, 2. Different context. Participants were tested in
v2 ¼ A(S  1)]. a room very different from the one in which
Contrast Analysis 247

Table 3 ANOVA Table for a Replication of Smith’s Table 4 Orthogonal Contrasts for the Replication of
Experiment (1979) Smith (1979)
Source df SS MS F Pr(F) Group Group Group Group Group
P
Experimental 4 700.00 175.00 5.469 ** .00119 Contrast 1 2 3 4 5 Ca
Error 45 1,440.00 32.00 ψ1 +2 –3 +2 +2 –3 0
Total 49 2,1400.00 ψ2 +2 0 –1 –1 0 0
Source: Adapted from Smith (1979). ψ3 0 0 +1 –1 0 0
ψ4 0 +1 0 0 –1 0
Note: ** p ≤ .01.
Source: Adapted from Smith (1979).

they learned the list. The new room was located • Research Hypothesis 4. The different context
in a different part of the campus, painted grey, group differs from the placebo group.
and looked very austere.
Contrasts
3. Imaginary context. Participants were tested in
the same room as participants from Group 2. In The four research hypotheses are easily trans-
addition, they were told to try to remember the formed into statistical hypotheses. For example,
room in which they learned the list. In order to the first research hypothesis is equivalent to stating
help them, the experimenter asked them several the following null hypothesis:
questions about the room and the objects in it.
The means of the population for Groups 1, 3,
4. Photographed context. Participants were placed and 4 have the same value as the means of the
in the same condition as Group 3, and in population for Groups 2 and 5.
addition, they were shown photos of the orange This is equivalent to contrasting Groups 1, 3,
room in which they learned the list. and 4, on one hand, and Groups 2 and 5, on the
5. Placebo context. Participants were in the same other. This first contrast is denoted ψ1 :
condition as participants in Group 2. In addition,
before starting to try to recall the words, they are ψ1 ¼ 2μ1  3μ2 þ 2μ3 þ 2μ4 þ 3μ5 :
asked to perform a warm-up task, namely, to try
to remember their living room. The null hypothesis to be tested is

The data and ANOVA results of the replication H0, 1 : ψ1 ¼ 0:


of Smith’s experiment are given in the Tables 2
and 3. The first contrast is equivalent to defining the fol-
lowing set of coefficients Ca :
Research Hypotheses for Contrast Analysis X
Several research hypotheses can be tested with Gr.1 Gr.2 Gr.3 Gr.4 Gr.5 Ca
Smith’s experiment. Suppose that the experiment a :
was designed to test these hypotheses: þ2 3 þ2 þ2 3 0

• Research Hypothesis 1. Groups for which the Note that the sum of the coefficients Ca is zero, as
context at test matches the context during it should be for a contrast. Table 4 shows all four
learning (i.e., is the same or is simulated by contrasts.
imaging or photography) will perform better
than groups with different or placebo contexts. Are the Contrasts Orthogonal?
• Research Hypothesis 2. The group with the same
context will differ from the group with Now the problem is to decide whether the con-
imaginary or photographed contexts. trasts constitute an orthogonal family. We check
• Research Hypothesis 3. The imaginary context that every pair of contrasts is orthogonal by using
group differs from the photographed context Equation 7. For example, Contrasts 1 and 2 are
group. orthogonal because
248 Contrast Analysis

Table 5 Steps for the Computation of SSc1 of Smith to the number of degrees of freedom of the experi-
(1979) mental sum of squares.
Group Ma Ca CaMa C2a
1 18.00 +2 +36.00 4 A Priori Nonorthogonal Contrasts
2 11.00 –3 –33.00 9 So orthogonal contrasts are relatively straight-
3 17.00 +2 +34.00 4 forward because each contrast can be evaluated on
4 19.00 +2 +38.00 4 its own. Nonorthogonal contrasts, however, are
5 10.00 –3 –30.00 9 more complex. The main problem is to assess the
0 45.00 30 importance of a given contrast conjointly with the
Source: Adapted from Smith (1979). other contrasts. There are currently two (main)
approaches to this problem. The classical
X
A ¼5
approach corrects for multiple statistical tests (e.g.,
Ca;1 Ca;2 ¼ ð2 × 2Þ þ ð3 × 0Þ þ ð2 ×  1Þ using a Sidák or Bonferroni correction), but essen-
a¼1
tially evaluates each contrast as if it were coming
þ ð2 × 1Þ þ ð3 × 0Þ þ ð0 × 0Þ from a set of orthogonal contrasts. The multiple
¼ 0: regression (or modern) approach evaluates each
contrast as a predictor from a set of nonorthogo-
F test nal predictors and estimates its specific contribu-
The sum of squares and Fψ for a contrast are tion to the explanation of the dependent variable.
computed from Equations 8 and 10. For example, The classical approach evaluates each contrast for
the steps for the computations of SSψ1 are given in itself, whereas the multiple regression approach
Table 5. evaluates each contrast as a member of a set of
contrasts and estimates the specific contribution
P of each contrast in this set. For an orthogonal set
Sð Ca Ma: Þ2 10 × ð45:00Þ2
SSψ1 ¼ P 2 ¼ ¼ 675:00 of contrasts, the two approaches are equivalent.
Ca 30
MSψ1 ¼ 675:00
The Classical Approach
MSψ1 675:00
Fψ1 ¼ ¼ ¼ 21:094 : Some problems are created by the use of multi-
MSerror 32:00
ple nonorthogonal contrasts. The most important
ð11Þ
one is that the greater the number of contrasts, the
greater the risk of a Type I error. The general strat-
The significance of a contrast is evaluated with
egy adopted by the classical approach to this prob-
a Fisher distribution with 1 and A(S  1) ¼ 45
lem is to correct for multiple testing.
degrees of freedom, which gives a critical value of
4.06 for α ¼ .05 (7.23 for α ¼ .01). The sums of Sidák and Bonferroni Corrections
squares for the remaining contrasts are SSψ · 2 ¼ 0,
SSψ · 3 ¼ 20, and SSψ · 4 ¼ 5 with 1 and AðS  1Þ When a family’s contrasts are nonorthogonal,
¼ 45 degrees of freedom. Therefore, ψ2 , ψ3 , and Equation 10 gives a lower bound for [PC]. So,
ψ4 are nonsignificant. Note that the sums of squares instead of having the equality, the following
of the contrasts add up to SSexperimental . That is, inequality, called the Sidák inequality, holds:
C
SSexperimental ¼ SSψ:1 þ SSψ:2 þ SSψ:3 þ SSψ:4 α½PF ≤ 1  ð1  α½PCÞ : ð12Þ
¼ 675:00 þ 0:00 þ 20:00 þ 5:00 This inequality gives an upper bound for α[PF],
¼ 700:00 : and therefore the real value of α[PF] is smaller
than its estimated value.
When the sums of squares are orthogonal, the As earlier, we can approximate the  Sidák
degrees of freedom are added the same way as inequality by Bonferroni as
the sums of squares are. This explains why the
maximum number of orthogonal contrasts is equal α½PF < Cα½PC: ð13Þ
Contrast Analysis 249

Table 6 Nonorthogonal Contrasts for the Replication Table 7 Fc Values for the Nonorthogonal Contrasts
of Smith (1979) From the Replication of Smith (1979)
Group Group Group Group Group Contrast rY:ψ r2Y:ψ Fψ pFψ
P
Contrast 1 2 3 4 5 Ca ψ1 .9820 .9643 21.0937 < .0001
ψ1 2 –3 2 2 –3 0 ψ2 –.1091 .0119 0.2604 .6123
ψ2 3 3 –2 –2 –2 0 ψ3 .5345 .2857 6.2500 .0161
ψ3 1 –4 1 1 1 0 Source: Adapted from Smith (1979).
Source: Adapted from Smith (1979).

 When used with a set of orthogonal contrasts, the


And, as earlier, Sidák and Bonferroni are linked to multiple regression approach gives the same results
each other by the inequality as the ANOVA-based approach previously
described. When used with a set of nonorthogonal
α½PF ≤ 1  ð1  α½PCÞC < Cα½PC: ð14Þ
contrasts, multiple regression quantifies the specific
Example contribution of each contrast as the semipartial
coefficient of correlation between the contrast
Let us go back to Smith’s (1979) study (see
coefficients and the dependent variable. The multi-
Table 2). Suppose that Smith wanted to test these
ple regression approach can be used for nonortho-
three hypotheses:
gonal contrasts as long as the following
constraints are satisfied:
• Research Hypothesis 1. Groups for which the
context at test matches the context during 1. There are no more contrasts than the number of
learning will perform better than groups with degrees of freedom of the independent variable.
different contexts;
• Research Hypothesis 2. Groups with real 2. The set of contrasts is linearly independent (i.e.,
contexts will perform better than those with not multicollinear). That is, no contrast can be
imagined contexts; obtained by combining the other contrasts.
• Research Hypothesis 3. Groups with any context
will perform better than those with no context.
Example
These hypotheses can easily be transformed Let us go back once again to Smith’s (1979)
into the set of contrasts given in Table 6. The study of learning and recall contexts. Suppose we
values of Fψ were computed with Equation 10 take our three contrasts (see Table 6) and use them
(see also Table 3) and are in shown in Table 7, as predictors with a standard multiple regression
along with their p values. If we adopt a value program. We will find the following values for the
of α[PF] ¼ .05, a  Sidák correction will entail semipartial correlation between the contrasts and
evaluating each contrast at the α level of the dependent variable:
α[PF] ¼ .0170 (Bonferroni will give the approx-
imate value of α[PF] ¼ .0167). So, with a correc-
ψ1 : r2 ¼ :1994
tion for multiple comparisons one can conclude Y:Ca;1 jCa;2 Ca;3
that Contrasts 1 and 3 are significant. ψ2 : r2Y:Ca;2 jCa;1 Ca;3 ¼ :0000
ψ3 : r2Y:Ca;3 jCa;1 Ca;2 ¼ :0013;
Multiple Regression Approach
ANOVA and multiple regression are equivalent
with r2Y:C being the squared correlation of
if one uses as many predictors for the multiple a;1 jCa;2 Ca;3

regression analysis as the number of degrees of ψ1 and the dependent variable with the effects of
freedom of the independent variable. An obvious ψ2 and ψ3 partialled out. To evaluate the signifi-
choice for the predictors is to use a set of contrast cance of each contrast, we compute an F ratio for
coefficients. Doing so makes contrast analysis the corresponding semipartial coefficients of corre-
a particular case of multiple regression analysis. lation. This is done with the following formula:
250 Contrast Analysis

r2 hypothesis, but one a posteriori contrast could be


Y:Ca;i jCa;k Ca;‘
FY:Ca;i jCa;k Ca;‘ ¼ × dfresidual : ð15Þ declared significant.
1  r2Y:A In order to avoid such a discrepant decision,
the Scheffé approach first tests any contrast as if
This results in the following F ratios for the Smith it were the largest possible contrast whose sum
example: of squares is equal to the experimental sum of
ψ1 : FY:Ca;1 jCa;2 Ca;3 ¼ 13:3333; p ¼ :0007; squares (this contrast is obtained when the con-
trast coefficients are equal to the deviations of
ψ2 : FY:Ca;2 jCa;1 Ca;3 ¼ 0:0000; p ¼ 1:0000; the group means to their grand mean) and, sec-
ψ3 : FY:Ca;3 jCa;1 Ca;2 ¼ 0:0893; p ¼ :7665: ond, makes the test of the largest contrast equiv-
alent to the ANOVA omnibus test. So if we
These F ratios follow a Fisher distribution denote by Fcritical, omnibus the critical value for the
with ν1 ¼ 1 and ν2 ¼ 45 degrees of freedom. ANOVA omnibus test (performed on A groups),
Fcritical ¼ 4.06 when α ¼ .05. In this case, ψ1 is the largest contrast is equivalent to the omnibus
the only contrast reaching significance (i.e., with test if its Fψ is tested against a critical value
Fψ > Fcritical ). The comparison with the classic equal to
approach shows the drastic differences between
the two approaches. Fcritical; Scheffé ¼ ðA  1Þ × Fcritical, omnibus : ð17Þ

A Posteriori Contrasts Equivalently, Fψ can be divided by (A  1),


For a posteriori contrasts, the family of contrasts and its probability can be evaluated with a Fisher
is composed of all the possible contrasts even if distribution with ν1 ¼ (A  1) and ν2 being
they are not explicitly made. Indeed, because equal to the number of degrees of freedom of the
one chooses one of the contrasts to be made mean square error. Doing so makes it impossible
a posteriori, this implies that one has implicitly to reach a discrepant decision.
made and judged uninteresting all the possible
contrasts that have not been made. Hence, what-
ever the number of contrasts actually performed, 4.1.1 An example: Scheffé
the family is composed of all the possible con-
trasts. This number grows very fast: A con- Suppose that the Fψ ratios for the contrasts
servative estimate indicates that the number computed in Table 7 were obtained a posteriori.
of contrasts that can be made on A groups is The critical value for the ANOVA is obtained from
equal to a Fisher distribution with ν1 ¼ A  1 ¼ 4 and
ν2 ¼ A(S  1) ¼ 45. For α ¼ .05, this value is
1 þ f½ð3A  1Þ=2  2A g: ð16Þ equal to Fcritical, omnibus ¼ 2.58. In order to evalu-
ate whether any of these contrasts reaches signifi-
So using a 
Sidàk or Bonferroni approach will not cance, one needs to compare them to the critical
have enough power to be useful. value of

Scheffé’s Test Fcritical; Scheffé ¼ ðA  1Þ × Fcriticalm; omnibus :


ð18Þ
Scheffé’s test was devised to test all possible ¼ 4 × 2:58 ¼ 10:32:
contrasts a posteriori while maintaining the overall
Type I error level for the family at a reasonable
With this approach, only the first contrast is
level, as well as trying to have a conservative but
considered significant.
relatively powerful test. The general principle is to
ensure that no discrepant statistical decision can Hervé Abdi and Lynne J. Williams
occur. A discrepant decision would occur if the
omnibus test would fail to reject the null See also Analysis of Variance (ANOVA); Type I Error
Control Group 251

Further Readings with the exception of the independent variable, or


treatment of interest.
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.
(2009). Experimental design and analysis for
psychology. Oxford, UK: Oxford University Press. Placebo Control Groups
Rosenthal, R., & Rosnow, R. L. (2003). Contrasts and
effect sizes in behavioral research: A correlational A placebo is a substance that appears to have an
approach. Boston: Cambridge University Press. effect but is actually inert. When individuals are
Smith, S. M. (1979). Remembering in and out of context. part of a placebo control group, they believe that
Journal of Experimental Psychology: Human Learning they are receiving an effective treatment, when it is
and Memory, 5, 460–471. in fact a placebo. An example of a placebo study
might be found in medical research in which
researchers are interested in the effects of a new
medication for cancer patients. The experimental
CONTROL GROUP group would receive the treatment under investiga-
tion, and the control group might receive an inert
In experimental research, it is important to con- substance. In a double-blind placebo study, neither
firm that results of a study are actually due to an the participants nor the experimenters know
independent or manipulated variable rather than who received the placebo until the observations
to other, extraneous variables. In the simplest case, are complete. Double-blind procedures are used
a research study contrasts two groups, and the to prevent experimenter expectations or experi-
independent variable is present in one group but menter bias from influencing observations and
not the other. For example, in a health research measurements.
study, one group may receive a medical treatment, Another type of control group is called a waiting
and the other does not. The first group, in which list control. This type of control group is often
treatment occurs, is called the experimental group, used to assess the effectiveness of a treatment. In
and the second group, in which treatment is with- this design, all participants may experience an
held, is called the control group. Therefore, when independent variable, but not at the same time.
experimental studies use control and experimental The experimental group receives a treatment and
groups, ideally the groups are equal on all factors is then contrasted, in terms of the effects of the
except the independent variable. The purpose, treatment, with a group awaiting treatment. For
then, of a control group is to provide a compara- example, the effects of a new treatment for depres-
tive standard in order to determine whether an sion may be assessed by comparing a treated
effect has taken place in the experimental group. group, that is, the experimental group, with indivi-
As such, the term control group is sometimes used duals who are on a wait list for treatment. Indivi-
interchangeably with the term baseline group or duals on the wait list control may be treated
contrast group. The following discussion outlines subsequently.
key issues and several varieties of control groups When participants in an experimental group
employed in experimental research. experience various types of events or participate
for varying times in a study, the control group is
called a yoked control group. Each participant of
Random Assignment
the control group is ‘‘yoked’’ to a member of the
In true experiments, subjects are assigned ran- experimental group. As an illustration, suppose
domly to either a control or an experimental a study is interested in assessing the effects of stu-
group. If the groups are alike in all ways except dents’ setting their own learning goals. Participants
for the treatment administered, then the effects of in this study might be yoked on the instructional
that treatment can be tested without ambiguity. time that they receive on a computer. Each partici-
Although no two groups are exactly alike, random pant in the yoked control (no goal setting) would
assignment, especially with large numbers, evens be yoked to a corresponding student in the experi-
differences out. By and large, randomized group mental group on the amount of time he or she
assignment results in groups that are equivalent, spent on learning from the computer. In this way,
252 Control Variables

the amount of instruction experienced is held con- Cook, T. D., & Campbell, D. T. (1979). Quasi-
stant between the groups. experimentation: Design and analysis for field settings.
Matching procedures are similar to yoking, but Boston: Houghton Mifflin.
the matching occurs on characteristics of the par- Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Experimental and quasi-experimental designs for
ticipant, not the experience of the participant dur-
generalized causal inference. Boston: Houghton
ing the study. In other words, in matching, Mifflin.
participants in the control group are matched with World Medical Association. (2002). Declaration of
participants in the experimental group so that the Helsinki: Ethical principles for medical research
two groups have similar backgrounds. Participants involving human subjects. Journal of Postgraduate
are often matched on variables such as age, gender, Medicine, 48, 206–208.
and socioeconomic status. Both yoked and
matched control groups are used so that partici-
pants are as similar as possible.
CONTROL VARIABLES

Ethics in Control Groups In experimental and observational design and data


analysis, the term control variable refers to variables
The decision regarding who is assigned to a control that are not of primary interest (i.e., neither the
group has been the topic of ethical concerns. The exposure nor the outcome of interest) and thus con-
use of placebo controlled trials is particularly con- stitute an extraneous or third factor whose influence
troversial because many ethical principles must be is to be controlled or eliminated. The term refers to
considered. The use of a wait-list control group is the investigator’s desire to estimate an effect (such as
generally viewed as a more ethical approach when a measure of association) of interest that is indepen-
examining treatment effects with clinical popula- dent of the influence of the extraneous variable and
tions. If randomized control procedures were used, free from bias arising from differences between
a potentially effective treatment would have to be exposure groups in that third variable.
withheld. The harm incurred from not treating Extraneous variables of this class are usually
patients in a control group is usually determined, those variables described as potential confounders
on balance, to be greater than the added scientific in some disciplines. Controlling for a potential con-
rigor that random assignment might offer. founder, which is not an effect modifier or media-
Developed by the World Medical Association, tor, is intended to isolate the effect of the exposure
the Declaration of Helsinki is an international doc- of interest on the outcome of interest while reduc-
ument that provides the groundwork for general ing or eliminating potential bias presented by differ-
human research ethics. Generally, the guiding prin- ences in the outcomes observed between exposed
ciple when one is using control groups is to apply and unexposed individuals that are attributable to
strict ethical standards so that participants are not the potential confounder. Control is achieved when
at risk of harm. Overall, the ethics involved in the the potential confounder cannot vary between the
use of control groups depends on the context of exposure groups, and thus the observed relation-
the scientific question. ship between the exposure and outcome of interest
is independent of the potential confounder.
Jennie K. Gill and John Walsh As an example, if an investigator is interested in
See also Double-Blind Procedure; Experimental Design;
studying the rate of a chemical reaction (the out-
Internal Validity; Sampling
come) and how it differs with different reagents
(the exposures), the investigator may choose to
keep the temperature of each reaction constant
among the different reagents being studied so that
Further Readings temperature differences could not affect the
Campbell, D. T., & Stanley, J. C. (1966). Experimental outcomes.
and quasi-experimental designs for research. Chicago: Potential control variables that are mediators in
Rand McNally. another association of interest, as well as potential
Control Variables 253

control variables that are involved in a statistical can be undertaken for each level of a potential
reaction with other variables in the study, are spe- confounder. Within each unique value (or homoge-
cial cases which must be considered separately. neous stratum) of the potential confounder, the
This entry discusses the use of control variables relationship of interest may be observed that is not
during the design and analysis stages of a study. influenced by differences between exposed and
unexposed individuals attributable to the potential
Design Stage confounder. This technique is another example of
restriction.
There are several options for the use of control Estimates of the relationship of interest inde-
variables at the design stage. In the example about pendent of the potential confounder can also be
rates of reaction mentioned earlier, the intention achieved by the use of a matched or stratified
was to draw conclusions, at the end of the series approach in the analysis. The estimate of interest
of experiments, regarding the relationship between is calculated at all levels (or several theoretically
the reaction rates and the various reagents. If the homogeneous or equivalent strata) of the potential
investigator did not keep the temperature constant confounder, and a weighted, average effect across
among the series of experiments, difference in the strata is estimated. Techniques of this kind include
rate of reaction found at the conclusion of the the Mantel–Haenszel stratified analysis, as well as
study may have had nothing to do with different stratified (also called matched or conditional)
reagents, but be solely due to differences in tem- regression analyses. These approaches typically
perature or some combination of reagent and tem- assume that the stratum-specific effects are not dif-
perature. Restricting or specifying a narrow range ferent (i.e., no effect modification or statistical
of values for one or more potential confounders is interaction is present). Limitations of this method
frequently done in the design stage of the study, are related to the various ways strata can be
taking into consideration several factors, including formed for the various potential confounders, and
ease of implementation, convenience, simplified one may end up with small sample sizes in many
analysis, and expense. A limitation on restriction strata, and therefore the analysis may not produce
may be an inability to infer the relationship a reliable result.
between the restricted potential confounder and The most common analytic methods for using
the outcome and exposure. In addition, residual control variables is analysis of covariance and mul-
bias may occur, owing to incomplete control tiple generalized linear regression modeling.
(referred to as residual confounding). Regression techniques estimate the relationship of
Matching is a concept related to restriction. interest conditional on a fixed value of the poten-
Matching is the process of making the study group tial confounder, which is analogous to holding the
and control group similar with regard to potential value of the potential confounder constant at the
confounds. Several different methods can be level of third variable. By default, model para-
employed, including frequency matching, category meters (intercept and beta coefficients) are inter-
matching, individual matching, and caliper match- preted as though potential confounders were held
ing. As with restriction, the limitations of match- constant at their zero values. Multivariable regres-
ing include the inability to draw inferences about sion is relatively efficient at handling small num-
the control variable(s). Feasibility can be an issue, bers and easily combines variables measured on
given that a large pool of subjects may be required different scales.
to find matches. In addition, the potential for Where the potential control variable in question
residual confounding exists. is involved as part of a statistical interaction with
Both matching and restriction can be applied in an exposure variable of interest, holding the con-
the same study design for different control variables. trol variable constant at a single level through
restriction (in either the design or analysis) will
allow estimation of the effect of the exposure of
The Analysis Stage
interest and the outcome that is independent of the
There are several options for the use of control third variable, but the effect measured applies only
variables at the analysis stage. Separate analysis to (or is conditional on) the selected level of the
254 ‘‘Convergent and Discriminant Validation by the Multitrait–Multimethod Matrix’’

potential confounder. This would also be the that class, but also to those students enrolled in
stratum-specific or conditional effect. For example, biology and all their characteristics. However, in
restriction of an experiment to one gender would spite of any shortcomings, convenience sampling is
give the investigator a gender-specific estimate of still an effective tool to use in pilot settings, when
effect. instruments may still be under development and
If the third variable in question is part of a true interventions are yet to be fully designed and
interaction, the other forms of control, which per- approved.
mit multiple levels of the third variable to remain
in the study (e.g., through matching, statistical Neil J. Salkind
stratification, or multiple regression analysis),
See also Cluster Sampling; Experience Sampling Method;
should be considered critically before being
Nonprobability Sampling; Probability Sampling;
applied. Each of these approaches ignores the
Proportional Sampling; Quota Sampling; Random
interaction and may serve to mask its presence.
Sampling; Sampling; Sampling Error; Stratified
Jason D. Pole and Susan J. Bondy Sampling; Systematic Sampling

See also Bias; Confounding; Interaction; Matching


Further Readings
Hultsch, D. F., MacDonald, S. W. S., Hunter, M. A.,
Further Readings Maitland, S. B., & Dixon, R. A. (2002). Sampling and
Cook, T. D., & Campbell, D. T. (1979). Quasi- generalisability in developmental research:
experimentation: Design and analysis issues for field Comparison of random and convenience samples of
settings. Boston: Houghton Mifflin. older adults. International Journal of Behavioral
Development, 26(4), 345–359.
Burke, T. W., Jordan, M. L., & Owen, S. S. (2002). A
cross-national comparison of gay and lesbian domestic
violence. Journal of Contemporary Criminal Justice,
CONVENIENCE SAMPLING 18(3), 231–257.

Few terms or concepts in the study of research


design are as self-explanatory as convenience sam-
pling. Convenience sampling (sometimes called ‘‘CONVERGENT AND DISCRIMINANT
accidental sampling) is the selection of a sample of VALIDATION BY THE MULTITRAIT–
participants from a population based on how con-
venient and readily available that group of partici- MULTIMETHOD MATRIX’’
pants is. It is a type of nonprobability sampling
that focuses on a sample that is easy to access and Psychology as an empirical science depends on the
readily available. For example, if one were inter- availability of valid measures of a construct. Valid-
ested in knowing the attitudes of a group of 1st- ity means that a measure (e.g., a test or question-
year college students toward binge drinking, a con- naire) adequately assesses the construct (trait) it
venience sample would be those students enrolled intends to measure. In their 1959 article ‘‘Conver-
in an introductory biology class. gent and Discriminant Validation by the Multitrait–
The advantages of convenience sampling are Multimethod Matrix,’’ Donald T. Campbell and
clear. Such samples are easy to obtain, and the cost Donald W. Fiske proposed a way of test validation
of obtaining them is relatively low. The disadvan- based on the idea that it is not sufficient to consider
tages of convenience sampling should be equally a single operationalization of a construct but that
clear. Results from studies using convenience sam- multiple measures are necessary. In order to validate
pling are not very generalizable to other settings, a measure, scientists first have to define the con-
given the narrow focus of the technique. For struct to be measured in a literary form and then
example, using those biology 101 students would must generate at least two measures that are as dif-
not only limit the sample to 1st-year students in ferent as possible but that are each adequate for
‘‘Convergent and Discriminant Validation by the Multitrait–Multimethod Matrix’’ 255

measuring the construct (multiple operationalism in different components of a trait that are function-
contrast to single operationalism). These two mea- ally different. For example, a low correlation
sures should strongly correlate but differ from mea- between an observational measure of anger and
sures that were created to assess different traits. the self-reported feeling component of anger could
Campbell and Fiske distinguished between four indicate individuals who regulated their visible
aspects of the validation process that can be ana- anger expression. In this case, a low correlation
lyzed by means of the multitrait–multimethod would not indicate that the self-report is an invalid
(MTMM) matrix. First, convergent validity is measure of the feeling component and that the
proven by the correlation of independent measure- observational measure is an invalid indicator of
ment procedures for measuring the same trait. Sec- overt anger expression. Instead, the two measures
ond, new measures of a trait should show low could be valid measures of the two different com-
correlations with measures of other traits from ponents of the anger episode that they are intended
which they should differ (discriminant validity). to measure, and different methods may be neces-
Third, each test is a trait–method unit. Conse- sary to appropriately assess these different compo-
quently, interindividual differences in test scores nents. It is also recommended that one consider
can be due to measurement features, as well as to traits that are as independent as possible. If two
the content of the trait. Fourth, in order to sepa- traits are considered independent, the heterotrait–
rate method- from trait-specific influences, and to monomethod correlations should be 0. Differences
analyze discriminant validity, more than one trait from 0 indicate the degree of a common method
and more than one method have to be considered effect.
in the validation process.
Discriminant Validity
Convergent Validity
Discriminant validity evidence is obtained if the
Convergent validity evidence is obtained if the cor- correlations of variables measuring different traits
relations of independent measures of the same trait are low. If two traits are considered independent,
(monotrait–heteromethod correlations) are signifi- the correlations of the measures of these traits
cantly different from 0 and sufficiently large. Con- should be 0. Discriminant validity requires that
vergent validity differs from reliability in the type both the heterotrait–heteromethod correlations
of methods considered. Whereas reliability is (e.g., correlation of a self-report measure of extro-
proven by correlations of maximally similar meth- version and a peer report measure of neuroticism)
ods of a trait (monotrait–monomethod correla- and the heterotrait–monomethod correlations
tions), the proof of convergent validity is the (e.g., correlations between self-report measures of
stronger, the more independent the methods are. extroversion and neuroticism) be small. These het-
For example, reliability of a self-report extrover- erotrait correlations should also be smaller than
sion questionnaire can be analyzed by the correla- the monotrait–heteromethod correlations (e.g.,
tions of two test halves of this questionnaire (split- self- and peer report correlations of extroversion)
half reliability) whereas the convergent validity of that indicate convergent validity. Moreover, the
the questionnaire can be scrutinized by its correla- patterns of correlations should be similar for the
tion with a peer report of extroversion. According monomethod and the heteromethod correlations
to Campbell and Fiske, independence of methods of different traits.
is a matter of degree, and they consider reliability
and validity as points on a continuum from reli-
Impact
ability to validity. Heterotrait–monomethod corre-
lations that do not significantly differ from 0 or According to Robert J. Sternberg, Campbell and
are relatively low could indicate that one of the Fiske’s article is the most often cited paper that has
two measures or even both measures do not ever been published in Psychological Bulletin and
appropriately measure the trait (low convergent is one of the most influential publications in psy-
validity). However, a low correlation could also chology. In an overview of then-available MTMM
show that the two different measures assess matrices, Campbell and Fiske concluded that
256 Copula Functions

almost none of these matrices fulfilled the criteria statistics, a copula is a function that links an n-
they described. Campbell and Fiske considered the dimensional cumulative distribution function to its
validation process as an iterative process that leads one-dimensional margins and is itself a continuous
to better methods for measuring psychological distribution function characterizing the depen-
constructs, and they hoped that their criteria dence structure of the model.
would contribute to the development of better Recently, in multivariate modeling, much atten-
methods. However, 33 years later, in 1992, they tion has been paid to copulas or copula functions.
concluded that the published MTMM matrices It can be shown that outside the elliptical world,
were still unsatisfactory and that many theoretical correlation cannot be used to characterize the
and methodological questions remained unsolved. dependence between two series. To say it differ-
Nevertheless, their article has had an enormous ently, the knowledge of two marginal distributions
influence on the development of more advanced and the correlation does not determine the bivari-
statistical methods for analyzing the MTMM ate distribution of the underlying series. In this
matrix, as well as for the refinement of the valida- context, the only dependence function able to sum-
tion process in many areas of psychology. marize all the information about the comovements
of the two series is a copula function. Indeed,
Michael Eid a multivariate distribution is fully and uniquely
characterized by its marginal distributions and its
See also Construct Validity; MBESS; Multitrait–
dependence structure as represented by the copula.
Multimethod Matrix; Triangulation; Validity of
Measurement
Definition and Properties

Further Readings In what follows, the definition of a copula is pro-


vided in the bivariate case.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and A copula is a function C:½0; 1 × ½0; 1 → ½0; 1
discriminant validation by the multitrait–multimethod
with the following properties:
matrix. Psychological Bulletin, 56, 81–105.
Eid, M., & Diener, E. (2006a). Handbook of
1. For every u; v in ½0; 1, Cðu; 0Þ ¼ 0 ¼ Cð0; vÞ,
multimethod measurement in psychology. Washington,
and Cðu; 1Þ ¼ u andCð1; vÞ ¼ v:
DC: American Psychological Association.
Fiske, D. W. & Campbell, D. T. (1992). Citations do not 2. For every u1 ; u2 ; v1 ; v2 in ½0; 1, such that u1 ≤ u2
solve problems. Psychological Bulletin, 112, 393–395. and v1 ≤ v2 ,Cðu2 ; v2 Þ  Cðu2 ; v1 Þ  Cðu1 ; v2 Þ
Shrout, P. E., & Fiske, S. T. (Eds.). (1995). Personality þ Cðu1 ; v1 Þ ≥ 0:
research, methods, and theory: A festschrift honoring
Donald W. Fiske. Hillsdale, NJ: Lawrence Erlbaum. An example is the product copula C? ðu; vÞ ¼ uv,
Sternberg, R. J. (1992). Psychological Bulletin’s top 10 which is a very important copula because it charac-
‘‘hit parade.’’ Psychological Bulletin, 112, 387–388.
terizes independent random variables when the dis-
tribution functions are continuous.
One important property of copulas is the
Fréchet–Hoeffding bounds inequality, given by
COPULA FUNCTIONS
W ðu; vÞ ≤ Cðu; vÞ ≤ Mðu; vÞ;
The word copula is a Latin noun that means a link
and is used in grammar to describe the part of where W and M are also copulas referred to as
a proposition that connects the subject and predi- Fréchet–Hoeffding lower and upper bounds,
cate. Abe Sklar in 1959 was the first to introduce respectively, and defined by Wðu; vÞ ¼ max
the word copula in a mathematical or statistical ðu þ v  1; 0Þ and Mðu; vÞ ¼ minðu; vÞ.
sense in a theorem describing the functions that Much of the usefulness of copulas in nonpara-
join together one-dimensional distribution func- metric statistics is due to the fact that for strictly
tions to form multivariate distribution functions. monotone transformations of the random vari-
He called this class of functions copulas. In ables under interest, copulas are either invariant or
Copula Functions 257

change in predictable ways. Specifically, let X and Conversely, if Cð u1 ; u2 ; . . . ; un Þ is a copula


Y be continuous random variables with copula and Fi ðxi Þ, i ¼ 1; 2; . . . ; n, are the distribution
CX;Y , and let f and g be strictly monotone func- functions of n random variables Xi , then Fð x1 ;
tions on RanX and RanY (Ran: Range), x2 ; . . . ; xn Þ defined above is an n-variate distribu-
respectively. tion function with margins Fi ; i ¼ 1, 2, . . . , n.
The expression of the copula function in
1. If f and g are strictly increasing, then Equation 2 is given in terms of cumulative distri-
Cf ðXÞ;gðYÞ ðu; vÞ ¼ CX;Y ðu; vÞ. bution functions. If we further assume that each
2. If f is strictly increasing and g is strictly Fi and C is differentiable, the joint density
decreasing, then f ð x1 , x2 , . . . , xn Þ corresponding to the cumula-
Cf ðXÞ;gðYÞ ðu; vÞ ¼ u  CX;Y ðu; 1  vÞ. tive distribution Fð x1 , x2 , . . . , xn Þ can easily be
3. If f is strictly decreasing and g is strictly
obtained using
increasing, then
Cf ðXÞ;gðYÞ ðu; vÞ ¼ v  CX;Y ð1  u; vÞ. ∂ n Fð x1 , x2 , . . . , xn Þ
f ð x1 , x2 , . . . , xn Þ ¼ :
∂x1 ∂x2 . . . ∂xn
4. If f and g are strictly decreasing, then
Cf ðXÞ;gðYÞ ðu; vÞ ¼ u þ v  1 þ CX;Y ð1  u; 1  vÞ.
This gives Equation 3:

Sklar’s Theorem for Continuous Distributions f ðx1 ; x2 ; . . . ; xn Þ ¼ f1 ðx1 Þ × f2 ðx2 Þ × . . . × fn ðxn Þ


Sklar’s theorem defines the role that copulas play × cðu1 ; u2 ; . . . ; un Þ
in the relationship between multivariate distribu- ð3Þ
tion functions and their univariate margins.
where the fi s are the marginal densities corre-
Sklar’s Theorem in the Bivariate Case sponding to the Fi s, the ui s are defined as
ui ¼ Fi ð xi Þ, and cð u1 ; u2 ; :::; un Þ is the density
Let H be a joint distribution function with con-
of the copula C obtained as follows:
tinuous margins F and G. Then there exists
a unique copula C such that for all x; y ∈ R;
∂n Cð u1 ; u2 ; . . . ; un Þ
cð u1 ; u2 ; . . . ; un Þ ¼ : ð4Þ
H ðx; yÞ ¼ CðFðxÞ; GðyÞÞ: ð1Þ ∂u1 ∂u2 . . . ∂un

Conversely, if C is a copula and F and G are In contrast to the traditional modeling approach
distribution functions, then the function H defined that decomposes the joint density as a product of
by Equation 1 is a joint distribution function with marginal and conditional densities, Equation 3
margins F and G. states that, under appropriate conditions, the joint
A multivariate version of this theorem exists density can be written as a product of the marginal
and is presented hereafter. densities and the copula density. From Equation 3,
it is clear that the density cð u1 ; u2 ; . . . ; un Þ
Sklar’s Theorem in n Dimensions encodes information about the dependence struc-
ture among the Xi s, and the fi s describe the mar-
For any multivariate distribution function ginal behaviors. It thus shows that copulas
Fðx1 , x2 , . . . , xn Þ ¼ Pð X1 ≤ x1 , X2 ≤ ; x2 ;. . . ; Xn ≤ represent a way to extract the dependence struc-
xn Þ with continuous marginal functions ture from the joint distribution and to extricate the
Fi ð xi Þ ¼ Pð Xi ≤ xi Þ for 1 ≤ i ≤ n, there exists dependence and marginal behaviors. Hence, cop-
a unique function Cð u1 ; u2 ; . . . ; un Þ, called the ula functions offer more flexibility in modeling
copula and defined on ½ 0; 1  n → ½ 0; 1 , such that multivariate random variables. This flexibility con-
for all ðx1 ; x2 ; :::; xn Þ ∈ R n , trasts with the traditional use of the multivariate
normal distribution, in which the margins are
Fð x1 ; x2 ; . . . ; xn Þ ¼
ð2Þ assumed to be Gaussian and linked through a lin-
Cð F1 ð x1 Þ; F2 ð x2 Þ; . . . ; Fn ð xn Þ Þ: ear correlation matrix.
258 Copula Functions

Survival Copulas transform those uniform variates via the inverse


distribution function method. One procedure for
A survival function F is defined as FðxÞ ¼ generating such a pair ðu; vÞ of uniform variates is
P½X > x ¼ 1  FðxÞ, where F denotes the distri- the conditional distribution method. For this
bution function of X. For a pair ðX; Y Þ of random method, we need the conditional distribution func-
variables with joint distribution function H, the tion for V given U ¼ u, denoted Cu ðvÞ and
joint survival function is given by H ðx; yÞ ¼ defined as follows:
P½X > x ; Y > y with margins F and G, which
are the univariate survival functions. Let C be the Cu ðvÞ ¼ P½V ≤ vjU ¼ u
copula of X and Y; then
Cðu þ u ; vÞ  Cðu; vÞ ∂
¼ lim ¼ Cðu; vÞ:
u → 0 u ∂u
H ðx; yÞ ¼ 1  FðxÞ  GðyÞ þ Hðx; yÞ
¼ FðxÞ þ GðyÞ  1 þ CðFðxÞ; GðyÞÞ : The procedure described above becomes

¼ FðxÞ þ GðyÞ  1 þ C 1  FðxÞ; 1  GðyÞ


• Generate two independent uniform variates u
and t
Let C^ be defined from ½0; 1 × ½0; 1 into ½0; 1 • Set v ¼ Cu 1 ðtÞ, where Cu 1 is any quasi-inverse
by C ^ðu; vÞ ¼ u þ v  1 þ Cð1  u; 1  vÞ, then of Cu

• The desired pair is ðu; vÞ.
Hðx; yÞ ¼ C ^ FðxÞ; GðyÞ .
Note that C ^ links the joint survival function to
its univariate margins in the same way a copula Examples of Copulas
joins a bivariate cumulative function to its mar-
gins. Hence, C^ is a copula and is referred to as the Several copula families are available that can
survival copula of X and Y. Note also that the sur- incorporate the relationships between random
vival copula C ^ is different from the joint survival variables. In what follows, the Gaussian copula
function C for two uniform random variables and the Archimedean family of copulas are briefly
whose joint distribution function is the copula C. discussed in a bivariate framework.

Simulation The Gaussian Copula


Copulas can be used to generate a sample from The Gaussian or normal copula can be obtained
a specified joint distribution. Such samples can by inversion from the well-known Gaussian bivari-
then be used to study mathematical models of ate distribution. It is defined as follows:
real-world systems or for statistical studies.
Various well-known procedures are used to gen- CGa ðu1 ; u2 ; ρÞ ¼ Fðφ1 ðu1 Þ; f1 ðu2 Þ; ρÞ;
erate independent uniform variates and to obtain
samples from a given univariate distribution. More where φ is the standard univariate Gaussian cumu-
specifically, to obtain an observation x of a random lative distribution and Fðx; y; ρÞ is the bivariate
variable X with distribution function F, the fol- normal cumulative function with correlation ρ.
lowing method, called the inverse distribution Hence, the Gaussian or normal copula can be
function method, can be used: rewritten as follows:

• Generate a uniform variate u Z f1 ðu1 Þ Z f1 ðu2 Þ


Ga 1
• Set x ¼ F1 ðuÞ, where F1 is any quasi-inverse C ðu1 ; u2 ; ρÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
of F. ∞ ∞ 2πð1  ρ2 Þ
2 2

ðs  2ρst þ t
exp dsdt:
By virtue of Sklar’s theorem, we need only to 2ð1  ρ2 Þ
generate a pair ðu; vÞ of observations of uniform
random variables ðU; VÞ whose distribution func- Variables with standard normal marginal distri-
tion is C, the copula of X and Y, and then butions and this dependence structure are standard
Copula Functions 259

bivariate normal variables with correlation not jointly elliptically distributed, and using linear
coefficient ρ. correlation as a measure of dependence in such
With ρ ¼ 0 we obtain a very important special situations might prove misleading. Two important
case of the Gaussian copula, which takes the form measures of dependence, known as Kendall’s tau
C? ðu; vÞ ¼ uv and is called product copula. The and Spearman’s rho, provide perhaps the best
importance of this copula is related to the fact that alternatives to the linear correlation coefficient as
two variables are independent if and only if their measures of dependence for nonelliptical distribu-
copula is the product copula. tions and can be expressed in terms of the underly-
ing copula. Before presenting these measures and
how they are related to copulas, we need to define
The Archimedean Copulas the concordance concept.
The Archimedean copulas are characterized by
their generator ’ through the following equation: Concordance
Cðu1 ; u2 Þ ¼ ’1 ð’ðu1 Þ þ ’ðu2 ÞÞ Let ðx1 ; y1 Þ and ðx2 ; y2 Þ be two observations
from a vector ðX; Y Þ of continuous random vari-
for u1 ; u2 ∈ ½0; 1: ables. Then ðx1 ; y1 Þ and ðx2 ; y2 Þ are said to be con-
The following table presents three parametric cordant if ðx1  x2 Þðy1  y2 Þ > 0 and discordant if
Archimedean families that have gained interest in ðx1  x2 Þðy1  y2 Þ < 0.
biostatistics, actuarial science, and management
science, namely, the Clayton copula, the Gumbel
Kendall’s tau
copula, and the Frank copula.
Generator
Kendall’s tau for a pair ðX; Y Þ, distributed
Family Cðu1 ; u2 Þ ’(t) Comment according to H, can be defined as the difference
h i
Clayton max ðuθ θ
1 þ u2  1Þ
1=θ
;0 1 θ
θ ðt  1Þ y>0 between the probabilities of concordance and dis-
  

θ θ
 1=θ cordance for two independent pairs ðX1 ; Y1 Þ and
Gumbel exp  lnðu1 Þ þðlnðu2 ÞÞ ðlnðtÞÞθ y≥1
h θu1 θu2 1Þ
i h θt i ðX2 ; Y2 Þ, each with distribution H. This gives
Frank  1θ ln 1 þ ðe ðe1Þðe
θ 1Þ ln ðeðeθ 1Þ
1Þ
y>0
τ X;Y ¼ PrfðX1  X2 ÞðY1  Y2 Þ > 0g
Each of the copulas presented in the preceding PrfðX1  X2 ÞðY1  Y2 Þ < 0g:
table is completely monotonic, and this allows for
multivariate extension. If we assume that all pairs The probabilities of concordance and discor-
of random variables have the same φ and the same dance can be evaluated by integrating over the dis-
θ , the three copula functions can be extended by tribution of ðX2 ; Y2 Þ.
using the following relation: In terms of copulas, Kendall’s tau becomes
ZZ
Cðu# 1; u# 2; . . . ; u# nÞ ¼ ’" ð1Þð’ðu# 1Þ τ XY ¼ 4 Cðu; vÞ dCðu; vÞ  1;
þ ’ðu# 2Þ þ    þ ’ðu# nÞÞ ½0;12

for u1 ; u2 ∈ ½0; 1: where C is the copula associated to ðX; Y Þ.

Copulas and Dependence Spearman’s rho


Copulas are widely used in the study of depen- Let ðX1 ; Y1 Þ, ðX2 ; Y2 Þ, and ðX3 ; Y3 Þ be three
dence or association between random variables. independent random vectors with a common joint
Linear correlation (or Pearson’s correlation) is the distribution function H whose margins are F and
most frequently used measure of dependence in G. Consider the vectors ðX1 ; Y1 Þ and ðX2 ; Y3 Þ—
practice. Indeed, correlation is a natural measure that is, a pair of vectors with the same margins,
of dependence in elliptical distributions (such as but one vector has distribution function H, while
the multivariate normal and the multivariate t dis- the components of the other are independent. The
tribution). However, most random variables are Spearman’s rho coefficient associated to a pair
260 Correction for Attenuation

ðX; Y Þ, distributed according to H, is defined to be Nelsen, R. B. (1999). An introduction to copulas. New


proportional to the probability of concordance York: Springer-Verlag.
minus the probability of discordance for the two Sklar, A. (1959). Fonctions de répartition à n dimensions
vectors ðX1 ; Y1 Þ and ðX2 ; Y3 Þ: et leurs marges [N-dimensions’ distribution functions
and their margins]. Publications de l’Institut
Statistique de Paris, 8, 229–231.
ρXY ¼ 3ðPrfðX1  X2 ÞðY1  Y3 Þ > 0g  Pr
fðX1  X2 ÞðY1  Y3 Þ < 0gÞ:
CORRECTION FOR ATTENUATION
In terms of the copula C associated to the pair
ðX; Y Þ, Spearman’s rho becomes
Correction for attenuation (CA) is a method that
ZZ allows researchers to estimate the relationship
ρXY ¼ 12 uv dCðu,vÞ  3 between two constructs as if they were measured
½0,12 perfectly reliably and free from random errors that
ZZ
¼ 12 Cðu,vÞ dudv  3: occur in all observed measures. All research seeks
½0,12 to estimate the true relationship among constructs;
because all measures of a construct contain ran-
Note that if we let U ¼ FðXÞ and V ¼ GðYÞ, dom measurement error, the CA is especially
then important in order to estimate the relationships
among constructs free from the effects of this
ZZ
error. It is called the CA because random measure-
ρXY ¼ 12 uv dCðu,vÞ  3 ¼ 12E½UV   3 ment error attenuates, or makes smaller, the
½0,12
observed relationships between constructs. For
E½UV   1=4 CovðU,VÞ :
¼ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi correlations, the correction is as follows:
1=12 VarðUÞ VarðVÞ
rxy
¼ CorrðFðXÞ,FðYÞÞ ρxy ¼ pffiffiffiffiffiffipffiffiffiffiffiffi , ð1Þ
rxx ryy
Kendall’s tau and Spearman’s rho are measures
of dependence between two random variables. where δxy is the corrected correlation between vari-
However, to extend to higher dimensions, we sim- ables x and y, rxy is the observed correlation
ply write pairwise measures in an n × n matrix in between variables x and y, rxx is the reliability esti-
the same way as is done for linear correlation. mate for the x variable, and ryy is the reliability
estimate for the y variable. For standardized mean
Chiraz Labidi differences, the CA is as follows:

See also Coefficients of Correlation, Alienation, and dxy


Determination; Correlation; Distribution δxy ¼ pffiffiffiffiffiffi , ð2Þ
ryy
Further Readings
where δxy is the corrected standardized mean dif-
Dall’Aglio, G., Kotz, S. & Salinetti, G. (1991). Advances ference, dxy is the observed standardized mean dif-
in probability distributions with given margins. ference, and ryy is the reliability for the continuous
Dordrecht, the Netherlands: Kluwer. variable. In both equations, the observed effect size
Embrechts, P., Lindskog F., & McNeil, A. (2001). Modelling (correlation or standardized mean difference) is
dependence with copulas and application to risk placed in the numerator of the right half of the
management. Retrieved January 25, 2010, from http://
equation, and the square root of the reliability esti-
www.risklab.ch/ftp/papers/DependenceWithCopulas.pdf
Frees, E. W., & Waldez, E. A. (1998). Understanding
mate(s) is (are) placed in the denominator of the
relationships using copulas. North American Actuarial right half of the equation. The outcome from the
Journal, 2(1). left half of the equation is the estimate of the rela-
Joe, H. (1997). Multivariate models and dependence tionship between perfectly reliable constructs. For
concepts. London: Chapman & Hall. example, using Equation 1, suppose the observed
Correction for Attenuation 261

correlation between two variables is .25, the reli- Typical Uses


ability for variable X is .70, and the reliability for
variable Y is .80. The estimated true correlation The CA has many applications in both basic and
between the two constructs is .25/(.70*.80) ¼ .33. applied research. Because nearly every governing
This entry describes the properties and typical uses body in the social sciences advocates (if not
of CA, shows how the CA equation is derived, and requires) reporting effect sizes to indicate the
discusses advanced applications for CA. strength of relationship, and because all effect sizes
are attenuated because of measurement error,
many experts recommend that researchers employ
Properties
the CA when estimating the relationship between
Careful examination of both Equations 1 and 2 any pair of variables.
reveals some properties of the CA. One of these is Basic research, at its essence, has a goal of
that the lower the reliability estimate, the higher the estimating the true relationship between two
corrected effect size. Suppose that in two research constructs. The CA serves this end by estimating
contexts, rxy ¼ .20 both times. In the first context, the relationship between measures that are free
rxx ¼ ryy ¼ .50, and in the second context, from measurement error, and all constructs are
rxx ¼ ryy ¼ .75. The first context, with the lower free from measurement error. In correlational
reliability estimates, yields a higher corrected corre- research, the CA should be applied whenever
lation (ρxy ¼ .40) than the second research context two continuous variables are being related and
(ρxy ¼ .27) with the higher reliability estimates. An when reliability estimates are available. How-
extension of this property shows that there are ever, when the reported effect size is either the
diminishing returns for increases in reliability; d value or the point-biserial correlation, the cate-
increasing the reliability of the two measures raises gorical variable (e.g., gender or racial group
the corrected effect size by smaller and smaller membership) is typically assumed to be mea-
amounts, as highly reliable measures begin to sured without error; as such, no correction is
approximate the construct level or ‘‘true’’ relation- made for this variable, and the CA is applied
ship, where constructs have perfect reliability. Sup- only to the continuous variable. In experimental
pose now that ρxy ¼ .30. If rxx ¼ ryy, the corrected research, the experimental manipulation is also
correlations when the reliabilities equal .50, .60, categorical in nature and is typically assumed to
.70, .80, and .90, then the corrected correlations are be recorded without random measurement error.
.60, .50, .43, .38, and .33, respectively. Notice that Again, the CA is only applied to the continuous
while the reliabilities increase by uniform amounts, dependent variable.
the corrected correlation is altered less and less. Although applied research also has a goal of
Another property of Equation 1 is that it is not seeking an estimate of the relationship between
necessary to have reliability estimates of both vari- two variables at its core, often there are additional
ables X and Y in order to employ the CA. When considerations. For example, in applied academic
correcting for only one variable, such as in applied or employment selection research, the goal is not
research (or when reliability estimates for one vari- to estimate the relationship between the predictor
able are not available), a value of 1.0 is substituted and criterion constructs but to estimate the rela-
for the reliability estimate of the uncorrected tionship between the predictor measure and the
variable. criterion construct. As such, the CA is not applied
Another important consideration is the accu- to the predictor. To do so would estimate the rela-
racy of the reliability estimate. If the reliability tionship between a predictor measure and a per-
estimate is too high, the CA will underestimate the fectly reliable criterion. Similarly, instances in
‘‘true’’ relationship between the variables. Simi- which the relationship of interest is between a pre-
larly, if the reliability estimate is too low, the CA dictor measure and a criterion measure (e.g.,
will overestimate this relationship. Also, because between Medical College Admission Test scores
correlations are bounded (  1.0 ≤ rxy ≤ 1.0), it and medical board examinations), the CA would
is possible to overcorrect outside these bounds if not be appropriate, as the relationship of interest
one (or both) of the reliability estimates is too low. is between two observed measures.
262 Correction for Attenuation

Derivation where terms are defined as before, and subscripts


denote which variable the terms came from. To
In order to understand how and why the CA calculate the correlation between two composite
works, it is important to understand how the equa- variables, it is easiest to use covariances. Assump-
tion is derived. The derivation rests on two funda- tion 2 defines how error terms in Equations 4 and
mental assumptions from classical test theory. 5 correlate with the other components
Assumption 1: An individual’s observed score on CovðTa , ea Þ ¼ CovðTb , eb Þ ¼ Covðeb , eb Þ ¼ 0, ð6Þ
a particular measure of a trait is made up of two
components: the individual’s true standing on where the Cov() terms are the covariances among
that trait and random measurement error. components in Equations 4 and 5. Suppose that
one additional constraint is imposed: The observed
This first assumption is particularly important score variance is 1.00. Applying Equations 3 and
to the CA. It establishes the equation that has 6, this also means that the sum of true score vari-
become nearly synonymous with classical test ance and error score variance also equals 1.00.
theory: The covariance matrix between the components of
variables A and B (from Equations 4, 5, and 6) is
X ¼ T þ e, ð3Þ
shown below:
where X is the observed score on a particular mea- 2 3
VarðTa Þ 0 CovðTa ; Tb Þ 0
sure, T is the true standing of the individual on the 6 0 Varðea Þ 0 0 7
construct underlying the measure, and e is random 6 7,
4 CovðTa ; Tb Þ 0 VarðTb Þ 0 5
measurement error. Implicit in this equation is the 0 0 0 Varðeb Þ
fact that the construct in question is stable enough
that there is a ‘‘true’’ score associated with it; if the ð7Þ
construct being measured changes constantly, it is
difficult to imagine a true score associated with where Cov(Ta,Tb) is the covariance between true
that construct. As such, the CA is not appropriate scores on variables A and B, and the Var() terms
when any of the constructs being measured are too are the variances for the respective components
unstable to be measured with a ‘‘true’’ score. Also defined earlier. The zero covariances between error
revealed in this equation is that any departure scores and other variables are defined from Equa-
from the true score in the observed score is entirely tion 6. It is known from statistical and psychomet-
due to random measurement error. ric theory that the correlation between two
variables is equal to the covariance between the
Assumption 2: Error scores are completely ran- two variables divided by the square root of the
dom and do not correlate with true scores or variances of the two variables:
other error scores.
CovðA, BÞ
rab ¼ pffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffi , ð8Þ
This assumption provides the basis for how Vara Varb
scores combine in composites. Because observed
test scores are composites of true and error scores where Vara and Varb are variances for their respec-
(see Assumption 1), Assumption 2 provides guid- tive variables. Borrowing notation defined in
ance on how these components combine when one Equation 7 and substituting into equation 8, the
is calculating the correlation between two compos- correlation between observed scores on variables
ite variables. Suppose there are observed scores for A and B is
two variables, A and B. Applying Equation 3,
rab ¼
Xa ¼ Ta þ ea ð4Þ CovðTa , Tb Þ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
½VarðTa Þ þ Varðea Þ ½VarðTb Þ þ Varðeb Þ
Xb ¼ T b þ e b ð5Þ ð9Þ
Correction for Attenuation 263

Since the observed scores have a variance of be rewritten to say that rxx ¼ Var(T). Making this
1.00 (because of the condition imposed earlier) final substitution into Equation 12,
and because of Equation 6, the correlation
between true scores is equal to the covariance rab
ρab ¼ pffiffiffiffiffiffipffiffiffiffiffiffi ð14Þ
between true scores [i.e., Cov(Ta,Tb) ¼ rab]. Mak- raa rbb ,
ing this substitution into Equation 9:
rab which is exactly equal to Equation 1 (save for the
rab ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , notation on variable names), or the CA. Though
½VarðTa Þ þ Varðeb Þ ½VarðTb Þ þ Varðeb Þ
not provided here, a similar derivation can be used
ð10Þ to obtain Equation 2, or the CA for d values.
where all terms are as defined earlier. Note that
the same term, rab, appears on both sides of
Advanced Applications
Equation 10. By definition, rab ¼ rab; mathemati-
cally, Equation 10 can be true only if the denomi- There are some additional applications of the CA
nator is equal to 1.0. Because it was defined under relaxed assumptions. One of these primary
earlier that the variance of observed scores equals applications is when the correlation between error
1.0, Equation 9 can hold true. This requirement terms is not assumed to be zero. The CA under
is relaxed for Equation 11 with an additional this condition is as follows:
substitution.
Suppose now that a measure was free from pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rxy  rex ey 1  rxx 1  ryy
measurement error, as it would be at the construct ρxy ¼ pffiffiffiffiffiffiffiffipffiffiffiffiffiffiffi , ð15Þ
rxx ryy
level. At the construct level, true relationships
among variables are being estimated. As such,
Greek letters are used to denote these relation- where rex ey is the correlation between error scores
ships. If the true relationship between variables A for variables X and Y, and the other terms are as
and B is to be estimated, Equation 10 becomes defined earlier.
It is also possible to correct part (or semipartial
rab correlations) and partial correlations for the influ-
ρab ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ,
½VarðTa Þ þ Varðea Þ ½VarðTb Þ þ Varðeb Þ ences of measurement error. Using the standard for-
ð11Þ mula for the partial correlation (in true score metric)
between variables X and Y, controlling for Z, we get
where ρab is the true correlation between variables
A and B, as defined in Equation 1. Again, because ρxy  ρxz ρyz
ρxy · z ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð16Þ
variables are free from measurement error at the
1  ðρxz Þ2 1  ðρyz Þ2 Þ
construct level (i.e., Var(ea) ¼ Var(eb) ¼ 0), Equa-
tion 11 becomes
rxy Substituting terms from Equation 1 into Equation
ρxy ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð12Þ 15 yields a formula to estimate the true partial cor-
VarðTx Þ VarðTy Þ relation between variables X and Y, controlling for
Z:
Based on classical test theory, the reliability of
a variable is defined to be the ratio of true score   
rxy rxz ryz
variance to observed score variance. In other  pffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffi
rxx ryy rxx rzz
pffiffiffiffiffiffiffiffiffi
ryy rzz
words, ρxy · z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2ffirffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2 ð17Þ
rxz ryz
1  pffiffiffiffiffiffiffiffiffirxx rzz
1  pffiffiffiffiffiffiffiffiffi
ryy rzz
VarðTÞ
rxx ¼ : ð13Þ
VarðXÞ
Finally, it is also possible to compute the partial
Because it was defined earlier that the observed correlation between variables X and Y, controlling
score variance was equal to 1.0, Equation 13 can for Z while allowing the error terms to correlate:
264 Correlation

pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi

rzz rxy  rex ey 1  rxx 1  ryy  to a cause–effect relationship between the variables
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
in question. To infer cause and effect, it is necessary
r  rex ez 1  rxx 1  rzz
xz pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
to conduct a controlled experiment involving an
ryz  rey ez 1  ryy 1  rzz
ρxyz ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi experimenter-manipulated independent variable in
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
rxx rzz  rxz  rex ez 1  rxx 1  rzz which subjects are randomly assigned to experi-
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 mental conditions. Typically, data for which a corre-
ryy rzz  ryz  rey ez 1  ryy 1  rzz lation coefficient is computed are also evaluated
ð18Þ with regression analysis. The latter is a methodology
for deriving an equation that can be employed to
Corrections for attenuation for part (or semi- estimate or predict a subject’s score on one variable
partial) correlations are also available under simi- from the subject’s score on another variable. This
lar conditions as Equations 16 and 17. entry discusses the history of correlation and mea-
sures for assessing correlation.
Matthew J. Borneman

See also Cohen’s d Statistic; Correlation; Random Error; History


Reliability
Although in actuality a number of other individuals
had previously described the concept of correlation,
Further Readings Francis Galton, an English anthropologist, is gener-
ally credited with introducing the concept of correla-
Fan, X. (2003). Two approaches for correcting
correlation attenuation caused by measurement error: tion in a lecture on heredity he delivered in Great
Implications for research practice. Educational & Britain in 1877. In 1896 Karl Pearson, an English
Psychological Measurement, 63, 915–930. statistician, further systematized Galton’s ideas and
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). introduced the now commonly employed product-
Measurement theory for the behavioral sciences. San moment correlation coefficient, which is represented
Francisco: W. H. Freeman. by the symbol r, for interval-level or continuous
Schmidt, F. L., & Hunter, J. E. (1999). Theory testing and data. In 1904 Charles Spearman, an English psychol-
measurement error. Intelligence, 27, 183–198. ogist, published a method for computing a correla-
Spearman, C. (1904). The proof and measurement of
tion for ranked data. During the 1900s numerous
association between two things. American Journal of
other individuals contributed to the theory and
Psychology, 15, 72–101.
Wetcher-Hendricks, D. (2006). Adjustments to the methodologies involved in correlation. Among the
correction for attenuation. Psychological Methods, 11, more notable contributors were William Gossett and
207–215. Ronald Fisher, both of whom described the distribu-
Zimmerman, D. W., & Williams, R. H. (1977). The tion of the r statistic; Udney Yule, who developed
theory of test validity and correlated errors of a correlational measure for categorical data, as well
measurement. Journal of Mathematical Psychology, as working with Pearson to develop multiple correla-
16, 135–152. tion; Maurice Kendall, who developed alternative
measures of correlation for ranked data; Harold
Hotelling, who developed canonical correlation; and
Sewall Wright, who developed path analysis.
CORRELATION
The Pearson Product-Moment Correlation
Correlation is a synonym for association. Within
the framework of statistics, the term correlation The Pearson product-moment correlation is the
refers to a group of indices that are employed to most commonly encountered bivariate measure of
describe the magnitude and nature of a relationship correlation. A bivariate correlation assesses the
between two or more variables. As a measure of degree of relationship between two variables. The
correlation, which is commonly referred to as a cor- product-moment correlation describes the degree
relation coefficient, is descriptive in nature, it can- to which a linear relationship (the linearity is
not be employed to draw conclusions with regard assumed) exists between one variable designated
Correlation 265

as the predictor variable (represented symbolically in two-dimensional space. By examining the config-
by the letter X) and a second variable designated uration of the scatterplot, a researcher can ascertain
as the criterion variable (represented symbolically whether linear correlational analysis is best suited
by the letter Y). The product-moment correlation for evaluating the data.
is a measure of the degree to which the variables Regression analysis is employed with the data
covary (i.e., vary in relation to one another). From to derive the equation of a regression line (also
a theoretical perspective, the product-moment cor- known as the line of best fit), which is the straight
relation is the average of the products of the paired line that best describes the relationship between
standard deviation scores of subjects on the two the two variables. To be more specific, a regression
variables. The equation for computing the unbi- line is the straight line for which the sum of the
ased P estimate of the population correlation is squared vertical distances of all the points from
r ¼ ( zx zy)/(n  1). the line is minimal. When r ¼ ± 1, all the points
The value r computed for a sample correlation will fall on the regression line, and as the value of
coefficient is employed as an estimate of ρ (the r moves toward zero, the vertical distances of the
lowercase Greek letter rho), which represents the points from the line increase.
correlation between the two variables in the under- The general equation for a regression line is
0
lying population. The value of r will always Y ¼ a þ bX, where a ¼ Y intercept, b ¼ the
fall within the range of  1 to þ 1 (i.e., slope of the line (with a positive correlation yield-
 1 ≤ r ≤ þ 1). The absolute value of r (i.e., jrj) ing a positively sloped line, and a negative correla-
indicates the strength of the linear relationship tion yielding a negatively sloped line), X represents
between the two variables, with the strength of the a given subject’s score on the predictor variable,
relationship increasing as the absolute value of r and Y 0 is the score on the criterion variable pre-
approaches 1. When r ¼ ± 1, within the sample dicted for the subject.
for which the correlation was computed, a subject’s An important part of regression analysis
score on the criterion variable can be predicted involves the analysis of residuals. A residual is the
perfectly from his or her score on the predictor difference between the Y 0 value predicted for a sub-
variable. As the absolute value of r deviates from 1 ject and the subject’s actual score on the criterion
and moves toward 0, the strength of the relation- variable. Use of the regression equation for predic-
ship between the variables decreases, such that tive purposes assumes that subjects for whom
when r ¼ 0, prediction of a subject’s score on the scores are being predicted are derived from the
criterion variable from his or her score on the pre- same population as the sample for which the
dictor variable will not be any more accurate than regression equation was computed. Although
a prediction that is based purely on chance. numerous hypotheses can be evaluated within the
The sign of r indicates whether the linear rela- framework of the product-moment correlation and
tionship between the two variables is direct (i.e., regression analysis, the most common null hypoth-
an increase in one variable is associated with an esis evaluated is that the underlying population
increase in the other variable) or indirect (i.e., an correlation between the variables equals zero. It is
increase in one variable is associated with a decrease important to note that in the case of a large sample
on the other variable). The closer a positive value size, computation of a correlation close to zero
of r is to þ 1, the stronger (i.e., more consistent) may result in rejection of the latter null hypothesis.
the direct relationship between the variables, and In such a case, it is critical that a researcher distin-
the closer a negative value of r is to 1, the stron- guish between statistical significance and practical
ger the indirect relationship between the two vari- significance, in that it is possible that a statistically
ables. If the relationship between the variables is significant result derived for a small correlation
best described by a curvilinear function, it is quite will be of no practical value; in other words, it will
possible that the value computed for r will be close have minimal predictive utility.
to zero. Because of the latter possibility, it is always A value computed for a product-moment corre-
recommended that a researcher construct a scatter- lation will be reliable only if certain assumptions
plot of the data. A scatterplot is a graph that sum- regarding the underlying population distribution
marizes the two scores of each subject with a point have not been violated. Among the assumptions
266 Correlation

for the product-moment correlation are the follow- involves assessing the relationship between a set of
ing: (a) the distribution of the two variables is predictor variables (i.e., two or more) and a set of
bivariate normal (i.e., each of the variables, as well criterion variables.
as the linear combination of the variables, is dis- A number of measures of association have
tributed normally), (b) there is homoscedasticity been developed for evaluating data in which the
(i.e., the strength of the relationship between the scores of subjects have been rank ordered or the
two variables is equal across the whole range of relationship between two or more variables is
both variables), and (c) the residuals are summarized in the format of a contingency table.
independent. Such measures may be employed when the data
are presented in the latter formats or have been
transformed from an interval or ratio format to
Alternative Correlation Coefficients
one of the latter formats because one or more of
A common criterion for determining which corre- the assumptions underlying the product-moment
lation should be employed for measuring the correlation are believed to have been saliently
degree of association between two or more vari- violated. Although, like the product-moment
ables is the levels of measurement represented by correlation, the range of values for some of the
the predictor and criterion variables. The product- measures that will be noted is between 1 and
moment correlation is appropriate to employ when þ1, others may assume only a value between
both variables represent either interval- or ratio- 0 and þ1 or may be even more limited in range.
level data. A special case of the product-moment Some alternative measures do not describe a lin-
correlation is the point-biserial correlation, which ear relationship, and in some instances a statistic
is employed when one of the variables represents other than a correlation coefficient may be
interval or ratio data and the other variable is employed to express the degree of association
represented on a dichotomous nominal scale (e.g., between the variables.
two categories, such as male and female). When Two methods of correlation that can be
the original scale of measurement for both vari- employed as measures of association when both
ables is interval or ratio but scores on one of the variables are in the form of ordinal (i.e., rank
variables have been transformed into a dichoto- order) data are Spearman’s rank order correlation
mous nominal scale, the biserial correlation is the and Kendall’s tau. Kendall’s coefficient of concor-
appropriate measure to compute. When the origi- dance is a correlation that can be employed as
nal scale of measurement for both variables is a measure of association for evaluating three or
interval or ratio but scores on both of the variables more sets of ranks.
have been transformed into a dichotomous nomi- A number of measures of correlation or associa-
nal scale, the tetrachoric correlation is employed. tion are available for evaluating categorical data
Multiple correlation involves a generalization of that are summarized in the format of a two-dimen-
the product-moment correlation to evaluate the sional contingency table. The following measures
relationship between two or more predictor vari- can be computed when both the variables are
ables with a single criterion variable, with all the dichotomous in nature: phi coefficient and Yule’s Q.
variables representing either interval or ratio data. When both variables are dichotomous or one or
Within the context of multiple correlation, partial both of the variables have more than two categories,
and semipartial correlations can be computed. A the following measures can be employed: contin-
partial correlation measures the relationship gency coefficient, Cramer’s phi, and odds ratio.
between two of the variables after any linear asso- The intraclass correlation and Cohen’s kappa
ciation one or more additional variables have with are measures of association that can be employed
the two variables has been removed. A semipartial for assessing interjudge reliability (i.e., degree of
correlation measures the relationship between two agreement among judges), the former being
of the variables after any linear association one or employed when judgments are expressed in the
more additional variables have with one of the form of interval or ratio data, and the latter when
two variables has been removed. An extension of the data are summarized in the format of a contin-
multiple correlation is canonical correlation, which gency table.
Correspondence Analysis 267

Measures of effect size, which are employed


within the context of an experiment to measure CORRESPONDENCE ANALYSIS
the degree of variability on a dependent variable
that can be accounted for by the independent vari- Correspondence analysis (CA) is a generalized
able, represent another type of correlational mea- principal component analysis tailored for the anal-
sure. Representative of such measures are omega ysis of qualitative data. Originally, CA was created
squared and eta-squared, both of which can be to analyze contingency tables, but CA is so versa-
employed as measures of association for data eval- tile that it is used with a number of other data
uated with an analysis of variance. table types.
In addition to all the aforementioned mea- The goal of CA is to transform a data table into
sures, more complex correlational methodolo- two sets of factor scores: one for the rows and one
gies are available for describing specific types of for the columns. The factor scores give the best
curvilinear relationships between two or more representation of the similarity structure of the
variables. Other correlational procedures, such rows and the columns of the table. In addition, the
as path analysis and structural equation model- factors scores can be plotted as maps, which dis-
ing, are employed within the context of causal play the essential information of the original table.
modeling (which employs correlation data In these maps, rows and columns are displayed as
to evaluate hypothesized causal relationships points whose coordinates are the factor scores and
among variables). whose dimensions are called factors. It is interest-
David J. Sheskin ing that the factor scores of the rows and the col-
umns have the same variance, and therefore, both
See also Bivariate Regression; Coefficients of Correlation, rows and columns can be conveniently represented
Alienation, and Determination; Least Squares, in one single map.
Methods of; Multiple Regression; Pearson Product- The modern version of CA and its geometric
Moment Correlation Coefficient; Regression to the interpretation comes from 1960s France and is
Mean; Residuals; Scatterplot associated with the French school of data analysis
(analyse des données).
As a technique, it was often discovered (and
Further Readings rediscovered), and so variations of CA can be
found under several different names, such as dual-
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). scaling, optimal scaling, or reciprocal averaging.
Applied multiple regression/correlation analysis for the
The multiple identities of CA are a consequence of
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
its large number of properties: It can be defined as
Edwards, A. L. (1984). An introduction to linear an optimal solution for many apparently different
regression and correlation (2nd ed.). New York: W. H. problems.
Freeman.
Howell, D. C. (2007). Statistical methods for psychology
(6th ed.). Belmont, CA: Thomson/Wadsworth. Notations
Montgomery, D. C., Peck, E. A., & Vining, G. (2006).
Introduction to linear regression analysis (4th ed.). Matrices are denoted with uppercase letters typeset
Hoboken, NJ: Wiley-Interscience. in a boldface font; for example, X is a matrix. The
Moore, D. S., & McCabe, G. P. (2009). Introduction to elements of a matrix are denoted with a lowercase
the practice of statistics (6th ed.). New York: W. H. italic letter matching the matrix name, with indices
Freeman. indicating the row and column positions of the ele-
Rosner, B. (2006). Fundamentals of biostatistics (6th ed.).
ment; for example, xi;j is the element located at the
Belmont, CA: Thomson-Brooks/Cole
Sheskin, D. J. (2007). Handbook of parametric and
ith row and jth column of matrix X. Vectors are
nonparametric statistical procedures (4th ed.). Boca denoted with lowercase, boldface letters; for exam-
Raton, FL: Chapman & Hall/CRC. ple, c is a vector. The elements of a vector are
Zar, J. H. (1999). Biostatistical analysis (4th ed.). Upper denoted with a lowercase italic letter matching the
Saddle River, NJ: Prentice Hall. vector name and an index indicating the position
268 Correspondence Analysis

Table 1 The Punctuation Marks of Six French


Aloz 2
Writers Zola
Rousseau Proust
Writer Period Comma All Other Marks
Giraudoux Chateaubriand 1
Rousseau 7,836 13,112 6,026 Hugo
Chateaubriand 53,655 102,383 42,413
Hugo 115,615 184,541 59,226
Zola 161,926 340,479 62,754 Figure 1 Principal Components Analysis of
Proust 38,177 105,101 12,670 Punctuation
Giraudoux 46,371 58,367 14,299 Notes: Data are centered. Aloz, the fictitious alias for Zola,
Source: Adapted from Brunet, 1989. is a supplementary element. Even though Aloz punctuates
the same way as Zola, Aloz is farther away from Zola than
from any other author. The first dimension explains 98% of
of the element in the vector; for example ci is the the variance. It reflects mainly the number of punctuation
ith element of c. The italicized superscript T indi- marks produced by the author.
cates that the matrix or vector is transposed.
authors. In this map, the authors are points and
An Example: How Writers Punctuate the distances between authors reflect the proximity
of style of the authors. So two authors close to
This example comes from E. Brunet, who analyzed each other punctuate in a similar way and two
the way punctuation marks were used by six authors who are far from each other punctuate
French writers: Rousseau, Chateaubriand, Hugo, differently.
Zola, Proust, and Giraudoux. In the paper, Brunet
gave a table indicating the number of times each
of these writers used the period, the comma, and A First Idea: Doing Principal Components Analysis
all the other marks (i.e., question mark, exclama-
tion point, colon, and semicolon) grouped A first idea is to perform a principal compo-
together. These data are reproduced in Table 1. nents analysis (PCA) on X. The result is shown in
From these data we can build the original data Figure 1. The plot suggests that the data are quite
matrix, which is denoted X. It has I ¼ 6 rows and unidimensional. And, in fact, the first component
J ¼ 3 columns and is equal to of this analysis explains 98% of the inertia (a
value akin to variance). How to interpret this com-
2 3
7836 13112 6026 ponent? It seems related to the number of punctua-
6 53655 102383 42413 7 tion marks produced by each author. This
6 7
6 115615 184541 59226 7 interpretation is supported by creating a fictitious
X ¼ 6
6 161926
7: ð1Þ
6 340479 62754 7
7
alias for Zola.
4 38177 105101 12670 5 Suppose that, unbeknown to most historians of
46371 58367 14299 French literature, Zola wrote a small novel under
the (rather transparent) pseudonym of Aloz. In this
In the matrix X, the rows represent the authors novel, he kept his usual way of punctuating, but
and the columns represent types of punctuation because it was a short novel, he obviously pro-
marks. At the intersection of a row and a column, duced a smaller number of punctuation marks
we find the number of a given punctuation mark than he did in his complete uvre. Here is the
(represented by the column) used by a given (row) vector recording the number of occurrences
author (represented by the row). of the punctuation marks for Aloz:
½ 2699 5675 1046 : ð2Þ
Analyzing the Rows
For ease of comparison, Zola’s row vector is repro-
Suppose that the focus is on the authors and that duced here:
we want to derive a map that reveals the similari-
ties and differences in punctuation style among ½ 161926 340479 62754 : ð3Þ
Correspondence Analysis 269

So Aloz and Zola have the same punctuation If all authors punctuate the same way, they all
style and differ only in their prolixity. A good anal- punctuate like the average writer. Therefore, in
ysis should reveal such a similarity of style, but as order to study the differences among authors, we
Figure 1 shows, PCA fails to reveal this similarity. need to analyze the matrix of deviations from the
In this figure, we have projected Aloz (as a supple- average writer. This matrix of deviations is
mentary element) in the analysis of the authors, denoted Y, and it is computed as
and Aloz is, in fact, farther away from Zola than  
any other author. This example shows that using Y ¼ R  1 × cT ¼
PCA to analyze the style of the authors is not I×1
2 3
a good idea because a PCA is sensitive mainly to :0068 :0781 :0849
the number of punctuation marks rather than to 6 :0269 :0483 :0752 7
6 7
how punctuation is used. The ‘‘style’’ of the 6 7 ð6Þ
6 :0244 :0507 :0263 7
authors is, in fact, expressed by the relative fre- 6 7:
6 :0107 :0382 :0275 7
quencies of their use of the punctuation marks. 6 7
6 7
This suggests that the data matrix should be trans- 4 :0525 :1097 :0573 5
formed such that each author is described by the
:0923 :0739 :0184
proportion of his usage of the punctuation marks
rather than by the number of marks used. The Masses (Rows) and Weights (Columns)
transformed data matrix is called a row profile
matrix. In order to obtain the row profiles, we In CA, a mass is assigned to each row and
divide each row by its sum. This matrix of row a weight to each column. The mass of each row
profiles is denoted R. It is computed as reflects its importance in the sample. In other
words, the mass of each row is the proportion of
1
this row in the total of the table. The masses of the
R ¼ diag X 1 X rows are stored in a vector denoted m, which is
J×1

2 3 computed as
:2905 :4861 :2234  1
6 :2704 :5159 :2137 7
6 7 m ¼ 1 ×X× 1 ×X 1
6 :3217 :5135 :1648 7 1×I J×1 J×1
¼6
6 :2865
7 ð4Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflffl{zfflffl}
6 :6024 :1110 7
7 Inverse of the total of X
4 :2448 :6739 :0812 5
Total of the rows of X
:3896 :4903 :1201
T
¼ ½:0189 :1393 :2522 :3966 :1094 :0835 :
(where diag transforms a vector into a diagonal
ð7Þ
matrix with the elements of the vector on the diag-
onal, and J ×1 1 is a J × 1 vector of ones). From the vector m, we define the matrix of masses
The ‘‘average writer’’ would be someone who as M ¼ diag (m).
uses each punctuation mark according to its pro- The weight of each column reflects its impor-
portion in the sample. The profile of this average tance for discriminating among the authors. So
writer would be the barycenter (also called cen- the weight of a column reflects the information
troid, center of mass, or center of gravity) of the this column provides to the identification of
matrix. Here, the barycenter of R is a vector with a given row. Here, the idea is that columns that
J ¼ 3 elements. It is denoted c and computed as are used often do not provide much information,
 1 and column that are used rarely provide much
T
c ¼ 1 ×X× 1 × 1 X information. A measure of how often a column
1×I J×1 1×I
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflffl{zfflffl} is used is given by the proportion of times it is
Inverse of the total of X ð5Þ used, which is equal to the value of this column’s
Total of the columns of X
component of the barycenter. Therefore the
weight of a column is computed as the inverse of
¼ ½ :2973 :5642 :1385 :
this column’s component of the barycenter.
270 Correspondence Analysis

Specifically, if we denote by w the J by 1 weight


vector for the columns, we have
 h i
w ¼ wj ¼ c1 : 2
j λ2 = 0.01
τ2 = 24%
Chateaubriand
For our example, we obtain
2 1 3 2 Proust
3
:2973 3:3641
6 7
 h 1 i 6 1 7 6 6
7
7
Rousseau
6 7
w ¼ wj ¼ cj ¼ 6 :5642 7 ¼ 6 1:7724 7: ð9Þ
4 5 4 5 Zola
1
Hugo
λ1 = 0.02
1 7:2190 τ 1 = 76%
:1385

From the vector w, we define the matrix of weights


as W ¼ diag {w}. Giraudoux

Generalized Singular Value Decomposition of Y


Now that we have defined all these notations,
CA boils down to a generalized singular value
decomposition (GSVD) problem. Specifically,
matrix Y is decomposed using the GSVD under Figure 2 Plot of the Correspondence Analysis of the
the constraints imposed by the matrices M (masses Rows of Matrix X
for the rows) and W (weights for the columns): Notes: The first two factors of the analysis for the rows are
T plotted (i.e., this is the matrix F). Each point represents an
Y ¼ PΔQ with: author. The variance of each factor score is equal to its
ð10Þ eigenvalue.
PT MP ¼ QT WQ ¼ I ;

where P is the right singular vector, Q is the left the observations onto the singular vectors). The
singular vector, and Δ is the diagonal matrix of row factor scores are stored in an I ¼ 3 L ¼ 2
the eigenvalues. From this we get matrix (where L stands for the number of nonzero
2 3 singular values) denoted F. This matrix is obtained
1:7962 0:9919 as
6 1:4198 1:4340 7
6 7 2 3
6 7
6 0:7739 0:3978 7 0:2398 0:0741
Y ¼ 6 6 7× 6 0:1895
0:6878 0:0223 7 6 0:1071 7
7
6 7 6 0:1033 0:0297 7
6 7 6 7:
4 1:6801 0:8450 5 F ¼ PΔ ¼ 6 7 ð12Þ
6 0:0918 0:0017 7
0:3561 2:6275 4 0:2243 0:0631 5
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
P ð11Þ 0:0475 0:1963
 
:1335 0
×
0 :0747 The variance of the factor scores for a given
|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} dimension is equal to the squared singular value
Δ
  of this dimension. (The variance of the observa-
0:1090 0:4114 0:3024
: tions is computed taking into account their
0:4439 0:2769 0:1670 masses.) Or equivalently, we say that the vari-
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
QT ance of the factor scores is equal to the eigen-
value of this dimension (i.e., the eigenvalue is
The rows of the matrix X are now represented the square of the singular value). This can be
by their factor scores (which are the projections of checked as follows:
Correspondence Analysis 271

Period (a)
z Period
[0,0,1] 1.0 1 0
.9
.8
.7

Period
.6
.5

Co
od
.4

mm
ri
.4861

Pe
.2905 .3 Rousseau

a
.2
.1
.2234 0 Co .2905
m Rousseau
r
the .5 .5 ma
O
Comma
.4861 0 1
1.0 1.0
y 1 .2234 0
[0,1,0] [1,0,0] x
Other Other
Other Comma

(b)
Period
Figure 3 In Three Dimensions, the Simplex Is a Two-
1 0
Dimensional Triangle Whose Vertices Are
the Vectors [100], [010], and [001]

Co
rio

" #

mm
Pe

0:13352 0
F MF ¼ Δ ¼ L ¼
T 2

a
G
0 0:07472
  ð13Þ H Z
0:0178 0 R *
¼ : C P
0 0:0056
Comma
We can display the results by plotting the factor 0 1
scores as a map on which each point represents 1 0
a row of the matrix X (i.e., each point represents Other Other
an author). This is done in Figure 2. On this map,
the first dimension seems to be related to time (the
rightmost authors are earlier authors, and the left- Figure 4 The Simplex as a Triangle
most authors are more recent), with the exception
of Giraudoux, who is a very recent author. The
second dimension singularizes Giraudoux. These a vector, it can be represented as a point in a multi-
factors will be easier to understand after we have dimensional space. Because the sum of a profile is
analyzed the columns. This can be done by analyz- equal to one, row profiles are, in fact points in a J
ing the matrix XT. Equivalently, it can be done by by 1 dimensional space.
what is called dual analysis. Also, because the components of a row profile
take value in the interval [0, 1], the points repre-
senting these row profiles can lie only in the sub-
Geometry of the Generalized
space whose ‘‘extreme points’’ have one component
Singular Value Decomposition
equal to one and all other components equal to
CA has a simple geometric interpretation. For zero. This subspace is called a simplex. For exam-
example, when a row profile is interpreted as ple, Figure 3 shows the two-dimensional simplex
272 Correspondence Analysis

Period
1 0 Period
1 0

1.8341
d Co
io
Per mm
od

Co
a
ri
Pe

G mm 1 G
H Z a 68 .3 31 R H
Z
R 28 3 *
2. C Comma
C P 0 P 1
0 1 Comma
1 0
1 0 Other Other
Other Other

Figure 5 Geometric Interpretation of the Column Weights


Note: Each side of the simplex is stretched by a factor equal to the square root of the weights.

corresponding to the subspace of all possible row The stretched simplex shows the whole space of
profiles with three components. As an illustration, the possible profiles. Figure 6 shows that the
the point describing Rousseau (with coordinates authors occupy a small portion of the whole space:
equal to [.2905 .4861 .2234]) is also plotted. For They do not vary much in the way they punctuate.
this particular example, the simplex is an equilat- Also, the stretched simplex represents the columns
eral triangle and, so the three-dimensional row pro- as the vertices of the simplex: The columns are
files can conveniently be represented as points on represented as row profiles with the column com-
this triangle, as illustrated in Figure 4a, which ponent being one and all the other components
shows the simplex of Figure 3 in two dimensions. being zeros. This representation is called an asym-
Figure 4b shows all six authors and the barycenter. metric representation because the rows always
The weights of the columns, which are used as have a dispersion smaller than (or equal to) the
constraints in the GSVD, also have a straightfor- columns.
ward geometric interpretation. As illustrated in
Figure 5, each side of the simplex is stretched by Distance, Inertia, Chi-Square,
a quantity equal to the square root of the dimen-
and Correspondence Analysis
sion it represents (we use the square root because
we are interested in squared distances but not in Chi-Square Distances
squared weights, so using the square root of the
weights ensures that the squared distances between In CA, the Euclidean distance in the stretched
authors will take into account the weight rather simplex is equivalent to a weighted distance in the
than the squared weights).
The masses of the rows are taken into account Period
to find the dimensions. Specifically, the first factor 1
0
is computed in order to obtain the maximum pos- riod Co
Pe mm
sible value of the sum of the masses times the a
G
squared projections of the authors’ points (i.e., the H Z
R Comma
projections have the largest possible variance). The 0 C P
second factor is constrained to be orthogonal (tak- 1
ing into account the masses) to the first one and to 1 Other 0
Other
have the largest variance for the projections.
The remaining factors are computed with simi-
lar constraints. Figure 6 shows the stretched sim- Figure 6 Correspondence Analysis: The ‘‘Stretched
plex, the author points, and the two factors (note Simplex’’ Along With the Factorial Axes
that the origin of the factors is the barycenter of Note: The projections of the authors’ points onto the
the authors). factorial axes give the factor scores.
Correspondence Analysis 273

original space. For reasons that will be made more Inertia and the Chi-Square Test
clear later, this distance is called the χ2 distance.
It is interesting that the inertia in CA is closely
The χ2 distance between two row profiles i and i0
related to the chi-square test. This test is tradition-
can be computed from the factor scores as
ally performed on a contingency table in order to
test the independence of the rows and the columns
X
L

2
2
di;i0 ¼ fi;‘  fi0 ;‘ ð14Þ of the table. Under independence, the frequency of
‘ each cell of the table should be proportional to the
product of its row and column marginal probabili-
or from the row profiles as ties. So if we denote by x þ , þ the grand total of
matrix X, the expected frequency of the cell at
X
J the ith row and jth column is denoted Ei,j and

2
2
di;i0 ¼ wj ri;j  ri0 ;j : ð15Þ computed as
j
Ei;j ¼ mi cj x þ ; þ : ð19Þ
Inertia
The chi-square test statistic, denoted χ2, is com-
The variability of the row profiles relative to puted as the sum of the squared difference between
their barycenter is measured by a quantity—akin the actual values and the expected values,
to variance—called inertia and denoted I . The weighted by the expected values:
inertia of the rows to their barycenter is computed
as the weighed sum of the squared distances of the

2 X xi;j  Ei;j 2
rows to their barycenter. We denote by dc,i the 2
χ ¼ : ð20Þ
(squared) distance of the ith row to the barycenter, i;j
Ei;j
computed as
When rows and columns are independent, χ2
X
J X
L
follows a chi-square distribution with
2
dc,i ¼ wj ðri,j  cj Þ2 ¼ fi,l2 ð16Þ
(I  1)(J  1) degrees of freedom. Therefore, χ2
j l
can be used to evaluate the likelihood of the row
and columns independence hypothesis. The statis-
where L is the number of factors extracted by the tic χ2 can be rewritten to show its close relation-
CA of the table, [this number is smaller than or ship with the inertia of CA, namely:
equal to min(I, J) 1]. The inertia of the rows to
their barycenter is then computed as
χ2 ¼ I x þ , þ : ð21Þ
X
I
I ¼ 2
mi dc;i : ð17Þ This shows that CA analyzes—in orthogonal com-
i ponents—the pattern of deviations for
independence.
The inertia can also be expressed as the sum of the
eigenvalues (see Equation 13): Dual Analysis

X
L In a contingency table, the rows and the columns
I ¼ λ‘ ð18Þ of the table play a similar role, and therefore the
‘ analysis that was performed on the rows can also
be performed on the columns by exchanging the
This shows that in CA, each factor extracts role of the rows and the columns. This is illus-
a portion of the inertia, with the first factor trated by the analysis of the columns of matrix X,
extracting the largest portion, the second factor or equivalently by the rows of the transposed
extracting the largest portion left of the inertia, matrix XT. The matrix of column profiles for XT is
and so forth. called O (like cOlumn) and is computed as
274 Correspondence Analysis

1
O ¼ diag XT 1 XT : ð22Þ
I×1

The matrix of the deviations to the barycenter is


called Z and is computed as
2 3
  :0004 :0126 :0207 :0143 :0193 :0259
Z ¼ O  1 × mT ¼ 4 :0026 :0119 :0227 :0269 :0213 :0109 5:
I×1
:0116 :0756 :0478 :0787 :0453 :0111

Weights and masses of the columns analysis GSVD with the constraints imposed by the
are the inverse of their equivalent for the row two matrices W  1 (masses for the rows) and
analysis. This implies that the punctuation M  1 (weights for the columns; compare with
marks factor scores are obtained from the Equation 10):

Z ¼ UDVT with: UT W1 U ¼ VT M1 V ¼ I : ð23Þ

This gives
2 3
0:3666 1:4932  
6 7 :1335 0
Z ¼ 4 0:7291 0:4907 5 ×
0 :0747
2:1830 1:2056 |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Δ
U ð24Þ
 
0:0340 0:1977 0:1952 0:2728 0:1839 0:0298
× :
0:0188 0:1997 0:1003 0:0089 0:0925 0:2195
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
VT

The factor scores for the punctuation marks are I ¼ :13352 þ :7472 ¼ :0178 þ :0056
stored in a J ¼ 3 × L ¼ 2 matrix called G,
which is computed in the same way F was com- ¼ 0:0234 : ð26Þ
puted (see Equation 12). So G is computed as
Also, the generalized singular decomposition of
2 3 one set (say, the columns) can be obtained from
0:0489 0:1115
G ¼ UΔ ¼ 4 0:0973 0:0367 5 : ð25Þ the other one (say, the rows). For example, the
0:2914 0:0901 generalized singular vectors of the analysis of the
columns can be computed directly from the analy-
sis from the rows as

Transition Formula U ¼ WQ: ð27Þ


A comparison of Equation 24 with Equation 11 Combining Equations 27 and 25 shows that the
shows that the singular values are the same for factors for the rows of Z (i.e., the punctuation
both analyses. This means that the inertia (i.e., the marks) can be obtained directly from the singular
square of the singular value) extracted by each fac- value decomposition of the authors matrix (i.e.,
tor is the same for both analyses. Because the vari- the matrix Y) as
ance extracted by the factors can be added to
obtain the total inertia of the data table, this also G ¼ WQΔ: ð28Þ
means that each analysis is decomposing the same
inertia, which here is equal to As a consequence, we can, in fact, find directly the
Correspondence Analysis 275

One Single Generalized Singular Value


2
λ2 = 0.01 Decomposition for Correspondence Analysis
τ 2 = 24% COLON
Because the factor scores obtained for the rows
and the columns have the same variance (i.e., they
have the same ‘‘scale’’), it is possible to plot them
in the same space. This is illustrated in Figure 7.
SEMICOLON
The symmetry of the rows and the columns in CA
is revealed by the possibility of directly obtaining
Chateaubriand the factors scores from one single GSVD. Specifi-
Proust OTHER cally, let Dm and Dc denote the diagonal matrices
COMMA MARKS with the elements of m and c, respectively, on the
Rousseau diagonal, and let N denote the matrix X divided
Zola
1 by the sum of all its elements. This matrix is called
Hugo λ1 = 0.02 a stochastic matrix: All its elements are larger than
τ1 = 76%
PERIOD zero, and their sum is equal to one. The factor
scores for the rows and the columns are obtained
Giraudoux
from the following GSVD:
INTERROGATION

EXCLAMATION N  mcT ¼ SΔTT with


ð31Þ
T 1
ST D1
m S ¼ T D c T ¼ I :

Abdi
The factor scores for the rows (F) and the columns
(G) are obtained as

F ¼ D1
m S and G ¼ D1
c T: ð32Þ
Figure 7 Correspondence Analysis of the
Punctuation of Six Authors
Supplementary Elements
Notes: Comma, period, and other marks are active columns;
Rousseau, Chateaubriand, Hugo, Zola, Proust, and Often in CA we want to know the position in the
Giraudoux are active rows. Colon, semicolon, interrogation, analysis of rows or columns that were not ana-
and exclamation are supplementary columns; Abdi is
lyzed. These rows or columns are called illustrative
a supplementary row.
or supplementary rows or columns (or supplemen-
tary observations or variables). By contrast with
factor scores of the columns from their profile the appellation of supplementary (i.e., not used to
matrix (i.e., the matrix O), and from the factor compute the factors), the active elements are those
scores of the rows. Specifically, the equation that used to compute the factors. Table 2 shows the
gives the values of O from F is punctuation data with four additional columns
giving the detail of the ‘‘other punctuation marks’’
G ¼ OFΔ1 ; ð29Þ (i.e., the exclamation point, the question mark, the
semicolon, and the colon). These punctuation
marks were not analyzed for two reasons: First,
and conversely, F could be obtained from G as these marks are used rarely, and therefore they
would distort the factor space, and second, the
F ¼ RGΔ1 : ð30Þ ‘‘other’’ marks comprises all the other marks, and
therefore to analyze them with ‘‘other’’ would be
These equations are called transition formulas redundant. There is also a new author in Table 2:
from the rows to the columns (and vice versa) or We counted the marks used by a different author,
simply the transition formulas. namely, Hervé Abdi in the first chapter of his 1994
276 Correspondence Analysis

Table 2 Number of Punctuation Marks Used by Six Major French Authors


Active Elements Supplementary Elements
x

Period Comma Other Marks xi;þ m ¼ xþþ Exclamation Question Semicolon Colon
Rousseau 7,836 13,112 6,026 26,974 .0189 413 1,240 3,401 972
Chateaubriand 53,655 102,383 42,413 198,451 .1393 4,669 4,595 19,354 13,795
Hugo 115,615 184,541 59,226 359,382 .2522 19,513 9,876 22,585 7,252
Zola 161,926 340,479 62,754 565,159 .3966 24,025 10,665 18,391 9,673
Proust 38,117 105,101 12,670 155,948 .1094 2,756 2,448 3,850 3,616
Giraudoux 46,371 58,367 14,229 119,037 .0835 5,893 5,042 1,946 1,418
xþj 423,580 803,983 197,388 1,424,951
xiþ
WT ¼ xþþ 3.3641 1.7724 7.2190 x++
x
cT ¼ xþþ

.2973 5642 .1385

Abdi (Chapter 1) 216 139 26


Source: Adapted from Brunet, 1989.
Notes: The exclamation point, question mark, semicolon, and colon are supplementary columns. Abdi (1994) Chapter 1 is
a supplementary row. Notations: xi þ ¼ sum of the ith row; x þ j ¼ sum of the jth column; x þ þ ¼ grand total.

book called Les réseaux de reurores. This author supplementary column profile matrix, then Gsup,
was not analyzed because the data are available the matrix of the supplementary column factor
for only one chapter (not his complete work) and scores, is computed as
also because this author is not a literary author.
The values of the projections on the factors for Gsup ¼ Osup FΔ1 : ð35Þ
the supplementary elements are computed from
the transition formula. Specifically, a supplemen-
Table 4 gives the factor scores for the supple-
tary row is projected into the space defined using
mentary elements.
the transition formula for the active rows (cf.
Equation 30) and replacing the active row profiles
by the supplementary row profiles. So if we denote
Little Helpers: Contributions and Cosines
by Rsup the matrix of the supplementary row pro-
files, then Fsup—the matrix of the supplementary Contributions and cosines are coefficients whose
row factor scores—is computed as goal is to facilitate the interpretation. The contri-
butions identify the important elements for a given
Fsup ¼ Rsup × G × Δ1 : ð33Þ factor, whereas the (squared) cosines identify the
factors important for a given element. These coeffi-
Table 3 provides factor scores and descriptives for cients express importance as the proportion of
the rows. something in a total. The contribution is the ratio
For example, the factor scores of the author of the weighted squared projection of an element
Abdi are computed as on a factor to the sum of the weighted projections
of all the elements for this factor (which happens
Fsup ¼ Rsup GΔ1 ¼ ½ 0:0908  0:5852 : ð34Þ to be the eigenvalue of this factor). The squared
cosine is the ratio of the squared projection of an
Supplementary columns are projected into the element on a factor to the sum of the projections
factor space using the transition formula from of this element on all the factors (which happens
the active rows (cf. Equation 29) and replacing to be the squared distance from this point to the
the active column profiles by the supplementary barycenter). Contributions and squared cosines are
column profiles. If we denote by Osup the proportions that vary between 0 and 1.
Correspondence Analysis 277

Table 3 Factor Scores, Contributions, and Cosines for the Rows


Axis λ % Rousseau Chateaubriand Hugo Zola Proust Giraudoux Abdi (Chapter 1)
Factor Scores

1 .0178 76 0.2398 0.1895 0.1033 –0.0918 –0.2243 0.0475 –0.0908


2 .0056 24 0.0741 0.1071 –0.0297 0.0017 0.0631 –0.1963 0.5852
Contributions

1 0.0611 0.2807 0.1511 0.1876 0.3089 0.0106 –


2 0.0186 0.2864 0.0399 0.0002 0.0781 0.5767 –
Cosines

1 0.9128 0.7579 0.9236 0.9997 0.9266 0.0554 0.0235


2 0.0872 0.2421 0.0764 0.0003 0.0734 0.9446 0.9765
Squared distances to grand barycenter

– – – 0.0630 0.0474 0.0116 0.0084 0.0543 0.0408 0.3508


Notes: Negative contributions are shown in italic. Abdi (1994) Chapter 1 is a supplementary row.

Table 4 Factor Scores, Contributions, and Cosines for the Columns


Axis λ % Period Comma Other Marks Exclamation Question Semicolon Colon
Factor scores

1 .0178 76 –0.0489 0.0973 –0.2914 –0.0596 –0.1991 –0.4695 –0.4008


2 .0056 24 0.1115 –0.0367 -0.0901 0.2318 0.2082 –0.2976 –0.4740
Contributions

1 0.0399 0.2999 0.6601 – – – –


2 0.6628 0.1359 0.2014 – – – –
Cosines

1 0.1614 0.8758 0.9128 0.0621 0.4776 0.7133 0.4170


2 0.8386 0.1242 0.0872 0.9379 0.5224 0.2867 0.5830
Squared distances to grand barycenter

– – 0.0148 0.0108 0.0930 0.0573 0.0830 0.3090 0.3853


Notes: Negative contributions are shown in italic. Exclamation point, question mark, semicolon, and colon are supplementary
columns.

The squared cosines, denoted h, between row i m f2


i mi f 2 j c g2 ci g2
and factor l (and between column j and factor l) bi;‘ ¼ P mi;‘f 2 ¼ λ‘
i;‘
and bj;‘ ¼ P cj;‘f 2 ¼ λ‘
j;‘
:
i i;‘ j j;‘
are obtained as i j

ð37Þ
f2 f2 g2 g2
hi;‘ ¼ Pi;‘f 2 ¼ di;‘
2 and hj;‘ ¼ Pj;‘f 2 ¼ dj;‘
2 : ð36Þ Contributions help locating the observations
i;‘ c;i j;‘ r;j
‘ ‘ important for a given factor. An often used rule
of thumb is to consider that the important con-
Squared cosines help in locating the factors impor- tributions are larger than the average contribu-
tant for a given observation. The contributions, tion, which is equal to the number of elements
denoted b, of row i to factor l and of column j to (i.e., 1I for the rows and 1J for the columns). A
factor l are obtained as dimension is then interpreted by opposing the
278 Correspondence Principle

positive elements with large contributions to the Further Readings


negative elements with large contributions.
Benzécri, J. P. (1973). L’analyse des données [Data
Cosines and contributions for the punctuation analysis] (2 vol.). Paris: Dunod.
example are given in Tables 3 and 4. Brunet, E. (1989). Faut-il ponderer les données
linguistiques [Weighting linguistics data]. CUMFID,
16, 39–50.
Multiple Correspondence Analysis Clausen, S. E. (1998). Applied correspondence analysis.
Thousand Oaks, CA: Sage.
CA works with a contingency table that is Escofier, B., & Pages, J. (1998). Analyses factorielles
equivalent to the analysis of two nominal vari- simples et multiples [Simple and multiple factor
ables (i.e., one for the rows and one for the col- analysis]. Paris: Dunod.
umns). Multiple CA (MCA) is an extension of Greenacre, M. J. (1984). Theory and applications of
CA that analyzes the pattern of relationships correspondence analysis. London: Academic Press.
among several nominal variables. MCA is used Greenacre, M. J. (2007). Correspondence analysis in
to analyze a set of observations described by practice (2nd ed.). Boca Raton, FL: Chapman & Hall/
CRC.
a set of nominal variables. Each nominal vari-
Greenacre, M. J., & Blasius, J. (Eds.). (2007). Multiple
able comprises several levels, and each of these correspondence analysis and related methods. Boca
levels is coded as a binary variable. For example, Raton, FL: Chapman & Hall/CRC.
gender (F vs. M) is a nominal variable with two Hwang, H., Tomiuk, M. A., & Takane, Y. (2009).
levels. The pattern for a male respondent will be Correspondence analysis, multiple correspondence
[0 1], and for a female respondent, [1 0]. The analysis and recent developments. In R. Millsap & A.
complete data table is composed of binary col- Maydeu-Olivares (Eds.), Handbook of quantitative
umns with one and only one column, per nomi- methods in psychology (pp. 243–263). London: Sage.
nal variable, taking the value of 1. Lebart, L., & Fenelon, J. P. (1971). Statistiques et
MCA can also accommodate quantitative vari- informatique appliquees [Applied statistics and
computer science]. Paris: Dunod.
ables by recoding them as ‘‘bins.’’ For example,
Weller, S. C., & Romney, A. K. (1990). Metric scaling:
a score with a range of  5 to þ 5 could be Correspondence analysis. Thousand Oaks, CA: Sage.
recoded as a nominal variable with three levels:
less than 0, equal to 0, or more than 0. With this
schema, a value of 3 will be expressed by the pat-
tern 0 0 1. The coding schema of MCA implies CORRESPONDENCE PRINCIPLE
that each row has the same total, which for CA
implies that each row has the same mass.
The correspondence principle is generally known
Essentially, MCA is computed by using a CA
as the Bohr correspondence principle (CP), for
program on the data table. It can be shown that
Niels Bohr. It is considered one of Bohr’s greatest
the binary coding scheme used in MCA creates
contributions to physics, along with his derivation
artificial factors and therefore artificially reduces
of the Balmer formula. Bohr’s leading idea is that
the inertia explained. A solution for this problem
classical physics, though limited in scope, is indis-
is to correct the eigenvalues obtained from the CA
pensable for the understanding of quantum phys-
program.
ics. The idea that old science is ‘‘indispensable’’ to
Hervé Abdi and Lynne J. Williams the understanding of new science is in fact the
main theme in using the concept of correspon-
See also Barycentric Discriminant Analysis; dence; therefore, the CP can be defined as the prin-
Canonical Correlation Analysis; Categorical ciple by which new theories of science (physics in
Variable; Chi-Square Test; Coefficient Alpha; Data particular) can relate to previously accepted theo-
Mining; Descriptive Discriminant Analysis; ries in the field by means of approximation at a cer-
Discriminant Analysis; Exploratory Data tain limit. Historically, Max Planck had
Analysis; Exploratory Factor Analysis; Guttman introduced the concept in 1906. Bohr’s first han-
Scaling; Matrix Algebra; Principal Components dling of the concept was in his first paper after
Analysis; R World War I, in which he showed that quantum
Correspondence Principle 279

formalism would lead to classical physics when of formal correspondence between modern and
n → ∞, where n is the quantum number. Although classical physics.
there were many previous uses of the concept, the
important issue here is not to whom the concept Old Correspondence Principle
can be attributed, but an understanding of the var- (Numerical Correspondence)
ious ways that it can be used in scientific and phil-
Planck stressed the relation between his ‘‘radi-
osophic research.
cal’’ assumption of discrete energy levels that are
The principle is important for the continuity in
proportional to frequency, and the classical theory.
science. There are two ways of thinking about
He insisted that the terms in the new equation
such continuity. A theory T covers a set of obser-
refer to the very same classical properties. He for-
vations S. A new observation s1 is detected. T can-
mulated the CP so that the numerical value of
not explain s1. Scientists first try to adapt T to be
able to account for s1. But if T is not in principle lim ½Quantumphysics ¼ ½Classicalphysics
h→0
able to explain s1, then scientists will start to look
for another theory, T *, that can explain S and s1. He demonstrated that the radiation law for the
The scientist will try to derive T * by using CP as energy density at frequency ν,
a determining factor. In such a case, T * should
lead to T at a certain limit. 8πhv3
Nonetheless, sometimes there may be a set of uðvÞ ¼ ð1Þ
c3 ðehv=kT  1Þ,
new observations, S1, for which it turns out that
a direct derivation of T * from T that might in prin- corresponds numerically in the limit h → 0 to the
ciple account for S1 is not possible or at least does classical Rayleigh–Jeans law:
not seem to be possible. Then the scientist will try
to suggest T * separately from the accepted set of 8πkTv2
uðvÞ ¼ , ð2Þ
boundary conditions and the observed set of S and c3
S1. But because T was able to explain the set of it is
where k is Boltzmann’s constant, T is the tempera-
highly probable that T has a certain limit of correct
ture, and c is the speed of light. This kind of corre-
assumptions that led to its ability to explain S.
spondence entails that the new theory should
Therefore, any new theory T * that would account
resemble the old one not just at the mathematical
for S and S1 should resemble T at a certain limit.
level but also at the conceptual level.
This can be obtained by specifying a certain corre-
spondence limit at which the new formalism of T *
will lead to the old formalism of T. Configuration Correspondence
These two ways of obtaining T * are the general Principle (Law Correspondence)
forms of applying the correspondence principle. The configuration correspondence principle
Nevertheless, the practice of science presents us claims that the laws of new theories should corre-
with many ways of connecting T * to T or parts of spond to the laws of the old theory. In the case of
it. Hence it is important to discuss the physicists’ quantum and classical physics, quantum laws corre-
different treatments of the CP. Moreover, the inter- spond to the classical laws when the probability den-
pretation of CP and the implications of using CP sity of the quantum state coincides with the classical
will determine our picture of science and the future probability density. Take, for example, a harmonic
development of science; hence, it is important to oscillator that has a classical probability density
discuss the philosophical implications of CP and
pffiffiffiffiffiffiffiffiffiffi
the different philosophical understandings of the PC ðxÞ ¼ 1=ðπ x20 x2 Þ, ð3Þ
concept.
where x is the displacement. Now if we superim-
pose the plot of this probability onto that of the
Formal Correspondence
quantum probability density jψn j2 of the eigen-
In the current state of the relation between modern states of the system and take (the quantum num-
physics and classical physics, there are four kinds ber) n → ∞, we will obtain Figure 1 below. As
280 Correspondence Principle

the high quantum number domain turns out to


C
be displaced as

vnq þ 1 ¼ vnq þ h=2md,


Q
Prob. Density

where m is the particle’s mass and d is the length


of the box. Such a spectrum does not collapse
toward the classical frequency in the limit of large
quantum numbers, while the spectrum of the parti-
cle does degenerate to the classical continuum in
the limit h → 0 It can be argued that such corre-
−x 0 +x
spondence would face another obvious problem
C = classical prob. Q = quantum prob. relating to the assumption that Planck’s constant
goes to zero. What is the meaning of saying that
‘‘a constant goes to zero’’? A constant is a number
Figure 1 Classical Versus Quantum Probability that has the same value at all times, and having it
Density as zero is contradictory, unless it is zero. A reply to
this problem might be that in correspondence, we
ought to take the real limiting value and not the
Richard Liboff, a leading expert in the field, has
abstract one. In the case of relativity, the limit, ‘‘c
noted, the classical probability density PC does not
goes to infinity’’ is an abstract one, and the real
follow the quantum probability density jψn j2.
limit should be ‘‘v/c goes to zero.’’ Now, when
Instead, it follows the local average in the limit of
dealing with corresponding quantum mechanics to
large quantum numbers n:
classical mechanics, one might say that we ought
to take the limit n → ∞ as a better one than h → 0
Zx þ ε
 D 2
E
1 2 The point here is that values like c and h are con-
PC ðxÞ ¼ PQ ðxÞ ¼ jψn j ¼ 2ε jψn ðyÞj dy: ð4Þ stants and would not tend to go to zero or to infin-
xε ity, but n and ν/c are variables—n ¼ ð0; 1; 2; 3; . . .Þ
and ν/c varies between 0 (when ν ¼ 0) and 1
(when ν ¼ c). (Of course, this point can also count
Frequency Correspondence Principle against the old CP of Planck, the first correspon-
The third type of correspondence is the offi- dence principle in our list, because it is built on the
cially accepted form of correspondence that is assumption that the limit is of Planck’s constant
known in quantum mechanics books as the Bohr going to zero.)
Correspondence Principle. This claims that the
classical results should emerge as a limiting case of
Form Correspondence Principle
the quantum results in the limits n → ∞ (the quan-
tum number) and h → 0 (Planck’s constant). Then The last type of correspondence is form CP,
in the case of frequency, the quantum value should which claims that we can obtain correspondence if
be equal to the classical value, i.e., Q ¼ C . In the functional (mathematical) form of the new the-
most cases in quantum mechanics, the quantum ory is the same as that of the old theory. This kind
frequency coalesces with the classical frequency in of correspondence is especially fruitful in particu-
the limit n → ∞ and h → 0. lar cases in which other kinds of correspondence
Nevertheless, n → ∞ and h → 0 are not uni- do not apply. Let us take the example used in fre-
versally equivalent, because in some cases of the quency correspondence (quantum frequency). As
quantum systems, the limit n → ∞ does not seen in the case of the particle in a cubical box, the
yield the classical one, while the limit h → 0 outcome of n → ∞ does not coincide with the
does; the two results are not universally equiva- outcome of h → 0. Hence the two limits fail to
lent. The case of a particle trapped in a cubical achieve the same result. In cases such as this, form
box would be a good example: the frequency in correspondence might overcome the difficulties
Correspondence Principle 281

facing frequency correspondence. The aim of form Brian David Josephson proved that the relation
correspondence is to prove that classical frequency between the phase difference and the voltage is
and quantum frequency have the same form. So, if given by ∂δ 2e h ∂δ

 V, that is, the voltage V ¼ 2e ∂t.
∂t ¼ h
Q denotes quantum frequency, C classical fre- Now, by the assertion that the Josephson junction
quency, and E energy, then form correspondence is would behave as a classical circuit, the total cur-
satisfied if C ðEÞ has the same functional form as rent would be
Q ðEÞ. Then, by using a dipole approximation,
Liboff showed that the quantum transition Z dδ ZC d2 δ
I ¼ Ic sin δ þ þ : ð7Þ
between state s þ n and state s where s >> n gives 2Re dt 2e dt2
the relation
This equation relates the current with the phase
nQ ðEÞ ≈ nðEs =2ma2 Þ1=2 : ð5Þ difference but without any direct reference to the
voltage. Furthermore, if we apply form correspon-
He also noticed that if we treat the same system dence, Equation 7 is analogous to the equation
classically (particles of energy E in a cubical box), of a pendulum in classical mechanics. The total
the calculation of the radiated power in the nth torque τ on the pendulum would be
vibrational mode is given by the expression
d2 θ dθ
τ ¼ M 2 þ D þ τ0 sin θ, ð8Þ
nC ðEÞ ≈ nðE=2ma2 Þ1=2 : ð6Þ dt dt
where M is the moment of inertia, D is the viscous
Both frequencies have the same form, even if damping, and τ is the applied torque.
one is characterizing quantum frequency and the Both these equations have the general mathe-
other classical, and even if their experimental matical form
treatment differs. Hence, form CP is satisfied.
But such correspondence is not problem free; in dx d2 x
the classical case, E denotes the average energy Y ¼ Y0 sin x þ B þA 2 : ð9Þ
dt dt
value of an ensemble of nth harmonic frequency,
but in the quantum case, it denotes the eigenenergy This kind of correspondence can be widely used
of that level. Also, in the quantum case, the energy to help in the solution of many problems in phys-
is discrete, and the only way to assert that the ics. Therefore, to find new horizons in physics,
quantum frequency yields the classical one is by some might even think of relating some of the new
saying that when the quantum number is very big, theories that have not yet applied CP. Such is the
the number of points that coincide with the classi- case with form corresponding quantum chaos to
cal frequency will increase, using the dipole classical chaos. The argument runs as follows:
approximation, which asserts that the distance Classical chaos exists. If quantum mechanics is to
between the points in the quantum case is assumed be counted as a complete theory in describing
small. Hence the quantum case does not resemble nature, then it ought to have a notion that corre-
the classical case as such, but it coincides with the sponds to classical chaos. That notion can be
average of an ensemble of classical cases. called quantum chaos. But what are the possible
The main thrust of form correspondence is that things that resemble chaotic behavior in quantum
it can relate a branch of physics to a different systems? The reply gave rise to quantum chaos.
branch on the basis of form resemblance, such as However, it turns out that a direct correspondence
in the case of superconductivity. Here, a quantum between the notion of chaos in quantum mechan-
formula corresponds to classical equations if we ics and that in classical mechanics does not exist.
can change the quantum formula in the limit into Therefore, form correspondence would be fruit-
a form where it looks similar to a classical form. ful here. Instead of corresponding quantum chaos
The case of Josephson junctions in superconductiv- to classical chaos, we can correspond both of them
ity, which are an important factor in building to a third entity. Classical chaos goes in a certain
superconducting quantum interference devices, limit to the form ’, and quantum chaos goes to
presents a perfect demonstration of such concept. the same form at the same limit:
282 Correspondence Principle

Lim classicalchaos ¼ ’ subject. But in general, realists are the defenders of


n→∞
the concept, whereas positivists, instrumentalists,
Limn → ∞ quantumchaos ¼ ’, empiricists, and antirealists are, if not opposed to
the principle, then indifferent about it. Some might
but because we have only classical and quantum accept it as a useful heuristic device, but that does
theories, then the correspondence is from one to not give it any authoritarian power in science.
the other, as suggested by Gordon Belot and John Even among realists there is more than one
Earman. position. Some such as Elie Zahar claim that the
In addition to these four formal forms of corre- CP influence ought to stem from old theory and
spondence, many other notions of correspondence arrive at the new through derivative power.
might apply, such as conceptual correspondence, Heinz Post is more flexible; he accepts both ways
whereby new concepts ought to resemble old con- as legitimate and suggests a generalized corre-
cepts at the limited range of applicability of such spondence principle that ought to be applied to
concepts. In addition, there is observational corre- all the developments in science. In doing so, he is
spondence, which is a weak case of correspon- replying to Thomas Kuhn’s Structure of Scientific
dence whereby the quantum will correspond to Revolutions, rejecting Kuhn’s claim of paradigm
what is expected to be observed classically at a cer- shift and insisting on scientific continuity. In
tain limit. Structural correspondence combines ele- doing so, Post is also rejecting Paul Feyerabend’s
ments from both form correspondence and law concept of incommensurability.
correspondence. Hence, scientific practice might So is CP really needed? Does correspondence
need different kinds of correspondence to achieve relate new theories to old ones, or are the new the-
new relations and to relate certain domains of ories deduced from old theories using CP? Can old
applicability to other domains. theories really be preserved? Or what, if anything,
can be preserved from the old theories? What about
incommensurability between the new and the old?
Philosophical Implications
How can we look at correspondence in light of
Usually, principles in science, such as the Archi- Kuhn’s concept of scientific revolution? What hap-
medes principle, are universally accepted. This is pens when there is a paradigm shift? All these ques-
not the case with CP. Although CP is considered tions are in fact related to our interpretation of CP.
by most physicists to be a good heuristic device, it CP, realists say, would help us in understanding
is not accepted across the board. There are two developments in science as miracle free (the no-
positions: The first thinks of development in sci- miracle argument). Nevertheless, by accepting CP
ence as a mere trial-and-error process; the second as a principle that new theories should uphold, we
thinks that science is progressive and mirrors real- in effect are trapped within the scope of old theo-
ity, and therefore new theories cannot cast away ries. This means that if our original line of reason-
old, successful theories but merely limit the old ing was wrong and still explains the set of
ones to certain boundaries. observations that we obtained, then the latter the-
Max Born, for example, believed that scientists ory that obeys CP will resolve the problems of the
depend mainly on trial and error in a shattered old theory within a certain limit, will no doubt
jungle, where they do not have any signposts in continue to hold the posits of the wrong theory,
science, and it is all up to them to discover new and will continue to abide by its accepted bound-
roads in science. He advised scientists to rely not ary conditions. This means that we will not be able
on ‘‘abstract reason’’ but on experience. However, to see where old science went wrong. In reply, con-
to accept CP means that we accept abstract reason; ventional realists and structural realists would
it also means that we do not depend on trial and argue that the well-confirmed old theories are
error but reason from whatever accepted knowl- good representations of nature, and hence any new
edge we have to arrive at new knowledge. theories should resemble them at certain limits.
The philosophical front is much more complex. Well, this is the heart of the matter. Even if old the-
There are as many positions regarding correspon- ories are confirmed by experimental evidence, this
dence as there are philosophers writing on the is not enough to claim that the abstract theory is
Correspondence Principle 283

correct. Why? Mathematically speaking, if we the old theory as a whole; we can save only the
have any finite set of observations, then there are representative part. Structural realists, such as
many possible mathematical models that can John Worrell and Elie Zahar, claim that only the
describe this set. Hence, how can we determine mathematical structure need be saved and that CP
that the model that was picked by the old science is capable of assisting us in saving it. Philip Kitcher
was the right one? asserts that only presupposition posits can survive.
But even if we accept CP as a heuristic device, Towfic Shomar claims that the dichotomy should
there are many ways that the concept can be be horizontal rather than vertical and that the only
applied. Each of these ways has a different set of parts that would survive are the phenomenological
problems for realists, and it is not possible to models (phenomenological realism). Stathis Psillos
accept any generalized form of correspondence. claims that scientific theories can be divided into
The realist position was challenged by many phi- two parts, one consisting of the claims that con-
losophers. Kuhn proved that during scientific revo- tributed to successes in science (working postu-
lutions the new science adopts a new paradigm in lates) and the other consisting of idle components.
which the wordings of the old science might con- Hans Radder, following Roy Bhaskar, thinks
tinue, but with different meanings. He demon- that progress in science is like a production line:
strated such a change with mass: The concept of There are inputs and outputs; hence our old
mass in relativity is not the same as Newtonian knowledge of theories and observations is the
mass. Feyerabend asserted that the changes between input that dictates the output (our new theories).
new science and old science make them incommen- CP is important in the process; it is a good heuris-
surable with each other. Hence, the realist notion of tic device, but it is not essential, and in many cases
approximating new theories to old ones is going it does not work.
beyond the accepted limits of approximation. But is CP a necessary claim for all kinds of real-
The other major recent attacks on realism come ism to account for developments in science? Some,
from pessimistic metainduction (Larry Laudan) on including Shomar, do not think so. Nancy Cart-
one hand and new versions of empiricist arguments wright accepts that theories are mere tools; she
(Bas van Fraassen) on the other. Van Fraassen thinks that scientific theories are patchwork that
defines his position as constructive empiricism. Lau- helps in constructing models that represent different
dan relies on the history of science to claim that the parts of nature. Some of these models depend on
realists’ explanation of the successes of science does tools borrowed from quantum mechanics and
not hold. He argues that the success of theories can- account for phenomena related to the microscopic
not offer grounds for accepting that these theories world; others use tools from classical mechanics
are true (or even approximately true). He presents and account for phenomena in the macroscopic
a list of theories that have been successful and yet world. There is no need to account for any connec-
are now acknowledged to be false. Hence, he con- tion between these models. Phenomenological real-
cludes, depending on our previous experience with ism, too, takes theories as merely tools to construct
scientific revolutions, the only reasonable induction phenomenological models that are capable of repre-
would be that it is highly probable that our current senting nature. In that case, whether the fundamen-
successful theories will turn out to be false. Van tal theories correspond to each other to some extent
Fraassen claims that despite the success of theories or not is irrelevant. The correspondence of theories
in accounting for phenomena (their empirical ade- concerns realists who think that fundamental theo-
quacy), there can never be any grounds for believ- ries represent nature and approximate its blueprint.
ing any claims beyond those about what is Currently, theoretical physics is facing a dead-
observable. That is, we cannot say that such theo- lock; as Lee Smolin and Peter Woit have argued,
ries are real or that they represent nature; we can the majority of theoretical physicists are running
only claim that they can account for the observed after the unification of all forces and laws of phys-
phenomena. ics. They are after the theory of everything. They
Recent trends in realism tried to salvage realism are convinced that science is converging toward
from these attacks, but most of these trends a final theory that represents the truth about
depend on claiming that we do not need to save nature. They are in a way in agreement with the
284 Covariate

realists, who hold that successive theories of Krajewski, W. (1977). Correspondence principle and
‘‘mature science’’ approximate the truth more and growth in science. Dordrecht, the Netherlands: Reidel.
more, so science should be in quest of the final the- Liboff, R. (1975). Bohr’s correspondence principle for
ory of the final truth. large quantum numbers. Foundations of Physics, 5(2),
271–293.
Theoretical representation might represent the
Liboff, R. (1984). The correspondence principle revisited.
truth about nature, but we can easily imagine that Physics Today, February, 50–55.
we have more than one theory to depend on. Makowski, A. (2006). A brief survey of various
Nature is complex, and in light of the richness of formulations of the correspondence principle.
nature, which is reflected in scientific practice, one European Journal of Physics, 27(5), 1133–1139.
may be unable to accept that Albert Einstein’s Radder, H. (1991). Heuristics and the generalized
request for simplicity and beauty can give the cor- correspondence principle. British Journal for the
rect picture of current science when complexity and Philosophy of Science, 42, 195–226.
diversity appear to overshadow it. The complexity Shomar, T. (2001). Structural realism and the
correspondence principle. Proceedings of the
of physics forces some toward a total disagreement
conference on Mulla Sadra and the world’s
with Einstein’s dream of finding a unified theory for
contemporary philosophy, Kish, Iran: Mulla Sudra
everything. To some, such a dream directly contra- Institute.
dicts the accepted theoretical representations of Zahar, E. (1988). Einstein’s revolution: A study in
physics. Diversity and complexity are the main heuristics. LaSalle, IL: Open Court.
characteristics of such representations.
Nonetheless, CP is an important heuristic device
that can help scientists arrive at new knowledge,
but scientists and philosophers should be careful
as to how much of CP they want to accept. As COVARIATE
long as they understand and accept that there is
more than one version of CP and as long as they Similar to an independent variable, a covariate is
accept that not all new theories can, even in princi- complementary to the dependent, or response, var-
ple, revert to old theories at a certain point, then iable. A variable is a covariate if it is related to the
they might benefit from applying CP. One other dependent variable. According to this definition,
remark of caution: Scientists and philosophers also any variable that is measurable and considered to
need to accept that old theories might be wrong; have a statistical relationship with the dependent
the wrong mathematical form may have been variable would qualify as a potential covariate. A
picked, and if they continue to accept such a form, covariate is thus a possible predictive or explana-
they will continue to uphold a false science. tory variable of the dependent variable. This may
be the reason that in regression analyses, indepen-
Towfic Shomar dent variables (i.e., the regressors) are sometimes
called covariates. Used in this context, covariates
See also Frequency Distribution; Models; Paradigm;
are of primary interest. In most other circum-
Positivism; Theory
stances, however, covariates are of no primary
interest compared with the independent variables.
Further Readings They arise because the experimental or observa-
tional units are heterogeneous. When this occurs,
Fadner, W. L. (1985). Theoretical support for the their existence is mostly a nuisance because they
generalized correspondence principle, American may interact with the independent variables to
Journal of Physics, 53, 829–838. obscure the true relationship between the depen-
French, S., & Kamminga, H. (Eds.). (1993).
dent and the independent variables. It is in this cir-
Correspondence, invariance and heuristics: Essays in
honour of Heinz Post. Dordrecht, the Netherlands:
cumstance that one needs to be aware of and
Kluwer Academic. make efforts to control the effect of covariates.
Hartmann, S. (2002). On correspondence. Studies in Viewed in this context, covariates may be called
History & Philosophy of Modern Physics, 33B, by other names, such as concomitant variables,
79–94. auxiliary variables, or secondary variables. This
Covariate 285

entry discusses methods for controlling the effects treatments. Under such circumstances, their value
of covariates and provides examples. is often observed, together with the value of the
dependent variables. The observation can be made
either before, after, or during the experiment,
Controlling Effects of Covariates depending on the nature of the covariates and their
influence on the dependent variables. The value of
Research Design
a covariate may be measured prior to the adminis-
Although covariates are neither the design vari- tration of experimental treatments if the status of
able (i.e., the independent variable) nor the pri- the covariate before entering into the experiment
mary outcome (e.g., the dependent variable) in is important or if its value changes during the
research, they are still explanatory variables that experiment. If the covariate is not affected by the
may be manipulated through experiment design so experimental treatments, it may be measured after
that their effect can be eliminated or minimized. the experiment. The researcher, however, should
Manipulation of covariates is particularly popular be mindful that measuring a covariate after an
in controlled experiments. Many techniques can experiment is done carries substantial risks unless
be used for this purpose. An example is to fix the there is strong evidence to support such an
covariates as constants across all experimental assumption. In the hypothetical nutrition study
treatments so that their effects are exerted uni- example given below, the initial height and weight
formly and can be canceled out. Another technique of pupils are not covariates that can be measured
is through randomization of experimental units after the experiment is carried out. The reason is
when assigning them to the different experimental that both height and weight are the response vari-
treatments. Key advantages of randomization are ables of the experiment, and they are influenced by
(a) to control for important known and unknown the experimental treatments. In other circum-
factors (the control for unknown factors is espe- stances, the value of the covariate is continuously
cially significant) so that all covariate effects are monitored, along with the dependent variable, dur-
minimized and all experimental units are statisti- ing an experiment. An example may be the yearly
cally comparable on the mean across treatments, mean of ocean temperatures in a long-term study
(b) to reduce or eliminate both intentional and by R. J. Beamish and D. R. Bouillon of the rela-
unintentional human biases during the experiment, tionship between the quotas of salmon fish har-
and (c) to properly evaluate error effects on the vested in the prior year and the number of salmon
experiment because of the sound probabilistic the- fish returned to the spawning ground of the rivers
ory that underlies the randomization. Randomiza- the following year, as prior research has shown
tion can be done to all experimental units at once that ocean temperature changes bear considerable
or done to experimental units within a block. influence on the life of salmon fish.
Blocking is a technique used in experimental
design to further reduce the variability in experi-
mental conditions or experimental units. Experi-
Statistical Analysis
mental units are divided into groups called blocks,
and within a group, experimental units (or condi- After the covariates are measured, a popular
tions) are assumed to be homogeneous, although statistical procedure, the analysis of covariance
they differ between groups. (ANCOVA), is then used to analyze the effect of
However ideal, there is no guarantee that ran- the design variables on the dependent variable by
domization eliminates all covariate effects. Even if explicitly incorporating covariates into the analyti-
it could remove all covariate effects, randomiza- cal model. Assume that an experiment has n design
tion may not always be feasible due to various variables and m covariates; a proper statistical
constraints in an experiment. In most circum- model for the experiment would be
stances, covariates, by their nature, are not con-
trollable through experiment designs. They are X
m

therefore not manipulated and allowed to vary yij ¼ μ þ ti þ βk ðxkij  xk:: Þ þ εij ð1Þ
naturally among experimental units across k¼1
286 Covariate

where yij is the jth measurement on the dependent model is (a) to control for potential confounding,
variable (i.e., the primary outcome) in the ith treat- (b) to improve comparison across treatments, (c)
ment; μ is the overall mean; ti is the ith design var- to assess model adequacy, and (d) to expand the
iable (often called treatment in experiment design); scope of inference. The last point is supported by
xkij is the measurement on the kth covariate corre- the fact that the experiment is conducted in a more
sponding to yij; xk :: is the mean of the xkij values; realistic environment that allows a covariate to
βk is a linear (partial) regression coefficient for the change naturally, instead of being fixed at a few
kth covariate, which emphasizes the relationship artificial levels, which may or may not be represen-
between xij and yij ; and εij is a random variable that tative of the true effect of the covariate on the
follows a specific probability distribution with zero dependent variable.
mean. An inspection of this ANCOVA model From what has been discussed so far, it is clear
reveals that it is in fact an integration of an analysis that there is no fixed rule on how covariates
of variance (ANOVA) model with an ANOVA of should be dealt with in a study. Experiment con-
regression model. The regression part is on the trol can eliminate some obvious covariates, such
covariates (recall that regressors are sometimes as ethnicity, gender, and age in a research on
called covariates). A test on the hypothesis human subjects, but may not be feasible in all cir-
H0 : βk ¼ 0 confirms or rejects the null hypothesis cumstances. Statistical control is convenient, but
of no effect of covariates on the response variable. covariates need to be measured and built into
If no covariate effects exist, Equation 1 is then a model. Some covariates may not be observable,
reduced to an ordinary ANOVA model. and only so many covariates can be accommo-
Before Equation 1 can be used, though, one needs dated in a model. Omission of a key covariate
to ensure that the assumption of homogeneity of could lead to severely biased results. Circum-
regression slopes is met. Tests must be carried out on stances, therefore, dictate whether to control the
the hypothesis that β1 ¼ β2 ¼    ¼ βk ¼ 0. In effect of covariates through experiment design
practice, this is equivalent to finding no interaction measures or to account for their effect in the data
between the covariates and the independent vari- analysis step. A combination of experimental and
ables on the dependent variable. Without meeting statistical control is often required in certain cir-
this condition, tests on the adjusted means of the cumstances. Regardless of what approach is taken,
treatments are invalid. The reason is that when the the ultimate goal of controlling the covariate effect
slopes are different, the response of the treatments is to reduce the experimental error so that the
varies at the different levels of the covariates. Conse- treatment effect of primary interest can be eluci-
quently, the adjusted means do not adequately dated without the interference of covariates.
describe the treatment effects, potentially resulting in
misleading conclusions. If tests indeed confirm het-
Examples
erogeneity of slopes, alternative methods can be
sought in lieu of ANCOVA, as described by Bradley To illustrate the difference between covariate and
Eugene Huitema. Research has repeatedly demon- independent variables, consider an agricultural
strated that even randomized controlled trials can experiment on the productivity of two wheat vari-
benefit from ANCOVA in uncovering the true rela- eties under two fertilization regimes in field condi-
tionship between the independent and the dependent tions, where productivity is measured by tons of
variables. wheat grains produced per season per hectare
It must be pointed out that regardless of how (1 hectare ¼ 2.47 acres). Although researchers
the covariates are handled in a study, their value, can precisely control both the wheat varieties and
like that of the independent variables, is seldom the fertilization regimes as the primary indepen-
analyzed separately because the covariate effect is dent variables of interest in this specific study, they
of no primary interest. Instead, a detailed descrip- are left to contend with the heterogeneity of soil
tion of the covariates is often given to assist the fertility, that is, the natural micro variation in soil
reader in evaluating the results of a study. Accord- texture, soil structure, soil nutrition, soil water
ing to Fred Ramsey and Daniel Schafer, the incor- supply, soil aeration, and so on. These variables
poration of covariates in an ANCOVA statistical are natural phenomena that are beyond the control
Covariate 287

of the researchers. They are the covariates. By of no primary interest in an investigation but are
themselves alone, these covariates can influence nuisances that must be dealt with. Various control
the productivity of the wheat varieties. Left unac- measures are placed on them at either the experi-
counted for, these factors will severely distort the ment design or the data analysis step to minimize
experimental results in terms of wheat produced. the experimental error so that the treatment effects
The good news is that all these factors can be mea- on the major outcome can be better understood.
sured accurately with modern scientific instru- Without these measures, misleading conclusions
ments or methodologies. With their values known, may result, particularly when major covariates are
these variables can be incorporated into an not properly dealt with.
ANCOVA model to account for their effects on Regardless of how one decides to deal with the
the experimental results. covariate effects, either by experimental control or
In health research, suppose that investigators by data analysis techniques, one must be careful
are interested in the effect of nutrition on the not to allow the covariate to be affected by the
physical development of elementary school chil- treatments in a study. Otherwise, the covariate
dren between 6 and 12 years of age. More spe- may interact with the treatments, making a full
cifically, the researchers are interested in the accounting of the covariate effect difficult or
effect of a particular commercial dietary regime impossible.
that is highly promoted in television commer-
cials. Here, physical development is measured by Shihe Fan
both height and weight increments without the
See also Analysis of Covariance (ANCOVA);
implication of being obese. In pursuing their
Independent Variable; Randomization Tests
interest, the researchers choose a local boarding
school with a reputation for excellent healthy
nutritional programs. In this study, they want to
Further Readings
compare this commercial dietary regime with the
more traditional diets that have been offered by Beach, M. L., & Meier, P. (1989). Choosing covariates in
the school system for many years. They have the analysis of clinical trials. Controlled Clinical
done all they can to control all foreseeable Trials, 10, S161–S175.
potential confounding variables that might inter- Beamish, R. J., & Bouillon, D. R. (1993). Pacific salmon
fere with the results on the two dietary regimes. production trends in relation to climate. Canadian
Journal of Fisheries and Aquatic Sciences, 50, 1002–
However, they are incapable of controlling the
1016.
initial height and weight of the children in the Cox, D. R., & McCullagh. P. (1982). Some aspects of
study population. Randomized assignments of analysis of covariance. Biometrics, 38, 1–17.
participating pupils to the treatment groups may Huitema, B. E. (1980). The analysis of covariance and
help minimize the potential damage that the nat- alternatives. Toronto, Canada: Wiley.
ural variation in initial height and weight may Montgomery, D. C. (2001). Design and analysis of
cause to the validity of the study, but they are experiments (5th ed.). Toronto, Canada: Wiley.
not entirely sure that randomization is the right Ramsey, F. L., & Schafer, D. W. (2002). The statistical
answer to the problem. These initial heights and sleuth: A course in methods of data analysis (2nd ed.).
Pacific Grove, CA: Duxbury.
weights are the covariates, which must be mea-
Zhang, M., Tsiatis, A. A., & Davidian, M. (2008).
sured before (but not after) the study and
Improving efficiency of inferences in randomized
accounted for in an ANCOVA model in order clinical trials using auxiliary covariates. Biometrics,
for the results on dietary effects to be properly 64, 707–715.
interpreted.

Final Thoughts
Covariates are explanatory variables that exist nat- C PARAMETER
urally within research units. What differentiates
them from independent variables is that they are See Guessing Parameter
288 Criterion Problem

identifying which types of behaviors are


CRITERION PROBLEM important.
Thomas Taber and Judith Hackman provided
The term criterion problem refers to a general one of the earliest attempts to model the perfor-
problem in regression analysis, especially when mance behaviors relating to effective perfor-
used for selection, when the criterion measure mance in undergraduate college students. After
that is easily obtainable is not a good approxi- surveying many university students, staff, and
mation of the actual criterion of interest. In faculty members, they identified 17 areas of
other words, the criterion problem refers to the student performance and two broad categories:
problem that measures of the criterion perfor- academic performance and nonacademic perfor-
mance behaviors in which the researcher or prac- mance. The academic performance factors, such
titioner is interested in predicting are not readily as cognitive proficiency, academic effort, and
available. For example, although specific sets of intellectual growth, are reasonably measured in
performance behaviors are desirable in academic the easily obtained overall GPA; these factors
and employment settings, often the easily obtain- are also predicted fairly well by the traditional
able measures are 1st-year grade point averages variables that predict GPA (e.g., standardized
(GPAs) and supervisory ratings of job perfor- tests, prior GPA). The nonacademic perfor-
mance. The criterion problem is that those easily mance factors, on the other hand, are not cap-
obtainable measures (GPA and supervisory per- tured well in the measurement of GPA. These
formance ratings) are not good measures of factors include ethical behavior, discrimination
important performance behaviors. The criterion issues, and personal growth, and none of these
is said to be deficient if important performance are well predicted by traditional predictor
behaviors are not captured in a particular crite- variables.
rion measure. It can also be considered contami- In more recent research, Frederick Oswald
nated if the criterion measure also assesses and colleagues modeled undergraduate student
things that are unrelated to the performance performance for the purposes of scale construc-
behaviors of interest. This entry explores the cri- tion. By examining university mission state-
terion problem in both academic and employ- ments across a range of colleges to determine
ment settings, describes how the criterion the student behaviors in which stakeholders are
problem can be addressed, and examines the ultimately interested, they identified 12 perfor-
implications for selection research. mance factors that can be grouped into three
broad categories: intellectual, interpersonal, and
intrapersonal. Intellectual behaviors are best
captured with GPA and include knowledge,
interest in learning, and artistic appreciation.
Criterion Problem in Academic Settings
Interpersonal behaviors include leadership,
In academic settings, there are two typical and interpersonal skills, social responsibility, and
readily available criterion variables: GPA and multicultural tolerance. Intrapersonal behaviors
student retention (whether a student has include health, ethics, perseverance, adaptabil-
remained with the university). Although each of ity, and career orientation. However, GPA does
these criteria is certainly important, most not measure the interpersonal or intrapersonal
researchers agree that there is more to being behaviors particularly well. It is interesting to
a good student than simply having good grades note that this research also showed that while
and graduating from the university. This is the traditional predictors (e.g., standardized tests,
essence of the criterion problem in academic set- prior GPA) predict college GPA well, they do
tings: the specific behaviors that persons on the not predict the nonintellectual factors; noncog-
admissions staff would like to predict are not nitive variables, such as personality and scales
captured well in the measurement of the readily developed to assess these performance dimen-
available GPA and retention variables. The first sions, are much better predictors of the nonin-
step in solving the criterion problem, then, is tellectual factors.
Criterion Problem 289

Criterion Problem in Employment Settings performance, supervision or leadership, and man-


agement or administrative duties. Two key ele-
In employment settings, the criterion problem ments here are of particular importance to the
takes a different form. One readily available crite- criterion problem. First, these are eight factors of
rion for many jobs is the so-called objective perfor- performance behaviors, specifically limiting the cri-
mance indicator. In manufacturing positions, for terion to aspects that are under the direct control
example, the objective performance indicator of the employee and eliminating any contaminat-
could be number of widgets produced; in sales, it ing factors. Second, these eight factors are meant
could be the dollar amount of goods sold. With to be exhaustive of all types of job performance
these measures of performance, the criterion prob- behaviors; in other words, any type of behavior
lem is one of contamination; external factors not relating to an employee’s performance can be
under the control of the employee also heavily grouped under one of these eight factors. Later
influence the scores on these metrics. In models of performance group these eight factors
manufacturing positions, the availability or main- under three broad factors of task performance (the
tenance of the equipment can have a substantial core tasks relating to the job), contextual perfor-
influence on the number of widgets produced; mance (interpersonal aspects of the job), and coun-
these outside influences are entirely independent of terproductive behaviors (behaviors going against
the behavior of the employee. In sales positions, organizational policies).
season or differences in sales territory can lead to
vast differences in the dollar amount of goods
Measurement of Performance
sold; again, these outside influences are outside the
behavioral control of the employee. Such outside Solutions to the criterion problem are difficult to
influences are considered criterion contamination come by. The data collection efforts that are
because although they are measured with the crite- required to overcome the criterion problem are
rion, they have nothing to do with the behaviors typically time, effort, and cost intensive. It is for
of interest to the organization. Unfortunately, most those reasons that the criterion problem still exists,
of these objective criterion measures suffer from despite the fact that theoretical models of perfor-
contamination without possibility of better mea- mance are available to provide a way to assess the
surement for objective metrics. actual criterion behaviors of interest. For those
Another form of the criterion problem in interested in addressing the criterion problem, sev-
employment settings concerns supervisory ratings eral broad steps must be taken. First, there must
of job performance. These supervisory ratings are be some identification of the performance beha-
by far the most common form of performance viors that are of interest in the particular situation.
evaluation in organizations. In these situations, the These can be identified from prior research on per-
criterion problem concerns what, exactly, is being formance modeling in the area of interest, or they
measured when a supervisor is asked to evaluate can be developed uniquely in a particular research
the performance of an employee. Although this context (e.g., through the use of job analysis). Sec-
problem has been the subject of much debate and ond, a way to measure the identified performance
research for the better part of a century, some behaviors must be developed. This can take the
recent research has shed substantial light on the form of a standardized questionnaire, a set of
issue. interview questions, and so forth. Also, if perfor-
John Campbell and colleagues, after doing mance ratings are used as a way to measure the
extensive theoretical, exploratory, statistical, and performance behaviors, identification of the appro-
empirical work in the military and the civilian priate set of raters (e.g., self, peer, supervisor, fac-
workforce, developed an eight-factor taxonomy of ulty) for the research context would fall under this
job performance behaviors. These eight factors step. The third and final broad step is the actual
included job-specific task proficiency, non–job- data collection on the criterion. Often, this is the
specific task proficiency, written and oral commu- most costly step (in terms of necessary resources)
nication, demonstration of effort, maintenance of because prior theoretical and empirical work can
personal discipline, facilitation of peer and team be relied on for the first two steps. To complete the
290 Criterion Problem

collection of criterion data, the researcher must What is clear is that unless attention is paid to
then gather the appropriate performance data the criterion problem, the resulting selection sys-
from the most appropriate source (e.g., archival tem will likely be problematic as well. When the
data, observation, interviews, survey or question- measured criterion is deficient, the selection system
naire responses). Following these broad steps will miss important predictors unless the unmea-
enables the researcher to address the criterion sured performance factors have the exact same
problem and develop a criterion variable much determinants as the performance factors captured
more closely related to the actual behaviors of in the measured criterion variable. When the mea-
interest. sured criterion is contaminated, relationships
between the predictors and the criterion variable
will be attenuated, as the predictors are unrelated
to the contaminating factor; when a researcher
Implications for Selection Research must choose a small number of predictors to be
included in the selection system, criterion contami-
The criterion problem has serious implications for nation can lead to useful predictors erroneously
research into the selection of students or employ- being discarded from the final set of predictors.
ees. If predictor variables are chosen on the basis Finally, when a criterion variable is simply avail-
of their relations with the easily obtainable crite- able without a clear understanding of what was
rion measure, then relying solely on that easily measured, it is extremely difficult to choose a set
available criterion variable means that the selec- of predictors.
tion system will select students or employees on
the basis of how well they will perform on the Matthew J. Borneman
behaviors captured in the easily measured criterion
See also Construct Validity; Criterion Validity; Criterion
variable, but not on the other important behaviors.
Variable; Dependent Variable; Selection; Threats to
In other words, if the criterion is deficient, then
Validity; Validity of Research Conclusions
there is a very good chance that the set of predic-
tors in the selection system will also be deficient.
Leaetta Hough and Frederick Oswald have out-
lined this problem in the employment domain with
Further Readings
respect to personality variables. They showed that
if careful consideration is not paid to assessing the Austin, J. T., & Villanova, P. (1992). The criterion
performance domains of interest (i.e., the criterion problem: 1917–1992. Journal of Applied Psychology,
ends up being deficient), then important predictors 77, 836–874.
can be omitted from the selection system. Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996).
The substantive nature of job performance variability.
The criterion problem has even more severe
In K. R. Murphy (Ed.), Individual differences and
implications for selection research in academic set- behavior in organizations (pp. 258–299). San
tings. The vast majority of the validation work has Francisco: Jossey-Bass.
been done using GPA as the ultimate criterion of Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager,
interest. Because GPA does not capture many of C. E. (1993). A theory of performance. In N. Schmitt
the inter- and intrapersonal nonintellectual perfor- & W. C. Borman (Eds.), Personnel selection (pp. 35–
mance factors that are considered important for 70). San Francisco: Jossey-Bass.
college students, the selection systems for college Hough, L. M., & Oswald, F. L. (2008). Personality
admissions are also likely deficient. Although it is testing and industrial-organizational psychology:
certainly true that admissions committees use the Reflections, progress, and prospects. Industrial &
Organizational Psychology: Perspectives on Science
available information (e.g., letters of recommenda-
and Practice, 1, 272–290.
tion, personal statements) to try to predict these Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., &
other performance factors that GPA does not Gillespie, M. A. (2004). Developing a biodata measure
assess, the extent to which the predictors relate to and situational judgment inventory as predictors of
these other performance factors is relatively college student performance. Journal of Applied
unknown. Psychology, 89, 187–207.
Criterion Validity 291

Viswesvaran, C., & Ones, D. S. (2000). Perspectives on research designs that assess criterion validity, effect
models of job performance. International Journal of sizes, and concerns that may arise in applied selec-
Selection & Assessment, 8, 216–226. tion are discussed.
Taber, T. D., & Hackman, J. D. (1976). Dimensions of
undergraduate college performance. Journal of
Applied Psychology, 61, 546–558. Nature of the Criterion
Again, the term criterion validity typically refers to
a specific predictor measure, often with the crite-
rion measure assumed. Unfortunately, this intro-
CRITERION VALIDITY duces substantial confusion into the procedure of
criterion validation. Certainly, a single predictor
Also known as criterion-related validity, or some- measure can predict an extremely wide range of
times predictive or concurrent validity, criterion criteria, as Christopher Brand has shown with gen-
validity is the general term to describe how well eral intelligence, for example. Using the same
scores on one measure (i.e., a predictor) predict example, the criterion validity estimates for gen-
scores on another measure of interest (i.e., the cri- eral intelligence vary quite a bit; general intelli-
terion). In other words, a particular criterion or gence predicts some criteria better than others.
outcome measure is of interest to the researcher; This fact further illustrates that there is no single
examples could include (but are not limited to) rat- criterion validity estimate for a single predictor.
ings of job performance, grade point average Additionally, the relationship between one predic-
(GPA) in school, a voting outcome, or a medical tor measure and one criterion variable can vary
diagnosis. Criterion validity, then, refers to the depending on other variables (i.e., moderator vari-
strength of the relationship between measures ables), such as situational characteristics, attributes
intended to predict the ultimate criterion of inter- of the sample, and particularities of the research
est and the criterion measure itself. In academic design. Issues here are highly related to the crite-
settings, for example, the criterion of interest may rion problem in predictive validation studies.
be GPA, and the predictor being studied is the
score on a standardized math test. Criterion valid-
Research Design
ity, in this context, would be the strength of the
relationship (e.g., the correlation coefficient) There are four broad research designs to assess the
between the scores on the standardized math test criterion validity for a specific predictor: predictive
and GPA. validation, quasi-predictive validation, concurrent
Some care regarding the use of the term crite- validation, and postdictive validation. Each of
rion validity needs to be employed. Typically, the these is discussed in turn.
term is applied to predictors, rather than criteria;
researchers often refer to the ‘‘criterion validity’’ of
Predictive Validation
a specific predictor. However, this is not meant to
imply that there is only one ‘‘criterion validity’’ When examining the criterion validity of a spe-
estimate for each predictor. Rather, each predictor cific predictor, the researcher is often interested in
can have different ‘‘criterion validity’’ estimates for selecting persons based on their scores on a predic-
many different criteria. Extending the above exam- tor (or set of predictor measures) that will predict
ple, the standardized math test may have one crite- how well the people will perform on the criterion
rion validity estimate for overall GPA, a higher measure. In a true predictive validation design, pre-
criterion validity estimate for science ability, and dictor measure or measures are administered to
a lower criterion validity estimate for artistic a set of applicants, and the researchers select appli-
appreciation; all three are valid criteria of interest. cants completely randomly (i.e., without regard to
Additionally, each of these estimates may be mod- their scores on the predictor measure or measures.)
erated by (i.e., have different criterion validity esti- The correlation between the predictor measure(s)
mates for) situational, sample, or research design and the criterion of interest is the index of criterion
characteristics. In this entry the criterion, the validity. This design has the advantage of being free
292 Criterion Validity

from the effects of range restriction; however, it is situation when the manner in which the incum-
an expensive design, and unfeasible in many situa- bents were selected is completely unrelated to
tions, as stakeholders are often unwilling to forgo scores on the predictor or predictors).
selecting on potentially useful predictor variables. Another potential concern regarding concurrent
validation designs is the motivation of test takers.
This is a major concern for noncognitive assess-
Quasi-Predictive Validation
ments, such as personality tests, survey data, and
Like a true predictive validation design, in background information. Collecting data on these
a quasi-predictive design, the researcher is inter- types of assessments in a concurrent validation
ested in administering a predictor (or set of predic- design provides an estimate of the maximum crite-
tors) to the applicants in order to predict their rion validity for a given assessment. This is because
scores on a criterion variable of interest. Unlike incumbents, who are not motivated to alter their
a true predictive design, in a quasi-predictive vali- scores in order to be selected, are assumed to be
dation design, the researcher will select applicants answering honestly. However, there is some con-
based on their scores on the predictor(s). As cern for intentional distortion in motivated testing
before, the correlation between the predictor(s) sessions (i.e., when applying for a job or admittance
and the criterion of interest is the index of criterion to school), which can affect criterion validity esti-
validity. However, in a quasi-predictive design, the mates. As such, one must take care when interpret-
correlation between the predictor and criterion ing criterion validity estimates in this type of
will likely be smaller because of range restriction design. If estimates under operational selection set-
due to selection on the predictor variables. Cer- tings are of interest (i.e., when there is some moti-
tainly, if the researcher has a choice between a pre- vation for distortion), then criterion validity
dictive and quasi-predictive design, the predictive estimates from a predictive or quasi-predictive
design would be preferred because it provides design are of interest; however, if estimates of maxi-
a more accurate estimate of the criterion validity mal criterion validity for the predictor(s) are of
of the predictor(s); however, quasi-predictive interest, then a concurrent design is appropriate.
designs are far more common. Although quasi-pre-
dictive designs typically suffer from range restric-
Postdictive Validation
tion problems, they have the advantage of
allowing the predictors to be used for selection Postdictive validation is an infrequently used
purposes while researchers obtain criterion validity design to assess criterion validity. At its basics,
estimates. postdictive validation assesses the criterion vari-
able first and then subsequently assesses the predic-
tor variable(s). Typically, this validation design is
Concurrent Validation
not employed because the predictor variable(s), by
In a concurrent validation design, the predic- definition, come temporally before the criterion
tor(s) of interest to the researcher are not adminis- variable is assessed. However, a postdictive valida-
tered to a set of applicants; rather, they are tion design can be especially useful, if not the only
administered only to the incumbents, or people alternative, when the criterion variable is rare or
who have already been selected. The correlation unethical to obtain. Such examples might include
between the scores on the predictors and the crite- criminal activity, abuse, or medical outcomes. In
rion measures for the incumbents serves as the cri- rare criterion instances, it is nearly impossible to
terion validity estimate for that predictor or set of know when the outcome will occur; as such, the
predictors. This design has several advantages, predictors are collected after the fact to help pre-
including cost savings due to administering the dict who is at risk for the particular criterion vari-
predictors to fewer people and reduced time to col- able. In other instances when it is extremely
lection of the criterion data. However, there are unethical to collect data on the criterion of interest
also some disadvantages, including the fact that (e.g., abuse), predictor variables are collected after
criterion validity estimates are likely to be smaller the fact in order to determine who might be at risk
as a result of range restriction (except in the rare for those criterion variables. Regardless of the
Criterion Validity 293

reason for the postdictive design, people who met Statistical Artifacts
or were assessed on the criterion variable are
Unfortunately, several statistical artifacts can
matched with other people who were not, typically
have dramatic effects on criterion validity esti-
on demographic and/or other variables. The rela-
mates, with two of the most common being mea-
tionship between the predictor measures and the
surement error and range restriction. Both of these
criterion variable assessed for the two groups
(in most applications) serve to lower the observed
serves as the estimate of criterion validity.
relationships from their true values. These effects
are increasingly important when one is comparing
the criterion validity of multiple predictors.
Effect Sizes
Range Restriction
Any discussion of criterion validity necessarily
involves a discussion of effect sizes; the results of Range restriction occurs when there is some
a statistical significance test are inappropriate to mechanism that makes it more likely for people
establish criterion validity. The question of interest with higher scores on a variable to be selected than
in criterion validity is, To what degree are the pre- people with lower scores. This is common in aca-
dictor and criterion related? or How well does the demic or employee selection as the scores on the
measure predict scores on the criterion variable? administered predictors (or variables related to
instead of, Are the predictor and criterion related? those predictors) form the basis of who is admitted
Effect sizes address the former questions, while sig- or hired. Range restriction is common in quasi-
nificance testing addresses the latter. As such, effect predictive designs (because predictor scores are
sizes are necessary to quantify how well the predic- used to select or admit people) and concurrent
tor and criterion are related and to provide a way designs (because people are selected in a way that
to compare the criterion validity of several differ- is related to the predictor variables of interest in
ent predictors. the study). For example, suppose people are hired
The specific effect size to be used is dependent into an organization on the basis of their interview
on the research context and types of data being scores. The researcher administers another poten-
collected. These can include (but are not limited tial predictor of the focal criterion in a concurrent
to) odds ratios, correlations, and standardized validation design. If the scores on this new predic-
mean differences. For the purposes of explanation, tor are correlated with scores on the interview,
it is assumed that there is a continuous predictor then range restriction will occur. True predictive
and a continuous criterion variable, making the validation designs are free from range restriction
correlation coefficient the appropriate measure of because either no selection occurs or selection
effect size. In this case, the correlation between occurs in a way uncorrelated with the predictors.
a given predictor and a specific criterion serves as In postdictive validation designs, any potential
the estimate of criterion validity. Working in the range restriction is typically controlled for in the
effect size metric has the added benefit of permit- matching scenario.
ting comparisons of criterion validity estimates for Range restriction becomes particularly problem-
several predictors. Assuming that two predictors atic when the researcher is interested in comparing
were collected under similar research designs and criterion validity estimates. This is because
conditions and are correlated with the same crite- observed criterion validity estimates for different
rion variable, then the predictor with the higher predictors can be differentially decreased because
correlation with the criterion can be said to have of range restriction. Suppose that two predictors
greater criterion validity than the other predictor that truly have equal criterion validity were admin-
(for that particular criterion and research context). istered to a set of applicants for a position.
If a criterion variable measures different behaviors Because of the nature of the way they were
or was collected under different research contexts selected, suppose that for Predictor A, 90% of the
(e.g., a testing situation prone to motivated distor- variability in predictor scores remained after peo-
tion vs. one without such motivation), then crite- ple were selected, but only 50% of the variability
rion validity estimates are not directly comparable. remained for Predictor B after selection. Because
294 Criterion Validity

of the effects of range restriction, Predictor A a theoretical relationship between a predictor


would have a higher criterion validity estimate construct and a criterion construct, then correct-
than Predictor B would, even though each had the ing for attenuation in the predictor is warranted.
same true criterion validity. In these cases, one However, if there is an applied interest in esti-
should apply range restriction corrections before mating the relationship between a particular pre-
comparing validity coefficients. Fortunately, there dictor measure and a criterion of interest, then
are multiple formulas available to correct criterion correcting for attenuation in the predictor is
validity estimates for range restriction, depending inappropriate. Although it is true that differ-
on the precise mechanism of range restriction. ences in predictor reliabilities can produce arti-
factual differences in criterion validity estimates,
these differences have substantive implications in
Measurement Error
applied settings. In these instances, every effort
Unlike range restriction, the attenuation of should be made to ensure that the predictors are
criterion validity estimates due to unreliability as reliable as possible for selection purposes.
occurs in all settings. Because no measure is
perfectly reliable, random measurement error
Concerns in Applied Selection
will serve to attenuate statistical relationships
among variables. The well-known correction for In applied purposes, researchers are often inter-
attenuation serves as a way to correct observed ested not only in the criterion validity of a specific
criterion validity estimates for attenuation due predictor (which is indexed by the appropriate
to unreliability. However, some care must be effect size) but also in predicting scores on the cri-
taken in applications of the correction for terion variable of interest. For a single predictor,
attenuation. this is done with the equation
In most applications, researchers are not inter-
ested in predicting scores on the criterion mea- yi ¼ b0 þ b1 xi , ð1Þ
sure; rather, they are interested in predicting
standing on the criterion construct. For example, where yi is the score on the criterion variable for
the researcher would not be interested in predict- person i, xi is the score on the predictor variable
ing the supervisory ratings of a particular for person i, b0 is the intercept for the regression
employee’s teamwork skills, but the researcher model, and b1 is the slope for predictor x. Equa-
would be interested in predicting the true nature tion 1 allows the researcher to predict scores on
of the teamwork skills. As such, correcting for the criterion variable from scores on the predictor
attenuation due to measurement error in the cri- variable. This can be especially useful when
terion provides a way to estimate the relation- a researcher wants the performance of selected
ship between predictor scores and the true employees to meet a minimum threshold.
criterion construct of interest. These corrections When multiple predictors are employed, the
are extremely important when criterion reliabil- effect size of interest is not any single bivariate cor-
ity estimates are different in the validation for relation but the multiple correlation between a set
multiple predictors. If multiple predictors are of predictors and a single criterion of interest
correlated with the same criterion variable in the (which might be indexed with the multiple R or
same sample, then each of these criterion validity R2 from a regression model). In these instances,
estimates is attenuated to the same degree. How- the prediction equation analogous to Equation 1 is
ever, if different samples are used, and the crite-
rion reliability estimates are unequal in the yi ¼ b0 þ b1x1i þ b2x2i þ    þ bp xpi , ð2Þ
samples, then the correction for attenuation in
the criterion should be employed before making where x1i, x2i, . . . xpi are the scores on the predic-
comparisons among predictors. tor variables 1, 2, . . . p for person i, b1, b2, . . . bp
Corrections for attenuation in the predictor are the slopes for predictors x1, x2, . . . xp, and
variable are appropriate only under some condi- other terms are as defined earlier. Equation 2
tions. If the researcher is interested in allows the researcher to predict scores on
Criterion Validity 295

a criterion variable given scores on a set of p pre- how the intercept and slope estimates, respectively,
dictor variables. change for the focal group.
The b2 and b3 coefficients have strong implica-
tions for bias in criterion validity estimates. If the
Predictive Bias b3 coefficient is large and positive (negative), then
the slope differences (and criterion validity esti-
A unique situation arises in applied selection
mates) are substantially larger (smaller) for the
situations because of federal guidelines requiring
focal group. However, if the b3 coefficient is near
criterion validity evidence for predictors that show
zero, then the criterion validity estimates are
adverse impact between protected groups. Pro-
approximately equal for the focal and reference
tected groups include (but are not limited to) eth-
groups. The magnitude of the b2 coefficient deter-
nicity, gender, and age. Adverse impact arises
mines (along with the magnitude of the b3 coeffi-
when applicants from one protected group (e.g.,
cient) whether the criterion scores are over- or
males) are selected at a higher rate than members
underestimated for the focal or references groups
from another protected group (e.g., females).
depending on their scores on the predictor vari-
Oftentimes, adverse impact arises because of sub-
able. It is generally accepted that for predictor
stantial group differences on the predictor on
variables with similar levels of criterion validity,
which applicants are being selected. In these
those exhibiting less predictive bias should be pre-
instances, the focal predictor must be shown to
ferred over those exhibiting more predictive bias.
exhibit criterion validity across all people being
However, there is some room for tradeoffs
selected. However, it is also useful to examine pre-
between criterion validity and predictive bias.
dictive bias.
For the sake of simplicity, predictive bias will be Matthew J. Borneman
explicated here only in the case of a single predic-
tor, though the concepts can certainly be extended See also Concurrent Validity; Correction for Attenuation;
to the case of multiple predictors. In order to Criterion Problem; Predictive Validity; Restriction of
examine the predictive bias of a criterion validity Range; Selection; Validity of Measurement
estimate for a specific predictor, it is assumed that
the variable on which bias is assessed is categori-
cal; examples would include gender or ethnicity. Further Readings
The appropriate equation would be Binning, J. F., & Barrett, G. V. (1989). Validity of
personnel decisions: A conceptual analysis of the
yi ¼ b0 þ b1x1i þ b2x2i þ b3 ðx1i  x2i Þ, ð3Þ
inferential and evidential bases. Journal of Applied
where x1i and x2i are the scores on the continuous Psychology, 74, 478–494.
Brand, C. (1987). The importance of general intelligence.
predictor variable and the categorical demographic
In S. Modgil & C. Modgil (Eds.), Arthur Jensen:
variable, respectively, for person i, b1 is the regres- Consensus and controversy (pp. 251–265).
sion coefficient for the continuous predictor, b2 is Philadelphia: Falmer Press.
the regression coefficient for the categorical predic- Cleary, T. A. (1968). Test bias: Prediction of grades of
tor, b3 is the regression coefficient for the interac- Negro and White students in integrated colleges.
tion term, and other terms are defined as earlier. Journal of Educational Measurement, 5, 115–124.
Equation 3 has substantial implications for bias in Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981).
criterion validity estimates. Assuming the reference Measurement theory for the behavioral sciences. San
group for the categorical variable (e.g., males) is Francisco: W. H. Freeman.
coded as 0 and the focal group (e.g., females) is Kuncel, N. R., & Hezlett, S. A. (2007). Standardized
tests predict graduate students’ success. Science, 315,
coded as 1, the b0 coefficient gives the intercept for
1080–1081.
the reference group, and the b1 coefficient gives Nunnally, J. C. (1978). Psychometric theory (2nd ed.).
the regression slope for the reference group. These New York: McGraw-Hill.
two coefficients form the baseline of criterion Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin,
validity evidence for a given predictor. The b2 M. B. (2001). High-stakes testing in employment,
coefficient and the b3 coefficient give estimates of credentialing, and higher education: Prospects in
296 Criterion Variable

a post-affirmative action world. American effectively as the research design more readily per-
Psychologist, 56, 302–318. mits the adjustment of certain explanatory vari-
Schmidt, F. L., & Hunter, J. E. (1998). The validity and ables in isolation from the others, allowing
utility of selection methods in personnel psychology: a clearer judgment to be made about the nature of
Practical and theoretical implications of 85 years of
the relationship between the response and predic-
research findings. Psychological Bulletin, 124,
262–274.
tors. This entry’s focus is on types of criterion vari-
ables and analysis involving criterion variables.

Types of Criterion Variables


CRITERION VARIABLE Criterion variables can be of several types, depend-
ing on the nature of the analysis being attempted.
Criterion variable is a name used to describe the In many cases, the criterion variable is a measure-
dependent variable in a variety of statistical model- ment on a continuous or interval scale. This case is
ing contexts, including multiple regression, dis- typical in observational studies, in which the crite-
criminant analysis, and canonical correlation. The rion variable is often the variable of most interest
goal of much statistical modeling is to investigate among a large group of measured variables that
the relationship between a (set of) criterion vari- might be used as predictors. In other cases, the cri-
able(s) and a set of predictor variables. The out- terion variable may be discrete, either ordinal
comes of such analyses are myriad and include as (ordered categories) or nominal (unordered cate-
possibilities the development of model formulas, gories). A particularly important case is that of
prediction rules, and classification rules. Criterion a binary (0/1) criterion variable, for which a com-
variables are also known under a number of other mon modeling choice is the use of logistic regres-
names, such as dependent variable, response vari- sion to predict the probability of each of the two
able, predictand, and Y. Similarly, predictor vari- possible outcomes on the basis of the values of the
ables are often referred to using names such as explanatory variables. Similarly, a categorical cri-
independent variable, explanatory variable, and X. terion variable requires particular modeling
While such names are suggestive of a cause-and- choices, such as multinomial logistic regression, to
effect relationship between the predictors and the accommodate the form of the response variable.
criterion variable(s), great care should be taken in Increasingly, flexible nonparametric methods such
assessing causality. In general, statistical modeling as classification trees are being used to deal with
alone does not establish a causal relationship categorical or binary responses, the strength of
between the variables but rather reveals the exis- such methods being their ability to effectively
tence or otherwise of an observed association, model the data without the need for restrictive
where changes in the predictors are concomitant, classical assumptions such as normality.
whether causally or not, with changes in the crite-
rion variable(s). The determination of causation
Types of Analysis Involving Criterion Variables
typically requires further investigation, ruling out
the involvement of confounding variables (other The types of analysis that can be used to describe
variables that affect both explanatory and criterion the behavior of criterion variables in relation to
variables, leading to a significant association a set of explanatory variables are likewise broad.
between them), and, often, scientifically explaining In a standard regression context, the criterion vari-
the process that gives rise to the causation. Causa- able is modeled as a (usually) linear function of
tion can be particularly difficult to assess in the the set of explanatory variables. In its simplest
context of observational studies (experiments in form, linear regression also assumes zero-mean,
which both explanatory and criterion variables are additive, homoscedastic errors. Generalized linear
observed). In designed experiments, where, for models extend simple linear regression by model-
example, the values of the explanatory variables ing a function (called the link function) of the cri-
might be fixed at particular, prechosen levels, it terion variable in terms of a linear function (called
may be possible to assess causation more the linear predictor) of the explanatory variables.
Critical Difference 297

The criterion variable is assumed to arise from an than the breakdown of variation used in princi-
exponential family distribution, the type of which pal components analysis.
leads to a canonical link function, the function for Finally, in all the modeling contexts in which
which XTY is a sufficient statistic for β, the vector criterion variables are used, there exists an asym-
of regression coefficients. Common examples of metry in the way in which criterion variables are
exponential family distributions with their corre- considered compared with the independent vari-
sponding canonical link function include the nor- ables, even in observational studies, in which both
mal (identity link), poisson (log link), binomial sets of variables are observed or measured as
and multinomial (logit link), and exponential and opposed to fixed, as in a designed experiment. Fit-
gamma (inverse link) distributions. Generalized ting methods and measures of fit used in these con-
linear models allow for very flexible modeling of texts are therefore designed with this asymmetry
the criterion variable while retaining most of the in mind. For example, least squares and misclassi-
advantages of simpler parametric models (compact fication rates are based on deviations of realiza-
models, easy prediction). Nonparametric models tions of the criterion variable from the predicted
for the criterion variable include methods such as values from the fitted model.
regression trees, projection pursuit, and neural
nets. These methods allow for flexible models that Michael A. Martin and Steven Roberts
do not rely on strict parametric assumptions,
although using such models for prediction can See also Canonical Correlation Analysis; Covariate;
prove challenging. Dependent Variable; Discriminant Analysis
Outside the regression context, in which the
goal is to model the value of a criterion variable Further Readings
given the values of the explanatory variables, other
types of analysis in which criterion variables play Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J.
a key role include discriminant analysis, wherein (1984). Classification and regression trees. Belmont,
the values of the predictor (input) variables are CA: Wadsworth.
Friedman, J. H., & Stuetzle, W. (1981). Projection
used to assign realizations of the criterion variable
pursuit regression. Journal of the American Statistical
into a set of predefined classes based on the values Association, 76, 817–823.
predicted for a set of linear functions of the predic- Hotelling, H. (1936). Relations between two sets of
tors called discriminant functions, through a model variates. Biometrika, 28, 321–377.
fit via data (called the training set) for which the McCullagh, P., & Nelder, J. A. (1989). Generalized linear
correct classes are known. models. New York: Chapman & Hall.
In canonical correlation analysis, there may McCulloch, W., & Pitts, W. (1943). A logical calculus of
be several criterion variables and several inde- the ideas immanent in nervous activity. Bulletin of
pendent variables, and the goal of the analysis is Mathematical. Biophysics, 5, 115–133.
to reduce the effective dimension of the data
while retaining as much of the dependence struc-
ture in the data as possible. To this end, linear
combinations of the criterion variables and of CRITICAL DIFFERENCE
the independent variables are chosen to maxi-
mize the correlation between the two linear com- Critical differences can be thought of as critical
binations. This process is then repeated with regions for a priori and post hoc comparisons of
new linear combinations as long as there remains pairs of means and of linear combinations of
significant correlation between the respective lin- means. Critical differences can be transformed into
ear combinations of criterion and independent confidence intervals. First, this entry discusses criti-
variables. This process resembles principal com- cal differences in the context of multiple compari-
ponents analysis, the difference being that corre- son tests for means. Second, this entry addresses
lation between sets of independent variables and confusion surrounding applying critical differences
sets of criterion variables is used as the means of for statistical significance and for the special case
choosing relevant linear combinations rather of consequential or practical significance.
298 Critical Difference

Means Model experiment is different from the average of the first


and second treatment means, μ3 6¼ ðμ1 þ μ2 Þ=2.
Multiple comparison tests arise from parametric This can be expressed as
and nonparametric tests of means, medians, and
ranks corresponding to different groups. The para- 1 1
metric case for modeling means can be described H0 : μ3  μ1  μ2 ¼ 0
2 2
as
IID
yij ¼ μi þ εij , which often assumes εij e Nðo; σ 2 Þ, 1 1
H1 : μ3  μ1  μ2 6¼ 0:
2 2
where i ¼ 1 . . . p (number of treatments) and
j ¼ 1 . . . ni (sample size of the ith treatment). Table 1 contains equations describing critical
The null hypothesis is that all the means are equal: differences corresponding to typical multiple com-
parison tests. In addition to means, there are non-
H0 : μ1 ¼ μ2 ¼ . . . ¼ μp parametric critical differences for medians and for
ranks. The number of treatments is denoted by p;
the ithPtreatment group contains ni observations;
H1 : μi 6¼ μi ; i 6¼ j,
N ¼ ni denotes the total number of observa-
tions in the experiment; α denotes the error rate
and it is tested with an F test. Regardless of the for each comparison; and k is the number of
result of this test, a priori tests are always consid- experimentwise comparisons—where applicable.
ered. However, post hoc tests are considered only The experimentwise error rate refers to the overall
if the F test is significant. Both a priori and post error rate for the entire experiment. Many of the
hoc tests compare means and/or linear combina- critical differences in the table can be adjusted to
tions of means. Comparing means is a natural fol- control the experimentwise error rate.
low-up to rejecting H0 : μ1 ¼ μ2 ¼ . . . ¼ μp . This table is not exhaustive and is intended to
be illustrative only. Many of the critical differences
Multiple Comparison Tests for Means have variations.
These same critical differences can be used for
Multiple comparison tests serve to uncover which constructing confidence intervals. However, cau-
pairs of means or linear contrasts of means are sig- tion is warranted as they might not be efficient.
nificant. They are often applied to analyze the
results of an experiment. When the null hypothesis
is that all means are equal, it is natural to compare Consequential or Practical Significance
pairs of means: Multiple comparison tests are designed to find sta-
tistical significance. Sometimes the researcher’s
H0 : μi ¼ μj
objective is to find consequential significance,
which is a special case of statistical significance.
H1 : μi 6¼ μj : Finding statistical significance between two means
indicates that they are discernible at a given level
The straightforward comparison of the two of significance, α. Consequential significance indi-
means can be generalized to a linear contrast: cates that they are discernible and the magnitude
X of the difference is large enough to generate conse-
H0 : ci μi ¼ 0 quences. The corresponding confusion is pan-
demic. In order to adjust the critical differences to
X meet the needs of consequential significance, an
H1 : ci μi 6¼ 0: extra step is wanted.
There are two classic statistical schools:
Linear contrasts describe testing other combina-
tions of means. For example, the researcher might 1. Fisher (Fisherian): This school does not need an
test whether the third treatment mean in an alternative hypothesis.
Critical Difference 299

Table 1 Critical Difference Examples


Paired Comparisons Critical Differences For
P
Test H 0 : μi ¼ μj Contrasts H0 : ci μi ¼ 0
sffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffi
1 1 X c2
i
Bonferroni’s tα=2k;Np σ^ þ tα=2k;Np σ^
ni nj ni
Duncan’s method σ^
qðαk ; p; N  pÞ pffiffiffi ’
n
where αk ¼ 1  ð1  αÞk1
sffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffi
1 1 X c2
i
Fisher’s least significant difference tα=2;Np σ^ þ tα=2;Np σ^
ni nj ni
sffiffiffiffiffiffiffiffiffiffiffiffi
X c2
i
Multivariate t method tα=2;k;Np σ^
ni
Newman–Keuls’s ni ¼ nj Comparing the order statistics of the means.
σ^
qðα; p; N  pÞ pffiffiffi
n
Newman–Keuls’s ni ffi nj Comparing the order statistics of the means.
σ^
qðα; p; N  pÞ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 1
p n1 þ n1 þ . . . þ n1t
1 2
sffiffiffiffiffiffiffiffiffiffiffiffi
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X c2
i
Scheffé’s ðp  1ÞFα;p1;N2 σ^
ni
rffiffiffi  
2 σ^ 1 X
Tukey’s ni ¼ nj qðα; p; N  pÞ^
σ qðα; p; N  pÞ pffiffiffi jci j
n n 2
sffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 σ^
Tukey’s ni 6¼ nj qðα; p; N  pÞ^
σ þ ; i 6¼ j qðα; p; N  pÞ pffiffiffiffi pffiffiffiffi

ni nj MIN ni ; nj

2. Neyman–Pearson: This school insists on an between the means is discernible and whether the
alternative hypothesis: ordering is statistically reliable.
J. Neyman would say that the alternative
H0 : μ1  μ2 ¼ 0 hypothesis, H1 , should represent the consequential
H1 : μ1  μ2 6¼ 0 scenario. For example, suppose that if the mean of
Treatment 1 exceeds the mean of Treatment 2 by
at least some amount c; then changing from Treat-
Ronald Fisher was the first to address the mat- ment 2 to Treatment 1 is consequential. The
ter of hypothesis testing. He defined statistical sig- hypotheses might take the following form:
nificance to address the need for discerning
between two treatments. His school emphasizes H0 : μ1  μ2  c ¼ 0
determining whether one treatment mean is
H1 : μ1  μ2  c > 0,
greater than the other, without stressing the magni-
tude of the difference. This approach is in keeping
with the usual critical differences as illustrated in where c is the consequential difference. In the
Table 1. They pronounce whether any difference application and instruction of statistics, the two
300 Critical Theory

classical schools are blended, which sows even postmodernism, and poststructuralism and also to
more confusion. show, through the evidence bases of literature and
To detect consequential significance, the final step reflective experiences, how critical theories can be
consists of adjusting the multiple comparison test by used by the researcher within different parts of
adding or multiplying by constants, which might be research design.
derived from economic or scientific calculations.

Randy J. Bartlett Marxist Critique


Marx’s critique of the free market economy and
See also Critical Value; Margin of Error; Multiple
capitalism is as applicable today as it was in the
Comparison Tests; Significance, Statistical
19th century. Capitalism as an ideology was
a power structure based on exploitation of the
Further Readings
working classes. Economic production revolves
Fisher, R. (1925). Statistical methods for research around the exploited relationship between bour-
workers. Edinburgh, UK: Oliver & Boyd. geoisie and proletariat. The bourgeoisie have
Hsu, J. (1996). Multiple comparisons, theory and monopolized production since the beginning of the
methods. Boca Raton, FL: Chapman & Hall/CRC. industrial revolution, forcing the proletariat from
Milliken, G., & Johnson, D. (1984). Analysis of messy the land and into structured environments (e.g.,
data: Vol. 1. Designed experiments. Belmont, CA:
urban factories). Peasants became workers, and
Wadsworth.
Neyman, J., & Pearson, E. (1928). On the use and
the working wage was a controlled, structured
interpretation of certain test criteria for purposes of device that gives the proletariat only a fraction of
statistical inference: Part I. Biometrika, 20A, the generated revenue. The surplus value from pro-
175–240. duction is a profit that is taken by the bourgeoisie.
Neyman, J., & Pearson, E. (1928). On the use and Critical theorists would argue that this practice is
interpretation of certain test criteria for purposes of both theft and wage slavery. This leads to wider
statistical inference: Part II. Biometrika, 20A, 263–294. consequences (i.e., overproduction), which in the
20th and 21st centuries has had wide implications
for the environment. Marxism favors social orga-
nization with all individuals having the right to
CRITICAL THEORY participate in consumption and production. The
result of class exploitation, Marxists believe,
A theory consists of a belief or beliefs that allow would be the revolutionary overthrow of capital-
researchers to examine and analyze the world ism with communism.
around them. This entry examines critical theory Max Horkheimer, a founding member of the
and how critical ideas can be applied to research. Frankfurt School in the 1920s and 1930s, wanted
As Peter Barry suggests, we can use theory rather to develop critical theory as a form of Marxism,
than theory’s using us. The complexity of any the- with the objective of changing society. His ideas
ory, let alone the critical theories examined in this attempted to address the threat not only of fascism
entry, cannot be underestimated, but theory can be in Europe but of the rise of consumer culture,
used to produce better research methodology and which can create forms of cultural hegemony and
also to test research data. Theory is important new forms of social control. Theory adapts to
because it allows us to question our beliefs, mean- changing times, but within new social situations,
ings, and understandings of the world around us. theories can be developed and applied to contem-
In all subject areas, critical theory provides a multi- porary contexts. Horkheimer’s focus on consumer
dimensional and wide-ranging critique of research culture can be used today within modern commu-
problems that researchers attempt to address. The nication when we reflect on how people today
importance of how we read, apply, and interpret socialize. One example is Facebook and the ways
critical theories is crucial within any research behavioral patterns continue to change as people
design. Within this entry, the aim is to increase use the World Wide Web more and more to
understandings of Marxism, critical race theory, exchange personal information. What are the
Critical Theory 301

consequences of these changes for social organiza- analysis, arguing that multiculturalism did not focus
tion and consumer culture? on antiracist practice enough. CRT has that central
A very contemporary application of Marxism focus and goes further in relation to how minority
has been provided by Mike Cole, who applies groups are racialized and colored voices silenced.
Marxist ideas to education by using the example CRT offers everyone a voice to explore, examine,
of Venezuela and Hugo Chavez, who opposes cap- debate, and increase understandings of racism.
italism and imperialism. In education, Cole high- David Gillborn is an advocate of applying CRT
lights, Chavez has hoped to open 38 new state within research, explaining that the focus of CRT is
universities with 190 satellite classrooms through- an understanding that the status of Black and other
out Venezuela by 2009. Social projects such as minority groups is always conditional on Whites.
housing are linked to this policy. Communal coun- As Gillborn highlights, to many Whites, such an
cils have been created whereby the local popula- analysis might seem outrageous, but its perceptive-
tion meets to decide on local policies and how to ness was revealed in dramatic fashion in July 2005,
implement them, rather than relying on bourgeois with the terrorist attacks on London and Madrid.
administrative machinery. Chavez is not only talk- Gillborn underlines that this was the clearest dem-
ing about democratic socialism but applying it to onstration of the conditional status of people of
government policy. Therefore, one can apply color in contemporary England. So power condi-
Marxist critique politically in different parts of the tions are highlighted by CRT. The application of
world. Chavez’s policies are a reaction to capital- CRT within research design can be applied in many
ism and the colonial legacy and have the objective subject areas and disciplines. That notion of power
of moving Venezuela in a more socialist direction. and how it is created, reinforced, and controlled is
Application and interpretation are the keys when an important theme within critical theory.
applying critical theory within research design.
Cole applies Marxist ideas to the example of Vene-
Postmodernism
zuela and provides evidence to interpret the events
that are taking place. Critical theory can be effec- Postmodernism also examines power relations and
tive when it is applied in contemporary contexts. how power is made and reinforced. To understand
what postmodernism is, we have to understand
what modernism was. Peter Barry explains that
Critical Race Theory
modernism is the name given to the movement that
The critical race theory (CRT) movement is a collec- dominated arts and culture in the first half of the
tion of activists and scholars interested in studying 20th century. Practice in music, literature, and
and transforming the relationship among race, rac- architecture was challenged. This movement of
ism, and power. Although CRT began as a move- redefining modernism as postmodernist can be seen
ment in the United States within the subject of law, in the changing characteristics of literary modern-
it has rapidly spread beyond that discipline. Today, ism. Barry provides a position that concerns liter-
academics in the social and behavioral sciences, ary modernism and the move to postmodernist
including the field of education, consider themselves forms of literature. It can be applied more broadly
critical race theorists who use CRT ideas to, accord- to research design and the application of critical
ing to Richard Delgardo and Jean Stefancic, under- theory. The move away from grand-narrative social
stand issues of school discipline, controversies, and cultural theories can be seen in the above as
tracking, and IQ and achievement testing. CRT philosophers and cultural commentators moved
tries not only to understand our social situation but away from ‘‘objective’’ positions and began to
to change it. The focus of CRT is racism and how it examine multiple points of views and diverse moral
is socially and culturally constructed. CRT goes positions. Postmodernists are skeptical about over-
beyond a conceptual focus of multiculturalism, all answers to questions that allow no space for
which examines equal opportunities, equity, and debate. Reflexivity allows the researcher to reflect
cultural diversity. Barry Troyna carried out educa- on his or her own identity or identities within
tion research in the United Kingdom during the a given profession. The idea of moving beyond
1970s and 1980s with an antiracist conceptual a simplistic mirror image or ‘‘Dear Diary’’
302 Critical Theory

approach to a reflective method with the applica- this form of critical theory aims to deconstruct the
tion of different evidence bases of literature reviews grand narratives and structural theoretical frame-
to personal experiences shows how status, roles, works. Poststructuralism also attempts to increase
and power can be critically analyzed. The plurality understandings of language and the ways knowl-
of roles that occurs is also a postmodern develop- edge and power are used and evolve to shape how
ment with the critical questioning of the world and we view structures (e.g., the institution in which
the multiple identities that globalization gives the we work and how it works) and why we and
individual. In relation to research design, a post- others accept how these structures work. Poststruc-
modern approach gives the researcher more possi- turalism is critical of these processes and attempts
bilities in attempting to increase understandings of to analyze new and alternative meanings. In rela-
a research area, question, or hypothesis. Fragmen- tion to research design, it is not only how we apply
ted forms, discontinuous narrative, and the random critical theory to poststructuralist contexts, it is
nature of material can give the researcher more how we attempt to read the theory and theorists.
areas or issues to examine. This can be problematic Michel Foucault is a poststructuralist, and his
as research focus is an important issue in the works are useful to read in association with issues
research process and it is vital for the researcher to of research design. Poststructural ideas can be used
stay focused. That last line would immediately be to examine the meanings of different words and
questioned by a postmodernist because the position how different people hold different views or mean-
is one of constant questioning and potential change. ings of those words. The plurality of poststructural-
The very nature of research and research design ist debate has been criticized because considering
could be questioned by the postmodernist. The different or all arguments is only relative when an
issue here is the creation and development of ideas, absolute decision has to be taken. However, it is
as it is for all researchers. Jean-François Lyotard the question of language and meaning in relation
believed that the researcher and intellectual should to power and knowledge that offers the researcher
resist the grand ideas and narratives that had a different angle within research design.
become, in his opinion, outdated. Applying that
directly to research design, a modernist argument
would be that the research process should consist Application Within the
of an introduction with a research question or
Social and Behavioral Sciences
hypothesis; literature review; method and method-
ology; data collection, presentation, and analysis; This final section highlights how critical theory, be
and a conclusion. If one were being critical of that it Marxism, CRT, postmodernism, or poststructur-
provisional research design structure, one could alism, can be applied within the social and behav-
suggest that research questions (plural) should be ioral sciences. It is not only the word application
asked, literature reviews (subject specific, general, that needs to be focused on but interpretation and
theoretical, conceptual, method, data) should be one’s interpretations in relation to one’s research
carried out, and all literature should be criticized; question or hypothesis. Researchers reading the
positivist research paradigms should be dropped in primary sources of Marx, Foucault, or Lyotard and
favor of more reflective, action research projects; applying them to a research design is all very well,
and other questions should be answered rather than but interpreting one’s own contextual meanings to
the focal question or hypothesis posed at the begin- methodology from the literature reviews and then
ning of the research project. Research design itself applying and interpreting again to data analysis
would be questioned because that is the very nature seems to be more difficult. Two different meanings
of postmodernist thought: the continuing critique could materialize here, which is due to the fact that
of the subject under examination. space and time needs to be given within research
design for reading and rereading critical theories to
increase understandings of what is being
Poststructuralism
researched. Critical theories can be described as
The issue of power and knowledge creation is also windows of opportunity in exploring research pro-
examined within poststructuralism in the sense that cesses in the social and behavioral sciences. They
Critical Thinking 303

can be used as a tool within research design to A proposition is a statement that claims to be
inform, examine, and ultimately test a research true, a statement that claims to be a good guide to
question and hypothesis. Critical theoretical frame- reality. Not all statements that sound as if they
works can be used within research introductions, may be true or false function as propositions, so
literature reviews, and method and methodology the first step in critical thinking is often to consider
processes. They can have a role to play in data whether a proposition is really being advanced.
analysis and research conclusions or recommenda- For example, ‘‘I knew this was going to happen’’ is
tions at the end of a research project. It is how the often an effort to save face or to feel some control
researcher uses, applies, and interprets critical the- over an unfortunate event rather than an assertion
ory within research design that is the key issue. of foreknowledge, even though it sounds like one.
Conversely, a statement may not sound as if it has
Richard Race a truth element, but on inspection, one may be dis-
covered. ‘‘Read Shakespeare’’ may sometimes be
See also Literature Review; Methods Section; Research
translated as the proposition, ‘‘Private events are
Design Principles; Research Question; Theory
hard to observe directly, so one way to learn more
about humans is to observe public representations
of private thoughts as described in context by cele-
Further Readings brated writers.’’ Critical thinking must evaluate
Delgardo, R., & Stefancic, J. (2001). Critical race theory: statements properly stated as propositions; many
An introduction. New York: New York University disagreements are settled simply by ascertaining
Press. what, if anything, is being proposed. In research, it
Foucault, M. (1991). Discipline and punish: The birth of is useful to state hypotheses explicitly and to define
the prison. London: Penguin Books. the terms of the hypotheses in a way that allows
Foucault, M. (2002). Archaeology of knowledge. all parties to the conversation to understand
London: Routledge. exactly what is being claimed.
Lyotard, J.-F. (1984). The postmodern condition: A
Critical thinking contextualizes propositions; it
report on knowledge. Minneapolis: University of
helps the thinker consider when a proposition is
Minnesota Press.
Malpas, S., & Wake, P. (Eds.). (2006). The Routledge true or false, not just whether it is true or false. If
companion to critical theory. London: Routledge. a proposition is always true, then it is either a tau-
Troyna, B. (1993). Racism and education. Buckingham, tology or a natural law. A tautology is a statement
UK: Open University Press. that is true by definition: ‘‘All ermines are white’’
is a tautology in places where nonwhite ermines
are called weasels. A natural law is a proposition
that is true in all situations, such as the impossibil-
ity of traveling faster than light in a vacuum. The
CRITICAL THINKING validity of all other propositions depends on the
situation. Critical thinking qualifies this validity by
Critical thinking evaluates the validity of proposi- specifying the conditions under which they are
tions. It is the hallmark and the cornerstone of sci- good guides to reality.
ence because science is a community that aims to Logic is a method of deriving true statements
generate true statements about reality. The goals from other true statements. A fallacy occurs when
of science can be achieved only by engaging in an a false statement is derived from a true statement.
evaluation of statements purporting to be true, This entry discusses methods for examining pro-
weeding out the false ones, and limiting the true positions and describes the obstacles to critical
ones to their proper contexts. Its centrality to the thinking.
scientific enterprise can be observed in the privi-
leges accorded to critical thinking in scientific dis-
Seven Questions
course. It usually trumps all other considerations,
including tact, when it appears in a venue that Critical thinking takes forms that have proven
considers itself to be scientific. effective in evaluating the validity of propositions.
304 Critical Thinking

Generally, critical thinkers ask, in one form or dualities. Thus, a critical thinker would want to
another, the following seven questions: examine whether it makes sense to consider one
citizen better than another or whether the proposi-
1. What does the statement assert? What is tion is implying that schools are responsible for
asserted by implication? social conduct rather than for academics.
2. What constitutes evidence for or against the Critical thinkers are also alert to artificial cate-
proposition? gories. When categories are implied by a proposi-
tion, they need to be examined as to whether they
3. What is the evidence for the proposition? What
is the evidence against it? really exist. Most people would accept the reality
of the category school in the contemporary United
4. What other explanations might there be for the States, but not all societies have clearly demarcated
evidence? mandatory institutions where children are sent dur-
5. To which circumstances does the proposition ing the day. It is far from clear that the categories
apply? of smaller schools and larger schools stand up to
6. Are the circumstances currently of interest like scrutiny, because school populations, though not
the circumstances to which the proposition falling on a smooth curve, are more linear than cat-
applies? egorical. The proponent might switch the proposi-
tion to School size predicts later criminal activity.
7. What motives might the proponent of the
proposition have besides validity?
What Constitutes Evidence
for or Against the Proposition?
What Does the Statement Assert?
Before evidence is evaluated for its effect on
What Is Asserted by Implication?
validity, it must be challenged by questions that
The proposition Small schools produce better ask whether it is good evidence of anything. This
citizens than large schools do can be examined as is generally what is meant by reliability. If a study
an illustrative example. The first step requires the examines a random sample of graduates’ criminal
critical thinker to define the terms of the proposi- records, critical thinkers will ask whether the sam-
tion. In this example, the word better needs elabo- ple is truly random, whether the available criminal
ration, but it is also unclear what is meant by records are accurate and comprehensive, whether
citizen. Thus, the proponent may mean that better the same results would be obtained if the same
citizens are those who commit fewer crimes or per- records were examined on different days by differ-
haps those who are on friendly terms with a larger ent researchers, and whether the results were cor-
proportion of their communities than most citizens. rectly transcribed to the research protocols.
Critical thinkers are alert to hidden tautologies, It is often said that science relies on evidence
or to avoiding the fallacy of begging the question, rather than on ipse dixits, which are propositions
in which begging is a synonym for pleading (as in accepted solely on the authority of the speaker.
pleading the facts in a legal argument) and ques- This is a mistaken view, because all propositions
tion means the proposition at stake. It is fallacious ultimately rest on ipse dixit. In tracking down
to prove something by assuming it. In this exam- criminal records, for example, researchers will
ple, students at smaller schools are bound to be on eventually take someone’s—or a computer’s—
speaking terms with a higher proportion of mem- word for something.
bers of the school community than are students at
larger schools, so if that is the definition of better
What Is the Evidence for the Proposition?
citizenship, the proposition can be discarded as
What Is the Evidence Against It?
trivial. Some questions at stake are so thoroughly
embedded in their premises that only very deep These questions are useful only if they are
critical thinking, called deconstruction, can reveal asked, but frequently people ask only about the
them. Deconstruction asks about implied assump- evidence on the side they are predisposed to
tions of the proposition, especially about unspoken believe. When they do remember to ask, people
Critical Thinking 305

have a natural tendency, called confirmation bias, Are the Circumstances Currently
to value the confirming evidence and to dismiss of Interest Like the Circumstances
the contradictory evidence. to Which the Proposition Applies?
Once evidence is adduced for a proposition, one
Once a proposition has been validated for a par-
must consider whether the very same evidence
ticular set of circumstances, the critical thinker
may stand against it. For example, once someone
examines the current situation to determine
has argued that distress in a child at seeing a parent
whether it is sufficiently like the validating circum-
during a stay in foster care is a sign of a bad rela-
stances to apply the proposition to it. In the social
tionship, it is difficult to use the same distress as
sciences, there are always aspects of the present case
a sign that it is a good relationship. But critical
that make it different from the validating circum-
thinking requires questioning what the evidence is
stances. Whether these aspects are different enough
evidence of.
to invalidate the proposition is a matter of judg-
ment. For example, the proposition relating school
What Other Explanations size to future criminality could have been validated
Might There Be for the Evidence? in California, but it is unclear whether it can be
applied to Texas. Critical thinkers form opinions
To ensure that an assertion of causality is cor-
about the similarity or differences between situa-
rect, either one must be able to change all but the
tions after considering reasons to think the current
causal variable and produce the same result, or one
case is different from or similar to the typical case.
must be able to change only the proposed cause
and produce a different result. In practice, espe-
cially in the social areas of science, this never hap-
What Motives Might the Proponent of
pens, because it is extremely difficult to change
the Proposition Have Besides Validity?
only one variable and impossible to change all vari-
ables except one. Critical thinkers identify variables The scientific community often prides itself on
that changed along with the one under consider- considering the content of an argument rather than
ation. For example, it is hard to find schools of dif- its source, purporting to disdain ad hominem argu-
ferent sizes that also do not involve communities ments (those being arguments against the propo-
with different amounts of wealth, social upheaval, nent rather than against the proposition).
or employment opportunities. Smaller schools may However, once it is understood that all evidence
more likely be private schools, which implies ultimately rests on ipse dixits, it becomes relevant
greater responsiveness to the families paying the to understand the motivations of the proponent.
salaries, and it may be that responsiveness and Also, critical thinkers budget their time to examine
accountability are more important than size per se. relevant propositions, so a shocking idea from
a novice or an amateur or, especially, an interested
party is not always worth examining. Thus, if
To Which Circumstances
a superintendent asserts that large schools lead to
Does the Proposition Apply?
greater criminality, one would want to know what
If a proposition is accepted as valid, then it is this official’s budgetary stake was in the argument.
either a law of nature—always true—or else it is Also, this step leads the critical thinker full circle,
true only under certain circumstances. Critical back to the question of what is actually being
thinkers are careful to specify these circumstances asserted. If a high school student says that large
so that the proposition does not become overly schools increase criminality, he may really be ask-
generalized. Thus, even if a causal relationship ing to transfer to a small school, a request that
were accepted between school size and future may not depend on the validity of the proposition.
criminality, the applicability of the proposition Thus, critical thinkers ask who is making the
might have to be constricted to inner cities or to assertion, what is at stake for the speaker, from
suburbs, or to poor or rich schools, or to schools what position the speaker is speaking, and under
where entering test scores were above or below what conditions or constraints, to what audience,
a certain range. and with what kind of language. There may be no
306 Critical Thinking

clear or definitive answers to these questions; how- a certain amount of tact. It can be awkward to
ever, the process of asking is one that actively question someone’s assertions about reality, and
engages the thinking person in the evaluation of downright rude to challenge their assertions about
any proposition as a communication. themselves. When people say something is true
and it turns out not to be true, or not to be always
The Role of Theory true, they lose face. Scientists try to overcome this
loss of face by providing a method of saving face,
When critical thinkers question what constitutes namely, by making a virtue of self-correction and
good evidence, or whether the current situation is putting truth ahead of pride. But without a com-
like or unlike the validating circumstances, or mitment to science’s values, critical thinking can
what motives the proponent may have, how do lead to hurt feelings.
they know which factors to consider? When they
ask about other explanations, where do other
explanations come from? Theory, in the sense of Losing Faith
a narrative that describes reality, provides these Religious faith is often expressed in spiritual and
factors and explanations. To use any theorist’s the- moral terms, but sometimes it is also expressed in
ory to address any of these issues is to ask what factual terms—faith that certain events happened at
the theorist would say about them. a certain time or that the laws of nature are some-
times transcended. When religion takes a factual
Multiculturalism turn, critical thinking can oppose it, and people can
feel torn between religion and science. Galileo said
Multiculturalism provides another set of questions that faith should concern itself with how to go to
to examine the validity of and especially to constrict heaven, and not with how the heavens go. When
the application of a proposition. Someone thinking people have faith in how reality works, critical
of large suburban high schools and small suburban thinking can become the adversary of faith.
parochial schools may think that the proposition
relating school size to criminality stands apart from
race, sex, and ethnicity. Multicultural awareness Losing Friends
reminds us to ask whether these factors matter. Human beings are social; we live together
according to unspoken and spoken agreements,
Obstacles to Critical Thinking and our social networks frequently become com-
munities of practice wherein we express our
If critical thinking is such a useful process for get-
experiences and observations in terms that are
ting at the truth, for producing knowledge that
agreeable to and accepted by our friends and
may more efficiently and productively guide our
immediate communities. Among our intimates and
behavior—if it is superior to unquestioning accep-
within our social hierarchies, we validate each
tance of popular precepts, common sense, gut
other’s views of reality and find such validation
instinct, religious faith, or folk wisdom—then why
comforting. We embed ourselves in like-minded
is it not more widespread? Why do some people
communities, where views that challenge our own
resist or reject critical thinking? There are several
are rarely advanced, and when they are, they and
reasons that it can be upsetting. Critical thinkers
their proponents are marginalized—labeled silly,
confront at least six obstacles: losing face, losing
dangerous, crazy, or unreasonable—but not seri-
faith, losing friends, thinking the unthinkable,
ously considered. This response to unfamiliar or
challenging beliefs, and challenging believing.
disquieting propositions strengthens our status as
members of the ingroup and differentiates us from
Losing Face
the outgroup. Like-minded communities provide
Critical thinking can cause people to lose face. reassurance, and critical thinking—which looks
In nonscientific communities, that is, in communi- with a curious, analytical, and fearless eye on pro-
ties not devoted to generating true statements positions—threatens not only one’s worldview but
about reality, propositions are typically met with one’s social ties.
Critical Value 307

Thinking the Unthinkable to which our understandings are socially con-


structed and communicative. With very few excep-
Some of the questions that critical thinkers ask in
tions, science’s propositions are continually revised
order to evaluate propositions require an imagina-
and circumscribed, facts blur with opinions, and
tive wondering about alternatives: Would the mean-
our categories turn out to be arbitrary or even in
ing of evidence change if something about the
the service of some social agenda. A postmodern
context changed, could there be some other,
world, energized by critical thinking about the
unthought-of factor accounting for evidence, or
keystones of our beliefs, can be a difficult world to
could the evidence itself be suspect? Critical thinking
live in for people seeking certainty.
involves considering alternatives that may be banned
It is the nature of powerful subsystems to pre-
from discourse because they challenge the status quo
serve their power by setting their definitions of
or because they challenge the complacency of the
situations in stone, and it is the task of critical
thinker. It may not be tolerable to ask whether race
thinkers to cast those definitions as revisable, cir-
matters, for example, in trying to understand any
cumscribable, and oftentimes arbitrary. If truth lib-
links between school size and criminality, and it
erates people from other people’s self-serving
may not be tolerable to ask how a group of profes-
views of reality, critical thinking is an engine for
sionals would behave if arrested, if one is evaluating
freedom as well as for truth.
the proposition that a suspect’s behavior on arrest
signifies guilt or innocence. Critical thinking can be Michael Karson and Janna Goodwin
frightening to people who just want to be left alone
and do not want to think about alternatives. See also Reliability; Theory; Threats to Validity; ‘‘Validity’’

Challenging Beliefs Further Readings


It is useful for most people to have several cen- Derrida, J. (1982). Margins of philosophy (A. Bass,
tral tenets about reality and the human condition Trans.). Chicago: University of Chicago Press.
that guide their behavior. If anxiety is the feeling Karson, M., & Goodwin, J. (2008). Critical thinking
of not knowing what to do, it can be comforting about critical thinking. In M. Karson, Deadly therapy:
Lessons in liveliness from theater and performance
to have some beliefs that dictate how to behave.
theory (pp. 159–175). Lanham, MD: Jason Aronson.
Very few beliefs emerge from the process of critical Longino, H. (1990). Science as social knowledge: Values
thinking unchanged and unqualified. The basic and objectivity in scientific inquiry. Princeton, NJ:
propositions on which an individual relies come to Princeton University Press.
seem temporary and situational rather than funda- Meehl, P. (1973). Why I do not attend case conferences. In
mental and reliable. The result can be an anxious P. E. Meehl, Psychodiagnosis: Selected papers (pp.
sense that the map being used to navigate reality is 225–302). Minneapolis: University of Minnesota Press.
out of date or drawn incorrectly. To preserve the
security of knowing what to do, many people
avoid evaluating their maps and beliefs.
CRITICAL VALUE
Challenging Believing
The critical value (CV) is used in significance test-
The very process of examining beliefs can make ing to establish the critical and noncritical regions
them seem arbitrary, like tentative propositions of a distribution. If the test value or statistic falls
reflecting communicative circumstances and per- within the range of the critical region (the region
sonal motives rather than truisms about reality of rejection), then there is a significant difference
etched in stone. This happens because critical or association; thus, the null hypothesis should be
thinking casts beliefs as statements, as sentences rejected. Conversely, if the test value or statistic
uttered or written by a speaker, and the process of falls within the range of the noncritical region
questioning them makes it clear that there are (also known as the nonrejection region), then the
almost always situational and motivational issues difference or association is possibly due to chance;
to consider. Critical thinking shows us the extent thus, the null hypothesis should be accepted.
308 Critical Value

When using a one-tailed test (either left-tailed When using a statistical table to reference
or right-tailed), the CV will be on either the left a CV, it is sometimes necessary to interpolate, or
or the right side of the mean. Whether the CV is estimate values, between CVs in a table because
on the left or right side of the mean is dependent such tables are not exhaustive lists of CVs. For
on the conditions of an alternative hypothesis. For the following example, the t-distribution table is
example, a scientist might be interested in increas- used. Assume that we want to find the critical t
ing the average life span of a fruit fly; therefore, value that corresponds to 42 df using a signifi-
the alternative hypothesis might be H1 : μ > 40 cance level or alpha of .05 for a two-tailed test.
days. Subsequently, the CV is on the right side of The table has CVs only for 40 df (CV ¼ 2.021)
the mean. Likewise, the null hypothesis would be and 50 df (CV ¼ 2.009). In order to calculate
rejected only if the sample mean is greater than the desired CV, we must first find the distance
40 days. This example would be referred to as between the two known dfs (50  40 ¼ 10).
a one-tailed right test. Then we find the distance between the desired df
To use the CV to determine the significance of and the lower known df (42  40 ¼ 2). Next,
a statistic, the researcher must state the null and we calculate the proportion of the distance that
alternative hypotheses; set the level of significance, the desired df falls from the lower known
or alpha level, at which the null hypothesis will be 2
df ð ¼ :20Þ. Then we find the distance between
rejected; and compute the test value (and the cor- 10
responding degrees of freedom, or df, if necessary). the CVs for 40 df and 50 df ð2:021  2:009 ¼
The investigator can then use that information to :012Þ. The desired CV is .20 of the distance
select the CV from a table (or calculation) for the between 2.021 and 2.009 ð:20 × :012 ¼ :0024Þ.
appropriate test and compare it to the statistic. Since the CVs decrease as the dfs increase, we
The statistical test the researcher chooses to use subtract .0024 from CV for 40 df ð2:021
(e.g., z-score test, z test, single sample t test, inde- :0024 ¼ 2:0186Þ; therefore, the CV for 42 df
pendent samples t test, dependent samples t test, with an alpha of .05 for a two-tailed test is
one-way analysis of variance, Pearson product- t ¼ 2.0186.
moment correlation coefficient, chi-square) deter- Typically individuals do not need to reference
mines which table he or she will reference to statistical tables because statistical software
obtain the appropriate CV (e.g., z-distribution packages, such as SPSS, an IBM company, formerly
table, t-distribution table, F-distribution table, called PASWâ Statistics, indicate in the output
Pearson’s table, chi-square distribution table). whether a test value is significant and the level of
These tables are often included in the appendixes significance. Furthermore, the computer calcula-
of introductory statistics textbooks. tions are more accurate and precise than the infor-
For the following examples, the Pearson’s mation presented in statistical tables.
table, which gives the CVs for determining
whether a Pearson product-moment correlation Michelle J. Boyd
(r) is statistically significant, is used. Using an See also Alternative Hypotheses; Degrees of Freedom; Null
alpha level of .05 for a two-tailed test, with Hypothesis; One-Tailed Test; Significance, Statistical;
a sample size of 12 (df ¼ 10), the CV is .576. In Significance Level, Concept of; Two-Tailed Test
other words, for a correlation to be statistically
significant at the .05 significance level using
Further Readings
a two-tailed test for a sample size of 12, then the
absolute value of Pearson’s r must be greater Heiman, G. W. (2003). Basic statistics for the behavioral
than or equal to .576. Using a significance level sciences. Boston: Houghton Mifflin.
of .05 for a one-tailed test, with a sample size of
12 (df ¼ 10), the CV is .497. Thus, for a correla-
tion to be statistically significant at the .05 level
using a one-tailed test for a sample size of 12, CRONBACH’S ALPHA
then the absolute value of Pearson’s r must be
greater than or equal to .497. See Coefficient Alpha
Crossover Design 309

Table 1 Six-Order Sequence Crossover Table


CROSSOVER DESIGN Sequence Units Period 1 Period 2 Period 3
1 1,2,3 A B C
There are many different types of experimental 2 4,5,6 B C A
designs for different study scenarios. Crossover 3 7,8,9 C A B
design is a special design in which each experimen- 4 10,11,12 A C B
tal unit receives a sequence of experimental treat- 5 13,14,15 B A C
ments. In practice, it is not necessary that all 6 16,17,18 C B A
permutations of all treatments be used. Research-
ers also call it switchover design, compared with
a parallel group design, in which some experimen-
tal units only get a specific treatment and other scientific fields with human beings and animals.
experimental units get another treatment. In fact, The feature that measurements are obtained on dif-
the crossover design is a specific type of repeated ferent treatments from each experimental unit dis-
measures experimental design. In the traditional tinguishes the crossover design from other
repeated measures experiment, the experimental experimental designs.
units, which are applied to one treatment (or one This feature entails several advantages and dis-
treatment combination) throughout the whole advantages. One advantage is reduced costs and
experiment, are measured more than one time, resources when money and the number of experi-
resulting in correlations between the measure- mental units available for the study are limited.
ments. The difference between crossover design The main advantage of crossover design is that the
and traditional repeated measures design is that in treatments are compared within subjects. The
crossover design, the treatment applied to an crossover design is able to remove the subject
experimental unit for a specific time continues effect from the comparison. That is, the crossover
until the experimental unit receives all treatments. design removes any component from the treatment
Some experimental units may be given the same comparisons that is related to the differences
treatment in two or more successive periods, between the subjects. In clinical trials, it is com-
according to the needs of the research. mon that the variability of measurements obtained
The following example illustrates the display of from different subjects is much greater than the
the crossover design. Researchers altered the diet variability of repeated measurement obtained from
ingredients of 18 steers in order to study the the same subjects.
digestibility of feedstuffs in beef cattle. There were A disadvantage of crossover design is that it
three treatments, or feed mixes, each with a differ- may bring a carryover effect, that is, the effect of
ent mix of alfalfa, straw, and so on. A three-period a treatment given in one period may influence the
treatment was used, with 3 beef steers assigned to effect of the treatment in the following period(s).
each of the six treatment sequences. Each diet in Typically the subjects (experimental units) are
each sequence was fed for 30 days. There was given sufficient time to ‘‘wash out’’ the effect of the
a 21-day washout between each treatment period treatments between two periods in crossover
of the study. Assume that the dependent variable is designs and return to their original state. It is
the neutral detergent fiber digestion coefficient cal- important in a crossover study that the underlying
culated for each steer. For this case, there are three condition does not change over time and that the
treatment periods and 3! ¼ 6 different sequences. effects of one treatment disappear before the next
The basic layout may be displayed as in Table 1. is applied. In practice, even if sufficient washout
time is administered after two successive treatment
periods, the subjects’ physiological states may have
Basic Conceptions, been changed and may be unable to return to the
original state, which may affect the effects of the
Advantages, and Disadvantages
treatment in the succeeding period. Thus, the car-
Crossover designs were first used in agriculture ryover effect cannot be ignored in crossover
research in 1950s and are widely used in other designs. In spite of its advantages, the crossover
310 Crossover Design

design should not be used in clinical trials in which simplest design is the two-treatment–two-period,
a treatment cures a disease and no underlying con- or 2 × 2, design. Different treatment sequences
dition remains for the next treatment period. are used to eliminate sequence effects. The cross-
Crossover designs are typically used for persistent over design cannot accommodate a separate com-
conditions that are unlikely to change over the parison group. Because each experimental unit
course of the study. The carryover effect can cause receives all treatments, the covariates are balanced.
problems with data analysis and interpretation of The goal of the crossover design is to compare the
results in a crossover design. The carryover pre- effects of individual treatments, not the sequences
vents the investigators from determining whether themselves.
the significant effect is truly due to a direct treat-
ment effect or whether it is a residual effect of
Variance Balance and Unbalance
other treatments. In multiple regressions, the car-
ryover effect often leads to multicollinearity, which In the feedstuffs example above, each treatment
leads to erroneous interpretation. If the crossover occurs one time in each experimental unit (or sub-
design is used and a carryover effect exists, a design ject) and each of the six three-treatment sequences
should be used in which the carryover effect will occurs two times, which confers the property of
not be confounded with the period and treatment balance known as variance balance. In general,
effects. The carryover effect makes the design less a crossover design is balanced for carryover effect
efficient and more time-consuming. if all possible sequences are used an equal number
The period effect occurs in crossover design of times in the experiment, and each treatment
because of the conditions that are present at the occurs equal times in each period and occurs once
time the observed values are taken. These condi- with each experimental unit. However, this is not
tions systematically affect all responses that are always possible; deaths and dropouts may occur in
taken during that time, regardless of the treatment a trial, which can lead to unequal numbers in
or the subject. For example, the subjects need to sequences.
spend several hours to complete all the treatments In the example above and Table 1, A → B
in sequence. The subjects may become fatigued occurs twice, once each in Sequences 1 and 3; and
over time. The fatigue factor may tend to system- A → C occurs once each in Sequences 4 and 5.
atically impact all the treatment effects of all the Similarly, B → A, B → C, C → A, and C → B each
subjects, and the researcher cannot control this sit- occur twice. For Experimental Units 1 and 2,
uation. If the subject is diseased during the first A → B brings the first-order carryover effect in this
period, regardless of treatment, and the subject is experiment and changes the response in the first
disease free by the time the second period starts, period following the application of the treatment;
this situation is also a period effect. In many cross- similarly, the second-order carryover effect changes
over designs, the timing and spacing of periods is the response in the second period following the
relatively loose. In clinical trials, the gap between application of the treatment. A 21-day rest period
periods may vary and may depend on when the is used to wash out the effect of treatment before
patient can come in. Large numbers of periods are the next treatment is applied. With the absence of
not suitable for animal feeding and clinical trials carryover effects, the response measurements
with humans. They may be used in psychological reflect only the current treatment effect.
experiments with up to 128 periods. In variance balance crossover design, all treat-
Sequence effect is another issue in crossover ment contrasts are equally precise; for instance, in
design. Sequence refers to the order in which the the example above, we have
treatments are applied. The possible sets of
sequences that might be used in a design depend varðτ A  τB Þ ¼ varðτB  τC Þ ¼ varðτ A  τ C Þ,
on the number of the treatments, the length of the
sequences, and the aims of the experiment or trial. where var ¼ variance and τ ¼ a treatment group
For instance, with t treatments there are t! possible mean.
sequences. The measurement of sequence effect is The contrasts of the carryover effects are also
the average response over the sequences. The equally precise. The treatment and carryover
Crossover Design 311

effects are negatively correlated in crossover Analysis of Different Types of Data


designs with variance balance.
Balanced crossover can avoid confounding the Approaches for Normally Distributed Data
period effect with the treatment effects. For exam- The dependent variables follow a normal distri-
ple, one group of experimental units received the bution. In general, let the crossover design have t
A → C, and another group of units received the treatments and n treatment sequence groups, let
sequence C → A. These two treatments are applied ri ¼ subjects in the ith treatment sequence group,
in each period, and comparisons of Treatments A and let each group receive treatments in a different
and C are independent of comparisons of periods order for p treatment periods. Then yijk is the
(e.g., Period 1 and Period 2). response value of the jth subject of the ith treat-
The variance-balanced designs developed inde- ment sequence in the kth period and can be
pendently by H. D. Patterson and H. L. Lucas, expressed as
M. H. Quenouille, and I. I. Berenblut contain
a large number of periods. These researchers yijk ¼ μ þ αi þ βij þ γ k þ τ dði;kÞ þ λcði;k1Þ þ εijk
showed that treatment and carryover effects are i ¼ 1; 2; ::n; j ¼ 1; 2; ::ri ; k ¼ 1; 2; ::p; d, c ¼ 1, 2, ::t;
orthogonal. Balaam designs are also balanced,
using t treatments, t2 sequences, and only two peri- where μ is the grand mean, αi is the effect of the
ods with all treatment combinations, and they are ith treatment sequence group, βij is the random
more efficient than the two-period designs of Patter- effect of the jth subject in the ith treatment
son and Lucas. Adding an extra period or having sequence group with variance σ 2b , γ k is the kth
a baseline observation of the same response variable period effect, τ dði;kÞ is the direct effect of treatment
on each subject can improve the Balaam design. in period k of sequence group i, and λcði;k1Þ is the
Generally, adding an extra period is better than carryover effect of the treatment in period k  1
having baselines. However, the cost is an important of sequence group i. Note that λcði;0Þ ¼ 0 because
factor to be considered by the researcher. there is no carryover effect in the first period. And
Variance-unbalanced designs lack the property βijk is the random error for the jth subject in the
that all contrasts among the treatment effect and all ith treatment sequence of period k with variance
contrasts among the carryover effects are of equal σ 2. This model is called a mixed model because it
precision. For some purposes, the designer would contains a random component. In order to sim-
like to have approximate variance balance, which plify, let us denote Treatments A, B, and C as
can be fulfilled by the cyclic designs of A. W. Davis Treatments 1, 2, and 3, respectively. The first
and W. B. Hall and the partially balanced incom- observed values in the first and second sequences
plete block designs that exist in many of more than of the example can be simply written
60 designs included in the 1962 paper by Patterson
and Lucas. One advantage of the Davis–Hall design Sequence1ðA → BÞy111 ¼ μ þ α1 þ β11 þ
is that it can be used for any number of treatments γ 1 þ τ1 þ ε111
and periods. The efficiencies of Patterson and Lucas
y112 ¼ μ þ α1 þ β11 þ γ 2 þ τ2 þ λ1 þ ε111
and Davis–Hall are comparable, and the designs of
Davis and Hall tend to require fewer subjects. Tied- y113 ¼ μ þ α1 þ β11 þ γ 3 þ τ3 þ λ2 þ ε113
double-changeover designs are another type of vari-
ance-unbalanced crossover design proposed by W. Sequence2ðB → CÞy211 ¼ μ þ α2 þ β21 þ γ 1 þ
T. Federer and G. F. Atkinson. The goal of this type
is to create a situation in which the variances of dif- τ2 þ ε211
ferences among treatments are nearly equal to the y212 ¼ μ þ α2 þ β21 þ γ 2 þ τ3 þ λ2 þ ε212
variances of differences among carryover effects. If y213 μ þ α2 þ β21 þ γ 3 þ τ1 þ λ3 þ ε213 :
Federer and Atkinson’s design has t treatments, p
periods, and c subjects, and we define two positive From above, it may be seen that there is no car-
integers q and s, then q ¼ (p  1)/t and s ¼ c/t. ryover effect in the first period, and because only
Some combinations of s and q make the design var- first-order carryovers are considered here, the first-
iance balanced. order carryover effects of Treatments A and B are
312 Crossover Design

λ1 and λ2 , respectively, in Sequence 1. Likewise, applied to compare the treatment effects and
the first-order carryover effects of Treatments B period effects. The crossover should not be used if
and C are λ2 and λ3 , respectively, in Sequence 2. there is a carryover effect, which cannot be sepa-
Actually, crossover designs are specific repeated rated from treatment effect or period effects. For
measures designs with observed values on each this case, bootstrapping, permutation, or randomi-
experimental unit that are repeated under different zation tests provide alternatives to normal theory
treatment conditions at different time points. The analyses. Gail Tudor and Gary G. Koch described
design provides a multivariate observation for each the nonparametric method and its limitations for
experimental unit. statistical models with baseline measurements and
The univariate analysis of variance can be used carryover effects. Actually, this method is an exten-
for the crossover designs if any of the assumptions sion of the Mann–Whitney, Wilcoxon’s, or
of independence, compound symmetry, or the Quade’s statistics. Other analytical methods are
Huynh–Feldt condition are appropriate for the available if there is no carryover effect. For most
experimental errors. Where independence and com- designs, nonparametric analysis is much more lim-
pound symmetry are sufficient conditions to justify ited than a parametric analysis.
ordinary least squares, the Huynh–Feldt condition
(Type H structure) is both a sufficient and necessary
Approaches for Ordinal and Binary Data
condition for use of ordinary least squares. There
are two cases of ANOVA for crossover design: The New statistical techniques have been developed
first is the analysis of variance without carryover to deal with longitudinal data of this type over
effect if the crossover design is a balanced row–col- recent decades. These new techniques can also be
umn design. The experimental units and periods are applied in the analysis of crossover design. A mar-
the rows and columns of the design, and the direct ginal approach using a weighted least square
treatment effects are orthogonal to the columns. method is proposed by J. R. Landis and others. A
The analysis approach is the same as that of a latin generalized estimating equation approach was
square experiment. The second is the analysis of var- developed by Kung-Yee Liang and colleagues in
iance with carryover effect, which is treated as 1986. For the binary data, subject effect models
a repeated measures split-plot design with the sub- and marginal effect models can be applied. Boot-
jects as whole plots and the repeated measures over strap and permutation tests are also useful for this
the p periods as the subplots. The total sum of type of data. Variance and covariance structures
squares is calculated from between and within sub- need not necessarily be considered. The different
jects’ parts. The significance of the carryover effects estimates of the treatment effect may be obtained
must be determined before the inference is made on on the basis of different assumptions of the mod-
the comparison of the direct effects of treatments. els. Although researchers have developed different
For normal data without missing values, a least approaches for different scenarios, each approach
squares analysis to obtain treatment, period, and has its own problems or limitations. For example,
subject effect is efficient. Whenever there are miss- marginal models can be used to deal with missing
ing data, within-subject treatment comparisons are data, such as dropout values, but the designs lose
not available for every subject. Therefore, addi- efficiency, and their estimates are calculated with
tional between-subject information must be used. less precision. The conditional approach loses
In crossover design, the dependent variable may information about patient behavior and it is
not be normally distributed, such as an ordinal or restricted to the logit link function. M. G. Ken-
a categorical variable. The crossover analysis ward and B. Jones in 1994 gave a more compre-
becomes much more difficult if a period effect exists. hensive discussion of different approaches.
Two basic approaches can be considered in this case. Usually, large sample sizes are needed when the
data are ordinal or dichotomous, which is a limita-
tion of crossover designs.
Approaches of Nonnormal Data
Many researchers have tried to find an ‘‘opti-
If the dependent variables are continuous but mal’’ method. J. Kiefer in the 1970s proposed the
not normal, the Wilcoxon’s rank sum test can be concept of universal optimality and named D-, A-,
Cross-Sectional Design 313

and E-optimality. The design is universally optimal sectional study cannot provide a very rich picture
and satisfies other optimal conditions, but it is of development; by definition, such a study exam-
extremely complicated to find a universally opti- ines one small group of individuals at only one
mal crossover design for any given scenario. point in time. Finally, it is difficult to compare
groups with one another, because unlike a longitu-
Ying Liu dinal design, participants do not act as their own
controls. Cross-sectional studies are quick and rel-
See also Block Design; Repeated Measures Design
atively simple, but they do not provide much infor-
mation about the ways individuals change over
Further Readings time.
As with longitudinal designs, cross-sectional
Davis, A. W., & Hall, W. B. (1969). Cyclic change-over
designs result in another problem: the confounding
designs. Biometrika, 56, 283–293.
of age with another variable—the cohort (usually
Landis, J. R., & Koch, G. G. (1977). The measurement
of observer agreement for categorical data. Biometrics, thought of as year of birth). Confounding is the
33, 159–174. term used to describe a lack of clarity about
Lasserre, V. (1991). Determination of optimal design whether one or another variable is responsible for
using linear models in crossover trials. Statistics in observed results. In this case, we cannot tell
Medicine, 10, 909–924. whether the obtained results are due to age
Liang, K. Y., & Zegar, S. L. (1986). Longitudinal data (reflecting changes in development) or some other
analysis using generalized linear models. Biometrika, variable.
73, 13–22. Confounding refers to a situation in which the
effects of two or more variables on some outcome
cannot be separated. Cross-sectional studies con-
found the time of measurement (year of testing)
CROSS-SECTIONAL DESIGN and age. For example, suppose you are studying
the effects of an early intervention program on
The methods used to study development are as later social skills. If you use a new testing tool that
varied as the theoretical viewpoints on the process is very sensitive to the effects of early experience,
itself. In fact, often (but surely not always) the you might find considerable differences among dif-
researcher’s theoretical viewpoint determines the ferently aged groups, but you will not know
method used, and the method used usually reflects whether the differences are attributable to the year
the question of interest. Age correlates with all of birth (when some cultural influence might have
developmental changes but poorly explains them. been active) or to age. These two variables are
Nonetheless, it is often a primary variable of con- confounded.
cern in developmental studies. Hence, the two tra- What can be done about the problem of con-
ditional research designs, longitudinal methods, founding age with other variables? K. Warner
which examine one group of people (such as peo- Schaie first identified cohort and time of testing as
ple born in a given year), following and reexamin- factors that can help explain developmental out-
ing them at several points in time (such as in 2000, comes, and he also devised methodological tools
2005, and 2010), and cross-sectional designs, to account for and help separate the effects of age,
which examine more than one group of people (of time of testing, and cohort. According to Schaie,
different ages) at one point in time. For example, age differences among groups represent matura-
a study of depression might examine adults of tional factors, differences caused by when a group
varying ages (say 40, 50, and 60 years old) in was tested (time of testing) represent environmen-
2009. tal effects, and cohort differences represent envi-
Cross-sectional studies are relatively inexpen- ronmental or hereditary effects or an interaction
sive and quick to conduct (researchers can test between the two. For example, Paul B. Baltes and
many people of different ages at the same time), John R. Nesselroade found that differences in the
and they are the best way to study age differences performance of adolescents of the same age on
(not age changes). On the other hand, a cross- a set of personality tests were related to the year in
314 Cross-Validation

which the adolescents were born (cohort) as well available data into two parts, called training data
as when these characteristics were measured (time and testing data, respectively. The training data
of testing). are used for fitting the model or training the algo-
Sequential development designs help to over- rithm, while the testing data are used for validat-
come the shortcomings of both cross-sectional and ing the performance of the fitted model or the
longitudinal developmental designs, and Schaie trained algorithm on predication purpose.
proposed two alternative models for developmen- A typical proportion of the training data might
tal research—the longitudinal sequential design be roughly 1/2 or 1/3 when the data size is large
and the cross-sectional sequential design—that enough. The division of the data into training part
avoid the confounding that results when age and and testing part can be done naturally or ran-
other variables compete for attention. Cross- domly. In some applications, a large enough
sectional sequential designs are similar to longitu- subgroup of the available data is collected inde-
dinal sequential designs except that they do not pendently of the other parts of the data by differ-
repeat observations on the same people from the ent people or institutes, or through different
cohort; rather, different groups are examined from procedures but for similar purposes. Naturally,
one testing time to the next. For example partici- that part of the data can be extracted and used for
pants tested in 2000, 2005, and 2010 would all testing purpose only. If such a subgroup does not
come from different sets of participants born in exist, one can randomly draw a predetermined
1965. Both of these designs allow researchers to proportion of data for training purposes and leave
keep certain variables (such as time of testing or the rest for testing.
cohort) constant while they test the effects of
others.
Neil J. Salkind K-Fold Cross-Validation
In many applications, the amount of available
See also Control Variables; Crossover Design;
data is not large enough for a simple cross-vali-
Independent Variable; Longitudinal Design; Research
dation. Instead, K-fold cross-validation is com-
Hypothesis; Research Question; Sequential Design
monly used to extract more information from
the data. Unlike the simple training-and-testing
Further Readings division, the available data are randomly divided
into K roughly equal parts. Each part is chosen
Schaie, K. W. (1992). The impact of methodological in turn for testing purposes, and each time the
changes in gerontology. International Journal of remaining ðK  1Þ parts are used for training
Human Development & Aging, 35, 19–29.
purposes. The prediction errors from all the K
Birren, J. E., & Schaie, K. W. (Eds.). (2006). Handbook
of the psychology of aging (6th ed.). San Diego, CA:
validations are collected, and the sum is used for
Elsevier. cross-validation purposes.
Sneve, M., & Jorde, R. (2008). Cross-sectional study on In order to formalize the K-fold cross-valida-
the relationship between body mass index and tion using common statistical notations, suppose
smoking, and longitudinal changes in body mass index the available data set consists of N observations
in relation to change in smoking status: The Tromsø or data points. The ith observation includes pre-
study. (2008). Scandinavian Journal of Public Health, dictor(s) xi in scalar (or vector) form and
36(4), 397–407. response yi , also known as input and output,
respectively. Suppose a random partition of the
data divides the original index set f1; 2 ; . . . ; N g
into K subsets Ið1Þ, Ið2Þ ; . . . ; IðKÞ with roughly
CROSS-VALIDATION equal sizes. For the kth subset, let ^fk ð · Þ be the
fitted prediction function based on the rest of
Cross-validation is a data-dependent method for the data after removing the kth subset. Then the
estimating the prediction error of a fitted model or K-fold cross-validation targets the average pre-
a trained algorithm. The basic idea is to divide the diction error defined as
Cross-Validation 315

1X K X  k
 Cross-Validation Applications
CV ¼ L yi ; ^f ðxi Þ :
N k ¼ 1 i ∈ IðkÞ The idea of cross-validation can be traced back
to the 1930s. It was further developed and
refined in the 1960s. Nowadays, it is widely
In the above expression, CV is average predic- used, especially when the data are unstructured
tion error, Lð · , · Þ is a predetermined function, or fewer model assumptions could be made. The
known as the loss function, which measures the cross-validation procedure does not require dis-
difference between the observed response yi and tribution assumptions, which makes it flexible
the predicted value ^f k ðxi Þ. Commonly used and robust.
loss functions Lðy, ^f Þ include the squared loss
 2
function y  ^f ; the absolute loss function Choosing Parameter
 
 
y  ^f ; the 0  1 loss function, which is One of the most successful applications of
cross-validation is to choose a smoothing parame-
0 if y ¼ ^f and 1 otherwise; and the cross-entropy
^ ter or penalty coefficient. For example, a researcher
loss function 2logPðY ¼ yi jxi Þ. wants to find the best function f for prediction
The result CV of K-fold cross-validation purposes but is reluctant to add many restrictions.
depends on the value of K used. Theoretically, K To avoid the overfitting problem, the researcher
can be any integer between 2 and the data size tries to minimize a penalized residual sum of
N. Typical values of K include 5 and 10. Gener- squares defined as follows:
ally speaking, when K is small, CV tends to
overestimate the prediction error because each X
N Z 2
2 00
training part is only a fraction of the full data RSSðf ; λÞ ¼ ½yi  f ðxi Þ þ λ ½f ðxÞ dx:
set. As K gets close to N, the expected bias i¼1

tends to be smaller, while the variance of CV


Then the researcher needs to determine which λ
tends to be larger because the training sets
should be used, because the solution changes along
become more and more similar to each other. In
with it. For each candidate λ, the corresponding
the meantime, the cross-validation procedure
fitted f is defined as the solution minimizing
involves more computation because the target
RSSðf ; λÞ. Then the cross-validation can be applied
model needs to be fitted for K times. As a com-
to estimate the prediction error CVðλÞ. The opti-
promise, 5-fold or 10-fold cross-validation
mal λ according to cross-validation is the one that
would be suggested.
minimizes CVðλÞ.
When K ¼ N is used, the corresponding cross-
validation is known as leave-one-out cross-valida-
tion. It minimizes the expected bias but can be
highly variable. For linear models using squared Model Selection
loss function, one may use the generalized cross- Another popular application of cross-valida-
validation to overcome the intensive computation tion is to select the best model from a candidate
problem. Basically, it provides an approximated set. When multiple models, or methods, or algo-
CV defined as rithms are applied to the same data set, a fre-
quently asked question is which one is the best.
" #2 Cross-validation provides a convenient criterion
1X N
yi  ^f ðxi Þ to evaluate their performance. One can always
GCV ¼ :
N i ¼ 1 1  trace(S)=N divide the original data set into training part and
testing part. Fit each model based on training
data and estimate the prediction error of the fit-
Here S is an N × N matrix setting the fitting ted model based on the testing data. Then the
equation ^y ¼ Sy, and trace(S) is the sum of the model that attains the minimal prediction error
diagonal elements of S. is the winner.
316 Cumulative Frequency Distribution

Two major concerns need to be addressed for with the aid of tables and graphs and may be put
the procedure. One of them is that the estimated together for both ungrouped and grouped scores.
prediction error may strongly depend on how
the data are divided. The value of CV itself is
random if it is calculated on the basis of a ran- Cumulative Frequency Tables for
dom partition. In order to compare two random
Distributions With Ungrouped Scores
CVs, one may need to repeat the whole proce-
dure for many times and compare the CV values A cumulative frequency table for distributions with
on average. Another concern is that the selected ungrouped scores typically includes the scores a vari-
model may succeed for one data set but fail for able takes in a particular sample, their frequencies,
another because cross-validation is a data-driven and the cumulative frequency. In addition, the table
method. In practice, people tend to test the may include the cumulative relative frequency or
selected model again when additional data are proportion, and the cumulative percentage fre-
available via other sources before they accept it quency. Table 1 illustrates the frequency, cumulative
as the best one. frequency, cumulative relative frequency, and cumu-
Cross-validation may also be needed during fit- lative percentage frequency for a set of data show-
ting a model as part of the model selection proce- ing the number of credits a sample of students at
dure. In the previous example, cross-validation is a college have registered for in the autumn quarter.
used for choosing the most appropriate smoothing The cumulative frequency is obtained by adding
parameter λ. In this case, the data may be divided the frequency of each observation to the sum
into three parts: training data, validation data, and of the frequencies of all previous observations
testing data. One may use the validation part to (which is, actually, the cumulative frequency on
choose the best λ for fitting the model and use the the previous row). For example, the cumulative
testing part for model selection. frequency for the first row in Table 1 is 1 because
there are no previous observations. The cumulative
Jie Yang frequency for the second row is 1 þ 0 ¼ 1. The
cumulative frequency for the third row is
See also Bootstrapping; Jackknife
1 þ 2 ¼ 3. The cumulative frequency for the
fourth row is 3 þ 1 ¼ 4, and so on. This means
Further Readings that four students have registered for 13 credits or
fewer in the autumn quarter. The cumulative fre-
Efron, B., & Tibshirani, R. (1993). An introduction to quency for the last observation must equal the
the bootstrap. New York: Chapman & Hall/CRC.
number of observations included in the sample.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
elements of statistical learning: Data mining,
Cumulative relative frequencies or cumulative
inference, and prediction (2nd ed.). New York: proportions are obtained by dividing each cumula-
Springer. tive frequency by the number of observations.
Cumulative proportions show the proportion of
observations that fulfill a particular criterion or
less. For example, the proportion of students who
CUMULATIVE FREQUENCY have registered for 14 credits or fewer in the
autumn quarter is 0.60. The cumulative propor-
DISTRIBUTION tion for the last observation (last row) is always 1.
Cumulative percentages are obtained by multi-
Cumulative frequency distributions report the fre- plying the cumulative proportions by 100. Cumu-
quency, proportion, or percentage of cases at a par- lative percentages show the percentage of
ticular score or less. Thus, the cumulative observations that fulfill a certain criterion or less.
frequency of a score is calculated as the frequency For example, 40% of students have registered for
of occurrence of that score plus the sum of the fre- 13 credits or fewer in the autumn quarter. The
quencies of all scores with a lower value. Cumula- cumulative percentage of the last observation (the
tive frequency distributions are usually displayed last row) is always 100.
Cumulative Frequency Distribution 317

Table 1 Cumulative Frequency Distribution of the Number of Credits Students Have Registered for in the
Autumn Quarter
Number of Cumulative Cumulative relative Cumulative percentage
credits Frequency frequency frequency frequency
y f cumðf Þ cumrðf Þ ¼ cumðf Þ=n cumpðf Þ ¼ 100  cumrðf Þ
10 1 1 0.10 10.00
11 0 1 0.10 10.00
12 2 3 0.30 30.00
13 1 4 0.40 40.00
14 2 6 0.60 60.00
15 4 10 1.00 100.00%
n ¼ 10

Table 2 Grouped Cumulative Frequency Distribution of Workers’ Salaries


Workers’ Cumulative Cumulative relative Cumulative percentage
salaries Frequency frequency frequency frequency
y f cumðf Þ cumrðf Þ ¼ cumðf Þ=n cumpðf Þ ¼ 100  cumrðf Þ
$15,000–$19,999 4 4 0.27 26.67
$20,000–$24,999 6 10 0.67 66.67
$25,000–$29,999 3 13 0.87 86.67
$30,000–$34,999 0 13 0.87 86.67
$35,000–$39,999 2 15 1.00 100.00%
n ¼ 15

Grouped Cumulative Frequency Distributions Cumulative relative frequencies or proportions


show the proportion of workers earning a certain
For distributions with grouped scores, the cumula- amount of money or less and are calculated by
tive frequency corresponding to each class equals dividing the cumulative frequency by the number
the frequency of occurrence of scores in that partic- of observations. For example, the proportion of
ular class plus the sum of the frequencies of scores workers earning $29,999 or less is 0.87.
in all lower classes (which is, again, the cumulative Finally, cumulative percentages show the per-
frequency on the previous row). Grouped cumula- centage of workers earning a certain amount of
tive frequency distributions are calculated for con- money or less, and these percentages are obtained
tinuous variables or for discrete variables that take by multiplying the cumulative relative frequency
too many values for the list of all possible values to by 100. For example, 86.67% of workers earn
be useful. The cumulative distribution table for this $34,999 or less. The last row shows that 15 work-
case is very similar to Table 1, and the entries in the ers, representing all the workers included in the
table are computed in a similar way. The only dif- analyzed sample (100%), earn $39,999 or less.
ference is that instead of individual scores, the first The cumulative proportion of the workers in this
column contains classes. Table 2 is an example of last class is, logically, 1.
a cumulative frequency table for a sample of work-
ers’ salaries at a factory.
The cumulative frequency of a particular class Graphing Cumulative Distributions
is calculated as the frequency of the scores in that
of Discrete Variables
class plus the sum of the frequencies of scores in
all lower classes. Cumulative frequencies in Table Cumulative frequencies, cumulative proportions,
2 show the number of workers who earn a certain and cumulative percentages of a discrete variable
amount or money or less. For example, the reader can all be represented graphically. An upper
learns that 10 workers earn $24,999 or less. right quadrant of a two-dimensional space is
318 Cumulative Frequency Distribution

12 16
Cumulative Frequency

Cumulative Frequency
10 14
12
8
10
6
8
4 6
2 4
0 2
9 10 11 12 13 14 15 16 0
10 15 20 25 30 35 40
Number of Credits
Workers’ Salaries, Thousands $

Figure 1 Cumulative Frequency Distribution of the


Figure 2 Grouped Cumulative Frequency
Number of Credits Students Have
Distribution of Workers’ Salaries
Registered for in the Autumn Quarter

typically used to display a cumulative distribu- Shape of Cumulative Distributions


tion. The quadrant is bounded by an x-axis
and a y-axis. The x-axis is positioned horizon- Cumulative polygons are often referred to as
tally and usually corresponds to the scores ogives. They climb steeply in regions where classes
that the variable can take. The y-axis indicates include many observations, such as the middle of
the cumulative frequency, proportion, or per- a bell-shaped distribution, and climb slowly in
centage. Cumulative distributions may be regions where classes contain fewer observations.
graphed using histograms or, more commonly, Consequently, frequency distributions with bell-
polygons. shaped histograms generate S-shaped cumulative
To exemplify, the number of credits from frequency curves. If the distribution is bell shaped,
Table 1 is plotted in Figure 1. The number of the cumulative proportion equals 0.50 at the aver-
credits the students have registered for is repre- age of the x value, which is situated close to the
sented on the x-axis. The cumulative frequency midpoint of the horizontal range. For positive
is represented on the y-axis. Using each score skews, the cumulative proportion reaches 0.50
and its corresponding cumulative frequency, before the x average. On the contrary, for negative
a number of points are marked inside the quad- skews, the cumulative proportion reaches 0.50 fur-
rant. They are subsequently joined to form ther away to the right, after the x average.
a polygon that represents the cumulative fre- Many statistical techniques work best when
quency distribution graph. variables have bell-shaped distributions. It is, how-
ever, essential to examine the actual form of the
data distributions before one uses these techniques.
Graphs provide the simplest way to do so, and
Graphing Grouped Cumulative Distributions
cumulative polygons often offer sufficient
Grouped cumulative frequency distributions may information.
also be graphed via both histograms and polygons.
Again, it is more common to use cumulative fre-
Using Cumulative Distributions in Practice
quency (or proportion or percentage) polygons. To
draw the polygon for a grouped distribution, the Cumulative frequency (or proportion or percent-
points represented by the midpoint of each class age) distributions have proved useful in a multitude
interval and their corresponding cumulative fre- of research spheres. Typically, cumulative percent-
quency are connected. age curves allow us to answer questions such as,
To exemplify, the cumulative frequency distri- What is the middle score in the distribution? What
bution described in Table 2 is graphically illus- is the dividing point for the top 30% of the group?
trated in Figure 2. Consequently, cumulative percentage curves are
Cumulative Frequency Distribution 319

100 3 shows the cumulative percentage frequency dis-


90 tribution for the data in Table 1, with the cumula-
80 tive percentage/percentile scale represented on the
Percentile scale

70 y-axis and the number of credits on the x-axis.


60
50
To determine the percentile rank corresponding
40 to 12 credits, a perpendicular is erected from the
30 x-axis at point 12 until it meets the cumulative
20 polygon. A horizontal line is then drawn from this
10
point until it meets the y-axis, which is at point 30
0
9 10 11 12 13 14 15 on the percentile scale. This means that 30% of
Number of credits the students have registered for 12 credits or fewer
for the autumn quarter. The cumulative percentage
frequency data shown in Table 1 confirm this
Figure 3 Cumulative Percentage Frequency result. The method may be employed the other
Distribution of the Number of Credits way around as well. The percentile rank may be
Students Have Registered for in the Autumn used to determine the corresponding percentile.
Quarter
Drawing a horizontal line from the point 60 on
the percentile scale to the cumulative polygon and
then a perpendicular on the x-axis from the point
used by instructors in schools, universities, and of intersection, the point (0, 14) is met. In other
other educational centers to determine the propor- words, the 60th percentile corresponds to 14 cred-
tion of students who scored less than a specified its, showing that 60% of the students have regis-
limit. Doctors’ offices may use cumulative weight tered for 14 credits or fewer. The midpoint in
and height distributions to investigate the percent- a distribution is often known as the 50th percen-
age of children of a certain age whose weight and tile, the median, or the second quartile. It repre-
height are lower than a certain standard. All this is sents the point below which we find 50% of
usually done by using cumulative frequency distri- the scores and above which the other 50% of the
butions to identify percentiles and percentile ranks. scores are located. The 10th percentile is called the
Consequently, the cumulative frequency curve is first decile, and each multiple of 10 is referred to
also known as the percentile curve. as a decile. The 25th percentile is the first quartile,
and the 75th percentile is the third quartile. This is
Finding Percentiles and Percentile Ranks a very simple method to determine the percentile
rank corresponding to a particular score. However,
From Cumulative Distributions
the method may not be always accurate, especially
A percentile is the score at or below which a speci- when a very fine grid is not available.
fied percentage of scores in a distribution falls. For
example, if the 40th percentile of an examination Oana Pusa Mihaescu
is 120, it means that 40% of the scores on the
See also Descriptive Statistics; Distribution; Frequency
examination are equal to or less than 120. The
Distribution; Frequency Table; Histogram; Percentile
percentile rank of a score indicates the percentage
Rank
of scores in the distribution that are equal to or
less than that score. Referring to the above exam-
ple, if the percentile rank of a score of 120 is 40, it
means that 40% of the scores are equal to or less Further Readings
than 120.
Downie, N. M., & Heath, R. W. (1974). Basic statistical
Percentiles and percentile ranks may be deter- methods. New York: Harper & Row.
mined with certain formulas. Nevertheless, as Fielding, J. L., & Gilbert, G. N. (2000). Understanding
already mentioned, the cumulative distribution social statistics. Thousand Oaks, CA: Sage.
polygon is often used to determine percentiles and Fried, R. (1969). Introduction to statistics. New York:
percentile ranks graphically. To exemplify, Figure Oxford University Press.
320 Cumulative Frequency Distribution

Hamilton, L. (1996). Data analysis for social sciences: A Kolstoe, R. H. (1973). Introduction to statistics for the
first course in applied statistics. Belmont, CA: behavioral sciences. Homewood, IL: Dorsey Press.
Wadsworth. Lindquist, E. F. (1942). A first course in statistics: Their
Kiess, H. O. (2002). Statistical concepts for the use and interpretation in education and psychology.
behavioral sciences. Boston: Allyn and Bacon. Cambridge, MA: Riverside Press.
D
and indirect costs are associated with obtaining
DATABASES access to specific populations for collection of spe-
cific data. This limitation is eliminated by using
large-scale databases. Depending on the topic of
One of the most efficient and increasingly common interest, the use of databases provides researchers
methods of investigating phenomena in the educa- access to randomly sampled and nationally repre-
tion and social sciences is the use of databases. sentative populations.
Large-scale databases generally comprise informa- Databases also provide researchers with access
tion collected as part of a research project. Infor- to populations they may not have had access to
mation included in databases ranges from survey individually. Specifically, the recruitment of indivi-
data from clinical trials to psychoeducational data duals from diverse backgrounds (e.g., Black,
from early childhood projects. Research projects Latino) has generally been a problem in the social
from which databases are derived can be longitudi- and medical sciences due to historical issues center-
nal or cross-sectional in nature, use multiple or ing on mistrust of researchers (e.g., the Tuskegee
individual informants, be nationally representative Experiment). While this is the case, databases such
or specific to a state or community, and be primary as the National Institute of Mental Health–funded
data for the original researcher or secondary data Collaborative Psychiatric Epidemiology Surveys
for individuals conducting analysis at a later time. (CPES) provide access to diverse subjects. Specifi-
This entry explores the benefits and limitations of cally, CPES joins together three nationally repre-
using databases in research, describes how to locate sentative surveys: the National Comorbidity
databases, and discusses the types of databases and Survey Replication (NCS-R), the National Survey
the future of the use of databases in research. of American Life (NSAL), and the National Latino
and Asian American Study (NLAAS). These stud-
Benefits ies collectively provide the first national data with
sufficient power to investigate cultural and ethnic
The primary advantage of using databases for influences on mental disorders. Although existing
research purposes is related to economics. Specifi- databases offer numerous benefits, they have lim-
cally, since databases consist of information that itations as well.
has already been collected, they save researchers
time and money because the data are readily avail-
able. As with many investigators, the primary hin-
Limitations
drance to conducting original field research is
limited monetary resources. Collecting data from The key limitation of using databases is that ques-
large samples is time-consuming, and many direct tions and the theoretical orientation of the original

321
322 Databases

researchers may not be congruent with those of Types of Databases


the secondary investigator. So if a researcher was
not part of the original research team, the concep- Early Childhood Databases
tualization of the constructs of interest in the data- Early Childhood Longitudinal Study
base may not be to his or her liking. Although
numerous available databases encompass a variety The Early Childhood Longitudinal Study (ECLS)
of topics, this limitation can be virtually impossi- consists of two overlapping cohorts: a birth cohort
ble to ignore. To combat it, researchers generally and a kindergarten cohort, better known as the
undertake the task of recoding questions and vari- ECLS-B and the ECLS-K, respectively. These data-
ables to fit their research questions of interest. bases include children who were followed from birth
In addition to question conceptualization pro- through kindergarten entry and from kindergarten
blems, another limitation of databases is the date through the eighth grade, respectively. The nation-
the data were collected. Specifically, if an individ- ally representative ECLS-B consists of 14,000 chil-
ual uses a database that is dated, this may impact dren born in the year 2001. The children
his or her ability to generalize his or her findings participating in the study come from diverse socio-
to the present day. This threat to internal validity economic and ethnic backgrounds, with oversamples
can be lessened if researchers use the most up-to- of Asian and Pacific Islander children, American
date database on their topic of interest. An exam- Indian and Alaska Native children, Chinese children,
ple of this is the U.S. Department of Education– twins, and low and very low birth weight children.
funded Education Longitudinal Study of 2002. Information about these children was collected when
This study is a direct follow-up to the National the children were approximately 9 months old,
Education Longitudinal Study of 1988. Although 2 years old (2003), and in preschool (1 year away
the 1988 study resulted in a high-quality, longitu- from kindergarten, fall 2005). In fall 2006, data
dinal database with significant policy implications, were collected from all participating sample chil-
stakeholders realized that the database was dated, dren, 75% of whom were expected to be age eligible
and the result was the initiation of the 2002 study. for kindergarten. In fall 2007, data were collected
from the remaining 25% of participating sample
children, who were newly eligible for kindergarten.
How to Locate Databases The ECLS-K is a nationally representative sample
of kindergartners, their families, their teachers, and
Due to the quantity of information and the man-
their schools all across the United States. Informa-
power necessary to conduct projects of this scope,
tion was collected in the fall and spring of kindergar-
the funding of large-scale research generally comes
ten (1998–1999); the fall and spring of first grade
from governmental entities such as the National
(1999–2000); and the spring of third grade (2002),
Institutes of Health (NIH) and the U.S. Depart-
fifth grade (2004), and eighth (2007) grade. It was
ment of Education. Since these institutions are tax-
designed to provide comprehensive and reliable data
payer funded, these large-scale databases are
that may be used to describe and better understand
generally available for free to researchers. How-
children’s development and experiences in the ele-
ever, because of the sensitive and personally identi-
mentary and middle school grades and how their
fiable information that can be ascertained from
early experiences relate to their later development,
these databases, researchers must obtain
learning, and experiences in school. The multifaceted
a restricted-use data license before many databases
data collected across the years allow researchers and
can be accessed. A restricted-use license consists of
policy makers to study how various student, home,
a justification of the need for the restricted-use
classroom, school, and community factors at various
data, an agreement to keep the data safe from
points in a child’s life relate to the child’s cognitive
unauthorized disclosures at all times, and an agree-
and social development.
ment to participate fully in unannounced, unsched-
uled inspections by the U.S. Department of
Head Start Family and Child Experiences Survey
Education or NIH security officials to ensure com-
pliance with the terms of the license and the secu- The Head Start Family and Child Experiences
rity procedures and plan. Survey (FACES) provides longitudinal data on
Databases 323

the characteristics, experiences, and outcomes of Adolescent Databases


Head Start children and families, as well as Education Longitudinal Study
the characteristics of the Head Start programs
that serve them. In 1997 the Department The Education Longitudinal Study (ELS) of
of Health and Human Services, Administration 2002 is part of the National Center for Education
for Children and Families, commissioned Statistics’ National Education Longitudinal Studies
FACES. The success of the original FACES program, which also includes three completed
database prompted follow-ups, and currently studies: the National Longitudinal Study of the
there are four FACES databases: FACES 1997, High School Class of 1972, the High School and
2000, 2003, and 2006. Each cohort has Beyond longitudinal study of 1980, and the
included a nationally representative sample of National Education Longitudinal Study of 1988.
Head Start children and their families. FACES The ELS database consists of a nationally repre-
has several major objectives which include sentative sample of students tracked from the time
studying the relationship among family, pre- they were high school sophomores until they enter
school, and school experiences; children’s aca- postsecondary education and the labor market. As
demic development in elementary school; and such, this database allows researchers to access
the developmental progression of children as information about individuals from the time they
they progress from Head Start to elementary are adolescents until their mid- to late 20s. Data
school. from this study are derived from information
obtained from students, school records, parents,
teachers, and high school administrators.
NICHD Study of Early Child Care
and Youth Development National Longitudinal Study of Adolescent Health
The National Institute of Child Health and The National Longitudinal Study of Adolescent
Human Development (NICHD) Study of Early Health (Add Health) is a nationally representative,
Child Care and Youth Development (SECCYD) school-based, longitudinal study that explores vari-
is a comprehensive longitudinal study initiated ables related to health behaviors for adolescents.
by the NICHD to answer questions about the The Add Health database provides researchers
relationships between child care experiences, with information on how social contexts (e.g.,
child care characteristics, and children’s devel- families, schools, and neighborhoods) influence
opmental outcomes. The SECCYD data are adolescents’ health and risk behaviors. Add Health
from 1,364 families, followed since their was started in 1994 by way of a grant from
infant’s birth in 1991. The study covers demo- NICHD. Data at the individual, family, school,
graphic, family, maternal, paternal, and care- and community levels were collected in two waves
giver characteristics; child social and emotional between 1994 and 1996. In 2001 and 2002, Add
outcomes; language development; cognitive Health respondents, 18 to 26 years old, were rein-
skills; school readiness; and growth and health terviewed in a third wave to investigate the influ-
measures. The study was conducted in four ence that adolescence has on young adulthood.
phases, based on the ages of the children. Phase
I of the study was conducted from 1991 to
1994, following the children from birth to age Special Population Databases
3 years. Phase II of the study was conducted
Pre-Elementary Education Longitudinal Study
between 1995 and 2000 to follow the 1,226
children continuing to participate from age 3 The Pre-Elementary Education Longitudinal
through their 2nd year in school. Phase III of Study (PEELS) is part of a group of studies on the
the study was conducted between 2000 and experiences, special services, and outcomes of chil-
2005 to follow more than 1,100 of the children dren, youth, and young adults with disabilities.
through their 7th year in school. Phase IV will The children were 3 to 5 years old at the start of
follow more than 1,000 of the original families the study. The purpose of PEELS is a better under-
through age 15. standing of the answers to the following questions:
324 Databases

What are the characteristics of children receiving Future of Database Research


preschool special education? What preschool pro-
grams and services do they receive? How are tran- As mentioned previously there are numerous
sitions between early intervention (programs for databases for the study of virtually any phenom-
children from birth to 3 years old) and preschool, enon. While this may be the case, the National
and between preschool and elementary school? Children’s Study may be the most ambitious
How do these children function and perform in database project. The development of this study
preschool, kindergarten, and early elementary was spurred by the 1998 President’s Task Force
school? Which child, school program, and/or spe- on Environmental Health and Safety Risks to
cial service characteristics are associated with bet- Children recommendation that a large prospec-
ter results in school? Data collection began in fall tive epidemiologic study of U.S. children be
2003 and was repeated in winter 2005, 2006, done. As such, the U.S. Congress, through the
2007, and 2009. Children’s Health Act of 2000, gave the NICHD
the task of conducting a national longitudinal
study of environmental influences (including
National Longitudinal Transition Study-2
physical, chemical, biological, and psychosocial
The National Longitudinal Transition Study-2 influences) on children’s health and develop-
(NLTS-2) is a 10-year study documenting the char- ment. Funded by the National Institute of Envi-
acteristics, experiences, and outcomes of a nation- ronmental Health Sciences, the Centers for
ally representative sample of more than 11,000 Disease Control and Prevention (CDC), and the
youth who were ages 13 through 16 and were U.S. Environmental Protection Agency, the study
receiving special education services in Grade 7 or will cost an estimated $2.3 billion over 30 years
above when the study began in 2001. The NLTS- and was initiated in 2007. Its purpose is to
2, funded by the U.S. Department of Education, is understand factors related to environmental
a follow-up of the original National Longitudinal risks and individual susceptibility factors for
Transition Study, conducted from 1985 through asthma, birth defects, dyslexia, attention deficit/
1993. Information in this database is derived from hyperactivity disorder, autism, schizophrenia,
children with disabilities, their parents, and their and obesity, as well as for adverse birth out-
schools. The study follows these children through comes. Participants for this study are a nationally
young adulthood. representative sample of 100,000 children. These
children will be followed from conception to
21 years of age and examines environmental
Special Education Elementary Longitudinal Study exposures. It includes genetic information and
The Special Education Elementary Longitudinal chemical analysis from the families’ communi-
Study (SEELS) is a study of school-age children ties. The database is intended to produce an
who were in special education. SEELS was funded extremely rich set of information for the study of
by the U.S. Department of Education and is part human development.
of the national assessment of the 1997 Individuals Historically, investigators have been encour-
With Disabilities Education Act. SEELS involves aged to collect their own data for research pur-
a large, nationally representative sample of stu- poses. While this is still true, the cost associated
dents in special education who were age 6 through with conducting independent field research and
12 in 1999. Students were selected randomly from difficulty gaining access to diverse samples
rosters of students in special education provided requires individuals to use other methods for
by local education agencies and state-operated spe- access to data. One of these methods is to use
cial schools for the deaf and blind that agreed to large-scale databases. As demonstrated by pub-
participate in the study. Beginning in the year lished research, the potential use of existing
2000 and concluding in 2006, SEELS documented databases to answer research questions is virtu-
the school experiences of a national sample of stu- ally unlimited. Given the widespread availability
dents as they move from elementary to middle of databases from multiple fields and the ongo-
school and from middle to high school. ing efforts of governmental agencies to fund
Data Cleaning 325

these endeavors, the use of databases will remain individual and household characteristics of youth with
a staple in research activities. disabilities: A report from the National Longitudinal
Transition Study-2 (NLTS2). Menlo Park, CA: SRI
Scott Graves International.

See also Primary Data Source; Secondary Data Source

Further Readings
Anderson, C., Fletcher, P., & Park, J. (2007). Early
DATA CLEANING
Childhood Longitudinal Study, Birth Cohort (ECLS–
B) Psychometric Report for the 2-year Data Collection Data cleaning, or data cleansing, is an important
(NCES 2007–084). Washington, DC: National Center part of the process involved in preparing data for
for Education Statistics. analysis. Data cleaning is a subset of data prepara-
Landrigan, P., Trasande, L., Thorpe, L., Gwynn, C., Lioy, tion, which also includes scoring tests, matching
P., D’Alton, M., et al. (2006). The national children’s data files, selecting cases, and other tasks that are
study: A 21-year prospective study of 10000 American required to prepare data for analysis.
children. Pediatrics, 118, 2173–2186.
Missing and erroneous data can pose a signifi-
Markowitz, J., Carlson, E., Frey, W., Riley, J., Shimshak,
cant problem to the reliability and validity of
A., Heinzen, H., et al. (2006). Preschoolers’
characteristics, services, and results: Wave 1 overview study outcomes. Many problems can be avoided
report from the Pre-Elementary Education through careful survey and study design. During
Longitudinal Study (PEELS). Rockville, MD: Westat. the study, watchful monitoring and data cleaning
NICHD Early Child Care Research Network. can catch problems while they can still be fixed.
(1993).Child care debate: Transformed or distorted?. At the end of the study, multiple imputation pro-
American Psychologist, 48, 692–693. cedures may be used for data that are truly
Pennell, B., Bowers, A., Carr, D., Chardoul, S., Cheung, irretrievable.
G., Dinkelmann, K., et al. (2004). The development The opportunities for data cleaning are depen-
and implementation of the National Comorbidity
dent on the study design and data collection meth-
Survey Replication, the National Survey of American
Life, and the National Latino and Asian American
ods. At one extreme is the anonymous Web survey,
Survey. International Journal of Methods in with limited recourse in the case of errors and
Psychiatric Research, 13, 241–269. missing data. At the other extreme are longitudinal
Rock, D., & Pollack, J. (2002). Early Childhood studies with multiple treatment visits and outcome
Longitudinal Study—Kindergarten Class of 1998–99 evaluations. Conducting data cleaning during the
(ECLS-K) psychometric report for kindergarten course of a study allows the research team to
through first grade (NCES 2002–05). Washington, obtain otherwise missing data and can prevent
DC: National Center for Education Statistics. costly data cleaning at the end of the study. This
Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W., & entry discusses problems associated with data
Shavelson, R. (2007). Estimating causal effects: Using
cleaning and their solutions.
experimental and observational designs. Washington,
DC: American Educational Research Association.
U.S. Department of Health and Human Services. (2002).
A descriptive study of Head Start families: FACES
Types of ‘‘Dirty Data’’
technical report I. Washington, DC: U.S. Department
of Health and Human Services, Administration for Two types of problems are encountered in data
Children and Families. cleaning: missing data and errors. The latter may
Wagner, M., Kutash, K., Duchnowski, A., & Epstein, M. be the result of respondent mistakes or data
(2005). The special education elementary longitudinal
entry errors. The presence of ‘‘dirty data’’
study and the national longitudinal transition study:
Study designs and implications for children and youth
reduces the reliability and validity of the mea-
with emotional disturbance. Journal of Emotional and sures. If responses are missing or erroneous, they
Behavioral Disorders, 13, 25–41. will not be reliable over time. Because reliability
Wagner, M., Marder, C., Levine, P., Cameto, P., sets the upper bound for validity, unreliable
Cadwallader, T., & Blackorby, J. (2003). The items reduce validity.
326 Data Cleaning

Table 1 Data Errors and Missing Data sciences. Even with the best intentions, everyone
Incomplete, makes errors. In the social sciences, most measures
Incorrect, are self-report. Potentially embarrassing items can
or Missing result in biased responses. Lack of motivation is
Variable ‘‘True’’ Data Data also an important source of error. For example,
Name Maria Margaret Smith Maria Smith respondents will be highly motivated in high-
Date of birth 2/19/1981 1981 stakes testing such as the College Board exams but
Sex F M probably do not bring the same keen interest to
Ethnicity Hispanic and Caucasian one’s research study.
Education B.A., Economics College
Place of birthNogales, Sonora, Nogales Solutions and Approaches
Mexico
Annual income $50,000 Data problems can be prevented by careful study
design and by pretesting of the entire research
protocol. After the forms have been collected,
Missing Data the task of data cleaning begins. The following
discussion of data cleaning is for a single paper-
Missing data reduce the sample size available
and-pencil survey collected in person. Data
for the analyses. An investigator’s research design
cleaning for longitudinal studies, institutional
may require 100 respondents in order to have suf-
data sets, and anonymous surveys is addressed in
ficient power to test the study hypotheses. Substan-
a later section.
tial effort may be required to recruit and treat 100
respondents. At the end of the study, if there are
10 important variables, with each variable missing
Missing Data
only 5% of the time, the investigator may be
reduced to 75 respondents with complete data for The best approach is to fill in missing data as
the analyses. Missing data effectively reduce the soon as possible. If a data collection form is
power of the study. Missing data can also intro- skimmed when it is collected, the investigator may
duce bias because questions that may be embarras- be able to ask questions at that time about any
sing or reveal anything illegal may be left blank. missing items. After the data are entered, the files
For example, if some respondents do not answer can be examined for remaining missing values. In
items about income, place of birth (for immigrants many studies, the team may be able to contact the
without documents), or drug use, the remaining respondent or fill in basic data from memory. Even
cases with complete data are a biased sample that if the study team consists of only the principal
is no longer representative of the population. investigator, it is much easier to fill in missing data
after an interview than to do so a year later. At
Data Errors that point, the missing data may no longer be
retrievable.
Data errors are also costly to the study because
lowered reliability attenuates the results. Respon-
dents may make mistakes, and errors can be intro- Data Entry Errors
duced during data entry. Data errors are more
difficult to detect than missing data. Table 1 shows A number of helpful computer procedures can
examples of missing data (ethnicity, income), be used to reduce or detect data entry errors, such
incomplete data (date and place of birth), and as double entry or proactive database design. Dou-
erroneous data (sex). ble entry refers to entering the data twice, in order
to ensure accuracy. Careful database design
includes structured data entry screens that are lim-
Causes
ited to specified formats (dates, numbers, or text)
All measuring instruments are flawed, regard- or ranges (e.g., sex can only be M or F, and age
less of whether they are in the physical or social must be a number less than 100).
Data Cleaning 327

Perils of Text Data Longitudinal Studies and Other Designs


Text fields are the default for many database
Anonymous surveys, secondary data sets, longitu-
management systems, and a new user may not
dinal studies with multiple forms collected on mul-
be aware of the problems that text data can
tiple occasions, and institutional databases present
cause at the end of a project. For example,
different data-cleaning problems and solutions.
‘‘How many days did you wait before you were
Anonymous surveys provide few opportunities
seen by a physician?’’ should require a number. If
for data cleaning. It is not possible to ask the
the database allows text data, the program will
respondent to complete the missing items. Data
accept ‘‘10,’’ ‘‘ten days,’’ ‘‘10 days,’’ ‘‘5 or 6,’’
cleaning will be limited to logic checks to identify
‘‘half a day,’’ ‘‘2–3,’’ ‘‘about a month,’’ ‘‘not
inconsistent responses and, in some cases, using
sure,’’ ‘‘NA,’’ and so on. Before analysis can pro-
the respondent’s answers to other questions.
ceed, every one of these answers must be con-
Secondary data sets, such as those developed
verted to a number. In large studies, the process
and maintained by the Centers for Disease Control
of converting text to numbers can be daunting.
and Prevention, will be fully documented, with
This problem can be prevented by limiting fields
online files for the data collection forms, research
to numeric inputs only.
protocols, and essential information about data
integrity. The data collection instruments have
been designed to minimize errors, and the majority
Logically Impossible Responses of possible data cleaning will be complete.
Computer routines can be used to examine the Institutional databases can be an excellent
data logically and identify a broad range of errors. source of information for addressing research
These validation routines can detect out-of-range questions. Hospital billing and claims files and col-
values and logical inconsistencies; there are some lege enrollment records have often been in place
combinations that are not possible in a data set. for decades. However, these systems were designed
For example, all the dates for data collection to meet an institution’s ongoing requirements and
should fall between the project’s starting date and were rarely planned with one’s research study in
ending date. In a study of entire families, the chil- mind. Legacy databases present some unique chal-
dren should be younger than their parents, and lenges for data cleaning.
their ages should be somewhat consistent with In contrast to secondary data analyses, institu-
their reported grade levels. More sophisticated tional databases may have limited documentation
procedures, such as Rasch modeling techniques, available to the outside researcher. Legacy data-
can be used to identify invalid surveys submitted bases usually change in order to meet the needs of
by respondents who intentionally meddled with the institution. The records of these changes may
the system (for example, by always endorsing the be unavailable or difficult for nonprogrammers to
first response). follow. User-initiated changes may be part of the
institutional lore. Data accuracy may be high for
critical data fields but lower for other variables,
and a few variables may be entirely blank. Vari-
Logistics
able formats may change across data files or
The amount of time required for data cleaning tables. For example, the same item in three data
is dependent on who cleans the data and when sets may have different field characteristics, with
the data cleaning takes place. The best people to a leading space in the first, a leading zero in the
clean the data are the study team, and the best second, and neither in the third. Data-cleaning
time is while the study is under way. The approaches may involve significant time examining
research staff is familiar with the protocol and the data. The accuracy of the final product will
probably knows most of the study participants. need to be verified with knowledgeable personnel.
A data set that is rife with problems can require Studies with multiple instruments allow the
a surprising amount of time for data cleaning at investigators to reconstruct additional missing and
the end of a study. erroneous data by triangulating data from the
328 Data Mining

other instruments. For example, one form from data are often greater than those required for the
a testing battery may be missing a date that can be actual analyses.
inferred from the rest of the packet if the error is
caught during data entry.
Longitudinal studies provide a wealth of oppor- Data Imputation
tunity to correct errors, provided early attention to At the end of the data-cleaning process, there
data cleaning and data entry have been built into may still be missing values that cannot be recov-
the study. Identification of missing data and errors ered. These data can be replaced using data impu-
while the respondent is still enrolled in the study tation techniques. Imputation refers to the process
allows investigators to ‘‘fill in the blanks’’ at the of replacing a missing value with a reasonable
next study visit. Longitudinal studies also provide value. Imputation methods range from mean
the opportunity to check for consistency across imputation (replacing a missing data point with
time. For example, if a study collects physical mea- the average for the entire study) to hot deck impu-
surements, children should not become shorter tation (making estimates based on a similar but
over time, and measurements should not move complete data set), single imputation if the propor-
back and forth between feet and meters. tion of missing values is small, and multiple impu-
Along with opportunities to catch errors, stud- tation. Multiple imputation remains the best
ies with multiple forms and/or multiple assessment technique and will be necessary if the missing data
waves also pose problems. The first of these is the are extensive or the values are not missing at ran-
file-matching problem. Multiple forms must be dom. All methods of imputation are considered
matched and merged via computer routines. Dif- preferable to case deletion, which can result in
ferent data files or forms may have identification a biased sample.
numbers and sorting variables that do not exactly
match, and these must be identified and changed Melinda Fritchoff Davis
before matching is possible.
See also Bias; Error; Random Error; Reliability;
Systematic Error; True Score; Validity of Measurement
Documentation
Researchers are well advised to keep a log of Further Readings
data corrections in order to track changes. For
example, if a paper data collection form was used, Allison, P. D. (2002). Missing data. Thousand Oaks, CA:
Sage.
the changes can be recorded on the form, along
Davis, M. F. (2010). Avoiding data disasters and other
with date the item was corrected. Keeping a log of pitfalls. In S. Sidani & D. L. Streiner (Eds.), When
data corrections can save the research team from research goes off the rails (pp. 320–326). New York:
trying to clean the same error more than once. Guilford.
Rubin, D. B. (1987). Multiple imputation for non-
response in surveys. Hoboken, NJ: Wiley.
Data Integrity
Researcher inexperience and the ubiquitous lack
of resources are the main reasons for poor data
hygiene. First, experience is the best teacher, and DATA MINING
few researchers have been directly responsible for
data cleaning or data stewardship. Second, every Modern researchers in various fields are confronted
study has limited resources. At the beginning of by an unprecedented wealth and complexity of
the study, the focus will invariably be on develop- data. However, the results available to these
ing data collection forms and study recruitment. If researchers through traditional data analysis techni-
data are not needed for annual reports, they may ques provide only limited solutions to complex
not be entered until the end of the study. The data situations. The approach to the huge demand for
analyst may be the first person to actually see the the analysis and interpretation of these complex
data. At this point, the costs required to clean the data is managed under the name of data mining, or
Data Mining 329

knowledge discovery. Data mining is defined as the complete linkage, average linkage, and Ward’s
process of extracting useful information from large method.
data sets through the use of any relevant data anal- Nonhierarchical clustering algorithms achieve
ysis techniques developed to help people make the purpose of clustering analysis without building
better decisions. These data mining techniques a hierarchical structure. The k-means clustering
themselves are defined and categorized according to algorithm is one of the most popular nonhierarchi-
their underlying statistical theories and computing cal clustering methods. A brief summary of the
algorithms. This entry discusses these various data k-means clustering algorithm is as follows: Given
mining methods and their applications. k seed (or starting) points, each observation is
assigned to one of the k seed points close to the
observation, which creates k clusters. Then seed
Types of Data Mining
points are replaced with the mean of the currently
In general, data mining methods can be separated assigned clusters. This procedure is repeated with
into three categories: unsupervised learning, super- updated seed points until the assignments do not
vised learning, and semisupervised learning meth- change. The results of the k-means clustering algo-
ods. Unsupervised methods rely solely on the input rithm depend on the distance metrics, the number
variables (predictors) and do not take into account of clusters (k), and the location of seed points.
output (response) information. In unsupervised Other nonhierarchical clustering algorithms
learning, the goal is to facilitate the extraction of include k-medoids and self-organizing maps.
implicit patterns and elicit the natural groupings Principal components analysis (PCA) is another
within the data set without using any information unsupervised technique and is widely used, primar-
from the output variable. On the other hand, ily for dimensional reduction and visualization.
supervised learning methods use information from PCA is concerned with the covariance matrix of
both the input and output variables to generate the original variables, and the eigenvalues and eigen-
models that classify or predict the output values of vectors are obtained from the covariance matrix.
future observations. The semisupervised method The product of the eigenvector corresponding to
mixes the unsupervised and supervised methods to the largest eigenvalue and the original data matrix
generate an appropriate classification or prediction leads to the first principal component (PC), which
model. expresses the maximum variance of the data set.
The second PC is then obtained via the eigenvector
corresponding to the second largest eigenvalue, and
Unsupervised Learning Methods
this process is repeated N times to obtain N PCs,
Unsupervised learning methods attempt to where N is the number of variables in the data set.
extract important patterns from a data set without The PCs are uncorrelated to each other, and gener-
using any information from the output variable. ally the first few PCs are sufficient to account for
Clustering analysis, which is one of the unsuper- most of the variations. Thus, the PCA plot of obser-
vised learning methods, systematically partitions vations using these first few PC axes facilitates visu-
the data set by minimizing within-group variation alization of high-dimensional data sets.
and maximizing between-group variation. These
variations can be measured on the basis of a variety
Supervised Learning Methods
of distance metrics between observations in the
data set. Clustering analysis includes hierarchical Supervised learning methods use both the input
and nonhierarchical methods. and output variables to provide the model or rule
Hierarchical clustering algorithms provide that characterizes the relationships between the
a dendrogram that represents the hierarchical input and output variables. Based on the character-
structure of clusters. At the highest level of this istics of the output variable, supervised learning
hierarchy is a single cluster that contains all the methods can be categorized as either regression or
observations, while at the lowest level are clusters classification. In regression problems, the output
containing a single observation. Examples of hier- variable is continuous, so the main goal is to pre-
archical clustering algorithms are single linkage, dict the outcome values of an unknown future
330 Data Mining

observation. In classification problems, the output including decision trees, support vector machines,
variable is categorical, and the goal is to assign k-nearest neighbors, and artificial neural networks.
existing labels to an unknown future observation. Decision tree models have gained huge popularity
Linear regression models have been widely in various areas because of their flexibility and
used in regression problems because of their sim- interpretability. Decision tree models are flexible in
plicity. Linear regression is a parametric that the models can efficiently handle both contin-
approach that provides a linear equation to uous and categorical variables in the model con-
examine relationships of the mean response to struction. The output of decision tree models is
one or to multiple input variables. Linear regres- a hierarchical structure that consists of a series of
sion models are simple to derive, and the final if–then rules to predict the outcome of the
model is easy to interpret. However, the para- response variable, thus facilitating the interpreta-
metric assumption of an error term in linear tion of the final model. From an algorithmic point
regression analysis often restricts its applicability of view, the decision tree model has a forward
to complicated multivariate data. Further, linear stepwise procedure that adds model terms and
regression methods cannot be employed when a backward procedure for pruning, and it conducts
the number of variables exceeds the number of variable selection by including only useful vari-
observations. Multivariate adaptive regression ables in the model. Support vector machine (SVM)
spline (MARS) is a nonparametric regression is another supervised learning model popularly
method that compensates for limitation of used for both regression and classification pro-
ordinary regression models. MARS is one of the blems. SVMs use geometric properties to obtain
few tractable methods for high-dimensional a separating hyperplane by solving a convex opti-
problems with interactions, and it estimates mization problem that simultaneously minimizes
a completely unknown relationship between the generalization error and maximizes the geo-
a continuous output variable and a number of metric margin between the classes. Nonlinear
input variables. MARS is a data-driven statistical SVM models can be constructed from kernel func-
linear model in which a forward stepwise algo- tions that include linear, polynomial, and radial
rithm is first used to select the model term and is basis functions. Another useful supervised learning
then followed by a backward procedure to prune method is k-nearest neighbors (kNNs). A type of
the model. The approximation bends at ‘‘knot’’ lazy-learning (instance-based learning) technique,
locations to model curvature, and one of the kNNs do not require a trained model. Given
objectives of the forward stepwise algorithm is a query point, the k closest points are determined.
to select the appropriate knots. Smoothing at the A variety of distance measures can be applied to
knots is an option that may be used if derivatives calculate how close each point is to the query
are desired. point. Then the k nearest points are examined to
Classification methods provide models to clas- find which of the categories belong to the k nearest
sify unknown observations according to the exist- points. Last, this category is assigned to the query
ing labels of the output variable. Traditional point being examined. This procedure is repeated
classification methods include linear discriminant for all the points that require classification. Finally,
analysis (LDA) and quadratic discriminant analysis artificial neural networks (ANNs), inspired by the
(QDA), based on Bayesian theory. Both LDA and way biological nervous systems learn, are widely
QDA assume that the data set follows normal dis- used for prediction modeling in many applications.
tribution. LDA generates a linear decision bound- ANN models are typically represented by a net-
ary by assuming that populations of different work diagram containing several layers (e.g.,
classes have the same covariance. QDA, on the input, hidden, and output layers) that consist of
other hand, does not have any restrictions on the nodes. These nodes are interconnected with
equality of covariance between two populations weighted connection lines whose weights are
and provides a quadratic equation that may be adjusted when training data are presented to the
efficient for linearly nonseparable data sets. ANN during the training process. The neural net-
Many supervised learning methods can handle work training process is an iterative adjustment of
both regression and classification problems, the internal weights to bring the network’s output
Data Mining 331

closer to the desired values through minimizing market basket analysis provides a way to under-
the mean squared error. stand the behavior of profitable customers by ana-
lyzing their purchasing patterns. Further,
unsupervised clustering analyses can be used to
Semisupervised Learning Methods
segment customers by market potential. In the tele-
Semisupervised learning approaches have communication industries, data mining methods
received increasing attention in recent years. Olivier help sales and marketing people establish loyalty
Chapelle and his coauthors described semisuper- programs, develop fraud detection modules, and
vised learning as ‘‘halfway between supervised and segment markets to reduce revenue loss. Data min-
unsupervised learning’’ (p. 4). Semisupervised learn- ing has received tremendous attention in the field
ing methods create a classification model by using of bioinformatics, which deals with large amounts
partial information from the labeled data. One- of high-dimensional biological data. Data mining
class classification is an example of a semisupervised methods combined with microarray technology
learning method that can distinguish between the allow monitoring of thousands of genes simulta-
class of interest (target) and all other classes (out- neously, leading to a greater understanding of
lier). In the construction of the classifiers, one-class molecular patterns. Clustering algorithms use
classification techniques require only the informa- microarray gene expression data to group the
tion from the target class. The applications of one- genes based on their level of expression, and classi-
class classification include novelty detection, outlier fication algorithms use the labels of experimental
detection, and imbalanced classification. conditions (e.g., disease status) to build models to
Support vector data description (SVDD) is classify different experimental conditions.
a one-class classification method that combines
a traditional SVM algorithm with a density
approach. SVDD produces a classifier to separate Data Mining Software
the target from the outliers. The decision boundary
A variety of data mining software is available. SAS
of SVDD is constructed from an optimization
Enterprise Miner (www.sas.com), SPSS (an IBM
problem that minimizes the volume of the
company, formerly called PASWâ Statistics) Clem-
hypersphere from the boundary and maximizes
entine (www.spss.com), and S-PLUS Insightful
the target data being captured by the boundary.
Miner (www.insightful.com) are examples of
The main difference between the supervised and
widely used commercial data mining software.
semisupervised classification methods is that the
In addition, commercial software developed by
former generates a classifier to classify an
Salford Systems (www.salford-systems.com) pro-
unknown observation into the predefined classes,
vides CART, MARS, TreeNet, and Random For-
whereas the latter gives a closed-boundary around
ests for specialized uses of tree-based models. Free
the target data in order to separate them from all
data mining software packages also are available.
other types of data.
These include RapidMiner (rapid-i.com), Weka
(www.cs.waikato.ac.nz/ml/weka), and R (www
Applications .r-project.org).
Interest in data mining has increased greatly Seoung Bum Kim and Thuntee Sukchotrat
because of the availability of new analytical techni-
ques with the potential to retrieve useful informa- See also Exploratory Data Analyis; Exploratory Factor
tion or knowledge from vast amounts of complex Analysis; Ex Post Facto Study
data that were heretofore unmanageable. Data
mining has a range of applications, including
manufacturing, marketing, telecommunication, Further Readings
health care, biomedicine, e-commerce, and sports. Chapelle, O., Zien, A., & Schölkopf, B. (Eds.). (2006).
In manufacturing, data mining methods have been Semi-supervised learning. Cambridge: MIT Press.
applied to predict the number of product defects in Duda, R. O., Hart, P. E., & Storl, D. G. (2001). Pattern
a process and identify their causes. In marketing, classification (2nd ed.). New York: Wiley.
332 Data Snooping

Hastie, T., Tibshirani, R., & Friedman, J. (2001). conclusions of statistical significance at the 5%
The elements of statistical learning. New York: level based on an analysis such as this are mis-
Springer. leading because the data-snooping process has
Mitchell, T. M. (1997). Machine learning. New York: essentially ensured that something significant
McGraw-Hill.
will be found. This means that if new data are
Tax, D. M. J., & Duin, R. P. W. (2004). Support
vector data description. Machine Learning, 54,
obtained, it is unlikely that the ‘‘significant’’
45–66. results found via the data-snooping process
would be replicated.

Data-Snooping Examples
DATA SNOOPING
Example 1
The term data snooping, sometimes also referred An investigator obtains data to investigate the
to as data dredging or data fishing, is used impact of a treatment on the mean of a response
to describe the situation in which a particular variable of interest without a predefined view
data set is analyzed repeatedly without an (alternative hypothesis) of the direction (positive
a priori hypothesis of interest. The practice of or negative) of the possible effect of the treat-
data snooping, although common, is problematic ment. Data snooping would occur in this situa-
because it can result in a significant finding (e.g., tion if after analyzing the data, the investigator
rejection of a null hypothesis) that is nothing observes that the treatment appears to have
more than a chance artifact of the repeated anal- a negative effect on the response variable and
yses of the data. The biases introduced by data then uses a one-sided alternative hypothesis cor-
snooping increase the more a data set is analyzed responding to the treatment having a negative
in the hope of a significant finding. Empirical effect. In this situation, a two-sided alternative
research that is based on experimentation and hypothesis, corresponding to the investigator’s
observation has the potential to be impacted by a priori ignorance on the effect of the treatment,
data snooping. would be appropriate. Data snooping in this
example results in the p value for the hypothesis
test being halved, resulting in a greater chance of
Data Snooping and assessing a significant effect of the treatment. To
Multiple Hypothesis Testing avoid problems of this nature, many journals
require that two-sided alternatives be used for
A hypothesis test is conducted at a significance
hypothesis tests.
level, denoted α, corresponding to the probabil-
ity of incorrectly rejecting a true null hypothesis
(the so-called Type 1 error). Data snooping
Example 2
essentially involves performing a large number
of hypothesis tests on a particular data set with A data set containing information on a response
the hope that one of the tests will be significant. variable and six explanatory variables is analyzed,
This data-snooping process of performing a large without any a priori hypotheses of interest, by fit-
number of hypothesis tests results in the actual ting each of the 64 multiple linear regression mod-
significance level being increased, or the burden els obtained by means of different combinations of
of proof for finding a significant result being sub- the six explanatory variables, and then only statis-
stantially reduced, resulting in potentially mis- tically significant associations are reported. The
leading results. For example, if 100 independent effect of data snooping in this example would be
hypothesis tests are conducted on a data set at more severe than in Example 1 because the data
a significance level of 5%, it would be expected are being analyzed many more times (more
that about 5 out of the 100 tests would yield sig- hypothesis tests are performed), meaning that one
nificant results simply by chance alone, even if would expect to see a number of significant asso-
the null hypothesis were, in fact, true. Any ciations simply due to chance.
Debriefing 333

Correcting for Data Snooping See also Bonferroni Procedure; Data Mining; Hypothesis;
Multiple Comparison Tests; p Value; Significance
The ideal way to avoid data snooping is for an Level, Interpretation and Construction; Type I Error
investigator to verify any significant results found
via a data-snooping process by using an indepen- Further Readings
dent data set. Significant results not replicated on
the independent data set would then be viewed as Freedman, D., Pisani, R., & Purves, R. (2007). Statistics
spurious results that were likely an artifact of the (4th ed.). New York: W. W. Norton.
Romano, J. P., & Wolf, M. (2005). Stepwise multiple
data-snooping process. If an independent data set is
testing as formalized data snooping. Econometrika,
obtainable, then the initial data-snooping process 73, 1237–1282.
may be viewed as an initial exploratory analysis Strube, M. J. (2006). SNOOP: A program for
used to inform the investigator of hypotheses of demonstrating the consequences of premature and
interest. In cases in which an independent data set is repeated null hypothesis testing. Behavior Research
not possible or very expensive, the role of an inde- Methods, 38, 24–27.
pendent data set can be mimicked by randomly White, H. (2000). A reality check for data snooping.
dividing the original data into two smaller data sets: Econometrika, 68, 1097–1126.
one half for an initial exploratory analysis (the
training set) and the other half for validation (the
validation set). Due to prohibitive cost and/or time,
obtaining an independent data set or a large enough DEBRIEFING
data set for dividing into training and validation sets
may not be feasible. In such situations, the investi- Debriefing is the process of giving participants fur-
gator should describe exactly how the data were ther information about a study in which they par-
analyzed, including the number of hypothesis tests ticipated at the conclusion of their participation.
that were performed in finding statistically signifi- Debriefing continues the informational process
cant results, and then report results that are that began at the participant recruitment or
adjusted for multiple hypothesis-testing effects. informed consent stage. If the true purpose of the
Methods for adjusting for multiple hypothesis test- study was revealed to participants at the informed
ing include the Bonferroni correction, Scheffes consent stage, debriefing is fairly straightforward.
method, Tukey’s test, and more recently the false Participants are reminded of the purposes of the
discovery rate. The relatively simple Bonferroni cor- study, given further information about expected
rection works by conducting individual hypothesis results, and thanked for their participation. The
tests at level of significance α/g where g is the num- debriefing session also provides an opportunity for
ber of hypothesis tests carried out. Performing the participants to ask any questions they may have
individual hypothesis tests at level of significance α/ about the study. In some research situations, parti-
g provides a crude means of maintaining an overall cipants might be called on to discuss negative emo-
level of significance of at least α. Model averaging tions or reveal sensitive information (e.g., studies
methods that combine information from every anal- on relationship violence or eating disorders). In
ysis of a data set are another alternative for alleviat- such studies, the researcher may include in the
ing the problems of data snooping. debriefing information about ways in which parti-
Data mining, a term used to describe the pro- cipants might obtain help in dealing with these
cess of exploratory analysis and extraction of use- issues, such as a referral to a campus mental health
ful information from data, is sometimes confused center. A debriefing script should be included in
with data snooping. Data snooping is sometimes research proposals submitted to an institutional
the result of the misuse of data-mining methods, review board.
such as the framing of specific alternative hypothe- If a study includes deception, debriefing is more
ses in response to an observation arising out of complex. In such instances, a researcher has con-
data mining. cluded that informing participants of the nature
of the study at the stage of obtaining consent
Michael A. Martin and Steven Roberts would interfere with the collection of valid and
334 Debriefing

generalizable data. In such instances, the authority may create situations in which partici-
researcher may give participants incomplete or pants are expected to engage in behavior with
misleading information about the nature of the which they are uncomfortable (such as administer-
study at the recruitment and consent stages. Other ing supposed electric shocks to a confederate). In
examples of deception in social science research such a situation, a researcher might address possi-
include deceptive instructions, false feedback, or ble negative feelings by stating that the participant’s
the use of confederates (members of the research behavior was not unusual or extreme (by, for
team who misrepresent their identities as part of example, stating that most other participants have
the study procedure). acted the same way). Another approach is to
In a deception study, the debriefing session is emphasize that the behavior was due to situational
the time when a complete explanation of the study factors rather than personal characteristics. Desen-
is given and the deception is revealed. Participants sitizing may encourage participants to make an
should be informed of the deception that took external (situational) rather than an internal (per-
place and of the true purpose of the research. The sonal) attribution for their behavior. Participants
reasons the researcher believed that deception was may feel angry, foolish, or embarrassed about hav-
necessary for the research should also be explained ing been deceived by the researcher. One desensitiz-
to participants. As in a nondeception study, parti- ing technique applicable to such situations is to
cipants should be thanked for their participation point out that negative feelings are a natural and
and provided with an opportunity to ask questions expected outcome of the study situation.
of the researcher. Participants should also be Joan Seiber states that participation in research
reminded of their right to withdraw from the study and postresearch debriefing should provide partici-
at any time. This reminder may take a number of pants with new insight into the topic of research
forms, ranging from a statement in the debriefing and a feeling of satisfaction in having made a con-
script indicating participants’ ability to withdraw, tribution to society and to scientific understanding.
to a second informed consent form for participants In a deceptive study, Seiber states, participants
to sign after being debriefed. should receive a number of additional benefits
from the debriefing: dehoaxing, desensitizing, an
opportunity to ask questions of the researcher, an
The Debriefing Process
opportunity to end participation in the study, res-
David Holmes has argued that debriefing should toration of confidence in scientific research, and
include processes of dehoaxing (if necessary) and information on the ways in which possible harm
desensitizing. Dehoaxing involves informing parti- has been anticipated and avoided. Seiber also
cipants about any deception that was used in the states that the dehoaxing process should include
study and explaining the researcher’s rationale for a convincing demonstration of the deception (for
the use of deception. Desensitizing involves dis- example, showing participants two identical com-
cussing and attempting to diminish any negative pleted tasks, one with positive feedback and one
feelings (such as stress or anxiety) that may have with negative feedback).
arisen as a result of the research process.
Negative feelings may result from the research
Types of Debriefing
process for a number of reasons. The purpose of
the research may have been to study these feelings, Several types of debriefing are associated with
and thus researchers may have deliberately deception studies. In each type, the researcher
instigated them in participants. For example, describes the deceptive research processes, explains
researchers interested in the effects of mood on test the reasons research is conducted on this topic and
performance might ask participants to read why deception was felt necessary to conduct the
an upsetting passage before completing a test. Neg- research, and thanks the participant for his or her
ative feelings may also arise as a consequence assistance in conducting the research.
of engaging in the behavior that researchers An explicit or outcome debriefing focuses on
were interested in studying. For example, research- revealing the deception included in the study.
ers interested in conformity and compliance to Explicit debriefing would include a statement
Debriefing 335

about the deceptive processes. Explicit debriefing after deception include the potential for debriefing
might also include a concrete demonstration of the to exacerbate harm by emphasizing the deceptive-
deception, such as demonstrating how feedback ness of researchers or for participants not to
was manipulated or introducing the participant to believe the debriefing, inferring that it is still part
the confederate. of the experimental manipulation.
A process debriefing is typically more involved Other experts have argued that it is possible to
than an explicit debriefing and allows for more conduct deception research ethically but have
opportunities for participants to discuss their feel- expressed concerns regarding possible negative
ings about participation and reach their own con- outcomes of such research. One such concern
clusions regarding the study. A process debriefing regarding deception and debriefing is the persever-
might include a discussion of whether the partici- ance phenomenon, in which participants continue
pant found anything unusual about the research even after debriefing to believe or be affected by
situation. The researcher might then introduce false information presented in a study. The most
information about deceptive elements of the prominent study of the perseverance phenomenon
research study, such as false feedback or the use of was conducted by Lee Ross, Mark Lepper, and
confederates. Some process debriefings attempt to Michael Hubbard, who were interested in adoles-
lead the participant to a realization of the decep- cents’ responses to randomly assigned feedback
tion on his or her own, before it is explicitly regarding their performance on a decision-making
explained by the researcher. task. At the end of the study session, participants
A somewhat less common type of debriefing is participated in a debriefing session in which they
an action debriefing, which includes an explicit learned that the feedback they had received was
debriefing along with a reenactment of the study unrelated to their actual performance. Ross and
procedure or task. colleagues found that participants’ self-views were
affected by the feedback even after the debriefing.
When, as part of the debriefing process, partici-
Ethical Considerations
pants were explicitly told about the perseverance
Ethical considerations with any research project phenomenon, their self-views did not continue to
typically include an examination of the predicted be affected after the debriefing.
costs (e.g., potential harms) and benefits of the
study, with the condition that research should not
Debriefing in Particular Research Contexts
be conducted unless predicted benefits significantly
outweigh potential harms. One concern expressed Most of the preceding discussion of debriefing has
by ethicists is that the individuals who bear the assumed a study of adult participants in a labora-
risks of research participation (study participants) tory setting. Debriefing may also be used in other
are often not the recipients of the study’s benefits. research contexts, such as Internet research, or
Debriefing has the potential to ameliorate costs with special research populations, such as children
(by decreasing discomfort and negative emotional or members of stigmatized groups. In Internet
reactions) and increase benefits (by giving partici- research, debriefing is typically presented in the
pants a fuller understanding of the importance of form of a debriefing statement as the final page of
the research question being examined and thus the study or as an e-mail sent to participants.
increasing the educational value of participation). In research with children, informed consent
Some experts believe that debriefing cannot be prior to participation is obtained from children’s
conducted in such a way as to make deception parents or guardians; child participants give their
research ethical, because deceptive research prac- assent as well. In studies involving deception of
tices eliminate the possibility for truly informed child participants, parents are typically informed
consent. Diana Baumrind has argued that debrief- of the true nature of the research at the informed
ing is insufficient to remediate the potential harm consent stage but are asked not to reveal the
caused by deception and that research involving nature of the research project to their children
intentional deception is unethical and should not prior to participation in the study. After study par-
be conducted. Other arguments against debriefing ticipation, children participate in a debriefing
336 Decision Rule

session with the researcher (and sometimes with T. L. Beauchamp, R. R. Faden, R. J. Wallace Jr., 245).
a parent or guardian as well). In this session, the Baltimore: Johns Hopkins University Press.
researcher explains the nature of and reasons for Fisher, C. B. (2005). Deception research involving
the deception in age-appropriate language. children: Ethical practices and paradoxes.
Ethics 287.
Marion Underwood has advocated for the use
Holmes, D. S. (1976). Debriefing after psychological
of a process debriefing with children. Underwood experiments: I. Effectiveness of postdeception
has also argued that it is important for the decep- dehoaxing. American Psychologist, 31, 868–875.
tion and debriefing to take place within a larger Hurley, J. C., & Underwood, M. K. Children’s
context of positive interactions. For example, chil- understanding of their research rights before and after
dren might engage in an enjoyable play session debriefing: Informed assent, confidentiality, and
with a child confederate after being debriefed stopping participation. Child Development, 73,
about the confederate’s role in an earlier 132–143.
interaction. Mills, J. (1976). A procedure for explaining experiments
involving deception. Personality 13.
Sieber, J. E. (1983). Deception in social research III: The
nature and limits of debriefing. IRB: A Review of
Statements by Professional Organizations Human Subjects Research, 5, 1–4.
The American Psychological Association’s Ethical
Principles of Psychologists and Code of Conduct Websites
states that debriefing should be an opportunity for
participants to receive appropriate information American Psychological Association: http://www.apa.org
Society for Research in Child Development’s Ethical
about a study’s aims and conclusions and should
Standards for Research with Children:
include correction of any participant mispercep-
http://www.srcd.org
tions of which the researchers are aware. The
APA’s ethics code also states that if information
must be withheld for scientific or humanitarian
reasons, researchers should take adequate mea- DECISION RULE
sures to reduce the risk of harm. If researchers
become aware of harm to a participant, they
In the context of statistical hypothesis testing, deci-
should take necessary steps to minimize the harm.
sion rule refers to the rule that specifies how to
The ethics code is available on the APA’s Web site.
choose between two (or more) competing hypothe-
The Society for Research in Child Develop-
ses about the observed data. A decision rule speci-
ment’s Ethical Standards for Research with Chil-
fies the statistical parameter of interest, the test
dren state that the researcher should clarify all
statistic to calculate, and how to use the test statis-
misconceptions that may have arisen over the
tic to choose among the various hypotheses about
course of the study immediately after the data are
the data. More broadly, in the context of statistical
collected. This ethics code is available on the
decision theory, a decision rule can be thought of
Society’s Web site.
as a procedure for making rational choices given
Meagan M. Patterson uncertain information.
The choice of a decision rule depends, among
See also Ethics in the Research Process; Informed other things, on the nature of the data, what one
Consent needs to decide about the data, and at what level
of significance. For instance, decision rules used
for normally distributed (or Gaussian) data are
Further Readings generally not appropriate for non-Gaussian data.
Baumrind, D. (1985). Research using intentional
Similarly, decision rules used for determining the
deception: Ethical issues revisited. American 95% confidence interval of the sample mean will
Psychologist, 40, 165–174. be different from the rules appropriate for binary
Elms, A. (1982). Keeping deception honest: Justifying decisions, such as determining whether the sample
conditions for social scientific research strategems. In mean is greater than a prespecified mean value at
Decision Rule 337

a given significance level. As a practical matter, Conceptual quibbles about this view of proba-
even for a given decision about a given data set, bility aside, this approach is entirely adequate for
there is no unique, universally acceptable decision a vast majority of practical purposes in research.
rule but rather many possible principled rules. But for more complex decisions in which a variety
There are two main statistical approaches to of factors and their attendant uncertainties have to
picking the most appropriate decision rule for be considered, frequentist decision rules are often
a given decision. The classical, or frequentist, too limiting.
approach is the one encountered in most textbooks
on statistics and the one used by most researchers
in their data analyses. This approach is generally Bayesian Decision Rules
quite adequate for most types of data analysis. The
Bayesian approach is still widely considered eso- Suppose, in the aforementioned example, that the
teric, but one that an advanced researcher should effectiveness of the hormone for various breeds of
become familiar with, as this approach is becom- cattle in the sample, and the relative frequencies of
ing increasingly common in advanced data analysis the breeds, is known. How should one use this
and complex decision making. prior distribution of hormone effectiveness to
choose between the two hypotheses? Frequentist
decision rules are not well suited to handle such
Decision Rules in Classical Hypothesis Testing
decisions; Bayesian decision rules are.
Suppose one needs to decide whether a new brand Essentially, Bayesian decision rules use Bayes’s
of bovine growth hormone increases the body law of conditional probability to compute a poste-
weight of cattle beyond the known average value rior distribution based on the observed data and the
of μ kilograms. The observed data consist of body appropriate prior distribution. In the case of the
weight measurements from a sample of cattle trea- above example, this amounts to revising one’s belief
ted with the hormone. The default explanation for about the body weight of the treated cattle based
the data, or the null hypothesis, is that there is no on the observed data and the prior distribution.
effect: the mean weight of the treated sample is no The null hypothesis is rejected if the posterior prob-
greater than the nominal mean μ. The alternative ability is less than the user-defined significance level.
hypothesis is that the mean weight of the treated One of the more obvious advantages of Bayes-
sample is greater than μ. ian decision making, in addition to the many sub-
The decision rule specifies how to decide which tler ones, is that Bayesian decision rules can be
of the two hypotheses to accept, given the data. In readily elaborated to allow any number of addi-
the present case, one may calculate the t statistic, tional considerations underlying a complex deci-
determine the critical value of t at the desired level sion. For instance, if the larger decision at hand in
of significance (such as .05), and accept the alter- the above example is whether to market the hor-
native hypothesis if the t value based on the data mone, one must consider additional factors, such
exceeds the critical value and reject it otherwise. If as the projected profits, possible lawsuits, and
the sample is sufficiently large and Gaussian, one costs of manufacturing and distribution. Complex
might use a similar decision rule with a different decisions of this sort are becoming increasingly
test statistic, the z score. Alternatively, one may common in behavioral, economic, and social
choose between the hypotheses based on the p research. Bayesian decision rules offer a statistically
value rather than the critical value. optimal method for making such decisions.
Such case-specific variations notwithstanding, It should be noted that when only the sample
what all frequentist decision rules have in common data are considered and all other factors, including
is that they arrive at a decision ultimately by com- prior distributions, are left out, Bayesian decision
paring some statistic of the observed data against rules can lead to decisions equivalent to and even
a theoretical standard, such as the sampling distri- identical to the corresponding frequentist rules. This
bution of the statistic, and determine how likely superficial similarity between the two approaches
the observed data are under the various competing notwithstanding, Bayesian decision rules are not
hypotheses. simply a more elaborate version of frequentist rules.
338 Declaration of Helsinki

The differences between the two approaches are 1947 Nuremberg Code was established. This was
profound and reflect longstanding debates about followed in 1948 by the WMA’s Declaration of
the nature of probability. For the researcher, on the Geneva, a statement of ethical duties for physicians.
other hand, the choice between the two approaches Both documents influenced the development of the
should be less a matter of adherence to any given Declaration of Helsinki, adopted in 1964 by the
orthodoxy and more about the nature of the deci- WMA. The initial Declaration, 11 paragraphs in
sion at hand. length, focused on clinical research trials. Notably,
it relaxed conditions for consent for participation,
Jay Hegdé changing the Nuremberg requirement that consent
is ‘‘absolutely essential’’ to instead urge consent ‘‘if
See also Criterion Problem; Critical Difference; Error at all possible’’ but to allow for proxy consent, such
Rates; Expected Value; Inference: Inductive and as from a legal guardian, in some instances.
Deductive; Mean Comparisons; Parametric Statistics The Declaration has been revised six times.
The first revision, conducted in 1975, expanded
Further Readings the Declaration considerably, nearly doubling its
length, increasing its depth, updating its termi-
Bolstad, W. M. (2007). Introduction to Bayesian nology, and adding concepts such as oversight by
statistics. Hoboken, NJ: Wiley.
an independent committee. The second (1983)
Press, S. J. (2005). Applied multivariate analysis: Using
Bayesian and frequentist methods of inference. New
and third (1989) revisions were comparatively
York: Dover. minor, primarily involving clarifications and
Resnik, M. D. (1987). Choices: An introduction to updates in terminology. The fourth (1996) revi-
decision theory. Minneapolis: University of Minnesota sion also was minor in scope but notably added
Press. a phrase that effectively precluded the use of
inert placebos when a particular standard of care
exists.
The fifth (2000) revision was extensive and
DECLARATION OF HELSINKI controversial. In the years leading up to the revi-
sion, concerns were raised about the apparent
The Declaration of Helsinki is a formal state- use of relaxed ethical standards for clinical trials
ment of ethical principles published by the in developing countries, including the use of pla-
World Medical Association (WMA) to guide the cebos in HIV trials conducted in sub-Saharan
protection of human participants in medical Africa. Debate ensued about revisions to the
research. The Declaration is not a legally binding Declaration, with some arguing for stronger lan-
document but has served as a foundation for guage and commentary addressing clinical trials
national and regional laws governing medical and others proposing to limit the document to
research across the world. Although not without basic guiding principles. Although consensus
its controversies, the Declaration has served as was not reached, the WMA approved a revision
the standard in medical research ethics since its that restructured the document and expanded its
establishment in 1964. scope. Among the more controversial aspects of
the revision was the implication that standards
of medical care in developed countries should
History and Current Status
apply to any research with humans, including
Before World War II, no formal international state- that conducted in developing countries. The
ment of ethical principles to guide research with opposing view held that when risk of harm is
human participants existed, leaving researchers to low and there are no local standards of care (as
rely on organizational, regional, or national policies is often the case in developing countries), pla-
or their own personal ethical guidelines. After atro- cebo-controlled trials are ethically acceptable,
cities were found to have been committed by Nazi especially given their potential benefits for future
medical researchers using involuntary, unprotected patients. Debate has continued on these issues,
participants drawn from concentration camps, the and cross-national divisions have emerged. The
Declaration of Helsinki 339

U.S. Food and Drug Administration rejected the adds that research also must encourage the protec-
fifth revision because of its restrictions on the tion of the health and rights of people. The intro-
use of placebo conditions and has eliminated all duction then specifically mentions vulnerable
references to the Declaration, replacing it with populations and calls for extra consideration when
the Good Clinical Practice guidelines, an alterna- these populations are participating in research.
tive internationally sanctioned ethics guide. The The final statement in the Declaration’s Introduc-
National Institutes of Health training in research tion asserts that medical researchers are bound by
with human participants no longer refers to the the legal and ethical guidelines of their own
Declaration, and the European Commission nations but that adherence to these laws does not
refers only to the fourth revision. liberate researchers from the edicts of the Declara-
The sixth revision of the Declaration, approved tion of Helsinki.
by the WMA in 2008, introduced relatively minor
clarifications. The revision reinforces the Declara-
tion’s long-held emphasis on prioritizing the rights
Principles for All Medical Research
of individual research participants above all other
interests. Public debate following the revision was The Principles for All Medical Research include
not nearly as contentious as had been the case with considerations that must be made by researchers
previous revisions. who work with human participants. The first
assertion in the principles states that a physician’s
duty is to ‘‘protect the life, health, dignity, integ-
Synopsis of the Sixth Revision
rity, right to self-determination, privacy, and confi-
The Declaration of Helsinki’s sixth revision com- dentiality’’ of research participants. Consideration
prises several sections: the Introduction, Principles of the environment and of the welfare of research
for All Medical Research, and Additional Princi- animals is also mentioned. Also, the basic princi-
ples for Medical Research Combined With Medi- ples declare that any research conducted on human
cal Care. It is 35 paragraphs long. participants must be in accordance with generally
held scientific principles and be based on as thor-
ough a knowledge of the participant as is possible.
Introduction
Paragraph 14 of the Declaration states that any
The introduction states that the Declaration is study using human participants must be thor-
intended for physicians and others who conduct oughly outlined in a detailed protocol, and it pro-
medical research on humans (including human vides specific guidelines about what should be
materials or identifiable information). It asserts included in the protocol. The protocol should
that the Declaration should be considered as include numerous types of information, including
a whole and that its paragraphs should not be con- funding sources, potential conflicts of interest,
sidered in isolation but with reference to all perti- plans for providing study participants access to
nent paragraphs. It then outlines general ethical interventions that the study identifies as beneficial,
principles that guide research on human partici- and more.
pants. These include a reminder of the words from Paragraph 15 states that the above-mentioned
the WMA’s Declaration of Geneva that the physi- protocol must be reviewed by an independent
cian is bound to: ‘‘The health of my patient will be research ethics committee before the study begins.
my first consideration.’’ This idea is expanded with This committee has the right and responsibility to
a statement asserting that when research is being request changes, provide comments and guidance,
conducted, the welfare of the participants takes and monitor ongoing trials. The committee mem-
precedence over the more general welfare of sci- bers also have the right and responsibility to con-
ence, research, and the general population. sider all information provided in the protocol and
The introduction also describes the goals of to request additional information as deemed
medical research as improving the prevention, appropriate. This principle of the Declaration is
diagnosis, and treatment of disease and increasing what has led to the development of institutional
the understanding of the etiology of disease. It review boards in the United States.
340 Declaration of Helsinki

The principles also state that research must be the ethics of accurate publication of research
conducted by qualified professionals and that the results. Researchers are responsible for accurate
responsibility for protecting research subjects and complete reporting of results and for making
always falls on the professionals conducting the their results publicly available, even if the results
study and not on the study participants, even are negative or inconclusive. The publication
though they have consented to participate. should also include funding sources, institutional
The principles also require an assessment of pre- affiliation, and any conflicts of interest. A final
dictable risks and benefits for both research parti- assertion states that research reports that do not
cipants and the scientific community. Risk meet these standards should not be accepted for
management must be carefully considered, and the publication.
objective of the study must be of enough impor-
tance that the potential risks are outweighed.
Another statement in the basic principles, para- Additional Principles for Medical
graph 17, states that research with disadvantaged Research Combined With Medical Care
or vulnerable populations is justified only if it
relates to the needs and priorities of the vulnerable This section of the Declaration, which was new
community and can be reasonably expected to to the fifth revision in 2000, has created the most
benefit the population in which the research is con- controversy. It begins with a statement that extra
ducted. This statement was included, in part, as care must be taken to safeguard the health and
a response to testing of new prescription drugs in rights of patients who are both receiving medical
Africa, where the availability of cutting-edge pre- care and participating in research. Paragraph 32
scription drugs is highly unlikely. then states that when a new treatment method is
The remainder of the principles section dis- being tested, it should be compared with the gener-
cusses issues of privacy, confidentiality, and ally accepted best standard of care, with two
informed consent. These discussions stipulate exceptions. First, placebo treatment can be used in
that research should be conducted only with par- studies where no scientifically proven intervention
ticipants who are capable of providing informed exists. This statement was adopted as a response
consent, unless it is absolutely necessary to do to drug testing that was being conducted in which
research with participants who cannot give con- the control group was given placebos when a scien-
sent. If this is the case, the specific reasons for tifically proven drug was available.
this necessity must be outlined in the protocol, The second exception states that placebos or no
informed consent must be provided by a legal treatment can be used when ‘‘compelling and sci-
guardian, and the research participant’s assent entifically sound methodological reasons’’ exist for
must be obtained if possible. Participants must using a placebo to determine the efficacy and/or
be informed of their right to refuse to participate safety of a treatment, and if the recipients of the
in the study, and special care must be taken placebo or no treatment will not suffer irreversible
when potential participants are under the care of harm. The Declaration then states that ‘‘Extreme
a physician involved in the study in order to care must be taken to avoid abuse of this option.’’
avoid dynamics of dependence on the physician This exception was most likely added as a response
or duress to affect decision-making processes. to the intense criticism of the fifth revision.
Paragraphs 27 through 29 outline guidelines for The adoption of the principle described in para-
research with participants who are deemed graph 32 aimed to prevent research participants’
incompetent to give consent and state that these illnesses from progressing or being transmitted to
subjects can be included in research only if the others because of a lack of drug treatment when
subject can be expected to benefit or if the fol- a scientifically proven treatment existed. Critics of
lowing conditions apply: A population that the this assertion stated that placebo treatment was
participant represents is likely to benefit, the consistent with the standard of care in the regions
research cannot be performed on competent per- where the drug testing was taking place and that
sons, and potential risk and burden are minimal. administration of placebos to control groups is
The final paragraph of the principles addresses often necessary to determine the efficacy of
Degrees of Freedom 341

a treatment. Supporters of the Declaration main- Further Readings


tain that the duty of the medical professional is to
Carlson, R. V., van Ginneken, N. H., Pettigrew, L. M.,
provide the best care possible to patients and that Davies, A., Boyd, K. M., & Webb, D. J. (2007). The
knowingly administering placebos in place of three official language versions of the Declaration of
proven treatments is ethically dubious. Helsinki: What’s lost in translation? Journal of
Paragraph 33 of the Declaration establishes Medical Ethics, 33, 545–548.
that study participants ‘‘are entitled to be Human, D., & Fluss, S. S. (2001). The World Medical
informed about the outcome of the study and to Association’s Declaration of Helsinki: Historical and
share any benefits that result from it’’ and gives contemporary perspectives. Retrieved February 1,
the example of participants’ being provided 2010, from http://www.wma.net/en/20activities/
10ethics/10helsinki/
access to interventions that have been identified
draft historical contemporary perspectives.pdf
as beneficial, or to other appropriate care. This Schmidt, U., & Frewer, A. (Eds.). (2007). History and
assertion is somewhat less strong than the fifth theory of human experimentation: The Declaration of
edition’s language, which stated that participants Helsinki and modern medical ethics. Stuttgart,
should be ‘‘assured’’ of the best known care iden- Germany: Franz Steiner.
tified during the course of the study when a study Williams, J. R. (2008). The Declaration of Helsinki and
is concluded. public health. Bulletin of the World Health
The final two paragraphs of the Declaration, Organization, 86, 650–651.
which are part of the ‘‘Additional Principles,’’ pro- Williams, J. R. (2008). Revising the Declaration of
vide that a patient’s refusal to participate in a study Helsinki. World Medical Journal, 54, 120–125.
Wolinsky, H. (2006). The battle of Helsinki. EMBO
should never affect the therapeutic relationship
Reports, 7, 670–672.
and that new, unproven treatments can be used World Medical Association. (2008). The Declaration of
when there is reason to believe they will be benefi- Helsinki. Retrieved February 1, 2010, from http://
cial and when no proven treatment exists. www.wma.net/en/30publications/10policies/b3/
index.html

Future
The Declaration of Helsinki remains the world’s
best-known statement of ethical principles to guide DEGREES OF FREEDOM
medical research with human participants. Its
influence is far-reaching in that it has been codified
In statistics, the degrees of freedom is a measure of
into the laws that govern medical research in coun-
the level of precision required to estimate a param-
tries across the world and has served as a basis for
eter (i.e., a quantity representing some aspect of
the development of other international guidelines
the population). It expresses the number of inde-
governing medical research with human partici-
pendent factors on which the parameter estimation
pants. As the Declaration has expanded and
is based and is often a function of sample size. In
become more prescriptive, it has become more
general, the number of degrees of freedom
controversial, and concerns have been raised
increases with increasing sample size and with
regarding the future of the Declaration and its
decreasing number of estimated parameters. The
authority. Future revisions to the Declaration may
quantity is commonly abbreviated df or denoted
reconsider the utility of prescriptive guidelines
by the lowercase Greek letter nu, ν.
rather than limiting its focus to basic principles.
For a set of observations, the degrees of free-
Another challenge will be to harmonize the Decla-
dom is the minimum number of independent
ration with other ethical research guidelines,
values required to resolve the entire data set. It
because there often is apparent conflict between
is equal to the number of independent observa-
aspects of current codes and directives documents.
tions being used to determine the estimate (n)
Bryan J. Dik and Timothy J. Doenges minus the number of parameters being estimated
in the approximation of the parameter itself,
See also Ethics in the Research Process as determined by the statistical procedure under
342 Degrees of Freedom

consideration. In other words, a mathematical the sum of the deviations about the mean is equal
restraint is used to compensate for estimating one to zero, at least four deviations are needed to
parameter from other estimated parameters. For determine the fifth; hence, one deviation is fixed
a single sample, one parameter is estimated. Often and cannot vary. The number of values that are
the population mean (μ), a frequently unknown free to vary is the degrees of freedom. In this
value, is based on the sample mean ( x), thereby example, the number of degrees of freedom is
resulting in n  1 degrees of freedom for estimat- equal to 4; this is based on five data observations
ing population variability. For two samples, two (n) minus one estimated parameter (i.e., using the
parameters are estimated from two independent sample mean to estimate the population mean).
samples (n1 and n2), thus producing n1 þ n2  2 Generally stated, the degrees of freedom for a sin-
degrees of freedom. In simple linear regression, the gle sample are equal to n  1 given that if n  1
relationship between two variables, x and y, is observations and the sample mean are known, the
described by the equation y ¼ bx þ a, where b is remaining nth observation can be determined.
the slope of the line and a is the y-intercept (i.e., Degrees of freedom are also often used to
where the line crosses the y-axis). In estimating describe assorted data distributions in comparison
a and b to determine the relationship between the with a normal distribution. Used as the basis for
independent variable x and dependent variable y, 2 statistical inference and sampling theory, the nor-
degrees of freedom are then lost. For multiple mal distribution describes a data set characterized
sample groups ðn1 þ    þ nk Þ, the number of by a bell-shaped probability density function that
parameters estimated increases by k, and is symmetric about the mean. The chi-square dis-
subsequently, the degrees of freedom is equal to tribution, applied usually to test differences among
n1 þ    þ nk  k. The denominator in the analysis proportions, is positively skewed with a mean
of variance (ANOVA) F test statistic, for example, defined by a single parameter, the degrees of free-
accounts for estimating multiple population means dom. The larger the degrees of freedom, the more
for each group under comparison. the chi-square distribution approximates a normal
The concept of degrees of freedom is fundamen- distribution. Also based on the degrees of freedom
tal to understanding the estimation of population parameter, the Student’s t distribution is similar to
parameters (e.g., mean) based on information the normal distribution, but with more probability
obtained from a sample. The amount of informa- allocated to the tails of the curve and less to the
tion used to make a population estimate can vary peak. The largest difference between the t distribu-
considerably as a function of sample size. For tion and the normal occurs for degrees of freedom
instance, the standard deviation (a measure of var- less than about 30. For tests that compare the vari-
iability) of a population estimated on a sample size ance of two or more populations (e.g., ANOVA),
of 100 is based on 10 times more information than the positively skewed F distribution is defined by
is a sample size of 10. The use of large amounts of the number of degrees of freedom for the various
independent information (i.e., a large sample size) samples under comparison.
to make an estimate of the population usually Additionally, George Ferguson and Yushio
means that the likelihood that the sample estimates Takane have offered a geometric interpretation of
are truly representative of the entire population is degrees of freedom whereby restrictions placed on
greater. This is the meaning behind the number of the statistical calculations are related to a point–
degrees of freedom. The larger the degrees of free- space configuration. Each point within a space of
dom, the greater the confidence the researcher can d dimensions has a freedom of movement or vari-
have that the statistics gained from the sample ability within those d dimensions that is equal to
accurately describe the population. d; hence, d is the number of degrees of freedom.
To demonstrate this concept, consider a sample For instance, a data point on a single dimensional
data set of the following observations (n ¼ 5): 1, line has one degree of movement (and one degree
2, 3, 4, and 5. The sample mean (the sum of the of freedom) whereas a data point in three-dimen-
observations divided by the number of observa- sional space has three.
tions) equals 3, and the deviations about the mean
are  2,  1, 0, þ 1, and þ 2, respectively. Since Jill S. M. Coleman
Delphi Technique 343

See also Analysis of Variance (ANOVA); Chi-Square Test; exclusive tool of investigation in a research or an
Distribution; F Test; Normal Distribution; Parameters; evaluation project is not uncommon.
Population; Sample Size; Student’s t Test; Variance This entry examines the Delphi process, includ-
ing subject selection and analysis of data. It also
discusses the advantages and disadvantages of the
Further Readings Delphi technique, along with the use of electronic
technologies in facilitating implementation.
Ferguson, G. A., & Takane, Y. (1989). Statistical analysis
in psychology and education (6th ed.). New York:
McGraw-Hill. The Delphi Process
Good, I. J. (1973). What are degrees of freedom?
American Statistician, 27, 227–228. The Delphi technique is characterized by multiple
Lomax, R. G. (2001). An introduction to statistical iterations, or ‘‘rounds,’’ of inquiry. The iterations
concepts for education and the behavioral sciences. mean a series of feedback processes. Due to the
Mahwah, NJ: Lawrence Erlbaum. iterative characteristic of the Delphi technique,
instrument development, data collection, and
questionnaire administration are interconnected
between rounds. As such, following the more or
less linear steps of the Delphi process is important
DELPHI TECHNIQUE to success with this technique.

The Delphi technique is a group communication


Round 1
process as well as a method of achieving a consen-
sus of opinion associated with a specific topic. Pred- In Delphi, one of two approaches can be taken
icated on the rationale that more heads are better in the initial round. Traditionally, the Delphi pro-
than one and that inputs generated by experts cess begins with an open-ended questionnaire. The
based on their logical reasoning are superior to sim- open-ended questionnaire serves as the cornerstone
ply guessing, the technique engages a group of for soliciting information from invited partici-
identified experts in detailed examinations and dis- pants. After receiving responses from participants,
cussions on a particular issue for the purpose of investigators convert the collected qualitative data
policy investigation, goal setting, and forecasting into a structured instrument, which becomes the
future situations and outcomes. Common surveys second-round questionnaire. A newer approach is
try to identify what is. The Delphi technique based on an extensive review of the literature. To
attempts to assess what could or should be. initiate the Delphi process, investigators directly
The Delphi technique was named after the ora- administer a structured questionnaire based on the
cle at Delphi, who, according to Greek myth, deliv- literature and use it as a platform for questionnaire
ered prophecies. As the name implies, the Delphi development in subsequent iterations.
technique was originally developed to forecast
future events and possible outcomes based on
Round 2
inputs and circumstances. The technique was prin-
cipally developed by Norman Dalkey and Olaf Hel- Next, Delphi participants receive a second ques-
mer at the RAND Corporation in the early 1950s. tionnaire and are asked to review the data devel-
The earliest use of the Delphi process was primarily oped from the responses of all invited participants
military. Delphi started to gain popularity as a futur- in the first round and subsequently summarized by
ing tool in the mid-1960s and came to be widely investigators. Investigators also provide partici-
applied and examined by researchers and practi- pants with their earlier responses to compare with
tioners in fields such as curriculum development, the new data that has been summarized and edi-
resource utilization, and policy determination. In ted. Participants are then asked to rate or rank
the mid-1970s, however, the popularity of the Del- order the new statements and are encouraged to
phi technique began to decline. Currently, using the express any skepticism, questions, and justifica-
Delphi technique as an integral part or as the tions regarding the statements. This allows a full
344 Delphi Technique

and fair disclosure of what each participant thinks is considered the most important step in Delphi.
or believes is important concerning the issue being The quality of results directly links to the quality
investigated, as well as providing participants an of the participants involved.
opportunity to share their expertise, which is Delphi participants should be highly trained
a principal reason for their selection to participate and possess expertise associated with the target
in the study. issues. Investigators must rigorously consider and
examine the qualifications of Delphi subjects. In
general, possible Delphi subjects are likely to
Round 3
be positional leaders, authors discovered from
In Round 3, Delphi participants receive a third a review of professional publications concerning
questionnaire that consists of the statements and the topic, and people who have firsthand relation-
ratings summarized by the investigators after the ships with the target issue. The latter group often
preceding round. Participants are again asked to consists of individuals whose opinions are sought
revise their judgments and to express their ratio- because their direct experience makes them a reli-
nale for their priorities. This round provides able source of information.
participants an opportunity to make further clarifi- In Delphi, the number of participants is gener-
cations and review previous judgments and inputs ally between 15 and 20. However, what constitu-
from the prior round. Researchers have indicated tes an ideal number of participants in a Delphi
that three rounds are often sufficient to gather study has never achieved a consensus in the litera-
needed information and that further iterations ture. Andre Delbecq, Andrew Van de Ven, and
would merely generate slight differences. David Gustafson suggest that 10 to 15 participants
should be adequate if their backgrounds are simi-
lar. In contrast, if a wide variety of people or
Round 4
groups or a wide divergence of opinions on the
However, when necessary, in the fourth and topic are deemed necessary, more participants need
often final round, participants are again asked to to be involved. The number of participants in Del-
review the summary statements from the preceding phi is variable, but if the number of participants is
round and to provide inputs and justifications. It is too small, they may be unable to reliably provide
imperative to note that the number of Delphi itera- a representative pooling of judgments concerning
tions relies largely on the degree of consensus the target issue. Conversely, if the number of parti-
sought by the investigators and thereby can vary cipants is too large, the shortcomings inherent in
from three to five. In other words, a general con- the Delphi technique (difficulty dedicating large
sensus about a noncritical topic may only require blocks of time, low response rates) may take
three iterations, whereas a serious issue of critical effect.
importance with a need for a high level of agree-
ment among the participants may require addi-
Analysis of Data
tional iterations. Regardless of the number of
iterations, it must be remembered that the purpose In Delphi, decision rules must be established to
of the Delphi is to sort through the ideas, impres- assemble, analyze, and summarize the judgments
sions, opinions, and expertise of the participants and insights offered by the participants. Consensus
to arrive at the core or salient information that on a topic can be determined if the returned
best describes, informs, or predicts the topic of responses on that specific topic reach a prescribed
concern. or a priori range. In situations in which rating or
rank ordering is used to codify and classify data,
the definition of consensus has been at the discre-
Subject Selection
tion of the investigator(s). One example of consen-
The proper use of the Delphi technique and the sus from the literature is having 80% of subjects’
subsequent dependability of the generated data votes fall within two categories on a 7-point scale.
rely in large part on eliciting expert opinions. The Delphi technique can employ and collect
Therefore, the selection of appropriate participants both qualitative and quantitative information.
Delphi Technique 345

Investigators must analyze qualitative data if, as Second, low response rates can jeopardize
with many conventional Delphi studies, open- robust feedback. Delphi investigators need both
ended questions are used to solicit participants’ a high response rate in the first iteration and
opinions in the first round. It is recommended a desirable response rate in the following rounds.
that a team of researchers and/or experts with Investigators need to play an active role in helping
knowledge of both the target issues and instrument to motivate participants, thus ensuring as high
development analyze the written comments. a response rate as possible.
Statistical analysis is performed in the further itera- Third, the process of editing and summarizing
tions to identify statements that achieve the desired participants’ feedback allows investigators to
level of consensus. Measures of central tendency impose their own views, which may impact parti-
(means, mode, and median) and level of dispersion cipants’ responses in later rounds. Therefore, Del-
(standard deviation and interquartile range) are phi investigators must exercise caution and
the major statistics used to report findings in the implement appropriate safeguards to prevent the
Delphi technique. The specific statistics used introduction of bias.
depend on the definition of consensus set by the Fourth, an assumption regarding Delphi partici-
investigators. pants is that their knowledge, expertise, and expe-
rience are equivalent. This assumption can hardly
be justified. It is likely that the knowledge bases of
Advantages of Using the Delphi Technique Delphi participants are unevenly distributed.
Although some panelists may have much more in-
Several components of the Delphi technique make
depth knowledge of a specific, narrowly defined
it suitable for evaluation and research problems.
topic, other panelists may be more knowledgeable
First, the technique allows investigators to gather
about a wide range of topics. A consequence of
subjective judgments from experts on problems or
this disparity may be that participants who do not
issues for which no previously researched or docu-
possess in-depth information may be unable to
mented information is available. Second, the multi-
interpret or evaluate the most important state-
ple iterations allow participants time to reflect and
ments identified by Delphi participants who have
an opportunity to modify their responses in subse-
in-depth knowledge. The outcome of such a Delphi
quent iterations. Third, Delphi encourages innova-
study could be a series of general statements rather
tive thinking, particularly when a study attempts
than an in-depth exposition of the topic.
to forecast future possibilities. Last, participant
anonymity minimizes the disadvantages often asso-
ciated with group processes (e.g., bandwagon Computer-Assisted Delphi Process
effect) and frees subjects from pressure to con-
The prevalence and application of electronic tech-
form. As a group communication process, the
nologies can facilitate the implementation of the
technique can serve as a means of gaining insight-
Delphi process. The advantages of computer-
ful inputs from experts without the requirement of
assisted Delphi include participant anonymity,
face-to-face interactions. Additionally, confidenti-
reduced time required for questionnaire and feed-
ality is enhanced by the geographic dispersion of
back delivery, readability of participant responses,
the participants, as well as the use of electronic
and the easy accessibility provided by Internet
devices such as e-mail to solicit and exchange
connections.
information.
If an e-mail version of the questionnaires is to
be used, investigators must ensure e-mail addresses
are correct, contact invited participants before-
Limitations of the Delphi Technique
hand, ask their permission to send materials via e-
Several limitations are associated with Delphi. mail, and inform the recipients of the nature of
First, a Delphi study can be time-consuming. the research so that they will not delete future e-
Investigators need to ensure that participants mail contacts. With regard to the purchase of a sur-
respond in a timely fashion because each round vey service, the degree of flexibility in question-
rests on the results of the preceding round. naire templates and software and service costs
346 Demographics

may be the primary considerations. Also, Delphi picture (graphy). Examples of demographic char-
participants need timely instructions for accessing acteristics include age, race, gender, ethnicity, reli-
the designated link and any other pertinent gion, income, education, home ownership, sexual
information. orientation, marital status, family size, health and
disability status, and psychiatric diagnosis.
Chia-Chien Hsu and Brian A. Sandford

See also Qualitative Research; Quantitative Research; Demographics as Variables in Research


Survey
Demographic information provides data regarding
research participants and is necessary for the deter-
Further Readings mination of whether the individuals in a particular
study are a representative sample of the target popu-
Adler, M., & Ziglio, E. (1996). Gazing into the oracle: lation for generalization purposes. Usually demo-
The Delphi method and its application to social policy
graphics or research participant characteristics are
and public health. London: Jessica Kingsley.
Altschuld, J. W., & Thomas, P. M. (1991). reported in the methods section of the research
Considerations in the application of a modified scree report and serve as independent variables in the
test for Delphi survey data. Evaluation Review, 15(2), research design. Demographic variables are indepen-
179–188. dent variables by definition because they cannot be
Dalkey, N. C., Rourke, D. L., Lewis, R., & Snyder, D. manipulated. In research, demographic variables
(1972). Studies in the quality of life: Delphi and may be either categorical (e.g., gender, race, marital
decision-making. Lexington, MA: Lexington Books. status, psychiatric diagnosis) or continuous (e.g.,
Delbecq, A. L., Van de Ven, A. H., & Gustafson, D. H. age, years of education, income, family size). Demo-
(1975). Group technique for program planning: A
graphic information describes the study sample, and
guide to nominal group and Delphi processes.
demographic variables also can be explored for their
Glenview, IL: Scott, Foresman.
Hasson, F., Keeney, S., & McKenna, H. (2000). Research moderating effect on dependent variables.
guidelines for the Delphi survey technique. Journal of
Advanced Nursing, 32(4), 1008–1015. The Nature of Demographic Variables
Hill, K. Q., & Fowles, J. (1975). The methodological
worth of the Delphi forecasting technique. Technological Some demographic variables are necessarily cate-
Forecasting & Social Change, 7, 179–19 2. gorical, such as gender, whereas other demo-
Hsu, C. C., & Sandford, B. A. (2007, August). The graphic variables (e.g., education, income) can be
Delphi technique: Making sense of consensus. collected to yield categorical or continuous vari-
Practical Assessment, Research, & Evaluation, 12(10).
ables. For example, to have education as a continu-
Retrieved September 15, 2007, from http://
pareonline.net/getvn.asp?v ¼ 12&n ¼ 10
ous variable, one would ask participants to report
Linstone, H. A., & Turoff, M. (1975). The Delphi number of years of education. But to have educa-
method: Techniques and applications. Reading, MA: tion as a categorical variable, one would ask parti-
Addison-Wesley. cipants to select a category of education (e.g., less
Yousuf, M. I. (2007, May). Using experts’ opinions than high school, high school, some college,
through Delphi technique. Practical Assessment, college degree, graduate degree). Note that
Research, & Evaluation, 12(4). Retrieved June 28, a researcher could post hoc create a categorical
2007, from http://pareonline.net/ variable for education if the data were initially
getvn.asp?v ¼ 12&n ¼ 4 gathered to yield a continuous variable.

Defining Demographic Variables


DEMOGRAPHICS Researchers should clearly and concisely define the
demographic variables employed in their study.
The term demographics refers to particular charac- When possible, variables should be defined consis-
teristics of a population. The word is derived tent with commonly used definitions or taxo-
from the Greek words for people (demos) and nomies (e.g., U.S. Census Bureau categories of
Dependent Variable 347

ethnicity). It is generally agreed and advisable that See also Dependent Variable; Independent Variable
demographic information should be collected on
the basis of participant report and not as an obser- Further Readings
vation of the researcher. In the case of race, for
example, it is not uncommon for someone whom Goldberg, W. A., Prause, J., Lucas-Thompson, R., &
a researcher may classify as Black to self-identify Himsel, A. (2008). Maternal employment and
children’s achievement in context: A meta-analysis of
as White or biracial.
four decades of research. Psychological Bulletin, 134,
77–108.
Selection of Demographic Hart, D., Atkins, R., & Matsuba, M. K. (2008). The
association of neighborhood poverty with personality
Information to Be Collected change in childhood. Journal of Personality & Social
Researchers should collect only the demographic Psychology, 94, 1048–1061.
information that is necessary for the specific pur-
poses of the research. To do so, in the planning
stage researchers will need to identify demographic
information that is vital in the description of parti-
DEPENDENT VARIABLE
cipants as well as in data analysis, and also infor-
mation that will enhance interpretation of the A dependent variable, also called an outcome
results. For example, in a study of maternal variable, is the result of the action of one or
employment and children’s achievement, Wendy more independent variables. It can also be
Goldberg and colleagues found that the demo- defined as any outcome variable associated with
graphic variables of children’s age and family some measure, such as a survey. Before provid-
structure were significant moderators of the ing an example, the relationship between the
results. Thus, the inclusion of particular demo- two (in an experimental setting) might be
graphic information can be critical for an accurate expressed as follows:
understanding of the data.
DV ¼ f ðIV1 þ IV2 þ IV3 þ    þ IVk Þ;

Confidentiality where DV ¼ the value of the dependent variable,


Respondents should be informed that demographic f ¼ function of, and IVk ¼ the value of one or
information will be held in strictest confidence and more independent variables.
reported only as aggregated characteristics, not as In other words, the value of a dependent vari-
individual data, and that the information will be able is a function of changes in one or more
used for no other purpose. If necessary, researchers independent variables. The following abstract
may need to debrief participants to explain the provides an example of a dependent variable and
purpose of requesting particular demographic its interaction with a independent variable (eth-
information. nic origin):

Ethnic origin is one factor that may influence the


Location of Demographic Items rate or sequence of infant motor development,
Often demographic information is gathered at the interpretation of screening test results, and deci-
end of data collection, particularly when demo- sions regarding early intervention. The primary
graphic questions could bias an individual’s purpose of this study is to compare motor devel-
response. For example, asking participants to opment screening test scores from infants of
answer questions about income and ethnicity Asian and European ethnic origins. Using
before they complete a survey may suggest to the a cross-sectional design, the authors analyzed
participant that the researchers will explore Harris Infant Neuromotor Test (HINT) scores of
responses on the basis of income and ethnicity. 335 infants of Asian and European origins. Fac-
torial ANOVA results indicated no significant
Marvin Lee and C. Melanie Schuele differences in test scores between infants from
348 Descriptive Discriminant Analysis

these two groups. Although several limitations Ostrosky, M. M. (2008). Preparing early child-
should be considered, results of this study indi- hood educators to address young children’s
cate that practitioners can be relatively confident social-emotional development and challenging
in using the HINT to screen infants of both ori- behavior. Journal of Early Intervention, 30(4),
gins for developmental delays. [Mayson, T. A., 321–340.]
Backman, C. L., Harris, S. & Hayes, V. E.
(2009). Motor development in Canadian infants Neil J. Salkind
of Asian and European ethnic origins. Journal of
See also Control Variables; Dichotomous Variable;
Early Intervention, 31(3), 199–214.]
Independent Variable; Meta-Analysis; Nuisance
Variable; Random Variable; Research Hypothesis
In this study, the dependent variable is motor
development as measured by the Harris Infant Further Readings
Neuromotor Test (HINT), and the independent
variables are ethnic origin (with the two categori- Luft, H. S. (2004). Focusing on the dependent variable:
cal levels of Asian origin and European origin). In Comments on ‘‘Opportunities and Challenges for
this quasi-experimental study (since participants Measuring Cost, Quality, and Clinical Effectiveness in
Health Care,’’ by Paul A. Fishman, Mark C. Hornbrook,
are preassigned), scores on the HINT are a function
Richard T. Meenan, and Michael J. Goodman. Medical
of ethnic origin. Care Research and Review, 61, 144S–150S.
In the following example, the dependent vari- Sechrest, L. (1982). Program evaluation: The independent
able is a score on a survey reflecting how well and dependent variables. Counseling Psychologist, 10,
survey participants believe that their students are 73–74.
prepared for professional work. Additional anal-
yses looked at group differences in program
length, but the outcome survey values illustrate
what is meant in this context as a dependent DESCRIPTIVE DISCRIMINANT
variable. ANALYSIS
This article presents results from a survey of fac- Discriminant analysis comprises two approaches
ulty members from 2- and 4-year higher education to analyzing group data: descriptive discriminant
programs in nine states that prepare teachers to analysis (DDA) and predictive discriminant analy-
work with preschool children. The purpose of the sis (PDA). Both use continuous (or intervally
study was to determine how professors address scaled) data to analyze the characteristics of group
content related to social-emotional development membership. However, PDA uses this continuous
and challenging behaviors, how well prepared they data to predict group membership (i.e., How accu-
believe graduates are to address these issues, and rately can a classification rule classify the current
resources that might be useful to better prepare sample into groups?), while DDA attempts to dis-
graduates to work with children with challenging cover what continuous variables contribute to the
behavior. Of the 225 surveys that were mailed, separation of groups (i.e., Which of these variables
70% were returned. Faculty members reported contribute to group differences and by how
their graduates were prepared on topics such as much?). In addition to the primary goal of discrim-
working with families, preventive practices, and inating among groups, DDA can examine the most
supporting social emotional development but less parsimonious way to discriminate between groups,
prepared to work with children with challenging investigate the amount of variance accounted for
behaviors. Survey findings are discussed related to by the discriminant variables, and evaluate the rel-
differences between 2- and 4-year programs and ative contribution of each discriminant (continu-
between programs with and without a special edu- ous) variable in classifying the groups.
cation component. Implications for personnel For example, a psychologist may be interested
preparation and future research are discussed. in which psychological variables are most respon-
[Hemmeter, M. L., Milagros Santos, R. M., & sible for men’s and women’s progress in therapy.
Descriptive Discriminant Analysis 349

For this purpose, the psychologist could collect data the researcher how or where the differences
on therapeutic alliance, resistance, transference, come from.
and cognitive distortion in a group of 50 men and This entry first describes discriminant functions
50 women who report progressing well in therapy. and their statistical significance. Next, it explains
DDA can be useful in understanding which vari- the assumptions that need to be met for DDA.
ables of the four (therapeutic alliance, resistance, Finally, it discusses the computation and interpre-
transference, and cognitive distortion) contribute to tation of DDA.
the differentiation of the two groups (men and
women). For instance, men may be low on thera-
Discriminant Functions
peutic alliance and high on resistance. On the other
hand, women may be high on therapeutic alliance A discriminant function (also called a canonical
and low on transference. In this example, the other discriminant function) is a weighted linear combi-
variable of cognitive distortion may not be shown nation of discriminant variables, which can be
to be relevant to group differentiation at all because written as
it does not capture much difference among the
groups. In other words, cognitive distortion is unre- D ¼ a þ b1 x1 þ b2 x2 þ    þ bn xn þ c, ð1Þ
lated to how men and women progress in therapy.
This is just a brief example of the utility of DDA in where D is the discriminant score, a is the inter-
differentiating among groups. cept, the bs are the discriminant coefficients, the xs
DDA is a multivariate technique with goals sim- are discriminant variables, and c is a constant. The
ilar to those of multivariate analysis of variance discriminant coefficients are similar to beta
(MANOVA) and computationally identical to weights in multiple regression and maximize the
MANOVA. As such, all assumptions of MANOVA distance across the means of the grouping variable.
apply to the procedure of DDA. However, MAN- The number of discriminant functions in DDA is
OVA can determine only whether groups are dif- k  1, where k is the number of groups or cate-
ferent, not how they are different. In order to gories in the grouping variable, or the number of
determine how groups differ using MANOVA, discriminant variables, whichever is less. For
researchers typically follow the MANOVA proce- example, in the example of men’s and women’s
dure with a series of analyses of variance (ANO- treatment progress, the number of discriminant
VAs). This is problematic because ANOVAs are functions will be one because there are two groups
univariate tests. As such, several ANOVAs may and four discriminant variables, that is, min (1, 4),
need to be conducted, increasing the researcher’s 1 is less than 4. In DDA, discriminant variables
likelihood of committing Type I error (likelihood are optimally combined so that the first discrimi-
of finding a statistically significant result that is nant function provides the best discrimination
not really there). What’s more, what makes multi- across groups, the second function second best,
variate statistics more desirable in social science and so on until all possible dimensions are
research is the inherent assumption that human assessed. These functions are orthogonal or inde-
behavior has multiple causes and effects that pendent from one another so that there will be no
exist simultaneously. Conducting a series of uni- shared variance among them (i.e., no overlap of
variate ANOVAs strips away the richness that contribution to differentiation of groups). The first
multivariate analysis reveals because ANOVA discriminant function will represent the most pre-
analyzes data as if differences among groups vailing discriminating dimension, and later func-
occur in a vacuum, with no interaction among tions may also denote other important dimensions
variables. Consider the earlier example. A series of discrimination.
of ANOVAs would assume that as men and The statistical significance of each discriminant
women progress through therapy, there is no function should be tested prior to a further evalua-
potential shared variance between the variables tion of the function. Wilks’s lambda is used to
therapeutic alliance, resistance, transference, and examine the statistical significance of functions.
cognitive distortion. And while MANOVA does Wilks’s lambda varies from 0 through 1, with 1
account for this shared variance, it cannot tell denoting the groups that have the same mean
350 Descriptive Discriminant Analysis

discriminant function scores and 0 denoting those multicollinearity assumption in multiple regres-
that have different mean scores. In other words, sion. If a discriminant variable is very highly corre-
the smaller the value of Wilks’s lambda, the more lated with another discriminant variable (e.g.,
likely it is statistically significant and the better it r > .90), the variance–covariance matrix of the
differentiates between the groups. Wilks’s lambda discriminant variables cannot be inverted. Then,
is the ratio of within-group variance to the total the matrix is called ill-conditioned. Sixth, discrimi-
variance on the discrimiant variables and indicates nant variables must follow the multivariate normal
the proportion of variance in the total variance distribution, meaning that a discriminant variable
that is not accounted for by differences of groups. should be normally distributed about fixed values
A small lambda indicates the groups are well dis- of all the other discriminant variables. K. V. Mar-
criminated. In addition, 1  Wilks’s lambda is dia has provided measures of multivariate skew-
used as a measure of effect size to assess the practi- ness and kurtosis, which can be computed to
cal significance of discriminant functions as well as assess whether the combined distribution of dis-
the statistical significance. criminant variables is multivariate. Also, multivari-
ate normality can be graphically evaluated.
Seventh, DDA assumes that the variance–covari-
Assumptions
ance matrices of discriminant variables are homo-
DDA requires seven assumptions to be met. First, geneous across groups. This assumption intends to
DDA requires two or more mutually exclusive make sure that the compared groups are from the
groups, which are formed by the grouping variable same population. If this assumption is met, any
with each case belonging to only one group. It is differences in a DDA analysis can be attributed to
best practice for groups to be truly categorical in discriminant variables, but not to the compared
nature. For example, sex, ethnic group, and state groups. This assumption is analogous to the
where someone resides are all categorical. Some- homogeneity of variance assumption in ANOVA.
times researchers force groups out of otherwise The multivariate Box’s M test can be used to deter-
continuous data. For example, people aged 15 to mine whether the data satisfies this assumption.
20, or income between $15,000 and $20,000. Box’s M test examines the null hypothesis that the
However, whenever possible, preserving continu- variance–covariance matrices are not different
ous data where it exists and using categorical data across the groups compared. If the test is significant
as grouping variables in DDA is best. The second (e.g., the p value is lower than .05), the null hypoth-
assumption states there must be at least two cases esis can be rejected, indicating that the matrices are
for each group. different across the groups. However, it is known
The other five assumptions are related to dis- that Box’s M test is very sensitive to even small dif-
criminant variables in discriminant functions, as ferences in variance–covariance matrices when the
explained in the previous section. The third sample size is large. Also, because it is known that
assumption states that any number of discriminant DDA is robust against violation of this assumption,
variables can be included in DDA as long as the the p value typically is set at a much lower level,
number of discriminant variables is less than the such as .001. Furthermore, it is recognized that
sample size of the smallest group. However, it is DDA is robust with regard to violation of the
generally recommended that the sample size be assumption of multivariate normality.
between 10 and 20 times the number of discrimi- When data do not satisfy some of the assump-
nant variables. If the sample is too small, the reli- tions of DDA, logistic regression can be used as an
ability of a DDA will be lower than desired. On alternative. Logistic regression can answer the
the other hand, if the sample size is too large, sta- same kind of questions DDA answers. Also, it is
tistical tests will turn out significant even for small a very flexible method in that it can handle both
differences. Fourth, the discriminant variables categorical and interval variables as discriminant
should be interval, or at least ordinal. Fifth, the variables and data under analysis do not need to
discriminant variables are not completely redun- meet assumptions of multivariate normality and
dant or highly correlated with each other. This equal variance–covariance matrices. It is also
assumption is identical to the absence of perfect robust to unequal group size.
Descriptive Discriminant Analysis 351

Computation highly correlated and thus have a large amount of


shared variance between them.
When a DDA is conducted, a canonical correlation Because of these issues, it is recommended that
analysis is performed computationally that will researchers consider the structure coefficients as well
determine discriminant functions and will calculate as the standardized discriminant coefficients to
their associated eigenvalues and canonical correla- determine which variables define the nature of a spe-
tions. Each discriminant function has its own cific discriminant function. The structure coeffi-
eigenvalue. Eigenvalues, also called canonical roots cients, or the factor structure coefficients, are not
or characteristic roots, denote the proportion of semipartial coefficients like the standardized dis-
between-group variance explained by the respec- criminant coefficients but are whole coefficients, like
tive discriminant functions, and their proportions correlation coefficients. The structure coefficients
add up to 100% for all discriminant functions. represent uncontrolled association between the dis-
Thus, the ratio of two eigenvalues shows the rela- criminant functions and the discriminant variables.
tive differentiating power of their associated dis- Because the factor coefficients are correlations
criminant functions. For example, if the ratio is between the discriminant variables and the discrimi-
1.5, then the discriminant function with the larger nant functions, they can be conceptualized as factor
eigenvalue explains 50% more of the between- loadings on latent dimensions, as in factor analysis.
group variance in the grouping variable than the However, in some cases these two coefficients
function with the smaller eigenvalue does. The do not agree. For example, the standardized dis-
canonical correlation is a correlation between the criminant coefficient might tell us that a specific
grouping variable and the discriminant scores that discriminant variable differentiates groups most,
are measured by the composite of discriminant but the structure coefficient might indicate the
variables in the discriminant function. A high opposite. A body of previous research says the
canonical correlation is associated with a function standardized discriminant coefficient and the struc-
that differentiates groups well. ture coefficient can be unreliable with a small sam-
ple size, such as when the ratio of the number of
subjects to the number of discriminant variables
Interpretation
drops below 20:1. Besides increasing the sample
Descriptive discriminant functions are interpreted size, Maurice Tatsuoka has suggested that the stan-
by evaluating the standardized discriminant coeffi- dardized discriminant coefficient be used to inves-
cients, the structure coefficients, and the centroids. tigate the nature of each discriminant variable’s
Standardized discriminant coefficients represent contribution to group discrimination and that the
weights given to each discriminant variable in pro- structure coefficients be used to assign substantive
portion to how well it differentiates groups and to labels to the discriminant functions.
how many groups it differentiates. Thus, more Although these two types of coefficients inform
weight will be given to a discriminant variable that us of the relationship between discriminant func-
differentiates groups better. Because standardized tions and their discriminant variables, they do not
discriminant coefficients are semipartial coeffi- tell us which of the groups the discriminant func-
cients as standardized beta coefficients in multiple tions differentiate most or least. In other words, the
regression and expressed as z scores with a mean coefficients do not provide any information on the
of zero and a standard deviation of 1, they repre- grouping variable. To return to the earlier example
sent the relative significance of each discriminant of treatment progress, the DDA results demon-
variable to its discriminant function. The greater strated that therapeutic alliance, resistance, and
the coefficient, the larger is the contribution of its transference are mainly responsible for the differ-
associated discriminant variable to the group dif- ences between men and women in the discriminant
ferentiation. However, these standardized beta scores. However, we still need to explore which
coefficients do not tell us the absolute contribution group has more or less of these three psychological
of each variable to the discriminant function. This traits that are found to be differentiating. Thus, we
becomes a serious problem when any two discrim- need to investigate the mean discriminant function
inant variables in the discriminant function are scores for each group, which are called group
352 Descriptive Statistics

centroids. The nature of the discrimination for each descriptive correlation coefficient before learning
discriminant function can be examined by looking how to use regression or multiple regression infer-
at different locations of centroids. For example, entially. Descriptive statistics are also complemen-
a certain group that has the highest and lowest tary to inferential ones in analytical practice. Even
values of centroids on a discriminant function will when the analysis draws its main conclusions from
be best discriminated on that function. an inferential analysis, descriptive statistics are
usually presented as supporting information to
Seong-Hyeon Kim and Alissa Sherry give the reader an overall sense of the direction
and meaning of significant results.
See also Multivariate Analysis of Variance (MANOVA)
Although most of the descriptive building
blocks of statistics are relatively simple, some
Further Readings descriptive methods are high level and complex.
Consider multivariate descriptive methods, that is,
Brown, M. T., & Wicker, L. R. (2000). Discriminant
statistical methods involving multiple dependent
analysis. In H. E. A. Tinsley & S. D. Brown (Eds.),
Handbook of multivariate statistics and mathematical
variables, such as factor analysis, principal compo-
modeling (pp. 209–235). San Diego, CA: Academic nents analysis, cluster analysis, canonical correla-
Press. tion, or discriminant analysis. Although each
Huberty, C. J. (1994). Applied discriminant analysis. represents a fairly high level of quantitative sophis-
New York: Wiley. tication, each is primarily descriptive. In the hands
Sherry, A. (2006). Discriminant analysis in counseling of a skilled analyst, each can provide invaluable
psychology research. The Counseling Psychologist, 5, information about the holistic patterns in data.
661–683. For the most part, each of these high-level multi-
variate descriptive statistical methods can be
matched to a corresponding inferential multivari-
ate statistical method to provide both a description
DESCRIPTIVE STATISTICS of the data from a sample and inferences to the
population; however, only the descriptive methods
Descriptive statistics are commonly encountered, are discussed here.
relatively simple, and for the most part easily The topic of descriptive statistics is therefore
understood. Most of the statistics encountered in a very broad one, ranging from the simple first
daily life, in newspapers and magazines, in televi- concepts in statistics to the higher reaches of data
sion, radio, and Internet news reports, and so structure explored through complex multivariate
forth, are descriptive in nature rather than inferen- methods. The topic also includes graphical data
tial. Compared with the logic of inferential statis- presentation, exploratory data analysis (EDA)
tics, most descriptive statistics are somewhat methods, effect size computations and meta-analy-
intuitive. Typically the first five or six chapters of sis methods, esoteric models in mathematical psy-
an introductory statistics text consist of descriptive chology that are highly useful in basic science
statistics (means, medians, variances, standard experimental psychology areas (such as psycho-
deviations, correlation coefficients, etc.), followed physics), and high-level multivariate graphical data
in the later chapters by the more complex rationale exploration methods.
and methods for statistical inference (probability
theory, sampling theory, t and z tests, analysis of
Graphics and EDA
variance, etc.)
Descriptive statistical methods are also founda- Graphics are among the most powerful types of
tional in the sense that inferential methods are descriptive statistical devices and often appear as
conceptually dependent on them and use them as complementary presentations even in primarily
their building blocks. One must, for example, inferential data analyses. Graphics are also highly
understand the concept of variance before learning useful in the exploratory phase of research, form-
how analysis of variance or t tests are used for ing an essential part of the approach known
statistical inference. One must understand the as EDA.
Descriptive Statistics 353

Figure 1 Scatterplots of Six Different Bivariate Data Configurations That All Have the Same Pearson Product-
Moment Correlation Coefficient

Graphics Graphics have marvelous power to clarify, but


they can also be used to obfuscate. They can be
Many of the statistics encountered in everyday highly misleading and even deceitful, either
life are visual in form—charts and graphs. Descrip- intentionally or unintentionally. Indeed, Darrell
tive data come to life and become much clearer Huff’s classic book How to Lie With Statistics
and substantially more informative through a well- makes much of its case by demonstrating deceit-
chosen graphic. One only has to look through ful graphing practices. One of them is shown in
a column of values of the Dow Jones Industrial Figure 2. Suppose that a candidate for election
Average closing price for the past 30 days and then to the office of sheriff were to show the failings
compare it with a simple line graph of the same of the incumbent with a graph like the one on
data to be convinced of the clarifying power of the left side of Figure 2. At first look, it appears
graphics. Consider also how much more informa- that crime has increased at an alarming rate dur-
tive a scatterplot is than the correlation coefficient ing the incumbent’s tenure, from 2007 to 2009.
as a description of the bivariate relationship The 2009 bar is nearly 3 times as high as the bar
between two variables. Figure 1 uses bivariate for 2007. However, with a more careful look, it
scatterplots to display six different data sets that is apparent that we have in fact magnified a small
all have the same correlation coefficient. Obviously segment of the y-axis by restricting the range
there is much more to be known about the struc- (from 264 crimes per 100,000 people to 276).
tural properties of a bivariate relationship than When the full range of y-axis values is included,
merely its strength (correlation), and much of this together with the context of the surrounding
is revealed in a scatterplot. years, it becomes apparent that what appeared
354 Descriptive Statistics

276 300

274 250

272 200

270 150

268 100

266 50

264 0
2007 2008 2009 2004 2005 2006 2007 2008 2009 2010

Figure 2 Two Bar Graphs Demonstrating the Importance of Proper Scaling


Note: On right, crimes per 100,000 persons from a 7-year-series bar graph; on left, same data clipped to 3 years, with the
y-axis trimmed to a misleadingly small magnitude range to create an incorrect impression.

to be a strongly negative trend is more reason- much of this development, with his highly crea-
ably attributed to random fluctuation. tive graphical methods, such as the stem-and-leaf
Although the distortion just described is inten- plot and the box-and-whisker plot, as shown on
tional, similar distortions are common through the left and right, respectively, in Figure 3. The
oversight. In fact, if one enters the numerical stem-and-leaf (with the 10s-digit stems on
values from Figure 2 into a spreadsheet program the left of the line and the units-digit leaves on
and creates a bar graph, the default graph employs the right) has the advantage of being both a table
the restricted range shown in the left-hand figure. and a graph. The overall shape of the stem-and-
It requires a special effort to present the data accu- leaf plot in Figure 3 shows the positive skew in
rately. Graphics can be highly illuminating, but the the distribution, while the precise value of each
caveat is that one must use care to ensure that they data point is preserved by numerical entries. The
are not misleading. The popularity of Huff’s book box-and-whisker plots similarly show the overall
indicates that he has hit a nerve in questioning the shape of a distribution (the two in Figure 3 hav-
veracity in much of statistical presentation. ing opposite skew) while identifying the sum-
The work of Edward Tufte is also well known mary descriptive statistics with great precision.
in the statistical community, primarily for his com- The box-and-whisker plot can also be effectively
pelling and impressive examples of best practices combined with other graphs (such as attached to
in the visual display of quantitative information. the x- and the y-axes of a bivariate scatterplot)
Although he is best known for his examples of to provide a high level of convergent informa-
good graphics, he is also adept in identifying tion. These methods and a whole host of other
a number of the worst practices, such as what he illuminating graphical displays (such as run
calls ‘‘chartjunk,’’ or the misleading use of rectan- charts, Pareto charts, histograms, MultiVari
gular areas in picture charts, and the often mind- charts, and many varieties of scatterplots) have
less use of PowerPoint in academic presentations. become the major tools of data exploration.
Tukey suggested that data be considered
decomposable into rough and smooth elements
EDA
(data ¼ rough þ smooth). In a bivariate
Graphics have formed the basis of one of the relationship, for example, the regression line
major statistical developments of the past could be considered the smooth component, and
50 years: EDA. John Tukey is responsible for the deviations from regression the rough
Descriptive Statistics 355

component. Obviously, a description of the methods are often used in an attempt to deal
smooth component is of value, but one can also with inadequate data.
learn much from a graphical presentation of the The available selection of graphical descriptive
rough. statistical tools is obviously broad and varied. It
Tukey contrasted EDA with confirmatory data includes simple graphical inscription devices—such
analysis (the testing of hypotheses) and saw each things as bar graphs, line graphs, histograms, scatter-
as having its place, much like descriptive and infer- plots, box-and-whisker plots, and stem-and-leaf
ential statistics. He referred to EDA as a reliance plots, as just discussed—and also high-level ones.
on display and an attitude. Over the past century a number of highly sophisti-
cated multidimensional graphical methods have been
devised. These include principal components plots,
The Power of Graphicity multidimensional scaling plots, cluster analysis den-
Although graphical presentations can easily drograms, Chernoff faces, Andrews plots, time series
go astray, they have much potential explanatory profile plots, and generalized draftsman’s displays
power and exploratory power, and some of the (also called multiple scatterplots), to name a few.
best of the available descriptive quantitative Hans Rosling, a physician with broad interests,
tools are in fact graphical in nature. Indeed, it has created a convincing demonstration of the
has been persuasively argued, and some evidence immense explanatory power of so simple a graph
has been given, that the use of graphs in publica- as a scatterplot, using it to tell the story of eco-
tions both within psychology and across other nomic prosperity and health in the development of
disciplines correlates highly with the ‘‘hardness’’ the nations of the world over the past two centu-
of those scientific fields. Conversely, an inverse ries. His lively narration of the presentations
relation is found between hardness of subareas accounts for some of their impact, but such data
of psychology and the use of inferential statistics stories can be clearly told with, for example,
and data tables, indicating that the positive cor- a time-series scatterplot of balloons (the diameter
relation of graphicity with hardness is not due to of each representing the population size of a partic-
quantification and that perhaps inferential ular nation) floating in a bivariate space of fertility
rate (x-axis) and life expectancy (y-axis). The
time-series transformations of this picture play like
a movie, with labels of successive years (‘‘1962,’’
3 5789 ‘‘1963,’’ etc.) flashing in the background.
3 0022234
2 566667899
2 122344 Effect Size, Meta-Analysis, and
1 578 Accumulative Data Description
1 0234 Effect size statistics are essentially descriptive in
0 569 nature. They have evolved in response to a logical
0 334 gap in established inferential statistical methods.
Many have observed that the alternative hypothe-
sis is virtually always supported if the sample size
is large enough and that many published and
statistically significant results do not necessarily
represent strong relationships. To correct this
Figure 3 Stem-and-Leaf Plot (Left); Two Box-and- somewhat misleading practice, William L. Hays,
Whisker Plots (Right) in his 1963 textbook, introduced methods for cal-
Note: The box-and-whisker plots consist of a rectangle
culating effect size.
extending from the 1st quartile to the 3rd quartile, a line
across the rectangle (at the 2nd quartile, or median), and In the 30 years that followed, Jacob Cohen took
‘‘whisker’’ lines at each end marking the minimum and the lead in developing procedures for effect size
maximum values. estimation and power analysis. His work in turn
356 Descriptive Statistics

led to the development of meta-analysis as an some types of factor analysis, cluster analysis, multi-
important area of research—comparisons of the dimensional scaling, discriminant analysis, and
effect sizes from many studies, both to properly canonical correlation, are essentially descriptive in
estimate a summary effect size value and to assess nature. Each is conceptually complex, useful in
and correct bias in accumulated work. That is, a practical sense, and mathematically interesting.
even though the effect size statistic itself is descrip- They provide clear examples of the farther reaches
tive, inferential data-combining methods have been of sophistication within the realm of descriptive
developed to estimate effect sizes on a population statistics.
level.
Another aspect of this development is that the
recommendation that effect sizes be reported has Principal Components
begun to take on a kind of ethical force in con- Analysis and Factor Analysis
temporary psychology. In 1996, the American Factor analysis has developed within the disci-
Psychological Association Board of Scientific pline of psychology over the past century in close
Affairs appointed a task force on statistical infer- concert with psychometrics and the mental testing
ence. Its report recommended including effect movement, and it continues to be central to psycho-
size when reporting a p value, noting that report- metric methodology. Factor analysis is in fact not
ing and analyzing effect size is imperative to one method but a family of methods (including
good research. principal components) that share a common core.
The various methods range from entirely descrip-
tive (principal components, and also factor analysis
Multivariate Statistics and Graphics
by the principal components method) to inferential
Many of the commonly used multivariate statisti- (common factors method, and also maximum like-
cal methods, such as principal components analysis, lihood method). Factors are extracted by the

4
Height

2
Harry

Ralph

0
Josiah
Samuel

Leonard
Francis

Frank

Lewis
Charles

Clarence

Elmer

Jacob
Eugene
Patrick

Archie
Isaac

Alexander
Arthur
Roy

Willis
Laurence

Alfred
Walter

Frederick
Jesse

Michael

David
Albert

Theodore
Christopher
Earnest

Marion
George

Alonzo

August
Edgar

Hiram
Edward

Nelson

Harvey

Hugh
Abraham

Oliver

Ira
Robert

Elilan
Oscar
James

Simon
Joel

Jack
Carl
Paul

Harrison
Peter

Reuben

Mark
Luther
Silas
John

Benjamin

Edmond

Aaron

Philip
Eli

Allan
Asa

Alvin
William

Rufus
Daniel

Horace
Solomon
Jeremiah

Wesley
Raymond
Joseph

Elijah

Matthew
Elias
Andrew

Adam
Amos
Thomas

Josuah
Richard

Moses
Howard
Earl

Edwin

Calvin

Nathan
Levi

Warren
Martin
Henry

Milton
Stephen
Herbert
Herman

dist(names100)
hclust (*,“complete”)

Figure 4 Cluster Analysis Dendrogram of the Log-Frequencies of the 100 Most Frequent Male Names in the United
States in the 19th Century
Descriptive Statistics 357

maximum likelihood method to account for as principal components method of factor analysis are
much variance as possible in the population corre- most often employed, for descriptive ends, in creat-
lation matrix. Principal components and the ing multivariate graphics.

3.5
3.5
3.0 Names:
3.0

Log of Name Frequency


Roy
Log of Name Frequency

Names: Raymond
2.5
2.5 John Earl
William 2.0 Clarence
2.0 James Paul
George 1.5 Howard
1.5 Charles Herbert
Joseph 1.0 Herman
1.0
Thomas Elmer
Henry 0.5
0.5 Carl
Emest
0.0
0.0
1801–1810

1811–1820

1821–1830

1831–1840

1841–1850

1851–1860

1861–1870

1871–1880

1881–1890

1891–1900

1801–1810

1811–1820

1821–1830

1831–1840

1841–1850

1851–1860

1861–1870

1871–1880

1881–1890

1891–1900
Cluster 1: High Frequency, log3 Across the Century Cluster 2: Zero to log2, Emerges in 5th to 8th Decade

3.5 3.5 Names:


Jacob
Names: 3.0 Isaac
3.0
Log of Name Frequency

Peter
Log of Name Frequency

Harry Andrew
2.5 Walter 2.5 Richard
Arthur Alexander
2.0 Jesse
2.0 Laurence Alfred
Marion Francis
1.5 Michael
1.5 Eugene
Martin
Patrick 1.0 Edwin
1.0 Ralph Frank
Oscar 0.5 Fredrick
0.5 Albert
Edgar Lewis
0.0 Robert
0.0 Edward
1801–1810

1811–1820

1821–1830

1831–1840

1841–1850

1851–1860

1861–1870

1871–1880

1881–1890

1891–1900

1801–1810

1811–1820

1821–1830

1831–1840

1841–1850

1851–1860

1861–1870

1871–1880

1881–1890

1891–1900
Samuel
David
Daniel
Benjamin

Cluster 3: Zero to log2, Emerges in 2nd to 5th Decade Cluster 4: Medium Frequency, log2 Across the Century

Names: 3.5
3.5 Josiah Names:
Joel Harrison
Asa Christopher
3.0 3.0
Elisha Simon
Alonzo
Log of Name Frequency

Joshua
Log of Name Frequency

Jeremiah 2.5 Theodore


2.5 Jack
Solomon
Elias Harvey
Slas 2.0 Willis
2.0
Eli Milton
Nelson Warren
1.5 Edmond 1.5 Mark
Rufus Luther
Reuben Alvin
1.0 Aaron 1.0 Philip
Calvin Allan
Amos August
0.5 Matthew
0.5 Oliver
Adam Leonard
Wesley Archie
0.0 0.0 Hugh
Horace
1801–1810

1811–1820

1821–1830

1831–1840

1841–1850

1851–1860

1861–1870

1871–1880

1881–1890

1891–1900

1801–1810

1811–1820

1821–1830

1831–1840

1841–1850

1851–1860

1861–1870

1871–1880

1881–1890

1891–1900

Abraham Ira
Nathan
Levi
Moses
Hiram
Elijah

Cluster 5: Drop From log2 to log1 Across the Century Cluster 6: Low Frequency, log1.3 Across the Century

Figure 5 Semantic Space of Male Names Defined by the Vectors for the 10 Decades of the 19th Century
358 Descriptive Statistics

This matrix, taken from U.S. Cen-


sus data from the 19th century,
first five decades,
contains frequencies per 10,000

1810
1820
1830
tightly clustered Y names for each decade (columns)

40
18
for each of the top 100 male

1800

50
names (rows). Figure 4 is a cluster

18
60
18 analysis dendrogram of these
187
0 names, revealing six clusters.
Each of the six clusters is shown
as a profile plot in Figure 5, with
1880
Z X a collage of line plots tracing the
1890
trajectory of each name within the
cluster. Clearly, the cluster analysis
last five decades, separates the groups well. Figure 6
more spread out
is a vector plot from a factor analy-
sis, in which two factors account
for 93.3% of the variance in the
name frequency pattern for ten dec-
ades. The vectors for the first three
or four decades of the century are
essentially vertical, with the remain-
Figure 6 Location of the 100 Most Frequent 19th Century Male ing decades fanning out sequentially
Names Within the Semantic Space of 10 Decades to the right and with the final
decade flat horizontally to the right.
Other Multivariate Methods A scatterplot (not shown here) of the 100 names
within this same two-factor space reveals that the
A number of other multivariate methods are names group well within the six clusters, with vir-
also primarily descriptive in their focus and can be tually no overlap among clusters.
effectively used to create multivariate graphics, as
several examples will illustrate. Cluster analysis is Bruce L. Brown
a method for finding natural groupings of objects
See also Bar Chart; Box-and-Whisker Plot; Effect Size,
within a multivariate space. It creates a graphical
Measures of; Exploratory Data Analysis; Exploratory
representation of its own, the dendrogram, but it
Factor Analysis; Mean; Median; Meta-Analysis;
can also be used to group points within a scatter-
Mode; Pearson Product-Moment Correlation
plot. Discriminant analysis can be used graphically
Coefficient; Residual Plot; Scatterplot; Standard
in essentially the same way as factor analysis and
Deviation; Variance
principal components, except that the factors are
derived to maximally separate known groups
Further Readings
rather than to maximize variance. Canonical cor-
relation can be thought of as a double factor anal- Brown, B. L. (in press). Multivariate analysis for the bio-
ysis in which the factors from an X set of variables behavioral and social sciences. New York: Wiley.
are calculated to maximize their correlation with Cudeck, R., & MacCallum, R. C. (2007). Preface. In R.
corresponding factors from a Y set of variables. As Cudeck & R. C. MacCallum (Eds.), Factor analysis at
such, it can form the basis for multivariate graphi- 100: Historical developments and future directions.
cal devices for comparing entire sets of variables. Mahwah, NJ: Lawrence Erlbaum.
Huff, D. (1954). How to lie with statistics. New York:
Norton.
Multivariate Graphics Kline, P. (1993). The handbook of psychological testing.
London: Routledge.
A simple example, a 100 × 10 matrix of name Smith, L. D., Best, I. A., Stubbs, A., Archibald, A. B., &
frequencies, illustrates several multivariate graphs. Roberson-Nay, R. (2002). Constructing knowledge:
Dichotomous Variable 359

The role of graphs and tables in hard and soft Constructed Dichotomous Variables
psychology. American Psychologist, 57(10), 749–761.
Tufte, E. R. (2001). The visual display of quantitative Dichotomous variables may be constructed on
information (2nd ed.). Cheshire, CT: Graphics Press. the basis of conceptual rationalizations regarding
Tukey, J. W. (1980). We need both exploratory and the variables themselves or on the basis of the dis-
confirmatory. American Statistician, 34(1), 23–25. tribution of the variables in a particular study.

Construction Based on Conceptualization


DICHOTOMOUS VARIABLE While studies may probe the frequency of inci-
dence of particular life experiences, researchers
A dichotomous variable, a special case of categori- may provide a conceptually or statistically embed-
cal variable, consists of two categories. The partic- ded rationale to support reducing the variation in
ular values of a dichotomous variable have no the range of the distribution into two groups. The
numerical meaning. A dichotomous variable can conceptual rationale may be rooted in the argu-
either be naturally existing or constructed by ment that there is a qualitative difference between
a researcher through recoding a variable with participants who did or did not receive a diagnosis
more variation into two categories. A dichotomous of depression, report discrimination, or win the
variable may be either an independent or a depen- lottery. The nature of the variable under study
dent variable, depending on its role in the research may suggest the need for further exploration
design. The role of the dichotomous variable in related to the frequency with which participants
the research design has implications for the selec- experienced the event being studied, such as
tion of appropriate statistical analyses. This entry a recurrence of depression, the frequency of
focuses on how a dichotomous variable may be reported discrimination, or multiple lottery win-
defined or coded and then outlines the implications nings. Nonetheless, the dichotomous variable
of its construction for data analysis. allows one to distinguish qualitatively between
groups, with the issue of the multiple incidence or
Identification, Labeling, and Conceptualization frequency of the reported event to be explored sep-
of a Dichotomous Variable arately or subsequently.

Identification and Labeling


Construction Based on Distribution
A dichotomous variable may also be referred to The original range of the variable may extend
as a categorical variable, a nonmetric variable, beyond a binomial distribution (e.g., frequency
a grouped dichotomous variable, a classification var- being recorded as an interval such as never, some-
iable, a dummy variable, a binary variable, or an times, often, or with an even broader range when
indicator variable. Within a data set, any coding sys- a continuous variable with possible values of 1–7
tem can be used that assigns two different values. may be reduced to two groups of 1–4 and 5–7).
An analysis of the standard deviation and shape of
the frequency distribution (i.e., as represented
Natural Dichotomous Variables
through a histogram, box-plot, or stem-and-leaf
Natural dichotomous variables are based on the diagram) may suggest that it would be useful to
nature of the variable and can be independently recode the variable into two values. This recoding
determined. These variables tend to be nominal, may take several forms, such as a simple median
discrete categories. Examples include whether split (with 50% of scores receiving one value and
a coin toss is heads or tails, whether a participant the other 50% receiving the other value), or other
is male or female, or whether a participant did or divisions based on the distribution of the data
did not receive treatment. Naturally dichotomous (e.g., 75% vs. 25% or 90% vs. 10%) or other
variables tend to align neatly with an inclusive cri- conceptual reasons. For example, single or low-
terion or condition and require limited checking of frequency events (e.g., adverse effects of a treat-
the data for reliability. ment) may be contrasted with high-frequency
360 Differential Item Functioning

events. The recoding of a variable with a range of (e.g., true/false or male/female) or may be assigned
values into a dichotomous variable may be done randomly by the researcher to address a range of
intentionally for a particular analysis, with the research issues (e.g., sought treatment vs. did not
original values and range of the variable main- seek treatment or sought treatment between zero
tained in the data set for further analysis. and two times vs. sought treatment three or
more times). How a dichotomous variable is
conceptualized and constructed within a research
Implications for Statistical Analysis
design (i.e., as an independent or a dependent
The role of the dichotomous variable within the variable) will affect the type of analyses appro-
research design (i.e., as an independent or depen- priate to interpret the variable and describe,
dent variable), as well as the nature of the sample explain, or predict based in part on the role of
distribution (i.e., normally or nonlinearly distrib- the dichotomous variable. Because of the arbi-
uted), influences the type of statistical analyses that trary nature of the value assigned to the dichoto-
should be used. mous variable, it is imperative to consult the
descriptive statistics of a study in order to fully
interpret findings.
Dichotomous Variables as Independent Variables
In the prototypical experimental or quasi-exper- Mona M. Abo-Zena
imental design, the dependent variable represents
See also Analysis of Variance (ANOVA); Correlation;
behavior that researchers measure. Depending on
Covariate; Logistic Regression; Multiple Regression
the research design, a variety of statistical proce-
dures (e.g., correlation, linear regression, and anal-
yses of variance) can explore the relationship Further Readings
between a particular dependent variable (e.g., Budesco, D. V. (1985). Analysis of dichotomous variables
school achievement) and a dichotomous variable in the presence of serial dependence. Psychological
(e.g., the participant’s sex or participation in a par- Bulletin, 73(3), 547–561.
ticular enrichment program). How the dichoto- Meyers, L. S., Gamst, G., & Guarnino, A. J. (2006).
mous variable is accounted for (e.g., controlled for, Applied multivariate research: Design and
blocked) will be dictated by the particular type of interpretation. Thousand Oaks, CA: Sage.
analysis implemented.

Dichotomous Variables as Dependent Variables DIFFERENTIAL ITEM FUNCTIONING


When a dichotomous variable serves as
a dependent variable, there is relatively less vari- Item bias represents a threat to the validity of test
ation in the predicted variable, and consequently scores in many different disciplines. An item is
the data do not meet the requirements of a nor- considered to be biased if the item unfairly favors
mal distribution and linear relationship between one group over another. More specifically, an item
the variables. When this occurs, there may be is considered to be biased if two conditions are
implications for the data analyses selected. For met. First, performance on the item is influenced
instance, if one were to assess how several pre- by sources other than differences on the construct
dictor variables relate to a dichotomous depen- of interest that are deemed to be detrimental to
dent variable (e.g., whether a behavior is one group. Second, this extraneous influence
observed), then procedures such as logistic results in differential performance across identifi-
regression should be used. able subgroups of examinees.
The use of the term bias refers to various con-
texts, both statistical and social. From a statistical
Applications
point of view, an item is said to be biased if the
Dichotomous variables are prominent features of expected test or item scores are not the same for
a research design. They may either occur naturally subjects from different subpopulations, given the
Differential Item Functioning 361

Item i Item i

Latent Trait

Figure 1 An Illustration of Gender Effect

same level of trait on the instrument of interest.


Thus, bias is not simply a difference between the
means of item scores for subjects from different
subpopulations. Group mean differences on an
item could simply indicate differences in their abil-
ity on the construct the item is measuring. In order
to show the presence of bias, one must show that Figure 2 An Illustration of No Gender Effect
groups continue to differ in their performance on Controlling for Latent Trait
an item or test even after their ability levels are
controlled for. From a social point of view, an item
is said to be biased if this difference is evaluated as An illustration of DIF is given in Figures 1
being harmful to one group more than other through 3. In this example, suppose there are two
groups. groups of subjects (e.g., men and women) that
In most psychometric research, there is an inter- have different probability of a dichotomous
est in detecting bias at the item level. One applica- response on an item i, illustrated in Figure 1. A
tion of this would be in test development. Items heavier weight signifies a higher probability of get-
that show bias can be reformulated or removed ting the item correct. In Figure 1, men have
from the instrument. By considering bias at only a higher probability of getting this particular item
the test level, one faces the real possibility of miss- correct.
ing bias for a particular item. Furthermore, by Because this item is an indicator of some
considering bias on the item level, it is possible to latent, then the difference between the two
see whether certain items are biased against certain groups is possibly attributable to the latent
subpopulations. trait. Therefore, controlling for this latent trait
One characteristic of bias is differential item (matching criterion) should remove the relation-
functioning (DIF), in which examinees from differ- ship between the gender and the item score. If
ent groups have differing probabilities of success this is the case, the item is measurement invari-
on an item after being matched on the ability of ant across the groups. This is illustrated in
interest. DIF is a necessary but insufficient condi- Figure 2.
tion for item bias. If an item is biased, then DIF is However, if the relationship between gender
present. However, the presence of DIF does not and the item remains after controlling for the
imply item bias in and of itself. latent trait, then DIF is present. That is, the item
362 Differential Item Functioning

getting the item correct. However, items that


have different item response functions between
groups indicate that DIF is present and that the
Item i items may be biased.
It is important to note that an item might
show DIF yet not be biased. DIF is a necessary
condition for bias, yet biased items reflect a more
stringent interpretation of the severity of impact
of the subpopulation difference. The presence of
DIF is detected by statistical means, yet an
item is considered biased only on interpretation
Latent Trait DIF of the meaning of the detected difference. An
item with DIF might not always indicate bias.
The item could simply indicate a multidimen-
sional facet of the item or test. Thus, DIF
analysis detects only possibly biased items. Aki-
hito Kamata and Brandon K. Vaughn have sug-
gested that an item with DIF should be referred
to as a ‘‘possibly biased item,’’ or simply as
a ‘‘DIF item.’’

Terminology
DIF analysis typically compares two groups:
a focal group and a reference group. The focal
Figure 3 An illustration of Differential Item group is defined as the main group of interest,
Functioning whereas the reference group is a group used for
comparison purposes. The statistical methodol-
ogy of DIF assumes that one controls for the
measures something in addition to the latent trait
trait or ability levels between these two groups.
that is differentially related to the group variable.
Most research uses the term ability level for
This is shown in Figure 3.
either ability levels or trait levels, even though in
The remainder of this entry defines DIF and its
specific situations one term might be more
related terminology, describes the use of DIF for
precise. The ability level is used to match sub-
polytomous outcomes, and discusses the assess-
jects from the two groups so that the effect of
ment and measurement of DIF.
ability is controlled. Thus, by controlling for
ability level, one may detect group differences
that are not confounded by the ability. This abil-
Definition of DIF
ity level is aptly referred to as the matching
DIF is one way to consider the different impact criterion.
an item may have on various subpopulations. The matching criterion might be one of many
One could consider DIF as the statistical mani- different indices of interest, yet typically the
festation of bias, but not the social aspect. An total test performance or some estimate of trait
item is said to show DIF when subjects from two levels (as in the case of attitudinal measures) is
subpopulations have different expected scores on used. In some instances, an external measure
the same item after controlling for ability. Using might be used as the matching criterion if it can
item response theory (IRT) terminology, if be shown that the measure is appropriate to
a non-DIF item has the same item response func- account for the ability levels of the groups of
tion between groups, then subjects having the interest. In addressing the issue of using test
same ability would have equal probability of scores as the matching criterion, the matching
Differential Item Functioning 363

1.0
1.0

0.8 0.8

Probability
0.6
Probability

0.6

0.4 0.4

0.2 0.2

0.0 0.0
Group A
−3 −2 −1 0 1 2 3
Group A −3 −2 −1 0 1 2 3 Group B
Group B
Theta Theta

Figure 4 Example of Uniform DIF Figure 5 Example of Nonuniform DIF

criterion should be free of DIF items. This can data, the consideration of DIF is more simplistic
be problematic in the typical case in which the as there are only two outcomes. But for polyto-
items undergoing DIF analysis are the very items mous outcomes, there is a possibility of an inner-
that form the matching criterion. In such a response DIF (IDIF). That is, there is the possi-
situation, the matching criterion should undergo bility that DIF does not exist uniformly across
a ‘‘purification’’ process, in which a preliminary all response categories but may exist for certain
DIF analysis is performed to rid the matching responses within that item. Figure 6 illustrates
criterion of any DIF items. an example in which a particular 4-point Likert-
The phrase uniform DIF refers to a type of type item displays DIF on lower ordinal
DIF in which the magnitude of group difference responses but not on higher ordinal responses.
is the same across ability levels. Using IRT ideas, This type of DIF can be referred to as a lower
uniform DIF occurs when there is no interaction IDIF. This can exist, as an illustration, when the
between group and item characteristic curves, as focal group tends to differentially vary in suc-
represented in Figure 4. In contrast, the phrase cessfully scoring lower ordinal scores on an atti-
nonuniform DIF refers to a type of DIF in which tudinal measurement as compared to a reference
the magnitude of the group difference is not con- group, while both groups have similar success in
sistent across ability levels. From an IRT per- upper ordinal scoring categories.
spective, nonuniform DIF would result in Figure 7 illustrates a balanced IDIF, in which
crossing item characteristic curves. This is illus- the nature of DIF changes for both extreme ordi-
trated in Figure 5. Nonuniform DIF can be nal responses.
thought of as an interaction effect between the In this example, there is potential bias against
group and the ability level. women on the lower ordinal responses, and poten-
tial bias against men on the upper responses.
Other types of IDIF patterns are possible. For
DIF for Polytomous Outcomes
example, upper IDIF would indicate potential bias
Although traditional DIF procedures involve on the upper ordinal responses, while consistent
dichotomously scored items, DIF can also be IDIF would indicate that the DIF effect is approxi-
considered for polytomously scored data (e.g., mately the same for all ordinal responses. Patterns
Likert-type scales). Polytomously scored data in IDIF are not always present, however. In some
have the additional consideration that subjects situations, IDIF may be present only between cer-
can respond to or be labeled with more than two tain ordinal responses and not others, with no dis-
categories on a given item. For dichotomous cernible pattern.
364 Differential Item Functioning

“Strongly Disagree” “Strongly Disagree”

“Disagree” “Disagree”

“Agree” “Agree”

“Strongly Agree” “Strongly Agree”

Figure 6 An Illustration of Lower IDIF for Figure 7 An Illustration of Balanced IDIF for
Polytomous Outcomes Polytomous Outcomes

Assessment emerged, in particular the multidimensional IRT


method of Robin Shealy and William Stout in
The actual assessment and measurement of DIF the early 1990s. The Shealy and Stout method
is not always as straightforward as the concept. provided an interesting vantage point for DIF
Various methods have been proposed to measure analysis—that DIF could be a result of multidi-
DIF. Perhaps the oldest method was an analysis mensionality of the test in question. One criti-
of variance approach, which tested for an inter- cism of the traditional methods is that they
action effect between groups and items. Yet this explain very little of the source of DIF. While the
approach did not gain in popularity because of IRT perspective allows for a greater discernment
the problematic nature of items being measured of DIF than do traditional methods, there is still
qualitatively or yielding binary outcomes. In no attempt to explain the basis for DIF.
1972, William H. Angoff introduced one of the One way of approaching this issue in recent
first widely used measures of DIF in the delta- research is by using multilevel analysis techniques,
plot method, also known as the transformed such as the approach proposed by David B. Swan-
item-difficulty method. However, this method son, Brian E. Clauser, Susan M. Case, Ronald J.
was often criticized as giving misleading results Nungster, and Carol Featherman. Kamata and
for items with differing discriminating power. Salih Binici, in 2003, considered a multilevel
Various other methods were introduced, such as approach to DIF detection. In this model, Level 1
the Mantel–Haenszel procedure. The Mantel– represented the item level, Level 2 represented the
Haenszel procedure dominated the psychometric individual level, and Level 3 represented a group
approach to the study of DIF for many years unit. The rationale for the inclusion of a third level
because of its ability to give an effect size for was that the magnitude of DIF could vary across
DIF, known as α, in addition to a significance group units, such as schools in an educational DIF
test. study. This approach models a random-effect
Another approach to DIF analysis is based on DIF for the group units and uses individual charac-
IRT principles. While traditional methods allow teristics to explicate the potential sources of DIF.
for items to differ in difficulty, there is no allow- Saengla Chaimongkol and Vaughn have, sepa-
ance for differing item discrimination. As Angoff rately, extended the work of Kamata and Binici by
has stressed, it is possible for an item with the using a Bayesian approach to obtain parameter
same difficulty parameter in the two groups but estimates for dichotomously and polytomously
with different slope parameters to yield a DIF scored responses, respectively.
index of zero when analyzed by all but the IRT
method. Thus, many IRT approaches to DIF Brandon K. Vaughn
Directional Hypothesis 365

See also Item Analysis; Item Response Theory • There is a positive relationship between the
number of books read by children and the
children’s scores on a reading test.
Further Readings • Teenagers who attend tutoring sessions will
make higher achievement test scores than
Angoff, W. H. (1972, September). A technique for the
comparable teenagers who do not attend
investigation of cultural differences. Paper presented at
the annual meeting of the American Psychological
tutoring sessions.
Association, Honolulu, HI.
Holland, P. W., & Thayer, D. T. (1988). Differential item Nondirectional and Null Hypotheses
performance and the Mantel-Haenszel procedure. In
H. Wainer & H. I. Braun (Eds.), Test validity (pp. In order to fully understand a directional hypothe-
129–145). Hillsdale, NJ: Lawrence Erlbaum. sis, there must also be a clear understanding of
Kamata, A., & Binici, S. (2003). Random-effect DIF a nondirectional hypothesis and null hypothesis.
analysis via hierarchical generalized linear models.
Paper presented at the annual meeting of the
Psychometric Society, Sardinia, Italy.
Nondirectional Hypothesis
Lord, F. M. (1980). Applications of item response theory
to practical testing problems. Hillsdale, NJ: Lawrence A nondirectional hypothesis differs from a direc-
Erlbaum. tional hypothesis in that it predicts a change, rela-
Williams, V. S. L. (1997). The ‘‘unbiased’’ anchor: tionship, or difference between two variables but
Bridging the gap between DIF and item bias. Applied
does not specifically designate the change, relation-
Measurement in Education, 10, 253–267.
ship, or difference as being positive or negative.
Another difference is the type of statistical test that
is used. An example of a nondirectional hypothesis
would be the following: For (Population A), there
DIRECTIONAL HYPOTHESIS will be a difference between (Independent Variable
1) and (Independent Variable 2) in terms of
A directional hypothesis is a prediction made by (Dependent Variable 1). The following are other
a researcher regarding a positive or negative examples of nondirectional hypotheses:
change, relationship, or difference between two
variables of a population. This prediction is typi- • There is a relationship between the number of
cally based on past research, accepted theory, books read by children and the children’s scores
extensive experience, or literature on the topic. on a reading test.
Key words that distinguish a directional hypothesis • Teenagers who attend tutoring sessions will have
are: higher, lower, more, less, increase, decrease, achievement test scores that are significantly
positive, and negative. A researcher typically devel- different from the scores of comparable
ops a directional hypothesis from research ques- teenagers who do not attend tutoring sessions.
tions and uses statistical methods to check the
validity of the hypothesis.
Null Hypothesis
Statistical tests are not designed to test a direc-
Examples of Directional Hypotheses
tional hypothesis or nondirectional hypothesis, but
A general format of a directional hypothesis would rather a null hypothesis. A null hypothesis is a pre-
be the following: For (Population A), (Independent diction that there will be no change, relationship,
Variable 1) will be higher than (Independent Vari- or difference between two variables. A null
able 2) in terms of (Dependent Variable). For hypothesis is designated by Ho. An example of
example, ‘‘For ninth graders in Central High a null hypothesis would be the following: for
School, test scores of Group 1 will be higher than (Population A), (Independent Variable 1) will not
test scores of Group 2 in terms of Group 1 receiv- be different from (Independent Variable 2) in terms
ing a specified treatment.’’ The following are other of (Dependent Variable). The following are other
examples of directional hypotheses: examples of null hypotheses:
366 Directional Hypothesis

• There is no relationship between the number of When one is performing a statistical test for
books read by children and the children’s scores significance, the null hypothesis is tested to
on a reading test. determine whether there is any significant
• Teenagers who attend tutoring sessions will amount of change, difference, or relationship
make achievement test scores that are equivalent between the two variables. Before the test is
to those of comparable teenagers who do not
administered, the researcher chooses a signifi-
attend tutoring sessions.
cance level, known as an alpha level, designated
by α. In studies of education, the alpha level is
Statistical Testing of Directional Hypothesis often set at .05 or α ¼ .05. A statistical test of
the appropriate variable will then produce a p
A researcher starting with a directional hypothesis value, which can be understood as the probabil-
will have to develop a null hypothesis for the pur- ity a value as large as or larger than the statisti-
pose for running statistical tests. The null hypothe- cal value produced by the statistical test would
sis predicts that there will not be a change or have been found by chance if the null hypothesis
relationship between variables of the two groups were true. The p value must be smaller than the
or populations. The null hypothesis is designated predetermined alpha level to be considered sta-
by H0, and a null hypothesis statement could be tistically significant. If no significance is found,
written as H0 : μ1 ¼ μ2 (Population or Group 1 then the null hypothesis is accepted. If there is
equals Population or Group 2 in terms of the a significant amount of change according to the
dependent variable). A directional hypothesis or p value between two variables which cannot be
nondirectional hypothesis would then be consid- explained by chance, then the null hypotheses is
ered to be an alternative hypothesis to the null rejected, and the alternative hypothesis is
hypothesis and would be designated as H1. Since accepted, whether it is a directional or a nondi-
the directional hypothesis is predicting a direction rectional hypothesis.
of change or difference, it is designated as The type of alternative hypothesis, directional
H1 : μ1 > μ2 or H1 : μ1 < μ2 (Population or or nondirectional, makes a considerable difference
Group 1 is greater than or less than Population or in the type of significance test that is run. A nondi-
Group 2 in terms of the dependent variable). In rectional hypothesis is used when a two-tailed test
the case of a nondirectional hypothesis, there of significance is run, and a directional hypothesis
would be no specified direction, and it could be when a one-tailed test of significance is run. The
designated as H1 : μ1 6¼ μ2 (Population or Group reason for the different types of testing becomes
1 does not equal Population or Group 2 in terms apparent when examining a graph of a normalized
of the dependent variable). curve, as shown in Figure 1.

Directional Hypothesis Nondirectional Hypothesis

H0 H0

H1 H1 H1

One-Tailed Test of Significance Two-Tailed Test of Significance

H1 : µ1 > µ2 H1 : µ1 > µ2

Figure 1 Comparison of Directional and Nondirectional Hypothesis Test


Discourse Analysis 367

The nondirectional hypothesis, since it pre- applications (9th ed.). Upper Saddle River, NJ:
dicts that the change can be greater or lesser Pearson Education.
than the null value, requires a two-tailed test of Moore, D. S., & McMabe, G. P. (1993). Introduction to
significance. On the other hand, the directional the practice of statistics (2nd ed.). New York: W. H.
Freeman.
hypothesis in Figure 1 predicts that there will be
Patten, M. L. (1997). Understanding research
a significant change greater than the null value; methods: An overview of the essentials. Los
therefore, the negative area of significance of the Angeles: Pyrczak.
curve is not considered. A one-tailed test of sig-
nificance is then used to test a directional
hypothesis.
DISCOURSE ANALYSIS
Summary Examples of Hypothesis Type
Discourse is a broadly used and abstract term
The following is a back-to-back example of the that is used to refer to a range of topics in vari-
directional, nondirectional, and null hypothesis. ous disciplines. For the sake of this discussion,
In reading professional articles and test hypothe- discourse analysis is used to describe a number
ses, one can determine the type of hypothesis as of approaches to analyzing written and spoken
an exercise to reinforce basic knowledge of language use beyond the technical pieces of lan-
research. guage, such as words and sentences. Therefore,
discourse analysis focuses on the use of language
Directional Hypothesis: Women will have higher
within a social context. Embedded in the con-
scores than men will on Hudson’s self-esteem scale.
structivism–structuralism traditions, discourse
Nondirectional Hypothesis: There will be analysis’s key emphasis is on the use of language
a difference by gender in Hudson’s self-esteem scale in social context. Language in this case refers to
scores. either text or talk, and context refers to the
Null Hypothesis: There will be no difference social situation or forum in which the text or
between men’s scores and women’s scores on talk occurs. Language and context are the two
Hudson’s self-esteem scale. essential elements that help distinguish the two
major approaches employed by discourse
Ernest W. Brewer and Stephen Stockton analysts. This entry discusses the background
See also Alternative Hypotheses; Nondirectional
and major approaches of discourse analysis and
Hypotheses; Null Hypothesis; One-Tailed Test;
frameworks associated with sociopolitical
p Value; Research Question; Two-Tailed Test
discourse analysis.

Further Readings Background


Ary, D., Jacobs, L. C., Razavieh, A., & Sorensen, C. In the past several years social and applied
(2006). Introduction to research in education (7th or professional sciences in academia have seen
ed.). Belmont, CA: Thomson Wadsworth. a tremendous increase in the number of dis-
Borg, W. R. (1987). Applying educational research (2nd course analysis studies. The history of discourse
ed.). White Plains, NY: Longman. analysis is long and embedded in the origins of
Creswell, J. W. (2005). Educational research: Planning, a philosophical tradition of hermeneutics and
conducting, and evaluating quantitative and phenomenology. These traditions emphasize the
qualitative research. Upper Saddle River, NJ: Pearson
issue of Verstehen, or lifeworld, and the social
Education.
Fraenkel, J. R., & Wallen, N. E. (2009). How to design
interaction within the lifeworld. A few major
and evaluate research in education (9th ed.). New theorists in this tradition are Martin Heidegger,
York: McGraw-Hill. Maurice Merleau-Ponty, Edmund Husserl, Wil-
Gay, L. R., Mills, G. E., & Airasian, P. (2009). helm Dilthey, and Alfred Schutz. Early applica-
Educational research: Competencies for analysis and tions of discourse analysis in social and applied
368 Discourse Analysis

and professional sciences can be found in psy- social construction of discursive practices that
chology, sociology, cultural studies, and linguis- maintain the social context. This approach empha-
tics. The tradition of discourse analysis is often sizes social context as influenced by language.
listed under interpretive qualitative methods and Sociopolitical methodologists focus on social con-
is categorized by Thomas A. Schwandt with her- text and the interplay between social context and
meneutics and social construction under the con- language. This approach is most often found in the
structivist paradigm. Jaber F. Gubrium and social and professional and applied sciences, where
James A. Holstein place phenomenology in the researchers using sociopolitical discourse analysis
same vein as naturalistic inquiry and ethnometh- often employ one of two specific frameworks: Fou-
odology. The strong influence of the German and cualdian discourse analysis and critical discourse
French philosophical traditions in psychology, analysis (CDA).
sociology, and linguistics has made this a com-
mon method in the social and applied and pro-
fessional sciences. Paradigmatically, discourse
analysis assumes that there are multiple con- Sociopolitical Discourse Analysis Frameworks
structed realities and that the goal of researchers
Foucauldian Discourse Analysis
working within this perspective is to understand
the interplay between language and social con- Michel Foucault is often identified as the key
text. Discourse analysis is hermeneutic and phe- figure in moving discourse analysis beyond lin-
nomenological in nature, emphasizing the guistics and into the social sciences. The works
lifeworld and meaning making through the use of Foucault emphasize the sociopolitical
of language. This method typically involves an approach to discourse analysis. Foucault empha-
analytical process of deconstructing and critiqu- sizes the role of discourse as power, which
ing language use and the social context of lan- shifted the way discourse is critically analyzed.
guage usage. Foucault initially identified the concept of arche-
ology as his methodology for analyzing dis-
course. Archeology is the investigation of
Two Major Approaches
unconsciously organized artifacts of ideas. It is
Discourse analysis can be divided into two major a challenge to the present-day conception of his-
approaches: language-in-use (or socially situated tory, which is a history of ideas. Archeology is
text and talk) and sociopolitical. The language- not interested in establishing a timeline or Hege-
in-use approach is concerned with the micro lian principles of history as progressive. One
dimensions of language, grammatical structures, who applies archeology is interested in dis-
and how these features interplay within a social courses, not as signs of a truth, but as the discur-
context. Language-in-use discourse analysis sive practices that construct objects of
focuses on the rules and conventions of talk and knowledge. Archeology identifies how discourses
text within a certain a context. This approach of knowledge objects, separated from a histori-
emphasizes various aspects of language within cal-linear progressive structure, are formed.
social context. Language-in-use methodologists Therefore, archeology becomes the method of
focus on language and the interplay between lan- investigation, contradictory to the history of
guage and social context. Language-in-use is ideas, used when looking at an object of knowl-
often found in the disciplines of linguistics and edge; archeology locates the artifacts that are
literature studies and is rarely used in social and associated with the discourses that form objects
human sciences. of knowledge. Archeology is the how of Fou-
The second major approach, sociopolitical, is cauldian discourse analysis of the formation of
the focus of the rest of this entry because it is most an object of knowledge. Archeology consists of
commonly used within the social and human three key elements: delimitation of authority
sciences. This approach is concerned with how (who gets to speak about the object of knowl-
language forms and influences the social context. edge?), surface of emergence (when does dis-
Sociopolitical discourse analysis focuses on the course about an object of knowledge begin?),
Discourse Analysis 369

and grids of specification (how the object of data collection, and genealogy is the critical
knowledge is described, defined, and labeled). analysis of the data. These two concepts are not
However, Foucault’s archeology then suggests fully distinguishable, and a genealogy as Fou-
a power struggle within the emergence of one or cault defines it cannot exist without the method
more discourses, via the identification of author- of archeology. Foucualt’s work is the foundation
ities of delimitation. Archeology’s target is to of much of the sociopolitical discourse analysis
deconstruct the history of ideas. The only way to used in contemporary social and applied and
fully deconstruct the history of an idea is to cri- professional sciences. Many discourse studies
tique these issues of power. Hence, the creation cite Foucault as a methodological influence or
of genealogy, which allows for this critique of use specific techniques or strategies employed by
power, with use of archeology, becomes the Foucault.
method of analysis for Foucault. Foucault had to
create a concept like genealogy, since archeol-
CDA
ogy’s implied power dynamic and hints of a cri-
tique of power are in a form of hidden power. CDA builds on the critique of power high-
The term genealogy refers to the power relations lighted by Foucault and takes it a step further.
rooted in the construction of a discourse. Gene- Teun A.van Dijk has suggested that the central
alogy focuses on the emergence of a discourse focus of CDA is the role of discourse in the
and identifies where power and politics surface (re)production and challenge of dominance.
in the discourse. Genealogy refers to the union CDA’s emphasis on the role of discourse in domi-
of erudite knowledge and local memories, which nance specifically refers to social power enacted
allows us to establish a historical knowledge of by elites and institutions’ social and political
struggles and to make use of this knowledge tac- inequality through discursive forms. The produc-
tically today. Genealogy focuses on local, discon- tion and (re)production of discursive formation
tinuous, disqualified, illegitimate knowledge of power may come in various forms of dis-
opposed to the assertions of the tyranny of total- course and power relations, both subtle and
izing discourses. Genealogy becomes the way we obvious. Therefore, critical discourse analysts
analyze the power that exists in the subjugated focus on social structures and discursive strate-
discourses that we find through the use of arche- gies that play a role in the (re)production of
ology. So genealogy is the exploration of the power. CDA’s critical perspective is influenced
power that develops the discourse, which con- not only by the work of Foucault but also by the
structs an object of knowledge. The three key philosophical traditions of critical theorists, spe-
elements of genealogy include subjugated dis- cifically Jurgen Habermas.
courses (whose voices were minimized or hidden Norman Fairclough has stated that discourse
in the formation of the object of knowledge?), is shaped and constrained by social structure and
local beliefs and understandings (how is the culture. Therefore he proposes three central
object of knowledge perceived in the social tenets of CDA: social structure (class, social sta-
context?), and conflict and power relations tus, age, ethnic identity, and gender); culture
(where are the discursive disruptions and the (accepted norms and behaviors of a society); and
enactments of power in the discourse?). Archeol- discourse (the words and language we use). Dis-
ogy suggests that there is a type of objectivity course (the words and language we use) shapes
that indicates a positivistic concept of neutrality our role and engagement with power within
to be maintained when analyzing data. While a social structure. CDA emphasizes when look-
genealogy has suggestions of subjectivity, local- ing at discourse three levels of analysis: the text,
isms, and critique, much like postmodernist or the discursive practice, and the sociocultural
critical theory, archeology focuses on how dis- practice.The text is a record of a communicated
courses form an object of knowledge. Genealogy event that reproduces social power. Discursive
becomes focused on why certain discourses are practices are ways of being in the world that sig-
dominant in constructing an object of knowl- nify accepted social roles and identities. Finally,
edge. Therefore, archeology is the method of the sociocultural comprises the distinct context
370 Discriminant Analysis

where discourse occurs. The CDA approach of discriminant analysis is to find optimal combi-
attempts to link text and talk with the underly- nations of predictor variables, called discriminant
ing power structures in society at a sociopolitical functions, to maximally separate previously
level through discursive practices. Text and talk defined groups and make the best possible predic-
are the description of communication that occurs tions about group membership. Discriminant anal-
within a social context that is loaded with power ysis has become a valuable tool in social sciences
dynamics and structured rules and practices of as discriminant functions provide a means to clas-
power enactment. When text is not critically ana- sify a case into the group that it mostly resembles
lyzed, oppressive discursive practices, such as mar- and help investigators understand the nature of
ginalization and oppression, are taken as accepted differences between groups. For example, a college
norms. Therefore, CDA is intended to shine a light admissions officer might be interested in predicting
on such oppressive discursive practices. Discourse whether an applicant, if admitted, is more likely to
always involves power, and the role of power in succeed (graduate from the college) or fail (drop
a social context is connected to the past and the out or fail) based on a set of predictor variables
current context, and can be interpreted differently such as high school grade point average, scores on
by different people due to various personal back- the Scholastic Aptitude Test, age, and so forth. A
grounds, knowledge, and power positions. There- sample of students whose college outcomes are
fore there is not one correct interpretation, but known can be used to create a discriminant func-
a range of appropriate and possible interpreta- tion by finding a linear combination of predictor
tions. The correct critique of power is not the vital variables that best separates Groups 1 (students
point of CDA, but the process of critique and its who succeed) and 2 (students who fail). This dis-
ability to raise consciousness about power in social criminant function can be used to predict the col-
context is the foundation of CDA. lege outcome of a new applicant whose actual
group membership is unknown. In addition, dis-
Bart Miles criminant functions can be used to study the
nature of group differences by examining which
Further Readings predictor variables best predict group membership.
Fairclough, N. (2000). Language and power (2nd ed.). For example, which variables are the most power-
New York: Longman. ful predictors of group membership? Or what pat-
Foucault, M. (1972). The archaeology of knowledge (A. tern of scores on the predictor variables best
M. Sheridan Smith, Trans.). London: Tavistock. describes the differences between groups? This
Schwandt, T. (2007). Judging interpretations. New entry discusses the data considerations involved in
Directions for Evaluation, 114, 11–15. discriminant analysis, the derivation and interpre-
Stevenson, C. (2004). Theoretical and methodological
tation of discriminant functions, and the process
approaches in discourse analysis. Nurse Researcher,
of classifying a case into a group.
12(2), 17–29.
Taylor, S. (2001). Locating and conducting discourse
analytic research. In M. Wetherell, S. Taylor, and S. J. Data Considerations of Discriminant Analysis
Yates (Eds.), Discourse as data: A guide for analysis
(pp. 5–48). London: Sage. First of all, the predictor variables used to create
van Dijk, T. A. (1999). Critical discourse analysis and discriminant functions must be measured at the
conversation analysis. Discourse & Society, 10(4), interval or ratio level of measurement. The shape
459–460. of the distribution of each predictor variable
should correspond to a univariate normal distribu-
tion. That is, the frequency distribution of each
predictor variable should be approximately bell
DISCRIMINANT ANALYSIS shaped. In addition, multivariate normality of pre-
dictor variables is assumed in testing the signifi-
Discriminant analysis is a multivariate statistical cance of discriminant functions and calculating
technique that can be used to predict group mem- probabilities of group membership. The assump-
bership from a set of predictor variables. The goal tion of multivariate normality is met when each
Discriminant Analysis 371

variable has a univariate normal distribution at Di ¼ di0  di1 X1  di2 X2      diK XK ,


any fixed values of all other variables. Although
the assumption of multivariate normality is com- where X1 ; X2 ; . . . ; XK are predictor variables 1
plicated, discriminant analysis is found to be rela- to K; Di is the ith discriminant function, di0 is
tively robust with respect to the failure to meet the a constant, and diK is the coefficient of predictor
assumption if the violation is not caused by out- variable k for discriminant function i. Discrimi-
liers. Discriminant analysis is very sensitive to the nant functions are like regression equations in
inclusion of outliers. Therefore, outliers must be the sense that a discriminant score for each case
removed or transformed before data are analyzed. is predicted by multiplying the score on each pre-
Another assumption of discriminant analysis is dictor variable (Xk ) by its associated coefficient
that no predictor variable may be expressed as a lin- (diK ), summing over all predictors, and adding
ear combination of other predictor variables. This a constant di0 . When there are only two groups
requirement intuitively makes sense because when in discriminant analysis, only one discriminant
a predictor variable can be represented by other function is needed to best separate groups. How-
variables, the variable does not add any new infor- ever, when there are more than two groups, the
mation and can be considered redundant. Mathe- maximum number of discriminant functions that
matically, such redundancy can lead to unreliable can be derived is equal to the number of groups
matrix inversions, which result in large standard minus one or the number of predictor variables,
errors of estimates. Therefore, redundant predictor whichever is fewer. For example, for a discrimi-
variables must be excluded from the analysis. nant analysis with three groups and four predic-
A further assumption made in discriminant tor variables, two discriminant functions can be
analysis is that the population variance–covariance derived. The weights or coefficients of predictor
matrices are equal across groups. This assumption variables for the first function can be derived so
is called homogeneity of variance–covariance that the group means on the function are as dif-
matrices. When sample sizes are large or equal ferent as possible. The weights used to combine
across groups, the significance test of discriminant predictor variables in the second function can be
function is usually robust with respect to the viola- determined also based on the criterion of pro-
tion of the homogeneity assumption. However, the ducing maximum possible difference among
classification is not so robust in that cases tend to group means but with the additional condition
be overclassified into groups with greater variabil- that the values of the second function are not
ity. When sample sizes are small and unequal, the correlated with values of the first function. Simi-
failure to meet the homogeneity assumption often larly, the third function also maximally differ-
causes misleading results of both significance tests entiates among groups but with the restriction of
and classifications. Therefore, prior to performing being uncorrelated with the first two functions,
discriminant analysis, the tenability of the assump- and so forth.
tion of homogeneity of variance–covariance matri- Although multiple discriminant functions are
ces must be tested. often identified, not all functions significantly
discriminate among groups. Therefore, statistical
tests are needed to determine the significance of
Deriving and Testing Discriminant Functions each discriminant function. There are several
statistics for testing the significance of each dis-
The major task in discriminant analysis is to find criminant function, such as Wilks’s lambda,
discriminant functions that maximally separate Roy’s largest root, Hotelling’s trace, and Pillai’s
groups and make the best possible predictions criterion. Wilks’s lambda is the ratio of within-
about group membership. The easiest and most groups variance to total variance and therefore
commonly used type of discriminant function is represents the percentage of variance in discrimi-
the linear function, in which predictor variables nant scores not explained by group membership.
are weighted and then summed to produce the Wilks’s lambda ranges from 0 to 1, with values
discriminant score. The equation of this type of closer to 1 indicating the function is less
discriminant function has the following form: discriminating. Because Wilks’s lambda can be
372 Discriminant Analysis

transformed to a statistic that has a chi-square might be misleading when one attempts to evalu-
distribution, its statistical significance can be ate the relative importance of predictor vari-
tested. A significant Wilks’s lambda indicates ables. This is because when standard deviations
that the group means calculated from the dis- are not the same across predictor variables, one
criminant analysis are significantly different and unit change in the value of a variable varies from
therefore the discriminant function works well one variable to another. Therefore, standardized
in discriminating among groups. discriminant coefficients are needed. Standard-
ized discriminant coefficients indicate the
relative importance of measured variables in cal-
Interpreting Discriminant Functions culating discriminant scores. Standardized dis-
criminant coefficients involve adjusting the
If a discriminant function is found to be signifi- unstandardized discriminant coefficients by the
cant, one might be interested in discovering how variance of the raw scores on each predictor var-
groups are separated along the discriminant func- iable. Standardized discriminant coefficients
tion and which predictor variables are most useful would be obtained if the original data were con-
in separating groups. To visually inspect how well verted to standard form, in which each variable
groups are spaced out along discriminant function, has a mean of zero and standard deviation of
individual discriminant scores and group centroids one, and then used to optimize the discriminant
can be plotted along the axes formed by discrimi- coefficients. However, the standardized discrimi-
nant functions. The mean of discriminant scores nant coefficients can be derived from unstan-
within a group is known as the group centroid. If dardized coefficients directly by the following
the centroids of two groups are well separated and formula:
there is no obvious overlap of the individual cases
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
along a discriminant function, then the discrimi-
SSk ;
nant function separates the two groups well. If sik ¼ dik
group centroids are close to each other and indi- Ng
vidual cases overlap a great deal, the discriminant
function fails to provide a clear separation of the where sik and dik are the standardized and
two groups. When there are only one or two sig- unstandardized coefficients for predictor variable
nificant discriminant functions, the location of k on discriminant function i, respectively, SSk is
group centroids and data cases can be easily plot- the sum of squares associated with variable k, N
ted. However, when there are more than two dis- is the total number of cases, and g is the number
criminant functions, it will be visually difficult to of groups. The predictor variable associated with
locate the group centroids and data cases. There- the largest standardized coefficient (in absolute
fore, only pairwise plots of discriminant functions value) contributes most to determining scores on
are used. However, the plot based on the first two the discriminant function and therefore plays the
discriminant functions is expected to be most most important role in separating groups. It
informative because these two functions are the should be noted that when predictor variables
most powerful discriminators of groups. are correlated, the associated discriminating
The relative importance of the contribution of coefficients might provide misleading results. For
each variable to the separation of the groups can example, consider two correlated variables that
be evaluated by examining the discriminant coef- have rather small contributions to the discrimi-
ficients. At this stage, however, the discriminant nant function. The two estimated standardized
function can be considered an unstandardized coefficients might be large but with opposite
equation in the sense that raw scores on signs, so that the effect of one variable is, to
predictor variables are used to produce discrimi- some degree, canceled by the opposite effect of
nant scores. Although the magnitudes of unstan- the other variable. However, this could be misin-
dardized coefficients indicate the absolute terpreted as both variables having relatively
contribution of a predictor variable in determin- large contributions to the discriminant function
ing the discriminant score, this information but in different directions.
Discriminant Analysis 373

A better guide to the meaning of the discrimi- used to assign group membership. That is, a case
nant function is to use structure coefficients. Struc- will be classified into the group to which it has
ture coefficients look at the correlations between the highest probability of belonging. In addition,
the discriminant function and each predictor vari- the probabilities of group membership serve as
able. The variable that correlates most highly with an indicator of the discriminating power of dis-
the discriminant function shares the greatest criminant functions. For example, discriminant
amount of variance with the discriminant function functions are considered to function well when
and, therefore, explains the discriminant function a case has a high probability of belonging to one
more. Structure coefficients can be directly derived group but low probabilities of belonging to other
by calculating the correlation coefficients between groups. In this way, it is clear that the case
each of the predictor variables and the discrimi- should be classified into the group of the highest
nant scores. It addresses the question, To which of probability. However, if probabilities for all
the K variables is the discriminant function most groups are very close, it might be meaningless to
closely related? When the absolute value of the classify the case into a specific group given that
coefficient is very large (close to 1), the discrimi- groups are actually not very distinct based on
nant function is carrying nearly the same informa- the discriminant functions.
tion as the predictor variable. In comparison, When predicted group membership is com-
when the coefficient is near zero, the discriminant pared with actual group membership in the sam-
function and the predictor variable share little var- ple from which the function was calculated, the
iance. The discriminant function can be named percentage of correct predictions, often called
after the predictor variables that have the highest the hit ratio, can be calculated. To evaluate the
correlations. performance of classification, the hit ratio
should not be compared with zero but rather
with the percentage that would have been cor-
Classifications
rectly classified by chance. If the groups have
In discriminant analysis, discriminant functions equal sample sizes, the expected percentage of
can be used to make predictions of the group to correct predictions by chance is equal to 1/K,
which a case most likely belongs. Classification where K is the total number of groups. For
of an individual case involves calculation of the instance, for a two-group analysis with equal
individual’s discriminant score and comparison sample sizes, one can expect a 50% of chance of
of it with each of the group centroids. To make making correct predictions of group membership
predictions of group membership, the distance by pure random guesses, and therefore the
from the individual’s discriminant scores to each expected hit ratio based on chance is .5. If the
of the group centroids is measured, and the cen- hit ratio yielded by discriminant functions is .6,
troid to which the individual’s scores are closest the improvement is actually rather small. When
is the group to which the individual is predicted groups are unequal in size, the percentage that
to belong. A distance measure commonly used in could be correctly classified by chance can be
discriminant analysis is Mahalanobis D2 , which estimated by multiplying the expected probabili-
calculates the squared distance from a specific ties of each group by the corresponding group
case to each of the group centroids. D2 can be size, summing for all groups, and dividing the
considered a measure that represents the degree sum by the total sample size. A z test for the dif-
to which a case’s profile on the predictor vari- ference between proportions can be performed
ables resembles the typical profile of a group. to statistically test the significance of the
Based on this interpretation, a case should be improvement in the classification accuracy from
classified into the group with the smallest D2 . the discriminant analysis.
Because D2 is a statistic with a chi-square distri- It should be noted that the hit ratio tends to
bution of p degrees of freedom, where p is the overestimate the classification accuracy of dis-
number of predictor variables, the probabilities criminant functions when the same sample is
that a case belongs to a group can be calculated. used to both derive the discriminant function
Similar to D2 , these probabilities can also be and test its predictive ability. To overcome this,
374 Discussion Section

the effectiveness of the classification procedure


can be tested on another sample that is indepen- DISCUSSION SECTION
dent of the one used to derive discriminant func-
tion. Testing the classification procedure on
a new sample is called cross-validation. There The purpose of a discussion section of a research
are several methods of cross-validation, includ- paper is to relate the results (results section) back
ing the use of a holdout sample, double cross- to the initial hypotheses of the study (introduction
validation, and the so-called jackknife proce- section). The discussion section provides an inter-
dure. In the holdout method of cross-validation, pretation of the results, presents conclusions, and
the sample is randomly split into two subsets. supports all the conclusions with evidence from
One is used to develop discriminant functions the study and generally accepted knowledge. The
and the other is used to test the accuracy of the discussion section should describe (a) what new
classification procedure. This is an unbiased knowledge has been gained in the study and
method of estimating the true misclassification (b) where research should go next.
rate. However, large sample sizes are required. Discussion of research findings must be based
The idea of double cross-validation is similar to on basic research concepts. When the researcher
the use of a holdout sample for cross-validation. explains a phenomenon, he or she must explain
In double cross-validation, the total sample is mechanisms. If the researcher’s results agree with
divided in half. Separate discriminant analyses the expectations, the researcher should describe
are performed on each sample, and the results the theory that the evidence supported. If the
are cross-validated on the other sample. In the researcher’s results differ from the expectations,
jackknife procedure, one observation at a time is then he or she should explain why that might have
eliminated from the sample, the discriminant happened. If the researcher cannot make a decision
analysis is performed with the remaining obser- with confidence, he or she should explain why that
vations, and then the obtained discriminant was the case, and how the study might be modified
functions are used to classify the eliminated in the future. Because one study will not answer
observation. This process is repeated until all an overall question, the researcher, keeping the big
observations have been eliminated once. Conse- picture in mind, addresses where research goes
quently, the classification rates can be deter- next.
mined using the cumulative results. Ideally, the scientific method discovers cause-
and-effect relationships between variables (concep-
Ying Cui tual objects whose value may vary). An indepen-
dent (exposure) variable is one that, when changed,
See also Canonical Correlation Analysis; Logistic
causes a change in another variable, the dependent
Regression; Multiple Regression; Multivariate
(outcome) variable. However, a change in a depen-
Analysis of Variance (MANOVA)
dent variable may be due wholly or in part to
a change in a third, confounding (extraneous) vari-
able. A confounding variable is anything other than
Further Readings
the independent variable of interest that may affect
Huberty, C. J. (1984). Issues in the use and interpretation the dependent variable. It must be predictive of the
of discriminant analysis. Psychological Bulletin, 95, outcome variable independent of its association
156–171. with the exposure variable of interest, but it cannot
Klecka, W. R. (1980). Discriminant analysis. Beverly be an intermediate in the causal chain of associa-
Hills, CA: Sage. tion between exposure and outcome. Confounding
Spicer, J. (2005). Making sense of multivariate data
variables can be dealt with through the choice of
analysis. Thousand Oaks, CA: Sage.
Tabachnick, B. G., & Fidell, L. S. (2007). Using
study design and/or data analysis.
multivariate statistics (5th ed.). Boston: Pearson Bias is the systematic deviation of results or
Education. inferences from the truth, or processes leading to
Timm, N. H. (2002). Applied multivariate analysis. New such deviation. Validity refers to the lack of bias, or
York: Springer. the credibility of study results and the degree to
Discussion Section 375

which the results can be applied to the general unbiasedness, unnecessary adjustment for non-
population of interest. Internal validity refers to the confounding variables always reduces the statis-
degree to which conclusions drawn from a study tical power of a study. Therefore, if both results
correctly describe what actually transpired during in a dual analysis are similar, then the unad-
the study. External validity refers to whether and to justed result is unbiased and should be reported
what extent the results of a study can be general- based on power considerations. If both results
ized to a larger population (the target population of are different, then the adjusted one should be
the study from which the sample was drawn, and reported based on validity considerations.
other populations across time and space). Below is a checklist of the items to be included
Threats to validity include selection bias in a discussion section:
(which occurs in the design stage of a study),
information bias (which occurs in the data col- 1. Overview: Provide a brief summary of the most
important parts of the introduction section and
lection stage of a study), and confounding bias
then the results section.
(which occurs in the data analysis stage of
a study). Selection bias occurs when during the 2. Interpretation: Relate the results back to the
selection step of the study, the participants in the initial study hypotheses. Do they support or fail
groups to be compared are not comparable to support the study hypotheses? It is also
because they differ in extraneous variables other important to discuss how the results relate to
the literature cited in the introduction.
than the independent variable under study. In
Comment on the importance and relevance of
this case, it would be difficult for the researcher
the findings and how the findings are related to
to determine whether the discrepancy in the the big picture.
groups is due to the independent variable or to
the other variables. Selection bias affects internal 3. Strengths and limitations: Discuss the strengths
validity. Selection bias also occurs when the and limitations of the study.
characteristics of subjects selected for a study are 4. Recommendations: Provide recommendations
systematically different from those of the target on the practical use of current study findings
population. This bias affects external validity. and suggestions for future research.
Selection bias may be reduced when group
assignment is randomized (in experiments) or The following are some tips for researchers to
selection processes are controlled for (in obser- follow in writing the discussion section: (a) Results
vational studies). Information bias occurs when do not prove hypotheses right or wrong. They sup-
the estimated effect is distorted either by an port them or fail to provide support for them.
error in measurement or by misclassifying the (b) In the case of a correlation study, causal lan-
participant for independent (exposure) and/or guage should not be used to discuss the results. (c)
dependent (outcome) variables. In experiments, Space is valuable in scientific journals, so being
information bias may be reduced by improving concise is imperative. Some journals ask authors to
the accuracy of measuring instruments and by restrict discussion to four pages or less, double
training technicians. In observational studies, spaced, typed. That works out to approximately
information bias may be reduced by pretesting one printed page. (d) When referring to informa-
questionnaires and training interviewers. Con- tion, data generated by the researcher’s own study
founding bias occurs when statistical controlling should be distinguished from published informa-
techniques (stratification or mathematical mod- tion. Verb tense is an important tool for doing
eling) are not used to adjust for the effects of that—past tense can be used to refer to work done;
confounding variables. Therefore, a distorted present tense can be used to refer to generally
estimate of the exposure effect results because accepted facts and principles.
the exposure effect is mixed with the effects of The discussion section is important because it
extraneous variables. Confounding bias may be interprets the key results of a researcher’s study in
reduced by performing a ‘‘dual’’ analysis (with light of the research hypotheses under study and
and without adjusting for extraneous variables). the published literature. It should provide a good
Although adjusting for confounders ensures indication of what the new findings from the
376 Dissertation

researcher’s study are and where research should As a means of maintaining high standards,
go next. many universities administer their doctoral pro-
gram through a graduate school with its own
Bernard Choi and Anita Pak dean. Additionally, some universities designate cer-
tain faculty, those who have proven they are
See also Bias; Methods Section; Results Section; Validity
researchers, as graduate faculty who participate in
of Research Conclusions
setting advanced degree policies and serve as major
professors and chairs. This dual faculty status has
Further Readings disappeared at most universities, however.
The dissertation process typically moves
Branson, R. D. (2004). Anatomy of a research paper.
through three stages: the proposal stage; the acti-
Respiratory Care, 49, 1222–1228.
Choi, P. T. (2005). Statistics for the reader: What to ask
vation stage, in which the research, thinking, or
before believing the results. Canadian Journal of producing work is accomplished; and the final
Anesthesia, 52, R1-R5. stage of presentation and approval. Though distin-
Hulley, S. B., Newman, T. B., & Cummings, S. R. guishable for explanatory purposes, these stages
(1988). The anatomy and physiology of research. are often blurred in practice. This is particularly
In S. B. Hulley & S. R. Cummings (Eds.), Designing evident when the area in which one intends to
clinical research (pp. 1–11). Baltimore: Williams & work is known, but not the specific aspect. For
Wilkins. example, the proposal and activation stage often
merge until the project outlines become clear.
In most cases one faculty member from the
department serves as the major professor or com-
DISSERTATION mittee chair (henceforth referred to as the chair).
This is usually at the invitation of the student,
As a requirement for an advanced university although some departments assign chairs in order
degree, the dissertation is usually the last to equitably balance faculty load. Additional fac-
requirement a candidate fulfills for a doctorate. ulty are recruited by the student to serve as readers
Probably its most salient characteristic is that it or committee members, often at the suggestion of
is a unique product, one that embodies in some the chair. Dissertation chairpersons and committee
way the creativity of the author—the result of members are chosen for their experience in the
research and of original thinking and the crea- candidate’s topic of interest and/or for some spe-
tion of a physical product. Depending on depart- cial qualifications, such as experience with the
mental tradition, some dissertations are expected research method or knowledge of statistics or
to be solely originated by the candidate; in experimental design.
others, the topic (and sometimes the approach as
well) is given by the major professor. But even in
The Proposal Stage
the latter case, the candidates are expected to
add something of their own originality to the Depending on the department’s tradition, the disser-
end result. tation may or may not be a collaborative affair with
This description of some relatively common fea- the faculty. Regardless, the dissertation, beginning
tures of the dissertation requirement applies pri- with the formulation of the problem in the pro-
marily to higher education in the United States. posal, is often a one-on-one, give-and-take relation
That there are common features owes much to between the candidate and the committee chair. In
communication among universities, no doubt How to Prepare a Dissertation Proposal, David R.
through such agencies as the Council of Graduate Krathwohl and Nick L. Smith described a disserta-
Schools and the American Association of Universi- tion proposal as a logical plan of work to learn
ties. But the requirement’s evolution at the local something of real or potential significance about an
level has resulted in considerable variation across area of interest. Its opening problem statement
the differing cultures of universities and even the draws the reader into the plan: showing its signifi-
departments within them. cance, describing how it builds on previous work
Dissertation 377

(both substantively and/or methodologically), and Faculty members who assume the role of disser-
outlining the investigation. The whole plan of tation chair take on a substantial commitment of
action flows from the problem statement: the activi- time, energy, and in some instances resources.
ties described in the design section, their sequence Agreeing to be the student’s chair usually involves
often illuminated graphically in the work plan (and, a commitment to help where able, such as in
if one is included, by the time schedule), and their procuring laboratories, equipment, participants,
feasibility shown by the availability of resources. access to research sites, and funding. Thus, nearly
Krathwohl and Smith point out that a well-written all faculty set limits on how many doctoral candi-
proposal’s enthusiasm should carry the reader along dates they will carry at any one time.
and reassure the reader with its technical and schol- In cases in which students take on problems
arly competence. A solid proposal provides the that are outside the interests of any departmental
reader with such a model of the clarity of thought faculty, students may experience difficulty in find-
and writing to be expected in the final write-up that ing a faculty member to work with them because
the reader feels this is an opportunity to support of the substantial additional time commitment and
research that should not be missed. the burden of gaining competence in another area.
While at first it may appear that this definition If no one accepts them, the students may choose to
suggests that the proposal should be written like change topic or, in some instances, transfer to
an advertisement, that is not what it is intended to another university.
convey. It simply recognizes the fact that if stu- Few view the proposal as a binding contract
dents cannot be enthusiastic about their idea, it is that if fulfilled, will automatically lead to
a lot to expect others to be. Material can be writ- a degree. Nevertheless, there is a sense that
ten in an interesting way and still present the idea the proposal serves as a good faith agreement
with integrity. It doesn’t have to be boring to be whereby if the full committee approves the pro-
good. posal and the student does what is proposed
Second, the definition points out that the pro- with sufficient quality (whatever that standard
posal is an integrated chain of reasoning that means in the local context), then the student has
makes strong logical connections between the fulfilled his or her part of the bargain, and the
problem statement and the coherent plan of action faculty members will fulfill theirs. Clearly, as an
the student has proposed undertaking. adjunct to its serving as a contract, the proposal
Third, this process means that students use this also serves as an evaluative criterion for ‘‘fulfill-
opportunity to present their ideas and proposed ing his or her part.’’
actions for consideration in a shared decision mak- In those institutions in which there is a formal
ing situation. With all the integrity at their com- admission to candidacy status, the proposal tends
mand, they help their chair or doctoral committee to carry with it more of a faculty commitment:
see how they view the situation, how the idea fills The faculty have deemed the student of sufficient
a need, how it builds on what has been done merit to make the student a candidate; therefore
before, how it will proceed, how pitfalls will be the faculty must do what they can to help the stu-
avoided, why pitfalls not avoided are not a serious dent successfully complete the degree.
threat, what the consequences are likely to be, and Finally, the proposal often becomes part of the
what significance they are likely to have. dissertation itself. The format for many disserta-
Fourth, while the students’ ideas and action tions is typically five chapters:
plans are subject to consideration, so also is their
capability to successfully carry them through. 1. Statement of the problem, why it is of some
Such a proposal definition gives the student importance, and what one hopes to be able to
a goal, but proposals serve many purposes besides show
providing an argument for conducting the study 2. A review of the past research and thinking on
and evidence of the student’s ability. Proposals also the problem, how it relates to what the student
serve as a request for faculty commitment, as intends to do, and how this project builds on it
a contract, as an evaluative criterion, and as a par- and possibly goes beyond it—substantively and
tial dissertation draft. methodologically
378 Dissertation

3. The plan of action (what, why, when, how, matter, but sciences have the highest completion
where, and who) rates after 10 years of candidacy—70% to 80%—
4. What was found, the data and its processing and English and the humanities the lowest—30%.
The likelihood of ABD increases when a student
5. Interpretation of the data in relation to the leaves campus before finishing. Reasons vary, but
problem proposed.
finances are a major factor. Michael T. Nettles and
Catherine M. Millett’s Rate of Progress scale
Many departments require students to prepare the allows students to compare their progress with
proposal as the first three chapters to be used in that of peers in the same field of study. The PhD
the dissertation with appropriate modification. Completion Project of the Council of Graduate
It is often easier to prepare a proposal when the Schools aims to find ways of decreasing ABD
work to be done can be preplanned. Many propo- levels.
sals, however, especially in the humanities and
more qualitatively oriented parts of the social
sciences, are for emergent studies. The focus of The Final Stage
work emerges as the student works with a given Many dissertations have a natural stopping
phenomenon. Without a specific plan of work, the point: the proposed experiment concludes, one is
student describes the study’s purpose, the approach, no longer learning anything new, reasonably
the boundaries of the persons and situations as well available sources of new data are exhausted. The
as rules for inclusion or exclusion, and expected study is closed with data analysis and interpreta-
findings. Since reality may be different, rules for tion in relation to what one proposed. But if one
how much deviation requires further approval are is building a theoretical model, developing or
appropriate. critiquing a point of view, describing a situation,
Practice varies, but all institutions require pro- or developing a physical product, when has one
posal approval by the chair, if not the whole com- done enough? Presumably when the model is
mittee. Some institutions require very formal adequately described, the point of view appro-
approval, even an oral examination on the pro- priately presented, or the product works on
posal; others are much looser. some level. But ‘‘adequate,’’ ‘‘appropriate,’’ and
‘‘some level’’ describe judgments that must be
made by one’s chair and committee, and their
The Activation Phase
judgment may differ from that of the student.
This phase also varies widely in how actively the The time to face the decision of how much is
chair and committee monitor or work with the enough is as soon as the research problem is suf-
student. Particularly where faculty members are ficiently well described that criteria that the
responsible for a funded project supporting a dis- chair and committee deem reasonable can be
sertation, monitoring is an expected function. But ascribed to it—a specified period of observations
for many, just how far faculty members are or number of persons to be queried, certain
expected or desired to be involved in what is sup- books to be digested and brought to bear, and so
posed to be the student’s own work is a fine line. forth. While not a guaranteed fix because minds
Too much and it becomes the professor’s study change as the problem emerges, the salience of
rather than the student’s. Most faculties let the stu- closure conditions is always greater once the
dents set the pace and are available when called on issue is raised.
for help. Dissertations are expected to conform to the
Completion appears to be a problem in all standard of writing and the appropriate style guide
fields; it is commonly called the all-but-dissertation for their discipline. In the social sciences this guide
(ABD) problem. Successful completion of 13 to 14 is usually either American Psychological Associa-
years of education and selection into a doctoral tion or Modern Language Association style.
program designates these students as exceptional, Once the chair is satisfied (often also the com-
so for candidates to fail the final hurdle is a waste mittee), the final step is, in most cases, an oral
of talent. Estimates vary and differ by subject examination. Such examinations vary in their
Distribution 379

formality, who chairs (often chosen by the grad- Websites


uate school), and how large the examining body.
Electronic Thesis/Dissertation Open Access Initiative
The examiners usually include the committee, (OAI) Union Catalog of the Networked Digital
other faculty from the department, and at least Library of Theses and Dissertation (NDLTD):
one faculty member from other departments http://www.ndltd.org/find
(called outsiders), schools, or colleges and some- ProQuest UMI Dissertation Publishing:
times other universities (also often chosen by the http://www.proquest.com/products rumi/dissertations
graduate school). Other students may silently
observe.
The chair (usually an outsider but in some uni-
versities the committee chair) usually begins with DISTRIBUTION
an executive session to set the procedure (often
traditional but sometimes varied). Then the candi- A distribution refers to the way in which research-
date and any visitors are brought back in and the ers organize sets of scores for interpretation. The
candidate presents, explains, and, if necessary, term also refers to the underlying probabilities
defends his or her choices. The chair rotates the associated with each possible score in a real or the-
questioning, usually beginning with the outsiders oretical population. Generally, researchers plot sets
and ending with committee members. When ques- of scores using a curve that allows for a visual rep-
tioning is finished (often after about 2 hours), the resentation. Displaying data in this way allows
examiners resume executive session to decide dis- researchers to study trends among scores. There
sertation acceptance: as is, with minor modifica- are three common approaches to graphing distri-
tions, with major modifications, or rejection. butions: histograms, frequency polygons, and
Rejection is rare; minor modifications are the ogives. Researchers begin with a set of raw data
norm. and ask questions that allow them to characterize
the data by a specific distribution. There are also
a variety of statistical distributions, each with its
Accessibility of Dissertations
unique set of properties.
Most universities require that students provide Distributions are most often characterized by
a copy of their dissertation in suitable format for whether the data are discrete or continuous in
public availability. A copy (print or digital, nature. Dichotomous or discrete distributions
depending on the library’s preference) is given to are most commonly used with nominal and ordi-
the university’s library. Copies are usually also nal data. A commonly discussed discrete distri-
required to be filed with ProQuest UMI Disserta- bution is known as a binomial distribution.
tion Publishing and/or the Electronic Thesis/Disser- Most textbooks use the example of a coin toss
tation Open Access Initiative (OAI) Union Catalog when discussing the probability of an event
of the Networked Digital Library of Theses and occurring with two discrete outcomes. With con-
Dissertation (NDLTD). tinuous data, researchers most often examine the
degree to which the scores approximate a normal
David R. Krathwohl distribution. Because distributions of continuous
variables can be normal or nonnormal in nature,
See also American Psychological Association Style;
researchers commonly explain continuous distri-
Proposal
butions in one of four basic ways: average value,
variability, skewness, and kurtosis. The normal
Further Readings distribution is often the reference point for
examining a distribution of continuously scored
Krathwohl, D. R., & Smith, N. L. (2005). How to
prepare a dissertation proposal. Syracuse, NY:
variables. Many continuous variables have dis-
Syracuse University Press. tributions that are bell shaped and are said to
Nettles, M. T., & Millett, C. M. (2006). Three magic approximate a normal distribution. The theoreti-
letters: Getting to Ph.D. Baltimore: Johns Hopkins cal curve, called the bell curve, can be used to
University Press. study many variables that are not normally
380 Distribution

distributed but are approximately normal. all values within the distribution. The area
According to the central limit theorem, as the between the curve and the horizontal axis is some-
sample size increases, the shape of the distribu- times referred to as the area under the curve. Gen-
tion of the sample means taken will approach erally speaking, distributions that are based on
a normal distribution. continuous data tend to cluster around an average
score, or measure of central tendency. The measure
of central tendency that is most often reported in
Discrete Distributions
research is known as the mean. Other measures of
The most commonly discussed discrete probability central tendency include the median and the mode.
distribution is the binominal distribution. The Bimodal distributions occur when there are two
binomial distribution is concerned with scores that modes or two values that occur most often in the
are dichotomous in nature, that is, there can be distribution of scores. The median is often used
only one of two possible outcomes. The Bernoulli when there are extreme values at either end of the
trial (named after mathematician Jakob Bernoulli) distribution. When the mean, median, and mode
is a good example that is often used when teaching are the same value, the curve tends to be bell
students about a binomial distribution of scores. shaped, or normal, in nature. This is one of the
The most often discussed Bernoulli trial is that of unique features of what is known as the normal
flipping a coin, in which the outcome will be either distribution. In such cases, the curve is said to be
heads or tails. The process allows for estimating symmetrical about the mean, which means that
the probability that an event will occur. Binomial the shape is the same on both sides. In other
distributions can also be used when one wants to words, if one drew a perpendicular line through
determine the probability associated with correct the mean score, each side of the curve would be
or incorrect responses. In this case, an example a perfect reflection the other. Skewness is the term
might be a 10-item test that is scored dichoto- used to measure a lack of symmetry in a distribu-
mously (correct/incorrect). A binomial distribution tion. Skewness occurs when one tail of the distri-
allows us to calculate the probability of scoring 5 bution is longer than the other. Distributions can
out of 10, 6 out of 10, 7 out of 10, and so on, cor- be positively or negatively skewed depending on
rect. Because the calculation of binomial probabil- which tail is longer. In addition, distributions can
ity distributions can become somewhat tedious, differ in the amount of variability. Variability
binomial distribution tables often accompany explains the dispersion of scores around the mean.
many statistics textbooks so that researchers can Distributions with considerable dispersion around
quickly access information regarding such esti- the mean tend to be flat when compared to the
mates. It should be noted that binomial distribu- normal curve. Distributions that are tightly dis-
tions are most often used in nonparametric persed around the mean tend to be peaked in
procedures. Chi-square distributions are another nature when compared to the normal curve with
form of a discrete distribution that is often used the majority of scores falling very close to the
when one wants to report whether an expected mean. In cases in which the distribution of scores
outcome occurred due to chance alone. appears to be flat, the curve is said to be platykur-
tic, and distributions that are peaked compared
with the normal curve are said to be leptokurtic in
Continuous Distributions
nature. The flatness or peakedness of a distribu-
Continuous variables can be any value or interval tion is a measure of kurtosis, which, along with
associated with a number line. In theory, a continu- variability and skewness, helps explain the shape
ous variable can assume an infinite number of pos- of a distribution of scores. Each normally dis-
sible values with no gaps among the intervals. This tributed variable will have its own measure of
is sometimes referred to as a ‘‘smooth’’ process. To central tendency, variability, degree of skewness,
graph a continuous probability distribution, one and kurtosis. Given this fact, the shape and loca-
draws a horizontal axis that represents the values tion of the curves will vary for many normally
associated with the continuous variable. Above the distributed variables. To avoid needing to have
horizontal axis is drawn a curve that encompasses a table of areas under the curve for each
Disturbance Terms 381

normally distributed variable, statisticians have another observed variable (say, x). To answer the
simplified things through the use of the standard question, researchers may construct the model in
normal distribution based on a z-score metric which y depends on x. Although y is not neces-
with a mean of zero and a standard deviation of sarily explained only by x, a discrepancy always
1. By standardizing scores, we can estimate the exists between the observed value of y and the
probability that a score will fall within a certain predicted value of y obtained from the model.
region under the normal curve. Parametric test The discrepancy is taken as a disturbance term
statistics are typically applied to data that or an error term.
approximate a normal distribution, and t distri- Suppose that n sets of data, ðx1 , y1 Þ, ðx2 , y2 Þ;
butions and F distributions are often used. As . . . , ðxn , yn Þ, are observed, where yi is a scalar and
with the binomial distribution and chi-square xi is a vector (say, 1 × k vector). We assume that
distribution, tables for the t distribution and F there is a relationship between x and y, which is
distribution are typically found in most intro- represented as the model y ¼ f ðxÞ, where f ðxÞ is
ductory statistics textbooks. These distributions a function of x. We say that y is explained by x, or
are used to examine the variance associated with y is regressed on x. Thus y is called the dependent
two or more sets of sample means. Because the or explained variable, and x is a vector of the inde-
sampling distribution of scores may vary based pendent or explanatory variables. Suppose that
on the sample size, the calculation of both the t a vector of the unknown parameter (say β, which
and F distributions includes something called is a k × 1 vector) is included in f ðxÞ. Using the n
degrees of freedom, which is an estimate of the sets of data, we consider estimating β in f ðxÞ. If
sample size for the groups under examination. we add a disturbance term (say u, which is also
When a distribution is said to be nonnormal, the called an error term), we can express the relation-
use of nonparametric or distribution-free statis- ship between y and x as y ¼ f ðxÞ þ u. The distur-
tics is recommended. bance term u indicates the term that cannot be
explained by x. Usually, x is assumed to be nonsto-
Vicki Schmitt chastic. Note that x is said to be nonstochastic
when it takes a fixed value. Thus f ðxÞ is determin-
See also Bernoulli Distribution; Central Limit Theorem;
istic, while u is stochastic. The researcher must
Frequency Distribution; Kurtosis; Nonparametric
specify f ðxÞ. Representatively, it is often specified
Statistics; Normal Distribution; Parametric Statistics
as the linear function f ðxÞ ¼ xβ.
The reasons a disturbance term u is necessary
Further Readings are as follows: (a) There are some unpredictable
elements of randomness in human responses,
Bluman, A. G. (2009). Elementary statistics: A step by
(b) an effect of a large number of omitted variables
step approach (7th ed.). Boston: McGraw-Hill.
Howell, D. C. (2010). Statistical methods for psychology
is contained in x, (c) there is a measurement error
(7th ed.). Belmont, CA: Wadsworth Cengage in y, or (d) a functional form of f ðxÞ is not known
Learning. in general. Corresponding examples are as follows:
Larose, D. T. (2010). Discovering statistics. New York: (a) Gross domestic product data are observed as
Freeman. a result of human behavior, which is usually
Salkind, N. J. (2008). Statistics for people who (think unpredictable and is thought of as a source of ran-
they) hate statistics (3rd ed.). Thousand Oaks, CA: domness. (b) We cannot know all the explanatory
Sage. variables that depend on y. Most of the variables
are omitted, and only the important variables
needed for analysis are included in x. The influence
of the omitted variables is thought of as a source
DISTURBANCE TERMS of u. (c) Some kinds of errors are included in
almost all the data, either because of data collec-
In the field of research design, researchers often tion difficulties or because the explained variable is
want to know whether there is a relationship inherently unmeasurable, and a proxy variable has
between an observed variable (say, y) and to be used in their stead. (d) Conventionally we
382 Disturbance Terms

specify f ðxÞ as f ðxÞ ¼ xβ. However, there is no Violation of the Assumption


reason to specify the linear function. Exceptionally, Vðui Þ ¼ σ 2 For All i
we have the case in which the functional form of
f ðxÞ comes from the underlying theoretical aspect. When the assumption on variance of ui is changed
Even in this case, however, f ðxÞ is derived from to Vðui Þ ¼ σ 2i , that is, a heteroscedastic distur-
a very limited theoretical aspect, not every theoret- bance term, the OLS estimator β^ is no longer
ical aspect. BLUE. The variance of β^ is given by
For simplicity hereafter, consider the linear X X
VðβÞ^ ¼ ð n xi 0 xi Þ1 ð n σ 2 xi 0 xi Þ
regression model yi ¼ xi β þ ui , i ¼ 1,2; . . . ; n. i¼1 i¼1 i
X n
When u1 , u2 ; . . . ; un are assumed to be mutually 0 1
ð i¼1 xi xi Þ :
independent and identically distributed with
mean zero and variance σ 2 , the sum of squared
Let
Pn b be a2 solution of minimization of
residuals, 2
i¼1 ðyi  xi βÞ =σ i with respect to β. Then
Xn
ðyi  xi βÞ2 , Xn Xn
i¼1 b ¼ ð i¼1 xi 0 xi =σ 2i Þ1 i¼1
xi 0 yi =σ 2i
is minimized with respect to β. Then, the estimator
^ is and
of β (say, β)
Xn
Xn 1 Xn beNðβ; ð i¼1
xi 0 xi =σ 2i Þ1 Þ
β^ ¼ i¼1
0
xi xi i¼1
0
xi yi ,
are derived under the normality assumption for ui.
which is called the ordinary least squares (OLS) We have the result that β^ is not BLUE because of
estimator. β^ is known as the best linear unbiased VðbÞ ≤ VðβÞ.^ The equality holds only when
estimator (BLUE). It is distributed as σ i ¼ σ for all i. For estimation, σ 2i has to be spec-
2 2

 Xn  ified, such as σ i ¼ jzi γj, where zi represents a vector


1
N β; σ 2 i¼1
x 0
i xi Þ of the other exogenous variables.

under the normality assumption on ui , because β^


is rewritten as Violation of the Assumption
Xn 1 Xn Covðui ; uj Þ ¼ 0 For All i 6¼ j
β^ ¼ β þ xi 0 xi xi 0 ui :
i¼1 i¼1 The correlation between ui and uj is called the spa-
tial correlation in the case of cross-sectional data
Note
pffiffiffi ^ from the central limit theorem that and the autocorrelation or serial correlation in
nðβ  βÞ is asymptotically normally distributed time-series data. Let ρij be the correlation coeffi-
with mean zero and variance σ 2 M1 xx even when cient between ui and uj , where ρij ¼ 1 for all i ¼ j
the disturbance term ui is notPnormal, in which and ρij ¼ ρji for all i 6¼ j. That is, we have
case we have to assume ð1=nÞ ni¼1 xi 0 xi → Mxx as Covðui ; uj Þ ¼ σ 2 ρij . The matrix that the ði; jÞth ele-
n goes to infinity; that is, n → ∞ (a → b indicates ment is ρij should be positive definite. In this situa-
that a approaches b). tion, the variance of β^ is:
For the disturbance term ui , we have made the
X X Xn
following three assumptions: ^ ¼ σ 2 ð n xi 0 xi Þ1 ð n
VðβÞ ρ xi 0 xj Þ
i¼1 i¼1 j¼1 ij

1. Vðui Þ ¼ σ 2 for all i, Xn


×ð i¼1
xi 0 xi Þ1 :
2. Covðui ; uj Þ ¼ 0 for all i 6¼ j,
3. Covðui ; xj Þ ¼ 0 for all i and j. b bePa solution of the minimization problem
Let P
n n ij 0
of i¼1 j¼1 ρ ðyi  xi βÞ ðyj  xj βÞ with respect
Now we examine β^ in the case in which each to β, where ρij denotes the ði; jÞth element of the
assumption is violated. inverse matrix of the matrix that the ði; jÞth
Doctrine of Chances, The 383

element is ρij . Then, under the normality assump- and


tion on ui , we obtain
Xn
Xn Xn
ij 0 1
Xn Xn ð1=nÞ xi 0 zi → Mxz ¼ M0zx
b ¼ ð i¼1 j¼1
ρ x i xj Þ ð
i¼1 j¼1
ρij xi 0 yj Þ i¼1

and as n → ∞. As an example of zi , we may choose


zi ¼ x
^i , where x
^i indicates the predicted value of xi
Xn Xn when xi is regressed on the other exogenous vari-
b eNðβ; σ 2 ð i¼1 ρij xi 0 xj Þ1 Þ:
j¼1 ables associated with xi , using OLS.
It can be verified that we obtain the following: Hisashi Tanizaki
Vðb Þ ≤ VðβÞ. ^ The equality holds only when
ρij ¼ 1 for all i ¼ j and ρij ¼ 0 for all i 6¼ j. For See also Autocorrelation; Central Limit Theorem; Serial
estimation, we need to specify ρij. For an example, Correlation; Unbiased Estimator
we may take the following specification:
ρij ¼ ρjijj , which corresponds to the first-order Further Readings
autocorrelation case (i.e., ui ¼ ρui1 þ εi , where εi
is the independently distributed error term) in Kennedy, P. (2008). A guide to econometrics (6th ed.).
time-series data. For another example, in the spa- Malden, MA: Wiley-Blackwell.
Maddala, G. S. (2001). Introduction to econometrics
tial correlation model we may take the form
(3rd ed.). New York: Wiley.
ρij ¼ 1 when i is in the neighborhood of j and
ρij ¼ 0 otherwise.

Violation of the Assumption DOCTRINE OF CHANCES, THE


Covðui ; xj Þ ¼ 0 For All i and j
The Doctrine of Chances, by Abraham de Moivre,
If ui is correlated with xj for some i and j, it is
is frequently considered the first textbook on prob-
known that β^ is not an unbiased estimator P of β,
^ 6¼ β, because of Eðð n xi 0 xi Þ1 x ability theory. Its subject matter is suggested by the
that
Pn is, Eð βÞ i¼1
0 book’s subtitle, namely, A Method of Calculating
i¼1 xi ui Þ 6¼ 0. In order to obtain a consistent the Probabilities of Events in Play. Here ‘‘play’’ sig-
estimator Pn of0 β, we need the condition nifies games involving dice, playing cards, lottery
ð1=nÞ i¼1 xi ui →P 0 as n → ∞. However, we have
draws, and so forth, and the ‘‘events’’ are specific
the fact that ð1=nÞ ni¼1 xi 0 ui → = 0 as n → ∞ in the
outcomes, such as throwing exactly one ace in four
case of Covðui ; xj Þ 6¼ 0. Therefore, β^ is not a consis-
throws.
tent estimator of β, that is, β^ → = β as n → ∞. To
De Moivre was a French Protestant who escaped
improve this inconsistency problem, we use the
religious persecution by emigrating to London.
instrumental variable P (say, zi ), which satisfiesPthe There he associated with some of the leading
properties ð1=nÞ ni¼1 zi 0 ui → 0 and ð1=nÞ ni¼1
English scientists of the day, including Edmund
zi 0 xi →
= 0 as n → ∞. Then, it is known that
P P Halley and Isaac Newton (to whom The Doctrine
bIV ¼ ð ni¼1 zi 0 xi Þ1 ni¼1 zi 0 yi is a consistent esti-
of Chances was dedicated). At age 30 de Moivre
mator of β, that is, bIV → β as n → ∞. Therefore,
was elected to the Royal Society and much later
bIV is called the instrumental pffiffiffi variable estimator. It was similarly honored by scientific academies in
can be also shown that nðbIV  βÞ is asymptoti-
both France and Prussia. Yet because he never suc-
cally normally distributed with mean zero and var-
ceeded in procuring an academic position, he was
iance σ 2 M1 1
zx Mzz Mxz , where obliged to earn a precarious livelihood as a private
Xn tutor, teacher, and consultant and died in severe
ð1=nÞ zi 0 xi → Mzx ,
i¼1 poverty. These adverse circumstances seriously con-
strained the amount of time he could devote to
Xn original research. Even so, de Moivre not only
ð1=nÞ i¼1
zi 0 zi → Mzz , made substantial contributions to probability
384 Doctrine of Chances, The

theory but also helped found analytical trigonome- Contributions


try, discovering a famous theorem that bears his
name. To be more precise, The Doctrine of Chances
To put The Doctrine of Chances in context, and was actually published in four versions dispersed
before discussing its contributions and aftermath, over most of de Moivre’s adult life. The first ver-
it is first necessary to provide some historical sion was a 52-page memoir that he had pub-
background. lished in Latin in the Philosophical Transactions
of the Royal Society. In 1711 this contribution
was entitled ‘‘De Mensura Sortis’’ (On the Mea-
surement of Lots). The primary influence was
Huygens, but not long afterwards the author
Antecedents
encountered the work of Montmort and Ber-
De Moivre was not unusual in concentrating the noulli. Montmort’s work left an especially strong
bulk of his book on games of chance. This imprint when de Moivre expanded the article
emphasis was apparent from the very first work into the first edition of The Doctrine of Chances,
on probability theory. The mathematician Gero- which was published in 1718. In fact, the influ-
lamo Cardano, who was also a professional ence was so great that Montmort and de Moivre
gambler, wrote Liber de ludo aleae (Book on entered into a priority dispute that, fortunately,
Games of Chance), in which he discussed the was amicably resolved (unlike the 1710 dispute
computation of probabilities. However, because between Newton and Leibniz over the invention
Cardano’s work was not published until 1663, of the calculus that de Moivre helped arbitrate
the beginning of probability theory is tradition- in Newton’s favor).
ally assigned to 1654. In that year Blaise Pascal Sometime after the first edition, de Moivre
and Pierre de Fermat began a correspondence on began to pursue Bernoulli’s work on approximat-
gaming problems. This letter exchange led Pascal ing the terms of the binomial expansion. By 1733
to write Traité du triangle arithmétique (Treatise de Moivre had derived what in modern terms is
on the Arithmetical Triangle), in which he called the normal approximation to the binomial
arranged the binomial coefficients into a triangle distribution. Although originally published in
and then used them to solve certain problems in a brief Latin note, the derivation was translated
games of chance. De Moivre was evidently into English for inclusion in the second edition of
among the first to refer to this geometric configu- The Doctrine of Chances in 1738. This section
ration as Pascal’s triangle (even though Pascal was then expanded for the book’s third edition,
did not really introduce the schema). which was published posthumously in 1756. This
In 1657 Christian Huygens published Libellus last edition also includes de Moivre’s separate con-
de ratiociniis in ludo aleae (The Value of All tributions to the theory of annuities, plus a lengthy
Chances in Games of Fortune), an extended dis- appendix. Hence, because the last version is by far
cussion of certain issues raised by Pascal and Fer- the most inclusive in coverage, it is this publication
mat. Within a half century, Huygens’s work was on which his reputation mostly rests. In a little
largely superseded by two works that appeared more than 50 years, the 52-page journal article
shortly before de Moivre’s book. The first was the had grown into a 348-page book—or 259 pages if
1708 Essai d’analyse sur les jeux de hasard (Essay the annuity and appendix sections are excluded.
of Analysis on Games of Chance) by Pierre de The main part of the text is devoted to treat-
Montmort and Ars Conjectandi (The Art of Con- ing 74 problems in probability theory. Besides
jecturing) by Jakob (or James) Bernoulli, published providing the solutions, de Moivre offers various
posthumously in 1713. By the time de Moivre cases and examples and specifies diverse corol-
wrote The Doctrine of Chances, he was familiar laries and lemmas. Along the way the author
with all these efforts as well as derived works. This presents the general procedures for the addition
body of knowledge put him in a unique position and multiplication of probabilities, discusses
to create a truly comprehensive treatment of prob- probability-generating functions, the binomial
ability theory. distribution law, and the use of recurring series
Doctrine of Chances, The 385

to solve difference equations involving probabili- There are two additional aspects of this work
ties, and offers original and less restricted treat- that are worth mentioning, even if not as impor-
ments of the duration of play in games of chance tant as the normal curve itself.
(i.e., the ‘‘gambler’s ruin’’ problem). Although First, de Moivre established a special case of the
some of the mathematical terminology and nota- central limit theorem that is sometimes referred to
tion is archaic, with minor adjustments and dele- as the theorem of de Moivre–Laplace. In effect,
tions The Doctrine of Chances could still be the theorem states that as the number of indepen-
used today as a textbook in probability theory. dent (Bernoulli) trials increases indefinitely, the
Because the author intended to add to his mea- binomial distribution approaches the normal dis-
ger income by the book’s sale, it was written in tribution. De Moivre illustrated this point by
a somewhat more accessible style than a pure showing that a close approximation to the normal
mathematical monograph. curve could be obtained simply by flipping a coin
Yet from the standpoint of later developments, a sufficient number of times. This demonstration is
the most critical contribution can be found on basically equivalent to that of the bean machine or
pages 243–254 of the 3rd edition (or pages 235– quincunx that Francis Galton invented to make
243 of the 2nd edition), which are tucked between the same point.
the penultimate and final problems. It is here that Second, de Moivre offered the initial compo-
de Moivre presented ‘‘a method of approximating nents of what later became known as the Poisson
the sum of the terms of the binomial (a þ b)n approximation to the binomial distribution, albeit
expanded into a series, from whence are deduced it was left to Siméon Poisson to provide this deri-
some practical rules to estimate the degree of vation the treatment it deserved. Given this frag-
assent which is to be given to experiments’’ (put in mentary achievement and others, one can only
modern mathematical notation and expressed in imagine what de Moivre would have achieved had
contemporary English orthography). Going he obtained a chair of mathematics at a major
beyond Bernoulli’s work (and that of Nicholas Ber- European university.
noulli, Jakob’s nephew), the approximation is
nothing other than the normal (or Gaussian)
Aftermath
curve.
Although de Moivre did not think of the Like his predecessors Huygens, Montmort, and
resulting exponential function in terms of a prob- Jakob Bernoulli, de Moivre was primarily interested
ability density function, as it is now conceived, in what was once termed direct probability. That is,
he clearly viewed it as describing a symmetrical given a particular probability distribution, the goal
bell-shaped curve with inflection points on both was to infer the probability of a specified event. To
sides. Furthermore, even if he did not possess the offer a specific example, the aim was to answer
explicit concept of the standard deviation, which questions such as What is the probability of throw-
constitutes one of two parameters in the modern ing a score of 12 given three throws of a regular
formula (the other being the mean), de Moivre six-faced die? In contrast, these early mathemati-
did have an implicit idea of a distinct and fixed cians were not yet intrigued by problems in inverse
unit that meaningfully divided the curve on probability. In this case the goal is to infer the
either side of the maximum point. By hand cal- underlying probability distribution that would most
culation he showed that the probabilities of out- likely produce a set of observed events. An instance
comes coming within ± 1, 2, and 3 of these would be questions like, Given that 10 coin tosses
units would be .6827, .9543, and .9987 (round- yielded 6 heads and 4 tails, what is the probability
ing his figures to four decimal places). The corre- that it is still an unbiased coin? and how many coin
sponding modern values for ± 1, 2, and 3 tosses would we need before we knew that the coin
standard deviations from the mean are .6826, was unbiased with a given degree of confidence?
.9544, and .9974. Taken together, de Moivre’s Inverse probability is what we now call statistical
understanding was sufficient to convince Karl inference—the inference of population properties
Pearson and others to credit him with the origi- from small random samples taken from that
nal discovery of the normal curve. population. How can we infer the population
386 Double-Blind Procedure

distribution from the sample distribution? How the method of least squares to minimize those
much confidence can we place in using the sample errors. Then Pierre-Simon Laplace, while working
mean as the estimate of the population mean? on the central limit theorem, discovered that the
This orientation toward direct rather than distribution of sample means tends to be described
inverse probability makes good sense historically. by a normal distribution, a result that is indepen-
As already noted, probability theory was first dent of the population distribution. Adolphe Quéte-
inspired by games of chance. And such games let later showed that human individual differences
begin with established probability distributions. in physical characteristics could be described by the
That is how each game is defined. So a coin toss same curve. The average person (l’homme moyen)
should have two equally likely outcomes, a die was someone who resided right in the middle of the
throw six equally likely outcomes, and a single distribution. Later still Galton extended this appli-
draw from a full deck of cards, 52 equally likely cation to individual differences in psychological
outcomes. The probabilities of various compound attributes and defined the level of ability according
outcomes—like getting one and only one ace in to placement on this curve. In due course the con-
three throws—can therefore be derived in a direct cept of univariate normality was generalized to
and methodical manner. In these derivations one those of bivariate and multivariate normality. The
certainty (the outcome probability) is derived from normal distribution thus became the single most
another certainty (the prior probability distribu- important probability distribution in the behavioral
tion) by completely certain means (the laws of and social sciences—with implications that went
probability). By comparison, because inverse prob- well beyond what de Moivre had more modestly
ability deals with uncertainties, conjectures, and envisioned in The Doctrine of Chances.
estimates, it seems far more resistant to scientific
analysis. It eventually required the introduction of Dean Keith Simonton
such concepts as confidence intervals and probabil-
ity levels. See also Game Theory; Probability, Laws of; Significance
It is telling that when Jakob Bernoulli attempted Level, Concept of; Significance Level, Interpretation
to solve a problem of the latter kind, he dramati- and Construction
cally failed. He specifically dealt with an urn
model with a given number of black and white Further Readings
pebbles. He then asked how many draws (with
de Moivre, A. (1967). The doctrine of chances (3rd ed.).
replacement) a person would have to make before
New York: Chelsea. (Original work published 1756)
the relative frequencies could be stated with an
Hald, A. (2003). A history of probability and statistics
a priori level of confidence. After much mathemat- and their applications before 1750. Hoboken, NJ:
ical maneuverings—essentially constituting the first Wiley.
power analysis—Bernoulli came up with a ludi- Stigler, S. M. (1986). The history of statistics: The
crous answer: 25,550 observations or tests. Not measurement of uncertainty before 1900. Cambridge,
surprisingly, he just ended Ars Conjectandi right MA: Harvard University Press.
there, apparently without a general conclusion,
and left the manuscript unpublished at his death.
Although de Moivre made some attempt to con-
tinue from where his predecessor left off, he was DOUBLE-BLIND PROCEDURE
hardly more successful, except for the derivation
of the normal curve. A double-blind procedure refers to a procedure in
It is accordingly ironic that the normal curve which experimenters and participants are ‘‘blind
eventually provided a crucial contribution to statis- to’’ (without knowledge of) crucial aspects of
tical inference and analysis. First, Carl Friedrich a study, including the hypotheses, expectations, or,
Gauss interpreted the curve as a density function most important, the assignment of participants to
that could be applied to measurement problems in experimental groups. This entry discusses the
astronomy. By assuming that errors of measure- implementation and application of double-blind
ment were normally distributed, Gauss could derive procedures, along with their historical background
Double-Blind Procedure 387

and some of the common criticisms directed Double-blind studies are normally also evalu-
at them. ated ‘‘blind.’’ Here, the data are input by auto-
matic means (Internet, scanning), or by assistants
blind to group allocation of participants. Whatever
Experimental Control
procedures are done to prepare the database for
‘‘Double-blinding’’ is intimately coupled to ran- analysis, such as transformations, imputations of
domization, where participants in an experimental missing values, and deletion of outliers, is done
study are allocated to groups according to a ran- without knowledge of group assignment. Nor-
dom algorithm. Participants and experimenters are mally a study protocol stipulates the final statisti-
then blinded to group allocation. Hence double- cal analysis in advance. This analysis is then run
blinding is an additional control element in experi- with a database that is still blinded in the sense
mental studies. If only some aspect of a study is that the groups are named ‘‘A’’ and ‘‘B.’’ Only after
blinded, it is a single-blind study. This is the case this first and definitive analysis has been conducted
when the measurement of an outcome parameter and documented is the blind broken.
is done by someone who does not know which Good clinical trials also test whether the blind-
group a participant belongs to and what hypothe- ing was compromised during the trial. If, for
ses and expectations are being tested. This could, instance, a substance or intervention has many and
in principle, also be done in nonexperimental stud- characteristic side effects, then patients or
ies if, for instance, two naturally occurring clinicians can often guess whether someone was
cohorts, smokers and nonsmokers, say, are tested allocated to treatment (often also called verum,
for some objective marker, such as intelligence or from the Latin word for true) or placebo. To test
plasma level of hormones. Double-blinding pre- for the integrity of the blinding procedure, either
supposes that participants are allocated to the all participants or a random sample of them are
experimental procedure and control procedure at asked, before the blind is broken, what group
random. Hence, by definition, natural groups or they think they had been allocated to. In a good,
cohorts cannot be subject to double-blinding. Dou- uncompromised blinded trial, there will be a
ble-blind testing is a standard for all pharmaceuti- near-random answer pattern because some patients
cal substances, such as drugs, but should be will have improved under treatment and some
implemented whenever possible in all designs. In under control.
order for a study to succeed with double-blinding,
a control intervention uses a placebo that can be
manufactured in a way that makes the placebo
Placebos
indistinguishable from the treatment.
To make blinding of patients and clinicians possi-
ble, the control procedure has to be a good mock
Allocation Concealment and Blind Analysis
or placebo procedure (also sometimes called
There are two corollaries to double-blinding: allo- sham). In pharmaceutical trials this is normally
cation concealment and blind statistical analysis. If done by administering the placebo in a capsule or
an allocation algorithm, that is, the process of allo- pill of the same color but containing pharmacolog-
cating participants to experimental groups, is com- ically inert material, such as corn flour. If it is nec-
pletely random, then, by definition, the allocation essary to simulate a taste, then often other
of participants to groups is concealed. If someone substances or coloring that are inactive or only
were to allocate participants to groups in an alter- slightly active, such as vitamin C, are added. For
nating fashion, then the allocation would not be instance, if someone wants to create a placebo for
concealed. The reason is that if someone were to caffeine, quinine can be used. Sometimes, if a phar-
be unblinded, because of an adverse event, say, macological substance has strong side effects, an
then whoever knew about the allocation system active placebo might be used. This is a substance
could trace back and forth from this participant that produces some of the side effects, but hardly
and find out about the group allocation of the any of the desired pharmacological effects, as the
other participants. experimental treatment.
388 Double-Blind Procedure

If a new pharmacological substance is to be animal magnetism. After emigrating to France,


tested against an already existing one, then a dou- Mesmer set up an enormously successful practice,
ble-dummy technique is used. Placebos for both treating rich and poor alike with his magnetic
substances are manufactured, and a patient takes cures. Several medical professors converted to ani-
always two substances, one of which is a placebo mal magnetism, among them Charles D’Eslon,
for the other substance. a professor in the medical school. The Académie
Française decided to scrutinize the phenomenon in
1784. D’Eslon and a volunteer patient were tested.
Blinding in Behavioral Research
When the magnetiseur stroked the volunteer with
While pharmaceutical procedures are compara- his magnet, she had all sorts of fits and exhibited
tively easy to blind, behavioral, surgical, or other strong physical reactions. It was decided to place
procedures are difficult to blind. For surgical and a curtain between her and the magnetiseur. This
similar trials, often real blinds are used, that is, step showed that only if the participant could see
screens that shield part of the surgical team from her magnetiseur did the phenomenon occur with
seeing what is actually happening such that only some reliability. By the same token, this proved
a few operating surgeons know whether they are that the participant’s conscious or unconscious
actually performing the real operation or a sham expectation, and not the magnet, conveyed the
procedure. Thus, the rest of the team—anesthesiol- influence.
ogist, nurses, assistant surgeons who might take This was the birth of hypnosis as a scientific
performance measures after the operation—can discipline as we know it. Incidentally, it was also
remain blinded. However, it should be noted that the birth of blinded control in experiments. The
most surgical procedures have not been evaluated next step in the history of blinded experimenta-
by blinded trials. tion was taken by followers of Samuel Hahne-
Most behavioral interventions cannot be mann, the German doctor who had invented
tested in blinded trials. A psychotherapist, for homeopathy. Homeopaths use sugar globules
instance, has to know what he or she is doing in impregnated with alcohol but containing hardly
order to be effective. However, patients can be any or none of the pharmacological substances
allocated in a blind fashion to different types of that are nominally their source, yet claim effects.
interventions, one of which is a sham interven- This pharmacological intervention lends itself
tion. Therefore, in behavioral research, active perfectly to blinded investigations because the
controls are very important. These are controls placebo is completely undistinguishable from the
that are tailor made and contain some, but not experimental medicine. Homeopaths were the
all, the purportedly active elements of the treat- first to test their intervention in blinded studies,
ment under scrutiny. The less possible it is to in 1834 in Paris, where inert sugar globules were
blind patients and clinicians to the interventions dispensed with the normal homeopathic ritual,
used, the more important it is to use either with roughly equal effectiveness. This was done
blinded assessors to measure the outcome again in 1842 in Nuremberg, where homeopaths
parameter of interest or objective measures that challenged critics and dispensed homeopathic
are comparatively robust against experimenter and inert globules under blind conditions to
or patient expectation. volunteers who were to record their symptoms.
Symptoms of a quite similar nature were
reported by both groups, leaving the battle unde-
Historical Background
cided. The first blinded conventional medical
Historically, blinding was introduced in the testing trials were conducted in 1865 by Austin Flint in
of so-called animal magnetism introduced by the United States and William Withey Gull in
Franz-Anton Mesmer, a German healer, doctor, London, both testing medications for rheumatic
and hypnotist. Mesmer thought that a magnet fever. The year 1883 saw the first blinded psy-
would influence a magnetic field in the human chological experiment. Charles Sanders Peirce
body that regulates human physiology. This pur- and Joseph Jastrow wanted to know the smallest
ported specific human magnetic system he called weight difference that participants could sense
Double-Blind Procedure 389

and used a screen to blind the participants from speaking, blinded clinical trials are cost inten-
seeing the actual weights. Blinded tests became sive, and researchers will likely not be able to
standard in hypnosis and parapsychological muster the resources to run sufficient numbers
research. Gradually medicine also came to of isolated, blinded trials on all components to
understand the importance and power of sugges- gain enough certainty. Theoretically, therapeutic
tion and expectation. The next three important packages come in a bundle that falls apart if one
dates are the publication of Methodenlehre were to disentangle them into separate elements.
der therapeutischen Untersuchung (Clinical So care has to be taken not to overgeneralize the
Research Methodology) by German pharmacolo- pharmacological model to all situations.
gist Paul Martini in 1932, the introduction of Blinding is always a good idea, where it can be
randomization by Ronald Fisher’s Design of implemented, because it increases the internal
Experiments in 1935, and the 1945 Cornell con- validity of a study. Double-blinding is necessary if
ferences on therapy, which codified the blinded one wants to know the specific effect of a mecha-
clinical trial. nistic intervention.

Harald Walach
Caveats and Criticisms
See also Experimenter Expectancy Effect; Hawthorne
Currently there is a strong debate over how to
Effect; Internal Validity; Placebo; Randomization
balance the merits of strict experimental control
Tests
with other important ingredients of therapeutic
procedures. The double-blind procedure has
grown out of a strictly mechanistic, pharmaco- Further Readings
logical model of efficacy, in which only a single
specific physiological mechanism is important, Committee for Proprietary Medicinal Products
Working Party on Efficacy of Medicinal
such as the blocking of a target receptor, or one
Products. (1995). Biostatistical methodology in
single psychological process that can be clinical trials in applications for marketing
decoupled from contexts. Such careful focus can authorizations for medicinal products. CPMP working
be achieved only in strictly experimental party on efficacy of medicinal products note for
research with animals and partially also with guidance III/3630/92-EN. Statistics in Medicine, 14,
humans. But as soon as we reach a higher level 1659–1682.
of complexity and come closer to real-world Crabtree, A. (1993). From Mesmer to Freud: Magnetic
experiences, such blinding procedures are not sleep and the roots of psychological healing. New
necessarily useful or possible. The real-world Haven, CT: Yale University Press.
effectiveness of a particular therapeutic interven- Greenberg, R. P., Bornstein, R. F., Greenberg, M. D., &
Fisher, S. (1992). A meta-analysis of antidepressant
tion is likely to consist of a specific, mechanisti-
outcome under ‘‘blinder’’ conditions. Journal of
cally active ingredient that sits on top of Consulting & Clinical Psychology, 60, 664–669.
a variety of other effects, such as strong, nonspe- Kaptchuk, T. J. (1998). Intentional ignorance: A history
cific effects of relief from being in a stable thera- of blind assessment and placebo controls in medicine.
peutic relationship; hope that a competent Bulletin of the History of Medicine, 72, 389–433.
practitioner is structuring the treatment; and Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G.
reduction of anxiety through the security given (1995). Empirical evidence of bias: Dimensions of
by the professionalism of the context. This methodological quality associated with estimates of
approach has been discussed under the catch- treatment effects in controlled trials. Journal of the
word whole systems research, which acknowl- American Medical Association, 273, 408–412.
Shelley, J. H., & Baur, M. P. (1999). Paul Martini:
edges (a) that a system or package of care is
The first clinical pharmacologist? Lancet, 353,
more than just the sum of all its elements and (b) 1870–1873.
that it is unrealistic to assume that all complex White, K., Kando, J., Park, T., Waternaux, C., & Brown,
systems of therapy can be disentangled into their W. A. (1992). Side effects and the ‘blindability’ of
individual elements. Both pragmatic and theoret- clinical drug trials. American Journal of Psychiatry,
ical reasons stand against it. Pragmatically 149, 1730–1731.
390 Dummy Coding

Here is how the data look after dummy coding:


DUMMY CODING
Values Group d1 d2 d3
Dummy coding is used when categorical variables
1 1 1 0 0
(e.g., sex, geographic location, ethnicity) are of
3 1 1 0 0
interest in prediction. It provides one way of using
2 1 1 0 0
categorical predictor variables in various kinds of
2 1 1 0 0
estimation models, such as linear regression.
2 2 0 1 0
Dummy coding uses only 1s and 0s to convey all
3 2 0 1 0
the necessary information on group membership.
4 2 0 1 0
With this kind of coding, the researcher enters a 1
3 2 0 1 0
to indicate that a person is a member of a category,
5 3 0 0 1
and a 0 otherwise.
6 3 0 0 1
Dummy codes are a series of numbers assigned
4 3 0 0 1
to indicate group membership in any mutually
5 3 0 0 1
exclusive and exhaustive category. Category mem-
10 4 0 0 0
bership is indicated in one or more columns of 0s
10 4 0 0 0
and 1s. For example, a researcher could code
9 4 0 0 0
sex as 1 ¼ female, 0 ¼ male or 1 ¼ male,
11 4 0 0 0
0 ¼ female. In this case the researcher would have
a column variable indicating status as male or
female. In general, with k groups there will be Note that every observation in Group 1 has the
k  1 coded variables. Each of the dummy-coded dummy-coded value of 1 for d1 and 0 for the
variables uses 1 degree of freedom, so k groups others. Those in Group 2 have 1 for d2 and 0 oth-
have k  1 degrees of freedom, just as in analysis erwise, and for Group 3, d3 equals 1 with 0 for
of variance (ANOVA). Consider the following the others. Observations in Group 4 have all 0s on
example, in which there are four observations d1, d2, and d3. These three dummy variables con-
within each of the four groups: tain all the information needed to determine which
observations are included in which group. If you
are in Group 2, then d2 is equal to 1 while d1 and
Group G1 G2 G3 G4 d3 are 0. The group with all 0s is known as the
reference group, which in this example is Group 4.
1 2 5 10
3 3 6 10
2 4 4 9
Dummy Coding in ANOVA
2 3 5 11 The use of nominal data in prediction requires the
Mean 2 3 5 10 use of dummy codes; this is because data need to
be represented quantitatively for predictive pur-
poses, and nominal data lack this quality. Once
For this example we need to create three the data are coded properly, the analysis can be
dummy-coded variables. We will call them d1, d2, interpreted in a manner similar to traditional
and d3. For d1, every observation in Group 1 will ANOVA.
be coded as 1 and observations in all other groups Suppose we have three groups of people, single,
will be coded as 0. We will code d2 with 1 if the married, and divorced, and we want to estimate
observation is in Group 2 and zero otherwise. For their life satisfaction. In the following table, the
d3, observations in Group 3 will be coded 1 and first column identifies the single group (observa-
zero for the other groups. There is no d4; it is not tions of single status are dummy coded as 1 and
needed because d1 through d3 have all the infor- 0 otherwise ), and the second column identifies the
mation needed to determine which observation is married group (observations of married status are
in which group. dummy coded as 1 and 0 otherwise). The divorced
Dummy Coding 391

group is left over, meaning this group is the refer- the means of the divorced and single groups. The
ence group. However, the overall results will be second b weight represents the difference in means
the same no matter which groups we select. between the divorced and married groups.

Satis
Column Column Group Dummy Coding in Multiple
Group Satisfaction 1 2 Mean
Regression With Categorical Variables
Single 25 1 0 24.80
Multiple regression is a linear transformation of
S 28 1 0
the X variables such that the sum of squared
S 20 1 0
deviations of the observed and predicted Y is mini-
S 26 1 0
mized. The prediction of Y is accomplished by the
S 25 1 0
following equation:
Married 30 0 1 30.20
M 28 0 1 Y0i ¼ b0 þ b1 X1i þ b2 X2i þ    þ bk Xki
M 32 0 1
M 33 0 1
Categorical variables with two levels may be
M 28 0 1
entered directly as predictor variables in a multiple
Divorced 20 0 0 23.8
regression model. Their use in multiple regression
D 22 0 0
is a straightforward extension of their use in sim-
D 28 0 0
ple linear regression. When they are entered as pre-
D 25 0 0
dictor variables, interpretation of regression
D 24 0 0
weights depends on how the variable is coded.
Grand Mean 26.27 0.33 0.33
When a researcher wishes to include a categorical
variable with more than two levels in a multiple
Note there are three groups and thus 2 degrees regression prediction model, additional steps are
of freedom between groups. Accordingly, there are needed to ensure that the results are interpretable.
two dummy-coded variables. If X1 denotes single, These steps include recoding the categorical vari-
X2 denotes married, and X3 denotes divorced, able into a number of separate, dichotomous vari-
then the single group is identified when X1 is 1 ables: dummy coding.
and X2 is 0; the married group is identified when
X2 is 1 and X1 is 0; and the divorced group is
identified when both X1 and X2 are 0. Example Data: Faculty Salary Data
If Y^ denotes the predicted level of life satisfac-
tion, then we get the following regression equation: Faculty Salary Gender Rank Dept Years Merit
^ ¼ a þ b1ðX1Þ þ b2ðX2Þ,
Y 1 Y1 0 3 1 0 1.47
2 Y2 1 2 2 8 4.38
where a is the interception, and b1 and b2 are 3 Y3 1 3 2 9 3.65
slopes or weights. The divorced group is identified 4 Y4 1 1 1 0 1.64
when both X1 and X2 are 0, so it drops out of the 5 Y5 1 1 3 0 2.54
regression equation, leaving the predicted value 6 Y6 1 1 3 1 2.06
equal to the mean of the divorced group. 7 Y7 0 3 1 4 4.76
The group that gets all 0s is the reference group. 8 Y8 1 1 2 0 3.05
For this example, the reference group is the 9 Y9 0 3 3 3 2.73
divorced group. The regression coefficients present 10 Y10 1 2 1 0 3.14
a contrast or difference between the group identi-
fied by the column and the reference group. To be The simplest case of dummy coding is one in
specific, the first b weight corresponds to the single which the categorical variable has three levels
group and the b1 represents the difference between and is converted to two dichotomous variables.
392 Dummy Coding

For example, Dept in the example data has three


Unstandardized Standardized
levels, 1 ¼ Psychology, 2 ¼ Curriculum, and
Coefficients Coefficients
3 ¼ Special Education. This variable could be
dummy coded into two variables, one called Model B Std. Error Beta t Sig.
Psyc and one called Curri. If Dept ¼ 1, then 1 (Constant) 54.600 2.394 22.807 .000
Psyc would be coded with a 1 and Curri with Psyc –8.886 3.731 –.423 –2.382 .025
a 0. If Dept ¼ 2, then Psyc would be coded with Curri –12.350 3.241 –.676 –3.810 .001
a 0 and Curri would be coded with a 1. If
Dept ¼ 3, then both Psyc and Curri would be Notes: Coefficients. Dependent Variable: SALARY.
Psyc ¼ Psychology; Curri ¼ Curriculum.
coded with a 0. The dummy coding is repre-
sented below.
The coefficients table can be interpreted as fol-
lows: The Psychology faculty makes $8,886 less in
salary per year relative to the Special Education fac-
Dummy Coded Variables ulty, while the Curriculum faculty makes $12,350
Dept Psyc Curri
less than the Special Education department.

Psychology 1 1 0
Combinations and Interaction of
Curriculum 2 0 1
Special Education 3 0 0 Categorical Predictor Variables
The previous examples dealt with individual cate-
gorical predictor variables with two or more levels.
A listing of the recoded data is presented below. The following example illustrates how to create
a new dummy coded variable that represents the
Faculty Dept Psyc Curri Salary interaction of certain variables.
Suppose we are looking at how gender, parental
1 1 1 0 Y1 responsiveness, and the combination of gender and
2 2 0 1 Y2 parental responsiveness influence children’s social
3 2 0 1 Y3 confidence. Confidence scores serve as the depen-
4 1 1 0 Y4 dent variable, with gender and parental responsive-
5 3 0 0 Y5 ness scores (response) serving as the categorical
6 3 0 0 Y6 independent variables. Response has three levels:
7 1 1 0 Y7 high level, medium level, and low level. The analy-
8 2 0 1 Y8 sis may be thought of as a two-factor ANOVA
9 3 0 0 Y9 design, as below:
10 1 1 0 Y10
Response Scale Values

Suppose we get the following model summary 0 1 2 Marginal


table and coefficients table. Gender (Low) (Mid) (High) Mean

0 (male) X11 X12 X13 Mmale


R Adjusted Std. Error of 1 (female) X21 X22 X23 Mfemale
Model R Square R Square the Estimate Marginal Mean Mlow Mmid Mhigh Grand Mean
1 .604 .365 .316 7.57
Change Statistics There are three ‘‘sources of variability’’ that
R Square F Sig. F could contribute to the differential explanation of
Change Change df1 df2 Change confidence scores in this design. One source is the
main effect for gender, another is the main effect
.365 7.472 2 26 .003 for response, and the third is the interaction effect
Dummy Coding 393

between gender and response. Gender is already As in an ANOVA design analysis, the first
dummy coded in the data file with males ¼ 0 and hypothesis of interest to be tested is the interaction
females ¼ 1. Next, we will dummy code the effect. In a multiple regression model, the first
response variable into two dummy coded variables, analysis tests this effect in terms of its ‘‘unique
one new variable for each degree among groups for contribution’’ to the explanation of confidence
the response main effect (see below). Note that the scores. This can be realized by entering the
low level of response is the reference group (shaded response dummy-coded variables as a block into
in the table below). Therefore, we have 2 (i.e., the model after gender has been entered. The Gen-
3  1) dummy-coded variables for response. der Response dummy-coded interaction variables
are therefore entered as the last block of variables
Dummy Coding in creating the ‘‘full’’ regression model, that is, the
Original Coding Low Medium High model with all three effects in the equation. Part of
the coefficients table is presented below.
0 (Low) 0 0 0
1 (Medium) 0 1 0
2 (High) 0 0 1 Model B t p
Last comes the third dummy-coded variables, 1 (Constant) 18.407 29.66 .000**
which represent the interaction source of variability. Gender 2.073 2.453 .017*
The new variables are labeled as G*Rmid (meaning 2 (Constant) 19.477 22.963 .000**
gender interacting with the medium level of Gender 2.045 2.427 .018*
response), G*Rhigh (meaning gender interacting Response (mid) –1.401 –1.518 .133
with the high level of response), and G*Rlow Response (high) –2.270 –1.666 .100
(meaning gender interacting with the low level of 3 (Constant) 20.229 19.938 .000**
response), which is the reference group. To dummy Gender .598 .425 .672
code variables that represent interaction effects of Response (mid) –1.920 –1.449 .152
categorical variables, we simply use the products of Response (high) –5.187 –2.952 .004*
the dummy codes that were constructed separately G*Rmid 1.048 .584 .561
for each of the variables. In this case, we simply G*Rhigh 6.861 2.570 .012
multiply gender by dummy coded response. Note
Notes: *p < .025. **p < .001.
that there are as many new interaction dummy
coded variables created as there are degrees of free-
The regression equation is
dom for the interaction term in the ANOVAs design.
The newly dummy coded variables are as follows: ^ ¼ 20:229 þ :598X1  1:920X2  5:187X3
Y
Gender Response: Response: þ 1:048X4  6:861X5,
(main Medium High G*Rmid G*Rhigh
effect) (main effect) (main effect) (interaction) (interaction) in which X1 ¼ gender (female), X2 ¼ response
(medium level), X3 ¼ response (high level), X4 ¼
0 0 0 0 0 Gender by Response (Female with Mid Level), and
0 0 0 0 0 X5 ¼ Gender by Response (Female with High
0 1 0 0 0 Level). Now, b1 ¼ :598 tells us girls receiving
0 1 0 0 0 a low level of parental response have higher confi-
0 0 1 0 0 dence scores than boys receiving the same level of
0 0 1 0 0 parental response, but this difference is not signifi-
1 0 0 0 0 cant (p ¼ .672 > .05), and b2 ¼  1.920 tells us
1 0 0 0 0 children receiving a medium level of parental
1 1 0 1 0 response have lower confidence scores than chil-
1 1 0 1 0 dren receiving a low level of parental response, but
1 0 1 0 1 this difference is not significant (p ¼ .152 > .05).
1 0 1 0 1 However, b5 ¼ 6:861 tells us girls receiving a high
394 Duncan’s Multiple Range Test

level of parental response tend to score higher than significance levels for the difference between any
do boys receiving this or other levels of response, pair of means, regardless of whether a significant
and this difference is significant (p ¼ .012 < .025). F resulted from an initial analysis of variance.
The entire model tells us the pattern of mean confi- Duncan’s test differs from the Newman–Keuls
dence scores across the three parental response test (which slightly preceded it) in that it does
groups for boys is sufficiently different from the not require an initial significant analysis of vari-
pattern of mean confidence scores for girls across ance. It is a more powerful (in the statistical
the three parental response groups ðp < :001Þ. sense) alternative to almost all other post
hoc methods.
Jie Chen When introducing the test in a 1955 article in
the journal Biometrics, David B. Duncan
See also Categorical Variable; Estimation
described the procedures for identifying which
pairs of means resulting from a group compari-
Further Readings son study with more than two groups are signifi-
cantly different from each other. Some sample
Aguinis, H. (2004). Regression analysis for categorical
mean values taken from the example presented
moderators. New York: Guilford.
Aiken, L. S., & West, S. G. (1991). Multiple regression,
by Duncan are given. Duncan worked in agro-
testing and interpreting interaction. Thousand Oaks, nomics, so imagine that the means represent
CA: Sage. agricultural yields on some metric. The first step
Allison, P. D. (1999). Multiple regression: A primer. in the analysis is to sort the means in order from
Thousand Oaks, CA: Pine Forge Press. lowest to highest, as shown.
Brannick, M. T. Categorical IVs: Dummy, effect, and
orthogonal coding. Retrieved September 15, 2009,
from http://luna.cas.usf.edu/~mbrannic/files/regression/
anova1.html Groups A F G D C B E
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Means 49.6 58.1 61.0 61.5 67.6 71.2 71.3
Applied multiple regression/correlation analysis for the
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
Gupta, R. (2008). Coding categorical variables in
regression models: Dummy and effect coding. From tables of values that Duncan developed
Retrieved September 15, 2009, from http:// from the t-test formula, standard critical
www.cscu.cornell.edu/news/statnews/stnews72.pdf differentials at the .05 level are identified. These
Keith, T. Z. (2006). Multiple regression and beyond. are significant studentized differences, which
Boston: Allyn & Bacon. must be met or surpassed. To maintain the nomi-
Pedhazur, E. J. (1997). Multiple regression in behavioral
nal significance level one has chosen, these differ-
research: Explanation and prediction. Eagan, MN:
Thomson Learning.
entials get slightly higher as the two means that
Stockburger, D. W. (1997). Multiple regression are compared become further apart in terms of
with categorial variables. Retrieved September 15, their rank ordering. In the example shown, the
2009, from http://www.psychstat.missouristate.edu/ means for groups A and F have an interval of 2
multibook/mlt08m.html. because they are adjacent to each other. Means
Warner, R. M. (2008). Applied statistics: From bivariate A and E have an interval of 7 as there are seven
through multivariate techniques. Thousand Oaks, CA: means in the span between them. By multiplying
Sage. the critical differentials by the standard error of
the mean, one can compute the shortest signifi-
cant ranges for each interval width (in the
example, the possible intervals are 2, 3, 4, 5, 6,
DUNCAN’S MULTIPLE RANGE TEST and 7). With the standard error of the mean of
3.643 (which is supplied by Duncan for this
Duncan’s multiple range test, or Duncan’s test, or example), the shortest significant ranges are
Duncan’s new multiple range test, provides calculated.
Dunnett’s Test 395

pairwise comparisons without concern for infla-


Range 2 3 4 5 6 7
tion of the Type I error rate. A researcher may per-
Studentized 2.89 3.04 3.12 3.20 3.25 3.29 form dozens of post hoc analyses in the absence of
differences specific hypotheses and treat all tests as if they are
Shortest 10.53 11.07 11.37 11.66 11.84 11.99 conducted at the .05 (or whatever the nominal
significant value chosen) level of significance. The compari-
ranges sons may be analyzed even in the absence of an
overall F test indicating that any differences exist.
Not surprisingly, Duncan’s multiple range test is
For any two means to be significantly differ-
not recommended by many statisticians who prefer
ent, their distance must be equal to or greater
more conservative approaches that minimize the
than the associated shortest significant range.
Type I error rate. Duncan’s response to those con-
For example, the distance between mean F
cerns was to argue that because the null hypothesis
(58.1) and mean B (71.2) is 13.1. Within the
is almost always known to be false to begin with,
rank ordering of the means, the two means form
it is more reasonable to be concerned about mak-
an interval of width 5, with an associated short-
ing Type II errors, missing true population differ-
est significant range of 11.66. Because
ences, and his method certainly minimizes the true
13.1 > 11.66, the two means are significantly dif-
Type II error rate.
ferent at the .05 level.
A small table that includes the significant stu-
Duncan suggested a graphical method of dis-
dentized differences calculated by Duncan and
playing all possible mean comparisons and
reported in his 1955 paper is provided below. It
whether they are significant compared with one
shows values based on the number of means to be
another. This method involved underlining those
considered and the degrees of freedom in the
clusters of means that are not statistically different.
experiment. The table assumes a .05 alpha level.
Following his suggestion, the results for this sam-
ple are shown below. Bruce Frey

See also Newman–Keuls Test and Tukey Test; Post Hoc


Groups A F G D C B E
Comparisons; Type II Error
Means 49.6 58.1 61.0 61.5 67.6 71.2 71.3
Further Readings
Notes: Means that are underlined by the same line are not
significantly different from each other. Means that are Chew, V. (1976). Uses and abuses of Duncan’s Multiple
underlined by different lines are significantly different from Range Test. Proceedings of the Florida State
each other. Horticultural Society, 89, 251–253.
Duncan, D. B. (1955). Multiple range and multiple F
Number of Means tests. Biometrics, March, 1–42.
Degrees of Freedom 2 3 4 5 6 7 8 9 10

5 3.643.743.793.833.833.833.833.833.83
10 3.153.303.373.433.463.473.473.473.47 DUNNETT’S TEST
15 3.013.163.253.313.363.383.403.423.43
20 2.953.103.183.253.303.343.363.383.40 Dunnett’s test is one of a number of a posteriori or
30 2.893.043.123.203.253.293.323.353.37 post hoc tests, run after a significant one-way anal-
60 2.832.983.083.143.203.243.283.313.33 ysis of variance (ANOVA), to determine which dif-
100 2.802.953.053.123.183.223.263.293.32 ferences are significant. The procedure was
Note: Significant Studentized Ranges at the .05 Level for introduced by Charles W. Dunnett in 1955. It
Duncan’s Multiple Range Test. differs from other post hoc tests, such as the
Newman–Keuls test, Duncan’s Multiple Range
The philosophical approach taken by Duncan is test, Scheffé’s test, or Tukey’s Honestly Significant
an unusually liberal one. It allows for multiple Difference test, in that its use is restricted to
396 Dunnett’s Test

comparing a number of experimental groups H 0B : μB ¼ μR ; H 1B : μB 6¼ μR


against a single control group; it does not test the
experimental groups against one another. Back-
ground information, the process of running Dun- H 0C : μC ¼ μR ; H 1C : μC 6¼ μR
nett’s test, and an example are provided in this
entry. Each hypothesis is tested against a critical value,
called q0 , using the formula:

Background Xi  XR
q0 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ,
A one-way ANOVA tests the null hypothesis ðH0 Þ 2 × MSerror n þ n 1 1
that all the k treatment means are equal; that is, i R

where the subscript i refers to each of the experi-


H 0 : μA ¼ μB ¼ μC ¼    ¼ μk , mental groups, and MSerror is the mean square for
the error term, taken from the ANOVA table. The
against the alternative hypothesis ðH1 Þ that at least value of q0 is then compared against a critical value
one of the means is different from the others. The in a special table, the researcher knowing the num-
difficulty is that if H0 is rejected, it is not known ber of groups (including the reference group) and
which mean differs from the others. It is possible the degrees of freedom of MSerror. The form of the
to run t tests on all possible pairs of means (e.g., A equation is very similar to that of the Newman–
vs. B, A vs. C, B vs. C). However, if there are five Keuls test. However, it accounts for the fact that
groups, this would result in 10 t tests (in general, there are a smaller number of comparisons because
there are k × (k  1) / 2 pairs). Moreover, the tests the various experimental groups are not compared
are not independent, because any one mean enters with each other. Consequently, it is more powerful
into a number of comparisons, and there is a com- than other post hoc tests.
mon estimate of the experimental error. As a result, In recent years, the American Psychological
the probability of a Type I error (that is, conclud- Association strongly recommended reporting con-
ing that there is a significant difference when in fidence intervals (CIs) in addition to point esti-
fact there is not one) increases beyond 5% to an mates of a parameter and p levels. It is possible to
unknown extent. The various post hoc tests are both derive CIs and run Dunnett’s test directly,
attempts to control this family-wise error rate and rather than first calculating individual values of q0
constrain it to 5%. for each group. The first step is to determine an
The majority of the post hoc tests compare each allowance (A), defined as
group mean against every other group mean. One, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
called Scheffé’s test, goes further, and allows the 2 × MSerror
user to compare combinations of groups (e.g., A ¼ t1ðα=2Þ ,
nh
A þ B vs. C, A þ C vs. B, A þ B vs. C þ D). Dun-
nett’s test is limited to the situation in which one where t1  (α/2) is taken from the Dunnett tables,
group is a control or reference condition ðRÞ, and and nh is the harmonic mean of the sample sizes of
each of the other (experimental group) means is all the groups (including the reference), which is
compared to it.
k
nh ¼ 1
:
Dunnett’s Test n1
þ n1 þ n1 þ    þ n1
2 3 k

Rather than one null hypothesis, Dunnett’s test has


Then the CI for each term is
as many null hypotheses as there are experimental
groups. If there are three such groups (A, B, and ðXi  XR Þ ± A
C), then the null and alternative hypotheses are:
The difference is statistically significant if the CI
H 0A : μA ¼ μR ; H 1A : μA 6¼ μR does not include zero.
Dunnett’s Test 397

Table 1 Results of a Fictitious Experiment Comparing Table 2 The ANOVA Table for the Results in Table 1
Three Treatments Against a Reference Source of Sum of Mean
Condition Variance Squares df Square F p
Group Mean Standard Deviation Between groups 536.000 3 178.667 14.105 < .001
Reference 50.00 3.55 Within groups 152.000 12 12.667
A 61.00 4.24 Total 688.000 15
B 52.00 2.45
C 45.00 3.74
meaning that the 95% CI for Group A is

An Example ð61  50Þ ± 6:75 ¼ 4:25  17:75:


Three different drugs (A, B, and C) are compared
against placebo (the reference condition, R), and David L. Streiner
there are 4 participants per group. The results are
shown in Table 1, and the ANOVA summary in See also Analysis of Variance (ANOVA); Duncan’s
Table 2. For Group A, q0 is: Multiple Range Test; Honestly Significant Difference
(HSD) Test; Multiple Comparison Tests; Newman–
61  50 11 Keuls Test and Tukey Test; Pairwise Comparisons;
q0 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1ffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 3:09
12:667 Scheffé Test; Tukey’s Honestly Significant Difference
2 × 12:667 þ 4 4 (HSD)
0
For Group B, q ¼ 0.56, and it is 1.40 for Group
C. With four groups and dferror ¼ 12, the critical Further Readings
value in Dunnett’s table is 2.68 for α ¼.05 and
3.58 for α ¼.01. We therefore conclude that Dunnett, C. W. (1955). A multiple comparison procedure
Group A is different from the reference group at for comparing several treatments with a control.
p < .05, and that Groups B and C do not differ Journal of the American Statistical Association, 50,
1096–1121.
significantly from it.
Dunnett, C. W. (1964). New tables for multiple
The value of A is : comparisons with a control. Biometrics, 20, 482–491.
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sahai, H., & Ageel, M. I. (2000). Analysis of variance:
2 × 12:667 Fixed, random, and mixed models. New York:
A ¼ 2:68 ¼ 2:68 × 2:52 ¼ 6:75,
4 Birkhäuser.
E
a continuum ranging from low to high levels of eco-
ECOLOGICAL VALIDITY logical validity. Despite no universally agreed-upon
definition of ecological validity, a deeper under-
standing of the concept can be achieved by analyz-
Ecological validity is the degree to which test per- ing its three dimensions: test environment, stimuli
formance predicts behaviors in real-world settings. under examination, and behavioral response.
Today, psychologists are called upon by attorneys,
insurance agencies, vocational rehabilitation coun-
selors, and employers to draw inferences about cli- Test Environment
ents’ cognitive capacities and their implications in In the field of psychological assessment, con-
real-world settings from psychological tests. These trolled test environments are recommended to
demands have accentuated the importance of eco- allow the client’s ‘‘best performance,’’ and psychol-
logical validity. Originally, neuropsychological ogists have attempted to reduce distractions, con-
tests were created as tools for detecting and local- fusion, and fatigue in the testing situation.
izing neuropathology. The diagnostic utility of Historically, to avoid misdiagnosing brain pathol-
such assessment instruments decreased with the ogy, evaluating a client’s best performance was cru-
development of brain-imaging techniques, and cial. However, because neuropsychologists today
neuropsychology shifted its focus toward identify- are asked to predict clients’ functioning in real-
ing the practical implications of brain pathology. world settings, the ecological validity of the tradi-
Society’s increasing interest in clients’ everyday tional test environment has been called into
abilities has necessitated further research into the question. Unlike testing situations, the natural
ecological validity of psychological and neuropsy- world does not typically provide a quiet, support-
chological tests. The dimensions, applications, lim- ive, distraction-reduced environment. The disparity
itations, and implications of ecological validity are between test environments and clients’ everyday
discussed in this entry. environments may reduce the predictive accuracy
of psychological assessments. Today, many referral
Dimensions questions posed to neuropsychologists call for the
development of testing environments that more
Robert Sbordone refers to ecological validity as ‘‘the closely approximate real-world settings.
functional and predictive relationship between the
patient’s performance on a set of neuropsychological
Stimuli Under Examination
tests and the patient’s behavior in a variety of real-
world settings’’ (Sbordone & Long, p. 16). This The extent to which stimuli used during testing
relationship is not absolute; tests tend to fall on resemble stimuli encountered in daily life should

399
400 Ecological Validity

be taken into account when evaluating ecological Commonly used outcome measures to which tradi-
validity. For example, the Grocery List Selective tional neuropsychological tests are correlated in
Reminding Test is a test that uses real-world stim- the veridicality approach are the Dysexecutive
uli. Unlike traditional paired associate or list- Questionnaire (DEX) and the Behavior Rating
learning tests, which often use arbitrary stimuli, Inventory of Executive Functioning.
the Grocery List Selective Reminding Test employs One limitation of the veridicality approach is
a grocery list to evaluate verbal learning. Naturally that the outcome measures selected for comparison
occurring stimuli increase the ecological validity of with the traditional neuropsychological test may
neuropsychological tests. not accurately represent the client’s everyday func-
tioning. Also, many of the traditional neuropsy-
Behavioral Response chological tests evaluated using the veridicality
approach were developed to diagnose brain
Another important dimension of ecological pathology, not make predictions about daily
validity is assuring that behavioral responses eli- functioning.
cited are representative of the person’s natural
behaviors and appropriately related to the con- Verisimilitude
struct being measured. Increased levels of ecologi-
cal validity would be represented in simulator Verisimilitude is the degree to which tasks per-
assessment of driving by moving the cursor with formed during testing resemble tasks performed in
the arrow keys, with the mouse, or with a steering daily life. With the verisimilitude approach, tests
wheel. The more the response approximates the are created to simulate real-world tasks. Some lim-
criterion, the greater the ecological validity. itations of the verisimilitude approach include the
The two main methods of establishing ecologi- cost of creating new tests and the reluctance of
cal validity are veridicality and verisimilitude. clinicians to put these new tests into practice. Mere
These methods are related to, but not isomorphic face validity cannot be substituted for empirical
with, the traditional constructs of concurrent research when assessing the ecological validity of
validity/predictive validity and construct validity/ neuropsychological tests formed from this
face validity, respectively. approach.

Veridicality Ecological Validity of


Veridicality is the degree to which test scores Neuropsychological Tests
correlate with measures of real-world functioning.
Executive Functioning
The veridicality approach examines the statistical
relationship between performance on traditional Although there are data to support the ecologi-
neuropsychological tests and one or more selected cal validity of traditional neuropsychological tests
outcome measures, including self-reports, infor- of executive functioning, a growing body of litera-
mant questionnaires, clinician ratings, perfor- ture suggests that the traditional tests (i.e., Wiscon-
mance-based measures, employment status, and sin Card Sorting Test, Stroop Color-Word Test,
activities of daily living. Self-reports have repeat- Trail Making Test, and Controlled Oral Word
edly been shown as weaker predictors of everyday Association Test), at best, only moderately predict
performance than clinician and informant ratings. everyday executive functioning and that tests
However, with recent advances in technology, developed with ecological validity in mind are
researchers have attempted to increase the ecologi- more effective. Studies examining the relationship
cal validity of self-reports by using ambulatory between the Hayling and Brixton tests and the
monitoring devices to conduct ecological momen- DEX have reported favorable results in patients
tary assessments to measure patients’ behaviors, with nondegenerative brain disorders, frontal lobe
moods, perception of others, physiological vari- lesions, and structural brain damage. The Califor-
ables, and physical activities in natural settings. nia Verbal Learning Test has been used effectively
This technology is in its infancy, and more research to predict job performance and occupational sta-
is needed to determine its effectiveness and utility. tus, whereas the preservative responses of the
Ecological Validity 401

Wisconsin Card Sorting Test are capable of effec- suitable ecological validity to aid the diagnosis and
tively predicting occupational status only. Newer outcome prediction of patients with epilepsy.
tests are being developed to encompass verisimili-
tude in the study of executive functioning. These Perception
tests include the Virtual Planning Test and the
Behavioral Assessment of Dysexecutive Syndrome. Research on ecological validity of perceptual
tests is limited. The Behavioral Inattention Test
was developed to assist the prediction of everyday
Attention problems arising from unilateral visual neglect.
Although research on the veridicality of tests of Ecological validity of the Wechsler Adult Intelli-
attention is limited, there is reasonable evidence gence ScaleRevised has been shown for assessing
that traditional tests of attention are ecologically visuoconstructive skills. Investigators used subtests
valid. More research should be conducted to verify such as Block Design, Object Assembly, and Pic-
current results, but the ecological validity of these ture Completion and found that poor performance
traditional tests is promising. Although some predicts problems in daily living.
investigators are not satisfied with traditional tests
for attention deficit/hyperactivity disorder Virtual Tests
(ADHD), researchers have found evidence of pre-
With increasing advances in cyber technology,
dictive validity in the Hayling test in children with
neuropsychological assessments are turning to
ADHD. The Test of Everyday Attention (TEA),
computers as an alternative to real-world behav-
which was developed using the verisimilitude
ioral observations. One innovative approach has
approach, is an assessment tool designed to evalu-
been the use of virtual reality scenarios where sub-
ate attentional switching, selective attention, and
jects are exposed to machines that encompass 3-D,
sustained attention. Investigators have found cor-
real-world-like scenes and are asked to perform
relations between the TEA and other standardized
common functions in these environments allowing
measures of attention, including the Stroop Color-
naturalistic stimulus challenges while maintaining
Word Test, the Symbol Digit Modalities Test, and
experimental control. These tests include the Vir-
the Paced Auditory Serial Addition Tests.
tual Reality Cognitive Performance Assessment
Test, a virtual city, the Virtual Office, and a simu-
Memory Tests lated street to assess memory and executive func-
tioning. These methods suggest that the virtual
The Rivermead Behavioral Memory Test
tests may provide a new, ecological measure for
(RBMT), designed using the verisimilitude
examining memory deficits in patients.
approach, is a standardized test used to assess
everyday memory functioning. The memory tasks
in the RBMT resemble everyday memory demands, Other Applications
such as remembering a name or an appointment.
Academic Tests
Significant correlations have been demonstrated
between the RBMT and other traditional memory There is a high degree of variance in our educa-
tests as well as between the RBMT and ratings of tional systems, from grading scales and curricula
daily functioning by subjects, significant others, to expectations and teacher qualifications. Because
and clinicians. Some studies have revealed the supe- of such high variability, colleges and graduate pro-
riority of the RBMT and the TEA in predicting grams use standardized tests as part of their admis-
everyday memory functioning or more general sions procedure. Investigators have found that the
functioning when compared to more traditional American College Test has low predictive validity
neuropsychological tests. In addition to the RBMT, of first-year grades as well as graduation grades
other tests that take verisimilitude into account for students attending undergraduate programs.
include the 3-Objects-3-Places, the Process Dissoci- Also, correlations have been found between Scho-
ation Procedure, and the Memory in Reality. lastic Assessment Test Math (SATM) and Scholas-
Research also suggests that list learning tasks have tic Assessment Test Verbal (SATV) test scores and
402 Ecological Validity

overall undergraduate GPA, but the SATM may management movement, which uses work environ-
underpredict women’s grades. ments as rehabilitation sites instead of vocational
Studies suggest that the Graduate Record Exam rehabilitation centers, emerged.
(GRE) is capable of at least modestly predicting The Behavioral Assessment of Vocational
first-year grades in graduate school and veterinary Skills is a performance-based measure able to sig-
programs as well as graduate grade point average, nificantly predict workplace performance. Also,
faculty ratings, comprehensive examination scores, studies have shown that psychosocial variables sig-
citation counts and degree attainment across nificantly predict a patient’s ability to function
departments, acceptance into PhD programs, exter- effectively at work. When predicting employment
nal awards, graduation on time, and thesis publica- status, the Minnesota Multiphasic Personality
tion. Although some studies suggest that the GRE Inventory is one measure that has been shown to
is an ecologically valid tool, there is debate about add ecological validity to neuropsychological test
how much emphasis to place on GRE scores in the performance.
postgraduate college admissions process. The Med-
ical College Admission Test (MCAT) has been
Employment
shown to be predictive of success on written tests
assessing skills in clinical medicine. In addition, the Prediction of a person’s ability to resume
MCAT was able to positively predict performance employment after disease or injury has become
on physician certification exams. increasingly important as potential employers turn
to neuropsychologists with questions about job
Activities of Daily Living capabilities, skills, and performance. Ecological
validity is imperative in assessment of employabil-
To evaluate patients’ ability to function inde-
ity because of the severe consequences of inaccu-
pendently, researchers have investigated the accu-
rate diagnosis. One promising area of test
racy of neuropsychological tests in predicting
development is simulated vocational evaluations
patients’ capacities to perform activities of daily
(SEvals), which ask participants to perform a vari-
living (ADL), such as walking, bathing, dressing,
ety of simulated vocational tasks in environments
and eating. Research studies found that neuropsy-
that approximate actual work settings. Research
chological tests correlated significantly with cogni-
suggests that the SEval may aid evaluators in
tive ADL skills involving attention and executive
making vocational decisions. Other attempts at
functioning. Overall, ADL research demonstrates
improving employment predictions include the
low to moderate levels of ecological validity. Eco-
Occupational Abilities and Performance Scale and
logical validity is improved, however, when the
two self-report questionnaires, the Work Adjust-
ADLs evaluated have stronger cognitive compo-
ment Inventory and the Working Inventory.
nents. Driving is an activity of daily living that has
been specifically addressed in ecological validity
literature, and numerous psychometric predictors Forensic Psychology
have been identified. But none does better at pre-
Forensic psychology encompasses a vast spec-
diction than the actual driving of a small-scale
trum of legal issues including prediction of recidi-
vehicle on a closed course. In like manner, a wheel-
vism, identification of malingering, and assessment
chair obstacle course exemplifies an ecologically
of damages in personal injury and medico-legal
valid outcome measure for examining the outcome
cases. There is little room for error in these predic-
of visual scanning training in persons with right
tions as much may be at stake.
brain damage.
Multiple measures have been studied for the
prediction of violent behavior, including (a) the
Vocational Rehabilitation
Psychopathic checklist, which has shown predictive
Referral questions posed to neuropsychologists ability in rating antisocial behaviors such as crimi-
have shifted from diagnostic issues to rehabilitative nal violence, recidivism, and response to correc-
concerns. In an effort to increase the ecological tional treatment; (b) the MMPI-2, which has
validity of rehabilitation programs, the disability a psychopathy scale (scale 4) and is sensitive to
Ecological Validity 403

antisocial behavior; (c) the California Psychological delineate the relationship between particular cogni-
Inventory, which is a self-report questionnaire that tive constructs and more specific everyday abilities
provides an estimate of compliance with society’s involving those constructs may increase the ecologi-
norms; and (d) the Mental Status Examination, cal validity of neuropsychological tests. However,
where the evaluator obtains personal history infor- there is some disagreement as to which tests appro-
mation, reactions, behaviors, and thought pro- priately measure various cognitive constructs. Pres-
cesses. In juveniles, the Youth Level of Service Case ently, comparing across ecological validity research
Management Inventory (YLS/CMI) has provided studies is challenging because of the wide variety of
significant information for predicting recidivism in outcome measures, neuropsychological tests, and
young offenders; however, the percentage variance populations assessed.
predicted by the YLS/CMI was low. Great strides have been made in understanding
Neuropsychologists are often asked to deter- the utility of traditional tests and developing new
mine a client’s degree of cognitive impairment after and improved tests that increase psychologists’
a head injury so that the estimated lifetime impact abilities to predict people’s functioning in everyday
can be calculated. In this respect, clinicians must life. As our understanding of ecological validity
be able to make accurate predictions about the increases, future research should involve more
severity of the cognitive deficits caused by the encompassing models, which take other variables
injury. into account aside from test results. Interviews
with the client’s friends and family, medical and
employment records, academic reports, client com-
Limitations and Implications for the Future
plaints, and direct observations of the client can be
In addition to cognitive capacity, other variables helpful to clinicians faced with ecological ques-
that influence individuals’ everyday functioning tions. In addition, ecological validity research
include environmental cognitive demands, compen- should address test environments, environmental
satory strategies, and noncognitive factors. These demands, compensatory strategies, noncognitive
variables hinder researchers’ attempts at demon- factors, test and outcome measure selection, and
strating ecological validity. With regard to environ- population effects in order to provide a foundation
mental cognitive demands, for example, an from which general conclusions can be drawn.
individual in a more demanding environment will
demonstrate more functional deficits in reality than William Drew Gouvier, Alyse A. Barker,
an individual with the same cognitive capacity in and Mandi Wilkes Musso
a less demanding environment. To improve ecologi-
See also Concurrent Validity; Construct Validity; Face
cal validity, the demand characteristics of an indivi-
Validity; Predictive Validity
dual’s environment should be assessed. Clients’
consistency in their use of compensatory strategies
across situations will also affect ecological validity. Further Readings
Clinicians may underestimate a client’s everyday Chaytor, N., & Schmitter-Edgecombe, M. (2003). The
functional abilities if compensatory strategies are ecological validity of neuropsychological tests: A
not permitted during testing or if the client simply review of the literature on everyday cognitive skills.
chooses not to use his or her typical repertoire of Neuropsychology Review, 13(4), 181197.
compensatory skills during testing. Also, noncogni- Farias, S. T., Harrell, E., Neumann, C., & Houtz, A.
tive factors, including psychopathology, malinger- (2003). The relationship between neuropsychological
ing, and premorbid functioning, impede the performance and daily functioning in individuals with
predictive ability of assessment instruments. Alzheimer’s disease: Ecological validity of
neuropsychological tests. Archives of Clinical
A dearth of standardized outcome measures, var-
Neuropsychology, 18, 655672.
iable test selection, and population effects are other Farmer, J., & Eakman, A. (1995). The relationship
limitations of ecological validity research. Mixed between neuropsychological functioning and
results in current ecological validity literature may instrumental activities of daily living following
be a result of using inappropriate outcome mea- acquired brain injury. Applied Neuropsychology, 2,
sures. More directed hypotheses attempting to 107115.
404 Effect Coding

Heaton, R. K., & Pendleton, M. G. (1981). Use of We denote by X the N × ðJ þ 1Þ augmented


neuropsychological tests to predict adult patients’ matrix collecting the data for the independent vari-
everyday functioning. Journal of Consulting and ables (this matrix is called augmented because the first
Clinical Psychology, 49(6), 807821. column is composed only of 1s), and by y the N × 1
Kibby, M. Y., Schmitter-Edgecombe, M., & Long, C. J.
vector of observations for the dependent variable.
(1998). Ecological validity of neuropsychological tests:
Focus on the California Verbal Learning Test and the
These two matrices have the following structure.
Wisconsin Card Sorting Test. Archives of Clinical 2 3
1 x1;1    x1;k    x1;K
Neuropsychology, 13(6), 523534.
6. .. .. .. .. .. 7
McCue, M., Rogers, J., & Goldstein, G. (1990). 6. 7
6 . . . . . . 7
Relationships between neuropsychological and 6 7
functional assessment in elderly neuropsychiatric X ¼6 6 1 x n;1    x n;k    xn;K
7
7
patients. Rehabilitation Psychology, 35, 9199. 6. .. .. .. .. .. 7
6. 7
Norris, G., & Tate, R. L. (2000). The Behavioural 4. . . . . . 5
Assessment of the Dysexecutive Syndrome (BADS): 1 xN;1  xN;k  xN;k
Ecological concurrent and construct validity.
Neuropsychological Rehabilitation, 10(1), 3345. ð2Þ
2 3
Schmuckler, M. A. (2001). What is ecological validity? A y1
dimensional analysis. Infancy, 2(4), 419436. 6. 7
Silver, C. H. (2000). Ecological validity of 6. 7
6. 7
neuropsychological assessment in childhood traumatic 6 7
and y ¼ 6
6 yn
7:
7
brain injury. Journal of Head Trauma Rehabilitation, 6. 7
15(4), 973988. 6. 7
4. 5
Sbordone, R. J., & Long, C. (1996). Ecological validity
of neuropsychological testing. Delray Beach, FL: GR yN
Press/St. Lucie Press.
Wood, R. L., & Liossi, C. (2006). The ecological validity The predicted values of the dependent variable
of executive tests in a severely brain injured sample. ^ are collected in a vector denoted ^y and are
Y
Archives of Clinical Neuropsychology, 21, 429437. obtained as
 1
^y ¼ Xb with b ¼ XT X XT y: ð3Þ
EFFECT CODING
where T denotes the transpose of a matrix and the
vector b has J components. Its first component is
Effect coding is a coding scheme used when an traditionally denoted b0, it is called the intercept
analysis of variance (ANOVA) is performed with of the regression, and it represents the regression
multiple linear regression (MLR). With effect cod- component associated with the first column of the
ing, the experimental effect is analyzed as a set of matrix X. The additional J components are called
(nonorthogonal) contrasts that opposes all but one slopes, and each of them provides the amount of
experimental condition to one given experimental change in Y consecutive to an increase in one unit
condition (usually the last one). With effect cod- of its corresponding column.
ing, the intercept is equal to the grand mean, and The regression sum of squares is obtained as
the slope for a contrast expresses the difference
between a group and the grand mean. T T 1 T 2
SSregression ¼ b X y  ð1 yÞ ð4Þ
N
Multiple Regression Framework (with 1T being a row vector of 1s conformable
In linear multiple regression analysis, the goal is to with y). The total sum of squares is obtained as
predict, knowing the measurements collected on N
1 T 2
subjects, a dependent variable Y from a set of J SStotal ¼ yT y  ð1 yÞ : ð5Þ
independent variables denoted N
The residual (or error) sum of squares is
fX1 ; . . . ; Xj ; . . . ; XJ g: ð1Þ obtained as
Effect Coding 405

T T T the dispersion between means) with the dispersion


SSerror ¼ y y  b X y: ð6Þ of the experimental scores to the means (i.e., the
The quality of the prediction is evaluated by dispersion within the groups). Specifically, the
computing the multiple coefficient of correla- dispersion between the means is evaluated by com-
tion denoted R2Y:1;...; J . This coefficient is equal to puting the sum of squares between means, denoted
the squared coefficient of correlation between the SSbetween, and computed as
dependent variable (Y) and the predicted depen-
^
dent variable (Y). X
K
SSbetween ¼ I × ðMþ;k  Mþ;þ Þ2 : ð9Þ
An alternative way of computing the multiple k
coefficient of correlation is to divide the regression
sum of squares by the total sum of squares. This The dispersion within the groups is evaluated by
shows that R2Y:1;...; J can also be interpreted as the computing the sum of squares within groups,
proportion of variance of the dependent variable denoted SSwithin and computed as
explained by the independent variables. With this
interpretation, the multiple coefficient of correla- K X
X I
SSwithin ¼ ðYi;k  Mþ;k Þ2 : ð10Þ
tion is computed as
k i

SSregression SSregression
R2Y:1;...; J ¼ ¼ : ð7Þ If the dispersion of the means around the grand
SSregression þ SSerror SStotal
mean is due only to random fluctuations, then the
SSbetween and the SSwithin should be commensura-
Significance Test ble. Specifically, the null hypothesis of no effect
can be evaluated with an F ratio computed as
In order to assess the significance of a given
R2Y:1;...; J , we can compute an F ratio as SSbetween N  K
F ¼ × : ð11Þ
SSwithin K1
R2Y:1;...; J NJ1
F ¼ × : ð8Þ Under the usual assumptions of normality of the
1  R2Y:1;...; J J
error and of independence of the error and the
Under the usual assumptions of normality of the scores, this F ratio is distributed under the null
error and of independence of the error and the hypothesis as a Fisher distribution with ν1 ¼ K  1
scores, this F ratio is distributed under the null and ν2 ¼ N  K degrees of freedom. If we denote
hypothesis as a Fisher distribution with ν1 ¼ J and by R2experimental the following ratio
ν2 ¼ N  J  1 degrees of freedom.
SSbetween
R2experimental ¼ ‚ ð12Þ
SSbetween þ SSwithin
Analysis of Variance Framework
we can re-express Equation 11 in order to show its
For an ANOVA, the goal is to compare the means similarity with Equation 8 as
of several groups and to assess whether these
means are statistically different. For the sake of R2experimental NK
simplicity, we assume that each experimental F ¼ × : ð13Þ
1 R2experimental K1
group comprises the same number of observations
denoted I (i.e., we are analyzing a ‘‘balanced
design’’). So, if we have K experimental groups Analysis of Variance With Effect
with a total of I observations per group, we have
Coding Multiple Linear Regression
a total of K × I ¼ N observations denoted Yi;k .
The first step is to compute the K experimental The similarity between Equations 8 for MLR
means denoted Mþ;k and the grand mean denoted and 13 for ANOVA suggests that these two
Mþ;þ . The ANOVA evaluates the difference methods are related, and this is indeed the case.
between the mean by comparing the dispersion of In fact, the computations for an ANOVA can be
the experimental means to the grand mean (i.e., performed with MLR via a judicious choice of
406 Effect Coding

the matrix X (the dependent variable is repre- Table 2 ANOVA Table for the Data From Table 1
sented by the vector y). In all cases, the first col- Source df SS MS F Pr(F)
umn of X will be filled with 1s and is coding for Experimental 3 150.00 50.00 10.00 .0044
the value of the intercept. One possible choice Error 8 40.00 5.00
for X, called mean coding, is to have one addi-
tional column in which the value for the nth Total 11 190.00
observation will be the mean of its group. This
approach provides a correct value for the sums
of squares but not for the F (which needs to be In order to perform an MLR analysis, the data
divided by K  1). Most coding schemes will use from Table 1 need to be ‘‘vectorized’’ in order to
J ¼ K  1 linearly independent columns (as provide the following y vector:
many columns as there are degrees of freedom 2 3
20
for the experimental sum of squares). They all 6 17 7
give the same correct values for the sums of 6 7
6 17 7
squares and the F test but differ for the values of 6 7
6 21 7
the intercept and the slopes. To implement effect 6 7
6 16 7
coding, the first step is to select a group called 6 7
6 14 7
the contrasting group; often, this group is the y ¼ 6 7
6 17 7: ð14Þ
last one. Then, each of the remaining J groups is 6 7
6 16 7
contrasted with the contrasting group. This is 6 7
6 15 7
implmented by creating a vector for which all 6 7
6 87
elements of the contrasting group have the value 6 7
4 11 5
1, all elements of the group under consider-
ation have the value of þ 1, and all other ele- 8
ments have a value of 0.
With the effect coding scheme, the intercept is In order to create the N ¼ 12 by J þ 1 ¼ 3 þ
equal to the grand mean, and each slope coefficient 1 ¼ 4X matrix, we have selected the fourth experi-
is equal to the difference between the grand mean mental group to be the contrasting group. The first
and the mean of the group whose elements were column of X codes for the intercept and is composed
coded with values of 1. This difference estimates only of 1s. For the other columns of X, the values
the experimental effect of this group, hence the for the observations of the contrasting group will all
name of effect coding for this coding scheme. The be equal to 1. The second column of X will use
mean of the contrasting group is equal to the inter- values of 1 for the observations of the first group,
cept minus the sum of all the slopes. the third column of X will use values of 1 for the
observations of the second group, and the fourth col-
umn of X will use values of 1 for the observations of
Example the third group:
2 3
The data used to illustrate effect coding are shown 1 1 0 0
in Table 1. A standard ANOVA would give the 61 1 0 07
6 7
results displayed in Table 2. 61 1 0 07
6 7
61 0 1 0 7
6 7
Table 1 A Data Set for an ANOVA 61 0 1 07
6 7
a1 a2 a3 a4 61 0 1 07
6
X ¼ 6 7: ð15Þ
S1 20 21 17 8 61 0 0 177
S2 17 16 16 11 61 0 0 1 7
6 7
S3 17 14 15 8 61 0 0 17
6 7
Ma. 18 17 16 9 M:: ¼ 15 6 1 1 1 1 7
6 7
4 1 1 1 1 5
Note: A total of N ¼ 12 observations coming from K ¼ 4
groups with I ¼ 3 observations per group. 1 1 1 1
Effect Size, Measures of 407

With this effect coding scheme, we obtain the fol-


lowing b vector of regression coefficients: EFFECT SIZE, MEASURES OF
2 3
15 Effect size is a statistical term for the measure of
6 3 7 associations between two variables. It is widely
b ¼ 6 7
4 2 5: ð16Þ
used in many study designs, such as meta-analysis,
1 regression, and analysis of variance (ANOVA).
The presentations of effect size in these study
We can check that the intercept is indeed equal designs are usually different. For example, in
to the grand mean (i.e., 15) and that the slope cor- meta-analysis—an analysis method for combining
responds to the difference between the correspond- and summarizing research results from different
ing groups and the grand mean. When using the studies—effect size is often represented as the stan-
MLR approach to the ANOVA, the predicted dardized difference between two continuous vari-
values correspond to the group means, and this is ables’ means. In analysis of variance, effect size
indeed the case here. can be interpreted as the proportion of variance
explained by a certain effect versus total variance.
In each study design, due to the characteristic of
Alternatives variables, say, continuous versus categorical, there
The two main alternatives to effect coding are are several ways to measure the effect size. This
dummy coding and contrast coding. Dummy cod- entry discusses the measure of effect size by differ-
ing is quite similar to effect coding, the only differ- ent study designs.
ence being that the contrasting group is always
coded with values of 0 instead of 1. With Measures and Study Designs
dummy coding, the intercept is equal to the mean
of the contrasting group, and each slope is equal Meta-Analysis
to the mean of the contrasting group minus the
Meta-analysis is a study of methodology to
mean of the group under consideration. For con-
summarize results across studies. Effect size was
trast coding, a set of (generally orthogonal, but lin-
introduced as standardized mean differences for
ear independent is sufficient) J contrasts is chosen
continuous outcome. This is especially important
for the last J columns of X. The values of the inter-
for studies that use different scales. For example,
cept and slopes will depend upon the specific set of
in a meta-analysis for the study of different effects
contrasts used.
for schizophrenia from a drug and a placebo,
Hervé Abdi researchers usually use some standardized scales to
measure patients’ situation. These scales can be the
See also Analysis of Covariance (ANCOVA); Analysis of Positive and Negative Syndrome Scale (PANSS) or
Variance (ANOVA); Contrast Analysis; Dummy the Brief Psychiatric Rating Scale (BPRS). The
Coding; Mean Comparisons; Multiple Regression PANSS is a 30-item scale, and scores range from
30 to 210. The BPRS scale is a 16-item scale, and
Further Readings one can score from 16 to 112. Difference studies
may report results measured on either scale. When
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. a researcher needs to use meta-analysis to combine
(2009). Experimental design and analysis for studies with both scales reported, it would be bet-
psychology. Oxford, UK: Oxford University Press. ter for him or her to convert those study results
Edwards, A. L. (1985). Data analysis for research
into a common standardized score so that those
designs. New York: Freeman.
Edwards, A. L. (1985). Multiple regression analysis and
study results become comparable. Cohen’s d and
the analysis of variance and covariance. New York: Hedge’s g are common effect sizes used in meta-
Freeman. analysis with continuous outcomes.
Fox, J. (2008). Applied regression analysis and For a dichotomized outcome, the odds ratio is
generalized linear models. Thousand Oaks, CA: Sage. often used as an indicator of effect size. For
408 Effect Size, Measures of

example, a researcher may want to find out corresponding numbers are 0.2, 0.5, and 0.8,
whether smokers have greater chances of having respectively. In the example above, the effect size
lung cancer compared to nonsmokers. He or she of 0.27 is small, which indicates that the difference
may do a meta-analysis with studies reporting between reading and writing exam scores for boys
how many patients, among smokers and nonsmo- and girls is small.
kers, were diagnosed with lung cancer. The odds
ratio is appropriate to use when the report is for Glass’s g
a single study. One can compare study results by Similarly, Glass proposed an effect size estima-
investigating odds ratios for all of these studies. tor using a control group’s standard deviation to
The other commonly used effect size in meta- E E C
analysis is the correlation coefficient. It is a more standardize mean difference: g0 ¼ x sxC . Here, x

direct approach to tell the association between and xC are the sample means of an experimental
two variables. group and a control group, respectively, and sC is
the standard deviation of the control group.
Cohen’s d This effect size assumes multiple treatment com-
Cohen’s d is defined as the population means dif- parisons to the control group, and that treatment
ference divided by the common standard deviation. standard deviations differ from each other.
This definition is based on the t-test on means and
Hedge’s g
can be interpreted as the standardized difference
between two means. Cohen’s d assumes equal vari- However, neither Cohen’s d nor Glass’s g takes
ance of the two populations. For two independent the sample size into account, and the equal popu-
samples, it can be expressed as d ¼ mA m σ
B
for lation variance assumption may not hold. Hedge
a one-tailed effect size index and d ¼ jmA m Bj
for proposed a modification to estimate effect size as
σ
a two-tailed effect size index. Here, mA and mB are
two population means in their raw scales, and σ is xE  xC
g ¼ ; where
the standard deviation of either population (both s
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

population means have equal variance). Because the 2 2
ðnE  1ÞðsE Þ  ðnC  1ÞðsC Þ
population means and standard deviations are usu- s ¼ :
ally unknown, sample means and standard devia- nE þ nC  2
tions are used to estimate Cohen’s d. One-tailed and
two-tailed effect size index for t-test of means in Here, nE and nC are sample sizes of treatment and
control, and sE and sC are sample standard devia-
standard units are d ¼ xA x s
B
and d ¼ jxA x s
Bj
,
tions of treatment and control. Comparing to above
where xA and xB are sample means, and s is the
effect size estimators, Hedges’s g uses pooled sample
common standard deviation of both samples.
standard deviations to standardize mean difference.
For example, a teacher wanted to know
However, the above estimator has a small sam-
whether the ninth-grade boys or the ninth-grade
ple bias. An approximate unbiased estimator of
girls in her school were better at reading and writ-
effect size defined by Hedges and Olkin is
ing. She randomly selected 10 boys and 10 girls
from all ninth-grade students and obtained the xE  xC 3
reading and writing exam score means of all boys g ¼ ð1  Þ; where
s 4N  9
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and girls, say, 67 and 71, respectively. She found
2 2
that the standard deviations of both groups were ðnE  1ÞðsE Þ  ðnC  1ÞðsC Þ
s ¼ :
15. Then for a two-tailed test, the effect size is nE þ nC  2

jxA  xB j j67  71j Here, N is the total sample size of both groups.
d ¼ ¼ ¼ 0:27: Especially when sample sizes in treatment and
s 15
control are equal, this estimator is the unique min-
Cohen used the terms small, medium, and large imum variance unbiased estimator. Many meta-
to represent relative size of effect sizes. The analysis software packages like Metawin use
Effect Size, Measures of 409

Hedge’s g as the default effect size for continuous Analysis of Variance


outcomes.
There are some other forms of effect size used
to measure the magnitude of effects. These effect
Odds Ratio
sizes are often used in analysis of variance
Odds ratio is a commonly used effect size for (ANOVA). They measure how much variance is
categorical outcomes. Odds ratio is the ratio of introduced by a new explanatory variable, and
odds in Category 1 versus odds in Category 2. For they are ratios of extra variance caused by an
example, a researcher wanted to find out the rela- explanatory variable versus total variance. These
tionship between smoking and getting lung cancer. measures include squared correlation coefficient
He recruited two groups of subjects: smokers and and correlation ratio, eta-squared and partial eta-
nonsmokers. After a few years of following up, he squared, omega squared, and intraclass correla-
found that there were N11 subjects diagnosed with tion. The following paragraphs give a brief
lung cancer among smokers and N21 subjects discussion of these effect sizes. For easier under-
among nonsmokers. There were N12 and N22 sub- standing, one-way ANOVA is used as an example.
jects who didn’t have lung cancer. The odds of
having lung cancer among smokers and nonsmo- Squared Correlation Coefficient (r2)
kers are estimated as N11 N21
N12 and N22 , respectively. The
In regression analysis, the squared correlation
odds ratio of having lung cancer in smokers com- coefficient is a commonly used effect size based on
pared to nonsmokers
. is the ratio of the above two σ 2T σ 2RL
N11 N21 variance. The expression is r2 ¼ , where σ 2T
odds, which is N12 N22 ¼N11 N22
N12 N21 .
σ 2T

The scale is different from the effect sizes of con- is the total variance of the dependent variable and
tinuous variables, such as Cohen’s d and Hedge’s g, σ 2RL is the variance explained by other variables.
so it is not appropriate to compare the size of the The range of the squared correlation coefficient is
odds ratio with the effect sizes described above. from 0 to 1. It can be interpreted as the proportion
of variance shared by two variables. For example,
Pearson Correlation Coefficient (r) an r2 of 0.35 means that 35% of the total variance
is shared by two variables.
The Pearson correlation coefficient (r) is also
a popular effect size. It was first introduced by
Eta-Squared (η2 ) and Partial Eta-Squared (η2p )
Karl Pearson to measure the strength of the rela-
tionship between two variables. The range of the Eta-squared and partial eta-squared are effect
Pearson correlation coefficient is from 1 to 1. sizes used in ANOVAs to measure degree of associ-
Cohen gave general guidelines for the relative sizes ation in a sample. The effect can be the main effect
of the Pearson correlation coefficient as small, r ¼ or interaction in an analysis of variance model. It
0.1; medium, r ¼ 0.3; and large, r ¼ 0.5. Many is defined as the sum of squares of the effect versus
statistical packages, such as SAS and IBMâ SPSSâ the sum of squares of the total. Eta-squared can be
(PASW) 18.0 (an IBM company, formerly called interpreted as the proportion of variability caused
PASWâ Statistics), and Microsoft’s Excel can by that effect for the dependent variable. The
compute the Pearson correlation coefficient. range of an eta-squared is from 0 to 1. Suppose
Many meta-analysis software packages, such as there is a study of the effects of education and
Metawin and Comprehensive Meta-Analysis, experience on salary, and that the eta-squared of
allow users to use correlations as data and calcu- the education effect is 0.35. This means that 35%
late effect size with Fisher’s z transformation. The of the variability in salary was caused by
transformation formula is z ¼ 12 lnð11rþr
Þ, where r education.
is the correlation coefficient. Eta-squared is additive, so the eta-squared of all
The square of the correlation coefficient is also effects in an ANOVA model sums to 1. All effects
an effect size used to measure how much variance include all main effects and interaction effects, as
is explained by one variable versus the total vari- well as the intercept and error effect in an ANOVA
ance. It is discussed in detail below. table.
410 Effect Size, Measures of

Partial eta-squared is defined as the sum of explanatory variables, to 1, meaning all variances
squares of the effect versus the sum of squares can be explained by explanatory variables.
of the effect plus error. For the same effect in
the same study, partial eta-squared is always ω and Cramer’s V
larger than eta-squared. This is because the
The effect sizes ω and Cramer’s V are often used
denominator in partial eta-squared is smaller
for categorical data based on chi-square. Chi-
than that in eta-squared. For the previous
square is a nonparametric statistic used to test
example, the partial eta-squared may be 0.49
potential difference among two or more categori-
or 0.65; it cannot be less than 0.35. Unlike eta-
cal variables. Many statistical packages, such as
squared, the sum of all effects’ partial eta-
SAS and SPSS, show this statistic in output.
squared may not be 1, and in fact can be larger
The effect size ω can be calculated from chi-
than 1. qffiffiffiffi
2
The statistical package SPSS will compute and square and total sample size N as ω ¼ χN .
print out partial eta-squared as the effect size for However, this effect size is used only in the cir-
analysis of variance. cumstances of 2 × 2 contingency tables. Cohen
gave general guidelines for the relative size of
Omega Squared (ω2 ) ω: 0.1, 0.3, and 0.5 represent small, medium,
Omega squared is an effect size used to mea- and large effect sizes, respectively.
sure the degree of association in fixed and ran- For a table size greater than 2 × 2, one can use
dom effects analysis of variance study. It is the Cramer’s V (sometimes called Cramer’s ’) as the
relative reduction variance caused by an effect. effect size to measure the strength of association.
Unlike eta-squared and partial eta-squared, Popular statistical software such as SAS and SPSS
omega squared is an estimate of the degree of can compute this statistic. One can also calculate
association in a population, instead of in it from the chi-square statistic using the formula
qffiffiffiffiffiffiffi
a sample. V ¼ NL χ2
, where N is the total sample size and
Intraclass Correlation (ρ2I ) L equals the number of rows minus 1 or the num-
ber of columns minus 1, whichever is less. The
Intraclass correlation is also an estimate of the effect size of Cramer’s V can be interpreted as the
degree of association in a population in random average multiple correlation between rows and
effects models, especially in psychological studies. columns. In a 2 × 2 table, Cramer’s V is equal to
The one-way intraclass correlation coefficient is the correlation coefficient. Cohen’s guideline for ω
defined as the proportion of variance of random is also appropriate for Cramer’s V in 2 × 2 contin-
effect versus the variance of this effect and error var- gency tables.
MS MSerror
iance. One estimator is ρ^2I ¼ MS effect ,
effect þ dfeffect MSerror
where MSeffect and MSerror are mean squares of the Reporting Effect Size
effect and error, that is, the mean squares of
between-group and within-group effects. Reporting effect size in publications, along with
the traditional null hypothesis test, is important.
The null hypothesis test tells readers whether an
Other Effect Sizes effect exists, but it won’t tell readers whether the
results are replicable without reporting an effect
size. Research organizations such as the American
R2 in Multiple Regression
Psychological Association suggest reporting effect
As with the other effect sizes discussed in the size in publications along with significance tests. A
regression or analysis of variance sections, R2 is general rule for researchers is that they should at
a statistic used to represent the portion of variance least report descriptive statistics such as mean and
explained by explanatory variables versus the total standard deviation. Thus, effect size can be calcu-
variance. The range of R2 is from 0, meaning no lated and used for meta-analysis to compare with
relation between the dependent variable and other studies.
Endogenous Variables 411

There are many forms of effect sizes. One


must choose to calculate appropriate effect size ENDOGENOUS VARIABLES
based on the purpose of the study. For example,
Cohen’s d and Hedge’s g are often used in meta- Endogenous variables in causal statistical modeling
analysis to compare independent variables’ are variables that are hypothesized to have one or
means. In meta-analysis with binary outcomes, more variables at least partially explaining them.
the odds ratio is often used to combine study Commonly referred to in econometrics and the
results. The correlation coefficient is good for structural equation modeling family of statistical
both continuous and categorical outcomes. To techniques, endogenous variables may be effect
interpret variances explained by effect(s), one variables that precede other endogenous variables;
should choose effect sizes from r2, eta-squared, thus, although some consider endogenous vari-
omega squared, and so on. ables to be dependent, such a definition is techni-
Many popular statistical software packages cally incorrect.
can compute effect sizes. For example, meta- Theoretical considerations must be taken into
analysis software such as Metawin and Compre- account when determining whether a variable is
hensive Meta-Analysis can compute Hedge’s g, endogenous. Endogeneity is a property of the
odds ratio, and the correlation coefficient based model, not the variable, and will differ among
on data type. Statistical software such as SPSS models. For example, if one were to model the
gives eta-squared as the effect size in analysis of effect of income on adoption of environmental
variance procedures. Of course, one can use sta- behaviors, the behaviors would be endogenous,
tistical packages such as SAS or Excel to calcu- and income would be exogenous. Another model
late effect size manually based on available may consider the effect of education on income; in
statistics. this case, education would be exogenous and
Qiaoyan Hu income endogenous.

See also Analysis of Variance (ANOVA); Chi-Square Test; The Problem of Endogeneity
Correlation; Hypothesis; Meta-Analysis
One of the most commonly used statistical models
is ordinary least squares regression (OLS). A vari-
ety of assumptions must hold for OLS to be the
Further Readings best unbiased estimator, including the indepen-
dence of errors. In regression models, problems
Cohen, J. (1988). Statistical power analysis for the with endogeneity may arise when an independent
behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence variable is correlated with the error term of an
Erlbaum.
endogenous variable. When observational data are
Hedges, L. V., & Olkin, I. (1985). Statistical methods for
meta-analysis. Orlando, FL: Academic Press.
used, as is the case with many studies in the social
Iversen, G. R., & Norpoth, H. (1976). Analysis of sciences, problems with endogeneity are more
variance (Sage University Paper Series on Quantitative prevalent. In cases where randomized, controlled
Applications in the Social Sciences, 07001). Beverly experiments are possible, such problems are often
Hills, CA: Sage. avoided.
Kirk, R. E. (1982). Experimental design: Procedures for Several sources influence problems with endo-
the behavioral sciences (2nd ed.). Belmont, CA: geneity: when the true value or score of a variable
Brooks/Cole. is not actually observed (measurement error),
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta- when a variable that affects the dependent variable
analysis. Thousand Oaks, CA: Sage.
is not included in the regression, and when recur-
Sutton, A. J., Abrams, K. R., Jones, D. R., Sheldon, T. A.,
& Song, F. (2000). Methods for meta-analysis in
sivity exists between the dependent and indepen-
medical research (2nd ed.). London: Wiley. dent variables (i.e., there is a feedback loop
Volker, M. A. (2006). Reporting effect size estimates in between the dependent and independent variables).
school psychology research. Psychology in the Each of these sources may occur alone or in con-
Schools, 43(6), 653672. junction with other sources.
412 Error

Further Readings
A B C
Bound, J., Jaeger, D. A., & Baker, R. M. (1995).
Problems with instrumental variables estimation when
the correlation between the instruments and the
Figure 1 Variables B and C Are Endogenous explanatory variable is weak. Journal of the American
Statistical Association, 90, 443450.
Kennedy, P. (2008). A guide to econometrics. Malden,
The solution to problems with endogeneity is MA: Blackwell.
often to use instrumental variables. Instrumental Kline, R. B. (2005). Principles and practice of structural
equation modeling (2nd ed.). New York: Guilford.
variables methods include two-stage least squares,
limited information maximum likelihood, and
jackknife instrumental variable estimators. Advan-
tages of instrumental variables estimation include
the transparency of procedures and the ability to ERROR
test the appropriateness of instruments and the
degree of endogeneity. Instrumental variables are Error resides on the statistical side of the fault line
beneficial only when they are strongly correlated separating the deductive tools of mathematics
with the endogenous variable and when they are from the inductive tools of statistics. On the math-
exogenous to the model. ematics side of the chasm lays perfect information,
and on the statistics side exists estimation in the
face of uncertainty. For the purposes of estimation,
error describes the unknown, provides a basis for
comparison, and serves as a hypothesized place-
Endogenous Variables in
holder enabling estimation. This entry discusses
Structural Equation Modeling the role of error from a modeling perspective and
In structural equation modeling, including path in the context of regression, ordinary least squares
analysis, factor analysis, and structural regression estimation, systematic error, random error, error
models, endogenous variables are said to be distributions, experimentation, measurement error,
‘‘downstream’’ of either exogenous variables or rounding error, sampling error, and nonsampling
other endogenous variables. Thus, endogenous error.
variables can be both cause and effect variables.
Consider the simple path model in Figure 1.
Variable A is exogenous; it does not have any vari- Modeling
ables causally prior to it in the model. B is endoge-
For practical purposes, the universe is stochastic.
nous; it is affected by the exogenous variable A
For example, any ‘‘true’’ model involving gravity
while affecting C. C is also an endogenous vari-
would require, at least, a parameter for every par-
able, directly affected by B and indirectly affected
ticle in the universe. One application of statistics is
by A.
to quantify uncertainty. Stochastic or probabilistic
As error associated with the measurement of
models approximate relationships within some
endogenous variables can bias standardized
locality that contains uncertainty. That is, by hold-
direct effects on endogenous variables, structural
ing some variables constant and constraining
equation modeling uses multiple measures of
others, a model can express the major relation-
latent constructs in order to address measure-
ships of interest within that locality and amid an
ment error.
acceptable amount of uncertainty. For example,
Kristin Floress a model describing the orbit of a comet around the
sun might contain parameters corresponding to
See also Exogenous Variables; Latent Variable; Least the large bodies in the solar system and account
Squares, Methods of; Regression to the Mean; for all remaining gravitational pulls with an error
Structural Equation Modeling term.
Error 413

Model equations employ error terms to repre- remaining steps in building least squares regres-
sent uncertainty or the negligible contributions. sion, which combines two concepts involving
Error terms are often additive or multiplicative errors: Estimate the coefficients by minimizing
placeholders, and models can have multiple error
ε2i ¼ 0, and assume that εij IID Nðo; σ 2 Þ. This
e
terms. progression of statistical innovations has culminated
in a family of regression techniques incorporating
Additive: E ¼ MC2 þ ε, where ε is an error term a variety of estimators and error assumptions.
perfecting the equation Statistical errors are the placeholders represent-
Multiplicative: y ¼ α þ xβε ing that which remains unquantified or inconsis-
tent in a hypothesized relationship. In assuming
Other: y ¼ eβðx þ εME Þ þ ε, where εME is
that these inconsistencies behave reasonably,
measurement error corresponding to x, and ε is an
additive error term.
researchers are able to find reasonable solutions.

Development of Regression Ordinary Least Squares Estimation

The traditional modeling problem is to solve a set Ordinary least squares (OLS) estimates are derived
of inconsistent equations—characterized by the from fitting one equation to explain a set of incon-
presence of more equations than unknowns. Early sistent equations. There are basically six assump-
researchers cut their teeth on estimating physical tions implicit in OLS estimation, all of which
relationships in astronomy and geodesy—the study regard errors as follows:
of the size and shape of the earth—expressed
1. Misspecification error is negligible—the
by a set of k inconsistent linear equations of the functional form is reasonable and no significant
following form: xs are absent from the model.

y1 ¼ β0 þ β1 x11 þ β2 x12 þ    þ βp x1p 2. Least squares,


ε2i ¼ 0, estimation is
reasonable for this application.
y2 ¼ β0 þ β1 x21 þ β2 x22 þ    þ βp x2p
3. Measurement error is negligible—y and the xs
.. are accurately measured.
.
yk ¼ β0 þ β1 xk1 þ β2 xk2 þ    þ βp xkp ‚ 4. Error terms are independent.
5. Error terms are identically distributed.
where the xs and ys are measured values and the
6. Error terms are approximately normal with
p þ 1βs are the unknowns. a mean of zero and the same variance.
Beginning in antiquity, these problems were
solved by techniques that reduced the number of The strength of the solution is sensitive to the
equations to match the number of unknowns. In underlying assumptions. In practice, assumptions
1750, Johann Tobias Mayer assembled his obser- are never proven, only disproven or failed to be
vations of the moon’s librations into a set of incon- disproven. This asymmetrical information regard-
sistent equations. He was able to solve for the ing the validity of the assumptions proves to be
unknowns by grouping equations and setting their hazardous in practice, leading to biased estimates.
sums equal to zero—an early step toward
εi ¼ 0. The statistical errors, εi , in most models are esti-
In 1760, Roger Boscovich began solving inconsis- mated by residuals,
tent equations by minimizing the sum of the abso-
lute errors (
jεi j ¼ 0) subject to an adding-up
^0 þ β
^εi ¼ yi  ^yi ¼ yi  β ^1 xi1 þ β
^ 2 xi2 þ    þ β^p xip :
constraint; by 1786, Pierre-Simon Laplace mini-
mized the largest absolute error; and later, Adrien-
Marie Legendre and Carl Friedrich Gauss began Statistical errors are unobservable, independent,
minimizing the sum of the squared error terms and unconstrained, whereas residuals are observ-
(
ε2i ¼ 0) or the least squares. Sir Francis Galton, able estimates of the statistical errors, not indepen-
Karl Pearson, and George Udny Yule took the dent, and usually constrained to sum to zero.
414 Error

Systematic Errors Versus Random Errors 2. Consider other attention-worthy distributions


for the errors, such as the extreme value
Error can be categorized in two types: systematic distribution, the Poisson, the gamma, the beta,
error—also known as statistical bias, lack-of- and so on. Other distributions can provide
fit error, or fixed error—and random error. System- reliable results at the expense of convenience
atic error describes a separate error pattern that is and familiarity.
recognizable as distinct from all other remaining 3. Contemplate a nonparametric solution.
random error. It can be attributed to some effect Although these techniques might not be
and might be controlled or modeled. Two typical ‘‘distribution free,’’ they are less sensitive, or at
sources of systematic error arise from model-fitting least sensitive in different ways to the
(misspecification) problems and measurement underlying distribution.
error problems. Both can create significant identifi- 4. Try a resampling approach, such as
able error patterns. bootstrapping, to create a surrogate underlying
Random error consists of the remaining unex- distribution.
plained uncertainty that cannot be attributed to
any particular important factor. Random error
comprises the vestiges after removing large dis- Experimental Error
cernable patterns. Hence, random error may The objective of design of experiments is to com-
appear to be homogeneous, when it is actually pare the effects of treatments on similar experi-
a mix of faint, indistinct heterogeneity. mental units. The basis of comparison is relative to
the unexplained behavior in the response, which
Error Distributions might be referred to as unexplained error. If the
Stochastic models include underlying assumptions treatment means differ greatly relative to the unex-
about their error terms. As violations of the plained error, then the difference is assumed to be
assumptions become more extreme, the results statistically significant. The purpose of the design
become less reliable. Robustness describes a mod- is to estimate more accurately both the treatment
el’s reliability in the face of departures from these means and the unexplained variation in the
underlying assumptions. response, y. This can be achieved through refining
In practice, more thought should be given to the design structure and/or increasing the number
robustness and the validity of the assumptions. In of experimental units. Conceptually,
particular, the assumption that the errors are nor- y ¼ Treatment Structure þ Design Structure
mally distributed is dramatically overused. This
distribution provides a convenient, long-tailed, þ Error Structure:
symmetrical shape, and it is ubiquitous as sug-
To illustrate the role of error in a designed
gested by the central limit theorem—introduced by
experiment, consider the one-way analysis of vari-
Laplace, circa 1810. The central limit theorem
ance (ANOVA) corresponding to
holds that distributions of means and sums of ran-
dom variables converge toward approximate nor-
yij ¼ μi þ εij ; where εij IID 2
e Nðo; σ ε Þ‚
mality, regardless of the underlying distribution of
the random variable. The distribution of the errors
where i ¼ 1 . . . p (number of treatments) and
is often approximately normal because it is a func-
j ¼ 1 . . . ni (sample size of the ith treatment). The
tion of the other distributions in the equation,
most notable hypotheses are
which are often approximately normal.
Regardless of its convenience and ubiquity, the H0 : μ1 ¼ μ2 ¼    ¼ μp
normality assumption merits testing. If this assump-
tion is untenable, then there are other practical
options: Ha : μi 6¼ μj i 6¼ j:

1. Transform the response or regressors so that the The usual test statistic for this hypothesis is
distribution of errors is approximately normal. F ¼ MSMS
treatment
error
, which is the ratio of two estimators
Error 415

P 2
n ðy y Þ
i i: The best solution for both objectives is to
of σ 2ε .The numerator is MStreatment ¼ p1
::
, reduce the measurement error. This can be accom-
which is unbiased only if H0 is true. The denomi- plished in three ways:
P 2
ðyij  yi: Þ
nator is MSerror ¼ Np , which is unbiased 1. Improve the measurement device, possibly
regardless of whether H0 is true. George W. Sne- through calibration.
decor recognized the value of this ratio and named
2. Improve the precision of the data storage
it the F statistic in honor of Ronald A. Fisher, who device.
was chiefly responsible for its derivation.
Difference is relative. The F test illustrates how 3. Replace x with a more accurate measure of the
error serves as a basis for comparison. If the treat- same characteristic, xM .
P 2
ni ðyi:  y:: Þ
ment means, p1 , vary relatively more than The next most promising solution is to estimate
P 2 the measurement error and use it to ‘‘adjust’’ the
ðyij  yi: Þ
the observations within each treatment, Np , parameter estimates and the confidence intervals.
then the statistician should infer that H0 is false. There are three approaches:
That is, if the discernible differences between the
treatment means are unusually large relative to the 1. Collect repeated measures of x on the same
unknown, σ 2ε , then the differences are more likely to observations, thereby estimating the variance of
be genuine. ANOVA is an analysis of means based the measurement error, σ 2ME , and using it to
adjust the regression coefficients and the
on analyzing variances of errors.
confidence intervals.
2. Calibrate x against a more accurate measure,
xM, which is unavailable for the broader
Measurement Error application, thereby estimating the variance of
the measurement error, σ 2ME .
Measurement error is the difference between the
3. Build a measurement error model based on
‘‘true’’ value and the measured value. This is some- a validation data set containing y and x
times called observational error. For many models, alongside the more accurate and broadly
one implied assumption is that the inputs and the unavailable xM. As long as the validation data
outputs are measured accurately enough for the set is representative of the target population, the
application. This is often false, especially with con- relationships can be extrapolated.
tinuous variables, which can only be as accurate as
the measurement and data storage devices allow. For the prediction problem, there is a third solu-
In practice, measurement error is often unstable tion for avoiding bias in the predictions, yet it does
and difficult to estimate, requiring multiple mea- not repair the biased coefficients or the ample con-
surements or independent knowledge. fidence intervals. The solution is to ensure that the
There are two negative consequences due to measurement error present when the model was
measurement error in the regressor, x. First, if the built is consistent as the model is applied.
measurement error variance is large relative to the
variability in x, then the coefficients will be biased.
Rounding Error
In a simple regression model for example, mea-
surement error in x will cause β ^0 to converge to Rounding error is often voluntary measurement
a slightly larger value than β0 and β ^ 1 to be ‘‘atten- error. The person or system causing the rounding
uated’’ that is, the measurement error shrinks β ^1 is now a second stage in the measurement device.
so that it will underestimate β1 . Second, if the Occasionally, data storage devices lack the same
measurement error in x is large relative to the vari- precision as the measurement device, and this cre-
ability of y, then this will increase the widths of ates rounding error. More commonly, people or
confidence intervals. Both of these problems inter- software collecting the information fail to retain
fere with the two primary objectives of modeling: the full precision of the data. After the data are
coefficient estimation and prediction. collected, it is common to find unanticipated
416 Error Rates

applications—the serendipity of statistics, wanting Kutner, M., Nachtsheim, C., Neter, J., & Li, W. (2005).
more precision. Applied linear statistical models (5th ed.). Boston:
Large rounding error, εR , can add unwelcome McGraw-Hill Irwin.
complexity to the problem. Suppose that x2 is Milliken, G., & Johnson, D. (1984). Analysis of messy
data: Vol. 1.: Designed experiments. Belmont, CA:
measured with rounding error, εR , then a model
Wadsworth.
involving x2 might look like this: Snedecor, G. (1956). Statistical methods applied to
experiments in agriculture and biology (5th ed.).
y ¼ β0 þ β1 x1 þ β2 ðx2 þ εR Þ þ ε: Ames: Iowa State College Press.
Stigler, S. (1986). The history of statistics: The
measurement of uncertainty before 1900.
Sampling and Nonsampling Error Cambridge, MA: Belknap Press of Harvard
The purpose of sampling is to estimate character- University Press.
istics (mean, variance, etc.) of a population Weisberg, S. (1985). Applied linear regression (2nd ed.).
New York: John Wiley.
based upon a randomly selected representative
subset. The difference between a sample’s esti-
mate and the population’s value is due to two
sources of error: sampling error and nonsam-
pling error. Even with perfect execution, there is
ERROR RATES
a limitation on the ability of the partial informa-
tion contained in the sample to fully estimate In research, error rate takes on different meanings
population characteristics. This part of the esti- in different contexts, including measurement and
mation difference is due to sampling error—the inferential statistical analysis. When measuring
minimum discrepancy due to observing a sample research participants’ performance using a task
instead of the whole population. Nonsampling with multiple trials, error rate is the proportion of
error explains all remaining sources of error, responses that are incorrect. In this manner, error
including nonresponse, selection bias, measure- rate can serve as an important dependent variable.
ment error (inaccurate response), and so on, that In inferential statistics, errors have to do with the
are related to execution. probability of making a false inference about the
Sampling error is reduced by improving the sam- population based on the sample data. Therefore,
ple design or increasing the sample size. Nonsam- estimating and managing error rates are crucial to
pling error is decreased through better execution. effective quantitative research.
This entry mainly discusses issues involving
Randy J. Bartlett error rates in measurement. Error rates in statisti-
cal analysis are mentioned only briefly because
See also Error Rates; Margin of Error; Missing Data, they are covered in more detail under other entries.
Imputation of; Models; ‘‘Probable Error of a Mean,
The’’; Random Error; Residual Plot; Residuals; Root
Mean Square Error; Sampling Error; Standard Error Rates in Measurement
Deviation; Standard Error of Estimate; Standard Error
In a task with objectively correct responses (e.g.,
of Measurement; Standard Error of the Mean; Sums of
a memory task involving recalling whether
Squares; Systematic Error; Type I Error; Type II Error;
a stimulus had been presented previously), a par-
Type III Error; Variability, Measure of; Variance;
ticipant’s response can be one of three possibili-
White Noise
ties: no response, a correct response, or an
incorrect response (error). Instances of errors
Further Readings across a series of trials are aggregated to yield
Harrell, F. E., Jr. (2001). Regression modeling strategies,
error rate, ideally in proportional terms. Specifi-
with applications to linear models, logistic regression, cally, the number of errors divided by the num-
and survival analysis. New York: Springer. ber of trials in which one has an opportunity to
Heyde, C. C., & Seneta, E. (2001). Statisticians of the make a correct response yields the error rate.
centuries. New York: Springer-Verlag. Depending on the goals of the study, researchers
Error Rates 417

may wish to use for the denominator the total Distance (divided by standard deviation yields d´)
number of responses or the total number of trials
(including nonresponses, if they are considered
relevant). The resulting error rate can then be
Noise distribution Signal + Noise distribution
used to test hypotheses about knowledge or cog-
nitive processes associated with the construct
represented by the targets of response.
Probability of a hit

Signal Detection Theory C


Probability of
a false alarm
One particularly powerful data-analytic 0
approach employing error rates is Signal Detection
Theory (SDT). SDT is applied in situations where Threshold Threshold
of an ideal observer
the task involves judging whether a signal exists
(e.g., ‘‘Was a word presented previously, or is it
a new word?’’). Using error rates in a series of trials Figure 1 Distributions of Signal and Noise
in a task, SDT mathematically derives characteris-
tics of participants’ response patterns such as sensi-
tivity (the perceived distinction between a signal (signal is not present and the perceiver responds
and noise) and judgment criterion (the tendency to negatively); and false alarms (signal is not present
respond in one way rather than the other). and the perceiver responds affirmatively). As indi-
Typically, SDT is based on the following cated above, misses and false alarms are errors.
assumptions. First, in each trial, either a signal Miss rate is calculated as the ratio of missed trials
exists or it does not (e.g., a given word was pre- to the total number of trials with signal. False
sented previously or not). Even when there is no alarm rate is the ratio of trials with false alarms to
signal (i.e., the correct response would be nega- the total number of trials without signal. Hit rate
tive), the perceived intensity of the stimuli varies and miss rate sum to 1, as do correct rejection rate
randomly (caused by factors originating from the and false alarm rate.
task or from the perceiver), which is called The objective of SDT is to estimate two indexes
‘‘noise.’’ Noise follows a normal distribution with of participants’ response tendencies from the error
a mean of zero. Noise always accompanies signal, rates. The sensitivity (or discriminability) index
and because noise is added to signal, the distribu- (d0 ) pertains to the strength of the signal (or a per-
tion of perceived intensity of signal has the same ceiver’s ability to discern signal from noise), and
(normal) shape. Each perceiver is assumed to have response bias (or strategy) (C) is the tendency to
an internal set criterion (called threshold) used to respond one way or the other (e.g., affirmatively
make decisions in the task. If the perceived inten- rather than negatively). The value of d0 reflects the
sity (e.g., subjective familiarity) of the stimulus is distance between the two distributions relative to
stronger than the threshold, the perceiver will their spread, so that a larger value means the sig-
decide that there is a signal (respond affirma- nal is more easily discerned (e.g., in the case of
tively—e.g., indicate that the word was presented word learning, this may imply that the learning
previously); otherwise, the perceiver will respond task was effective). C reflects the threshold of the
negatively. When the response is not consistent perceiver minus that of an ideal observer. When
with the objective properties of the stimulus (e.g., the value of C is positive, the perceiver is said to
a negative response to a word that was presented be conservative (i.e., requiring stronger intensity of
previously or an affirmative response to a word the stimulus to respond affirmatively), and a per-
that was not), it is an error. ceiver with a negative C is liberal. As the C value
Responses are categorized into four groups: hits increases, both miss rate and correct rejection rate
(signal is present and the perceiver responds affir- increase (i.e., more likely to respond negatively
matively); misses (signal is present and the per- both when there is a signal and when there is not);
ceiver responds negatively); correct rejections conversely, as it decreases, both hit rate and false
418 Error Rates

alarm rate increase. Bias is sometimes expressed as influences that controlled and automatic processes
β, which is defined as the likelihood ratio of the have on the response are opposite to each other;
signal distribution to noise distribution at the crite- these trials are called incongruent trials. The goal
rion (i.e., the ratio of the height of the signal curve of PDP is to estimate the probabilities that con-
to the height of the noise curve at the value of the trolled and automatic processes affect responses.
0
threshold) and is equal to ed *; C . The value would
be greater than 1 when the perceiver is conserva-
tive and less than 1 when liberal. Sensitivity and Other Issues With Error Rates in Measurement
bias can be estimated using a normal distribution
function from hit rate and false alarm rate; Speed-Accuracy Trade-Off
because the two rates are independent of each In tasks measuring facility of judgments (e.g.,
other, it is necessary to obtain both of them from categorizing words or images), either error rate or
data. response latency can be used as the basis of analy-
sis. If the researcher wants to use error rates, it is
Process Dissociation Procedure desirable to have time pressure in the task in order
to increase error rates, so that larger variability in
Process Dissociation Procedure (PDP) is error rate can be obtained. Without time pressure,
a method that uses error rates to estimate the sepa- in many tasks, participants will make mostly accu-
rate contributions of controlled (intentional) and rate responses, and it will be hard to discern mean-
automatic (unintentional) processes in responses. ingful variability in response facility.
In tasks involving cognitive processes, participants
will consciously (intentionally) strive to make cor- Problems Involving High Error Rates
rect responses. But at the same time, there may
also be influences of automatic processes that are When something other than error rate is mea-
beyond conscious awareness or control. Using sured (e.g., response latency), error rate may be high
PDP, the researcher can estimate the independent for some or all participants. As a consequence, there
influences of controlled and automatic processes may be too few valid responses to use in analysis.
from error rates. To address this issue, if only a few participants have
The influences of controlled and automatic pro- error rates higher than a set criterion (ideally dis-
cesses may work hand in hand or in opposite cerned by a discontinuity in the frequency distribu-
directions. For example, in a typical Stroop color tion of error rates), the researcher may remove his
naming task, participants are presented with color or her data. However, if it is a more prevalent trend,
words (e.g., ‘‘red’’) and instructed to name the col- the task may be too difficult (in which case, the task
ors of the words’ lettering, which are either consis- may have to be made easier) or inappropriate (in
tent (e.g., red) or inconsistent (e.g., green) with the which case, the researcher should think of a better
words. In certain trials, the response elicited by the way to measure the construct).
automatic process is the correct response as
Error Rates as a Source of Error Variance
defined in the task. For example, when the stimu-
lus is the word ‘‘red’’ in red lettering, the auto- In measures with no objectively correct or
matic process (to read the word) will elicit the incorrect responses (e.g., Likert-scale ratings of
response ‘‘red,’’ which is the same as the one dic- attitudes or opinions), errors can be thought of as
tated by the controlled process (to name the color the magnitude of inaccuracy of the measurement.
of the lettering). Such trials are called congruent For example, when the wording of questionnaire
trials, because controlled and automatic processes items or the scale of a rating trial is confusing, or
elicit the same response. In other trials, the when some of the participants have response sets
response elicited by the automatic process is not (e.g., a tendency to arbitrarily favor a particular
the response required in the task. In our example, response option), the responses may not accurately
when the word ‘‘green’’ is presented in red letter- reflect what is meant to be measured. If error vari-
ing, the automatic process will elicit the response ance caused by peculiarities of the measurement or
‘‘green.’’ In this case, the directions of the of some participants is considerable, the reliability
Estimation 419

of the measurement is dubious. Therefore, the Luce, R. D. (1986). Response times: Their role in
researcher should strive to minimize these kinds of inferring elementary mental organization. New York:
errors, and check the response patterns within and Oxford University Press.
across participants to see if there are any nontriv- Sanders, A. F. (1998). Elements of human performance:
Reaction processes and attention in human skill.
ial, systematic trends not intended.
Mahwah, NJ: Lawrence Erlbaum.
Wickens, T. D. (2001). Elementary signal detection
theory. New York: Oxford University Press.
Errors in Statistical Inference
In statistical inference, the concept of error rates is
used in null hypothesis significance testing (NHST) ESTIMATION
to make judgments of how probable a result is in
a given population. Proper NHST is designed to
minimize the rates of two types of errors: Type II Estimation is the process of providing a numerical
and particularly Type I errors. value for an unknown quantity based on informa-
A Type I error (false positive) occurs when tion collected from a sample. If a single value is
a rejected null hypothesis is correct (i.e., an effect calculated for the unknown quantity, the process is
is inferred when, in fact, there is none). The proba- called point estimation. If an interval is calculated
bility of a Type I error is represented by the p that is likely, in some sense, to contain the quan-
value, which is assessed relative to an a priori cri- tity, then the procedure is called interval estima-
terion, α. The conventional criterion is α ¼ :05; tion, and the interval is referred to as a confidence
that is, when the probability of a Type I error (p) interval. Estimation is thus the statistical term for
is less than .05, the result is considered ‘‘statisti- an everyday activity: making an educated guess
cally significant.’’ Recently, there has been a grow- about a quantity that is unknown based on known
ing tendency to report the exact value of p rather information. The unknown quantities, which are
than merely stating whether it is less than α. Fur- called parameters, may be familiar population
thermore, researchers are increasingly reporting quantities such as the population mean μ, popula-
effect size estimates and confidence intervals so tion variance σ 2, and population proportion π. For
that there is less reliance on a somewhat arbitrary, instance, a researcher may be interested in the pro-
dichotomous decision based on the .05 criterion. portion of voters favoring a political party. That
A Type II error (false negative) occurs when proportion is the unknown parameter, and its esti-
a retained null hypothesis is incorrect (i.e., no mation may be based on a small random sample
effect is inferred when, in fact, there is one). An of individuals. In other situations, the parameters
attempt to decrease the Type II error rate (by being are part of more elaborate statistical models, such
more liberal in saying there is an effect) also as the regression coefficients β0 ; β1 ; . . . ; βp in a lin-
increases the Type I error rate, so one has to ear regression model
compromise.
X
p

Sang Hee Park and Jack Glaser Y ¼ β0 þ xj βj þ ε‚


j¼1

See also Error; False Positive; Nonsignificance;


Significance, Statistical which relates a response variable Y to explanatory
variables x1 ; x2 ; . . . ; xp .
Point estimation is one of the most common
Further Readings forms of statistical inference. One measures
a physical quantity in order to estimate its value,
Green, D. M., & Swets, J. A. (1966). Signal detection
theory and psychophysics. New York: Wiley.
surveys are conducted to estimate unemploy-
Jacoby, L. L. (1991). A process dissociation framework: ment rates, and clinical trials are carried out to
Separating automatic from intentional uses of estimate the cure rate (risk) of a new treatment.
memory. Journal of Memory and Language, 30, The unknown parameter in an investigation is
513541. denoted by θ, assumed for simplicity to be
420 Estimation

a scalar, but the results below extend to the case moments, and solving for the unknown para-
that θ ¼ ðθ1 ; θ2 ; . . . ; θk Þ with k > 1. meters. Least-squares estimators are obtained,
To estimate θ, or, more generally, a real-valued particularly in regression analysis, by minimizing
function of θ, τ(θ), one calculates a corresponding a (possibly weighted) difference between the
function of the observations, a statistic, δ ¼ δ(X1, observed response and the value predicted by the
X2; . . . ; Xn). An estimator is any statistic δ defined model.
over the sample space. Of course, it is hoped that δ The method of maximum likelihood is the most
will tend to be close, in some sense, to the popular technique for deriving estimators. Consid-
unknown τ(θ), but such a requirement is not part ered for fixed x ¼ (x1, x2 ; . . . ; xn) as a function of
of the formal definition of an estimator. The value θ, the joint probability density (or probability)
δ(x1, x2; . . . ; xn) taken on by δ in a particular case pθ ðxÞ ¼ pθ ðx1 ; . . . ; xn Þ is called the likelihood of θ,
is the estimate of τ(θ), which will be our educated and the value θ^ ¼ θ(X) ^ of θ that maximizes pθ(x)
guess for the unknown value. In practice, the com- constitutes the maximum likelihood estimator
pact notation δ^ is often used for both estimator (MLE) of θ. The MLE of a function τ(θ) is defined
and estimate. ^
to be τðθÞ.
The theory of point estimation can be divided In Bayesian analysis, a distribution πðθÞ, called
into two parts. The first part is concerned with a prior distribution, is introduced for the parame-
methods for finding estimators, and the second part ter θ, which is now considered a random quantity.
is concerned with evaluating these estimators. The prior is a subjective distribution, based on the
Often, the methods of evaluating estimators will experimenter’s belief about θ, prior to seeing the
suggest new estimators. In many cases, there will be data. The joint probability density (or probability
an obvious choice for an estimator of a particular function) of X now represents the conditional dis-
parameter. For example, the sample mean is a natu- tribution of X given θ, and is written pðx j θÞ. The
ral candidate for estimating the population mean; conditional distribution of θ given the data x is
the median is sometimes proposed as an alternative. called the posterior distribution of θ, and by
In more complicated settings, however, a more sys- Bayes’s theorem, it is given by
tematic way of finding estimators is needed.
πðθ j xÞ ¼ πðθÞpðx j θÞ=mðxÞ‚ ð1Þ

Methods of Finding Estimators where m(x) Ris the marginal distribution of X, that
is, mðxÞ ¼ πðθÞpðx j θÞ dθ. The posterior distri-
The formulation of the estimation problem in
bution which combines prior information and
a concrete situation requires specification of the
information in the data, is now used to make state-
probability model, P, that generates the data. The
ments about θ: For instance, the mean or median
model P is assumed to be known up to an
of the posterior distribution can be used as a point
unknown parameter θ, and P ¼ Pθ is written to
estimate of θ: The resulting estimators are called
express this dependence. The observations x ¼
Bayes estimators.
(x1, x2; . . . ; xn) are postulated to be the values
taken on by the random observable X ¼ (X1,
X2; . . . ; Xn) with distribution Pθ . Frequently, it will
Example
be reasonable to assume that each of the Xis has
the same distribution, and that the variables X1, Suppose X1, X2; . . . ; Xn are i.i.d. Bernoulli ran-
X2; . . . ; Xn are independent. This situation is called dom variables, which take the value 1 with proba-
the independent, identically distributed (i.i.d.) case bility θ and 0 with probability 1  θ: A Bernoulli
in the literature and allows for a considerable sim- process results, for example, from conducting a sur-
plification in our model. vey to estimate the unemployment rate, θ: In this
There are several general-purpose techniques context, the value 1 denotes the responder was
for deriving estimators, including methods based unemployed. The first moment (mean) of the distri-
on moments, least-squares, maximum-likelihood, bution is θ and the likelihood function is given by
and Bayesian approaches. The method of moments ny
is based on matching population and sample pθ ðx1 ; . . . ; xn Þ ¼ θy ð1  θÞ ‚ 0 ≤ θ ≤ 1‚
Estimation 421

P
where y ¼ xi . The method of moments and The property of unbiasedness is an attractive
maximum-likelihood estimates of θ are both one, and much research has been devoted to the
θ^ ¼ y=n, that is, the intuitive frequency-based esti- study of unbiased estimators. For a large class of
mate for the probability of success given y suc- problems, it turns out that among all unbiased
cesses in n trials. For a Bayesian analysis, if the estimators, there exists one that uniformly mini-
prior distribution for the parameter θ is a Beta dis- mizes the variance for all values of the unknown
tribution, Beta(α‚ β), parameter, and which is therefore uniformly mini-
mum variance unbiased (UMVU). Furthermore,
πðθÞ / θ α1 ð1  θÞβ1 ‚ α‚ β > 0 one can specify a lower bound on the variance of
any unbiased estimator of θ, which can sometimes
the posterior distribution for θ, from (1), is be attained. The result is the following version of
the information inequality
πðθ j xÞ / θ α þ y1 ð1  θÞn þ βy1 :  
^
var θðXÞ ≥ 1=IðθÞ‚ ð2Þ
The posterior distribution is also a Beta distri-
bution, θ j x ~ Beta ðα þ y; n þ β ¼ yÞ, and a Bayes where
estimate, based on, for example, the posterior ( 2 )
mean, is θ^ ¼ ðα þ yÞ=ðn þ β  yÞ. ∂
IðθÞ ¼ E logpθ ðXÞ ð3Þ
∂θ
Methods of Evaluating Estimators
is the information (or Fisher information) that X
For any given unknown parameter, there are, in contains about θ: The bound can be used to obtain
general, many possible estimators, and methods to the (absolute) efficiency of an unbiased estimator θ^
distinguish between good and poor estimators are of θ: This is defined as
needed. The general topic of evaluating statistical
procedures is part of the branch of statistics ^ ¼ 1=IðθÞ
eðθÞ :
known as decision theory. The error in using the ^ θÞ
Vðθ;
observable θ^ ¼ θðXÞ
^ to estimate the unknown θ is
ε^ ¼ θ^  θ. This error forms the basis for assessing By Equation 2, the efficiency is bounded above
by unity; when eðθÞ ^ ¼ 1, for all θ, θ^ is said to be
the performance of an estimator. A commonly
used finite-sample measure of performance is the efficient. Thus, an efficient estimator, if it exists, is
mean squared error (MSE). The MSE of an estima- the UMVU, but the UMVU is not necessarily effi-
tor θ^ of a parameter θ is the function of θ defined cient. In practice, there is no universal method for
 2 deriving UMVU estimators, but there are, instead,
^
by E θðXÞ  θ where Eð · Þ denotes the expected a variety of techniques that can sometimes be
value of the expression in brackets. The advantage applied.
of the MSE is that it can be decomposed into a sys- Interestingly, unbiasedness is not essential,
tematic error represented by the square of the bias and a restriction to the class of unbiased estima-
 
B θ;^ θ ¼ E ½θðXÞ
^  θ and the intrinsic variability tors may rule out some very good estimators,
   
represented by the variance V θ; ^ θ ¼ var θðXÞ
^ . including maximum likelihood. It is sometimes
Thus, the case, for example, that a trade-off occurs
between variance and bias in such a way that
 2     a small increase in bias can be traded for a larger
^ θ þ V θ;
E θ^  θ ¼ B2 θ; ^ θ :
decrease in variance, resulting in an improve-
  ment in MSE. In addition, finding a best unbi-
An estimator whose bias B θ; ^ θ ¼ 0 is called
ased estimator is not straightforward. For
^
unbiased and satisfies E[θðXÞ ¼ θ for all θ, so instance, UMVU estimators, or even any unbi-
that, on average, it will estimate the right value. ased estimator, may not exist for a given τ(θ); or
For unbiased estimators, the MSE reduces to the the bound in Equation 2 may not be attainable,
^
variance of θ. and one then has to decide if one’s candidate for
422 Eta-Squared

best unbiased estimator is, in fact, optimal. of τðθÞ. There are other asymptotically optimal
Therefore, there is scope to consider other crite- estimators, such as Bayes estimators. The
ria also, and possibilities include equivariance, method of moments estimator is not, in general,
minimaxity, and robustness. asymptotically optimal but has the virtue of
In many cases in practice, estimation is per- being quite simple to use.
formed using a set of independent, identically dis- In most practical situations, it is possible to con-
tributed observations. In such cases, it is of interest sider the use of several different estimators for the
to determine the behavior of a given estimator as unknown parameters. It is generally good advice
the number of observations increases to infinity to use various alternative estimation methods in
(i.e., asymptotically). The advantage of asymptotic such situations, these methods hopefully resulting
evaluations is that calculations simplify and it is in similar parameter estimates. If a single estimate
also more clear how to measure estimator perfor- is needed, it is best to rely on a method that pos-
mance. Asymptotic properties concern a sequence sesses good statistical properties, such as maxi-
of estimators indexed by n, θ^n , obtained by per- mum likelihood.
forming the same estimation procedure for each
sample size. For example, X1 ¼ X1 , X2 ¼ Panagiotis Besbeas
ðX1 þ X2 Þ=2, X3 ¼ ðX1 þ X2 þ X3 Þ=3, and so
See also Accuracy in Parameter Estimation; Confidence
forth. A sequence of estimators θ^n is said to be
Intervals; Inference: Deductive and Inductive; Least
asymptotically optimal for θ if it exhibits the fol-
Squares, Methods of; Root Mean Square Error;
lowing characteristics:
Unbiased Estimator

Consistency: It converges in probability to the


parameter it is estimating, i.e., Pð jθ^n  θj < εÞ → 1 Further Readings
for every ε > 0.
Berger, J. O. (1985). Statistical decision theory and
Asymptotic normality: The distribution of Bayesian analysis (2nd ed.). New York: Springer-
n1=2 ðθ^n  θÞ tends to a normal distribution Verlag.
with mean zero and variance 1/I1(θ), where Bickel, P. J., & Doksum, K. A. (2001).
I1(θ) is the Fisher information in a single Mathematical statistics: Basic ideas and selected
observation, that is, Equation 3 with X topics, Vol. 1 (2nd ed.). Upper Saddle River,
replaced by X1. Englewood Cliffs, NJ: Prentice Hall.
Casella, G., & Berger, R. L. (2002). Statistical inference
Asymptotic efficiency: No other asymptotically
(2nd ed.). Pacific Grove, CA: Duxbury.
normal estimator has smaller variance than θ^n .
Johnson, N. L., & Kotz, S. (19942005). Distributions
in statistics (5 vols.). New York: Wiley.
The small-sample and asymptotic results above Lehmann, E.L., & Casella, G. (1998). Theory of point
generalize to vector-valued parameters θ ¼ ðθ1 ; estimation. New York: Springer-Verlag.
θ2 ; :::; θk Þ and estimation of real- or vector-valued
functions of θ, τ(θ). One also can use the asymp-
totic variance as a means of comparing two
asymptotically normal estimators through the idea ETA-SQUARED
of asymptotic relative efficiency (ARE). The ARE
of one estimator compared with another is the Eta-squared is commonly used in ANOVA and
reciprocal of the ratio of their asymptotic (general- t test designs as an index of the proportion of vari-
ized) variances. For example, the ARE of median ance attributed to one or more effects. The statistic
to mean when the Xs are normal is 2/π ≈ 0.64, is useful in describing how variables are behaving
suggesting a considerable efficiency loss in using within the researcher’s sample. In addition,
the median at this case. because eta-squared is a measure of effect size,
It can be shown under general regularity con- researchers are able to compare effects of grouping
ditions that if ^θ is the MLE, then τð^θÞ is an variables or treatment conditions across related
asymptotically optimal (most efficient) estimator studies. Despite these advantages, researchers need
Eta-Squared 423

to be aware of eta-squared’s limitations, which This index of the strength of association between
include an overestimation of population effects variables has been referred to as practical signifi-
and its sensitivity to design features that influence cance. Determination of the size of effect based on
its relevance and interpretability. Nonetheless, an eta-squared value is largely a function of the vari-
many social scientists advocate for the reporting of ables under investigation. In behavioral science, large
the eta-squared statistic, in addition to reporting effects may be a relative term.
statistical significance. Partial eta-squared (η2p ), a second estimate of
This entry focuses on defining, calculating, and effect size, is the ratio of variance due to an effect
interpreting eta-squared values, and will discuss to the sum of the error variance and the effect vari-
the advantages and disadvantages of its use. The ance. In a one-way ANOVA design that has just
entry concludes with a discussion of the literature one factor, the eta-squared and partial eta-squared
regarding the inclusion of eta-squared values as values are the same. Typically, partial eta-squared
a measure of effect size in the reporting of statisti- values are greater than eta-squared estimates, and
cal results. this difference becomes more pronounced with the
addition of independent factors to the design.
Some critics have argued that researchers incor-
Defining Eta-Squared
rectly use these statistics interchangeably. Gener-
Eta-squared (η2) is a common measure of effect size ally, η2 is preferred to η2p for ease of interpretation.
used in t tests as well as univariate and multivariate
analysis of variance (ANOVA and MANOVA,
Calculating Eta-Squared
respectively). An eta-squared value reflects the
strength or magnitude related to a main or interac- Statistical software programs, such as IBMâ SPSSâ
tion effect. Eta-squared quantifies the percentage of (PASW) 18.0 (an IBM company, formerly called
variance in the dependent variable (Y) that is PASWâ Statistics) and SAS, provide only the par-
explained by one or more independent variables tial eta-squared values in the output, and not the
(X). This effect tells the researcher what percentage eta-squared values. However, these programs pro-
of the variability in participants’ individual differ- vide the necessary values for the calculation of the
ences on the dependent variable can be explained eta-squared statistic. Using information provided
by the group or cell membership of the participants. in the ANOVA summary table in the output, eta-
This statistic is analogous to r-squared values in squared can be calculated as follows:
bivariate correlation (r2) and regression analysis
(R2). Eta-squared is considered an additive measure SSeffect
η2 ¼ :
of the unique variation in a dependent variable, SStotal
such that nonerror variation is not accounted for
by other factors in the analysis. When using between-subjects and within-sub-
jects designs, the total sum of squares (SStotal) in
the ratio represents the total variance. Likewise,
Interpreting the Size of Effects
the sum of squares of the effect (which can be
The value of η2 is interpretable only if the F ratio for a main effect or an interaction effect) represents
a particular effect is statistically significant. Without the variance attributable to the effect. Eta-squared
a significant F ratio, the eta-squared value is essen- is the decimal value of the ratio and is interpreted
tially zero and the effect does not account for any as a percentage. For example, if SStotal ¼ 800 and
significant proportion of the total variance. Further- SSA ¼ 160 (the sum of squares of the main effect
more, some researchers have suggested cutoff values of A), the ratio would be .20. Therefore, the inter-
for interpreting eta-squared values in terms of the pretation of the eta-squared value would be that
magnitude of the association between the indepen- the main effect of A explains 20% of the total
dent and dependent measures. Generally, assuming variance of the dependent variable. Likewise, if
a moderate sample size, eta-squared values of .09, SStotal ¼ 800 and SSA × B ¼ 40 (the sum of
.14, and .22 or greater could be described in the squares of the interaction between A and B), the
behavioral sciences as small, medium, and large. ratio would be .05. Given these calculations, the
424 Eta-Squared

explanation would be that the interaction between dependent variable that is shared with the grouping
A and B accounts for 5% of the total variance of variable for a particular sample. Thus, because eta-
the dependent variable. squared is sample-specific, one disadvantage of eta-
squared is that it may overestimate the strength of
the effect in the population, especially when the
Mixed Factorial Designs sample size is small. To overcome this upwardly
biased estimation, researchers often calculate an
When using a mixed-design ANOVA, or a design
omega-squared (ω2) statistic, which produces
that combines both between- and within-subject
a more conservative estimate. Omega-squared is an
effects (e.g., pre- and posttest designs), researchers
estimate of the dependent variable population vari-
have differing opinions regarding whether the
ability accounted for by the independent variable.
denominator should be the SStotal when calculating
the eta-squared statistic. An alternative option is to
use the between-subjects variance (SSbetween subjects) Design Considerations
and within-subjects variance (SSwithin subjects), sepa- In addition to the issue of positive bias in popu-
rately, as the denominator to assess the strength of lation effects, research design considerations may
the between-subjects and within-subjects effects, also pose a challenge to the use of the eta-squared
respectively. Accordingly, when considering such statistic. In particular, studies that employ a multi-
effects separately, eta-squared values are calculated factor completely randomized design should
using the following formulas: employ alternative statistics, such as partial eta-
SSA and omega-squared. In multifactor designs, partial
η2 ¼ ‚ eta-squared may be a preferable statistic when
SSwithin subjects
researchers are interested in comparing the
strength of association between an independent
SSB and a dependent variable that excludes variance
η2 ¼ ‚ and
SSbetween subjects from other factors or when researchers want to
compare the strength of association between the
same independent and dependent measures across
SSA × B studies with distinct factorial designs. The strength
η2 ¼ :
SSwithin subjects of effects also can be influenced by the levels cho-
sen for independent variables. For example, if
When using SSbetween subjects and SSwithin subjects researchers are interested in describing individual
as separate denominators, calculated percentages differences among participants but include only
are generally larger than when using SStotal as the extreme groups, the strength of association is
denominator in the ratio. Regardless of the likely to be positively biased. Conversely, using
approach used to calculate eta-squared, it is impor- a clinical research trial as an example, failure to
tant to clearly interpret the eta-squared statistics include an untreated control group in the design
for statistically significant between-subjects and might underestimate the eta-squared value. Finally,
within-subjects effects, respectively. attention to distinctions between random and fixed
effects, and the recognition of nested factors in
Strengths and Weaknesses multifactor ANOVA designs, is critical to the accu-
rate use, interpretation, and reporting of statistics
Descriptive Measure of Association that measure the strength of association between
Eta-squared is a descriptive measure of the independent and dependent variables.
strength of association between independent and
dependent variables in the sample. A benefit of the Reporting Effect Size
eta-squared statistic is that it permits researchers to
and Statistical Significance
descriptively understand how the variables in their
sample are behaving. Specifically, the eta-squared Social science research has been dominated by
statistic describes the amount of variation in the a reliance on significance testing, which is not
Ethics in the Research Process 425

particularly robust to small (N < 50Þ or large sam- Olejnik, S., & Algina, J. (2000). Measures of effect size
ple sizes ðN > 400Þ. More recently, some journals for comparative studies: Applications, interpretations,
publishers have adopted policies that require the and limitations. Contemporary Educational
reporting of effect sizes in addition to reporting Psychology, 25, 241286.
Olejnik, S., & Algina, J. (2003). Generalized eta and
statistical significance (p values). In 2001, the
omega squared statistics: Measures of effect size for
American Psychological Association strongly some common research designs. Psychological
encouraged researchers to include an index of Methods, 8(4), 434447.
effect size or strength of association between vari- Pierce, C. A., Block, R. A., & Aguinis, H. (2004).
ables when reporting study results. Social scientists Cautionary note on reporting eta-squared values from
who advocate for the reporting of effect sizes multifactor ANOVA designs. Educational and
argue that these statistics facilitate the evaluation Psychological Measurement, 64, 916924.
of how a study’s results fit into existing literature, Thompson, B. (2002). ‘‘Statistical,’’ ‘‘practical,’’ and
in terms of how similar or dissimilar results are ‘‘clinical’’: How many kinds of significance do
counselors need to consider? Journal of Counseling
across related studies and whether certain design
and Development, 80(1), 6471.
features or variables contribute to similarities or
differences in effects. Effect size comparisons using
eta-squared cannot be made across studies that dif-
fer in the populations they sampled (e.g., college
students vs. elderly individuals) or in terms of con-
ETHICS IN THE RESEARCH PROCESS
trolling relevant characteristics of the experimental
setting (e.g., time of day, temperature). Despite the In the human sciences, ethical concerns are felt at
encouragement to include strength of effects and the level of the practicing scientist and are the
significance testing, progress has been slow largely focus of scholarly attention in the field of research
because effect size computations, until recently, ethics. Most of the ethical issues have to do with
were not readily available in statistical software the scientist’s obligations and the limits on permis-
packages. sible scientific activity. Perspectives on these issues
are informed by ideas drawn from a variety of
intellectual traditions, including philosophical,
Kristen Fay and Michelle J. Boyd legal, and religious. Political views and cultural
See also Analysis of Variance (ANOVA); Effect Size,
values also influence the interpretation of
Measures of; Omega Squared; Partial Eta-
researcher conduct. Ethical questions about scien-
Squared; R2; Single-Subject Design; Within-Subjects
tific activity were once considered external to the
Design
research endeavor, but today, it is taken for
granted that researchers will reflect on the deci-
sions that they make when designing a study and
Further Readings the ethical ramifications that their work might
have. Scientists are also expected to engage in dia-
Cohen, J. (1973). Eta-squared and partial eta-squared in
logue on topics that range from the controversial,
fixed factor ANOVA designs. Educational and
Psychological Measurement, 33, 107112. such as the choice to study intelligence or conduct
Keppel, G. (1991). Design and analysis: A researcher’s HIV trials, to the procedural, such as whether
handbook (3rd ed.). Englewood Cliffs, NJ: Prentice research volunteers are entitled to payment for
Hall. their services.
Kirk, R. E. (1996). Practical significance: A concept
whose time has come. Educational and Psychological
Measurement, 56, 363368. Key Themes in Research Ethics
Maxwell, S. E., & Delaney, H. D. (2000). Designing
experiments and analyzing data. Mahwah, NJ:
Nearly any decision that scientists make can have
Lawrence Erlbaum. ethical implications, but the questions most often
Meyers, L. G., Gamst, G., & Guarino, A. J. (2006). addressed under the broad heading of Research
Applied multivariate research: Design and Ethics can be grouped as follows: (a) guidelines
interpretation. London: Sage. and oversight, (b) autonomy and informed
426 Ethics in the Research Process

consent, (c) standards and relativism, (d) conflicts principles in law and medicine. In particular, from
of interest, and (e) the art of ethical judgment. medical practice came injunctions like the doctor’s
There is no exhaustive list of ethical problems ‘‘Do no harm.’’ And from jurisprudence came the
because what constitutes an ethical problem for notion that nonconsensual touching can amount
researchers is determined by a number of factors, to assault, and that those so treated might have
including current fashions in research (not always valid claims for restitution.
in the human sciences) and such things as the pre- A common theme in these early codes was that
vailing political climate. Hence, a decision that research prerogatives must be secondary to the
a researcher makes might be regarded as contro- dignity and overall welfare of the humans under
versial for several reasons, including a general study. In the late 1970s, the Belmont Report
sense that the action is out of step with the greater expanded on this with its recommendation that
good. It is also common for decisions to be contro- a system of institutional review boards (IRBs)
versial because they are deemed to be contrary to should ensure compliance with standards. The
the values that a particular scientific association IRBs were also to be a liaison between researchers
promotes. and anyone recruited by them. In addition, Bel-
Legislation often piggybacks on such senti- mont popularized a framework that researchers
ments, with a close connection between views on and scholars could use when discussing ethical
what is ethical and what should (or should not) be issues. At first, the framework comprised moral
enforced by law. In many countries, governmental principles like beneficence, justice, nonmaleficence,
panels weigh in on issues in research ethics. There, and respect for autonomy. In recent years, com-
too, however, the categories of inquiry are fluid, mentators have supplemented it with ideas from
with the panelists drawing on social, economic, care ethics, casuistry, political theory, and other
and other considerations. The amorphous nature schools of thought.
of the ethical deliberation that researchers might Now commonplace in the lives of scientists,
be party to thus results from there being so few codes of ethics were once dismissed as mere
absolutes in science or ethics. Most of the ques- attempts to ‘‘legislate morality.’’ Skeptics also
tions that researchers confront about the design of warned that scholarly debate about ethics would
a study or proper conduct can readily be set have little to offer researchers. In retrospect, it is
against reasonable counter-questions. This does plain that any lines between law and morality
not rule out ethical distinctions, however. Just as were blurred long before there was serious interest
the evaluation of scientific findings requires a com- in codes of ethics. Not only that, rules and abstrac-
bination of interpretive finesse and seasoned reflec- tions pertaining to ethics have always had to meet
tion, moral judgment requires the ability to basic tests of relevance and practicality. In this pro-
critically evaluate supporting arguments. cess, scientists are not left out of the deliberations;
they play an active role in helping to scrutinize the
codes. Today, those codes are malleable artifacts,
Guidelines and Oversight
registers of current opinions about ethical values
Current codes of ethics have their origin in the in science. They are also widely available online,
aftermath of World War II, when interest in formal usually with accompanying discussions of related
guidelines and oversight bodies first arose. News ethical issues.
of the atrocities committed by Nazi researchers
highlighted the fact that, with no consistent stan-
Autonomy and Informed Consent
dards for scientific conduct, judgments about
methods or approach were left to each researcher’s In human research, one group, made up of scien-
discretion. The unscrupulous scientist was free to tists, if not entire disciplines, singles out another
conduct a study simply to see what might happen, group for study. Because this selection is almost
for example, or to conscript research ‘‘volunteers.’’ never random, those in the latter group will usu-
The authors of the Nuremberg Code and, later, the ally know much less about the research and why
Helsinki Declaration changed this by constructing they were selected. This can create a significant
a zone of protection out of several long-standing imbalance of power that can place the subjects
Ethics in the Research Process 427

(also called recruits, patients, participants, or infor- collaborators and less like ‘‘guinea pigs,’’ even
mants) in a subordinate position. The prospect without the ritualistic process of seeking consent.
that this will also leave the subjects vulnerable But commentators tend to agree that if consent
raises a number of ethical and legal issues to which rules are softened, there should be a presumption
researchers must respond when designing their of unusually important benefits and minimal risks.
studies. As that demand is hard to meet, the trend is
Informed consent rules are the most common toward involving only bona fide volunteers. This
response to this problem of vulnerability. The spe- preference for rather restrictive guidelines reflects
cific details vary, but these rules usually call on apprehension about the legal consequences of
researchers to provide a clear account of what involving people in research against their will, as
might be in store for anyone who might serve as well as a desire to avoid situations where a lack of
a subject. In most cases, the rules have a strong consent might be a prelude to serious harm.
legalistic strain, in that the subjects contract to
serve without pressure from the researchers, and
with the option of changing their minds later. Such
Standards and Relativism
stipulations have been central to ethics codes from
the outset, and in the review of research protocols. One of the oldest philosophical questions asks
Still, as informed consent rules are applied, their whether there is one moral truth or many truths.
ethical significance rests with how researchers This question is of particular concern for research-
choose to define this operative concept. There are ers when there is what seems to be a ‘‘gray area’’
questions about how much or how little someone in ethical guidelines, that is, when a code of ethics
is consenting to when agreeing to participate, for does not appear to offer clear, explicit recommen-
example. There are questions about whether con- dations. In those settings, researchers ordinarily
sent also gives the researcher license to disseminate must decide which course represents the best com-
study results in a particular manner. As might be promise between their objectives and the interests
expected, there are also differences of opinions on of their subjects. Some commentators see the prob-
what it means to ‘‘inform’’ the subjects, and lem of ethical relativism in dilemmas like these.
whether someone who is deliberately mis- or unin- And although not everyone accepts the label of
formed is thereby denied autonomy. What moti- ‘‘relativism,’’ with some preferring to speak of
vates these disagreements are the legitimate moral pluralism or even ‘‘flexibility’’ instead, most
concerns about the degree of protection that is agree that the underlying problem bears on deci-
needed in some research, and the practicality of sions about the design and conduct of research.
consent guidelines. Research has never been governed by one
Whereas some scholars would have researchers overarching ethical standard that is complete in
err on the side of caution with very strict consent itself. Ethical standards have usually accommo-
guidelines, others maintain that consent require- dated differences in methodological orientation,
ments can make certain types of research impossi- perceived levels of risk, and other criteria. Unfor-
ble or much less naturalistic. They argue that tunately, such accommodation has never been
conditions worthy of study can be fleeting or sensi- straightforward, and it is even less so now that
tive enough that researchers have little time to disciplinary boundaries are shifting and tradi-
negotiate the subjects’ participation. Observational tional methods are being applied in novel ways.
research into crowd behavior or the elite in gov- Psychologists now delve into what used to be
ernment would be examples of studies that might considered medical research, historians work
be compromised if researchers were required to alongside public health specialists, investigative
distribute informed consent paperwork first. In journalism bears similarities to undercover field-
clinical trials, scientists might wonder how they work, and some clinical studies rely on the anal-
should understand consent when their patients ysis of patient narratives. Because of this, to
may be unable to truly understand the research. recommend that anthropologists merely refer to
There is much interest in developing research the code of ethics for their discipline is to offer
designs that can help the subjects feel more like advice of very limited value.
428 Ethics in the Research Process

Even if the disciplines were to hold static, the Conflicts of Interests


problem of relativism could surface wherever
research takes place among populations whose In research projects that stretch across several
beliefs about ethics or science differ from the institutions, there might be disagreements over the
researchers’. The researchers would still face the allocation of credit for discoveries or published
same questions about how they should adapt their work. A researcher might also learn that the phar-
own moral standards to those of their subjects, or maceutical corporation that is offering to fund his
vice versa. This problem is also complicated by the study would like some control over the reporting
need for researchers to secure institutional of the results. Another researcher might discover
approval for their studies. Because many of the that her data stand to aid some people, even as
points of reference in such oversight are subject to they might threaten social programs on which
interpretation, a protocol might be rejected at one others rely. Scientific organizations might express
institution only to be accepted at another. qualms about a study that could, despite the
Researchers are usually discouraged against ‘‘shop- researcher’s wishes, be used by governments to
ping around’’ for a more lenient audience. But it is improve the interrogation of dissidents. Such con-
an open question in many cases whether one flicts of interest can reduce the incentive that
review board has a better grasp of moral values researchers have to conduct their work, place sub-
than another. jects at risk, and undermine scientific credibility.
Because it is, in some respects, the simplest There is broad consensus that, in order to avoid
alternative, there is some concern that laws and this, the law should intervene to prevent conflicts
the fear of litigation will ‘‘solve’’ the relativism that could lead to harm to subjects or society.
problem by ruling out even minor deviations from Granting agencies are also expected to limit sup-
formal codes. On one hand, this could impose port for research that might create a conflict of
more standardization on the review of protocols. interest or even give the appearance of one. Jour-
On the other hand, it could sharply limit profes- nal editors serve as gatekeepers, and they normally
sional discretion. For now, it is unclear how recep- require that authors state compliance with ethics
tive the legal community is toward such efforts. It codes and declare any conflicts of interest. The
is also mistaken to presume that this would repre- remaining responsibility is left to researchers, edu-
sent an unwarranted interference in scientific cators, and others in supervisory roles. Through
inquiry. Historically, researchers have not been left their contributions to the scholarly literature,
out of attempts to refine legal regulations, and the researchers can help each other explore the issues
collaboration has helped to remove the ambiguity surrounding conflicts of interest, and it is custom-
from codes that gave rise to some of the need to ary for training programs to incorporate these
‘‘relativize’’ ethical standards in the first place. discussions.
Concerns about relativism need not stall scien- Where these efforts lead to changes in codes of
tific activity or encourage researchers to simply ethics, the goal is usually not to prevent them, but
devise rules of their own making. There are moral to provide researchers with a first line of defense.
principles that transcend variations in the research This still leaves researchers to rely very much on
setting. Experience suggests that prohibitions the individual ability to decide with which alli-
against causing physical harm to research subjects, ances or influences they are comfortable, and
for instance, or the falsifying of observational data, which ones will violate ethical principles. That
derive from such principles. Although these and may seem to offer little in the way of progress, but
other ‘‘bedrock’’ ethical principles are not enough the fact remains that conflicts of interest are
to resolve all of the issues surrounding relativism, among the most complex ethical issues that
they can provide researchers with a baseline when researchers face, and there is no simple way to
adaptation of ethical guidelines seems necessary. respond to them that will not create additional
Discussions about this adaptation in the scholarly problems for the design of research or scientific
literature, especially when informed by actual innovation. In that respect, the researcher must
experience, also provide a resource, as do texts on perform a type of moral triage, where certain con-
research design. flicts of interest are quickly dealt with or avoided,
Ethics in the Research Process 429

and others are treated on an ongoing basis as part It is also very common to criticize a clinical trial
of the research process itself. by alleging that the patients might not be fully
It is helpful to remember that in the days of the apprised of the risks. Researchers who would
‘‘Gentleman Scientist,’’ there was little need to design studies of genetic manipulation are asked to
worry about conflicts of interest. Research was consider the damage that might result. And a pro-
ethical if it conformed to a tacitly understood minent objection against including too many iden-
model of the humanistic intellectual. A similar tifying details in a published ethnography is that
model of virtue is still needed, yet today, research- the researcher will be unable to control the harm
ers face conflicts of interest not covered by any that this could cause. Where there are doubts
code, written or otherwise, and they engage in sci- about this language and the assessment that would
ence for any number of reasons. There are also declare one study ethical and another forbidden,
good reasons to think that conflicts of interest can there are questions about how well subjects are
serve as test cases for the values that researchers being protected or how research is being designed.
are expected to support. Under that interpretation, Most scholars would grant that researchers
an important benefit of the researcher’s having to must do more than offer a cost-benefit analysis for
grapple with conflicts of interest would be the con- their protocols. But this concession can still leave
tinual reexamination of such things as the role of researchers without a clear sense of what that
business or government in science, or the value of something else should be. For instance, textbooks
knowledge for its own sake, aside from its practi- on research design commonly recommend that
cal applications. researchers be honest and forthcoming with their
subjects, aside from how this might affect the
breakdown of anticipated risks. In practice, how-
The Art of Ethical Judgment
ever, researchers ordinarily do not deal with only
Often, the researcher’s most immediate ethical one or two moral principles. And even if this were
concern is the need to comply with institutional not so, it can be unreasonable to ask that research-
standards. This compliance is usually obtained by ers give an account of how the risks from a loss of
submitting a protocol in accordance with the vari- privacy in their subject population can be weighed
ous guidelines applicable to the type of research against the benefits that a marginalized population
involved. This process can lend an administrative, might gain from a particular study.
official stamp to ethical assessment, but it can also In short, what is needed is an ability to identify
obscure the wide range of values that are in play. variables that are either ignored or overempha-
In particular, critics charge that researchers too sized in current assessment strategies. It is natural
often feel pressured to look upon this assessment to turn to researchers to help refine that search.
as something that can be reduced to a ledger of This is not asking that researchers develop moral
anticipated risks and benefits. wisdom. Researchers are, rather, the most qualified
Critics also object that the review process rarely to devise improved methods of ethical assessment.
includes a provision for checking whether the risk- They are also positioned best to bring any current
benefit forecast proves too accurate. Even if deficiencies in those methods to the attention of
researchers did express interest in such verification, fellow scholars. Needless to say, researchers have
few mechanisms would enable it. The use of ani- very practical reasons to improve their ability to
mals in research, as in a toxicity study, is said to justify their work: Society is unlikely to stop ask-
illustrate some of these problems. Although risks ing for an accounting of benefits and risks. And
and benefits are clearly at issue, training programs the emphasis on outcomes, whether risks, benefits,
usually provide little advice on how researchers or some other set of parameters, is consistent with
are to compare the risks to the animals against the the priority usually given to empiricism in science.
benefits that patients are thought to gain from it. In other words, where there is even the possibility
Concerns like these are of first importance, as that researchers are unable to adequately gauge
scientists are taught early in their careers that pro- the effects of their work, there will be an impres-
tocols must be presented with assurances that sion that scientists are accepting significant short-
a study is safe and in the public’s interest. comings in the way that a study is deemed
430 Ethnography

a success or failure. That perception scientists can- Plomer, A. (2005). The law and ethics of medical
not afford, so it will not do to fall back on the research: International bioethics and human rights.
position that ambiguity about ethical values is sim- London: Routledge.
ply the price that researchers must pay. More sen- van den Hoonaard, W. C. (Ed.). (2002). Walking the
tightrope: Ethical issues for qualitative researchers.
sible is to enlist researchers in the search for ways
Toronto, Ontario, Canada: University of Toronto
to understand how the design of a study affects Press.
humans, animals, and the environment.

Study Design ETHNOGRAPHY


Initially considered something external to science,
attention to ethical issues is now a major part of Ethnography, in the simplest sense, refers to the
the design of studies, and scientific conduct in gen- writing or making of an abstract picture of a group
eral. Although there is a popular misconception of people. ‘‘Ethno’’ refers to people, and ‘‘graph’’
that this attention to ethical matters is unduly to a picture. The term was traditionally used to
restrictive, that it places too many limits on the denote the composite findings of social science
researcher’s behavior, there is perhaps just as much field-based research. That is, an ethnography
effort toward elaborating on the obligations that represented a monograph (i.e., a written account)
the researcher has to study certain topics and pro- of fieldwork (i.e., the first-hand exploration of
vide much needed information about health and a cultural or social setting). In contemporary
human behavior. The constraints that are placed on research, the term is used to connote the process of
research design involve safeguards against unneces- conducting fieldwork, as in ‘‘doing ethnography.’’
sarily manipulating, misleading, or otherwise harm- For this entry, ethnography is addressed in the dual
ing the human participants. Researchers are also sense of monograph and research process.
expected to envision the results that their studies
might have, and to consider how what might seem
minor decisions about methods or approach can Traditions
have lasting effects on the general state of knowl- Ethnography has been an integral part of the social
edge, as well as the perception that science has sciences from the turn of the 20th century. The
among the public. As a result, the decisions made challenge in imparting an understanding of ethnog-
when designing research will inevitably take place raphy lies in not giving the impression there was or
at the intersection of science and the changing atti- is a monolithic ethnographic way. A chronological,
tudes about what is appropriate or necessary. linear overview of ethnographic research would
Chris Herrera underrepresent the complexities and tensions of the
historical development of ethnography. The devel-
See also Clinical Trial; Ethnography; Experimental opment of ethnographic research cannot be neatly
Design; Informed Consent; Naturalistic Inquiry; presented in periods or typologies, nor can ethnog-
Research Design Principles raphy be equated with only one academic disci-
pline. Ethnography largely originated in the
disciplines of anthropology and sociology; both
Further Readings anthropologists and sociologists have consistently
based their research on intensive and extensive
Emanuel, E. J., Crouch, R. A., Arras, J. D., Moreno, fieldwork. Ethnography, however, has evolved into
J. D., & Grady, C. (Eds.). (2003). Ethical and
different intellectual traditions for anthropologists
regulatory aspects of clinical research. Baltimore:
Johns Hopkins University Press.
and sociologists. By separately examining the disci-
Homan, R. (1991). The ethics of social research. New plines of anthropology and sociology, one can gain
York: Longman. an understanding of the origin of ethnography and
Murphy, T. F. (2004). Case studies in biomedical research how these disciplines have uniquely contributed to
ethics. Cambridge: MIT Press. the foundation of ethnographic research.
Ethnography 431

Anthropology both have been attuned to the constitution of


everyday life.
A primary concern of the discipline of anthro-
pology is the study of culture, where culture is
Sociology
defined as the acquired meanings persons use to
interpret experience and guide social behavior. Because of its characteristic firsthand explora-
In anthropology, an ethnography is a complex tion of social settings, ethnography is also deeply
descriptive interpretation of a culture. Anthro- rooted in some intellectual traditions of the disci-
pologists historically ventured to remote, exotic pline of sociology. In sociology, ethnography
settings to live among a people for a year or so entails studying social contexts and contexts of
to gain a firsthand understanding of their cul- social action. The ethnographic study of small-
ture. Now, a dramatic difference between the scale urban and rural social settings dates back to
ethnographer and ‘‘the other’’ is no longer a crite- the beginning of the twentieth century. Early eth-
rion of anthropological ethnography. The study nographies originating in sociology were unique
of tribal or primitive cultures has evolved into for adhering to the scientific model of observation
the study of a wide range of cultural concerns, and data collection, for including quantitative
such as cultural events or scenes. Employing techniques such as mapping, and for the use of lit-
a cross-cultural perspective persists because it erary devices of modern fiction, particularly in the
affords the ethnographer an ability to recognize United States. Moreover, sociological ethnogra-
aspects of human behavior capable of being phies helped define community as a product of
observed, which is more likely to occur in the human meaning and interaction.
presence of differences than similarities. In Many early sociological ethnographies origi-
anthropology, ethnographic fieldwork aims to nated, in some manner, at the University of Chi-
discern cultural patterns of socially shared cago. During this generative era, the University of
behavior. Anthropological ethnographers do not Chicago was considered to be on the forefront of
set out simply to observe culture; rather, they sociology. More than half of the sociologists in the
make sense of what they observe by making cul- world were trained there, and a subgroup of these
ture explicit. Conducting prolonged fieldwork, scholars created the Chicago School of ethnogra-
be it in a distant setting or a diverse one, con- phy. This group of sociologists proved to be pro-
tinues to be a distinguishing characteristic of eth- lific ethnographers and fundamentally shaped the
nography originating in anthropology. discipline’s embracing of ethnography. Chicago
There have been key differences, however, School ethnographies are marked by descriptive
between American and British anthropologists’ narratives portraying the face-to-face, everyday life
approach to ethnography. Generally, in the United in a modern, typically urban, setting. Central in
States, ethnography and social anthropology were these was the dynamic process of social change,
not separate anthropological pursuits. American such as rapid changes in values and attitudes.
anthropologists’ pursuit of ethnographies of exotic
groups and universal meanings governing human
Contemporary
behavior were accommodated under the rubric
‘‘cultural anthropology.’’ Conversely, British A current guiding assumption about ethnogra-
anthropologists historically have drawn a distinc- phy is that ethnographic research can be con-
tion between social anthropology and cultural ducted across place, people, and process as long as
anthropology. They distinguished between the eth- patterns of human social behavior are central.
nographer finely examining a specific group of Both the methods and the resulting monograph are
people and the social anthropologist examining proving amenable across disciplines (e.g., econom-
the same group to discern broad cultural patterns. ics, public policy), which is illustrated in the multi-
Where these two disciplines intersect is that both ple theoretical perspectives and genres that are
traditions of anthropology have adhered to the now represented in contemporary ethnography.
standard of participant observation, both have Firsthand exploration offers a method of scientific
been attentive to learning the native language, and inquiry attentive to diverse social contexts, and the
432 Ethnography

spectrum of contemporary theoretical frameworks only answers to the questions he or she brings into
affords a broad range of perspectives such as semi- the field, but also questions to explain what is
otics, poststructuralism, deconstructionist herme- being observed.
neutics, postmodern, and feminist. Likewise,
a range of genres of contemporary ethnographies
Participant Observation
has evolved in contrast to traditional ethnogra-
phies: autoethnography, critical ethnography, eth- Participant observation is the bedrock of doing
nodrama, ethnopoetics, and ethnofiction. Just as and writing ethnography. It exists on a continuum
ethnography has evolved, the question of whether from solely observing to fully participating. The
ethnography is doable across disciplines has ethnographer’s point of entry and how he or she
evolved into how ethnographic methods might moves along the continuum is determined by the
enhance understanding of the discipline-specific problem being explored, the situation, and his or
research problem being examined. her personality and research style. The ethnogra-
pher becomes ‘‘self-as-instrument,’’ being con-
scious of how his or her level of participant
Fieldwork
observation to collect varying levels of data affects
An expectation of ethnography is that the ethnog- objectivity. Data are collected in fieldnotes written
rapher goes into the field to collect his or her own in the moment as well as in reflection later.
data rather than rely on data collected by others. Therein, participant observation can be viewed as
To conduct ethnography is to do fieldwork. a lens, a way of seeing. Observations are made
Throughout the evolution of ethnography, field- and theoretically intellectualized at the limitation
work persists as the sine qua non. Fieldwork pro- of not making other observations or intellectualiz-
vides the ethnographer with a firsthand cultural/ ing observations with another theoretical perspec-
social experience that cannot be gained otherwise. tive. The strength of extensive participant
Cultural/social immersion is irreplaceable for pro- observation is that everyone, members and ethnog-
viding a way of seeing. In the repetitive act of rapher, likely assumes natural behaviors over pro-
immersing and removing oneself from a setting, longed fieldwork. Repeated observations with
the ethnographer can move between making up- varying levels of participation provide the means
close observations and then taking a distant con- for how ‘‘ethnography makes the exotic familiar
templative perspective in a deliberate effort to and the familiar exotic.’’
understand the culture or social setting intellectu-
ally. Fieldwork provides a mechanism for learning
Interviewing
the meanings that members are using to organize
their behavior and interpret their experience. A key assumption of traditional ethnographic
Across discipline approaches, three fieldwork research (i.e., anthropological, sociological) is that
methods define and distinguish ethnographic the cultural or social setting represents the sample.
research. Participant observation, representative The members of a particular setting are sampled
interviewing, and archival strategies, or what as part of the setting. Members’ narratives are
Harry Wolcott calls experiencing, enquiring, and informally and formally elicited in relation to par-
examining, are hallmark ethnographic methods. ticipant observation. Insightful fieldwork depends
Therein, ethnographic research is renowned for on both thoughtful, in-the-moment conversation
the triangulation of methods, for engaging multi- and structured interviewing with predetermined
ple ways of knowing. Ethnography is not a reduc- questions or probes, because interviews can flesh
tionist method, focused on reducing data into out socially acquired messages. Yet interviewing is
a few significant findings. Rather, ethnography contingent on the cultural and social ethos, so it is
employs multiple methods to flesh out the com- not a given of fieldwork. Hence, a critical skill the
plexities of a setting. The specific methods used ethnographer must learn is to discern when inter-
are determined by the research questions and the viewing adds depth and understanding to partici-
setting being explored. In this vein, Charles Frake pant observation and when interviewing interrupts
advanced that the ethnographer seeks to find not focused fieldwork.
Evidence-Based Decision Making 433

Archives ethnographer’s self-awareness of his or her per-


spective, political alliance, and cultural influence.
Across research designs, using data triangula-
The reflexivity process of the ethnographer owning
tion (i.e., multiple data collection methods) can
his or her authorial voice and recognizing the voice
increase the scientific rigor of a study. Triangula-
of others leads to an ongoing deconstructive exer-
tion teases out where data converge and diverge.
cise. This critical exercise maintains an awareness
In ethnographic research where the ethnographer
of the multiple realities constructing the reality
operates as the instrument in fieldwork, the strate-
being represented. Thus, ethnography represents
gic examination of archive material can add an
one refined picture of a group of people, not the
objective perspective. Objectivity is introduced by
picture.
using the research questions to structure how data
are collected from archives. In addition, exploring Karen E. Caines
archive material can augment fieldwork by addres-
sing a gap or exhausting an aspect of the research. See also Action Research; Discourse Analysis;
Archives are not limited to documents or records Interviewing; Naturalistic Observation; NVivo;
but include letters, diaries, photographs, videos, Qualitative Research
audiotapes, artwork, and the like.
Further Readings
Monograph Agar, M. H. (1996). The professional stranger: An
informal introduction to ethnography (2nd ed.). New
Ethnography, in the traditional sense, is a descrip-
York: Academic Press.
tive, interpretive monograph of cultural patterning Atkinson, P., Coffey, A., Delamont, S., Lofland, J., &
or social meaning of human social behavior; an Lofland, L. (Eds.). (2001). Handbook of ethnography.
ethnography is not an experiential account of field- London: Sage.
work. The meanings derived from fieldwork con- Bernard, H. R. (2005). Research methods in
stitute the essence of an ethnography. Beyond anthropology: Qualitative and quantitative
capturing and conveying a people’s worldview, an approaches (4th ed.). Lanham, MD: AltaMira.
ethnography should be grounded in social context. Coffey, A. (1999). The ethnographic self. London: Sage.
The emphasis is at once on specificity and circum- Reed-Danahay, D. E. (Ed.). (1997). Auto/ethnography:
Rewriting the self and the social. Oxford, UK: Berg.
stantiality. In analyzing the data of fieldwork and
Van Maanen, J. (Ed.). (1995). Representation in
compiling the findings, the ethnographer is chal- ethnography. Thousand Oaks, CA: Sage.
lenged to navigate the tension between a positivist Wolcott, H. F. (2008). Ethnography: A way of seeing.
perspective (i.e., objectivity) and an interpretivist Lanham, MD: AltaMira.
one (i.e., intellectual abstracting) to achieve ‘‘thick
description.’’ For an ethnography, raw data are
not simply used to describe and itemize findings.
Instead, to achieve thick or refined description, EVIDENCE-BASED
a theoretical framework (e.g., discipline, contem-
porary) is used to describe and conceptualize data. DECISION MAKING
Another tension challenging the ethnographer
in writing a monograph is that of representation. Quantitative research is a means by which one can
Can the ethnographer understand ‘‘others’’ to the gain factual knowledge. Today, a number of such
point of speaking for or standing in for them? study design options are available to the
The irony of ethnography is that the more evolved researcher. However, when studying these quantifi-
the interpretation of the data becomes, the more cations, the researcher must still make a judgment
abstract it becomes and the more it becomes one on the interpretation of those statistical findings.
way of representing the data. Clifford Geertz rea- To others, what may seem like conflicting, confus-
soned that ethnography is a ‘‘refinement of ing, or ambiguous data requires thoughtful inter-
debate.’’ Juxtaposed to representation as refine- pretation. Information for decision making is said
ment is reflexivity as awareness. Reflexivity is the to be almost never complete, and researchers
434 Evidence-Based Decision Making

always work with a certain level of uncertainty. approach ensures that the methodology used, as
The need for making and providing proper inter- well as the logic of the researcher, to arrive at
pretation of the data findings requires that error- conclusions are sound.
prone humans acknowledge how a decision is This entry explores the history of the evidence-
made. based movement and the role of decision analysis
Much of the knowledge gained in what is in evidence-based decision making. In addition,
known as the evidence-based movement comes algorithms and decision trees, and their differ-
from those in the medical field. Statisticians are an ences, are examined.
integral part of this paradigm because they aid in
producing the evidence through various and
appropriate methodologies, but they have yet to History and Explanation of the
define and use terms encompassing the evidence-
Evidence-Based Movement
based mantra. Research methodology continues to
advance, and this advancement contributes to Evidence-based decision making stems from the
a wider base of information. Because of this, the evidence-based medicine movement that began in
need continues to develop better approaches to the Canada in the late 1980s. David Sackett defined
evaluation and utilization of research information. the paradigm of evidence-based medicine/practice
However, such advancements will continue to as the conscientious, explicit, and judicious use of
require that a decision be rendered. current best evidence about the care of individual
In the clinical sciences, evidence-based deci- patients. The evidence-based movement has grown
sion making is defined as a type of informal rapidly. In 1992, there was one publication on evi-
decision-making process that combines a clini- dence-based practices; by 1998, there were in
cian’s professional expertise coupled with the excess of 1,000. The evidence-based movement
patient’s concerns and evidence gathered from continues to enjoy rapid growth in all areas of
scientific literature to arrive at a diagnosis and health care and is seeing headways made in
treatment recommendation. Milos Jenicek fur- education.
ther clarified that evidence-based decision mak- From Sackett’s paradigm definition comes
ing is the systematic application of the best a model that encompasses three core pillars, all of
available evidence to the evaluation of options which are equally weighted. These three areas are
and to decision making in clinical, management, practitioner experience and expertise, evidence
and policy settings. from quantitative research, and individual
Because there is no mutually agreed-upon defi- (patient) preferences.
nition of evidence-based decision making among The first pillar in an evidence-based model is
statisticians, a novel definition is offered here. In the practitioner’s individual expertise in his or her
statistical research, evidence-based decision mak- respective field. To make such a model work, the
ing is defined as using the findings from the statisti- individual practitioner has to take into consider-
cal measures employed and correctly interpreting ation biases, past experience, and training. Typi-
the results, thereby making a rational conclusion. cally, the practitioner, be it a field practitioner or
The evidence that the researcher compiles is a doctoral-level statistician, has undergone some
viewed scientifically through the use of a defined form of mentoring in his or her graduate years.
methodology that values systematic as well as rep- Such a mentorship encourages what is known as
licable methods for production. the apprentice model, which, in and of itself, is
The value of an evidence-based decision- authoritarian and can be argued to be completely
making process provides a more rational, credi- subjective. The evidence-based decision-making
ble basis for the decisions a researcher and/or cli- model attempts to move the researcher to use the
nician makes. In the clinical sciences, the value latest research findings on statistical methodology
of an evidence-based decision makes patient care instead of relying on the more subjective authori-
more efficient by valuing the role the patient tarian model (mentorstudent).
plays in the decision-making process. In statisti- The second pillar in an evidence-based model
cal research, the value of an evidence-based for the researcher is the use of the latest research
Evidence-Based Decision Making 435

findings that are applicable for the person’s field of a method that allows one to make such determina-
study. It relies mainly on systematic reviews and tions. Jenicek suggested that decision analysis is
meta-analysis studies followed by randomized con- not a direction-giving method but rather a direc-
trolled trials. These types of studies have the high- tion-finding method. Direction-giving methods will
est value in the hierarchy of evidence. Prior to the be described later on in the form of the decision
evidence-based movement in the field of medicine, tree and the algorithm.
expert opinion and case studies, coupled with Jenicek suggested that decision analysis has
practitioner experience and inspired by their field- seven distinct stages. The first stage of decision
based mentor, formed much of the practice of clin- analysis requires one to adequately define the
ical medicine. problem. The second stage in the decision analysis
The third pillar of an evidence-based model is process is to provide an answer to the question,
patient preferences. In the clinical sciences, the ‘‘What is the question to be answered by decision
need for an active patient as opposed to a passive analysis?’’ In this stage, true positive, true negative,
patient has become paramount. Including the false negative, and false positive results, as well as
patient in the decision making about his or her other things, need to be taken into consideration.
own care instills in the patient a more active role. The third stage in the process is the structuring of
Such an active role by the patient is seen to the problem over time and space. This stage
strengthen the doctorpatient encounter. For the encompasses several key aspects. The researcher
statistician and researcher, it would appear that must recognize the starting decision point, make
such a model would not affect their efforts. How- an overview of possible decision options and their
ever, appreciation of this pillar in the scientific outcomes, and establish temporo-spatial sequence.
research enterprise can be seen in human subject Deletion of unrealistic and/or impossible or irrele-
protection. vant options is also performed at this stage. The
From this base paradigm arose the term evi- fourth stage in the process involves giving dimen-
dence-based decision making. Previously, research- sion to all the relevant components of the problem.
ers made decisions based on personal observation, This is accomplished by obtaining available data
intuition, and authority, as well as belief and tradi- to figure out probabilities. Obtaining the best and
tion. Although the researcher examined the evi- most objective data for each relevant outcome is
dence that was produced from the statistical also performed here. The fifth stage is the analysis
formulas used, he or she still relied on personal of the problem. The researcher will need to choose
observation, intuition, authority, belief, and tradi- the best way through the available decision paths.
tion. Interpretation of statistical methods is only as As well, the researcher will evaluate the sensitivity
good as the person making the interpretation of of the preferred decision. This stage is marked by
the findings. the all-important question, ‘‘What would happen
if conditions of the decision were to change?’’ The
final two stages are solve the problem and act
Decision Analysis
according to the result of the analysis. In evidence-
Decision analysis is the discipline for addressing based decision making, the stages that involve the
important decisions in a formal manner. It is com- use of the evidence is what highlights this entire
posed of the philosophy, theory, methodology, and process. The decision in evidence-based decision
professional practice to meet this end. John Last making is hampered if the statistical data are
suggested that decision analysis is derived from flawed. For the statistician, research methodology
game theory, which tends to identify all available using this approach will need to take into consider-
choices and the potential outcomes of each. ation efficacy (can it work?), effectiveness (does it
The novice researcher and/or statistician may work?), and efficiency (what does it cost in terms
not always know how to interpret results of a sta- of time and/or money for what it gives?).
tistical test. Moreover, statistical analysis can In evidence-based decision-making practice,
become more complicated because the inexperi- much criticism has been leveled at what may
enced researcher does not know which test is more appear to some as a reliance on statistical mea-
suitable for a given situation. Decision analysis is sures. It should be duly noted that those who
436 Evidence-Based Decision Making

follow the evidence-based paradigm realize that health, disease evolution, and policy manage-
evidence does not make the decision. However, ment. The analysis, which involves a decision to
those in the evidence-based movement acknowl- be made, leads to the best option. Choices and/
edge that valid and reliable evidence is needed to or options that are available at each stage in the
make a good decision. thinking process have been likened to branches
Jenicek has determined that decision analysis on a tree—a decision tree. The best option could
has its own inherent advantages and disadvan- be the most beneficial, most efficacious, and/or
tages. Advantages to decision analysis are that it is most cost-effective choice among the multiple
much less costly than the search for the best deci- choices to be made. The graphical representation
sion through experimental research. Such experi- gives the person who will make the decision
mental research is often sophisticated in design a method by which to find the best solution
and complex in execution and analysis. Another among multiple options. Such multiple options
advantage is that decision analysis can be easily can include choices, actions, and possible out-
translated into clinical decisions and public health comes, and their corresponding values.
policies. There is also an advantage in the educa- A decision tree is a classifier in the form of a tree
tional realm. Decision analysis is an important tool structure where a node is encountered. A decision
that allows students to better structure their think- tree can have either a leaf node or a decision node.
ing and to navigate the maze of the decision-mak- A leaf node is a point that indicates the value of
ing process. A disadvantage to decision analysis is a target attribute or class of examples. A decision
that it can be less valuable if the data and informa- node is a point that specifies some test to be car-
tion are of poor quality. ried out on a single attribute or value. From this,
a branch of the tree and/or subtree can represent
a possible outcome of a test or scenario. For exam-
Algorithm
ple, a decision tree can be used to classify a sce-
John Last defined an algorithm as any systematic nario by starting at the root of the tree and
process that consists of an ordered sequence of moving through it. The movement is temporarily
steps with each step depending on the outcome of halted when a leaf node, which provides a possible
the previous one. It is a term that is commonly outcome or classification of the instance, is
used to describe a structured process. It is a graphi- encountered.
cal representation commonly seen as a flow chart. Decision trees have several advantages. First
An algorithm can be described as a specific set of of all, decision trees are simple to understand
instructions for carrying out a procedure or solving and interpret. After a brief explanation, most
a problem. It usually requires that a particular pro- people are able to understand the model. Second,
cedure terminate at some point when questions are decision trees have a value attached to them even
readily answered in the affirmative or negative. An if very little hard data support them. Jenicek sug-
algorithm is, by its nature, a set of rules for solving gested that important insights can be generated
a problem in a finite number of steps. Other names based on experts describing a situation along
used to describe an algorithm have been method, with its alternative, probabilities, and cost, as
procedure, and/or technique. In decision analysis, well as the experts’ preference for a suitable out-
algorithms have also been defined as decision anal- come. Third, a decision tree can easily replicate
ysis algorithms. They have been argued to be best a result with simple math. The final advantage to
suited for clinical practice guidelines and for teach- a decision tree is that it can easily incorporate
ing. One of the criticisms of an algorithm is that it with other decision techniques. Overall, decision
can restrict critical thought. trees represent rules and provide a classification
as well as prediction. More important, the deci-
sion tree as a decision-making entity allows the
Decision Tree
researcher the ability to explain and argue why
A decision tree is a type of decision analysis. the reason for a decision is crucial. It should be
Jenicek defined a decision tree as a graphical rep- noted that not everything that has branches can
resentation of various options in such things as be considered a decision tree.
Exclusion Criteria 437

Decision Tree and the Last, J. M. (2001). A dictionary of epidemiology


(4th ed.). Oxford, UK: Oxford University Press.
Algorithm: The Differences
Sackett, D. L., Staus, S. E., Richardson, W. S., Rosenberg,
A decision tree and an algorithm appear to be W., & Haynes, R. B. (2000). Evidence-based medicine:
quite synonymous and are often confused. Both How to practice and teach EBM. Edinburgh, UK:
are graphical representations. However, the differ- Churchill Livingstone.
ence between the two is that a decision tree is not
direction-giving, whereas an algorithm is. In other
words, a decision tree provides options to a prob-
lem in which each possible decision can be consid- EXCLUSION CRITERIA
ered and/or argued as pertinent given the situation.
An algorithm provides a step-by-step guide in Exclusion criteria are a set of predefined defini-
which a decision depends unequivocally on the tions that is used to identify subjects who will not
preceding decision. be included or who will have to withdraw from
a research study after being included. Together
with inclusion criteria, exclusion criteria make up
Final Thoughts the eligibility criteria that rule in or out the partici-
pants in a research study. Similar to inclusion crite-
Evidence-based decision making, an offshoot of
ria, exclusion criteria are guided by the scientific
the evidence-based medicine movement, has strong
objective of the study and have important implica-
implications for the researcher. Decision analysis
tions for the scientific rigor of a study as well as
should not be confused with direction-giving.
for assurance of ethical principles. Commonly used
Decision analysis is about determination of the
exclusion criteria seek to leave out subjects not
finding of direction. Evidence-based decision mak-
complying with follow-up visits, those who are
ing takes into consideration that it is a formal
not able to provide biological specimens and data,
method using the best available evidence either
and those whose safety and ethical protection can-
from the statistical methodology used in a single
not be assured.
study or evidence from multiple studies.
Some definitions are needed to discuss exclusion
Proponents of evidence-based decision making
criteria. Generalizability refers to the applicability
have always recognized that evidence alone is
of study findings in the sample population to the
never the sole determinant of a decision. However,
target population (representativeness) from
from quantitative research, the evidence from the
which the sample was drawn; it requires an
statistical formulas used, such proponents know
unbiased selection of the sample population,
that good evidence is needed in order to make
which is then said to be generalizable to, or rep-
a good decision. This is the crux of evidence-based
resentative of, the target population. Ascertain-
decision making.
ing exclusion criteria requires screening subjects
Timothy Mirtz and Leon Greene using valid and reliable measurements to ensure
that subjects who are said to meet those criteria
See also Decision Rule; Error; Error Rates; Reliability; really have them (sensitivity) and those who are
Validity of Measurement said not to have them really do not have them
(specificity). Such measurements should also be
valid (i.e., should truly measure the exclusion
Further Readings criteria) and reliable (consistent and repeatable
every time they are measured).
Friis, R. H., & Sellers, T. A. (1999). Epidemiology for
The precision of exclusion criteria will depend
public health practice (2nd ed.). Gaithersburg, MD:
Aspen.
on how they are ascertained. For example, ascer-
Jenicek, M. (2003). Foundations of evidence-based taining an exclusion criterion as ‘‘self-reported
medicine. Boca Raton, FL: Parthenon. smoking’’ will likely be less sensitive, specific,
Jenicek, M., & Hitchcock, D. L. (2005). Logic and valid, and reliable than ascertaining it by means of
critical thinking in medicine. Chicago: AMA Press. testing for levels of cotinine in blood. On the other
438 Exclusion Criteria

hand, cotinine in blood may measure exposure to exclusion criteria after IRB approval require new
secondhand smoking, thus excluding subjects who approval of any amendments.
should not be excluded; therefore, a combination In epidemiologic and clinical research, asses-
of self-reported smoking and cotinine in blood sing an exposure or intervention under strict
may increase the sensitivity, specificity, validity, study conditions is called efficacy, whereas doing
and reliability of such measurement, but it will be so in real-world settings is called effectiveness.
more costly and time consuming. Concerns have been raised about the ability to
A definition of exclusion criteria that requires generalize the results from randomized clinical
several measurements may be just as good as one trials to a broader population, because partici-
using fewer measurements. Good validity and reli- pants are often not representative of those seen
ability of exclusion criteria will help minimize ran- in clinical practice. Each additional exclusion
dom error, selection bias, and confounding, thus criterion implies a different sample population
improving the likelihood of finding an association, and approaches the assessment of efficacy, rather
if there is one, between the exposures or interven- than effectiveness, of the exposure or interven-
tions and the outcomes; it will also decrease the tion under study, thus influencing the utility and
required sample size and allow representativeness applicability of study findings. For example,
of the sample population. Using standardized studies of treatment for alcohol abuse have
exclusion criteria is necessary to accomplish con- shown that applying stringent exclusion criteria
sistency, replicability, and comparability of find- used in research settings to the population at
ings across similar studies on a research topic. large results in a disproportionate exclusion of
Standardized disease-scoring definitions are avail- African Americans, subjects with low socioeco-
able for mental and general diseases (Diagnostic nomic status, and subjects with multiple sub-
and Statistical Manual of Mental Disorders and stance abuse and psychiatric problems.
International Classification of Diseases, respec- Therefore, the use of more permissive exclusion
tively). Study results on a given research topic criteria has been recommended for research stud-
should carefully compare the exclusion criteria to ies on this topic so that results are applicable to
analyze consistency of findings and applicability to broader, real-life populations. The selection and
sample and target populations. Exclusion criteria application of these exclusion criteria will also
must be as parsimonious in number as possible; have important consequences on the assurance
each additional exclusion criterion may decrease of ethical principles, because excluding subjects
sample size and result in selection bias, thus affect- based on race, gender, socioeconomic status,
ing the internal validity of a study and the external age, or clinical characteristics may imply an
validity (generalizability) of results, in addition to uneven distribution of benefits and harms, disre-
increasing the cost, time, and complexity of gard for the autonomy of subjects, and lack of
recruiting study participants. Exclusion criteria respect. Researchers must strike a balance
must be selected carefully based upon a review of between stringent and more permissive exclusion
the literature on the research topic, in-depth criteria. On one hand, stringent exclusion crite-
knowledge of the theoretical framework, and their ria may reduce the generalizability of sample
feasibility and logistic applicability. study findings to the target population, as well
Research proposals submitted for institutional as hinder recruitment and sampling of study sub-
review board (IRB) approval should clearly jects. On the other hand, they will allow rigor-
describe exclusion criteria to potential study parti- ous study conditions that will increase the
cipants, as well as consequences, at the time of homogeneity of the sample population, thus
obtaining informed consent. Often, research proto- minimizing confounding and increasing the like-
col amendments that change the exclusion criteria lihood of finding a true association between
will result in two different sample populations that exposure/intervention and outcomes. Confound-
may require separate data analyses with a justifica- ing may result from the effects of concomitant
tion for drawing composite inferences. Exceptions medical conditions, use of medications other
to exclusion criteria need to be approved by the than the one under study, surgical or rehabilita-
IRB of the research institution; changes to tion interventions, or changes in the severity of
Exogenous Variables 439

disease in the intervention group or occurrence Szklo, M., & Nieto, F. J. (2007). Epidemiology: Beyond
of disease in the nonintervention group. the basics. Boston: Jones & Bartlett.
Changes in the baseline characteristics of study
subjects that will likely affect the outcomes of the
study may also be stated as exclusion criteria. For
example, women who need to undergo a specimen EXOGENOUS VARIABLES
collection procedure involving repeated vaginal
exams may be excluded if they get pregnant during Exogenous originated from the Greek words exo
the course of the study. In clinical trials, exclusion (meaning ‘‘outside’’) and gen (meaning ‘‘born’’),
criteria identify subjects with an unacceptable risk and describes something generated from outside
of taking a given therapy or even a placebo (for a system. It is the opposite of endogenous, which
example, subjects allergic to the placebo sub- describes something generated from within the sys-
stance). Also, exclusion criteria will serve as the tem. Exogenous variables, therefore, are variables
basis for contraindications to receive treatment that are not caused by any other variables in
(subjects with comorbidity or allergic reactions, a model of interest; in other words, their value is
pregnant women, children, etc.). Unnecessary not determined in the system being studied.
exclusion criteria will result in withholding treat- The concept of exogeneity is used in many
ment from patients who may likely benefit from fields, such as biology (an exogenous factor is a fac-
a given therapy and preclude the translation of tor derived or developed from outside the body);
research results into practice. Unexpected reasons geography (an exogenous process takes place out-
for subjects’ withdrawal or attrition after inception side the surface of the earth, such as weathering,
of the study are not exclusion criteria. An addi- erosion, and sedimentation); and economics (exog-
tional objective of exclusion criteria in clinical enous change is a change coming from outside the
trials is enhancing the differences in effect between economics model, such as changes in customers’
a drug and a placebo; to this end, subjects with tastes or income for a supply-and-demand model).
short duration of the disease episode, those with Exogeneity has both statistical and causal interpre-
mild severity of illness, and those who have a posi- tations in social sciences. The following discussion
tive response to a placebo may be excluded from focuses on the causal interpretation of exogeneity.
the study.
Eduardo Velasco Exogenous Variables in a System
Although exogenous variables are not caused by
See also Bias; Confounding; Inclusion Criteria;
any other variables in a model of interest, they
Reliability; Sampling; Selection; Validity of
may cause the change of other variables in the
Measurement; Validity of Research Conclusions
model. In the specification of a model, exogenous
variables are usually labeled with Xs and endoge-
nous variables are usually labeled with Ys. Exoge-
Further Readings nous variables are the ‘‘input’’ of the model,
Gordis, L. (2008). Epidemiology (4th ed.). Philadelphia: predetermined or ‘‘given’’ to the model. They are
W. B. Saunders. also called predictors or independent variables.
Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, The following is an example from educational
D., & Newman, T. B. (Eds.). (2007). Designing research. Family income is an exogenous variable
clinical research: An epidemiologic approach (3rd ed.). to the causal system consisting of preschool atten-
Philadelphia: Lippincott Williams & Wilkins. dance and student performance in elementary
LoBiondo-Wood, G., & Haber, J. (2006). Nursing
school. Because family income is determined by
research: Methods and critical appraisal for evidence-
based practice (6th ed.). St. Louis, MO: Mosby.
neither a student’s preschool attendance nor ele-
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). mentary school performance, family income is an
Experimental and quasi-experimental designs for exogenous variable to the system being studied.
generalized causal inference. Boston: Houghton On the other hand, students’ family income may
Mifflin. determine both preschool attendance and
440 Exogenous Variables

elementary school performance. High-income fam-


ilies are more likely than low-income families to
enroll their children in preschools. High-income
families also tend to provide more resources and Child’s Child’s
support for their children to perform well in ele- Child’s Math
Preschool Reading
Score
mentary school; for example, high-income parents Attendance Score
may purchase more learning materials and spend
more spare time helping their children with home-
work assignments than low-income parents. Family
Whether a variable is exogenous is relative. An Income
exogenous variable in System A may not be an
exogenous variable in System B. For example,
family income is an exogenous variable in the sys- Figure 1 A Hypothetical Path Diagram
tem consisting of preschool attendance and ele-
mentary school performance. However, family
income is not an exogenous variable in the system reading score, and math score, respectively.
consisting of parental education level and parental Moreover, a child’s preschool attendance is
occupation because parental education level and hypothesized to influence the child’s reading and
occupation probably influence family income. math scores; thus, single-headed arrows start
Therefore, once parental education level and from the child’s preschool attendance and end
parental occupation are added to the system con- with the child’s reading and math scores. Finally,
sisting of family income, family income will a child’s reading score is hypothesized to influ-
become an endogenous variable. ence the child’s math score; therefore, a single-
headed arrow points to the math score from the
Exogenous Variables in Path Analysis reading score. In this model, even though
a child’s preschool attendance and reading score
Exogenous and endogenous variables are fre- both cause changes in other variables, they are
quently used in structural equation modeling, espe- not exogenous variables because they both have
cially in path analysis in which a path diagram can input from other variables as well. As preschool
be used to portray the hypothesized causal and attendance and reading score serve as indepen-
correlational relationships among all the variables. dent variables as well as dependent variables,
By convention, a hypothesized causal path is indi- they are also called mediators.
cated by a single-headed arrow, starting with More than one exogenous variable may exist in
a cause and pointing to an effect. Because exoge- a path analysis model, and the exogenous variables
nous variables do not receive causal inputs from in a model may be correlated with each other. A
other variables in the system, no single-headed correlation relationship is conventionally indicated
arrow points to exogenous variables. Figure 1 by a double-headed arrow in path analysis. There-
illustrates a hypothetical path diagram. fore, in a path diagram, two exogenous variables
In this model, family income is an exogenous may be connected with a double-headed arrow.
variable. It is not caused by any other variables Finally, it is worth mentioning that the causal
in the model, so no single-headed arrow points relationships among exogenous and endogenous
to it. A child’s preschool attendance and reading/ variables in a path analysis may be hypothetical
math scores are endogenous variables. Each of and built on theories and/or common sense. There-
the endogenous variables is influenced by at least fore, researchers should be cautious and consider
one other variable in the model and accordingly research design in their decision about whether the
has at least one single-headed arrow pointing hypothesized causal relationships truly exist, even
to it. Family income is hypothesized to produce if the hypothesized causal relationships are sup-
changes in all the other variables; therefore, ported by statistical results.
single-headed arrows start from family income
and end with the child’s preschool attendance, Yue Yin
Expected Value 441

X
See also Cause and Effect; Endogenous Variables; Path E½x ¼ x · pðxÞ‚
Analysis x∈

Further Readings
and the above expected value exists if the above
sum of absolute
P value of X is absolutely conver-
Kline, R. B. (2005). Principles and practice of structural gent, that is, x ∈  jxj is finite.
equation modeling (2nd ed.). New York: Guilford. As a simple example, suppose x can take on
Pearl, J. (2000). Causality: Models, reasoning, and two values, 0 and 1, which occur with probabili-
inference. Cambridge, UK: Cambridge University
ties .4 and .6. Then
Press.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
E½X ¼ ðx ¼ 0Þ × pðx ¼ 0Þ þ ðx ¼ 1Þ × pðx ¼ 1Þ
Experimental and quasi-experimental designs for :
generalized causal inference. New York: Houghton ¼ 0 × 0:4 þ 1 × 0:6 ¼ 0:6
Mifflin.
If G(X) is a function of RV X, its expected
value, E[G(X)], is a weighted average of the possi-
ble values of G(X) and is defined as
EXPECTED VALUE 8R
< x∈ GðxÞ·f ðxÞdx; for continuous case
The expected value is the mean of all values of E½GðXÞ ¼ P :
: GðxÞ·pðxÞ; for discrete case
a random variable weighted by the probability of x∈
the occurrence of the values. The expected value
(or expectation, or mean) of random variable (RV) The above expected value exists if the above sum
X is denoted as E[X] (or sometimes μ). of absolute value of G(X) is absolutely convergent.

Mathematical Definition Linear Relationship


The RV, X, of a random experiment, which is If RVs X1, X2 ; . . . ; Xn have the expectations as μ1,
defined on a probability space ð‚
; PÞ on an μ2 ; . . . ; μn, and c1, c2 ; . . . ; cn are all constants,
underlying sample space , takes value in event set then

⊆ R with certain probability measure P.


If X is a continuous random variable (i.e.,
is ) E½c0  ¼ c0 and E½ci Xi  ¼ ci E½Xi  = ci μi ð1Þ
an interval), the expected value of X is defined as
Z * E½ci Xi þ c0  ¼ ci E½Xi  þ c0 ¼ ci μi þ c0 ð2Þ
E½X ¼ xdP:
 " #
Xn X
n X
n

If X also has a probability density function E ci Xi þc0 ¼ ci E½Xi þc0 ¼ ci μi þc0 ð3Þ
i¼1 i¼1 i¼1
(pdf) f(x) of certain probability distribution, the
above expected value of X can be formulated as
" #
Z X
n X
n

E½X ¼ x · f ðxÞdx: E ci Gi ðXi Þ þ c0 ¼ ci E½Gi ðXi Þ þ c0


x∈ i¼1 i¼1
Xn X
The expected value exists if the above integral of ¼ ci Gi ðxÞ · pðxÞ þ c0
i¼1 x∈
absolute
R value of X is absolutely convergent, that
is, x ∈  jxjdx is finite. ð4Þ
If X is a discrete random variable (i.e.,  is
countable) with probability mass or probability
E½X1 X2 . . . Xn  ¼ μ1 μ2 . . . μn for independent Xi :
density function (pmf or pdf) p(x), the expected
value of X is defined as ð5Þ
442 Expected Value

Z Z
Interpretation       
E G Xi ;Xj ¼ G xi ;xj f xi ;xj dxi dxj ‚
xi ∈xi xj ∈xj
From a statistical point of view, the following
terms are important: arithmetic mean (or simply
8i‚j∈N‚ and i6¼ j:
mean), central tendency, and location statistic. The
arithmetic mean of X is the summation of the set
of observations (sample) of X ¼ {x1, x2; . . . ; xN} If Xi,Xj are discrete random variables with the
divided by the sample size N: joint pmf p(xi,xj), the expected value of G(Xi,Xj) is

   X X    
1X N E G Xi ;Xj ¼ G xi ;xj p xi ;xj 8i‚j∈N‚
X ¼ xi : xi ∈xi xj ∈xj
N i¼1
and i6¼ j:
X is called arithmetic mean/sample mean/
average when used to estimate the location of
a sample. When it is used to estimate the location Conditional Expected Value
of an underlying distribution, X is called popula-
tion mean/average, or expectation/expected value, Given that Xi,Xj are continuous random variables
which can be denoted as E[X] or μ: This is consis- with the joint pdf f(xi,xj) and f(xj) > 0, then the
tent with the original definition because the proba- conditional pdf of Xi given Xj ¼ xj is
bility of each value’s occurrence is equal to 1/N.  
One could construct a different estimate of the   f xi ; xj
fXi jXj xi jxj ¼   ‚ 8 all xi ‚
mean, for example, if some values were expected fXj xj
to occur more frequently than others. Because
a researcher rarely has such information, the and the corresponding conditional expected value
simple mean is most commonly used. is
Z
Moment    
E Xi jXj ¼ xj ¼ xi · fXi jXj xi jxj dxi :
xi ∈ xi
The moment (a characteristic of a distribution) of
X about the real number c is defined as
Given that Xi,Xj are discrete random variables
n
E½ðx  cÞ 8c ∈ R; and integer n ≥ 1: with the joint pmf p(xi,xj) and p(xj) > 0, then the
conditional pdf of Xi given Xj ¼ xj is
Hence, E[Xn] are also called central moments.  
  p xi ; xj
E[X] is called the first moment (* n ¼ 1) of X pXi jXj xi jxj ¼   ‚ 8 all xi ‚
about c ¼ 0, which is commonly called the mean pXj xj
of X. The second moment about the mean of X is
called the Variance of X. Theoretically, the entire and the corresponding conditional expected value
distribution of X can be described if all moments is
of X are known by using the moment-generating
functions, although only the first five moments are   X  
E Xi jXj ¼ xj ¼ xi · pXi jXj xi jxj :
generally necessary to specify a distribution com- xi ∈ xi
pletely. The third moment is termed skewness, and
the fourth is called kurtosis.
Furthermore, the expectation of the conditional
expectation of Xi given Xj ¼ xj is simply the
Joint Expected Value expectation of Xi:
If Xi,Xj are continuous random variables with the   
joint pdf f ðxi ; xj ), the expected value of G(Xi,Xj) is E E Xi jXj ¼ E½Xi :
Expected Value 443

Variance and Covariance Cov½X;Y  ¼ E½XY   E½XE½Y  ¼ Cov½X;Y  ð10Þ


Given a continuous RV X with pdf f(x), the vari-  
ance of X can be formulated as Cov ci X þ cj Y;Z ¼ ci Cov½XZþcj Cov½YZ: ð11Þ
h i
Var½X ¼ σ 2X ¼ E ðX  μÞ2
Uncorrelatedness Versus Independence
Z
¼ ðx  μÞ2 ·f ðxÞ dx: Continuous random variables Xi and Xj are said
x∈ to be independent if and only if the joint pdf (or
joint pmf of discrete RVs) of Xi and Xj equals
When X is a discrete RV with pmf p(x), the var- the product of the marginal pdfs for Xi and Xj ,
iance of X can be formulated as respectively:
h i X    
Var½X ¼ σ 2X ¼ E ðX  μÞ2 ¼ ðx  μÞ2 ·pðxÞ: f Xi ; Xj ¼ f ðXi Þf Xj :
x∈
qffiffiffiffiffiffi When Xi , Xj are said to be independent, the

The positive square root of variance, σ 2X , is conditional expected value of Xi given Xj ¼ xj is
called standard deviation and denoted as σ X or sX. the expectation of Xi:
Given two continuous RVs X and Y with joint  
pdf f(x,y), the covariance of X and Y can be for- E Xi jXj ¼ E½Xi : ð12Þ
mulated as
Also, the expectation of the product of Xi and
Cov½X; Y  ¼ σ XY ¼ E½ðX  μX ÞðY  μY Þ Xj equals the product of the expectations for Xi
Z Z
and Xj, respectively:
¼ ðx  μX Þðy  μY Þ · f ðx; yÞ dx:
x ∈ X y ∈ Y
   
E Xi Xj ¼ E½Xi E Xj : ð13Þ
When X and Y are discrete RVs with joint pmf
Xi and Xj are said to be uncorrelated,
p(x,y), the covariance of X and Y can be formu-
orthogonal, and linearly independent if their
lated as
covariance is zero, that is, Cov[Xi,Xj] ¼ 0.
Cov½X; Y  ¼ σ XY ¼ E½ðX  μX ÞðY  μY Þ Then the variance of the summation of Xi and
X X Xj equals the summation of the variances of Xi
¼ ðx  μX Þðy  μY Þ · pðx; yÞ: and Xj, respectively. That is, Equation 9 can be
x ∈ X y ∈ Y
rewritten as
   
Var Xi ; Xj ¼ Var½Xi  þ Var Xj : ð14Þ
Linear Relationship of Variance and Covariance
Independence is a stronger condition that
If X, Y, and Z are random variables and c0 ; always implies uncorrelatedness and orthogonal-
c1 ; . . . ; cn are all constants, then ity, but not vice versa. For example, a perfect
  2
spherical relationship between X and Y will be
Var½X ¼ E X2  E½jXj ¼ Cov½X; X ð6Þ uncorrelated but certainly not independent.
Because simple correlation is linear, many rela-
tionships would have a zero correlation but be
* Var½c0  ¼ 0 and Var½ci X ¼ c2i Var½X ð7Þ
interpretably nonindependent.

) Var½ci X þ c0  ¼ c2i Var½X ¼ c2i σ 2X ð8Þ Inequalities


  Basic Inequality
Var ci X þ cj Y ¼ c2i Var½X þ c2j Var½Y 
ð9Þ Given Xi,Xj are random variables, if the realiza-
þ 2ci cj Cov½X; Y  tion xi is always less than or equal to xj, the
444 Expected Value

expected value of Xi is less than or equal to that Expectations and Variances for
of Xj: Well-Known Distributions
* Xi ≤ Xj ‚ ) E½Xj  ≤ E½Yi : ð15Þ The following table lists distribution characteris-
tics, including the expected value of X and the
The expected value of the absolute value of expected variance of both discrete and continuous
a random variable X is less than or equal to the variables.
absolute value of its expectation:

E½jXj ≥ jE½Xj: ð15Þ


Estimation
Given the real-valued random variables Y and X,
Jensen Inequality
the optimal minimum mean-square error (MMSE)
estimator of Y based on observing X is the condi-
For a convex function h( · ) and RV X with pdf/ tional mean, E[Y|X], which in the conditional
pmf p(x), then expectation across the ensemble of all random pro-
cess with the same and finite second moment.
E½hðXÞ ≥ hðE½XÞ: ð16Þ Also, the MMSE ε of Y given X, which is defined
as ε ¼ Y  E½YjX, is orthogonal to any function
Or, put in another way, of data X, G(X). That is the so-called orthogonal-
! ity principle
X X
hðxÞ · pðxÞ ≥ h x · pðxÞ : ð17Þ
x ∈ X x ∈ X E½ε · GðXÞ ¼ 0:

When the observations Y are normally distrib-


Markov Inequality uted with zero mean, the MMSE estimator of Y
given X, the conditional expectation E[Y|X], is lin-
If X ≥ 0, then for all x ≥ 0, the Markov inequal- ear. For those non-normally distributed Y, the con-
ity is defined as ditional expectation estimator can be nonlinear.
Usually, it is difficult to have the close form of the
E½X optimal nonlinear estimates, which will depend on
PðX ≥ xÞ ≤ : ð18Þ
x higher order moment functions.
Given a given length of data set (Y; XÞ ¼
fðyi ; xi Þ; i ¼ 1; 2; . . . ; Ng with sample size N, the
Chebyshev Inequality linear least square error (LSE) estimator of Y
given X is
If the variance of X is known, the tighter Che-
byshev inequality bound is
^ ¼ b
Y ^0 ;
^1 X1 þ b
Var½X
PðjX  EðXÞj ≥ xÞ ≤ : ð19Þ
x2 where

Discrete Random Variables Continuous Random Variables


Distribution E[X] Var[X] Distribution E[X] Var[X]
Bernoulli (p) p p(1  p) Uniform (a,b) (a + b)/2 (b  a)2/2
Binomial (n,p) np np(1  p) Exponential (λ) 1/λ 1/λ2
Poisson (λ) λ λ Normal (μ,σ 2) μ σ2
Negative Binomial (r,p) r/p r(1  p)/p2 Gamma (α,λ) α/λ α/λ2
Geometric (k,p) (1  p)/p (1  p)/p2 Cauchy (α) Undefined Undefined
Experience Sampling Method 445



Glass, G. (1964). [Review of the book Expected values of
1
P
N
1
P
N
1
P
N
N
yi  N
yi xi  N
xi discrete random variables and elementary statistics by
^1 ¼
b i¼1 i¼1 i¼1
A. L. Edwards]. Educational and Psychological

2
1
P
N P
N Measurement, 24, 969971.
N
xi  N1 xi
i¼1 i¼1

and
EXPERIENCE SAMPLING METHOD
XN XN
^0 ¼ 1
b ^1 · 1
yi  b xi :
N i¼1 N i¼1 The experience sampling method (ESM) is a strat-
egy for gathering information from individuals
If the random process is stationary (i.e., con- about their experience of daily life as it occurs.
stant mean and time shift-invariant covariance) The method can be used to gather both qualitative
P
N and quantitative data, with questions for partici-
and ergodic (i.e., time average N1 xi converges pants that are tailored to the purpose of the
i¼1 research. It is a phenomenological approach,
in mean-square sense to the ensemble average meaning that the individual’s own thoughts, per-
E[X]), the LSE estimator will be asymptotically ceptions of events, and allocation of attention are
equal to the MMSE estimator. the primary objects of study. In the prototypical
MMSE estimation is not the only possibility for application, participants in an ESM study are
the expected value. Even in simple data sets, other asked to carry with them for 1 week a signaling
statistics, such as the median or mode, can be used device such as an alarm wristwatch or palmtop
to estimate the expected value. They have proper- computer and a recording device such as a booklet
ties, however, that make them suboptimal and inef- of questionnaires. Participants are then signaled
ficient under many situations. Other estimation randomly 5 to 10 times daily, and at each signal,
approaches include the mean of a Bayesian poste- they complete a questionnaire. Items elicit infor-
rior distribution, the maximum likelihood estima- mation regarding the participants’ location at the
tor, or the biased estimator in ridge regression. In moment of the signal, as well as their activities,
modern statistics, these alternative estimation thoughts, social context, mood, cognitive effi-
methods are being increasingly used. For a normally ciency, and motivation. Researchers have used
distributed random variable, all will typically yield ESM to study the effects of television viewing on
the same value except for computational variation mood and motivation, the dynamics of family rela-
in computer routines, usually very small in magni- tions, the development of adolescents, the experi-
tude. For non-normal distributions, various other ence of engaged enjoyment (or flow), and many
considerations come into play, such as the type of mental and physical health issues. Other terms for
distribution encountered, amount of data available, ESM include time sampling, ambulatory assess-
and existence of the various moments. ment, and ecological momentary assessment; these
Victor L. Willson and Jiun-Yu Wu terms may or may not signify the addition of other
types of measures, such as physiological markers,
See also Chi-Square Test; Sampling Distributions; to the protocol.
Sampling Error

Designing a Study Using ESM


Further Readings
In ESM studies, researchers need to select a sample
Brennan, A., Kharroubi, S., O’Hagan, A., & Chilcott, J.
of people from a population, but they must also
(2007). Calculating partial expected value of perfect
information via Monte Carlo sampling algorithms.
choose a method to select a sample of moments
Medical Decision Making, 27, 448470. from the population of all moments of experience.
Felli, J. C., & Hazen, G. B. (1998). Sensitivity analysis Many studies make use of signal-contingent sam-
and the expected value of perfect information. Medical pling, a stratified random approach in which the
Decision Making, 18, 95109. day is divided into equal segments and the
446 Experience Sampling Method

participant is signaled at a random moment during nonindependence in the data, person-level vari-
each segment. Other possibilities are to signal the ables are preferred when using inferential statisti-
participant at the same times every day (interval- cal techniques such as analysis of variance or
contingent sampling) or to ask the participant to multiple regression. More complex procedures,
respond after every occurrence of a particular such as hierarchical linear modeling, multilevel
event of interest (event-contingent sampling). The modeling, or mixed-effects random regression
number of times per day and the number of days analysis, allow the researcher to consider the
that participants are signaled are parameters that response-level and person-level effects
can be tailored based on the research purpose and simultaneously.
practical matters.
Increasingly, researchers are using palmtop
computers as both the signaling device and the Studies Involving ESM
recording device. The advantages here are the Mihaly Csikszentmihalyi was a pioneer of the
direct electronic entry of the data, the ability to method when he used pagers in the 1970s to study
time-stamp each response, and the ease of pro- a state of optimal experience he called flow. Csiks-
gramming a signaling schedule. Disadvantages zentmihalyi and his students found that when peo-
include the difficulty in obtaining open-ended ple experienced a high level of both challenges and
responses and the high cost of the devices. When skills simultaneously, they also frequently had high
a wristwatch or pager is used as the signaling levels of enjoyment, concentration, engagement,
device and a pen with a booklet of blank question- and intrinsic motivation. To study adolescents’
naires serves as the recording device, participants family relationships, Reed Larson and Maryse
can be asked open-ended questions such as ‘‘What Richards signaled adolescents and their parents
are you doing?’’ rather than be forced to choose simultaneously. The title of their book, Divergent
among a list of activity categories. This method is Realities, telegraphs one of their primary conclu-
less costly, but does require more coding and data sions. Several researchers have used ESM to study
entry labor. Technology appears to be advancing patients with mental illness, with many finding
to the point where an inexpensive electronic device that symptoms worsened when people were alone
will emerge that will allow the entry of open- with nothing to do. Two paradoxes exposed by
ended responses with ease, perhaps like text-mes- ESM research are that people tend to retrospec-
saging on a mobile phone. tively view their work as more negative and TV-
watching as more positive experiences than what
Analysis of ESM Data they actually report when signaled in the moment
while doing these activities.
Data resulting from an ESM study are complex,
including many repeated responses to each ques- Joel M. Hektner
tion. Responses from single items are also often
combined to form multi-item scales to measure See also Ecological Validity; Hierarchical Linear
constructs such as mood or intrinsic motivation. Modeling; Levels of Measurement; Multilevel
Descriptive information, such as means and fre- Modeling; Standardized Score; z Score
quencies, can be computed at the response level,
meaning that each response is treated as one case Further Readings
in the data. However, it is also useful to aggregate
the data by computing means within each person Bolger, N., Davis, A., & Rafaeli, E. (2003). Diary
and percentages of responses falling in categories methods: Capturing life as it is lived. Annual Review
of Psychology, 54, 579616.
of interest (e.g., when with friends). Often, z-
Hektner, J. M., Schmidt, J. A., & Csikszentmihalyi, M.
scored variables standardized to each person’s (2007). Experience sampling method: Measuring the
own mean and standard deviation are computed quality of everyday life. Thousand Oaks, CA: Sage.
to get a sense of how individuals’ experiences in Reis, H. T., & Gable, S. L. (2000). Event-sampling and
one context differ from their average levels of other methods for studying everyday experience. In
experiential quality. To avoid the problem of H. T. Reis & C. M. Judd (Eds.), Handbook of
Experimental Design 447

research methods in social and personality psychology age excludes itself from being an explanation of
(pp. 190222). New York: Cambridge the data.
University Press. There are numerous extraneous variables, any
Shiffman, S. (2000). Real-time self-report of momentary one of which may potentially be an explanation of
states in the natural environment: Computerized
the data. Ambiguity of this sort is minimized with
ecological momentary assessment. In A. A. Stone, J. S.
Turkkan, C. A. Bachrach, J. B. Jobe, H. S. Kurtzman,
appropriate control procedures, an example of
& V. S. Cain (Eds.), The science of self-report: which is random assignment of subjects to the two
Implications for research and practice (pp. 276293). conditions. The assumption is that, in the long
Mahwah, NJ: Lawrence Erlbaum. run, effects of unsuspected confounding variables
Walls, T. A., & Schafer, J. L. (Eds.). (2006). Models for may be balanced between the two conditions.
intensive longitudinal data. New York: Oxford
University Press.
Genres of Experimental Designs
for Data Analysis Purposes

EXPERIMENTAL DESIGN Found in Column I of Table 2 are three groups of


designs defined in terms of the number of factors
used in the experiment, namely, one-factor, two-
Empirical research involves an experiment in which factor, and multifactor designs.
data are collected in two or more conditions that
are identical in all aspects but one. A blueprint for
such an exercise is an experimental design. Shown One-Factor Designs
in Table 1 is the design of the basic experiment. It It is necessary to distinguish between the two-
has (a) one independent variable (color) with two level and multilevel versions of the one-factor
levels (pink and white); (b) four control variables design because different statistical procedures are
(age, health, sex, and IQ); (c) a control procedure used to analyze their data. Specifically, data from
(i.e., random assignment of subjects); and (d) a a one-factor, two-level design are analyzed with
dependent variable (affective score). the t test. The statistical question is whether or not
the difference between the means of the two condi-
Method of Difference and tions can be explained by chance influences (see
Row a of Table 2).
Experimental Control
Some version of one-way analysis of variance
Table 1 also illustrates the inductive rule, method would have to be used when there are three or
of difference, which underlies the basic one-factor, more levels to the independent variable (see Row
two-level experiment. As age is being held con- b of Table 2). The statistical question is whether or
stant, any slight difference in age between subjects not the variance based on three or more test condi-
in the two conditions cannot explain the difference tions is larger than that based on chance.
(or its absence) between the mean performances of With quantitative factors (e.g., dosage) as
the two conditions. That is, as a control variable, opposed to qualitative factors (e.g., type of drug),

Table 1 Basic Structure of an Experiment

Control Variables Control Procedure


Independent
Variable Random Dependent
Test Manipulated, Assignment Variable,
Condition Wall Color Age Health Sex IQ of Subjects (Si) Affective Score
Experimental Pink Middle-aged Good Male Normal S1, S21, S7,. . . S15 To be collected
and analyzed
Control White Middle-aged Good Male Normal S9, S10, S24,. . . S2
448 Experimental Design

Table 2 Genres of Experimental Designs in Terms of Treatment Combinations


Panel A: One-Factor Designs in Terms of Number of Levels
I II III
Number of Number of Statistical Test
Factors Levels in Factor (parametric) Statistical Question
a 1 2 t test Is the difference between the two means accountable by
chance influences?
b 3 or more One-way Can the variance based on the means of the 3 (or more)
ANOVA conditions be explained in terms of chance influences?
Are there trends in the data?
c 2ðA; BÞ m×n Two-way Main effect of A: Is the difference between the m means of
ANOVA A accountable by chance influences?
Main effect of B: Is the difference between the n means of
B accountable by chance influences?
AB interaction: Can the difference among the means of
the m × n treatment combinations accountable
by chance influences?
Simple effect of A: Is the difference among the m means
of A at Level j of B accountable by chance influences? influences?
Simple effect of B: Is the difference among the n means
of B at Level i of A accountable by chance influences?
d 3 or more m × n × p Multi-factor Extension of the questions found in two-way ANOVA
or ANOVA
m×n×p×q

one may ascertain trends in the data when a factor stand for the respective number of levels. For
has three or more levels (see Row b). Specifically, example, the name of a three-factor design is m by
a minimum of three levels is required for ascertain- n by p; the first independent variable has m levels,
ing a linear trend, and a minimum of four levels the second has n levels, and the third has p levels
for a quadratic trend. (see Row d of Table 2).
The lone statistical question of a one-factor,
two-level design (see Row a of Table 2) is asked
Two-Factor Designs
separately for Factors A and B in the case of the
Suppose that Factors A (e.g., room color) and B two-factor design (see [a] and [b] in Row c of
(e.g., room size) are used together in an experi- Table 2). Either of them is a main effect (see [a]
ment. Factor A has m levels; its two levels are a1 and [b] in Row c) so as to distinguish it from a sim-
and a2 when m ¼ 2. If Factor B has n levels (and ple effect (see Row c). This distinction may be
if n ¼ 2), the two levels of B are b1 and b2. The illustrated with Table 3.
experiment has a factorial design when every level
Main Effect
of A is combined with every level of B to define
a test condition or treatment combination. The Assume an equal number of subjects in all treat-
size of the factorial design is m by n; it has m-by-n ment combinations. The means of a1 and a2 are
treatment combinations. This notation may be 4.5 and 2.5, respectively (see the ‘‘Mean of ai’’ col-
generalized to reflect factorial design of any size. umn in either panel of Table 3). The main effect of
Specifically, the number of integers in the name A is 2 (i.e., 4.5  2.5). In the same vein, the means
of the design indicates the number of independent of b1 and b2 are 4 and 3, respectively (see the
variables, whereas the identities of the integers ‘‘Mean of bj’’ row in either panel of Table 3). The
Experimental Design 449

Table 3 What May Be Learned From a 2-by-2 Factorial Design


(a)
Room Size (B)
Main Effect Simple Effect
Small (b1) Large (b2) Mean of ai of A of B at ai
Room Pink (a1) (i) Small, (ii) Large, (5 + 4) ‚ 2 ¼ 4.5 4.5  2.5 ¼ 2 At a1:
Color (A) Pink (ab11) 5 Pink (ab12) 4 d3 ¼ (5  4) ¼ 1
White (a2) (iii) Small, (iv) Large, (3 + 2) ‚ 2 ¼ 2.5 At a2:
White (ab21) 3 White (ab22) 2 d4 ¼ (3  2) ¼ 1
Mean of bj (5 + 3) ‚ 2 ¼ 4 (4 + 2) ‚ 2 ¼ 3 (DofD)1: d1  d2 ¼ 2  2 ¼ 0
(DofD)2: d3  d4 ¼ 1  1 ¼ 0
Main effect of B 43¼1 [Q1]: Is (DofD)12 zero?
Simple effect of A at bj At b1: At b2: [Q2]: Is (DofD)34 zero?
d1 ¼ (5  3) ¼ 2 d2 ¼ (4  2) ¼ 2
(b)
Room Size (B)
Main Effect Simple Effect
Small (b1) Large (b2) Mean of ai of A of B at ai
Room Pink (a1) (i) Small, (ii) Large, (15 + 7) ‚ 2 ¼ 11 11  6 ¼ 5 At a1:
Color (A) Pink (ab11) 15 Pink (ab12) 7 d3 ¼ (15  7) ¼ 8
White (a2) (iii) Small, (iv) Large, (2 + 10) ‚ 2 ¼ 6 At a2:
White (ab21) 2 White (ab22) 10 d4 ¼ (2  10) ¼ 8
Mean of Bj (15 + 2) ‚ 2 ¼ 8.5 (7 + 10) ‚ 2 ¼ 8.5 (DofD)1: d1  d2 ¼ 13  (3) ¼ 16
(DofD)2: d3  d4 ¼ 8  (8) ¼ 16
Main effect of B 8.5  8.5 ¼ 0 [Q1]: Is (DofD)12 zero?
Simple effect of A at Bj At b1: At b2: [Q2]: Is (DofD)34 zero?
d1 ¼ (15  2) ¼ 13 d2 ¼ (7  10) ¼ 3
Notes: (a) An example of additive effects of A and B. (b) An example of AB interaction (nonadditive) effects.

main effect of B is 1. That is, the two levels of B d2 ¼ Simple effect of A at b2 is (ab12  ab22) ¼
(or A) are averaged when the main effect of A (or (4  2) ¼ 2;
B) is being considered.
d3 ¼ Simple effect of B at a1 is (ab12  ab11) ¼
(4  5) ¼ 1;
Simple Effect
d4 ¼ Simple effect of B at a2 is (ab22  ab21) ¼
Given that there are two levels of A (or B), (2  3) ¼ 1.
it is possible to ask whether or not the two levels
of B (or A) differ at either level of A (or B).
Hence, there are the entries, d3 and d4, in the AB Interaction
‘‘Simple effect of B at ai’’ column, and the In view of the fact that there are two simple
entries, d1 and d2, ‘‘Simple effect of A at bj’’ row effects of A (or B), it is important to know
in either panel of Table 3. Those entries are the whether or not they differ. Consequently, the
four simple effects of the 2-by-2 factorial experi- effects noted above give rise to the following
ment. They may be summarized as follows: questions:

[Q1] (DofD)12: Is d1  d2 ¼ 0?
d1 ¼ Simple effect of A at b1 is (ab11  ab21) ¼
(5  3) ¼ 2; [Q2] (DofD)34: Is d3  d4 ¼ 0?
450 Experimental Design

Given that d1  d2 ¼ 0, one is informed that them are assigned randomly to each of the six
the effect of Variable A is independent of that of treatment combinations of a 2-by-3 factorial
Variable B. By the same token, that d3  d4 ¼ experiment. It is called the completely randomized
0 means that the effect of Variable B is indepen- design, but more commonly known as an unre-
dent of that of Variable A. That is to say, when lated sample (or an independent sample) design
the answers to both [Q1] and [Q2] are ‘‘Yes,’’ when there are only two levels to a lone indepen-
the joint effects of Variables A and B on the dent variable.
dependent variable are the sum of the individual
effects of Variables A and B. Variables A and B
Repeated Measures Design
are said to be additive in such an event.
Panel (b) of Table 3 illustrates a different sce- All subjects are tested in all treatment combina-
nario. The answers to both [Q1] and [Q2] are tions in a repeated measures design. It is known by
‘‘No.’’ It informs one that the effects of Variable the more familiar name related samples or depen-
A (or B) on the dependent variable differ at dif- dent samples design when there are only two levels
ferent levels of Variable B (or A). In short, it is to a lone independent variable. The related sam-
learned from a ‘‘No’’ answer to either [Q1] or ples case may be used to illustrate one complica-
[Q2] (or both) that the joint effects of Variables tion, namely, the potential artifact of the order of
A and B on the dependent variables are nonaddi- testing effect.
tive in the sense that their joint effects are not Suppose that all subjects are tested at Level I (or
the simple sum of the two separate effects. Vari- II) before being tested at Level II (or I). Whatever
ables A and B are said to interact (or there is the outcome might be, it is not clear whether the
a two-way AB interaction) in such an event. result is due to an inherent difference between
Levels I and II or to the proactive effects of the level
Multifactor Designs used first on the performance at the subsequent
level of the independent variable. For this reason,
What has been said about two-factor designs a procedure is used to balance the order of testing.
also applies to designs with three or more indepen- Specifically, subjects are randomly assigned to
dent variables (i.e., multifactor designs). For exam- two subgroups. Group 1 is tested with one order
ple, in the case of a three-factor design, it is (e.g., Level I before Level II), whereas Group 2 is
possible to ask questions about three main effects tested with the other order (Level II before Level I).
(A, B, and C); three 2-way interaction effects (AB, The more sophisticated Latin square arrangement is
AC, and BC interactions); a set of simple effects used to balance the order of test when there are
(e.g., the effect of Variable C at different treatment three or more levels to the independent variable.
combinations of AB, etc.); and a three-way inter-
action (viz., ABC interaction).
Randomized Block Design

Genres of Experimental The nature of the levels used to represent an


independent variable may preclude the use of the
Designs for Data Interpretation Purposes
repeated measures design. Suppose that the two
Experimental designs may also be classified in levels of therapeutic method are surgery and radia-
terms of how subjects are assigned to the treat- tion. As either of these levels has irrevocable con-
ment combinations, namely, completely random- sequences, subjects cannot be used in both
ized, repeated measures, randomized block, and conditions. Pairs of subjects have to be selected,
split-plot. assigned, and tested in the following manner.
Prospective subjects are first screened in terms
of a set of relevant variables (body weight, severity
Completely Randomized Design
of symptoms, etc.). Pairs of subjects who are iden-
Suppose that there are 36 prospective subjects. tical (or similar within acceptable limits) are
As it is always advisable to assign an equal number formed. One member of each pair is assigned ran-
of subjects to each treatment combination, six of domly to surgery, and the other member to
Experimental Design 451

Table 4 Inductive Principles Beyond the Method of Difference


(a)
Control Variables Control Procedure
Independent
Variable Random Dependent
Manipulated, Assignment Variable,
Test Condition Medication Age Health Sex IQ of Subjects Affective Score
Experimental 10 units Middle-aged Good Male Normal S1, S21, S7, . . . S36 To be collected
(High dose) and analyzed
Experimental 5 units Middle-aged Good Male Normal S9, S10, S24, . . . S27
(Low dose)
Control Placebo Middle-aged Good Male Normal S9, S10, S24, . . . S12
(b)
Control Variables Control Procedure
Independent
Variable Random Dependent
Manipulated, Assignment Variable,
Test Condition Wall Color Age Health Sex IQ of Subjects Affective Score
Experimental Pink Middle-aged Good Male Normal S1, S21, S7, . . . S15 To be collected
Control (hue) White Middle-aged Good Male Normal S9, S10, S24, . . . S2 and analyzed
Control (brightness) Green Middle-aged Good Male Normal S9, S10, S24, . . . S12
Notes: (a) Method of concomitant variation. (b) Joint method of agreement and difference.

radiation. This matched-pair procedure is Method of Concomitant Variation


extended to matched triplets (or groups of four
Consider a study of the effects of a drug’s dos-
subjects matched in terms of a set of criteria) if
age. The independent variable is dosage, whose
there are three (or four) levels to the independent
three levels are 10, 5, and 0 units of the medica-
variable. Each member of the triplets (or four-
tion in question. As dosage is a quantitative vari-
member groups) is assigned randomly to one of
able, it is possible to ask whether or not the effect
the treatment combinations.
of treatment varies systematically with dosage.
The experimental conditions are arranged in the
Split-Plot Design way shown in Panel (a) of Table 4 that depicts the
A split-plot design is a combination of the method of concomitant variation.
repeated measures design and the completely ran- The control variables and procedures in Tables 1
domized design. It is used when the levels of one and 4 are the same. The only difference is that each
of the independent variables has irrevocable effects row in Table 4 represents a level (of a single inde-
(e.g., surgery or radiation of therapeutic method), pendent variable) or a treatment combination (when
whereas the other independent variable does not there are two or more independent variables). That
(e.g., Drugs A and B of type of drug). is to say, the method of concomitant variation is the
logic underlying factorial designs of any size when
quantitative independent variables are used.
Underlying Inductive Logic
Joint Method of Agreement and Difference
Designs other than the one-factor, two-level design
implicate two other rules of induction, namely, the Shown in Panel (b) of Table 4 is the joint method
method of concomitant variation and the joint of agreement and disagreement. Whatever is true
method of agreement and difference. of Panel (a) of Table 4 also applies to Panel (b) of
452 Experimenter Expectancy Effect

Table 4. It is the underlying inductive rule when may do whatever is required of them. This demand
a qualitative independent variable is used (e.g., characteristics artifact creates credibility issues in the
room color). research data. The subject effect artifact questions
In short, an experimental design is a stipulation the generalizability of research data. This issue arises
of the formal arrangement of the independent, because participants in the majority of psychological
control, and independent variables, as well as the research are volunteering tertiary-level students who
control procedure, of an experiment. Underlying may differ from the population at large.
every experimental design is an inductive rule that As an individual, a researcher has profound
reduces ambiguity by rendering it possible to effects on the data. Any personal characteristics of
exclude alternative interpretations of the result. the researcher may affect research participants
Each control variable or control procedure (e.g., ethnicity, appearance, demeanor). Having
excludes one alternative explanation of the data. vested interests in certain outcomes, researchers
approach their work from particular theoretical
Siu L. Chow perspectives. These biases determine in some subtle
and insidious ways how researchers might behave
See also Replication; Research Hypothesis; Rosenthal
in the course of conducting research. This is the
Effect
experimenter expectancy effect artifact.
Further Readings At the same time, the demand characteristics
artifact predisposes research participants to pick up
Boring, E. G. (1954). The nature and history of
experimental control. American Journal of
cues about the researcher’s expectations. Being
Psychology, 67, 573589. obligingly ingratiatory, research participants ‘‘coop-
Chow, S. L. (1992). Research methods in psychology: A erate’’ with the researcher to obtain the desired
primer. Calgary, Alberta, Canada: Detselig. results. The experimenter expectancy effect artifact
Mill, J. S. (1973). A system of logic: Ratiocinative and detracts research conclusions from their objectivity.
inductive. Toronto, Ontario, Canada: University of
Toronto Press.
SPOPE Revisited—SPONE
Limits of Goodwill
EXPERIMENTER EXPECTANCY Although research participants bear goodwill
EFFECT toward researchers, they may not (and often can-
not) fake responses to please the researcher as
The experimenter’s expectancy effect is an impor- implied in the SPOPE thesis.
tant component of the social psychology of the psy- To begin with, research participants might give
chological experiment (SPOPE), whose thesis is that untruthful responses only when illegitimate fea-
conducting or participating in research is a social tures in the research procedure render it necessary
activity that might be affected subtly by three social and possible. Second, it is not easy to fake
or interpersonal factors, namely, demand character- responses without being detected by the researcher,
istics, subject effects, and the experimenter’s expec- especially when measured with a well-defined task
tancy effects. These artifacts call into question the (e.g., the attention span task). Third, it is not pos-
credibility, generality, and objectivity, respectively, sible to fake performance that exceeds the partici-
of research data. However, these artifacts may be pants’ capability.
better known as social psychology of nonexperi-
mental research (SPONE) because they apply only
Nonexperiment Versus Experiment
to nonexperimental research.
Faking on the part of research participants is
not an issue when experimental conclusions are
The SPOPE Argument
based on subjects’ differential performance on the
Willing to participate and being impressed by the attention span task in two or more conditions with
aura of scientific investigation, research participants proper controls. Suppose that a properly selected
Experimenter Expectancy Effect 453

Table 1 The Basic Structure of an Experiment

Control Variable
Test Condition Independent Variable IQ Sex Age Control Procedure Dependent Variable
Experimental Drug Normal M 1215 Random assignment Longer attention span
Control Placebo Normal M 1215 Repeated measures Shorter attention span

Table 2 A Schematic Representation of the Design of Goldstein, Rosnow, Goodstadt, and Sul’s (1972) Study of
Verbal Conditioning
Experimental Control Condition
Condition (Knowledgeable (Not knowledgeable of Mean of Difference Between
Subject Group of verbal conditioning) verbal conditioning) Two Means Two Means
Volunteers X1 (6) X2 (3) X (4.5) d1 ¼ X1  X2 ð6  3Þ ¼ 3
Nonvolunteers Y 1 (4) Y 2 (1) Y (2.5) d2 ¼ Y 1  Y 2 ð4  1Þ ¼ 3
Source: Goldstein, J. J., Rosnow, R. L., Goodstadt, B., & Suls, J. M. (1972). The ‘‘good subject’’ in verbal operant
conditioning research. Journal of Experimental Research in Personality, 6, 2933.
Notes: Hypothetical mean increase in the number of first-person pronouns used. The numbers in parentheses are added for
illustrative purposes only. They were not Goldstein et al.’s (1972) data.

sample of boys is assigned randomly to the two Individual Differences Versus Their Effects on Data
conditions in Table 1. Further suppose that one
Data shown in Table 2 have been used to
group fakes to do well, and the other fakes to do
support the subject effect artifact. The experi-
poorly. Nonetheless, it is unlikely that the said
ment was carried out to test the effect of volun-
unprincipled behavior would produce the differ-
teering on how fast one could learn. Subjects
ence between the two conditions desired by the
were verbally reinforced for uttering first-person
experimenter.
pronouns. Two subject variables are used (volun-
teering status and knowledgeability of condition-
Difference Is Not Absolute ing principle).
Data obtained from college or university stu- Of interest is the statistically significant main
dents do not necessarily lack generality. For exam- effect of volunteering status. Those who volun-
ple, students also have two eyes, two ears, one teered were conditioned faster than those who did
mouth, and four limbs like typical humans have. not. Note that the two levels of any subject vari-
That is, it is not meaningful to say simply that A able are, by definition, different. Hence, the signifi-
differs from B. It is necessary to make explicit (a) cant main effect of volunteering status is not
the dimension on which A and B differ, and (b) the surprising (see the ‘‘Mean of Two Means’’ column
relevancy of the said difference to the research in in Table 2). It merely confirms a pre-existing indi-
question. vidual difference, but not the required effect of
It is also incorrect to say that researchers individual differences on experimental data. The
employ tertiary students as research participants data do not support the subject effect artifact
simply because it is convenient to do so. On the because the required two-way interaction between
contrary, researchers select participants from spe- volunteering status and knowledgeability of condi-
cial populations in a theoretically guided way tioning is not significant.
when required. For example, they select boys with
normal IQ within a certain age range when they
Experiment Versus Meta-Experiment
study hyperactivity. More important, experimen-
ters assign subjects to test conditions in a theoreti- R. Rosenthal and K. L. Fode were the investi-
cally guided way (e.g., completely random). gators in Table 3 who instructed A, B, C, and D
454 Experimenter Expectancy Effect

Table 3 The Design of Rosenthal and Fode’s (1963) Experimental Study of Expectancy

Investigators (Rosenthal & Fode)


Expectation Group þ5 Data Collectors 5 Data Collectors
Data collector A ... B C ... D
Subject i1 ... j1 k1 ... q1
i2 ... j2 k2 ... q2
... ... ... ... ... ...
... ... ... ... ... ...
in ... jn kn ... qn
Mean rating collected by individual data collector [x] ... [y] [z] ... [w]
Mean rating of data collectors as a group 4.05 0.95
Source: Rosenthal, R., & Fode, K. L. (1963). Three experiments in experimenter bias. Psychological Reports, 12, 491511.

Table 4 The Design of Chow’s (1994) Meta-Experiment


(a)
Investigator (Chow, 1994)

Expectancy Group þ5 Expectancy No Expectancy 5 Expectancy


Experimenter A B F G P Q
Test condition H S H S H S H S H S H S
(H = Happy face)
(S = Sad face)
Subjects i1 i01 j1 j01 k1 k01 m1 m01 n1 n01 q1 q01
i2 i02 j2 j02 k2 k02 m2 m02 n2 n02 q2 q02
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
in i0n jn j0n kn k0n mn m0n nn n0n qn q0n
Mean rating collected [x] [x0 ] [y] [y0 ] [w] [w0 ] [z] [w0 ] [a] [a0 ] [b] [b0 ]
by individual data collector
(b)
Expectancy Group þ5 Expectancy No Expectancy 5 Expectancy
Stimulus face Happy Sad Happy Sad Happy Sad
Mean 3.16 0.89 3.88 0.33 2.95 0.38
Difference between
  4.05 4.21 3.33
two means: E  C
Source: Chow, S. L. (1994). The experimenter’s expectancy effect: A meta experiment. Zeitschrift für Pädagogische
Psychologie (German Journal of Educational Psychology), 8, 8997.
Notes: (a) The design of the meta-experiment. (b) Mean ratings in the experimental (‘‘Happy’’) and control (‘‘Sad’’) condition
as a function of expectancy.

to administer a photo-rating task to their own to expect a mean rating of þ5 (‘‘successful’’)


groups of participants. Participants had to indi- from their participants, and C and D were
cate whether or not a photograph (with a neutral induced to expect a mean rating of 5 (‘‘unsuc-
expression) was one of a successful person or an cessful’’; see the ‘‘5 Data Collectors’’ column).
unsuccessful person. The investigator induced A The observed difference between the two groups
and B (see the ‘‘ þ 5 Data Collectors’’ columns) of data collectors is deemed consistent with the
Exploratory Data Analysis 455

experimenter expectancy effect artifact (i.e., control is used, in which case it is more appro-
4.05 versus 0.95). priate to characterize the SPOPE phenomenon as
Although Rosenthal and Fode are experimen- SPONE. Those putative artifacts are not applica-
ters, A, B, C, and D are not. All of them col- ble to experimental studies.
lected absolute measurement data in one
condition only, not collecting experimental data Siu L. Chow
of differential performance. To test the experi-
See also Experimental Design
menter expectancy effect artifact in a valid man-
ner, the investigators in Table 3 must give each Further Readings
of A, B, C, and D an experiment to conduct.
Chow, S. L. (1992). Research methods in psychology: A
That is, the experimenter expectancy effect arti- primer. Calgary, Alberta, Canada: Detselig.
fact must be tested with a meta-experiment (i.e., Chow, S. L. (1994). The experimenter’s expectancy effect:
an experiment about experiment), an example of A meta experiment. Zeitschrift für Pädagogische
which is shown in panel (a) of Table 4. Psychologie (German Journal of Educational
Regardless of the expectancy manipulation Psychology), 8, 8997.
(positive [ þ 5], neutral [0], or negative [5]), Goldstein, J. J., Rosnow, R. L., Goodstadt, B., & Suls,
Chow gave each of A, B, F, G, P, and Q an experi- J. M. (1972). The ‘‘good subject’’ in verbal operant
ment to conduct. That is, every one of them conditioning research. Journal of Experimental
obtained from his or her own group of subjects the Research in Personality, 6, 2933.
Orne, M. (1962). On the social psychology of the
differential performance on the photo-rating task
psychological experiment: With particular
between two conditions (Happy Face vs. Sad reference to demand characteristics and their
Face). implications. American Psychologist, 17,
It is said in the experimenter expectancy effect 776783.
argument that subjects behave in the way the Rosenthal, R., & Fode, K. L. (1963). Three experiments
experimenter expects. As such, that statement is in experimenter bias. Psychological Reports, 12,
too vague to be testable. Suppose that a sad face 491511.
was presented. Would both the experimenter (e.g.,
A or Q in Table 4) and subjects (individuals tested
by A or Q) ignore that it was a sad (or happy) face
and identify it as ‘‘successful’’ (or ‘‘unsuccessful’’) EXPLORATORY DATA ANALYSIS
under the ‘‘ þ 5’’ (or ‘‘5’’) condition? Much
depends on the consistency between A’s or Q’s Exploratory data analysis (EDA) is a data-driven
expectation and the nature of the stimulus (e.g., conceptual framework for analysis that is based
happy or sad faces), as both A (or Q) and his or primarily on the philosophical and methodological
her subjects might moderate or exaggerate their work of John Tukey and colleagues, which dates
responses. back to the early 1960s. Tukey developed EDA in
response to psychology’s overemphasis on hypode-
ductive approaches to gaining insight into phe-
Final Thoughts
nomena, whereby researchers focused almost
SPOPE is so called because the distinction exclusively on the hypothesis-driven techniques of
between experimental and nonexperimental confirmatory data analysis (CDA). EDA was not
empirical research has not been made as a result developed as a substitute for CDA; rather, its
of not appreciating the role of control in empiri- application is intended to satisfy a different stage
cal research. Empirical research is an experiment of the research process. EDA is a bottom-up
only when three control features are properly approach that focuses on the initial exploration of
instituted (a valid comparison baseline, con- data; a broad range of methods are used to
stancy of conditions, and procedures for elimi- develop a deeper understanding of the data, gener-
nating artifacts). As demand characteristics, ate new hypotheses, and identify patterns in the
participant effect and expectancy effect may be data. In contrast, CDA techniques are of greater
true of nonexperimental research in which no value at a later stage when the emphasis is on
456 Exploratory Data Analysis

testing previously generated hypotheses and con- purpose for which it is used—namely, to assist the
firming predicted patterns. Thus, EDA offers a dif- development of rich mental models of the data.
ferent approach to analysis that can generate
valuable information and provide ideas for further
investigation. Revelation
EDA encourages the examination of different
ways of describing the data to understand inherent
Ethos patterns and to avoid being fooled by unwarranted
A core goal of EDA is to develop a detailed under- assumptions.
standing of the data and to consider the processes
that might produce such data. Tukey used the Data Description
analogy of EDA as detective work because the The use of summary descriptive statistics offers
process involves the examination of facts (data) a concise representation of data. EDA relies on
for clues, the identification of patterns, the genera- resistant statistics, which are less affected by devi-
tion of hypotheses, and the assessment of how well ant cases. However, such statistics involve a trade-
tentative theories and hypotheses fit the data. off between being concise versus precise; therefore,
EDA is characterized by flexibility, skepticism, an analyst should never rely exclusively on statisti-
and openness. Flexibility is encouraged as it is sel- cal summaries. EDA encourages analysts to exam-
dom clear which methods will best achieve the ine data for skewness, outliers, gaps, and multiple
goals of the analyst. EDA encourages the use of peaks, as these can present problems for numerical
statistical and graphical techniques to understand measures of spread and location. Visual representa-
data, and researchers should remain open to unan- tions of data are required to identify such instances
ticipated patterns. However, as summary measures to inform subsequent analyses. For example, based
can conceal or misrepresent patterns in data, EDA on their relationship to the rest of the data, outliers
is also characterized by skepticism. Analysts must may be omitted or may become the focus of the
be aware that different methods emphasize some analysis, a distribution with multiple peaks may be
aspects of the data at the expense of others; thus, split into different distributions, and skewed data
the analyst must also remain open to alternative may be reexpressed. Inadequate exploration of the
models of relationships. data distribution through visual representations
If an unexpected data pattern is uncovered, the can result in the use of descriptive statistics that are
analyst can suggest plausible explanations that are not characteristic of the entire set of values.
further investigated using confirmatory techniques.
EDA and CDA can supplement each other: Where Data Visualization
the abductive approach of EDA is flexible and
Visual representations are encouraged because
open, allowing the data to drive subsequent
graphs provide parsimonious representations of
hypotheses, the more ambitious and focused
data that facilitate the development of suitable
approach of CDA is hypothesis-driven and facili-
mental models. Graphs display information in
tates probabilistic assessments of predicted pat-
a way that makes it easier to detect unexpected
terns. Thus, a balance is required between an
patterns. EDA emphasizes the importance of using
exploratory and confirmatory lens being applied to
numerous graphical methods to see what each
data; EDA comes first, and ideally, any given study
reveals about the data structure.
should combine both.
Tukey developed a number of EDA graphical
tools, including the box-and-whisker plot, other-
wise known as the box plot. Box plots are useful
Methods
for examining data and identifying potential out-
EDA techniques are often classified in terms of the liers; however, like all data summarization methods,
four Rs: revelation, residuals, reexpression, and they focus on particular aspects of the data. There-
resistance. However, it is not the use of a technique fore, other graphical methods should also be used.
per se that determines whether it is EDA, but the Stem-and-leaf displays provide valuable additional
Exploratory Data Analysis 457

information because all data are retained in a fre- about a model’s misspecifications. EDA thus
quency table, providing a sense of the distribution emphasizes careful examination of residual plots
shape. In addition, dot plots highlight gaps or dense for any additional patterns, such as curves or multi-
parts of a distribution and can identify outliers. ple modes, as this suggests that the selected model
Tukey’s emphasis on graphical data analysis has failed to describe an important aspect of the data.
influenced statistical software programs, which In such instances, further smoothing is required to
now include a vast array of graphical techniques. get at the underlying pattern.
These techniques can highlight individual values EDA focuses on building models and generating
and their relative position to each other, check hypotheses in an iterative process of model specifi-
data distributions, and examine relationships cation. The analyst must be open to alternative
between variables and relationships between models, and thus the residuals of different models
regression lines and actual data. In addition, inter- are examined to see if there is a better fit to the
active graphics, such as linked plots, allow the data. Thus, models are generated, tested, modified,
researcher to select a specific case or cases in one and retested in a cyclical process that should lead
graphical display (e.g., scatterplot) and see the the researcher, by successive approximation,
same case(s) in another display (e.g., histogram). toward a good description of the data. Model
Such an approach could identify cases that are building and testing require heeding data at all
bivariate outliers but not outliers on either of the stages of research, especially the early stages of
two variables being correlated. analysis. After understanding the structure of each
variable separately, pairs of variables are examined
in terms of their patterns, and finally, multivariate
Residuals
models of data can be built iteratively. This itera-
According to Tukey, the idea of data analysis tive process is integral to EDA’s ethos of using the
is explained using the following formula: data to develop and refine models.
DATA ¼ SMOOTH þ ROUGH, or, more for- Suitable models that describe the data ade-
mally, DATA ¼ FIT þ RESIDUALS, based on quately can then be compared to models specified
the idea that the way in which we describe/ by theory. Alternatively, EDA can be conducted on
model data is never completely accurate because one subset of data to generate models, and then
there is always some discrepancy between the confirmatory techniques can be applied subse-
model and the actual data. The smooth is the quently to test these models in another subset. Such
underlying, simplified pattern in the data; for cross-validation means that when patterns are dis-
example, a straight line representing the relation- covered, they are considered provisional until their
ship between two variables. However, as data presence is confirmed in a different data set.
never conform perfectly to the smooth, devia-
tions from the smooth (the model) are termed
Reexpression
the rough (the residuals).
Routine examination of residuals is one of the Real-life data are often messy, and EDA recog-
most influential legacies of EDA. Different models nizes the importance of scaling data in an appro-
produce different patterns of residuals; conse- priate way so that the phenomena are represented
quently, examining residuals facilitates judgment in a meaningful manner. Such data transformation
of a model’s adequacy and provides the means to is referred to as data reexpression and can reveal
develop better models. From an EDA perspective, additional patterns in the data. Reexpression can
the rough is just as important as the smooth and affect the actual numbers, the relative distances
should never be ignored. between the values, and the rank ordering of the
Although residual sums-of-squares are widely numbers. Thus, EDA treats measurement scales as
used as a measure of the discrepancy between the arbitrary, advocating a flexible approach to exami-
model and the data, relying exclusively on this mea- nation of data patterns.
sure of model fit is dangerous as important patterns Reexpression may make data suitable for para-
in the residuals may be overlooked. Detailed exami- metric analysis. For example, nonlinear transfor-
nation of the residuals reveals valuable information mations (e.g., log transformation) can make data
458 Exploratory Factor Analysis

follow a normal distribution or can stabilize the of outliers is examined by comparing the residuals
variances. Reexpression can result in linear rela- from a model based on the entire data set with one
tionships between variables that previously had that excludes outliers. If results are consistent
a nonlinear relationship. Making the distribution across the two models, then either course of action
symmetrical about a single peak makes modeling may be followed. Conversely, if substantial differ-
of the data pattern easier. ences exist, then both models should be reported
and the impact of the outliers needs to be consid-
ered. From an EDA perspective, the question is
Resistance not how to deal with outliers but what can be
learned from them. Outliers can draw attention to
Resistance involves the use of methods that mini- important aspects of the data that were not origi-
mize the influence of extreme or unusual data. Dif- nally considered, such as unanticipated psychologi-
ferent procedures may be used to increase cal processes, and provide feedback regarding
resistance. For example, absolute numbers or rank- model misspecification. This data-driven approach
based summary statistics can be used to summarize to improving models is inherent in the iterative
information about the shape, location, and spread EDA process.
of a distribution instead of measures based on sums.
A common example of this is the use of the median
Conclusion
instead of the mean; however, other resistant central
tendency measures can also be used, such as the tri- EDA is a data-driven approach to gaining familiar-
mean (the mean of the 25th percentile, the 75th per- ity with data. It is a distinct way of thinking about
centile, and the median counted twice). Resistant data analysis, characterized by an attitude of flexi-
measures of spread include the interquartile range bility, openness, skepticism, and creativity to dis-
and the median absolute deviation. covering patterns, avoiding errors, and developing
Resistance is increased by giving greater weight useful models that are closely aligned to data.
to values that are closer to the center of the distri- Combining insights from EDA with the powerful
bution. For example, a trimmed mean may be used analytical tools of CDA provides a robust
whereby data points above a specified value are approach to data analysis.
excluded from the estimation of the mean. Alter-
natively, a Winsorized mean may be used where Maria M. Pertl and David Hevey
the tail values of a distribution are pulled in to
See also Box-and-Whisker Plot; Histogram; Outlier;
match those of a specified extreme score.
Residual Plot; Residuals; Scatterplot

Outliers
Further Readings
Outliers present the researcher with a choice
between including these extreme scores (which may Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.).
(1983). Understanding robust and exploratory data
result in a poor model of all the data) and excluding
analysis. New York: John Wiley.
them (which may result in a good model that
Keel, T. G., Jarvis, J. P., & Muirhead, Y. E. (2009). An
applies only to a specific subset of the original exploratory analysis of factors affecting homicide
data). EDA considers why such extreme values investigations: Examining the dynamics of murder
arise. If there is evidence that outliers were pro- clearance rates. Homicide Studies, 13, 5068.
duced by a different process from the one underly- Tukey, J. W. (1977). Exploratory data analysis. Reading,
ing the other data points, it is reasonable to exclude MA: Addison-Wesley.
the outliers as they do not reflect the phenomena
under investigation. In such instances, the
researcher may need to develop different models to
account for the outliers. For example, outliers may EXPLORATORY FACTOR ANALYSIS
reflect different subpopulations within the data set.
However, often there is no clear reason for the Exploratory factor analysis (EFA) is a multivariate
presence of outliers. In such instances, the impact statistical technique to model the covariance
Exploratory Factor Analysis 459

structure of the observed variables by three sets of patterns, and ei contains measurement errors and
parameters: (a) factor loadings associated with uniqueness. It is almost like a multiple regression
latent (i.e., unobserved) variables called factors, model; however, the major difference from multi-
(b) residual variances called unique variances, and ple regression is that in EFA, the factors are latent
(c) factor correlations. EFA aims at explaining the variables and not observed. The model for EFA is
relationship of many observed variables by a rela- often given in a matrix form:
tively small number of factors. Thus, EFA is con-
sidered one of the data reduction techniques. x ¼ μ þ Λf þ e‚ ð1Þ
Historically, EFA dates back to Charles Spearman’s
where x, μ, and e are p-dimensional vectors, f is
work in 1904, and the theory behind EFA has been
an m-dimensional vector of factors, and Λ is
developed along with the psychological theories of
a p × m matrix of factor loadings. It is usually
intelligence, such as L. L. Thurstone’s multiple fac-
assumed that factors (f) and errors (e) are uncorre-
tor model. Today, EFA is among the most fre-
lated, and different error terms (ei and ej for i 6¼ j)
quently used statistical techniques by researchers
are uncorrelated. From the matrix form of the
in the social sciences and education.
model in Equation 1, we can express the popula-
It is well-known that EFA often gives the solu-
tion variance-covariance matrix (covariance struc-
tion similar to principal component analysis
ture) Σ as
(PCA). However, there is a fundamental difference
between EFA and PCA in that factors are predic- Σ ¼ ΛΦΛ0 þ Ψ ð2Þ
tors in EFA, whereas in PCA, principal compo-
nents are outcome variables created as a linear if factors are correlated, where Φ is an m × m cor-
combination of observed variables. Here, an relation matrix among factors (factor correlation
important note is that PCA is a different method matrix), Λ0 is the transpose of matrix Λ in which
from principal factor analysis (also called the prin- rows and columns of Λ are interchanged (so that
cipal axis method). Statistical software such as Λ0 is a m × p matrix), and Ψ is a p × p diagonal
IBMâ SPSSâ (PASW) 18.0 (an IBM company, for- matrix (all off-diagonal elements are zero due to
merly named PASWâ Statistics) supports both uncorrelated ei) of error or unique variances. When
PCA and principal factor analysis. Another simi- the factors (f ) are not correlated, the factor corre-
larity exists between EFA and confirmatory factor lation matrix is equal to the identity matrix (i.e.,
analysis (CFA). In fact, CFA was developed as Φ ¼ I m ) and the covariance structure is reduced to
a variant of EFA. The major difference between
EFA and CFA is that EFA is typically employed Σ ¼ ΛΛ0 þ Ψ ð3Þ
without prior hypotheses regarding the covariance
structure, whereas CFA is employed to test the For each observed variable, when factors are
prior hypotheses on the covariance structure. not correlated, we can compute the sum of
Often, researchers do EFA and then do CFA using squared factor loadings
a different sample. Note that CFA is a submodel X
m
of structural equation models. It is known that hi ¼ λ2i1 þ λ2i2 þ    þ λ2im ¼ λ2ij ‚ ð4Þ
two-parameter item response theory (IRT) is math- j¼1
ematically equivalent to the one-factor EFA with
ordered categorical variables. EFA with binary and which is called the communality of the (ith) vari-
ordered categorical variables can also be treated as able. When factors are correlated, the communal-
a generalized latent variable model (i.e., a general- ity is calculated as
ized linear model with latent predictors).
X
m X
Mathematically, EFA expresses each observed hi ¼ λ2ij þ λij λik φjk : ð5Þ
variable (xi) as a linear combination of factors (f1, j¼1 j6¼k
f2 ; . . . ; fm) plus an error term, that is,
xi ¼ μi þ λi1f1 þ λi2f2 þ    þ λimfm þ ei, where m When the observed variables are standardized,
is the number of factors, μi is the population mean the ith communality gives the proportion of vari-
of xi,λijs are called the factor loadings or factor ability of the ith variable explained by the m
460 Exploratory Factor Analysis

factors. It is well-known that the squared multiple solution converges. It obtains factor loading esti-
correlation of the ith variable on the remaining mates using the eigenvalues and eigenvectors of
p  1 variables gives a lower bound for the the matrix R  Ψ, where R is the sample correla-
ith communality. tion matrix.
When the factors are uncorrelated, with
a m × m orthogonal matrix T (i.e., TT 0 ¼ I m Þ, the
variance-covariance matrix Σ of the observed vari-
Estimation (Extraction)
ables x under EFA given as Σ ¼ ΛΛ0 þ Ψ can be
There are three major estimation methods routinely rewritten as
used in EFA. Each estimation method tries to mini-
mize a distance between the sample covariance Σ ¼ ΛTT 0 Λ0 þ Ψ ¼ ðΛTÞðΛTÞ0 þ Ψ;
matrix S and model-based covariance matrix (esti- ð6Þ
Ψ ¼ Λ Λ0 þ Ψ‚
mate of Σ based on the EFA model: Σ ¼ ΛΛ0 þ Ψ
because for ease of estimation, initially, the factors where Λ ¼ ΛT. This indicates that the EFA model
are typically assumed to be uncorrelated). The has an identification problem called the indetermi-
first method tries to minimize the trace (i.e., sum nacy. That means that we need to impose at least
of diagonal elements) of (1/2)ðS  ΣÞ2 and is m(m  1)/2 constraints on the factor loading matrix
called either least-squares (LS) or unweighted in order to estimate the parameters λij uniquely. For
least-squares (ULS) method. Although LS is fre- example, in the ML estimation, commonly used
quently used in multiple regression, it is not so constraints are to let Λ0 Ψ1 Λ be a diagonal matrix.
common as an estimation method for EFA Rotations (to be discussed) are other ways to
because it is not scale invariant. That is, the impose constraints on the factor loading matrix.
solution is different if we use the sample correla- One can also fix m(m  1)/2 loadings in the upper
tion matrix or the sample variance-covariance triangle of Λ at zero for identification.
matrix. Consequently, the following two meth- In estimation, we sometimes encounter a prob-
ods (both of which are scale invariant) are fre- lem called the improper solution. The most fre-
quently used for parameter estimation in EFA. quently encountered improper solution associated
One of them tries to minimize the trace of with EFA is that certain estimates of unique var-
ð1=2ÞfðS  ΣÞS1 g2 and is called the generalized iances in Ψ are negative. Such a phenomenon is
least-squares (GLS) method. Note that S1 is the called the Heywood case. If the improper solution
inverse (i.e., matrix version of reciprocal) of S occurs as a result of sampling fluctuations, it is not
and serves as a weight matrix here. Another scale- of much concern. However, it may be a manifesta-
invariant estimation method tries to minimize tion of model misspecification.
trace (SΣ1 Þ  logðdetðSΣ1 ÞÞ  p; where det is When data are not normally distributed or con-
the determinant operator and log is the natural tain outliers, better parameter estimates can be
logarithm. This method is called the maximum- obtained when the sample covariance matrix S in
likelihood (ML) method. It is known that when any of the above estimation methods is replaced
the model holds, GLS and ML give asymptotically by a robust covariance matrix. When a sample
(i.e., when sample size is very large) equivalent contains missing values, the S should be replaced
solutions. In fact, the criterion for ML can be by the maximum-likelihood estimate of the popu-
approximated by the trace of (1/2)[(S  Σ)Σ1 2 , lation covariance matrix.
with almost the same function to be minimized as
GLS, the only difference being the weight matrix
Number of Factors
S1 replaced by Σ1. When the sample is normally
distributed, the ML estimates are asymptotically We need to determine the number of factors m
most efficient (i.e., when the sample size is large, such that the variance-covariance matrix of
the ML procedure leads to estimates with the observed variables is well approximated by the
smallest variances). Note that the principal factor factor model, and also m should be as small as
method frequently employed as an estimation possible. Several methods are commonly employed
method for EFA is equivalent to ULS when the to determine the number of factors.
Exploratory Factor Analysis 461

KaiserGuttman Criterion m factor model is statistically adequate in explain-


ing the relationship of the measured variables.
One widely used method is to let the number of
Under the null hypothesis, the LRT statistic asymp-
factors equal the number of eigenvalues of the
totically follows a chi-square distribution with
sample correlation matrix that are greater than 1.
degrees of freedom df ¼ ½ðp  mÞ2  ðp þ mÞ=2.
It is called the ‘‘eigenvalue-greater-than-1 rule’’ or
Statistical software reports the LRT statistic with
the KaiserGuttman criterion. The rationale
Bartlett correction in which the sample size n is
behind it is as follows: For a standardized variable,
replaced by n  ð2p þ 5Þ=6  2m=3, which is sup-
the variance is 1. Thus, by choosing the number of
posed to improve the closeness of the LRT to the
factors equal to eigenvalues that are greater than
asymptotic chi-square distribution. A rescaled ver-
1, we can choose the factors whose variance is at
sion of the LRT also exists when data do not fol-
least greater than the variance of each (standard-
low normal distribution or samples contain
ized) observed variable. It makes sense from the
missing values.
point of view of data reduction. However, this
criterion tends to find too many factors. A variant
of the eigenvalue-greater-than-1 rule is the Rotation
‘‘eigenvalue-greater-than-zero’’ rule that applies
In his 1947 book, Thurstone argued that the initial
when the diagonal elements of the sample correla-
solution of factor loadings should be rotated to
tion matrix are replaced by the squared multiple
find a simple structure, that is, the pattern of factor
correlations when regressing each variable on the
loadings having an easy interpretation. A simple
rest of the p  1 variables (called the reduced cor-
structure can be achieved when, for each row of
relation matrix).
the factor loading matrix, there is only one ele-
ment whose absolute value (ignoring the sign) is
Scree Plot high and the rest of the elements are close to zero.
Another frequently used rule is a visual plot, Several methods for rotation to a simple structure
with the ordered eigenvalues (from large to small) have been proposed. Mathematically, these meth-
of the sample correlation matrix in the vertical ods can be regarded as different ways of imposing
axis and the ordinal number in the horizontal axis. constraints to resolve the identification problem
This plot is commonly called the scree plot. It fre- discussed above.
quently happens with practical data that, after the The rotational methods are classified as
first few, the eigenvalues taper off as almost orthogonal and oblique rotations. The orthogo-
a straight line. The number of factors suggested nal rotations are the rotations in which the fac-
by the scree plot is the number of eigenvalues just tors are uncorrelated, that is, TT 0 ¼ I m with the
before they taper off in a linear fashion. There are rotation matrix T that connects the initial factor
variant methods to the scree plot, such as Horn’s loading matrix A and the rotated factor loading
parallel analysis. matrix Λ such that Λ ¼ AT. The oblique rota-
tions are the rotations in which the factors are
allowed to be correlated with each other. Note
Communality here that once we employ an oblique rotation,
Because communalities are analogous to the we need to distinguish between the factor pat-
squared multiple correlation in regression, they tern matrix and the factor structure matrix. The
can be used to aid our decision for the number of factor pattern matrix is a matrix whose elements
factors. Namely, we should choose m such that are standardized regression coefficients of each
every communality is sufficiently large. observed variable on factors, whereas the factor
structure matrix represents correlations between
the factors and the observed variables. As long
Likelihood Ratio Test
as the rotation is orthogonal, the factor pattern
When the observed sample is normally distrib- and the factor structure matrices are identical,
uted, the likelihood ratio test (LRT) statistic from and we often call the identical matrix the factor
the ML procedure can be used to test whether the loading matrix.
462 Exploratory Factor Analysis

Table 1 Sample Correlation Matrix for the Nine Psychological Tests (n ¼ 145)
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 Visual 1.000
x2 Cubes 0.318 1.000
x3 Flags 0.468 0.230 1.000
x4 Paragraph 0.335 0.234 0.327 1.000
x5 Sentence 0.304 0.157 0.335 0.722 1.000
x6 Word 0.326 0.195 0.325 0.714 0.685 1.000
x7 Addition 0.116 0.057 0.099 0.203 0.246 0.170 1.000
x8 Counting 0.314 0.145 0.160 0.095 0.181 0.113 0.585 1.000
x9 Straight 0.489 0.239 0.327 0.309 0.345 0.280 0.408 0.512 1.000

Table 2 SAS Code for EFA for the Data in Table 1


data one(type = corr);
input _type_ $ _name_ $ x1-x9;
datalines;
n . 145 145 145 145 145 145 145 145 145
corr x1 1.000 . . . . . . . .
corr x2 0.318 1.000 . . . . . . .
corr x3 0.468 0.230 1.000 . . . . . .
corr x4 0.335 0.234 0.327 1.000 . . . . .
corr x5 0.304 0.157 0.335 0.722 1.000 . . . .
corr x6 0.326 0.195 0.325 0.714 0.685 1.000 . . .
corr x7 0.116 0.057 0.099 0.203 0.246 0.170 1.000 . .
corr x8 0.314 0.145 0.160 0.095 0.181 0.113 0.585 1.000 .
corr x9 0.489 0.239 0.327 0.309 0.345 0.280 0.408 0.512 1.000
;
title ‘promax solution for nine psychological tests (n = 145)’;
proc factor data = one scree method = ml nfactor = 3 rotate = promax se;
run;

Table 3 Eigenvalues of the Sample Correlation Matrix


Among the orthogonal rotations, Kaiser’s vari-
Cumulative max rotation is by far the most frequently used.
Eigenvalue Difference Proportion Proportion The varimax rotation tries to achieve a simple
1 3.557 1.977 0.395 0.395 structure by rotating to maximize the variance of
2 1.579 0.423 0.176 0.571 squared factor loadings. More specifically, letting
3 1.156 0.367 0.128 0.699 Λ ¼ (λij ) denote the rotated factor loading matrix,
4 0.789 0.218 0.088 0.787 the varimax rotation maximizes
5 0.571 0.142 0.063 0.850
6 0.429 0.069 0.048 0.898 X 1X 4 w X 2 2
m p p
7 0.360 0.054 0.040 0.938 f λij  2 ð λ Þ g‚ ð7Þ
8 0.306 0.052 0.034 0.972 j¼1
p i¼1 p i ¼ 1 ij
9 0.254 0.028 1.000
Note: The squared multiple correlations are x1: 0.408, x2: where w ¼ 1. The varimax rotation is within the
0.135, x3: 0.274, x4: 0.631, x5: 0.600, x6: 0.576, x7: 0.402, family of orthogonal rotations called the ortho-
x8: 0.466, x9: 0.437. max rotations, which includes quartimax
Exploratory Factor Analysis 463

Table 4 Varimax-Rotated Factor Pattern Matrix (and standard error in parentheses)


Factor 1 Factor 2 Factor 3 Communality
x1 0.143 (0.062) 0.831 (0.095) 0.143 (0.075) 0.732
x2 0.135 (0.077) 0.359 (0.088) 0.066 (0.074) 0.151
x3 0.256 (0.078) 0.505 (0.083) 0.070 (0.076) 0.325
x4 0.833 (0.037) 0.244 (0.055) 0.073 (0.055) 0.759
x5 0.794 (0.039) 0.205 (0.056) 0.173 (0.057) 0.702
x6 0.780 (0.040) 0.243 (0.059) 0.069 (0.058) 0.672
x7 0.166 (0.060) 0.013 (0.074) 0.743 (0.077) 0.580
x8 0.013 (0.058) 0.240 (0.082) 0.794 (0.074) 0.688
x9 0.188 (0.066) 0.469 (0.088) 0.511 (0.078) 0.517

Table 5 Promax Rotation: Factor Pattern Matrix (and standard error in parentheses)
Factor 1 Factor 2 Factor 3 Communality
x1 0.027 (0.041) 0.875 (0.102) 0.022 (0.064) 0.732
x2 0.068 (0.094) 0.359 (0.109) 0.011 (0.088) 0.151
x3 0.169 (0.093) 0.494 (0.107) 0.044 (0.078) 0.325
x4 0.852 (0.044) 0.061 (0.054) 0.032 (0.044) 0.759
x5 0.809 (0.046) 0.007 (0.056) 0.086 (0.050) 0.702
x6 0.794 (0.047) 0.074 (0.060) 0.032 (0.049) 0.672
x7 0.121 (0.054) 0.193 (0.060) 0.785 (0.083) 0.580
x8 0.131 (0.043) 0.126 (0.086) 0.803 (0.087) 0.688
x9 0.065 (0.070) 0.388 (0.107) 0.440 (0.092) 0.517

Table 6 Promax Rotation: Factor Structure Matrix


Factor 1 Factor 2 Factor 3 4.0

x1 0.341 (0.074) 0.855 (0.084) 0.308 (0.095) 1


3.5
x2 0.219 (0.086) 0.384 (0.083) 0.146 (0.086)
x3 0.369 (0.083) 0.549 (0.072) 0.189 (0.076) 3.0
x4 0.870 (0.033) 0.412 (0.078) 0.208 (0.086)
x5 0.834 (0.037) 0.385 (0.078) 0.295 (0.085) 2.5
Eigen values

x6 0.817 (0.037) 0.400 (0.075) 0.198 (0.082)


x7 0.239 (0.082) 0.162 (0.077) 0.741 (0.074) 2.0
x8 0.127 (0.082) 0.380 (0.088) 0.818 (0.068) 2
x9 0.343 (0.072) 0.586 (0.079) 0.607 (0.076) 1.5
3
1.0
4
5
0.5
6 7
Table 7 Interfactor Correlation Matrix With Promax 8 9
Rotation 0.0
Factor 1 Factor 2 Factor 3 0 1 2 3 4 5 6 7 8 9
Factor 1 1.000
Factor 2 0.427 (0.076) 1.000
Figure 1 Scree Plot of Eigenvalues of Sample
Factor 3 0.254 (0.089) 0.386 (0.084) 1.000
Correlation Matrix
464 Exploratory Factor Analysis

rotation (with w ¼ 0) and equamax criterion Illustration


(with w ¼ m/2).
Among the oblique rotations, the promax rota- To illustrate EFA, nine variables were adopted from
tion is most well-known. The promax rotation is the original 26-variable test as reported by Karl J.
a variant of Procrustes rotation in which a simple Holzinger and Frances Swineford in 1939. The nine
structure is achieved by minimizing the distance variables are: (x1) Visual perception, (x2) Cubes,
between the factor pattern matrix and the target (x3) Flags, (x4) Paragraph comprehension, (x5) Sen-
matrix using the least-squares method. The pro- tence completion, (x6) Word meaning, (x7) Addi-
max rotation creates the target matrix using the tion, (x8) Counting dots, and (x9) Straight-curved
varimax-rotated factor loading matrix. More spe- capitals. The sample size is 145, and the sample cor-
cifically, the varimax solution is first normalized, relation matrix is given in Table 1. The sample cor-
and then each element is accentuated by raising to relation matrix was analyzed with the ML
the power of either three or four while retaining estimation method using the factor procedure in the
the sign. Other oblique rotations include oblimin statistical software SAS version 9.1.3. The SAS code
rotations that minimize is listed in Table 2. Table 3 contains the eigenvalues
of the sample correlation matrix, and the scree plot
( ! !) is given in Figure 1. The first three eigenvalues
X
m Xp
w X 2
p X
p
λ2ij λ2ik  λ λ2ik ð8Þ (3.557, 1.579, and 1.156) were greater than 1. The
j6¼k i¼1
p i ¼ 1 ij i¼1 LRT statistic with Bartlett correction for a three-fac-
tor model is 3.443, corresponding to a p value of
with the weight w ranging between 0 and 1. .992 when compared against the chi-square distribu-
After rotation, the meaning of each factor is tion with 12 degrees of freedom; the LRT statistic
identified using the variables on which it has sub- for a two-factor model is 49.532, corresponding to
stantial or significant loadings. The significance of a p value of .0002 when compared against the chi-
loadings can be judged using the ratio of loading square distribution with 19 degrees of freedom.
over its standard error (SE). The SE option in SAS Thus, both the KaiserGuttman criterion and the
factor procedure gives outputs of standard errors LRT statistic suggest that a three-factor solution is
for factor loadings with various rotations under adequate, and the subsequent analysis was done
the normality assumption. Formulas for SE with assuming the three-factor model. After the initial
non-normally distributed or missing data also solution, the factor loadings were rotated with the
exist, but not in standard software. varimax rotation. The varimax-rotated factor load-
ings are shown in Table 4, where Factor 1 has high
loadings on Variables 4 through 6, Factor 2 has high
Factor Scores
loadings on Variables 1 through 3, and Factor 3 has
It is of interest to know the predicted score for m high loadings on Variables 7 through 9. Communal-
factors for each observation using the EFA model. ities are higher than 0.5 except for Variables 2
Two factor score predictors/estimators are well- (0.151) and 3 (0.325). The factor pattern matrix for
known. One is the Bartlett estimator the promax rotation (Table 5) is almost identical to
the factor loading matrix for the varimax rotation.
f^B ¼ ðΛ0 Ψ1 ΛÞ1 Λ0 Ψ1 ðx  μÞ‚ ð9Þ So is the factor structure matrix (Table 6). Thus, it
seems obvious that Factor 1 measures language abil-
which is conditionally unbiased (that is, the ity, Factor 2 measures visual ability, and Factor 3
expected value of ^f given f is equal to f). The other measures speed. The correlation between Factors 1
estimator, commonly called the regression estimator, and 2 is 0.427, the correlation between Factors 1
and 3 is 0.254, and the correlation between Factors
^f ¼ ΦΛ0 Σ1 ðx  μÞ‚ ð10Þ 2 and 3 is 0.386 (see Table 7).
R
Kentaro Hayashi and Ke-Hai Yuan
is conditionally biased, but the trace of its mean
squared error matrix is the smallest. There are also See also Confirmatory Factor Analysis; Item
other different factor score estimators. Analysis; Latent Variable; Principal
Ex Post Facto Study 465

Components Analysis; Structural Equation research designs. It is also often applied as a substi-
Modeling tute for true experimental research to test hypothe-
ses about cause-and-effect relationships or in
Further Readings situations in which it is not practical or ethically
Comrey, A. L., & Lee, H. B. (1992). A first course in factor acceptable to apply the full protocol of a true
analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. experimental design. Despite studying facts that
Gorsuch, R. (1983). Factor analysis (2nd ed.). Hillsdale, have already occurred, ex post facto research
NJ: Lawrence Erlbaum. shares with experimental research design some of
Harmann, H. H. (1976). Modern factor analysis (3rd its basic logic of inquiry.
ed.). Chicago: University of Chicago Press. Ex post facto research design does not include
Hatcher, L. (1994). A step-by-step approach to using the any form of manipulation or measurement
SAS system for factor analysis and structural equation
before the fact occurs, as is the case in true
modeling. Cary, NC: SAS Institute.
Holzinger, K. J., & Swineford, F. A. (1939). A study in
experimental designs. It starts with the observa-
factor analysis: The stability of a bifactor solution tion and examination of facts that took place
(Supplementary Educational Monograph 48). naturally, in the sense that the researcher did not
Chicago: University of Chicago Press. interfere, followed afterward by the exploration
Hoyle, R. H., & Duvall, J. L. (2004). Determining the of the causes behind the evidence selected for
number of factors in exploratory and confirmatory analysis. The researcher takes the dependent var-
factor analysis. In D. Kaplan (Ed.), The Sage iable (the fact or effect) and examines it retro-
handbook of quantitative methodology for the spectively in order to identify possible causes
social sciences (pp. 301315). Thousand Oaks, CA: and relationships between the dependent vari-
Sage.
able and one or more independent variables.
Mulaik, S. A. (1972). The foundations of factor analysis.
New York: McGraw-Hill.
After the deconstruction of the causal process
Pett, M. A., Lackey, N. R., & Sullivan, J. J. (2003). responsible for the facts observed and selected
Making sense of factor analysis. Thousand Oaks, CA: for analysis, the researcher can eventually adopt
Sage. a prospective approach, monitoring what hap-
Spearman, C. (1904). General intelligence, objectively pens after that.
determined and measured. American Journal of Contrary to true experimental research, ex
Psychology, 15, 201293. post facto research design looks first to the
Thurstone, L. L. (1947). Multiple factor analysis. effects (dependent variable) and tries afterward
Chicago: University of Chicago Press. to determine the causes (independent variable).
Yanai, H., & Ichikawa, M. (2007). Factor analysis. In
In other words, unlike experimental research
C. R. Rao & S. Sinharay (Eds.), Handbook of
statistics (Vol. 26, pp. 257296). Amsterdam:
designs, the independent variable has already
Elsevier. been applied when the study is carried out, and
Yuan, K.-H., Marshall, L. L., & Bentler, P. M. (2002). A for that reason, it is not manipulated by the
unified approach to exploratory factor analysis with researcher. In ex post facto research, the control
missing data, nonnormal data, and in the presence of of the independent variables is made through
outliers. Psychometrika, 67, 95122. statistical analysis, rather than by control and
experimental groups, as is the case in experimen-
tal designs. This lack of direct control of the
independent variable and the nonrandom selec-
EX POST FACTO STUDY tion of participants are the most important dif-
ferences between ex post facto research and the
Ex post facto study or after-the-fact research is true experimental research design.
a category of research design in which the investi- Ex post facto research design has strengths
gation starts after the fact has occurred without that make it the most appropriate research plan
interference from the researcher. The majority of in numerous circumstances; for instance, when it
social research, in contexts in which it is not possi- is not possible to apply a more robust and rigor-
ble or acceptable to manipulate the characteristics ous research design because the phenomenon
of human participants, is based on ex post facto occurred naturally; or it is not practical to
466 External Validity

manipulate the independent variables; or the not randomly selected (e.g., nonprobabilistic
control of independent variables is unrealistic; or samples: convenient samples, snowball samples),
when such manipulation of human participants which limit the possibility of statistical inference.
is ethically unacceptable (e.g., delinquency, ill- For that reason, findings in ex post facto
nesses, road accidents, suicide). Instead of expos- research design cannot, in numerous cases, be
ing human subjects to certain experiments or generalized or looked upon as being statistically
treatments, it is more reasonable to explore the representative of the population.
possible causes after the fact or event has In sum, ex post facto research design is widely
occurred, as is the case in most issues researched used in social as well as behavioral and biomedical
in anthropology, geography, sociology, and in sciences. It has strong points that make it the most
other social sciences. It is also a suitable research appropriate research design in a number of cir-
design for an exploratory investigation of cause- cumstances as well as limitations that make it
effect relationships or for the identification of weak from the point of view of its internal and
hypotheses that can later be tested through true external validity. It is often the best research design
experimental research designs. that can be used in a specific context, but it should
It has a number of weaknesses or shortcom- be applied only when a more powerful research
ings as well. From the point of view of its inter- design cannot be employed.
nal validity, the two main weak points are the
lack of control of the independent variables and Carlos Nunes Silva
the nonrandom selection of participants or sub-
jects. For example, its capacity to assess con- See also Cause and Effect; Control Group; Experimental
founding errors (e.g., errors due to history, social Design; External Validity; Internal Validity;
interaction, maturation, instrumentation, selec- Nonexperimental Designs; Pre-Experimental Design;
tion bias, mortality) is unsatisfactory in numer- Quasi-Experimental Design; Research Design
ous cases. As a consequence, the researcher may Principles
not be sure that all independent variables that
caused the facts observed were included in the
Further Readings
analysis, or if the facts observed would not have
resulted from other causes in different circum- Bernard, H. R. (1994). Research methods in
stances, or if that particular situation is or is not anthropology: Qualitative and quantitative
a case of reverse causation. It is also open to dis- approaches (2nd ed.). Thousand Oaks, CA: Sage.
cussion whether the researcher will be able to Black, T. R. (1999). Doing quantitative research in the
social sciences: An integrated approach to research
find out if the independent variable made a signif-
design, measurement and statistics. Thousand Oaks,
icant difference or not in the facts observed, con- CA: Sage.
trary to the true experimental research design, in Cohen, L., Manion, L., & Morrison, K. (2007). Research
which it is possible to establish if the indepen- methods in education. London: Routledge.
dent variable is the cause of a given fact or event. Engel, R. J., & Schutt, R. K. (2005). The practice of
Therefore, from the point of view of its internal research in social work. Thousand Oaks, CA: Sage.
validity, ex post facto research design is less per- Ethridge, M. E. (2002). The political research experience.
suasive to determine causality compared to true Readings and analysis (3rd ed.). Armonk, NY: M. E.
experimental research designs. Nevertheless, if Sharpe.
there is empirical evidence flowing from numer-
ous case studies pointing to the existence of
a causal relationship, statistically tested, between
the independent and dependent variables EXTERNAL VALIDITY
selected by the researcher, it can be considered
sound evidence in support of the existence of When an investigator wants to generalize results
a causal relationship between these variables. It from a research study to a wide group of people
has also a number of weaknesses from the point (or a population), he or she is concerned with
of view of its external validity, when samples are external validity. A set of results or conclusions
External Validity 467

from a research study that possesses external valid- anxiety more in relation to women who did not; in
ity can be generalized to a broader group of indivi- fact, a closer analysis might reveal that only those
duals than those originally included in the study. women who exercised regularly in addition to tak-
External validity is relevant to the topic of research ing the supplement reduced their anxiety. In other
methods because scientific and scholarly investiga- words, closer data analysis could reveal that the
tions are normally conducted with an interest in findings do not generalize across all subpopula-
generalizing findings to a larger population of indi- tions of 25-year-old women (e.g., those who do
viduals so that the findings can be of benefit to not exercise) even though they do generalize to the
many and not just a few. In the next three sections, overall target population of 25-year-old women.
the kinds of generalizations associated with exter- The distinction between these two kinds of gen-
nal validity are introduced, the threats to external eralizations is useful because generalizing to spe-
validity are outlined, and the methods to increase cific populations is surprisingly more difficult than
the external validity of a research investigation are generalizing across populations because the former
discussed. typically requires large-scale studies where partici-
pants have been selected using formal random
sampling procedures. This is rarely achieved in
Two Kinds of Generalizations
field research, where large-scale studies pose chal-
Two kinds of generalizations are often of interest lenges for administering treatment interventions
to researchers of scientific and scholarly investiga- and for high-quality measurement, and participant
tions: (a) generalizing research findings to a specific attrition is liable to occur systematically. Instead,
or target population, setting, and time frame; and the more common practice is to generalize findings
(b) generalizing findings across populations, set- from smaller studies, each with its own sample of
tings, and time frames. An example is provided to convenience or accidental sampling (i.e., a sample
illustrate the difference between the two kinds. that is accrued expediently for the purpose of the
Imagine a new herbal supplement is introduced research but provides no guarantee that it formally
that is aimed at reducing anxiety in 25-year-old represents a specific target population), across the
women in the United States. Suppose that a ran- populations, settings, and time frames associated
dom sample of all 25-year-old women has been with the smaller studies. It needs to be noted that
drawn that provides a nationally representative individuals in samples of convenience may belong
sample within known limits of sampling error. to the target population to which one wishes to
Imagine now that the women are randomly generalize findings; however, without formal ran-
assigned to two conditions—one where the women dom sampling, the representativeness of the sam-
consume the herbal supplement as prescribed, and ple is questionable. According to Thomas Cook
the other a control group where the women and Donald Campbell, an argument can be made
unknowingly consume a sugar pill. The two condi- for strengthening external validity by means of
tions or groups are equivalent in terms of their rep- a greater number of smaller studies with samples
resentativeness of 25-year-old women. Suppose of convenience than by a single large study with
that after data analysis, the group that consumed an initially representative sample. Given the fre-
the herbal supplement demonstrated lower anxiety quency of generalizations across populations, set-
than the control group as measured by a paper- tings, and time frames in relation to target
and-pencil questionnaire. The investigator can gen- populations, the next section reviews the threats to
eralize this finding to the average 25-year-old external validity claims associated with this type
woman in the United States, that is, the target of generalization.
population of the study. Note that this finding can
be generalized to the average 25-year-old woman
Threats to External Validity
despite possible variations in how differently
women in the experimental group reacted to the To be able to generalize research findings across
supplement. For example, a closer analysis of the populations, settings, and time frames, the inves-
data might reveal that women in the experimental tigator needs to have evidence that the research
group who exercised regularly reduced their findings are not unique to a single population,
468 External Validity

but rather apply to more than one population. and, possibly, are more susceptible to the effects of
One source for this type of evidence comes from an herbal supplement than other women. Thus,
examining statistical interactions between vari- recruiting participants from a variety of locations
ables of interest. For example, in the course of and making participation as convenient as possible
data analysis, an investigator might find that should be undertaken.
consuming an herbal supplement (experimental
treatment) statistically interacts with the activity
Setting and Treatment Interaction
level of the women participating in the study,
such that women who exercise regularly benefit Just as the selection of participants can inter-
more from the anxiety-reducing effects of the act with the treatment, so can the setting in
supplement relative to women who do not exer- which the study takes place. This type of interac-
cise regularly. What this interaction indicates is tion is more applicable to research studies where
that the positive effects of the herbal supplement participants experience an intervention that
cannot be generalized equally to all subpopula- could plausibly change in effect depending on
tions of 25-year-old women. The presence of the context, such as in educational research or
a statistical interaction means that the effect of organizational psychological investigations.
the variable of interest (i.e., consuming the However, to continue with the health supple-
herbal supplement) changes across levels of ment example, suppose the investigator requires
another variable (i.e., activity levels of 25-year- the participants to consume the health supple-
old women). In order to generalize the effects of ment in a laboratory and not in their homes.
the herbal supplement across subpopulations of Imagine that the health supplement produces
25-year-old women, a statistical interaction can- better results when the participant ingests it at
not be observed between the two variables of home and produces worse results when the par-
interest. Many interactions can threaten the ticipant ingests it in a laboratory setting. If the
external validity of a study. These are outlined as investigator varies the settings in the study, it is
follows. possible to test the statistical interaction between
the setting in which the supplement is ingested
and the herbal supplement treatment. Again, the
Participant Selection and Treatment Interaction
absence of a statistical interaction between the
To generalize research findings across popula- setting and the treatment variable would indicate
tions of interest, it is necessary to recruit partici- that the research findings can be generalized
pants in an unbiased manner. For example, when across the two settings; the presence of an inter-
recruiting female participants to take part in an action would indicate that the findings cannot be
herbal supplement study, if the investigator adver- generalized across the settings.
tises the study predominantly in health food stores
and obtains the bulk of participants from this loca-
History and Treatment Interaction
tion, then the research findings may not generalize
to women who do not visit health food stores. In In some cases, the historical time in which the
other words, there may be something unique to treatment occurs is unique and could contribute to
those women who visit health food stores and either the presence or absence of a treatment
decide to volunteer in the study that may make effect. This is a potential problem because it means
them more disposed to the effects of a health sup- that whatever effect was observed cannot be gener-
plement. To counteract this potential bias, the alized to other time frames. For example, suppose
investigator could systematically advertise the that the herbal supplement is taken by women dur-
study in other kinds of food stores to test whether ing a week in which the media covers several high-
the selection of participants from different loca- profile optimistic stories about women. It is rea-
tions interacts with the treatment. If the statistical sonable for an investigator to inquire whether the
interaction is absent, then the investigator can be positive results of taking an herbal supplement
confident that the research findings are not exclu- would have been obtained during a less eventful
sive to those women who visit health food stores week. One way to test for the interaction between
External Validity 469

historical occurrences and treatment is to adminis- allow one to conclude is that an effect has or has
ter the study at different time frames and to repli- not been obtained within a specific range of cate-
cate the results of the study. gories of persons, settings, and times. In other
words, one can claim that ‘‘in at least one sample
of boys and girls, the mathematics intervention
Methods to Increase External Validity
had the effect of increasing test scores.’’
If one wishes to generalize research findings to tar- There are other methods to increase external
get populations, it is appropriate to outline a sam- validity, such as the impressionistic modal instance
pling frame and select instances so that the sample model, where the investigator samples purposively
is representative of the population to which one for specific types of instances. Using this method,
wishes to generalize within known limits of sam- the investigator specifies the category of person,
pling error. Procedures for how to do this can be setting, or time to which he or she wants to gener-
found in textbooks on sampling theory. Often, the alize and then selects an instance of each category
most representative samples will be those that have that is impressionistically similar to the category
been selected randomly from the population of mode. This method of selecting instances is most
interest. This method of random sampling for rep- often used in consulting or project evaluation work
resentativeness requires considerable resources and where broad generalizations are not required. The
is often associated with large-scale studies. After most powerful method for generalizing research
participants have been randomly selected from the findings, especially if the generalization is to a target
population, participants can then be randomly population, is random sampling for representative-
assigned to experimental groups. ness. The next most powerful method is random
Another method for increasing external validity sampling for heterogeneity, with the method of
involves sampling for heterogeneity. This method impressionistic modal instance being the least pow-
requires explicitly defining target categories of per- erful. The power of the model decreases as the nat-
sons, settings, and time frames to ensure that ural assortment of individuals in the sample
a broad range of instances from within each cate- dwindles. However, practical concerns may prevent
gory is represented in the design of the study. For an investigator from using the most powerful
example, an educational researcher interested in method.
testing the effects of a mathematics intervention
might design the study to include boys and girls Jacqueline P. Leighton
from both public and private schools located in
See also Inference: Deductive and Inductive;
small rural towns and large metropolitan cities.
Interaction; Research Design Principles;
The objective would then be to test whether the
Sampling; Theory
intervention has the same effect in all categories
(e.g., whether the mathematics intervention leads
to the same effect in boys and girls, public and pri-
vate schools, and rural and metropolitan areas). Further Readings
Testing for the effect in each of the categories
requires a sufficiently large sample size in each of Cook, T. D., & Campbell, D. T. (1979). Quasi-
the categories. Deliberate sampling for heterogene- experimentation: Design and analysis issues for field
ity does not require random sampling at any stage settings. Boston: Houghton Mifflin.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
in the design, so it is usually viable to implement
Experimental and quasi-experimental designs for
in cases where investigators are limited by generalized causal inference. Boston: Houghton
resources and in their access to participants. How- Mifflin.
ever, deliberate sampling does not allow one to Tebes, J. K. (2000). External validity and
generalize from the sample to any formally speci- scientific psychology. American Psychologist, 55(12),
fied population. What deliberate sampling does 15081509.
F
what validity is. Validity is commonly defined as
FACE VALIDITY a question: ‘‘To what extent do the research con-
clusions provide the correct answer?’’ In testing
the validity of research conclusions, one looks at
Face validity is a test of internal validity. As the the relationship of the purpose and context of the
name implies, it asks a very simple question: ‘‘On research project to the research conclusions. Valid-
the face of things, do the investigators reach the ity is determined by testing (questions of validity)
correct conclusions?’’ It requires investigators to research observations against what is already
step outside of their current research context and known in the world, giving the phenomenon that
assess their observations from a commonsense per- researchers are analyzing the chance to prove them
spective. A typical application of face validity wrong. All tests of validity are context-specific and
occurs when researchers obtain assessments from are not an absolute assessment. Tests of validity
current or future individuals who will be directly are divided into two broad realms: external valid-
affected by programs premised on their research ity and internal validity. Questions of external
findings. An example of testing for face validity is validity look at the generalizability of research
the assessment of a proposed new patient tracking conclusions. In this case, observations generated in
system by obtaining observations from local com- a research project are assessed on their relevance
munity health care providers who will be responsi- to other, similar situations. Face validity falls
ble for implementing the program and getting within the realm of internal validity assessments.
feedback on how they think the new program may A test of internal validity asks if the researcher
work in their centers. draws the correct conclusion based on the avail-
What follows is a brief discussion on how face able data. These types of assessments look into the
validity fits within the overall context of validity nuts-and-bolts of an investigation (for example,
tests. Afterward, documentation of face validity’s looking for sampling error or researcher bias) to
history is reviewed. Here, early criticisms of face see if the research project was legitimate.
validity are addressed that set the stage for how
and why the test returned as a valued assessment.
This discussion of face validity concludes with
some recent applications of the test. History of Face Validity
For all of its simplicity, the test for face validity
has had an amazing and dramatic past that,
The Validity of Face Validity
until recently, has re-emerged as a valued and
To better understand the value and application of respected test of validity. In its early applica-
face validity, it is necessary to first set the stage for tions, face validity was used by researchers as

471
472 Face Validity

Table 1 Debate Over Face Validity


Point of Contention Two Sides of the Debate on Face Validity
Is face validity a Face validity is not a valid test. Face validity is a valid test.
valid test? Several other, more technically Face validity asks a simple question:
advanced tests of validity provide a more Do the research findings make sense?
detailed assessment of research than what No other validity test that assesses
is learned from a face validity assessment. research findings is based on common
sense. Face validity fills a gap in
internal validity tests.
Is face validity a Content validity and face validity are related. Face validity is a stand-alone test.
stand-alone test, or Face validity is a watered-down version of Face validity focuses on the commonsense
is it another shade content validity because it asks roughly the appearance of validity in the research
of content validity? same question: Do the identified research findings. Content validity focuses on the
variables closely fit what is known about fit of the defined content of variables to
the research topic? When you have what is known about the research topic.
content validity, you automatically When you have content validity (fit),
have face validity. you may not have face
validity (appearance).
Is face validity only Only experts can conduct face validity tests. Laypersons are important for
for experts? Face validity is for experts only because tests of face validity.
laypeople do not have a true understanding Laypersons provide valuable insights
of research methods and tests of validity. on the applicability of the research
findings based on their experiential
perspectives.

a first-step assessment, in concert with other temporarily prevented face validity from getting
tests, to assess the validity of an analysis. During established as a legitimate test of validity (see
the 1940s and 1950s, face validity was used by Table 1).
psychologists when they were in the early stages The first question regarding face validity is over
of developing tests for use in selecting industrial the legitimacy of the test itself. Detractors argue
and military personnel. It was soon widely used that face validity is insignificant because its obser-
by many different types of researchers in differ- vations are not based on any verifiable testing pro-
ent types of investigations, resulting in confusion cedure yielding only rudimentary observations
on what actually constituted face validity. about a study. Face validity does not require a sys-
Quickly, the confusion over the relevance of face tematic method in the obtaining of face validity
validity gave way to its being rejected by observations. They conclude that the only use for
researchers in the 1960s, who took to new and face validity observations is for public relations
more complex tests of validity. statements.
Advocates for face validity see that face validity
provides researchers with the opportunity for com-
Early Debate Surrounding Face Validity
monsense testing of research results: ‘‘After the
Discussions surrounding face validity were revived investigation is completed and all the tests of valid-
in 1985 by Baruch Nevo’s seminal article ‘‘Face ity and reliability are done, does this study make
Validity Revisited,’’ which focused on clearing up sense?’’ Here, tests of face validity allow investiga-
some of the confusion surrounding the test and tors a new way to look at their conclusions to
challenging researchers to take another, more seri- make sure they see the forest for the trees, with the
ous look at face validity’s applications. Building on forest being common sense and the trees being all
Nevo’s research, three questions can be distin- of the different tests of validity used in document-
guished in the research validity literature that have ing the veracity of their study.
Face Validity 473

The second question confuses the value of face research projects. Laypersons lack technical
validity by blurring the applications of face valid- research skills and can provide only impressionistic
ity with content validity. The logic here is that face validity observations, which are of little use to
both tests of validity are concerned with content investigators.
and the representativeness of the study. Content Most researchers now see that the use of
validity is the extent to which the items identified experts in face validity assessments is more accu-
in the study reflect the domain of the concept rately understood as being a test of content valid-
being measured. Because content validity and face ity because they provide their observations at the
validity both look at the degree to which the start or middle of a research project, and face
intended range of meanings in the concepts of the validity focuses on assessing the relevance of
study appear to be covered, once a study has con- research conclusions. Again, content validity
tent validity, it will automatically have face valid- should be understood sequentially in relation to
ity. After testing for content validity, there is no face validity, with the former being used to garner
real need to test for face validity. expert observations on the relevance of research
The other side to this observation is that con- variables in the earlier parts of the investigation
tent validity should not be confused with face from other experts in the field, and face validity
validity because they are completely different tests. should come from laypersons for their common-
The two tests of validity are looking at different sense assessment at the completion of the research
parts of the research project. Content validity is project.
concerned with the relevance of the identified The large-scale vista that defines face validity,
research variables within a proposed research pro- defines the contribution this assessment provides
ject, whereas face validity is concerned with the to the research community, also provides its Achil-
relevance of the overall completed study. Face les heel. Face validity lacks the depth, precision,
validity looks at the overall commonsense assess- and rigor of inquiry that comes with both internal
ment of a study. In addition to the differences and external validity tests. For example, in asses-
between the two tests of validity in terms of what sing the external validity of a survey research pro-
they assess, other researchers have identified ject, one can precisely look at the study’s sample
a sequential distinction between content validity size to determine if it has a representative sample
and face validity. Content validity is a test that of the population. The only question face validity
should be conducted before the data-gathering has for a survey research project is a simple one:
stage of the research project is started, whereas ‘‘Does the study make sense?’’ For this reason, face
face validity should be applied after the investiga- validity can never be a stand-alone test of validity.
tion is carried out. The sequential application of
the two tests is intuitively logical because content
The Re-Emergence of Face Validity
validity focuses on the appropriateness of the iden-
tified research items before the investigation has The renewed interest in face validity is part of the
started, whereas face validity is concerned with the growing research practice of integrating layper-
overall relevance of the research findings after the sons’ nontechnical, one-of-a-kind insights into the
study has been completed. evaluation of applied research projects. Commonly
The third question surrounding face validity known as obtaining an emic viewpoint, testing for
asks a procedural question: Who is qualified to face validity provides the investigator the opportu-
provide face validity observations—experts or lay- nity to learn what many different people affected
persons? Proponents for the ‘‘experts-only’’ by a proposed program already know about a par-
approach to face validity believe that experts who ticular topic. The goal in this application of face
have a substantive knowledge about a research validity is to include the experiential perspectives
topic and a good technical understanding of tests of people affected by research projects in their
of validity provide constructive insights from out- assessment of what causes events to happen, what
side of the research project. In this application of the effects of the study in the community may be,
face validity, experts provide observations that can and what specific words or events mean in the
help in the development and/or fine-tuning of community.
474 Factorial Design

The following examples show how researchers Further Readings


use face validity assessments in very different con-
Nevo, B. (1985). Face validity revisited. Journal of
texts, but share the same goal: obtaining a com- Educational Measurement, 22(4), 287293.
monsense assessment from persons affected by Patton, M. Q. (1997). Utilization-focused evaluation (3rd
research conclusions. Michael Quinn Patton is ed.). Thousand Oaks, CA: Sage.
widely recognized for his use of ‘‘internal evalua- Pike, K. (1967). Language in relation to a unified
tors’’ to generate face validity observations in the theory of the structure of human behavior. Paris:
evaluation of programs. In the Hazelden Founda- Mouton & Co.
tion of Minnesota case study, he describes his
work in providing annual evaluations based on the
foundation’s data of tracking clients who go
through its program. At the completion of the FACTORIAL DESIGN
annual evaluation, a team of foundation insider
evaluators then participates in the evaluation by A factorial design contains two or more indepen-
assessing the data and conclusions made in the dent variables and one dependent variable. The
reports. independent variables, often called factors, must
Face validity assessments are commonly used be categorical. Groups for these variables are often
in applied research projects that include the called levels. The dependent variable must be con-
fields of community development, planning, tinuous, measured on either an interval or a ratio
public policy, and macro social work. In plan- scale.
ning, face validity observations are obtained dur- Suppose a researcher is interested in determin-
ing scheduled public hearings throughout the ing if two categorical variables (treatment condi-
planning process. The majority of planning tion and gender) affect a continuous variable
research is based on artificial constructs of real- (achievement). The researcher decides to use a fac-
ity that allow planners to understand complex, torial design because he or she wants to examine
multivariable problems (e.g., rush-hour traffic). population group means. A factorial analysis of
One of the reasons that planners incorporate cit- variance will allow him or her to answer three
izen input into the planning process is that it questions. One question concerns the main effect
allows them to discover the ‘‘inside perspective’’ of treatment: Do average achievement scores differ
from the community on how their research and significantly across treatment conditions? Another
proposed plans may affect their day-to-day lives. question concerns the main effect of gender: Does
A street-widening project in Lincoln, Nebraska, the average achievement score for females differ
is one example of how a city used face validity significantly from the average achievement score
in its planning process. A central traffic corridor for males? The final question refers to the interac-
was starting to experience higher levels of rush- tion effect of treatment condition and gender: Is
hour congestion as the result of recent growth the effect of treatment condition on achievement
on the city’s edge. Knowing that simply widening the same for both genders?
the street to accommodate more vehicles could This entry first describes how to identify facto-
affect area businesses adversely, city planners rial designs and their advantages. Next, analysis
met with local store owners to get their face and interpretation of factorial designs, including
validity observations of how the street affected follow-up analyses for significant results, are dis-
their daily operations. Armed with traffic data cussed. A short discussion on the importance of
and face validity observations of local store effect size concludes the entry.
owners, the city was able to plan a wider street
that took into account both traffic commuters’
Identification
and area businesses’ experiences with the street.
One way to identify factorial designs is by the num-
John Gaber ber of factors involved. Although there is no limit
to the number of factors, two-factor and three-
See also Applied Research; Planning Research; ‘‘Validity’’ factor designs are most common. Occasionally,
Factorial Design 475

a researcher will use a four-factor design, but these participants chosen, whereas if one or two addi-
situations are extremely rare. When a study incor- tional factors, such as gender or age, are included
porates a large number of factors, other designs are in the design, then the researcher can examine dif-
considered, such as regression. ferences between these specific subsets of partici-
Another way to identify factorial designs is by pants. Another advantage is that the simultaneous
the number of levels for each factor. The simplest effect of the factors operating together can be
design is a 2 × 2, which represents two factors, tested. By examining the interaction between treat-
both of them having two levels. A 3 × 4 design ment and age, the researcher can determine
also has two factors, but one factor has three levels whether the effect of treatment is dependent on
(e.g., type of reward: none, food, money) and age. The youngest participant group may show
the other factor has 4 levels (e.g., age: 68 years, higher scores when receiving Treatment A,
911 years, 1214 years, 1516 years). A 2 × 2 whereas the oldest participant group may show
× 3 design has three factors; for example, gender higher scores when receiving Treatment B.
(2 levels: male, female), instructional method A third advantage of factorial designs is that
(2 levels: traditional, computer-based), and they are more parsimonious, efficient, and power-
ability (3 levels: low, average, high). ful than an examination of each factor in a separate
In a factorial design, each level of a factor is analysis. The principle of parsimony refers to con-
paired with each level of another factor. As such, ducting one analysis to answer all questions rather
the design includes all combinations of the factors’ than multiple analyses. Efficiency is a related prin-
levels, and a unique subset of participants is in ciple. Using the most efficient design is desirable,
each combination. Using the 3 × 4 example in meaning the one that produces the most precise
the previous paragraph, there are 12 cells or sub- estimate of the parameters with the least amount
sets of participants. If a total of 360 participants of sampling error. When additional factors are
were included in the study and group sample sizes added to a design, the error term can be greatly
were equal, then 30 young children (ages 6 to 8) reduced. A reduction of error also leads to more
would receive no reward for completing a task, powerful statistical tests. A factorial design
a different set of 30 young children (ages 6 to 8) requires fewer participants in order to achieve the
would receive food for completing a task, and yet same degree of power as in a single-factor design.
a different set of 30 young children (ages 6 to 8)
would receive money for completing a task. Simi-
Analysis and Interpretation
larly, unique sets of 30 children would be found in
the 911, 1214, and 1516 age ranges. The statistical technique used for answering ques-
This characteristic separates factorial designs tions from a factorial design is the analysis of vari-
from other designs that also involve categorical ance (ANOVA). A factorial ANOVA is an
independent variables and continuous dependent extension of a one-factor ANOVA. A one-factor
variables. For instance, a repeated measures design ANOVA involves one independent variable and
requires the same participant to be included in one dependent variable. The F-test statistic is used
more than one level of an independent variable. If to test the null hypothesis of equality of group
the 3 × 4 example was changed to a repeated means. If the dependent variable is reaction time
measures design, then each participant would be and the independent variable has three groups,
exposed to tasks involving the three different types then the null hypothesis states that the mean reac-
of rewards: none, food, and money. tion times for Groups 1, 2, and 3 are equal. If the
F test leads to rejection of the null hypothesis, then
the alternative hypothesis is that at least one pair
Advantages
of the group mean reaction times is not equal. Fol-
Factorial designs have several advantages. First, low-up analyses are necessary to determine which
they allow for a broader interpretation of results. pair or pairs of means are unequal.
If a single-factor design was used to examine treat- Additional null hypotheses are tested in a facto-
ments, the researcher could generalize results only rial ANOVA. For a two-factor ANOVA, there are
to the characteristics of the particular group of three null hypotheses. Two of them assess main
476 Factorial Design

Table 1 Sample Means for the Mathematics Instruction Study


Method A Method B Marginal Means
Problem Solving Traditional for Gender
Males XA, males ¼ 55.0 XB, males ¼ 40.0 Xmales ¼ 47.5
Females XA, females ¼ 50.0 XB, females ¼ 35.0 Xfemales ¼ 42.5
Marginal Means for Method XA ¼ 52.5 XB ¼ 37.5 Xoverall ¼ 45.0

effects, that is, the independent effect of each inde- Example of Math Instruction Study
pendent variable on the dependent variable. A
The following several paragraphs illustrate the
third hypothesis assesses the interaction effect of
application of factorial ANOVA for an experiment
the two independent variables on the dependent
in which a researcher wants to compare effects of
variable. For a three-factor ANOVA, there are
two methods of math instruction on math compre-
seven null hypotheses: (a) three main-effect
hension. Method A involves a problem-solving and
hypotheses, one for each independent variable;
reasoning approach, and Method B involves a more
(b) three two-factor interaction hypotheses, one
traditional approach that focuses on computation
for each unique pair of independent variables; and
and procedures. The researcher also wants to deter-
(c) one three-factor interaction hypothesis that
mine whether the methods lead to different levels
examines whether a two-factor interaction is gen-
of math comprehension for male versus female stu-
eralizable across levels of the third factor. It is
dents. This is an example of a 2 × 2 factorial
important to note that each factorial ANOVA
design. There are two independent variables:
examines only one dependent variable. Therefore,
method and gender. Each independent variable has
it is called a univariate ANOVA. When more than
two groups or levels. The dependent variable is
one dependent variable is included in a single pro-
represented by scores on a mathematics compre-
cedure, a multivariate ANOVA is used.
hension assessment. Total sample size for the study
is 120, and there are 30 students in each combina-
tion of gender and method.
Model Assumptions
Similar to one-factor designs, there are three
Matrix of Sample Means
model assumptions for a factorial analysis: nor-
mality, homogeneity of variance, and indepen- Before conducting the ANOVA, the researcher
dence. First, values on the dependent variable examined a matrix of sample means. There are
within each population group must be normally three types of means in a factorial design: cell
distributed around the mean. Second, the popula- means, marginal means, and an overall (or
tion variances associated with each group in the grand) mean. In a 2 × 2 design, there are four
study are assumed to be equal. Third, one partici- cell means, one for each unique subset of partici-
pant’s value on the dependent variable should not pants. Table 1 shows that the 30 male students
be influenced by any other participant in the study. who received Method A had an average math
Although not an assumption per se, another comprehension score of 55. Males in Method B
requirement of factorial designs is that each sub- had an average score of 40. Females’ scores were
sample should be a random subset from the popu- lower but had the same pattern across methods
lation. Prior to conducting statistical analysis, as males’ scores. The 30 female students in
researchers should evaluate each assumption. If Method A had an average score of 50, whereas
assumptions are violated, the researcher can either the females in Method B had a score of 35.
(a) give evidence that the inferential tests are The second set of means is called marginal
robust and the probability statements remain valid means. These means represent the means for all
or (b) account for the violation by transforming students in one group of one independent variable.
variables, use statistics that adjust for the viola- Gender marginal means represent the mean of all
tion, or use nonparametric alternatives. 60 males (47.5) regardless of which method they
Factorial Design 477

Table 2 ANOVA Summary Table for the variation. Within rows, there are several elements:
Mathematics Instruction Study sum of squares (SS), degrees of freedom (df), mean
SS df MS F p square (MS), F statistic, and significance level (p).
Method 6157.9 1 6157.9 235.8 < .001 The last row in the table represents the total
Gender 879.0 1 879.0 33.7 < .001 variation in the data set. SS(total) is obtained by
Method × 1.9 1 1.9 0.1 .790 determining the deviation between each individual
Gender raw score and the overall mean, squaring the
Within (error) 3029.3 116 26.1 deviations, and obtaining the sum. The other rows
partition this total variation into four components.
Total 10068.2 119
Three rows represent between variation, and one
represents error variation. The first row in Table 2
received, and likewise the mean of all 60 females shows the between source of variation due to
(42.5). Method marginal means represent the mean method. To obtain SS(method), each method mean
of all 60 students who received Method A (52.5) is subtracted from the overall mean, the deviations
regardless of gender, and the mean of 60 students are squared, multiplied by the group sample size,
who received Method B (37.5). Finally, the overall and then summed. The degrees of freedom for
mean is the average score for all 120 students method is the number of groups minus 1. For the
(45.0) regardless of gender or method. between source of variation due to gender, the sum
of squares is found in a similar way by subtracting
The F Statistic the gender group means from the overall mean.
The third row is the between source of variation
An F-test statistic determines whether each of the
accounted for by the interaction between method
three null hypotheses in the two-factor ANOVA
and gender. The sum of squares is the overall mean
should be rejected or not rejected. The concept of
minus the effects of method and gender plus the
the F statistic is similar to that of the t statistic for
individual cell effect. The degrees of freedom are
testing the significance of two group means. It is
the product of the method and gender degrees of
a ratio of two values. The numerator of the F ratio
freedom. The fourth row represents the remaining
is the variance that can be attributed to the observed
unexplained variation not accounted for by the
differences between the group means. The denomi-
two main effects and the interaction effect. The
nator is the amount of variance that is ‘‘left over,’’
sum of squares for this error variation is obtained
that is, the amount of variance due to differences
by finding the deviation between each individual
among participants within groups (or error). There-
raw score and the mean of the subgroup to which
fore, the F statistic is a ratio between two
it belongs, squaring that deviation, and then sum-
variances—variance attributable ‘‘between’’ groups
ming all deviations. Degrees of freedom for the
and variance attributable ‘‘within’’ groups. Is the
between and within sources of variation add up to
between-groups variance larger than the within-
the df(total), which is the total number of indivi-
groups variance? The larger it is, the larger the F sta-
duals minus 1.
tistic. The larger the F statistic, the more likely it is
As mentioned earlier, mean square represents
that the null hypothesis will be rejected. The
variance. The mean square is calculated in the
observed F (calculated from the data) is compared
same way as the variance for any set of data.
to a critical F at a certain set of degrees of freedom
Therefore, the mean square in each row of Table 2
and significance level. If the observed F is larger than
is the ratio between SS and df. Next, in order to
the critical F, then the null hypothesis is rejected.
make the decision about rejecting or not rejecting
each null hypothesis, the F ratio is calculated.
Partitioning of Variance
Because the F statistic is the ratio of between to
Table 2 shows results from the two-factor within variance, it is simply obtained by dividing
ANOVA conducted on the math instruction study. the mean square for each between source by the
An ANOVA summary table is produced by statisti- mean square for error. Finally, the p values in
cal software programs and is often presented in Table 2 represent the significance level for each
research reports. Each row identifies a portion of null hypothesis tested.
478 Factorial Design

gender states that the mean score for all males


60
equals the mean score for all females, regardless
55 of the method they received. Results shows that
Math Comprehension Score

55
the null hypothesis is rejected, F(1, 116) ¼ 33.7,
50 p < .001. The mean score for all males (47.5) is
50
higher than the mean score for all females
(42.5), regardless of method. Finally, results
45
show no significant interaction between method
40 and gender, F(1, 116) ¼ .1, p ¼ .790, meaning
40
that the difference in mean math scores for
35 Method A versus Method B is the same for males
35
and females. For both genders, the problem-solv-
ing approach produced a higher mean score than
30
the traditional approach. Figure 1 shows the
Males Females four cell means plotted on a graph. The lines are
Gender parallel, indicating no interaction between
Method method and gender.
Method A Method B

A Further Look at Two-Factor Interactions


Figure 1 Plot of Nonsignificant Interaction of
Method by Gender When the effect of one independent variable
is constant across the levels of the other indepen-
dent variable, there is no significant interaction.
Interpreting the Results
Pictorially, the lines on the graph are parallel.
Answering the researcher’s questions requires When the lines are significantly nonparallel, then
examining the F statistic to determine if it is there is an interaction between the two indepen-
large enough to reject the null hypothesis. A crit- dent variables. Generally, there are two types of
ical value for each test can be found in a table of significant interactions: ordinal and disordinal.
critical F values by using the df(between) and Figure 2 illustrates an ordinal interaction. Sup-
df(within) and the alpha level set by the pose that the math instruction study showed
researcher. If the observed F is larger than the a significant interaction effect, indicating that
critical F, then the null hypothesis is rejected. the effect of method is not the same for the two
Alternatively, software programs provide the genders. The cell means on the graph show non-
observed p value for each test. The null hypothe- parallel lines that do not intersect. Although the
sis is rejected when the p value is less than the means for Method A are higher than those for
alpha level. Method B for both genders, there is a 15-point
In this study, the null hypothesis for the main difference in the method means for males, but
effect of method states that the mean math com- only a 5-point difference in the method means
prehension score for all students in Method A for females.
equals the mean score for all students in Method Figure 3 illustrates a disordinal interaction in
B, regardless of gender. Results indicate that the which the lines intersect. In this scenario, Method
null hypothesis is rejected, F(1, 116) ¼ 235.8, A produced a higher mean score for males, and
p < .001. There is a significant difference in math Method B produced a higher mean score for
comprehension for students who received the females. The magnitude of the difference in gender
two different types of math instruction. Students means for each method is the same, but the direc-
who experienced the problem-solving and rea- tion is different. Method A mean minus Method B
soning approach had a higher mean score (52.5) mean for males was þ 15, whereas the difference
than students with the traditional approach between Method A and Method B means for
(37.5). The null hypothesis for the main effect of females was 15.
Factorial Design 479

60 60

55
55 55 55 55

Math Comprehension Score


Math Comprehension Score

50
50
50

45 45
45

40 40 40
40
40
35
35

30
30
Males Females
Males Females Gender
Gender Method
Method A Method B
Method
Method A Method B

Figure 3 Plot of Significant, Disordinal Interaction of


Method by Gender
Figure 2 Plot of Significant, Ordinal Interaction of
Method by Gender
a mean that represents the average of all other
experimental groups combined.
Follow-Up Analyses for When a significant interaction occurs in a study,
Significant Results main effects are not usually interpreted. When the
effect of one factor is not constant across the levels
Significant main effects are often followed by of another factor, it is difficult to make generalized
post hoc tests to determine which pair or pairs statements about a main effect. There are two
of group means are significantly different. A categories of follow-up analysis for significant
wide variety of tests are available. They differ in interactions. One is called simple main effects. It
their degree of adjustment for the compounding involves examining the effect of one factor at only
of Type I error (rejection of a true null hypothe- one level of another factor. Suppose there are five
sis). Examples of a few post hoc tests are Fisher’s treatment conditions in an experimental factor and
LSD, Duncan, NewmanKeuls, Tukey, and two categories of age. Comparing the means for
Scheffé, in order from liberal (adjusts less, rejects younger participants versus older participants
more) to conservative (adjusts more, rejects less). within each condition level is one example of
Many other tests are available for circumstances a series of tests for a simple main effect.
when the group sample sizes are unequal or Another category of follow-up analysis is called
when group variances are unequal. For example, interaction comparisons. This series of tests exam-
the Games Howell post hoc test is appropriate ines whether the difference in means between two
when the homogeneity of variance assumption is levels of Factor A is equal across two levels of Fac-
violated. Non-pairwise tests of means, some- tor B. Using the example in the above paragraph,
times called contrast analysis, can also be con- one test in the series would involve comparing the
ducted. For example, a researcher might be difference in Condition A means for younger and
interested in comparing a control group mean to older participants to the difference in Condition B
480 Factor Loadings

means for younger and older participants. Addi- method designed to explain the correlations
tional tests would compare the mean age differ- between observed variables using a smaller num-
ences for Conditions A and B, Conditions A and C, ber of factors. Because factor analysis is a widely
and so on. used method in social and behavioral research, an
in-depth examination of factor loadings and the
related factor-loading matrix will facilitate a better
Effect Sizes
understanding and use of the technique.
A final note about factorial designs concerns the
practical significance of the results. Every research Factor Analysis and Factor Loadings
study involving factorial analysis of variance should
include a measure of effect size. Tables are available Factor loadings are coefficients found in either
for Cohen’s f effect size. Descriptive labels have a factor pattern matrix or a factor structure
been attached to the f values (.1 is small, .25 is matrix. The former matrix consists of regression
medium, and greater than .40 is large), although coefficients that multiply common factors to pre-
the magnitude of the effect size obtained in a study dict observed variables, also known as manifest
should be interpreted relative to other research in variables, whereas the latter matrix is made up of
the particular field of study. Some widely available product-moment correlation coefficients between
software programs report partial eta-squared common factors and observed variables.
values, but they overestimate actual effect sizes. The pattern matrix and the structure matrix are
Because of this positive bias, some researchers pre- identical in orthogonal factor analysis where com-
fer to calculate omega-squared effect sizes. mon factors are uncorrelated. This entry primarily
examines factor loadings in this modeling situation,
Carol S. Parke which is most commonly seen in applied research.
Therefore, the majority of the entry content is
See also Dependent Variable; Effect Size, Measures of;
devoted to factor loadings, which are both regres-
Independent Variable; Main Effects; Post Hoc
sion coefficients in the pattern matrix and correla-
Analysis; Repeated Measures Design
tion coefficients in the structure matrix. Factor
loadings in oblique factor analysis are briefly dis-
Further Readings cussed at the end of the entry, where common fac-
Glass, G. V, & Hopkins, K. D. (1996). Statistical tors are correlated and the two matrices differ.
methods in education and psychology (3rd ed.). Besides, factor analysis could be exploratory
Boston: Allyn & Bacon. (EFA) or confirmatory (CFA). EFA does not assume
Green, S. B., & Salkind, N. J. (2005). Using SPSS for any model a priori, whereas CFA is designed to
Windows: Analyzing and understanding data (4th ed.). confirm a theoretically established factor model.
Upper Saddle River, NJ: Prentice Hall. [see examples Factor loadings play similar roles in these two mod-
of conducting simple main effects and interaction eling situations. Therefore, in this entry on factor
comparisons for significant two-factor interactions]
loadings, the term factor analysis refers to both
Howell, D. C. (2002). Statistical methods for psychology
(5th ed.). Pacific Grove, CA: Wadsworth.
EFA and CFA, unless stated otherwise.
Stevens, J. (2007). Intermediate statistics: A modern
approach (3rd ed.). Mahwah, NJ: Lawrence Erlbaum. Overview
Warner, R. M. (2008). Applied statistics: From bivariate
through multivariate techniques. Thousand Oaks, Factor analysis, primarily EFA, assumes that
CA: Sage. common factors do exist that are indirectly mea-
sured by observed variables, and that each
observed variable is a weighted sum of common
factors plus a unique component. Common factors
FACTOR LOADINGS are latent and they influence one or more observed
variables. The unique component represents all
Factor loadings are part of the outcome from fac- those independent things, both systematic and ran-
tor analysis, which serves as a data reduction dom, that are specific to a particular observed
Factor Loadings 481

variable. In other words, a common factor is Participant A, consists of two parts. Part 1 is the
loaded by at least one observed variable, whereas average score on this item for all participants
each unique component corresponds to one and having identical levels of introversion and extro-
only one observed variable. Factor loadings are version as Participant A, and this average item
correlation coefficients between observed variables score is denoted by a constant times this partici-
and latent common factors. pant’s level of introversion plus a second constant
Factor loadings can also be viewed as standard- times his or her level of extroversion. Part 2 is the
ized regression coefficients, or regression weights. unique component that indicates the amount of
Because an observed variable is a linear combina- difference between the item score from Participant
tion of latent common factors plus a unique com- A and the said average item score. Obviously, such
ponent, such a structure is analogous to a multiple a two-part scenario is highly similar to a descrip-
linear regression model where each observed vari- tion of regression analysis.
able is a response and common factors are predic- In the above example, factor loadings for this
tors. From this perspective, factor loadings are item, or this observed variable, are nothing but the
viewed as standardized regression coefficients two constants that are used to multiply introver-
when all observed variables and common factors sion and extroversion. There is a set of factor load-
are standardized to have unit variance. Stated dif- ings for each item or each observed variable.
ferently, factor loadings can be thought of as an
optimal set of regression weights that predicts an
Factor Loadings in a Mathematical Form
observed variable using latent common factors.
Factor loadings usually take the form of a matrix, The mathematical form of a factor analysis
and this matrix is a standard output of almost all model takes the following form:
statistical software packages when factor analysis is
performed. The factor loading matrix is usually x ¼ Λf þ ε,
denoted by the capital Greek letter , or lambda, where x is a p-variate vector of standardized,
whereas its matrix entries, or factor loadings, are observed data; Λ is a p × m matrix of factor load-
denoted by λij with i being the row number and j ings; f is an m-variate vector of standardized, com-
the column number. The number of rows of the mon factors; and ε is a p-variate vector of
matrix equals that of observed variables and the standardized, unique components.
number of columns that of common factors. Back to the previous example, in which p ¼ 5
and m ¼ 2. So the factor loading matrix Λ is
Factor Loadings in a Hypothetical Example a 5 × 2 matrix consisting of correlation coeffi-
cients. The above factor analysis model can be
A typical example involving factor analysis is to written in another form for each item:
use personality questionnaires to measure underly-
8
ing psychological constructs. Item scores are > x1 ¼ λ11 f1 þ λ12 f2 þ ε1 ,
>
>
observed data, and common factors correspond to >
< x2 ¼ λ21 f1 þ λ22 f2 þ ε2 ,
latent personality attributes. x3 ¼ λ31 f1 þ λ32 f2 þ ε3 ,
Suppose a psychologist is developing a theory >
>
>
> x ¼ λ41 f1 þ λ42 f2 þ ε4 ,
that hypothesizes there are two personality attri- : 4
x5 ¼ λ51 f1 þ λ52 f2 þ ε5 :
butes that are of interest, introversion and extro-
version. To measure these two latent constructs, Each of the five equations corresponds to one
the psychologist develops a 5-item personality item. In other words, each observed variable is
instrument and administers it to a randomly represented by a weighted linear combination of
selected sample of 1,000 participants. common factors plus a unique component. And
Thus, each participant is measured on five for each item, two factor loading constants bridge
variables, and each variable can be modeled as observed data and common factors. These con-
a linear combination of the two latent factors stants are standardized regression weights because
plus a unique component. Stated differently, the observed data, common factors, and unique com-
score on an item for one participant, say, ponents are all standardized to have zero mean
482 Factor Loadings

and unit variance. For example, in determining A factor loading that falls outside of the interval
standardized x1, f1 is given the weight λ11 and f2 is bounded by ( ± cutoff value) is considered to be
given the weight λ12, whereas in determining stan- large and is thus retained. On the other hand, a fac-
dardized x2, f1 is given the weight λ21 and f2 is tor loading that does not meet the criterion indi-
given the weight λ22. cates that the corresponding observed variable
The factor loading matrix can be used to should not load on the corresponding common
define an alternative form of factor analysis factor. The cutoff value is arbitrarily selected
model. Suppose the observed correlation matrix depending on the field of study, but ( ± 0.4) seems
and the factor model correlation matrix are RX to be preferred by many researchers.
and RF, respectively. The following alternative A factor loading can also be t tested, and the
factor model can be defined using the factor null hypothesis for this test is that the loading is
loading matrix: not significantly different from zero. The com-
puted t statistic is compared with the threshold
RF ¼ ΛΛT þ ψ, chosen for statistical significance. If the computed
value is larger than the threshold, the null is
where ψ is a diagonal matrix of unique variances. rejected in favor of the alternative hypothesis,
Some factor analysis algorithms iteratively solve which states that the factor loading differs signifi-
for Λ and ψ so that the difference between RX cantly from zero.
and RF is minimized. A confidence interval (CI) can be constructed
for a factor loading, too. If the CI does not cover
zero, the corresponding factor loading is signifi-
Communality and Unique Variance cantly different from zero. If the CI does cover
zero, no conclusion can be made regarding the
Based on factor loadings, communality and
significance status of the factor loading.
unique variance can be defined. These two con-
cepts relate to each observed variable.
The communality for an observed variable Rotated Factor Loadings
refers to the amount of variance in that variable
The need to rotate a factor solution relates to
that is explained by common factors. If the com-
the factorial complexity of an observed variable,
munality value is high, at least one of the common
which refers to the number of common factors
factors has a substantial impact on the observed
that have a significant loading for this variable. In
variable. The sum of squared factor loadings is the
applied research, it is desirable for an observed
communality value for that observed variable.
variable to load significantly on one and only one
And most statisticians use h2i to denote the com-
common factor, which is known as a simple struc-
munality value for the ith observed variable.
ture. For example, a psychologist prefers to be able
The unique variance for an observed variable is
to place a questionnaire item into one and only
computed as 1 minus that variable’s communality
one subscale.
value. The unique variance represents the amount
When an observed variable loads on two or
of variance in that variable that is not explained
more factors, a factor rotation is usually performed
by common factors.
to achieve a simple structure, which is a common
practice in EFA. Of all rotation techniques, vari-
Issues Regarding Factor Loadings max is most commonly used. Applied researchers
usually count on rotated factor loadings to inter-
Significance of Factor Loadings
pret the meaning of each common factor.
There are usually three approaches to the deter-
mination of whether or not a factor loading is
Labeling Common Factors
significant: cutoff value, t test, and confidence
interval. However, it should be noted that the In EFA, applied researchers usually use factor
latter two are not commonly seen in applied loadings to label common factors. EFA assumes
research. that common factors exist, and efforts are made to
False Positive 483

determine the number of common factors and the Further Readings


set of observed variables that load significantly on
Child, D. (1990). The essentials of factor analysis
each factor. The underlying nature of each such set (2nd ed.). London: Cassel Educational Ltd.
of observed variables is used to give the corre- Everitt, B. S. (1984). An introduction to latent variable
sponding common factor a name. For example, if models. London: Chapman & Hall.
a set of questionnaire items loads highly on a factor Garson, G. D. (2008, March). Factor analysis. Retrieved
and the items refer to different aspects of extrover- July 26, 2008, from http://www2.chass.ncsu.edu/
sion, the common factor should be named accord- garson/pa765/factor.htm
ingly to reflect the common thread that binds all Harman, H. H. (1976). Modern factor analysis (3rd ed.).
of those items. However, the process of labeling Chicago: University of Chicago Press.
Lawley, D. N. (1940). The estimation of factor loadings
common factors is very subjective.
by the method of maximum likelihood. Proceedings of
the Royal Society of Edinburgh, A-60, 6482.
Suhr, D. D. (n.d.). Exploratory or confirmatory factor
analysis. Retrieved July 26, 2008, from http://
Factor Loadings in Oblique Factor Analysis www2.sas.com/proceedings/sugi31/20031.pdf

Oblique factor analysis is needed given the cor-


relation of common factors. Unlike the orthogonal
case, the factor pattern matrix and the factor
structure matrix differ in this modeling situation.
An examination of factor loadings involves inter- FALSE POSITIVE
preting both matrices in a combined manner. The
pattern matrix provides information regarding the The term false positive is most commonly
group of observed variables used to measure each employed in diagnostic classification within the
common factor, thus contributing to an interpreta- context of assessing test validity. The term repre-
tion of common factors, whereas the structure sents a diagnostic decision in which an individual
matrix presents product-moment correlation coef- has been identified as having a specific condition
ficients between observed variables and common (such as an illness), when, in fact, he or she does
factors. not have the condition. The term false positive is
less commonly used within the context of hypothe-
sis testing to represent a Type I error, which is
defined as rejection of a true null hypothesis, and
Factor Loadings in Second-Order Factor Analysis thereby incorrectly concluding that the alternative
hypothesis is supported. This entry focuses on the
Factor-analyzing observed variables leads to
more common use of the term false positive within
a reduced number of first-order common factors,
the context of diagnostic decision making. The dis-
and the correlations between them can sometimes
ciplines that are most likely to be concerned with
be explained further by an even smaller number of
the occurrence of false positives are medicine, clin-
second-order common factors; and, a second-order
ical psychology, educational and school psychol-
factor model is usually analyzed under the CFA
ogy, forensic psychology (and the legal system),
context. For this type of factor model, factor load-
and industrial psychology. In each of the afore-
ings refer to all of those regression/correlation
mentioned disciplines, critical decisions are made
coefficients that not only connect observed vari-
about human beings based on the results of diag-
ables with first-order factors but also bridge two
nostic tests or other means of assessment. The con-
different levels of common factors.
sequences associated with false positives can range
Hongwei Yang from a person being found guilty of a murder he
or she did not commit to a much less serious con-
See also Confirmatory Factor Analysis; Correlation; sequence, such as a qualified person being errone-
Exploratory Factor Analysis; Regression Coefficient; ously identified as unsuitable for a job, and thus
Structural Equation Modeling not being offered the job.
484 False Positive

Basic Definitions latter being failure to reject a false null hypothesis


and thereby concluding incorrectly that the alter-
Within the framework of developing measuring native hypothesis is not supported. A true negative
instruments that are capable of categorizing people can be employed within the context of hypothesis
and/or predicting behavior, researchers attempt to testing to represent retention of a correct null
optimize correct categorizations (or predictions) hypothesis, whereas a true positive can be
and minimize incorrect categorizations (or predic- employed to represent rejection of a false null
tions). Within the latter context, the following four hypothesis. Recollect that within the context of
diagnostic decisions are possible (the first two of hypothesis testing, a null hypothesis states that no
which are correct and the latter two incorrect): experimental effect is present, whereas the alterna-
true positive, true negative, false positive, false tive hypothesis states that an experimental effect is
negative. The terms true and false in each of the present. Thus, retention of a null hypothesis is
aforementioned categories designate whether or analogous to a clinical situation in which it is con-
not a diagnostic decision made with respect to an cluded a person is normal, whereas rejection of the
individual is, in fact, correct (as in the case of a true null hypothesis is analogous to reaching the con-
positive and a true negative) or incorrect (as in the clusion a person is not normal.
case of a false positive and a false negative). The
terms positive and negative in each of the afore-
Illustrative Examples
mentioned categories refer to whether or not the
test result obtained for an individual indicates he Three commonly encountered examples involving
or she has the condition in question. Thus, both testing will be used to illustrate the four diagnostic
a true positive and false positive represent indivi- decisions. The first example comes from the field
duals who obtain a positive test result—the latter of medicine, where diagnostics tests are commonly
indicating that such individuals have the condition employed in making decisions regarding patients.
in question. On the other hand, both a true nega- Thus, it will be assumed that the condition a diag-
tive and a false negative represent individuals who nostic test is employed to identify is a physical or
obtain a negative test result—the latter indicating psychological illness. In such a case, a true positive
that such individuals do not have the condition in is a person whom the test indicates has the illness
question. and does, in fact, have the illness. A true negative
In the discipline of medicine, the true positive is a person whom the test indicates does not have
rate for a diagnostic test is referred to as the sensi- the illness and, in fact, does not have the illness. A
tivity of the test—that is, the probability that a per- false positive is a person whom the test indicates
son will test positive for a disease, given the person has the illness but, in fact, does not have the ill-
actually has the disease. The true negative rate for ness. A false negative is a person whom the test
a diagnostic test is referred to as the specificity of indicates does not have the illness but, in fact, has
the test—that is, the probability that a person will the illness.
test negative for a disease, given the person actu- The second example involves the use of the
ally does not have the disease. As a general rule, in polygraph for the purpose of ascertaining whether
order for a diagnostic test to be a good instrument a person is responding honestly. Although in most
for detecting the presence of a disease, it should be states, polygraph evidence is not generally admissi-
high in both sensitivity and specificity. The propor- ble in court, a person’s performance on a polygraph
tion of true positives and false positives in a popu- can influence police and prosecutors with regard
lation is referred to as the selection ratio because it to their belief concerning the guilt or innocence of
represents the proportion of the population that is an individual. The condition that the polygraph is
identified as possessing the condition in question. employed to identify is whether or not a person is
It was noted earlier that a false positive can also responding honestly to what are considered to be
be employed within the context of hypothesis test- relevant questions. In the case of a polygraph
ing to represent a Type I error. Analogously, a false examination, a true positive is a person whom the
negative can be employed within the context of polygraph identifies as dishonest and is, in fact,
hypothesis testing to represent a Type II error—the dishonest. A true negative is a person whom the
False Positive 485

polygraph identifies as honest and is, in fact, hon- clear-cut. As an example, although the conse-
est. A false positive is a person whom the poly- quence of failure to diagnose breast cancer (a false
graph identifies as dishonest but is, in fact, honest. negative) could cost a woman her life, the conse-
A false negative is a person whom the polygraph quences associated with a woman being a false
identifies as honest but is, in fact, dishonest. positive could range from minimal (e.g., the
The final example involves the use of an integ- woman is administered a relatively benign form of
rity test, which is commonly used in business and chemotherapy) to severe (e.g., the woman has an
industry in assessing the suitability of a candidate unnecessary mastectomy).
for a job. Although some people believe that integ- In contrast to medicine, the American legal sys-
rity tests are able to identify individuals who will tem tends to view a false positive as a more serious
steal from an employer, the more general consen- error than a false negative. The latter is reflected in
sus is that such tests are more likely to identify the use of the ‘‘beyond a reasonable doubt’’ stan-
individuals who will not be conscientious employ- dard in criminal courts, which reflects the belief
ees. In the case of an integrity test, the condition that it is far more serious to find an innocent person
that the test is employed to identify is the unsuit- guilty than to find a guilty person innocent. Once
ability of a job candidate. With regard to a person’s again, however, the consequences associated with
performance on an integrity test, a true positive is the relative seriousness of both types of errors may
a person whom the test identifies as an unsuitable vary considerably depending upon the nature of the
employee and, in fact, will be an unsuitable crime involved. For example, one could argue that
employee. A true negative is a person whom the finding a serial killer innocent (a false negative)
test identifies as a suitable employee and, in fact, constitutes a far more serious error than wrongly
will be a suitable employee. A false positive is convicting an innocent person of a minor felony (a
a person whom the test identifies as an unsuitable false positive) that results in a suspended sentence.
employee but, in fact, will be a suitable employee.
A false negative is a person whom the test identi-
The Low Base Rate Problem
fies as a suitable employee but, in fact, will be an
unsuitable employee. The base rate of a behavior or medical condition
is the frequency with which it occurs in a popula-
tion. The low base rate problem occurs when
Relative Seriousness of False a diagnostic test that is employed to identify
a low base rate behavior or condition tends to
Positive Versus False Negative
yield a disproportionately large number of false
It is often the case that the determination of a cut- positives. Thus, when a diagnostic test is
off score on a test (or criterion of performance on employed in medicine to detect a rare disease, it
a polygraph) for deciding to which category a may, in fact, identify virtually all of the people
person will be assigned will be a function of the who are afflicted with the disease, but in the pro-
perceived seriousness of incorrectly categorizing cess erroneously identify a disproportionately large
a person a false positive versus a false negative. number of healthy people as having the disease,
Although it is not possible to state that one type of and because of the latter, the majority of people
error will always be more serious than the other, labeled positive will, in fact, not have the disease.
a number of observations can be made regarding The relevance of the low base rate problem to
the seriousness of the two types of errors. The cri- polygraph and integrity testing is that such instru-
terion for determining the seriousness of an error ments may correctly identify most guilty indivi-
will always be a function of the consequences asso- duals and potentially unsuitable employees, yet, in
ciated with the error. In medicine, physicians tend the process, erroneously identify a large number of
to view a false negative as a more serious error innocent people as guilty or, in the case of an
than a false positive the latter being consistent integrity test, a large number of potentially suit-
with the philosophy that it is better to treat a able employees as unsuitable. Estimates of false
nonexistent illness than to neglect to treat a poten- positive rates associated with the polygraph and
tially serious illness. Yet things are not always that integrity tests vary substantially, but critics of the
486 Falsifiability

latter instruments argue that error rates are unac- Bayes’s theorem represents the probability that
ceptably high. In the case of integrity tests, people a person will be healthy given that his or her diag-
who utilize them may concede that although such nostic test result was positive.
tests may yield a large number of false positives, at
the same time they have a relatively low rate of PðBþ=A2 ÞPðA2 Þ
false negatives. Because of the latter, companies PðA2 =BþÞ ¼
PðBþ=A1 ÞPðA1 Þ þ PðBþ=A2 ÞPðA2 Þ
that administer integrity tests cite empirical evi-
dence that such tests are associated with a decrease When the above noted conditional probability
in employee theft and an increase in productivity. P(A2/Bþ) is multiplied by P(Bþ) (the proportion
In view of the latter, they consider the conse- of individuals in the population who obtain a posi-
quences associated with a false positive (not hiring tive result on the diagnostic test), the resulting
a suitable person) to be far less damaging to the value represents the proportion of false positives in
company than the consequences associated with the population. In order to compute P(A2/Bþ), it
a false negative (hiring an unsuitable person). is necessary to know the population base rates A1
and A2 as well as the conditional probabilities
P(Bþ /A1) and P(Bþ /A2). Obviously, if one or
Use of Bayes’s Theorem for more of the aforementioned probabilities is not
Computing a False Positive Rate known or cannot be estimated accurately, comput-
ing a false positive rate will be problematical.
In instances where the false positive rate for a test
cannot be determined from empirical data, Bayes’s David J. Sheskin
theorem can be employed to estimate the latter.
Bayes’s theorem is a rule for computing condi- See also Bayes’s Theorem; Sensitivity; Specificity; True
tional probabilities that was stated by an 18th- Positive
century English clergyman, the Reverend Thomas
Bayes. A conditional probability is the probability Further Readings
of Event A given the fact that Event B has already
Feinstein, A. R. (2002). Principles of medical statistics.
occurred. Bayes’s theorem assumes there are two
Boca Raton, FL: Chapman & Hall/CRC.
sets of events. In the first set, there are n events to Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical
be identified as A1, A2, . . . , An, and in the second methods for rates and proportions (3rd ed.). Hoboken,
set, there are two events to be identified as Bþ NJ: Wiley-Interscience.
and B. Bayes’s theorem allows for the computa- Meehl, P., & Rosen, A. (1955). Antecedent probability
tion of the probability that Aj (where 1 ≤ j ≤ n) and the efficiency of psychometric signs, patterns, or
will occur, given it is known that Bþ has cutting scores. Psychological Bulletin, 52, 194 216.
occurred. As an example, the conditional probabil- Pagano, M., & Gauvreau, K. (2000). Principles of
ity P(A2/Bþ ) represents the probability that Event biostatistics (2nd ed.). Pacific Grove, CA: Duxbury.
A2 will occur, given the fact that Event Bþ has Rosner, B. (2006). Fundamentals of biostatistics (6th ed.).
Belmont, CA: Thomson-Brooks/Cole.
already occurred.
Sheskin, D. J. (2007). Handbook of parametric and
An equation illustrating the application of nonparametric statistical procedures (4th ed.). Boca
Bayes’s theorem is presented below. In the latter Raton, FL: Chapman & Hall/CRC.
equation, it is assumed that Set 1 is comprised of Wiggins, J. (1973). Personality and prediction: Principles
the two events A1 and A2 and Set 2 is comprised of personality assessment. Menlo Park, CA: Addison
of the two events Bþ and B. If it is assumed A1 Wesley.
represents a person who is, in fact, sick; A2 repre-
sents a person who is, in fact, healthy; Bþ indi-
cates a person who received a positive diagnostic
test result for the illness in question; and B indi- FALSIFIABILITY
cates a person who received a negative diagnostic
test result for the illness in question, then the con- The concept of falsifiability is central to distin-
ditional probability P(A2/Bþ ) computed with guishing between systems of knowledge and
Falsifiability 487

understanding, specifically between scientific theo- Popper found similarity between astrologists
ries of understanding the world and those consid- and those who interpret and make predictions
ered nonscientific. The importance of the concept about historical events via Marxian analyses in
of falsifiability was developed most thoroughly by that both have historically sought to verify rather
the philosopher Karl Popper in the treatise Conjec- than falsify their perspectives as a matter of prac-
tures and Refutations: The Growth of Scientific tice. Where a lack of corroboration between reality
Knowledge. Specifically, falsifiability refers to the and theory exists, proponents of both systems rein-
notion that a theory or statement can be found to terpret their theoretical position so as to corre-
be false; for instance, as the result of an empirical spond with empirical observations, essentially
test. undermining the extent to which the theoretical
Popper sought to distinguish between various perspective can be falsified. The proponents of
means of understanding the world in an effort to both pseudoscientific approaches tacitly accept the
determine what constitutes a scientific approach. manifest truth of their epistemic orientations irre-
Prior to his seminal work, merely the empirical spective of the fact that apparent verisimilitude is
nature of scientific investigation was accepted as contingent upon subjective interpretations of his-
the criterion that differentiated it from pseudo- or torical events.
nonscientific research. Popper’s observation that Popper rejected the notion that scientific theo-
many types of research considered nonscientific ries were those thought most universally true,
were also based upon empirical techniques led to given the notion that verifying theories in terms of
dissatisfaction with this conventional explanation. their correspondence to the truth is a quixotic task
Consequently, several empirically based methods requiring omniscience. According to Popper, one
colloquially considered scientific were contrasted cannot predict the extent to which future findings
in an effort to determine what distinguished sci- could falsify a theory, and searching for verifica-
ence from pseudoscience. Examples chosen by tion of the truth of a given theory ignores this
Popper to illustrate the diversity of empirical potentiality. Instead of locating the essence of sci-
approaches included physics, astrology, Marxian ence within a correspondence with truth, Popper
theories of history, and metaphysical analyses. found that theories most scientific were those
Each of these epistemic approaches represents capable of being falsified. This renders all scientific
a meaningful system of interpreting and under- theories tenable at best, in the sense that the most
standing the world around us, and has been used plausible scientific theories are merely those that
earnestly throughout history with varying degrees have yet to be falsified.
of perceived validity and success. Every empirical test of a theory is an attempt to
Popper used the term line of demarcation to dis- falsify it, and there are degrees of testability with
tinguish the characteristics of scientific from respect to theories as a whole. Focusing on falsifi-
nonscientific (pseudoscientific) systems of under- cation relocates power from the extent to which
standing. What Popper reasoned differentiated the a theory corresponds with a given reality or set of
two categories of understanding is that the former circumstances to the extent to which it logically
could be falsified (or found to be not universally can be proven false given an infinite range of
true), whereas the latter was either incapable of empirical possibilities. Contrarily, a hypothetical
being falsified or had been used in such a way that theory that is capable of perfectly and completely
renders falsification unlikely. According to Popper, explaining a given phenomenon is inherently
this usage takes the form of seeking corroboratory unscientific because it cannot be falsified logically.
evidence to verify the verisimilitude of a particular Where theories are reinterpreted to make them
pseudoscientific theory. For example, with respect more compatible with potentially falsifying empiri-
to astrology, proponents subjectively interpret cal information, it is done to the benefit of its cor-
events (data) in ways that corroborate their pre- respondence with the data, but to the detriment of
conceived astrological theories and predictions, the original theory’s claim to scientific status.
rather than attempting to find data that undermine As an addendum, Popper rejected the notion
the legitimacy of astrology as an epistemic that only tenable theories are most useful, because
enterprise. those that have been falsified may illuminate
488 Field Study

constructive directions for subsequent research. the direct manipulation of the environment by the
Thus, the principle of falsificationism does not researcher. However, sometimes, independent and
undermine the inherent meaning behind statements dependent variables already exist within the social
that fall short of achieving its standard of scientific structure under study, and inferences can then be
status. drawn about behaviors, social attitudes, values,
Some competing lines of demarcation in distin- and beliefs. It must be noted that a field study is
guishing scientific from pseudoscientific research separate from the concept of a field experiment.
include the verificationist and anarchistic episte- Overall, field studies belong to the category of
mological perspectives. As previously noted, the nonexperimental designs where the researcher uses
simple standard imposed by verificationism states what already exists in the environment. Alterna-
that a theory is considered scientific merely if it tively, field experiments refer to the category of
can be verified through the use of empirical evi- experimental designs where the researcher follows
dence. A competing line involves Paul Feyera- the scientific process of formulating and testing
bend’s anarchistic epistemological perspective, hypotheses by invariably manipulating some
which holds that any and all statements and theo- aspect of the environment. It is important that pro-
ries can be considered scientific because history spective researchers understand the types, aims,
shows that ‘‘whatever works’’ has been labeled sci- and issues; the factors that need to be considered;
entific regardless of any additional distinguishing and the advantages and concerns raised when con-
criteria. ducting the field study type of research.
Field studies belong to the category of nonex-
Douglas J. Dallier perimental design. These studies include the
case study—an in-depth observation of one orga-
See also External Validity; Hypothesis; Internal Validity;
nization, individual, or animal; naturalistic
Logic of Scientific Discovery, The; Research Design
observation—observation of an environment
Principles; Test; Theory
without any attempt to interfere with variables;
participant observer study—observation through
Further Readings the researcher’s submergence into the group
Feyerabend, P. (1975). Against method: Outline of an under study; and phenomenology—observation
anarchistic theory of knowledge. London: New Left derived from the researcher’s personal experi-
Books. ences. The two specific aims of field studies are
Kuhn, T. (1962). The structure of scientific revolutions. exploratory research and hypothesis testing.
Chicago: University of Chicago Press. Exploratory research seeks to examine what exists
Lakatos, I., Feyerabend, P., & Motterlini, M. (1999). For in order to have a better idea about the dynamics
and against method: Including Lakatos’s lectures on that operate within the natural setting. Here, the
scientific method and the Lakatos-Feyerabend acquisition of knowledge is the main objective.
correspondence. Chicago: University of Chicago Press.
With hypothesis testing, the field study seeks to
Mace, C. A. (Ed.). (1957). Philosophy of science: A
personal report: British philosophy in mid-century.
determine whether the null hypothesis or the alter-
London: Allen and Unwin. native hypothesis best predicts the relationship of
Popper, K. (1962). Conjectures and refutations: The variables in the specific context; assumptions can
growth of scientific knowledge. New York: Basic then be used to inform future research.
Books.
Real-Life Research and Applications
Field studies have often provided information and
FIELD STUDY reference points that otherwise may not have been
available to researchers. For example, the famous
A field study refers to research that is undertaken obedience laboratory experiment by Stanley Mil-
in the real world, where the confines of a labora- gram was criticized on the grounds that persons in
tory setting are abandoned in favor of a natural real-life situations would not unquestioningly
setting. This form of research generally prohibits carry out unusual requests by persons perceived to
Field Study 489

be authority figures as they did in the laboratory dependent variable, and other specific variables of
experiment. Leonard Bickman then decided to test interest that already operate in the natural setting
the obedience hypothesis using a real-life applica- may be identified and, to a lesser extent, controlled
tion. He found that his participants were indeed by the researcher because those variables would
more willing to obey the stooge who was dressed become the focus of the study. Overall, field stud-
as a guard than the one who dressed as a sports- ies tend to capture the essence of human behavior,
man or a milkman. Another example of field particularly when the persons under observation
research usage is Robert Cialidini’s investigation of are unaware that they are being observed, so that
how some professionals, such as con men, sales authentic behaviors are reflected without the influ-
representatives, politicians, and the like, are able ence of demand characteristics (reactivity) or social
to gain compliance from others. In reality, he desirability answers. Furthermore, when observa-
worked in such professions and observed the tion is unobtrusive, the study’s integrity is
methods that these persons used to gain compli- increased.
ance from others. From his actual experiences, he However, because field studies, by their very
was able to offer six principles that cover the com- nature, do not control extraneous variables, it is
pliance techniques used by others. Some field stud- exceedingly difficult to ascertain which factor or
ies take place in the workplace to test attitudes factors are more influential in any particular con-
and efficiency. Therefore, field studies can be con- text. Bias can also be an issue if the researcher is
ducted to examine a multitude of issues that testing a hypothesis. There is also the problem of
include playground attitudes of children, gang replication. Any original field study sample will
behaviors, how people respond to disasters, effi- not be accurately reflective of any other replication
ciency of organization protocol, and even behavior of that sample. Furthermore, there is the issue of
of animals in their natural environment. Informa- ethics. Many times, to avoid reactivity, researchers
tion derived from field studies result in correla- do not ask permission from their sample to
tional interpretations. observe them, and this may cause invasion-of-
privacy issues even though such participants are in
the public eye. For example, if research is being
Strengths and Weaknesses
carried out about the types of kissing that take
Field studies are employed in order to increase eco- place in a park, even though the persons engaged
logical and external validity. Because variables are in kissing are doing so in public, had they known
not directly manipulated, the conclusions drawn that their actions were being videotaped, they may
are deemed to be true to life and generalizable. have strongly objected. Other problems associated
Also, such studies are conducted when there is with field studies include the fact that they can be
absolutely no way of even creating mundane real- quite time-consuming and expensive, especially if
ism in the laboratory. For example, if there is a number of researchers are required as well as
a need to investigate looting behavior and the audiovisual technology.
impact of persons on each other to propel this
behavior, then a laboratory study cannot suffice Indeira Persaud
for the investigation because of the complexity of
See also Ecological Validity; Nonexperimental Design;
the variables that may be involved. Field research
Reactive Arrangements
is therefore necessary.
Although field studies are nonexperimental, this
does not imply that such studies are not empirical.
Scientific rigor is promoted by various means, Further Readings
including the methods of data collection used in
Allen, M. J. (1995). Introduction to psychological
the study. Data can be reliably obtained through research. Itasca, IL: F. E. Peacock.
direct observation, coding, note-taking, the use of Babbie, E. (2004). The practice of social research
interview questions—preferably structured—and (10th ed.). Belmont, CA: Thomson Wadsworth.
audiovisual equipment to garner information. Bickman, L. (1974). Clothes make the person.
Even variables such as the independent variable, Psychology Today, 8(4), 4851.
490 File Drawer Problem

Cialdini, R. B. (2006). Influence: The psychology of whereas null results are inconclusive. Reviewers
persuasion. New York: HarperCollins. who have a professional or financial interest in cer-
Robson, C. (2003). Real world research. Oxford, UK: tain results may also be less accepting of and more
Blackwell. critical toward null results than those that confirm
Solso, R. L., Johnson, H. H., & Beal, M. K. (1998).
their expectations.
Experimental psychology: A case approach. New
York: Longman.
Detection
Three methods are commonly used to evaluate
FILE DRAWER PROBLEM whether publication bias exists within a literature
review. Although one of these methods can be per-
formed using vote-counting approaches to research
The file drawer problem is the threat that the synthesis, these approaches are typically conducted
empirical literature is biased because nonsignifi- within a meta-analysis focusing on effect sizes.
cant research results are not disseminated. The The first method is to compare results of pub-
consequence of this problem is that the results lished versus unpublished studies, if the reviewer
available provide a biased portrayal of what is has obtained at least some of the unpublished
actually found, so literature reviews (including studies. In a vote-counting approach, the reviewer
meta-analyses) will conclude stronger effects than can evaluate whether a higher proportion of pub-
actually exist. The term arose from the image that lished studies finds a significant effect than do the
these nonsignificant results are placed in research- proportion of unpublished studies. In a meta-
ers’ file drawers, never to be seen by others. This analysis, one performs moderator analyses that
file drawer problem also has several similar names, statistically compare whether effect sizes are
including publication or dissemination bias. greater in published versus unpublished studies.
Although all literature reviews are vulnerable to An absence of differences is evidence against a file
this problem, meta-analysis provides methods of drawer problem.
detecting and correcting for this bias. This entry A second approach is through the visual exami-
first discusses the sources of publication bias and nation of funnel plots, which are scatterplots of
then the detection and correction of such bias. each study’s effect size to sample size. Greater vari-
ability of effect sizes is expected in smaller versus
larger studies, given their greater sampling vari-
Sources
ability. Thus, funnel plots are expected to look like
The first source of publication bias is that research- an isosceles triangle, with a symmetric distribution
ers may be less likely to submit null than signifi- of effect sizes around the mean across all levels of
cant results. This tendency may arise in several sample size. However, small studies that happen to
ways. Researchers engaging in ‘‘data snooping’’ find small effects will not be able to conclude sta-
(cursory data analyses to determine whether more tistical significance and therefore may be less likely
complete pursuit is warranted) simply may not to be published. The resultant funnel plot will be
pursue investigation of null results. Even when asymmetric, with an absence of studies in the small
complete analyses are conducted, researchers may sample size/small effect size corner of the triangle.
be less motivated—due to expectations that the A third, related approach is to compute the cor-
results will not be published, professional pride, or relation between effect sizes and sample sizes
financial interest in finding supportive results—to across studies. In the absence of publication bias,
submit results for publication. one expects no correlation; small and large studies
The other source is that null results are less should find similar effect sizes. However, if nonsig-
likely to be accepted for publication than are sig- nificant results are more likely relegated to the file
nificant results. This tendency is partly due to reli- drawer, then one would find that only the small
ance on decision making from a null hypothesis studies finding large effects are published. This
significance testing (versus effect size) framework; would result in a correlation between sample size
statistically significant results lead to conclusions, and effect size (a negative correlation if the average
Fisher’s Least Significant Difference Test 491

effect size is positive and a positive correlation if challenge is that the user typically must specify
the average effect size is negative). a selection model, often with little information.
Noel A. Card
Correction
See also Effect Size, Measures of; Literature Review;
There are four common ways of correcting for the
Meta-Analysis
file drawer problem. The first is not actually a cor-
rection, but an attempt to demonstrate that the
results of a meta-analysis are robust to this prob- Further Readings
lem. This approach involves computing a failsafe Begg, C. B. (1994). Publication bias. In H. Cooper &
number, which represents the number of studies L. V. Hedges (Eds.), The handbook of research
with an average effect size of zero that could be synthesis (pp. 399409). New York: Russell Sage
added to a meta-analysis before the average effect Foundation.
becomes nonsignificant. If the number is large, one Rosenthal, R. (1979). The ‘‘file drawer problem’’ and
concludes that it is not realistic that so many tolerance for null results. Psychological Bulletin, 86,
excluded studies could exist so as to invalidate the 638641.
conclusions, so the review is robust to the file Rothstein, H. R., Sutton, A. J., & Borenstein, M. (Eds.).
(2005). Publication bias in meta-analysis: Prevention,
drawer problem.
assessment and adjustments. Hoboken, NJ: Wiley.
A second approach is to exclude underpowered
studies from a literature review. The rationale for
this suggestion is that if the review includes only
studies of a sample size large enough to detect FISHER’S LEAST SIGNIFICANT
a predefined effect size, then nonsignificant results
should not result in publication bias among this DIFFERENCE TEST
defined set of studies. This suggestion assumes that
statistical nonsignificance is the primary source of When an analysis of variance (ANOVA) gives a sig-
unpublished research. This approach has the dis- nificant result, this indicates that at least one group
advantage of excluding a potentially large number differs from the other groups. Yet the omnibus test
of studies with smaller sample sizes, and therefore does not indicate which group differs. In order to
might often be an inefficient solution. analyze the pattern of difference between means, the
A third way to correct for this problem is ANOVA is often followed by specific comparisons,
through trim-and-fill methods. Although several and the most commonly used involves comparing
variants exist, the premise of these methods is two means (the so-called pairwise comparisons).
a two-step process based on restoring symmetry to The first pairwise comparison technique was
a funnel plot. First, one ‘‘trims’’ studies that are in developed by Ronald Fisher in 1935 and is called
the represented corner of the triangle until a sym- the least significant difference (LSD) test. This
metric distribution is obtained; the mean effect size technique can be used only if the ANOVA F omni-
is then computed from this subset of studies. Sec- bus is significant. The main idea of the LSD is to
ond, one restores the trimmed studies and ‘‘fills’’ compute the smallest significant difference (i.e.,
the missing portion of the funnel plot by imputing the LSD) between two means as if these means
studies to create symmetry; the heterogeneity of had been the only means to be compared (i.e.,
effect sizes is then estimated from this filled set. with a t test) and to declare significant any differ-
A final method of management is through a fam- ence larger than the LSD.
ily of selection (weighted distribution) models.
These approaches use a distribution of publication
Notations
likelihood at various levels of statistical signifi-
cance to weight the observed distribution of effect The data to be analyzed comprise A groups, and
sizes for publication bias. These models are statis- a given group is denoted a. The number of obser-
tically complex, and the field has not reached vations of the ath group is denoted Sa. If all groups
agreement on best practices in their use. One have the same size, the notation S is used. The
492 Fisher’s Least Significant Difference Test

total number of observations is denoted N. The Note that LSD has more power compared to
mean of Group a is denoted Ma þ . From the other post hoc comparison methods (e.g., the hon-
ANOVA, the mean square of error (i.e., within estly significant difference test, or Tukey test) because
group) is denoted MSS(A) and the mean square of the α level for each comparison is not corrected for
effect (i.e., between group) is denoted MSA. multiple comparisons. And, because LSD does not
correct for multiple comparisons, it severely inflates
Type I error (i.e., finding a difference when it does
Least Significant Difference
not actually exist). As a consequence, a revised ver-
The rationale behind the LSD technique value sion of the LSD test has been proposed by Anthony
comes from the observation that when the null J. Hayter (and is known as the FisherHayter proce-
hypothesis is true, the value of the t statistics eval- dure) where the modified LSD (MLSD) is used
uating the difference between Groups a and a is instead of the LSD. The MLSD is computed using
equal to the Studentized range distribution q as
Maþ  Ma0 þ rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
t ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MSSðAÞ
 ffi ð1Þ
MLSD ¼ qα, A1 ð5Þ
MSSðAÞ S þ S 0 1 1 S
a a

where qα, A1 is the α-level critical value of the Stu-


and follows a Student’s t distribution with N  A dentized range distribution for a range of A  1
degrees of freedom. The ratio t therefore would be and for ν ¼ N  A degrees of freedom. The
declared significant at a given α level if the value MLSD procedure is more conservative than the
of t is larger than the critical value for the α level LSD, but more powerful than the Tukey approach
obtained from the t distribution and denoted tν; α because the critical value for the Tukey approach
(where v ¼ N  A is the number of degrees of is obtained from a Studentized range distribution
freedom of the error; this value can be obtained equal to A. This difference in range makes Tukey’s
from a standard t table). Rewriting this ratio critical value always larger than the one used for
shows that a difference between the means of the MLSD, and therefore, it makes Tukey’s
Groups a and awill be significant if approach more conservative.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 
1 1
jMaþ  Ma0 þ j > LSD ¼ tν, α MSSðAÞ þ Example
Sa Sa0
ð2Þ In a series of experiments on eyewitness testimony,
Elizabeth Loftus wanted to show that the wording
When there is an equal number of observations of a question influenced witnesses’ reports. She
per group, Equation 2 can be simplified as showed participants a film of a car accident, then
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi asked them a series of questions. Among the ques-
2 tions was one of five versions of a critical question
LSD ¼ tν, α MSSðAÞ ð3Þ asking about the speed the vehicles were traveling:
S
In order to evaluate the difference between the 1. How fast were the cars going when they hit
means of Groups a and a (where a and a are the each other?
indices of the two groups under consideration), we 2. How fast were the cars going when they
take the absolute value of the difference between smashed into each other?
the means and compare it to the value of LSD. If
3. How fast were the cars going when they
jMiþ  Mjþ j ≥ LSD ð4Þ collided with each other?
4. How fast were the cars going when they
then the comparison is declared significant at the bumped each other?
chosen α level (usually .05 or .01). Then, this pro-
AðA  1Þ 5. How fast were the cars going when they
cedure is repeated for all comparisons. contacted each other?
2
Fisher’s Least Significant Difference Test 493

Table 1 Results for a Fictitious Replication of Loftus The data from a fictitious replication of Loftus’
and Palmer (1974) in Miles per Hour experiment are shown in Table 1. We have A ¼ 4
Contact Hit Bump Collide Smash groups and S ¼ 10 participants per group.
21 23 35 44 39 The ANOVA found an effect of the verb used
20 30 35 40 44 on participants’ responses. The ANOVA table is
26 34 52 33 51 shown in Table 2.
46 51 29 45 47
35 20 54 45 50
13 38 32 30 45 Least Significant Difference
41 34 30 46 39
30 44 42 34 51 For an α level of .05, the LSD for these data is
42 41 50 49 39 computed as
26 35 21 44 55 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
M.þ 30 35 38 41 46
2
LSD ¼ tν, :05 MSSðAÞ
n
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
Table 2 ANOVA Results for the Replication of ¼ tν, :05 80:00 ×
Loftus and Palmer (1974). 10
rffiffiffiffiffiffiffiffi ð6Þ
Source df SS MS F Pr(F) 160
¼ 2:01
Between: A 4 1,460 365 4.56 .0036 10
Error: S(A) 45 3,600 80 ¼ 2:01 × 4
Total 49 5,060 ¼ 8:04

Table 3 LSD: Differences Between Means and Significance of Pairwise Comparisons From the (Fictitious)
Replication of Loftus and Palmer (1974)
Experimental Group
M1.þ ¼ Contact M2.þ ¼ Hit M3. þ ¼ Bump M4.þ ¼ Collide M5.þ ¼ Smash
M1.þ ¼ 30 Contact 0.00 5.00 ns 8.00 ns 11.00** 16.00**
M2.þ ¼ 35 Hit 0.00 3.00 ns 6.00 ns 11.00**
M3.þ ¼ 38 Bump 0.00 3.00 ns 8.00 ns
M4.þ ¼ 41 Collide 0.00 5.00 ns
M5.þ ¼ 46 Smash 0.00
Notes: Differences larger than 8.04 are significant at the α ¼ .05 level and are indicated with *, and differences larger than
10.76 are significant at the α ¼ .01 level and are indicated with **.

Table 4 MLSD: Differences Between Means and Significance of Pairwise Comparisons From the (Fictitious)
Replication of Loftus and Palmer (1974)
Experimental Group
M1.þ ¼ Contact M2.þ ¼ Hit M3.þ ¼ Bump M4.þ ¼ Collide M5.þ ¼ Smash
M1.þ ¼ 30 Contact 0.00 5.00 ns 8.00 ns 11.00** 16.00**
M2.þ ¼ 35 Hit 0.00 3.00 ns 6.00 ns 11.00**
M3.þ ¼ 38 Bump 0.00 3.00 ns 8.00 ns
M4.þ ¼ 41 Collide 0.00 5.00 ns
M5.þ ¼ 46 Smash 0.00
Notes: Differences larger than 10.66 are significant at the α ¼ .05 level and are indicated with *, and differences larger than
13.21 are significant at the α ¼ .01 level and are indicated with **.
494 Fixed-Effects Models

A similar computation will show that, for these Hayter, A. J. (1986). The maximum familywise error rate
data, the LSD for an α level of .01 is equal to of Fisher’s least significant difference test. Journal of
LSD ¼ 2.69 × 4 ¼ 10.76. the American Statistical Association, 81, 10011004.
For example, the difference between Mcontact þ Seaman, M. A., Levin, J. R., & Serlin, R. C. (1991). New
developments in pairwise multiple comparisons: Some
and Mhit þ is declared nonsignificant because
powerful and practicable procedures. Psychological
Bulletin, 110, 577586.
jMcontactþ  Mhitþ j ¼ j30  35j
ð7Þ
¼ 5 < 8:04:

The differences and significance of all pairwise FIXED-EFFECTS MODELS


comparisons are shown in Table 3.
Fixed-effects models are a class of statistical mod-
els in which the levels (i.e., values) of independent
Modified Least Significant Difference variables are assumed to be fixed (i.e., constant),
For an α level of .05, the value of q.05,A1 is and only the dependent variable changes in
equal to 3.77 and the MLSD for these data is com- response to the levels of independent variables.
puted as This class of models is fundamental to the general
linear models that underpin fixed-effects regression
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi analysis and fixed-effects analysis of variance, or
MSSðAÞ pffiffiffi
MLSD ¼ qα;A1 ¼ 3:77 × 8 ANOVA (fixed-effects ANOVA can be unified with
S fixed-effects regression analysis by using dummy
¼ 10:66 ð8Þ variables to represent the levels of independent
variables in a regression model; see the article by
The value of q.01,A1 ¼ 4.67, and a similar com- Andrew Gelman for more information); the gener-
putation will show that, for these data, the MLSD alized linear models, such as logistic regression for
for an α level pffiffiffi of .01 is equal to binary response variables and binomial counts;
MLSD ¼ 4:67 × 8 ¼ 13:21. Poison regression for Poisson (count) response
For example, the difference between Mcontact þ variables; as well as the analysis of categorical data
and Mhit þ is declared nonsignificant because using such techniques as the Mantel-Haenszel or
Peto odds ratio. A common thesis in assuming
jMcontactþ  Mhitþ j ¼ j30  35j a fixed-effects model among these analyses is that
under conditions of similar investigation methods,
¼ 5 < 10:66: ð9Þ similar measurements, and similar experimental or
The differences and significance of all pairwise observational units, the mean response among the
comparisons are shown in Table 4. levels of independent variables should be compara-
ble. If there is any discrepancy, the difference is
Lynne J. Williams and Hervé Abdi caused by the within-study variation among the
effects at the fixed levels of independent variables.
See also Analysis of Variance (ANOVA); Bonferroni This entry discusses the application of fixed effects
Procedure; Honestly Significant Difference (HSD) Test; in designed experiments and observational studies,
Multiple Comparison Tests; Newman-Keuls Test and along with alternate applications.
Tukey Test; Pairwise Comparisons; Post Hoc
Comparisons; Scheffé Test; Tukey’s Honestly
Significant Difference (HSD) Designed Experiments
Fixed-effects models are very popular in designed
Further Readings experiments. The principal idea behind using these
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. models is that the levels of independent variables
(2009). Experimental design and analysis for (treatments) are specifically chosen by the researcher,
psychology. Oxford, UK: Oxford University Press. whose sole interest is the response of the dependent
Fixed-Effects Models 495

variable to the specific levels of independent vari- Fixed-effects model experiment designs contrast
ables that are employed in a study. If the study is to sharply with random effects model designs, in
be repeated, the same levels of independent variables which the levels of independent variables are ran-
would be used again. As such, the inference space of domly selected from a large population of all possi-
the study, or studies, is the specific set of levels of ble levels. Either this population of levels is infinite
independent variables. Results are valid only at the in size, or the size is sufficiently large and can prac-
levels that are explicitly studied, and no extrapola- tically be considered infinite. The levels of indepen-
tion is to be made to levels of independent variables dent variables are therefore believed to be random
that are not explicitly investigated in the study. variables. Those that are chosen for a specific exper-
In practice, researchers often arbitrarily and sys- iment are a random draw. If the experiment is
tematically choose some specific levels of indepen- repeated, these same levels are unlikely to be
dent variables to investigate their effects according reused. Hence, it is meaningless to compare the
to a hypothesis and/or some prior knowledge about means of the dependent variable at those specific
the relationship between the dependent and inde- levels in one particular experiment. Instead, the
pendent variables. These levels of independent vari- experiment seeks inference about the effect of
ables are either of interest to the researcher or the entire population of all possible levels, which
thought to be representative of the independent are much broader than the specific ones used in the
variables. During the experiment, these levels are experiment, whether or not they are explicitly stud-
maintained constant. Measurements taken at each ied. In doing so, a random-effects model analysis
fixed level of an independent variable or a combina- draws conclusions from both within- and between-
tion of independent variables therefore constitute variable variation. Compared to fixed-effects model
a known population of responses to that level (com- designs, the advantages of random-effects model
bination of levels). Analyses then draw information designs are a more efficient use of statistical infor-
from the mean variation of the study to make infer- mation, and results from one experiment can be
ence about the effect of those specific independent extrapolated to levels that are not explicitly used in
variables at the specified levels on the mean that experiment. A key disadvantage is that some
response of the dependent variable. A key advan- important levels of independent variables may be
tage of a fixed-effects model design is that impor- left out of an experiment, which could potentially
tant levels of an independent variable can be have an adverse effect on the generality of conclu-
purposefully investigated. As such, both human and sions if those omitted levels turn out to be critical.
financial resource utilization efficiency may be max- To illustrate the differences between a fixed-
imized. Examples of such purposeful investigations and a random-effects model analysis, consider
may be some specific dosages of a new medicine in a controlled, two-factor factorial design experi-
a laboratory test for efficacy, or some specific che- ment. Assume that Factor A has a levels; Factor B
mical compositions in metallurgical research on the has b levels; and n measurements are taken from
strength of alloy steel, or some particular wheat each combination of the levels of the two factors.
varieties in an agriculture study on yields. Table 1 illustrates the ANOVA table comparing
The simplest example of a fixed-effects model a fixed- and a random-effects model. Notice in
design for comparing the difference in population Table 1 that the initial steps for calculating the
means is the paired t test model in a paired com- mean squares are similar in both analyses. The dif-
parison design. This design is a variation of the ferences are the expected mean squares and the
more general randomized block design in that each construction of hypothesis tests. Suppose that the
experimental unit serves as a block. Two treat- conditions for normality, linearity, and equal vari-
ments are applied to each experimental unit, with ance are all met, and the hypothesis tests on both
the order varying randomly from one experiment the main and the interactive effects of the fixed-
unit to the next. The null hypothesis of the paired effects model ANOVA are simply concerned with
t test is μ1  μ2 ¼ 0. Because of no sampling var- the error variance, which is the expected mean
iability between treatments, the precision of esti- square of the experimental error. In comparison,
mates in this design is considerably improved as the hypothesis tests on the main effect of the
compared to a two-sample t test model. random-effects model ANOVA draw information
Table 1 Analysis of Variance Table for the Two-Factor Factorial Design Comparing a Fixed-With a Random-Effects Model , Where Factor A Has a Levels,
Factor B Has b Levels, and n Replicates Are Measured at Each A × B Level

Source of Sum of Degrees of Mean Fixed-Effects Model Random-Effects Model


Variance Squares Freedom Square Expected MS F0 Expected MS F0
P
a
bn A2i
SSA 2 i¼1 MSA MSA
Factor A SSA a1 MSA ¼ E(MSB) = σ þ F0 ¼ EðMSA Þ ¼ σ 2 þ nσ 2AB þ bnσ 2A F0 ¼
a1 a1 MSE MSAB
X
b
an B2j
SSB j¼1 MSB MSB
Factor B SSB b1 MSB ¼ E(MSB) ¼ σ 2 þ F0 ¼ EðMSB Þ ¼ σ 2 þ nσ 2AB þ anσ 2B F0 ¼
496

b1 b1 MSE MSAB


P
a P
b
n ðABÞ2ij
SAB i¼1 j¼1 MSAB MSAB
A×B SSAB (a  1)(b  1) MSAB ¼ S E(MSAB) ¼ σ 2 þ F0 ¼ EðMSA BÞ ¼ σ 2 þ nσ 2AB F0 ¼
ða  1Þðb  1Þ ða  1Þðb  1Þ MSE MSE

SSE
Error SSE ab(n  1) MSE ¼ E(MSE) ¼ σ 2 E(MSE) ¼ σ 2
abðn  1Þ

Total SST abn  1


Fixed-Effects Models 497

from both the experimental error variance and the Take as an example a hypothetical ecological
variance due to the interactive effect of the main study in 10 cities of a country on the association
experimental factors. In other words, hypothesis between lung cancer prevalence rates and average
tests in random-effects model ANOVA must be cigarette consumption per capita in populations 45
determined according to the expected mean years of age and older. Here, cigarette consump-
squares. Finding an appropriate error term for tion is the primary independent variable and lung
a test is not as straightforward in a random-effects cancer prevalence rate is the dependent variable.
model analysis as in a fixed-effects model analysis, Suppose that two surveys are done at two different
particularly when sophisticated designs such as the times and noticeable differences are observed in
split-plot design are used. In these designs, one both cigarette consumption and lung cancer preva-
often needs to consult an authoritative statistical lence rates at each survey time both across the 10
textbook, but not overly rely on commercial statis- cities (i.e., intercity variation) and within each of
tical software if he or she is not particularly famil- the 10 cities (i.e., intracity variation over time).
iar with the relevant analytical procedure. Both fixed and random regression models can be
used to analyze the data, depending on the
assumption that one makes.
Observational Studies
A fixed-effects regression model can be used if
In observational studies, researchers are often one makes assumptions such as no significant
unable to manipulate the levels of an indepen- changes in the demographic characteristics, in the
dent variable as they can frequently do in con- cigarette supply-and-demand relationship, in the
trolled experiments. Being the nature of an air pollution level and pollutant chemical composi-
observational study, there may be many influen- tion, or other covariates that might be inductive to
tial independent variables. Of them, some may lung cancer over time in each city, and if one fur-
be correlated with each other, whereas others are ther assumes that any unobservable variable that
independent. Some may be observable, but might simultaneously affect the lung cancer preva-
others are not. Of the unobservable variables, lence rate and the average per capita cigarette con-
researchers may have knowledge of some, but sumption does not change over time.
may be unaware of others. Unobservable vari-
ables are generally problematic and could com- yit ¼ β0 þ β1 xit þ αi þ εit ð1Þ
plicate data analyses. Those that are hidden
from the knowledge of the researchers are prob- where yit and xit are, respectively, the lung cancer
ably the worst offenders. They could potentially prevalence rate and the average per capita cigarette
lead to erroneous conclusions by obscuring the consumption in the ith city at time t; αi is a fixed
main results of a study. If a study takes repeated parameter for the ith city; and εit is the error term
measures (panel data), some of those variables for the ith city at time t. In this model, αi captures
may change values over the course of the study, the effects of all observed and unobserved time-
whereas others may not. All of these add com- invariant variables, such as demographic charac-
plexity to data analyses. teristics including age, gender, and ethnicity; socio-
If an observational study does have panel data, economic characteristics; air pollution; and other
the choice of statistical models depends on variables, which could vary from city to city but
whether or not the variables in question are corre- are constant within the ith city (this is why the
lated with the main independent variables. The above model is called a fixed-effects model). By
fixed-effects model is an effective tool if variables treating αi as fixed, the model focuses only on the
are correlated, whether they are measured or within-city variation while ignoring the between-
unmeasured. Otherwise, a random-effects model city variation.
should be employed. Size of observation units (i.e., The estimation of Equation 1 becomes ineffi-
number of students in an education study) or cient if many dummy variables are included in the
groupings of such units (i.e., number of schools) is model to accommodate a large number of observa-
generally not a good criterion for choosing one tional units (αi) in panel data, because this sacri-
particular statistical model over the other. fices many degrees of freedom. Furthermore,
498 Fixed-Effects Models

a large number of observational units coupled with information to the overall analysis with respect to
only a few time points may result in the intercepts those variables of concern. The second point is easy
of the model containing substantial random error, to understand if one treats a variable that does not
making them inconsistent. Not much, if any, infor- change values as a constant. A constant subtracting
mation could be gained from those noisy para- a constant is zero—that is, a zero effect of such vari-
meters. To circumvent these problems, one may ables on the dependent variable. In this regard,
convert the values of both the dependent and the fixed-effects models are mostly useful for studying
independent variables of each observational unit the effects of independent variables that show
into the difference from their respective mean for within-observational-unit variation.
that unit. The differences in the dependent variable If, on the other hand, there is reasonable doubt
are then regressed on the differences of the inde- regarding the assumptions made about a fixed-
pendent variables without an intercept term. The effects model, particularly if some independent
estimator then looks only at how changes in the variables are not correlated with the major inde-
independent variables cause the dependent variable pendent variable(s), a fixed-effects model will not
to vary around a mean within an observational be able to remove the bias caused by those vari-
unit. As such, the unit effects are removed from ables. For instance, in the above hypothetical study,
the model by differencing. a shortage in the cigarette supply caused a decrease
It is clear from the above discussions that the in its consumption in some cities, or a successful
key technique in using a fixed-effects model in promotion by cigarette makers or retailers per-
panel data is to allow each observational unit suaded more people to smoke in other cities. These
(‘‘city’’ in the earlier example) to serve as its own random changes from city to city make a fixed-
control so that the data are grouped. Conse- effects model unable to control effectively the
quently, a great strength of the fixed-effects model between-city variation in some of the independent
is that it simultaneously controls for both observ- variables. If this happens, a random-effects model
able and unobservable variables that are associated analysis would be more appropriate because it is
with each specific observational unit. The fixed- able to accommodate the variation by incorporat-
effect coefficients (αiI) absorb all of the across-unit ing in the model two sources of error. One source
influences, leaving only the within-unit effect for is specific to each individual observational unit,
the analysis. The result then simply shows how and the other source captures variation both within
much the dependent variable changes, on average, and between individual observational units.
in response to the variation in the independent
variables within the observational units; that is, in
Alternate Applications
the earlier example, how much, on average, the
lung cancer prevalence rate will go up or down in After discussing the application of fixed-effects
response to each unit change in the average ciga- model analyses in designed and observational
rette consumption per capita. research, it may also be helpful to mention the util-
Because fixed-effects regression model analyses ity of fixed-effects models in meta-analysis (a study
depend on each observational unit serving as its on studies). This is a popular technique widely
own control, key requirements in applying them in used for summarizing knowledge from individual
research are as follows: (a) There must be two or studies in social sciences, health research, and
more measurements on the same dependent variable other scientific areas that rely mostly on observa-
in an observational unit; otherwise, the unit effect tional studies to gather evidence. Meta-analysis is
cannot be properly controlled; and (b) independent needed because both the magnitude and the direc-
variables of interest must change values on at least tion of the effect size could vary considerably
two of the measurement occasions in some of the among observational studies that address the same
observational units. In other words, the effect of question. Public policies, health practices, or pro-
any independent variable that does not have much ducts developed based on the result of each indi-
within-unit variation cannot be estimated. Observa- vidual study therefore may not be able to achieve
tional units with values of little within-unit variation their desired effects as designed or as believed.
in some independent variables contribute less Through meta-analysis, individual studies are
Fixed-Effects Models 499

brought together and appraised systematically. in the task of choosing the right model for specific
Common knowledge is then explicitly generated to research. In ANOVA, after a model is chosen,
guide public policies, health practices, or product there is no easy way to identify the correct vari-
developments. ance components for computation of standard
In meta-analysis, each individual study is treated errors and for hypothesis tests (see Table 1, for
as a single analysis unit and plugged into a suitable example). This leads Gelman to advocate abolish-
statistical model according to some assumptions. ing the terminology of fixed- and random-effects
Fixed-effects models have long been used in meta- models. Instead, a unified approach is taken within
analysis, with the following assumptions: (a) Indi- a hierarchical (multilevel) model framework,
vidual studies are merely a sample of the same regardless of whether one is interested in the
population, and the true effect for each of them is effects of specific treatments used in a particular
therefore the same; and (b) there is no heterogene- experiment (fixed-effects model analyses in a tradi-
ity among study results. Under these assumptions, tional sense) or in the effects of the underlying
only the sampling error (i.e., the within-study vari- population of treatments (random-effects model
ation) is responsible for the differences (as reflected analyses otherwise). In meta-analysis, Bayesian
in the confidence interval) in the observed effect model averaging is another alternative to fixed-
among studies. The between-study variation in the and random-effects model analyses.
estimated effects has no consequence on the confi-
dence interval in a fixed-effects model analysis.
Final Thoughts
These assumptions may not be realistic in many
instances and are frequently hotly debated. An Fixed-effects models concern mostly the response
important difficulty in applying a fixed-effects of dependent variables at the fixed levels of
model in meta-analysis is that each individual independent variables in a designed experiment.
study is conducted on different study units (individ- Results thus obtained generally are not extrapo-
ual persons, for instance) under a different set of lated to other levels that are not explicitly investi-
conditions by different researchers. Any (or all) of gated in the experiment. In observational studies
these differences could potentially introduce its (or with repeated measures, fixed-effects models are
their) effects into the studies to cause variation in used principally for controlling the effects of
their results. Therefore, one needs to consider not unmeasured variables if these variables are corre-
only within- but also between-study variation in lated with the independent variables of primary
a model in order to generalize knowledge properly interest. If this assumption does not hold, a fixed-
across studies. Because the objective of meta-analy- effects model cannot adequately control for inter-
ses is to seek validity generalization, and because unit variation in some of the independent vari-
heterogeneity tests are not always sufficiently sensi- ables. A random-effects model would be more
tive, a random-effects model is thus believed to be appropriate.
more appropriate than a fixed-effects model.
Unless there is truly no heterogeneity confirmed Shihe Fan
through proper investigations, fixed-effects model See also Analysis of Variance (ANOVA); Bivariate
analyses tend to overestimate the true effect by Regression; Random-Effects Models
producing a smaller confidence interval. On the
other hand, critics argue that random-effects mod- Further Readings
els make assumptions about distributions, which
may or may not be realistic or justified. They give Gelman, A. (2005). Analysis of variance: Why it is more
more weight to small studies and are more sensitive important than ever. Annals of Statistics, 33, 153.
Hocking, R. R. (2003). Methods and application of linear
to publication bias. Readers interested in meta-
models: Regression and the analysis of variance
analysis should consult relevant literature before (2nd ed.). Hoboken, NJ: Wiley.
embarking on a meta-analysis mission. Kuehl, R. O. (1994). Statistical principles of research
The arguments for and against fixed- and design and analysis. Belmont, CA: Duxbury.
random-effects models seem so strong, at least on Montgomery, D. C. (2001). Design and analysis of
the surface, that a practitioner may be bewildered experiments (5th ed.). Toronto: Wiley.
500 Focus Group

Nelder, J. A., & Weddenbum, R. W. M. (1972). History


Generalized linear models. Journal of the Royal
Statistical Society: Series A, 135, 370384. The history of focus groups dates back to the
Sutton, A. J., Abrams, K. R., Jones, D. R., Sheldon, T. A., 1930s, when Emory S. Bogardus, a scholar,
& Song, F. (2000). Methods for meta-analysis in wrote about group interviews and their
medical research. Toronto, Ontario, Canada: Wiley. usefulness to researchers. During World War II,
focus groups were conducted to determine the
usefulness of the military’s training materials
and the success of war propaganda in the war
effort. Following World War II, focus groups
FOCUS GROUP were used primarily to obtain responses and
gather opinions about films, written materials,
A focus group is a form of qualitative research con- and radio broadcasts. Beginning in the 1980s,
ducted in a group interview format. The focus focus groups were used in a wide variety of
group typically consists of a group of participants research settings, thus expanding their initial
and a researcher who serves as the moderator for role as mere gauges for government and market-
discussions among the group members. In focus ing research.
groups, there is not always the usual exchange of
questions and answers between the researcher and
the group that one would commonly envision in an Format
interview setting. Rather, the researcher often Although focus groups have enormous versatility
ensures that specific topics of research interest are and diversity in how they operate, a step-by-step
discussed by the entire group in hopes of extracting format for conducting focus groups has emerged
data and self-disclosure that might otherwise be in recent years. The first step is to determine the
withheld in the traditional researcher-interviewee goals of the study. Although not highly specific
environment. In this entry, the purpose, history, at this point, it is common for the researcher to
format, advantages and disadvantages, and future write a general purpose statement that lays the
direction of focus groups are discussed. foundation for the research project. The second
step is to determine who will serve as the moder-
ator of the focus group. Selection of a moderator
is of utmost importance to the success of the
Purpose
focus group, as the moderator promotes interac-
The purpose of focus groups varies depending on tions among group members and prevents the
the topic under investigation. In some studies, group from digressing from the topic of interest.
a focus group serves as the primary means of col- The next step involves refinement of research
lecting data via a strictly qualitative approach. In goals. Lists of information to obtain during the
other studies, a group discussion or focus group is focus group interviews are created, and these
used as a preliminary step before proceeding to lists serve as the basis for formulating questions
quantitative data collection, usually known as and probes to be used later in the focus group
a mixed-method approach. In still other studies, interviews. Following this step, participants are
a focus group is employed in conjunction with recruited for the focus group, preferably through
individual interviews, participant observation, and intentional sampling, with the goal of obtaining
other qualitative forms of data collection. Thus, a group of individuals that is most apt to provide
focus groups are used to complement a mixed- the researcher with the needed information.
method study or they have a self-contained func- After the participants have been selected, the
tion, given their ability to function independently number of focus group sessions is determined.
or be combined with other qualitative or quantita- The number of focus group sessions will vary
tive approaches. As a result of their versatility, and will depend on the number of participants
focus groups serve the needs of many researchers needed to make the focus group interviews suc-
in the social sciences. cessful. The next step is to locate a focus group
Follow-Up 501

site where the interviews will be conducted. Future Directions


However, there are no specific parameters for
how this step is to be accomplished. The seventh Focus groups are held in a variety of settings, such
step involves the development of an interview as a formal setting where psychologists do research
guide. The interview guide includes the research or a less structured environment where a book
objectives and ensuing questions that have been club meets to discuss and react to a novel. For dec-
developed and refined from earlier steps in the ades, focus groups have served governments,
process. The questions are constructed with the researchers, businesses, religious groups, and many
intent to facilitate smooth transitions from one other areas of society. Thus, focus groups are likely
topic to another. The culminating step is to con- to continue to be used in the future when indivi-
duct the focus group interview. The moderator duals and researchers are in need of qualitative
should be well prepared for the focus group ses- data.
sion. Preparation includes having the necessary Matthew J. Grumbein and
documents, arranging the room and chairs, and Patricia A. Lowe
arriving early to the site to test any media to be
used during the focus group interview. After the See also Qualitative Research
group interview is conducted and recorded, it is
most frequently transcribed, coded, and ana-
lyzed. The interview is transcribed so the Further Readings
researcher is able to adequately interpret the Kormanski, C. (1999). The team: Explorations in group
information obtained from the interview. process. Denver, CO: Love.
Moore, C. M. (1987). Group techniques for idea
building. Newbury Park, CA: Sage.
Morgan, D. L. (1997). Qualitative research methods
Advantages and Disadvantages (2nd ed.). Thousand Oaks, CA: Sage.
Focus groups are created to address the research- Vaughn, S., Schumm, J. S., & Sinagub, J. (1996). Focus
er’s needs, and there are certain advantages and group interviews in education psychology. Thousand
Oaks, CA: Sage.
disadvantages inherent in their use. One advantage
associated with the use of focus groups is that they
are time-efficient in comparison to the traditional
one-on-one interviews. The ability to collect infor-
mation from multiple individuals at one time FOLLOW-UP
instead of interviewing one person at a time is an
attractive option to many researchers. However, Follow-up procedures are an important compo-
there are some disadvantages associated with the nent of all research. They are most often con-
group format. Participants might find the time to ducted during the actual research but can also be
travel to the focus group facility site to be burden- conducted afterward. Follow-up is generally
some, and the moderator might find scheduling done to increase the overall effectiveness of the
a time for the focus group to meet to be extremely research effort. It can be conducted for a number
challenging. of reasons, namely, to further an end in a particu-
Another advantage of the use of focus groups lar study, review new developments, fulfill
lies in their dependence on group interactions. a research promise, comply with institutional
With multiple individuals in one setting, discus- review board protocol for research exceeding
sions and the infusion of diverse perspectives dur- a year, ensure that targeted project milestones
ing those discussions are possible. However, are being met, thank participants or informants
discussions of any focus group are dependent upon for their time, debrief stakeholders, and so on.
group dynamics, and the data gleaned from those Follow-up may also be conducted as a normal
discussions might not be as useful if group mem- component of the research design. Or, it could
bers are not forthcoming or an uncomfortable even be conducted subsequent to the original
mood becomes evident and dominates a session. research to ascertain if an intervention has
502 Follow-Up

changed the lives of the study participants. administered to ascertain clarity of the reworded
Regardless of its purpose, follow-up always has questions. Likewise, a supervisor may discover
cost implications. that one or more telephone interviewers are not
administering their telephone surveys according to
protocol. This would require that some follow-up
Typical Follow-Up Activities
training be conducted for those interviewers.
Participants
In the conduct of survey research, interviewers Project Milestones
often have to make multiple attempts to schedule
face-to-face and telephone interviews. When Research activities require careful monitoring
face-to-face interviews are being administered, and follow-up to ensure that things are progressing
appointments generally need to be scheduled in smoothly. Major deviations from project milestones
advance. However, participants’ schedules may generally require quick follow-up action to get the
make this simple task difficult. In some cases, mul- activity back on schedule to avoid schedule slippage
tiple telephone calls and/or letters may be required and cost overruns.
in order to set up a single interview. In other cases
(e.g., national census), follow-up may be required
because participants were either not at home or Incentives
were busy at the time of the interviewer’s visits. In research, incentives are often offered to
Likewise, in the case of telephone interviews, inter- encourage participation. Researchers and research
viewers may need to call potential participants sev- organizations therefore need to follow up on their
eral times before they are actually successful in promises and mail the promised incentive to all
getting participants on the phone. persons who participated in the research.
With mail surveys, properly timed (i.e., prede-
fined follow-up dates—usually every 2 weeks)
follow-up reminders are an effective strategy to Thank-You Letters
improve overall response rates. Without such
reminders, mail response rates are likely to be less Information for research is collected using
than 50%. Follow-up reminders generally take a number of techniques (e.g., focus groups, infor-
one of two forms: a letter or postcard reminding mants, face-to-face interviews). Follow-up thank-
potential participants about the survey and encour- you letters should be a normal part of good
aging them to participate, or a new survey package research protocol to thank individuals for their
(i.e., a copy of the survey, return envelope, and time and contributions.
a reminder letter). The latter technique generally
proves to be more effective because many potential
Stakeholder Debriefing
participants either discard mail surveys as soon as
they are received or are likely to misplace the Following the completion of the research, one
survey if it is not completed soon after receipt. or more follow-up meetings may be held with sta-
keholders to discuss the research findings, as well
as any follow-up studies that may be required.
Review New Developments
During a particular research study, any number
Compliance With Institutional Review Boards
of new developments can occur that would require
follow-up action to correct. For example, a pilot The U.S. Department of Health and Human
study may reveal that certain questions were Services (Office of Human Research Protections)
worded in such an ambiguous manner that most Regulation 45 CFR 46.109(e) requires that institu-
participants skipped the questions. To correct this tional review boards conduct follow-up reviews at
problem, the questions would need to be reworded least annually on a number of specific issues when
and a follow-up pilot study would need to be research studies exceed one year.
Frequency Distribution 503

Follow-Up Studies See also Debriefing; Interviewing; Recruitment; Survey

Follow-up studies may be a component of a par-


ticular research design. For example, time series Further Readings
designs include a number of pretests and posttests
Babbie, E. (2004). The practice of social research
using the same group of participants at different (10th ed.). Belmont, CA: Wadsworth/Thompson
intervals. If the purpose of the posttest is to ascer- Learning.
tain the strength of a particular treatment over an Office of Human Research Protection. (2007, January).
extended period, the posttest is referred to as fol- Guidance on continuing review [Online]. Retrieved
low-up. Follow-up studies may also be conducted April 19, 2009, from http://www.hhs.gov/ohrp/
when cost and time are constraining factors that humansubjects/guidance/contrev0107.htm
make longitudinal studies unfeasible. For example,
a follow-up study on the same participants can be
held at the end of a 20-year period, rather than at
the end of every 5-year period. FREQUENCY DISTRIBUTION
Success of Intervention A frequency distribution shows all the possible
scores a variable has taken in a particular set of
In some types of research, follow-up may be data, together with the frequency of occurrence of
conducted subsequent to the original research to each score in the respective set. This means that
ascertain if an intervention has changed the lives a frequency distribution describes how many times
of the study participants and to ascertain the a score occurs in the data set.
impact of the change. Frequency distributions are one of the most
common methods of displaying the pattern of
Cost Implications observations for a given variable. They offer the
possibility of viewing each score and its corre-
All follow-up activities have associated costs that sponding frequency in an organized manner
should be estimated and included in project budgets. within the full range of observed scores. Along
The extent and type of follow-up will, to a large with providing a sense of the most likely
extent, determine the exact cost implications. For observed score, they also show, for each score,
example, from a cost-benefit standpoint, mail survey how common or uncommon it is within the ana-
follow-up should be limited to three or four repeat lyzed data set.
mailings. In addition, different costs will be incurred Both discrete and continuous variables can be
depending on the follow-up procedure used. For described using frequency distributions. Frequency
example, a follow-up letter or postcard will cost distributions of a particular variable may be dis-
a lot less (less paper, less weight, less postage) than played using stem-and-leaf plots, frequency tables,
if the entire survey package is reposted. When mail and frequency graphs (typically bar charts or histo-
surveys are totally anonymous, this will also grams, and polygons). This entry discusses each of
increase costs because the repeat mailings will have these types of displays, along with its shape and
to be sent to all participants. Using certified mail modality, and the advantages and drawbacks of
generally improves response rates, but again at addi- using frequency distributions.
tional cost. When thank-you letters are being sent,
they should be printed on official letterhead and
carry an official signature if possible. However, the Stem-and-Leaf Plots
use of official letterheads and signatures is more
costly compared to a computer printout with an Stem-and-leaf plots were developed by John Tukey
automated signature. In the final analysis, research- in the 1970s. To create a stem-and-leaf plot for
ers need to carefully balance costs versus benefits a set of data, the raw data first must be arranged
when making follow-up decisions. in an array (in ascending or descending order).
Then, each number must be separated into a stem
Nadini Persaud and a leaf. The stem consists of the first digit or
504 Frequency Distribution

digits, and the leaf consists of the last digit. observations corresponding to each score or fall-
Whereas the stem can have any number of digits, ing within each class is counted. Table 2 presents
the leaf will always have only one. Table 1 shows a frequency distribution table for the age of the
a stem-and-leaf plot of the ages of the participants participants at the city hall meeting from the ear-
at a city hall meeting. lier example.
The plot shows that 20 people have participated Apart from a list of the scores or classes and
at the city hall meeting, five in their 30s, none in their corresponding frequencies, frequency tables
his or her 40s, eight in their 50s, five in their 60s, may also contain relative frequencies or propor-
and two in their 70s. tions (obtained by dividing the simple frequencies
Stem-and-leaf plots have the advantage of by the number of cases) and percentage frequen-
being easily constructed from the raw data. cies (obtained by multiplying the relative frequen-
Whereas the construction of cumulative fre- cies by 100).
quency distributions and histograms often Frequency tables may also include cumulative
requires the use of computers, stem-and-leaf frequencies, proportions, or percentages. Cumula-
plots are a simple paper-and-pencil method for tive frequencies are obtained by adding the fre-
analyzing data sets. Moreover, no information is quency of each observation to the sum of the
lost in the process of building up stem-and-leaf frequencies of all previous observations. Cumula-
plots, as is the case in, for example, grouped fre- tive proportions and cumulative percentages are
quency distributions. calculated similarly; the only difference is that,
instead of simple frequencies, cumulative frequen-
cies are divided by the total number of cases for
Frequency Tables obtaining cumulative proportions.
Frequency tables look similar for nominal or
A table that shows the distribution of the fre- categorical variables, except the first column con-
quency of occurrence of the scores a variable may tains categories instead of scores or classes. In
take in a data set is called a frequency table. Fre- some frequency tables, the missing scores for nom-
quency tables are generally univariate, because it is inal variables are not counted, and thus, propor-
more difficult to build up multivariate tables. They tions and percentages are computed based on the
can be drawn for both ungrouped and grouped number of nonmissing scores. In other frequency
scores. Frequency tables with ungrouped scores tables, the missing scores may be included as a cat-
are typically used for discrete variables and when egory so that proportions and percentages can be
the number of different scores the variable may computed based on the full sample size of non-
take is relatively low. When the variable to be ana- missing and missing scores. Either approach has
lyzed is continuous and/or the number of scores it analytical value, but authors must be clear about
may take is high, the scores are usually grouped which base number is used in calculating any pro-
into classes. portions or percentages.
Two steps must be followed to build a fre-
quency table out of a set of data. First, the scores
or classes are arranged in an array (in an ascend- Table 2 Frequency Table of the Age of the
ing or descending order). Then, the number of Participants at a City Hall Meeting
Relative Percentage
Table 1 Stem-and-Leaf Plot of the Ages of the Frequency Frequency
Participants at a City Hall Meeting Age y Frequency f rf ¼ f =n p ¼ 100  rf
Stem Leaf 3039 5 0.25 25.00
3 34457 4049 0 0.00 00.00
4 5059 8 0.40 40.00
5 23568889 6069 5 0.25 25.00
6 23778 7079 2 0.10 10.00
7 14 n ¼ 20 1.00 100.00%
Frequency Distribution 505

Frequency Graphs
Frequency graphs can take the form of bar charts
and histograms, or polygons.

Bar Charts and Histograms


8
A frequency distribution is often displayed

Frequency
graphically through the use of a bar chart or a his- 5
togram. Bar charts are used for categorical vari-
2
ables, whereas histograms are used for scalable
variables. Bar charts resemble histograms in that 0
bar heights correspond to frequencies, proportions, 35 45 55 65 75
or percentages. Unlike the bars in a histogram, Age Groups
bars in bar charts are separated by spaces, thus
indicating that the categories are in arbitrary order
and that the variable is categorical. In contrast, Figure 1 Histogram of the Age of the Participants
at a City Hall Meeting
spaces in a histogram signify zero scores.
Both bar charts and histograms are represented
in an upper-right quadrant delimited by a horizon- midpoint of the upper base of each of the histo-
tal x-axis and a vertical y-axis. The vertical axis gram’s end columns to the midpoints of the adja-
typically begins with zero at the intersection of the cent intervals (on the x-axis). Figure 2 presents
two axes; the horizontal scale need not begin with a frequency polygon based on the histogram in
zero, if this leads to a better graphic representa- Figure 1.
tion. Scores are represented on the horizontal axis, Histograms and frequency polygons may also
and frequencies, proportions, or percentages are be constructed for relative frequencies and percen-
represented on the vertical axis. When working tages in a similar way. The advantage of using
with classes, either limits or midpoints of class graphs of relative frequencies is that they can be
interval are measured on the x-axis. Each bar is used to directly compare samples of different sizes.
centered on the midpoint of its corresponding class Frequency polygons are especially useful for
interval; its vertical sides are drawn at the real lim- graphically depicting cumulative distributions.
its of the respective interval. The base of each bar
represents the width of the class interval.
Distribution Shape and Modality
A frequency distribution is graphically dis-
played on the basis of the frequency table that Frequency distributions can be described by their
summarizes the sample data. The frequency distri- skewness, kurtosis, and modality.
bution in Table 2 is graphically displayed in the
histogram depicted in Figure 1.
Skewness
Frequency Polygons
Frequency distributions may be symmetrical or
Frequency polygons are drawn by joining the skewed. Symmetrical distributions imply equal
points formed by the midpoint of each class inter- proportions of cases at any given distance above
val and the frequency corresponding to that class and below the midpoint on the score range scale.
interval. However, it may be easier to derive the Consequently, each half of a symmetrical distribu-
frequency polygon from the histogram. In this tion looks like a mirror image of the other half.
case, the frequency polygon is drawn by joining Symmetrical distributions may be uniform (rectan-
the midpoints of the upper bases of adjacent bars gular) or bell-shaped, and they may have one, two,
of the histogram by straight lines. Frequency poly- or more peaks.
gons are typically closed at each end. To close Perfectly symmetrical distributions are seldom
them, lines are drawn from the point given by the encountered in practice. Skewed or asymmetrical
506 Frequency Distribution

Normal Positively Negatively


Distribution Skewed Skewed
8 (symmetrical) Distribution Distribution
Frequency

5
Figure 3 Example of Symmetrical, Positively Skewed,
2
and Negatively Skewed Distributions
0
35 45 55 65 75 85
Age Groups

Figure 2 Frequency Polygon of the Age of the


Participants at a City Hall Meeting Leptokurtic Mesokurtic Platykurtic
Distribution Distribution Distribution

distributions, which are not symmetrical and typi-


cally have protracted tails, are more common. This Figure 4 Example of Leptokurtic, Mesokurtic, and
means that scores are clustered at one tail of the Platykurtic Distributions
distribution, while occurring less frequently at the
other tail. Distributions are said to be positively
kurtosis. According to their degree of kurtosis, dis-
skewed if scores form a tail to the right of the
tributions may be leptokurtic, mesokurtic, or pla-
mean, or negatively skewed if scores form a tail to
tykurtic. Leptokurtic distributions are
the left of the mean. Positive skews occur fre-
characterized by a high degree of peakedness.
quently in variables that have a lower limit of zero,
These distributions are usually based on data sets
but no specific upper limit, such as income, popu-
in which most of the scores are grouped around
lation density, and so on.
the mode. Normal (Gaussian) distributions are
The degree of skewness may be determined by
described as mesokurtic. Platykurtic distributions
using a series of formulas that produces the index
are characterized by a higher degree of flatness,
of skewness; a small index number indicates
which means that more scores in the data set are
a small degree of skewness, and a large index num-
distributed further away from the mode, as com-
ber indicates significant nonsymmetry. Truly sym-
pared to leptokurtic and mesokurtic distributions.
metrical distributions have zero skewness. Positive
Figure 4 presents leptokurtic, mesokurtic, and pla-
index numbers indicate tails toward higher scores;
tykurtic distributions.
negative index numbers indicate the opposite, tails
toward lower scores. Generally, severely skewed
distributions show the presence of outliers in the Modality
data set. Statistics computed using severely skewed
data may be unreliable. Figure 3 presents symmet- The mode is the most frequently occurring score
rical, positively skewed, and negatively skewed in a distribution. Distributions may be unimodal
distributions. (they have only one peak), bimodal (they have two
peaks), or multimodal (they have more than two
peaks). A distribution may be considered bimodal
or multimodal even if the peaks do not represent
Kurtosis
scores with equal frequencies. Rectangular distri-
Kurtosis refers to a frequency distribution’s butions have no mode. Multiple peaks may also
degree of flatness. A distribution may have the occur in skewed distributions; in such cases, they
same degree of skewness, but differ in terms of the may indicate that two or more dissimilar kinds of
Frequency Table 507

Fielding, J. L., & Gilbert, G. N. (2000). Understanding


social statistics. Thousand Oaks, CA: Sage.
Fried, R. (1969). Introduction to statistics. New York:
Oxford University Press.
Hamilton, L. (1996). Data analysis for social sciences: A
Unimodal Bimodal Multimodal first course in applied statistics. Belmont, CA:
Distribution Distribution Distribution
Wadsworth.
Kiess, H. O. (2002). Statistical concepts for the
behavioral sciences. Boston: Allyn & Bacon.
Figure 5 Example of Unimodal, Bimodal, and
Kolstoe, R. H. (1973). Introduction to statistics for the
Multimodal Distributions
behavioral sciences. Homewood, IL: Dorsey.
Lindquist, E. F. (1942). A first course in statistics: Their
cases have been combined in the analyzed data set. use and interpretation in education and psychology.
Multipeaked distributions are more difficult to Cambridge, MA: Riverside Press.
interpret, resulting in misleading statistics. Distri-
butions with multiple peaks suggest that further
research is required in order to identify subpopula-
tions and determine their individual characteris- FREQUENCY TABLE
tics. Figure 5 presents unimodal, bimodal, and
multimodal distributions. Frequency is a measure of the number of occur-
rences of a particular score in a given set of data. A
frequency table is a method of organizing raw data
Advantages and Drawbacks of in a compact form by displaying a series of scores
Using Frequency Distributions in ascending or descending order, together with
their frequencies—the number of times each score
The advantage of using frequency distributions is occurs in the respective data set. Included in a fre-
that they present raw data in an organized, easy- quency table are typically a column for the scores
to-read format. The most frequently occurring and a column showing the frequency of each score
scores are easily identified, as are score ranges, in the data set. However, more detailed tables may
lower and upper limits, cases that are not com- also contain relative frequencies (proportions) and
mon, outliers, and total number of observations percentages. Frequency tables may be computed for
between any given scores. both discrete and continuous variables and may
The primary drawback of frequency distribu- take either an ungrouped or a grouped format. In
tions is the loss of detail, especially when continu- this entry, frequency tables for ungrouped and
ous data are grouped into classes and the grouped formats are discussed first, followed by
information for individual cases is no longer avail- a discussion of limits and midpoints. This entry
able. The reader of Table 2 learns that there are concludes with a brief discussion of the advantages
eight participants 50 to 59 years old, but the and drawbacks of using frequency tables.
reader does not receive any more details about
individual ages and how they are distributed
within this interval. Frequency Tables for Distributions
Oana Pusa Mihaescu With Ungrouped Scores

See also Cumulative Frequency Distribution; Descriptive


Frequency distributions with ungrouped scores are
Statistics; Distribution; Frequency Table; Histogram
presented in tables showing the scores in the first
column and how often each score has occurred (the
frequency) in the second. They are typically used for
discrete variables, which have a countable or finite
Further Readings
number of distinct values. Tables of ungrouped
Downie, N. M., & Heath, R. W. (1974). Basic statistical scores are also used when the number of different
methods. New York: Harper & Row. scores a variable can take in a data set is low.
508 Frequency Table

Table 1 Raw Data of the Number of Children Relative frequencies, also called proportions,
Families Have in a Small Community are computed as frequencies divided by the sample
5 2 3 4 5 5 4 5 2 0 size: rf ¼ f/n. In this equation, rf represents the
relative frequency corresponding to a particular
score, f represents the frequency corresponding to
the same score, and n represents the total number
Table 2 An Ascending Array of the Number of of cases in the analyzed sample. They indicate the
Children Families Have in a Small Community proportion of observations corresponding to each
0 2 2 3 4 4 5 5 5 5 score. For example, the proportion of families with
two children in the analyzed community is 0.20.
Percentages are computed as proportions multi-
plied by 100: p ¼ rf(100), where p represents the
Table 3 Frequency Table of the Number of Children percentage and rf represents the relative frequency
Families Have in a Small Community corresponding to a particular score. They indicate
Relative Percentage what percentage of observations corresponds to
Number of Frequency Frequency each score. For example, 20% of the families in
Children y Frequency f rf ¼ f =n p ¼ 100  rf the observed sample have four children each. Pro-
portions in a frequency table must sum to 1.00,
0 1 0.10 10.00
whereas percentages must sum to 100.00. Due to
1 0 0.00 00.00
rounding off, some imprecision may sometimes
2 2 0.20 20.00
occur, and the total proportion and percentage
3 1 0.10 10.00
may be just short of or a little more than 1.00 or
4 2 0.20 20.00
100.00%, respectively. However, this issue is now
5 4 0.40 40.00
completely solved through the use of computer
n ¼ 10 1.00 100.00%
programs for such calculations.

Two steps must be followed to build up a fre- Frequency Tables for


quency table out of a set of data: (a) Construct
Grouped Frequency Distributions
a sensible array using the given set of data, and
(b) count the number of times each score occurs in In grouped frequency distributions, the scores are
the given data set. The raw data in Table 1 show organized into classes, typically arranged in ascend-
the number of children families have in a small ing or descending order. The frequency of the obser-
community. vations falling into each class is recorded. Tables
Building up an array implies arranging the with grouped frequency distributions are typically
scores in an ascending or descending order. An used for continuous variables, which can take on an
ascending array is built for the data set in Table 2. infinite number of scores. Grouped frequencies may
The number of times each score occurs in the be employed for discrete variables as well, especially
data set is then counted, and the total is displayed when they take on too many scores to be repre-
for each score, as in Table 3. sented in an ungrouped form. In this case, represent-
Frequencies measure the number of times each ing the frequency of each score does not offer many
score occurs. This means that one family has no useful insights. Because of the wide range of scores
children, and four families have five children in these data, it may be difficult to perceive any
each. Although some scores may not occur in the interesting characteristics of the analyzed data set.
sample data, these scores must nevertheless be To build up a grouped frequency table, a number
listed in the table. For example, even if there are of classes or groups are formed based on the sample
no families with only one child, the score of 1 is data. A class is a range of scores into which raw
still displayed together with its corresponding scores are grouped. In this case, the frequency is not
frequency (zero) in the ascending array built out the number of times each score occurs, but the num-
of the sample data. ber of times these scores fall into one of these
Frequency Table 509

Table 4 Raw Data on Students’ Test Scores Table 5 Grouped Frequency Table of Students’ Test
86.5 87.7 78.8 Scores
88.1 86.2 87.3 Students’ Relative Percentage
99.6 96.3 92.1 Test Frequency Frequency
92.5 79.1 89.8 Scores y Frequency f rf ¼ f =n p ¼ 100  rf
76.1 98.8 98.1 75.180.0 3 0.20 20.00
80.185.0 0 0.00 00.00
85.190.0 6 0.40 40.00
classes. Four steps must be followed to build 90.195.0 2 0.13 13.33
a grouped frequency table: (a) Arrange the scores 95.1100.0 4 0.27 26.67
into an array, (b) determine the number of classes, n ¼ 15 1.00 100.00%
(c) determine the size or width of the classes, and
(d) determine the number of observations that fall
into each class. students have scored between 75.1 and 80.0, and
Table 4 displays a sample of the scores obtained four students have scored between 95.1 and the
by a sample of students at an exam. maximum score of 100.0. Again, there may be
A series of rules applies to class selection: some classes for which the frequency is zero,
(a) All observations must be included; (b) each meaning that no case falls within that class. How-
observation must be assigned to only one class; ever, these classes must also be listed (for example,
(c) no scores can fall between two intervals; and no students have obtained between 80.1 and 85.0
(d) whenever possible, class intervals (the width of points; nevertheless, this case is listed in Table 5).
each class) must be equal. Typically, researchers In general, grouped frequency tables include
choose a manageable number of classes. Too few a column displaying the classes and a column
classes lead to a loss of information from the data, showing their corresponding frequencies, but
whereas too many classes lead to difficulty in ana- they may also include relative frequencies (pro-
lyzing and understanding the data. It is sometimes portions) and percentages. Proportions and per-
recommended that the width of the class intervals centages are computed in the same way as for
be determined by dividing the difference between ungrouped frequency tables. Their meaning
the largest and the smallest score (the range of changes, though. For example, Table 5 shows
scores) by the number of class intervals to be used. that the proportion of students who have scored
Referring to the above example, students’ lowest between 85.1 and 90.0 is 0.40, and that 13.33%
and highest test scores are 76.1 and 99.6, of the students have scored between 90.1 and
respectively. The width of the class interval, i, 95.0.
would then be found by computing:
99:6  76:1
i¼ ¼ 4:7, if the desired number of
5 Stated Limits and
classes is five. The five classes would then be:
Real Limits of a Class Interval
76.180.8; 80.985.5; 85.690.2; 90.394.9;
95.099.6. However, this is not a very convenient It is relevant when working with continuous vari-
grouping. It would be easier to use intervals of 5 ables to define both stated and real class limits.
or 10, and limits that are multiples of 5 or 10. The lower and upper stated limits, also known as
There are many situations in which midpoints are apparent limits of a class, are the lowest and high-
used for analysis, and midpoints of 5 and 10 inter- est scores that could fall into that class. For exam-
vals are easier to calculate than midpoints of 4.7 ple, for the class 75.180.0, the lower stated limit
intervals. Based on this reasoning, the observations is 75.1 and the upper stated limit is 80.0.
in the example data set are grouped into five clas- The lower real limit is defined as the point that
ses, as in Table 5. Finally, the number of observa- is midway between the stated lower limit of a class
tions that fall within each interval is counted. and the stated upper limit of the next lower class.
Frequencies measure the number of cases that The upper real limit is defined as the point that is
fall within each class. This means that three midway between the stated upper limit of a class
510 Friedman Test

and the stated lower limit of the next higher class. and four children, in Table 3). Frequency tables also
For example, the lower real limit of the class represent the first step in drawing histograms and
80.185.0 is 80.05, and the upper real limit is calculating means from grouped data.
85.05. Using relative frequency distributions or per-
Real limits may be determined not only for clas- centage frequency tables is important when com-
ses, but also for numbers. In the case of numbers, paring the frequency distributions of samples with
real limits are the points midway between a partic- different sample sizes. Whereas simple frequencies
ular number and the next lower and higher num- depend on the total number of observations, rela-
bers on the scale used in the respective research. tive frequencies and percentage frequencies do not
For example, the lower real limit of number 4 on and thus may be used for comparisons.
a 1-unit scale is 3.5, and its upper real limit is 4.5. The main drawback of using frequency tables is
However, real limits are not always calculated as the loss of detailed information. Especially when
midpoints. For example, most individuals identify data are grouped into classes, the information for
their age using their most recent birthday. Thus, it individual cases is no longer available. This means
is considered that a person 39 years old is at least that all scores in a class are dealt with as if they
39 years old and has not reached his 40th birth- were identical. For example, the reader of Table 5
day, and not that he is older than 38 years and 6 learns that six students have scored between 85.1
months and younger than 39 years and 6 months. and 90.0, but the reader does not learn any more
For discrete numbers, there are no such things details about the individual test results.
as stated and real limits. When counting the num-
ber of people present at a meeting, limits do not Oana Pusa Mihaescu
extend below and above the respective number
See also Cumulative Frequency Distribution; Descriptive
reported. If there are 120 people, all limits are
Statistics; Distribution; Frequency Distribution;
equal to 120.
Histogram

Midpoints of Class Intervals Further Readings

Each class has a midpoint defined as the point Downie, N. M., & Heath, R. W. (1974). Basic statistical
midway between the real limits of the class. Mid- methods. New York: Harper & Row.
points are calculated by adding the values of the Fielding, J. L., & Gilbert, G. N. (2000). Understanding
social statistics. Thousand Oaks, CA: Sage.
stated or real limits of a class and dividing the sum
Fried, R. (1969). Introduction to statistics. New York:
by two. For example, the midpoint for the class Oxford University Press.
80:1 þ 85:0 80:05 þ 85:05 Hamilton, L. (1996). Data analysis for social sciences: A
80.185.0 is m ¼ ¼ ¼
2 2 first course in applied statistics. Belmont, CA:
82:55. Wadsworth.
Kiess, H. O. (2002). Statistical concepts for the
behavioral sciences. Boston: Allyn & Bacon.
Advantages and Drawbacks Kolstoe, R. H. (1973). Introduction to statistics for the
of Using Frequency Tables behavioral sciences. Homewood, IL: Dorsey.
Lindquist, E. F. (1942). A first course in statistics: Their
The main advantage of using frequency tables is use and interpretation in education and psychology.
that data are grouped and thus easier to read. Fre- Cambridge, MA: Riverside Press.
quency tables allow the reader to immediately
notice a series of characteristics of the analyzed data
set that could probably not have been easily seen
when looking at the raw data: the lowest score (i.e., FRIEDMAN TEST
0, in Table 3); the highest score (i.e., 5, in Table 3);
the most frequently occurring score (i.e., 5, in Table In an attempt to control for unwanted variability,
3); and how many observations fall between two researchers often implement designs that pair or
given scores (i.e., five families have between two group participants into subsets based on common
Friedman Test 511

characteristics (e.g., randomized block design) or Table 1 Ranks for Randomized Block Design
implement designs that observe the same partici- Treatment Conditions
pant across a series of conditions (e.g., repeated-
measures design). The analysis of variance 1 2 . . . K Row Means
(ANOVA) is a common statistical method used to Blocks 1 R11 R12 . . . R1K Kþ1
R1 ¼
analyze data from a randomized block or repeated- 2
measures design. However, the assumption of nor- 2 R21 R22 . . . R2K Kþ1
R2 ¼
mality that underlies ANOVA is often violated, or .. 2
.. . . . . . . . . . . . . .
the scale of measurement for the dependent vari- .
able is ordinal-level, hindering the use of ANOVA. N RN1 RN2 . . . RNK Kþ1
To address this situation, economist Milton Fried- RN ¼
2
man developed a statistical test based on ranks that Column Means R1 R2 . . . R3 Kþ1

may be applied to data from randomized block or 2
repeated measures designs where the purpose is to
detect differences across two or more conditions.
This entry describes this statistical test, named the It is apparent that row means in Table 1 (i.e.,
Friedman Test, which may be used in lieu of mean of ranks for each block) are the same across
ANOVA. The Friedman test is classified as a non- blocks; however, the column means (i.e., mean of
parametric test because it does not require a specific ranks within a treatment condition) will be
distributional assumption. A primary advantage of affected by differences across treatment conditions.
the Friedman test is that it can be applied more Under the null hypothesis that there is no differ-
widely as compared to ANOVA. ence due to treatment, the ranks are assigned at
random, and thus, an equal frequency of ranks
would be expected for each treatment condition.
Procedure Therefore, if there is no treatment effect, then the
The Friedman test is used to analyze several column means are expected to be the same for
related (i.e., dependent) samples. Friedman each treatment condition. The null hypothesis may
referred to his procedure as the method of ranks in be specified as follows:
that it is based on replacing the original scores
Kþ1
with rank-ordered values. Consider a study in H0 : μR ¼ μR ¼    ¼ μR ¼ : ð1Þ
which data are collected within a randomized
·1 ·2 ·K 2
block design where N blocks are observed over K To test the null hypothesis that there is no treat-
treatment conditions on a dependent measure that ment effect, the following test statistic may be
is at least ordinal-level. The first step in the Fried- computed:
man test is to replace the original scores with
ranks, denoted Rjk, within each block; that is, the PK  2
scores for block j are compared with each other, N
R·k  R· ·
k¼1
and a rank of 1 is assigned to the smallest ðTSÞ ¼ , ð2Þ
observed score, a rank of 2 is assigned to the sec- PP
N K 2
Rjk  R · · ðN ðK  1ÞÞ
ond smallest, and so on until the largest value is j¼1 k¼1
replaced by a rank of K. In the situation where
there are ties within a block (i.e., two or more of where R · k represents the mean value for treatment
the values are identical), the midrank is used. The k; R · · represents the grand mean (i.e., mean of all
midrank is the average of the ranks that would rank values); and Rjk represents the rank for block
have been assigned if there were no ties. Note that j and treatment k. Interestingly, the numerator and
this procedure generalizes to a repeated measures denominator of (TS) can be obtained using
design in that the ranks are based on within- repeated measures ANOVA on the ranks. The
participant observations (or, one can think of the numerator is the sum of squares for the treatment
participants as defining the blocks). Table 1 pre- effect (SSeffect). The denominator is the sum of
sents the ranked data in tabular form. squares total (which equals the sum of squares
512 Friedman Test

within-subjects because there is no between- Table 2 Resting Heart Rate as Measured by Beats
subjects variability) divided by the degrees of free- per Minute
dom for the treatment effect plus the degrees of Weight-Lifting Bicycling Running
freedom for the error term. Furthermore, the test 1 72 65 66
statistic provided in Equation 2 does not need to 2 65 67 67
be adjusted when ties exist. 3 69 65 68
An exact distribution for the test statistic may 4 65 61 60
be obtained using permutation in which all possi- Block 5 71 62 63
ble values of (TS) are computed by distributing the 6 65 60 61
rank values within and across blocks in all possible 7 82 72 73
combinations. For an exact distribution, the 8 83 71 70
p value is determined by the proportion of values 9 77 73 72
of (TS) in the exact distribution that are greater 10 78 74 73
than the observed (TS) value. In the recent past,
the use of the exact distribution in obtaining the p
value was not feasible due to the immense comput- measured by beats per minute. The researcher
ing power required to implement the permutation. implemented a randomized block design in which
However, modern-day computers can easily con- initial resting heart rate and body weight (variables
struct the exact distribution for even a moderately that are considered important for response to exer-
large number of blocks. Nonetheless, for a suffi- cise) were used to assign participants into relevant
cient number of blocks, the test statistic is distrib- blocks. Participants within each block were ran-
uted as a chi-square with degrees of freedom equal domly assigned to one of the three exercise modes
to the number of treatment conditions minus 1 (i.e., treatment condition). After one month of
(i.e., K  1). Therefore, the chi-square distribution exercising, the resting heart of each participant
may be used to obtain the p value for (TS) when was recorded and is shown in Table 2.
the number of blocks is sufficient. The first step in the Friedman test is to replace
The Friedman test may be viewed as an exten- the original scores with ranks within each block.
sion of the sign test. In fact, in the context of two For example, for the first block, the smallest origi-
treatment conditions, the Friedman test provides nal score of 65, which was associated with the par-
the same result as the sign test. As a result, multi- ticipant in the bicycling group, was replaced by
ple comparisons may be conducted either by using a rank of 1; the original score of 66 associated
the sign test or by implementing the procedure for with the running group was replaced by a rank of
the Friedman test on the two treatment conditions 2; and the original score of 72, associated with
of interest. The familywise error rate can be con- weightlifting, was replaced by a rank of 3. Further-
trolled using typical methods such as Dunn- more, note that for Block 2, the original values
Bonferroni or Holm’s Sequential Rejective Proce- of resting heart rate were the same for the bicy-
dure. For example, when the degrees of freedom cling and running conditions (i.e., beats per minute
equals 2 (i.e., K ¼ 3), then the Fisher least signifi- equaled 67 for both conditions as shown in
cant difference (LSD) procedure may be implemen- Table 2). Therefore, the midrank value of 2.5 was
ted in which the omnibus hypothesis is tested first; used, which was based on the average of the ranks
if the omnibus hypothesis is rejected, then each they would have received if they were not tied
multiple comparison may be conducted using (i.e., [2 þ 3]/2 ¼ 2.5). Table 3 reports the rank
either the sign or the Friedman test on the specific values for each block.
treatment conditions using a full α level. The mean of the ranked values for each block
(Rj ·) is identical because the ranks were based on
within blocks. Therefore, there is no variability
Example
across blocks once the original scores have been
Suppose a researcher was interested in examining replaced by the ranks. However, the mean of the
the effect of three types of exercises (weightlifting, ranks varies across treatment conditions (R · k ). If
bicycling, and running) on resting heart rate as the treatment conditions are identical in the
F Test 513

Table 3 Rank Values of Resting Heart Rate Within Table 4 p Values for the Pairwise Comparisons
Each Block Comparison p value (Exact, Two-tailed)
Weight-Lifting Bicycling Running Rj Weight-lifting vs. Bicycling 0.021
1 3 1 2 2 Weight-lifting vs. Running 0.021
2 1 2.5 2.5 2 Bicycling vs. Running 1.000
3 3 1 2 2
4 3 2 1 2
Block 5 3 1 2 2 procedure can be used to control the familywise
6 3 1 2 2 error rate. The omnibus test was significant at α ¼
7 3 1 2 2 0.05, therefore, each of the pairwise comparisons
8 3 2 1 2 can be tested using α ¼ 0.05. Table 4 reports the
9 3 2 1 2 exact p values (two-tailed) for the three pairwise
10 3 2 1 2 comparisons. From the analyses, the researcher
Rk 2.8 1.55 1.65 2 can conclude that the weightlifting condition dif-
fered in its effect on resting heart rate compared to
running and bicycling; however, it cannot be con-
population, R · k s are expected to be similar across cluded that the running and bicycling conditions
the three conditions (i.e., R · k ¼ 2). differed.
The omnibus test statistic, (TS), is computed for
the data shown in Table 3 as follows: Craig Stephen Wells

ðTSÞ ¼ See also Levels of Measurement; Normal Distribution;


2 2 2
Repeated Measures Design; Sign Test; Wilcoxon Rank
10½ð2:80  2Þ þ ð1:55  2Þ þ ð1:65  2Þ  Sum Test
2 2 2
½ð3  2Þ þ ð1  2Þ þ    þ ð1  2Þ =ð10ð3  1ÞÞ
9:65
¼ ¼ 9:897 Further Readings
19:5=
20
ð3Þ Friedman, M. (1937a). A comparison of alternative tests
of significance for the problem of m rankings. Annals
of Mathematical Statistics, 11, 8692.
The numerator of (TS) is the sum of squares for
Friedman, M. (1937b). The use of ranks to avoid the
the effect due to exercise routine on the ranks assumption of normality implicit in the analysis of
(SSeffect ¼ 9.65). The denominator is the sum variance. Journal of the American Statistical
of squares total (SStotal ¼ 19.5) divided by Association, 32, 675701.
the degrees of freedom for the treatment effect Friedman, M. (1939). A correction: The use of ranks to
(dfeffect ¼ 2) plus the degrees of freedom for the avoid the assumption of normality implicit in the
error term (dferror ¼ 18). analysis of variance. Journal of the American
The p value for (TS) based on the exact distri- Statistical Association, 34, 109.
bution is 0.005, which allows the researcher to
reject H0 for α ¼ 0.05 and conclude that the exer-
cise routines had a differential effect on resting
heart rate. The chi-square distribution may also be
used to obtain an approximated p value. The F TEST
omnibus test statistic follows the chi-square distri-
bution with 2 degrees of freedom (i.e., df ¼ 3  The F test was named by George W. Snedecor in
1) and has a p value of 0.007. honor of its developer, Sir Ronald A. Fisher. F tests
To determine which methods differ, pairwise are used for a number of purposes, including com-
comparisons could be conducted using either the parison of two variances, analysis of variance
sign test or the Friedman procedure on the treat- (ANOVA), and multiple regression. An F statistic
ment conditions. Because df ¼ 2, the Fisher LSD is a statistic having an F distribution.
514 F Test

The F Distribution Comparing Two Variances


One sort of F test is that for comparing two inde-
The F distribution is related to the chi-square (χ2)
pendent variances. The sample variances are com-
distribution.
pared by taking their ratio. This problem is
A random variable has a chi-square distribu-
mentioned first, as other applications of F tests
tion with m degrees of freedom (d.f.) if it is
involve ratios of variances (or mean squares) as
distributed as the sum of squares of m indepen-
well. The hypothesis H0 : σ 21 ¼ σ 22 is rejected if
dent standard normal random variables. The
the ratio
chi-square distribution arises in analysis of
variance and regression analysis because the larger sample variance
sum of squared deviations (numerator of the F¼
smaller sample variance
variance) of the dependent variable is
decomposed into parts having chi-square distri- is large. The statistic has an F distribution if the
butions under the standard assumptions on the samples are from normal distributions.
errors in the model.
The F distribution arises because one takes
ratios of the various terms in the decomposition of Normal Distribution Theory
the overall sum of squared deviations. When the The distribution of the sample variance s2 com-
errors in the model are normally distributed, these puted from a sample of N from a normal distribu-
terms have distributions related to chi-square dis- tion with variance σ 2 is given by the fact that
tributions, and the relevant ratios have F distribu- (N  1)s2/σ 2 is distributed according to a chi-
tions. Mathematically, the F distribution with m square distribution with m ¼ N  1 d.f. So, for
and n d.f. is the distribution of the variances s21 and s22 of two independent sam-
ples of sizes N1 and N2 from normal distributions,
U=m the variable U ¼ ðN1  1Þs21 =σ 21 is distributed
, according to chi-square with m ¼ N1  1 d.f.,
V=n
and the variable V ¼ ðN2  1Þs22 =σ 22 is distributed
where U and V are statistically independent and according to chi-square with n ¼ N2  1 d.f. If
distributed according to chi-square distributions σ 21 ¼ σ 22 , the ratio F ¼ s21 =s22 has an F distribution
with m and n d.f. with m ¼ N1  1 and n ¼ N2  1 d.f.

Analysis of Variance
F Tests
F tests are used in ANOVA. The total sum of
Given a null hypothesis H0 and a significance squared deviations is decomposed into parts cor-
level α, the corresponding F test rejects H0 if the responding to different factors. In the normal
value of the F statistic is large; more precisely, if case, these parts have distributions related to
F > Fm;n; α , the upper αth quantile of the chi-square. The F statistics are ratios of these
Fm,n distribution. The values of m and n depend parts and hence have F distributions in the nor-
upon the particular problem (comparing var- mal case.
iances, ANOVA, multiple regression). The
achieved (descriptive) level of significance (p
One-Way ANOVA
value) of the test is the probability that a variable
with the Fm,n distribution exceeds the observed One-way ANOVA is for comparison of the
value of the statistic F. The null hypothesis is means of several groups. The data are Ygi,g ¼ 1;
rejected if p < α. 2; . . . ; k groups, i ¼ 1; 2; . . . ; Ng cases in the gth
Many tables are available for the quantiles, but group.
they can be obtained in Excel and in statistical The model is
computer packages, and p values are given in the
output for various procedures. Ygi ¼ μg þ εgi ;
F Test 515

where the errors εgi have mean zero and constant that is,
variance σ 2, and are uncorrelated. The null
hypothesis (hypothesis of no differences between ðN  1Þ ¼ ðk  1Þ þ ðN  kÞ:
group means) is H0 : μ1 ¼ μ2 ¼    ¼ μk . It is
Each mean square is the corresponding sum of
convenient to reparametrize as μg ¼ μ þ αg,
squares, divided by its d.f.: MSTot ¼ SSTot/
where αg is the deviation of the true mean μg for
DFTot ¼ SSTot/(N  l) is just the sample variance
group g from the true overall mean μ The devia-
P of Y; MSB ¼ SSB/DFB ¼ SSB/(k  l), and
tions satisfy a constraint such as kg¼1 Ng αg ¼ 0. MSW ¼ SSW/DFW ¼ SSW/(N  k). The rele-
In terms of the αg , the model is vant F statistic is F ¼ MSB/MSW. For F to have an
F distribution, the errors must be normally distrib-
Ygi ¼ μ þ αg þ εgi ; uted. The residuals can be examined to see if their
histogram looks bell-shaped and not too heavy-
and H0 is α1 ¼ α2 ¼    ¼ αk :
tailed, and a normal quantile plot can be used.

Decomposition of Sum of Squares; Mean Squares Power and Noncentral F


There is a corresponding decomposition of the The power of the test depends upon the extent
observations and of the sums of squared deviations of departure from the null hypothesis, as given by
from the mean. The group sums are the noncentrality parameter δ2. For one-way
PNg
Ygþ ¼ i¼1 Ygi . The means are Yg. ¼ Yg þ /Ng. ANOVA,
P PNg
The overall sum is Yþþ ¼ kg¼1 i¼1 Ygi , and the X
k X
k
overall mean is Y. . ¼ Y þþ /N, where N ¼ N1 þ σ 2 δ2 ¼ Ng ðμg  μÞ2 ¼ Ng α2g ,
N2 þ    þ NK. The decomposition of the observa- g¼1 g¼1

tions as
a measure of dispersion of the true means μg. The
Ygi ¼ μ
^ þ α^g þ ε^i noncentral F distribution is related to the noncen-
¼ Y:: þ ðYg :  Yþþ Þ þ ðYgi  Yg :Þ: tral chi-square distribution. The noncentral chi-
square distribution with m degrees of freedom and
This is noncentrality parameter δ2 is the distribution of the
ðYgi  Y::Þ ¼ ðYg:  Y:: þ ðYgi  Yg: Þ: sum of squares of m independent normal variables
with variances equal to 1 and means whose sum of
Squaring both sides and summing gives the squares is δ2. If, in the ratio (U/m)/(V/n), the vari-
analogous decomposition of the sum of squares able U has a noncentral chi-square distribution,
then the ratio has a noncentral F distribution.
Ng
X
k X
2
When the null hypothesis of equality of means is
ðYgi  Y::Þ ¼ false, the test statistic has a noncentral F distribu-
g¼1 i¼1
tion. The noncentrality parameter depends upon
Ng Ng
X
k X X
k X the group means and the sample sizes. Power com-
ðYg :  Y::Þ2 þ ðYgi  Yg :Þ2 ¼ putations involve the noncentral F distribution. It is
g¼1 i¼1 g¼1 i¼1
via the noncentrality parameter that one specifies
Ng
X
k
2
X
k X what constitutes a reasonably large departure from
Ng ðYg :  Y::Þ þ ðYgi  Yg :Þ2 the null hypothesis. Ideally, the level α and the sam-
g¼1 g¼1 i¼1
ple sizes are chosen so that the power is sufficiently
or SSTot ¼ SSB þ SSW: large (say, .8 or .9) for large departures from the
null hypothesis.
Here, SSTot denotes the total sum of squares; SSB,
between-group sum of squares; and SSW, within- Randomized Blocks Design
group sum of squares. The decomposition of d.f. is
This is two-way ANOVA with no replication.
DFTot ¼ DFB þ DFW; There are two factors A and B with a and b levels,
516 F Test

thus, ab observations. The decomposition of the understand the F tests for more complicated
sum of squares is designs.
SSTot ¼ SSA þ SSB þ SSRes: Multiple Regression
The decomposition of d.f. is Given a data set of observations on explanatory
variables X1; X2; . . . ; Xp and a dependent variable
DFTot ¼ DFA þ DFB þ DFRes Y for each of N cases, the multiple linear regres-
that is, sion model takes the expected value of Y to be
a linear function of X1; X2; . . . ; Xp. That is, the
ðab  1Þ ¼ ða  1Þ þ ðb  1Þ þ ða  1Þðb  1Þ: mean Ex(Y) of Y for given values x1; x2; . . . ; xp is
of the form
Each mean square is the corresponding sum of
squares, divided by its d.f.: MSA ¼ SSA/DFA ¼ Ex ðYÞ ¼ β0 þ β1 x1 þ β2 x2 þ    þ βp xp ,
SSA/(a  1), MSB ¼ SSB/DFB ¼ SSB/(b  1),
MSRes ¼ SSRes/DFRes ¼ SSRes/(a  1)(b  1). where the βj are parameters to be estimated, and
The test statistics are FA ¼ MSA/MSRes with the error is additive,
a  1 and (a  1)(b  1) d.f. and FB ¼ MSB/ Y ¼ Ex ðYÞ þ ε:
MSRes with b  1 and (a  1)(b  1) d.f.
Writing this in terms of the N cases gives the
observational model for i ¼ 1; 2; . . . ; N,
Two-Way ANOVA
Yi ¼ β0 þ β1 x1i þ β2 x2i þ    þ βp xpi þ εi :
When there is replication, with r replicates for
each combination of levels of A and B, the decom- The assumptions on the errors are that they
position of SSTot is have mean zero and common variance σ 2, and are
uncorrelated.
SSTot ¼ SSA þ SSB þ SSA × B þ SSRes,

where A × B represents interaction. The decom- Decomposition of Sum of Squares


position of d.f. is Analogously to the model Yi ¼ Ex(Yi) þ εi , the
DFTot ¼ DFA þ DFB þ DFA × B þ DFRes observations can be written as
Observed value ¼ predicted value þ residual,
that is,
or
ðabr  1Þ ¼
ða  1Þ þ ðb  1Þ þ ða  1Þðb  1Þ þ abðr  1Þ: Yi ¼ Y^ i þ ei :
^ i is the predicted value of Yi, given by
Here, Y
Each mean square is the corresponding sum of
squares, divided by its d.f.: MSA ¼ SSA/DFA ¼ ^ i ¼ b0 þ b1x1i þ b2x2i þ    þ bpxpi ;
Y
SSA/(a  1), MSB ¼ SSB/DFB ¼ SSB/(b  1),
MSA × B ¼ SSA × B/DFA × B ¼ SSA × B/ where bj is the least squares estimate of βj. This
(a  1)(b  1), MSRes ¼ SSRes/DFRes ¼ SSRes/ can be written
ab(r  1). The denominators of the F tests depend
on whether the levels of the factors are fixed, ran- ^i ¼
Y
dom, or mixed. Y: þ b1 ðx1i  x1 :Þ þ b2 ðx2i  x2 :Þ þ . . . bp ðxpi  xp :Þ:

Here, Y. is the mean of the values of Y and xj. is


Other Designs
the mean of the values of Xj. This gives
The concepts developed above for one-way and
^ i  Y:Þ þ ðYi  Y
ðYi  Y:Þ ¼ ðY ^ i Þ;
two-way ANOVA should enable the reader to
F Test 517

^ i estimates the error


where the residual ei ¼ Yi  Y When this null hypothesis is true, the model is
εi. Squaring and summing gives the decomposition the reduced model
of sum of squares
ExðYÞ ¼ β0 þ β1 x1 þ β2 x2 þ    þ βq xq :
SSTot ¼ SSReg þ SSRes:
An F test can be used to test the adequacy of this
Here SSReg is the regression sum of squares and reduced model. Let SSResfull denote the residual
SSRes is the residual sum of squares. The propor- sum of squares of the full model with p variables
tion of SSTot accounted for by the regression is and SSResred that of the reduced model with only
SSReg/SSTot ¼ SSReg/(SSReg þ SSRes), which can q variables. Then the test statistic is
be shown to be equal to R2, where R is the multiple
correlation coefficient between Y and the p Xs. ðSSResred  SSResfull Þ=ðp  qÞ
F ¼ :
SSResfull =ðN  p  1Þ
F in Terms of R-square The denominator SSResfull/(N  p  1) is sim-
The decomposition ply MSResfull, the residual mean square in the full
model. The quantity SSResfull is less than or equal
SSTot ¼ SSReg þ SSRes to SSResred because the latter is the result of mini-
mization over a subset of the regression coeffi-
is SSTot ¼ R2SSTot þ (1  R2)SSTot. The cients, the last p  q of them being restricted to
decomposition of d.f. is zero. The difference SSResred  SSResfull is thus
DFTot ¼ DFReg þ DFRes, nonnegative and is the loss of fit due to omitting
the last p  q variables, that is, the loss of fit due
or to the null hypothesis that the last p  q βj are
zero. It is the hypothesis sum of squares SSHyp.
ðN  1Þ ¼ p þ ðN  p  1Þ: Thus, the numerator is
It follows that MSReg ¼ SSReg/DFReg ¼ SSReg/ ðSSResred  SSResfull Þ=ðp  qÞ ¼ SSHyp=ðp  qÞ
p ¼ R2/p and MSRes ¼ SSRes/DFRes ¼ SSRes/
¼ MSHyp;
(n  p  1) ¼ (1  R2)/(N  p  1). The statistic

R2 =p the hypothesis mean square. So F is again the


F ¼ MSReg=MSRes ¼ ratio of two mean squares, in this case, F ¼
ð1  R2 Þ=ðN  p  1Þ
MSHyp/MSResfull. The numbers of d.f. are p  q
has an F distribution with p and N  p  1 d.f. and N  p  1. Under the null hypothesis, this F
when the errors are normally distributed. statistic has the Fpq, Np1 distribution if the
errors are normally distributed.

Comparing a Reduced Model With the Full Model Stanley L. Sclove

The F statistic for testing the null hypothesis See also Analysis of Variance (ANOVA); Coefficients of
concerning the coefficient of a single variable is Correlation, Alienation, and Determination;
the square of the t statistic for this test. But F Experimental Design; Factorial Design; Hypothesis;
can be used for testing several variables at a time. Least Squares, Methods of; Significance Level,
It is often of interest to test a portion of a model, Concept of; Significance Level, Interpretation and
that is, to test whether a subset of the Construction; Stepwise Regression
variables—say, the first q variables—is adequate.
Let p ¼ q þ r; it is being considered whether Further Readings
the last r ¼ p  q variables are needed. The null
hypothesis is Bennett, J. H. (Ed.). (1971). The collected papers of
R. A. Fisher. Adelaide, Australia: University of
H0 : βj ¼ 0, j ¼ q þ 1, . . . , p: Adelaide Press.
518 F Test

Fisher, R. A. (1925). Statistical methods for research Kempthorne, O. (1952). The design and analysis of
workers. Edinburgh, UK: Oliver and Boyd. experiments. New York: Wiley.
Graybill, F. A. (1976). Theory and application of the Scheffé, H. (1959). The analysis of variance. New York:
linear model. N. Scituate, MA: Duxbury Press. Wiley.
Hogg, R. V., McKean, J. W., & Craig, A. T. (2005). Snedecor, G. W., & Cochran, W. G. (1989). Statistical
Introduction to mathematical statistics (6th ed.). methods (8th ed.). Ames: Iowa State University Press.
Upper Saddle River, NJ: Prentice Hall.
G
pretest scores (i.e., gain ¼ posttest  pretest).
GAIN SCORES, ANALYSIS OF However, both the pretest and the posttest scores
for any individual contain some amount of mea-
surement error such that it is impossible to know
Gain (i.e., change, difference) is defined here as a person’s true score on any given assessment.
the difference between test scores obtained for Thus, in classical test theory (CTT), a person’s
an individual or group of individuals from a mea- observed score (X) is composed of two parts, some
surement instrument, intended to measure the true score (T) and some amount of measurement
same attribute, trait, concept, construct, or skill, error (E) as defined in Equation 1:
between two or more testing occasions. This dif-
ference does not necessarily mean that there is X ¼ T þ E: ð1Þ
an increase in the test score(s). Thus, a negative
difference is also described as a ‘‘gain score.’’ In a gain score analysis, it is the change in the true
There are a multitude of reasons for measuring scores (T Þ that is of real interest. However, the
gain: (a) to evaluate the effects of instruction or researcher’s best estimate of the true score is the
other treatments over time, (b) to find variables person’s observed score, thus making the gain
that correlate with change for developing a crite- score (i.e., the difference between observed scores)
rion variable in an attempt to answer questions an unbiased estimator of T for any given individ-
such as ‘‘What kinds of students grow fastest on ual or subject. What follows is a description of
the trait of interest?,’’ and (c) to compare indi- methods for analyzing gain scores, a discussion of
vidual differences in gain scores for the purpose the reliability of gain scores, alternatives to the
of allocating service resources and selecting indi- analysis of gain scores, and a brief overview of
viduals for further or special study. designs that measure change using more than two
The typical and most intuitive approach to the waves of data collection.
calculation of change is to compute the difference
between two measurement occasions. This differ-
Methods for the Analysis of Gain Scores
ence is called a gain score and can be considered
a composite in that it is made up of a pretest (e.g., The gain score can be used as a dependent variable
an initial score on some trait) and a posttest (e.g., in a t test (i.e., used to determine whether the
a score on the same trait after a treatment has been mean difference is statistically significant for
implemented) score where a weight of 1 is assigned a group or whether the mean differences between
to the posttest and a weight of 1 is assigned to two groups are statistically significantly different)
the pretest. Therefore, the computation of the gain or an analysis of variance (ANOVA) (i.e., used
score is simply the difference between posttest and when the means of more than two groups or more

519
520 Gain Scores, Analysis of

than two measurement occasions are compared) Reliability of the Gain Score
with the treatment, intervention, instructional
mode (i.e., as with educational research) or natu- Frederic M. Lord and Melvin R. Novick intro-
rally occurring group (e.g., sex) serving as the duced the reliability of the gain score as the ratio
between-subjects factor. (For simplicity, through- of the variance of the difference score (σ 2D Þ to the
out this entry, levels of the between-groups factors sum of the variance of the difference score and the
are referred to as treatment groups. However, the variance of the error associated with that differ-
information provided also applies to other types of ence (σ 2D þ σ 2errD Þ:
groups as well, such as intervention, instructional
σ 2D
modes, and naturally occurring.) If the t test or the ρD ¼ : ð2Þ
treatment main effect in an ANOVA is significant, σ 2D þ σ 2errD
the null hypothesis of no significant gain or differ-
ence in improvement between groups (e.g., treat- Hence, the variance of the difference score is
ment and control groups) can be rejected. the systematic difference between subjects in their
gain score. In other words, the reliability of the
gain score is really a way to determine whether the
t Test assessment or treatment discriminates between
those who change a great deal and those who
Depending on the research question and design, change little, and to what degree. The reliability of
either a one-sample t test or an independent sam- the gain score can be further described in terms of
ples t test can be conducted using the gain score as the pretest and posttest variances along with their
the dependent variable. A one-sample t test can be respective reliabilities and the correlation of the
used when the goal is to determine whether the pretest with the posttest. Equation 3 describes this
mean gain score is significantly different from zero relationship from the CTT perspective, where
or some other specified value. When two groups observations are considered independent.
(e.g., control and treatment) are included in the
research design and the aim is to determine σ 2pre ρpre þ σ 2post ρpost  2σ pre σ post ρpre;post
whether more gain is observed in the treatment ρD ¼ ; ð3Þ
σ 2pre þ σ 2post  2σ pre σ post ρpre;post
group, for example, an independent t test can be
implemented to determine whether the mean gain
where ρD represents the reliability of the gain score
scores between groups are significantly different
(D) and σ 2pre and σ 2post designate the variance of the
from each other. In this context, the gain score is
pretest and posttest scores, respectively. Likewise,
entered as the dependent variable and more than
σ pre and σ post designate the standard deviations of
two groups would be examined (e.g., a control
the pretest and posttest scores, respectively, and
and two different treatment groups).
ρpre and ρpost represent the reliabilities of the pre-
test and posttest scores, respectively. Lastly,
ρpre;post designates the correlation between the
Analysis of Variance
pretest and posttest scores. Equation 3 further
Like the goal of an independent t test, the aim reduces to Equation 4:
of an ANOVA is to determine whether the mean
gain scores between groups are significantly differ- ρpre þ ρpost  2ρpre;post
ρD ¼ ð4Þ
ent from each other. Instead of conducting multi- 2  2ρpre;post
ple t tests, an ANOVA is performed when more
than two groups are present in order to control the when the variances of the pretests and posttests
type I error rate (i.e., rate of rejecting a true null are equal (i.e., σ 2pre ¼ σ 2post Þ. However, it is rare
hypothesis). However, differences in pretest scores that equal variances are observed when a treatment
between groups are not controlled for when con- is studied that is intended to show growth between
ducting an ANOVA using the gain scores, which pretesting and posttesting occasions. When growth
can result in misleading conclusions as discussed is the main criterion, this equality should not be
later. considered an indicator of construct validity, as it
Gain Scores, Analysis of 521

has been in the past. In this case, it is merely an have shown that gain scores can be reliable under
indication of whether rank order is maintained certain circumstances that depend upon the experi-
over time. If differing growth rates are observed, mental procedure and the use of appropriate
this equality will not hold. For example, effective instruments. Williams, Zimmerman, and Roy D.
instruction tends to increase the variability within Mazzagatti further discovered that for simple gains
a treatment group, especially when the measure to be reliable, it is necessary that the intervention
used to assess performance has an ample number or treatment be strong and the measuring device
of score points to detect growth adequately (i.e., or assessment be sensitive enough to detect
the scoring range is high enough to prevent ceiling changes due to the intervention or treatment. The
effects). If ceiling effects are present or many stu- question remains, ‘‘How often does this occur in
dents achieve mastery such that scores are concen- practice?’’ Zimmerman and Williams show, by
trated near the top of the scoring scale, the example, that with a pretest assessment that has
variability of the scores declines. a 0.9 reliability, if the intervention increases the
The correlation between pretest and posttest variability of true scores, the reliability of the gain
scores for the treatment group provides an esti- scores will be at least as high as that of the pretest
mate of the reliability (i.e., consistency) of the scores. Conversely, if the intervention reduces the
treatment effect across individuals. When the cor- variability of true scores, the reliability of the gain
relation between the pretest and posttest is one, scores decreases, thus placing its value between the
the reliability of the difference score is zero. This is reliabilities of the pretest and posttest scores.
because uniform responses are observed, and Given these findings, it seems that the use of gain
therefore, there is no ability to discriminate scores in research is not as meek as it was once
between those who change a great deal and those thought. In fact, only when there is no change or
who change little. However, some researchers, a reduction in the variance of the true scores as
Gideon J. Mellenbergh and Wulfert P. van den a result of the intervention(s) is the reliability of
Brink, for example, suggest that this does not the gain score significantly lowered. Thus, when
mean that the difference score should not be pretest scores are reliable, gain scores are reliable
trusted. In this specific instance, a different mea- for research purposes.
sure (e.g., measure of sensitivity) is needed to Although the efficacy of using gain scores has
assess the utility of the assessment or the produc- been historically wrought with much controversy,
tivity of the treatment in question. Such measures as the main arguments against their use are that
may include, but are not limited to, Cohen’s effect they are unreliable and negatively correlated with
size or an investigation of information (i.e., preci- pretest scores, gain scores are currently gaining in
sion) at the subject level. application and appeal because of the resolution of
Additionally, experimental independence (i.e., misconceptions found in the literature on the reli-
the pretest and posttest error scores are uncorre- ability of gain scores. Moreover, depending on the
lated) is assumed by using the CTT formulation of research question, precision may be a better way
reliability of the difference score. This is hardly the to judge the utility of the gain score than reliability
case with educational research, and it is likely that alone.
the errors are positively correlated; thus, the reli-
ability of gain scores is often underestimated. As
a result, in cases such as these, the additivity of
Alternative Analyses
error variances does not hold and leads to an
inflated estimate of error variance for the gain Alternative statistical tests of significance can also
score. Additionally, David R. Rogosa and John B. be performed that do not include a direct analysis
Willett contend that it is not the positive correla- of the gain scores. An analysis of covariance
tion of errors of measurement that inflate the reli- (ANCOVA), residualized gain scores, and the
ability of the gain score, but rather individual Johnson-Neyman technique are examples. Many
differences in true change. other examples also exist but are not presented
Contrary to historical findings, Donald W. Zim- here (see Further Readings for references to these
merman and Richard H. Williams, among others, alternatives).
522 Gain Scores, Analysis of

Analysis of Covariance intervention when compared to those with greater


and Residualized Gain Scores impairment. Analytic post hoc methods for con-
trolling pretreatment differences would wrongly
Instead of using the actual difference between
conclude an effect due to the intervention in this
pretest and posttest scores, Lee J. Cronbach and
case. Likewise, with high pretest scores and an
Lita Furby suggest the use of ANCOVA or residua-
assessment that restricts the range of possible high
lized gain scores. In such an analysis, a regression
scores, a ceiling effect can occur where an underes-
line is fit that relates the pretest to the posttest
timate of the treatment’s efficacy is observed. The
scores. Then, the difference between the posttest
strength of the Johnson-Neyman technique is that
score and predicted posttest scores is used as an
it can define regions along the continuum of the
indication of individuals who have changed more
covariate (i.e., the pretest score) for which the con-
or less than expected, assuming a linear relation-
clusion of a significant difference of means on the
ship. More specifically, the gain score is trans-
posttest can be determined.
formed into a residual score (zÞ, where
z ¼ x2  ½bx ðx1 Þ þ ax ; ð5Þ Multiple Observation Designs
x2 represents the posttest score, and bx ðx1 Þ þ ax Up to this point, the pre/posttest (i.e., two-wave)
represents the linear regression estimate of x2 design has been discussed, including some of its
based on x1 , the pretest score. limitations. However, the real issue may be that
Such an analysis answers the research question, the assumption that change can be measured
‘‘What is the effect of the treatment on the posttest adequately from using only two scores is unreal-
that is not predictable from the pretest?’’ In other istic and unnatural. In the case where more than
words, group means on the posttest are compared two observations are present, it is possible to fit
conditional on the pretest scores. Usually, the a statistical curve (i.e., growth curve) to the
power to detect differences between groups is observed data to model change as a function of
greater using this analysis than it is using an time. Thus, linearity is no longer assumed
ANOVA. Additional advantages include the ability because change is realistically a continuous pro-
to test assumptions of linearity and homogeneity cess and not a quantum leap (i.e., incremental)
of regression. from one point to the next. With more data
points, the within-subject error decreases and
the statistical power increases. Analytic models
Johnson-Neyman Technique
describing the relationship between individual
In the case of quasi-experimental designs, the growth and time have been proposed. These
gain score may serve to eliminate initial differences models are used to estimate their respective
between groups (i.e., due to the non-random parameters using various regression analysis
assignment of individuals to treatment groups). techniques. Thus, the reliability of estimates of
However, post hoc adjustments for differences individual change can be improved by having
between groups on a pretest, such as calculation of more than two waves of data, and formulas for
gain scores, repeated measures ANOVA, and determining reliability for these models can be
ANCOVA, all assume that in the absence of found.
a treatment effect, the treatment and control
groups grow at the same rate. There are two well-
Final Thoughts
known scenarios (e.g., fan-spread and ceiling
effects) where this is not the case, and the John- The purpose for analyzing gain scores is to either
son-Neyman technique may be an appropriate examine overall effects of treatments (i.e., popula-
alternative for inferring treatment effects. tion change) or distinguish individual differences in
In the case of fan-spread, the less capable exam- response to the treatment (i.e., individual change).
inees may have a greater chance for improvement Reliability of the gain score is certainly more of
(e.g., in the case of special education interventions) a concern for the former purpose, whereas preci-
and have greater pretest scores without the sion of information is of primary concern for the
Game Theory 523

estimation of individual gain scores. Methods for Zimmerman, D. W., & Williams, R. H. (1998).
analyzing gain scores include, but are not limited Reliability of gain scores under realistic assumptions
to, t tests and ANOVA models. These models about properties of pre-test and post-test scores.
answer the question ‘‘What is the effect of the treat- British Journal of Mathematical and Statistical
Psychology, 51(2), 343351.
ment on change from pretest to posttest?’’ Gain
scores focus on the difference between measure-
ments taken at two points in time and thus repre-
sent an incremental model of change. Ultimately,
multiple waves of data should be considered for the GAME THEORY
analysis of individual change over time because it is
unrealistic to view the process of change as follow- Game theory is a model of decision making
ing a linear and incremental pattern. and strategy under differing conditions of uncer-
tainty. Games are defined as strategic interactions
Tia Sukin between players, where strategy refers to a com-
plete plan of action including all prospective play
See also Analysis of Covariance (ANCOVA); Analysis of
options as well as the player’s associated outcome
Variance (ANOVA); Growth Curves
preferences. The formal predicted strategy for solv-
ing a game is referred to as a solution. The pur-
Further Readings
pose of game theory is to explore differing
Cronbach, L. J., & Furby, L. (1970). How we should solutions (i.e., tactics) among players within games
measure ‘‘change’’—or should we? Psychological of strategy that obtain a maximum of utility. In
Bulletin, 74(1), 6880. game theory parlance, ‘‘utility’’ refers to preferred
Haertel, E. H. (2006). Reliability. In R. Brennan (Ed.), outcomes that may vary among individual players.
Educational measurement (4th ed., pp. 65110). John von Neumann and Oskar Morgenstern
Westport, CT: Praeger.
seeded game theory as an economic explanatory
Johnson, P. O., & Fay, L. C. (1950). The Johnson-
Neyman technique, its theory and application.
construct for all endeavors of the individual to
Psychometrika, 15(4), 349367. achieve maximum utility or, in economic terms,
Knapp, T. R., & Schafer, W. D. (2009). From gain score t profit; this is referred to as a maximum. Since its
to ANCOVA F (and vice versa). Practical Assessment, inception in 1944, game theory has become an
Research & Evaluation, 14(6), 17. accepted multidisciplinary model for social
Lord, F. M., & Novick, M. R. (1968). Statistical exchange in decision making within the spheres of
theories of mental test development. Reading, MA: biology, sociology, political science, business, and
Addison-Wesley. psychology. The discipline of psychology has
Mellenbergh, G. J., & van den Brink, W. P. (1998). The embraced applied game theory as a model for con-
measurement of individual change. Psychological
flict resolution between couples, within families,
Methods, 3(4), 470485.
Rogosa, D., Brandt, D., & Zimowski, M. (1982). A
and between hostile countries; as such, it is also
growth curve approach to the measurement of change. referred to as the theory of social situations.
Psychological Bulletin, 92(3), 726748. Classic game theory as proposed by von Neu-
Rogosa, D. R., & Willett, J. B. (1985). Understanding mann and Morgenstern is a mathematical model
correlates of change by modeling individual founded in utility theory, wherein the game
differences in growth. Psychometrika, 50, 203228. player’s imagined outcome preferences can be
Williams, R. H., & Zimmerman, D. W. (1996). Are combined and weighted by their probabilities.
simple gain scores obsolete? Applied Psychological These outcome preferences can be quantified and
Measurement, 20(1), 5969. are therefore labeled utilities. A fundamental
Williams, R. H., Zimmerman, D. W., & Mazzagatti, R. D.
assumption of von Neumann and Morgenstern’s
(1987). Large sample estimates of the reliability of
simple, residualized, and base-free gain scores. Journal
game theory is that the game player or decision
of Experimental Education, 55(2), 116118. maker has clear preferences and expectations.
Zimmerman, D. W., & Williams, R. H. (1982). Gain Each player is presumed rational in his or her
scores in research can be highly reliable. Journal of choice behavior, applying logical heuristics in
Educational Measurement, 19(2), 149154. weighing all choice options, thereby formulating
524 Game Theory

his or her game strategy in an attempt to optimize aware of all actions during the game; in other
the outcome by solving for the maximum. These words, if Player A moves a pawn along a chess
game strategies may or may not be effective in board, Player B can track that pawn throughout
solving for the maximum; however, the reasoning the game. There are no unknown moves. This is
in finding a solution must be sound. referred to as perfect information because there
is no uncertainty, and thus, games of perfect
information have few conceptual problems; by
Game Context
and large, they are considered technical pro-
Three important contextual qualities of a game blems. In contrast, games of imperfect informa-
involve whether the game is competitive or non- tion involve previously unknown game plans;
competitive, the number of players involved in the consequently, players are not privy to all previ-
game, and the degree to which all prior actions are ously employed competitive strategies. Games of
known. imperfect information require players to use
Games may be either among individuals, Bayesian interpretations of others’ actions.
wherein they are referred to as competitive
games, or between groups of individuals, typi-
The Role of Equilibrium
cally characterized as noncompetitive games.
The bulk of game theory focuses on competitive Equilibrium refers to a stable outcome of a game
games of conflict. The second contextual quality associated with two or more strategies and by
of a game involves player or group number. extension, two or more players. In an equilib-
Although there are situations in which a single rium state, player solutions are balanced, the
decision maker must choose an optimal solution resources demanded and the resources available
without reference to other game players (i.e., are equal, this means that one of the two parties
human against nature), generally, games are will not optimize. John Forbes Nash provided
between two or more players. Games of two a significant contribution to game theory by pro-
players or groups of players are referred to as posing a conceptual solution to analyze strategic
two-person or two-player (where ‘‘player’’ may interactions and consequently the strategic
reflect a single individual or a single group of options for each game player, what has come to
individuals) games; these kinds of games are be called Nash equilibrium. This equilibrium is
models of social exchange. In a single-person a static state, such that all players are solving for
model, the decision maker controls all variables optimization and none of the players benefit
in a given problem; the challenge in finding an from a unilateral strategy change. In other
optimal outcome (i.e., maximum) is in the num- words, in a two-person game, if player A
ber of variables and the nature of the function to changes strategy and player B does not, player A
be maximized. In contrast, in two-person, two- has departed from optimization; the same would
player or n-person, n-player games (where ‘‘n’’ is be true if player B changed strategy in the
the actual number of persons or groups greater absence of a strategy change by player A. Nash
than two), the challenge of optimization hinges equilibrium of a strategic game is considered sta-
on the fact that each participant is part of a social ble because all players are deadlocked, their
exchange, where each player’s outcome is inter- interests are evenly balanced and in the absence
dependent on the actions of all other players. of some external force, like a compromise, they
The variables in a social exchange economy are are unlikely to change their tactical plan. Nash
the weighted actions of all other game players. equilibrium among differing strategic games has
The third important quality of a game is the become a heavily published area of inquiry
degree to which players are aware of other within game theory.
players’ previous actions or moves within
a game. This awareness is referred to as informa-
Types of Games
tion, and there are two kinds of game informa-
tion: perfect and imperfect. Games of perfect Games of strategy are typically categorized as
information are those in which all players are zero-sum games (also known as constant-sum
Game Theory 525

games), nonzero-sum competitive games, and Table 1 Outcome Matrix for a Standard Game of
nonzero-sum cooperative games; within this lat- Chicken
ter category are also bargaining games and coali-
tional games. Driver B

Yield Maintain
Zero-Sum Games Driver A Yield 0, 0 1, þ1
A defining feature of zero-sum games is that Driver B wins
they are inherently win-lose games. Games of Maintain þ1, 1 0, 0
strategy are characterized as zero-sum or con- Driver A wins
stant-sum games if the additive gain of all
players is equal to zero. Two examples of zero-
sum games are a coin toss or a game of chicken.
NonZero-Sum Competitive Games
Coin tosses are strictly competitive zero-sum
games, where a player calls the coin while it is Nonzero-sum games of strategy are character-
still aloft. The probability of a head or a tail is ized as situations in which the additive gain of all
exactly 50:50. In a coin toss, there is an absence players is either more than or less than zero. Non-
of Nash equilibrium; given that there is no way zero-sum games may yield situations in which
to anticipate accurately what the opposing players are compelled by probability of failure to
player will choose, nor is it possible to predict depart from their preferred strategy in favor of
the outcome of the toss, there exists only one another strategy that does the least violence to
strategic option—choose. Consequently, the pay- their outcome preference. This kind of decisional
off matrix in a coin toss contains only two vari- strategy is referred to as a minimax approach, as
ables: win or lose. If Player A called the toss the player’s goal is to minimize his or her maxi-
inaccurately, then his or her net gain is 1, and mum loss. However, the player may also select
player B’s gain was +1. There is no draw. Player a decisional strategy that yields a small gain. In
A is the clear loser, and Player B is the clear win- this instance, he or she selects a strategic solution
ner. However, not all zero-sum outcomes are that maximizes the minimum gain; this is referred
mutually exclusive; draws can occur, for exam- to as a maximin. In two-person, nonzero-sum
ple, in a two-player vehicular game of chicken, games, if the maximin for one player and the mini-
where there are two car drivers racing toward max for another player are equal, then the two
each other. The goal for both drivers is to avoid players have reached Nash equilibrium. In truly
yielding to the other driver; the first to swerve competitive zero-sum games, there can be no Nash
away from the impending collision has lost the equilibrium. However, in nonzero-sum games,
game. In this game, there are four possible out- Nash equilibria are frequently achieved; in fact, an
comes: Driver A yields, Driver B yields, neither analogous nonzero-sum game of zero-sum
Driver A nor B yields, or both Driver A and chicken results in two Nash equilibria. Biologists
Driver B simultaneously swerve. However, and animal behaviorists generally refer to this
for Driver A, there are only two strategic game as Hawk-Dove in reference to the aggressive
options: Optimize his or her outcome ðþ1Þ or strategies employed by the two differing species of
optimize the outcome for Driver B (1); Player birds.
B has these same diametrically opposed options. Hawk-Dove games are played by a host of ani-
Note that optimizing for the maximum is defined mal taxa, including humans. Generally, the context
as winning. Yet surviving in the absence of a win for Hawk-Dove involves conspecific species com-
is losing, and dying results in a forfeited win, so peting for indivisible resources such as access to
there is no Nash equilibrium in this zero-sum mating partners, such that an animal must employ
game, either. Table 1 reflects the outcome an aggressive Hawk display and attack strategy or
matrix for each driver in a game of chicken, a noncommittal aggressive display reflective of
where the numeric values represent wins (þ) and a Dove strategy. The Hawk tactic is a show of
losses (). aggressive force in conjunction with a commitment
526 Game Theory

Table 2 Outcome Matrix for the Hawk-Dove Game Table 3 Outcome Matrix for the Prisoner’s Dilemma

Conspecific B Prisoner B
Hawk Dove Defect Cooperate
Conspecific A Hawk 1, 1 4, 2 Prisoner A Defect 5, 5 0, 10
Dove 2, 4 3, 3 Cooperate 10, 0 10, 10

to follow through on the aggressive display with each prisoner the same deal. The Prisoner’s
an attack. The Dove strategy employs a display of Dilemma prisoner has two options: cooperate
aggressive force without a commitment to follow (i.e., remain silent) or defect (i.e., confess). Each
up the show, thereby fleeing in response to a com- prisoner’s outcome is dependent not only on his
petitive challenge. The goal for any one animal is or her behavior but the actions of his or her
to employ a Hawk strategy while the competitor accomplice. If Prisoner A defects (confesses)
uses a Dove strategy. Two Hawk strategies result while Prisoner B cooperates (remains silent),
in combat, although in theory, the escalated Prisoner A is freed and turns state’s evidence,
aggression will result in disproportionate injury and Prisoner B receives a full 10-year prison sen-
because the animals will have unequivalent com- tence. In this scenario, the police have sufficient
bative skills; hence, this is not truly a zero-sum evidence to convict both prisoners on a lighter
game. Despite this, it is assumed that the value of sentence without their shared confessions, so if
the disputed resource is less than the cost of com- Prisoners A and B both fail to confess, they both
bat. Therefore, two Hawk strategies result in receive a 5-year sentence. However if both sus-
a minimax and two Dove strategies result in a maxi- pects confess, they each receive the full prison
min. In this situation, the pure strategy of Hawk- sentence of 10 years. Table 3 reflects the Prison-
Dove will be preferred for each player, thereby er’s Dilemma outcome matrix, where the
resulting in two Nash equilibria for each conspecific numeric values represent years in prison.
(Hawk, Dove and Dove, Hawk). Table 2 reflects This example of Prisoner’s Dilemma does con-
the outcome matrix for the Hawk-Dove strategies, tain a single Nash equilibrium (defect, defect),
where a 4 represents the greatest risk-reward payoff where both suspects optimize by betraying their
and a 1 reflects the lowest payoff. accomplice, providing the police with a clear
Although game theorists are not necessarily advantage in extracting confessions.
interested in the outcome of the game as much as
the strategy employed to solve for the maximum,
NonZero-Sum Cooperative Games
it should be noted that pure strategies (i.e., always
Dove or always Hawk) are not necessarily the Although game theory typically addresses situa-
most favorable approach to achieving optimiza- tions in which players have conflicting interests,
tion. In the case of Hawk-Dove, a mixed approach one way to maximize may be to modify one’s
(i.e., randomization of the different strategies) is strategy to compromise or cooperate to resolve the
the most evolutionarily stable strategy in the long conflict. Nonzero-sum games within this cate-
run. gory broadly include Tit-for-Tat, bargaining
Nonzero-sum games are frequently social games, and coalition games. In nonzero-sum
dilemmas wherein private or individual interests cooperative games, the emphasis is no longer on
are at odds with those of the collective. A classic individual optimization; the maximum includes
two-person, nonzero-sum social dilemma is Pris- the optimization interests of other players or
oner’s Dilemma. Prisoner’s Dilemma is a strategic groups of players. This equalizes the distribution
game in which the police concurrently interrogate of resources among two or more players. In coop-
two criminal suspects in separate rooms. In an erative games, the focus shifts from more individu-
attempt to collect more evidence supporting their alistic concept solutions to group solutions. In
case, the police strategically set each suspect in game theory parlance, players attempt to maxi-
opposition. The tactic is to independently offer mize their minimum loss, thereby selecting the
Game Theory 527

maximin and distributing the resources to one or If, after the second Tat, the first player has not
more other players. Generally, these kinds of coop- corrected his or her strategy back to one of
erative games occur between two or more groups cooperation, then the second player responds
that have a high probability of repeated interaction with a retaliatory counter.
and social exchange. Broadly, nonzero-sum Another type of nonzero-sum cooperative
cooperative games use the principle of reciprocity game falls within the class of games of negotiation.
to optimize the maximum for all players. If one Although these are still games of conflict and strat-
player uses the maximin, the counter response egy, as a point of disagreement exists, the game is
should follow the principle of reciprocity and the negotiation. Once the bargain is proffered and
respond in kind. An example of this kind of non- accepted, the game is over. The most simplistic of
zero-sum cooperative game is referred to as Tit- these games is the Ultimatum Game, wherein two
for-Tat. players discuss the division of a resource. The first
Anatol Rapoport submitted Tit-for-Tat as player proposes the apportionment, and the sec-
a solution to a computer challenge posed by Uni- ond player can either accept or reject the offer.
versity of Michigan political science professor The players have one turn at negotiation; conse-
Robert Axelrod in 1980. Axelrod solicited the quently, if the first player values the resource, he
most renowned game theorists of academia to or she must make an offer that is perceived, in the-
submit solutions for an iterated Prisoner’s ory, to be reasonable by the second player. If the
Dilemma, wherein the players (i.e., prisoners) offer is refused, there is no second trial of nego-
were able to retaliate in response to the previous tiations and neither player receives any of the
tactic of their opposing player (i.e., accomplice). resource. The proposal is actually an ultimatum
Rapoport’s Tit-for-Tat strategy succeeded in of ‘‘take it or leave it.’’ The Ultimatum Game is
demonstrating optimization. Tit-for-Tat is a pay- also a political game of power. The player who
back strategy typically between two players and proposes the resource division may offer an
founded on the principle of reciprocity. It begins unreasonable request, and if the second player
with an initial cooperative action by the first maintains less authority or control, the lack of
player; henceforth, all subsequent actions reflect any resource may be so unacceptable that both
the last move of the second player. Thus, if the players yield to the ultimatum. In contrast to the
second player responds cooperatively (Tat), then Ultimatum Game, bargaining games of alternat-
the first player responds in kind (Tit) ad infini- ing offers is a game of repeated negotiation and
tum. Tit-for-Tat was not necessarily a new con- perfect information where all previous negotia-
flict resolution approach when submitted by tions may be referenced and revised, and where
Rapoport, but one favored historically as a mili- the players enter a state of equilibrium through
taristic strategy using a different name, equiva- a series of trials consisting of offers and counter-
lent retaliation, which reflected a Tit-for-Tat offers until eventually an accord is reached and
approach. Each approach is highly vulnerable; it the game concludes.
is effective only as long as all players are infalli- A final example of a nonzero-sum cooperative
ble in their decisions. Tit-for-Tat fails as an opti- game is a game of coalitions. This is a cooperative
mum strategy in the event of an error. If a player game between individuals within groups. Coali-
makes a mistake and accidentally defects from tional games are games of consensus, potentially
the cooperative concept solution to a competitive through bargaining, wherein the player strategies
action, then conflict ensues and the nonzero- are conceived and enacted by the coalitions. Both
sum cooperative game becomes one of competi- the individual players and their group coalitions
tion. A variant of Tit-for-Tat, Tit-for-Two-Tats, have an interest in optimization. However, the
is more effective in optimization as it reflects existence of coalitions protects players from
a magnanimous approach in the event of an acci- defecting individuals; the coalition maintains the
dental escalation. In this strategy, if one player power to initiate a concept solution and thus all
errs through a competitive action, the second plays of strategy.
player responds with a cooperative counter,
thereby inviting remediation to the first player. Heide Deditius Island
528 Gauss–Markov Theorem

See also Decision Rule; Probability, Laws of Origins


In 1821, Carl Friedrich Gauss proved that the least
squares method produced unbiased estimates that
Further Readings
have the smallest variance, a result that has
Aumann, R. J., & Brandenburger, A. (1995). Epistemic become the cornerstone of regression analysis. In
conditions for Nash equilibrium. Econometrica, 63(5), his 1900 textbook on probability, Andrei Markov
11611180. essentially rediscovered Gauss’s theorem. By the
Axelrod, R. (1985). The evolution of cooperation. New 1930s, however, the result was commonly referred
York: Basic Books.
to as the Markov theorem rather than the Gauss
Barash, D. P. (2003). The survival game: How game
theory explains the biology of competition and theorem. Perhaps because of awareness of this mis-
cooperation. New York: Owl Books. attribution to Markov, many statisticians today
Davis, M. D. (1983). Game theory: A nontechnical avoid using the Gauss or Markov label altogether,
introduction (Rev. ed.). Mineola, NY: Dover. referring instead to the equivalence of ordinary
Fisher, L. (2008). Rock, paper, scissors: Game theory in least squares (OLS) estimator and best linear unbi-
everyday life. New York: Basic Books. ased (BLU) estimator. Most econometricians, how-
Osborne, M. J., & Rubinstein, A. (1994). A course in ever, refer to the result instead by the compromise
game theory. Cambridge: MIT Press. label used here, the Gauss–Markov theorem.
Savage, L. J. (1972). The foundations of statistics (2nd
ed.). Mineola, NY: Dover.
Von Neumann, J., & Morgenstern, O. (1944). Theory of Statistical Estimation and Regression
games and economic behavior. Princeton, NJ:
Princeton University Press. The goal of statistical estimation is to provide
accurate guesses about parameters (statistical sum-
maries) of a population from a subset or sample of
the population. In regression estimation, the main
GAUSS–MARKOV THEOREM population parameters of interest measure how
changes in one (independent) variable influence
The Gauss–Markov theorem specifies the condi- the value of another (dependent) variable. Because
tions under which the ordinary least squares statistical estimation involves estimating unknown
(OLS) estimator is also the best linear unbiased population parameters from known sample data,
(BLU) estimator. Because these BLU estimator there are actually two equations involved in
properties are guaranteed by the Gauss–Markov any simple regression estimation: a population
theorem under general conditions that are often regression function, which is unknown and being
encountered in practice, ordinary least squares estimated,
has become what George Stigler describes as the
‘‘automobile of modern statistical analysis.’’ Fur- y ¼ α þ βx þ μ
thermore, many of the most important advances
in regression analysis have been direct generali- and a sample regression function, which serves as
zations of ordinary least squares under the the estimator and is calculated from the available
Gauss–Markov theorem to even more general data in the sample,
conditions. For example, weighted least squares,
generalized least squares, finite distributed lag y¼α ^ þμ
^ þ βx ^
models, first-differenced estimators, and fixed-
effect panel models all extend the finite-sample where
results of the Gauss–Markov theorem to condi-
tions beyond the classical linear regression y ¼ dependent variable
model. After a brief discussion of the origins of x ¼ independent variable
the theorem, this entry further examines the
α ¼ y-intercept of population regression function
Gauss–Markov theorem in the context of statis-
tical estimation and regression analysis. ^ ¼ y-intercept of sample regression function
α
Gauss–Markov Theorem 529

β ¼ slope of population regression function ^ ¼β


Unbiasedness EðβÞ
^ ¼ slope of sample regression function
β ^ are calculated
Because the estimates α ^ and β
μ ¼ error, disturbance of population regression from a finite sample, they are likely to be different
function from the actual parameters α and β calculated
from the entire population. If the sample estimates
^ ¼ residual of sample regression function
μ
are the same as the population parameters on aver-
age when a large number of random samples of
Because statistical estimation involves the calcu-
the same size are taken, then the estimates are said
lation of population parameters from a finite sam-
to be unbiased.
ple of data, there is always some uncertainty about
how close statistical estimates are to actual popu-
lation parameters. To sort out the many possible
Best (Minimum Variance)
ways of estimating a population parameter from
a sample of data, various properties have been The estimates α ^ will also vary from sam-
^ and β
proposed for estimators. The ‘‘ideal estimator’’ ple to sample because each randomly drawn sam-
always takes the exact value of the population ple will contain different values of x and y: In
parameter it is estimating. This ideal is unachieva- practice, typically, there is only one sample, and
ble from a finite sample of the population, and repeated random samples are drawn hypothetically
estimation instead involves a trade-off between dif- to establish the sampling distribution of the sample
ferent forms of accuracy, such as unbiasedness and estimates. A BLU estimator β ^ is ‘‘best’’ in the sense
minimum variance. The best linear unbiased esti- that the sampling distribution of β ^ has a smaller
mator, which is discussed next, represents such variance than any other linear unbiased estimator
a trade-off. of β.
BLU estimation places a priority on unbiased-
ness, which must be satisfied before the minimum
Best Linear Unbiased Estimation
variance condition. However, if, instead of unbi-
Because the Gauss–Markov theorem uses the BLU asedness, we begin by requiring that the estimator
estimator as its standard of optimality, the quali- meet some nonzero maximum threshold of vari-
ties of this estimator are described first. ance, the estimator chosen may hinge in a critical
way on this arbitrary choice of threshold. The
BLU estimator is free of this kind of arbitrariness
Linear Parameters
because it makes unbiasedness the starting point,
A parameter β is defined as linear if it is a lin- and all estimators can be classified as either biased
ear function of y; that is, β ¼ ay þ b; where a or unbiased. But there is a trade-off in choosing
and b can be constants or functions of x; but not a BLU estimator, because variance may, in fact, be
of y: Requiring that the parameters α and β are much more important than unbiasedness in some
linear does not preclude nonlinear relationships cases, and yet unbiasedness is always given priority
between x and y; such as polynomial terms over lower variance.
(e.g., x2 Þ and interaction terms (e.g., x  zÞ on Maximum likelihood is an alternative estima-
the right-hand side of the regression equation tion criterion that does not suffer from the
and logarithmic terms on the left-hand side of limitation of subordinating lower variance to unbi-
the regression equation [e.g., ln(yÞ]. This flexibil- asedness. Instead, the maximum likelihood estima-
ity gives the Gauss–Markov theorem wide appli- tor (MLE) is chosen by the simple and intuitive
cability, but there are also many important principle that the set of parameters should be that
regression models that cannot be written in most likely to have generated the set of data in the
a form linear in the parameters. For example, sample. Some have suggested that this principle be
the main estimators for dichotomous dependent used as the main starting point for most statistical
variables, logit and probit, involve nonlinear estimation. However, the MLE has two distinct
parameters for which the Gauss–Markov theo- disadvantages to the BLU estimator. First, the cal-
rem cannot be applied. culation of the MLE requires that the distribution
530 Gauss–Markov Theorem

of the errors be completely specified, whereas the of a non-singular matrix Q satisfying the equa-
Gauss–Markov theorem does not require full tion Ωx ¼ xQ: Because such complex necessary
specification of the error distribution. Second, and sufficient conditions offer little intuition for
the maximum likelihood estimator offers only assessing when a given model satisfies them, in
asymptotic (large sample) properties, whereas applied work, looser sufficient conditions, which
the properties of BLU estimators hold in finite are easier to assess, are usually employed.
(small) samples. Because these practical conditions are typically
not also necessary conditions, it means they are
generally stricter than is theoretically required
Ordinary Least Squares Estimation
for OLS to be BLU. In other words, there may
The ordinary least squares estimator calculates β ^ be models that do not meet the sufficient condi-
by minimizing the sum of the squared residuals μ ^. tions for OLS to be BLU, but where OLS is
However, without further assumptions, one cannot nonetheless BLU.
know how accurately OLS estimates β. These Now, this entry turns to the sets of sufficient
further assumptions are provided by the Gauss– conditions that are most commonly employed
Markov theorem. for two different types of regression models:
The OLS estimator has several attractive qual- (a) models where x is fixed in repeated sampling,
ities. First, the Gauss–Markov theorem ensures which is appropriate for experimental research,
that it is the BLU estimator given that certain and (b) models where x is allowed to vary from
conditions hold, and these properties hold even sample to sample, which is more appropriate for
in small sample sizes. The OLS estimator is easy observational (nonexperimental) data.
to calculate and is guaranteed to exist if the
Gauss–Markov asssumptions hold. The OLS
regression line can also be intuitively understood
as the expected value of y for a given value of x: Gauss–Markov Conditions for
However, because OLS is calculated using Experimental Research (Fixed x)
squared residuals, it is also especially sensitive to
outliers, which exert a disproportionate influ- In experimental studies, the researcher has con-
ence on the estimates. trol over the treatment administered to subjects.
This means that in repeated experiments with the
same size sample, the researcher would be able to
The Gauss–Markov Theorem ensure that the subjects in the treatment group get
The Gauss–Markov theorem specifies conditions the same level of treatment. Because this level of
under which ordinary least squares estimators treatment is essentially the value of the indepen-
are also best linear unbiased estimators. Because dent variable x in a regression model, this is equiv-
these conditions can be specified in many ways, alent to saying that the researcher is able to hold x
there are actually many different Gauss–Markov fixed in repeated samples. This provides a much
theorems. First, there is the theoretical ideal of simpler data structure in experiments than is possi-
necessary and sufficient conditions. These neces- ble in observational data, where the researcher
sary and sufficient conditions are usually devel- does not have complete control over the value of
oped by mathematical statisticians and often the independent variable x: The following condi-
specify conditions that are not intuitive or prac- tions are sufficient to ensure that the OLS estima-
tical to apply in practice. For example, the most tor is BLU when x is fixed in repeated samples:
widely cited necessary and sufficient conditions
1. Model correctly specified
for the Gauss–Markov theorem, which Simo
Puntanen and George Styan refer to as ‘‘Zys- 2. Regressors not perfectly collinear
kind’s condition,’’ states in matrix notation that 3. E(μ) ¼ 0
a necessary and sufficient condition for OLS to
4. Homoscedasticity
be BLU with fixed x and nonsingular variance-
covariance (dispersion) matrix Ω is the existence 5. No serial correlation
Gauss–Markov Theorem 531

Model Correctly Specified function, but it will not typically be problematic


whenever the model is correctly specified and
In practice, there are two main questions here:
a constant (y-intercept) term is included in the
What variables should be in the model? What is
model specification.
the correct functional form of those variables?
Omitting an important variable that is correlated
Homoscedastic Errors, Eðμ2 Þ ¼ σ 2
with both the dependent variable and one or more
independent variables necessarily produces biased The variance of the errors must be constant
estimates. On the other hand, including irrelevant across observations. Consider a regression of the
variables, which may be correlated with the depen- cost of automobiles purchased by consumers on
dent variables or independent variables but not the consumers’ income. In general, one would
with both, does not bias OLS estimates. However, expect a positive relationship so that people with
adding irrelevant variables is not without cost. It higher incomes purchase more expensive automo-
reduces the number of observations available for biles on average. But one might also find that there
calculating the impact of each independent vari- is much more variability in the purchase price of
able, and there have to be fewer variables than cars for higher income consumers than for lower
observations for least-squares estimates to exist. income consumers, simply because lower income
Specifying the incorrect functional form of an consumers can afford a smaller range of vehicles.
independent variable will also lead to biased Thus, we expect heteroscedastic (nonconstant)
regression estimates. For example, if a researcher errors in this case, and we cannot apply the
leaves out a significant squared term of x; then the Gauss–Markov theorem to OLS.
sample regression function will impose a linear If the other Gauss–Markov assumptions apply,
relationship on what is actually a nonlinear rela- OLS still generates unbiased estimates of regres-
tionship between x and y: sion coefficients in the presence of heteroscedastic
errors, but they are no longer the linear unbiased
No Perfect Multicollinearity estimators with the minimum variance. Smaller
variance estimates can be calculated by weighting
Perfect multicollinearity occurs when two or observations according to the heteroscedastic
more variables are simple linear functions of each errors, using an estimator called weighted least
other. Multiple regression calculates the effect of squares (WLS). When the researcher knows the
one variable while holding the other variables con- exact form of the heteroscedasticity and the other
stant. But if perfect multicollinearity exists, then it Gauss–Markov conditions hold, the WLS estima-
is impossible to hold one variable constant with tor is BLU. The actual nature of the heteroscedasti-
respect to the other variables with which it is per- city in the errors of the population regression
fectly correlated. Whenever one variable changes, function are usually unknown, but such weights
the other variable changes by an exact linear func- can be estimated using the residuals from the sam-
tion of the change in the first variable. Perfect mul- ple regression function in a procedure known as
ticollinearity does not typically cause problems in feasible generalized least squares (FGLS). How-
practice, because variables are rarely perfectly cor- ever, because FGLS requires an extra estimation
related unless they are simply different measures of step, estimating μ2 by μ ^ 2, it no longer obtains
the same construct. The problem of high but not BLU estimates, but estimates only with asymptotic
perfect collinearity is more likely to be problematic (large sample) properties.
in practice, although it does not affect the applica-
bility of the Gauss–Markov theorem. No Serial Correlation, Eðμi μj Þ ¼ 0
The errors of different observations cannot be
EðμÞ ¼ 0
correlated with each other. Like the homoscedas-
The expected value of the errors must be zero ticity assumption, the presence of serial correlation
^ to be an unbiased estimate of β.
in order for β does not bias least-squares estimates but it does
There is no way to directly test whether this affect their efficiency. Serial correlation can be
assumption holds in the population regression a more complex problem to treat than
532 Gauss–Markov Theorem

heteroscedasticity because it can take many differ- Gauss–Markov theorem to observational data,
ent forms. For example, in its simplest form, the perhaps to reiterate the potential specification
error of one observation is correlated only with problem of omitted confounding variables when
the error in the next observation. For such pro- there is not random assignment to treatment and
cesses, Aitken’s generalized least squares can be control groups.
used to achieve BLU estimates if the other Gauss–
Markov assumptions hold. If, however, errors are Homoscedastic Errors, Eðμ2 jxÞ ¼ σ 2
associated with the errors of more than one other
The restriction on heteroscedasticity of the
observation at a time, then more sophisticated
errors is also strengthened in the case where x is
time-series models are more appropriate, and in
not fixed. In the fixed-x case, the error of each
these cases, the Gauss–Markov theorem cannot be
observation was required to have the same vari-
applied. Fortunately, if the sample is drawn ran-
ance. In the arbitrary-x case, the errors are also
domly, then the errors automatically will be uncor-
required to have the same variance across all possi-
related with each other, so that there is no need to
ble values of x: This is tantamount to requiring
worry about serial correlation.
that the variance of the errors not be a (linear or
nonlinear) function of x: Again, as in the fixed-x
Gauss–Markov Assumptions for case, violations of this assumption will still yield
Observational Research (Arbitrary x) unbiased estimates of regression coefficients as
long as the first three assumptions hold. But such
A parallel but stricter set of Gauss–Markov heteroscedasticity will yield inefficient estimates
assumptions is typically applied in practice in the unless the heteroscedasticity is addressed in the
case of observational data, where the researcher way discussed in the fixed-x section.
cannot assume that x is fixed in repeated samples.
No Serial Correlation, Eðμi μj |xÞ ¼ 0
1. Model correctly specified
Finally, the restriction on serial correlation in
2. Regressors not perfectly collinear
the errors is strengthened to prohibit serial correla-
3. E(μ|xÞ ¼ 0 tion that may be a (linear or nonlinear) function of
4. Homoscedastic errors, E(μ2 |xÞ ¼ σ 2 x: Violations of this assumption will still yield
unbiased least-squares estimates, but these esti-
5. No serial correlation, E(μi μj |xÞ ¼ 0 mates will not have the minimum variance among
all unbiased linear estimators. In particular, it is
The first two assumptions are exactly the same as possible to reduce the variance by taking into
in the fixed-x case. The other three sufficient con- account the serial correlation in the weighting of
ditions are augmented so that they hold condi- observations in least squares. Again, if the sample
tional on the value of x: is randomly drawn, then the errors will automati-
cally be uncorrelated with each other. In time-
E(μjxÞ ¼ 0
series data in particular, it is usually inappropriate
In contrast to the fixed-x case, E(μ|xÞ ¼ 0 is to assume that the data are drawn as a random
a very strong assumption that means that in addi- sample, so special care must be taken to ensure
tion to having zero expectation, the errors are not that E(μi μj |xÞ ¼ 0 before employing least squares.
associated with any linear or nonlinear function of In most cases, it will be inapproriate to use least
^ is not only unbiased, but also
x: In this case, β squares for time-series data. However, F. W. McEl-
unbiased conditional on the value of x; a stronger roy has provided a useful set of necessary and suf-
form of unbiasedness than is strictly needed for ficient conditions for the Gauss–Markov theorem.
BLU estimation. Indeed, there is some controversy For models with a y-intercept and a very simple
about whether this assumption must be so much form of serial correlation known as exchangeabil-
stronger than in the fixed-x case, given that ity, the OLS estimator is still the best linear unbi-
the model is correctly specified. Nevertheless, ased estimator. A useful necessary and sufficient
E(μ|x) ¼ 0 is typically used in applying the condition for the Gauss–Markov theorem in the
Generalizability Theory 533

case of arbitrary x is provided by McElroy. If the Advantages


model includes an intercept term, McElroy shows
that OLS will still be BLU in the presence of There are a few approaches to the investigation of
a weak form of serial correlation where all of the test reliability, that is, the consistency of measure-
errors are equally correlated with each other. ment obtained in testing. For example, for norm-
referenced testing (NRT), CTT reliability indexes
Roger Larocca show the extent to which candidates are rank-
ordered consistently across test tasks, test forms,
See also Biased Estimator; Estimation; Experimental occasions, and so on (e.g., Cronbach’s alpha and
Design; Homoscedasticity; Least Squares, Methods of; parallel-form and test-retest reliability estimates).
Observational Research; Regression Coefficient; In contrast, in criterion-referenced testing (CRT),
Regression to the Mean; Residuals; Serial Correlation; various statistics are used to examine the extent
Standard Error of Estimate; Unbiased Estimator to which candidates are consistently classified
into different categories (score or ability levels)
across test forms, occasions, test tasks, and so on.
Further Readings Threshold-loss agreement indexes such as the
agreement coefficient and the kappa coefficient are
Berry, W. D. (1993). Understanding regression some examples.
assumptions. Newbury Park, CA: Sage.
Why might one turn to G theory despite the
McElroy, F. W. (1967). A necessary and sufficient
condition that ordinary least-squares estimators be
availability of these different approaches to reli-
best linear unbiased estimators. Journal of the ability investigation? G theory is a broadly defined
American Statistical Association, 62, 13021304. analytic framework that addresses some limita-
Plackett, R. L. (1949). A historical note on the method of tions of the traditional approaches. First, the
least squares. Biometrika, 36; 458460. approaches above address only NRT or CRT,
Puntanen, S., & Styan, G. P. H. (1989). The equality of whereas G theory accommodates both (called rela-
the ordinary least squares estimator and the best tive decisions and absolute decisions, respectively),
linear unbiased estimator. American Statistician, 43; yielding measurement error and reliability esti-
153161.
mates tailored to the specific type of decision mak-
Stigler, S. M. (1980). Gauss and the invention of least
ing under consideration. Second, CTT reliability
squares. Annals of Statistics, 9; 465474.
estimates take account of only one source of mea-
surement error at a time. Thus, for example, when
one is concerned about the consistency of exam-
inee rank-ordering across two testing occasions
GENERALIZABILITY THEORY and across different raters, he or she needs to cal-
culate two separate CTT reliability indexes (i.e.,
Generalizability theory (G theory), originally devel- test-retest and interrater reliability estimates). In
oped by Lee J. Cronbach and his associates, is contrast, G theory provides reliability estimates
a measurement theory that provides both a concep- accounting for both sources of error simulta-
tual framework and a set of statistical procedures neously. The G theory capability to analyze multi-
for a comprehensive analysis of test reliability. ple sources of error within a single analysis is
Building on and extending classical test theory particularly useful for optimizing the measurement
(CTT) and analysis of variance (ANOVA), G design to achieve an acceptable level of measure-
theory provides a flexible approach to modeling ment reliability.
measurement error for different measurement con-
ditions and types of decisions made based on test
Key Concepts and Terms
results. This entry introduces the reader to the
basics of G theory, starting with the advantages of A fundamental concept in G theory is dependabil-
G theory, followed by key concepts and terms ity. Dependability is defined as the extent to which
and some illustrative examples representing differ- the generalization one makes about a given candi-
ent G-theory analysis designs. date’s universe score based on an observed test
534 Generalizability Theory

score is accurate. The universe score is a G-theory a generalizability study (G study), where the
analogue of the true score in CTT and is defined as observed score variance is decomposed into pieces
the average score a candidate would have obtained attributable to different sources of score variability
across an infinite number of testing under called variance components associated with
measurement conditions that the investigator is a facet(s) identified by the investigator.
willing to accept as exchangeable with one another As shown in detail in the numerical example
(called randomly parallel measures). Suppose, for below, G-study variance component estimates are
example, that an investigator has a large number typically obtained by fitting a random-effects
of vocabulary test items. The investigator might ANOVA model to data. The primary purpose of
feel comfortable treating these items as randomly this analysis is to obtain mean squares for different
parallel measures because trained item writers effects that are needed for the calculation of vari-
have carefully developed these items to target a spe- ance component estimates. Variance component
cific content domain, following test specifications. estimates are key building blocks of G theory. A
The employment of randomly parallel measures is G-study variance component estimate indicates the
a key assumption of G theory. Note the difference magnitude of the effect of a given source of vari-
of this assumption from the CTT assumption, ability on the observed score variance for a hypo-
where sets of scores that are involved in a reliability thetical measurement design where only a single
calculation must be statistically parallel measures observation is used for testing (e.g., a test consist-
(i.e., two sets of scores must share the same mean, ing of one item).
the same standard deviation, and the same correla- The G-study variance component estimates are
tion to a third measure). then used as the baseline data in the second step of
Observed test scores can vary for a number of the analysis called a decision study (D study). In
reasons. One reason may be the true differences a D study, variance components and measurement
across candidates in terms of the ability of interest reliability can be estimated for a variety of hypo-
(called the object of measurement). Other reasons thetical measurement designs (for instance, a test
may be the effects of different sources of measure- consisting of multiple items) and types of score
ment error: Some are systematic (e.g., item diffi- interpretations of interest.
culty), whereas others are unsystematic (e.g.,
fatigue). In estimating a candidate’s universe score,
A Numerical Example: One-Facet
one cannot test the person an infinite number of
times in reality. Therefore, one always has to esti- Crossed Study Design
mate a candidate’s universe score based on a lim- Suppose that an investigator wants to analyze
ited number of measurements available. In G results of a grammar test consisting of 40 items
theory, a systematic source of variability that may administered to 60 students in a French language
affect the accuracy of the generalization one makes course. Because these items have been randomly
is called a facet. There are two types of facets. A selected from a large pool of items, the investigator
facet is random if the intention is to generalize defines items as a random facet. In this test, all
beyond the conditions actually used in an assess- candidates (persons) complete all items. In G-the-
ment. In this case, measurement conditions are ory terms, this study design is called a one-facet
conceptualized as a representative sample of study design because it involves only one facet
a much larger population of admissible observa- (items). Moreover, persons and items are called
tions (called the universe in G theory). Alterna- crossed because for each person, scores for all
tively, a facet is fixed when there is no intention to items are available (denoted p × i; where the ‘‘ × ’’
generalize beyond the conditions actually used in is read ‘‘crossed with’’).
the assessment, because either the set of measure- For this one-facet crossed study design, the
ment conditions exhausts all admissible observa- observed score variance is decomposed into three
tions in the universe or the investigator has chosen variance components:
specific conditions on purpose.
Different sources of measurement error are ana- 1. Person variance component [σ 2 ðpÞ]: The
lyzed in a two-step procedure. The first step is observed score variance due to the true
Generalizability Theory 535

Person variance Item variance Table 1 ANOVA Table for a One-Facet Crossed
component component Study Design
Degrees of Sum of Mean
Source Freedom (df) Squares (SS) Squares (MS)
Persons (p) 59 96.943 1.643
Items (i) 39 173.429 4.457
pi,e 2,301 416.251 0.181
σ 2(p) σ 2(pi,e) σ 2(i)
Total 2,399 686.623

EMSðpÞ ¼ σ 2 ðpi;eÞ þ ni σ 2 ðpÞ ð1Þ

Residual variance EMSðiÞ ¼ σ 2 ðpi;eÞ þ np σ 2 ðiÞ ð2Þ


component

EMSðpi;eÞ ¼ σ 2 ðpi;eÞ ð3Þ


Figure 1 Decomposition of the Observed Score
Variance Into Variance Components in Here, n denotes the G-study sample size. Theoreti-
a One-Facet Crossed Study Design cally, an EMS is defined as the average mean
square value across repeated analyses of samples
differences among candidates on the target from the same population of examinees and the
ability
same universes of admissible observations for the
2. Item variance component [σ 2 ðiÞ]: The observed same study design. Because an EMS is usually
score variance due to the differences across unknown, the EMSs in these equations are
items in terms of difficulty replaced with the observed mean square values for
3. Residual variance component [σ 2 (pi,e)]: The persons (pÞ, items (iÞ, and residuals (pi,e) obtained
observed score variance due to a combination of from an ANOVA of sample data. Thus, it should
two confounded sources of error: (a) the be remembered that the obtained variance compo-
observed score variance due to an interaction nents are estimates obtained from a sample. As an
between persons and items, that is, the extent to example, Table 1 shows what would be obtained
which the rank-ordering of persons differs from from a random-effects ANOVA of the data for this
one item to another, and (b) the score variance example based on the 60 students’ responses to the
due to undifferentiated error, consisting of other 40 grammar items.
systematic sources of variance (facets) not taken
The variance component estimates are obtained
account of in this study design and random
sources of error (e.g., fatigue).
by solving the equations for σ 2 ðpÞ, σ 2 ðiÞ, and
σ 2 (pi,e), respectively:
Figure 1 schematically represents the decompo- 1:643 ¼ 0:181 þ 40 × σ 2 ðpÞ σ 2 ðpÞ ¼ 0:037
sition of the observed score variance into the per-
son (pÞ, item (iÞ, and residual (pi,e) variance 4:447 ¼ 0:181 þ 60 × σ 2 ðiÞ σ 2 ðiÞ ¼ 0:071
components. Note that the person variance com- 0:181 ¼ σ 2 ðpi;eÞ σ 2 ðpi;eÞ ¼ 0:181
ponent is a source of variability due to the object
of measurement. The item and residual variance Then, the obtained G-study variance component
components (dotted areas in the figure) reflect estimates can be analyzed further by preparing
sources of variability associated with the facet of a table like Table 2.
measurement. The left panel of Table 2 provides the magni-
Calculating variance components requires a set tudes of the G-study variance component estimates
of formulas called expected mean square (EMS) for persons, items, and residuals, along with the
equations for a specific study design. Below are the percentage of the observed score variance
equations for the one-facet crossed study design: explained by each source of score variability for
536 Generalizability Theory

Table 2 Variance Component Estimates for a One-Facet Crossed Study Design

G-Study Variance Components D-Study Variance Components


(for a single observation) (for 50 items)
Estimated Variance Percentage of Estimated Variance
Source Components Total Variance Components
Persons (p) 0.037 12.8 0.037
0
Items (i) 0.071 24.6 σ 2(i)/ni ¼ 0.071/50 ¼ 0.001
0
pi,e 0.181 62.6 σ 2(pi,e)/ni ¼ 0.181/50 ¼ 0.004
Total 0.289 100.0

a single observation. As can be seen in the table, except that for the object of measurement [σ 2 ðpÞ]
the person, item, and residual variance compo- contribute to the absolute error variance [σ 2 (Abs)].
nents account for 12.8%, 24.6%, and 62.6% of In this example, both σ 2 ðiÞ and σ 2 (pi,e) will
the total score variance, respectively. contribute to the absolute error variance. Thus,
Based on the G-study results above, a D study σ 2 ðAbsÞ ¼ σ 2 ðiÞ þ σ 2 ðpi; eÞ ¼ 0:001 þ 0:004 ¼
can be conducted to estimate score reliability for 0:005: Second, a G-coefficient or a phi-coefficient
an alternative measurement design. As in CTT, is obtained by dividing the variance component
where the Spearman-Brown prophecy formula is due to the object of measurement [σ 2 ðpÞ], which is
used to estimate test reliability for different test also called the universe-score variance, by the sum
lengths, one can estimate the measurement reliabil- of itself and the appropriate type of error variance.
ity for a test involving different numbers of items. Thus, for this one-facet crossed study example, the
As an example, the right panel of Table 2 shows G- and phi-coefficients are calculated as follows:
the D-study results for 50 items. First, D-study var-
iance component estimates for this measurement Eρ2 ¼ σ 2 ðpÞ=½σ 2 ðpÞ þ σ 2 ðRelÞ
design are obtained by dividing the G-study vari-
¼ 0:037=ð0:037 þ 0:004Þ ¼ 0:902
ance component estimates associated with the
facet of measurement [i.e., σ 2 ðiÞ and σ 2 (pi,e) in φ ¼ σ 2 ðpÞ=½σ 2 ðpÞ þ σ 2 ðAbsÞ
this case] by the D-study sample size for the item ¼ 0:037=ð0:037 þ 0:005Þ ¼ 0:881
facet (ni0 = 50).
Second, a summary index of reliability, similar
G theory is conceptually related to CTT.
to what one might obtain in a CTT analysis, can
Under certain conditions, CTT and G theory
be calculated for the 50-item scenario. G theory
analysis results yield identical results. This is the
provides two types of reliability-like indexes for
case when a one-facet crossed study design is
different score interpretations: a generalizability
employed for relative decisions. Thus, for exam-
coefficient (denoted Eρ2 Þ for relative decisions,
ple, the G-coefficient obtained from the one-
and an index of dependability (denoted φ; often
facet D study with 50 items above is identical to
called phi coefficient) for absolute decisions. These
Cronbach’s alpha for the same number of items.
coefficients are obtained in two steps. First, the
error variance appropriate for the type of decision
is calculated. For relative decisions, all variance
Other Study Designs
components involving persons, except the object
of measurement [σ 2 ðpÞ], contribute to the relative The numerical example above is one of the
error variance [σ 2 (Rel)]. Thus, for this one-facet simplest designs that can be implemented in a
crossed study example, only the residual variance G-theory data analysis. Below are some examples
component [σ 2 (pi,e)] contributes to the relative of crossed study designs involving multiple ran-
error variance; hence, σ 2 (Rel) ¼ σ 2 (pi,e) ¼ 0.004. dom facets as well as other study designs involving
For absolute decisions, all variance components a nested facet or a fixed facet.
Generalizability Theory 537

A Crossed Study Design With Two Random Facets study example, that each candidate response is
evaluated on three dimensions: pronunciation,
As mentioned above, one can take advantage of
grammar, and fluency. These dimensions can be
the strength of G theory when multiple sources of
best conceptualized as the levels in a fixed facet
error are modeled simultaneously. Suppose, for
because they have been selected as the scoring cri-
example, in a speaking test each student completes
teria on purpose.
three items, and two raters score each student’s
There are some alternatives to model such fixed
responses to all three items. In this case, the inves-
facets in G theory. Whichever approach is
tigator may identify two facets: items and raters.
employed, the decision for selecting an approach
Persons, items, and raters are crossed with one
must be made based on careful considerations of
another because (a) all students complete all items,
various substantive issues. One approach is to con-
(b) all students are rated by both raters, and
duct a two-facet crossed study (p × r × iÞ for each
(c) both raters score student responses to all items.
dimension separately. This approach is preferred if
This study design is called a two-facet crossed
the investigator believes that the three dimensions
study design (p × r × iÞ.
are conceptually so different that study results can-
not be interpreted meaningfully at the aggregated
A Study Design Involving a Nested Facet level, or if the variance component estimates vary
Some G theory analyses may be conducted for widely across the dimensions. Alternatively, one
study designs involving a nested facet. Facet A is can analyze all dimensions simultaneously by con-
nested within Facet B if different, multiple levels of ducting a three-facet crossed study (p × r × i × d;
Facet A are associated with each level of Facet B where dimensions, or d; are treated as a fixed
(Shavelson & Webb). Typically, a nested facet is facet). This approach is reasonable if variance
found in two types of situations. The first is when component estimates averaged across the dimen-
one facet is nested within another facet by defini- sions can offer meaningful information for a partic-
tion. A common example is a reading test consist- ular assessment context, or if the variance
ing of groups comprehension items based on component estimates obtained from separate anal-
different passages. In this case, the item facet is yses of the dimensions in the p × r × i study runs
nested within the reading passage facet because are similar across the dimensions.
a specific group of items is associated only with Another possible approach is to use multivariate
a particular passage. The second is a situation G theory. Although multivariate G theory is
where one chooses to use a nested study design, beyond the scope of this introduction to G theory,
although employing a crossed study design is pos- Robert Brennan’s 2001 volume in the Further
sible. For instance, collecting data for the two- Readings list provide an extensive discussion of
facet crossed study design example above can be this topic.
resource intensive because all raters have to score
all student responses. In this case, a decision might
be made to have different rater pairs score differ-
Computer Programs
ent items to shorten the scoring time. This results
in a two-facet study design where raters are nested Computer programs specifically designed for G
within items [denoted p × ðr : iÞ, where the ‘‘:’’ is theory analyses offer comprehensive output for
read ‘‘nested within’’]. both G and D studies for a variety of study
designs. Brennan’s GENOVA Suite offers three
Study Designs Involving a Fixed Facet programs: GENOVA, urGENOVA, and mGE-
NOVA. GENOVA and urGENOVA handle differ-
Because G theory is essentially a measurement ent study designs for univariate G-theory analyses,
theory for modeling random effects, at least one on which this entry has focused, whereas mGE-
facet identified in a study design must be a random NOVA is designed for multivariate G-theory
effect. Multifacet study designs may involve one or analyses.
more fixed facets, however. Suppose, in the speak-
ing test described earlier in the two-facet crossed Yasuyo Sawaki
538 General Linear Model

See also Analysis of Variance (ANOVA); Classical Test observations are stored in an I by 1 vector denoted
Theory; Coefficient Alpha; Interrater Reliability; y. The values of the independent variables describ-
Random Effects Models; Reliability ing the I observations are stored in an I by K
matrix denoted X: K is smaller than I; and X is
assumed to have rank K (i.e., X is full rank on its
Further Readings
columns). A quantitative independent variable can
Brennan, R. L. (1992). Elements of generalizability be directly stored in X; but a qualitative indepen-
theory. Iowa City, IA: ACT. dent variable needs to be recoded with as many
Brennan, R. L. (2001). Generalizability theory. New columns as there are degrees of freedom for this
York: Springer-Verlag. variable. Common coding schemes include dummy
Cronbach, L. J., Gleser, G. C., Nanda, H., & coding, effect coding, and contrast coding.
Rajaratnam, N. (1972). The dependability of
behavioral measurements: Theory of generalizability
for scores and profiles. New York: Wiley. Core Equation
Shavelson, R. J., & Webb, N. M. (1991). Generalizability
theory: A primer. Newbury Park, CA: Sage. For the GLM, the values of the dependent vari-
Webb, N. M., & Shavelson, R. J. (1981). Multivariate able are obtained as a linear combination of the
generalizability of general educational development values of the independent variables. The vectors
ratings. Journal of Educational Measurement, 18(1), for the coefficients of the linear combination are
1322. stored in a K by 1 vector denoted b. In general, the
Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983).
values of y cannot be perfectly obtained by a linear
Multivariate generalizability theory. In L. J. Fyans, Jr.
combination of the columns of X, and the differ-
(Ed.), Generalizability theory: Inferences and practical
applications (pp. 6781): San Francisco: Jossey-Bass. ence between the actual and the predicted values is
called the prediction error. The values of the error
are stored in an I by 1 vector denoted e. Formally,
the GLM is stated as
GENERAL LINEAR MODEL y ¼ Xb þ e: ð1Þ

The general linear model (GLM) provides a general The predicted values are stored in an I by 1 vector
framework for a large set of models whose com- denoted ^y, and therefore, Equation 1 can be
mon goal is to explain or predict a quantitative rewritten as
dependent variable by a set of independent vari-
y ¼ ^y þ e with ^y ¼ Xb: ð2Þ
ables that can be categorical or quantitative. The
GLM encompasses techniques such as Student’s t Putting together Equations 1 and 2 shows that
test, simple and multiple linear regression, analysis
of variance, and covariance analysis. The GLM is e ¼ y  ^y: ð3Þ
adequate only for fixed-effect models. In order to
take into account random-effect models, the GLM Additional Assumptions
needs to be extended and becomes the mixed- The independent variables are assumed to be
effect model. fixed variables (i.e., their values will not change
for a replication of the experiment analyzed by the
GLM, and they are measured without error). The
Notations
error is interpreted as a random variable; in addi-
Vectors are denoted with boldface lower-case let- tion, the I components of the error are assumed to
ters (e.g., y), and matrices are denoted with bold- be independently and identically distributed
face upper-case letters (e.g., XÞ. The transpose of (i.i.d.), and their distribution is assumed to be
a matrix is denoted by the superscript T , and the a normal distribution with a zero mean and a vari-
inverse of a matrix is denoted by the super- ance denoted σ 2e . The values of the dependent vari-
script 1 . There are I observations. The values of able are assumed to be a random sample of
a quantitative dependent variable describing the I a population of interest. Within this framework,
General Linear Model 539

the vector b is seen as an estimation of the popula- SSresidual


tion parameter vector β. ∼ χ2 ðνÞ: ð10Þ
σ 2e

Least Square Estimate By contrast, the ratio of the model sum of squares
SS
to the error variance model
σ2
is distributed as a non-
Under the assumptions of the GLM, the popula- e

tion parameter vector β is estimated by b, which is central χ2 with v ¼ K degrees of freedom and non-
computed as centrality parameter

b ¼ ðXT XÞ1 XT y: ð4Þ 2 T T


λ¼ β X Xβ:
σ 2e
This value of b minimizes the residual sum of
squares (i.e., b is such that eT e is minimum). This is abbreviated as

SSmodel
∼ χ2 ðν; λÞ: ð11Þ
Sums of Squares σ 2e

The total sum of squares of y is denoted SStotal , From Equations 10 and 11, it follows that the
and it is computed as ratio
SStotal ¼ yT y: ð5Þ SSmodel =σ 2e IK1
F¼ 2
×
Using Equation 2, the total sum of squares can be SSresidual =σ e K
ð12Þ
rewritten as SSmodel IK1
¼ ×
SSresidual K
y þ eÞT ð^
SStotal ¼ yT y ¼ ð^ y þ eÞ
ð6Þ is distributed as a noncentral Fisher’s F with v1 ¼
yT ^
¼^ yT e;
y þ eT e þ 2^
K and v2 ¼ I  K 1 degrees of freedom and non-
yT e ¼ 0, and therefore,
but it can be shown that 2^ centrality parameter equal to
Equation 6 becomes 2 T T
λ¼ β X Xβ:
T
SStotal ¼ y y ¼ ^ T
y þ e e:
y ^ T
ð7Þ σ 2e

The first term of Equation 7 is called the model In the specific case when the null hypothesis of
sum of squares and is denoted SSModel . It is equal interest states that H0 : β ¼ 0, the noncentrality
to parameter vanishes and then the F ratio from
Equation 12 follows a standard (i.e., central)
T
SSmodel ¼ ^ y ¼ b XT Xb:
yT ^ ð8Þ Fisher’s distribution with v1 ¼ K and v2 ¼
I  K  1 degrees of freedom.
The second term of Equation 7 is called the resid-
ual or the error sum of squares and is denoted
SSresidual . It is equal to
Test on Subsets of the Parameters
T T
SSresidual ¼ e e ¼ ðy  XbÞ ðy  XbÞ: ð9Þ Often, one is interested in testing only a subset
of the parameters. When this is the case, the I by
Sampling Distributions of the Sums of Squares
K matrix X can be interpreted as composed of
Under the assumptions of normality and i.i.d. two blocks: an I by K1 matrix X1 and an I by K2
for the error, the ratio of the residual sum of matrix X2 with K ¼ K1 þ K2 . This is expressed
SS as
squares to the error variance residual
σ2
is distributed
e
as a χ2 with a number of degrees of freedom of .
X ¼ ½X1 .. X2 : ð13Þ
v ¼ I  K  1. This is abbreviated as
540 General Linear Model

Vector b is partitioned in a similar manner as When the null hypothesis is true, Fb2 | b1 follows
2 3 a Fisher’s F distribution with ν1 ¼ K2 and ν2 ¼ I 
b1 K  1 degrees of freedom, and therefore, Fb2 |b1
b ¼ 4    5: ð14Þ can be used to test the null hypothesis that β2 ¼ 0:
b2

In this case, the model corresponding to Equation Specific Cases


1 is expressed as
The GLM comprises several standard statistical
2 3 techniques. Specifically, linear regression is
b1
.. 6 7 obtained by augmenting the matrix of indepen-
y ¼ Xb þ e ¼ ½X1 . X2 4    5 þ e
ð15Þ dent variables by a column of ones (this addi-
b2 tional column codes for the intercept). Analysis
¼ X1 b1 þ X2 b2 þ e of variance is obtained by coding the experimen-
tal effect in an appropriate way. Various schemes
For convenience, we will assume that the test can be used, such as effect coding, dummy cod-
of interest concerns the parameters β2 estimated ing, or contrast coding (with as many columns
by vector b2 and that the null hypothesis to be as there are degrees of freedom for the source of
tested corresponds to a semipartial hypothesis, variation considered). Analysis of covariance is
namely, that adding X2 after X1 does not obtained by combining the quantitative indepen-
improve the prediction of y. The first step is to dent variables expressed as such and the categor-
evaluate the quality of the prediction obtained ical variables expressed in the same way as for
when using X1 alone. The estimated value of the an analysis of variance.
parameters is denoted e b1 —a new notation is
needed because in general, b1 is different from
b1 (b1 and e
e b1 are equal only if X1 and X2 are Limitations and Extensions
two orthogonal blocks of columns). The model The general model, despite its name, is not com-
relating y to X1 is called a reduced model. For- pletely general and has several limits that have
mally, this reduced model is obtained as spurred the development of ‘‘generalizations’’ of
the general linear model. Some of the most nota-
y ¼ X1 e
b1 þ e
e1 ð16Þ ble limits and some palliatives are listed below.
The general linear model requires X to be full
(where e e1 is the error of prediction for the reduced rank, but this condition can be relaxed by using
model). The model sum of squares for the reduced (cf. Equation 4) the Moore-Penrose generalized
model is denoted SSeb [see Equation (9) for its inverse (often denoted Xþ and sometimes called
1
computation]. The semipartial sum of squares for a ‘‘pseudo-inverse’’) in lieu of (XT XÞ1 XT . Doing
X2 is the sum of squares over and above the sum so, however, makes the problem of estimating
of squares already explained by X1. It is denoted the model parameters more delicate and requires
SSb2 |b1 and it is computed as the use of the notion of estimable functions.
The general linear model is a fixed-effects
SSb2 |b1 ¼ SSmodel  SSb~1 ð17Þ model, and therefore, it does not naturally work
with random-effects models (including multifac-
The null hypothesis test indicating that X2 does torial repeated or partially repeated measure-
not improve the prediction of y over and above X1 ment designs). In this case (at least for balanced
is equivalent to testing the null hypothesis that b2 is designs), the sums of squares are computed cor-
equal to 0. It can be tested by computing the fol- rectly, but the F tests are likely to be incorrect. A
lowing F ratio: palliative to this problem is to compute expected
values for the different sums of squares and to
SSb2 |b1 IK1 compute F tests accordingly. Another, more gen-
Fb2 |b1 ¼ × : ð18Þ
SSresidual K2 eral, approach is to model separately the fixed
Graphical Display of Data 541

effects and the random effects. This is done with Graphs and charts are also used to enhance report-
mixed-effects models. ing and communication. Graphical displays often
Another obvious limit of the general linear provide vivid color and bring life to documents,
model is to model only linear relationships. In while also simplifying complex narrative and data.
order to include some nonlinear models (such as This entry discusses the importance of graphs,
logistic regression), the GLM needs to be extended describes common techniques for presenting data
to the class of the generalized linear models. graphically, and provides information on creating
effective graphical displays.
Hervé Abdi

See also Analysis of Variance (ANOVA); Analysis of


Covariance (ANCOVA); Contrast Analysis; Degrees of
Freedom; Dummy Coding; Effect Coding; Importance of Graphs in
Experimental Design; Fixed-Effects Models; Data Analysis and Reporting
Gauss–Markov Theorem; Homoscedasticity; Least
In the conduct of research, researchers accumu-
Squares, Methods of; Matrix Algebra; Mixed Model
late an enormous amount of data that requires
Design; Multiple Regression; Normality Assumption;
analysis and interpretation in order to be useful.
Random Error; Sampling Distributions; Student’s t
To facilitate this process, researchers use statisti-
Test; t Test, Independent Samples; t Test, One Sample;
cal software to generate various types of sum-
t Test, Paired Samples
mary tables and graphical displays to better
understand their data. Histograms, normal Q-Q
Further Readings plots, detrended Q-Q plots, and box plots are
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. common graphs that are used to assess normality
(2009). Experimental design and analysis for of data and identify outliers (i.e., extreme values)
psychology. Oxford, UK: Oxford University Press. in the data. Such graphs are quite useful for
Brown, H., & Prescott, R. (2006). Applied mixed models identifying data anomalies that might require
in medicine. London: Wiley. more in-depth study. Researchers also generate
Cohen, J. (1968). Multiple regression as a general data- special graphs to ensure that important assump-
analytic system. Psychological Bulletin, 70; 426443. tions are not being violated when performing
Fox, J. (2008). Applied regression analysis and certain statistical tests (e.g., correlation, t test,
generalized linear models. Thousand Oaks, CA: Sage.
analysis of variance [ANOVA]). For example,
Graybill, F. A. (1976). Theory and application of the
linear model. North Scituate, MA: Duxbury.
the residuals scatterplot and normal probability
Hocking, R. R. (2003). Methods and applications of plot are useful for checking the assumptions of
linear models. New York: Wiley. normality, linearity, and homoscedasticity, as
Rencher, A. C., & Schaafe, G. B. (2008). Linear models well as identifying outliers.
in statistics. New York: Wiley. Narrative and numerical data—no matter how
Searle, S. R. (1971). Linear models. New York: Wiley. well organized—are of little use if they fail to com-
municate information. However, many research
papers and reports can be intimidating to the aver-
age person. Therefore, researchers need to find cre-
GRAPHICAL DISPLAY OF DATA ative ways to present data so that they are
sufficiently appealing to the average reader. Data
Graphs and charts are now a fundamental compo- that are presented in the form of charts and graphs
nent of modern research and reporting. Today, are one way that researchers can make data more
researchers use many graphical means such as his- appealing; many people find graphs much easier to
tograms, box plots, and scatterplots to better understand compared to narratives and tables.
understand their data. Graphs are effective for dis- Effective graphical displays of data can undoubt-
playing and summarizing large amounts of numer- edly simplify complex data, making it more com-
ical data, and are useful for showing trends, prehensible to the average reader. As the old clichè
patterns, and relationships between variables. goes—a picture paints a thousand words.
542 Graphical Display of Data

Common Graphical Displays for Reporting a grid, with a line moving from left to right, on the
diagram. When several time series lines are being
Bar Chart plotted, and color is not being used, pronounced
Bar charts are one of the most commonly used symbols along the lines can help to draw attention
techniques for presenting data and are considered to the different variables. For example, a diamond
to be one of the easiest diagrams to read and inter- (t) can be used to represent all the data points for
pret. They are used to display frequency distribu- unemployment, a square (n) for job approval, and
tions for categorical variables. In bar chart so on. Another option is to use solid/dotted/dashed
displays, the value of the observation is propor- lines to distinguish different variables.
tional to the length of the bar; each category of the
variable is represented by a separate bar; and the Effective Graphical Displays
categories of the variable are generally shown
along the horizontal axis, whereas the number of The advent of commercial, feature-rich statistical
each category is shown on the vertical axis. Bar and graphical software such as Excel and IBMâ
charts are quite versatile; they can be adapted to SPSSâ (PASW) 18.0 has made the incorporation of
incorporate displays of both negative and positive professional graphical displays into reports easy and
data on the same chart (e.g., profits and losses inexpensive. (Note: IBMâ SPSSâ Statistics was for-
across years). They are particularly useful for com- merly called PASWâ Statistics.) Both Excel and SPSS
paring groups and for showing changes over time. have built-in features that can generate a wide array
Bar charts should generally not contain more than of graphical displays in mere seconds, using a few
810 categories or they will become cluttered and point-and-click operations. However, commercial
difficult to read. When more than 10 categories software has also created new problems. For exam-
are involved in data analysis, rotated bar charts or ple, some researchers may go overboard and incor-
line graphs should be considered instead. porate so many charts and graphs into their writing
that the sheer volume of diagrams can make
Pie Chart comprehension of the data torturous—rather than
enlightening—for the reader. Others may use so
A pie chart is a circle divided into sectors or many fancy features (e.g., glow, shadows) and design
slices, where the sectors of the pie are proportional shapes (e.g., cones, doughnuts, radars, cylinders) that
to the whole. The entire pie represents 100%. Pie diagrams lose their effectiveness in conveying certain
charts are used to display categorical data for a sin- information and instead become quite tedious to
gle variable. They are quite popular in journalistic read. Many readers may become so frustrated that
and business reporting. However, these charts can they may never complete reading the document.
be difficult to interpret unless percentages and/or An equally problematic issue pertains to dis-
other numerical information for each slice are torted and misleading charts and graphs. Some of
shown on the diagram. A good pie chart should these distortions may be quite deliberate. For
have no more than eight sectors or it will become example, sometimes, scales are completely omitted
too crowded. One solution is to group several from a graph. In other cases, scales may be started
smaller slices into a category called ‘‘Other.’’ When at a number other than zero. Omitting a zero tends
color is being used, red and green should not be to magnify changes. Likewise, ‘‘starting time’’ can
located on adjacent slices, because some people also affect the appearance of magnitude. An even
are color-blind and cannot distinguish red from worse scenario, however, is when either a ‘‘scale’’
green. When patterns are used, it is important to or the ‘‘starting time’’ is adjusted and then com-
ensure that optical illusions are not created on bined with a three-dimensional or other fancy
adjacent slices or the data may be misinterpreted. graph—this may lead to even greater distortion.
Other distortions may simply result from inexperi-
enced persons preparing the diagrams. The resul-
Line Graph
tant effect is that many readers who are not
A line graph shows the relationship between knowledgeable in statistics can be easily misled by
two variables by connecting the data points on such graphs. Thus, when using graphical displays,
Graphical Display of Data 543

meticulous attention should be given to ensuring assist in creating informative and effective graphi-
that the graphs do not emphasize unimportant dif- cal displays that communicate meaningful informa-
ferences and/or distort or mislead readers. tion with clarity and precision:
In order to present effective data, the researcher
must be able to identify the salient information from 1. Focus on substance—emphasize the important.
the data. In addition, the researcher must be clear on
2. Ensure that data are coherent, clear, and
what needs to be emphasized, as well as the targeted accurate.
audience for the information. The data must then be
presented in a manner that is vivid, clear, and con- 3. Use an appropriate scale that will not distort
or mislead.
cise. The ultimate goal of effective graphical displays
should be to ensure that any data communicated are 4. Label the x-axis and y-axis with appropriate
intelligible and enlightening to the targeted audience. labels to aid interpretation [e.g., Temperature
When readers have to spend a great deal of time try- (8 C); Time (minutes)].
ing to decipher a diagram, this is a clear indication 5. Number the graphs/charts, and give them an
that the diagram is ineffective. informative title (e.g., Figure 1: ABC College
All graphical displays should include source Course Enrollment 2009).
information. When graphical displays are sourced 6. Include the source at the bottom of the
entirely from other works, written permission is diagram.
required that will specify exactly how the source
7. Simplicity is often best. Use three-dimensional
should be acknowledged. If graphs are prepared
and other fancy graphs cautiously—they often
using data that are not considered proprietary, distort and/or mislead.
copyright permission need not be sought, but the
data source must still be acknowledged. When 8. Avoid stacked bar charts unless the primary
graphical displays are prepared entirely from the comparison is being made on the data series
researcher’s own data, the source information gen- located on the bottom of the bar.
erally makes reference to the technique/population 9. When names are displayed on a label (e.g.,
used to obtain the data (e.g., 2009 Survey of ABC countries, universities, etc.), alphabetize data
College Students). Source information should be before charting to aid reading.
placed at the bottom of the diagram and should be 10. Use statistical and textual descriptions
sufficiently detailed to enable the reader to go appropriately to aid data interpretation.
directly to the source (e.g., Source: General Motors
11. Use a legend when charts include more than
Annual Report, 2009, Page 10, Figure 6—
one data series. Locate legend carefully to
reprinted with permission). avoid reducing plot area.
When using graphics, many researchers often
concentrate their efforts on ensuring that the 12. Appearance is important. Consider using
salient facts are presented, while downplaying borders with curved edges and three-
appearance of displays. Others emphasize appear- dimensional effects to enhance graphical
displays. Use colors effectively and
ance over content. Both are important. Eye-catch-
consistently. Do not color every graph with
ing graphs are useless if they contain little or no a different color. Bear in mind that when
useful information. On the other hand, a graph colored documents are photocopied in black
that contains really useful content may get limited and white, images will be difficult to interpret
reading because of its appearance. Therefore, unless the original document had sharp color
researchers need to package their reports in a man- contrast. When original documents are being
ner that would be appealing to a wider mass. Effec- printed in black and white, it may be best to
tive data graphics require a combination of good use shades of black and gray or textual
statistical and graphical design skills, which some patterns.
researchers may not possess. However, numerous 13. Avoid chart clutter. This confuses and
guidelines are available on the Internet and in texts distracts the reader and can often obscure the
that can assist even a novice to create effective distribution’s shape. For example, if you are
graphs. In addition, the following guidelines can charting data for 20 years, show every other
544 Greenhouse–Geisser Correction

year on the x-axis. Angling labels may effect of the independent variable is tested by com-
create an optical illusion of less clutter. puting an F statistic, which is computed as the
14. Use readable, clear fonts (e.g., Times New ratio of the mean square of effect by the mean
Roman 10 or 12) for labels, titles, scales, square of the interaction between the subject fac-
symbols, and legends. tor and the independent variable. For a design
with S subjects and A experimental treatments,
15. Ensure that diagrams and legends are not so
small that they require a magnifying glass in
when some assumptions are met, the sampling
order to be read. distribution of this F ratio is a Fisher distribution
with v1 ¼ A  1 and v2 ¼ ðA  1ÞðS  1Þ degrees
16. Edit and scale graphs to desired size in the of freedom.
program in which they were created before In addition to the usual assumptions of
transferring into the word-processed
normality of the error and homogeneity of
document to avoid image distortions with
resizing.
variance, the F test for repeated-measurement
designs assumes a condition called sphericity.
17. Edit and format graphical displays generated Intuitively, this condition indicates that the
directly from statistical programs before using. ranking of the subjects does not change across
18. Use gridlines cautiously. They may overwhelm experimental treatments. This is equivalent
and distract if the lines are too thick. However, to stating that the population correlation
faded gridlines can be very effective on some (computed from the subjects’ scores) between
types of graphs (e.g., line charts). two treatments is the same for all pairs of
19. Use a specific reference format such as APA treatments. This condition implies that there is
style to prepare the document to ensure correct no interaction between the subject factor and
placement of graph titles, and so on. the treatment.
If the sphericity assumption is not valid, then
20. Ensure that graphical displays are self-
explanatory—readers should be able to
the F test becomes too liberal (i.e., the propor-
understand them with minimal or no reference tion of rejections of the null hypothesis is larger
to the text and tables. than the α level when the null hypothesis is true).
In order to minimize this problem, Seymour
Nadini Persaud Greenhouse and Samuel Geisser, elaborating on
early work by G. E. P. Box, suggested using
See also Bar Chart; Column Graph; Cumulative Frequency an index of deviation to sphericity to correct
Distribution; Histogram; Line Graph; Pie Chart the number of degrees of freedom of the F
distribution. This entry first presents this index
Further Readings of nonsphericity (called the Box index, denoted
ε), and then it presents its estimation and
Fink, A. (2003). How to report on surveys (2nd ed.). its application, known as the Greenhouse–
Thousand Oaks, CA: Sage. Geisser correction. This entry also presents the
Owen, F., & Jones, R. (1994). Statistics (4th ed.). Huyhn–Feldt correction, which is a more effi-
London: Pitman.
cient procedure. Finally, this entry explores tests
Pallant, J. (2001). SPSS survival manual: A step by step
guide to using SPSS. Maidenhead, Berkshire, UK:
for sphericity.
Open University Press.
Index of Sphericity
Box has suggested a measure for sphericity, denoted
GREENHOUSE–GEISSER ε, which varies between 0 and 1 and reaches the
value of 1 when the data are perfectly spherical.
CORRECTION The computation of this index is illustrated with
the fictitious example given in Table 1 with data
When performing an analysis of variance with collected from S ¼ 5 subjects whose responses were
a one-factor, repeated-measurement design, the measured for A ¼ 4 different treatments. The
Greenhouse–Geisser Correction 545

Table 1 A Data Set for a Repeated-Measurement Greenhouse–Geisser Correction


Design
Box’s approach works for the population
a1 a2 a3 a4 M:s
covariance matrix, but in general, this matrix is
S1 76 64 34 26 50 not known. In order to estimate ε, we need to
S2 60 48 46 30 46 transform the sample covariance matrix into an
S3 58 34 32 28 38 estimate of the population covariance matrix. In
S4 46 46 32 28 38 order to compute this estimate, we denote by ta;a0
S5 30 18 36 28 28 the sample estimate of the covariance between
Ma. 54 42 36 28 M.. ¼ 40 groups a and a0 (these values are given in Table 2),
by ta: the mean of the covariances for group a; and
by t:: the grand mean of the covariance table. The
estimation of the population covariance matrix
Table 2 The Covariance Matrix for the Data Set of
Table 1
will have for a general term sa;a0 ; which is com-
puted as
a1 a2 a3 a4
a1 294 258 8 8 sa;a0 ¼ ðta;a0  t:: Þ  ðta;:  t:: Þ  ðta0;:  t:: Þ
a2 258 94 8 8 ð2Þ
¼ ta;a0  ta;:  ta0;: þ t::
a3 8 8 34 6
a4 8 8 6 2
ta: (this procedure is called ‘‘double-centering’’).
138 138 14 2 t:: ¼ 72
ta:  t:: Table 3 gives the double-centered covariance
66 66 8 74
matrix. From this matrix, we can compute the
estimate of ε, which is denoted ^ε (compare with
Equation 1):
standard analysis of variance of these data gives
a value of FA ¼ 600  2
112 ¼ 5:36, which, with ν1 ¼ 3 and P
ν2 ¼ 12, has a p value of .014. sa;a
a
In order to evaluate the degree of sphericity, ε^ ¼ P : ð3Þ
the first step is to create a table called a covari- ðA  1Þ s2a;a0
a;a0
ance matrix. This matrix is composed of the var-
iances of all treatments and all the covariances
In our example, this formula gives
between treatments. As an illustration, the
covariance matrix for our example is given in ð90 þ 90 þ 78 þ 78Þ2
Table 2. ^ε ¼
Box defined an index of sphericity, denoted ε, ð4  1Þð902 þ 542 þ    þ 662 þ 782 Þ2
which applies to a population covariance matrix. If 3362 112; 896
¼ ¼
we call ζa;a0 the entries of this A × A table, the Box 3 × 84; 384 253; 152
index of nonsphericity is obtained as
We use the value of ^ε = .4460 to correct the number
 2 of degrees of freedom of FA as ν1 ¼ ^εðA  1Þ ¼1.34
P
ζa;a
a
ε¼ P ð1Þ Table 3 The Double-Centered Covariance Matrix
ðA  1Þ ζ2a;a0 Used to Estimate the Population Covariance
a;a0
Matrix
a1 a2 a3 a4
Box also showed that when sphericity fails,
a1 90 54 72 72
the number of degrees of freedom of the FA
a2 54 90 72 72
ratio depends directly upon the degree of non-
a3 72 72 78 66
sphericity and is equal to ν1 ¼ εðA  1Þ and
a4 72 72 66 78
ν2 ¼ εðA  1ÞðS  1Þ.
546 Greenhouse–Geisser Correction

and ν2 ¼ ^εðA  1)(S  1) ¼ 5.35. These corrected Extreme Greenhouse–Geisser Correction


values of ν1 and ν2 give for FA ¼ 5.36 a probabil-
A conservative (i.e., increasing the risk of Type
ity of p ¼ .059. If we want to use the critical
II error: the probability of not rejecting the null
value approach, we need to round the values of
hypothesis when it is false) correction for spheric-
these corrected degrees of freedom to the nearest
ity has been suggested by Greenhouse and Geisser.
integer (which will give here the values of
Their idea is to choose the largest possible value of
ν1 ¼ 1 and ν2 ¼ 5).
^ε; which is equal to (A  1). This leads us to con-
sider that FA follows a Fisher distribution with
ν1 ¼ 1 and ν2 ¼ S  1 degrees of freedom. In this
Eigenvalues
case, these corrected values of ν1 ¼ 1 and ν2 ¼ 4
The Box index of sphericity is best understood give for FA ¼ 5.36 a probability of p ¼ .081.
in relation to the eigenvalues of a covariance
matrix. Covariance matrices belong to the class of
Huynh–Feldt Correction
positive semidefinite matrices and therefore always
have positive or null eigenvalues. Specifically, if we Huynh Huynh and Leonard S. Feldt suggested
denote by Σ a population covariance, and by λ‘ a more powerful approximation for ε denoted e
ε
the ‘th eigenvalue of Σ, the sphericity condition is and computed as
equivalent to having all eigenvalues equal to a con-
stant. Formally, the sphericity condition states that SðA  1Þ^ε  2
e
ε¼ : ð7Þ
ðA  1Þ½S  1  ðA  1Þ^ε
λ‘ ¼ constant 8‘: ð4Þ
In our example, this formula gives
In addition, if we denote by V (also called β or ν)
the following index, 5ð4  1Þ:4460  2
e
ε¼ ¼ :5872:
P 2 ð4  1Þ½5  1  ð4  1Þ:4460
ð λ‘ Þ
V¼ P 2 ; ð5Þ We use the value of ~ε ¼ .5872 to correct the
λ‘
number of degrees of freedom of FA as v1 ¼ eεðA 
then the Box coefficient can be expressed as 1) ¼ 1.76 and v2 ¼ eεðA  1ÞðS  1Þ ¼ 7:04. These
corrected values give for FA ¼ 5.36 a probability
1 of p ¼ .041. If we want to use the critical value
ε¼ V: ð6Þ
A1 approach, we need to round these corrected values
Under sphericity, all of the eigenvalues are for the number of degrees of freedom to the near-
equal, and V is equal to (A  1). The estimate of ε est integer (which will give here the values of
is obtained by using the eigenvalues of the esti- ν1 ¼ 2 and ν2 ¼ 7Þ. In general, the correction of
mated covariance matrix. For example, the matrix Huynh and Feldt is to be preferred because it is
from Table 3 has the following eigenvalues: more powerful (and Greenhouse–Geisser is too
conservative).
λ1 ¼ 288; λ2 ¼ 36; λ3 ¼ 12:

This gives Stepwise Strategy for Sphericity


P Greenhouse and Geisser suggest using a stepwise
ð λ‘ Þ2 ð288 þ 36 þ 12Þ2 strategy for the implementation of the correction
V¼ P 2 ¼ ≈ 1:3379;
λ‘ 2882 þ 362 þ 122 for lack of sphericity. If FA is not significant with
the standard degrees of freedom, there is no need
which, in turn, gives to implement a correction (because it will make it
1 1:3379 even less significant). If FA is significant with the
^ε ¼ V¼ ≈ :4460 extreme correction [i.e., with ν1 ¼ 1 and ν2 ¼ (S 
A1 3
1)], then there is no need to correct either (because
(this matches the results of Equation 4). the correction will make it more significant). If FA
Greenhouse–Geisser Correction 547

is not significant with the extreme correction but is For our example, we find that
significant with the standard number of degrees of
freedom, then use the ε correction (they recom- 2ðA  1Þ2 þ A þ 2
f ¼ ¼ 2 × 32 þ 4 þ 26 × 3 × 4
mend using ^ε; but the subsequent e ε is currently 6ðA  1ÞðS  1Þ
preferred by many statisticians). 24
¼ ¼ :33
72
Testing for Sphericity
and
One incidental question about using a correction
for lack of sphericity is to decide when a sample X2W ¼ ð1  f Þ × ðS  1Þ × lnfWg
covariance matrix is not spherical. Several tests ¼ 4ð1  :33Þ × lnf:0886g ≈ 6:46;
can be used to answer this question. The most well
known is Mauchly’s test, and the most powerful is with ν ¼ 12 4 × 3 ¼ 6, we find that p ¼ 38 and we
the John, Sugiura, and Nagao test. cannot reject the null hypothesis. Despite its rela-
tive popularity, the Mauchly test is not recom-
mended by statisticians because it lacks power. A
Mauchly’s Test more powerful alternative is the John, Sugiura,
and Nagao test for sphericity described below.
J. W. Mauchly constructed a test for sphericity
based on the following statistic, which uses the
eigenvalues of the estimated covariance matrix:
John, Sugiura, and Nagao Test
Q
λ‘ According to John E. Cornell, Dean M. Young,
W ¼  P ðA1Þ : ð8Þ
1 Samuel L. Seaman, and Roger E. Kirk, the best test
A1
λ‘
for sphericity uses V. Tables for the critical values
This statistic varies between 0 and 1 and reaches 1 of W are available in A. P. Grieve, but a good
when the matrix is spherical. For our example, we approximation is obtained by transforming V into
find that  
Q 2 1 2 1
λ‘ 228 × 36 × 12 XV ¼ SðA  1Þ V  : ð12Þ
W ¼  P ðA1Þ ¼  3 2 A1
1 1
A1
λ‘ 3
ð228 þ 36 þ 12Þ
Under the null hypothesis, X2V is approximately
124; 416
¼ ≈ :0886 distributed as a χ2 distribution with
1; 404; 928 1
ν ¼ 2 AðA  1Þ  1. For our example, we find that
Tables for the critical values of W are available in  
2 1 2 1
Nagarsenker and Pillai (1973), but a good approx- XV ¼ SðA  1Þ V 
2 A1
imation is obtained by transforming W into  
5×3 2
1
¼ 1:3379  ¼ 22:60:
X2W ¼ ð1  f Þ × ðS  1Þ × lnfWg; ð9Þ 2 3

where With ν ¼ 12 4 × 3  1 ¼ 5, we find that p ¼ .004


and we can reject the null hypothesis with the
2ðA  1Þ2 þ A þ 2 usual test. The discrepancy between the conclu-
f ¼ : ð10Þ
6ðA  1ÞðS  1Þ sions reached from the two tests for sphericity
illustrates the lack of power of Mauchly’s test.
Under the null hypothesis of sphericity, X2W is
approximately distributed as a χ2 with degrees of Herve Abdi
freedom equal to
See also Analysis of Covariance (ANCOVA); Analysis of
1 Variance (ANOVA); Pooled Variance; Post Hoc
ν ¼ AðA  1Þ: ð11Þ
2 Analysis; Post Hoc Comparisons; Sphericity
548 Grounded Theory

Further Readings framework of logically deduced hypotheses,


grounded theory begins inductively by gathering
Box, G. E. P. (1954). Some theorems on quadratic forms
applied in the study of analysis of variance problems, data and posing hypotheses during analysis that
I: Effect of inequality of variance in the one-way can be confirmed or disconfirmed during subse-
classification. Annals of Mathematical Statistics, 25, quent data collection. Grounded theory is used to
290302. generate a theory about a research topic through
Box, G. E. P. (1954). Some theorems on quadratic the systematic and simultaneous collection and
forms applied in the study of analysis of variance analysis of data. Developed in the 1960s by Barney
problems, II: Effects of inequality of variance and of Glaser and Anselm Strauss within the symbolic
correlation between errors in the two-way interactionist tradition of field studies in sociology
classification. Annals of Mathematical Statistics,
and drawing also on principles of factor analysis
25, 484498.
Cornell, J. E., Young, D. M., Seaman, S. L., & Kirk,
and qualitative mathematics, it is now used widely
R. E. (1992). Power comparisons of eight tests for in the social sciences; business and organizational
sphericity in repeated measures designs. Journal of studies; and, particularly, nursing.
Educational Statistics, 17, 233249. As an exploratory method, grounded theory is
Geisser, S., & Greenhouse, S. W. (1958). An extension of particularly well suited for investigating social pro-
Box’s result on the use of F distribution in multivariate cesses that have attracted little prior research
analysis. Annals of Mathematical Statistics, 29; attention, where the previous research is lacking in
885891. breadth and/or depth, or where a new point of
Greenhouse, S. W., & Geisser, S. (1959). On methods in view on familiar topics appears promising. The
the analysis of profile data. Psychometrika, 24;
purpose is to understand the relationships among
95112.
Grieve, A. P. (1984). Tests of sphericity of normal
concepts that have been derived from qualitative
distribution and the analysis of repeated measure (and, less often, quantitative) data, in order to
designs. Psychometrika, 49; 257267. explore (and explain) the behavior of persons
Huynh, H., & Feldt, L. S. (1970). Conditions under engaged in any specific kind of activity. By using
which mean square ratios in repeated measurement this method, researchers aim to discover the basic
designs have exact F-distributions. Journal of the issue or problem for people in particular circum-
American Statistical Association, 65; 15821589. stances, and then explain the basic social process
Huynh, H., & Feldt, L. S. (1976). Estimation of the Box (BSP) through which they deal with that issue. The
correction for degrees of freedom from sample data in goal is to develop an explanatory theory from the
randomized block and split-plot designs. Journal of
‘‘ground up’’ (i.e., the theory is derived inductively
Educational Statistics, 1; 6982.
John, S. (1972). The distribution of a statistic used for
from the data).
testing sphericity of normal distributions. Biometrika, This entry focuses on the grounded theory
59, 169173. research process, including data collection, data
Mauchly, J. W. (1940). Significance test for sphericity of analysis, and assessments of the results. In addition,
n-variate normal population. Annals of Mathematical modifications to the theory are also discussed.
Statistics, 11, 204209.
Nagao, H. (1973). On some test criteria for covariance
matrix. Annals of Statistics, 1, 700709. Grounded Theory Research Design
Sugiura, N. (1972). Locally best invariant test for
One important characteristic that distinguishes
sphericity and the limiting distribution. Annals of
grounded theory (and other qualitative research) is
Mathematical Statistics, 43, 13121326.
the evolutionary character of the research design.
Because researchers want to fully understand the
meaning and course of action of an experience
from the perspective of the participants, variables
GROUNDED THEORY cannot be identified in advance. Instead, the
important concepts emerge during data collection
Grounded theory, a qualitative research method, and analysis, and the researcher must remain
relies on insight generated from the data. Unlike open-minded to recognize these concepts. There-
traditional research that begins from a preconceived fore, the research process must be flexible to allow
Grounded Theory 549

these new insights to guide further data collection Data Collection


and exploration. At the same time, grounded
Grounded theory is a form of naturalistic
theory is both a rigorous and systematic approach
inquiry. Because the problems that generate
to empirical research.
research are located in the natural world,
grounded theorists investigate their questions in
Writing Memos (and draw their interpretations from) the natural
world of their participants. Consequently, once
To ensure that a study is both systematic and a phenomenon has been identified as the topic to
flexible, the researcher is responsible for keeping be studied, data collection begins by seeking out
detailed notes in the form of memos in which the the places where the issue occurs and reviewing
researcher documents observations in the field, documents, observing and talking to the people
methodological ideas and arrangements, analytical involved, and sometimes reviewing visual media.
thinking and decisions, and personal reflections. Consequently, the study begins with purposive
Memo writing begins at the time of conceptualiza- sampling. Later, data collection is guided by a par-
tion of the study with the identification of the phe- ticular type of purposive sampling called theoreti-
nomenon of interest and continues throughout the cal sampling.
study. These memos become part of the study
data. When a researcher persists in meticulously Theoretical Sampling
recording memos, writing the first draft of the
study report becomes a simple matter of sorting After the initial data are analyzed, data collec-
the memos into a logical sequence. tion is directed by the emerging theory. Theoretical
sampling is achieved by collecting, coding, and
analyzing the data simultaneously, rather than
Reviewing the Literature sequentially. Sampling is now guided by deductive
reasoning, as the researcher seeks additional data
Whether to review the literature before data to enlarge upon the insights that have been learned
collection may depend on the circumstances of the from the participants who have been interviewed
individual researcher. Methodological purists fol- thus far and to fill out those portions of the theory
low the originators’ advice to delay reading related that need further development. Because of theoret-
literature to avoid developing preconceived ideas ical sampling, the sample size cannot be deter-
that could be imposed during data analysis, thus mined before the study commences. Only when
ensuring that the conceptualization emerges from the researcher is satisfied that no new concepts are
the data. Instead, they recommend reading broadly emerging and no new information on the impor-
in other disciplines early in the study to develop tant concepts is forthcoming can the decision be
‘‘sensitizing concepts’’ that may trigger useful ideas made that the point of theoretical saturation has
and analogies during the latter stages of theoretical been achieved and data collection ends.
construction and elaboration. For them, the actual
literature review is more appropriately begun once
Interviewing
the theory has started to take shape, at which time
previous writing about those concepts that has For grounded theory, data consist of any form
already emerged from the data can be helpful for of information about the research topic that can
developing theoretical relationships and relating be gathered, including the researcher’s own field
the emerging theory to previous knowledge about notes. For many studies, however, interviews form
the topic. Others, however, recognize pragmati- the majority of the data, but these are usually sup-
cally that for a research proposal to be approved plemented with other kinds of information. As
by current funding agencies and thesis committees, a general rule, more than one interview with parti-
knowledge of past research must be demonstrated cipants can help to create a broader and more in-
and then followed up with further reading as the depth analysis. Interviewed participants are
theory develops in order to show where it is con- initially asked broad, open-ended questions to try
gruent (or not) with previous academic work. to elicit their own interpretations and
550 Grounded Theory

understandings of what is important in their expe- Coding and Categorizing Data


rience. This relative lack of explicit direction in the
Codes and categories are the building blocks of
questions is a conscious effort not to bias their
theory. Open coding begins by examining the data
responses toward what informants might think the
minutely, phrase by phrase and line by line, to
researcher wants to hear. As the study evolves, the
identify concepts and processes, by asking oneself,
interview process changes. Later interviews are
‘‘What does this indicate?’’ and ‘‘What is going on
structured to answer more specific questions aimed
here?’’ These short portions of data are assigned
at better understanding those concepts that have
‘‘in vivo’’ or substantive codes derived as much as
not yet been fully fleshed out in the data or to seek
possible from the interviewee’s own vocabulary.
agreement from participants that the theory
An in-depth interview obviously yields a multitude
accounts for their experience.
of codes, although many will be repeated. Codes
Most researchers audiotape the interviews and
are repeated when the analyst finds other phrases
transcribe them verbatim. This preserves the rich-
that indicate the same thing. As the codes are iden-
ness of the data and prepares the data for analysis.
tified, memos are written to define them.
Immediately following each interview, written or
Coding allows the data segments belonging to
tape-recorded field notes are made of the interview
each code to be sorted together. Comparing these
er’s observations and impressions. Some grounded
coded segments for similarities and differences
theorists insist that tape-recording is unnecessary
allows them to be grouped into categories. As this
and that a researcher’s detailed field notes of con-
is done, memos are written for each analytical
versations are sufficient. They maintain that by
decision. When each new interview is coded, the
relying too heavily on transcriptions, the
substantive codes and the data they contain are
researcher may be constrained from raising the
compared with other codes and categories, and
analysis from a descriptive to a more conceptual
with the coding in previous interviews. Thus, by
and theoretical level of abstraction. After the first
comparing incident to incident, data segment with
or second interview has been completed, the
data segment, code with code, and codes with
researcher begins constant comparative analysis.
categories and individual cases, connections
among the data are identified. Some codes and
categories may be combined and the number of
The Constant Comparative Method conceptual codes reduced. As codes fit together,
the relationships among them are recorded in
In the analysis of data, grounded theorists employ memos.
both inductive and deductive reasoning. Constant
comparison is used throughout this simultaneous Theoretical Coding
and iterative collection, analysis, and interpreta-
tion of the data. Emerging codes and concepts are Categories are collapsed into a higher level of
continually compared with one another, with new theoretical category or construct, as patterns,
data, with previously analyzed data, and with the dimensions, and relationships among them are
researcher’s observations and analytical memos. noted. One way to look for these relationships is
to draw diagrams showing connections among the
categories and to memos describing those connec-
tions. Another is to examine the codes and cate-
Data Analysis
gories in terms of their causes, contexts,
Analyzing voluminous amounts of textual data contingencies, consequences, covariances, and con-
is a matter of data reduction, segmenting the data ditions (the six Cs). Glaser has described multiple
into sections that can be compared, contrasted, families of theoretical codes, but the six Cs consti-
and sorted by categories. Thus, grounded theory tute the basic coding family that is commonly
analysis consists of coding, categorizing, memo employed to tease out the meaning of a code or
writing, and memo sorting, to arrive at a core vari- category. Some other examples are dimensions,
able or BSP that is the central theme of the degrees, types, and temporal ordering. These cod-
analysis. ing families are meant to sensitize the analyst to
Grounded Theory 551

relationships that may be discovered among codes is conducted to connect the theory with previous
and categories; they are not intended to serve as work in the field.
a checklist for matching with theoretical
constructs.
Substantive and Formal Grounded Theories
Theoretical codes or constructs, derived by
questioning the data, are used to conceptualize Two levels of grounded theory (both of which are
relationships among the codes and categories. considered to be middle-range) can be found in the
Each new level of coding requires the researcher to literature. Most are substantive theories, developed
reexamine the raw data to ensure that they are from an empirical study of social interaction in
congruent with the emerging theory. Unanswered a defined setting (such as health care, education, or
questions may identify gaps in the data and are an organization) or pertaining to a discrete experi-
used to guide subsequent interviews until the ence (such as having a particular illness, learning
researcher is no longer able to find new informa- difficult subjects, or supervising co-workers). In
tion pertaining to that construct or code. Thus, the contrast, formal theories are more abstract and
code is ‘‘saturated,’’ and further data collection focused on more conceptual aspects of social inter-
omits this category, concentrating on other issues. action, such as stigma, status passage, or negotia-
Throughout, as linkages are discovered and tion. A common way to build formal theory is by
recorded in memos, the analyst posits hypotheses the constant comparative analysis of any group of
about how the concepts fit together into an inte- substantive grounded theories that is focused on
grated theory. Hypotheses are tested against fur- a particular social variable but enacted under dif-
ther observations and data collection. The ferent circumstances, for different reasons, and in
hypotheses are not tested statistically but, instead, varied settings.
through this persistent and methodical process of
constant comparison.
Using Software for Data Analysis
Using a software program to manage these com-
Hypothesizing a Core Category
plex data can expedite the analysis. Qualitative
Eventually, a core variable that appears to data analysis programs allow the analyst to go
explain the patterns of behavior surrounding the beyond the usual coding and categorizing of data
phenomenon of interest becomes evident. This that is possible when analyzing the data by hand.
core category links most or all of the other cate- Analysts who use manual methods engage in
gories and their dimensions and properties a cumbersome process that may include highlight-
together. In most, but not all, grounded theory ing data segments with multicolored markers, cut-
studies, the core category is a BSP, an ‘‘umbrella ting up transcripts, gluing data segments onto
concept’’ that appears to explain the essence of the index cards, filing the data segments that pertain
problem for participants and how they attempt to to a particular code or category together, and
solve it. BSPs may be further subdivided into two finally sorting and re-sorting these bits of paper by
types: basic social structural processes and basic taping them on the walls. Instead, with the aid of
social psychological processes. computer programs, coding and categorizing the
At this point, however, the core category is only data are accomplished easily, and categorized data
tentative. Further interviews focus on developing segments can be retrieved readily. In addition, any
and testing this core category by trying to discount changes to these procedures can be tracked as
it. The researcher presents the theory to new parti- ideas about the data evolve. With purpose-build
cipants and/or previously interviewed participants programs (such as NVivo), the researcher is also
and elicits their agreement with the theory, further able to build and test theories and construct matri-
clarification, or refutation. With these new data, ces in order to discover patterns in the data.
the analyst can dispense with open coding and Controversy has arisen over the use of computer
code selectively for the major categories of the programs for analyzing qualitative data. Some
BSP. Once satisfied that the theory is saturated and grounded theorists contend that using a computer
explains the phenomenon, a final literature review program forces the researcher in particular
552 Grounded Theory

directions, confining the analysis and stifling crea- similar circumstances, as reflective of their own
tivity. Those who support the use of computers experience.
recognize their proficiency for managing large
amounts of complex data. Many grounded theor-
ists believe that qualitative software is particularly Modifiability
well-suited to the constant comparative method.
Finally, modifiability becomes important after
Nevertheless, prudent qualitative researchers who
the study is completed and when the theory is
use computers as tools to facilitate the examina-
applied. No grounded theory can be expected to
tion of their data continually examine their use of
account for changing circumstances. Over time,
technology to enhance, rather than replace, recog-
new variations and conditions that relate to the
nized analytical methods.
theory may be discovered, but a good BSP remains
applicable because it can be extended and qualified
Ensuring Rigor appropriately to accommodate new data and
variations.
Judging qualitative work by the positivist stan-
dards of validity and reliability is inappropriate as
these tests are not applicable to the naturalistic Developments in Grounded Theory
paradigm. Instead, a grounded theory is assessed
In the 50 years that have elapsed since grounded
according to four standards, commonly referred to
theory was first described in 1967, various
as fit, work, grab, and modifiability.
grounded theorists have developed modifications
to the method. Although Barney Glaser continues
Fit to espouse classic grounded theory method,
Anselm Strauss and Juliet Corbin introduced the
To ensure fit, the categories must be generated
conditional matrix as a tool for helping the analyst
from the data, rather than the data being forced to
to explicate contextual conditions that exert influ-
comply with preconceived categories. In reality,
ences upon the action under investigation. Using
many of the categories found in the data will be
the conditional matrix model, the analyst is cued
factors that occur commonly in everyday life.
to examine the data for the effects of increasingly
However, when such common social variables are
broad social structures, ranging from groups
found in the data, the researcher must write about
through organizations, communities, the country,
these pre-existing categories in a way that reveals
and the international relations within which the
their origin in the data. Inserting quotations from
action occurs. As new theoretical perspectives
the data into the written report is one way of doc-
came to the fore, grounded theorists adapted the
umenting fit.
methodology accordingly. For example, Kathy
Charmaz contributed a constructivist approach to
Work grounded theory, and Adele Clarke expanded into
postmodern thought with situational analysis.
To work, a theory should explain what hap-
Others have used grounded theory within feminist
pened and variation in how it happened, predict
and critical social theory perspectives. Whichever
what will happen, and/or interpret what is happen-
version a researcher chooses for conducting his or
ing for the people in the setting. Follow-up inter-
her grounded theory study, the basic tenets of
views with selected participants can be used as
grounded theory methodology continue to endure;
a check on how well the theory works for them.
conceptual theory is generated from the data by
way of systematic and simultaneous collection and
Grab analysis.
Grab refers to the degree of relevance that the P. Jane Milliken
theory and its core concept have to the topic of
the study. That is, the theory should be immedi- See also Inference: Deductive and Inductive; Naturalistic
ately recognizable to participants and others in Inquiry; NVivo; Qualitative Research
Group-Sequential Designs in Clinical Trials 553

Further Readings a smaller number of subjects, it is not adapted to


many trials due to its impracticality. As a natural
Bryant, A., & Charmaz, K. (2007). The Sage handbook
of grounded theory. Thousand Oaks, CA: Sage. extension of it, the use of group-sequential theory
Charmaz, K. (2006). Constructing grounded theory: A in clinical trials was introduced to accommodate
practical guide through qualitative analysis. Thousand the sequential method’s limitations. This entry dis-
Oaks, CA: Sage. cusses the group-sequential design methods and
Clarke, A. E. (2005). Situational analysis: Grounded describes three procedures for testing significance.
theory after the postmodern turn. Thousand Oaks,
CA: Sage.
Glaser, B. G. (1978). Theoretical sensitivity. Mill Valley, Methods
CA: Sociology Press.
Glaser, B. G. (2008). Doing quantitative grounded Group-sequential methods are clinical trial stop-
theory. Mill Valley, CA: Sociology Press. ping rules that consist of series of interim analyses
Glaser, B. G., & Strauss, A. (1967). The discovery of conducted at each visit so that any significant dif-
grounded theory. Chicago: Aldine. ference among treatment groups can be detected
Hutchinson, S. A., & Wilson, H. S. (2001). Grounded
before the trial ends. Initially, before the trial
theory: The method. In P. L. Munhall (Ed.), Nursing
begins, the number of visits and the sample size
research: A qualitative perspective (pp. 209243).
Boston: Jones and Bartlett. required at each interim visit are determined.
Schreiber, R. S., & Stern, P. N. (2001). Using grounded Then, at each interim visit, a significance test is
theory in nursing. New York: Springer. conducted. Once there is evidence for significant
Strauss, A. L. (1987). Qualitative analysis for social difference between the treatment groups at any
scientists. Cambridge, UK: Cambridge University interim visit, an early termination of the clinical
Press. trial is possible and there is no need to recruit any
Strauss, A. L., & Corbin, J. (1998). Basics of qualitative more subjects. For the significance testing, there
research: Techniques and procedures for developing are several available methods, among which
grounded theory. Newbury Park, CA: Sage.
Pocock’s, O’Brien and Fleming’s, and Wang and
Tsiatis’s tests are widely used in clinical trials.

GROUP-SEQUENTIAL DESIGNS Test Procedures


IN CLINICAL TRIALS In this section, three tests are described for com-
paring two treatment groups based on a normal
If the main interest of a clinical trial is to deter- response with known variance. Significance tests
mine whether a new treatment results in a better for the other types of response variables (e.g.,
outcome than the existing treatment, investigators binomial or exponential) are not explained here,
often would like to obtain the result as soon as but are also available. In addition, if the main
possible. One of the reasons for that is that if one interest is to compare more than two treatment
treatment is clearly superior to the other, it is groups, it is possible to modify the tests using an F
unethical to continue the inferior treatment. Stan- ratio test for one-way analysis of variance. All
dard methods, which fix the length of a study and three tests are similar in the sense that they adjust
conduct only one significance test at the end of the critical values for multiple comparisons in order to
study to compare treatments, are inefficient in prevent the increasing probability of Type I errors
terms of use of time and cost. Therefore, one ques- (rejection of a true null hypothesis).
tion that needs to be answered is whether one can For all tests, let K be the fixed number of visits,
predict with certainty the outcome of the trial which is predetermined before the trial begins. Let
before the end of the study based on interim data. xij be the jth subject from the ith treatment group,
Statisticians adapted sequential methods to test where i ¼ 1, 2 and j ¼ 1, . . . , n: Assume that each
a significant difference in treatment groups every xij is independently drawn from a normal distribu-
time new and follow-up subjects are assessed. Even tion with a mean of μi and a variance of σ 2i .
though this method saves time and requires Finally, nk is the number of accumulated subjects
554 Growth Curve

at the kth visit, and it is assumed that n1 and n2 Wang and Tsiatis’s Test
are even.
This method also uses the Zk from Step 1 in
Pocock’s test. In Step 2, Zk is compared with a crit-
ical value CWT ðK; α, Þðk/KÞ1=2 . Refer to Sam-
Pocock’s Test
ple Size Calculations in Clinical Trial Research, by
The method consists of two steps: Shein-Chung Chow, Jun Shao, and Hansheng
Wang, for the table of critical values. Pocock’s test
1. Calculate Zk at each visit k where k ¼ 1; . . . ; and O’Brien and Fleming’s test are considered to
K: be special cases of Wang and Tsiatis’s test.
nk nk
! The calculation of the required sample size per
1 X X
Zk ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x1j  x2j : treatment group at each interim visit is formulated
nk ðσ 21 þ σ 22 Þ j¼1 j¼1 as follows:
 2  !
2. Compare Zk to a critical value Cp ðK; α). Zα=2 þ Zβ σ 21 þ σ 22
n ¼ RWT ðK; α; βÞ =K
ðμ1  μ2 Þ2
At any interim visit k prior to the final visit K;
if |Zk | > Cp ðK; α), then stop the trial to conclude
that there is evidence that one treatment is supe- For calculation of critical values, see the Further
rior to the other. Otherwise, continue to collect the Readings section.
assessments. The critical values Cp ðK; α) are avail- Abdus S. Wahed and Sachiko Miyahara
able in standard textbooks or statistical software
packages. See also Cross-Sectional Design; Internal Validity
The required sample size per treatment group at Longitudinal Design; Sequential Design
each interim visit is calculated as follows:
 2  ! Further Readings
Zα=2 þ Zβ σ 21 þ σ 22
n ¼ Rp ðK; α; βÞ =K Chow, S., Shao, J., & Wang, H. (2008). Sample size
ðμ1  μ2 Þ2
calculations in clinical trial research. Boca Raton, FL:
Chapman and Hall.
where σ 21 and σ 22 are the variances of the continu-
Jennison, C., & Turnbull, B. (2000). Group sequential
ous responses from Treatment Group 1 and 2, methods with applications to clinical trials. New York:
respectively. Similarly, μ1 and μ2 are the means of Chapman and Hall.
the responses from the two treatment groups.

O’Brien and Fleming’s Test GROWTH CURVE


This method uses the same Zk as in Step 1 of
Pocock’s test. In Step 2, instead of comparing it Growth curve analysis refers to the procedures for
with Cp ðK; α), it is p
compared with a different criti- describing change of an attribute over time and
ffiffiffiffiffiffiffiffiffi testing related hypotheses. Population growth
cal value CB ðK; α) K=k. Compared to Pocock’s
test, this test has the advantage of not rejecting the curve traditionally consists of a graphical display
null hypothesis too easily at the beginning of the of physical growth (e.g., height and weight) and is
trial. typically used by pediatricians to determine
The computation of the required sample sizes whether a specific child seems to be developing as
per treatment group at each interim visit is similar expected. As a research method, the growth curve
to Pocock’s calculation: is particularly useful to analyze and understand
longitudinal data. It allows researchers to describe
 2  ! processes that unfold gradually over time for each
Zα=2 þ Zβ σ 21 þ σ 22
n ¼ RB ðK; α; βÞ =K individual, as well as the differences across indivi-
ðμ1  μ2 Þ2 duals, and to systematically relate these differences
Growth Curve 555

aggregate intercept determines the average out-


Person 1
come variable for all samples, whereas the aggre-
gate slope indicates the average rate of change for
Slope
Person 2 the outcome variable for each incremental time
Mean point (e.g., year, month, or day).
Person 3 The growth curve can be positive (an incline)
Intercept

Person 4 or negative (a decline), linear (representing


Person 5 straight line), or nonlinear. Three or more
repeated observations are generally recom-
mended for growth curve analysis. Two waves of
data offer very limited information about change
and the shape of the growth curves. With three
Time 1 Time 2 Time 3
or more waves of data, a linear growth curve
can be tested. With four or more waves of data,
Figure 1 Individual and Aggregate Growth Curves higher order polynomial alternatives (e.g., qua-
dratic, cubic, logarithmic, or exponential) can be
tested. A higher order polynomial growth curve
against theoretically important time-invariant and is useful to describe patterns of change that are
time-varying covariates. This entry discusses the not the same over time. For example, a rapid
use of growth curves in research and two increase in weight, height, and muscle mass
approaches for studying growth curves. tends to occur during the first 3 years of child-
hood; becomes less rapid as children reach their
Growth Curves in Longitudinal Research third birthday; and increases rapidly again as
they reach puberty. This pattern illustrates the
One of the primary interests in longitudinal nonlinear trajectories of physical growth that
research is to describe patterns of change over can be captured with additional data points.
time. For example, researchers might be interested In order to measure quantitative changes over
in investigating depressive symptoms. Possible time, the study outcome variable must also change
questions include the following: Do all people dis- continuously and systematically over time. In addi-
play similar initial levels of depressive symptoms tion, for each data point, the same instrument
(similar intercepts)? Do some people tend to have must be used to measure the outcome. Consistent
a greater increase or decrease in depressive symp- measurements help ensure that the changes over
toms than others (different slopes)? Separate time reflect growth and are not due to changes in
growth curves can be estimated for each individ- measurement.
ual, using the following equation: There are two common statistical approaches
for studying growth curves. The first approach
Yit ¼ β0i þ β1i ðTimeÞ þ εit uses a structural equation modeling framework
for estimating growth curves (i.e., latent growth
That is, the outcome variable Y for individual i curve analysis). The second approach uses hierar-
is predicted by an intercept of β0i and a slope of chical linear modeling (i.e., multilevel modeling)
β1i :The error term at each point in time, εit ; repre- framework. These approaches yield equivalent
sents the within-subject error. Each individual will results.
have different growth parameters (i.e., different
intercept and slope), and these individual growth
curves are used to estimate an aggregate mean and Growth Curve Within the Structural
variance for the group intercept and the group
Equation Modeling Framework
slope (see Figure 1). The intercept, also called the
initial level or constant, represents the value of the Within the structural equation modeling (SEM)
outcome variable when the growth curve or framework, both the initial level and slope are
change is first measured (when time ¼ 0). The treated as two latent constructs. Repeated
556 Growth Curve

measures at different time points are considered


multiple indicators for the latent variable con-
structs. In a latent growth curve, both the intercept
(η1 Þ and slope (η2 Þ are captured by setting the fac- ζ2
ζ1
tor loadings from the latent variable (η) to the
observed variables (YÞ. The intercept is estimated
by fixing the factor loadings from the latent vari-
able (η1 Þ to the observed variables (repeated mea-
sures of Y1 to Y3 Þ, each with a value of 1. Latent Intercept Slope
linear slope loadings are constrained to reflect the (η1) (η2)
appropriate time interval (e.g., 0, 1, and 2 for
equally spaced time points). The means for inter- 1 0
cept (η1 Þ and slope (η2 Þ are the estimates of the 1 1 1 2
aggregate intercept and slope for all samples. Indi-
vidual differences from the aggregate intercept and
slope are captured by the variance of the intercept
(ζ1 Þ and slope (ζ2 Þ. The measurement error for
Y1 Y2 Y3
each of the three time points is reflected by ε1, ε2 ,
and ε3 (see Figure 2). These measurement errors
can be correlated.
ε1 ε2 ε3

Growth Curve Within the


Hierarchical Linear Modeling Framework Figure 2 Univariate Latent Growth Curve in a SEM
Framework
In the hierarchical linear modeling (HLM)
framework, a basic growth curve model is con-
ceptualized as two levels of analysis. The where π0i and π1i represent the intercept and
repeated measures of the outcome of interest are slope. They are assumed to vary across individuals
considered ‘‘nested’’ within the individual. Thus, (as captured by variances U0i and U1i Þ. The resi-
the first level of analysis captures intra-individ- duals for each point in time are represented by eit
ual changes over time. This is often called the and are assumed to be normally distributed with
within-person level. In the within-person level, zero means.
just as in the SEM framework, individual growth
trajectories are expected to be different from per- Florensia F. Surjadi and K. A. S. Wickrama
son to person. This approach is more flexible
than the traditional ordinary least squares See also See also Confirmatory Factor Analysis;
regression technique, which requires the same Hierarchical Linear Modeling; Hypothesis; Latent
parameter values for all individuals. The second Growth Modeling; Latent Variable; Multilevel
level of analysis reflects interindividual change Modeling; Structural Equation Modeling
and describes between-person variability in the
phenomenon of interest.
Further Readings
1st level : Yit ¼ π0i þ π1i ðTimeÞit þ eit
2nd level : π0i ¼ γ 00 þ U0i Duncan, T. E., Duncan, S. C., & Strycker, L. A. (2006).
An introduction to latent variable growth curve
π1i ¼ γ 10 þ U1i modeling: Concepts, issues, and applications (2nd
ed.). Mahwah, NJ: Lawrence Erlbaum.
Rogosa, D., Brandt, D., & Zimowski, M. (1982). A
Combined equations :
growth curve approach to the measurement of change.
Yit ¼ γ 00 þ γ 10 ðTimeÞit þ U0i þ U1i þ eit Psychological Bulletin, 92, 726748.
Guessing Parameter 557

Singer, J. D., & Willett, J. B. (2003). Applied longitudinal 1.0


data analysis: Modeling change and event occurrence.
New York: Oxford University Press. 0.8
Wickrama, K. A. S., Beiser, M., & Kaspar, V. (2002).

Probability
Assessing the longitudinal course of depression and 0.6
economic integration of South-East Asian refugees: An
application of latent growth curve analysis. 0.4
International Journal of Methods in Psychiatric
0.2
Research, 11, 154168.
0
−3 −2 −1 0 1 2 3
θ
GUESSING PARAMETER
Figure 1 Item Characteristic Curve
In item response theory (IRT), the guessing param-
eter is a term informally used for the lower asymp-
tote parameter in a three-parameter-logistic (3PL) function is called an item characteristic curve
model. Among examinees who demonstrate very (ICC) or item response function (IRF). The value
low levels of the trait or ability measured by the of the lower asymptote or guessing parameter in
test, the value of the guessing parameter is the Figure 1 is 0.2, so as θ becomes infinitely low, the
expected proportion that will answer the item cor- probability of a correct response approaches 0.2.
rectly or endorse the item in the scored direction. Guessing does not necessarily mean random
This can be understood more easily by examining guessing. If the distractors function effectively, the
the 3PL model: correct answer should be less appealing than the
distractors and thus would be selected less than
e1:7ai ðθbi Þ would be expected by random chance. The lower
PðθÞ ¼ ci þ ð1  ci Þ ; ð1Þ asymptote parameter would then be less than
1 þ e1:7ai ðθbi Þ
1/number of options. Frederic Lord suggested that
where θ is the value of the trait or ability; P(θ) is this is often the case empirically for large-scale
the probability of correct response or item tests. Such tests tend to be well-developed, with
endorsement, conditional on θ; ai is the slope or items that perform poorly in pilot testing discarded
discrimination for item i; bi is the difficulty or before the final forms are assembled. In more typi-
threshold for item i; and ci is the lower asymptote cal classroom test forms, one or more of the dis-
or guessing parameter for item i: Sometimes, the tractors may be implausible or otherwise not
symbol g is used instead of c: In Equation 1, as θ function well. Low-ability examinees may guess
decreases relative to b; the second term approaches randomly from a subset of plausible distractors,
zero and thus the probability approaches c: If it is yielding a lower-asymptote greater than 1/number
reasonable to assume that the proportion of exam- of options. This could also happen when there is
inees with very low θ who know the correct a clue to the right answer, such as the option
answer is virtually zero, it is reasonable to assume length. The same effect would occur if examinees
that those who respond correctly do so by gues- can reach the correct answer even by using faulty
sing. Hence, the lower asymptote is often labeled reasoning or knowledge. Another factor is
the guessing parameter. Figure 1 shows the proba- that examinees who guess tend to choose middle
bilities from Equation 1 plotted across the range of response options; if the correct answer is B or C,
θ, for a ¼ 1:5; b ¼ 0:5, and c ¼ 0:2. The range of θ the probability of a correct response by guessing
is infinite, but the range chosen for the plot was would be higher than if the correct answer were A
3 to þ3 because most examinees or respondents or D.
would fall within this range if the metric were set Because the lower asymptote may be less than
such that θ had a mean of 0 and standard devia- or greater than random chance, it may be speci-
tion of 1 (a common, though arbitrary, way of fied as a parameter to be freely estimated, per-
defining the measurement metric in IRT). This haps with constraints to keep it within
558 Guttman Scaling

a reasonable range. Estimating the guessing Further Readings


parameter accurately can be difficult. This is par-
Baker, F. B. (2001). The basics of item response theory.
ticularly true for easy items, because there are College Park, MD: ERIC Clearinghouse on
few data available at the location where θ is very Assessment and Evaluation. Available at http://
low relative to item difficulty. For more difficult edres.org/irt
items, the value of the guessing parameter makes Hambleton, R. K., Swaminathan, H., & Rogers, H. J.
a bigger difference in the range where the exam- (1991). Fundamentals of item response theory.
inee scores are, and an ICC with the wrong gues- Newbury Park, CA: Sage.
sing parameter does not fit the data well. If the Lord, F. M. (1974). Estimation of latent ability and item
guessing parameter is too low, the best fitting parameters when there are omitted responses.
Psychometrika, 39, 247264.
ICC will be too flat and will be too high at one
end and too low at the other, especially for more
discriminating items. Thus, the guessing parame-
ter can be estimated more accurately for items
with high difficulty and discrimination.
GUTTMAN SCALING
Although the term guessing parameter is an
IRT term, the concept of guessing is also relevant Guttman scaling was developed by Louis Guttman
to classical test theory (CTT). For example, and was first used as part of the classic work on
when guessing is present in the data, the tetra- the American Soldier. Guttman scaling is applied
choric correlation matrix is often nonpositive to a set of binary questions answered by a set of
definite. Also, CTT scoring procedures can subjects. The goal of the analysis is to derive a
include a guessing penalty to discourage exami- single dimension that can be used to position both
nees from guessing. For example, one scoring the questions and the subjects. The position of the
formula is as follows: score ¼ R  W=ðk  1Þ, questions and subjects on the dimension can then
where R is the number of right answers, W is the be used to give them a numerical value. Guttman
number of wrong answers, and k is the number scaling is used in social psychology and in
of options for each item. If an examinee cannot education.
eliminate any of the options as incorrect and
guesses randomly among all options, on average,
An Example of a Perfect Guttman Scale
the examinee would have a probability of 1/k of
a correct response. For every k items guessed, Suppose that we test a set of children and that we
random guessing would add 1 to R and k  1 to assess their mastery of the following types of
W: Thus, the average examinee’s score would be mathematical concepts: (a) counting from 1 to 50,
the same whether the items were left blank or (b) solving addition problems, (c) solving subtrac-
a random guessing strategy was employed. tion problems, (d) solving multiplication problems,
Examinees who could eliminate at least one dis- and (e) solving division problems.
tractor would obtain a higher score, on average, Some children will be unable to master any of
by guessing among the remaining options than these problems, and these children do not provide
by leaving the item blank. From this, it seems information about the problems, so we will not
that imposing this penalty for guessing would consider them. Some children will master counting
confound scores with test-taking strategy, an but nothing more; some will master addition and
unintended construct. These examples show that we expect them to have mastered counting but no
guessing is not just a complication introduced by other concepts; some children will master subtrac-
IRT models but a challenge for psychometric tion and we expect them to have mastered count-
modeling in general. ing and addition; some children will master
multiplication and we expect them to have mas-
Christine E. DeMars tered subtraction, addition, and counting. Finally,
some children will master division and we expect
See also Classical Test Theory; Item Analysis; Item them to have mastered counting, addition, subtrac-
Response Theory tion, and multiplication. What we do not expect
Guttman Scaling 559

Table 1 The Pattern of Responses of a Perfect to the number of nonzero variables (i.e., columns
Guttman Scale in Table 1) for this row.
Problems
The previous quantifying scheme assumes that
the differences in difficulty are the same between
Children Counting Addition Subtraction Multiplication Division all pairs of contiguous operations. In real applica-
S1 1 0 0 0 0
tions, it is likely that these differences are not the
S2 1 1 0 0 0
same. In this case, a way of estimating the size of
S3 1 1 1 0 0
the difference between two contiguous operations
S4 1 1 1 1 0
is to consider that this difference is inversely
S5 1 1 1 1 1
proportional to the number of children who solved
a given operation (i.e., an easy operation is sol-
Note: A value of 1 means that the child (row) has mastered ved by a large number of children, a hard one is
the type of problem (column); a value of 0 means that the solved by a small number of children).
child has not mastered the type of problem.

to find, however, are children, for example, who How to Order the Rows
have mastered division but who have not mastered of a Matrix to Find the Scale
addition or subtraction or multiplication. So, the
When the Guttman model is valid, there are
set of patterns of responses that we expect to find
multiple ways of finding the correct order of the
is well structured and is shown in Table 1. The
rows and the columns that will give the format of
pattern of data displayed in this table is consistent
the data as presented in Table 1. The simplest
with the existence of a single dimension of mathe-
approach is to reorder rows and columns accord-
matical ability. In this framework, a child has
ing to their marginal sum. Another theoretically
reached a certain level of this mathematical ability
interesting procedure is to use correspondence
and can solve all the problems below this level and
analysis (which is a type of factor analysis tailored
none of the problems above this level.
for qualitative data) on the data table; then, the
When the data follow the pattern illustrated in
coordinates on the first factor of the analysis will
Table 1, the rows and the columns of the table
provide the correct ordering of the rows and the
both can be represented on a single dimension.
columns.
The operations will be ordered from the easiest to
the hardest, and a child will be positioned on the
right of the most difficult type of operation solved. Imperfect Scale
So the data from Table 1 can be represented by the
In practice, it is rare to obtain data that fit a Gutt-
following order:
man scaling model perfectly. When the data do
Counting—S1 —Addition—S2 —Subtraction— not conform to the model, one approach is to
relax the unidimensionality assumption and
S3 —Multiplication—S4 —Division—S5 assume that the underlying model involves several
ð1Þ dimensions. Then, these dimensions can be
obtained and analyzed with multidimensional
This order can be transformed into a set of techniques such as correspondence analysis (which
numerical values by assigning numbers with equal can be seen as a multidimensional generalization
steps between two contiguous points. For example, of Guttman scaling) or multidimensional scaling.
this set of numbers can represent the numerical Another approach is to consider that the devia-
values corresponding to Table 1: tions from the ideal scale are random errors. In
this case, the problem is to recover the Guttman
Counting S1 Addition S2 Subtraction S3 Multiplication S4 Division S5 scale from noisy data. There are several possible
1 2 3 4 5 6 7 8 9 10
ways to fit a Guttman scale to a set of data. The
simplest method (called the Goodenough–Edwards
This scoring scheme implies that the score of an method) is to order the rows and the columns
observation (i.e., a row in Table 1) is proportional according to their marginal sum. An example of
560 Guttman Scaling

Table 2 An Imperfect Guttman Scale of its CR is equal to or larger than .90. In practice,
it is often possible to improve the CR of a scale by
Problems
eliminating rows or columns that contain a large
Children Counting Addition Subtraction Multiplication Division Sum proportion of errors. However, this practice may
C1 1 0 0 0 0 1 also lead to capitalizing on random errors and
C2 1 0 *
1 *
0 0 2 may give an unduly optimistic view of the actual
C3 1 1 1 0 0 3 reproducibility of a scale.
C4 1 1 0* 1 0 3
Hervé Abdi
C5 1 1 1 1 1 5
Sum 5 3 3 2 1 – See also Canonical Correlation Analysis; Categorical
*
Notes: Values with an asterisk ( ) are considered errors. Variable; Correspondence Analysis; Likert Scaling;
Compare with Table 1 showing a perfect scale. Principal Components Analysis; Thurstone Scaling

a set of data corresponding to such an imperfect


Further Readings
scale is given in Table 2. In this table, the ‘‘errors’’
are indicated with an asterisk (* ), and there are Dunn-Rankin, P. (1983). Scaling methods. Hillsdale, NJ:
three of them. This number of errors can be used Lawrence Erlbaum.
to compute a coefficient of reproducibility denoted Edwards, A. (1957). Techniques of attitude scale
CR and defined as construction. Englewood Cliffs, NJ: Prentice-Hall.
Greenacre, M. J. (2007). Correspondence analysis in
Number of errors practice (2nd ed.). Boca Raton, FL: Chapman & Hall/
CR ¼ 1  : ð2Þ CRC.
Number of possible errors
Guest, G. (2000). Using Guttman scaling to rank wealth:
The number of possible errors is equal to the Integrating quantitative and qualitative data. Field
Methods, 12; 346357.
number of entries in the data table, which is equal
Guttman, L. A. (1944). A basis for scaling qualitative
to the product of the numbers of rows and col- data. American Sociological Review, 91, 139150.
umns of this table. For the data in Table 2, there Guttman, L. A. (1950). The basis for scalogram analysis.
are three errors out of 5 × 6 ¼ 30 possible errors; In S. A. Stouffer, L. A. Guttman, & E. A. Schuman,
this gives a value of the coefficient of reproducibil- Measurement and prediction: Volume 4 of Studies in
ity equal to Social Psychology in World War II. Princeton, NJ:
Princeton University Press.
3 McIver, J. P., & Carmines, E. G. (1981). Unidimensional
CR ¼ 1  ¼ :90: ð3Þ
30 scaling. Beverly Hills, CA: Sage.
van der Ven, A. H. G. S. (1980). Introduction to scaling.
According to Guttman, a scale is acceptable if Chichester, UK: Wiley.
less than 10% of its entries are erroneous, which is Weller, S. C., & Romney, A. K. (1990). Metric scaling:
equivalent to a scale being acceptable if the value Correspondence analysis. Thousand Oaks, CA: Sage.
Editorial Board

General Editor

Neil J. Salkind
University of Kansas

Associate Editors

Bruce B. Frey
University of Kansas
Donald M. Dougherty
University of Texas Health Science Center at San Antonio

Managing Editors

Kristin Rasmussen Teasdale


University of Kansas
Nathalie Hill-Kapturczak
University of Texas Health Science Center at San Antonio
Copyright © 2010 by SAGE Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the
publisher.
For information:
SAGE Publications, Inc.
2455 Teller Road
Thousand Oaks, California 91320
E-mail: order@sagepub.com

SAGE Publications Ltd..


1 Oliver’s Yard
55 City Road
London EC1Y 1SP
United Kingdom

SAGE Publications India Pvt. Ltd.


B 1/I 1 Mohan Cooperative Industrial Area
Mathura Road, New Delhi 110 044
India

SAGE Publications Asia-Pacific Pte. Ltd.


33 Pekin Street #02-01
Far East Square
Singapore 048763

Printed in the United States of America.

Library of Congress Cataloging-in-Publication Data

Encyclopedia of research design/edited by Neil J. Salkind.


v. cm.
Includes bibliographical references and index.
ISBN 978-1-4129-6127-1 (cloth)
1. Social sciences—Statistical methods—Encyclopedias. 2. Social sciences—Research—Methodology—Encyclopedias.
I. Salkind, Neil J.

HA29.E525 2010
001.403—dc22 2010001779

This book is printed on acid-free paper.

10   11   12   13   14   10   9   8   7   6   5   4   3   2   1

Publisher: Rolf A. Janke


Acquisitions Editor: Jim Brace-Thompson
Editorial Assistant: Michele Thompson
Developmental Editor: Carole Maurer
Reference Systems Coordinators: Leticia M. Gutierrez, Laura Notton
Production Editor: Kate Schroeder
Copy Editors: Bonnie Freeman, Liann Lech, Sheree Van Vreede
Typesetter: C&M Digitals (P) Ltd.
Proofreaders: Kristin Bergstad, Kevin Gleason, Sally Jaskold, Sandy Zilka
Indexer: Virgil Diodato
Cover Designer: Glenn Vogel
Marketing Manager: Amberlyn McKay
Contents

Volume 2
List of Entries   vii
Entries
H 561 M 745
I 589 N 869
J 655 O 949
K 663 P 985
L 681
List of Entries

Abstract Between-Subjects Design. See Cohort Design


Accuracy in Parameter Single-Subject Design; Collinearity
Estimation Within-Subjects Design Column Graph
Action Research Bias Completely Randomized
Adaptive Designs in Clinical Biased Estimator Design
Trials Bivariate Regression Computerized Adaptive Testing
Adjusted F Test. See Block Design Concomitant Variable
Greenhouse–Geisser Bonferroni Procedure Concurrent Validity
Correction Bootstrapping Confidence Intervals
Alternative Hypotheses Box-and-Whisker Plot Confirmatory Factor Analysis
American Educational Research b Parameter Confounding
Association Congruence
American Psychological Canonical Correlation Analysis Construct Validity
Association Style Case-Only Design Content Analysis
American Statistical Case Study Content Validity
Association Categorical Data Analysis Contrast Analysis
Analysis of Covariance Categorical Variable Control Group
(ANCOVA) Causal-Comparative Design Control Variables
Analysis of Variance (ANOVA) Cause and Effect Convenience Sampling
Animal Research Ceiling Effect “Convergent and Discriminant
Applied Research Central Limit Theorem Validation by the Multitrait–
A Priori Monte Central Tendency, Measures of Multimethod Matrix”
Carlo Simulation Change Scores Copula Functions
Aptitudes and Instructional Chi-Square Test Correction for Attenuation
Methods Classical Test Theory Correlation
Aptitude-Treatment Interaction Clinical Significance Correspondence Analysis
Assent Clinical Trial Correspondence Principle
Association, Measures of Cluster Sampling Covariate
Autocorrelation Coefficient Alpha C Parameter. See Guessing
“Coefficient Alpha and the Parameter
Bar Chart Internal Structure Criterion Problem
Bartlett’s Test of Tests” Criterion Validity
Barycentric Discriminant Coefficient of Concordance Criterion Variable
Analysis Coefficient of Variation Critical Difference
Bayes’s Theorem Coefficients of Correlation, Critical Theory
Behavior Analysis Design Alienation, and Determination Critical Thinking
Behrens–Fisher t′ Statistic Cohen’s d Statistic Critical Value
Bernoulli Distribution Cohen’s f Statistic Cronbach’s Alpha. See
Beta Cohen’s Kappa Coefficient Alpha

vii
viii List of Entries

Crossover Design Experience Sampling Method Inclusion Criteria


Cross-Sectional Design Experimental Design Independent Variable
Cross-Validation Experimenter Expectancy Effect Inference: Deductive and
Cumulative Frequency Exploratory Data Analysis Inductive
Distribution Exploratory Factor Analysis Influence Statistics
  Ex Post Facto Study Influential Data Points
Databases External Validity Informed Consent
Data Cleaning   Instrumentation
Data Mining Face Validity Interaction
Data Snooping Factorial Design Internal Consistency Reliability
Debriefing Factor Loadings Internal Validity
Decision Rule False Positive Internet-Based Research Method
Declaration of Helsinki Falsifiability Interrater Reliability
Degrees of Freedom Field Study Interval Scale
Delphi Technique File Drawer Problem Intervention
Demographics Fisher’s Least Significant Interviewing
Dependent Variable Difference Test Intraclass Correlation
Descriptive Discriminant Fixed-Effects Models Item Analysis
Analysis Focus Group Item Response Theory
Descriptive Statistics Follow-Up Item-Test Correlation
Dichotomous Variable Frequency Distribution  
Differential Item Functioning Frequency Table Jackknife
Directional Hypothesis Friedman Test John Henry Effect
Discourse Analysis F Test  
Discriminant Analysis   Kolmogorov−Smirnov Test
Discussion Section Gain Scores, Analysis of KR-20
Dissertation Game Theory Krippendorff’s Alpha
Distribution Gauss–Markov Theorem Kruskal–Wallis Test
Disturbance Terms Generalizability Theory Kurtosis
Doctrine of Chances, The General Linear Model  
Double-Blind Procedure Graphical Display of Data L’Abbé Plot
Dummy Coding Greenhouse–Geisser Correction Laboratory Experiments
Duncan’s Multiple Range Test Grounded Theory Last Observation Carried
Dunnett’s Test Group-Sequential Designs in Forward
  Clinical Trials Latent Growth Modeling
Ecological Validity Growth Curve Latent Variable
Effect Coding Guessing Parameter Latin Square Design
Effect Size, Measures of Guttman Scaling Law of Large Numbers
Endogenous Variables   Least Squares, Methods of
Error Hawthorne Effect Levels of Measurement
Error Rates Heisenberg Effect Likelihood Ratio Statistic
Estimation Hierarchical Linear Modeling Likert Scaling
Eta-Squared Histogram Line Graph
Ethics in the Research Process Holm’s Sequential Bonferroni LISREL
Ethnography Procedure Literature Review
Evidence-Based Decision Homogeneity of Variance Logic of Scientific Discovery,
Making Homoscedasticity The
Exclusion Criteria Honestly Significant Difference Logistic Regression
Exogenous Variables (HSD) Test Loglinear Models
Expected Value Hypothesis Longitudinal Design
List of Entries ix

Main Effects Nominal Scale Percentile Rank


Mann–Whitney U Test Nomograms Pie Chart
Margin of Error Nonclassical Experimenter Pilot Study
Markov Chains Effects Placebo
Matching Nondirectional Hypotheses Placebo Effect
Matrix Algebra Nonexperimental Design Planning Research
Mauchly Test Nonparametric Statistics Poisson Distribution
MBESS Nonparametric Statistics for Polychoric Correlation
McNemar’s Test the Behavioral Sciences Coefficient
Mean Nonprobability Sampling Polynomials
Mean Comparisons Nonsignificance Pooled Variance
Median Normal Distribution Population
Meta-Analysis Normality Assumption Positivism
“Meta-Analysis of Normalizing Data Post Hoc Analysis
Psychotherapy Nuisance Variable Post Hoc Comparisons
Outcome Studies” Null Hypothesis Power
Methods Section Nuremberg Code Power Analysis
Method Variance NVivo Pragmatic Study
Missing Data, Imputation of   Precision
Mixed- and Random-Effects Observational Research Predictive Validity
Models Observations Predictor Variable
Mixed Methods Design Occam’s Razor Pre-Experimental Design
Mixed Model Design Odds Pretest–Posttest Design
Mode Odds Ratio Pretest Sensitization
Models Ogive Primary Data Source
Monte Carlo Simulation Omega Squared Principal Components Analysis
Mortality Omnibus Tests Probabilistic Models for
Multilevel Modeling One-Tailed Test Some Intelligence and
Multiple Comparison Tests “On the Theory of Scales of Attainment Tests
Multiple Regression Measurement” Probability, Laws of
Multiple Treatment Order Effects Probability Sampling
Interference Ordinal Scale “Probable Error of a Mean, The”
Multitrait–Multimethod Orthogonal Comparisons Propensity Score Analysis
Matrix Outlier Proportional Sampling
Multivalued Treatment Effects Overfitting Proposal
Multivariate Analysis of   Prospective Study
Variance (MANOVA) Pairwise Comparisons Protocol
Multivariate Normal Panel Design “Psychometric Experiments”
Distribution Paradigm Psychometrics
  Parallel Forms Reliability Purpose Statement
Narrative Research Parameters p Value
National Council on Parametric Statistics  
Measurement in Education Partial Correlation Q Methodology
Natural Experiments Partial Eta-Squared Q-Statistic
Naturalistic Inquiry Partially Randomized Qualitative Research
Naturalistic Observation Preference Trial Design Quality Effects Model
Nested Factor Design Participants Quantitative Research
Network Analysis Path Analysis Quasi-Experimental Design
Newman−Keuls Test and Pearson Product-Moment Quetelet’s Index
Tukey Test Correlation Coefficient Quota Sampling
x List of Entries

R Scheffé Test Student’s t Test


R2 Scientific Method Sums of Squares
Radial Plot Secondary Data Source Survey
Random Assignment Selection Survival Analysis
Random-Effects Models Semipartial Correlation SYSTAT
Random Error Coefficient Systematic Error
Randomization Tests Sensitivity Systematic Sampling
Randomized Block Design Sensitivity Analysis
Random Sampling Sequence Effects “Technique for the Measurement
Random Selection Sequential Analysis of Attitudes, A”
Random Variable Sequential Design Teoria Statistica Delle Classi e
Range “Sequential Tests of Statistical Calcolo Delle Probabilità
Rating Hypotheses” Test
Ratio Scale Serial Correlation Test−Retest Reliability
Raw Scores Shrinkage Theory
Reactive Arrangements Significance, Statistical Theory of Attitude
Recruitment Significance Level, Concept of Measurement
Regression Artifacts Significance Level, Interpretation Think-Aloud Methods
Regression Coefficient and Construction Thought Experiments
Regression Discontinuity Sign Test Threats to Validity
Regression to the Mean Simple Main Effects Thurstone Scaling
Reliability Simpson’s Paradox Time-Lag Study
Repeated Measures Design Single-Blind Study Time-Series Study
Replication Single-Subject Design Time Studies
Research Social Desirability Treatment(s)
Research Design Principles Software, Free Trend Analysis
Research Hypothesis Spearman–Brown Prophecy Triangulation
Research Question Formula Trimmed Mean
Residual Plot Spearman Rank Order Triple-Blind Study
Residuals Correlation True Experimental Design
Response Bias Specificity True Positive
Response Surface Design Sphericity True Score
Restriction of Range Split-Half Reliability t Test, Independent Samples
Results Section Split-Plot Factorial Design t Test, One Sample
Retrospective Study SPSS t Test, Paired Samples
Robust Standard Deviation Tukey’s Honestly Significant
Root Mean Square Error Standard Error of Estimate Difference (HSD)
Rosenthal Effect Standard Error of Two-Tailed Test
Rubrics Measurement Type I Error
  Standard Error of the Mean Type II Error
Sample Standardization Type III Error
Sample Size Standardized Score
Sample Size Planning Statistic Unbiased Estimator
Sampling Statistica Unit of Analysis
Sampling and Retention of Statistical Control U-Shaped Curve
Underrepresented Groups Statistical Power Analysis for the  
Sampling Distributions Behavioral Sciences “Validity”
Sampling Error Stepwise Regression Validity of Measurement
SAS Stratified Sampling Validity of Research
Scatterplot Structural Equation Modeling Conclusions
List of Entries xi

Variability, Measure of Weights Yates’s Correction


Variable Welch’s t Test Yates’s Notation
Variance Wennberg Design Yoked Control Procedure
Volunteer Bias White Noise  
  Wilcoxon Rank Sum Test z Distribution
Wave WinPepi Zelen’s Randomized Consent Design
Weber−Fechner Law Winsorize z Score
Weibull Distribution Within-Subjects Design z Test

H
changes in productivity among a small group of
HAWTHORNE EFFECT workers at the plant. Consistent with predictions,
the researchers found that brighter lighting
resulted in increased productivity. Unexpectedly,
The term Hawthorne effect refers to the tendency hourly output also increased when lighting was
for study participants to change their behavior subsequently dimmed, even below the baseline (or
simply as a result of being observed. Consequently, usual) level. In fact, any manipulation or changes
it is also referred to as the observer effect. This ten- to the work environment resulted in increased out-
dency undermines the integrity of the conclusions put for the workers in the study. Decades later,
researchers draw regarding relationships between a researcher named Henry Landsberger reevalu-
variables. Although the original studies from ated the data and concluded that worker produc-
which this term was coined have drawn criticism, tivity increased simply as a result of the interest
the Hawthorne effect remains an important con- being shown in them rather than as a result of
cept that researchers must consider in designing changes in lighting or any of the other aspects of
studies and interpreting their results. Furthermore, the environment the researchers manipulated.
these studies were influential in the development Although the term Hawthorne effect was derived
of a field of psychology known as industrial/ from this particular series of studies, the term
organizational psychology. more generally refers to any behavior change that
stems from participants’ awareness that someone
History is interested in them.

The term Hawthorne effect was coined as a result


of events at Hawthorne Works, a manufacturing
A Modern-Day Application
company outside of Chicago. Throughout the
1920s and early 1930s, officials at the telephone Research on the Hawthorne effect extends well
parts manufacturing plant commissioned a Har- beyond the manufacturing industry. In fact,
vard researcher, Elton Mayo, and his colleagues to the effect applies to any type of research. For
complete a series of studies on worker productiv- instance, a recent study examined the
ity, motivation, and satisfaction. Of particular Hawthorne effect among patients undergoing
interest to the company was the effect of lighting arthroscopic knee surgery. In that study, all par-
on productivity; they conducted several experi- ticipants were provided standard preoperative
ments to examine that relationship. For example, information and informed consent about the
in one study, the researchers manipulated the level operation. However, half of the participants
of lighting in order to see whether there were any received additional information regarding the

561
562 Hawthorne Effect

purpose of the study. Specifically, their informed the research participants to interview them about
consent form also indicated that they would be their current level of loneliness. In interpreting the
taking part in a research study investigating findings, there are several potential reactivity
patient acceptability of the side effects of anes- effects to consider in this design. First, failure to
thesia. The researchers then examined postoper- find any differences between the two groups could
ative changes in psychological well-being and be attributed to the attention that both groups
physical complaints (e.g., nausea, vomiting, and received from the researcher at the monthly
pain) in the two groups. Consistent with the visits—that is, reacting to the attention and knowl-
Hawthorne effect, participants who received the edge that someone is interested in improving their
additional information indicating that they were situation (the Hawthorne effect). Conversely, reac-
part of a research study reported significantly tivity also could be applied to finding significant
better postoperative psychological and physical differences between the two groups. For example,
well-being than participants who were not the participants who received the pet might report
informed of the study. Similar to the conclusions less loneliness in an effort to meet the experimen-
drawn at the Hawthorne Works manufacturing ter’s expectations (experimenter effects). These are
plant, researchers in the knee surgery study but two of the many possible reactivity effects to
noted that a positive response accompanied sim- be considered in this study.
ply knowing that one was being observed as part In sum, many potential factors can impede
of research participation. accurate interpretation of study findings. Reactiv-
ity effects represent one important area of consid-
eration when designing a study. It is in the best
Threat to Internal Validity
interest of the researcher to safeguard against reac-
The Hawthorne effect represents one specific type tivity effects to the best of their ability in order to
of reactivity. Reactivity refers to the influence that have a greater degree of confidence in the internal
an observer has on the behavior under observation validity of their study.
and, in addition to the Hawthorne effect, includes
experimenter effects (the tendency for participants
How to Reduce Threat to Internal Validity
to change their behavior to meet the expectation
of researchers), the Pygmalion effect (the tendency The Hawthorne effect is perhaps the most chal-
of students to change their behavior to meet lenging threat to internal validity for researchers to
teacher expectations), and the Rosenthal effect control. Although double-blind studies (i.e., studies
(the tendency of individuals to internalize the in which neither the research participant nor the
expectations, whether good or bad, of an authority experimenter are aware to which intervention they
figure). Any type of reactivity poses a threat to are assigned) control for many threats to internal
interpretation about the relationships under inves- validity, double-blind research designs do not
tigation in a research study, otherwise known as eliminate the Hawthorne effect. Rather, it just
internal validity. Broadly speaking, the internal makes the effect equal across groups given that
validity of a study is the degree to which changes everyone knows they are in a research study and
in outcome can be attributed to something the that they are being observed. To help mitigate the
experimenter intended rather than attributed to Hawthorne effect, some have suggested a special
uncontrolled factors. For example, consider a study design employing what has been referred to
in which a researcher is interested in the effect of as a Hawthorne control. This type of design
having a pet on loneliness among the elderly. Spe- includes three groups of participants: the control
cifically, the researcher hypothesizes that elderly group who receives no treatment, the experimental
individuals who have a pet will report less loneli- group who receives the treatment of interest to
ness than those who do not. To test that relation- the researchers, and the Hawthorne control
ship, the researcher randomly assigns the elderly who receives a treatment that is irrelevant to
participants to one of two groups: one group the outcome of interest to the experimenters. For
receives a pet and the other group does not. The instance, consider the previous example regarding
researcher schedules monthly follow-up visits with the effect of having a pet on loneliness in the
Heisenberg Effect 563

elderly. In that example, the control group would Hawthorne studies formed the foundation for the
not receive a pet, the experimental group would development of a branch of psychology known as
receive a pet, and the Hawthorne control group industrial/organizational psychology. This particu-
would receive something not expected to impact lar branch focuses on maximizing the success of
loneliness such as a book about pets. Thus, if the organizations and of groups and individuals within
outcome for the experimental group is significantly organizations. The outcomes of the Hawthorne
different from the outcome of the Hawthorne con- Works research led to an emphasis on the impact
trol group, one can reasonably argue that the spe- of leadership styles, employee attitudes, and inter-
cific experimental manipulation, and not simply personal relationships on maximizing productivity,
the knowledge that one is observed, resulted in the an area known as the human relations movement.
group differences.
Lisa M. James and Hoa T. Vo

Criticisms See also Experimenter Expectancy Effect; Internal


Validity; Rosenthal Effect
In 2009, two economists from the University of
Chicago, Steven Levitt and John List, decided to
reexamine the data from the Hawthorne Works Further Readings
plant. In doing so, they concluded that peculiari- De Amici, D., Klersy, C., Ramajoli, F., Brustia, L., &
ties in the way the studies were conducted resulted Politi, P. (2000). Impact of the Hawthorne effect in
in erroneous interpretations that undermined the a longitudinal clinical study: The case of anesthesia.
magnitude of the Hawthorne effect that was previ- Controlled Clinical Trials, 21, 103114.
ously reported. For instance, in the original experi- Gillespie, R. (1991). Manufacturing knowledge: A history
ments, the lighting was changed on Sundays when of the Hawthorne experiments. Cambridge, UK:
the plant was closed. It was noted that worker Cambridge University Press.
productivity was highest on Monday and remained Mayo, E. (1933). The human problems of an industrial
civilization. New York: MacMillan.
high during the first part of the workweek but
Roethlisberger, F. J., & Dickson, W. J. (1939).
declined as the workweek ended on Saturday. The Management and the worker. Cambridge, MA:
increase in productivity on Monday was attributed Harvard University Press.
to the change in lighting the previous day. How- Rosenthal, R., & Jacobson, L. (1968, 1992). Pygmalion
ever, further examination of the data indicated that in the classroom: Teacher expectation and pupils’
output always was higher on Mondays relative to intellectual development. New York: Irvington.
the end of the previous workweek, even in the
absence of experiments. The economists also were
able to explain other observations made during the
original investigation by typical variance in work HEISENBERG EFFECT
performance unrelated to experimental manipula-
tion. Although there has been some controversy Expressed in the most general terms, the Heisenberg
regarding the rigor of the original studies at effect refers to those research occasions in which
Hawthorne Works and the subsequent inconclu- the very act of measurement or observation directly
sive findings, most people agree that the alters the phenomenon under investigation.
Hawthorne effect is a powerful but undesirable Although most sciences assume that the properties
effect that must be considered in the design of of an entity can be assessed without changing the
research studies. nature of that entity with respect to those assessed
properties, the idea of the Heisenberg effect suggests
that this assumption is often violated. In a sense, to
Birth of Industrial Psychology
measure or observe instantaneously renders the cor-
Regardless of the criticism mounted against the responding measurement or observation obsolete.
interpretation of the Hawthorne effect, the Because reality is not separable from the observer,
Hawthorne studies have had lasting impact on the process of doing science contaminates reality.
the field of psychology. In particular, the Although this term appears frequently in the social
564 Heisenberg Effect

and behavioral sciences, it is actually misleading. by the time the photons have traveled the immense
For reasons discussed in this entry, some argue it number of light years to reach the observer’s tele-
should more properly be called the observer effect. scopes, photometers, and spectroscopes.
In addition, this effect is examined in relation to Observer effects permeate many different kinds
other concepts and effects. of research in the behavioral and social sciences. A
famous example in industrial psychology is the
Hawthorne effect whereby the mere change
Observer Effect
in environmental conditions can induce a
The observer effect can be found in almost any sci- temporary—and often positive—alteration in per-
entific discipline. A commonplace example is taking formance or behavior. A comparable illustration in
the temperature of a liquid. This measurement educational psychology is the Rosenthal or
might occur by inserting a mercury-bulb thermome- ‘‘teacher-expectancy’’ effect in which student per-
ter into the container and then reading the formance is enhanced in response to a teacher’s
outcome on the instrument. Yet unless the expectation of improved performance. In fact, it is
thermometer has exactly the same temperature as difficult to conceive of a research topic or method
the liquid, this act will alter the liquid’s post- that is immune from observer effects. They might
measurement temperature. If the thermometer’s intrude on laboratory experiments, field experi-
temperature is warmer, then the liquid will be ments, interviews, and even ‘‘naturalistic’’ observa-
warmed, but if the thermometer’s temperature is tions—the quotes added because the observations
cooler, then the liquid will be cooled. Of course, cease to be natural to the extent that they are
the magnitude of the measurement contamination contaminated by observer effects. In ‘‘participant
will depend on the temperature discrepancy observation’’ studies, the observer most likely
between the instrument and the liquid. The con- alters the observed phenomena to the very
tamination also depends on the relative amount of degree that he or she actively participates.
material involved (as well as on the specific heat Needless to say, observer effects can seriously
capacities of the substances). The observer effect of undermine the validity of the measurement in the
measuring the temperature of saline solution in behavioral and social sciences. If the phenomenon
a small vial is far greater than using the same ther- reacts to assessment, then the resulting score might
mometer to assess the temperature of the Pacific not closely reflect the true state of the case at time
Ocean. of measurement. Even so, observer effects are not
As the last example implies, the observer effect all equivalent in the magnitude of their interfer-
can be negligible and, thus, unimportant. In some ence. On the one hand, participants in laboratory
cases, it can even be said to be nonexistent. If experiments might experience evaluation appre-
a straight-edge ruler is used to measure the length hension that interferes with their performance on
of an iron bar, under most conditions, it is unlikely some task, but this interference might be both
that the bar’s length will have been changed. Yet small and constant across experimental conditions.
even this statement is contingent on the specific The repercussions are thus minimal. On the
conditions of measurement. For instance, suppose other hand, participants might respond to cer-
that the goal was to measure in situ the length of tain cues in the laboratory setting—so-called
a bar found deep within a subterranean cave. demand characteristics—by deliberately behav-
Because that measurement would require the ing in a manner consistent with their perception
observer to import artificial light and perhaps even of the experimenter’s hypothesis. Such artificial
inadvertent heat from the observer’s body, the bar’s (even if accommodating) behavior can render the
dimension could slightly increase. Perhaps the only findings scientifically useless.
natural science in which observer effects are com- Sometimes researchers can implement proce-
pletely absent is astronomy. The astronomer can dures in the research design that minimize observer
measure the attributes of a remote stellar object, effects. A clear-cut instance are the double-blind
nebula, or galaxy without any fear of changing the trials commonly used in biomedical research.
phenomenon. Indeed, as in the case of supernovas, Unlike single-blind trials where only the partici-
the entity under investigation might no longer exist pant is ignorant of the experimental treatment,
Heisenberg Effect 565

double-blind trials ensure that the experimenter is future events were totally fixed by the prior distri-
equally unaware. Neither the experimenter nor the butions and properties of matter and energy, real-
participant knows the treatment condition. Such ity became much more unpredictable. Two
double-blind trials are especially crucial in avoid- quantum ideas were especially crucial to the con-
ing the placebo effect, a contaminant that might cept of the observer effect.
include an observer effect as one component. If the The first is the idea of superimposition. Accord-
experimenter is confident that a particular medi- ing to quantum theory, it is possible for an entity
cine will cure or ameliorate a patient’s ailment or to exist in all available quantum states simulta-
symptoms, that expectation alone can improve the neously. Thus, an electron is not in one particular
clinical outcomes. state but in multiple states described by a probabil-
Another instance where investigators endeavor ity distribution. Yet when the entity is actually
to reduce observer effects is the use of deception in observed, it can only be in one specific state. A
laboratory experiments, particularly in fields like classic thought experiment illustrating this phe-
social psychology. If research participants know nomenon is known as Schrödinger’s cat, a creature
the study’s purpose right from the outset, their placed in a box with poison that would be admin-
behavior will probably not be representative of istered contingent on the state of a subatomic par-
how they would act otherwise. So the participants ticle. Prior to observation, a cat might be either
are kept ignorant, usually by being deliberately alive or dead, but once it undergoes direct observa-
misled. The well-known Milgrim experiment tion, it must occupy just one of these two states. A
offers a case in point. To obtain valid results, the minority of quantum theorists have argued that it
participants had to be told that (a) the investigator is the observation itself that causes the superim-
was studying the role of punishment on pair- posed states to collapse suddenly into just a single
associate learning, (b) the punishment was being state. Given this interpretation, the result can be
administered using a device that delivered real considered an observer effect. In a bizarre way, if
electric shocks, and (c) the learner who was receiv- the cat ends up dead, then the observer killed it by
ing those shocks was experiencing real pain and destroying the superimposition! Nonetheless, the
was suffering from a heart condition. All three of majority of theorists do not accept this view.
these assertions were false but largely necessary The very nature of observation or measurement
(with the exception of the very last deception). in the micro world of quantum physics cannot
A final approach to avoiding observer effects is have the same meaning as in the macro world of
to use some variety of unobtrusive or nonreactive everyday Newtonian physics.
measures. One example is archival data analysis, The second concept is closest to the source of
such as content analysis and historiometry. When the term, namely, the 1927 Heisenberg uncertainty
the private letters of suicides are content analyzed, principle. Named after the German physicist Wer-
the act of measurement cannot alter the phenome- ner Heisenberg, this rule asserts that there is a defi-
non under investigation. Likewise, when historio- nite limit to how precisely both the momentum
metric techniques are applied to biographical and the position of a given subatomic particle,
information about eminent scientists, that applica- such as an electron, can simultaneously be mea-
tion leaves no imprint on the individuals being sured. The more the precision is increased in the
studied. measurement of momentum, the less precise will
be the concurrent measurement of that particle’s
position, and conversely. Stated differently, these
Quantum Physics
two particle attributes have linked probability dis-
The inspiration for the term Heisenberg effect tributions so that if one distribution is narrowed,
originated in quantum physics. Early in the 20th the other is widened. In early discussions of the
century, quantum physicists found that the behav- uncertainly principle, it was sometimes argued that
ior of subatomic particles departed in significant, this trade-off was the upshot of observation. For
even peculiar, ways from the ‘‘billiard ball’’ models instance, to determine the location of an electron
that prevailed in classical (Newtonian) physics. requires that it be struck with a photon, but that
Instead of a mechanistic determinism in which very collision changes the electron’s momentum.
566 Heisenberg Effect

Nevertheless, as in the previous case, most quan- former the observer directly affects the participants
tum theorists perceive the uncertainty as being in the original study, in the latter, it is the original
inherent in the particle and its entanglement with study’s results that affect the participants in a later
the environment. Position and momentum in partial or complete replication. Furthermore, for
a strict sense are concepts in classical physics that good or ill, there is little evidence that enlighten-
again do not mean the same thing in quantum ment effects actually occur. Even the widely publi-
physics. Indeed, it is not even a measurement issue: cized Milgrim experiment was successfully
The uncertainty principle applies independent of replicated many decades later.
the means by which physicists attempt to assess An example of a divergent concept is also the
a particle’s properties. There is no way to improve most superficially similar: observer bias. This
measurement so as to lower the degree of uncer- occurs when the characteristics of the observer
tainty below a set limit. influence how data are recorded or analyzed.
In short, the term Heisenberg effect has very lit- Unlike the observer effect, the observer bias occurs
tle, if any, relation with the Heisenberg uncertainty in the researcher rather than in the participant.
principle—or for that matter any other idea that The first prominent example in the history of sci-
its originator contributed to quantum physics. Its ence appeared in astronomy. Astronomers observ-
usage outside of quantum physics is comparable ing the exact same event—such as the precise time
with that of using Einstein’s theory of relativity to a star crossed a line in a telescope—would often
justify cultural relativism in the behavioral and give consistently divergent readings. Each astrono-
social sciences. Behavioral and social scientists are mer had a ‘‘personal equation’’ that added or sub-
merely borrowing the prestige of physics by adopt- tracted some fraction of a second to the correct
ing an eponym, yet in doing so, they end up for- time (defined as the average of all competent
feiting the very conceptual precision that grants observations). Naturally, if observer bias can occur
physics more status. For this reason, some argue in such a basic measurement, it can certainly
that it would probably be best if the term Heisen- infringe on the more complex assessments that
berg effect was replaced with the term observer appear in the behavioral and social sciences.
effect. Hence, in an observational study of aggressive
behavior on the playground, two independent
researchers might reliably disagree in what mutu-
Related Concepts
ally observed acts can be counted as instances of
The observer effect can be confused with other aggression. Prior training of the observers might
ideas besides the Heisenberg uncertainty principle. still not completely remove these personal biases.
Some of these concepts are closely related, and Even so, to the degree that observer bias does not
others are not. affect the overt behavior of the children being
An instance of the former is the phenomenon observed, it cannot be labeled as an observer
that can be referred to as the enlightenment effect. effect.
This occurs when the result of scientific research
becomes sufficiently well known that the finding Dean Keith Simonton
renders itself obsolete. In theory the probability of
See also Experimenter Expectancy Effect; Hawthorne
replicating the Milgrim obedience experiment
Effect; Interviewing; Laboratory Experiments; Natural
might decline as increasingly more potential
Experiments; Naturalistic Observation; Observational
research participants become aware of the results
Research; Rosenthal Effect; Validity of Measurement
of the original study. Although enlightenment
effects could be a positive benefit with respect to
social problems, they would be a negative cost Further Readings
from a scientific perspective. Science presumes the Chiesa, M., & Hobbs, S. (2008). Making sense of social
accumulation of knowledge, and knowledge can- research: How useful is the Hawthorne effect?
not accumulate if findings cannot be replicated. European Journal of Social Psychology, 38, 6774.
Still, it must be recognized that the observer and Orne, M. T. (1962). On the social psychology of the
enlightenment effects are distinct. Where in the psychological experiment: With particular reference to
Hierarchical Linear Modeling 567

demand characteristics and their implications. Table 1 Parameter Estimates for Different Models
American Psychologist, 17, 776783. Based on the Full Information Maximum
Rosenthal, R. (2002). The Pygmalion effect and its Likelihood (FIML) Estimation
mediating mechanisms. San Diego, CA: Academic
Model A Model B Model C
Press.
Sechrest, L. (2000). Unobtrusive measures. Washington, Fixed-effect estimate
DC: American Psychological Association. Intercept (γ 00) 12.64 12.67 13.15
(SEy ) (.24) (.19) (.21)
SESCij (γ 10) 2.40 2.55
(SEy ) (.12) (.14)
HIERARCHICAL LINEAR MODELING Himintyj (γ 01) — 1.86
(SEy ) (.40)
Hierarchical linear modeling (HLM, also known SESCij*Himintyj (γ 11) — .57
as multilevel modeling) is a statistical approach (SEy ) (.25)
for analyzing hierarchically clustered observa-
tions. Observations might be clustered within Variance estimate of
experimental treatment (e.g., patients within the random effect
group treatment conditions) or natural groups τ 00 8.55 4.79 4.17
(e.g., students within classrooms) or within indi- τ 11 .40 .34
viduals (repeated measures). HLM provides τ 10 (or τ01) .15ns .35ns
proper parameter estimates and standard errors σ2 39.15 36.83 36.82
for clustered data. It also capitalizes on the hier-
archical structure of the data, permitting Deviance statistic (D) 47,113.97 46,634.63 46,609.06
researchers to answer new questions involving Number of parameters 3 6 8
the effects of predictors at both group (e.g., class Notes:y Standard error of fixed effects in parentheses. All
size) and individual (e.g., student ability) levels. parameter estimates are significant at p < .05 unless marked
Although the focus here is on two-level models ns
. Model A contains no predictor. Model B has SESCij as
with continuous outcome variables, HLM can be predictor, and Model C has both SESCij and Himintyj as
extended to other forms of data (e.g., binary predictors.
variables, counts) with more than two levels of
clustering (e.g., student, classroom, and school). schools) affects the level of achievement. In HLM,
The key concepts in HLM are illustrated in this separate sets of regression equations are written at
entry using a subsample of a publically accessi- the individual (level 1) and group (level 2) levels of
ble data set based on the 1982 High School and analysis.
Beyond (HS&B) Survey. The partial HS&B data
set contains a total of 7,185 students nested
within 160 high schools, which is included in the Level 1 (Student-Level) Model
free student version of HLM available from Sci- MathAchij ¼ β0j þ eij ; ð1Þ
entific Software International, Inc. Mathematics
achievement (MathAch) will be used as the out-
come variable in a succession of increasingly where i represents each student and j represents
complex models. The results of Models A and B each school. Note that no predictors are included
discussed here were reported by Stephen Rau- in Equation 1. β0j is the mean MathAch score for
denbush and Anthony Bryk in their HLM text. school j: eij is the within-school residual that cap-
tures the difference between individual MathAch
score and the school mean MathAch. eij is
Some Important Submodels assumed to be normally distributed, and the vari-
ance of eij is assumed to be homogeneous across
Model A: Random-Intercepts Model
schools [i.e., eij ∼ N(0, σ 2 Þ for all 160 schools]. As
The random-intercepts model is the simplest presented in Table 1 (Model A), the variance of eij
model in which only group membership (here, is equal to σ 2 ¼ 39.15.
568 Hierarchical Linear Modeling

Level 2 (School-Level) Model Model B: Random-Coefficients


Regression Model With Level 1 Predictor
The level 2 model partitions each school’s mean
MathAch score into two parts The student’s socioeconomic status (SES) is
added into the model as a level 1 predictor. SES
β0j ¼ γ 00 þ U0j : ð2Þ has been centered, SESC ¼ SES  Mean(SES), so
that SESC has a mean of 0 in the full sample of
7,185 students.
Here, γ 00 ¼ 12.64 is the overall mean MathAch
score, averaging over the 160 school means. U0j Level 1 (Student-Level) Model
captures the residual difference between individual MathAchij ¼ β0j þ β1j SESCij þ eij : ð4Þ
school mean MathAch and the overall mean
MathAch. τ00 ¼ Var(U0j Þ ¼ 8.55 is the variance of β0j in Equation 4 is the estimated math achieve-
the residuals at level 2. ment score in school j for the students who have
The random-intercept model not only provides a mean SES score (i.e., SESC ¼ .00). β1j is the
an important baseline for model comparison, but amount of change in MathAch in school j for a 1-
it also allows the computation of the intraclass unit change in SESC . eij is the within-school ran-
correlation (ICC), which is the proportion of the dom error with variance equal to σ 2 [i.e., eij ∼
between variance to the sum of the between and N(0, σ 2 Þ]. Conceptually, this same regression
within variance. model is applied to each of the 160 schools, yield-
ing 160 different sets of regression coefficients
τ00 (β0j; β1j Þ.
ICC ¼ : ð3Þ
τ 00 þ σ 2
Level 2 (School-Level) Model

In this example, the ICC is β0j ¼ γ 00 þ U0j : ð5Þ

τ 00 8:55 β1j ¼ γ 10 þ U1j : ð6Þ


ICC ¼ 2
¼ ¼ :18:
τ00 þ σ 8:55 þ 39:15
As shown in Table 1, Model B, γ 00 ¼ 12.67 is
This ICC generally ranges from 0 to 1 based on the grand intercept, which is the overall average
Equation 3, with higher values indicating greater MathAch score for students who have the mean
clustering. As the product of the ICC and average SES score (i.e., SESC ¼ .00). γ 10 ¼ 2.40 is the over-
cluster size increases, the Type I error rate for the all grand slope (or regression coefficient) between
study quickly increases to unacceptable levels if SES and MathAch, indicating that MathAch
ignoring the clustering and treating all observa- increases 2.40 points for each 1-unit change in SES
tions as independent from each other. For exam- over the 160 schools, a positive relationship. U0j is
ple, with an average of approximately 45 students the residual from the grand intercept in Equation
per school (i.e., 7,185 students/160 schools) and 5, and U1j is the residual from the grand slope in
an ICC of approximately .20, a t test that com- Equation 6, respectively. The residuals are assumed
pared the mean of two different school types (i.e., to have
 a multivariate normal  distribution
 (i.e.,
high minority enrollment vs. low minority enroll- U0j τ00 τ01
∼ Nð0; TÞ, where T ¼ Þ, so they
ment), but ignored the clustering, would have U1j τ10 τ11
a Type I error rate of more than .50 compared can be summarized by three parameters: τ00 ¼
with the nominal level of α ¼ .05. Similarly, the 4.79 is the variance of the intercepts, τ 11 ¼ .40
estimated 95% confidence interval is approxi- is the variance of the slopes, and τ10 ¼ τ01 ¼ .15
mately .33 of the width of the correct confidence is the covariance of the slope and intercept across
interval. In contrast, use of HLM procedures can the 160 schools. The negative sign of τ 10 (¼ τ 01 Þ
maintain the Type I error rate at the nominal α ¼ indicates that schools with higher intercepts tend
.05 level. to be associated with relatively low slopes (i.e.,
Hierarchical Linear Modeling 569

weaker relation) between SESC and MathAch. Level 2 (School-Level) Model


Each parameter in Table 1, Model B, can be tested
using a Wald z test, z ¼ (estimate)/(standard error). β0j ¼ γ 00 þ γ 01 Himintyj þ U0j ; ð8Þ
The variances of the intercepts and slopes, τ00 and
τ 11 , are both statistically significant (p < :05),
whereas the covariance, τ01 , between the intercepts β1j ¼ γ 10 þ γ 11 Himintyj þ U1j : ð9Þ
and slopes is not (p > :10). This result suggests
that school-related variables might potentially be To interpret the meaning of the fixed-effect
able to account for the variation in the regression coefficients in Equations 8 and 9, it is useful to
models across schools. substitute in the corresponding values of the
Raudenbush and Bryk suggest using the base- dummy codes for different type of school—0 or 1.
line random-intercepts model to calculate a mea-
Low minority school : β0j ¼ γ 00 þ γ 01 ð0Þ
sure of the explained variance (or pseudo-R2 Þ,
which is analogous to the R-square in the ordinary þ U0j β0j ¼ γ 00 þ γ 01 ð1Þ þ U0j
least-squares (OLS) regression. For the level 1
model, the explained variance is the proportional
reduction in the random error variance (σ 2 Þ of the High minority school : β1j ¼ γ 10 þ γ 11 ð0Þ
full model, including all level 1 predictors (here þ U1j β1j ¼ γ 10 þ γ 11 ð1Þ þ U1j
Model B) relative to the random-intercepts model
with no predictors (here Model A). Suppose that As shown in Table 1, Model C, γ 00 ¼ 13.15 is
the estimated random error variance [i.e., Var(eij Þ] the mean intercept, which is the mean MathAch
of Model A is σ^ A and that the estimated error vari- score for students with mean SES (i.e., SESC ¼
ance of the Model B is σ^ B . Then, the explained .00) in schools with relatively low minority
variance [resulting from the additional level 1 pre- enrollment. γ 10 ¼ 2.55 is the mean slope of the
dictor(s)] can be calculated by the following equa- relation between MathAch and SES for the
tion: schools with the low percentage of minority stu-
dents. γ 01 ¼ 1.86 is the difference in the mean
σ^ 2A  σ^ 2B intercepts between the two types of schools,
σ^ 2explained ¼ : ð7Þ
σ 2A which is the difference in mean MathAch scores
for students with mean SES (SESC ¼ 0). γ 11 ¼
In this example, adding SESC to the level 1 .57 is the difference in mean slopes between
model can explain the two types of schools. All fixed-effect esti-
σ^ 2A  σ^ 2B 39:15  36:83 mates/regression coefficients are significant, and
σ^ 2explained ¼ ¼ ¼ :06; the corresponding models for the two school
σ^ 2A 39:15
types are presented in Figure 1. On average, high
.06 or 6% of the variance in MathAch. minority-school students with mean SES score
(SESC ¼ 0) are lower on MathAch (by γ 01 ¼
1.86 points) than their low minority-school
counterparts. The average slope between
Model C: Random-Coefficients Regression
MathAch and SES is weaker for the high minor-
Model With Level 1 and Level 2 Predictors
ity-school students (γ 10 þ γ 11 ¼ 2.55  .57 ¼
A second predictor, high minority percentage 1.98 MathAch points per 1-unit increase in
(i.e., Himinty), is added to the level 2 model in SESC Þ than the low minority-school students
addition to SESC at level 1. Himinty is a dummy (γ 10 ¼ 2.55 MathAch points per 1-unit increase
variable in which 44 schools with greater than in SESC Þ.
40% minority enrollment are coded as 1 and 116 The explained level 2 variance for each of the
schools with less than 40% minority enrollment two random effects can also be calculated using
are coded as 0. The level 1 equation (Equation 4) the explained variance as described previously.
does not change, and the corresponding level 2 Here, Model B is the unconditional model (with-
models are presented as follows. out any level 2 predictors) and Model C is the
570 Hierarchical Linear Modeling

Math Ach fixed-effect parameters in HLM, especially when


the number of level 2 units/clusters is small.
20 Low Minority Deviance statistics can be used to compare
18 School
nested models. If Model 1 is a special case of Model
16
2 (i.e., Model 2 can be changed into Model 1 by
14 imposing constraints), then Model 1 is nested
High Minority
12 School within Model 2. In this example, Model B is nested
10 within Model C because Model B is a special case
8
of Model C in which both γ 01 and γ 11 are con-
strained to zero. A higher deviance indicates poorer
6
fit. The difference in the deviance statistics between
4 nested models follows a central chi-square distribu-
2 tion with degrees of freedom equal to the difference
0 SESc in the number of parameters between the models.
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
The REML deviance statistic can be used only for
models with nested random effects. In contrast,
Figure 1 The Estimated Overall Average Models for models with both nested random and fixed effects
the Two Different Types of Schools can be compared using the FIML deviance statistics.
For example, one can examine whether the addition
of Himinty to both the intercept and slope equa-
conditional model (with Himinty in the model), tions contributes significantly to the overall model
and the corresponding explained variances for the by comparing the deviance statistics between Mod-
two random effect variances are els B and C, which are estimated with FIML. The
significant difference in the deviance statistics (i.e.,
τ 00 B  τ00 C 4:79  4:17 χ2 (2) ¼ DB  DC ¼ 46,634.63  46,609.06 ¼
τ00 explained ¼ ¼ ¼ :13
τ 00 B 4:79 25.57, p < .001) indicates that the more complex
model (i.e., Model C with Himinty) fits the data
and better than the parsimonious model (i.e., Model B
with fewer parameters). However, a nonsignificant
τ11 B  τ 11 C :40  :34
τ 11 explained ¼ ¼ ¼ :15: difference in the deviance statistics indicates that
τ11 B :40 the fit of the parsimonious model to the data does
not differ from that of the complex model. Addi-
That is, .13 (or 13%) of the intercept variance
tionally, criteria such as Akaike information crite-
(τ00 Þ and .15 (or 15%) of the slope variance (τ 11 Þ
rion (AIC) and Bayes information criterion (BIC)
can be explained by adding the school type vari-
combine information from the deviance statistic,
able (i.e., Himinty) in the level 2 model. Note that
model complexity, and sample size to help select
although this way of calculating explained vari-
the model with the optimal combination of fit and
ance is straightforward, it is not fully analogous to
parsimony.
R2 in multiple regression—it can sometimes result
in a negative explained variance.
Examining Assumptions
Other Important Features Violations of the assumptions including normal-
ity and homoscedasticity of the level 1 and level 2
Estimation Methods and Model Comparison
residuals in HLM can be treated as a signal
Restricted maximum likelihood (REML), full of misspecification in the hypothesized model.
information maximum likelihood (FIML), and The homogeneity assumption can be examined
empirical Bayesian (EB) estimation are commonly through the probability plot of the standardized
used in HLM to provide accurate parameter esti- residual dispersions, and the normality assumption
mation. Among these three common estimation can be examined through a qq plot in which
methods, REML is preferable for estimating the the ordered expected Mahalanobis distances are
Histogram 571

plotted against the observed Mahalanobis dis- consists of interval- or ratio-level data and is usu-
tances across all clusters. ally displayed on the abscissa (x-axis), and the fre-
quency data on the ordinate (y-axis), with the
height of the bar proportional to the count. If the
Diagnostics data for the independent variable are put into
‘‘bins’’ (e.g., ages 04, 59, 1014, etc.), then the
Toby Lewis and colleague proposed a top-down width of the bar is proportional to the width of
procedure (i.e., from highest level to lowest level) the bin. Most often, the bins are of equal size, but
using diagnostic measures such as leverage, inter- this is not a requirement. A histogram differs from
nally and externally Studentized residuals, and a bar chart in two ways. First, the independent
DFFITS for different level observations, which are variable in a bar chart consists of either nominal
analogous to diagnostic measures commonly used (i.e., named, unordered categories, such as reli-
in OLS regression. gious affiliation) or ordinal (ranks or ordered cate-
Oi-Man Kwok, Stephen G. West, and Ehri Ryu gories, such as stage of cancer) data. Second, to
emphasize the fact that the independent variable is
See also Fixed-Effects Models; Intraclass Correlation; not continuous, the bars in a bar chart are sepa-
Multilevel Modeling; Multiple Regression; Random- rated from one another, whereas they abut each
Effects Models other in a histogram. After a bit of history, this
entry describes how to create a histogram and then
discusses alternatives to histograms.
Further Readings
de Leeuw, J., & Meijer, E. (2008). Handbook of
multilevel analysis. New York: Springer.
A Bit of History
Hox, J. J. (2002). Multilevel analysis: Techniques and The term histogram was first used by Karl Pearson
applications. Mahwah, NJ: Lawrence Erlbaum. in 1895, but even then, he referred to it as a ‘‘com-
Kreft, I. G. G., & de Leeuw, J. (1998). Introducing
mon form of graphical representation,’’ implying
multilevel modeling. Thousand Oaks, CA: Sage.
Lewis, T., & Langford, I. H. (2001). Outliers, robustness that the technique itself was considerably older.
and the detection of discrepant data. In A. H. Leyland Bar charts (along with pie charts and line graphs)
& H. Goldstein (Eds.), Multilevel modelling of health were introduced over a century earlier by William
statistics (pp. 7591). New York: Wiley. Playfair, but he did not seem to have used histo-
O’Connell, A. A., & McCoach, D. B. (2008). Multilevel grams in his books.
modeling of educational data. Charlotte, NC:
Information Age Publishing.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Creating a Histogram
linear models: Applications and data analysis methods
Consider the hypothetical data in Table 1, which
(2nd ed.). Thousand Oaks, CA: Sage.
tabulates the number of hours of television
Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel
analysis: An introduction to basic and advanced watched each week by 100 respondents. What is
multilevel modeling. Thousand Oaks, CA: Sage. immediately obvious is that it is impossible to
comprehend what is going on. The first step in try-
ing to make sense of these data is to put them in
Websites rank order, from lowest to highest. This says that
Scientific Software International, Inc. HLM 6, Student the lowest value is 0 and the highest is 64, but it
Version: http://www.ssicentral.com/hlm/student.html does not yield much more in terms of understand-
ing. Plotting the raw data would result in several
problems. First, many of the bars will have heights
of zero (e.g., nobody reported watching for one,
HISTOGRAM two, or three hours a week), and most of the other
bars will be only one or two units high (i.e., the
A histogram is a method that uses bars to display number of people reporting that specific value).
count or frequency data. The independent variable This leads to the second problem, in that it makes
572 Histogram

Table 1 Fictitious Data on How Many Hours of often, the answer is somewhere between 6 and 15,
Television Are Watched Each Week by 100 with the actual number depending on two consid-
People erations. The first is that the bin size should be an
Subjects Data easily comprehended size. Thus, bin sizes of 2, 5,
15 41 43 14 35 31 10, or 20 units are recommended, whereas those
610 39 22 9 32 49 of 3, 7, or 9 are not. The second consideration is
1115 12 27 53 7 23 esthetics; the graph should get the point across and
1620 29 22 22 26 14 not look too cluttered.
2125 33 34 12 13 16 Although several formulas have been proposed
2630 34 25 40 5 41 to determine the width of the bins, the simplest is
3135 43 30 40 44 12 arguably the most useful. It is the range of the
3640 55 14 25 32 10 values (largest minus smallest) divided by the
4145 30 28 25 23 0 desired number of bins. For these data, the range
4650 56 24 17 15 33 is 64, and if 10 bins are desired, it would lead to
5155 30 15 29 20 14 a bin width of 6 or 7. Because these are not widths
5660 40 26 24 34 49 that are easy to comprehend, the closest compro-
6165 50 26 13 36 47 mise would be 5. Table 2 shows the results of put-
6670 19 9 64 35 33 ting the data in bins of five units each. The first
7175 35 39 9 25 41 column lists the values included in each bin; the
7680 5 18 54 11 59 second column provides the midpoint of the bin;
8185 36 36 37 52 29 and the third column summarizes the number of
8690 24 22 41 36 31 people in each bin. The last column, which gives
9195 32 10 50 45 23 the cumulative total for each interval, is a useful
96100 24 15 5 20 52 check that the counting was accurate.
However, there is a price to pay for putting the
data into bins, and it is that some information is
Table 2 The Data in Table 1 Grouped into Bins lost. For example, Table 2 shows that seven people
watched between 15 and 19 hours of television
Interval Midpoint Count Cumulative total
per week, but the exact amounts are now no lon-
04 2 1 1 ger known. In theory, all seven watched 17 hours
59 7 7 8 each. In reality, only one person watched 17 hours,
1014 12 12 20 although the mean of 17 for these people is rela-
1519 17 7 27 tively accurate. The larger the bin width, the more
2024 22 13 40 information that is lost.
2529 27 12 52 The scale on the y-axis should allow the largest
3034 32 14 66 number in any of the bins to be shown, but again
3539 37 10 76 it should result in divisions that are easy to grasp.
4044 42 10 86 For example, the highest value is 14, but if this
4549 47 4 90 were chosen as the top, then the major tick marks
5054 52 6 96 would be at 0, 7, and 14, which is problematic for
5559 57 3 99 the viewer. It would be better to extend the y-axis
6064 62 1 100 to 15, which will result in tick marks every five
units, which is ideal. Because the data being plot-
ted are counts or frequencies, the y-axis most often
it difficult to discern any pattern. Finally, the starts at zero. Putting all this together results in the
x-axis will have many values, again interfering histogram in Figure 1.
with comprehension. The exception to the rule of the y-axis starting
The solution is to group the data into mutually at zero is when all of the bars are near the top of
exclusive and collectively exhaustive classes, or the graph. In this situation, small but important
bins. The issue is how many bins to use. Most differences might be hidden. When this occurs, the
Holm’s Sequential Bonferroni Procedure 573

15 left is zero for days in hospital (a logical limit) and


approximately 50 milliseconds for reaction time (a
physiological limit), with no upper limit. Such an
Number of People

10
asymmetry could be a warning not to use certain
statistical tests with these data, if the tests assume
that the data are normally distributed. Finally, the
graph can easily show whether the data are unimo-
5
dal, bimodal, or have more than two peaks that
stand out against all of the other data.

0
2 12 22 32 42 52 62 Alternatives to Histograms
Number of Hours
One alternative to a histogram is a frequency poly-
gon. Instead of drawing bars, a single point is
Figure 1 Histogram Based on Table 2 placed at the top of the bin, corresponding to its
midpoint, and the points are connected with lines.
The only difference between a histogram and a fre-
100
quency polygon is that, by convention, an extra
bin is placed at the upper and lower ends with
Number of People

90 a frequency of zero, tying the line to the x-axis.


Needless to say, this is omitted if that category is
nonsensical (e.g., an age less than 0).
80 Another variation is the stem-and-leaf display.
This is a histogram where the ‘‘bars’’ consist of the
70
actual values, combining a graph with a data table.
Its advantage is that no information is lost because
0 of binning.
2 12 22 32 42 52 62
Number of Hours David L. Streiner

See also Graphical Display of Data; Line Graph; Pie


Figure 2 Histogram Showing a Discontinuity in the y-Axis Chart

bottom value should still be zero, and there would Further Readings
be a discontinuity before the next value. It is
important, though, to flag this for the viewer, Cleveland, W. S. (1985). The elements of graphing data.
by having a break within the graph itself, as in Pacific Grove, CA: Wadsworth.
Figure 2. Robbins, N. B. (2005). Creating more effective graphs.
Hoboken, NJ: Wiley.
The histogram is an excellent way of displaying
Streiner, D. L., & Norman, G. R. (2007). Biostatistics:
several attributes about a distribution. The first is The bare essentials (3rd ed.). Shelton, CT: People’s
its shape—is it more or less normal, or rectangular, Medical Publishing House.
or does it seem to follow a power function, with
counts changing markedly at the extremes? The
second attribute is its symmetry—is the distribu-
tion symmetrical, or are there very long or heavy HOLM’S SEQUENTIAL
tails at one end? This is often seen if there is a natu-
ral barrier at one end, beyond which the data can- BONFERRONI PROCEDURE
not go, but no barrier at the other end. For
example, in plotting length of hospitalization or The more statistical tests one performs the more
time to react to some stimulus, the barrier at the likely one is to reject the null hypothesis when it is
574 Holm’s Sequential Bonferroni Procedure

true (i.e., a false alarm, also called a Type 1 error). cannot occur simultaneously). Therefore, the prob-
This is a consequence of the logic of hypothesis ability of not making a Type I error on one trial is
testing: The null hypothesis for rare events is equal to
rejected in this entry, and the larger the number of
tests, the easier it is to find rare events that are false 1  α ¼ 1  :05 ¼ :95:
alarms. This problem is called the inflation of the
Recall that when two events are independent,
alpha level. To be protected from it, one strategy is
the probability of observing these two events
to correct the alpha level when performing multiple
together is the product of their probabilities. Thus,
tests. Making the alpha level more stringent (i.e.,
if the tests are independent, the probability of not
smaller) will create less errors, but it might also
making a Type I error on the first and the second
make it harder to detect real effects. The most well-
tests is
known correction is called the Bonferroni correc- :95 × :95 ¼ ð1  :05Þ2 ¼ ð1  αÞ2 :
tion; it consists in multiplying each probability by
the total number of tests performed. A more power- With three tests, the probability of not making
ful (i.e., more likely to detect an effect exists) a Type I error on all tests is
sequential version was proposed by Sture Holm in
1979. In Holm’s sequential version, the tests need :95 × :95 × :95 ¼ ð1  :05Þ3 ¼ ð1  αÞ3 :
first to be performed in order to obtain their p
values. The tests are then ordered from the one For a family of C tests, the probability of not mak-
with the smallest p value to the one with the largest ing a Type I error for the whole family is
p value. The test with the lowest probability is C
tested first with a Bonferroni correction involving ð1  αÞ :
all tests. The second test is tested with a Bonferroni
For this example, the probability of not making
correction involving one less test and so on for the
a Type I error on the family is
remaining tests. Holm’s approach is more powerful
than the Bonferroni approach, but it still keeps ð1  αÞC ¼ ð1  :05Þ10 ¼ :599:
under control the inflation of the Type 1 error.
Now, the probability of making one or more
The Different Meanings of Alpha Type I errors on the family of tests can be deter-
mined. This event is the complement of the event
When a researcher performs more than one statis- not making a Type I error on the family, and there-
tical test, he or she needs to distinguish between fore, it is equal to
two interpretations of the α level, which represents
the probability of a Type 1 error. The first interpre- 1  ð1  αÞC :
tation evaluates the probability of a Type 1 error
for the whole set of tests, whereas the second eval- For this example,
uates the probability for only one test at a time.
1  ð1  :05Þ10 ¼ :401:

Probability in the Family So, with an α level of .05 for each of the 10 tests,
the probability of incorrectly rejecting the null
A family of tests is the technical term for a series
hypothesis is .401.
of tests performed on a set of data. This section
This example makes clear the need to distin-
shows how to compute the probability of rejecting
guish between two meanings of α when perform-
the null hypothesis at least once in a family of tests
ing multiple tests:
when the null hypothesis is true.
For convenience, suppose that the significance 1. The probability of making a Type I error when
level is set at α ¼ .05. For each test the probability dealing only with a specific test. This
of making a Type I error is equal to α ¼ .05. The probability is denoted α[PT] (pronounced
events ‘‘making a Type I error’’ and ‘‘not making ‘‘alpha per test’’). It is also called the testwise
a Type I error’’ are complementary events (they alpha.
Holm’s Sequential Bonferroni Procedure 575

2. The probability of making at least one Type I independent tests, and you want to limit the risk
error for the whole family of tests. This of making at least one Type I error to an overall
probability is denoted α[PT] (pronounced value of α[PF] = .05, you will consider a test signif-
‘‘alpha per family of tests’’). It is also called the icant if its associated probability is smaller than
familywise or the experimentwise alpha.
α½PT ¼ 1  ð1  α½PFÞ1=C
How to Correct for Multiple Tests ¼ 1  ð1  :05Þ
1=4
¼ :0127:
Recall that the probability of making at least
one Type I error for a family of C tests is With the Bonferroni approximation, a test reaches
significance if its associated probability is smaller
α½PF ¼ 1  ð1  α½PTÞC : ð1Þ than

This equation can be rewritten as α½PF :05


α½PT ¼ ¼ ¼ :0125;
C 4
α½PT ¼ 1  ð1  α½PFÞ1=C : ð2Þ
which is very close to the exact value of .0127.
This formula—derived assuming independence of
the tests—is sometimes called the  Sidák equation.  ak Correction for a p Value
Bonferroni and Sid
It shows that in order to maintain a given α[PF]
level, the α[PT] values used for each test need to When a test has been performed as part of
be adapted. a family comprising C tests, the p value of this test
Because the  Sidák equation involves a fractional can be corrected with the Sidák or Bonferroni
power, it is difficult to compute by hand, and there- approaches by replacing α[PF] by p in Equations 1
fore, several authors derived a simpler approxima- or 3. Specifically, the Sidák corrected p value for C
tion, which is known as the Bonferroni (the most comparisons, denoted pSidak;C becomes
popular name), Boole, or even Dunn approxima-
C
tion. Technically, it is the first (linear) term of a pSidak; C ¼ 1  ð1  pÞ ; ð6Þ
Taylor expansion of the  Sidák equation. This
approximation gives and the Bonferroni corrected p value for C com-
parisons, denoted pBonferroni; C becomes
α½PF ≈ C × α½PT ð3Þ
pBonferroni; C ¼ C × p : ð7Þ
and
Note that the Bonferroni correction can give
α½PF a value of pSidak; C larger than 1. In such cases,
α½PT ≈ : ð4Þ
C pBonferroni; C is set to 1.

Sidák and Bonferroni are linked to each other
by the inequality Sequential HolmSidák and HolmBonferroni

α½PF Holm’s procedure is a sequential approach whose


α½PT ¼ 1  ð1  α½PFÞ1=C ≥ : ð5Þ goal is to increase the power of the statistical tests
C
while keeping under control the familywise Type I
They are, in general, very close to each other, but error. As stated, suppose that a family comprising
the Bonferroni approximation is pessimistic (it C tests is evaluated. The first step in Holm’s proce-
always does worse than the  Sidák equation). Prob- dure is to perform the tests to obtain their p
ably because it is easier to compute, the Bonferroni values, Then the tests are ordered from the one
approximation is more well known (and cited with the smallest p value to the one with the larg-
more often) than the exact  Sidák equation. est p value. The test with the smallest probability
The 
SidákBonferroni equations can be used to will be tested with a Bonferroni or a Sidák correc-
find the value of α[PT] when α[PF] is fixed. For tion for a family of C tests (Holm used a Bonfer-
example, suppose that you want to perform four roni correction, but Sidák gives an accurate value
576 Holm’s Sequential Bonferroni Procedure

and should be preferred to Bonferroni, which is an pSidak; i|C ¼ 1  ð1  pÞCiþ1 ¼ pSidak; 3|3
approximation). If the test is not significant, then
the procedure stops. If the first test is significant, ¼ 1  ð1  0:000040Þ31þ1
the test with the second smallest p value is then ¼ pSidak; 1|3 ¼ 1  ð1  0:000040Þ3 ð10Þ
corrected with a Bonferroni or a  Sidák approach
¼ 1  0:999603
for a family of (C  1) tests. The procedure stops
when the first nonsignificant test is obtained or ¼ 0:000119 :
when all the tests have been performed. Formally,
assume that the tests are ordered (according to Using the Bonferroni approximation (cf. Equation
their p values) from 1 to C; and that the procedure 9) will give a corrected p value of pBonferroni; 1=3 ¼
stops at the first nonsignificant test. When using .000120. Because the corrected p value for the first
the  Sidák correction with Holm’s approach, test is significant, the second test can then be per-
the corrected p value for the ith test, denoted formed for which i ¼ 2 and p ¼ .016100. Using
pSidak; i=C ; is computed as Equations 8 and 9, the corrected p values of
pSidak; 2=3 ¼ .031941 and pBonferroni; 2=3 ¼ .032200
pSidak; i|C ¼ 1  ð1  pÞCiþ1 : ð8Þ are found. The corrected p values are significant,
and, so, the last test can be performed for which i ¼
When using the Bonferroni correction with 3. Because this is the last of the series, the corrected
Holm’s approach, the corrected p value for the ith p values are now equal to the uncorrected p value of
test, denoted pBonferroni; i=C ; is computed as p ¼ pSidak; 3=3 ¼ pBonferroni; 3=3 ¼ :612300, which is
clearly not significant. Table 1 gives the results of the
pBonferroni; i|C ¼ ðC  i þ 1Þ × p: ð9Þ Holm’s sequential procedure along with the values
of the standard Sidák and Bonferroni corrections.
Just like the standard Bonferroni procedure, cor-
rected p values larger than 1 are set equal to 1.
Correction for Nonindependent Tests

The Sidák equation is derived assuming indepen-
Example
dence of the tests. When they are not independent,
Suppose that a study involving analysis of var- it gives a conservative estimate. The Bonferonni
iance has been designed and that there are three being a conservative estimation of Sidák will also
tests to be performed. The p values for these give a conservative estimate. Similarly, the sequen-
three tests are equal to 0.000040, 0.016100, and tial Holm’s approach is conservative when the tests
0.612300 (they have been ordered from the are not independent. Holm’s approach is obviously
smallest to the largest). Thus, C ¼ 3. The first more powerful than Sidák’s (because the pSidak; i|C
test has an original p value of p ¼ 0.000040. values are always smaller than or equal to the
Because it is the first of the series, i ¼ 1, and its pSidak; i|C values), but it still controls the overall
corrected p value using the HolmSidák familywise error rate. The larger the number of
approach (cf. Equation 8) is equal to tests, the larger the increase in power with Holm’s

Table 1  ak, Bonferroni, Holm–


Sid Sidak, and Holm–Bonferroni Corrections for Multiple Comparisons for a Set of
C = 3 Tests with p Values of 0.000040, 0.016100, and 0.612300, Respectively

Sid ak Bonferroni HolmSid  ak HolmBonferroni
α[PT] pSidak; C pBonferroni, C pSidak; i|C pBonferroni, C
i p 1  (1 p)Ci þ 1 C×p 1  (1  p)Ci þ 1 (C  i þ 1) × p
1 0.000040 0.000119 0.000120 0.000119 0.000120
2 0.016100 0.047526 0.048300 0.031941 0.032200
3 0.612300 0.941724 1.000000 0.612300 0.612300
Homogeneity of Variance 577

procedure compared with the standard Sidák (or


Bonferroni) correction. HOMOGENEITY OF VARIANCE
Homogeneity of variance is an assumption under-
lying both t tests and F tests (analyses of variance,
Alternatives ANOVAs) in which the population variances (i.e.,
The  SidákBonferroni as well as Holm’s the distribution, or ‘‘spread,’’ of scores around the
approaches become very conservative when the mean) of two or more samples are considered
number of comparisons becomes large and when equal. In correlations and regressions, the term
the tests are not independent (e.g., as in brain ‘‘homogeneity of variance in arrays,’’ also called
imaging). Recently, some alternative approaches ‘‘homoskedasticity,’’ refers to the assumption that,
have been proposed to make the correction less within the population, the variance of Y for each
stringent. A more recent approach redefines the value of X is constant. This entry focuses on
problem by replacing the notion of α[PF] by homogeneity of variance as it relates to t tests and
the false discovery rate (FDR), which is defined as ANOVAs.
the ratio of the number of Type I errors by the
number of significant tests.
Herve Abdi Homogeneity Within Populations
See also Bonferroni Procedure; Post Hoc Analysis; Post Within research, it is assumed that populations
Hoc Comparisons; Teoria Statistica Delle Classi e under observation (e.g., the population of female
Calcolo Delle Probabilit
a college students, the population of stay-at-home
fathers, or the population of older adults living
with type 2 diabetes) will be relatively similar and,
therefore, will provide relatively similar responses
Further Readings or exhibit relatively similar behaviors. If two iden-
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. tifiable samples (or subpopulations) are each
(2009). Experimental design and analysis for extracted from a larger population, the assumption
psychology. Oxford, UK: Oxford University Press. is that the responses, measurable behaviors, and so
Aickin, M., & Gensler, H. (1996). Adjusting for multiple on, of participants within both groups will be simi-
testing when reporting research results: The lar and that the distribution of responses measured
Bonferroni vs. Holm methods. American Journal of within each of the groups (i.e., variance) will also
Public Health, 86; 726728. be similar. It is important to note, however, that it
Benjamini, Y., & Hochberg, T. (1995). Controlling the
would be unreasonable to expect that the var-
false discovery rate: A practical and powerful
approach to multiple testing. Journal of the Royal
iances be exactly equal, given fluctuations based
Statistical Society, Series B, 57; 289300. on random sampling. When testing for homogene-
Games, P. A. (1977). An improved t table for ity of variance, the goal is to determine whether
simultaneous control on g contrasts. Journal of the the variances of these groups are relatively similar
American Statistical Association, 72; 531534. or different. For example, is the variation in
Hochberg, Y. (1988). A sharper Bonferroni procedure responses of female college students who attend
for multiple tests of significance. Biometrika, 75; large public universities different from the varia-
800803. tion in responses of female college students who
Holm, S. (1979). A simple sequentially rejective multiple attend small private universities? Is the variation in
test procedure. Scandinavian Journal of Statistics, 6;
observable behaviors of older adults with type 2
6570.
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual
diabetes who exercise different from that of older
Review of Psychology, 46; 561584. adults with type 2 diabetes who do not exercise? Is
Sidák, Z. (1967). Rectangular confidence region for the the variation in responses of 40-year-old stay-at-
means of multivariate normal distributions. Journal of home fathers different from 25-year-old stay-at-
the American Statistical Association, 62; 626633. home fathers?
578 Homogeneity of Variance

Assumptions of t tests and ANOVAs researcher believes that 30 minutes of exercise, 4


times a week, will lower the blood sugar levels of
When the null hypothesis is H0 : μ1 ¼ μ2 , the older adults by 5 points within the first two
assumption of homogeneity of variance must first months. The population variance of the two sam-
be considered. Note that testing for homogeneity ples would be assumed to be similar at the onset
of variance is different from hypothesis testing. In of the study. It is also the case that introducing
the case of t tests and ANOVAs, the existence of a constant (in this case, the treatment effect, or
statistically significant differences between the exercise) either by addition or by subtraction has
means of two or more groups is tested. In tests of no effect on the variance of a sample. If all values
homogeneity of variance, differences in the varia- of each of the participants in the treatment sample
tion of the distributions among subgroups are decrease by five points, while the values of each of
examined. the participants in the control sample remain the
The assumption of homogeneity of variance is same, the similarities of the variance will not
one of three underlying assumptions of t tests and change simply because the values of the treatment
ANOVAs. The first two assumptions concern inde- participants all decreased by the same value (the
pendence of observations, that scores within treatment effect).
a given sample are completely independent of each The variances of groups are ‘‘heterogeneous’’ if
other (e.g., that an individual participant does not the homogeneity of variance assumption has been
provide more than one score or that participants violated. ANOVAs are robust (not overly influ-
providing scores are not related in some way), and enced by small violations of assumptions) even
normality (i.e., that the scores of the population when the homogeneity of variance assumption is
from which a sample is drawn are normally dis- violated, if there are relatively equal numbers of
tributed). As stated, homogeneity of variance, the subjects within each of the individual groups.
third assumption, is that the population variances
(σ 2 Þ of two or more samples are equal (σ 21 ¼ σ 22 Þ.
Pooled Variance
It is important to remember that the underlying
assumptions of t tests and ANOVAs concern In conducting t tests and ANOVAs, the population
populations not samples. In running t tests and variance (σ 2 Þ is estimated using sample data from
ANOVAs, the variances of each of the groups that both (all) groups. The homogeneity of variance
have been sampled are used in order to test this assumption is capitalized on in t tests and ANO-
assumption. VAs when the estimates of each of the samples are
This assumption, that the variances are equal averaged. Based on the multiple groups, a pooled
(or similar), is quite tenable. In experimental variance estimate of the population is obtained.
research methods, studies often begin with a ‘‘treat- The homogeneity of variance assumption (σ 21 ¼
ment’’ group and a ‘‘control’’ group that are σ 22 Þ is important so that the pooled estimate can be
assumed to be equal at the experiment’s onset. The used. The pooling of variances is done because the
treatment group is often manipulated in a way that variances are assumed to be equal and estimating
the researcher hopes will change the measurable the same quantity (the population variance) in the
behaviors of its members. It is hoped that the par- first place. If sample sizes are equal, the pooling of
ticipants’ scores will be raised or lowered by an variances will yield the same result. However,
amount that is equivalent to the ‘‘effect’’ of the when sample sizes are unequal, the pooling of var-
experimental treatment. If it is the treatment effect iances can cause quite different results.
alone that raises or lowers the scores of the treat-
ment group, then the variability of the scores
Testing for Homogeneity of Variance
should remain unchanged (note: the values will
change, but the spread or distribution of the scores When testing for homogeneity of variance, the null
should not). For example, a researcher might want hypothesis is H0 : σ 21 ¼ σ 22 . The ratio of the two
to conduct a study on the blood sugar levels of variances might also be considered. If the two var-
older adults with type 2 diabetes and obtain two iances are equal, then the ratio of the variances
samples of older adults with type 2 diabetes. This equals 1.00. Therefore, the null hypothesis is
Homogeneity of Variance 579

H0 : σ 21 /σ 22 ¼ 1. When this null hypothesis is not more conservative methods, which alleviate the
rejected, then homogeneity of variance is con- problem heterogeneity of variance, leave the
firmed, and the assumption is not violated. researcher with less statistical power for hypothe-
The standard test for determining homogeneity sis testing to determine whether differences
of variance is the Levene’s test and is most fre- between group means exist. That is, the researcher
quently used in newer versions of statistical soft- is less likely to obtain a statistically significant
ware. Alternative approaches to Levene’s test have result using the more conservative method.
been proposed by O’Brien and by Brown and For-
sythe. For a more detailed presentation on calcu-
lating the Levene’s test by hand, refer to Howell, Robustness of t tests and ANOVAs
2007. Generally, tests of homogeneity of variance
are tests on the deviations (squared or absolute) of Problems develop when the variances of the
scores from the sample mean or median. If, for groups are extremely different from one another
example, Group A’s deviations from the mean or (if the value of the largest variance estimate is
median are larger than Group B’s deviations, then more than four or five times that of the smallest
it can be said that Group A’s variance is larger than variance estimate), or when there are large num-
Group B’s. These deviations will be larger (or bers of groups being compared in an ANOVA.
smaller) if the variance of one of the groups is Serious violations can lead to inaccurate p values
larger (or smaller). Based on the Levene’s test, it and estimates of effect size. However, t tests and
can be determined whether a statistically signifi- ANOVAs are generally considered robust when it
cant difference exists between the variances of two comes to moderate departures from the underlying
(or more) groups. homogeneity of variance assumption. Particularly
If the result of a Levene’s test is not statistically when group sizes are equal (n1 ¼ n2 Þ and large. If
significant, then there are no statistical differences the group with the larger sample also has the
between the variances of the groups in question larger variance estimate, then the results of
and the homogeneity of variance assumption is the hypothesis tests will be too conservative. If the
met. In this case, one fails to reject the null larger group has the smaller variance, then the
hypothesis H0 : σ 21 ¼ σ 22 that the variances of the results of the hypothesis test will be too liberal.
populations from which the samples were drawn Methodologically speaking, if a researcher has vio-
are the same. That is, the variances of the groups lated the homogeneity of variance assumption, he
are not statistically different from one another, and or she might consider equating the sample sizes.
t tests and ANOVAs can be performed and inter-
preted as normal. If the result of a Levene’s test is
statistically significant, then the null hypothesis, Follow-Up Tests
that the groups have equal variances, is rejected. It If the results of an ANOVA are statistically signifi-
is concluded that there are statistically significant cant, then post hoc analyses are run to determine
differences between the variances of the groups where specific group differences lie, and the results
and the homogeneity of variance assumption has of the Levene’s test will determine which post hoc
been violated. Note: The significance level will be tests are run and should be examined. Newer ver-
determined by the researcher (i.e., whether the sig- sions of statistical software provide the option of
nificance value exceeds .05, .01, etc.). running post hoc analyses that take into consider-
When the null hypothesis H0 : σ 21 ¼ σ 22 is ation whether the homogeneity of variance
rejected, and the homogeneity of variance assump- assumption has been violated.
tion is violated, it is necessary to adjust the statisti- Although this entry has focused on homogene-
cal procedure used and employ more conservative ity of variance testing for t tests and ANOVAs,
methods for testing the null hypothesis H0 : μ1 ¼ tests of homogeneity of variance for more complex
μ2 . In these more conservative procedures, the statistical models are the subject of current
standard error of difference is estimated differently research.
and the degrees of freedom that are used to test
the null hypothesis are adjusted. However, these Cynthia R. Davis
580 Homoscedasticity

See also Analysis of Variance (ANOVA); Student’s t Test; Heteroscedasticity


t Test, Independent Samples; t Test, Paired Samples;
Variance Violation of the homoscedasticity assumption
results in heteroscedasticity when values of the
dependent variable seem to increase or decrease as
a function of the independent variables. Typically,
Further Readings homoscedasticity violations occur when one or
Boneau, C. A. (1960). The effects of violations of more of the variables under investigation are not
assumptions underlying the t test. Psychological normally distributed. Sometimes heteroscedasticity
Bulletin, 57, 4964. might occur from a few discrepant values (atypical
Games, P. A., & Howell, J. F. (1976). Pairwise multiple data points) that might reflect actual extreme
comparison procedures with unequal n’s and/or observations or recording or measurement error.
variances: A Monte Carlo study. Journal of Scholars and statisticians have different views
Educational Statistics, 1, 113125. on the implications of heteroscedasticity in para-
Howell, D. C. (2007). Fundamental statistics for the
metric analyses. Some have argued that heterosce-
behavioral sciences (6th ed.). Belmont, CA: Duxbury.
dasticity in ANOVA might not be problematic if
there are equal numbers of observations across all
cells. More recent research contradicts this view
and argues that, even in designs with relatively
HOMOSCEDASTICITY equal cell sizes, heteroscedasticity increases the
Type I error rate (i.e., error of rejecting a correct
null hypothesis). Still others have persuasively
Homoscedasticity suggests equal levels of variabil-
argued that heteroscedasticity might be substan-
ity between quantitative dependent variables
tively interesting to some researchers.
across a range of independent variables that are
Regardless, homoscedasticity violations result in
either continuous or categorical. This entry focuses
biased statistical results and inaccurate inferences
on defining and evaluating homoscedasticity in
about the population. Therefore, before conduct-
both univariate and multivariate analyses. The
ing parametric analyses, it is critical to evaluate
entry concludes with a discussion of approaches
and address normality violations and examine data
used to remediate violations of homoscedasticity.
for outlying observations. Detection of homosce-
dasticity violations in multivariate analyses is often
made post hoc, that is, by examining the variation
Homoscedasticity as a Statistical Assumption of residuals values.
Homoscedasticity is one of three major assump-
tions underlying parametric statistical analyses. In
univariate analyses, such as the analysis of vari- Exploratory Data Analysis
ance (ANOVA), with one quantitative dependent
Evaluating Normality
variable (YÞ and one or more categorical indepen-
dent variables (XÞ, the homoscedasticity assump- Generally, normality violations for one or more
tion is known as homogeneity of variance. In this of the variables under consideration can be evalu-
context, it is assumed that equal variances of the ated and addressed in the early stages of analysis.
dependent variable exist across levels of the inde- Researchers suggest examining a few characteris-
pendent variables. tics of single-variable distributions to assess nor-
In multivariate analyses, homoscedasticity mality. For example, the location (i.e., anchoring
means all pairwise combinations of variables (X point of a distribution that is ordered from
and YÞ are normally distributed. In regression con- the lowest to highest values, often measured by
texts, homoscedasticity refers to constant variance mean, median, or mode) and spread of data (i.e.,
of the residuals (i.e., the difference between the variability or dispersion of cases, often described
actual and the predicted value of a data point), or by the standard deviation) are helpful in assessing
conditional variance, regardless of changes in X: normality. A third characteristic, the shape of the
Homoscedasticity 581

distribution (e.g., normal or bell-shaped, single- or homoscedasticity. These methods are often con-
multipeaked, or skewed to the left or right), is best ducted as part of the analysis.
characterized visually using histograms, box plots,
and stem-and-leaf plots. Although it is important
Regression Analyses
to examine, individually, the distribution of each
relevant variable, it is often necessary in multivari- In regression analysis, examination of the resid-
ate analyses to evaluate the pattern that exists ual values is particularly helpful in evaluating
between two or more variables. Scatterplots are homoscedasticity violations. The goal of regression
a useful technique to display the shape, direction, analysis is that the model being tested will (ideally)
and strength of relationships between variables. account for all of the variation in Y: Variation in
the residual values suggests that the regression
model has somehow been misspecified, and graph-
Examining Atypical Data Points ical displays of residuals are informative in detect-
ing these problems. In fact, the techniques for
In addition to normality, data should always examining residuals are similar to those used with
be preemptively examined for influential data the original data to assess normality and the pres-
points. Labeling observations as outside the nor- ence of atypical data points.
mal range of data can be complicated because Scatterplots are a useful and basic graphical
decisions exist in the context of relationships method to determine homoscedasticity violations.
among variables and intended purpose of the data. A specific type of scatterplot, known as a residual
For example, outlying X values are never problem- plot, plots residual Y values along the vertical axis
atic in ANOVA designs with equal cell sizes, but and observed or predicted Y values along the hori-
they introduce significant problems in regression zontal (XÞ axis. If a constant spread in the
analyses and unbalanced ANOVA designs. How- residuals is observed across all values of X; homo-
ever, discrepant Y values are nearly always prob- scedasticity exists. Plots depicting heterosced
lematic. Visual detection of unusual observations sticity commonly show the following two patterns:
is facilitated by box plots, partial regression lever- (1) Residual values increase as values of X increase
age plots, partial residual plots, and influence- (i.e., a right-opening megaphone pattern) or
enhanced scatterplots. Examination of scatterplots (2) residuals are highest for middle values of X
and histograms of residual values often indicates and decrease as X becomes smaller or larger
the influence of discrepant values on the overall (i.e., a curvilinear relationship). Researchers often
model fit, and whether the data point is extreme superimpose Lowess lines (i.e., lines that trace the
on Y (outlier) or X (high-leverage data point). In overall trend of the data) at the mean, as well as 1
normally distributed data, statistical tests (e.g., standard deviation above and below the mean of
z-score method, Leverage statistics, or Cook’s DÞ residuals, so that patterns of homoscedasticity can
can also be used to detect discrepant observations. be more easily recognized.
Some ways of handling atypical data points in
normally distributed data include the use of
trimmed means, scale estimators, or confidence Univariate and Multivariate Analyses of Variance
intervals. Removal of influential observations In contexts where one or more of the indepen-
should be guided by the research question and dent variables is categorical (e.g., ANOVA, t tests,
impact on analysis and conclusions. Sensitivity and MANOVA), several statistical tests are often
analyses can guide decisions about whether these used to evaluate homoscedasticity.
values influence results. In the ANOVA context, homogeneity of vari-
ance violations can be evaluated using the FMax
and Levene’s test. The FMax test is computed by
Evaluating Homoscedasticity
dividing the largest variance by the smallest vari-
In addition to examining data for normality and ance within each group. If the FMax exceeds the
the presence of influential data points, graphical critical value found in the F-value table, heterosce-
and statistical methods are also used to evaluate dasticty might exist. Some researchers suggest an
582 Homoscedasticity

FMax of 3.0 or more indicates a violation assump- change relative distances between data points and,
tion. However, conservative estimates (p < .025) therefore, influence the shape of distributions.
are suggested when evaluating F ratios because the Nonlinear transformations might be useful in
FMax test is highly sensitive to issues of non-nor- multivariate analyses to normalize distributions
mality. Therefore, it is often difficult to determine and address homoscedasticity violations. Typically,
whether significant values are caused by heteroge- transformations performed on X values more
neity of variance or normality violations of the accurately address normality violations than trans-
underlying population. The Levene’s test is another formations on Y:
statistical test that assumes equal variance across Before transforming the data, it is important to
levels of the independent variable. If the p value determine both the extent to which the variable(s)
obtained from the Levene’s test is less than .05, it under consideration violate the assumptions of
can be assumed that differences between variances homoscedasticity and normality and whether atyp-
in the population exist. Compared with the FMax ical data points influence distributions and the
test, the Levene’s test has no required normality analysis. Examination of residual diagnostics and
assumption. plots, stem-and-leaf plots, and boxplots are helpful
In the case of MANOVA, when more than one to discern patterns of skewness, non-normality,
continuous dependent variable is being assessed, and heteroscedasticity. In cases where a small
the same homogeneity of variance assumption number of influential observations is producing
applies. Because there are multiple dependent vari- heteroscedasticity, removal of these few cases
ables, a second assumption exists that the intercor- might be more appropriate than a variable
relations among these dependent measures (i.e., transformation.
covariances) are the same across different cells or
groups of the design. Box’s M test for equality of
Tukey’s Ladder of Power Transformations
variancecovariance matrices is used to test this
assumption. A statistically significant (p < :05) Tukey’s ladder of power transformations
Box’s M test indicates heteroscedasticity. However, (‘‘Bulging Rule’’) is one of the most commonly
results should be interpreted cautiously because used and simple data transformation tools. Mov-
the Box’s M test is highly sensitive to departures ing up on the ladder (i.e., applying exponential
from normality. functions) reduces negative skew and pulls in low
outliers. Roots and logs characterize descending
functions on Tukey’s ladder and address problems
Remediation for with positive skew and high atypical data points.
Violations of Homoscedasticity Choice of transformation strategy should be based
on the severity of assumption violations. For
Data Transformations
instance, square root functions are often suggested
Because homoscedasticity violations typically to correct a moderate violation and inverse square
result from normality violations or the presence of root functions are examples of transformations
influential data points, it is most beneficial to that address more severe violations.
address these violations first. Data transformations
are mathematical procedures that are used to
Advantages and Disadvantages
modify variables that violate the statistical
assumption of homoscedasticity. Two types of There are advantages and disadvantages to con-
data transformations exist: linear and nonlinear ducting data transformations. First, they can reme-
transformations. Linear transformations, produced diate homoscedasticity problems and improve
by adding, subtracting, multiplying, dividing, or accuracy in a multivariate analysis. However, inter-
a combination of these functions a constant value pretability of results is often challenged because
to the variable under consideration, preserve the transformed variables are quite different from the
relative distances of data points and the shape of original data values. In addition, transformations
the distribution. Conversely, nonlinear transforma- to variables where the scale range is less than 10
tions use logs, roots, powers, and exponentials that are often minimally effective. After performing
Honestly Significant Difference (HSD) Test 583

data transformations, it is advisable to check the


resulting model because remedies in one aspect of HONESTLY SIGNIFICANT
the regression model (e.g., homoscedasticity) might
lead to other model fit problems (e.g.,
DIFFERENCE (HSD) TEST
nonlinearity).
When an analysis of variance (ANOVA) gives a sig-
nificant result, this indicates that at least one group
Method of Weighted Least Squares
differs from the other groups. Yet, the omnibus
Another remedial procedure commonly used to test does not inform on the pattern of differences
address heteroscedasticity in regression analysis is between the means. To analyze the pattern of dif-
the method of weighted least squares (WLS). ference between means, the ANOVA is often fol-
According to this method, each case is assigned lowed by specific comparisons, and the most
a weight based on the variance of residuals around commonly used involves comparing two means
the regression line. For example, high weights are (the so-called pairwise comparisons).
applied to data points that show a low variance of An easy and frequently used pairwise compari-
residuals around the regression line. Generally, son technique was developed by John Tukey under
ordinary least-squares (OLS) regression is often the name of the honestly significant difference
preferable, however, to WLS regression except in (HSD) test. The main idea of the HSD is to com-
cases of large sample size or serious violations of pute the honestly significant difference (i.e., the
constant variance. HSD) between two means using a statistical distri-
bution defined by Student and called the q distri-
Kristen Fay bution. This distribution gives the exact sampling
distribution of the largest difference between a set
See also Homogeneity of Variance; Multivariate Normal
of means originating from the same population.
Distribution; Normal Distribution; Normality
All pairwise differences are evaluated using the
Assumption; Normalizing Data; Parametric Statistics;
same sampling distribution used for the largest dif-
Residual Plot; Variance
ference. This makes the HSD approach quite
conservative.
Further Readings
Berry, W. D. (1993). Understanding regression
assumptions. Newbury Park, CA: Sage. Notations
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Applied multiple regression/correlation analysis for the
The data to be analyzed comprise A groups;
behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence a given group is denoted a. The number of obser-
Erlbaum. vations of the a. The group is denoted Sa : If all
Games, P. A., Winkler, H. B., & Probert, D. A. (1972). groups have the same size, it is denoted S: The
Robust tests for homogeneity of variance. Educational total number of observations is denoted N: The
and Psychological Measurement, 32; 887909. mean of Group a is denoted Maþ . Obtained from
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, a preliminary ANOVA, the error source (i.e.,
W. C. (1998). Multivariate data analysis (5th ed). within group) is denoted SðAÞ, the effect (i.e.,
Upper Saddle River, NJ: Prentice Hall. between group) is denoted A: The mean square of
Hartwig, F., & Dearing, B. E. (1979). Exploratory data
error is denoted MSSðAÞ ; and the mean square of
analysis. Beverly Hills, CA: Sage.
Judd, C. M., McClelland, G. H., & Culhane, S. E.
effect is denoted MSA :
(1995). Data analysis: Continuing issues in the
everyday analysis of psychological data. Annual
Review of Psychology, 46, 433465. Least Significant Difference
Meyers, L. G., Gamst, G., & Guarino, A. J. (2006).
Applied multivariate research: Design and The rationale behind the HSD technique comes
interpretation. London: Sage. from the observation that, when the null hypothe-
Tukey, J. W. (1977). Exploratory data analysis. Reading, sis is true, the value of the q statistics evaluating
MA: Addison-Wesley. the difference between Groups a and a0 is equal to
584 Honestly Significant Difference (HSD) Test

Maþ  Ma0 þ Table 1 Results for a Fictitious Replication of Loftus


q ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð1Þ & Palmer (1974) in Miles per Hour
1
2
MSSðAÞ ðS1a þ S10 Þ
a
Contact Hit Bump Collide Smash
21 23 35 44 39
and follows a Studentized range q distribution
20 30 35 40 44
with a range of A and N  A degrees of freedom.
26 34 52 33 51
The ratio t would therefore be declared significant
46 51 29 45 47
at a given α level if the value of q is larger than the
35 20 54 45 50
critical value for the α level obtained from the q
13 38 32 30 45
distribution and denoted qA;α where v ¼ N  A is
41 34 30 46 39
the number of degrees of freedom of the error, and
30 44 42 34 51
A is the range (i.e., the number of groups). This 42 41 50 49 39
value can be obtained from a table of the Studen-
26 35 21 44 55
tized range distribution. Rewriting Equation 1
M. þ 30 35 38 41 46
shows that a difference between the means of
Group a and a0 will be significant if Source: Adapted from Loftus & Palmer (1974).

|Maþ  Ma0 þ | > HSD ¼ Table 2 ANOVA Results for the Replication of Loftus
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi & Palmer (1974)
1 1 1 ð2Þ
qA;α MSSðAÞ ð þ Þ: Source df SS MS F Pr(F)
2 Sa Sa0 Between: A 4 1,460.00 365.00 4.56 .0036
Error: S(A) 45 3,600.00 80.00
When there is an equal number of observation Total 49 5,060.00
per group, Equation 2 can be simplified as
Source: Adapted from Loftus & Palmer (1974).
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
MSSðAÞ |Maþ  Ma0 þ | ≥ HSD; ð4Þ
HSD ¼ qA;α : ð3Þ
S
then the comparison is declared significant at the
To evaluate the difference between the means of chosen α level (usually .05 or .01). Then this proce-
Groups a and a0 , the absolute value of the differ- dure is repeated for all AðA1Þ
2 comparisons.
ence between the means is taken and compared Note that HSD has less power than almost
with the value of HSD. If all other post hoc comparison methods (e.g., Fisher’s

Table 3 HSD

Experimental Group
M1:þ M2:þ M3:þ M4:þ M5:þ
Contact 30 Hit 1 35 Bump 38 Collide 41 Smash 46
M1:þ ¼ 30 Contact 000 5.00 ns 8.00 ns 11.00 ns 16.00**
M2:þ ¼ 35 Hit 0.00 3.00 ns 6.00 ns 11.00 ns
M3:þ ¼ 38 Bump 0.00 3.00 ns 8.00 ns
M4:þ ¼ 41 Collide 0.00 5.00 ns
M5:þ ¼ 46 Smash 0.00
Source: Difference between means and significance of pairwise comparisions from the (fictitious) replication of Loftus &
Palmer (1974).

Notes: Differences larger than 11.37 are significant at the a ¼ .05 level and are indicated with *, and differences larger than
13.86 are significant at the a ¼ .01 level and are indicated with **.
Hypothesis 585

LSD or NewmannKeuls) except the Scheffé The differences and significance of all pairwise
approach and the Bonferonni method because the α comparisons are shown in Table 3.
level for each difference between means is set at the
same level as the largest difference. Herve Abdi and Lynne J. Williams

See also Analysis of Variance (ANOVA); Bonferroni


Procedure; Fisher’s Least Significant Difference Test;
Example Multiple Comparison Tests; NewmanKeuls Test and
Tukey Test; Pairwise Comparisons; Post Hoc
In a series of experiments on eyewitness testimony,
Comparisons; Scheffé Test
Elizabeth Loftus wanted to show that the wording
of a question influenced witnesses’ reports. She
showed participants a film of a car accident and Further Readings
then asked them a series of questions. Among the
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.
questions was one of five versions of a critical
(2009). Experimental design and analysis for
question asking about the speed the vehicles were psychology. Oxford, UK: Oxford University Press.
traveling: Hayter, A. J. (1986). The maximum familywise error rate
of Fisher’s least significant difference test. Journal of
1. How fast were the cars going when they hit the American Statistical Association, 81, 10011004.
each other? Seaman, M. A., Levin, J. R., & Serlin, R. C. (1991). New
2. How fast were the cars going when they developments in pairwise multiple comparisons: Some
smashed into each other? powerful and practicable procedures. Psychological
Bulletin, 110, 577586.
3. How fast were the cars going when they
collided with each other?
4. How fast were the cars going when they
bumped each other? HYPOTHESIS
5. How fast were the cars going when they
contacted each other? A hypothesis is a provisional idea whose merit
requires further evaluation. In research, a hypothe-
The data from a fictitious replication of Loftus’s sis must be stated in operational terms to allow its
experiment are shown in Table 1. We have A ¼ 4 soundness to be tested.
groups and S ¼ 10 participants per group. The term hypothesis derives from the Greek
The ANOVA found an effect of the verb used (upóyesiBÞ, which means ‘‘to put under’’ or ‘‘to
on participants’ responses. The ANOVA table is suppose.’’ A scientific hypothesis is not the same as
shown in Table 2. a scientific theory, even though the words hypothe-
For an α level of .05, the value of q:05;A is equal sis and theory are often used synonymously in
to 4.02 and the HSD for these data is computed as common and informal usage. A theory might start
as a hypothesis, but as it is subjected to scrutiny, it
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi develops from a single testable idea to a complex
MSSðAÞ pffiffiffi
HSD ¼ qα;A ¼ 4:02 × 8 ¼ 11:37: ð5Þ framework that although perhaps imperfect has
S withstood the scrutiny of many research studies.
This entry discusses the role of hypotheses in
The value of q:01;A ¼ 4.90, and a similar com- research design, the types of hypotheses, and writ-
putation will show that, for these data, the HSD ing hypothesis.
pffiffiffi an α level of .01 is equal to HSD ¼ 4:90 ×
for
8 ¼ 13:86.
For example, the difference between Mcontactþ Hypothesis in Research Design
and Mhitþ is declared nonsignificant because
Two major elements in the design of research are
the researcher’s hypotheses and the variables to
|Mcontactþ  Mhitþ | ¼ |30  35| ¼ 5 < 11:37: ð6Þ test them. The hypotheses are usually extensions
586 Hypothesis

of existing theory and past research, and they groups. Setting up the null hypothesis is an essen-
motivate the design of the study. The variables tial step in testing statistical significance. After for-
represent the embodiment of the hypotheses in mulating a null hypothesis, one can establish the
terms of what the researcher can manipulate and probability of observing the obtained data.
observe.
A hypothesis is sometimes described as an edu-
Alternative Hypothesis
cated guess. However, this statement is also ques-
tioned to be a good description of hypothesis. For The alternative hypothesis and the null
example, many people might agree with the hypothesis are the two rival hypotheses whose
hypothesis that an ice cube will melt in less than likelihoods are compared by a statistical hypoth-
30 minutes if put on a plate and placed on a table. esis test. For example, an alternative hypothesis
However, after doing quite a bit of research, one can be a statement that the means, variance, and
might learn about how temperature and air pres- so on, of the samples being tested are not equal.
sure can change the state of water and restate the It describes the possibility that the observed dif-
hypothesis as an ice cube will melt in less than 30 ference or effect is true. The classic approach to
minutes in a room at sea level with a temperature decide whether the alternative hypothesis will be
of 20 C or 68 F. If one does further research and favored is to calculate the probability that the
gains more information, the hypothesis might observed effect will occur if the null hypothesis
become an ice cube made with tap water will melt is true. If the value of this probability (p value) is
in less than 30 minutes in a room at sea level with sufficiently small, then the null hypothesis will
a temperature of 20 C or 68 F. This example be rejected in favor of the alternative hypothesis.
shows that a hypothesis is not really just an edu- If not, then the null hypothesis will not be
cated guess. It is a tentative explanation for an rejected.
observation, phenomenon, or scientific problem
that can be tested by further investigation. In other
Examples of Null Hypothesis
words, a hypothesis is a tentative statement about
and Alternative Hypothesis
the expected relationship between two or more
variables. The hypothesis is tentative because its If a two-tailed alternative hypothesis is that
accuracy will be tested empirically. application of Educational Program A will
influence students’ mathematics achievements
(Ha : μProgram A 6¼ μcontrol Þ, the null hypothesis is
Types of Hypotheses that application of Program A will have no
effect on students’ mathematics achievements
Null Hypothesis
(H0 : μProgram A ¼ μcontrol Þ. If a one-tailed alterna-
In statistics, there are two types of hypotheses: tive hypothesis is that application of Program A
null hypothesis (H0 Þ and alternative/research/ will increase students’ mathematics achieve-
maintained hypothesis (Ha Þ. A null hypothesis ments (Ha : μProgram A > μcontrol Þ, the null hypoth-
(H0 Þ is a falsifiable proposition, which is assumed esis remains that use of Program A will have no
to be true until it is shown to be false. In other effect on students’ mathematics achievements
words, the null hypothesis is presumed true until (H0 : μProgram A ¼ μcontrol Þ. It is not merely the
statistical evidence, in the form of a hypothesis opposite of the alternative hypothesis—that is, it
test, indicates it is highly unlikely. When the is not that the application of Program A will not
researcher has a certain degree of confidence, usu- lead to increased mathematics achievements in
ally 95% to 99%, that the data do not support the students. However, this does remain the true null
null hypothesis, the null hypothesis will be hypothesis.
rejected. Otherwise, the researcher will fail to
reject the null hypothesis.
Hypothesis Writing
In scientific and medical applications, the null
hypothesis plays a major role in testing the signifi- What makes a good hypothesis? Answers to the
cance of differences in treatment and control following three questions can help guide
Hypothesis 587

hypothesis writing: (1) Is the hypothesis based on effect if the test is one-sided, for the sake of over-
the review of the existing literature? (2) Does the coming this ambiguity.
hypothesis include the independent and dependent
variables? (3) Can this hypothesis be tested in the Jie Chen, Neal Kingston,
experiment? For a good hypothesis, the answer to Gail Tiemann, and Fei Gu
every question should be ‘‘Yes.’’
Some statisticians argue that the null hypothesis See also Directional Hypothesis; Nondirectional
cannot be as general as indicated earlier. They Hypotheses; Null Hypothesis; Research Hypothesis;
believe the null hypothesis must be exact and free ‘‘Sequential Tests of Statistical Hypotheses’’
of vagueness and ambiguity. According to this
view, the null hypothesis must be numerically Further Readings
exact—it must state that a particular quantity or
difference is equal to a particular number. Agresti, A., & Finlay, B. (2008). Statistical methods for the
Some other statisticians believe that it is desir- social sciences (4th ed.). San Francisco, CA: Dellen.
Nolan, S. A., & Heinzen, T. E. (2008). Statistics for the
able to state direction as a part of null hypothesis
behavioral sciences (3rd ed.). New York: Worth.
or as part of a null hypothesis/alternative hypothe- Shavelson, R. J. (1998). Statistical reasoning for the
sis pair. If the direction is omitted, then it will be behavioral sciences (3rd ed.). Needham Heights, MA:
quite confusing to interpret the conclusion if the Allyn & Bacon.
null hypothesis is not rejected. Therefore, they Slavin, R. (2007). Educational research in an age of
think it is better to include the direction of the accountability. Upper Saddle River, NJ: Pearson.
I
to real life, thus influencing the utility and applica-
INCLUSION CRITERIA bility of study findings. Inclusion criteria must be
selected carefully based on a review of the litera-
ture, in-depth knowledge of the theoretical frame-
Inclusion criteria are a set of predefined charac- work, and the feasibility and logistic applicability
teristics used to identify subjects who will be of the criteria. Often, research protocol amend-
included in a research study. Inclusion criteria, ments that change the inclusion criteria will result
along with exclusion criteria, make up the selec- in two different sample populations that might
tion or eligibility criteria used to rule in or out require separate data analyses with a justification
the target population for a research study. Inclu- for drawing composite inferences.
sion criteria should respond to the scientific The selection and application of inclusion crite-
objective of the study and are critical to accom- ria also will have important consequences on the
plish it. Proper selection of inclusion criteria will assurance of ethical principles; for example,
optimize the external and internal validity of the including subjects based on race, gender, age, or
study, improve its feasibility, lower its costs, and clinical characteristics also might imply an uneven
minimize ethical concerns; specifically, good distribution of benefits and harms, threats to the
selection criteria will ensure the homogeneity of autonomy of subjects, and lack of respect. Not
the sample population, reduce confounding, and including women, children, or the elderly in the
increase the likelihood of finding a true associa- study might have important ethical implications
tion between exposure/intervention and out- and diminish the compliance of the study with
comes. In prospective studies (cohort and research guidelines such as those of the National
clinical trials), they also will determine the feasi- Institutes of Health in the United States for inclu-
bility of follow-up and attrition of participants. sion of women, children, and ethnic minorities in
Stringent inclusion criteria might reduce the gen- research studies.
eralizability of the study findings to the target Use of standardized inclusion criteria is neces-
population, hinder recruitment and sampling of sary to accomplish consistency of findings across
study subjects, and eliminate a characteristic that similar studies on a research topic. Common inclu-
might be of critical theoretical and methodologi- sion criteria refer to demographic, socioeconomic,
cal importance. health and clinical characteristics, and outcomes
Each additional inclusion criterion implies a dif- of study subjects. Meeting these criteria requires
ferent sample population and will add restrictions screening eligible subjects using valid and reliable
to the design, creating increasingly controlled con- measurements in the form of standardized expo-
ditions, as opposed to everyday conditions closer sure and outcome measurements to ensure that

589
590 Inclusion Criteria

subjects who are said to meet the inclusion criteria The selection of inclusion criteria should be
really have them (sensitivity) and those who are guided by ethical and methodological issues; for
said not to have them really do not have them example, in a clinical trial to treat iron deficiency
(specificity). Such measurements also should be anemia among reproductive-age women, including
consistent and repeatable every time they are women to assess an iron supplement therapy
obtained (reliability). Good validity and reliability would not be ethical if women with life-threatening
of inclusion criteria will help minimize random or very low levels of anemia are included in a non-
error, selection bias, misclassification of exposures treatment arm of a clinical trial for follow-up with
and outcomes, and confounding. Inclusion criteria an intervention that is less than the standard of
might be difficult to ascertain; for example, an care. Medication washout might be established as
inclusion criterion stating that ‘‘subjects with type an inclusion criterion to prevent interference of
II diabetes mellitus and no other conditions will be a therapeutic drug on the treatment under study. In
included’’ will require, in addition to clinical ascer- observational prospective studies, including sub-
tainment of type II diabetes mellitus, evidence that jects with a disease to assess more terminal clinical
subjects do not have cardiovascular disease, hyper- endpoints without providing therapy also would be
tension, cancer, and so on, which will be costly, unethical, even if the population had no access to
unfeasible, and unlikely to rule out completely. A medical care before the study.
similar problem develops when using as inclusion In observational studies, inclusion criteria are
criterion ‘‘subjects who are in good health’’ used to control for confounding, in the form of
because a completely clean bill of health is difficult specification or restriction, and matching. Specifi-
to ascertain. Choosing inclusion criteria with high cation or restriction is a way of controlling con-
validity and reliability will likely improve the like- founding; potential confounder variables are
lihood of finding an association, if there is one, eliminated from the study sample, thus removing
between the exposures or interventions and the any imbalances between the comparison groups.
outcomes; it also will decrease the required sample Matching is another strategy to control confound-
size. For example, inclusion criteria such as tumor ing; matching variables are defined by inclusion
markers that are known to be prognostic factors criteria that will homogenize imbalances between
of a given type of cancer will be correlated more comparison groups, thus removing confounding.
strongly with cancer than unspecific biomarkers or The disadvantage is that variables eliminated by
clinical criteria. Inclusion criteria that identify restriction or balanced by matching will not be
demographic, temporal, or geographic characteris- amenable to assessment as potential risk factors
tics will have scientific and practical advantages for the outcome at hand. This also will limit gener-
and disadvantages; restricting subjects to male gen- alizability, hinder recruitment, and require more
der or adults might increase the homogeneity of time and resources for sampling. In studies of
the sample, thus helping to control confounding. screening tests, inclusion criteria should ensure the
Inclusion criteria that include selection of subjects selection of the whole spectrum of disease severity
during a certain period of time might overlook and clinical forms. Including limited degrees of dis-
important secular trends in the phenomenon under ease severity or clinical forms will likely result in
study, but not establishing a feasible period of time biased favorable or unfavorable assessment of
might make conducting of the study unfeasible. screening tests.
Geographic inclusion criteria that establish select- Sets of recommended inclusion criteria have
ing a population from a hospital also might select been established to enhance methodological rigor
a biased sample that will preclude the generaliz- and comparability between studies; for example,
ability of the findings, although it might be the the American College of Chest Physicians and
only alternative to conducting the study. In studies Society of Critical Care Medicine developed inclu-
of rheumatoid arthritis, including patients with at sion criteria for clinical trials of sepsis; new criteria
least 12 tender or swollen joints will make difficult rely on markers of organ dysfunction rather than
the recruitment of a sufficient number of patients on blood culture positivity or clinical signs and
and will likely decrease the generalizability of symptoms. Recently, the Scoliosis Research Society
study results to the target population. in the United States has proposed new
Independent Variable 591

standardized inclusion criteria for brace studies in Inclusion criteria for experimental studies
the treatment of adolescent idiopathic scoliosis. involve different considerations than those for
Also, the International Campaign for Cures of Spi- observational studies. In clinical trials, inclusion
nal Cord Injury Paralysis has introduced inclusion criteria should maximize the generalizability of
and exclusion criteria for the conduct of clinical findings to the target population by allowing the
trials for spinal cord injury. Standardized inclusion recruitment of a sufficient number of individuals
criteria must be assessed continuously because it with expected outcomes, minimizing attrition rates,
might be possible that characteristics used as inclu- and providing a reasonable follow-up time for
sion criteria change over time; for example, it has effects to occur.
been shown that the median numbers of swollen Automatized selection and standardization of
joints in patients with rheumatoid arthritis has inclusion criteria for clinical trials using electronic
decreased over time. Drawing inferences using old health records has been proposed to enhance the
criteria that are no longer valid would no longer consistency of inclusion criteria across studies.
be relevant for the current population.
In case control studies, inclusion criteria will Eduardo Velasco
define the subjects with the disease and those with-
See also Bias; Confounding; Exclusion Criteria;
out it; cases and controls should be representative
Reliability; Sampling; Selection; Sensitivity; Specificity;
of the diseased and nondiseased subjects in the tar-
Validity of Research Conclusions
get population. In occupational health research, it
is known that selecting subjects in the work setting
might result in a biased sample of subjects who are Further Readings
healthier and at lower risk than the population at Gordis, L. (2008). Epidemiology (4th ed.). Philadelphia:
large. Selection of controls must be independent of W. Saunders.
exposure status, and nonparticipation rates might Hulley, S. B., Cummings, S. R., Browner, W. S., Grady,
introduce bias. Matching by selected variables will D., Hearst, N., & Newman, T. B. (2007). Designing
remove the confounding effects of those variables clinical research: An epidemiologic approach (3rd ed.).
on the association under study; obviously, those Philadelphia: Wolters, Kluwer, Lippincott, Williams &
variables will not be assessed as predictors of the Wilkins.
outcome. Matching also might introduce selection LoBiondo-Wood G., & Haber J. (2006). Nursing
research: Methods and critical appraisal for evidence-
bias, complicate recruitment of subjects, and limit
based practice (6th ed.). St. Louis, MO: Mosby.
inference to the target population. Additional spe-
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
cific inclusion criteria will be needed for nested Experimental and quasi-experimental designs for
case control and case cohort designs and two-stage generalized causal inference. Boston: Houghton-
or multistage sampling. Common controls are Mifflin.
population controls, neighborhood controls, hospi- Szklo, M., & Nieto F. J. (2007). Epidemiology: Beyond
tal or registry controls, friends, relatives, deceased the basics. Sudbury, MA: Jones & Bartlett Publishers.
controls, and proxy respondents. Control selection
usually requires that controls remain disease free
for a given time interval, the exclusion of controls
who become incident cases, and the exclusion of INDEPENDENT VARIABLE
controls who develop diseases other than the one
studied, but that might be related to the exposure Independent variable is complementary to depen-
of interest. dent variable. These two concepts are used primar-
In cohort studies, the most important inclusion ily in their mathematical sense, meaning that the
criterion is that subjects do not have the disease value of a dependent variable changes in response
outcome under study. This will require ascertain- to that of an independent variable. In research
ment of disease-free subjects. Inclusion criteria design, independent variables are those that
should allow efficient accrual of study subjects, a researcher can manipulate, whereas dependent
good follow-up participation rates, and minimal variables are the responses to the effects of inde-
attrition. pendent variables. By purposefully manipulating
592 Independent Variable

the value of an independent variable, one hopes to EðyÞ ¼ f ðxÞ‚ ð2Þ


cause a response in the dependent variable.
As such, independent variables might carry where EðyÞ is the expectation of y; or equivalently,
different names in various research fields,
depending on how the relationship between the y ¼ f ðxÞ þ ε‚ ð3Þ
independent and the dependent variable is
defined. They might be called explanatory vari- where ε is a random variable, which follows a spe-
ables, controlled variables, input variables, pre- cific probability distribution with a zero mean.
dictor variables, factors, treatments, conditions, This is a probabilistic model. It is composed of
or other names. For instance, in regression a deterministic part [f ðxÞ] and a random part (εÞ.
experiments, they often are called regressors in The random part is the one that accounts for the
relation to the regressand, the dependent, or the variation in y:
response variable. In experiments, independent variables are the
The concept of independent variable in statistics design variables that are predetermined by
should not be confused with the concept of inde- researchers before an experiment is started. They
pendent random variable in probability theories. are carefully controlled in controlled experi-
In the latter case, two random variables are said to ments or selected in observational studies (i.e.,
be independent if and only if their joint probability they are manipulated by the researcher accord-
is the product of their marginal probabilities for ing to the purpose of a study). The dependent
every pair of real numbers taken by the two ran- variable is the effect to be observed and is the
dom variables. In other words, if two random vari- primary interest of the study. The value of the
ables are truly independent, the events of one dependent variable varies subjecting to the varia-
random variable have no relationship with the tion in the independent variables and cannot be
events of the other random variable. For instance, manipulated to establish an artificial relationship
if a fair coin is flipped twice, a head occurring in between the independent and dependent vari-
the first flip has no association with whether the ables. Manipulation of the dependent variable
second flip is a head or a tail because the two invalidates the entire study.
events are independent. Because they are controlled or preselected and
Mathematically, the relationship between inde- are usually not the primary interest of a study, the
pendent and dependent variables might be under- value of independent variables is almost univer-
stood in this way: sally not analyzed. Instead, they are simply taken
as prescribed. (This, however, does not preclude
y ¼ f ðxÞ‚ ð1Þ the numeric description of independent variables
as can be routinely seen in scientific literature. In
where x is the independent variable (i.e., any argu- fact, they are often described in detail so that
ment to a function) and y is the dependent variable a published study can be evaluated properly or
(i.e., the value that the function is evaluated to). repeated by others.) In contrast, the value of the
Given an input of x; there is a corresponding dependent variable is unknown before a study.
output of y; x changes independently, whereas y The observed value of the dependent variable usu-
responds to any change in x: ally requires careful analyses and proper explana-
Equation 1 is a deterministic model. For each tion after a study is done.
input in x; there is one and only one response in y: A caution note must be sounded that even
A familiar graphic example is a straight line if though the value of independent variables can be
there is only one independent variable of order 1 manipulated, one should not change it in the mid-
in the previous model. In statistics, however, this dle of a study. Doing so drastically modifies the
model is grossly inadequate. For each value of x; independent variables before and after the change,
there is often a population of y, which follows causing a loss of the internal validity of the study.
a probability distribution. To reflect more accu- Even if the value of the dependent variable does
rately this reality, the preceding equation is revised not change drastically in response to a manipula-
accordingly: tion such, the result remains invalid. Careful
Inference: Deductive and Inductive 593

selection and control of independent variables See also Bivariate Regression; Covariate; Dependent
before and during a study is fundamental to both Variable
the internal and the external validity of that study.
To illustrate what constitutes a dependent vari-
able and what is an independent variable, let us Further Readings
assume an agricultural experiment on the produc-
Hockling, R. R. (2003). Methods and applications of
tivity of two wheat varieties that are grown under linear models: Regression and the analysis of variance.
identical or similar field conditions. Productivity Hoboken, NJ: Wiley.
is measured by tons of wheat grains produced Kuehl, R. O. (1994). Statistical principles of research
per season per hectare. In this experiment, variety design and analysis. Belmont, CA: Wadsworth.
would be the independent variable and productiv- Montgomery, D. C. (2001). Design and analysis of
ity the dependent variable. The qualifier, ‘‘identical experiments (5th ed.). Toronto, Ontario, Canada: Wiley.
or similar field conditions,’’ implies other extrane- Ramsey, F. L., & Schafer, D. W. (2002). The statistical
ous (or nuisance) factors (i.e., covariates) that must sleuth: A course in methods of data analysis (2nd ed.).
be controlled, or taken account of, in order for the Pacific Grove, CA: Duxbury.
Wacherly, D. D., Mendenhall , W., III, & Scheaffer, R. L.
results to be valid. These other factors might be the
(2002). Mathematical statistics with applications
soil fertility, the fertilizer type and amount, irriga- (6th ed.). Pacific Grove, CA: Duxbury.
tion regime, and so on. Failure to control Zolman, J. F. (1993). Biostatistics: Experimental design
or account for these factors could invalidate the and statistical inference. New York: Oxford University
experiment. This is an example of controlled Press.
experiments. Similar examples of controlled experi-
ments might be the temperature effect on the hard-
ness of a type of steel and the speed effect on the
crash result of automobiles in safety tests.
Consider also an epidemiological study on the
INFERENCE: DEDUCTIVE
relationship between physical inactivity and obesity AND INDUCTIVE
in young children: The parameter(s) that measures
physical inactivity, such as the hours spent on Reasoning is the process of making inferences—of
watching television and playing video games, and drawing conclusions. Students of reasoning make
the means of transportation to and from daycares/ a variety of distinctions regarding how inferences
schools is the independent variable. These are cho- are made and conclusions are drawn. Among the
sen by the researcher based on his or her prelimi- oldest and most durable of them is the distinction
nary research or on other reports in literature on between deductive and inductive reasoning, which
the same subject prior to the study. The param- contrasts conclusions that are logically implicit in
eter(s) that measure obesity, such as the body mass the claims from which they are drawn with those
index, is (are) the dependent variable. To control that go beyond what is given.
for confounding, the researcher needs to consider, Deduction involves reasoning from the general
other than the main independent variables, any to the particular:
covariate that might influence the dependent vari-
able. An example might be the social economical All mammals nurse their young.
status of the parents and the diet of the families.
Whales are mammals.
Independent variables are predetermined factors
that one controls and/or manipulates in a designed Therefore whales nurse their young.
experiment or an observational study. They are
design variables that are chosen to incite a response Induction involves reasoning from the particular
of a dependent variable. Independent variables are to the general:
not the primary interest of the experiment; the
dependent variable is. All the crows I have seen are black.
Being black must be a distinguishing feature of
Shihe Fan crows.
594 Inference: Deductive and Inductive

Implication Versus Inference deduction? One answer is that what is implicit in


premises is not always apparent until it has been
Fundamental to an understanding of deductive made explicit. It is not the case that deductive rea-
reasoning is a distinction between implication and soning never produces surprises for those who use
inference. Implication is a logical relationship; it. A mathematical theorem is a conclusion of
inference is a cognitive act. Statements imply; peo- a deductive argument. No theorem contains infor-
ple infer. A implies B if it is impossible for A to be mation that is not implicit in the axioms of the
true if B is false. People are said to make an infer- system from which it was derived. For many theo-
ence when they justify one claim (conclusion) by rems, the original derivation (proof) is a cognitively
appeal to others (premises). Either implications demanding and time-consuming process, but once
exist or they do not, independently of whether the theorem has been derived, it is available for
inferences are made that relate to them. Inferences use without further ado. If it were necessary to
either are made or are not made; they can be valid derive each theorem from basic principles every
or invalid, but they are inferences in either case. time it was used, mathematics would be a much
Failure to keep the distinction in mind can cause less productive enterprise.
confusion. People are sometimes said to imply Another answer is that a conclusion is easier to
when they make statements with the intention that retain in memory than the premises from which it
their hearers will see the implications of those was deduced and, therefore, more readily accessi-
statements and make the corresponding inferences, ble for future reference. Retrieval of the conclusion
but to be precise in the use of language one would from memory generally requires less cognitive
have to say not that people imply but that they effort than would the retrieval of the supporting
make statements that imply. premises and derivation of their implications.

Aristotle and the Syllogism Validity, Plausibility, and Truth


The preeminent name in the history of deductive The validity of a logical argument does not guar-
reasoning is that of Aristotle, whose codification antee the truth of the argument’s conclusion. If at
of implicative relationships provided the founda- least one premise of the argument is false, the con-
tion for the work of many generations of logicians clusion also might be false, although it is not nec-
and epistemologists. Aristotle analyzed the various essarily so. However, if all the argument’s premises
ways in which valid inferences can be drawn with are true and its form is valid, the conclusion must
the structure referred to as a categorical syllogism, be true. A false conclusion cannot follow from true
which is a form of argument involving three asser- premises. This is a powerful fact. Truth is consis-
tions, the third of which (the conclusion) follows tent: Whatever a collection of true premises
from the first two (the major and minor premises). implies must be true. It follows that if one knows
A syllogism is said to be valid if, and only if, the the conclusion to a valid argument to be false, one
conclusion follows from (is implied by) the prem- can be sure that at least one of the argument’s
ises. Aristotle identified many valid forms and premises is false.
related them to each other in terms of certain Induction leads to conclusions that state more
properties such as figure and mood. Figure relates than is contained implicitly in the claims on which
to the positions of the middle term—the term that those conclusions are based. When, on the basis of
is common to both premises—and mood to the noticing that all the members that one has seen of
types of premises involved. An explanation of the a certain class of things has a particular attribute
system Aristotle used to classify syllogistic forms (‘‘all the crows I have seen are black’’), one con-
can be found in any introductory text on first- cludes that all, or nearly all, the members of that
order predicate logic. class (including those not seen) have that attribute,
That deductive reasoning makes explicit only one is generalizing—going beyond the data in
knowledge already contained implicitly in the hand—which is one form of induction.
premises from which the deductions are made There are many formal constructs and tools to
prompts the question: Of what practical use is facilitate construction and assessment of deductive
Inference: Deductive and Inductive 595

arguments. These tools include syllogistic forms, about the role of guessing and conjecturing in
calculi of classes and propositions, Boolean alge- mathematics.
bra, and a variety of diagrammatic aids to analysis
such as truth tables, Euler diagrams, and Venn dia-
The Interplay of Deduction and Induction
grams. Induction does not lend itself so readily to
formalization; indeed (except in the case of mathe- Any nontrivial cognitive problem is almost certain
matical induction, which is really a misnamed to require the use of both deductive and inductive
form of deduction) inductive reasoning is almost inferencing, and one might find it difficult to
synonymous with informal reasoning. It has to do decide, in many instances, where the dividing line
with weighing evidence, judging plausibility, and is between the two. In science, for example, the
arriving at uncertain conclusions or beliefs that interplay between deductive and inductive reason-
one can hold with varying degrees of confidence. ing is continual. Observations of natural phenom-
Deductive arguments can be determined to be ena prompt generalizations that constitute the
valid or invalid. The most one can say about an stuff of hypotheses, models, and theories. Theories
inductive argument is that it is more or less provide the basis for the deduction of predictions
convincing. regarding what should be observed under specified
Logic is often used to connote deductive reason- conditions. Observations are made under the con-
ing only; however, it can be sufficiently broadly ditions specified, and the predictions are either cor-
defined to encompass both deductive and inductive roborated or falsified. If falsification is the result,
reasoning. Sometimes a distinction is made between the theories from which the predictions were
formal and informal logic, to connote deductive deduced must be modified and this requires induc-
and inductive reasoning, respectively. tive reasoning—guesswork and more hypothesiz-
Philosophers and logicians have found it much ing. The modified theories provide the basis for
easier to deal with deductive than with inductive deducing new predictions. And the cycle goes on.
reasoning, and as a consequence, much more has In mathematics, a similar process occurs. A sug-
been written about the former than about the lat- gestive pattern is observed and the mathematician
ter, but the importance of induction is clearly rec- induces a conjecture, which, in some cases,
ognized. Induction has been called the despair of becomes a theorem—which is to say it is proved
the philosopher, but no one questions the necessity by rigorous deduction from a specified set of
of using it. axioms. Mathematics textbooks spend a lot of
Many distinctions similar to that between time on the proofs of theorems, emphasizing the
deductive and inductive reasoning have been made. deductive side of mathematics. What might be less
Mention of two of them will suffice to illustrate apparent, but no less crucial to the doing of math-
the point. American philosopher/mathematician/ ematics, is the considerable guesswork and induc-
logician Charles Sanders Peirce drew a contrast tion that goes into the identification of conjectures
between a demonstrative argument, in which the that are worth exploring and the construction of
conclusion is true whenever the premises are true, proofs that will be accepted as such by other
and a probabilistic argument, in which the conclu- mathematicians.
sion is usually true whenever the premises are true. Deduction and induction are essential also to
Hungarian/American mathematician George P olya meet the challenges of everyday life, and we all
distinguished between demonstrative reasoning make extensive use of both, which is not to claim
and plausible reasoning, demonstrative reasoning that we always use them wisely and well. The psy-
being the kind of reasoning by which mathematical chological research literature documents numerous
knowledge is secured, and plausible reasoning that ways in which human reasoning often leads to
which we use to support conjectures. Ironically, conclusions that cannot be justified either logically
although P olya equated demonstrative reasoning or empirically. Nevertheless, that the type of rea-
with mathematics and described all reasoning out- soning that is required to solve structured prob-
side of mathematics as plausible reasoning, he lems for the purposes of experimentation in the
wrote extensively, especially in his 1954 two- psychological laboratory does not always ade-
volume Mathematics and Plausible Reasoning, quately represent the reasoning that is required to
596 Influence Statistics

deal with the problems that present themselves in See also Experimental Design; Falsifiability; Hypothesis;
real life has been noted by many investigators, Margin of Error; Nonexperimental Designs; Pre-
and it is reflected in contrasts that are drawn Experimental Designs; Quasi-Experimental Designs
between pure (or theoretical) and practical
thinking, between academic and practical intelli-
gence, between formal and everyday reasoning, Further Readings
and between other distinctions of a similar nature. Cohen, L. J. (1970). The implications of induction.
London: Methuen.
Galotti, K. M. (1989). Approaches to studying formal
The Study of Inferencing and everyday reasoning. Psychological Bulletin, 105,
331–351.
The study of deductive reasoning is easier than the Holland, J. H., Holyoak, K. J., Nisbett, R. E., &
study of inductive reasoning because there are Thagard, P. R. (1986). Induction: Processes of
widely recognized rules for determining whether inference, learning, and discovery. Cambridge: MIT
a deductive argument is valid, whereas there are Press.
not correspondingly widely recognized rules for Johnson-Laird, P. N., & Byrne, R. M. J. (1991).
determining whether an inductive argument is Deduction. Hillside, NJ: Lawrence Erlbaum.
sound. Perhaps, as a consequence, deductive rea- P
olya, G. (1954). Mathematics and plausible reasoning,
soning has received more attention from research- Vol. 1: Induction and analogy in mathematics, Vol. 2:
Patterns of plausible inference. Princeton, NJ:
ers than has inductive reasoning.
Princeton University Press.
Several paradigms for investigating deduction
Rips, L. J. (1994). The psychology of proof: Deductive
have been used extensively by students of cogni- reasoning in human thinking. Cambridge: MIT Press.
tion. None is more prominent than the ‘‘selection
task’’ invented by British psychologist Peter Wason
in the 1960s. In its simplest form, a person is
shown four cards, laid out so that only one side of INFLUENCE STATISTICS
each card is visible, and is told that each card has
either a vowel or a consonant on one side and
either an even number or an odd number on the Influence statistics measure the effects of individual
other side. The visible sides of the cards show data points or groups of data points on a statistical
a vowel, a consonant, an even number, and an odd analysis. The effect of individual data points on an
number. The task is to specify which card or cards analysis can be profound, and so the detection of
must be turned over to determine the truth or fal- unusual or aberrant data points is an important part
sity of the claim If there is a vowel on one side, of nearly every analysis. Influence statistics typically
there is an even number on the other. The correct focus on a particular aspect of a model fit or data
answer, according to conditional logic, is the card analysis and attempt to quantify how the model
showing a vowel and the one showing an odd changes with respect to that aspect when a particular
number. The original finding was that only a small data point or group of data points is included in the
minority of people given this task perform it cor- analysis. In the context of linear regression, where
rectly; the most common selections are either the the ideas were first popularized in the 1970s, a vari-
card showing a vowel and the one showing an ety of influence measures have been proposed to
even number, or only the one showing a vowel. assess the impact of particular data points.
The finding has been replicated many times and The popularity of influence statistics soared in
with many variations of the original task. Several the 1970s because of the proliferation of fast and
interpretations of the result have been proposed. relatively cheap computing, a phenomenon that
That the task remains a focus of research more allowed the easy examination of the effects of indi-
than 60 years after its invention is a testament to vidual data points on an analysis for even rela-
the ingenuity of its inventor and to the difficulty of tively large data sets. Seminal works by R. Dennis
determining the nature of human reasoning. Cook; David A. Belsley, Edwin Kuh, and Roy E.
Welsch; and R. Dennis Cook and Sanford
Raymond S. Nickerson Weisberg led the way for an avalanche of new
Influence Statistics 597

techniques for assessing influence. Along with sample size and p is the number of estimated
these new techniques came an array of names for regression coefficients.
them: DFFITS, DFBETAS, COVRATIO, Cook’s Influence with respect to estimated model coeffi-
D; and leverage, to name but a few of the more cients can be measured either for individual coeffi-
prominent examples. Each measure was designed cients, using a measure called DFBETAS, or
to assess the influence of a data point on a partic- through an overall measure of how individual data
ular aspect of the model fit: DFFITS on the fitted points affect estimated coefficients as a whole.
values from the model, DFBETAS on each indi- DFBETAS is a scaled difference between estimated
vidual regression coefficient, COVRATIO on the coefficients for models fit with and without each
estimated residual standard error, and so on. individual datum, respectively:
Each measure can be readily computed using
widely available statistical packages, and their ^k  β
β ^ k;i
use as part of an exploratory analysis of data is DFBETASk;i ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; for k ¼ 1; . . . ; p;
MSEðiÞ ckk
very common.
This entry first discusses types of influence sta-
where p is the number of coefficients and ckk is
tistics. Then we describe the calculation and lim-
the kth diagonal element of the matrix ðXT XÞ1 .
itations of influence statistics. Finally, we conclude
Again, although DFBETASk;i resembles a t statis-
with an example.
tic, it fails to have a t distribution, and its size is
judged relative to a cutoff proposed by Belsley,
Types Kuh, and Welsch whereby the ith point is regarded
as influential with respect to the
pffiffiffi kth estimated
Influence measures are typically categorized by the
coefficient if DFBETASk;i  > 2= n.
aspect of the model to which they are targeted.
Cook’s distance calculates an overall measure of
Some commonly used influence statistics in the
distance between coefficients estimated using mod-
context of linear regression models are discussed
els with and without each respective data point:
and summarized next. Analogs are also available
for generalized linear models and for other more ^ k;i ÞT XT Xðβ
^k  β ^k  β
^ k;i Þ
complex models, although these are not described ðβ
Di ¼ :
in this entry. pMSE
Influence with respect to fitted values of a model
can be assessed using a measure called DFFITS, There are several rules of thumb commonly
a scaled difference between the fitted values for used to judge the size of Cook’s distance in asses-
the models fit with and without each individual sing influence, with some practitioners using rela-
respective data point: tive standing among the values of the Di s, whereas
others prefer to use the 50% critical point of the
Y^i  Y ^ i;i Fp;np distribution.
DFFITSi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , Influence with respect to the estimate of residual
MSEðiÞ hii
standard error in a model fit can be assessed using
where the notation in the numerator denotes fitted a quantity COVRATIO that measures the change
values for the response for models fit with and in the estimate of error spread between models fit
without the ith data point, respectively, MSEðiÞ is with and without the ith data point:
the mean square for error in the model fit without
s 2p  1

data point i; and hii is the ith leverage; that is, the COVRATIOi ¼
ðiÞ
;
ith diagonal element of the hat matrix, H ¼ X s 1  hii
ðXT XÞ1 XT . Although DFFITSi resembles a t sta-
tistic, it does not have a t distribution, and the size where sðiÞ is the estimate of residual standard error
of DFFITSi is judged relative to a cutoff proposed from a model fit without the ith data point. Influ-
by Belsley, Kuh, and Welsch. A point is regarded ence with respect to residual scale is assessed if
as potentially influentialpwith
ffiffiffiffiffiffiffiffi respect to fitted a point has a value of COVRATIOi for which
values if jDFFITSi j > 2 p=n, where n is the jCOVRATIOi  1j ≥ 3p=n.
598 Influence Statistics

Many influence measures depend on the values and similar expressions not requiring multiple
of the leverages, hii , which are the diagonal ele- model fits can be developed for the other influence
ments of the hat matrix. The leverages are a func- measures considered earlier.
tion of the explanatory variables alone and,
therefore, do not depend on the response variable
Limitations
at all. As such, they are not a direct measure of
influence, but it is observed in a large number of Each influence statistic discussed so far is an exam-
situations that cases having high leverage tend to ple of a single-case deletion statistic, based on
be influential. The leverages are closely related to comparing models fit on data sets differing by only
the Mahalanobis distances of each data point’s one data point. In many cases, however, more than
covariate values from the centroid of the covariate one data point in a data set exerts influence, either
space, and so points with high leverage are in that individually or jointly. Two problems that can
sense ‘‘far’’ from the center of the covariate space. develop in the assessment of multiple influence are
Because the average of the leverages is equal to masking and swamping. Masking occurs when an
p=n, where p is the number of covariates plus 1, it influential point is not detected because of the
is common to consider points with twice the aver- presence of another, usually adjacent, influential
age leverage as having the potential to be influen- point. In such a case, single-case deletion influence
tial; that is, points with hii > 2p=n would be statistics fail because only one of the two poten-
investigated further for influence. Commonly, the tially influential points is deleted, respectively,
use of leverage in assessing influence would occur when computing the influence statistic, still leaving
in concert with investigation of other influence the other data point to influence the model fit.
measures. Swamping occurs when ‘‘good’’ data points are
identified as influential because of the presence of
other, usually remote, influential data points that
Calculation
influence the model away from the ‘‘good’’ data
Although the formulas given in the preceding point. It is difficult to overcome the potential pro-
discussion for the various influence statistics are blems of masking and swamping for several rea-
framed in the context of models fit with and sons: First, in high-dimensional data, visualization
without each individual data point in turn, the is often difficult, making it very hard to ‘‘see’’
calculation of these statistics can be carried out which observations are ‘‘good’’ and which are not;
without the requirement for multiple model fits. second, it is almost never the case that the exact
This computational saving is particularly impor- number of influential points is known a priori, and
tant in the context of large data sets with many points might exert influence either individually or
covariates, as each influence statistic would oth- jointly in groups of unknown size; and third, mul-
erwise require n þ 1 separate model fits in its cal- tiple-case deletion methods, although simple in
culation. Efficient calculation is possible through conception, remain difficult to implement in prac-
the use of updating formulas. For example, the tice because of the computational burden associ-
values of sðiÞ , the residual standard error from ated with assessing model fits for very large
a model fit without the ith data point, can be numbers of subsets of the original data.
computed via the formula
Examples
e2i
ðn  p  1Þs2ðiÞ ¼ ðn  pÞs2  ;
1  hii A simple example concludes this entry. A ‘‘good’’
data set with 20 data points was constructed, to
where s is the residual standard error fit using the
which was added, first, a single obvious influential
entire data set and ei is the model errors from the
point, and then a second, adjacent influential
model fit to the full data set. Similarly,
point. The first panel of Figure 1 depicts the origi-
pffiffiffiffiffi nal data, and the second and third panels show the
ei hii augmented data. An initial analysis of the ‘‘good’’
DFFITSi ¼
sðiÞ ð1  hii Þ data reveals no points suspected of being
Influence Statistics 599

Table 1 Values of Influence Statistics for Example Data


Two influential points added (right panel of figure)
Point DFFITS DFBETAS (0) DFBETAS (1) Cook’s D COVRATIO Leverage
4 –1 –0.993 0.455 0.283 0.335 0.0572 Swamped
21 –0.406 0.13 –0.389 0.086 2.462 0.5562 Masked
22 –0.148 0.042 –0.14 0.012 1.973 0.4399 Masked
Cutoff 0.603 0.426 0.426 0.718 (0.73, 1.27) 0.18
One influential point added (middle panel of figure)
Point DFFITS DFBETAS (0) DFBETAS (1) Cook’s D COVRATIO Leverage
4 –0.984 –0.966 0.419 0.273 0.338 0.0582 Swamped
21 –126 49.84 –126.02 1100 2.549 0.9925 Highly influential
Cutoff 0.617 0.436 0.436 0.719 (0.71, 1.29) 0.18

Original Data One Influential Point Added Two Influential Points Added

22 22 22

20 20 20

18 18 18
21
16 16 16 22
Y

Y
21
14 14 14

12 12 12

10 10 10
4 4 4
2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 10 20 30 40 50 10 20 30 40 50
X X X

Figure 1 Influence Statistics Example Data

influential. When the first influential point is whereas the dot-dash line reflects the fitted model
inserted in the data, its impact on the model is using all data points except point 21. Of course, in
extreme (see the middle plot), and the influence this simple example, the effects of the data points
statistics clearly point to this point as being influ- marked in the plot are clearly visible—the simple
ential. When the second extreme point is added two-dimensional case usually affords such an easy
(see the rightmost plot), its presence obscures the visualization. In higher dimensions, such visualiza-
influence of the initially added point (point 22 tion is typically not possible, and so the values of
masks point 21, and vice versa), and the pair influence statistics become more useful as tools for
of added points causes a known ‘‘good’’ point, identifying unusual or influential data points.
labeled 4, to be considered influential (the pair Table 1 shows the values of the various influ-
(21,22) swamps point 4). In the plots, the fitted ence statistics for the example depicted in the
model using all data points is marked using a solid figure. Values of DFFITS, DFBETAS, Cook’s D;
line, whereas the fitted model using only the COVRATIO, and leverage are given for the situa-
‘‘good’’ data points is marked using a dashed line. tions depicted in the middle and right panels of the
In the rightmost plot, the dotted line reflects the figure. The values of the influence statistics for the
fitted model using all data points except point 22, case of the single added influential point show
600 Influential Data Points

how effectively the influence statistics betray the dominate the outcome of an analysis with hun-
added influential point—their values are extremely dreds of observations: It might spell the differ-
high across all statistics. The situation is very dif- ence between rejection and failure to reject
ferent, however, when a second, adjacent influen- a null hypothesis or might drastically change
tial point is added. In that case, the two added estimates of regression coefficients. Assessing
points mask each other, and at the same time, they influence can reveal data that are improperly
swamp a known ‘‘good’’ point. The dotted line measured or recorded, and it might be the first
and the dot-dash line in the rightmost panel of clue that certain observations were taken under
Figure 1 clearly show how the masking occurs— unusual circumstances. This entry discusses the
the fitted line is barely changed when either of the identification and treatment of influential data
points 21 or 22 is individually removed from the points.
data set. These points exert little individual influ-
ence, but their joint influence is extreme.
Identifying Influential Data Points
Michael A. Martin and Steven Roberts A variety of straightforward approaches is avail-
able to identify influential data points on the basis
See also Data Cleaning; Data Mining; Outlier; SPSS
of their leverage, outlying response values, or indi-
vidual effect on regression coefficients.
Further Readings
Atkinson, A. C., & Riani, M. (2000). Robust diagnostic
regression analysis. New York: Springer-Verlag. Graphical Methods
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980).
In the case of simple linear regression (p = 2),
Regression diagnostics: Identifying influential data
a contingency plot of the response versus predic-
and sources of collinearity. New York: Wiley.
Chatterjee, S., & Hadi, A. S. (1986). Influential tor values might disclose influential observa-
observations, high leverage points, and outliers in tions, which will fall well outside the general
linear regression. Statistical Science, 1, 379–393. two-dimensional trend of the data. Observations
Cook, R. D. (1977). Detection of influential observation with high leverage as a result of the joint effects
in linear regression. Technometrics, 19, 15–18. of multiple explanatory variables, however, are
Cook, R. D. (1979). Influential observations in linear difficult to reveal by graphical means. Although
regression. Journal of the American Statistical simple graphing is effective in identifying
Association, 74, 169–174. extreme outliers and nonsensical values, and is
Cook, R. D., & Weisberg, S. (1982). Residuals and
valuable as an initial screen, the eyeball might
influence in regression. New York: Chapman & Hall.
not correctly discern less obvious influential
points, especially when the data are sparse (i.e.,
small nÞ.
INFLUENTIAL DATA POINTS
Leverage
Influential data points are observations that exert
an unusually large effect on the results of regres- Observations whose influence is derived from
sion analysis. Influential data might be classified as explanatory values are known as leverage points.
outliers, as leverage points, or as both. An outlier The leverage of the ith observation is defined as
is an anomalous response value, whereas a leverage hi ¼ xi ðX0 XÞ1 x0 i , where xi is the ith row of the
point has atypical values of one or more of the n × p design matrix X for p predictors and sample
predictors. It is important to note that not all out- size n: Larger values of hi ; where 0 ≤ hi ≤ 1, are
liers are influential. indicative of greater leverage. For reasonably large
Identification and appropriate treatment of data sets (n  p > 50), a value of hi greater than
influential observations are crucial in obtaining 2p/n is a standard criterion
P for classification as
a valid descriptive or predictive linear model. a leverage point, where ni¼ 1 hi ¼ p and thus the
A single, highly influential data point might mean of hi ¼ p / n:
Influential Data Points 601

Standardized Residuals A composite score for influence on all coefficient


estimates is available in Cook’s distance, Di ¼
An objective test for outliers is available in 2
the form of standardized residuals. The Student- ðβ ^ 0 X0 X ðβ
^ðiÞ  βÞ ^ðiÞ  βÞ=ps
^ 2
¼ hi ðyi ^2yi Þ , where β
^
ð1hi Þ ps2
ized deleted residuals, ei* ¼ yp
i ^yi
ffi, where s2ðiÞ is
ffiffiffiffiffiffiffi ^
and βðiÞ are the p × 1 vectors of parameter esti-
sðiÞ 1hi
the mean square estimate of the residual vari- mates with and without observation i; respectively.
ance σ 2 with the ith observation removed, have Cook’s distance scales the distance between β^ and
a Student’s t distribution with n – p – 1 degrees ^
βðiÞ such that under the standard assumptions of
of freedom (df) under the assumption of nor- linear regression, a value greater than the median of
mally distributed errors. An equivalent expres- an F distribution with p and n  p df is generally
sion might be constructed in terms of yi ðiÞ, considered to be highly influential.
which is the fitted value for observation i when
the latter is not included in estimating regression Multiple Influential Points
parameters: ei* ¼ 1h hi
i
yi  ^
½^ yi ðiÞ: As a rule of The measures described previously are geared
thumb, an observation might be declared an out- toward finding single influential data points. In
lier if |ei * | > 3. As mentioned, however, classifi- a few cases, they will fail to detect influential
cation as an outlier does not necessarily imply observations because two or more similarly anom-
large influence. alous points might conceal one another’s effect.
Such situations are quite unusual, however. Similar
tests have been developed that are generalized to
Estimates of Influence
detect multiple influential points simultaneously.
Several additional measures assess influence on
the basis of effect on the model fit and estimated Treatment of Influential Data Points
regression parameters. The standardized change in
pffiffiffi
hi ðyi ^
yi Þ Although influential data points should be care-
fit, DFFITSi ¼ ð1h Þs , provides a standardized
i ðiÞ fully examined, they should not be removed from
measure of effect of the ith observation on its fitted the analysis unless they are unequivocally proven
(predicted) value. It represents the change, in units to be erroneous. Leverage points that nevertheless
of standard errors (SE), in the fitted value brought have small effect might be beneficial, as they tend
about by omission of the ith point in fitting the lin- to enhance the precision of coefficient estimates.
ear model. DFFITS and the Studentized residual are Observations with large effect on estimation can
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
closely related: DFFITSi ¼ ei * hi =ð1  hi Þ. The be acknowledged, and results both in the presence
criteria for large effect are typically | DFFITS | > and absence of the influential observations can be
pffiffiffiffiffiffiffiffi
2 p=n for large data sets or |DFFITS | > 1 for reported. It is possible to downweight influential
small data sets. This measure might be useful data points without omitting them completely, for
where prediction is the most important goal of example, through weighted linear regression or by
an analysis. Winsorization. Winsorization reduces the influence
More generally, it is of interest to examine of outliers without completely removing observa-
effect on estimated regression coefficients. Influ- tions by adjusting response values more centrally.
ence on individual parameters might be assessed This approach is appropriate, for example, in
through a standardized measure of change: genetic segregation and linkage analysis, in which
^ β
β ^ j ðiÞ partially centralizing extreme values scales down
DFBETASij ¼ qj ffiffiffiffiffiffiffiffiffiffiffiffiffi ^j and β
, where β ^j ðiÞ are
their influence without changing inference on the
sðiÞ ðX0 XÞ1
jj underlying genotype.
the least-squares estimates of the jth coefficient Alternatively, and especially when several valid
with and without the ith data point, respectively. but influential observations are found, one might
DFBETAS, like DFFITS, measures effect in terms consider robust regression; this approach is rela-
of the estimated SE. A reasonable criterion for tively insensitive to even a substantial percentage
pffiffiffi
a high level of influence is |DFBETASij | > 2= n. of outliers. A large number of atypical or outlying
602 Informed Consent

values might indicate an overall inappropriateness History


of the linear model for the data. For example, it
might be necessary to transform (normalize) one Today, protecting human research participants
or more variables. through informed consent is common practice.
That was not always the case. Unethical and
Robert P. Igo, Jr. harmful studies conducted in the past led to the
creation of a regulatory board and current ethical
See also Bivariate Regression; Data Cleaning; Exploratory
principles that are in place to protect the rights of
Data Analysis; Graphical Display of Data; Influence
human participants.
Statistics; Residual Plot; Robust; Winsorize
The Nazis conducted a great deal of inhuman
research during World War II. As a result, in 1947
Further Readings the Nuremberg Military Tribunal created the Nur-
emberg Code, which protected human participants
Belsley, D. A., Kuh, E., & Welsch, R. E. (2004).
in medical experiments. The code required
Regression diagnostics: Identifying influential data and
sources of collinearity. Hoboken, NJ: Wiley. researchers to obtain voluntary consent and mini-
Cook, R. D. (1977). Detection of influential observations mize harm in experiments that would provide
in linear regression. Technometrics, 19, 15–18. more benefits to the participants than foreseen
Cook, R. D., & Weisberg, S. (1994). An introduction to risks. In 1954, the National Institutes of Health
regression graphics. Hoboken, NJ: Wiley. (NIH) established an ethics committee that
Fox, J. (1997). Applied regression analysis, linear models adopted a policy that required all human partici-
and related methods. Thousand Oaks, CA: Sage. pants to provide voluntary informed consent. Fur-
thermore, the Department of Health, Education,
and Welfare issued regulations in 1974 that called
for protection of human research participants. The
INFORMED CONSENT department would not support any research that
was not first reviewed and approved by a commit-
Protecting human participants in research is tee, which is now known as the Institutional
extremely important, and part of that process is Review Board (IRB). The IRB would be responsi-
informed consent. Informed consent is an ongoing ble for determining the degree of risk and whether
communication process between research partici- the benefits outweighed any risk to the partici-
pants and the investigator to ensure participants’ pants. It was noted in the regulations that
comfort. Informed consent allows potential informed consent must be obtained.
research participants to volunteer their participa- The National Commission for the Protection of
tion freely, without threat or undue coaching. The Human Subjects of Biomedical and Behavioral
potential participant is also provided with infor- Research was created in 1974, and this organiza-
mation an individual would want to know before tion was charged to identify basic ethical principles
participating, so an educated decision can be made within research with human subjects and make
whether or not to participate. Generally, the recommendations to improve the policies in place.
purpose of informed consent is to protect each par- As a result, the Belmont Report was created,
ticipant’s welfare, ensure the participants are vol- which identifies and defines the three basic princi-
untary and informed, and promote positive ples for research, and it is still in use today. The
feelings before and after completing a study. basic principles of the Belmont report include
This entry begins with a brief history and then respect for persons, beneficence, and justice. Benef-
describes the necessary components of informed icence is the process of maximizing good outcomes
consent and some additional considerations. Next, from science, humanity, and the research partici-
the entry discusses the methods involved in obtain- pants, while avoiding or minimizing unnecessary
ing informed consent, including special cases. The risk, harm, or wrong. Respect encompasses the
entry concludes with a discussion of situations in overall protection of an individual’s autonomy,
which exceptions to informed consent process through courtesy and respect for everyone, includ-
might be made. ing individuals who are not autonomous (children,
Informed Consent 603

mentally handicapped, etc.). Justice ensures rea- Third, a description of any benefits to the
sonable, nonexploitative, and carefully considered subjects or others that are expected should be
procedures through fair administration. explained. Benefits include scientific knowledge;
As a result of the Belmont Report, six norms personally relevant benefits for the participants
were determined for conducting research: valid (i.e., food, money, and medical/mental health ser-
research designs, competence of researcher, identi- vices); insight, training, learning, role modeling,
fication of consequences, selection of subjects, vol- empowerment, and future opportunities; psycho-
untary informed consent, and compensation for social benefits (i.e., altruism, favorable attention,
injury. Each norm coexists with the others to and increased self-esteem); kinship benefits (i.e.,
ensure participants’ safety and should be followed closeness to people or reduction of alienation);
by researchers when formulating and implement- and community benefits (i.e., policies and public
ing a research project. documentation).
Several revisions were made to the ethics code Fourth, descriptions of alternatives to partici-
as time progressed. More recently, the Common pate must be provided to potential participants.
Rule was created and applied. The Common Rule This provides additional resources to people who
established the following three main protective fac- are being recruited.
tors: review of research by an IRB, institutional The fifth requirement is a description of how
assurances of compliance, and informed consent of confidentiality or anonymity will be ensured and
participants. The Common Rule is also still used its limits. Anonymity can be ensured in several
today. ways. Examples include using numbers or code
names instead of the names of the participants.
The specifics of the study will likely determine
how confidentiality will be ensured.
Required Components
Sixth, if the research will have more than mini-
Generally, three conditions must be met for mal risk, law requires a statement of whether
informed consent to be considered valid—the par- compensation for injury will be provided. If com-
ticipants must understand the information pre- pensation will be provided, a description of how
sented, the consent must be given voluntary, and should be included.
the participant must be competent to give consent. The seventh requirement is to provide the con-
More specifically, federal law requires eight com- tact information of the individual(s) the partici-
ponents be included in the consent statement. pants can contact if they have any questions or if
First, an explanation of the purpose of the there was the event of harm.
research, the expected duration of the subject’s Eighth, a statement will be made that participa-
participation, and a description of the procedure tion is voluntary and that if one chooses not to
must be included. Details of the methods are not participate, there will be no penalty or loss. In
required and are actually discouraged to allow addition, it must be explained that if one chooses
clearer comprehension on the part of the partici- to participate, then leaving the study at any time is
pant. Jargon, legal terminology, and irrelevant acceptable and there would be no penalty.
information should not be included. If deception is Last, the participants should receive a copy of
necessary, then participants must be informed that the informed consent to keep.
the details of the study cannot be explained prior
to the study, and that they will be given a full
Other Considerations
explanation of the study upon completion. Addi-
tional information on the use of deception is dis- There are several elements that can be added to an
cussed later. informed consent form to make it more effective,
Second, any description of foreseeable risk or although these items are not required. Examples
discomfort should be explained. Risk implies that include the following: any circumstances that
harm, loss, or damages might occur. This can might warrant termination of a participant regard-
include mere inconvenience or physical, psycholog- less of consent, additional costs the participants
ical, social, economic, and legal risks. might experience, the procedure if a participant
604 Informed Consent

decided to leave the study and its consequences, required criteria of an informed consent were read
and developments of the study. orally to the participant or the participants’ legally
Overall, an effective consent statement should authorized representative. The IRB must also
be jargon free, easy to understand, and written in approve a written summary of what will be said
a friendly, simple manner. A lengthy description of orally to the potential participants. Only the short
the methods is not necessary, and any irrelevant form will be signed by the participant. In addition,
information should not be included. As discussed a witness must sign the short form and the sum-
previously, each legal requirement should be mary of what was presented. A copy of the short
included as well. form and the written summary should be provided
In addition to the content and the style in which to the participant. Whichever way the material is
the informed consent is written, the manner in presented, the individual should be provided ade-
which the material is presented can increase or quate time to consider the material before signing.
decrease participation. Establishing good rapport Behavioral consent occurs when the consent
is very important and might require specific atten- form is waived or exempt. These situations are dis-
tion, as presenting the informed consent might cussed later.
become mundane if repeated a great deal. Using
a friendly greeting and tone throughout the pro-
Special Cases
cess of reading the informed consent is important.
Using body language that displays openness will Several cases require special considerations in
be helpful as well. A lack of congruence between addition to the required components of informed
what is verbalized and displayed through body consent. These special cases include minors, indivi-
language might lead potential participants to feel duals with disabilities, language barriers, third par-
uncomfortable. Furthermore, using an appropriate ties, studies using the Internet for collection, and
amount of eye contact will help create a friendly the use of deception in research.
atmosphere as well. Too little or too much eye To protect children and adolescent research par-
contact could potentially be offensive to certain ticipants, safeguards are put into place. Children
individuals. Presenting a willingness to answer all might be socially, cognitively, or psychologically
concerns and questions is important as well. Over- immature, and therefore, cannot provide informed
all, potential participants will better trust research- consent. In 1983, the Department of Health and
ers who present themselves in a friendly, caring Human Services adopted a federal regulation gov-
manner and who create a warm atmosphere. erning behavioral research on persons under the
age of 18. The regulations that were put into place
include several components. First, an IRB approval
Methods of Obtaining Informed Consent
must be obtained. Next, the documented permis-
There are several methods in which consent can sion of one parent or guardian and the assent of
be obtained. Largely, consent is acquired through the child must be obtained. Assent is the child’s
written (signed) consent. Oral and behavioral affirmative agreement to participate in the study.
consent are other options that are used less A lack of objection is not enough to assent. The
commonly. standard for assent is the child’s ability to under-
In most cases, the IRB will require a signed con- stand the purpose and what will occur if one
sent form. A signed consent form provides proof chooses to participate. In the case of riskier
that consent was indeed obtained. In riskier stud- research, both parents’ permission must be
ies, having a witness sign as well can provide extra obtained. Furthermore, the research must
assurance. involve no greater risk than the child normally
The actual written consent form can take two encounters, unless the risk is justified by antici-
forms—one that contains each required element pated benefits to the child.
outlined previously or a short written consent doc- Adapting the assent process with young chil-
ument. If the full version is presented, the form dren can lead to better comprehension. Minimiz-
can be read to or by the potential participants. ing the level of difficulty by using simple language
The short form entails documenting that the is effective to describe the study. After the
Informed Consent 605

presentation of the information, the comprehen- Internet and then sending informed consent forms
sion of the child should be assessed. Repeating the through the mail to obtain signatures. When the
material or presenting it in a story or video format signed copy is obtained, a code can be sent back to
can be effective as well. participate in the online study. Another suggestion
Another group that requires special consider- is using a button, where the participant has to
ation is individuals with disabilities. Assessing click ‘‘I agree’’ after reading the informed consent.
mental stability and illness as well as cognitive After giving consent, access to the next page would
ability is important to determine the participants’ be granted.
ability to make an informed decision. Moreover, The use of deception in research is another spe-
considerations to the degree of impairment and cial concern that is extremely controversial. By
level of risk are critical to ensure the requirements definition, using deception does not meet the crite-
of informed consent are met. Often, a guardian or ria of informed consent to provide full disclosure
surrogate will be asked to provide consent for par- of information about the study to be conducted.
ticipation in a study with disabled individuals. Deception in research includes providing inaccu-
Cultural issues also need consideration when rate information about the study, concealing
obtaining consent from individuals of different information, using confederates, making false
nationalities and ethnicities. Individuals who speak guarantees in regard to confidentiality, misrepre-
a language other than English might have difficulty senting the identity of the investigator, providing
understanding the material presented in the false feedback to the participants, using placebos,
informed consent. Special considerations should be using concealed recording devices, and failing to
made to address this issue. For instance, an inter- inform people they are part of a study. Proponents
preter can be used or the form could be tran- of deceptive research practice argue that deception
scribed to the native language of the potential provides useful information that could not other-
participant. Translators can reduce language bar- wise be obtained if participants were fully
riers significantly and provide an objective presen- informed. The American Psychological Association
tation of the information about the study. (APA) guidelines allow the use of deception with
Protecting third parties in research, or informa- specific regulations. The use of deception must be
tion obtained about other people from a partici- justified clearly by the prospective scientific value,
pant, is another special case. Although there are and other alternatives must be considered before
no guidelines currently in place, some recommen- using deception. Furthermore, debriefing must
dations exist. Contextual information that is occur no later than the end of data collection to
obtained from participants is generally not consid- explain the use of deception in the study and all
ered private. However, when information about the information fully that was originally withheld.
a third party becomes identifiable and is private,
an informed consent must be obtained.
With advances in technology, many studies are
Exceptions
being carried out through the Internet because of
the efficiency and low cost to researchers. Unfortu- There are cases in which an IRB will approve con-
nately, ethical issues, including informed consent, sent procedures with elements missing or with revi-
are hard to manage online. Researchers agree ethi- sions from the standard list of requirements, or in
cal guidelines are necessary, and some recommen- which they will waive the written consent entirely.
dations have been made, but currently there is no In general, the consent form can be altered or
standardized method of colleting and validating waived if it is documented that the research
informed consent online. Concerns about obtain- involves no more harm than minimal risk to the
ing consent online include being certain the partici- participants, the waiver or alteration will not
pant is of legal age to consent and that the adversely affect the rights and welfare of the sub-
material presented was understood. Maintaining jects, the research could not be practically carried
confidentiality with the use of e-mail and manag- out without the waiver or alteration, or the sub-
ing deception is difficult. Suggestions to these jects will be provided with additional pertinent
issues include recruiting participants via the information after participating.
606 Informed Consent

Alterations and waivers can also be made if it is Parental permission can also be waived under
demonstrated that the research will be conducted two circumstances. Consent can be waived for
or is subject to the approval of state or local research involving only minimal risk, given that
government officials, and it is designed to study, the research will not affect the welfare of the parti-
evaluate, or examine (a) public benefit service pro- cipants adversely, and the research can be carried
grams, (b) procedures for obtaining benefits or ser- out practically without a waiver. For instance, chil-
vices under these programs, (c) possible changes in dren who live on the streets might not have par-
or alternatives to those programs or procedures, or ents who could provide consent. In addition,
(d) possible changes in methods or level of pay- parental permission can also be waived if they do
ment for benefits or services under those programs. not properly protect the child.
Furthermore, under the same conditions listed pre- The National Commission for the Protection
viously, required elements can be left out if the of Human Subjects of Biomedical and Behav-
research could not be practically carried out with- ioral Research identified four other cases in
out the waiver or alteration. which a waiver for parental consent could poten-
The IRB can also waive the requirement for the tially occur. Research designed to study factors
researcher to obtain a signed consent form in two related to the incidence or treatment of condi-
cases. First, if the only record linking the partici- tions in adolescents who could legally receive
pant and the research would be the consent form treatment without parental permission is one
and the primary risk would be potential harm case. The second is participants who are mature
from a breach of confidentiality, then a signed con- minors and the procedures involve no more risk
sent form can be waived. Each participant in this than usual. Third, research designed to study
case should be provided the choice of whether he neglected or abused children, and fourth,
or she would like documentation linking him or research involving children whose parents are
her to the research. The wishes of the participant not legally or functionally competent, does not
should then be followed. Second, the research require parental consent.
to be conducted presents no more than minimal Child assent can also be waived if the child is
risk to the participants and involves no procedures deemed incapable of assenting, too young, imma-
that normally require a written consent outside of ture, or psychologically unstable, or if obtaining
a research context. In each case, after approving assent would hinder the research possibilities. The
a waiver, the IRB could require the researcher to IRB must approve the terms before dismissing
provide participants with a written statement assent.
explaining the research that will be conducted. There are several cases in which research is
Observational studies, ethnographic studies, exempted from the IRB and, therefore, does not
survey research, and secondary analysis can all require the consent of the parent(s). One of the
waive informed consent. In observational stud- most common cases is if research is conducted
ies, a researcher observes the interaction of in commonly accepted educational settings that
a group of people as a bystander. If the partici- involve normal educational practices. Examples of
pants remain anonymous, then informed consent normal practices include research on instructional
can be waived. An ethnographic study involves strategies and research on the effectiveness of, or
the direct observation of a group through an the comparison among, instructional technique,
immersed researcher. Waiving consent in ethno- curricula, or classroom management. Another
graphic studies depends on the case and vulnera- common example is if research involves educa-
bility of the participants. In conducting survey tional tests (i.e., cognitive, diagnostic, aptitude, or
research, if the participant can hang up the achievement), or if the information from these
phone or throw away mail, then consent is likely tests cannot be identified to the participants.
not needed. If a survey is conducted in person
and the risk is minimal to the participant, then Rhea L. Owens
consent can be waived as well. Furthermore,
informed consent does not have to be obtained See also Assent; Interviewing; Observations; Participants;
for secondary analysis of data. Recruitment
Instrumentation 607

Further Readings validity in research. This entry discusses instru-


mentation in relation to the data-collection pro-
Adair, J. G., Dushenko, T. W., & Lindsey, R. C. L.
(1985). Ethical regulations and their impact on cess, internal validity, and research designs.
research practice. American Psychologist, 40, 59–72.
American Psychological Association. (1992). Ethical Instrumentation Pertaining
principles of psychologist and code of conduct.
American Psychologist, 47; 1597–1611. to the Whole Process of Data Collection
Citro, C. F., Ilgen, D. R., & Marrett, C. B. (2003). Instrumentation is the use of, or work completed
Protecting participants and facilitating social and
by, planned instruments. In a research effort, it is
behavioral sciences research. Washington, DC:
the responsibility of an investigator to describe
National Academic Press.
Collogan, L. K., & Fleischman, A. R. (2005). Adolescent thoroughly the instrument used to measure the
research and parental permission. In E. Kodish (Ed.), dependent variable(s), outcome(s), or the effects of
Ethics and research with children: A case-based interventions or treatments. In addition, because
approach. New York: Oxford University Press. research largely relies on data collection through
Gross, B. (2001). Informed consent. Annals of the measurement, and instruments are assigned with
American Psychotherapy Association, 4, 24. operational numbers to measure purported con-
Jensen, P. S., Josephson, A. M., & Frey, J. (1989). structs, instrumentation inevitably involves the
Informed consent as a framework for treatment: procedure of establishing instrument validity and
Ethical and therapeutic considerations. American
reliability as well as minimizing measurement
Journal of Psychotherapy, 3, 378–386.
errors.
Michalak, E. E., & Szabo, A. (1998). Guidelines for
Internet research: An update. European Psychologist,
3, 70–75. Validity and Reliability
Miller, C. (2003). Ethical guidelines in research. In J. C.
Thomas & M. Hersen (Eds.), Understanding research Validity refers to the extent to which an instru-
in clinical and counseling psychology. Mahwah, NJ: ment measures what it purports to measure with
Lawrence Erlbaum. investigated subjects. Based on the research neces-
Sieber, J. E. (1992). Planning ethically responsible sities, investigators need to determine ways to
research: A guide for students and internal review assess instrument validity that best fits the needs
boards. Newbury Park, CA: Sage. and objectives for the research. In general, instru-
Tymchuk, A. J. (1991). Assent processes. In B. Stanley & ment validity consists of face validity, content
J. E. Sieber (Eds.), Social research on children and
validity, criterion-related validity, and construct-
adolescents: Ethical issues. Newbury Park, CA: Sage.
related validity. It is necessary to note that an
instrument is simply valid for measuring a particu-
lar purpose and for a designated group. That is, an
instrument can be valid for measuring a group of
INSTRUMENTATION specific subjects but can become invalid for
another. For example, a valid 4th-grade math
Instrumentation refers to the tools or means by achievement test is unlikely to be a valid math
which investigators attempt to measure variables achievement test for 2nd graders. In another
or items of interest in the data-collection process. instance, a valid 4th-grade math achievement test
It is related not only to instrument design, selec- is unlikely to be a valid aptitude test for 4th
tion, construction, and assessment, but also the to graders.
conditions under which the designated instruments Reliability refers to the degree to which an
are administered—the instrument is the device instrument consistently measures whatever
used by investigators for collecting data. In addi- the instrument was designed to measure. A reliable
tion, during the process of data collection, investi- instrument can generate consistent results. More
gators might fail to recognize that changes in the specifically, when an instrument is applied to tar-
calibration of the measuring instrument(s) can lead get subjects more than once, an investigator can
to biased results. Therefore, instrumentation is also expect to obtain results that are quite similar or
a specific term with respect to a threat to internal even identical each time. Such measurement
608 Instrumentation

consistency enables investigators to gain confi- to measure because the performance is low for all
dence in the measuring ability or dependability of subjects. Measurement errors can also take place
the particular instrument. Approaches to reliability in a random fashion. In this case, for example, if
consist of repeated measurements on an individual a math achievement test is reliable and if a student
(i.e., test–retest and equivalent forms), internal has been projected to score 70 based on her previ-
consistency measures (i.e., split-half, Kuder– ous performance, then investigators would expect
Richardson 20, Kuder–Richardson 21, and test scores of this student to be close to the pro-
Cronbach’s alpha), and interrater and intrarater jected score of 70. After the same examination
reliability. Usually, reliability is shown in the was administered on several different occasions,
numerical form, as a coefficient. The range of reli- the scores obtained (e.g., 68, 71, and 72) might
ability coefficient is from 0 (errors existed in the not be the exact projected score—but they are
entire measurement) to 1 (no error in the measure- pretty close. In this case, the differences in test
ment was discovered); the higher the coefficient, scores would be caused by random variation. Con-
the better the reliability. versely, of course, if the test is not reliable, then
considerable fluctuations in terms of test scores
would not be unusual. In fact, any values or scores
Measurement Errors
obtained from such an instrument would be, more
Investigators need to attempt to minimize mea- or less, affected by random errors, and researchers
surement errors whenever practical and possible can assume that no instrument is totally free from
for the purpose of accurately indicating the random errors. It is imperative to note that a valid
reported values collected by the instrument. Mea- instrument must have reliability. An instrument
surement errors can occur for various reasons and can, however, be reliable but invalid—consistently
might result from the conditions of testing (e.g., measuring the wrong thing.
test procedure not properly followed, testing site Collectively, instrumentation involves the whole
too warm or too cold for subjects to calmly process of instrument development and data col-
respond to the instrument, noise distractions, or lection. A good and responsible research effort
poor seating arrangements), from characteristics of requires investigators to specify where, when, and
the instrument itself (e.g., statements/questions not under what conditions the data are obtained to
clearly stated, invalid instruments of measuring the provide scientific results and to facilitate similar
concept in question, unreliable instruments, or research replications. In addition to simply indicat-
statements/questions too long), from test subjects ing where, when, and under what conditions the
themselves (e.g., socially desirable responses pro- data are obtained, the following elements are part
vided by subjects, bogus answers provided by sub- of the instrumentation concept and should be
jects, or updated or correct information not clearly described and disclosed by investigators:
possessed by subjects), or combinations of these how often the data are to be collected, who will
listed errors. Pamela L. Alreck and Robert B. Settle collect the data, and what kinds of data-collection
refer to the measurement errors described previ- methods are employed. In summary, instrumenta-
ously as instrumentation bias and error. tion is a term referring to the process of identifying
Concerning the validity and reliability of instru- and handling the variables that are intended to be
mentation, measurement errors can be both sys- measured in addition to describing how investiga-
tematic and random. Systematic errors have an tors establish the quality of the instrumentation
impact on instrument validity, whereas random concerning validity and reliability of the proposed
errors affect instrument reliability. For example, if measures, how to minimize measurement errors,
a group of students were given a math achieve- and how to proceed in the process of data
ment test and the test was difficult to all exami- collection.
nees, then all test scores would be systematically
lowered. These lowered scores indicate that the
Instrumentation as a Threat to Internal Validity
validity of the math achievement test is low for
that particular student group or, in other words, As discussed by Donald T. Campbell and Julian
the instrument does not measure what it purports Stanley in Experimental and Quasi-Experimental
Instrumentation 609

Designs for Research, instrumentation, which is presenting ‘‘leading’’ questions to the persons being
also named instrument decay, is one of the threats interviewed, allowing some subjects to use more
to internal validity. It refers to changes in calibra- time than others to complete a test, or screening or
tion of a measuring instrument or changes in per- editing sensitive issues or comments by those col-
sons collecting the data that can adversely lecting the data. The primary controls to this threat
generate differences in the data gathered thereby are to standardize the measuring procedure and to
affecting the internal validity of a study. This keep data collectors ‘‘blind.’’ Principal investigators
threat can result from data-collector characteris- need to provide training and standardized guide-
tics, data-collector bias, and the decaying effect. lines to make sure that data collectors are aware of
Accordingly, certain research designs are suscepti- the importance of measurement consistency within
ble to this threat. the process of data collection. With regard to keep-
ing data collectors ‘‘blind,’’ principal investigators
need to keep data collectors ignorant of which
Data Collector Characteristics
method individual subjects or groups (e.g., control
The results of a study can be affected by the group vs. experimental group) are being tested or
characteristics of data collectors. When more than observed in a research effort.
two data collectors are employed as observers,
scorers, raters, or recorders in a research project, Decaying Effect
a variety of individual characteristics (e.g., gender,
age, working experience, language usage, and eth- When the data generated from an instrument
nicity) can interject themselves into the process and allow various interpretations and the process of
thereby lead to biased results. For example, this sit- handling those interpretations are tedious and/or
uation might occur when the performance of difficult requiring rigorous discernment, an investi-
a given group is rated by one data collector while gator who scores or needs to provide comments on
the performance of another group is collected by these instruments one after another can eventually
a different person. Suppose that both groups per- become fatigued, thereby leading to scoring differ-
form the task equally well per the performance cri- ences. A change in the outcome or conclusion
teria. However, the score of one group is supported by the data supports has now been intro-
significantly higher than that of the other group. duced by the investigator, who is an extraneous
The difference in raters would be highly suspect in source not related to the actual collected data. A
causing the variations in measured performance. common example would be that of an instructor
The principal controls to this threat are to use who attempts to grade a large number of term
identical data collector(s) throughout the data- papers. Initially, the instructor is thorough and
collection process, to analyze data separately for painstaking on his or her assessment of perfor-
every data collector, to precalibrate or make cer- mance. However, after grading many papers, tired-
tain that every data collector is equally skilled in ness, fatigue, and clarity of focus gradually factor in
the data collection task, or ensure that each rater and influence his or her judgments. The instructor
has the opportunity to collect data from each then becomes more generous on scoring the second
group. half of the term papers. The principal control to this
threat is to arrange several data-collection or grad-
ing sessions to keep the scorer calm, fresh, and
Data Collector Bias mentally acute while administering examinations or
It is possible that data collectors might uncon- grading papers. By doing so, the decaying effect
sciously treat certain subjects or groups differently that leads to scoring differences can be minimized.
than the others. The data or outcome generated
under such conditions would inevitably produce Instrumentation as a
biased results. Data collector bias can occur regard-
Threat to Research Designs
less of how many investigators are involved in the
collection effort; a single data-collection agent is Two quasi-experimental designs, the time-series
subject to bias. Examples of the bias include and the separate-sample pretest–posttest designs,
610 Interaction

and one preexperimental design, the one-group use instrumentation to collect hard data or mea-
pretest–posttest group design, are vulnerable to the surements of the real world, whereas research in
threat of instrumentation. The time-series design is the social sciences produces ‘‘soft’’ data that only
an elaboration of a series of pretests and posttests. measures perceptions of the real world. One rea-
For reasons too numerous to include here, data col- son that instrumentation is complicated is because
lectors sometimes change their measuring instru- many variables act independently as well as inter-
ments during the process of data collection. If this is act with each other.
the case, then instrumentation is introduced and any
main effect of the dependent variable can be misread Chia-Chien Hsu and Brian A. Sandford
by investigators as the treatment effect. Instrumenta-
See also Internal Validity; Reliability; Validity of
tion can also be a potential threat to the separate-
Measurement
sample pretest–posttest design. Donald T. Campbell
and Julian Stanley note that differences in attitudes
and experiences of a single data collector could be Further Readings
confounded with the variable being measured. That
is, when a data collector has administered a pretest, Alreck, P. L., & Settle, R. B. (1995). The survey research
handbook: Guidelines and strategies for conducting
he or she would be more experienced in the posttest
a survey. New York: McGraw-Hill.
and this difference might lead to variations in mea- Campbell, D. T., & Stanley, J. C. (1963). Experimental
surement. Finally, the instrumentation threat can be and quasi-experimental designs for research. Chicago:
one of the obvious threats often realized in the one- Rand McNally.
group pretest–posttest group design (one of the pre- Fraenkel, J. R., & Wallen, N. E. (2000). How to design
experimental designs). This is a result of the six and evaluate research in education. Boston: McGraw-
uncontrolled threats to internal validity inherent Hill.
with this design (i.e., history, maturation, testing, Gay, L. R. (1992). Educational research: Competencies
instrumentation, regression, interaction of selection, for analysis and application. Upper Saddle River, NJ:
and maturation). Therefore, with only one interven- Prentice Hall.
tion and the pre- and posttest design, there is
a greater chance of being negatively affected by data
collector characteristics, data collector bias, and
decaying effect, which can produce confounded INTERACTION
results. The effect of the biases are difficult to pre-
dict, control, or identify in consideration of any In most research contexts in the biopsychosocial
effort to separate actual treatment effects with the sciences, researchers are interested in examining
influence of these extraneous factors. the influence of two or more predictor variables
on an outcome. For example, researchers might be
interested in examining the influence of stress
Final Note
levels and social support on anxiety among first-
In the field of engineering and medical research, semester graduate students. In the current exam-
the term instrumentation is frequently used and ple, there are two predictor variables—stress levels
refers to the development and employment of and social support—and one outcome variable—
accurate measurement, analysis, and control. Of anxiety. In its simplest form, a statistical interac-
course, in the fields mentioned previously, instru- tion is present when the association between a
mentation is also associated with the design, con- predictor and an outcome varies significantly as
struction, and maintenance of actual instruments a function of a second predictor. Given the current
or measuring devices that are not proxy measures example, one might hypothesize that the associa-
but the actual device or tool that can be manipu- tion between stress and anxiety varies significantly
lated per its designed function and purpose. Com- as a function of social support. More specifically,
paring the devices for measurement in engineering one might hypothesize that there is no association
with the social sciences, the latter is much less pre- between stress and anxiety among individuals
cise. In other words, the fields of engineering might reporting higher levels of social support while
Interaction 611

simultaneously hypothesizing that the association 10


between stress and anxiety is strong among indivi- 9 rt
p po
duals reporting lower levels of social support. Data 8 l Su
consistent with these joint hypotheses would be cia
7 So
suggestive of a significant interaction between w
6 Lo

Anxiety
stress and social support in predicting anxiety. 5
Hypothetical data consistent with this interac-
4
tion are presented in Figure 1. The horizontal axis
3
is labeled Stress, with higher values representing High Social Support
2
higher levels of stress. The vertical axis is labeled
1
Anxiety, with higher values representing higher
0
levels of anxiety. In figures such as these, one pre- 0 1 2 3 4 5 6 7 8 9 10
dictor (in this case, stress) is plotted along the hori- Stress
zontal axis, while the outcome (in this case,
anxiety) is plotted along the vertical axis. The sec-
Figure 1 The Interaction of Stress and Support in
ond predictor (in this case, social support) forms
Predicting Anxiety
the lines in the plot. In Figure 1, the flat line is
labeled High Social Support and represents the
association between stress and anxiety for indivi- social support or that support moderates the
duals reporting higher levels of social support. The stress–anxiety association. The term moderator is
other line is labeled Low Social Support and repre- commonly used in various fields in the social
sents the association between stress and anxiety sciences. Researchers interested in testing hypothe-
for individuals reporting lower levels of social sup- ses involving moderation are interested in testing
port. In plots like Figure 1, as the lines depart from statistical interactions that involve the putative
parallelism, a statistical interaction is suggested. moderator and at least one other predictor. In
plots like Figure 1, the moderator variable will
often be used to form the lines in the plot while
Terminological Clarity the remaining predictor is typically plotted along
Researchers use many different terms to discuss the horizontal axis.
statistical interactions. The crux issue in describing
statistical interactions has to do with dependence. Statistical Models
The original definition presented previously stated
that a statistical interaction is present when the Statistical interactions can be tested using many
association between a predictor and an outcome different analytical frameworks. For example,
varies significantly as a function of a second pre- interactions can be tested using analysis of vari-
dictor. Another way of stating this is that the ance (ANOVA) models, multiple regression mod-
effects of one predictor on an outcome depend on els, and/or logistic regression models—just to
the value of a second predictor. Figure 1 is a great name a few. Next, we highlight two such modeling
illustration of such dependence. For individuals frameworks—ANOVA and multiple regression—
who report higher levels of social support, there is although for simplicity, the focus is primarily on
no association between stress and anxiety. For ANOVA.
individuals who report lower levels of social sup-
port, there is a strong positive association between
Analysis of Variance
stress and anxiety. Consequently, the association
between stress and anxiety depends on level of The ANOVA model is often used when predic-
social support. Some other terms that researchers tor variables can be coded as finite categorical
use to describe statistical interactions are (a) condi- variables (e.g., with two or three categories) and
tional on, (b) contingent on, (c) modified by, and/ the outcome is continuous. Social science research-
or (d) moderated by. Researchers might state that ers who conduct laboratory-based experiments
the effects of stress on anxiety are contingent on often use the ANOVA framework to test research
612 Interaction

hypotheses. In the ANOVA framework, predictor Table 1 Data From a Hypothetical Laboratory-Based
variables are referred to as independent variables Study Examining the Effects of Stress and
or factors, and the outcome is referred to as the Social Support on Anxiety
dependent variable. The simplest ANOVA model Support
that can be used to test a statistical interaction
includes two factors—each of which has two cate- Low High
gories (referred to as levels in the ANOVA vernac- Stress Low 4 2 3
ular). Next, an example is presented with High 10 6 8
hypothetical data that conform to this simple 7 4
structure. This example assumes that 80 partici-
pants were randomly assigned to one of four con-
ditions: (1) low stress and low support, (2) low the mean anxiety score among individuals in the
stress and high support, (3) high stress and low high-stress/low-support condition. Means are pre-
support, and (4) high stress and high support. In sented also in the margins of the tables (i.e., the
this hypothetical study, one can assume that stress underlined numbers). The marginal means are the
was manipulated by exposing participants to either means of the two relevant row or column entries.
a simple (low stress) or complex (high stress) cog- The main effect of a factor is examined by compar-
nitive task. One can assume also that a research ing the marginal means across the various levels of
confederate was used to provide either low or high the factor. In the current example, we assume that
levels of social support to the relevant study partic- any (nonzero) difference between the marginal
ipant while she or he completed the cognitive task. means is equivalent to a main effect for the factor in
question. Based on this assumption, the main effect
of stress in Table 1 is significant—because there is
Main Effects and Simple Effects
a difference between the two stress marginal
A discussion of interactions in statistical texts means of 3 (for the low-stress condition) and 8
usually involves the juxtaposition of two kinds of (for the high-stress condition). As might be
effects: main effects and simple effects (also expected, on average individuals exposed to the
referred to as simple main effects). Researchers high-stress condition reported higher levels of
examining main effects (in the absence of interac- anxiety than did individuals exposed to the low-
tions) are interested in the unique independent stress condition. Similarly, the main effect of
effect of each of the predictors on the outcome. In support is also significant because there is a dif-
these kinds of models—which are often referred to ference between the two support marginal means
as additive effects models—the effect of each pre- of 7 (for the low-support condition) and 4 (for
dictor on the outcome is constant across all levels the high-support condition).
of the remaining predictors. In sharp contrast, In the presence of a statistical interaction,
however, in examining models that include interac- however, the researcher’s attention turns away
tions, researchers are interested in exploring the from the main effects of the factors and instead
possibility that the effects of one predictor on the focuses on the simple effects of the factors. As
outcome depend on another predictor. Next, this described previously, in the data presented in
entry examines both main and interactive effects Table 1 the main effect of support are quantified
in the context of the hypothetical laboratory-based by comparing the means of individuals who
experiment. received either low or high levels of social sup-
In Table 1, hypothetical data from the laboratory- port. A close examination of Table 1 makes clear
based study are presented. The numbers inside the that scores contributing to the low-support mean
body of the table are means (i.e., arithmetic derive from the following two different sources:
averages) on the dependent variable (i.e., anxiety). (1) individuals who were exposed to a low-stress
Each mean is based on a distinct group of 20 partici- cognitive task and (2) individuals who were
pants who were exposed to a combination of the exposed to a high-stress cognitive task. The same
stress (e.g., low) and support (e.g., low) factors. For is true of scores contributing to the high-support
example, the number 10 in the body of the table is mean. In the current example, however, combining
Interaction 613

Table 2 Heuristic Table of Various Effects From a Hypothetical 2 × 2 Design


1 2 3 4 5 Effect Contrast
Support
Main effect of support H  G
Low High Main effect of stress F  E
Stress Low A B E Simple support at low stress B  A
High C D F Simple support at high stress D  C
G H Simple stress at low support C  A
Note. Cell values above (e.g., A) represent group Simple stress at high support D  B
means—each of which is presumed to be based on Stress by support (D  C)  (B  A) or
20 scores. Underlined values (e.g., E) represent interaction (D  B)  (C  A)
marginal means—each of which is the average of Note. The minus signs in the contrast
the relevant row or column entries (i.e., E is the column are meant to denote subtraction.
average of A and B).

data from individuals exposed to either lower or comparing the mean of the low-support/low-
higher levels of stress does not seem prudent. In stress group (4) to the mean of the low-support/
part, this is true because the mean anxiety score high-stress group (10) is testing the simple effect
for individuals exposed to high support varies as of stress at low levels of support. If there is
a function of stress. In other words, when holding a (nonzero) difference between these two means,
support constant at high levels, on average indivi- we will assume that the simple effect of stress at
duals exposed to the low-stress task report lower lower levels of support is statistically significant.
levels of anxiety (2) than do individuals exposed to Consequently, in the current case, this simple
the high-stress task (6). Even more important, the effect is significant (because 10  4 ¼ 6). In
stress effect is more pronounced at lower levels of examining the simple effects of stress at high
support because the average anxiety difference levels of support, we would compare the high-
between the low-stress (4) and high-stress (10) support/low-stress mean (2) with the high-
conditions is larger. In other words, the data pre- support/high-stress mean (6). We would conclude
sented in Table 1 suggest that the effects of stress that the simple effect of stress at high levels of sup-
on anxiety depend on the level of support. Another port is also significant (because 6  2 ¼ 4).
way of saying this is that there is a statistical inter- The test of the interaction effect is quantified
action between stress and support in predicting by examining whether the relevant simple effects
anxiety. are different from one another. If the difference
As noted previously, in exploring interactions, between the two simple effects of stress—one at
researchers focus on simple rather than main lower levels of support and the second at higher
effects. Although the term was not used, two of levels of support are compared—it will be found
the four simple effects yielded by the hypothetical that the interaction is significant (because 4 
study have already been discussed. When examin- 6 ¼  2). The fact that these two simple effects
ing simple effects, the researcher contrasts the differ quantifies numerically the original concep-
table means in one row or column of the table. In tual definition of a statistical interaction, which
doing so, the researcher is examining the simple stated that in its simplest form a statistical inter-
effects of one of the factors at a specific level (i.e., action is present when the association between
value) of the other factor. a predictor and an outcome varies significantly
Previously, when we observed that there was as a function of a second predictor. In the cur-
a difference in the mean anxiety scores for indi- rent case, the association between stress and
viduals who received a low level of social sup- anxiety varies significantly as a function of sup-
port under either the low-stress or high-stress port. At higher levels of support, the (simple)
conditions, we were discussing the simple effect effect of stress is more muted, resulting in a mean
of stress at low levels of support. In other words, anxiety difference of 4 between the low-stress
614 Interaction

and high-stress conditions. At lower levels of impetus for their work derived from researchers’
support, however, the (simple) effect of stress is lack of understanding of how to specify and inter-
more magnified, resulting in a mean anxiety dif- pret multiple regression models properly, including
ference of 6 between the low-stress and high- tests of interactions. In some of the more com-
stress conditions. Consequently, the answer to monly used statistical software programs, ANOVA
the question ‘‘What is the effect of stress on anx- models are typically easier to estimate because the
iety?’’ is ‘‘It depends on the level of support.’’ actual coding of the effects included in the analysis
(Table 2 provides a generic 2 × 2 table in which occurs ‘‘behind the scenes.’’ In other words, if
all of the various main and simple effects are a software user requested a full factorial ANOVA
explicitly quantified. The description of the model (i.e., one including all main effects and
Table 1 entries as well as the presentation of interactions) to analyze the data from the hypo-
Table 2 should help the reader gain a better thetical laboratory-based study described previ-
understanding of these various effects.) ously, the software would create effect codes to
specify the stress and support predictors and
would also form the product of these codes to
Multiple Regression
specify the interaction predictor. The typical user,
In 1968, Jacob Cohen shared some of his however, is probably unaware of the coding that is
insights with the scientific community in psychol- used to create the displayed output. In specifying
ogy regarding the generality and flexibility of the a multiple regression model to analyze these
multiple regression approach to data analysis. data, the user would not be spared the work of
Within this general analytical framework, the coding the various effects in the analysis. More
ANOVA model exists as a special case. In the more important, the user would need to understand the
general multiple regression model, predictor vari- implications of the various coding methods for the
ables can take on any form. Predictors might be proper interpretation of the model estimates. This
unordered categorical variables (e.g., gender: male is one reason why the text by Aiken and West has
or female), ordered categorical variables (e.g., received so much positive attention. The text out-
symptom severity: low, moderate, or high), and/or lines—through the use of illustrative examples—
truly continuous variables (e.g., chronological age). various methods used to test and interpret multiple
Similarly, interactions between and/or among pre- regression models including tests of interaction
dictor variables can include these various mixtures effects. It also includes a detailed discussion of the
(e.g., a categorical predictor by continuous predic- proper interpretation of the various conditional
tor interaction or a continuous predictor by contin- (i.e., simple) effects that are components of larger
uous predictor interaction). interactions.
Many of the same concepts described in the
context of ANOVA have parallels in the regression
framework. For example, in examining an interac-
tion between a categorical variable (e.g., graduate
Additional Considerations
student cohort: first year or second year) and a
continuous variable (e.g., graduate school–related This discussion endeavored to provide a brief and
stress) in predicting anxiety, a researcher might nontechnical introduction to the concept of statis-
examine the simple slopes that quantify the associ- tical interactions. To keep the discussion more
ation between stress and anxiety for each of the accessible, equations for the various models
two graduate school cohorts. In such a model, the described were not provided. Moreover, this dis-
test of the (cohort by stress) interaction is equiva- cussion focused mostly on interactions that were
lent to testing whether these simple slopes are sig- relatively simple in structure (e.g., the laboratory-
nificantly different from one another. based example, which involved an interaction
In 1991, Leona S. Aiken and Stephen G. West between two 2-level categorical predictors). Before
published their seminal work on testing, interpret- concluding, however, this entry broaches some
ing, and graphically displaying interaction effects other important issues relevant to the discussion of
in the context of multiple regression. In part, the statistical interactions.
Internal Consistency Reliability 615

Interactions Can Include Many Variables See also Analysis of Variance (ANOVA); Effect Coding;
Factorial Design; Main Effects; Multiple Regression;
All the interactions described previously involve Simple Main Effects
the interaction of two predictor variables. It is pos-
sible to test interactions involving three or more
predictors as well (as long as the model can be Further Readings
properly identified and estimated). In the social Aiken, L. S., & West, S. G. (1991). Multiple regression:
sciences, however, researchers rarely test interac- Testing and interpreting interactions. Newbury Park,
tions involving more than three predictors. In test- CA: Sage.
ing more complex interactions the same core Baron, R. M., & Kenny, D. A. (1986). The moderator-
concepts apply—although they are generalized to mediator variable distinction in social psychological
include additional layers of complexity. For exam- research: Conceptual, strategic, and statistical
ple, in a model involving a three-way interaction, considerations. Journal of Personality and Social
seven effects comprise the full factorial model (i.e., Psychology, 51, 1173–1182.
three main effects; three 2-way interactions; and Cohen, J. (1968). Multiple regression as a general data-
analytic system. Psychological Bulletin, 70, 426–443.
one 3-way interaction). If the three-way interac-
Cohen, J. (1978). Partialed products are interactions;
tion is significant, it suggests that the simple two- Partialed powers are curve components. Psychological
way interactions vary significantly as a function of Bulletin, 85, 858–866.
the third predictor. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
Applied multiple regression/correlation analysis for the
When Is Testing an Interaction Appropriate? behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
This discussion thus far has focused nearly Keppel, G. (1991). Design and analysis: A researcher’s
exclusively on understanding simple interactions handbook (3rd ed.). Englewood Cliffs, NJ: Prentice
from both conceptual and statistical perspectives. Hall.
When it is appropriate to test statistical interac-
tions has not been discussed. As one might imag-
ine, the answer to this question depends on many
factors. A couple of points are worthy of mention, INTERNAL CONSISTENCY
however. First, some models require the testing of RELIABILITY
statistical interactions in that the models assume
that the predictors in question do not interact.
Internal consistency reliability estimates how much
For example, in the classic analysis of covariance
total test scores would vary if slightly different
(ANCOVA) model in which one predictor is trea-
items were used. Researchers usually want to mea-
ted as the predictor variable of primary theoretical
sure constructs rather than particular items. There-
interest and the other predictor is treated as a
fore, they need to know whether the items have
covariate (or statistical control variable), the
a large influence on test scores and research
model assumes that the primary predictor and cov-
conclusions.
ariate do not interact. As such, researchers
This entry begins with a discussion of classical
employing such models should test the relevant
reliability theory. Next, formulas for estimating
(predictor by covariate) interaction as a means of
internal consistency are presented, along with a dis-
assessing empirically this model assumption. Sec-
cussion of the importance of internal consistency.
ond, as is true in most empirical work in the
Last, common misinterpretations and the interac-
sciences, theory should drive both the design of
tion of all types of reliability are examined.
empirical investigations and the statistical analyses
of the primary research hypotheses. Consequently,
researchers can rely on the theory in a given area Classical Reliability Theory
to help them make decisions about whether to
hypothesize and test statistical interactions. To examine reliability, classical test score theory
divides observed scores on a test into two compo-
Christian DeLucia and Brandon Bergman nents, true score and error:
616 Internal Consistency Reliability

X ¼ T þ E; σ 2X ¼ σ 2T þ σ 2E ;

where X = observed score, T = true score, and where σ 2E = the variance of error scores across
E = error score. participants.
If Steve’s true score on a math test is 73 but he The reliability coefficient can now be rewritten
gets 71 on Tuesday because he is tired, then his as follows:
observed score is 71, his true score is 73, and his
error score is –2. On another day, his error score σ 2T σ 2T
ρXX0 ¼ ¼ :
might be positive, so that he scores better than he σ 2X σ 2T þ σ 2E
usually would.
Each type of reliability defines true score and Reliability coefficients vary from 0 to 1, with
error differently. In test–retest reliability, true score higher coefficients indicating higher reliability.
is defined as whatever is consistent from one test- This formula can be applied to each type of reli-
ing time to the next, and error is whatever varies ability. Thus, internal consistency reliability is the
from one testing time to the next. In interrater reli- proportion of observed score variance that is
ability, true score is defined as whatever is con- caused by true differences between participants,
sistent from one rater to the next, and error is where true differences are defined as differences
defined as whatever varies from one rater to the that are consistent across the set of items. If the
next. Similarly, in internal consistency reliability, reliability coefficient is close to 1, then researchers
true score is defined as whatever is consistent from would have obtained similar total scores if they
one item to the next (or one set of items to the had used different items to measure the same
next set of items), and error is defined as whatever construct.
varies from one item to the next (or from one set
of items to the next set of items that were designed
Estimates of Internal Consistency
to measure the same construct). To state this
another way, true score is defined as the expected Several different formulas have been proposed to
value (or long-term average) of the observed estimate internal consistency reliability. Lee
scores—the expected value over many times (for Cronbach, Cyril Hoyt, and Louis Guttman inde-
test–retest reliability), many raters (for interrater pendently developed the most commonly used for-
reliability), or many items (for internal consis- mula, which is labeled coefficient alpha after the
tency). The true score is the average, not the truth. terminology used by Cronbach. The split-half
The error score is defined as the amount by which approach is also common. In this approach, the
a particular observed score differs from the aver- test is divided into two halves, which are then cor-
age score for that person. related. G. F. Kuder and M. W. Richardson devel-
Researchers assess all types of reliability using oped KR-20 for use with dichotomous items (i.e.,
the reliability coefficient. The reliability coefficient true/false items or items that are marked as correct
is defined as the ratio of true score variance to or incorrect). KR-20 is easy to calculate by hand
observed score variance: and has traditionally been used in classroom set-
tings. Finally, Tenko Raykov and Patrick Shrout
σ 2T have recently proposed measuring internal consis-
ρXX0 ¼ ;
σ 2X tency reliability using structural equation modeling
approaches.
where ρXX0 ¼ the reliability coefficient, σ 2T ¼ the
variance of true scores across participants, and
Importance of Internal Consistency
σ 2X ¼ the variance of observed scores across
participants. Internal consistency reliability is the easiest type of
Classical test score theory assumes that true reliability to calculate. With test–retest reliability,
scores and errors are uncorrelated. Therefore, the test must be administered twice. With interra-
observed variance on the test can be decomposed ter reliability, the test must be scored twice. But
into true score variance and error variance: with internal consistency reliability, the test only
Internal Consistency Reliability 617

needs to be administered once. Because of this, if different items were used. This question is
internal consistency is the most commonly used theoretically important because it tells researchers
type of reliability. whether they have covered the full breadth of the
Internal consistency reliability is important construct. But this question is usually not of practi-
when researchers want to ensure that they have cal interest, because researchers usually administer
included a sufficient number of items to capture the same items to all participants.
the concept adequately. If the concept is narrow,
then just a few items might be sufficient. For
Common Misinterpretations
example, the International Personality Item Pool
(IPIP) includes a 10-item measure of self-discipline Four misinterpretations of internal consistency are
that has a coefficient alpha of .85. If the concept is common. First, researchers often assume that if
broader, then more items are needed. For example, internal consistency is high, then other types of
the IPIP measure of conscientiousness includes 20 reliability are high. In fact, there is no necessary
items and has a coefficient alpha of .88. Because mathematical relationship between the variance
conscientiousness is a broader concept than self- caused by items, the variance caused by time, and
discipline, if the IPIP team measured conscientious- the variance caused by raters. It might be that
ness with just 10 items, then the particular items there is little variance caused by items but consid-
that were included would have a substantial effect erable variance caused by time and/or raters.
on the scores obtained, and this would be reflected Because each type of reliability defines true score
in a lower internal consistency. and error score differently, there is no way to pre-
Second, internal consistency is important if dict one type of reliability based on another.
a researcher administers different items to each Second, researchers sometimes assume that high
participant. For example, an instructor might use internal consistency implies unidimensionality.
a computer-administered test to assign different This misinterpretation is reinforced by numerous
items randomly to each student who takes an textbooks that state that the internal consistency
examination. Under these circumstances, the coefficient indicates whether all items measure the
instructor must ensure that students’ course grades same construct. However, Neal Schmitt showed
are mostly a result of real differences between the that a test can have high internal consistency even
students, rather than which items they were if it measures two or more unrelated constructs.
assigned. This is possible because internal consistency reli-
However, it is unusual to administer different ability is influenced by both the relationships
items to each participant. Typically, researchers between the items and the number of items. If all
compare scores from participants who completed items are related strongly to each other, then just
identical items. This is in sharp contrast with other a few items are sufficient to obtain high internal
forms of reliability. For example, participants are consistency. If items have weaker relationships or
often tested at different times, both within a single if some items have strong relationships and other
study and across different studies. Similarly, parti- items are unrelated, then high internal consistency
cipants across different studies are usually scored can be obtained by having more items.
by different raters, and sometimes participants Researchers often want to know whether a set
within a single study are scored by different raters. of items is unidimensional, because it is easier to
When differences between testing times or raters interpret test scores if all items measure the same
are confounded with differences between partici- construct. Imagine that a test contains 10 vocabu-
pants, researchers must consider the effect of this lary items and 10 math items. Jane scores 10 by
design limitation on their research conclusions. answering the 10 vocabulary items correctly; John
Because researchers typically only compare partici- scores 10 by answering the 10 math items cor-
pants who have completed the same items, this rectly; and Chris scores 10 by answering half of
limitation is usually not relevant to internal consis- the vocabulary and half of the math items cor-
tency reliability. rectly. All three individuals obtain the same score,
Thus, the internal consistency coefficient tells but these identical scores do not reflect similar
researchers how much total test scores would vary abilities. To avoid this problem, researchers often
618 Internal Consistency Reliability

want to know whether test items are unidimen- score high on the test possess all the necessary
sional. However, as stated previously, internal con- skills and might do well in the job. If few appli-
sistency does not imply unidimensionality. cants score high on all items, the company might
To determine whether items measure a unitary need a more detailed picture of the strengths and
construct, researchers can take one of two weaknesses of each applicant. In that case, the
approaches. First, they can calculate the average researcher could develop internally consistent sub-
interitem correlation. This correlation measures scales to measure each skill area, as described pre-
how closely the items are related to each other and viously. In that case, internal consistency would be
is the most common measure of item homogeneity. relevant to the subscales but would remain irrele-
However, the average interitem correlation might vant to the total test scores.
disguise differences between items. Perhaps some Fourth, researchers often assume mistakenly
items have strong relationships with each other that the formulas that are used to assess internal
and other items are unrelated. Second, researchers consistency—such as coefficient alpha—are only
can determine how many constructs underlie a set relevant to internal consistency. Usually these for-
of items by conducting an exploratory factor anal- mulas are used to estimate the reliability of total
ysis. If one construct underlies the items, the (or average) scores on a set of k items, but these
researcher can determine whether some items mea- formulas can also be used to estimate the reliabil-
sure that construct better than others. If two or ity of total scores from a set of k times or k
more constructs underlie the items, then the raters—or any composite score. For example, if
researcher can determine which items measure a researcher is interested in examining stable dif-
each construct and create homogeneous subscales ferences in emotion, participants could record their
to measure each. In summary, high internal consis- mood each day for a month. The researcher could
tency does not indicate that a test is unidimen- average the mood scores across the 30 days for
sional; instead, researchers should use exploratory each participant. Coefficient alpha can be used to
factor analysis to determine dimensionality. estimate how much of the observed differences
The third misinterpretation of internal consis- between participants are caused by differences
tency is that internal consistency is important for between days and how much is caused by stable
all tests. There are two exceptions. Internal consis- differences between the participants. Alternatively,
tency is irrelevant if test items are identical and researchers could use coefficient alpha to examine
trivially easy. For example, consider a speed test of raters. Job applicants could be rated by each man-
manual dexterity. For each item, participants draw ager in a company, and the average ratings could
three dots within a circle. Participants who com- be calculated for each applicant. The researcher
plete more items within the time limit receive could use coefficient alpha to estimate the propor-
higher scores. When items are identical and easy, tion of variance caused by true differences between
as they are in this example, J. C. Nunnally and the applicants—as opposed to the particular set of
I. H. Berstein showed that internal consistency will managers who provided ratings. Thus, coefficient
be very high and hence is not particularly informa- alpha (and the other formulas discussed previ-
tive. This conclusion makes sense conceptually: ously) can be used to estimate the reliability of any
When items are identical, very little variance in score that is calculated as the total or average
total test scores is caused by differences in items. of parallel measurements—whether those parallel
Researchers should instead focus on other types of measurements are obtained from different items,
reliability, such as test–retest. times, or raters.
Internal consistency is also irrelevant when the
test is designed deliberately to contain heteroge-
Going Beyond Classical Test Score Theory
neous content. For example, if a researcher wants
to predict success in an area that relies on several In classical test score theory, each source of vari-
different skills, then a test that assesses each of ance is considered separately. Internal consistency
these skills might be useful. If these skills are inde- reliability estimates the effect of test items. Test–
pendent of each other, the test might have low retest reliability estimates the effect of time. Inter-
internal consistency. However, applicants who rater reliability estimates the effect of rater. To
Internal Validity 619

provide a complete picture of the reliability of test Cronbach, L. J. (1951). Coefficient alpha and the internal
scores, the researcher must examine all types of structure of tests. Psychometrika, 16, 297–334.
reliability. Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963).
Even if a researcher examines every type of reli- Theory of generalizability: A liberalization of
reliability theory. British Journal of Statistical
ability, the results are incomplete and hard to inter-
Psychology, 16, 137–163.
pret. First, the reliability results are incomplete Kuder, G. F., & Richardson, M. W. (1937). The theory
because they do not consider the interaction of of the estimation of test reliability. Psychometrika, 2,
these factors. To what extent do ratings change 151–160.
over time? Do some raters score some items more Lord, F. M., & Novick, M. R. (1968). Statistical theories
harshly? Is the change in ratings over time consis- of mental test scores. Reading, MA: Addison-Wesley.
tent across items? Thus, classical test score theory Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
does not take into account two-way and three-way theory (3rd ed.). New York: McGraw-Hill.
interactions between items, time, and raters. Sec- Raykov, R., & Shrout, P. E. (2002). Reliability of scales
with general structure: Point and interval estimation
ond, the reliability results are hard to interpret
with structural equation modeling approach.
because each coefficient is given separately. If inter-
Structural Equation Modeling: A Multidisciplinary
nal consistency is .91, test–retest reliability is .85, Journal, 9, 195–212.
and interrater reliability is .82, then what propor- Schmitt, N. (1996). Uses and abuses of coefficient alpha.
tion of observed score variance is a result of true Psychological Assessment, 8, 350–353.
differences between participants and what propor-
tion is a result of these three sources of random
Websites
error?
To address these issues, researchers can use International Personality Item Pool: http://ipip.ori.org
more sophisticated mathematical models, which
are based on a multifactor repeated measures anal-
ysis of variance (ANOVA). First, researchers can
conduct a study to examine the influence of all INTERNAL VALIDITY
these factors on test scores. Second, researchers
can calculate generalizability coefficients to take Internal validity refers to the accuracy of state-
into account the number of items, times, and raters ments made about the causal relationship between
that will be used when collecting data to make two variables, namely, the manipulated (treatment
decisions in an applied context. or independent) variable and the measured vari-
able (dependent). Internal validity claims are not
Kimberly A. Barchard based on the labels a researcher attaches to vari-
ables or how they are described but, rather, to the
See also Classical Test Theory; Coefficient Alpha;
procedures and operations used to conduct
Exploratory Factor Analysis; Generalizability Theory;
a research study, including the choice of design
Interrater Reliability; Intraclass Correlation; KR-20;
and measurement of variables. Consequently,
Reliability; Structural Equation Modeling; Test–Retest
internal validity is relevant to the topic of research
Reliability
methods. In the next three sections, the procedures
that support causal inferences are introduced, the
threats to internal validity are outlined, and meth-
Further Readings ods to follow to increase the internal validity of
a research investigation are described.
Anastasi, A., & Urbina, S. (1997). Psychological testing
(7th ed.): Upper Saddle River, NJ: Prentice Hall.
Cortina, J. M. (1993). What is coefficient alpha? An Causal Relationships Between Variables
examination of theory and applications. Journal of
Applied Psychology, 78, 98–104. When two variables are correlated or found to
Crocker, L., & Algina, J. (1986). Introduction to classical covary, it is reasonable to ask the question of
and modern test theory. Orlando, FL: Harcourt Brace whether there is a direction in the relationship.
Jovanovich. Determining whether there is a causal relationship
620 Internal Validity

between the variables is often done by knowing one could argue that this relationship is not direct.
the time sequence of the variables; that is, whether The investigator’s discovery is a false positive
one variable occurred first followed by the other finding. The relationship between class size and
variable. In randomized experiments, where parti- academic achievement is not direct because the
cipants are randomly assigned to treatment condi- students associated with classes of different sizes
tions or groups, knowledge of the time sequence is are not equivalent on a key variable—attentive
often straightforward because the treatment vari- behavior. Thus, it might not be that larger class
able (independent) is manipulated before the mea- sizes have a positive influence on academic
surement of the outcome variable (dependent). achievement but, rather, that larger classes have
Even in quasi-experiments, where participants are a selection of students that, without behavioral
not randomly assigned to treatment groups, the problems, can attend to classroom instruction.
investigator can usually relate some of the change The third variable can threaten the internal
in pre-post test measures to group membership. validity of studies by leading to false positive find-
However, in observational studies where variables ings or false negative findings (i.e., not finding
are not being manipulated, the time sequence is a relationship between variables A and B because
difficult, if not impossible, to disentangle. of the presence of a third variable, C, that is
One might think that knowing the time diminishing the relationship between variables A
sequence of variables is often sufficient for ascer- and B). There are many situations that can give
taining internal validity. Unfortunately, time rise to the presence of uncontrolled third variables
sequence is not the only important aspect to con- in research studies. In the next section, threats to
sider. Internal validity is also largely about ensur- internal validity are outlined. Although each threat
ing that the causal relationship between two is discussed in isolation, it is important to note that
variables is direct and not mitigated by a third var- many of these threats can simultaneously under-
iable. A third, uncontrolled, variable can function mine the internal validity of a research study and
to make the relationship between the two other the accuracy of inferences about the causality of
variables appear stronger or weaker than it is in the variables involved.
real life. For example, imagine that an investigator
decides to investigate the relationship between
Threats to Internal Validity
class size (treatment variable) and academic
achievement (outcome variable). The investigator 1. History
recruits school classes that are considered large
An event (e.g., a new video game), which is not
(with more than 20 students) and classes that are
the treatment variable of interest, becomes accessi-
considered small (with fewer than 20 students).
ble to the treatment group but not the comparison
The investigator then collects information on stu-
group during the pre- and posttest time interval.
dents’ academic achievement at the end of the year
This event influences the observed effect (i.e., the
to determine whether a student’s achievement
outcome, dependent variable). Consequently, the
depends on whether he or she is in a large or small
observed effect cannot be attributed exclusively to
class. Unbeknownst to the investigator, however,
the treatment variable (thus threatening internal
students who are selected to small classes are those
validity claims).
who have had behavioral problems in the previous
year. In contrast, students assigned to large classes
are those who have not had behavioral problems
2. Maturation
in the previous year. In other words, class size is
related negatively to behavioral problems. Conse- Participants develop or grow in meaningful
quently, students assigned to smaller classes will be ways during the course of the treatment
more disruptive during classroom instruction and (between the pretest and posttest). The develop-
will plausibly learn less than those assigned to mental change in participants influences the
larger classes. In the course of data analysis, if the observed effect, and so now the observed effect
investigator were to discover a significant relation- cannot be solely attributed to the treatment
ship between class size and academic achievement, variable.
Internal Validity 621

3. Testing example described in the previous section, the two


groups of class sizes differed systematically in the
In the course of a research study, participants
behavioral disposition of students. As such, any
might be required to respond to a particular instru-
observed effect could not be solely attributed
ment or test multiple times. The participants
to the treatment variable (class size). Selection is
become familiar with the instrument, which
a concern when participants are not randomly
enhances their performance and the observed
assigned to groups. This category of threat can
effect. Consequently, the observed effect cannot be
also interact with other categories to produce, for
solely attributed to the treatment variable.
example, a selection-history threat, in which
treatment groups have distinct local events occur-
4. Instrumentation ring to them as they participate in the study, or
a selection-maturation threat, in which treatment
The instrument used as a pretest to measure
groups have distinct maturation rates that are
participants is not the same as the instrument used
unrelated to the treatment variable of interest.
for the posttest. The differences in test type could
influence the observed effect; for example, the met-
ric used for the posttest could be more sensitive to 8. Ambiguity About Direction of Causal Influence
changes in participant performance than the metric
In correlation studies that are cross-sectional,
used for the pretest. The change in metric and not
meaning that variables of interest have not been
the treatment variable of interest could influence
manipulated and information about the variables
the observed effect.
are gathered at one point in time, establishing the
causal direction of effects is unworkable. This is
5. Statistical Regression because the temporal precedence among variables
When a pretest measure lacks reliability and par- is unclear. In experimental studies, in which a vari-
ticipants are assigned to treatment groups based on able has been manipulated, or in correlation stud-
pretest scores, any gains or losses indicated by the ies, where information is collected at multiple time
posttest might be misleading. For example, partici- points so that the temporal sequence can be estab-
pants who obtained low scores on a badly designed lished, this is less of a threat to internal validity.
pretest are likely to perform better on a second test
such as the posttest. Higher scores on the posttest 9. Diffusion of Treatment Information
might give the appearance of gains resulting from
the manipulated treatment variable but, in fact, the When a treatment group is informed about the
gains are largely caused by the inaccurate measure manipulation and then happens to share this infor-
originally provided by the pretest. mation with the control group, this sharing of
information could nullify the observed effect. The
sharing of details about the treatment experience
6. Mortality with control participants effectively makes the
When participants are likely to drop out more control group similar to the treatment group.
often from one treatment group in relation to
another (the control), the observed effect cannot be 10. Compensatory Equalization of Treatments
attributed solely to the treatment variable. When
groups are not equivalent, any observed effect could This is similar to the threat described in number
be caused by differences in the composition of the 9. In this case, however, what nullifies the effect of
groups and not the treatment variable of interest. the treatment variable is not the communication
between participants of different groups but,
rather, administrative concerns about the inequal-
7. Selection
ity of the treatment groups. For example, if an
Internal validity is compromised when one experimental school receives extra funds to imple-
treatment group differs systematically from ment an innovative curriculum, the control school
another group on an important variable. In the might be given similar funds and encouraged to
622 Internet-Based Research Method

develop a new curriculum. In other words, when ensure that the groups are equivalent on key vari-
the treatment is considered desirable, there might ables. For example, if the groups are equivalent,
be administrative pressure to compensate the one would expect both groups to score similarly on
control group, thereby undermining the observed the pretest measure. Furthermore, one would
effect of the treatment. inquire about the background characteristics of the
students—Are there equal distributions of boys and
girls in the groups? Do they come from comparable
11. Rivalry Between Treatment Conditions
socioeconomic backgrounds? Even if the treatment
Similar to the threat described in number 9, groups are comparable, efforts should be taken to
threat number 10 functions to nullify differences not publicize the nature of the treatment one group
between treatment groups and, thus, an observed is receiving relative to the control group so as to
effect. In this case, when participation in a treat- avoid threats to internal validity involving diffusion
ment versus control group is made public, control of treatment information, compensatory equaliza-
participants might work extra hard to outperform tion of treatments, rivalry between groups, and
the treatment group. Had participants not been demoralization of participants that perceive to be
made aware of their group membership, an receiving the less desirable treatment. Internal valid-
observed effect might have been found. ity checks are ultimately designed to bolster confi-
dence in the claims made about the causal
relationship between variables; as such, internal
12. Demoralization of Participants Receiving Less
validity is concerned with the integrity of the design
Desirable Treatments
of a study for supporting such claims.
This last threat is similar to the one described in
number 11. In this case, however, when treatment Jacqueline P. Leighton
participation is made public and the treatment is
See also Cause-and-Effect; Control Variables; Quasi-
highly desirable, control participants might feel
Experimental Design; Random Assignment; True
resentful and disengage with the study’s objective.
Experimental Design
This could lead to large differences in the outcome
variable between the treatment and control groups.
However, the observed outcome might have little to Further Readings
do with the treatment and more to do with partici-
Cook, T. D., & Campbell, D. T. (1979). Quasi-
pant demoralization in the control group. experimentation: Design and analysis issues for field
settings. Boston: Houghton-Mifflin.
Establishing Internal Validity Keppel, G., & Wickens, T. D. (2002). Design and
analysis: A researcher’s handbook (4th ed.). Upper
Determining whether there is a causal relationship Saddle River, NJ: Prentice Hall.
between variables, A and B, requires that the vari- Rosenthal, R., & Rosnow, R. L. (1991). Essentials of
ables covary, the presence of one variable preceding behavioral research: Methods and data analysis (2nd
the other (e.g., A → B), and ruling out the pres- ed.). Boston: McGraw-Hill.
ence of a third variable, C, which might mitigate Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002).
Experimental and quasi-experimental designs for
the influence of A on B. One powerful way to
generalized causal inference. Boston: Houghton-
enhance internal validity is to randomly assign sam- Mifflin.
ple participants to treatment groups or conditions.
By randomly assigning, the investigator can guaran-
tee the probabilistic equivalence of the treatment
groups before the treatment variable is adminis- INTERNET-BASED
tered. That is, any participant biases are equally
distributed in the two groups. If the sample partici- RESEARCH METHOD
pants cannot be randomly assigned, and the investi-
gator must work with intact groups, which is often Internet-based research method refers to any
the case in field research, steps must be taken to research method that uses the Internet to collect
Internet-Based Research Method 623

data. Most commonly, the Web has been used as Development


the means for conducting the study, but e-mail
has been used as well. The use of e-mail to col- The use of the Web for data collection primarily
lect data dates back to the 1980s while the first required the development of forms in Web pages.
uses of the Web to collect data started in the Forms are so ubiquitous now that it is hard to real-
mid-1990s. Whereas e-mail is principally limited ize that forms are not an original part of the Web,
to survey and questionnaire methodology, the which was primarily a novel way of moving
Web, with its ability to use media, has the ability between and within documents via links (hypertext).
to execute full experiments and implement a wide The next development was the ability to embed
variety of research methods. The use of the Inter- images in the browser known as Mosaic. However,
net offers new opportunities for access to partici- until forms were developed, there was no means to
pants allowing for larger and more diverse gain information from the user. With the develop-
samples. However, this new access to partici- ment of forms, and the simultaneous means to store
pants comes at the cost of a great deal of loss of the information from the forms, it was possible to
control of the research environment. Although ask readers of Web pages to submit information,
this loss of control can be problematic in correla- and then these data could be sent to the server and
tional designs, it can be devastating in experi- stored. This development occurred in the mid-1990s
mental designs were environmental control and it was not long before psychological researchers
can be everything. What makes this method so took advantage of this possibility to present stimuli
intriguing is the fact that valid results have to participants and collect their responses, thereby
been obtained in many studies using this conducting experiments over the Internet.
methodology. From that point in time, the use of the Inter-
Even though the use of e-mail is an interesting net for research has grown dramatically. In the
research method, it is little used at this time. It is mid-1990s, only a handful of studies were
probably because e-mail offers few if any advan- posted each year; now, hundreds of new studies
tages over the Web as a research environment, and are posted each year as shown on websites like
it cannot perform any data collection methods that Psychological Research on the Net. The largest
are not possible with the Web. As a result, the rest proportion of studies is in the area of social and
of this entry focuses on the use of the Web to col- cognitive psychology, although studies have been
lect psychological data. conducted in most areas of psychology including
emotions, mental health, health psychology, per-
ception, and even biopsychology.
Some General Terms
Examples
To ease discussion, some general terms need to be
introduced as they are unique to research on the The first publication of actual formal psychologi-
Internet. These terms are shown in Table 1. cal experiments can be found in the 1997 papers

Table 1 General Terms to Understand Internet Research


Term Definition
Browser The software used to read Web pages. Popular examples include Internet Explorer and Firefox.
Server The machine/software that host the Web page. When the browser asks for a Web page, it asks
for the server to deliver the page.
Client The machine the person reading the Web page is on. The browser resides on the client machine.
Server side Operations that take place on the server. For example, when the data are stored, the operations
that store the date are server side.
Client side Operations that take place on the client machine. For example, the browser that interprets a
Web page file and makes it so you can read it is client side.
Forms Elements of Web pages that allow for the user to input information that can be sent to the server.
624 Internet-Based Research Method

by John Krantz, Jody Ballard, and Jody Scher, both cases, the easiest path to be helpful was pre-
as well as by Ulf Reips. One main experiment dominantly taken. So both studies complement
performed by Krantz, Ballard, and Scher was each other.
a within-subject experiment examining preferences
for different weighted female drawings. The exper-
Practical Issues
iment was a 3 × 9 design with independent vari-
ables on weight and shoulder-to-hip proportions. Next, several practical issues that can influence the
The experiment was conducted both under tradi- data quality obtained from Web-based studies is
tional laboratory conditions and on the Web. The discussed. When a Web-based study is read, it is
participants gave ratings of preference for each fig- important to understand how the experimenter
ure using a magnitude estimation procedure. The handles the following factors.
use of the within-subject methodology and a mag-
nitude estimation procedure allowed for a detailed
Recruitment
comparison of the results found in the laboratory
with the results found on the Web. First, the results Just placing a study on the Web is not usually
were highly correlated between laboratory and sufficient to get an adequately large sample to ana-
Web. In addition, a regression analysis was per- lyze. It is typical to advertise the study. There are
formed on the two data sets to determine whether several methods of study advertising that are used.
the data do more than move in the same direction. One way is to advertise on sites that list psycho-
This regression found that the Web values are logical studies. The two largest are listed at the
nearly identical to the values for the same condi- end of this entry. These sites are also well known
tion in the laboratory; that is, the Web data essen- to people interested in participating in psychologi-
tially can be replaced by the laboratory data and cal research, which makes them a useful means for
vice versa. This similarity was found despite the participants to find research studies. These sites
vast difference in the ways the experiment was also come up at the top of searches for psychologi-
delivered (e.g., in the laboratory, the participants cal experiments and related terms in search
were tested in groups, and on the Web, presum- engines. Another common method is to solicit
ably most participants ran the study singly) and participants from discussion groups or e-mail
differences in age range (in the laboratory, all par- listservs. Because these groups tend to be formed
ticipants are traditional college students, whereas to discuss common issues, this method allows
a much greater age range was observed in the Web access to subpopulations that might be of interest.
version). With the advent of social networking on the Web,
Krantz and Reshad Dalal performed a literature social networking sites such as Facebook have also
review of the Web studies conducted up to the been used to recruit participants. Finally, tradi-
time this entry was written. The point of the tional media such as radio and television can be
review was to determine whether the Web results used to recruit participants to Internet studies. It
could be considered, at least in a general sense, has occurred that some network news programs
valid. They examined both e-mail and Web-based have found an experiment related to a show they
research methods and compared their results with were running and posted a link to that study on
laboratory method results. In general, they found the website associated with the show. It should be
that most Web studies tended to find weaker noted that the Web is not a monolithic entity. Dif-
results than in the control of the laboratory. How- ferent methods of recruitment will lead to different
ever, the results seemed valid and even in cases samples. Depending on the sample needs of the
where the data differed from the laboratory or study, it is often advisable to use multiple types of
field, the differences were intriguing. One study recruitment methods.
performed an e-mail version of Stanley Milgram’s
lost letter technique. In Milgram’s study, the letters
Sample Characteristics
that were mailed were sent to the originally
intended destination. However, the e-mails that One enticing feature of the Web as a research
were sent were returned to the original sender. In environment is the ability to obtain more diverse
Internet-Based Research Method 625

samples. Samples are more diverse on the Web in the background. All of these factors, and others,
than the comparable sample in the laboratory. add potential sources of error to the data collected
However, that is not to say that the samples are over the Web. The term technical variance has
truly representative. Web use is not even distrib- been applied to this source of variation in experi-
uted across all population segments. It is probably mental conditions. Many of these variations, such
wise to consider the Web population in a mode as browser type and version, can be collected dur-
similar to early days of the telephone which, when ing the experiment, allowing some assessment of
used for sampling without attention to the popula- the influence of these technical variations on the
tion that had telephones, led to some classic mis- data. However, although rare, it is possible for
taken conclusions in political polls. users to hide or alter these values, such as altering
what browser is being used.
Dropout
One of the big concerns in Web-based research
is the ease with which participants can leave the Ethical Issues
study. In the laboratory, it is rare for a participant There have been some major discussions of the
to up and leave the experiment. Participants in ethics of Web-based research. In particular, the
Web-based research regularly do not complete lack of contact between the experimenter and par-
a study, leaving the researcher with several incom- ticipant means that it is more difficult to ensure
plete data sets. Incomplete data can make up to that the nature of the study is understood. In addi-
40% of a data set in some studies. There are two tion, there is no way to be sure that any partici-
main concerns regarding dropout. First, if the pant is debriefed rendering the use of deception
dropout is not random but is selective in some particularly problematic on the Web. However, on
sense, it can limit the generalizability of the results. the positive side, participants do feel very free to
Second, if the conditions in experiments differ in leave the study at any time, meaning that it can be
a way that causes differential dropout across con- more clearly assumed that the sample is truly vol-
ditions, this fact can introduce a confound in the untary, free of the social constraints that keep
experiment. This factor must be examined in eval- many participants in laboratory experiments when
uating the conditions. Information about the they wish to leave.
length of the study and the careful use of incen-
tives can reduce dropout. In addition, it is possible
in experiments that use many pages to measure
dropout and use it as a variable in the data Future Directions
analysis. As Web research becomes more widely accepted,
the main future direction will be Web-based
Technical Variance research examining new topic areas that are not
possible to be done in the laboratory. To date most
One of the big differences between the labora- Web-based studies have been replications and
tory and the Web is the loss of control over the extensions of existing research studies. Another
equipment used by the participant. In the labora- development will be the greater use of media in
tory, it is typical to have the participants all use the experiments. Most studies to date have been
the same computer or same type of computer, con- principally surveys with maybe a few images used
trol environmental conditions, and any other fac- as stimuli. The variations of monitors and lighting
tor that might influence the outcome of the study. have rendered the use of any images, beyond the
On the Web, such control is not possible. Varia- simplest, problematic. The development of better
tions include the type of computer being used, the and more controlled methods of delivering images
way the person is connected to the network, and video will allow a wider range of studies to be
the type of browser, the size of browser window explored over the Web.
the participant prefers, the version of the browser,
and even what other programs might be running John H. Krantz
626 Interrater Reliability

See also Bias; Confounding; Ethics in the Research Generality is important in showing that the
Process; Experimental Design; Sampling obtained ratings are not the idiosyncratic results of
one person’s subjective judgment. Procedure ques-
Further Readings tions include the following: How many raters are
needed to be confident in the results? What is the
Binbaum, M. H. (2000). Psychological experiments on minimum level of agreement that the raters need
the internet. San Diego, CA: Academic Press. to achieve? Is it necessary for the raters to agree
Krantz, J. H., Ballard, J., & Scher, J. (1997). Comparing exactly or is it acceptable for them to differ from
the results of laboratory and world-wide Web samples
one another as long as the differences are system-
on the determinants of female attractiveness.
Behavioral Research Methods, Instruments, & atic? Are the data nominal, ordinal, or interval?
Computers, 29, 264–269. What resources are available to conduct the inter-
Krantz, J. H., & Dalal, R. (2000). Validity of Web-based rater reliability study (e.g., time, money, and tech-
psychological research. In M. H. Birnbaum (Ed.), nical expertise)?
Psychological experiments on the Internet Interrater or interobserver (these terms can be
(pp. 35–60). San Diego, CA: Academic Press. used interchangeably) reliability is used to assess
Musch, J., & Reips, U.-D. (2000). A brief history of Web the degree to which different raters or observers
experimenting. In M. H. Birnbaum (Ed.), make consistent estimates of the same phenome-
Psychological experiments on the Internet
non. Another term for interrater or interobserver
(pp. 61–88). San Diego, CA: Academic Press.
reliability estimate is consistency estimates. That
Reips, U.-D. (2002). Standards for Internet-based
experimenting. Experimental Psychology, is, it is not necessary for raters to share a common
49, 243–256. interpretation of the rating scale, as long as each
judge is consistent in classifying the phenomenon
according to his or her own viewpoint of the
Websites scale. Interrater reliability estimates are typically
Psychological Research on the Net: reported as correlational or analysis of variance
http://psych.hanover.edu/research/exponnet.html indices. Thus, the interrater reliability index repre-
Web Experiment List: http://genpsylab-wexlist.unizh.ch sents the degree to which ratings of different
judges are proportional when expressed as devia-
tions from their means. This is not the same as
interrater agreement (also known as a consensus
INTERRATER RELIABILITY estimate of reliability), which represents the extent
to which judges make exactly the same decisions
The use of raters or observers as a method of mea- about the rated subject. When judgments are made
surement is prevalent in various disciplines and on a numerical scale, interrater agreement gener-
professions (e.g., psychology, education, anthro- ally means that the raters assigned exactly the
pology, and marketing). For example, in psycho- same score when rating the same person, behavior,
therapy research raters might categorize verbal or object. However, the researcher might decide to
(e.g., paraphrase) and/or nonverbal (e.g., a head define agreement as either identical ratings or rat-
nod) behavior in a counseling session. In educa- ings that differ no more than one point or as rat-
tion, three different raters might need to score an ings that differ no more than two points (if the
essay response for advanced placement tests. This interest is in judgment similarity). Thus, agreement
type of reliability is also present in other facets of does not have to be defined as an all-or-none phe-
modern society. For example, medical diagnoses nomenon. If the researcher does decide to include
often require a second or even third opinion from a discrepancy of one or two points in the definition
physicians. Competitions, such as Olympic figure of agreement, the chi-square value for identical
skating, award medals based on quantitative rat- agreement should also be reported. It is possible to
ings provided by a panel of judges. have high interrater reliability but low interrater
Those data recorded on a rating scale are based agreement and vice versa. The researcher must
on the subjective judgment of the rater. Thus, the determine which form of determining rater reli-
generality of a set of ratings is always of concern. ability is most important for the particular study.
Interrater Reliability 627

Whenever rating scales are being employed, it is rating scale, as long as each rater is consistent in
important to pay special attention to the interrater assigning a score to the phenomenon. Consistency
or interobserver reliability and interrater agree- is most used with continuous data. Values of .70 or
ment of the rating. It is essential that both the reli- better are generally considered to be adequate. The
ability and agreement of the ratings are provided three most common types of consistency estimates
before the ratings are accepted. In reporting the are (1) correlation coefficients (e.g., Pearson and
interrater reliability and agreement of the ratings, Spearman), (2) Cronbach’s alpha, and (3) intraclass
the researcher must describe the way in which the correlation.
index was calculated. The Pearson product-moment correlation coeffi-
The remainder of this entry focuses on calculat- cient is the most widely used statistic for calculat-
ing interrater reliability and choosing an appropri- ing the degree of consistency between independent
ate approach for determining interrater reliability. raters. Values approaching þ1 or 1 indicate that
the raters are following a consistent pattern,
whereas values close to zero indicate that it would
Calculations of Interrater Reliability
be almost impossible to predict the rating of one
For nominal data (i.e., simple classification), at judge given the rating of the other judge. An
least two raters are used to generate the categorical acceptable level of reliability using a Pearson cor-
score for many participants. For example, a contin- relation is .70. Pearson correlations can only be
gency table is drawn up to tabulate the degree of calculated for one pair of judges at a time and for
agreement between the raters. Suppose 100 obser- one item at a time. The Pearson correlation
vations are rated by two raters and each rater assumes the underlying data are normally distrib-
checks one of three categories. If the two raters uted. If the data are not normally distributed, the
checked the same category in 87 of the 100 obser- Spearman rank coefficient should be used. For
vations, the percentage of agreement would be example, if two judges rate a response to an essay
87%. The percentage of agreement gives a rough item from best to worst, then a ranking and the
estimate of reliability and it is the most popular Spearman rank coefficient should be used.
method of computing a consensus estimate of If more than two raters are used, Cronbach’s
interrater reliability. The calculation is also easily alpha correlation coefficient could be used to com-
done by hand. Although it is a crude measure, it pute interrater reliability. An acceptable level for
does work no matter how many categories are Cronbach’s alpha is .70. If the coefficient is lower
used in each observation. An adequate level of than .70, this means that most of the variance in
agreement is generally considered to be 70%. the total composite score is a result of error vari-
However, a better estimate of reliability can be ance and not true score variance.
obtained by using Cohen’s kappa, which ranges The best measure of interrater reliability avail-
from 0 to 1 and represents the proportion of agree- able for ordinal and interval data is the intraclass
ment corrected for chance. correlation (RÞ. It is the most conservative measure
of interrater reliability. R can be interpreted as the
K ¼ ðρa  ρc Þ=ð1  ρc Þ; proportion of the total variance in the ratings
caused by variance in the persons or phenomena
where ρa is the proportion of times the raters agree being rated. Values approaching the upper limit of
and ρc is the proportion of agreement we would R(1.00) indicate a high degree of reliability,
expect by chance. This formula is recommended whereas an R of 0 indicates a complete lack of reli-
when the same two judges perform the ratings. ability. Although negative values of R are possible,
For Cohen’s kappa, .50 is considered acceptable. If they are rarely observed; when they are observed,
subjects are rated by different judges but the num- they imply judge × item interactions. The more R
ber of judges rating each observation is held con- departs from 1.00, the less reliable are the judge’s
stant, then Fleiss’ kappa is preferred. ratings. The minimal acceptable level of R is con-
Consistency estimates of interrater reliability are sidered to be .60. There is more than one formula
based on the assumption that it is not necessary for available for intraclass correlation. To select the
the judges to share a common interpretation of the appropriate formula, the investigator must decide
628 Interrater Reliability

(a) whether the mean differences in the ratings of intraclass correlation coefficients) are also fairly
the judges should be considered rater error and simple to compute. The greatest disadvantage to
(b) whether he or she is more concerned with the using these statistical techniques is that they are
reliability of the average rating of all the judges or sensitive to the distribution of the data. The more
the average reliability of the individual judge. the data depart from a normal distribution, the
The two most popular types of measurement more attenuated the results.
estimates of interrater reliability are (a) factor The measurement estimates of interrater reli-
analysis and (b) the many-facets Rasch model. The ability (e.g., factor analysis and many-facets Rasch
primary assumption for the measurement estimates measurement) can work with multiple judges, can
is that all the information available from all the adjust summary scores for rater severity, and can
judges (including discrepant ratings) should be allow for efficient designs (e.g., not all raters have
used when calculating a summary score for each to judge each item or object). However, the mea-
respondent. Factor analysis is used to determine surement estimates of interrater reliability require
the amount of shared variance in the ratings. The expertise and considerable calculation time.
minimal acceptable level is generally 70% of the Therefore, as noted previously, the best tech-
explained variance. Once the interrater reliability nique will depend on the goals of the study, the
is established, each subject will receive a summary nature of the data (e.g., degree of normality), and
score based on his or her loading on the first prin- the resources available. The investigator might also
cipal component underlying the ratings. Using the improve reliability estimates with additional train-
many-facets Rasch model, the ratings between ing of raters.
judges can be empirically determined. Also the dif-
ficulty of each item, as well as the severity of all Karen D. Multon
judges who rated each item, can be directly com-
See also Cohen’s Kappa; Correlation; Instrumentation;
pared. In addition, the facets approach can deter-
Intraclass Correlation; Pearson Product-Moment
mine to what degree each judge is internally
Correlation Coefficient; Reliability; Spearman Rank
consistent in his or her ratings (i.e., an estimate of
Order Correlation
intrarater reliability). For the many-facets Rasch
model, the acceptable rater values are greater than
.70 and less than 1.3.
Further Readings
Bock, R., Brennan, R. L., & Muraki, E. (2002). The
Choosing an Approach information in multiple ratings. Applied Psychological
There is no ‘‘best’’ approach for calculating interra- Measurement, 26; 364–375.
ter or interobserver reliability. Each approach has Cohen, J. (1960). A coefficient of agreement for nominal
scales. Educational and Psychological Measurement,
its own assumptions and implications as well as its
20, 37–46.
own strengths and weaknesses. The percentage of Fleiss, J. L. (1971). Measuring nominal scale agreement
agreement approach is affected by chance. Low among many raters. Psychological Bulletin, 76,
prevalence of the condition of interest will affect 378–382.
kappa and correlations will be affected by low var- Linacre, J. M. (1994). Many-facet Rasch measurement.
iability (i.e., attenuation) and distribution shape Chicago: MESA Press.
(normality or skewed). Agreement estimates of Snow, A. L., Cook, K. F., Lin, P. S., Morgan, R. O., &
interrater reliability (percent agreement, Cohen’s Magaziner, J. (2005). Proxies and other external
kappa, Fleiss’ kappa) are generally easy to compute raters: Methodological considerations. Health Services
and will indicate rater disparities. However, train- Research, 40, 1676–1693.
Stemler, S. E., & Tsai, J. (2008). Best practices in
ing raters to come to an exact consensus will
interrater reliability: Three common approaches.
require considerable time and might or might not In J. W. Osborne (Ed.), Best practices in quantitative
be necessary for the particular study. methods (pp. 29–49). Thousand Oaks, CA: Sage.
Consistency estimates of interrater reliability Tinsley, H. E. A., & Weiss, D. J. (1975). Interrater
(e.g., Pearson product-moment and Spearman rank reliability and agreement of subjective judgements.
correlations, Cronbach’s alpha coefficient, and Journal of Counseling Psychology, 22, 358–376.
Interval Scale 629

Standardized tests, including Intelligence


INTERVAL SCALE Quotient (IQ), Scholastic Assessment Test (SAT),
Graduate Record Examination (GRE), Graduate
Interval scale refers to the level of measurement in Management Admission Test (GMAT), and Miller
which the attributes composing variables are mea- Analogies Test (MAT) are also examples of an
sured on specific numerical scores or values and interval scale. For example, in the IQ scale, the dif-
there are equal distances between attributes. The ference between 150 and 160 is the same as that
distance between any two adjacent attributes is between 80 and 90. Similarly, the distance in the
called an interval, and intervals are always equal. GRE scores between 350 and 400 is the same as
There are four scales of measurement, which the distance between 500 and 550.
include nominal, ordinal, interval, and ratio scales. Standardized tests are not based on a ‘‘true
The ordinal scale has logically rank-ordered attri- zero’’ point that represents the lack of intelligence.
butes, but the distances between ranked attributes These standardized tests do not even have a zero
are not equal or are even unknown. The equal dis- point. The lowest possible score for these stan-
tances between attributes on an interval scale dif- dardized tests is not zero. Because of the lack of
fer from an ordinal scale. However, interval scales a ‘‘true zero’’ point, standardized tests cannot
do not have a ‘‘true zero’’ point, so statements make statements about the ratio of their scores.
about the ratio of attributes in an interval scale Those who have an IQ score of 150 are not twice
cannot be made. Examples of interval scales as intelligent as those who have an IQ score of 75.
include temperature scales, standardized tests, the Similarly, such a ratio cannot apply to other stan-
Likert scale, and the semantic differential scale. dardized tests including SAT, GRE, GMAT, or
MAT.

Temperature Scales and Standardized Tests Likert Scales

Temperature scales including the Fahrenheit and One example of interval scale measurement that
Celsius temperature scales are examples of an is widely used in social science is the Likert scale.
interval scale. For example, the Fahrenheit temper- In experimental research, particularly in social
ature scale in which the difference between 25 sciences, there are measurements to capture atti-
and 30 is the same as the difference between 80 tudes, perceptions, positions, feelings, thoughts, or
and 85 . In the Celsius temperature scales, the dis- points of view of research participants. Research
tance between 16 and 18 is the same as that participants are given questions and they are
between 78 and 80 . expected to express their responses by choosing
However, 60 F is not twice as hot as 30 F. Simi- one of five or seven rank-ordered response choices
larly, –40 C is not twice as cold as –20 C. This is that is closest to their attitudes, perceptions, posi-
because both Fahrenheit and Celsius temperature tions, feelings, thoughts, or points of view.
scales do not have a ‘‘true zero’’ point. The zero An example of the Likert scales that uses a
points in the Fahrenheit and Celsius temperature 5-point scale is as follows:
scales are arbitrary—in both scales, 0 does not
mean the lack of heat nor cold. How satisfied are you with the neighborhood
In contrast, the Kelvin temperature scale is where you live?
based on a ‘‘true zero’’ point. The zero point of the • Very satisfied
Kelvin temperature scale, which is equivalent to • Somewhat satisfied
–459.67 F or –273.15 C is considered the lowest • Neither satisfied nor dissatisfied
possible temperature of anything in the universe. • Somewhat dissatisfied
In the Kelvin temperature scale, 400 K is twice as • Very dissatisfied
hot as 200 K, and 100 K is twice as cold as 200 K.
The Kelvin temperature scale is not an example of Some researchers argue that such responses are
interval scale but that of ratio scale. not interval scales because the distance between
630 Intervention

attributes are not equal. For example, the differ- Further Readings
ence between very satisfied and somewhat satisfied
Babbie, E. (2007). The practice of social research (11th
might not be the same as that between neither sat- ed.). Belmont, CA: Thomson and Wadsworth.
isfied nor dissatisfied and somewhat dissatisfied. Dillman, D. A. (2007). Mail and internet surveys: The
Each attribute in the Likert scales is given tailored design method (2nd ed.). New York: Wiley.
a number. For the previous example, very satisfied Keyton, J. (2006). Communication research: Asking
is 5, somewhat satisfied is 4, neither satisfied nor questions, finding answers (2nd ed.). Boston:
dissatisfied is 3, somewhat dissatisfied is 2, and McGraw-Hill.
very dissatisfied is 1. The greater number repre- University Corporation for Atmospheric Research.
sents the higher degree of satisfaction of respon- (2001). Windows to the universe: Kelvin scale.
Retrieved December 14, 2008, from http://
dents of their neighborhood. Because of such
www.windows.ucar.edu/cgi-bin/tour_def/earth/
numbering, there is now equal distance between Atmosphere/temperature/kelvin.html
attributes. For example, the difference between
very satisfied (5) and somewhat satisfied (4) is the
same as the difference between neither satisfied
nor dissatisfied (3) and somewhat dissatisfied (2).
However, the Likert scale does not have a ‘‘true INTERVENTION
zero’’ point, as shown in the previous example, so
that statements about the ratio of attributes in the Intervention research examines the effects of an
Likert scale cannot be made. intervention on an outcome of interest. The pri-
mary purpose of intervention research is to engen-
der a desirable outcome for individuals in need
Semantic Differential Scale
(e.g., reduce depressive symptoms or strengthen
Another interval scale measurement is the semantic reading skills). As such, intervention research
differential scale. Research respondents are given might be thought of as differing from prevention
questions and also semantic differential scales, usu- research, where the goal is to prevent a negative
ally 7-point or 5-point response scales, as their outcome from occurring, or even from classic lab-
response choices. Research respondents are expected oratory experimentation, where the goal is often
to choose 1 scale out of 7 or 5 semantic differential to support specific tenets of theoretical paradigms.
scales that is closest to their condition or perception. Assessment of an intervention’s effects, the sine
An example of the semantic differential scales qua non of intervention research, varies according
that uses a 7-point scale is as follows: to study design, but typically involves both statisti-
cal and logical inferences.
How would you rate the quality of the neighbor- The hypothetical intervention study presented
hood where you live? next is used to illustrate important features of
intervention research. Assume a researcher wants
Excellent Poor
to examine the effects of parent training (i.e., inter-
7 6 5 4 3 2 1 vention) on disruptive behaviors (i.e., outcome)
among preschool-aged children. Of 40 families
seeking treatment at a university-based clinic, 20
Research respondents who rate the quality of
families were randomly assigned to an intervention
their neighborhood as excellent should choose ‘‘7’’
condition (i.e., parent training) and the remaining
and those who rate the quality of their neighbor-
families were assigned to a (wait-list) control con-
hood as poor should choose ‘‘1.’’ Similar to the
dition. Assume the intervention was composed of
Likert scale, there is equal distance between attri-
six, 2-hour weekly therapy sessions with the par-
butes in the semantic differential scales, but there
ent(s) to strengthen theoretically identified parent-
is no ‘‘true zero’’ point.
ing practices (e.g., effective discipline strategies)
Deden Rukmana believed to reduce child disruptive behaviors.
Whereas parents assigned to the intervention con-
See also Ordinal Scale; Ratio Scale dition attended sessions, parents assigned to the
Intervention 631

control condition received no formal intervention. groups are probabilistically equated on all mea-
In the most basic form of this intervention design, sured and unmeasured characteristics), it is
data from individuals in both groups are collected unlikely that some other factor resulted in postin-
at a single baseline (i.e., preintervention) assess- tervention group differences. It is worth noting
ment and at one follow-up (i.e., postintervention) that this protection conveyed by random assign-
assessment. ment can be undone once the study commences
(e.g., by differential attrition or participant loss). It
is also worth noting that quasi-experiments or
Assessing the Intervention’s Effect
intervention studies that lack random assignment
In the parenting practices example, the first step in to condition are more vulnerable to internal valid-
assessing the intervention’s effect involves testing ity threats. Thoughtful design and analysis of
for a statistical association between intervention quasi-experiments typically involve identifying sev-
group membership (intervention vs. control) and eral plausible internal validity threats a priori and
the identified outcome (e.g., reduction in temper incorporating a mixture of design and statistical
tantrum frequency). This is accomplished by using controls that attempt to rule out (or render
an appropriate inferential statistical procedure implausible) the influence of these threats.
(e.g., an independent-samples t test) coupled with
an effect size estimate (e.g., Cohen’s dÞ, to provide
Other Things to Consider
pertinent information regarding both the statistical
significance and strength (i.e., the amount of bene- Thus far, this discussion has focused nearly exclu-
fit) of the intervention–outcome association. sively on determining whether the intervention
Having established an intervention–outcome worked. In addition, intervention researchers often
association, researchers typically wish to ascertain examine whether certain subgroups of participants
whether this association is causal in nature (i.e., benefited more from exposure to the intervention
that the intervention, not some other factor, caused than did other subgroups. In the parenting exam-
the observed group difference). This more formi- ple, one might find that parents with a single child
dable endeavor of establishing an ‘‘intervention to respond more favorably to the intervention than
outcome’’ causal connection is known to social sci- do parents with multiple children. Identifying this
ence researchers as establishing a study’s internal subgroup difference might aid researchers in modi-
validity—the most venerable domain of the fying the intervention to make it more effective for
renowned Campbellian validity typology. Interven- parents with multiple children. This additional var-
tion studies considered to have high internal valid- iable (in this case, the subgroup variable) is referred
ity have no (identified) plausible alternative to as an intervention moderator. The effects of
explanations (i.e., internal validity threats) for the intervention moderators can be examined by test-
intervention–outcome association. As such, the ing statistical interactions between intervention
most parsimonious explanation for the results is group membership and the identified moderator.
that the intervention caused the outcome. Intervention researchers should also examine
the processes through which the intervention pro-
duced changes in the outcome. Examining these
Random Assignment in Intervention Research
process issues typically requires the researcher to
The reason random assignment is a much-heralded construct a conceptual roadmap of the interven-
design feature is its role in reducing the number of tion’s effects. In other words, the researcher must
alternative explanations for the intervention– specify the paths followed by the intervention in
outcome association. In randomized experiments affecting the outcomes. These putative paths are
involving a no-treatment control, the control con- referred to as intervention mediators. In the par-
dition provides incredibly important information enting example, these paths might be (a) better
regarding what would have happened to the inter- understanding of child behavior, (b) using more
vention participants had they not been exposed to effective discipline practices, or (c) increased levels
the intervention. Because random assignment pre- of parenting self-efficacy. Through statistical medi-
cludes systematic pretest group differences (as the ation analysis, researchers can test empirically
632 Interviewing

whether the intervention affected the outcome in Important Issues to Consider


part by its impact on the identified mediators. When Conducting an Interview
Christian DeLucia and Steven C. Pitts Interviewer Characteristics and Demeanor

See also External Validity; Internal Validity; Quasi- Physical attributes such as age, race, gender,
Experimental Designs; Threats to Validity; Treatment(s) and voice, as well as attitudinal attributes such as
friendliness, professionalism, optimism, persua-
siveness, and confidence, are important attributes
Further Readings that should be borne in mind when selecting inter-
Aiken, L. S., & West, S. G. (1991). Multiple regression: viewers. Even when questions are well written, the
Testing and interpreting interactions. Newbury Park, success of face-to-face and telephone surveys are
CA: Sage. still very much dependent on the interviewer. Inter-
Campbell, D. T. (1957). Factors relevant to the validity views are conducted to obtain information.
of experiments in social settings. Psychological However, information can only be obtained if
Bulletin, 54, 297–312. respondents feel sufficiently comfortable in an
Cook, T. D., & Campbell, D. T. (1979). Quasi- interviewer’s presence. Good interviewers have
experimentation: Design and analysis issues for field
excellent social skills, show a genuine interest in
settings. Chicago: Rand McNally.
getting to know their respondents, and recognize
Cronbach, L. J. (1982). Designing evaluations of
educational and social programs. San Francisco: that they need to be flexible in accommodating
Jossey-Bass. respondents’ schedules.
MacKinnon, D. P. (2008). Introduction to statistical Research shows that interviewer characteristics
mediation analysis. New York: Lawrence Erlbaum. can definitely affect both item response and
Reynolds, K., & West, S. G. (1988). A multiplist strategy response quality and might even affect a respon-
for strengthening nonequivalent control group designs. dent’s decision to participate in an interview. It
Evaluation Review, 11, 691–714. might, therefore, be desirable in many cases to
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). match interviewers and interviewees in an effort to
Experimental and quasi-experimental designs for
solicit respondents’ cooperation, especially for
generalized causal inference. Boston: Houghton
interviews that deal with sensitive topics (e.g.,
Mifflin.
racial discrimination, inequality, health behavior,
or domestic abuse) or threatening topics (e.g., ille-
gal activities). For example, women interviewers
INTERVIEWING should be used to interview domestically abused
women. Matching might also be desirable in some
cultures (e.g., older males to interview older
Interviewing is an important aspect of many types males), or for certain types of groups (e.g., minor-
of research. It involves conducting an interview— ity interviewers for minority groups). Additionally,
a purposeful conversation—between two people matching might help to combat normative
(the interviewer and the interviewee) to collect responses (i.e., responding in a socially desirable
data on some particular issue. The person asking way) and might encourage respondents to speak in
the questions is the interviewer, whereas the person a more candid manner. Matching can be done on
providing the answers is the interviewee (i.e., several characteristics, namely, race, age, ethnicity,
respondent). Interviewing is used in both quantita- and sex.
tive and qualitative research and spans a wide con-
tinuum of forms, moving from totally structured
Interviewer Training
to totally unstructured. It can use a range of tech-
niques including face-to-face (in-person), tele- When a research study is large and involves the
phone, videophone, and e-mail. Interviewing use of many interviewers, it will require proper
involves several steps, namely, determining the training, administration, coordination, and con-
interviewees, preparing for the interview, and con- trol. The purpose of interviewer training is to
ducting the interview. ensure that interviewers have the requisite skills
Interviewing 633

that are essential for the collection of high-quality, deviate substantially from the estimated time
reliable, and valid data. The length of training will required for the interview.
be highly dependent on the mode of survey execu-
tion, as well as the interviewers’ experience. The
The Interview
International Standards Association, for example,
recommends a minimum of 6 hours of training for At the beginning of the interview, the inter-
new telephone interviewers involved in market, viewer should greet the respondent in a friendly
opinion, and social research. manner, identify and introduce himself or herself,
Prior to entering the field, an interviewer train- and thank the respondent for taking time to facili-
ing session should be conducted with all involved tate the interview. If a face-to-face interview is
interviewers. At this session, interviewers should being conducted, the interviewer should also pre-
be given a crash course on basic research issues sent the interviewee with an official letter from the
(e.g., importance of random sampling, reliability institution sponsoring the research, which outlines
and validity, and interviewer-related effects). They the legitimacy of the research and other salient
should also be briefed on the study objectives and issues such as the interviewer’s credentials. In tele-
the general guidelines/procedures/protocol that phone and face-to-face interviews where contact
should be followed for data collection. If a struc- was not established in advance (e.g., a national
tured questionnaire is being used to collect data, it survey), the interviewer has to try to elicit the
is important that the group go through the entire cooperation of the potential respondent and
questionnaire, question by question, to ensure that request permission to conduct the interview. In
every interviewer clearly understands the question- unscheduled face-to-face interviews, many poten-
naire. This should be followed by one or more tial respondents might refuse to permit an inter-
demonstrations to illustrate the complete interview view for one or more of the following reasons:
process. Complications and difficulties encoun- busy, simply not interested, language barrier, and
tered during the demonstrations, along with safety concerns. In telephone interviews, the
recommendations for coping with the problems, respondent might simply hang up the telephone
should be discussed subsequent to the demonstra- with or without giving an excuse.
tion. Detailed discussion should take place on After introductions, the interviewer should then
how to use probes effectively and how to quickly brief the respondent on the purpose of the study,
change ‘‘tone’’ if required. A pilot study should be explain how the study sample was selected,
conducted after training to identify any additional explain what will be done with the data, explain
problems or issues. how the data will be reported (i.e., aggregated
statistics—no personal information), and, finally,
assure the respondent of anonymity and confiden-
tiality. The interviewer should also give the respon-
Preparing for the Interview
dent an idea of the estimated time required for the
Prior to a face-to-face interview, the interviewer interview and should apprise the interviewee of his
should either telephone or send an official letter or her rights during the interview process (e.g.,
to the interviewee to confirm the scheduled time, right to refuse to answer a question if respondent
date, and place for the interview. One of the most is uncomfortable with the question). If payment of
popular venues for face-to-face interviews is any kind is to be offered, this should also be
respondents’ homes; however, other venues can explained to the respondent.
also be used (e.g., coffee shops, parking lots, or An interviewer should try to establish good rap-
grocery stores). When sensitive topics are being port with the interviewee to gain the interviewee’s
discussed, a more private venue is desirable so that confidence and trust. This is particularly important
the respondent can talk candidly. Interviewers in a qualitative interview. Establishing good rap-
should also ensure that they are thoroughly port is, however, highly dependent on the inter-
acquainted with the questionnaire and guidelines viewer’s demeanor and social skills. Throughout
for the interview. This will help to ensure that the interview, the interviewer should try to make
the interview progresses smoothly and does not the conversational exchange a comfortable and
634 Interviewing

pleasant experience for the interviewee. Pleasant- questions be recorded verbatim to minimize errors
ries and icebreakers can set the tone for the inter- that could result from inaccurate summation. Ver-
view. At the same time, interviewers should be batim responses will also permit more accurate
detached and neutral, and should refrain from coding. Throughout the interview, the interviewer
offering any personal opinions. During the inter- should try to ensure that note taking is as unobtru-
view, interviewers should use a level of vocabulary sive as possible. Audio recordings should be used
that is easily understood by the respondent and to back up handwritten notes if the respondent has
should be careful about using certain gestures (this no objection. However, it might be necessary at
concern is applicable only to face-to-face inter- times to switch off the machine if the respondent
views) and words because they might be consid- seems reluctant to discuss a sensitive topic. Audio
ered offensive in some cultures and ethnic groups. recordings offer several advantages, namely, they
In addition, interviewers should maintain a relaxed can verify the accuracy of handwritten notes and
stance (body language communicates information) can be used to help interviewers to improve their
and a pleasant and friendly disposition; however, interviewing techniques.
these are applicable only to face-to-face interviews. If respondents give incomplete or unambiguous
At all times, interviewers should listen attentively answers, the interviewer should use tactful probes
to the respondent and should communicate this to to elicit a more complete answer (e.g., ‘‘Anything
the respondent via paraphrases, probes, nods, and else?’’ ‘‘In what ways?’’ ‘‘How?’’ ‘‘Can you elabo-
well-placed ‘‘uh-huhs’’ or ‘‘umms.’’ An interviewer rate a little more?’’). Probes must never be used to
should not interrupt a respondent’s silence that coerce or lead a respondent; rather, they should be
might occur because of thoughtful reflection or neutral, unbiased, and nondirective. Probes are
during an embarrassing conversation. Rather, he more common with open-ended questions. How-
or she should give the respondent sufficient time to ever, they can also be used with closed-ended ques-
resume the conversation on his or her own, or tions. For example, in a closed-ended question
important data might be lost. Additionally, when with a Likert scale, a respondent might give
interviewers are dealing with sensitive issues, they a response that cannot be classified on the scale.
should show some empathy with the respondent. The interviewer could then ask: ‘‘Do you strongly
When face-to-face interviews are being conducted, agree or strongly disagree?’’ There are several
they should be done without an audience, if possi- types of probes that can be used, namely, the silent
ble, to avoid distractions. probe (remaining silent until the respondent con-
To conduct the interview, the interviewer will tinues), the echo probe (repeating the last sentence
have to adopt a certain interview style (e.g., and asking the respondent to continue), the ‘‘uh-
unstructured, semistructured, or structured). This huh’’ probe (encouraging the respondent to con-
is determined by the research goal and is explained tinue), the tell-me-more probe (asking a question
during the training session. The style adopted has to get better insight), and the long question probe
implications for the amount of control that the (making your question longer to get more detailed
interviewer can exercise over people’s responses. In information).
qualitative research, interviews rely on what is At the conclusion of the interview, the inter-
referred to as an interview guide. An interview viewer should summarize the important points to
guide is a relatively unstructured list of general the respondent, allow the respondent sufficient
topics to be covered. Such guides permit great flex- time to refine or clarify any points, reassure the
ibility. In contrast, in quantitative research, an respondent that the information will remain confi-
interview schedule is used. An interview schedule dential, and thank the respondent for his or her
is a structured list of questions with explicit instru- time. Closure should be conducted in a courteous
ctions. Interview schedules are standardized. manner that does not convey abruptness to the
It is critically important that the interviewer interviewee. The respondent should be given the
follow the question wording for each question interviewer’s contact information. An official
exactly to ensure consistency across interviews and follow-up thank-you letter should also be sent
to minimize the possibility of interviewer bias. within 2 weeks. Immediately after the interview or
Additionally, it is important that open-ended as soon as possible thereafter, the interviewer
Interviewing 635

should update his or her recorded notes. This is interviewing technique. However, by the 1960s,
particularly important when some form of short- telephone interviewing started to gain popularity.
hand notation is used to record notes. This was followed by computer-assisted telephone
interviewing (CATI) in the 1970s, and computer-
assisted personal interviewing (CAPI) and
Interview Debriefing
computer-assisted self-interviewing (CASI) in the
Interview debriefing is important for obtaining 1980s. In CATI, an automated computer randomly
feedback on the interview process. Debriefing dials a telephone number. All prompts for intro-
can be held either in person or via telephone. The duction and the interview questions are displayed
debriefing process generally involves asking all on a computer screen. Once the respondent agrees
interviewers to fill out a questionnaire composed to participate, the interviewer records the answers
of both open-ended and closed-ended questions. A directly onto the computer. CAPI and CASI are
group meeting is subsequently held to discuss the quite similar to CATI but are used in face-to-face
group experiences. The debriefing session provides interviews. However, although CAPI is performed
valuable insight on problematic issues that require by the interviewer, with CASI, respondents either
correction before the next survey administration. can be allowed to type all the survey responses
onto the computer or can type the responses to
sensitive questions and allow the interviewer to
Interviewer Monitoring and Supervision
complete all other questions. Computer-assisted
To ensure quality control, interviewers should interviewing offers several advantages, including
be supervised and monitored throughout the study. faster recording and elimination of bulky storage;
Effective monitoring helps to ensure that unfore- however, these systems can be quite expensive to
seen problems are handled promptly, acts as set up and data can be lost if the system crashes
a deterrent to interview falsification, and assists and the data were not backed up. Other modern-
with reducing interviewer-related measurement day interviewing techniques include videophone
error. Good monitoring focuses on four main interviews, which closely resemble a face-to-face
areas: operational execution, interview quality, interview, except that the interviewer is remotely
interviewer falsification, and survey design. In gen- located, and e-mail interviews, which allow
eral, different types of monitoring are required for respondents to complete the interview at their
different interview techniques. For example, with convenience.
face-to-face interviews, interviewers might be
required to report to the principal investigator
after the execution of every 25 interviews, to turn Advantages and Disadvantages of Face-to-Face
in their data and discuss any special problems
and Telephone Interviews
encountered. In the case of telephone interviews,
the monitoring process is generally simplified The administration of a questionnaire by an inter-
because interviews are recorded electronically, and viewer has several advantages compared with
supervisors also have an opportunity to listen to administration by a respondent. First of all,
the actual interviews as they are being conducted. interviewer-administered surveys have a much
This permits quick feedback to the entire group on higher response rate than self-administered sur-
specific problems associated with issues such as (a) veys. The response rate for face-to-face interviews
voice quality (e.g., enunciation, pace, and volume) is approximately 80% to 85%, whereas for tele-
and (b) adherence to interview protocol (e.g., read- phone interviews, it is approximately 60%. This
ing verbatim scripts, using probes effectively, and might be largely attributable to the normal dynam-
maintaining neutrality). ics of human behavior. Many people generally feel
embarrassed in being discourteous to an inter-
viewer who is standing on their doorstep or is on
Types of Interviewing Techniques
the phone; however, they generally do not feel
Prior to the 1960s, paper-and-pencil (i.e., face-to- guilty about throwing out a mail survey as soon as
face) interviewing was the predominant type of it is received. Second, interviewing might help to
636 Intraclass Correlation

reduce ‘‘do not know’’ responses because the inter- Interviewer-Related Errors
viewer can probe to get a more specific answer.
Third, an interviewer can clarify confusing ques- The manner in which interviews are administered,
tions. Finally, when face-to-face interviews are as well as an interviewer’s characteristics, can
conducted, the interviewer can obtain other useful often affect respondents’ answers, which can lead
information, such as the quality of the dwelling (if to measurement error. Such errors are problematic,
conducted in the respondent’s home), respondent’s particularly if they are systematic, that is, when an
race, and respondent reactions. interviewer makes similar mistakes across many
Notwithstanding, interviewer-administrated interviews. Interviewer-related errors can be
surveys also have several disadvantages, namely, decreased through carefully worded questions,
(a) respondents have to give real-time answers, interviewer–respondent matching, proper training,
which means that their responses might not be continuous supervision or monitoring, and prompt
as accurate; (b) interviewers must have good ongoing feedback.
social skills to gain respondents’ cooperation Nadini Persaud
and trust; (c) improper administration and inter-
viewer characteristics can lead to interviewer- See also Debriefing; Ethnography; Planning Research;
related effects, which can result in measurement Protocol; Qualitative Research; Survey; Systematic
error, and (d) the cost of administration is con- Error
siderably higher (particularly for face-to-face
interviews) compared with self-administered
surveys. Further Readings
Babbie, E. (2005). The basics of social research (3rd ed.).
Belmont, CA: Thomson/Wadsworth.
Cost Considerations Erlandson, D. A., Harris, E. L., Skipper, B. L., & Allen,
S. D. (1993). Doing naturalistic inquiry: A guide to
The different interviewing techniques that can be methods. Newbury Park, CA: Sage.
used in research all have different cost implica- Ruane, J. M. (2005). Essentials of research methods: A
tions. Face-to-face interviews are undoubtedly the guide to social sciences research. Oxford, UK:
most expensive of all techniques because this pro- Blackwell Publishing.
cedure requires more interviewers (ratio of face-to- Schutt, R. K. (2001). Investigating the social world: The
face to telephone is approximately 4:1), more process and practice of research (3rd ed.). Thousand
interview time per interview (approximately 1 Oaks, CA: Pine Forge.
Weisberg, H. F., Krosnick, J. A., & Bowen, B. D. (1996).
hour), more detailed training of interviewers, and
An introduction to survey, research, polling and data
greater supervisor and coordination. Transporta- analysis (3rd ed.). Thousand Oaks, CA: Sage.
tion costs are also incurred with this technique.
Telephone interviews are considerably cheaper—
generally about half the cost. With this procedure,
coordination and supervision are much easier—
interviewers are generally all located in one room, INTRACLASS CORRELATION
printing costs are reduced, and sampling selection
cost is less because samples can be selected using The words intraclass correlation (ICC) refer to
random-digit dialing. These cost reductions greatly a set of coefficients representing the relationship
outweigh the cost associated with telephone calls. between variables of the same class. Variables of
Despite the significantly higher costs of face-to- the same class share a common metric and vari-
face interviews, this method might still be pre- ance, which generally means that they measure the
ferred for some types of research because response same thing. Examples include twin studies and
rates are generally higher and the quality of the two or more raters evaluating the same targets.
information obtained might be of a substantially ICCs are used frequently to assess the reliability of
higher quality compared with a telephone inter- raters. The Pearson correlation coefficient usually
view, which is quite impersonal. relates measures of different classes, such as height
Intraclass Correlation 637

Table 1 Data on Sociability of Dyads of Gay Couples


Couple Partner 1 Partner 2 Couple Partner 1 Partner 2 Couple Partner 1 Partner 2
1 18 15 6 30 21 11 40 43
2 21 22 7 32 28 12 41 28
3 24 30 7 35 37 13 42 37
4 25 24 9 35 31 14 44 48
5 26 29 10 38 25 15 48 50

and weight or stress and depression, and is an an ICC can be obtained for agreement. Within
interclass correlation. a couple, partner is a fixed variable—someone’s
Most articles on ICC focus on the computation partner can not be randomly select. Finally, there
of different ICCs and their tests and confidence is no question about averaging across partners,
limits. This entry focuses more on the uses of sev- so the reliability of an average is not relevant.
eral different ICCs. (In fact, ‘‘reliability’’ is not really the intent.)
The different ICCs can be distinguished along Table 2 gives the expected mean squares for
several dimensions: a one-way analysis of variance. The partner effect
can not be estimated separately from random
• One-way or two-way designs error.
• Consistency of order of rankings by different If each member of a couple had nearly the same
judges, or agreement on the levels of the score, there would be little within-couple variance,
behavior being rated and most of the variance in the experiment would
• Judges as a fixed variable or as a random be a result of differences between couples. If mem-
variable bers of a dyad differed considerably, the within-
• The reliability of individual ratings versus the
couple variance would be large and predominate.
reliability of mean ratings over several judges
A measure of the degree of relationship represents
the proportion (ρÞ of the variance that is between
One-Way Model couple variance. Therefore,
Although most ICCs involve two or more judges
σ 2C
rating n objects, the one-way models are differ- ρICC ¼ :
ent. A theorist hypothesizing that twins or gay σ 2C þ σ 2e
partners share roughly the same level of sociabil-
ity would obtain sociability data on both mem- The appropriate estimate for ρICC, using the
bers of 15 gay couples from a basic sociability obtained mean squares (MS), would be
index. A Pearson correlation coefficient is not
appropriate for these data because the data are MScouple  MSw=in
exchangeable within couples—there is no logical rICC ¼ :
MScouple þ ðk  1ÞMSw=in
reason to identify one person as the first member
of the couple and the other as the second. The
For this sample data, the analysis of variance
design is best viewed as a one-way analysis of
summary table is shown in Table 3.
variance with ‘‘couple’’ as the independent vari-
able and the two measurements within each cou-
ple as the observations. Possible data are Table 2 Expected Mean Squares for One-Way Design
presented in Table 1. With respect to the dimen- Source df E(MS)
sions outlined previously, this is a one-way
Between couple n  1 kσ 2C þ σ 2e
design. Partners within a couple are exchange-
Within couple n(k  1) σ 2e
able, and thus a partner effect would have no
Partner error k —
meaning. Because there is no partners effect, an
(n  1)(k  1) —
ICC for consistency cannot be obtained, but only
638 Intraclass Correlation

Table 3 Summary Table for One-Way Design Table 4 Ratings of Compatibility for 15 Couples by
Four Raters
Source df Sum Sq Mean Sq F value
Between couple 14 2304.87 164.63 8.74 Raters
Within couple 15 282.50 18.83
Couples A B C D
Total 29 2587.37
1 15 18 15 18
2 22 25 20 26
MScouple  MSw=in 3 18 15 10 23
ICC1 ¼ 4 10 7 12 18
MScouple þ ðk  1ÞMSw=in
5 25 22 20 30
164:63  18:83 145:80 6 23 28 21 30
¼ ¼ ¼ :795: 7 30 25 20 27
164:63 þ 18:83 183:46
8 19 21 14 26
A test of the null hypothesis that ρICC ¼ 0 can 9 10 12 14 16
be taken directly from the F for couples, which is 10 16 19 15 12
8.74 on (n  1) ¼ 14 and nðk  1) ¼ 15 degrees of 11 14 18 11 19
freedom (df). 12 23 28 25 22
This F can then be used to create confidence 13 29 21 23 32
limits on ρICC by defining 14 18 18 12 17
15 17 23 14 15
FL ¼ Fobs =F:975 ¼ 8:74=2:891 ¼ 3:023

FU ¼ Fobs × F:975 ¼ 8:74 × 2:949 ¼ 25:774: raters rate the compatibility of 15 married couples
based on observations of a session in which cou-
For FL, critical value is taken at α = .975 for (n – ples are asked to come to a decision over a question
1) and nðk – 1) degrees of freedom, but for FU, the of importance to both of them. Sample data are
degrees of freedom are reversed to obtain the criti- shown in Table 4.
cal value at α = .975 for nðk – 1) and (n – 1).
The confidence interval is now given by
Factors to Consider Before Computation
FL  1 FU  1
≤ρ Mixed Versus Random Models
FL þ ðk  1Þ FU þ ðk  1Þ
3:023  1 25:774  1 : As indicated earlier, there are several decisions
≤ρ≤
3:023 þ 1 25:774 þ 1 to make before computing an ICC. The first is
:503 ≤ ρ ≤ :925 whether raters are a fixed or a random variable.
Raters would be a fixed variable if they are the
Not only are members of the same couple simi- graduate assistants who are being trained to rate
lar in sociability, but the ICC is large given the couple compatibility in a subsequent experiment.
nature of the dependent variable. These are the only raters of concern. (The model
would be a mixed model because we always
assume that the targets of the ratings are sampled
Two-Way Models
randomly.) Raters would be a random variable if
The previous example pertains primarily to the sit- they have been drawn at random to assess whether
uation with two (or more) exchangeable measure- a rating scale we have developed can be used reli-
ments of each class. Two-way models usually ably by subsequent users. Although the interpreta-
involve different raters rating the same targets, and tion of the resulting ICC will differ for mixed and
it might make sense to take rater variance into random models (one can only generalize to subse-
account. quent raters if the raters one uses are sampled at
A generic set of data can be used to illustrate random), the calculated value will not be affected
the different forms of ICC. Suppose that four by this distinction.
Intraclass Correlation 639

Table 5 Analysis of Variance Summary Table for The analysis of variance for the data in Table 4
Two-Way Design is shown in Table 5.
Source df Sum Sq Mean Sq F value With either a random or mixed-effects model,
Row (Couple) 14 1373.233 98.088 10.225 the reliability of ratings in a two-way model for
Rater 3 274.850 91.617 9.551 consistency is defined as
Error 42 402.900 9.593
MSrow  MSerror
Total 59 2050.983 ICCC;1 ¼ :
MSrow þ ðk  1ÞMSerror

The notation ICCC;1 refers to the ICC for con-


Agreement Versus Consistency
sistency based on the individual rating. For the
From a computational perspective, the researcher example, in Table 2 this becomes
needs to distinguish between a measure of consis-
tency and a measure of agreement. Consistency MSrow  MSerror
ICCC;1 ¼
measures consider only whether raters are aligned MSrow þ ðk  1ÞMSerror
in their ordering of targets, whereas agreement
98:088  9:593 88:495
measures take into account between-rater variance ¼ ¼
in the mean level of their ratings. If one trains grad- 98:088 þ ð4  1Þ × 9:593 126:867
uate assistants as raters so that they can divide up ¼ :698:
the data and have each rater rate 10 different cou-
ples, it is necessary to ensure that they would give A significance test for this coefficient against the
the same behavioral sample similar ratings. A mea- null hypothesis ρ = 0 is given by the test on rows
sure of agreement is needed. If, instead, rater means (couples) in the summary table. This coefficient is
will be equated before using the data, a measure of clearly significantly different from 0.
consistency might be enough. Agreement and con- FL and FU are defined as before:
sistency lead to different ICC coefficients.
FL ¼ Fobs =Fð:975;14;42Þ ¼ 10:225=2:196 ¼ 4:654:
Unit of Reliability
The final distinction is between whether the FU ¼ Fobs ð:975; 42; 14Þ
concern is the reliability of individual raters or of
a mean rating for each target. It should be evident ¼ 10:225 × 2:668 ¼ 27:280:
that mean ratings will be more reliable than indi-
The 95% confidence intervals (CIs) are given by
vidual ratings, and the ICC coefficient should
reflect that difference. FL  1 4:656  1
CIL ¼ ¼ ¼ :478;
FL þ ðk  1Þ 4:656 þ 3
Two-Way Model for Consistency
and
Although the distinction between mixed and
random models has interpretive importance, there FU  1 27:280  1
CIU ¼ ¼ ¼ :869:
will be no difference in the computation of the FU þ ðk  1Þ 27:280 þ 3
ICCs. For that reason, the two models will not be
distinguished in what follows. But distinctions Notice that the degrees of freedom are again
must be made on the basis of consistency versus reversed in computing FU .
agreement and on the unit of reliability. But suppose that the intent is to average the
Assume that the data in Table 4 represent the k ¼ 4 ratings across raters. This requires the reli-
ratings of 15 couples (rows) by four randomly ability of that average rating. Then, define
selected raters. Of concern is whether raters can
use a new measurement scale to rank couples in MSrow  MSerror
ICCc;4 ¼ ;
the same order. First, assume that the ultimate unit MSrow
of measurement will be the individual rating. which is .902.
640 Intraclass Correlation

The F for ρ = 0 is still the F for rows, with FL The notation ICCA;1 represents a measure of
and FU defined as for the one-way. The confidence agreement for the reliability of individual ratings.
limits on ρ become With consistency, individual raters could use differ-
ent anchor points, and rater differences were not
1 1 involved in the computation. With agreement,
CIL ¼ 1  ¼1 ¼ :785;
FL 4:656 rater differences matter, which is why MSrater
appears in the denominator for ICCA;1 .
and
Using the results presented in Table 5 gives
1 1
CIU ¼ 1  ¼1 ¼ :963:
FU 27:280 MSrow  MSerror
ICCA;1 ¼ ;
It is important to think here about the implica- MSrow þ ðk  1ÞMSerror þ kðMSrater n MSerror Þ
tions of a measure of consistency. In this situation, 98:088  9:593
ICC(C,1) is reasonably high. It would be unchanged ¼ ;
98:088 þ ð4  1Þ9:593 þ 4ð91:61715 9:593Þ
(at .698) if a rater E was created by subtracting 15
points from rater D and then substituting rater E ¼ :595:
for rater D. If one were to take any of these judges
(or perhaps other judges chosen at random) to
a high school diving competition, their rankings The test on H0 : ρ = 0 is given by the F for rows
should more or less agree. Then, the winners would in the summary table and is again 10.255, which
be the same regardless of the judge. But suppose is significant on (n  1) and kðn  1) df. However
that one of these judges, either rater D or rater E, the calculation of confidence limits in this situation
was sent to that same high school but asked to rate is complex, and the reader is referred to McGraw
the reading ability of each student and make a judg- and Wong (1996) for the formulas involved.
ment of whether that school met state standards in If interest is, instead, the reliability of mean rat-
reading. Even though each judge would have ings, then
roughly the same ranking of children, using rater E
instead of rater D could make a major difference in
whether the school met state standards for reading. MSrow  MSerror
ICCA;4 ¼ ;
Consistency is not enough in this case, whereas it MSrow þ ðMSrater MS
n
error Þ

would have been enough in diving. 98:088  9:593


¼ ¼ :855:
98:088 þ ð91:6179:593
15
Þ

Two-Way Models for Agreement


The previous section was concerned not with An F test on the statistical significance of ICCA;k
the absolute level of ratings but only with their is given in Table 8 of McGraw and Wong, and cor-
relative ratings. Now, consider four graduate responding confidence limits on the estimate are
assistants who have been thoroughly trained on given in Table 7 of McGraw and Wong.
using a particular rating system. They will later In the discussion of consistency creating a rater
divide up the participants to be rated. It is E by subtracting 15 points from rater D and then
important that they have a high level of agree- replacing rater D with rater E led to no change
ment so that the rating of a specific couple in ICCC;1 . But with that same replacement, the
would be the same regardless of which assistant agreement ICCs would be ICCA;1 ¼ :181, and
conducted the rating. ICCA;4 ¼ :469. Clearly, the agreement measure is
When the concern is with agreement using sensitive to the mean rating of each rater.
a two-way model,
David C. Howell
MSrow  MSerror
ICCA;1 ¼ : See also Classical Test Theory; Coefficient Alpha;
MSrow þ ðk  1ÞMSerror þ kðMSraternMSerror Þ Correlation; Mixed Model Design; Reliability
Item Analysis 641

Further Readings alpha) rather than the regular coefficient alpha as


used for a one-dimensional test.
McGraw, K. O., & Wong, S. P. (1996). Forming
inferences about some intraclass correlation A test that is composed of items selected based
coefficients. Psychological Methods, 1, 30–46. on item analysis statistics tends to be more reliable
Nichols, D. (1998). Choosing an intraclass correlation than one composed of an equal number of unana-
coefficient. Retrieved June 16, 2009, from http:// lyzed items. Even though the process of item anal-
www.ats.ucla.edu/stat/spss/library/whichicc.htm ysis looks sophisticated, it is not as challenging as
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass it seems. Item analysis software programs are user
correlations: Uses in assessing reliability. friendly and make the task much simpler. The
Psychological Bulletin, 86, 420–428. most time-consuming part of item analysis might
be the process of tabulating the data to be ana-
lyzed by the software program. Statistical pro-
grams such as SPSS, an IBM product, and SAS can
ITEM ANALYSIS be used for item analysis as well. The information
needed to start an item analysis procedure after
Item analysis is the set of qualitative and quantita- the administration is the work of the examinees.
tive techniques and procedures used to evaluate In the subsequent sections, several aspects of the
the characteristics of items of the test before and item analysis process are discussed.
after the test development and construction. An
item is a basic building block of a test, and its
Norm-Referenced Versus
analysis provides information about its perfor-
mance. Item analysis allows selecting or omitting Criterion-Referenced Interpretations
items from the test, but more important, item anal- Norm-referenced and criterion-referenced interpre-
ysis is a tool to help the item writer improve an tations are two different types of score interpreta-
item. Anthony Nitko suggests that some classroom tions. Norm-referenced interpretations help to
uses of an item analysis might be estimating locate an examinee’s position within a well-defined
whether an item functions as intended, providing group, and criterion-referenced interpretations are
feedback to students about their performance, pro- used in describing a degree of proficiency in a speci-
viding feedback to the teacher about student diffi- fied content domain. Use of college admission tests
culties and ideas of curriculum improvement, is an example of norm-referenced interpretations,
revising assessment tasks, and improving item- and most tests and quizzes written by teachers are
writing skills. examples of criterion-referenced interpretations.
Item analysis can be used both for dichoto- A criterion might be the lesson curriculum or the
mously scored (correct or incorrect) items and state standards. Some tests are intended for both
polytomously scored (with more than two score types of interpretations; for example, some states
categories) items. The main purpose of item analy- use standards-based tests for both purposes.
sis is to improve internal consistency or internal With standards-based tests, the criterion-referenced
structure validity, focused on confirming a single- interpretation is intended to give information about
factor or one-trait test. If the trait is not one factor, how proficient the students are in the curriculum
then the use of item analysis might tend to lower defined by the state standards. The norm-
validity. If a test has two factors (or content divi- referenced interpretation provides a measure of
sions) or is multifactored (more than two content how each student compares with peers.
divisions), the calculation of item statistics for each
item (or option) should be focused on the subtotal
Norm-Referenced Interpretations
for the relevant set of items rather than on the
total test score. Item analysis in this case is used to Norm-referenced interpretations are involved
improve the internal consistency of each subset of with determining how an examinee’s score com-
items with no intention to change the dimensional- pares with others who have taken the same test.
ity of the entire set. In these cases, an overall reli- Item analysis techniques and procedures for norm-
ability index would be stratified alpha (or battery referenced interpretations are employed to refine
642 Item Analysis

and improve the test by identifying items that will examinees in the upper group get an item right
increase the ability of the test to discriminate and all examinees in the lower group fail. The
among the scores of those who take the test. In value of D equals 0 when the item is correctly
norm-referenced interpretations, the better an item answered by all, none, or any other same percent-
discriminates among examinees within the range age of examinees in both upper and lower groups.
of interest, the more information that item pro- D has a negative value when the percentage of stu-
vides. Discrimination among examinees is the dents answering the item correctly in the lower
most crucial characteristic desired in an item used group is greater than the percentage of correct
for a norm-referenced purpose. The discriminating responses in the upper group.
power is determined by the magnitude of the item A test with items having high D values pro-
discrimination index that will be discussed next. duces more spread in scores, therefore contribut-
ing to the discrimination in ability among
examinees. The item discrimination index is the
Discrimination Index
main factor that directly affects item selection in
The item discrimination index, which is usually norm-referenced tests. Ideally, items selected for
designated by the uppercase letter D (also net D, norm-referenced interpretations are considered
U-L, ULI, and ULD), shows the difference between good with D values above 30 and very good with
upper and lower scorers answering the item cor- values above 40. The reliability of the test will
rectly. It is the degree to which high scorers were be higher by selecting items that have higher
inclined to get each item right and low scorers item discrimination indices.
were inclined to get that item wrong. D is a mea- Warren Findley demonstrated that the index of
sure of the relationship between the item and the item discrimination is absolutely proportional to
total score, where the total score is used as a substi- the difference between the numbers of correct and
tute for a criterion of success on the measure being incorrect discriminations (bits of information) of
assessed by the test. For norm-referenced interpre- the item. Assuming 50 individuals in the upper
tations, there is usually no available criterion of and 50 individuals in the lower groups, the ideal
success because that criterion is what the test is item would distinguish each of 50 individuals in
supposed to represent. the U group from each in the L group. Thus,
Upper (U) and lower (L) groups can be decided 50 × 50 ¼ 2; 500 possible correct discriminations
by dividing the arranged descending scores into or bits of information. Considering an item on
three groups: upper, middle, and lower. When there which 45 individuals of the upper group but only
is a sufficient number of normally distributed scores, 20 individuals of the lower group answer the item
Truman Kelley demonstrated that using the top and correctly, the item would distinguish 45 individuals
bottom 27% (1.225 standard deviation units from who answered correctly from the upper group
the mean) of the scores as upper and lower groups from 30 individuals who answered incorrectly in
would be the best to provide a wide difference the lower group, generating a total of 45 × 30 ¼
between the groups and to have an adequate num- ; 350 correct discriminations. Consequently, 5 indi-
ber of scores in each group. When the total number viduals answering the item incorrectly in the upper
of scores is between 20 and 40, it is advised to select group are distinguished incorrectly from the 20
the top 10 and the bottom 10 scores. When the individuals who answered correctly in the lower
number of scores is less than or equal to 20, two group, generating 5 × 20 ¼ 100 incorrect discrimi-
groups are used without any middle group. nations. The net amount of effective discrimina-
One way of computing the item discrimination tions of the item is 1; 350  100 ¼ 1; 250
index is finding the difference of percentages of discriminations, which is 50% of the 2,500 total
correct responses of U and L groups by computing maximum possible correct discriminations. Note
Up  Lp (U percentage minus L percentage). Typi- that the difference between the number of exami-
cally, this percentage difference is multiplied by nees who answer the item correctly in the upper
100 to remove the decimals; the result is D, which group and in the lower group is 45  20 ¼ 25,
yields values with the range of þ 100 and  100. which is half of the maximum possible difference
The maximum value of D ( þ 100) occurs when all with 50 in each group.
Item Analysis 643

Another statistic used to estimate the item dis- difference in abilities if it has a difficulty index of
crimination is the point-biserial correlation coeffi- 0, which means none of the examinees answer cor-
cient. The point-biserial correlation is obtained rectly, or 100, which means all of the examinees
when the Pearson product-moment correlation is answer correctly. Thus, items selected for norm-
computed from a set of paired values where one of referenced interpretations usually do not have dif-
the variables is dichotomous and the other is con- ficulty indices near the extreme values of 0 or 100.
tinuous. Correlation of passing or failing an indi-
vidual item with the overall test scores is the Criterion-Referenced Interpretations
common example for the point-biserial correlation
in item analysis. The Pearson correlation can be Criterion-referenced interpretations help to
calculated using SPSS or SAS. One advantage cited interpret the scores in terms of specified perfor-
for using the point-biserial as a measure of the mance standards. In psychometric terms, these
relationship between the item and the total score is interpretations are concerned with absolute rather
that it uses all the scores in the data rather than than relative measurement. The term absolute is
just the scores of selected groups of students. used to indicate an interest in assessing whether
a student has a certain performance level, whereas
relative indicates how a student compares with
Difficulty Index other students. For criterion-referenced purposes,
The item difficulty index, which is denoted as p; there is little interest in a student’s relative standing
is the proportion of the number of examinees within a group.
answering the item correctly to the total number of Item statistics such as item discrimination and
examinees. The item difficulty index ranges from item difficulty as defined previously are not used in
0 to 1. Item difficulty also can be presented as the same way for criterion-referenced interpreta-
a whole number by multiplying the resulting deci- tions. Although validity is an important consider-
mal by 100. The difficulty index shows the percent- ation in all test construction, the content validity
age of examinees who answer the item correctly, of the items in tests used for criterion-referenced
although the easier item will have a greater value. interpretations is essential. Because in many
Because the difficulty index increases as the diffi- instances of criterion-referenced testing the exami-
culty of an item decreases, some have suggested that nees are expected to succeed, the bell curve is gen-
it be called the easiness index. Robert Ebel suggested erally negatively skewed. However, the test results
computing item difficulty by finding the proportion vary greatly depending on the amount of instruc-
of examinees that answered the item incorrectly to tion the examinees have had on the content being
the total number of examinees. In that case, the tested. Item analysis must focus on group differ-
lower percentage would mean an easier item. ences and might be more helpful in identifying
It is sometimes desirable to select items with problems with instruction and learning rather than
a moderate spread of difficulty and with an aver- guiding the item selection process.
age difficulty index near 50. The difficulty index of
Discrimination Index
50 means that half the examinees answer an item
correctly and other half answer incorrectly. How- A discrimination index for criterion-referenced
ever, one needs to consider the purpose of the test interpretations is usually based on a different crite-
when selecting the appropriate difficulty level of rion than the total test score. Rather than using an
items. The desired average item difficulty index item–total score relationship as in norm-referenced
might be increased with multiple-choice items, analysis, the item–criterion relationship is more rel-
where an examinee has a chance to guess. The evant for a criterion-referenced analysis. Thus, the
item difficulty index is useful when arranging the upper and lower groups for a discrimination index
items in the test. Usually, items are arranged from should be selected based on their performance on
the easiest to the most difficult for the benefit of the criterion for the standards or curriculum of
examinees. interest. A group of students who have mastered
In norm-referenced interpretations, an item the skills of interest could comprise the upper
does not contribute any information regarding the group, whereas those who have not yet learned
644 Item Analysis

those skills could be in the lower group. Or a group norm-referenced interpretations, because it must
of instructed students could comprise the upper be known for what group the index was obtained.
group, with those who have had no instruction on The difficulty index of zero would mean that none
the content of interest comprising the lower group. of the examinees answered the item correctly,
In either of these examples, the D index would rep- which is informative regarding the ‘‘nonmastered’’
resent a useful measure to help discriminate the content and could be expected for the group that
masters versus the nonmasters of the topic. was not instructed on this content. The difficulty
After adequate instruction, a test on specified index of 100 would mean that all the examinees
content might result in very little score variation answered an item correctly, which would confirm
with many high scores. All of the examinees might the mastery of content by examinees who had
even get perfect scores, which will result in a dis- been well taught. Items with difficulties of 0 or
crimination index of zero. But all these students 100 would be rejected for norm-referenced pur-
would be in the upper group and not negatively poses because they do not contribute any informa-
impact the discrimination index if calculated tion about the examinee’s relative standing. Thus,
properly. the purpose of the test is important in using item
Similarly, before instruction, a group might all indices.
score very low, again with very little variation in
the scores. These students would be the lower
Effectiveness of Distractors
group in the discrimination index calculation.
Within each group, there might be no variation at One common procedure during item analysis is
all, but between the groups, there will be evidence determining the performance of the distractors
of the relationship between instruction and suc- (incorrect options) in multiple-choice items. Dis-
cess. Thus, item discrimination can be useful in tractors are expected to enhance the measurement
showing which items measure the relevance to properties of the item by being acceptable options
instruction, but only if the correct index is used. for the examinees with incomplete knowledge of
As opposed to norm-referenced interpretations, the content assessed by the item. The discrimina-
a large variance in performance of an instructed tion index is desired to be negative for distractors
group in criterion-referenced interpretation would and intended to be positive for the correct options.
probably indicate an instructional flaw or a learn- Distractors that are positively correlated with the
ing problem on the content being tested by the test total are jeopardizing the reliability of the test;
item. But variance between the instructed group therefore, they should be replaced by more appro-
and the not-instructed group is desired to demon- priate ones. Also, there is percent marked, (percent-
strate the relevance of the instruction and the sen- age upper þ percentage lower)/2, for each option
sitivity to the test for detecting instructional as a measure of ‘‘attractiveness.’’ If too few exami-
success. nees select an option, its inclusion in the test might
not be contributing to good measurement unless
the item is the one indicating mastery of the sub-
Difficulty Indexes
ject. Also, if a distractor is relatively more preferred
The difficulty index of the item in criterion- among the upper group examinees, there might be
referenced interpretations might be a relatively two possible correct answers.
more useful statistic than discrimination index in A successful distractor is the one that is attrac-
identifying what concepts were difficult to master tive to the members of the low-scoring group and
by students. In criterion-referenced testing, items not attractive to the members of the high-scoring
should probably have an average difficulty index group. When constructing a distractor, one should
around 80 or 90 within instructed groups. It is try to find the misconceptions related to the con-
important to consider the level of instruction cept being tested. Some methods of obtaining
a group has had before interpreting the difficulty acceptable options might be use of context termi-
index. nology, use of true statements for different
In criterion-referenced interpretations, a diffi- arguments, and inclusion of options of similar dif-
culty index of 0 or 100 is not as meaningless as in ficulty and complexity.
Item Response Theory 645

Differential Item Functioning See also Differential Item Functioning; Internal


Consistency Reliability; Item Response Theory;
The difficulty index should not be confused with Item-Test Correlation; Pearson Product-Moment
differential item functioning (DIF). DIF analysis Correlation Coefficient; Reliability; SAS; SPSS
investigates every item in a test for the signs of inter-
actions with sample characteristics. Differential item Further Readings
functioning occurs when people from the different
groups (race or gender) with the same ability have Anastasi, A., & Urbina, S. (1997). Psychological testing
different probabilities of giving a correct response (7th ed.). Upper Saddle River, NJ: Prentice Hall.
Lord, F. M. (1980). Applications of item response theory
on an item. An item displays DIF when the item
to practical testing problems. Hillsdale, NJ: Lawrence
parameters such as estimated item difficulty or item Erlbaum.
discrimination index differ across the groups. Nitko, A. J. (2004). Educational assessment of students
Because all distractors are incorrect options, the (4th ed.). Upper Saddle River, NJ: Pearson.
difference among the groups in distractor choice
has no effect on the test score. However, group dif-
ference in the distractor choice might indicate that
the item functions differently for the different sub- ITEM RESPONSE THEORY
groups. Analysis of differential distractor function-
ing (DDF) examines only incorrect responses. Item response theory (IRT) is a mental measure-
Although DIF and DDF are not usually considered ment theory based on the postulate that an indivi-
under item analysis, they are essential investiga- dual’s response to a test item is a probabilistic
tions for examining validity of tests. function of characteristics of the person and char-
acteristics of the item. The person characteristics
are the individual’s levels of the traits being mea-
Rasch Model sured, and the item characteristics are features
The item analysis considered to this point has been such as difficulty and discriminating power. Item
focused on what is called classical test theory. response theory has several advantages over classic
Another approach to test development is called test theory and has the potential to solve several
item response theory (IRT). There are several mod- difficult measurement problems. The foundations
els of IRT that are beyond the scope of this writing. of item response theory were developed in the
The one-parameter model, called the Rasch model, early 20th century; however, it was Frederic Lord,
deserves mention as an introduction to IRT. beginning in the 1950s, who organized and devel-
Item analysis in the Rasch model is based on oped the theory into a framework that could be
the item characteristic curve (ICC) of the one- applied to practical testing problems. Advances in
parameter logistic model. The ICC is a nonlinear computing were necessary to make the theory
regression of the probability of correct response accessible to researchers and practitioners. Item
for the dichotomous item (with the range of 0 to response theory is now widely used in educational
1) on the ability (trait or skill) to be measured contexts by testing companies, public school sys-
(with the range of ∞ to þ∞Þ. The placement of tems, the military, and certification and licensure
the curve is related to the difficulty index, and the boards, and is becoming more widely used in other
slope of the curve is related to the D index dis- contexts such as psychological measurement and
cussed previously. According to the Rasch model, medicine. This entry discusses item response mod-
items having same discrimination indices but dif- els and their characteristics, estimation of param-
ferent difficulty indices should be selected for the eters and goodness of fit of the models, and testing
test. The Rasch model is usually used for develop- applications.
ing tests intended for norm-referenced interpreta-
tions or for tests like standards-based tests that are Item Response Models
intended for both types of interpretations.
Item response theory encompasses a wide range of
Perman Gochyyev and Darrell Sabers models depending on the nature of the item score,
646 Item Response Theory

the number of dimensions assumed to underlie item parameters and classic item indices, an inte-
performance, the number of item characteristics gral must be calculated to obtain the probability
assumed to influence responses, and the mathemat- of a correct response. Allen Birnbaum proposed
ical form of the model relating the person and item a more mathematically tractable cumulative logis-
characteristics to the observed response. The item tic function. With an appropriate scaling factor,
score might be dichotomous (correct/incorrect), the normal ogive and logistic functions differ by
polytomous as in multiple-choice response or less than .05 over the entire trait continuum. The
graded performance scoring, or continuous as in logistic model has become widely accepted as the
a measured response. Dichotomous models have basic item response model for dichotomous and
been the most widely used models in educational polytomous responses.
contexts because of their suitability for multiple The unidimensional three-parameter logistic
choice tests. Polytomous models are becoming model for dichotomous responses is given by
more established as performance assessment
becomes more common in education. Polyto-
e1:7aj ðθbj Þ
mous and continuous response models are Pðuj ¼ 1|θÞ ¼ cj þ ð1  cj Þ ;
appropriate for personality or affective measure- 1 þ e1:7aj ðθbj Þ
ment. Continuous response models are not well
known and are not discussed here. where uj is the individual’s response to item j;
The models that are currently used most widely scored 1 for correct and 0 for incorrect, θ is the
assume that there is a single trait or dimension individual’s value on the trait being measured,
underlying performance; these are referred to as Pðuj ¼ 1|θÞ is the probability of a correct response
unidimensional models. Multidimensional models, to item j given θ, cj is the lower asymptote parame-
although well-developed theoretically, have not ter, aj is the item discrimination parameter, bj is
been widely applied. Whereas the underlying the item difficulty parameter, and 1.7 is the scaling
dimension is often referred to as ‘‘ability,’’ there is factor required to scale the logistic function to the
no assumption that the characteristic is inherent or normal ogive. The curve produced by the model is
unchangeable. called the item characteristic curve (ICC). ICCs for
Models for dichotomous responses incorporate several items with differing values of the item para-
one, two, or three parameters related to item char- meters are shown in Figure 1. The two-parameter
acteristics. The simplest model, which is the one- model is obtained by omitting the lower asymp-
parameter model, is based on the assumption that tote parameter, and the one-parameter model is
the only item characteristic influencing an indivi- obtained by subsequently omitting the discrimi-
dual’s response is the difficulty of the item. A nation parameter. The two-parameter model
model known as the Rasch model has the same assumes that low-performing individuals have
form as the one-parameter model but is based on no chance of answering the item correctly
different measurement principles. The Rasch the- through guessing, whereas the one-parameter
ory of measurement was popularized in the United model assumes that all items are equally
States by Benjamin Wright. The two-parameter discriminating.
model adds a parameter for item discrimination, The lower asymptote parameter is bounded by
reflecting the extent to which the item discrimi- 0 and 1 and is usually less than .3 in practice. The
nates among individuals with differing levels of discrimination parameter is proportional to
the trait. The three-parameter model adds a lower the slope of the curve at its point of inflection; the
asymptote or pseudo-guessing parameter, which steeper the slope, the greater the difference in
gives the probability of a correct response for an probability of correct response for individuals of
individual with an infinitely low level of the trait. different trait levels, hence, the more discriminat-
The earliest IRT models used a normal ogive ing the item. Discrimination parameters must be
function to relate the probability of a correct positive for valid measurement. Under the one-
response to the person and item characteristics. and two-parameter models, the difficulty param-
Although the normal ogive model is intuitively eter represents the point on the trait continuum
appealing and provides a connection between IRT where the probability of a correct response is 0.5;
Item Response Theory 647

1.0

0.9
Probability of Correct Response

0.8
a =1.0, b = −1.0, c = 0.2
0.7

0.6

0.5 a = 0.5, b = 0.0, c = 0.1


0.4

0.3

0.2 a = 1.5, b = 1.0, c = 0.3


0.1

0.0
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Trait Value

Figure 1 Item Characteristic Curves for Three Items Under the Three-Parameter Logistic Item Response Theory
Model

under the three-parameter model, the probability well known are the graded response model
of a correct response at θ ¼ bj is (1 + cj Þ/2. (GRM), the partial credit model (PCM), the gener-
Note the indeterminacy in the model previously: alized partial credit model (GPCM), and the rating
The model does not specify a scale for a; b; and θ. scale model (RSM). Only the GRM and the
A linear transformation of parameters will pro- GPCM are described here.
duce the same probability of a correct response, The GRM is obtained by formulating two-
that is, if θ * ¼ Aθ þ B, b * ¼ Ab þ B, and parameter dichotomous models for the probability
a * ¼ a=A, then a * ðθ *  b * Þ ¼ aðθ  bÞ and that an examinee will score in each response cate-
Pðθ * Þ ¼ PðθÞ. The scale for parameter estimates is gory or higher (as opposed to a lower category),
typically fixed by standardizing on either the θ then subtracting probabilities to obtain the proba-
values or the b values. With this scaling, the θ and bility of scoring within each response category,
b parameter estimates generally fall in the range that is,
( 3, 3) and the a parameter estimates are gener-
ally between 0 and 2. eaj ðθbjk Þ
Pðuj ≥ k|θÞ ¼ Pk* ðθÞ ¼ ;
There are several item response models for 1 þ eaj ðθbjk Þ
polytomous responses. When there is no assump- k ¼ 1; . . . ; m  1; P0* ðθÞ ¼ 1;
tion that the response categories are on an ordered
scale, the nominal response model might be used Pm* ðθÞ ¼ 0;
to model the probability that an individual will Pðuj ¼ kÞ ¼ Pk* ðθÞ  Pkþ1
*
ðθÞ:
score in a particular response category. The nomi-
nal response model is not widely used because Here, responses are scored 0 through m – 1, where
polytomously scored responses are generally m is the number of response categories, k is the
ordered in practice, as, for example, in essay or response category of interest, aj is the discrimina-
partial credit scoring. There are several models tion parameter, interpreted as in dichotomous
for ordered polytomous responses: The most models, and bjk is the category parameter. The
648 Item Response Theory

1.0

a = 1.0, b1 = −1, b2 = 0, b3 = 1 0.9

0.8
Probability of Response

0.7
u=0 u =1 u=2 u =3
0.6

0.5

0.4

0.3

0.2

0.1

0.0
−3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Trait Value

Figure 2 Item Response Category Characteristic Curves for an Item Under the GPCM

category parameters represent the level of the trait not be ordered under this model. The GPCM is
needed to have a 50% chance of scoring in that a generalization of the PCM, which assumes equal
category or higher. Category parameters are neces- discriminations across items and omits the discrim-
sarily ordered on the trait scale. It is assumed that ination parameter in the model. An example of
the item is equally discriminating across category IRCCs for a polytomous item under the GPCM is
boundaries. The model provides a separate item shown in Figure 2.
response function for each response category; the Item response theory has several advantages
resultant curves are called item response category over classic test theory in measurement applica-
characteristic curves (IRCCs). tions. First, IRT item parameters are invariant
The GPCM differs from the GRM in that it is across subpopulations, whereas classic item indices
based on a comparison of adjacent categories. The change with the performance level and heterogene-
model is given by ity of the group taking the test. Person parameters
are invariant across subsets of test items measuring
P
k
the same dimension; whether the test is easy or
aj ðθbjv Þ
e v¼1 hard, an individual’s trait value remains the same.
Pðuj ¼ k|θÞ ¼ Pjk ðθÞ; ¼ P
c ; This is not the case with total test score, which
P aj
m1 ðθbjv Þ
depends on the difficulty of the test. The invari-
1þ e v¼1
c¼1 ance property is the most powerful feature of item
1 response models and provides a solid theoretical
k ¼ 1; . . . ; m  1; Pj0 ðθÞ ¼ P
c : base for applications such as test construction,
P aj
m1 ðθbjv Þ
equating, and adaptive testing. Note that invari-
1þ e v¼1
c¼1 ance is a property of the parameters and holds
only in the population; estimates will vary across
In this model, the category parameter bjk repre- samples of persons or items.
sents the trait value at which an individual has an A second advantage of IRT is individualized
equal probability of scoring in category k versus standard errors of measurement, rather than
category (k – 1). The category parameters need a group-based measure such as is calculated in
Item Response Theory 649

classic test theory. Another advantage is that item Goodness of Fit


response theory is formulated at the item level and
gives a basis for prediction of an individual’s or Item response theory is based on strong assump-
group’s performance when presented with new test tions: that the assumed dimensionality is correct
items; this is useful in constructing tests for specific and that the mathematical model correctly speci-
populations or purposes. Classic test theory is fies the relationship between the item and person
based on total test score and offers no basis for parameters and response to the item. The advan-
prediction of performance. tages of IRT are obtained only when these assump-
tions are met, that is, when the model fits the data.
Applications of item response theory must begin
Estimation of Parameters
with an assessment of the fit of the model. Assess-
Estimation of the parameters of an item response ment of goodness of fit requires checking the
model is a computer-intensive task generally assumptions of the model, the expected features of
requiring large samples. When item parameters are the model, and predictions based on the model.
known and only person parameters must be esti- The primary assumption of the models in com-
mated, or vice versa, the procedure is relatively mon use is that of unidimensionality. A variety of
simple via maximum likelihood estimation. Maxi- methods has been proposed for assessing the
mum likelihood estimation involves computing the dimensionality of a mental measurement scale.
likelihood of the observed data as a function of the Linear factor analysis methods are most commonly
unknown parameters, based on the model to be fit- used, although these fail to take into account the
ted, and then determining the parameter values nonlinearity of the relationship between the trait
that maximize the likelihood. and the observed response. Nonlinear factor analy-
When both item parameters and person param- sis procedures are more appropriate but less widely
eters are unknown, estimation is considerably used. The computer programs NOHARM and
more difficult. Large samples (on the order of TESTFACT fit multidimensional nonlinear factor
a thousand or more) are required to obtain ade- models to item response data. TESTFACT pro-
quate estimates under the three-parameter model. vides a chi-square test of fit of the model. William
The most widely used commercially available com- Stout developed a test of ‘‘essential unidimension-
puter programs for IRT parameter estimation ality’’ based on the principle of local independence,
(BILOG-MG, PARSCALE, and MULTILOG) which states that after conditioning on all traits
employ marginal maximum likelihood procedures underlying performance, item responses are statis-
with the Expectation-Maximization (EM) algo- tically independent of each other.
rithm. Under marginal maximum likelihood esti- The primary expected feature of an IRT model
mation, item parameters are estimated assuming is parameter invariance. Checking for item param-
the population distribution of person parameters is eter invariance is done by comparing item param-
known. The EM algorithm is used to solve the eter estimates obtained in different subgroups
marginal likelihood equations obtained for the of examinees; estimates should differ by no more
item parameters. These equations involve quanti- than sampling error. Likewise, ability estimates
ties that are unknown from the data; the expected based on different subsets of items should also be
values of these quantities are substituted (E-step) similar.
and maximum likelihood estimation (M-step) is Checking predictions based on the model can
then performed. Prior distributions are often be done at the test level or item level. At the test
specified for the item parameters to facilitate esti- level, observed score distributions can be com-
mation. Person parameters are subsequently esti- pared with predicted score distributions based on
mated holding the item parameters fixed at their the fitted model. At the item level, differences
estimated values. A normal prior distribution is between observed and expected proportions of
specified for θ and the mean of the posterior distri- examinees in each response category within sub-
bution for each individual is taken as the estimate groups based on trait estimates or expected score
of θ. This estimate is referred to as the expectation can be examined graphically or by means of chi-
a posteriori (EAP) estimate. square fit statistics.
650 Item Response Theory

Applications membership is an additional dimension influencing


performance, a violation of model assumptions.
Item response theory provides a natural frame- In this case, item parameter invariance is not
work for many testing applications, including test obtained across subgroups and the ICCs for the
construction. Test and item information are impor- subgroups differ. Differences between ICCs can be
tant concepts for this purpose. The information quantified by calculating the area between the
provided by the test about an individual with curves or by computing a chi-square statistic for
a given value of θ is inversely related to the stan- testing the equality of parameters.
dard error of estimate of θ, that is, The invariance property of IRT also provides
a basis for adaptive testing. In adaptive testing,
1 individuals are administered items one by one
IðθÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffi :
SEðθÞ from a large precalibrated, equated pool and their
trait estimates are updated after each item based
Test information is a sum over items; hence, on their response. Each subsequent item is selected
each item contributes independently to the infor- to provide maximum information at the indivi-
mation provided by the test. This additive property dual’s current estimated trait value. Testing can be
allows the selection of items to create a test that terminated when the standard error of the θ esti-
measures with a desired degree of precision in any mate falls below a preset criterion. Although each
part of the trait continuum. Items provide their person has taken a different test, trait estimates are
maximum information at a trait value equal to the comparable. Adaptive testing has several advan-
difficulty of the item. More discriminating items tages, among which are equal measurement preci-
provide more information, and for dichotomous sion for all examinees, shorter tests, immediate
items, lower c parameters increase information. score reporting, and potentially greater test
Equating of test forms is another area in which security.
IRT provides an elegant solution. Because person Item response theory provides a powerful
parameters are invariant across subsets of items, framework for mental measurement. When the
the scores of individuals who have taken different assumptions of the model are met, IRT can pro-
forms of a test can be validly compared. However, vide elegant solutions to many measurement
because of the indeterminacy problem, it is neces- problems.
sary to ensure that the estimates are on a common
scale. This step is most often accomplished by H. Jane Rogers
including a set of common items in the two forms
See also Classical Test Theory; Computerized Adaptive
to be equated, and determining the linear transfor-
Testing; Differential Item Functioning; Psychometrics
mation that would equalize the means and stan-
dard deviations of the common b values or
equalize the test characteristic curves. This trans-
formation is applied to the b values and a corre- Further Readings
sponding transformation to the a values of all Andrich D. (1988). Rasch models for measurement.
items in the form to be equated to the base form. Newbury Park, CA: Sage.
The same transformation applied to the b values is Baker, F. B., & Kim, S. (2004). Item response theory:
applied to the trait estimates of the group taking Parameter estimation techniques (2nd ed.). New York:
the form to be equated to place them on the same Marcel-Dekker.
scale as that in the reference group. Du Toit, M. (2003). IRT from SSI: BILOG MG,
Another area in which IRT is useful is in the MULTILOG,PARSCALE, TESTFACT. Lincolnwood,
IL: Scientific Software International.
detection of differential item functioning (DIF) for
Embretson, S. E., & Reise, S. P. (2000). Item response
ethnic, gender, and other demographic subgroups. theory for psychologists. Mahwah, NJ: Lawrence
An item shows differential functioning if indivi- Erlbaum.
duals at the same trait value do not have the Hambleton, R. K., & Swaminathan H. (1985). Item
same probability of scoring in a given response response theory: Principles and applications. Boston:
category. When this occurs, it indicates that group Kluwer-Nijohoff.
Item-Test Correlation 651

Hambleton R. K., Swaminathan, H., & Rogers H. J. type of reliability. The item-test correlation is one
(1991). Fundamentals of item response theory. of many item discrimination indices used in item
Newbury Park, CA: Sage. analysis.
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item Because item responses are typically scored as
response theory and health outcomes measurement in
zero when incorrect and unity (one) if correct,
the 21st century. Medical Care, 38, II28–II42.
Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item
the item variable is binary or dichotomous
response theory: Application to psychological (having two values). The resulting correlation is
measurement. Homewood, IL: Dow Jones-Irwin. properly called a point-biserial coefficient when
Kolen, M. J., & Brennan, R. L. (2004). Test equating, a binary item is correlated with a total score that
scaling, and linking: Methods and practices (2nd ed.). has more than two values (called polytomous or
New York: Springer-Verlag. continuous). However, some items, especially
Lord, F. M. (1980). Applications of item response theory essay items, performance assessments, or those
to practical testing problems. Hillsdale, NJ: Lawrence for inclusion in affective scales, are not usually
Erlbaum.
dichotomous, and thus some item-test correla-
Lord, F. M., & Novick, M. R. (1968). Statistical theories
tions are regular Pearson coefficients between
of mental test scores. Reading, MA: Addison-Wesley.
Ostini, R., & Nering, M. L. (2006). Polytomous item polytomous items and total scores. The magni-
response theory models. Thousand Oaks, CA: Sage. tude of correlations found when using polyto-
Rasch, G. (1960). Probabilistic models for some mous items is usually greater than that observed
intelligence and attainment tests. Chicago: University for dichotomous items. Reliability is related to
of Chicago Press. the magnitude of the correlations and to the
Van der Linden, W. J., & Glas, C. A.W. (2000). number of items in a test, and thus with polyto-
Computerized adaptive testing: Theory and practice. mous items, a lesser number of items is usually
Boston: Kluwer. sufficient to produce a given level of reliability.
Wainer, H. (Ed.). (2000). Computerized adaptive testing:
Similarly, to the extent that the average of
A primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
the item-test correlations for a set of items is
Wright B. D., & Stone M. H. (1979). Best test design:
Rasch measurement. Chicago: MESA Press. increased, the number of items needed for a reli-
able test is reduced.
All correlations tend to be higher in groups that
have a wide range of talent than in groups where
there is a more restricted range. In that respect, the
ITEM-TEST CORRELATION item-test correlation presents information about
the group as well as about the item and the test.
The item-test correlation is the Pearson correlation The range of talent in the group might be limited
coefficient calculated for pairs of scores where one in some samples, for example, in a group of stu-
item of each pair is an item score and the other dents who have all passed prerequisites for an
item is the total test score. The greater the value advanced class. In groups where a restriction of
of the coefficient, the stronger is the correlation range exists, the item-test correlations will provide
between the item and the total test. Test developers a lower estimate of the relationship between the
strive to select items for a test that have a high cor- item and the test.
relation with the total score to ensure that the test When the range of talent in the group being
is internally consistent. Because the item-test corre- tested is not restricted, the item-test correlation is
lation is often used to support the contention that a spurious measure of item quality. The spurious-
the item is a ‘‘good’’ contributor to what the test ness arises from the inclusion of the particular item
measures, it has sometimes been called an index of in the total test score, resulting in the correlation
item validity. That term applies only to a type of between an item and itself being added to the cor-
evidence called internal structure validity, which is relation between the item and the rest of the total
synonymous with internal consistency reliability. test score. A preferred concept might be the item–
Because the item-test correlation is clearly an index rest correlation, which is the correlation between
of internal consistency, it should be considered as the item and the sum of the rest of the item scores.
a measure of item functioning associated with that Another term for this item-rest correlation is the
652 Item-Test Correlation

corrected item-test correlation, the name given to item–subtotal correlations, the researcher might
this type of index in the SPSS Scale Reliability develop (or examine) a measure that was inter-
analysis (SPSS, an IBM company). nally consistent for each subset of items. To get
The intended use of the test is an important an overall measure of reliability for such a multi-
factor in interpreting the magnitude of an item- trait test, a stratified alpha coefficient would be
test correlation. If the intention is to develop the desired reliability estimate rather than the
a test with high criterion-related test validity, regular coefficient alpha as reported by SPSS
one might seek items that have high correlations Scale Reliability. In the case of multitrait tests,
with the external criterion but relatively lower using a regular item–total correlation for item
correlations with the total score. Such items pre- analysis would likely present a problem because
sumably measure aspects of the criterion that are the subset with the most items could contribute
not adequately covered by the rest of the test too much to the total test score. This heavier
and could be preferred to items correlating concentration of similar items in the total score
highly with both the criterion and the test score. would result in the items in the other subsets
However, unless item-test correlations are sub- having lower item-test correlations and might
stantial, the tests composed of items with high lead to their rejection.
item-criterion correlations might be too hetero- In analyzing multiple-choice items to deter-
geneous in content to provide meaningful inter- mine how each option might contribute to the
pretations of the test scores. Thus, the use of reliability of the total test, one can adapt the
item-test correlations to select items for a test is item-test correlation to become an option-test
based on the goal to establish internal consis- correlation. In this approach, each option is
tency reliability rather than direct improvement examined to determine how it correlates with
in criterion validity. the total test (or subtest). If an option is expected
The item-test correlation resembles the load- to be a distractor (wrong response), the option–
ing of the item on the first principal component test correlation should be negative (this assumes
or the unrotated first component of an analysis the option is scored 1 if selected, 0 if not
of all the items in a test. The concepts are related selected). Distractors that are positively corre-
but the results are not identical in these different lated with the test or subtest total are detracting
approaches to representing the ‘‘loading’’ or from the reliability of the measure; these options
‘‘impact’’ of an item. However, principal compo- should probably be revised or eliminated.
nents analysis might help the reader to under- The item-test correlation was first associated
stand the rationale of measuring the relationship with the test analysis based on classic test theory.
between an item and some related measure. Typ- Another approach to test analysis is called item
ically, using any of these methods, the researcher response theory (IRT). With IRT, the relation-
wants the item to represent the same trait as the ship between an item and the trait measured by
total test, component, or factor of interest. This the total set of items is usually represented by an
description is limited to a single-factor test or item characteristic curve, which is a nonlinear
a single-component measure just as the typical regression of the item on the measure of ability
internal consistency reliability is reported for representing the total test. An index called the
a homogeneous test measuring a single trait. If point-biserial correlation is sometimes computed
the trait being measured is not a unitary trait, in an IRT item analysis statistical program, but
then other approaches are suggested because the that correlation might not be exactly the same as
regular item–total correlation will underestimate the item-test correlation. The IRT index might
the value of an item. be called an item-trait or an item-theta correla-
If a researcher has a variable that is expected tion, because the item is correlated with a mea-
to measure more than one trait, then the item– sure of the ability estimated differently than
total correlation can be obtained separately for based on the total score. Although this technical
each subset of items. In this approach, each difference might exist, there is little substantial
total represents the subtotal score for items of difference between the item-trait and the item-
a subset rather than a total test score. Using test correlations. The explanations in this entry
Item-Test Correlation 653

are intended to present information to help the Further Readings


user of either index.
Anastasi, A., & Urbina, S. (1997). Psychological testing
(7th ed.). Upper Saddle River, NJ: Prentice Hall.
Darrell Sabers and Perman Gochyyev Lord, F. M. (1980). Applications of item response theory
to practical testing problems. Hillsdale, NJ: Lawrence
See also Classical Test Theory; Coefficient Alpha; Internal Erlbaum.
Consistency Reliability; Item Analysis; Item Response
Theory; Pearson Product-Moment Correlation
Coefficient; Principal Components Analysis
J
estimate (because the bias is eliminated by the sub-
JACKKNIFE traction between the two estimates). The pseudo-
values are then used in lieu of the original values
to estimate the parameter of interest, and their
The jackknife or ‘‘leave one out’’ procedure is standard deviation is used to estimate the parame-
a cross-validation technique first developed by ter standard error, which can then be used for null
M. H. Quenouille to estimate the bias of an esti- hypothesis testing and for computing confidence
mator. John Tukey then expanded the use of the intervals. The jackknife is strongly related to the
jackknife to include variance estimation and tai- bootstrap (i.e., the jackknife is often a linear
lored the name of jackknife because like a jack- approximation of the bootstrap), which is cur-
knife—a pocket knife akin to a Swiss army knife rently the main technique for computational esti-
and typically used by Boy Scouts—this technique mation of population parameters.
can be used as a ‘‘quick and dirty’’ replacement As a potential source of confusion, a somewhat
tool for a lot of more sophisticated and specific different (but related) method, also called jack-
tools. Curiously, despite its remarkable influence knife, is used to evaluate the quality of the predic-
on the statistical community, the seminal work of tion of computational models built to predict the
Tukey is available only from an abstract (which value of dependent variable(s) from a set of inde-
does not even mention the name of jackknife) and pendent variable(s). Such models can originate, for
from an almost impossible to find unpublished example, from neural networks, machine learning,
note (although some of this note found its way genetic algorithms, statistical learning models, or
into Tukey’s complete work). any other multivariate analysis technique. These
The jackknife estimation of a parameter is an models typically use a very large number of para-
iterative process. First the parameter is estimated meters (frequently more parameters than observa-
from the whole sample. Then each element is, in tions) and are therefore highly prone to overfitting
turn, dropped from the sample and the parameter (i.e., to be able to predict the data perfectly within
of interest is estimated from this smaller sample. the sample because of the large number of para-
This estimation is called a partial estimate (or also meters but to be able to predict new observations
a jackknife replication). A pseudovalue is then poorly). In general, these models are too complex
computed as the difference between the whole to be analyzed by current analytical techniques,
sample estimate and the partial estimate. These and therefore, the effect of overfitting is difficult to
pseudovalues reduce the (linear) bias of the partial evaluate directly. The jackknife can be used to

655
656 Jackknife

estimate the actual predictive power of such mod- P *


2
Tn*  T •
els by predicting the dependent variable values of
σ^T2 * ¼ : ð5Þ
each observation as if this observation were a new n N1
observation. To do so, the predicted value(s) of *
each observation is (are) obtained from the model Tukey conjectured that the T n s could be con-
built on the sample of observations minus the sidered as independent random variables. There-
observation to be predicted. The jackknife, in this fore, the standard error of the parameter estimates,
context, is a procedure that is used to obtain an denoted σ^ 2T * , could be obtained from the variance
unbiased prediction (i.e., a random effect) and to of the pseudovalues from the usual formula for the
minimize the risk of overfitting. standard error of the mean as follows:
sffiffiffiffiffiffiffiffi v ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uP  * 2
ffi
*
σ^T * u *
Tn  T •
Definitions and Notations t
σ^T* ¼ n
¼ : ð6Þ
N NðN  1Þ
The goal of the jackknife is to estimate a parameter
of a population of interest from a random sample of This standard error can then be used to compute
data from this population. The parameter is denoted confidence intervals for the estimation of the
θ, its estimate from a sample is denoted T; and its parameter. Under the independence assumption,
jackknife estimate is denoted T*. The sample of n this estimation is distributed as a Student’s t distri-
observations (which can be univariate or multivari- bution with (N – 1) degrees of freedom. Specifically
ate) is a set denoted {X1 ; . . . ; Xn ; . . . ; Xn }. The sam- a (1 – αÞ confidence interval can be computed as
ple estimate of the parameter is a function of the
observations in the sample. Formally: T * ± tα;ν σ^ T* ð7Þ

T¼f ðX1 ; . . . ;Xn ; . . . ;XN Þ: ð1Þ with tα;v being the α-level critical value of a Stu-
dent’s t distribution with v ¼ N1 degrees of
An estimation of the population parameter freedom.
obtained without the nth observation is called
the nth partial prediction and is denoted Tn :
Jackknife Without Pseudovalues
Formally:
Pseudovalues are important for understanding the
Tn ¼f ðX1 ; . . . ;Xn1 ;Xnþ1 . . . ;Xn Þ: ð2Þ inner working of the jackknife, but they are not
computationally efficient. Alternative formulas using
A pseudovalue estimation of the nth observation is only the partial estimates can be used in lieu of the
denoted Tn* ; it is computed as the difference pseudovalues. Specifically, if T · denotes the mean of
between the parameter estimation obtained from the partial estimates and σ^ Tn their standard devia-
the whole sample and the parameter estimation tion, then T * (cf. Equation 4) can be computed as
obtained without the nth observation. Formally:
T * ¼ NT  ðN  1ÞT • ð8Þ
Tn* ¼ NT  ðN  1ÞTn : ð3Þ
and σ^ T* (cf. Equation 6) can be computed as
The jackknife estimate of θ, denoted T*, is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
obtained as the mean of the pseudovalues. N  1X 2
σ^T * ¼ Tn  T •
Formally: N ð9Þ
σ^Tn
¼ ðN  1Þ pffiffiffiffi ffi:
* 1XN
N
T * ¼ T• ¼ T*; ð4Þ
N n n
* Assumptions of the Jackknife
where T • is the mean of the pseudovalues. The
variance of the pseudovalues is denoted σ^ 2T * and is Although the jackknife makes no assumptions
n
obtained with the usual formula: about the shape of the underlying probability
Jackknife 657

distribution, it requires that the observations are sometimes a source of confusion). The first tech-
independent of each other. Technically, the obser- nique, presented in the preceding discussion, esti-
vations are assumed to be independent and identi- mates population parameters and their standard
cally distributed (i.e., in statistical jargon: i.i.d.). error. The second technique evaluates the general-
This means that the jackknife is not, in general, an ization performance of predictive models. In these
appropriate tool for time-series data. When the models, predictor variables are used to predict the
independence assumption is violated, the jackknife values of dependent variable(s). In this context, the
underestimates the variance in the data set, which problem is to estimate the quality of the prediction
makes the data look more reliable than they actu- for new observations. Technically speaking, the
ally are. goal is to estimate the performance of the predic-
Because the jackknife eliminates the bias by tive model as a random effect model. The problem
subtraction (which is a linear operation), it works of estimating the random effect performance for
correctly only for statistics that are linear functions predictive models is becoming a crucial problem in
of the parameters or the data, and whose distribu- domains such as, for example, bio-informatics and
tion is continuous or at least ‘‘smooth enough’’ to neuroimaging because the data sets used in these
be considered as such. In some cases, linearity can domains typically comprise a very large number of
be achieved by transforming the statistics (e.g., variables (often a much larger number of variables
using a Fisher Z transform for correlations or a log- than observations—a configuration called the
arithm transform for standard deviations), but ‘‘small N; large P’’ problem). This large number of
some nonlinear or noncontinuous statistics, such variables makes statistical models notoriously
as the median, will give very poor results with the prone to overfitting.
jackknife no matter what transformation is used. In this context, the goal of the jackknife is to
estimate how a model would perform when
Bias Estimation applied to new observations. This is done by drop-
ping in turn each observation and fitting the model
The jackknife was originally developed by Que- for the remaining set of observations. The model is
nouille as a nonparametric way to estimate and then used to predict the left-out observation. With
reduce the bias of an estimator of a population this procedure, each observation has been pre-
parameter. The bias of an estimator is defined as dicted as a new observation.
the difference between the expected value of this In some cases a jackknife can perform both
estimator and the true value of the population functions, thereby generalizing the predictive
parameter. So formally, the bias, denoted , of an model as well as finding the unbiased estimate of
estimation T of the parameter θ is defined as the parameters of the model.
 ¼ EfT g  θ; ð10Þ
Example: Linear Regression
with EfTg being the expected value of T:
The jackknife estimate of the bias is computed Suppose that we had performed a study examining
by replacing the expected value of the estimator the speech rate of children as a function of their
[i.e., EfTg] by the biased estimator (i.e., TÞ and by age. The children’s age (denoted XÞ would be used
replacing the parameter (i.e., θ) by the ‘‘unbiased’’ as a predictor of their speech rate (denoted YÞ.
jackknife estimator (i.e., T*). Specifically, the Dividing the number of words said by the time
jackknife estimator of the bias, denoted jack , is needed to say them would produce the speech rate
computed as (expressed in words per minute) of each child. The
results of this (fictitious) experiment are shown in
jack ¼TT * : ð11Þ Table 1.
We will use these data to illustrate how the
jackknife can be used to (a) estimate the regression
Generalizing the Performance of Predictive Models
parameters and their bias and (b) evaluate the gen-
Recall that the name jackknife refers to two eralization performance of the regression model.
related, but different, techniques (and this is As a preliminary step, the data are analyzed by
658 Jackknife

Table 1 Data From a Study Examining the Speech an* ¼ na  ðn  1Þan and
Rate of Children as a Function of Age ð14Þ
bn* ¼ nb  ðn  1Þbn ;
Xn Yn ^n
Y ^
Y ^
Y
n jack,n

1 4 91 95.0000 9 4.9986 97.3158 and for the first observation, this equation
2 5 96 96.2500 96.1223 96.3468 becomes
3 6 103 97.5000 97.2460 95.9787
a1* ¼ 6 × 1:25  5 × 0:9342 ¼ 2:8289 and
4 9 99 101.2500 100.6172 101.7411
5 9 103 101.2500 100.6172 100.8680 b1* ¼ 6 × 90  5 × 93:5789 ¼ 72:1053:
6 15 108 108.7500 107.3596 111.3962 ð15Þ
Notes: The independent variable is the age of the child (XÞ.
The dependent variable is the speech rate of the child in Table 2 gives the partial estimates and pseudo-
words per minutes (YÞ. The values of Y ^ are obtained as Y
^¼ values for the intercept and slope of the regression.
90 þ 1.25X: Xn is the value of the independent variable, From this table, we can find that the jackknife esti-
^ n is the value of the dependent variable, Yn is the predicted
Y
mates of the regression will give the following
value of the dependent variable predicted from the
regression, Y^ * is the predicted value of the dependent equation for the prediction of the dependent vari-
n
variable predicted from the jackknife derived unbiased able (the prediction using the jackknife estimates is
estimates, and Yjack is the predicted values of the dependent denoted Y^ * Þ:
n
variable when each value is predicted from the
corresponding jackknife partial estimates. ^ * ¼ a * þ b * X ¼ 90:5037 þ 1:1237X:
Y ð16Þ
n

The predicted values using the jackknife esti-


a standard regression analysis, and we found that
mates are given in Table 1. It is worth noting that,
the regression equation is equal to
for regression, the jackknife parameters are linear
^ ¼ a þ bX ¼ 90 þ 1:25X:
Y ð12Þ functions of the standard estimates. This implies
that the values of Y ^ * can be perfectly predicted
n
The predicted values are given in Table 1. This ^
from the values of Yn . Specifically,
regression model corresponds to a coefficient of  *

correlation of r = .8333 (i.e., the correlation ^ ¼ a a
* * b b* ^
Y n þ Yn : ð17Þ
^ is equal to .8333).
between the Ys and the Ys b b

Therefore, the correlation between the Y ^ * and


n
Estimation of Regression Parameters and Bias the Yn is equal to one; this, in turn, implies that
In this section, the jackknife is used to estimate the correlation between the original data and the
predicted values is the same for both Y and Y ^ *.
the intercept, the slope, and the value of the coeffi- n
cient of correlation for the regression. The estimation for the coefficient of correlation
Each observation is dropped in turn and com- is slightly more complex because, as mentioned,
puted for the slope and the intercept, the partial the jackknife does not perform well with nonlinear
estimates (denoted bn and an Þ, and the pseudo- statistics such as correlation. So, the values of r are
values (denoted bn* and an* Þ. So, for example, transformed using the Fisher Z transform prior to
when we drop the first observation, we use the jackknifing. The jackknife estimate is computed
observations 2 through 6 to compute the regres- on these Z-transformed values, and the final value
sion equation with the partial estimates of the of the estimate of r is obtained by using the inverse
slope and intercept as (cf. Equation 2): of the Fisher Z transform (using r rather than
the transformed Z values would lead to a gross
^ 1 ¼ a1 þ b1 X ¼ 93:5789 þ 0:9342X: ð13Þ
Y overestimation of the correlation). Table 2 gives
the partial estimates for the correlation, the
From these partial estimates, we compute a pseudo- Z-transformed values, and the Z-transformed
value by adapting Equation 3 to the regression pseudovalues. From Table 2, we find that the jack-
context. This gives the following jackknife pseudo- knife estimate of the Z-transformed coefficient of
values for the nth observation: correlation is equal to Z * ¼ 1.019, which when
Jackknife 659

Table 2 Partial Estimates and Pseudovalues for the Regression Example of Table 1

Partial Estimates Pseudovalues


Observations A-n b-n r-n Z-n an bn Z*
1 93.5789 0.9342 8005 1.1001 72.1053 2.8289 1.6932
2 90.1618 1.2370 8115 1.1313 89.1908 1.3150 1.5370
3 87.4255 1.4255 9504 1.8354 102.8723 0.3723  1.9835
4 90.1827 1.2843 .8526 1.2655 89.0863 1.0787 0.8661
5 89.8579 1.2234 .8349 1.2040 90.7107 1.3832 1.1739
6 88.1887 1.5472 .7012 0.8697 99.0566  0.2358 2.8450
a• b• — Z• a b 
Z
89.8993 1.2753 — .2343 90.5037 1.1237 1.0219
Jackknife Estimates
SD σ^bn σ^bn — σ^zn σ^an σ^bn σ^zn
2.1324 0.2084 — 0.3240 10.6622 1.0418 1:6198
Jackknife Standard Deviations
SE — — — — σa ^  σ^b  σz ^ 
σ^a ¼ pffiffinn σ^b ¼ pffiffinn σ^z ¼ pffiffinn
— — — — 4.3528 0.4253 .6613
Jackknife Standard Error

transformed back to a correlation, produces a value The bias of the estimate is computed from Equa-
of the jackknife estimate for the correlation of r* = tion 11. For example, the bias of the estimation of
.7707. Incidently, this value is very close to the the coefficient of correlation is equal to
value obtained with another classic alternative
population unbiased estimate called the shrunken jack ðrÞ ¼r  r* ¼ :8333  :7707 ¼ :0627: ð20Þ
r; which is denoted ~r, and computed as
The bias is positive, and this shows (as expected)
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

that the coefficient of correlation overestimates the
ðN  1Þ
~r ¼ 1  ð1  r Þ 2 magnitude of the population correlation.
ðN  2Þ
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

5
Estimate of the Generalization
¼ 1  ð1  :8333 Þ 2 ¼ :7862: ð18Þ
4
Performance of the Regression
To estimate the generalization performance of the
Confidence intervals are computed using regression, we need to evaluate the performance of
Equation 7. For example, taking into account the model on new data. These data are supposed to
that the α = .05 critical value for a Student’s t be randomly selected from the same population as
distribution for v = 5 degrees of freedom is equal the data used to build the model. The jackknife
to tα;v = 2.57, the confidence interval for the strategy here is to predict each observation as a new
intercept is equal to observation; this implies that each observation is
predicted from its partial estimates of the prediction
10:6622 parameter. Specifically, if we denote by Yjack;n the
a* ± tα;ν σ^ a* ¼ 90:5037 ± 2:57 × pffiffiffi jackknife predicted value of the nth observation, the
6 jackknife regression equation becomes
ð19Þ
¼ 90:5037 ± 2:57 × 4:3528
¼ 90:5037 ± 11:1868: ^ jack; n ¼ an þ bn Xn :
Y ð21Þ
660 John Henry Effect

So, for example, the first observation is predicted statistical sciences (Vol. 4, pp. 280–287). New York:
from the regression model built with observations Wiley.
2 to 6; this gives the following predicting equation Manly, B. F. J. (1997). Randomization, bootstrap, and
for Yjack;1 (cf. Tables 1 and 2): Monte Carlo methods in biology (2nd ed.). New York:
Chapman & Hall.
Miller, R. G. (1974). The jackknife: A review.
^ jack; 1 ¼ a1 þ b1 X1 ¼ 93:5789
Y Biometrika, 61; 1–17.
ð22Þ Quenouille, M. H. (1956). Notes on bias in estimation.
þ 0:9342 × 4 ¼ 97:3158:
Biometrika, 43; 353–360.
Shao, J., & Tu, D. (1995). The jackknife and the
The jackknife predicted values are listed in Table bootstrap. New York: Springer-Verlag.
1. The quality of the prediction of these jackknife Tukey, J. W. (1958). Bias and confidence in not quite
values can be evaluated, once again, by computing large samples (abstract). Annals of Mathematical
a coefficient of correlation between the predicted Statistics, 29; 614.
values (i.e., theYjack;n Þ and the actual values (i.e., Tukey, J. W. (1986). The future of processes of data
the Yn Þ. This correlation, denoted rjack , for this analysis. In The collected works of John W. Tukey
(Vol. IV, pp. 517–549). New York: Wadsworth.
example is equal to rjack = .6825. It is worth noting
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009).
that, in general, the coefficient rjack is not equal to Puzzlingly high correlations in f MRI studies of
the jackknife estimate of the correlation r * (which, emotion, personality, and social cognition.
recall, is in our example equal to r * = .7707). Perspectives in Psychological Sciences, 4; 274–290.

Herve Abdi and Lynne J. Williams

See also Bias; Bivariate Regression; Bootstrapping;


Coefficients of Correlation, Alienation, and
Determination; Pearson Product-Moment Correlation
JOHN HENRY EFFECT
Coefficient; R2 ; Reliability; Standard Error of Estimate
The term John Henry effect was coined to explain
the unexpected outcome of an experiment caused
Further Readings by the control group’s knowledge of its role within
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. the experiment. The control group’s perceived role
(2009). Experimental design and analysis for as a baseline or a comparison group to the experi-
psychology. Oxford, UK: Oxford University Press. mental condition, specifically one testing an inno-
Bissell, A. F., & Ferguson, R. A. (1975). The jackknife— vative technology, can cause the control group to
toy, tool or two-edged weapon? The Statistician, 24; behave in an unnatural way to outperform the
79–100. new technology. The group’s knowledge of its
Diaconis, P., & Efron, B. (1983). Computer-intensive
position within the experiment as a baseline com-
methods in statistics. Scientific American, 116–130.
Efron, B., & Tibshirani, R. J. (1993). An introduction to
parison causes the group to perform differently
the bootstrap. New York: Chapman & Hall. and, often more specifically, better than usual,
Efron, B., & Gong, G. (1983). A leisurely look at the eliminating the effect of the experimental manipu-
bootstrap, the jackknife, and cross-validation. The lation. Deriving its name from the folktale the
American Statistician 37; 36–48. ‘‘Ballad of John Henry,’’ the John Henry effect is
Gentle, J. E. (1998). Elements of computational statistics. similar to the Hawthorne effect in which partici-
New York, NY: Springer. pant behavior changes as a result of the partici-
Kriegeskorte, K., Simmons, W. K., Bellgowan, P. S. F., & pants’ knowledge that they are being observed or
Baker, C. I. (2009). Circular analysis in systems studied. This change in participant behavior can
neuroscience: The dangers of double dipping. Nature
confound the experiment rendering the results
Neuroscience, 12; 535–540.
Krzanowski, W. J., & Radley, D. (1989). Nonparametric
inaccurate or misleading.
confidence and tolerance regions in canonical variate The effect was studied and explained by educa-
analysis. Biometrics, 45; 1163–1173. tion researcher Robert Heinich after his review of
Hinkley, D. V. (1983). Jackknife methods. In Johnshon, several studies that compared the effects of televi-
N. L., Kotz, S., & Read, C. B. (Eds.). Encyclopedia of sion instruction with those of standard classroom
John Henry Effect 661

teaching. Heinich noted that many of these studies of Henry in which an unexpected result develops
demonstrated insignificant differences between from the group’s overexertion or out-of-the-
control and experimental groups and often ordinary performance. With both Henry and the
included results in which the control group outper- control group, the fear of being replaced incites
formed the experimental condition. He was one of the spirit of competition and leads to an inaccurate
the first researchers to acknowledge that the valid- depiction of the differences in performance by the
ity of these experiments was compromised by the control and experimental groups.
control groups’ knowledge of their role as a control The effect was later studied extensively by Gary
or baseline comparison group. Comparing the con- Saretsky, who expanded on the term’s definition
trol groups with the title character from the ‘‘Bal- by pointing out the roles that both competition
lad of John Henry,’’ Heinich described how and fear play in producing the effect. In most
a control group might exert extra effort to com- cases, the John Henry effect is perceived as a con-
pete with or even outperform its comparison trol group’s resultant behavior to the fear of being
group. outperformed or replaced by the new strategy or
In the ‘‘Ballad,’’ title character John Henry novel technology.
works as a rail driver whose occupation involves
hammering spikes and drill bits into railroad ties Samantha John
to lay new tracks. John Henry’s occupation is
See also Control Group; Hawthorne Effect
threatened by the invention of the steam drill,
a machine designed to do the same job in less
time. The ‘‘Ballad of John Henry’’ describes an Further Readings
evening competition in which Henry competes
with the steam drill one on one and defeats it by Heinich, R. (1970). Technology & the management of
laying more track. Henry’s effort to outperform instruction. Washington, DC: Association for
Educational Communications and Technology.
the steam drill causes a misleading result, how-
Saretsky, G. (1975). The John Henry Effect: Potential
ever, because although he did in fact win the confounder of experimental vs. control group
competition, his overexertion causes his death approaches to the evaluation of educational
the next day. innovations. Presented at the annual meeting of the
Heinich’s use of the folktale compares the per- American Educational Research Association,
formance by a control group with the performance Washington, DC.
K
K
goodness-of-fit tests. An example illustrating the
KOLMOGOROV–SMIRNOV TEST application and evaluation of a KS test is also
provided.

The Kolmogorov–Smirnov (KS) test is one of many


goodness-of-fit tests that assess whether univariate Estimating Parameters
data have a hypothesized continuous probability
distribution. The most common use is to test Most properties of the KS test have been devel-
whether data are normally distributed. Many sta- oped for testing completely specified distributions.
tistical procedures assume that data are normally For example, one tests not just that the data are
distributed. Therefore, the KS test can help vali- normal, but more specifically that the data are nor-
date use of those procedures. For example, in a lin- mal with a certain mean and a certain variance. If
ear regression analysis, the KS test can be used to the parameters of the distribution are not known,
test the assumption that the errors are normally it is common to estimate parameters in order to
distributed. However, the KS test is not as power- obtain a completely specified distribution. For
ful for assessing normality as other tests such as example, to test whether the errors in a regression
the Shapiro–Wilk, Anderson–Darling, and Bera– have a normal distribution, one could estimate the
Jarque tests that are specifically designed to test error variance by the mean-squared error and test
for normal distributions. That is, if the data are whether the errors are normal with a mean of zero
not normal, the KS test will erroneously conclude and a variance equal to the calculated mean-
that they are normal more frequently than will the squared error. However, if parameters are esti-
other three mentioned tests. Yet the KS test is mated in the KS test, the critical values in standard
better in this regard than the widely used chi- KS tables are incorrect and substantial power can
square goodness-of-fit test. Nevertheless, the KS be lost. To permit parameter estimation in the KS
test is valid for testing data against any specified test, statisticians have developed corrected tables
continuous distribution, not just the normal distri- of critical values for testing special distributions.
bution. The other three mentioned tests are not For example, the adaptation of the KS test for test-
applicable for testing non-normal distributions. ing the normal distribution with estimated mean
Moreover, the KS test is distribution free, which and variance is called the Lilliefors test.
means that the same table of critical values might
be used—whatever the hypothesized continuous
Multiple-Sample Extensions
distribution, normal or otherwise.
This entry discusses the KS test in relation to The KS test has been extended in other ways. For
estimating parameters, multiple samples, and example, there is a two-sample version of the KS

663
664 Kolmogorov–Smirnov Test

test that is used to test whether two separate sets test uses a different measure of disparity. The KS
of data have the same distribution. As an example, test uses the maximum distance between the
one could have a set of scores for males and a set empirical distribution function of the data and the
of scores for females. The two-sample KS test hypothesized distribution. The following example
could be used to determine whether the distribu- clarifies ideas.
tion of male scores is the same as the distribution
of female scores. The two-sample KS test does not
Example
require that the form of the hypothesized common
distribution be specified. One does not need to Suppose that one wishes to test whether the 50
specify whether the distribution is normal, expo- randomly sampled data in Table 1 have a standard
nential, and so on, and no parameters are esti- normal distribution (i.e., normal with mean ¼ 0
mated. The two-sample KS test is distribution free, and variance ¼ 1). The smooth curve in Figure 1
so just one table of critical values suffices. The KS shows the cumulative distribution function (CDF)
test has been extended further to test the equality of the standard normal distribution. That is, for
of distributions when the number of samples each x on the horizontal axis, the smooth curve
exceeds two. For example, one could have scores shows the standard normal probability less than or
from several different cities. equal to x: These are the values displayed in every
table of standard normal probabilities found in
statistics textbooks. For example, if x ¼ 2, the
Goodness-of-Fit Tests
smooth curve has a value of 0.9772, indicating
In all goodness-of-fit tests, there is a null hypothe- that 97.72% of the values in the distribution are
sis that states that the data have some distribution found below 2 standard deviations above the
(e.g., normal). The alternative hypothesis states mean, and only 2.28% of the values are more than
that the data do not have that distribution (e.g., 2 standard deviations above the mean. The jagged
not normal). In most empirical research, one hopes line in the figure shows the empirical distribution
to conclude that the data have the hypothesized function (EDF; i.e., the proportion of the 50 data
distribution. But in empirical research, one cus- less than or equal to each x on the horizontal
tomarily sets up the research hypothesis as the axis). The EDF is 0 below the minimum data value
alternative hypothesis. In goodness-of-fit tests, this and is 1 above the largest data value and steps up
custom is reversed. The result of the reversal is by 1=50 ¼ 0:02 at each data value from left to
that a goodness-of-fit test can provide only a weak right.
endorsement of the hypothesized distribution. The If the 50 data come from a standard normal dis-
best one can hope for is a conclusion that the tribution, then the smooth curve and the jagged
hypothesized distribution cannot be rejected, or line should be close together because the empirical
that the data are consistent with the hypothesized proportion of data less than or equal to each x
distribution. Why not follow custom and set up should be close to the proportion of the true distri-
the hypothesized distribution as the alternative bution less than or equal to each x: If the true dis-
hypothesis? Then rejection of the null hypothesis tribution is standard normal, then any disparity or
would be a strong endorsement of the hypothe- gap between the two lines should be attributable
sized distribution at a specified low Type I error to the random variation of sampling, or to the dis-
probability. The answer is that it is too hard to dis- creteness of the 0.02 jumps at each data value.
prove a negative. For example, if the hypothesized However, if the true data distribution is not stan-
distribution were standard normal, then the test dard normal, then the true CDF and the standard
would have to disprove all other distributions. The normal smooth curve of Figure 1 will differ. The
Type I error probability would be 100%. EDF will be closer to the true CDF than to the
Like all goodness-of-fit tests, the KS test is based smooth curve. And a persistent gap will open
on a measure of disparity between the empirical between the two curves of Figure 1, provided the
data and the hypothesized distribution. If the dis- sample size is sufficiently large. Thus, if there is
parity exceeds a critical cutoff value, the hypothe- only a small gap between the two curves of the fig-
sized distribution is rejected. Each goodness-of-fit ure, then it is plausible that the data come from
Kolmogorov–Smirnov Test 665

Table 1 Example of KS Test (50 Randomly Sampled to measure the gap. The area between the lines
Data, Arranged in Increasing Order) and square of the area betweenthe lines are among
1:7182 0:9339 0:2804 0.2694 1.0030 the measures that are used in other goodness-of-fit
1:7144 0:9326 0:2543 0.3230 1.0478 tests. In most cases, statistical software running on
1:6501 0:8866 0:2093 0.4695 1.0930 a computer will probably calculate both the value
1:5493 0:6599 0:1931 0.5368 1.2615 of the KS test and the critical value or p value. If
1:4843 0:5801 0:1757 0.5686 1.2953 the computation of the KS test is done by hand, it
1:3246 0:5559 0:1464 0.7948 1.4225 is necessary to exercise some care because of the
1:3173 0:4403 0:0647 0.7959 1.6038 discontinuities or jumps in the EDF at each data
1:2435 0:4367 0:0594 0.8801 1.6379 value.
1:1507 0:3205 0:0786 0.8829 1.6757
0:9391 0:3179 0:1874 0.9588 1.6792 Calculation of the KS Statistic
The KS statistic that measures the gap can be
calculated in general by the formula
1.0
0.9 Dn ¼ maxðDþ 
n , Dn Þ,
0.8
0.7 in which
0.6
CDF/EDF

0.5 Dþ
n ¼ max ½i=n  F0 ðxi Þ
0.4 1≤i≤n
0.3
0.2 and
0.1
0.0 x D
n ¼ max ½F0 ðxi Þ  ði  1Þ=n:
−3.0 −2.0 −1.0 0.0 1.0 2.0 3.0 1≤i≤n

Sample Data Values


In these formulas, Dn is the value of the KS sta-
tistic based on n sample data that have been
Figure 1 Example of KS Test ordered (x1 ≤ x2 ≤    ≤ xn Þ; F0 ðxi Þ is the value of
the hypothesized CDF at xi (the smooth curve in
Notes: The smooth curve is the CDF of hypothesized
the figure); and i=n is the EDF (the proportion of
standard normal distribution. The jagged line is the EDF of
50 randomly sampled data in Table 1. The KS statistic is the data less than or equal to xi Þ. The statistics Dþ n
maximum vertical distance between the curve and the line. and D n are the basis of one-sided KS tests, in
which the alternative hypothesis is that every per-
centile of F0 exceeds (or lies below) every corre-
a standard normal distribution. However, if the sponding percentile of the true distribution F: Such
gap is sufficiently large, then it is implausible that one-sided KS tests have fairly good power.
the data distribution is standard normal.
Critical Values for the KS Test
Intuition for the KS test
Formally, the null and alternative hypotheses
This intuition is the basis for the KS test: Reject for the KS test can be stated as
the hypothesis that the data are standard normal if
the gap in Figure 1 is greater than a critical value; H0 : FðxÞ ¼ F0 ðxÞ for all x
declare the data consistent with the hypothesis that
versus
the data are standard normal if the gap is less than
the critical value. To complete the test, it is neces- H1 : FðxÞ 6¼ F0 ðxÞ for at least one x:
sary to determine how to compute the gap and the
critical value. The KS test measures the gap by the
maximum vertical difference between the two The hypotheses say that either the true CDF
lines. The maximum distance is not the only way [FðxÞ] is identical to the hypothesized CDF [F0 ðxÞ]
666 Kolmogorov–Smirnov Test

at all x; or it is not. Statisticians have produced distribution and the true distribution unless a sub-
tables of critical values for small sample sizes. If stantially larger sample size is used than is common
the sample size is at least moderately large (say, in much empirical research. Second, failure to
n ≥ 35Þ, then asymptotic approximations might be reject the hypothesized distribution might be only
pffiffiffi
used. For example, PðDn > 1:358= nÞ ¼ 0:05 a weak endorsement of the hypothesized distribu-
pffiffiffi
and PðDn > 1:224= nÞ ¼ 0:05 for large n: Thus, tion. For example, if one tests the hypothesis that
pffiffiffi
1:358= n is an approximate critical value for the a set of regression residuals has a normal distribu-
KS test at the 0.05 level of significance, and tion, one should perhaps not take too much com-
pffiffiffi fort in the failure of the KS test to reject the
1:224= n is an approximate critical value for the
KS test at the 0.10 level of significance. normal hypothesis. Third, if the research objective
In the example, the maximum gap between the can be satisfied by testing a small set of specific
EDF and the hypothesized standard normal distri- parameters, it might be overkill to test the entire
bution of Figure 1 is 0.0866. That is, the value of distribution. For example, one might want to know
the KS statistic Dn is 0.0866. Because n ¼ 50, the whether the mean of male scores differs from the
critical value for a test at the 0.10 mean of female scores. Then it would probably be
ffi significance level
pffiffiffiffiffi
is approximately 1:224= 50 ¼ 0:1731. Because better to test equality of means (e.g., by a t test)
Dn ¼ 0:0866 < 0:1731, then it could be concluded than to test the equality of distributions (by the KS
that the data are consistent with the hypothesis of test). The hypothesis of identical distributions is
a standard normal distribution at the 0.10 signifi- much stronger than the hypothesis of equal means
cance level. In fact, the p value is about 0.8472. and/or variances. As in the example, two distribu-
However, the data in Table 1 are in fact drawn tions can have equal means and variances but not
from a true distribution that is not standard nor- be identical. To have identical distributions means
mal. The KS test incorrectly accepts (fails to reject) that all possible corresponding pairs of parameters
the null hypothesis that the true distribution is are equal. The KS test has some sensitivity to all
standard normal. The KS test commits a Type II differences between distributions, but it achieves
error. The true distribution is uniform on the range that breadth of sensitivity by sacrificing sensitivity
ð1:7321; þ1:7321Þ. This uniform distribution to differences in specific parameters. Fourth, if it is
has a mean of 0 and a variance of 1, just like the not really important to distinguish distributions
standard normal. The sample size of 50 is insuffi- that differ by small gaps (such as 0.0572)—if only
cient to distinguish between the hypothesized stan- large gaps really matter—then the KS test might be
dard normal distribution and the true uniform quite satisfactory. In the example, this line of
distribution with the same mean and variance. The thought would imply researcher indifference to the
maximum KS gap between the true uniform distri- shapes of the distributions (uniform vs. normal)—
bution and the hypothesized standard normal dis- the uniform distribution on the range ð1:7321,
tribution is only 0.0572. Because the critical value þ1:7321Þ would be considered ‘‘close enough’’ to
for the KS test using 50 data and a 0.10 signifi- normal for the intended purpose.
cance level is about three times the true gap, it is
Thomas W. Sager
very unlikely that an empirical gap of the magni-
tude required to reject the hypothesized standard See also Distribution; Nonparametric Statistics
normal distribution can be obtained. A much
larger data
pffiffiffi set is required. The critical value
(1:224= nÞ for a test at the 0.10 significance level Further Readings
must be substantially smaller than the true gap
Conover, W. J. (1999). Practical nonparametric statistics
(0.0572) for the KS test to have much power. (3rd ed.). New York: Wiley.
Hollander, M., & Wolfe, D. A. (1999). Nonparametric
Evaluation statistical methods (2nd ed.). New York:
Wiley-Interscience.
Several general lessons can be drawn from this Khamis, H. J. (2000). The two-stage δ-corrected
example. First, it is difficult for the KS test to dis- Kolmogorov-Smirnov test. Journal of Applied
tinguish small differences between the hypothesized Statistics, 27; 439–450.
KR-20 667

Lilliefors, H. (1967). On the Kolmogorov-Smirnov test where K is the number of i items or observations,
for normality with mean and variance unknown. pi is the proportion of responses in the keyed
Journal of the American Statistical Association, 62, direction for item i; qi ¼ 1  pi ; and σ 2X is the vari-
399–402. ance of the raw summed scores. Therefore, KR-20
Stephens, M. A. (1974). EDF statistics for goodness of fit
is a function of the number of items, item diffi-
and some comparisons. Journal of the American
Statistical Association, 69, 730–737.
culty, and the variance of examinee raw scores. It
Stephens, M. A. (1986). Tests based on EDF statistics. In is also a function of the item-total correlations
R. B. D’Agostino, & M. A. Stevens (Eds.), (classical discrimination statistics) and increases as
Goodness-of-fit techniques. New York: Marcel the average item-total correlation increases.
Dekker. KR-20 produces results equivalent to coefficient
α, which is another index of internal consistency,
and can be considered a special case of α. KR-20
can be calculated only on dichotomous data,
where each item in the measurement instrument is
KR-20 cored into only two categories. Examples of this
include true/false, correct/incorrect, yes/no, and
KR-20 (Kuder–Richardson Formula 20) is an present/absent. Coefficient α also can be calculated
index of the internal consistency reliability of on polytomous data, that is, data with more than
a measurement instrument, such as a test, ques- two levels. A common example of polytomous
tionnaire, or inventory. Although it can be applied data is a Likert-type rating scale.
to any test item responses that are dichotomously Like α, KR-20 can be described as the mean of
scored, it is most often used in classical psycho- all possible split-half reliability coefficients based
metric analysis of psychoeducational tests and, as on the Flanagan–Rulon approach of split-half reli-
such, is discussed with this perspective. ability. An additional interpretation is derived
Values of KR-20 generally range from 0.0 to from Formula 1: The term pi qi represents the vari-
1.0, with higher values representing a more inter- ance of each item. If this is considered error vari-
nally consistent instrument. In very rare cases, typ- ance, then the sum of the item variances divided
ically with very small samples, values less than 0.0 by the total variance in scores presents the propor-
can occur, which indicates an extremely unreliable tion of variance resulting from error. Subtracting
measurement. A rule-of-thumb commonly applied this quantity from 1 translates it into the propor-
in practice is that 0.7 is an acceptable value or 0.8 tion of variance not resulting from error, assuming
for longer tests of 50 items or more. Squaring KR- there is no source of error other than the random
20 provides an estimate of the proportion of score error present in the process of an examinee
variance not resulting from error. Measurements responding to each item.
with KR-20 < 0.7 have the majority of score vari- G. Frederic Kuder and Marion Richardson
ance resulting from error, which is unacceptable in also developed a simplification of KR-20 called
most situations. KR-21, which assumes that the item difficulties
Internal consistency reliability is defined as the are equivalent. KR-21 allows us to substitute the
consistency, repeatability, or homogeneity of mea- mean of the pi and qi into Formula 1 for pi
surement given a set of item responses. Several and qi ; which simplifies the calculation of the
approaches to reliability exist, and the approach reliability.
relevant to a specific application depends on the An important application of KR-20 is the calcu-
sources of error that are of interest, with internal lation of the classical standard error of measure-
consistency being appropriate for error resulting ment (SEM) for a measurement. The SEM is
from differing items. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
KR-20 is calculated as SEM ¼ sX 1  KR20; ð2Þ
" PK #
K pi q i
where sX is the standard deviation of the raw
KR20 ¼ 1  i¼12 , ð1Þ scores in the sample. Note that this relationship is
K1 σX
inverse; with sX held constant, as KR-20 increases,
668 Krippendorff’s Alpha

the SEM decreases. Within classical test theory, devices, or coders of data, designed to indicate
having a more reliable measurement implies a their reliability. As a general measure, it is appli-
smaller SEM for all examinees. cable to data on various levels of measurement
The following data set is an example of the cal- (metrics) and includes some known coefficients
culation of KR-20 with 5 examinees and 10 items. as special cases. As a statistical measure, it maps
samples from a population of data into a single
Item chance corrected coefficient, a scale, indicating
the extent to which the population of data can
Person 1 2 3 4 5 6 7 8 9 10 X be relied on or trusted in subsequent analyses.
1 1 1 1 1 0 1 1 1 1 1 9 Alpha equates reliability with the reproducibility
2 1 0 1 1 1 0 1 1 0 1 7 of the data-generating process, measured by the
3 1 0 1 0 0 1 1 0 1 1 6 agreement on what the data in question refer to
4 0 0 1 1 0 1 0 0 1 1 5 or mean. Typical applications of α are content
5 1 1 1 1 1 1 1 0 1 0 8 analyses where volumes of text need to be read
pi: 0.8 0.2 1.0 0.8 0.4 0.8 0.8 0.4 0.8 0.8 and categorized, interview responses that require
qi: 0.2 0.8 0.0 0.2 0.6 0.2 0.2 0.6 0.2 0.2 scaling or ranking before they can be treated sta-
pi × qi 0.16 0.16 0.0 0.16 0.24 0.16 0.16 0.24 0.16 0.16 tistically, or estimates of political or economic
variables.
The variance of the raw scores (XÞ in the final
column is 2.5, resulting in
  Reliability Data
10 1:6
KR-20 ¼ 1 ¼ 0:4 Data are considered reliable when researchers have
10  1 2:5
reasons to be confident that their data represent
real phenomena in the world outside their project,
as the KR-20 estimate of reliability.
or are not polluted by circumstances that are
Nathan A. Thompson extraneous to the process designed to generate
them. This confidence erodes with the emergence
See also Classical Test Theory; Coefficient Alpha; of disagreements, for example, among human
Internal Consistency Reliability; Reliability coders regarding how they judge, categorize, or
score given units of analysis, in the extreme, when
Further Readings their accounts of what they see or read is random.
To establish reliability requires duplications of the
Cortina, J. M. (1993). What is coefficient alpha? An data-making efforts by an ideally large number of
examination of theory and applications. Journal of coders. Figure 1 represents reliability data in their
Applied Psychology, 78, 98–104. most basic or canonical form, as a matrix of m
Cronbach, L. J. (1951). Coefficient alpha and the internal
coders by r units, containing the values ciu assigned
structure of tests. Psychometrika, 16, 297–334.
de Gruijter, D. N. M., & van der Kamp, L. J. T. (2007).
Statistical test theory for the behavioral sciences. Boca Units: 1 2 3 . . u . . . . . . . . r
Raton, FL: CRC Press.
Coders: 1 c11 . . . . c1u . . . . . . . c1r
Kuder, G. F., & Richardson, M. W. (1937). The theory of
the estimation of test reliability. Psychometrika, 2, : : : :
151–160. i ci1 . . . . ciu . . . . . . . cir
: : : :
: : : :
m cm1. . . . cmu . . . . . . . cmr
KRIPPENDORFF’S ALPHA m1 . . . . mu . . . . . . . mr

Krippendorff’s α (alpha) is a general statistical


measure of agreement among observers, measuring Figure 1 Canonical Form of Reliability Data
Krippendorff’s Alpha 669

by coder i to unit u: The total number of pairable Do


values c is αmetric ¼ 1 
De
X Xu ¼ r mu Xi ¼ m Xj ¼ m metric δ2ciu kju
n¼ mu |mu ≥ 2;
u u¼1 n i1 j ¼ 1 m ðm  1Þ
u u
¼1 :
Xc ¼ n Xk ¼ n metric δ2
ck
where mu is the number of coders evaluating
unit u:
c¼1 k¼1 nðn  1Þ
For data to serve reliability assessments, it is
necessary, moreover, that (a) units are freely per- In this expression, the denominator De is the
mutable and (b) representative of the data whose average difference metric δ2ck between the nðn  1)
reliability is in question; and that (c) coders work pairs of values c and k in the n entries in Table 1
independent of each other and (d) must be suffi- regardless of their occurrence in units or which
ciently common to be found where the data- coder contributed them, excluding values in units
making process might be replicated or data are to with mu ≤ 1. The number nðn  1Þ excludes the
be added to an existing project. pairing of values with themselves. The numerator
Do first averages the differences metric δ2ciu kiu between
Alpha all mu ðmu  1Þ pairable values ciu and kju that can
be formed within the mu ≥ 2 values in units u and
The general form of Krippendorff’s α is then averages this difference over all units u:
All differences have the properties metric δ2ck ≥ 0,
Do 2 2
metric δck ¼ metric δkc ,
2 2
metric δcc ¼ metric δkk ¼ 0, and
α¼1 ,
De respond to the metric or level of measurement of
the data involved. The difference functions pro-
where Do is the observed disagreement and De is vided by α are
the disagreement expected when the correlation (
between the units coded and the values used by 2
0 iff c ¼ k
nominal δck ¼ , where c and k are names;
coders to describe them is demonstrably absent. 1 iff c 6¼ k
This conception of chance is uniquely tied to data- !2
gX
¼k
making processes. 2 nc þ nk
ordinal δck ¼ ng  , where ng is the
When agreement is without exception, Do ¼ 0, g¼c
2
α ¼ 1, and data reliability is considered perfect. number of ranks g, used by all coders;
When Do ¼ De , α ¼ 0, and reliability is consid-
ered absent. In statistical reality, α might be nega-
2
interval δck¼ ðc  kÞ2 , where c and k are the values of
tive, leading to these limits: an interval scale;
 
( 2 ck 2
ratio δck ¼ , where c and k are absolute
± Sampling error cþk
1≥α≥0 : values;
 Systematic disagreement
2 ðc  kÞ2
polar δck ¼ , where cmin
Small sample sizes might cause the zero value of ðc þ k  2cmin Þð2cmax  c  kÞ
α to be a mere approximation. The occurrence of and cmax are the extreme bipolar values of a scale;
  
systematic disagreement can drive α below zero. 2 ck 2
The latter should not occur when coders follow δ
circular ck ¼ sin 180 , where U is the
U
the same coding instruction and work indepen- range of values in a circle and the arguments of sin
dently of each other as is required for generating are degrees.
proper reliability data.
In terms of the reliability data in Figure 1, α is Thus, αnominal measures agreements in nominal
defined—for conceptual clarity expressed here data or categories, αordinal in ordinal data or rank
without algebraic simplifications—by orderings, αinterval in interval data or scale points,
670 Krippendorff’s Alpha

αratio in ratio data such as proportions or absolute pairable within units. The distribution of the mar-
numbers, αpolar in data recorded in bipolar oppo- ginal sums, nc: and n:c ; is the best estimate of the
site scales, and αcircular in data whose values consti- otherwise unknown distribution of values in the
tute a closed circle or recursions. population of data whose reliability is in question.
The earlier expression for α in terms of Figure 1 In coincidence matrix terms, α becomes
is computationally inefficient and can be simplified P P 2
in terms of a conceptually more convenient coinci- c k nckmetric δck
αmetric ¼ 1  P P nc: n:k
dence matrix representation, shown in Figure 2, 2
metric δck :
c k
summing the reliability data in Figure 1 as indi- n::  1
cated. Coincidence matrices take advantage of the
This expression might be simplified with reference
fact that reliability does not depend on the identity
to particular metrics, for example, for nominal
of coders, only on whether pairs of values match
and binary data:
and what it means if they do not match, and on
estimates of how these values are distributed in the P P
c k6¼c nck
population of data whose reliability is in question. αnominal ¼ 1  P P nc: n:k
Therefore, they tabulate coincidences without ref- c k6¼c
n::  1
erence to coders. Coincidence matrices should not
P P nc: ðn:c  1Þ
be confused with the more familiar contingency c ncc  c
matrices that cross-tabulate units of analysis as n::  1
¼ ,
judged or responded to by two coders (not the P nc: ðn:c  1Þ
n::  c
values they jointly generate). n::  1
nc6¼k
αbinary ¼ 1  ðn::  1Þ :
nc: n:k
Categories: 1 . k . .
1 n11 . n1k . . n1.
It should be mentioned that the family of α coef-
. . . . . . .
c nc1 . nck . . n c.
ficients also includes versions for coding units with
. . . . . . . multiple values as well as for unitizing continua, for
. . . . . . . example, of texts taken as character strings or tape
n.1 . n.k . . n.. = Number of values recordings. These are not discussed here.
used by all coders

number of c − k pairs of values in unit u


Where: nck = ∑ u mu − 1 An Example
Consider the example in Figure 3 of three coders,
Figure 2 Coincidence Matrix Representation of each assigning one of four values to most of the
Reliability Data 11 units.
On the left of Figure 3 the values of the reliabil-
ity data are tabulated in their canonical form.
By contrast, the cell contents nck of coincidence Coders h and j code only 9 out of the 11 units that
matrices are the frequencies of cu  ku pairs of are attended to by coder i: Excluding unit 11,
values found in units u; weighted by (mu 1) to which does not contain pairable values, m11 ≤ 1, all
ensure that each pairable value contributes exactly n ¼ n:: ¼ 28 pairable values are found tabulated in
one to the matrix. Coincidence matrices contain the coincidence matrix on the right of Figure 3. For
perfectly matching values, ncc ; in their diagonal nominal data, αnominal ¼ 0:624. If α is interpreted
and are symmetrical around that diagonal, as the proportion of values that perfectly distin-
nck ¼ nkc : Their marginal sums nc: ¼ n:c enumer- guish among the given units, the remainder result-
ate the values c used by all coders. Their totals, ing from chance, this might be seen in the
n:: ¼ n ≤ mr; are equal to mr when the table in canonical form of the reliability data. In the first six
Figure 1 is fully occupied, and less then mr when units, one finds 14 out of 28 values in perfect agree-
values are missing, including values that are not ment. Very few of them could be the results of
Krippendorff’s Alpha 671

Categories c: 1 2 3 4
Units u: 1 2 3 4 5 6 7 8 9 10 11
k: 1 3 1 4
Coder h: 2 3 4 4 2 2 4 4 2 2 1 6 1 8
i: 2 3 1 4 4 2 1 3 3 3 3 3 1 4 2 7
j: 2 3 1 4 4 2 1 4 3 4 2 7 9
nc:
mu : 3 3 2 3 3 3 3 3 3 2 1 4 8 7 9 28

Figure 3 Example of Reliability Data in Canonical and Coincidence Matrix Terms

chance. The remaining values exhibit some agree- to be taken as sufficiently reliable. The distribu-
ment but also much uncertainty as to what the tion of α becomes narrower not only with
coded units are. The researcher would not know. increasing numbers of units sampled but also
In the coincidence matrix, one might notice dis- with increasing numbers of coders employed in
agreements to follow a pattern. They occur exclu- the coding process.
sively near the diagonal of perfect agreements. The choice of min α depends on the validity
There are no disagreements between extreme requirements of research undertaken with
values, 1-4, or of the 1-3 and 2-4 kind. This pat- imperfect data. In academic research, it is com-
tern of disagreement would be expected in interval mon to aim for α ≥ 0.9 but require α ≥ 0.8 and
data. When the reliability data in Figure 3 are trea- accept data with α between 0.666 and 0.800
ted as interval data, αinterval ¼ 0:877. The interval only to draw tentative conclusions. When
α takes into account the proximity of the mis- human lives or valuable resources are at stake,
matching scale values—irrelevant in nominal data min α must be set higher.
and appropriately ignored by αnominal. Hence, for To obtain the confidence limits for α at p and
data in Figure 3: αnominal < αinterval . Had the dis- probabilities q for a chosen min α, the distributions
agreements been scattered randomly throughout of αmetric are obtained by bootstrapping in prefer-
the off-diagonal cells of the coincidence matrix, ence to mere mathematical approximations.
αnominal and αinterval would not differ. Had disagree-
ment been predominantly between the extreme
Agreement Coefficients Embraced by Alpha
values of the scale, e.g., 1-4, αnominal would have
exceeded αinterval . This property of α provides the Alpha generalizes several known coefficients. It is
researcher with a diagnostic device to establish defined to bring coefficients for different metrics
how coders use the given values. but of the same makeup under the same roof.
Alpha is applicable to any number of coders,
which includes coefficients defined only for two.
Statistical Properties of Alpha
Alpha has no problem with missing data, as pro-
A common mistake is to accept data as reliable vided in Figure 1, which includes complete m × r
when the null hypothesis that agreement results data as a special case. Alpha corrects for small
from chance fails. This test is seriously flawed as sample sizes, which includes the extreme of very
far as reliability is concerned. Reliable data need large samples of data.
to contain no or only statistically insignificant When data are nominal, generated by two
disagreements. Acknowledging this requirement, coders, and consist of large sample sizes, αnominal
a distribution of α offers two statistical indices: equals Scott’s π. Scott’s π, a popular and widely
(1) α’s confidence limits, low α and high α, at a cho- used coefficient in content analysis and survey
sen level of statistical significance p; and more research, conforms to α’s conception of chance.
importantly, (2) the probability q that the mea- When data are ordinal, generated by two coders,
sured a fails to exceed the min a required for data and very large, αordinal equals Spearmann’s rank
672 Krippendorff’s Alpha

Categories: ci ki ci ki
c j 100 200 300 c j 100 100 200

kj 100 100 k j 100 100 200

100 300 400 200 200 400

% Agreement = 50% % Agreement = 50%


κ = 0.200 κ = 0.000
π = 0.000 π = 0.000
α = 0.001 α = 0.001

Figure 4 Example of Kappa Adding Systematic Disagreements to the Reliability It Claims to Measure

correlation coefficient ρ without ties in ranks. about when reliability is absent. Percent agreement
When data are interval, generated by two coders, is not interpretable as a reliability scale—unless
and numerically large, αinterval equals Pearson’s corrected for chance, which is what Scott’s π does.
intraclass correlation coefficient rii : The intraclass Cohen’s 1960 k (kappa), which also is limited
correlation is the product moment correlation to nominal data, two coders, and large sample
coefficient applied to symmetrical coincidence sizes, has the undesirable property of counting sys-
matrices rather than to asymmetrical contingency tematic disagreement among coders as agreement.
matrices. There is a generalization of Scott’s π to This is evident in unequal marginal distribution of
larger numbers of coders by Joseph Fleiss, who categories in contingency matrices, which rewards
thought he was generalizing Cohen’s , renamed K coders who disagree on their use of categories with
by Sidney Siegel and John Castellan. K equals higher  values. Figure 4 shows two numerical
αnominal for a fixed number of coders with complete examples of reliability data, tabulated in contin-
nominal data and very large, theoretically infinite gency matrices between two coders i and j in
sample sizes. Recently, there have been two close which terms  is originally defined.
reinventions of α, one by Kenneth Berry and Paul Both examples show 50% agreement. They dif-
Mielke and one by Michael Fay. fer in their marginal distributions of categories. In
the left example, data show coder i to prefer cate-
gory k to category c at a ratio of 3:1, whereas
Agreement Coefficients Unsuitable coder j exhibits the opposite preference for c over
k at the rate of 3:1—a systematic disagreement,
as Indices of Data Reliability
absent in the data in the right example. The two
Correlation coefficients for interval data and asso- examples have the same number of disagreements.
ciation coefficients for nominal data measure Yet, the example with systematic disagreements
dependencies, statistical associations, between vari- measures  ¼ 0.200, whereas the one without that
ables or coders, not agreements, and therefore can- systematic disagreement measures  ¼ 0.000. In
not serve as measures of data reliability. In systems both examples, α ¼ 0.001. When sample sizes
of correlations among many variables, the correla- become large, α for two coders converges to Scott’s
tion among the same variables are often called reli- π at which point α ¼ π ¼ 0:000. Evidently,
abilities. They do not measure agreement, however, Cohen’s  gives ‘‘agreement credit’’ for this system-
and cannot assess data reliability. atic disagreement, whereas π and α do not. The
Percent agreement, limited to nominal data gen- reason for ’s mistaken account of these systematic
erated by two coders, varies from 0% to 100%, is disagreements lies in Cohen’s adoption of statistical
the more difficult to achieve the more values are independence between two coders as its conception
available for coding, and provides no indication of chance. This is customary when measuring
Kruskal–Wallis Test 673

correlations or associations but has nothing to do Scott, W. A. (1955). Reliability of content analysis: The
with assigning units to categories. By contrast, in π case of nominal scale coding. Public Opinion
and α, chance is defined as the statistical indepen- Quarterly, 19, 321–325.
dence between the set of units coded and the values Siegel, S., & Castellan, N. J. (1988). Nonparametric
statistics for the behavioral sciences (2nd ed.). Boston:
used to describe them. The margins of coincidence
McGraw-Hill.
matrices estimate the distribution of values occur-
ring in the population whose reliability is in ques-
tion. The two marginal distributions of values in
contingency matrices, by contrast, refer to coder
preferences, not to population estimates. Notwith- KRUSKAL–WALLIS TEST
standing its popularity, Cohen’s  is inappropriate
when the reliability of data is to be assessed. The Kruskal–Wallis test is a nonparametric test to
Finally, Cronbach’s alpha for interval data and decide whether k independent samples are from
Kuder and Richardson’s Formula-20 (KR-20) for different populations. Different samples almost
binary data, which are widely used in psychomet- always show variation regarding their sample
ric and educational research, aim to measure the values. This might be a result of chance (i.e., sam-
reliability of psychological tests by correlating the pling error) if the samples are drawn from the
test results among multiple subjects. As Jum same population, or it might be a result of a genu-
Nunnally and Ira Bernstein have observed, system- ine population difference (e.g., as a result of a dif-
atic errors are unimportant when studying individ- ferent treatment of the samples). Usually the
ual differences. However, systematically biased decision between these alternatives is calculated by
coders reduce the reliability of the data they gener- a one-way analysis of variance (ANOVA). But in
ate and such disagreement must not be ignored. cases where the conditions of an ANOVA are not
Perhaps for this reason, Cronbach’s alpha is fulfilled the Kruskal–Wallis test is an alternative
increasingly interpreted as a measure on the inter- approach because it is a nonparametric method;
nal consistency of tests. It is not interpretable as an that is, it does not rely on the assumption that the
index of the reliability of coded data. data are drawn from a probability distribution
(e.g., normal distribution).
Klaus Krippendorff
Related nonparametric tests are the Mann–
See also Coefficient Alpha; Cohen’s Kappa; Content Whitney U test for only k ¼ 2 independent sam-
Analysis; Interrater Reliability; KR-20; Replication; ples, the Wilcoxon signed rank test for k ¼ 2
‘‘Validity’’ paired samples, and the Friedman test for k > 2
paired samples (repeated measurement) and are
shown in Table 1.
The test is named after William H. Kruskal and
W. Allen Wallis and was first published in the Jour-
Further Readings
nal of the American Statistical Association in
Berry, K. J., & Mielke, P. W., Jr. (1988). A generalization 1952. Kruskal and Wallis termed the test as the H
of Cohen’s kappa agreement measure to interval test; sometimes the test is also named one-way
measurement and multiple raters. Educational and analysis of variance by ranks.
Psychological Measurement, 48, 921–933.
Fay, M. P. (2005). Random marginal agreement
coefficients: Rethinking the adjustments of chance Table 1 Nonparametric Tests to Decide Whether k
when measuring agreement. Biostatistics, 6, 171–180. Samples Are From Different Populations
Hayes, A. F., & Krippendorff, K. (2007). Answering the
call for a standard reliability measure for coding data. k¼2 k>2
Communication Methods and Measures, 1, 77–89. Independent Mann–Whitney Kruskal–Wallis
Krippendorff, K. (2004). Content analysis: An introduction samples U test test
to its methodology (2nd ed.). Thousand Oaks, CA: Sage. Paired Wilcoxon Friedman
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
samples signed rank test test
theory (3rd ed.). New York: McGraw-Hill.
674 Kruskal–Wallis Test

Table 2 Example Data, N ¼15 Observations From both receive rank 4.5. Furthermore, three sam-
k ¼ 3 Different Samples (Groups) ples had a value of 1310 and would have
received ranks from 6 to 8. Now they get rank 7
Group 1 Group 2 Group 3
as the mean of 6, 7, and 8.
Observation Rank Observation Rank Observation Rank In the next step, the sum of ranks (R1 ; R2 ; R3 Þ
2400 14 1280 4.5 1280 4.5 for each group is calculated. The overall sum of
1860 10 1690 9 1090 1 ranks is NðN þ 1Þ=2. In the example case, this is
2240 13 1890 11 2220 12 15 × 16 / 2 ¼ 120. As a first control, the sum of
1310 7 1100 2 1310 7 ranks for all groups should add up to the same
2700 15 1210 3 1310 7 value: R1 þ R2 þ R3 ¼ 59 þ 29:5 þ 31:5 ¼ 120.
R1 59 R2 29.5 R3 31.5 Distributing the ranks among the three groups
randomly, each rank sum would be about
Notes: Next to the observation, the individual rank for this
observation can be seen. Ranks are added up per sample to
120=3 ¼ 40. The idea is to measure and add the
a rank sum termed Ri : squared deviations from this expectancy value:

ð59  40Þ2 þ ð29:5  40Þ2 þ ð31:5  40Þ2 ¼ 543:5:


This entry begins with a discussion of the
concept of the Kruskal–Wallis test and provides The test statistic provided by the formula of
an example. Next, this entry discusses the formal Kruskal and Wallis is a transformation of this sum
procedure and corrections for ties. Last, this of squares, and the resulting H is under certain
entry describes the underlying assumptions of the conditions (see below) chi-square distributed for
Kruskal–Wallis test. k  1 degrees of freedom. In the example case, the
conditions for an asymptotic chi-square test are
not fulfilled because there are not enough observa-
Concept and Example
tions. Thus, the probability for H has to be looked
The idea of the test is to bring all observations of up in a table, which can be found in many statis-
all k samples into a rank order and to assign them tics books and on the Internet. In this case, we cal-
an according rank. After this initial procedure, all culate H ¼ 5:435 according to the formula
further calculations are based only on these ranks provided in the following section. The critical
but not on the original observations anymore. The value for this case ðk ¼ 3; 5 observations in each
underlying concept of the test is that these ranks group) for p ¼ :05 can be found in the table; it is
should be equally distributed throughout the H ¼ 5:780. Thus, for the example shown earlier,
k samples, if all observations are from the same the null hypothesis that all three samples are from
population. A simple example is used to demon- the same population will not be rejected.
strate this.
A researcher made measurements on k ¼ 3 dif-
Formal Procedure
ferent groups. Overall there are N ¼ 15 observa-
tions. Data are arranged according to their group, The Kruskal–Wallis test assesses the null hypothe-
and an individual rank is assigned to each observa- sis that k independent random samples are from
tion starting with 1 for the smallest observation the same population. Or to state it more precisely
(see Table 2). that the samples come from populations with the
Two things might be noted here. First, in this same locations. Because the logic of the test is
example, for the sake of simplicity, all groups based on comparing ranks rather than means or
have the same number of observations; however, medians, it is not correct to say that it tests the
this is not a necessary condition. Second, there equality of means or medians in populations.
are several observations with the same value Overall N observations are made in k samples.
called tie. In this case, all observations sharing All observations have to be brought into a rank
the same values are assigned their mean rank. In order and ranks have to be assigned from the smal-
the current example, two observations resulted lest to the largest observation. No attention is given
in a value of 1280 sharing ranks 4 and 5. Thus, to the sample to which the observation belongs. In
Kruskal–Wallis Test 675

the case of shared ranks (ties), the according mean In this formula, m stands for the number of ties
rank value has to be assigned. Next a sum of ranks occurred, and ti stands for the number of tied
Ri for all k samples (from i ¼ 1 to i ¼ kÞ has to be ranks occurred in a specific tie i:
computed. The according number of observations In the preceding example, there were m ¼ 2 tied
of each sample is denoted by Ni ; the number of all observations, one for the ranks 4 to 5, and another
observations by N: The Kruskal–Wallis test statis- one for the ranks ranging from 6 to 8. For the first
tic H is computed according to tie t1 ¼ 2, since two observations are tied, for the
second tie t2 ¼ 3, because three observations are
12 X k
R2i identical. Thus, we compute the following correc-
H¼  3ðN þ 1Þ: tion coefficient:
NðN þ 1Þ i¼1 Ni
ð23  2Þ þ ð33  3Þ
For larger samples, H is approximately chi- C¼1 ¼ 0:991;
153  15
square distributed with k  1 degrees of freedom.
For smaller samples, an exact test has to be per- with this coefficient, Hcorr calculates to
formed and the test statistic H has to be compared
with critical values in tables, which can be found 5:435
Hcorr ¼ ¼ 5:484:
in statistics books and on the Internet. (The tables 0:991
provided by Kruskal and Wallis in the Journal of
the American Statistical Association, 1952, 47, Several issues might be noted as seen in this
614617, contain some errors; an errata can be example. The correction coefficient will always be
found in the Journal of the American Statistical smaller than one, and thus, H will always increase
Association, 1953, 48; 910.) These tables are by this correction formula. If the null hypothesis is
based on a full permutation of all possible rank already rejected by an uncorrected H; any further
distributions for a certain case. By this technique, correction might strengthen the significance of the
an empirical distribution of the test criterion H is result, but it will never result in not rejecting the
obtained by a full permutation. Next the position null hypothesis. Furthermore, the correction of H
of the obtained H within this distribution can be resulting from this computation is very small, even
determined. The according p value reflects the though 5 out of 15 (33%) of the observations in
cumulative probability of H to obtain this or even the example were tied. Even in this case where the
a larger value by chance alone. There is no consis- uncorrected H was very close to significance, the
tent opinion what exactly forms a small sample. correction was negligible. From this perspective, it
Most authors recommend for k ¼ 3 and Ni ≤ 8 is only necessary to apply this correction, if N is
observations per sample, for k ¼ 4 and Ni ≤ 4, very small or if the number of tied observations is
and for k ¼ 5 and Ni ≤ 3 to perform the exact relatively large compared with N; some authors
test. recommend here a ratio of 25%.

Assumptions
Correction for Ties
The Kruskal–Wallis test does not assume a normal
When ties (i.e., shared ranks) are involved in the distribution of the data. Thus, whenever the
data, there is a possibility of correcting for this fact requirement for a one-way ANOVA to have nor-
when computing H: mally distributed data is not met, the Kruskal–
H Wallis test can be applied instead. Compared with
Hcorr ¼ ;
C the F test, the Kruskal–Wallis test is reported to
have an asymptotic efficiency of 95.5%.
thereby C is computed by
However, several other assumptions have to be
P
m met for the Kruskal–Wallis test:
ðti3  ti Þ
C¼1 i¼1
: 1. Variables must have at least an ordinal level
N3  N (i.e., rank-ordered data).
676 Kurtosis

2. All samples have to be random samples, and the There are several alternative ways of measuring
samples have to be mutually independent. kurtosis; they differ in their sensitivity to the tails
3. Variables need to be continuous, although of the distribution and to the presence of outliers.
a moderate number of ties is tolerable as shown Some tests of normality are based on the com-
previously. parison of the skewness and kurtosis of the data
with the values corresponding to a normal distri-
4. The populations from which the different
samples drawn should only differ
bution. Tools to do inference about means and var-
regarding their central tendencies but not iances, many of them developed under the
regarding their overall shape. This means that assumption of normality, see their performance
the populations might differ, for example, in affected when applied to data from a distribution
their medians or means, but not in their with high kurtosis.
dispersions or distributional shape The next two sections focus on kurtosis of theo-
(such as, e.g., skewness). If this assumption is retical distributions, and the last two deal with
violated by populations of dramatically kurtosis in the data analysis context.
different shapes, the test loses its
consistency (i.e., a rejection of the null
hypothesis by increasing N is not anymore Comparing Distributions in Terms of Kurtosis
guaranteed in a case where the null hypothesis
A distribution such as the Laplace is said to have
is not valid).
higher kurtosis than the normal distribution
Stefan Schmidt because it has more mass toward the center and
heavier tails [see Figure 1(a)]. To visually compare
See also Analysis of Variance (ANOVA); Friedman Test; the density functions of two symmetric distribu-
Mann–Whitney U Test; Nonparametric Statistics; tions in terms of kurtosis, these should have the
Wilcoxon Rank Sum Test same center and variance. Figure 1(b) displays the
corresponding cumulative version or distribution
Further Readings functions (CDFs). Willem R. van Zwet defined
in 1964 a criterion to compare and order sym-
Conover, W. J. (1971). Practical nonparametric statistics.
New York: Wiley.
metric distributions based on their CDFs.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in According to this criterion, the normal has
one-criterion variance analysis. Journal of the indeed no larger kurtosis than the Laplace distri-
American Statistical Association, 47, 583–621. bution. However, not all the symmetric distribu-
Kruskal, W. H., & Wallis, W. A. (1953). Errata: use of tions are ordered.
ranks in one-criterion variance analysis. Journal of the
American Statistical Association, 48, 907–911.
Measuring Kurtosis in Distributions
In Greek, kurtos means convex; the mathematician
Heron in the first century used the word kurtosis
KURTOSIS to mean curvature. Kurtosis was defined, as a sta-
tistical term, by Karl Pearson around 1905 as the
Density functions are used to describe the distri- measure
bution of quantitative variables. Kurtosis is
4
a characteristic of the shape of the density func- β2 ¼ Eðx  μÞ =σ 4
tion related to both the center and the tails. Dis-
tributions with density functions that have to compare other distributions with the normal
significantly more mass toward the center and in distribution in terms of the frequency toward the
the tails than the normal distribution are said to mean μ (σ is the standard deviation and β2 ¼ 3
have high kurtosis. Kurtosis is invariant under for the normal distribution). It was later that
changes in location and scale; thus, kurtosis ordering criteria based on the distribution func-
remains the same after a change in units or the tions were defined; in addition, more flexible
standardization of data. definitions acknowledging that kurtosis is related
Kurtosis 677

(a) (b)

0.8 1.0
Normal

0.7 Laplace

0.8
0.6

0.5
0.6
F(x)

F(x)
0.4

0.4
0.3

0.2
0.2
0.1

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0
−5.0 −2.5 0.0 2.5 5.0
x
x

Figure 1 Density Functions and CDFs for Normal and Laplace Distributions
Notes: (a) Density functions. (b) Cumulative distribution functions.

(a) (b)

60
99.9

50 99

95
40 90
80
Frequency

Percent

70
30 60
50
40
30
20 20
10
5 Mean −0.1063
10 StDev 0.9480
1 n 200
p value < 0.005
0 0.1
−4 −3 −2 −1 0 1 2 3 −4 −2 0 2 4

x x

Figure 2 Histogram and Normal Probability Plot for a Sample From a Laplace Distribution
Notes: (a) Histogram of data. (b) Probability plot of data. Normal ¼ 95% CI.
678 Kurtosis

Table 1 Values of Some Kurtosis Measures for Eight Symmetric Distributions


Distribution β2 γ 2 (0.05) τ4 Distribution β2 γ 2 (0.05) τ4
Uniform 1.8 0 0 tð4Þ 0.503 0.217
Normal 3 0.355 0.123 tð2Þ 0.648 0.375
Laplace 6 0.564 0.236 tð1Þ 0.854 1
SU(0,1) 36.2 0.611 0.293 SU(0,0.9) 82.1 0.649 0.329

to both peakedness and tail weight were pro- the excess β ^2 3 (for the data in Figure 2,
posed, and it was accepted that kurtosis could be ^
β2  3 ¼ 3:09Þ.
measured in several ways. New measures of kur- If the sample size n is small,
tosis, to be considered valid, have to agree with hP i
the orderings defined over distributions by the ðx  xÞ4 =n
criteria based on distribution functions. It is said b2 ¼
s4
that some kurtosis measures, such as β2 , natu-
rally have an averaging effect that prevents them might take a small value even if the kurtosis of the
from being as informative as the CDFs. Two dis- population is very high [the upper bound for b2 is
tributions can have the same value of β2 and still n  2 þ 1=ðn  1Þ. Adjusted estimators exist to
look different. Because kurtosis is related to the reduce the bias, at least in the case of nearly nor-
peak and tails of a distribution, in the case of mal distributions. A commonly used adjusted esti-
nonsymmetric distributions, kurtosis and skew- mator of excess is
ness tend to be associated, particularly if they hP i
are represented by measures that are highly sen- 4
nðn þ 1Þ ðx  x
 Þ
sitive to the tails. 
Two of the several kurtosis measures that have ðn  1Þðn  2Þðn  3Þ s4
been defined as alternatives to β2 are as follows: 3ðn  1Þðn  1Þ
:
ðn  2Þðn  3Þ
1. L-kurtosis defined by J. R. M. Hosking in 1990
and widely used in the field of hydrology; A single distant outlier can dramatically change
τ4 ¼ L4 =L2 is a ratio of L  moments that are ^2 .
the value of β
linear combinations of expected values of order
statistics.
Effects of High Kurtosis
2. Quantile kurtosis, γ 2 ðpÞ, defined by Richard
Groeneveld in 1998 for symmetric distributions The variance of the sample variance is related to
only, is based on distances between certain β2 . The power of some tests for the equality of var-
quantiles. Other kurtosis measures defined in iances gets affected by high kurtosis. For example,
terms of quantiles, quartiles, and octiles also when testing the hypothesis of equal variances for
exist. two populations based on two independent sam-
ples, the power of the Levene test is lower if the
As an illustration, the values of these kurtosis distribution of the populations is symmetric but
measures are displayed in Table 1 for some with higher kurtosis than if the samples come from
distributions. normal distributions. The performance of the t test
for the population mean is also affected under
situations of high kurtosis. Van Zwet proved that,
Studying Kurtosis From Data when working with symmetric distributions, the
median is more efficient than the mean as estima-
Histograms and probability plots help to explore tor of the center of the distribution if the latter has
sample data. Figure 2 indicates that the data might very high kurtosis.
come from a distribution with higher kurtosis than
the normal. Statistical software generally calculates Edith Seier
Kurtosis 679

See also Distribution; Median; Normal Distribution; Byers, R. H. (2000). On the maximum of the
Student’s t Test; Variance standardized fourth moment. InterStat, January:
Hosking, J. R. M. (1992). Moments or L moments? An
example comparing two measures of distributional
Further Readings shape. American Statistician, 46, 186–199.
Balanda, K. P., & MacGillivray, H. L. (1998). Kurtosis: A Ruppert, D. (1997). What is kurtosis? An influence
critical review. American Statistician, 42, 111–119. function approach. American Statistician, 41, 1–5.
Bonett, D. G., & Seier, E. (2002). A test of normality Seier, E., & Bonett, D. G. (2003). Two families of
with high uniform power. Computational Statistics kurtosis measures. Metrika, 58, 59–70.
and Data Analysis, 40, 435–445.
L
the summary estimate will be much improved by
L’ABBÉ PLOT combining the results of many small studies. In
this hypothesized meta-analysis, the pooled esti-
mate of relative risk is 0.72 (95% confidence
The L’Abbé plot is one of several graphs com- interval: 0.530.97), which suggests that the risk
monly used to display data visually in a meta- of lack of clinical improvement in the treatment
analysis of clinical trials that compare a treatment group is statistically significantly lower than that
and a control intervention. It is basically a scatter- in the control group. However, the results from
plot of results of individual studies with the risk the 10 trials vary considerably (Figure 1), and it is
in the treatment group on the vertical axis and important to investigate why similar trials of the
the risk in the control group on the horizontal same intervention might yield different results.
axis. This plot was advocated in 1987 by Kristan Figure 2 is the L’Abbé plot for the hypothesized
L’Abbé and colleagues for visually showing varia- meta-analysis. The vertical axis shows the event
tions in observed results across individual trials in rate (or risk) of a lack of clinical improvement in
meta-analysis. This entry briefly discusses meta- the treatment group, and the horizontal axis shows
analysis before addressing the usefulness, limita- the event rate of a lack of clinical improvement
tions, and inappropriate uses of the L’Abbé plot. in the control group. Each point represents the
result of a trial, according to the corresponding
event rates in the treatment and the control group.
Meta-Analysis
The size of the points is proportionate to the trial
To understand what the L’Abbé plot is, it is nec- size or the precision of the result. The larger the
essary to have a discussion about meta-analysis. sample size, the larger the point in Figure 2. How-
Briefly, meta-analysis is a statistical method to ever, it should be mentioned that smaller points
provide a summary estimate by combining the might represent larger trials in a L’Abbé plot pro-
results of many similar studies. A hypothesized duced by some meta-analysis software.
meta-analysis of 10 clinical trials is used here to The diagonal line (line A) in Figure 2 is called
illustrate the use of the L’Abbé plot. The most the equal line, indicating the same event rate
commonly used graph in meta-analysis is the for- between the two arms within a trial. That is, a trial
est plot (as shown in Figure 1) to display data point will lie on the equal line when the event rate
from individual trials and the summary estimate in the treatment group equals that in the control
(including point estimates and 95% confidence group. Points below the equal line indicate that
intervals). The precision or statistical power of the risk in the treatment group is lower than that

681
682 L’Abbé Plot

Treatment Control Risk Ratio Risk Ratio


Study or Subgroup Events Total Events Total Weight IV, Random, 95% CI IV, Random, 95% CI
Trial-01 20 400 16 400 10.6% 1.25 [0.66, 2.38]
Trial-02 58 800 60 800 15.9% 0.97 [0.68, 1.37]
Trial-03 11 100 12 100 8.8% 0.92 [0.42, 1.98]
Trial-04 9 50 7 50 7.2% 1.29 [0.52, 3.18]
Trial-05 26 200 36 200 13.7% 0.72 [0.45, 1.15]
Trial-06 14 100 20 100 10.9% 0.70 [0.37, 1.31]
Trial-07 6 50 14 50 7.6% 0.43 [0.18, 1.03]
Trial-08 7 25 8 25 7.8% 0.88 [0.37, 2.05]
Trial-09 14 100 41 100 12.3% 0.34 [0.20, 0.59]
Trial-10 3 25 12 25 5.3% 0.25 [0.08, 0.78]

Total (95% CI) 1850 1850 100.0% 0.72 [0.53, 0.97]


Total events 168 226
Heterogeneity: Tau² = 0.12; Chi² = 19.63, df = 9 (p = .02); I² = 54%
0.05 0.2 1 5 20
Test for overall effect: Z = 2.16 (p = .03)
Favors Treatment Favors Control

Figure 1 A Hypothesized Meta-Analysis of 10 Clinical Trials Comparing a Treatment and a Control Intervention
Note: Outcome: lack of clinical improvement.

60%
in meta-analysis. This overall RR line corresponds
to a pooled relative risk of 0.72 in Figure 2. It
50% A: equal line Event Rate
would be expected that the points of most trials
Treatment Group Risk

B: overall RR line Trial Control Treatment


40% Trial-01
Trial-02
4.0%
7.5%
5.0%
7.3%
will lie around this overall RR line. The distance
30% Trial-03 12.0% 11.0% between a trial point and the overall RR line indi-
Trial-04 14.0% 18.0%
20% Trial-05 18.0% 13.0%
cates the difference between the trial result and the
Trial-06 20.0% 14.0% average estimate.
10% Trial-07 28.0% 12.0%
Trial-08 32.0% 28.0%
0% Trial-09 41.0% 14.0%
0% 10% 20% 30% 40% 50% 60% Trial-10 48.0% 12.0%
Control Group Risk Usefulness
Clinical trials that evaluate the same underlying
Figure 2 The L’Abbé Plot for the Hypothesized average treatment effect might generate different
Meta-Analysis of 10 Clinical Trials results because of random error. The smaller the
sample size, the greater will be the random varia-
in the control group, and vice versa for points tion. Meta-analysis can yield a weighted average
above the equal line. In the hypothesized meta- by pooling the results of all available similar trials.
analysis, the central points of two trials (T01 and This weighted average is the best summary esti-
T04) are above the equal line, indicating that the mate of the true treatment effect if the variation in
event rate in the treatment group is higher than results across trials is mainly caused by random
that in the control group. The points of the error. However, the variation in results across trials
remaining eight trials locate below the equal line, often cannot be explained satisfactorily by chance
showing that the risk in the treatment group is alone. For example, the result of statistical testing
reduced in these trials. of heterogeneity in the hypothesized meta-analysis
The dotted line (line B) in Figure 2 is the overall (Figure 1) suggests that the heterogeneity across
RR line, which represents the pooled relative risk trials is statistically significant (p ¼ .02). Different
L’Abbé Plot 683

patient and/or intervention characteristics might different people. In addition, the same pattern of
also be the causes of variation in results across variations across studies revealed in a L’Abbé plot
trials. The effect of a treatment might be associ- might have very different causes. The usefulness of
ated with the severity of illness, age, gender, or the L’Abbé plot is also restricted by the available
other patient characteristics. Trial results might data reported in the primary studies in meta-analysis.
vary because of different doses of medications, dif- When the number of the available studies in
ferent intensity of interventions, different level of a meta-analysis is small and when data on impor-
training and experience of doctors, and other dif- tant variables are not reported, the investigation
ferences in settings or interventions. of heterogeneity will be unlikely fruitful.
The variation in results across studies is termed The visual perception of variations across studies
heterogeneity in the meta-analysis. Several graphi- in a L’Abbé plot might be misleading because ran-
cal methods can be used for the investigation dom variation in the distance between a study point
of heterogeneity in meta-analysis. The commonly and the overall RR line is associated with both the
used graphical methods are forest plot, funnel plot, sample size of a trial and the event rate in the control
Galbraith plot, and the L’Abbé plot. Only esti- group. Points of small trials are more likely farther
mates of relative effects (including relative risk, away from the overall RR line purely by chance. In
odds ratio, or risk difference) between the treat- addition, trials with a control event rate closing to
ment and control group are displayed in forest 50% will have great random variation in the dis-
plot, funnel plot, and Galbraith plot. As compared tance from the overall RR line. It is possible that the
with other graphical methods, an advantage with distances between trial points and the overall RR
the L’Abbé plot in meta-analysis is that it can line in a L’Abbé plot are adjusted by the correspond-
reveal not only the variations in estimated relative ing sample sizes and event rates in the control group,
effects across individual studies but also the trial using a stochastic simulation approach. However,
arms that are responsible for such differences. This the stochastic simulation method is complex and
advantage of the L’Abbé plot might help research- cannot be used routinely in meta-analysis.
ers and clinicians to identify the focus of the inves-
tigation of heterogeneity in meta-analysis.
Inappropriate Uses
In the hypothesized meta-analysis (Figure 2), the
event rate varies greatly in both the control group L’Abbé plots have been used in some meta-analyses
(from 4.0% to 48.0%) and in the treatment group to identify visually the trial results that are outliers
(from 5.0% to 28.0%). The points of trials with according to the distance between a trial point and
relatively low event rates in the control group tend the overall RR line. Then, the identified outliers are
to locate above the overall RR line, and the points excluded from meta-analysis one by one until het-
of trials with relatively high event rates in the con- erogeneity across studies is no longer statistically
trol group tend to locate below the overall RR line. significant. However, this use of the L’Abbé plot is
This suggests that variations in relative risk across inappropriate because of the following reasons.
trials might be mainly a result of different event First, the exclusion of studies according to their
rates in the control group. Therefore, the event rate results, not their design and other study characteris-
in the control group might be associated with treat- tics, might introduce bias into meta-analysis and
ment effect in meta-analysis. In a real meta-analy- reduce the power of statistical tests of heterogeneity
sis, this pattern of graphical distribution should be in meta-analysis. Second, the chance of revealing
interpreted by considering other patient and/or clinically important causes of heterogeneity might
intervention characteristics to investigate the possi- be missed simply by excluding studies from meta-
ble causes of variations in results across trials. analysis without efforts to investigate reasons for
the observed heterogeneity. In addition, different
methods might identify different trials as outliers,
Limitations
and the exclusion of different studies might lead to
A shortcoming of many graphical methods is that different results of the same meta-analysis.
the visual interpretation of data is subjective, and Another inappropriate use of the L’Abbé plot is
the same plot might be interpreted differently by to conduct a regression analysis of the event rate
684 Laboratory Experiments

in the treatment group against the event rate in the Song, F. (1999). Exploring heterogeneity in meta-analysis:
control group. If the result of such a regression Is the L’Abbé plot useful? Journal of Clinical
analysis is used to examine the relation between Epidemiology, 52, 725730.
treatment effect and the event rate in the con- Song, F., Sheldon, T. A., Sutton, A. J., Abrams, K. R., &
Jones, D. R. (2001). Methods for exploring
trol group, then misleading conclusions could be
heterogeneity in meta-analysis. Evaluation and the
obtained. This is because of the problem of regres- Health Professions, 24, 126151.
sion to the mean and random error in the esti-
mated event rates.

Final Thoughts LABORATORY EXPERIMENTS


To give an appropriate graphical presentation of
variations in results across trials in meta-analysis, Laboratory experiments are a particular method
the scales used on the vertical axis and on the hori- that enables the highest level of control for hypo-
zontal axis should be identical in a L’Abbé plot. thesis testing. Like other types of experiments, they
The points of trials should correspond to the dif- use random assignment and intentional manipula-
ferent sample sizes or other measures of precision. tions, but these experiments are conducted in
It should be explicit about whether the larger a room or a suite of rooms dedicated to that pur-
points correspond to larger or smaller trials. As pose. Although experimental research can be con-
compared with other graphs in meta-analysis, one ducted in places besides laboratories, such as in
important advantage of the L’Abbé plot is that it classrooms or business organizations, a laboratory
displays not only the relative treatment effects of setting is usually preferable, because an investiga-
individual trials but also data on each of the two tor can create optimal conditions for testing the
trial arms. Consequently, the L’Abbé plot is helpful ideas guiding the research.
to identify not only the studies with outlying Psychology was the first social science to
results but also the study arm being responsible use experimental laboratories, with Ivan Pavlov’s
for such differences. Therefore, the L’Abbé plot is famous experiments conditioning dogs around
a useful graphical method for the investigation of the turn of the 20th century. However, it was the
heterogeneity in meta-analysis. However, the inter- second half of that century that saw the spread
pretation of L’Abbé plots is subjective. Misleading of experiments throughout the other social
conclusions could be obtained if the interpretation sciences. In the 1940s and 1950s, R. Freed Bales
of a L’Abbé plot is inappropriate. conducted discussion groups at Harvard, devel-
oping a research design still used for many pur-
Fujian Song poses, including focus groups in communications
studies and marketing.
See also Bias; Clinical Trial; Control Group; Bales’s groups, for reasons including practical
Meta-Analysis; Outlier; Random Error limitations, included no more than about 20 indi-
viduals, and experimental studies were once called
‘‘small group’’ research. A more accurate term
Further Readings
would be ‘‘group processes’’ research, because the
Bax, L., Ikeda, N., Fukui, N., Yaju, Y., Tsuruta, H., & focus of study is not actually the group but rather
Moons, K. G. M. (2008). More than numbers: The what happens within the group. Researchers do
power of graphs in meta-analysis. American Journal not so much study the group itself as they
of Epidemiology, 169, 249255. study abstract processes that occur in interaction.
L’Abbé, K. A., Detsky, A. S., & O’Rourke, K. (1987).
Experimenters cannot study an army or a business
Meta-analysis in clinical research. Annals of Internal
Medicine, 107, 224233.
corporation in the laboratory, but they can and do
Sharp, S. J., Thompson, S. G., & Altman, D. G. (1996). study authority structures, negotiation processes,
The relation between treatment benefit and underlying responses to legitimate or illegitimate orders, sta-
risk in meta-analysis. British Medical Journal, 313, tus generalization, and communication within
735738. networks. In other words, abstract features of
Laboratory Experiments 685

concrete structures such as the U.S. Army can be nowhere in nature, social science laboratories con-
studied experimentally. The results of laboratory tain situations isolating one or a few social pro-
research, with proper interpretation, can then be cesses for detailed study. The gain from this focus
applied in businesses, armies, or other situations is that experimental results are among the stron-
meeting the conditions of the theory being tested gest for testing hypotheses.
by the experiment. Most hypotheses, whether derived from general
theoretical principles or simply formulated ad hoc,
have the form ‘‘If A then B.’’ To test such a sen-
Laboratories as Created Situations
tence, treat ‘‘A’’ as an independent variable and
The essential character of a laboratory experiment ‘‘B’’ as a dependent variable. Finding that B is pres-
is that it creates an invented social situation that ent when A is also present gives some confidence
isolates theoretically important processes. Such the hypothesis is correct, but of course the concern
a situation is usually unlike any naturally occurring is that something else besides A better accounts for
situation so that complicated relationships can the presence of B. But when the copresence of
be disentangled. In an experiment, team partners A and B occur in a laboratory, an investigator has
might have to resolve many disagreements over had the opportunity to remove other possible can-
a collective task, or they might be asked to decide didates besides A from the situation. In a labora-
whether to offer a small gift to people who always tory test, the experimental situation creates A and
(or never) reciprocate. In such cases, an investiga- then measures to determine the existence or the
tor is efficiently studying things that occur natu- extent of B.
rally only occasionally, or that are hard to observe Another concern in natural settings is direction
in the complexity of normal social interaction. of causality. Finding A and B together is consistent
Bringing research into a laboratory allows an with (a) A causes B; (b) B causes A; or (c) some
investigator to simplify the complexity of social other factor C causes both A and B.
interaction to focus on the effects of one or a few Although we cannot ever observe causation,
social processes at a time. It also offers an oppor- even with laboratory data, such data can lend
tunity to improve data collection greatly using greater confidence in hypothesis (a). That is because
video and sound recordings, introduce question- an experimenter can introduce A before B occurs,
naires at various points, and interview participants thus making interpretation (b) unlikely. An experi-
about their interpretations of the situation and of menter can also simplify the laboratory situation to
people’s behavior in it. eliminate plausible Cs, even if some unknown fac-
Every element of the social structure, the inter- tor might still be present that is affecting B. In gen-
action conditions, and the independent variables is eral, the results from laboratory experiments help
included in the laboratory conditions because an in assessing the directionality of causation and elim-
investigator put it there. The same is true for the inating potential alternative explanations of
measurement operations used for the dependent observed outcomes.
variables. Well-designed experiments result from
a thorough understanding of the theoretical princi-
ples to be tested, long-term planning, and careful
Laboratories Abstract From Reality
attention to detail. Casually designed experiments
often produce results that are difficult to interpret, Experimental research requires conceptualizing
either because it is not clear exactly what hap- problems abstractly and generally. Many impor-
pened in them or because measurements seem to tant questions in social science are concrete or
be affected by unanticipated, perhaps inconsistent, unique, and therefore these questions do not
factors. lend themselves readily to laboratory experimen-
tation. For instance, the number of homeless
in the United States in a particular year is not
Strong Tests and Inferences
a question for laboratory methods. However,
All laboratories simplify nature. Just as chemis- effects of network structures, altruism, and moti-
try laboratories contain pure chemicals existing vational processes that might affect homelessness
686 Last Observation Carried Forward

are suitable for experimental study. Abstract and


general conceptualization loses many concrete LAST OBSERVATION
features of unique situations while gaining appli-
cability to other situations besides the one that
CARRIED FORWARD
initially sparked interest. From laboratory stud-
ies, we will never know how many Americans Last observation carried forward (LOCF) is
are homeless, but we might learn that creating a method of imputing missing data in longitu-
social networks of particular kinds is a good dinal studies. If a person drops out of a study
way to reduce that number. before it ends, then his or her last observed score
on the dependent variable is used for all subse-
quent (i.e., missing) observation points. LOCF is
used to maintain the sample size and to reduce
Ethics
the bias caused by the attrition of participants in
Ethical considerations are crucial in all social a study. This entry examines the rationale for,
science research. Unfortunately, some infamous problems associated with, and alternative to
cases in biomedical and even in social research LOCF.
have sometimes caused observers to associate
ethical malfeasance and laboratory experiments.
Such linkage is unwarranted. Protecting partici- Rationale
pants’ rights and privacy are essential parts of When participants drop out of longitudinal studies
any sort of research. Institutional Review Boards (i.e., ones that collect data at two or more time
(IRBs) oversee the protection of human research points), two different problems are introduced.
participants and the general rules of justice, First, the sample size of the study is reduced,
beneficence, and respect guide the treatment of which might decrease the power of the study, that
experimental participants. Experimental partici- is, its ability to detect a difference between groups
pants are volunteers who receive incentives (e.g., when one actually exists. This problem is relatively
money or course credit) for their work, and good easy to overcome by initially enrolling more parti-
research design includes full explanations and cipants than are actually needed to achieve a
beneficial learning experiences for participants. desired level of power, although this might result
There is no place for neglect or mistreatment of in extra cost and time. The second problem is
participants. a more serious one, and it is predicated on the
belief that people do not drop out of studies for
Murray Webster, Jr., and Jane Sell
trivial reasons. Patients in trials of a therapy might
See also Ethics in the Research Process; Experimental stop coming for return visits if they feel they have
Design; Experimenter Expectancy Effect; Quasi- improved and do not recognize any further bene-
Experimental Design fit to themselves from continuing their participa-
tion. More often, however, participants drop out
because they do not experience any improvement
in their condition, or they find the side effects of
Further Readings
the treatment to be more troubling than they are
Rashotte, L. S., Webster, M., & Whitmeyer, J. (2005). willing to tolerate. At the extreme, the patients
Pretesting experimental instructions. Sociological might not be available because they have died,
Methodology, 35, 151175. either because their condition worsened or, in rare
Webster, M., & Sell, J. (Eds.). (2007). Laboratory cases, because the ‘‘treatment’’ actually proved to
experiments in the social sciences. New York:
be fatal. Thus, those who remain in the trial and
Elsevier.
Webster, M., & Sell, J. (2006). Theory and
whose data are analyzed at the end reflect a biased
experimentation in the social sciences. In subset of all those who were enrolled. Compound-
W. Outhwaite & S. P. Turner (Eds.), The Sage ing the difficulty, the participants might drop out
handbook of social science methodology of the experimental and comparison groups at dif-
(pp. 190207). Thousand Oaks, CA: Sage. ferent rates, which biases the results even more.
Last Observation Carried Forward 687

Needless to say, the longer the trial and the more Problems
follow-up visits or interviews that are required, the
worse the problem of attrition becomes. In some Counterbalancing these advantages of LOCF are
clinical trials, drop-out rates approach 50% of several disadvantages. First, because all the missing
those who began the study. values for an individual are replaced with the same
LOCF is a method of data imputation, or number, the within-subject variability is artificially
‘‘filling in the blanks,’’ for data that are missing reduced. In turn, this reduces the estimate of the
because of attrition. This allows the data for all error and, because the within-person error contri-
participants to be used, ostensibly solving the butes to the denominator of any statistical test,
two problems of reduced sample size and biased it increases the likelihood of finding significance.
results. The method is quite simple, and consists of Thus, rather than being conservative, LOCF actu-
replacing all missing values of the dependent vari- ally might have a liberal bias and might lead to
able with the last value that was recorded for that erroneously significant results.
particular participant. The justification for using Second, just as LOCF assumes no additional
this technique is shown in Figure 1, where the left improvement for patients in the treatment condi-
axis represents symptoms, and lower scores are tion, it also assumes that those in the comparison
better. If the effect of the treatment is to reduce group will not change after they drop out of the
symptoms, then LOCF assumes that the person trial. However, for many conditions, there is a very
will not improve any more after dropping out of powerful placebo effect. In trials involving patients
the trial. Indeed, if the person discontinues very suffering from depression, up to 40% of those in
early, then there might not be any improvement the placebo arm of the study show significant
noted at all. This most probably underestimates improvement; and in studies of pain, this effect
the actual degree of improvement experienced by can be even stronger. Consequently, in underesti-
the patient and, thus, is a conservative bias; that is, mating the amount of change in the control group,
it works against the hypothesis that the interven- LOCF again might have a positive bias, favoring
tion works. If the findings of the study are that the rejection of the null hypothesis.
treatment does work, then the researcher can be Finally, LOCF should never be used when the
even more confident of the results. The same logic purpose of the intervention is to slow the rate
applies if the goal of treatment is to increase the of decline. For example, the so-called memory-
score on some scale; LOCF carries forward a smal- enhancing drugs slow the rate of memory loss for
ler improvement. patients suffering from mild or moderate demen-
tia. In Figure 1, the left axis would now represent
memory functioning, and thus lower scores are
worse. If a person drops out of the study, then
LOCF assumes no additional loss of functioning,
which biases the results in favor of the treatment.
20
In fact, the more people who drop out of the study,
Recorded Values and the earlier the drop-outs occur, the better the
15
drug looks. Consequently, LOCF introduces a very
strong liberal bias, which significantly overesti-
Score

10 LOCF mates the effectiveness of the drug.


Dropped Expected
5 Out Values
Alternatives
0
0 2 4 6 8 10
Fortunately, there are alternatives to LOCF. The
Follow-Up Visit most powerful is called growth curve analysis (also
known as latent growth modeling, latent curve
analysis, mixed-effects regression, hierarchical lin-
Figure 1 The Rationale for Last Observation Carried ear regression, and about half a dozen other
Forward names), which can be used for all people who have
688 Latent Growth Modeling

at least three data points. In essence, a regression This entry begins with discussions of fixed
line is fitted to each person’s data, and the slope and random effects and of time-varying and time-
and intercept of the line become the predictor vari- invariant predictors. Next, approaches are des-
ables in another regression. This allows one to cribed and an example of the modeling process is
determine whether the average slope of the line provided. Last, additional extensions of latent
differs between groups. This does not preserve as growth modeling and its use in future research are
many cases as LOCF, because those who drop out examined.
with fewer than three data points cannot be ana-
lyzed, but latent growth modeling does not intro-
duce the same biases as does LOCF.
Fixed Versus Random Effects
David L. Streiner To understand growth modeling, one needs to
understand the concepts of fixed effects and ran-
See also Bias; Latent Growth Modeling; Missing Data,
dom effects. In ordinary least-squares regression,
Imputation of
a fixed intercept and a slope for each predictor are
estimated. In growth modeling, it is often the case
that each person has a different intercept and
Further Readings
slope, which are called random effects. Consider
Molnar, F. J., Hutton, B., & Fergusson, D. (2008). Does a growth model of marital conflict reported by the
analysis using ‘‘last observation carried forward’’ mother across the first 12 months after the birth of
introduce bias in dementia research? Canadian a couple’s first child. Conflict might be measured
Medical Association Journal, 179, 751753. on a 0 to 10 scale right after the birth and then
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal
every 2 months for the first year. There are 7 time
data analysis: Modeling change and event occurrence.
New York: Oxford University Press.
points (0, 2, . . . , 12) and a regression of the con-
flict scores on time might be done. Hypothetical
results appear in Figure 1 in the graph labeled
Both Intercept and Slope Are Fixed. The results
reflect an intercept β0 ¼ 2.5 and a slope β1 ¼ 0.2.
LATENT GROWTH MODELING Thus, the conflict starts with an initial level of 2.5
and increases by 0.2 every 2 months. By the 12th
Latent growth modeling refers to a set of proce- month, the conflict would be moderate, 2.5 þ
dures for conducting longitudinal analysis. Statisti- 0.2 × 12 ¼ 4.9. These results are fixed effects.
cians refer to these procedures as mixed models. However, women might vary in both their inter-
Many social scientists label these methods as mul- cept and their slope.
tilevel analyses, and the label of hierarchical linear In contrast, the graph in Figure 1 labeled Ran-
models is used in education and related disciplines. dom Intercept and Fixed Slope allows for differ-
These procedures can be useful with static data ences in the initial level and results in parallel
where an individual response might be nested in lines. Mother A has the same intercept and slope
a family. Thus, a response might be explained as the fixed-effects model, 2.5 and 0.2, respec-
by individual characteristics, such as personality tively. All three have a slope of 0.2, but they vary
traits, or by a family-level characteristic, such as in their intercept (starting point). This random
family income. intercept model, by providing for individual differ-
Longitudinal applications differ from static appli- ences in the intercept, should fit the data for all the
cations in that there are repeated measurements of mothers better than the fixed model, but the
a variable for each individual. The repeated mea- requirement that all lines are parallel might be
surements are nested in the individual. Just as indivi- unreasonable. An alternative approach is illus-
duals in a family tend to be similar, repeated trated in the graph labeled Fixed Intercept and
measurements for the same individual tend to be Random Slope. Here, all the mothers have a fixed
similar. This lack of independence is handled by initial level of conflict, but they are allowed to
mixed models. have different slopes (growth rates).
Latent Growth Modeling 689

Random Intercept and Random Slope


Both Intercept and Slope Are Fixed
10
10
8
Conflict

6 8

Conflict
6
2
4
0
2
0 2 4 6 8 10 12
Months Since Birth of First Child 0
Mother A Mother C
0 2 4 6 8 10 12
Mother B Mother C’
Months Since Birth of First Child
Mother B’

Fixed Intercept and Random Slope Random Intercept and Fixed Slope
10 10
8 8
Conflict

6 Conflict 6
4 4
2 2 
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Months Since Birth of First Child Months Since Birth of First Child
Mother A Mother C Mother A Mother C
Mother B Mother B

Figure 1 Hypothetical Growth Curves for Marital Conflict in First 12 Months After Birth of First Child

Finally, the graph labeled Random Intercept intercept and slope; here, these estimates, when
and Random Slope allows each mother to have they are random effects, are treated as outcomes.
her own intercept and her own slope. This graph Therefore, change in a variable is explained rather
is more complicated than the fully fixed graph, but than just a static level of a variable.
it might be a better fit to the data and seems realis- Latent growth modeling allows researchers to
tic. This model with a random intercept and a ran- explain such differences using two types of predic-
dom slope allows each mother to have a different tors (independent variables). These are called time-
initial level and a different growth trajectory. invariant covariates and time-varying covariates.
First, the intercept or slope might depend on time-
invariant covariates. These covariates are a con-
Time-Varying and Time-Invariant Predictors
stant value for each individual. For example, the
When either or both the intercept and slope are mother’s age at the birth and whether she was
treated as random effects, they are outcome vari- married at the time of birth are time invariant.
ables that call for an explanation. Why do some Time-varying covariates, by contrast, can have
mothers start with a high or low level of conflict? different values from one time to the next. Sup-
Why do mothers vary in their trajectory? The pose 6 months after the birth of a child, the father
intercept and slope become dependent variables. attends a childcare workshop and becomes a
Traditional regression analysis just estimates an more engaged parent. This would not predict her
690 Latent Growth Modeling

intercept—that was at month 0, but it might pre- Steps


dict a change in conflict at the 6th month and pos-
The first step involves examining the data in
sibly at subsequent months.
terms of distributions and level of measurement.
In evaluation research, an important time-
Mplus and some other SEM packages can estimate
varying covariate is the fidelity of implementa-
models with or without assuming normality and
tion. A researcher might have a 4-year program
regardless of the level of measurement. Mplus can
designed to improve math achievement. First, the
also adjust standard errors for clustering, which
researcher would estimate the fixed-effects growth
would happen if a few youth from each of many
curve. Second, he or she would test whether the
schools were sampled. Alternatively, the researcher
random effects were significant. Third, the rese-
might want to incorporate school as an explana-
archer would test whether time-invariant covari-
tory variable, perhaps including school-level
ates (intervention vs. control group) were
variables (e.g., time spent in physical education
significant. Fourth, he or she would test whether
curriculum).
year-to-year variability in the program fidelity had
It is useful to observe actual scores for a sample
a significant effect. The fidelity of implementation
of observations. Figure 2 presents a random sample
each year is a time-varying covariate. It might
of 10 youth. A BMI of around 20 would be a nor-
explain why some years, some students scored
mal weight; a BMI of 25 would be overweight. A
much higher or lower on math achievement than
BMI of 30 or higher would be obese. What is
their overall growth trajectory.
revealed in Figure 2? First, a random intercept is
probably needed given that the sample of 10 youth
Approaches and Example varies from around a BMI of 12 to an initial BMI
of around 30. What about the slope? These 10
Once one understands the concept of growth curve
observations do not tell much, although one might
modeling, a statistical software package is needed
observe a general positive slope and that there is
to estimate the models. Most statisticians use
substantial variance in the growth rate. One youth
commands available in Stata, SAS, or R to esti-
who started at a BMI of 30 increased to one of 50
mate mixed models. Researchers in education and
(morbidly obese) by the age of 18; other youth
related disciplines often use a program called
have a slope of about zero (flat). Based on these 10
HLM. Here, an alternative approach that focuses
observations, one would expect that both the inter-
on the latent variable modeling approach to these
cept and the slope have a significant random effect.
questions is presented. This approach is an exten-
sion of structural equation modeling (SEM) in
which the intercept and slope are conceptualized
as latent growth factors. SEM software (Mplus, 70
LISREL, EQS, and to a limited extent AMOS) 60
have varying strengths for this approach. The
models estimated here use the Mplus program 50
because it is arguably the most comprehensive and
40
the most rapidly developing. The Mplus program
BMI

is available at the Mplus website, where there is 30


extensive documentation and online training. To
20
introduce latent growth modeling, data from the
National Longitudinal Survey of Youth, 1997 is 10
used. The focus is on the body mass index (BMI)
of a panel of youth who were 12 years old in 0
0 1 2 3 4 5 6 7
1997. They were followed until they were 18 in Age 12–18
2003. There are limitations to the BMI measure,
especially during adolescence, but the purpose
is only to use it as an illustration of growth Figure 2 Random Sample of the BMI Actual Scores
modeling. for 10 Youth
Latent Growth Modeling 691

The next step is developing a simple latent changes for each unit change in the independent
growth model. The simplest is a linear growth variable. How is this translated to identify the
curve (called a curve, but a linear growth curve is latent slope growth factor? There are BMI mea-
actually a straight line requiring just an intercept surements for 7 consecutive years, 19972003.
factor and a linear slope factor). Figure 3 presents Because each of these is a 1-year change, load-
how a linear growth curve model can be drawn. ings of 0, 1, 2, 3, 4, 5, and 6 can be used, as
This figure is simpler than it might appear. The illustrated by the solid lines with an arrow going
oval labeled ‘‘intercept’’ is the latent intercept from the latent slope growth factor to each
growth factor. It represents the initial level of the year’s measurement of BMI. Other fixed loadings
growth curve. Based on the sample of 10 youth, might be appropriate. If no data were collected
one might guess this will have a value of just more in 2000 and 2002, there would be five waves
than 20. This value is the estimated initial and loadings of 0, 1, 2, 4, and 6 could be used,
Mintercept. The other oval, labeled ‘‘slope,’’ is the simply dropping the missing years. One might
latent slope growth factor. It represents how much want the intercept to represent the final level
the BMI increases (or decreases) each year. Using of the variable and use loadings of 6, 5, 4,
the sample of 10 youth, one might guess that this 3, 2, 1, and 0, or put the intercept in the
will be a small positive number, perhaps around middle using loadings of 3, 2, 1, 0, 1, 2,
0.5. This value is the Mslope. and 3. In the Mplus program, it is also possible
The sample of 10 youth indicates that there is for each participant to have a different time span
variation around both the mean latent intercept between measurements. John’s BMI98 (BMI
growth factor and the mean latent slope growth measurement in 1998) might be 14 months after
factor. This variance is the random-effect com- his first measurement if there was some delay in
ponent and is represented by the circles above data collection. His BMI99 might be only 10
the ovals, labeled Ri (residual variance of latent months after his second wave.
intercept growth factor) and Rs (residual vari- The observed measurements appear in the rect-
ance of latent slope growth factor). If one of angular boxes and are labeled BMI97 to BMI03.
these variances is not significantly greater than Figure 3 has circles at the bottom labeled E97 to
zero, then that factor could be treated as a fixed E03. These represent measurement error. SEM
effect. The curved line with an arrow at each software varies widely in how it programs a figure
end connecting Ri and Rs is the correlation of like this. The key part of the program in Mplus is
the latent intercept growth factor and latent
slope growth factor. A positive correlation would intercept slope | BMI97@0 BMI98@1
indicate that people who start with a high BMI BMI99@2 BMI@3 BMI@4 BMI@;5 BMI@6.
(intercept) have a more rapidly increasing BMI
(slope) than people who start with a low BMI. Such The first name, intercept, which could be
a positive correlation is unfortunate both for anything such as i or alpha, will always be the
youth who have a very high initial BMI (for whom intercept. The second name, slope, which could
a bigger slope is extremely problematic) and for be anything such as s or beta, will always be the
youth with a very low BMI (who have a low or linear slope. The logical ‘‘or bar |’’ tells the pro-
even negative slope). gram this is a growth curve. Each path from the
How is the intercept identified? The intercept intercept to the observed scores must be 1.0. One
is often referred to as the constant. It represents only needs to specify the loadings for the slope
the base value to which some amount is added (BMI97 is set at 0, BMI98 is set at 1, etc.). The
or subtracted for each unit increase in the predic- Mplus program reads this single line and knows
tor. This constant base value is identified by hav- that the model in Figure 3 is being estimated. It is
ing a fixed loading of 1.0 from it to each year’s possible to override the assumptions of the
measurement of BMI. These lines are the dashed program, such as specifying that the residual is not
lines with an arrow going from the latent inter- correlated or that some the measurement errors
cept to the individual BMI scores. The tradi- are correlated. These assumptions depend on res-
tional meaning of a slope is how much a variable earch goals and hypotheses.
692 Latent Growth Modeling

interprets them. The estimated


BMI ¼ 21.035 þ .701 × Year.
R1 R2 Thus, one can estimate a youth
has a BMI of 21.035 (z ¼
210.352, p < .001) at the age of
12 (well within the normal range),
Intercept Slope but this grows by .701 (z ¼
40.663, p < .001) each year. By
2003, the estimated BMI is 21.035
1 1 1 þ .701 × 6 ¼ 25.241. Thus, by
1 1 1 1 the time they are 18, the expected
BMI is into the overweight range.
An increase in BMI of .701 each
1 2 3 4 5 6
0 year is substantial.
BMI97 BMI98 BMI99 BMI100 BMI01 BMI02 BMI03 The variance of the intercept and
slope represents the size of the ran-
dom effects. The variance of the
intercept is 15.051, z ¼ 25.209,
p < .001, and the variance of the
E97 E98 E99 E00 E01 E02 E03 slope is .255, z ¼ 14.228, p <
.001. Both of these values are statis-
tically significant. The z tests are
Figure 3 A Linear Growth Curve Model of BMI Between Age 12 and 18 problematic tests of significance bec-
ause the variance can never be nega-
tive. A better test, not reported here,
After running the program, information that involves estimating two models: one model that
helps in evaluating this simple growth curve is fixes the variances at zero and a second model that
obtained. First, a chi-square of 268.04, p < .000, allows the variances to be freely estimated. These
is obtained. This tests whether the simple growth models are then compared based on the difference
curve fits the data perfectly. The significant chi- in chi-square and degrees of freedom. If the model
square says it does not. Because this is only with the variances estimated has a significantly
a model, it is not surprising that it does not fit smaller chi-square (calculated by subtraction), then
the data perfectly. A model can be very good a random-effects model is appropriate. A more
and useful without perfectly reproducing all the elaborate test would compare the model with fixed
observed data. intercept variance first with a model with a random
How close of a fit does the model in this exam- intercept, then a model with both a random inter-
ple have? Among the many possible measures of cept and a random slope. These model compari-
closeness of fit, there is a comparative fit index sons are recommended whenever the z tests are
(CFI) of .98, a root mean square error of approxi- only marginally significant.
mation (RMSEA) of .078, and a standardized Can the variances be interpreted? The variances
mean square residual (SRMR) of .051. The CFI can be converted to standard deviations by taking
should be more than .95, the RMSEA should be their square roots. The square root of 15.051 ¼
less than .06, and the SRMR should be less than 3.880. This means two thirds of the youth have an
.08. The results in this example are not too bad, intercept within one standard deviation of 21.035,
but the RMSEA is probably the most respected and about 95% have an intercept within two stan-
measure of a close fit and it is too large. dard deviations of 21.035. The same thing can be
What do the results reveal about the growth done for the variance of the slope. Its square root
trajectory? The mean intercept is 21.035. The is 0.505. This shows that people can have a slope
mean slope is .701. These are unstandardized of .701 ± 1.10, and this covers everything from
values, and one needs to be careful how one a dramatic increase in BMI to an estimated
Latent Growth Modeling 693

28 Consider how Figure 3 would look with three


27 waves of data. The loadings for the intercept and
26
25
the slope are fixed. Three error terms, E97, E98,
24 E99, the Mintercept, the Mslope, variances, Ri and
23 Rs , as well as their covariance would need to be
22
21
estimated. This gives a total of eight parameters to
20 be estimated. Degrees of freedom are calculated by
19 subtracting the number of parameters being esti-
18
17
mated, eight, from the number of pieces of infor-
16 mation obtained, nine, meaning there is just one
BMI97

BMI98

BMI99

BMI00

BMI01

BMI02

BMI03
degree of freedom. This does not provide a very
rigorous test of a model, but it demonstrates that
Sample Means, General Estimated Means, General it is possible to estimate a linear growth curve with
just three waves of data.
What if there were four waves? If one counted
Figure 4 Plot of BMI Growth Trajectory for Actual the means, variances, and covariances, there would
and Estimated Means be 14 pieces of information instead of 9. However,
only one more parameter, E00, would be esti-
mated, so there would be 14  9 ¼ 5 degrees of
decrease in BMI. Clearly, there is an important
freedom. This gives a better test of a linear model.
random effect for both the intercept and the slope.
It also allows a quadratic term to be estimated to
The covariance between the intercept and slope
fit a curve. Adding a quadratic adds four para-
is .408, z ¼ 5.559, p < .001 and the correlation
meters: Mquadratic, RQ, and the covariances of the
is r ¼ .208, z ¼ 5.195, p < .001. [Mplus has
quadratic with both the intercept and the linear
a different test of significance for unstandardized
growth factors. It is good to have four waves of
and standardized coefficients, see Muthen (2007).]
data for a linear growth curve, although three is
This correlation between the intercept and slope
the minimum and it is good to have at least five
indicates exactly the area of concern, namely, that
waves of data for a nonlinear growth curve,
the youth who had the highest initial BMI scores
although four is the minimum.
also had the highest rate of growth in their BMI.
A plot of the linear latent growth curve can also
be examined. In Figure 4, it can be observed that Time-Invariant Covariates
a straight line slightly overestimates the initial
mean and slightly underestimates the final mean. Whenever there is a significant variance in the
This suggests that a quadratic can be added to the intercept or slope, these random effects should be
growth curve to capture the curvature in the explained. For example, whites and nonwhites
observed data. When this is done, an excellent fit might be compared on their BMI. In this example,
to the data is obtained. after dropping Asians and Pacific Islanders, the
nonwhites are primarily African Americans and
Latinos. Whites might have a lower intercept and
a flatter slope then nonwhites in their BMI. If
How Many Waves of Data Are Needed?
this were true, then race/ethnicity would explain
A latent growth model tries to reproduce the a portion of the random effects. Race/ethnicity is
summary statistics describing the data. If there are a time-invariant covariate.
just three waves of data, then the researcher would Consider emotional problems as a covariate
have three means, M97, M98, M99; three variances, that might explain some of the variance in the ran-
Var(BMI97), Var(BMI98), and Var(BMI99); and dom effects. If this is measured at age 12 and not
three covariances Cov(BMI97, BMI98), Cov(BMI97, again, it would be treated as a time-invariant cov-
BMI98), and Cov(BMI98, BMI99) for a total of nine ariate. Children who have a high score on emo-
pieces of information he or she is trying to repro- tional problems at age 12 might have a different
duce. How many parameters are being estimated? growth trajectory than children who have a low
694 Latent Growth Modeling

like the measures of BMI because it


Youth E1 is a single score. The emotional pro-
Emotional
White
Problems
blems variable is presented in an
Parent E2 oval because it is a latent variable
(factor) that has two indicators,
the youths’ report and the parents’
report. Both emotional problems
and white race have lines with
arrows going from them to the
R1 R2 R3 intercept and slope growth factors.
This figure also has a quadratic
slope factor with the loadings on
BMI97 to BMI03 coded with the
Intercept Linear Slope Quadratic
square of the corresponding load-
ings for the linear slope factor.
0 The results of this model can be
4 9 16 25 36
1 1 1 1 used to test whether the trajectory
1 1 1 1 is conditional on these two time
invariant covariates. One also could
6
use the R2 for the intercept and
0 1 2 3 4 5
slope factors to see how much vari-
BMI97 BMI98 BMI99 BMI100 BMI01 BMI02 BMI03 ance in the random effects these
covariate can explain.

Time-Varying Covariates
E97 E98 E99 E00 E01 E02 E03
In Figure 6, time-invariant cov-
ariates are represented by the rect-
angle labeled W. This represents
Figure 5 Latent Growth Curve With Two Time-Invariant Covariates a vector of possible time-invariant
covariates that will influence the
growth trajectory. It is possible to
score on emotional problems. Alternatively, if extend this to include time-varying covariates.
emotional problems are measured at each wave, it Time-varying covariates either are measured after
would be a time-varying covariate. In this section, the process has started or have a value that
race/ethnicity and age 12 emotional problems as changes (hours of nutrition education or level of
time-invariant covariates are considered. There are program fidelity) from wave to wave. Although
other time-invariant covariates that should be con- output is not shown, Figure 5 illustrates the use
sidered, all of which would be measured just one of time-varying covariates. In Figure 6, the time-
time when the youth was 12 years old. For exam- varying covariates A1 to A6 might be the number
ple, mothers’ and fathers’ BMI, knowledge of food of hours of curriculum devoted to nutrition educa-
choices, proximity of home to a fast-food restau- tion each year.
rant, and many other time-invariant covariates Time-varying covariates do not have a direct
could be measured. influence on the intercept or slope. For example,
Figure 5 shows the example model with these the amount of nutrition education youth received
two covariates added. This has been called the con- in 2003 could influence neither their initial BMI
ditional latent trajectory modeling because the ini- in 1997 nor their growth rate in earlier years.
tial level and trajectory (slope) are conditional on Instead, the hours of curriculum devoted to nutri-
other variables. White is a binary variable coded tion education each year would provide a direct
0 for nonwhite and 1 for white. It is in a rectangle effect on their BMI that year. A year with a strong
Latent Growth Modeling 695

long-term problems in careers, fam-


E1 E2 E3 E4 E5 E6 ily relations, their role as parents,
and health risks. One could add
these distal variables with direct
Y1 Y2 Y3 Y4 Y5 Y6 effects coming from the intercept
and slope growth factors to these
distal variables.
Parallel processes are possible.
For example, one might have a
growth curve for growth in conflict
Intercept
between parents and a parallel
growth curve for growth in conflict
the youth has with peers or tea-
Slope chers. High initial parental conflict
could lead to a steeper slope for
conflict the youth has with peers or
teachers. The level of initial conflict
A1 A2 A3 A4 A5 A6
and the growth in conflict for both
variables could have long-term
W effects on a distal outcome, such as
the youths’ marital or parenting
relations after they get married and
have children.
Also, growth curves need not be
Figure 6 Latent Growth Model With Time-Invariant and Time- limited to continuous variables.
Variant Covariates Growth could be a binary variable
using a logit model where the
growth in the probability of some
curriculum might result in a lower BMI, and a year outcome is estimated. For example, a school might
with little or no curriculum might result in a higher introduce a program designed to stop smoking.
BMI. Thus, yearly curriculum would be explaining Initially, there might be a high probability of
departures from the overall growth trajectory smoking, but this probability might decrease
rather than the trajectory itself. To explain the ran- across the course of the program. Time-variant
dom effects even more, additional paths could be (e.g., fidelity) and time-invariant covariates (e.g.,
added such as from the curriculum scores to the gender) could be included in the model of this pro-
BMI in subsequent years. cess to identify what predicts the program’s
effectiveness.
Other times, a count variable might be pre-
Extensions
dicted. Whether a smoking cessation intervention
Latent growth modeling is a rapidly developing eliminates smoking, it might reduce the number
area, and researchers have only scratched the sur- of cigarettes students smoke. Here, a Poisson or
face of the potential it has for social science negative binomial model might be used to fit the
research. Along with the programs and data for data. Other times, researchers might be interested
the models discussed so far, several extensions are in both the binary growth curve and the count
available, a few of which are mentioned here. growth curve. These could be treated as parallel
The latent growth curve model could be a part growth curves. It might be that some time-invariant
of a larger model. For example, the initial level and some time-variant covariates predict the binary
and rate of change in BMI could lead to distal out- or the count components differentially. Peers who
comes. Youth who have high initial BMIs and for smoke might make it difficult to stop smoking
whom this value is rapidly increasing might have (binary component), but peer influence might not
696 Latent Variable

have as much effect on the number of cigarettes See also Growth Curve; Structural Equation Modeling
smoked (count component). Finding such distinc-
tions can provide a much better understanding of
Further Readings
an intervention’s effectiveness and can give ideas
for how to improve the intervention. Barrett, P. (2006). Structural equation modeling:
The last of many possibilities to be mentioned Adjudging model fit. Personality and Individual
here involves applications of growth mixture mod- Differences, 42, 815824.
els. A sample can be treated as representing a single Bollen, K. A., & Curran, P. J. (2006). Latent curve
population, when there might be multiple popula- models: A structural equation perspective. Hoboken,
NJ: John Wiley.
tions represented and these multiple populations
Center for Human Resources. (2007). NLSY97 user’s
might have sharp differences. If one were to create guide: A guide to rounds 19 data. Washington, DC:
a growth curve of abusive drinking behavior from U.S. Department of Labor.
age 18 to 37, one will find a growth curve that Curran, F. J., & Hussong, A. M. (2003). The use of latent
generally increases from age 18 to 23 and then trajectory models in psychopathology research.
decreases after that. However, this overall growth Journal of Abnormal Psychology, 112, 526544.
model might not fit the population very well. Duncan, T. E., Duncan, S. C., & Strycker, L. A. (2006).
Why? What about people who never drink? These An introduction to latent variable growth curve
people have an intercept at or close to zero and modeling: Concepts, issues, and applications (2nd ed.).
Mahwah, NJ: Lawrence Erlbaum.
a flat slope that is near zero. This is an identifiable
Latent growth curves. Retrieved July 2, 2009, from
population for which the overall growth curve is
http://oregonstate.edu/~acock/growth
inapplicable. What about alcoholics? They might Mplus homepage. Retrieved January 27, 2010, from
be similar to the overall pattern up to age 23, but http://www.statmodel.com
then they do not decrease. Mixture models seek to Muthen, B. (2007). Mplus technical appendices:
find clusters of people who have homogeneous Standardized coefficients and their standard errors.
growth trajectories. Bengt Muthen and Linda Los Angeles. Retrieved January 27, 2010, from http://
Muthen applied a growth mixture model and were statmodel.com/download/techappen.pdf
able to identify three clusters of people that we Muthen, B., & Muthen, L. (2000). The development of
can label the normative group, the nondrinkers, heavy drinking and alcohol-related problems from
ages 18 to 37 in a U.S. national sample. Journal of
and the likely alcoholics. Once identifying group
Studies of Alcohol, 70, 290300.
membership, a profile analysis can be performed
Schreiber, J., Stage, R., King, J., Nora, A., & Barlow, E.
to evaluate how these groups differ on other vari- (2006). Reporting structural equation modeling and
ables. The same intervention would not work for confirmatory factor analysis results: A review. The
the alcoholic that works for the normative group, Journal of Educational Research, 99, 323337.
and it is not cost effective to have an intervention Yu, C. Y. (2002). Evaluating cutoff criteria of model fit
on the group of nondrinkers. indices for latent variable models with binary and
continuous outcomes (Unpublished dissertation).
University of California, Los Angeles.
Future Directions
Latent growth curve modeling is one of the most
important advances in the treasure chest of research LATENT VARIABLE
methods that has developed in the last 20 years. It
allows researchers to focus on change, what
explains the rate of change, and the consequences A latent variable is a variable that cannot be
of change. It is applicable across a wide range of observed. The presence of latent variables, how-
subject areas and can be applied to data of all levels ever, can be detected by their effects on variables
of measurement. It is also an area of rapid develop- that are observable. Most constructs in research
ment and will likely continue to change the way are latent variables. Consider the psychological
researchers work with longitudinal data. construct of anxiety, for example. Any single
observable measure of anxiety, whether it is a self-
Alan C. Acock report measure or an observational scale, cannot
Latent Variable 697

provide a pure measure of anxiety. Observable indicators can be accounted for by a relatively
variables are affected by measurement error. Mea- small number of latent variables.
surement error refers to the fact that scores often The measure of the degree to which an indicator
will not be identical if the same measure is given is associated with a latent variable is the indicator’s
on two occasions or if equivalent forms of the loading on the latent variable. An inspection of the
measure are given on a single occasion. In addi- pattern of loadings and other statistics is used to
tion, most observable variables are affected by identify latent variables and the observed variables
method variance, with the results obtained using that are associated with them. Principal compo-
a method such as self-report often differing from nents are latent variables that are obtained from an
the results obtained using a different method such analysis of a typical correlation matrix with 1s on
as an observational rating scale. Latent variable the diagonal. Because the variance on the diagonal
methodologies provide a means of extracting a rel- of a correlation matrix is a composite of common
atively pure measure of a construct from observed variance and unique variance including measure-
variables, one that is uncontaminated by measure- ment error, principal components differ from fac-
ment error and method variance. The basic idea is tors in that they capture unique as well as shared
to capture the common or shared variance among variance among the indicators. Because all variance
multiple observable variables or indicators of is included in the analysis and exact scores are
a construct. Because measurement error is by defi- available, principal components analysis primarily
nition unique variance, it is not captured in the is useful for ‘‘boiling down’’ a large number of
latent variable. Technically, this is true only when observed variables into a manageable number of
the observed indicators are (a) obtained in differ- principal components.
ent measurement occasions, (b) have different con- In contrast, the factors that result from factor
tent, and (c) have different raters if subjective analysis are latent variables obtained from an analy-
scoring is involved. Otherwise, they will share a sis of a correlation matrix after replacing the 1s on
source of measurement error that can be captured the diagonal with estimates of each observed vari-
by a latent variable. When the observed indicators able’s shared variance with the other variables in
represent multiple methods, the latent variables the analysis. Consequently, factors capture only the
also can be measured relatively free of method var- common variance among observed variables and
iance. This entry discusses two types of methods exclude measurement error. Because of this, princi-
for obtaining latent variables: exploratory and pal factor analysis is better for exploring the under-
confirmatory. In addition, this entry explores the lying factor structure of a set of observed variables.
use of latent variables in future research.
Confirmatory Methods
Exploratory Methods
Latent variables can also be identified using confir-
Latent variables are linear composites of observed matory methods such as confirmatory factor anal-
variables. They can be obtained by exploratory or ysis and structure equation models with latent
confirmatory methods. Two common exploratory variables, and this is where the real power of latent
methods for obtaining latent variables are factor variables is unleashed. Similar to exploratory fac-
analysis and principal components analysis. Both tor analysis, confirmatory factor analysis captures
approaches are exploratory in that no hypotheses the common variance among observed variables.
typically are proposed in advance about the num- However, predictions about the number of latent
ber of latent variables or which indicators will be variables and about which observed indicators are
associated with which latent variables. In fact, the associated with them are made a priori (i.e., prior
full solutions of factor analyses and principal com- to looking at the results) based on theory and prior
ponents analyses have as many latent variables as research. Typically, observed indicators are only
there are observed indicators and allow all indica- associated with a single latent variable when con-
tors to be associated with all latent variables. firmatory methods are used. The a priori predic-
What makes exploratory methods useful is when tions about the number of latent variables and
most of the shared variance among observed which indicators are associated with them can be
698 Latin Square Design

tested by examining how well the specified model


fits the data. The use of a priori predictions and LATIN SQUARE DESIGN
the ability to test how well the model fits the data
are important advantages over exploratory meth- In general, a Latin square of order n is an n × n
ods for obtaining latent variables. square such that each row (and each column) is
One issue that must be considered is that it is a permutation (or an arrangement) of the same n
possible to propose a confirmatory latent variable distinct elements. Suppose you lead a team of four
model that does not have a unique solution. This chess players to play four rounds of chess against
is referred to as an underidentified model. Underi- another team of four players. If each player must
dentified confirmatory factor analysis models can play against a different player in each round, a pos-
usually be avoided by having at least three obser- sible schedule could be as follows:
ved indicators for a model with a single latent vari-
able and at least two observed indicators for each
latent variable in a model with two or more latent (a)
variables, provided that they are allowed to be cor-
Team A 1 2 3 4
related with one another.
Round 1 1 3 4 2

Future Research Round 2 4 2 1 3

Until recently, latent variables have been continu- Round 3 2 4 3 1


ous rather than categorical variables. With the
development of categorical latent variables, latent Round 4 3 1 2 4
variables are proving to be a powerful new tool
for identifying mixtures of different kinds of peo-
ple or subtypes of various syndromes. Models with Here, we assumed that the players are numbered 1
categorical latent variables (e.g., mixture models) to 4 in each team. For instance, in round 1, player
are replacing cluster analysis as the method of 1 of team A will play 1 of team B, player 2 of team
choice for categorizing people. Other recent advan- A will play 3 of team B, and so on.
ces in latent variable analysis include latent Suppose you like to test four types of fertilizers
growth models, transition mixture models, and on tomatoes planned in your garden. To reduce the
multi-level forms of all of the models described effect of soils and watering in your experiment, you
previously. might choose 16 tomatoes planned in a 4 × 4 square
(i.e., four rows and four columns), such that each
Richard Wagner, Patricia Thatcher Kantor, row and each column has exactly one plan that uses
and Shayne Piasta one type of the fertilizers. Let the fertilizers denoted
by A, B, C, and D, then one possible experiment is
See also Confirmatory Factor Analysis; Exploratory denoted by the following square on the left.
Factor Analysis; Principal Components Analysis;
Structural Equation Modeling

1 2 3 4 1 2 3 4

Further Readings 1 A C D B 1 1 3 4 2

Hancock, G. R., & Mueller, R. O. (2006). Structural 2 D B A C 2 4 2 3 1


equation modeling: A second course. Greenwich, CT:
Information Age Publishing. 3 B D C A 3 2 4 3 1
Kline, R. B. (2005). Principles and practice of structural
equation modeling (2nd ed.). New York: Guilford. 4 C A B D 4 3 1 2 4
Schumacker, R. E., & Lomax, R. G. (1996). A beginner’s
guide to structural equation modeling. Mahwah, NJ: (a) (b)
Lawrence Erlbaum.
Latin Square Design 699

This table tells us that the plant at row 1 and The left-cancellation law states that no symbol
column 1 will use fertilizer A, the plant at row 1 appears in any column more than once, and the
and column 2 will use fertilizer C, and so on. If we right-cancellation law states that no symbol
rename A to 1, B to 2, C to 3, and D to 4, then we appears in any row more than once. The unique-
obtain a square in the (b) portion of the table, image property states that each cell of the square
which is identical to the square in the chess can hold at most one symbol.
schedule. A Latin square can be also defined by a set of
In mathematics, all three squares in the previous triples. Let us look at the first Latin square [i.e.
section (without the row and column names) are square (a)].
called a Latin square of order four. The name
Latin square originates from mathematicians of
the 19th century like Leonhard Euler, who used 1 2 3 1 2 3
Latin characters as symbols.
2 3 1 3 1 2

3 1 2 2 3 1
Various Definitions (a) (b)

One convenience of using the same set of elements


1 3 2 1 3 2
for both the row and column indices and the ele-
ments inside the square is that we might regard 2 1 3 3 2 1
? as a function ? defined on the set {1,2,3,4}: ?
(1,1) ¼ 1, ? (1,2) ¼ 3, and so on. Of course, we 3 2 1 2 1 3
might write it as 1?1 ¼ 1 and 1?2 ¼ 3, and we (c) (d)
can also show that (1?3)?4 ¼ 4. This square pro-
vides a definition of ?, and the square (b) in the
previous section is called ‘‘the multiplication table’’ The triple representation of square (a) is
of ?.
In mathematics, if the multiplication table of fh1, 1, 1i, h1, 2, 2i, h1, 3, 3i, h2, 1, 2i, h2, 2, 3i,
a binary function, say *, is a Latin square, then h2, 3, 1i, h3, 1, 3i, h3, 2, 1i, h3, 3, 2ig:
that function, together with the set of the elements,
is called quasigroup. In contrast, if *is a binary
The meaning of a triple hx, y, zi is that the entry
function over the set
value at row x and column y is z (i.e., x *y ¼ z, if
the Latin square is the multiplication table of *).
S ¼ f1; 2; . . . ; ng
The definition of a Latin square of order n can be
a set of n2 integer triples hx, y, zi, where 1 ≤ x, y,
and satisfies the following properties, then the
z ≤ n, such that
multiplication table * is a Latin square of order n.
For all elements w, x, y, z ∈ S,
• All the pairs hx, zi are different (the left-
cancellation law).
ðx * w ¼ z, x * y ¼ zÞ ) w ¼ y : • All the pairs hy, zi are different (the right-
ð1Þ
the left-cancellation law; cancellation law).
• All the pairs hx, yi are different (the unique-
image property).
ðx * y ¼ z, x * y ¼ zÞ ) w ¼ x :
ð2Þ The quasigroup definition and the triple repre-
the right-cancellation law;
sentation show that rows, columns, and entries
play similar roles in a Latin square. That is, a Latin
square can be viewed as a cube, where the dimen-
ðx * y ¼ w, x * y ¼ zÞ ) w ¼ z :
ð3Þ sions are row, column, and entry value. If two
the unique-image property: dimensions in the cube are fixed, then there will be
700 Latin Square Design

exactly one number in the remaining dimension. Conjugates of a Latin Square


Moreover, if a question or a solution exists for
a certain kind of Latin squares for one dimension, Given a Latin square, if we switch row 1 with
then more questions or solutions might be gener- column 1, row 2 with column 2; . . . ; and row n
ated by switching dimensions. with column n, then we obtain the transpose of
the original Latin square, which is also a Latin
square. If the triple representation of this Latin
square is S, then the triple representation of the
Counting Latin Squares transpose is T ¼ {hx2,x1,x3i|hx1,x2,x3i ∈ S}. For
the three elements hx1,x2,x3i, we have six per-
Let us first count Latin squares for some very small
mutations, each of which will produce a Latin
orders. For order one, we have only one Latin
square, which is called a conjugate. Formally,
square. For order two, we have two (one is identi-
the transpose T is the (2,1,3) conjugate;
cal to the other by switching rows or columns).
For order three, we have 12 (four of them are U ¼ fhx1 , x3 , x2 ijhx1 , x2 , x3 i ∈ Sg is the ð1, 3, 2Þ
listed in the previous section). For order four, we conjugate of S;
have 576 Latin squares. A Latin square is said be
V ¼ fhx2 , x1 , x3 ijhx1 , x2 , x3 i ∈ Sg is the ð2, 1, 3Þ
reduced if in the first row and the first column the
conjugate of S;
elements occur in natural order. If R(n) denotes the
number of reduced Latin square of order n, then W ¼ fhx2 , x3 , x1 ijhx1 , x2 , x3 i ∈ Sg is the ð3, 1, 2Þ
the total number of Latin squares, N(n), is given conjugate of S;
by the following formula: X ¼ fhx3 , x1 , x3 ijhx1 , x2 , x3 i ∈ Sg is the ð3, 2, 1Þ
conjugate of S;
NðnÞ ¼ n!ðn  1Þ!RðnÞ, Y ¼ fhx3 , x2 , x1 ijhx1 , x2 , x3 i ∈ Sg is the ð1, 2, 3Þ
conjugate of S:
where n! ¼ n *(n  1) *(n  2) . . . *2 *1 is the
factorial of n. From this formula, we can deduce Needless to say, the (1,2,3) conjugate of S is S
that the numbers of reduced Latin squares are itself. The six conjugates of a small Latin square
1,1,3,4, for order 1,2,3,4, respectively. are as follows:
Given a Latin square, we might switch rows
to obtain another Latin square. For instance, 1423 1243 1324 1243 1342 1324
Latin square (b) is obtained from (a) by switch- 2314 4312 2413 3421 3124 3142
ing rows 2 and 3. Similarly, (c) is obtained from 4132 2134 4231 2134 2431 4231
(a) by switching columns 2 and 3, and (d) is 3241 3421 3142 4312 4213 2413
obtained from (a) by switching symbols 2 and 3. (a) (b) (c) (d) (e) (f)
Of course, more than two rows might be
involved in a switch. For example, row 1 can be (a) a Latin square, (b) its (2,1,3) conjugate, (c) its
changed to row 2, row 2 to row 3, and row 3 to (3,2,1) conjugate, (d) its (2,3,1) conjugate, (e) its
row 1. Two Lain squares are said be to isotopic (1,3,2) conjugate, and (f) its (3,1,2) conjugate.
if one becomes the other by switching rows, col- Some conjugates might be identical to the
umns, or symbols. Isotopic Latin squares can be original Latin square. For instance, a symmetric
merged into one class, which is called the iso- Latin square [like square (a) of order three] is
topy class. identical to its (2,1,3) conjugate. Two Latin
The following table gives the numbers of noniso- squares are said be parastrophic if one of them is
topic and reduced Latin squares of order up to ten. a conjugate of the other. Two Latin squares are

n= 1 2 3 4 5 6 7 8 9 10
Nonisotopic 1 1 1 2 2 22 563 1,676,267 115,618,721,533 208,904,371,354,363,006
Reduced 1 1 1 4 56 9,408 16,942,080 535,281,401,856 3.77 × 1018 7.58 × 1025
Latin Square Design 701

said be paratopic if one of them is isotopic to Partial Latin Squares


a conjugate of the other. Like the isotopic rela-
tion, both the parastrophic and paratopic rela- A partial Latin square is a square such that each
tions are equivalence relations. For orders of no entry in the square contains either a symbol or is
less than six, the number of nonparatopic Latin empty, and no symbol occurs more than once in
squares is less than that of nonisotopic Latin any row or column. Given a partial Latin square,
squares. we often ask whether the empty entries can
be filled to form a complete Latin square. For
instance, it is known that a partial Latin square of
order n with at most n  1 filled entries can
Transversals of a Latin Square always be completed to a Latin square of order n.
If in a partial Latin square, 1*1 ¼ 1 and 2*2 ¼
Given a Latin square of order n, a transversal of
2, then we cannot complete this Latin square if the
the square is a set S of n entries, one selected from
order is just two.
each row and each column such that no two
The most interesting example of the Latin
entries contain the same symbol. For instance, four
square completion problem is perhaps the Sudoku
transversals of the Latin square (a) are shown in
puzzle, which appears in numerous newspapers
the following as (b) through (e): and magazines. The most popular form of Sudoku
has a 9 × 9 grid made up of nine 3 × 3 subgrids
1 3 4 2 1 3 called ‘‘regions.’’ In addition to the constraints that
4 2 1 3 3 1 every row and every column is a permutation of 1
2 4 3 1 4 2 through 9, each region is also a permutation of 1
3 1 2 4 2 4 through 9. The less the number of filled entries
(a) (b) (c) (also called hints) in a Sudoku puzzle, the more
difficult the puzzle. Gordon Royle has a collection
4 2 of 47,386 distinct Sudoku configurations with
2 4 exact 17 filled cells. It is an open question whether
1 3 17 is the minimum number of entries for a Sudoku
3 1 puzzle to have a unique solution. It is also an open
(d) (e) question whether Royle’s collection of 47,386
puzzles is complete.
If we overlap the previous four transversals, we
obtain the original Latin square. This set of trans-
versals is called a transversal design (of index one). Holey Latin Squares
The total number of transversals for the Latin
square (a) is eight. The four other transversals Among partial Latin squares, people are often
include the two diagonals, the entry set {(1,2), interested in Latin squares with holes, that is,
(2,1), (3,4), (4,3)} and the entry set {(1,3), (2,4), some subsquares of the square are missing. The
(3,1), (4,2)}. This set of four transversals also con- existence of these holey Latin squares is very
sists of a transversal design. Because the set of all useful in the construction of Latin squares of
eight transversals can be partitioned into two tra- large orders.
versal designs, this Latin square has a resolvable Suppose H is a subset of {1; 2; . . . ; n}; a holey
set of transversals. Latin square of order n with hole H is a set of
The research problem concerning transversals n2  |H|2 integer triples hx, y, zi, 1 ≤ x, y, z ≤ n,
includes finding the maximal numbers of transver- such that
sals, resolvable transversals, or transversal designs
for each order. The maximal number of transver- 1. All hx,yi are distinct and at most one of them is
sals for orders under nine are known; for orders in H.
greater than nine, only lower and upper bounds 2. All hx,zi are distinct and at most one of them is
are known. in H.
702 Latin Square Design

3. All hy,zi are distinct and at most one of them is


1, 1 4, 5 2, 4 5, 3 3, 2
in H.
5, 4 2, 2 1, 5 3, 1 4, 3
Let H ¼ {1,2}, and holey Latin squares of orders
4 and 5 with hole H are given in the following as 4, 2 5, 1 3, 3 2, 5 1, 4
squares (a) and (b).
3, 5 1, 3 5, 2 4, 4 2, 1

5 6 3 4 2, 3 3, 4 4, 1 1, 2 5, 5
4 3 5 6 5 4 3
4 3 3 5 4 5 6 2 1 (c)
3 4 3 4 5 2 1 6 5 1 2
4 3 1 2 5 3 1 4 2 3 4 2 1 The orthogonality of Latin squares is perhaps
3 4 2 1 4 5 2 1 3 4 3 1 2 the most important property in the study of Latin
squares. One problem of great interests is to prove
(a) (b) (c)
the existence of a set of mutually orthogonal Latin
squares (MOLS) of certain order.
Similarly, we might have more than one hole.
This can be demonstrated by Euler’s 36 Officers
For instance, (c) is a holey Latin square of order 6,
Problem, in which one attempts to arrange 36 offi-
with holes {1,2}, {3,4}, and {5,6}. Holey Latin
cers of 6 different ranks and 6 different regiments
squares with multiple holes (not necessarily mutu-
into a square so each line contains 6 officers of dif-
ally disjoint or same size) can be defined similarly
ferent ranks and regiments.
using the triple representation. Obviously, holey
If the ranks and regiments of these 36 officers
Latin squares are a special case of partial Latin
arranged in a square are represented, respectively,
squares. A necessary condition for the existence of
by two Latin squares of order six, then Euler’s 36
a holey Latin square is that the hole size cannot
officers problem asks whether two orthogonal Latin
exceed the half of the order. There are few results
squares of order 6 exist. Euler went on to conjecture
concerning the maximal number of holey Latin
that such an n × n array does not exist for n ¼ 6,
squares for various orders.
and one does not exist whenever n ≡ 2 (mod). This
was known as the Euler conjecture until its disproof
Orthogonal Latin Squares in 1959. At first, Raj C. Bose and Sharadchandra S.
Shrikhande found some counterexamples; the next
Given a Latin square of order n, there are n2 entry
year, Ernest Tilden Parker, Bose, and Shrikhande
positions. Given a set of n symbols, there are n2 dis-
were able to construct a pair of orthogonal order 10
tinct pairs of symbols. If we overlap two Latin
Latin squares, and they provided a construction for
squares of order n, we obtain a pair of symbols
the remaining even values of n that are not divisible
at each entry position. If the pair at each entry posi-
by 4 (of course, excepting n ¼ 2 and n ¼ 6).
tion is distinct compared to the other entry posi-
Today’s computer software can find a pair of such
tions, we say the two Latin squares are orthogonal.
Latin squares in no time. However, it remains a great
The following are some pairs of orthogonal Latin
challenge to find a set of three mutually orthogonal
squares of small orders. There is a pair of numbers
Latin squares of order 10.
in each entry; the first of these comes from the first
Let M(n) be the maximum number of Latin
square and the second from the other square.
squares in a set of MOLS of order n. The follow-
ing results are known: If n is a prime power, that
1, 1 2, 3 3, 4 4, 2 is, n ¼ pe, where p is a prime, then M(n) ¼ n 
1. For small n > 6 and n is not a prime power, we
1, 1 2, 3 3, 2 2, 2 1, 4 4, 3 3, 1
do not know the exact value of M(n) except the
2, 2 3, 1 1, 3 3, 3 4, 1 1, 2 2, 4 lower bounds as given in the following table:

3, 3 1, 2 2, 1 4, 4 3, 2 2, 1 1, 3 n 6 10 12 14 15 18 20 21 22 24
(a) (b) MðnÞ 1 ≥ 2 ≥ 5 ≥ 3 ≥ 4 ≥ 3 ≥ 4 ≥ 5 ≥ 3 ≥ 5
Law of Large Numbers 703

MOLS can be used to design experiments. Sup- Note: Partially supported by the National Science Founda-
pose a drug company has four types of headache tion under Grants CCR-0604205.
drugs, four types of fever drugs, and four types of
cough drugs. To design a new cold medicine, the Further Readings
company wants to test the combinations of these
three kinds of drugs. In a test, three drugs (not the Bennett, F., & Zhu, L. (1992). Conjugate-orthogonal
same type) will be used simultaneously. Can we Latin squares and related structures. In J. Dinitz &
D. Stinson (Eds.), Contemporary design theory: A
design only 16 tests so that every pair of drugs
collection of surveys. New York: John Wiley.
(not the same type) will be tested? The answer is Bose, R. C., & Shrikhande, S. S. (1960). On the
yes, as we have a set of three MOLS of order four. construction of sets of mutually orthogonal Latin
A pair of MOLS is equivalent to a transversal squares and falsity of a conjecture of Euler.
design of index one. Transactions of the American Mathematical Society,
People are also interested in whether a Latin 95, 191209.
square is orthogonal to its conjugate and the Carley, A. (1890). On Latin squares. Oxford Cambridge
existence of mutually orthogonal holey Latin Dublin Messenger Mathematics, 19, 135137.
squares. For instance, for the two orthogonal Colbourn, C. J. (1984). The complexity of completing
squares of order 5 in the beginning of this sec- partial Latin squares. Discrete Applied Mathematics,
8, 2530.
tion [i.e., (c)], one is the other’s (2,1,3) conju-
Colbourn, C. J., & Dinitz, J. H. (Eds.). (1996). The CRC
gate. It has been known that a Latin square handbook of combinatorial designs. Boca Raton,
exists that is orthogonal to its (2,1,3), (1,3,2), FL: CRC Press.
and (3,2,1) conjugates for all orders except 2, 3, Euler, L. (1849). Recherches sur une espèe de carr magiques.
and 6; a Latin square exists that is orthogonal to Commentationes Arithmeticae Collectae, 2, 302361.
its (2,3,1) and (3,1,2) conjugates for all orders Evans, T. (1975). Algebraic structures associated with
except 2, 3, 4, 6, and 10. For holey Latin Latin squares and orthogonal arrays. Proceedings of
squares, the result is less conclusive. Conf. on Algebraic Aspects of Combinatorics,
Congressus Numerantium, 13, 3152.
Hedayat, A. S., Sloane, N. J. A., & Stufken, J. (1999).
Applications Orthogonal arrays: Theory and applications. New
The two examples in the beginning of this entry York: Springer-Verlag.
Mandl, R. (1985). Orthogonal Latin squares: An
show that Latin squares can be used for tourna-
application of experiment design to compiler testing.
ment scheduling and experiment design. This strat- Communications of the ACM, 28, 10541058.
egy has also been used for designing puzzles and McKey, B. D., & McLeod, J. C. (2006). The number of
tests. As a matching procedure, Latin squares transversals in a Latin square. Des Codes Cypt, 40,
relate to problems in graph theory, job assignment 269284.
(or Marriage Problem), and, more recently, proces- Royle, G. (n.d.). Minimum Sudoku. Retrieved January
sor scheduling for massively parallel computer sys- 27, 2010, from http://people.csse.uwa.edu.au/gordon/
tems. Algorithms for solving the Marriage sudokumin.php
Problem are also used in linear algebra to reduce Tarry, G. (1900). Le probleme de 36 officiers. Compte
Rendu de l’Association Francaise Avancement Sciences
matrices to block diagonal form.
Naturelle, 1, 122123.
Latin squares have rich connections with many
Zhang, H. (1997). Specifying Latin squares in
fields of design theory. A Latin square is also propositional logic. In R. Veroff (Ed.), Automated
equivalent to a (3,n) net, an orthogonal array of reasoning and its applications: Essays in honor of
strength two and index one, a 1-factorization of Larry Wos. Cambridge: MIT Press.
the complete bipartite graph Kn,n, an edge-parti-
tion of the complete tripartite graph Kn,n,n into tri-
angles, a set of n2 mutually nonattacking rooks on
a n × n × n board, and a single error-detecting LAW OF LARGE NUMBERS
code of word length 3, with n2 words from an
n-symbol alphabet. The Law of Large Numbers states that larger sam-
Hantao Zhang ples provide better estimates of a population’s
704 Law of Large Numbers

parameters than do smaller samples. As the size of samples of only 10 randomly selected men, it is
a sample increases, the sample statistics approach easily possible to get an unusually tall group of
the value of the population parameters. In its sim- 10 men or an unusually short group of men.
plest form, the Law of Large Numbers is some- Additionally, in such a small group, one outlier,
times stated as the idea that bigger samples are for example who is 85 inches, can have a large
better. After a brief discussion of the history of the effect on the sample mean. However, if samples
Law of Large Numbers, the entry discusses related of 100 men were drawn from the population,
concepts and provides a demonstration and the the means of those samples would vary less than
mathematical formula. the means from the samples of 10 men. It is
much more difficult to select 100 tall men ran-
domly from the population than it is to select 10
tall men randomly. Furthermore, if samples of
History
1,000 men are drawn, it is extremely unlikely
Jakob Bernoulli first proposed the Law of Large that 1,000 tall men will be randomly selected.
Numbers in 1713 as his ‘‘Golden Theorem.’’ The mean heights for those samples would vary
Since that time, numerous other mathematicians even less than the means from the samples of
(including Siméon-Denis Poisson who first 100 men. Thus, as sample sizes increase, the var-
coined the term Law of Large Numbers in 1837) iability between sample statistics decreases. The
have proven the theorem and considered its sample statistics from larger samples are, there-
application in games of chance, sampling, and fore, better estimates of the true population
statistical tests. Understanding the Law of Large parameters.
Numbers is fundamental to understanding the
essence of inferential statistics, that is, why one
can use samples to estimate population para- Demonstration
meters. Despite its primary importance, it is
often not fully understood. Consequently, the If a fair coin is flipped a million times, we expect
understanding of the concept has been the topic that 50% of the flips will result in heads and 50%
of numerous studies in mathematics education in tails. Imagine having five people flip a coin 10
and cognitive psychology. times so that we have five samples of 10 flips.
Suppose that the five flippers yield the following
results:

Sampling Distributions Flipper 1: H T H H T T T T T T (3 H, 7 T)

Understanding the Law of Large Numbers Flipper 2: T T T H T H T T T T (2 H, 8 T)


requires understanding how sampling distribu- Flipper 3: H T H H T H T H T T (5 H, 5 T)
tions differ for samples of various sizes. For
Flipper 4: T T T T T T T T T H (1 H, 9 T)
example, if random samples of 10 men are
drawn and their mean heights are calculated so Flipper 5: T H H H T H T H T H (6 H, 4 T)
that a frequency distribution of the mean heights
can be created, a large amount of variability If a sampling distribution of the percentage of
might be expected between those means. With heads is then created, the values 30, 20, 50, 10,
the mean height of adult men in the United and 60 would be included for these five samples. If
States at about 70 inches (5’10’’ or about 177 we continue collecting samples of 10 flips, with
cm), some samples of 10 men could have means enough samples we will end up with the mean of
as high as 80 inches (6’6’’), whereas others might the sampling distribution equal to the true popula-
be as low as 60 inches (5’0’’). Although as the tion mean of 50. However, there will be a large
central limit theorem suggests the mean of the amount of variability between our samples, as
sampling distribution of the means will be equal with the five samples presented previously. If we
to the population mean of 70 inches, the individ- create a sampling distribution for samples of 100
ual sample means will vary substantially. In flips of a fair coin, it is extremely unlikely that the
Least Squares, Methods of 705

samples will have 10% or 20% heads. In fact, we Ferguson, T. S. (1996). A course in large sample theory.
would quickly observe that although the mean of New York: Chapman & Hall.
the sample statistics will be equal to the popula- Hald, A. (2007). A history of parametric statistical
tion mean of 50% heads, the sample statistics will inference from Bernoulli to Fisher. New York:
Springer-Verlag.
vary much less than did the statistics for samples
of 10 flips.

Mathematical Formula LEAST SQUARES, METHODS OF


The mathematical proof that Bernoulli originally
solved yields the simple formula, The least-squares method (LSM) is widely used to
find or estimate the numerical values of the para-
meters to fit a function to a set of data and to char-
Xn → μ as n → ∞:
acterize the statistical properties of estimates. It is
probably the most popular technique in statistics
Mathematicians sometimes denote two ver- for several reasons. First, most common estimators
sions of the Law of Large Numbers, which are can be cast within this framework. For example,
referred to as the weak version and the strong the mean of a distribution is the value that mini-
version. Simply put, the weak version suggests mizes the sum of squared deviations of the scores.
that Xn converges in probability to μ, whereas Second, using squares makes LSM mathematically
the strong version suggests that Xn converges very tractable because the Pythagorean theorem
almost surely to μ. indicates that, when the error is independent of an
estimated quantity, one can add the squared error
When Bigger Samples Are Not Better and the squared estimated quantity. Third, the
mathematical tools and algorithms involved in
Although larger samples better represent the popu- LSM (derivatives, eigendecomposition, and singu-
lations from which they are drawn, there are lar value decomposition) have been well studied
instances when a large sample might not provide for a long time.
the best parameter estimates because it is not LSM is one of the oldest techniques of modern
a ‘‘good’’ sample. Biased samples that are not ran- statistics, and even though ancestors of LSM can
domly drawn from the population might provide be traced back to Greek mathematics, the first
worse estimates than smaller, randomly drawn modern precursor is probably Galileo. The modern
samples. The Law of Large Numbers applies only approach was first exposed in 1805 by the French
to randomly drawn samples, that is, samples in mathematician Adrien-Marie Legendre in a now
which all members of the population have an classic memoir, but this method is somewhat older
equal chance of being selected. because it turned out that, after the publication of
Legendre’s memoir, Carl Friedrich Gauss (the
Jill H. Lohmeier famous German mathematician) contested
See also Central Limit Theorem; Expected Value;
Legendre’s priority. Gauss often did not publish
Random Sampling; Sample Size; Sampling
ideas when he thought that they could be contro-
Distributions; Standard Error of the Mean
versial or not yet ripe, but he would mention his
discoveries when others would publish them (the
way he did, for example for the discovery of non-
Further Readings Euclidean geometry). And in 1809, Gauss pub-
lished another memoir in which he mentioned that
Dinov, I. D., Christou, N., & Gould, R. (2009). Law of
large numbers: The theory, applications, and
he had previously discovered LSM and used it as
technology-based education. Journal of Statistics early as 1795 in estimating the orbit of an aster-
Education, 17, 115. oid. A somewhat bitter anteriority dispute fol-
Durrett, R. (1995). Probability: Theory and examples. lowed (a bit reminiscent of the Leibniz-Newton
Pacific Grove, CA: Duxbury Press. controversy about the invention of calculus),
706 Least Squares, Methods of

which however, did not diminish the popularity of where E stands for ‘‘error,’’ which is the quantity
this technique. to be minimized. The estimation of the parameters
The use of LSM in a modern statistical frame- is obtained using basic results from calculus and,
work can be traced to Sir Francis Galton who specifically, uses the property that a quadratic
used it in his work on the heritability of size, expression reaches its minimum value when its
which laid down the foundations of correlation derivatives vanish. Taking the derivative of E with
and (also gave the name to) regression analysis. respect to a and b and setting them to zero gives
The two antagonistic giants of statistics Karl the following set of equations (called the normal
Pearson and Ronald Fisher, who did so much in equations):
the early development of statistics, used and
developed it in different contexts (factor analysis ∂E X X
for Pearson and experimental design for Fisher). ¼ 2Na þ 2b Xi  2 Yi ¼ 0 ð3Þ
∂a
Nowadays, the LSM exists with several varia-
tions: Its simpler version is called ordinary least
and
squares (OLS), and a more sophisticated version is
called weighted least squares (WLS), which often X X X
performs better than OLS because it can modulate ∂E
¼ 2b X2i þ 2a Xi  2 Yi Xi ¼ 0: ð4Þ
the importance of each observation in the final ∂b
solution. Recent variations of the least square
method are alternating least squares (ALS) and Solving the normal equations gives the following
partial least squares (PLS). least-squares estimates of a and b as:

a ¼ MY  bMX ð5Þ
Functional Fit Example: Regression
with My and MX denoting the means of X and Y,
The oldest (and still the most frequent) use of and
OLS was linear regression, which corresponds to
the problem of finding a line (or curve) that best P
ðYi  MY ÞðXi  MX Þ
fits a set of data points. In the standard formula- b¼ P 2
: ð6Þ
tion, a set of N pairs of observations fYi ; Xi g is ðXi  MX Þ
used to find a function relating the value of the
dependent variable (Y) to the values of an inde- OLS can be extended to more than one indepen-
pendent variable (X). With one variable and dent variable (using matrix algebra) and to nonlin-
a linear function, the prediction is given by the ear functions.
following equation:

^ ¼ a þ bX:
Y ð1Þ The Geometry of Least Squares
OLS can be interpreted in a geometrical frame-
This equation involves two free parameters that work as an orthogonal projection of the data
specify the intercept (a) and the slope (b) of the vector onto the space defined by the independent
regression line. The least-squares method defines variable. The projection is orthogonal because
the estimate of these parameters as the values that the predicted values and the actual values are
minimize the sum of the squares (hence the name uncorrelated. This is illustrated in Figure 1,
least squares) between the measurements and the which depicts the case of two independent vari-
model (i.e., the predicted values). This amounts to ables (vectors X1 and X2) and the data vector
minimizing the expression: (y), and it shows that the error vector (y  ^
y) is
X X orthogonal to the least-squares (^y) estimate,
2
^iÞ ¼ 2
E¼ ðYi  Y ½Yi  ða þ bXi Þ , ð2Þ which lies in the subspace defined by the two
i i independent variables.
Least Squares, Methods of 707

y a weight that reflects the uncertainty of the mea-


surement. In general, the weight wi, which is
assigned to the ith observation, will be a function
of the variance of this observation, which is
y_^
y
x1 denoted σ 2i . A straightforward weighting schema
is to define wi ¼ σ 1i (but other more sophisti-
cated weighted schemes can also be proposed).
^
For the linear regression example, WLS will find
y the values of a and b minimizing:
X
Ew ¼ ^ i Þ2
wi ðYi  Y
x2
i
X 2
¼ wi ½Yi  ða þ bXi Þ : ð7Þ
i

Iterative Methods: Gradient Descent


Figure 1 The Least-Squares Estimate of the Data Is When estimating the parameters of a nonlinear
the Orthogonal Projection of the Data function with OLS or WLS, the standard approach
Vector Onto the Independent Variable using derivatives is not always possible. In this
Subspace case, iterative methods are often used. These meth-
ods search in a stepwise fashion for the best values
Optimality of Least-Squares Estimates of the estimate. Often they proceed by using at
each step a linear approximation of the function
OLS estimates have some strong statistical pro- and refine this approximation by successive cor-
perties. Specifically when (a) the data obtained rections. The techniques involved are known as
constitute a random sample from a well-defined gradient descent and Gauss-Newton approxima-
population, (b) the population model is linear, tions. They correspond to nonlinear least squares
(c) the error has a zero expected value, (d) the approximation in numerical analysis and nonlinear
independent variables are linearly independent, regression in statistics. Neural networks constitutes
and (e) the error is normally distributed and uncor- a popular recent application of these techniques
related with the independent variables (the so-
called homoscedasticity assumption), the OLS esti-
mate is the best linear unbiased estimate, often Problems with Least Squares and Alternatives
denoted with the acronym ‘‘BLUE’’ (the five condi- Despite its popularity and versatility, LSM has its
tions and the proof are called the Gauss-Markov problems. Probably the most important drawback
conditions and theorem). In addition, when the of LSM is its high sensitivity to outliers (i.e.,
Gauss-Markov conditions hold, OLS estimates are extreme observations). This is a consequence of
also maximum-likelihood estimates. using squares because squaring exaggerates the
magnitude of differences (e.g., the difference bet-
Weighted Least Squares ween 20 and 10 is equal to 10, but the difference
between 202 and 102 is equal to 300) and there-
The optimality of OLS relies heavily on the fore gives a much stronger importance to extreme
homoscedasticity assumption. When the data observations. This problem is addressed by using
come from different subpopulations for which robust techniques that are less sensitive to the
an independent estimate of the error variance effect of outliers. This field is currently under
is available, a better estimate than OLS can be development and is likely to become more impor-
obtained using weighted least squares (WLS), tant in the future.
which is also called generalized least squares
(GLS). The idea is to assign to each observation Hervé Abdi
708 Levels of Measurement

See also Bivariate Regression; Correlation; Multiple could just have easily been assigned #20 instead.
Regression; Pearson Product-Moment Correlation The important point is that each player was
Coefficient; assigned a number. Second, it is also important to
notice that the number or label is assigned to
Further Reading
a quality of the variable or outcome. Each thing
that is measured generally measures only one
Abdi, H., Valentin, D., & Edelman, B. E. (1999). Neural aspect of that variable. So one could measure an
networks. Thousand Oaks, CA: Sage. individual’s weight, height, intelligence, or shoe
Bates, D. M., & Watts, D. G. (1988). Nonlinear size, and one would discover potentially important
regression analysis and its applications. New York:
information about an aspect of that individual.
John Wiley.
Greene, W. H. (2002). Econometric analysis. New York:
However, just knowing a person’s shoe size does
Prentice Hall. not tell everything there is to know about that
Harper, H. L. (1974). The method of least squares and individual. Only one piece of the puzzle is known.
some alternatives. Part I, II, III, IV, V, VI. International Finally, it is important to note that the numbers or
Statistical Review, 42, 147174; 235264. labels are not assigned willy-nilly but rather
Harper, H. L. (1975). The method of least squares and according to a set of rules. Following these rules
some alternatives. Part I, II, III, IV, V, VI. International keeps the assignments constant, and it allows other
Statistical Review, 43, 144; 125190; 269272. researchers to feel confident that their variables are
Harper, H. L. (1976). The method of least squares and measured using a similar scale to other researchers,
some alternatives. Part I, II, III, IV, V, VI. International
which makes the measurements of the same quali-
Statistical Review, 44, 113159.
Nocedal, J., & Wright, S. (1999). Numerical
ties of variables comparable.
optimization. New York: Springer-Verlag. These scales (or levels) of measurement were
Plackett, R. L. (1972). The discovery of the method of first introduced by Stanley Stevens in 1946. As
least squares. Biometrika, 59, 239251. a psychologist who had been debating with other
Seal, H. L. (1967). The historical development of the scientists and mathematicians on the subject of
Gauss linear model. Biometrika, 54, 123. measurement, he proposed what is referred to
today as the levels of measurement to bring all
interested parties to an agreement. Stevens wanted
researchers to recognize that different varieties of
LEVELS OF MEASUREMENT measurement exist and that types of measurement
fall into four proposed classes. He selected the four
How things are measured is of great importance, levels through determining what was required to
because the method used for measuring the quali- measure each level as well as what statistical pro-
ties of a variable gives researchers information cesses could reasonably be performed with vari-
about how one should be interpreting those mea- ables measured at those levels. Although much
surements. Similarly, the precision or accuracy of debate has ensued on the acceptable statistical pro-
the measurement used can lead to differing out- cesses (which are explored later), the four levels of
comes of research findings, and it could potentially measurement have essentially remained the same
limit the statistical analyses that could be per- since their proposal so many years ago.
formed on the data collected.
Measurement is generally described as the assig-
nment of numbers or labels to qualities of a vari- The Four Levels of Measurement
able or outcome by following a set of rules. There
Nominal
are a few important items to note in this definition.
First, measurement is described as an assignment The first level of measurement is called nominal.
because the researcher decides what values to Nominal-level measurements are names or cate-
assign to each quality. For instance, on a football gory labels. The name of the level, nominal, is said
team, the coach might assign each team member to derive from the word nomin-, which is a Latin
a number. The actual number assigned does not prefix meaning name. This fits the level very well,
necessarily have any significance, as player #12 as the goal of the first level of measurement is
Levels of Measurement 709

to assign classifications or names to qualities of participants are different and which participant is
variables. If ‘‘type of fruit’’ was the variable of better (or worse) than another participant. Ordinal
interest, the labels assigned might be bananas, measurements convey information about order but
apples, pears, and so on. If numbers are used as still do not speak to amount. It is not possible to
labels, they are significant only in that their num- determine how much better or worse participants
bers are different but not in amount. For example, are at this level. So, Big Brown might have come in
for the variable of gender, one might code males ¼ first, the second place finisher came in 4 3/4 lengths
1 and females ¼ 2. This does not signify that there behind, and the third place finisher came in 3 1/2
are more females than males, or that females have lengths behind the second place finisher. However,
more of any given quality than males. The numbers because the time between finishers is different, one
assigned as labels have no inherent meaning at the cannot determine from the rankings alone (1st,
nominal level. Every individual or item that has 2nd, and 3rd) how much faster one horse is than
been assigned the same label is treated as if they another horse. Because the differences between
are equivalent, even if they might differ on other rankings do not have a constant meaning at this
variables. Note also from the previous examples level of measurement, researchers might determine
that the categories at the nominal level of that one participant is greater than another but not
measurement are discrete, which means mutually how much greater he or she is. Ordinal measure-
exclusive. A variable cannot be both male and ments are often used in educational research when
female in this example, only one or the other, much examining percentile ranks or when using Likert
as one cannot be both an apple and a banana. The scales, which are commonly used for measuring
categories must not only be discrete, but also they opinions and beliefs on what is usually a 5-point
must be exhaustive. That is, all participants must scale.
fit into one (and only one) category. If participants
do not fit into one of the existing categories, then
Interval
a new category must be created for them. Nomi-
nal-level measurements are the least precise level of The third level of measurement is called interval.
measurement and as such, tell us the least about Interval-level measurements are created with each
the variable being measured. If two items are mea- interval exactly the same distance apart. The name,
sured on a nominal scale, then it would be possible interval, is said to derive from the words inter-,
to determine whether they are the same (do they which is a Latin prefix meaning between, and val-
have the same label?) or different (do they have dif- lum, which is a Latin word meaning ramparts. The
ferent labels?), but it would not possible to identify purpose of this third level of measurement is to
whether one is different from the other in any allow researchers to compare how much greater
quantitative way. Nominal-level measurements are participants are than each other. For example,
used primarily for the purposes of classification. a hot day that is measured at 96 8F is 208 hotter
than a cooler day measured at 76 8F, and that is
the same distance as an increase in temperature
Ordinal
from 43 8F to 63 8F. Now, it is possible to say that
The second level of measurement is called ordi- the first day is hotter than the second day and also
nal. Ordinal-level measurements are in some form how much hotter it is. Interval is the lowest level
of order. The name, ordinal, is said to derive from of measurement that allows one to talk about
the word ordin-, which is a Latin prefix meaning amount. A piece of information that the interval
order. The purpose of this second level of measure- scale does not provide to researchers is a true zero
ment is to rank the size or magnitude of the quali- point. On an interval-level scale, whereas there
ties of the variables. For example, the order of might be a marking for zero, it is just a place
finish of the Kentucky Derby might be Big Brown holder. On the Fahrenheit scale, zero degrees does
#1, Eight Belles #2, and Denis of Cork #3. In the not mean that there is no heat, because it is possi-
ordinal level of measurement, there is not only cat- ble to measure negative degrees. Because of the
egory information, as in the nominal level, but also lack of an absolute zero at this level of measure-
rank information. At this level, it is known that the ment, although it is determinable how much
710 Levels of Measurement

greater one score is from another, one cannot deter- possible have a true zero of each (no socks, no
mine whether a score is twice as big as another. So cars, no items correct, and no errors).
if Jane scores a 6 on a vocabulary test, and John
scores a 2, it does not mean that Jane knows 3
times as much as John, because the zero point on Commonalities in the Levels of Measurement
the test is not a true zero, but rather an arbitrary
Even as each level of measurement is defined by fol-
one. Similarly, a zero on the test does not mean
lowing certain rules, the levels of measurement as
that the person being tested has zero vocabulary
a whole also have certain rules that must be fol-
ability. There is some controversy as to whether
lowed. All possible variables or outcomes can be
the variables that we measure for aptitude, intelli-
measured at least one level of measurement. Levels
gence, achievement, and other popular educational
of measurement are presented in an order from
tests are measured at the interval or ordinal levels.
least precise (and therefore least descriptive) to most
precise (and most descriptive). Within this order,
Ratio each level of measurement follows all of the rules of
the levels that preceded it. So, whereas the nominal
The fourth level of measurement is called ratio. level of measurement only labels categories, all the
Ratio-level measurements are unique among all levels that follow also have the ability to label cate-
the other levels of measurements because they have gories. Each subsequent level retains all the abilities
an absolute zero. The name ratio is said to derive of the level that came before. Also, each level of
from the Latin word ratio, meaning calculation. measurement is more precise than the ones before,
The purpose of this last level of measurement is to so the interval level of measurement is more exact
allow researchers to discuss not only differences in in what it can measure than the ordinal or nominal
magnitude but also ratios of magnitude. For exam- levels of measurement. Researchers generally
ple, for the ratio level variable of weight, one can believe that any outcome or variable should be
say that an object that weights 40 pounds is twice measured at the most precise level possible, so in
as heavy as an object that weights 20 pounds. This the case of a variable that could be measured at
level of measurement is so precise, it can be diffi- more than one level of measurement, it would be
cult to find variables that can be measured using more desirable to measure it at the highest level
this level. To use the ratio level, the variable of possible. For example, one could measure the
interest must have a true zero. Many social science weight of items in a nominal scale, assigning the
and education variables cannot be measured at this first item as ‘‘1,’’ the second item as ‘‘2,’’ and so on.
level because they simply do not have an absolute Or, one could measure the same items on an ordinal
zero. It is fairly impossible to have zero self- scale, assigning the labels of ‘‘light’’ and ‘‘heavy’’ to
esteem, zero intelligence, or zero spelling ability, the different items. One could also measure weight
and as such, none of those variables can be mea- on an interval scale, where one might set zero at
sured on a ratio level. In the hard sciences, more the average weight, and items would be labeled
variables are measurable at this level. For example, based on how their weights differed from the aver-
it is possible to have no weight, no length, or no age. Finally, one could measure each item’s weight,
time left. Similar to interval level scales, very few where zero means no weight at all, as weight is nor-
education and social science variables are mea- mally measured on a ratio scale. Using the highest
sured at the ratio level. One of the few common level of measurement provides researchers with the
variables at this measure is reaction time, which is most precise information about the actual quality
the amount of time that passes between when of interest, weight.
a stimulus happens and when a reaction to it is
noted. Other common occurrences of variables
that are measured at this level happen when the What Can Researchers Do
variable is measured by counting. So the number
With Different Levels of Measurement?
of errors, number of items correct, number of cars
in a parking lot, and number of socks in a drawer Levels of measurement tend to be treated flexibly
are all measured at the ratio level, because it is by researchers. Some researchers believe that there
Levels of Measurement 711

are specific statistical analyses that can only be might be performed. In addition, percentiles might
done at higher levels of measurement, whereas be calculated, although with caution, as some
others feel that the level of measurement of a vari- methods used for the calculation of percentiles
able has no effect on the allowable statistics that assume the variables are measured at the interval
can be performed. Researchers who believe in the level. The median might be calculated as a measure
statistical limitations of certain levels of measure- of central tendency. Quartiles might be calculated,
ment might also be fuzzy on the lines between the and some additional nonparametric statistics
levels. For example, many students learn about might be used. For example, Spearman’s rank-
a level of measurement called quasi-interval, on order correlation might be used to calculate the
which variables are measured on an ordinal scale correlation between two variables measured at the
but treated as if they were measured at the interval ordinal level.
scale for the purposes of statistical analysis. As
many education and social science tests collect
Interval
data on an ordinal scale, and because using an
ordinal scale might limit the statistical analyses At the interval level of measurement, almost
one could perform, many researchers prefer to every statistical tool becomes available. All the
treat the data from those tests as if they were mea- previously allowed tools might be used, as well as
sured on an interval scale, so that more advanced many that can only be properly used starting at
statistical analyses can be done. this level of measurement. The mean and standard
deviation, both frequently used calculations of cen-
tral tendency and variability, respectively, become
What Statistics Are Appropriate?
available for use at this level. The only statistical
Along with setting up the levels of measurement as tools that should not be used at this level are those
they are currently known, Stevens also suggested that require the use of ratios, such as the coeffi-
appropriate statistics that should be permitted to cient of variation.
be performed at each level of measurement. Since
that time, as new statistical procedures have been
Ratio
developed, this list has changed and expanded.
The appropriateness of some of these procedures is All statistical tools are available for data at this
still under debate, so one should examine the level.
assumptions of each statistical analysis carefully
before conducting it on data of any level of
Controversy
measurement.
As happens in many cases, once someone codifies
a set or rules or procedures, others proceed to put
Nominal
forward statements about why those rules are
When variables are measured at the nominal incorrect. In this instance, Stevens approached
level, one might count the number of individuals his proposed scales of measurement with this idea
or items that are classified under each label. One that certain statistics should be allowed to be
might also calculate central tendency in the form performed only on variables that had been mea-
of the mode. Another common calculation that sured at certain levels of measurement, much as
can be performed is the chi-square correlation, has been discussed previously. This point of view
which is otherwise known as the contingency has been called measurement directed, which
correlation. Some more qualitative analyses might means that the level of measurement used should
also be performed with this level of data. guide the researcher as to which statistical analysis
is appropriate. On the other side of the debate are
researchers who identify as measurement indepen-
Ordinal
dent. These researchers believe that it is possible
For variables measured at the ordinal level, all to conduct any type of statistical analysis, regard-
the calculation of the previous level, nominal, less of the variable’s level of measurement. An
712 Likelihood Ratio Statistic

oft-repeated statement used by measurement-inde- Lord, F. M. (1953). On the statistical treatment of


pendent researchers was coined by F. M. Lord, football numbers. American Psychologist, 8,
who stated, ‘‘the numbers do not know where they 750751.
come from’’ (p. 751). This means regardless of Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement,
design, and analysis: An integrated approach.
what level of measurement was used to assign
Hillsdale, NJ: Lawrence Erlbaum.
those numbers, given any set of numbers it is pos- Stevens, S. S. (1946). On the theory of scales of
sible to perform any statistical calculation of inter- measurement. Science, 103, 677680.
est with them. Stine, W. W. (1989). Meaningful inference: The role of
Researchers who agree with the measurement- measurement in statistics. Psychological Bulletin, 105,
directed position tend to take the stance that 147155.
although it is possible to conduct any statistical Thorndike, R. M. (2005). Measurement and evaluation
analysis the researcher wishes with any numbers, in psychology and education (7th ed.). Upper Saddle
regardless of the level on which they were measured, River, NJ: Pearson Education.
it is most difficult to interpret the results of those
analyses in an understandable manner. For example,
if one measures gender on a nominal scale, assigning LIKELIHOOD RATIO STATISTIC
the labels ‘‘1’’ to males and ‘‘2’’ to females, one
could use those numbers to calculate the mean gen-
der in the U.S. population as 1.51. But how does The likelihood ratio statistic evaluates the relative
one interpret that? Does it mean that the average plausibility of two competing hypotheses on the
person in the United States is male plus half a male? basis of a collection of sample data. The favored
The average person is three quarters female? hypothesis is determined by whether the ratio is
Researchers who believe in measurement-directed greater than or less than one.
statistics would argue that the mode should be used To introduce the likelihood ratio, suppose that
to calculate central tendency for anything measured yOBS denotes a vector of observed data. Assume
at the nominal level, because using the mode would that a parametric joint density is postulated for the
bring more interpretable results than using the random vector Y corresponding to the realization
mean, which is more appropriate for interval-level yOBS . Let f ðy; θÞ represent this density, with para-
data. Researchers who take the measurement-inde- meter vector θ. The likelihood of θ based on the
pendent view believe that by restricting the statisti- data yOBS is defined as the joint density:
cal analyses one can conduct, one loses the use of
important statistical tools that in some cases (partic- Lðθ; yOBS Þ ¼ f ðyOBS ; θÞ:
ularly in the use of quasi-interval data) could have
lead to valuable research breakthroughs. Although the likelihood and the density are
the same function, they are viewed differently: The
Carol A. Carman density f ðy; θÞ assigns probabilities to various out-
comes for the random vector Y based on a fixed
See also Central Tendency, Measures of; Descriptive
value of θ, whereas the likelihood Lðθ; yOBS Þ
Statistics; Mean; Median; Mode; ‘‘On the Theory of
reflects the plausibility of various values for θ
Scales of Measurement’’; Planning Research;
based on the observed data yOBS .
Sensitivity; Standard Deviation; Variability,
In formulating the likelihood, multiplicative fac-
Measure of
tors that do not depend on θ are routinely omitted,
and the function is redefined based on the remain-
Further Readings ing terms, which comprise the kernel. For instance,
when considering a binomial experiment based on
Coladarci, T., Cobb, C. D., Minium, E. W., & Clarke,
n trails with success probability θ, the density for
R. C. (2004). Fundamentals of statistical reasoning in
education. Hoboken, NJ: John Wiley.
the success count Y is
Gaito, J. (1980). Measurement scales and statistics:  
n y
Resurgence of an old misconception. Psychological f ðy; θÞ ¼ θ ð1  θÞny ,
Bulletin, 87, 564567. y
Likelihood Ratio Statistic 713

whereas the corresponding likelihood for θ based on parameter vector θ might lie. The conventional test
the observed success count yOBS is often defined as statistic, which is often called the generalized like-
lihood ratio statistic, is given by
Lðθ; yOBS Þ ¼ θyOBS ð1  θÞnyOBS :
^ yOBS Þ,
Lðθ^0 ; yOBS Þ=Lðθ;
Now, consider two possible values for the
parameter vector θ, say θ0 and θ1 . The likelihood where Lðθ^0 ; yOBS Þ denotes the maximum value
ratio statistic Lðθ0 ; yOBS Þ=Lðθ1 ; yOBS Þ might be attained by the likelihood Lðθ; yOBS Þ as the para-
used to determine which of these two candidate meter vector θ varies over the space 0 , and
values is more ‘‘likely’’ (i.e., which is better sup- ^ yOBS Þ represents the maximum value attained
Lðθ;
ported by the data yOBS ). If the ratio is less than by Lðθ; yOBS Þ as θ varies over the combined space
one, θ1 is favored; if the ratio is greater than one, 0 ∪ 1 .
θ0 is preferred. Tests based on the generalized likelihood ratio
As an example, in a classroom experiment to are often optimal in terms of power. The size of
illustrate simple Mendelian genetic traits, suppose a test refers to the level of significance at which
that a student is provided with 20 seedlings that the test is conducted. A test is called uniformly
might flower either white or red. She is told to most powerful (UMP) when it achieves a power
plant these seedlings and to record the colors after that is greater than or equal to the power of any
germination. Let θ denote the probability of a seed- alternative test of comparable size. When no
ling flowering white. If Y denotes the number of UMP test exists, it might be helpful to restrict
seedlings among the 20 planted that flower white, attention to only those tests that can be classified
then Y might be viewed as arising from a binomial as unbiased. A test is unbiased when the power
distribution with density of the test never falls below its size [i.e., when
  Prðreject H0 jθ ∈ 1 Þ ≥ Prðreject H0 jθ ∈ 0 Þ]. A
20 y
f ðy; θÞ ¼ θ ð1  θÞny : test is called uniformly most powerful unbiased
y
(UMPU) when it achieves a power that is greater
The student is told that θ is either θ0 ¼ 0:75 or than or equal to the power of any alternative
θ1 ¼ 0:50; she must use the outcome of her ex- unbiased test. The generalized likelihood ratio
periment to determine the correct probability. statistic can often be used to formulate UMP and
After planting the 20 seedlings, she observes that UMPU tests.
yOBS ¼ 13 flower white. In this setting, the likeli- The reliance of the likelihood ratio statistic in
hood ratio statistic statistical inference is largely based on the likeli-
hood principle. Informally, this principle states
Lðθ0 ; yOBS Þ=Lðθ1 ; yOBS Þ that all the information in the sample yOBS that is
relevant for inferences on the parameter vector θ
equals 1.52. Thus, the likelihood ratio implies that is contained within the likelihood function
the value θ0 ¼ 0:75 is the more plausible value for Lðθ; yOBS Þ. The likelihood principle is somewhat
the probability θ. Based on the ratio, the student controversial and is not universally held. For
should choose the value θ0 ¼ 0:75. instance, neglecting constants that do not involve θ
The likelihood ratio might also be used to the same likelihood might result from two differ-
test formally two competing point hypotheses ent experimental designs. In such instances, likeli-
H0 : θ ¼ θ0 versus H1 : θ ¼ θ1 . In fact, the Ney- hood-based inferences would be the same under
manPearson Lemma establishes that the power either design, although tests that incorporate the
of such a test will be at least as high as the power nature of the design might lead to different conclu-
of any alternative test, assuming that the tests are sions. For instance, consider the preceding genetics
conducted using the same levels of significance. example based on simple Mendelian traits. If a stu-
A generalization of the preceding test allows dent were to plant all n seedlings at one time and
one to evaluate two competing composite hypoth- to count the number Y that eventually flower
eses H0 : θ ∈ 0 versus H1 : θ ∈ 1 . Here, 0 and white, then the count Y would follow a binomial
1 refer to disjoint parameter spaces where the distribution. However, if the student were to plant
714 Likert Scaling

seedlings consecutively one at a time and continue had undertaken in 1929. The use of Likert items
until a prespecified number of seedlings flower red, and scaling is probably the most used survey meth-
then the number Y that flower white would follow odology in educational and social science research
a negative binomial distribution. Based on the ker- and evaluation.
nel, each experimental design leads to the same The Likert scale provides a score based on
likelihood. Thus, if the overall number of seedlings a series of items that have two parts. One part is
planted n and the observed number of white flow- the stem that is a statement of fact or opinion to
ering seedlings yOBS are the same in each design, which the respondent is asked to react. The other
then likelihood-based inferences such as the pre- part is the response scale. Likert was the
ceding likelihood ratio test would yield identical first recognized for the use of a 5-point, ordinal
results. However, tests based on the probability scale of strongly approve—approve—undecided—
distribution models f ðy; θÞ could yield different disapprove—strongly disapprove. The scale is
conclusions. often changed to other response patterns such as
The likelihood ratio Lðθ0 ; yOBS Þ=Lðθ1 ; yOBS Þ has strongly agree—agree—neutral—disagree—strongly
a simple Bayesian interpretation. Prior to the disagree. This entry discusses Likert’s approach
collection of data, suppose that the candidate and scoring methodology and examines the
values θ0 and θ1 are deemed equally likely, so that research conducted on Likert scaling and its
the prior probabilities Prðθ0 Þ ¼ Prðθ1 Þ ¼ 0:5 are modifications.
employed. By Bayes’s rule, the ratio of the poste-
rior probabilities for the two parameter values,
Likert’s Approach
Prðθ0 jyOBS Þ= Prðθ1 jyOBS Þ, In Likert’s original research, which led to Likert
scaling, Likert compared four ways of structuring
corresponds to the likelihood ratio. As this inter- attitude survey items believing that there was an
pretation would suggest, the concept of the likeli- alternative to the approach attributed to Louis
hood function and the likelihood principle both Leon Thurstone. Although both approaches were
play prominent roles in Bayesian inference. based on equal-interval, ordinal stepped scale
points, Likert considered Thurstone’s methods to
Joseph E. Cavanaugh and Eric D. Foster be a great deal of work that was not necessary.
See also Bayes’s Theorem; Directional Hypothesis;
Setting up a Thurstone scale involved the use of
Hypothesis; Power; Significance Level, Concept of
judges to evaluate statements to be included in the
survey. This included rank ordering the statements
in terms of the expected degree of the attribute
Further Readings being assessed and then comparing and ordering
Pawitan, Y. (2001). In all likelihood: Statistical modeling each pair of item possibilities, which is an onerous
and inference using likelihood. Oxford, UK: Oxford task if there were many item possibilities. Origi-
University Press. nally, each item was scored as a dichotomy (agree/
disagree or þ=Þ.
A Thurstone scale was scored in a similar man-
ner as Likert’s original method using sigma values,
LIKERT SCALING which were z scores weighted by the responses cor-
responding to the assumed equal interval
Likert (pronounced lick-ert) scaling is a method categories. However, part of the problem with
of attitude, opinion, or perception assessment of Thurstone’s scoring method related to having
a unidimensional variable or a construct made a spread of judge-determined 1 to 11 scoring cate-
up of multidimensions or subscales. It recognizes gories when scoring the extreme values of 0 or
the contribution to attitude assessment of Rensus 1 proportions. These could not be adequately
Likert who published a classic paper on this topic accounted for because they were considered as ±
in 1932, based on his doctoral dissertation directed infinity z values in a sigma scoring approach and
by Gardner Murphy and based on work Murphy thus were dropped from the scoring. Likert felt
Likert Scaling 715

there was another approach that did not rely so scaling involving the use of the mean or sum of
much on the use of judges and could include the scores from a set of items to represent a position
scoring of items where everyone either did not or on the attitude variable continuum.
did select the extreme score category by using the Although Likert focused on developing a unidi-
± 3.00 z values instead of ± ∞. Thus, Likert set mensional scale, applications of his methodology
out to use some of the features of a Thurstone have been used to develop multidimensional scales
scale but simplify the process and hope to achieve that include subscales intended to assess attitudes
a similar level of reliability found with the Thur- and opinions on different aspects of the construct
stone scale. His research met the goals he set out of interest.
to meet.
A stem or statement was presented related to
racial attitudes and then respondents were asked Modifications
to respond to one of several response sets. One Many modifications of Likert scaling use a wide
set used ‘‘yes / no’’ options, and another used nar- variety of response sets similar or not so similar to
rative statements, and two of them used what the ones Likert used that are called Likert-type
we now know as a Likert item, using strongly items. It seems that almost any item response set
approve—approve—undecided—disapprove— that includes ordered responses in a negative and
strongly disapprove as the response categories. positive direction gets labeled as a Likert item.
The distinction between the last two types of More than 35 variations of response sets have
items relates to the source of the questions as been identified, even though many of them vary
being developed specifically by Likert for asses- considerably from Likert’s original response set.
sing attitudes and the other as abbreviations Some even use figures such as smiley or frowny
of newspaper articles reflecting societal conflicts faces instead of narrative descriptions or abbrevia-
among race-based groups. tions. These scales have been used often with
children.
Likert’s Scoring Methodology
Likert found that many items he used had distribu- Research
tions resembling a normal distribution. He con-
Controversial Issues Related to
cluded that if these distributions resembled
Likert Items and Scales
a normal distribution, it was legitimate to deter-
mine a single unidimensional scale value by finding Research on Likert item stem and response con-
the mean or sum of the items and using that for struction, overall survey design, methods of scor-
a value that represented the attitude, opinion, or ing, and various biases has been extensive; it is
perception of the variable on a continuum. probably one of the most researched topics in
Sigma values were z scores weighted by the use social science. There are many controversial issues
of responses to the five categories. These were then and debates about using Likert items and scales,
used by item to estimate score reliabilities (using including the reading level of respondents, item
split-half and testretest approaches), which were reactivity, the length or number of items, the mode
found to be high. Likert also demonstrated a high of delivery, the number of responses, using an odd
level of concurrent validity between his approach or even number of responses, labeling of a middle
and Thurstone’s approach, even though he had response, the direction of the response categories,
used only about half the number of items that dealing with missing data, the lack of attending
Thurstone had used. behaviors, acquiescence bias, central tendency
Likert sings the praises of the sigma scoring bias, social desirability bias, the use of parametric
technique. However, he also discovered that sim- methods or nonparametric methods when compar-
ply summing the scores resulted in about the same ing scale indicators of central tendency (median or
degree of score reliability, for both split-half mean), and, probably most controversial, the use
and testretest score reliabilities, as the sigma of negatively worded items. All of these have the
approach. Thus was born the concept of Likert potential for influencing score reliability and
716 Likert Scaling

validity, some more than others, and a few actually These options can be used, but it is advisable not
increase the estimate of reliability. to put these as midpoints on an ordinal contin-
Because Likert surveys are usually in the cate- uum or to give them a score for scaling
gory of ‘‘self-administered’’ surveys, the reading purposes.
level of respondents must be considered. Typi- Another issue is the direction of the response
cally, a reading level of at least 5th grade is often categories. Options are to have the negative
considered a minimal reading level for surveys response set on the left side of the scale moving to
given to most adults in the general population. the right becoming more positive or having the pos-
Often, Edward Fry’s formula is used to assess itive response set on the left becoming more nega-
reading level of a survey. A companion issue is tive as the scale moves from left to right. There
when surveys are translated from one language does not seem to be much consensus on which is
to another. This can be a challenging activity better, so often the negative left to positive right is
that, if not done well, can reduce score reliabil- preferred.
ity. Related to this is the potential for reducing Missing data are as much an issue in Likert
reliability and validity when items are reactive scaling as in all other types of research. Often,
or stir up emotions in an undesirable manner a decision needs to be made relative to how
that can confound the measure of the attitudes many items need to be completed for the survey
of interest. Sometimes, Likert survey items are to be considered viable for inclusion in the data
read to respondents in cases where reading level set. Whereas there are no hard-and-fast rules for
might be an issue or clearly when a Likert scale making this decision, most survey administrators
is used in a telephone survey. Often, when read- would consider a survey with fewer than 80% of
ing Likert response options over the phone, it is the items completed not to be a viable entry.
difficult for some respondents to keep the cate- There are a few ways of dealing with missing
gories in mind, especially if they change in the data when there are not a lot of missed
middle of the survey. Other common modes of responses. The most common is to use the mean
delivery now include online Likert surveys used of the respondent’s responses on the completed
for myriad purposes. items as a stand-in value. This is done automati-
Survey length can also affect reliability. Even cally if the scored value is the mean of the
though one way of increasing score reliability is to answered responses. If the sum of items is the
lengthen a survey, making the survey too long and scale value, any missing items will need to have
causing fatigue or frustration will have the oppo- the mean of the answered items imputed into the
site effect. missing data points before summing the items to
One issue that often comes up is deciding on get the scaled score.
the number of response categories. Most survey
researchers feel three categories might be too
Response Bias
few and more than seven might be too many.
Related to this issue is whether to include an odd Several recognized biased responses can occur
or even number of response categories. Some feel with Likert surveys. Acquiescence bias is the
that using an even number of categories forces tendency of the respondent to provide positive
the respondent to chose one directional opinion responses to all or almost all of the items. Of
or the other, even if mildly so. Others feel there course, it is hard to separate acquiescence bias
should be an odd number of responses and the from reasoned opinions for these respondents.
respondent should have a neutral or nonagree or Often, negatively worded Likert stems are used to
nondisagree opinion. If there are an odd number determine whether this is happening based on the
of response categories, then care must be used in notion that if a respondent responded positively
defining the middle category. It should represent both to items worded in a positive direction as
a point of the continuum such as neither approve well as a negative direction, then they were more
nor disapprove, neither agree nor disagree, or likely to be exhibiting this biased behavior rather
neutral. Responses such as does not apply or than attending to the items. Central tendency bias
cannot respond do not fit the ordinal continuum. is the tendency to respond to all or most of the
Likert Scaling 717

items with the middle response category. Using an much promulgated by Likert, that item distribu-
even number of response categories is often a strat- tions are close to being normal and are thus addi-
egy employed to guard against this behavior. Social tive, giving an approximate interval scale. This
desirability bias is the tendency for respondents to would justify the use of z tests, t tests, and analysis
reply to items to reflect what they believe they of variance for group inferential comparisons and
would be expected to respond based on societal the use of Pearson’s r to examine variable relation-
norms or values rather than their own feelings. ships. Rasch modeling is often used as an approach
Likert surveys on rather personal attitudes or opi- for obtaining interval scale estimates for use in
nions related to behaviors considered by society inferential group comparisons if certain item char-
to be illegal, immoral, unacceptable, or personally acteristics are assumed. To the extent the item score
embarrassing are more prone to this problem. This distributions depart from normality, this assump-
problem is exacerbated if respondents have any tion would have less viability and would tend to
feeling that their responses can be directly or even call for the use of nonparametric methods.
indirectly attributed to them personally. The effect
of two of these behaviors on reliability is some-
Use of Negatively Worded
what predictable. It has been demonstrated that
or Reverse-Worded Likert Stems
different patterns of responses have differential
effects on Cronbach’s alpha coefficients. Acquies- Although there are many controversies about
cent (or the opposite) responses inflate Cronbach’s the use of Likert items and scales, the one that
alpha. Central tendency bias has little effect on seems to be most controversial is the use of reverse
Cronbach’s alpha. It is pretty much impossible to or negatively worded Likert item stems. This
determine the effect on alpha from social desirabil- has been a long recommended practice to guard
ity responses, but it would seem that there would against acquiescence. Many Likert item scholars
not be a substantial effect on it. still recommend this practice. It is interesting to
note that Likert used some items with positive atti-
tude stems and some with negative attitude stems
Inferential Data Analysis of
in all four of his types of items. However, Likert
Likert Items and Scale Scores
provides no rationale for doing this in his classic
One of the most controversial issues relates to work. Many researchers have challenged this prac-
how Likert scale data can be used for inferential tice as not being necessary in most attitude assess-
group comparisons. On an item level, it is pretty ment settings and as a practice that actually
much understood that the level of measurement reduces internal consistency score reliability. Sev-
is ordinal and comparisons on an item level eral researchers have demonstrated that this prac-
should be analyzed using nonparametric meth- tice can easily reduce Cronbach’s alpha by at least
ods, primarily using tests based on the chi-square 0.10. It has been suggested that the reversal of
probability distribution. Tests for comparing fre- Likert response sets for half of the items while
quency distributions of independent groups often keeping the stems all going in a positive direction
use a chi-square test of independence. When accomplishes the same purpose of using negatively
comparing Likert item results for the dependent worded Likert items.
group situation such as a pretestposttest arrange-
ment, McNemar’s test is recommended. The con-
Reliability
troversy arises when Likert scale data (either in the
form of item means or sums) are being compared Even though Likert used split-half methods for
between or among groups. Some believe this scale estimating score reliability, most of the time in cur-
value is still at best an ordinal scale and recom- rent practice, Cronbach’s alpha coefficient of in-
mend the use of a nonparametric test, such as a ternal consistency, which is also known as the
MannWhitney, Wilcoxon, or KruskalWallis Kuder-Richardson 20 approach, is used. Cron-
test, or the Spearman rank-order correlation if bach’s alpha is sometimes defined as the mean
looking for variable relationships. Others are will- split-half reliability coefficient if all the possible
ing to believe the assumption, which was pretty split-half coefficients are defined.
718 Line Graph

Likert’s Contribution to Research Methodology Development, Department of Parks, Recreation and


Tourism Management.
Likert’s contribution to the method of scaling,
which was named for him, has had a profound
effect on the assessment of opinions and attitudes
of groups of individuals. There are still many LINE GRAPH
issues about the design, application, scoring,
and analysis of Likert scale data. However, the A line graph is a way of showing the relationship
approach is used throughout the world, providing between two interval- or ratio-level variables. By
useful information for research and evaluation convention, the independent variable is drawn
purposes. along the abscissa (x-axis), and the dependent vari-
able on the ordinate (y-axis). The x-axis can be
J. Jackson Barnette
either a continuous variable (e.g., age) or time. It
See also Likert Scaling; ‘‘Technique for the Measurement is probably the most widely used type of chart
of Attitudes, A’’ because it is easy to make and the message is read-
ily apparent to the viewer. Line graphs are not as
good as tables for displaying actual values of a var-
Further Readings iable, but they are far superior in showing relation-
Andrich, D. (1978). A rating formulation for ships between variables and changes over time.
ordered response categories. Psychometrika, 43,
561573.
History
Babbie, E. R. (2005). The basics of social research.
Stamford, CT: Thomson Wadsworth. The idea of specifying the position of a point using
Barnette, J. J. (1999). Nonattending respondent effects two axes, each reflecting a different attribute, was
on the internal consistency of self-administered introduced by René Descartes in 1637 (what are
surveys: A Monte Carlo simulation study.
now called Cartesian coordinates). During the fol-
Educational and Psychological Measurement, 59,
3846.
lowing century, graphs were used to display the
Barnette, J. J. (2000). Effects of stem and Likert relationship between two variables, but they were
response option reversals on survey internal all hypothetical pictures and were not based on
consistency: If you feel the need, there’s a better empirical data. William Playfair, who has been
alternative to using those negatively-worded stems. described as an ‘‘engineer, political economist, and
Educational and Psychological Measurement, 60, scoundrel,’’ is credited with inventing the line
361370. graph, pie chart, and bar chart. He first used line
Cronbach, L. J. (1951). Coefficient alpha and the graphs in a book titled The Commercial and Polit-
internal structure of tests. Psychometrika, 16, ical Atlas, which was published in 1786. In it, he
297334.
drew 44 line and bar charts to describe financial
Fry, E. (1968). A readability formula that saves time.
Journal of Reading, 11, 265271. statistics, such as England’s balance of trade with
Likert, R. (1932). A technique for the measurement of other countries, its debt, and expenditures on the
attitudes. Archives of Psychology, 140, 155. military.
Murphy, G. (1929). An historical introduction to modern
psychology, New York: Harcourt.
Thurstone, L. L. (1928). Attitudes can be measured.
Types
American Journal of Sociology, 33, 529554. Perhaps the most widely used version of the line
Uebersax, J. (2006). Likert scales: Dispelling the graph has time along the horizontal (x) axis and
confusion. Statistical Methods for Rater Agreement.
the value of some variable on the vertical (y) axis.
Retrieved February 16, 2009, from http://
ourworld.compuserve.com/homepages/jsuebersax/
For example, it is used on weather channels to dis-
likert.htm play changes in temperature over a 12- or 24-hour
Vagias, W. M. (2006). Likert-type scale response anchors. span, and by climatologists to show changes in
Clemson, SC: Clemson University, Clemson average temperature over a span of centuries. The
International Institute for Tourism & Research time variable can be calendar or clock time, as in
Line Graph 719

20 15

Females
15
Prevalence (%)

10

Number
10
Males

5
5

0
0
0 55–59 60–64 65–69 70–74 75+
2 12 22 32 42 52 62
Age
Number of Hours

Figure 1 Prevalence of Lifetime Mood Disorder by


Age and Sex Figure 2 Number of Hours of TV Watched Each
Week by 100 People
Source: From ‘‘The Epidemiology of Psychological
Problems in the Elderly,’’ by D. L. Streiner,
J. Cairney, & S. Veldhuizen, 2006, Canadian Journal of
Psychiatry, 51, pp. 185191. Copyright 1991 by the assumptions of statistical tests, which might
Canadian Journal of Psychiatry. Adapted and used with require, for example, a linear association
permission. between the variables.
Frequency polygons are often used as a substi-
tute for histograms. If the values for the x-axis
these examples, or relative time, based on a per- are in ‘‘bins’’ (e.g., ages 04, 59, 1014, etc.),
son’s age, as in Figure 1. Graphs of this latter type then the point is placed at the midpoint of the
are used to show the expected weight of infants bin (e.g., ages 2, 7, 12, etc.), with a distance
and children at various ages to help a pediatrician along the y-axis corresponding to the number or
determine whether a child is growing at a normal percentage of people in that category, as shown
rate. The power of this type of graph was exempli- in Figure 2 (the data are fictitious). Most often,
fied in one displaying the prevalence of Hodgkin’s choosing between a histogram and a frequency
lymphoma as a function of age, which showed an polygon is a matter of taste; they convey identi-
unusual pattern, in that there are two peaks: one cal information. The only difference is that, by
between the ages of 15 to 45 and another in the convention, there are extra bins at the ends, so
mid-50s. Subsequent research, based on this obser- that the first and last values drawn are zero.
vation, revealed that there are actually two sub- Needless to say, these are omitted if the x values
types of this disorder, each with a different age are nonsensical (e.g., an age less than zero, or
of onset. fewer than 0 hours watching TV).
Also widely used are line graphs with a contin- A useful variation of the frequency polygon is
uous variable, such as weight, displayed on the the cumulative frequency polygon. Rather than
abscissa, and another continuous variable (e.g., plotting the number or percentage at each value of
serum cholesterol) on the ordinate. As with time x, what is shown is the cumulative total up to and
graphs, these allow an immediate grasp of the including that value, as in Figure 3. If the y-axis
relationship between the variables, for example, shows the percentage of observations, then it is
whether they are positively or negatively corre- extremely easy to determine various centiles. For
lated, whether the relationship is linear or fol- example, drawing a horizontal line from the y
lows some other pattern, whether it is the same values of 25, 50, and 75, and dropping vertical
or different for various groups, and so on. This lines to the x-axis from where they intersect the
type of display is extremely useful for determin- line yields the median and the inter-quartile range,
ing whether the variables meet some of the also shown in Figure 3.
720 Line Graph

100 numbering, the reader must be alerted to this by


using a scale break, as in the x-axis of Figure 1.
80 The two small lines (sometimes a z-shaped line is
used instead) is a signal that there is a break in the
60 numbering.
Percent

40 2. Nominal- or ordinal-level data should never be


used for either axis. The power of the line graph is
20
to show relationships between variables. If either
0
variable is nominal (i.e., unordered categories),
2 12 22 32 42 52 62 then the order of the categories is completely arbi-
Number of Hours trary, but different orderings result in different pic-
tures, and consequently all are misleading. This is
not an issue with ranks or ordered categories (i.e.,
Figure 3 Cumulative Frequency Polygon of the Data
in Figure 2
ordinal data), because the ordering is fixed. How-
ever, because the spacing between values is not
constant (or even known), the equal spacing along
Guidelines for Drawing Line Graphs the axes gives an erroneous picture of the degree
of change from one category to the next. Only
General guidelines for creating a line graph follow. interval or ratio data should be plotted with line
1. Whether the y-axis should start at zero is a con- graphs.
tentious issue. On the one hand, starting it at some
other value has the potential to distort the picture 3. If two or more groups are plotted on the same
by exaggerating small changes. For example, one graph, the lines should be easily distinguishable.
advertisement for a breakfast cereal showed One should be continuous, another dashed, and
a decrease in eaters’ cholesterol level, with the first a third dotted, for example. If symbols are also
point nearly 93% the way up the y-axis, and the used to indicate the specific data points, they
last point, 3 weeks later, only 21% the vertical dis- should be large enough to be easily seen, of differ-
tance, which is a decrease of 77%. However, the ent shapes, and some filled and others not. Unfor-
y-axis began at 196 and ended at 210, so that the tunately, many graphing packages that come with
change of 10 points was less than 5% (both computers do a poor job of drawing symbols,
values, incidentally, are well within the normal using an X or þ that is difficult to discern, espe-
range). On the other hand, if the plotted values are cially when the line is not horizontal. The user
all considerably more than zero, then including might have to insert better ones one manually,
zero on the axis means that 80% to 90% of the using special symbols.
graph is blank, and any real changes might be hid-
den. The amount of perceived change in the graph 4. Where there are two or more lines in the
should correspond to amount of change in the graph, the labels should be placed as close to the
data. One way to check this is by using the for- respective lines as possible, as in Figure 1. Using
mula: a legend outside the body of the graph increases
the cognitive demand on the viewer, who must
Graph Distortion Index ðGDIÞ first differentiate among the different line and
Size of effect in graph symbol types, and then shift attention to another
¼  1: part of the picture to find them in the legend
Size of effect in data
box, and then read the label associated with it.
The GDI is 0 if the graph accurately reflects the This becomes increasingly more difficult as the
degree of change. If starting the y-axis at zero number of lines increases. Boxes containing all
results in a value much more or less than 0, then the legends should be used only when putting
the graph should start at some higher value. How- the names near the appropriate lines introduces
ever, if either axis has a discontinuity in the too much clutter.
LISREL 721

1500
LISREL
A

1000 B LISREL, which is an acronym for linear structural


relations, is a statistical program package parti-
cularly designed to estimate structural equation
500
models (SEMs). It can also been used for several
other types of analysis, such as data manipulation,
exploratory data analyses, and regression, as well
as factor analytic procedures. In the past few dec-
0
0 2 4 6 8 10
ades, SEM has become an increasingly popular
Time technique for the analysis of nonexperimental data
in the social sciences. Among programs from
which researchers wishing to apply SEM might
Figure 4 Change in Two Groups Over Time choose, such as AMOS, EQS, Mplus, SAS CALIS,
and RAMONA among many others, LISREL is
arguably the most longstanding and widely used
5. It is easy to determine from a line graph
tool. Notably, LISREL has been the prototype for
whether one or two groups are changing over
many later developed SEM programs. After a brief
time. It might seem, then, that it would be simple
history, this entry discusses the LISREL model and
to compare the rate of change of two groups.
its execution.
But, this is not the case. If the lines are sloping,
then it is very difficult to determine whether the
difference between them is constant or is chang-
ing. For example, it seems in Figure 4 that the Background and Brief History
two groups are getting closer together over time. The LISREL model and computer program was
In fact, the difference is a constant 200 points developed in the 1970s by Karl G. Jöreskog and
across the entire range. The eye is fooled because Dag Sörbom, who were both professors at Uppsala
the difference looks larger when the lines are University, Sweden. In 1973, Jöreskog discovered
horizontal than when they are nearly vertical. If a ‘‘maximum likelihood estimation’’ computa-
the purpose is to show differences over time, it is tional procedure and created a computer program
better to plot the actual difference, rather than for fitting factor models to data based on this esti-
the individual values mation. A few years later, he, together with Sör-
6. Determining the number of labels to place on bom, developed a program called LISREL, which
each axis is a balance between having so few that incorporates maximum likelihood estimation pro-
it is hard to determine where each point lies and cedures for both confirmatory factor analysis and
having so many that they are crowded together. In the linear structural model among factors. To date,
the end, it is a matter of esthetics and judgment LISREL has undergone a few revisions. LISREL is
available on a variety of operation systems such as
David L. Streiner Microsoft Windows, Macintosh, Mainframe, and
UNIX. The most current version as of December
See also Graphical Display of Data; Histogram; Pie Chart 2008 is LISREL 8.8.

Further Readings
Cleveland, W. S. (1985). The elements of graphing data. The General LISREL Model
Pacific Grove, CA: Wadsworth.
Robbins, N. B. (2005). Creating more effective graphs. In general, LISREL estimates the unknown coeffi-
Hoboken, NJ: John Wiley. cients of a set of linear structural equations. A full
Streiner, D. L., & Norman, G. R. (2007). Biostatistics: LISREL model consists of two submodels: the
The bare essentials (3rd ed.). Shelton, CT: PMPH. measurement model and the structural equation
722 LISREL

model. These models can be described by the fol- with LISREL matrices and their Greek representa-
lowing three equations: tion is helpful to master this program fully.

1. The structural equation model: η ¼ Bη þ ξ þ ζ Greek Notation


2. The measurement model for Y: y ¼ y η þ ε Matrices are represented by uppercase Greek
3. The measurement model for X: x ¼ x ξ þ δ letters, and their elements are represented by low-
ercase Greek letters. The elements represent the
parameters in the models. For instance, the exoge-
Types of Variables
nous variables are termed X variables, and the
In specifying structural equation models, one endogenous ones are Y variables.
needs to be familiar with several types of variables.
LISREL distinguishes variables between latent var- Basic Matrices Within the LISREL Framework
iables and observed variables. Latent variables are Setting up LISREL involves specifying several
variables that are not observed or measured matrices and specifying whether the elements within
directly. They are theoretical concepts that can these matrices are fixed at particular values or are
only be indexed by observed behaviors. Of the free parameters to be estimated by the program.
two types of latent variables, exogenous variables The eight matrices in the LISREL model are
are variables that are not influenced by other vari- as follows: lambda-X, lambda-Y, theta delta,
ables in the model, whereas endogenous variables theta epsilon, psi, phi, gamma, and beta. For
are the ones influenced by other variables. In other example, x (lambda-X) is a regression matrix
words, exogenous latent variables are indepen- that relates exogenous latent variables to the
dent variables and endogenous variables are observed variables that are designed to measure
dependent variables, which are influenced by the them;  (gamma) is a matrix of coefficients that
exogenous variables in the model. relates exogenous latent variables to endogenous
factors. Reviews of LISREL matrices and their
The Measurement Model Greek and program notations can be found in
the LISREL manual and other SEM textbooks.
The measurement model (also known as the
A measurement model can be defined by
CFA model) specifies how latent variables or hypo-
a regression matrix that relates the latent variable
thetical constructs are indicated by the observed
(Y or X) to its observed measures, one vector of
variables. It is designed particularly to describe the
latent variable, and one vector of measurement
measurement properties of the observed variables.
errors. Similarly, a structural model is defined by
It can be specified as X variables or Y variables.
two matrices and three vectors.

The Structural Equation Model


Statistical Identification of the Model
The structural model describes the causal rela-
LISREL requires an establishment of the identi-
tions among latent variables, or how the latent
fication of the models. The issue of identification
variables are linked to each other. In addition, it
pertains to whether or not there is a unique set of
assigns the explained and unexplained variance.
parameters consistent with the data. A structural
model can be just identified, overidentified, or
Greek Notation and Matrices in LISREL underidentified. An overidentified model is desired
The command language of LISREL is based on in running SEMs.
a matrix representation of the confirmatory factor
Recommendations for Overidentification
analysis or full structural equation model. Prior to
version 8, LISREL required the use of Greek letters It is recommended that latent constructs are
to specify models. Later on, LISREL used Greek as measured by at least three measures to ensure
well as English, which was made possible by the overidentification. Recursive models with identi-
SIMPLIS command language. However, familiarity fied constructs are always identified.
LISREL 723

Working With LISREL be generated. The first part of an output is a repro-


duction of the command file, which reminds the
Input user of the model specification. Model specifica-
Execution of LISREL programs can be described tion is followed by standard errors and t values
by a syntax file consisting of at least three steps of together with parameter estimates. In LISREL 8
operation: data input, model specification, and out- and later versions, standard errors (SEs) and
put of results. Each step of the operation is initiated t values (TVs) are always printed, by default.
with a command (or keyword). There are three These parameter estimates and other estimates
required commands in a given LISREL input file: derived from them determine the goodness of fit
DA, MO, and OU. In addition, there are other between the model under study and the observed
commands that are useful to include, such as NI, data. In addition, modification indices (MIs) are
NO, and MA. also included by default in LISREL 8.
There are several output options from which
Basic Rules the user can choose. All options and keywords on
the OU command can be omitted; however, a line
LISREL is controlled by two-letter keywords
with the two letters OU must be included as the
that represent the control lines and the names of
last line of the command file.
parameters. Although a keyword might contain
several letters, only the first two can be recognized
Path Diagrams
by the program. Keywords are not case sensitive.
In other words, they can be written in either As part of its output, LISREL generates path
uppercase or lowercase. However, they must be diagrams that might help researchers to make sure
separated by blanks. Each section of the LISREL that the model is specified correctly. Path diagrams
input file starts with a control line. provide a visual portrayal of relations assumed to
After optional title lines, a DA (data) command hold among the variables in the LISREL model.
always comes first. LK, LE, FR, FI, EQ, CO, IR, Observed variables are represented by rectangu-
PA, VA, ST, MA, PL, and NF commands must lar boxes, and latent variables are represented by
always come after the MO (model) command. The ellipses. Curved, double-headed arrows represent
MO command must appear unless no LISREL correlations between pairs of variables. A straight
model is analyzed. one-headed arrow is drawn from an exogenous
variable to an endogenous variable and thus indi-
Data Specification cates the influence of one variable on another. A
The data specification section always starts with straight single-headed arrow is drawn to the latent
a DA command. To fully describe the data set, variable from each of its manifest variables. The
other keywords are needed to specify the number error variables appear in the diagram but are not
of input variables, the number of observations, and enclosed. Variation and covariation in the depen-
the matrix to be analyzed. Other optional input dent variables is to be accounted for or explained
such as variable selection can be included as well. by the independent variables.

Model Specification Evaluation of the LISREL Model


To specify the model under study, the user The adequacy of the hypothesized model can be
needs to provide information about the number of evaluated by the following indicators provided by
observed variables, the number of latent variables, LISREL.
the form of each matrix to be analyzed, and the esti-
mation mode (being fixed or free) of each matrix.
Offending Estimates
Output
The appropriateness of standard errors can
The OU command is often used to specify the reflect the goodness of model fit. Excessively large
methods of estimation and to specify the output to or small standard errors indicate poor model fit.
724 LISREL

Other estimates such as negative error variances SIMPLIS


and standardized coefficients exceeding 1 signal
SIMPLIS is a new command language that was
problems with model fit.
created to simplify the use of LISREL. SIMPLIS
means a simplified LISREL. With a few excep-
Overall Goodness-of-Fit Measures tions, any model that can be specified for use in
LISREL can be specified using SIMPLIS com-
Several indexes that indicate the overall fit of a mands. SIMPLIS commands are written in English.
model are provided in the LISREL program, which A SIMPLIS input file consists of the following six
include: chi-square, goodness-of-fit index (GFI), sections: title, observed variables, form of input
adjusted goodness-of-fit index (AGFI), and root data, number of cases, unobserved variables, and
mean square error of approximation (RMSEA). model structure. A variety of optional commands
is also available to users.
Model Modification Indices
Model modification indices are measures of the
Statistical Applications of LISREL for Windows
predicted decrease in chi-square if fixed parameters
are relaxed and the model is reestimated. The fixed The latest LISREL for Windows includes several
parameters corresponding to large modification statistical applications beyond SEM, such as fol-
indices are the ones that will improve model fit lows: MULTILEV for hierarchical linear modeling,
substantially, if freed. SURVEYGLIM for generalized linear modeling,
CATFIRM for formative inference-based recursive
modeling for categorical response variables, CON-
LISREL Package: LISREL,
FIRM for formative inference-based recursive
PRELIS, and SIMPLIS modeling for continuous response variables, and
PRELIS MAPGLIM for generalized linear modeling for
multilevel data.
The PRELIS program is a companion program
that serves as a preprocessor for LISREL, and thus
the acronym PRElis. It is used for calculating cor-
Availability: Downloads and Manuals
relation and covariance matrices and for estimat-
ing asymptotic covariance from raw data. It can LISREL and its companion program PRELIS are
also be used to manipulate data and to provide ini- a software product marketed by Scientific Software,
tial descriptive statistics and graphical displays of International. The three manuals authored by Karl
the data. It can prepare the correct matrix to be G. Jöreskog and Dag Sörbom for use with LISREL
read by LISREL for many types of data such as and its companion package PRELIS are as follows:
continuous, ordinal, censored, or any combination
thereof, even when such data are severely skewed LISREL 8: Structural Equation Modeling With the
or have missing values. In addition, PRELIS has SIMPLIS Command Language
many data management functions such as variable LISREL 8: User’s Reference Guide
transformation and recoding, case selection, new
PRELIS 2: User’s Reference Guide
variable computation, and data file merging, as
well as bootstrap sampling.
A student edition of LISREL can be down-
A PRELIS input file is comprised of two-letter
loaded from the website of Scientific Software,
keywords that initiate and/or comprise control
International.
lines. The information in these control lines
informs the program of the location of the data to Yibing Li
be imported, what analysis to do, the destination
of the newly created matrices, and what to print See also Confirmatory Factor Analysis; Endogenous
on the output file. DA (data), RA (raw), and OU Variables; Exogenous Variables; Path Analysis;
(output) are three required control lines. Structural Equation Modeling
Literature Review 725

Further Readings Focus


Byrne, B. M. (1989). A primer of LISREL: Basic The focus is the basic unit of information that
applications and programming for confirmatory factor the reviewer extracts from the literature.
analytic models. New York: Springer-Verlag. Reviews most commonly focus on research out-
Byrne, B. M. (1998). Structural equation modeling with
comes, drawing conclusions of the form of ‘‘The
LISREL, PRELIS, and SIMPLIS: Basic concepts,
applications, and programming. Mahwah, NJ:
research shows X’’ or ‘‘These studies find X
Lawrence Erlbaum. whereas other studies find Y.’’ Although research
Diamantopoulos, A., & Siguaw, J. A. (2000). Introducing outcomes are most common, other foci are pos-
LISREL: A guide for the uninitiated. Thousand Oaks, sible. Some reviews focus on research methods,
CA: Sage. for example, considering how many studies in
Jöreskog, K. G., & Sörbom, D. (1993). LISREL 8: a field use longitudinal designs. Literature
Structural equation modeling with the SIMPLIS reviews can also focus on theories, such as what
command language. Chicago: SSI Scientific. theoretical explanations are commonly used
Jöreskog, K. G., & Sörbom, D. (1993). LISREL 8: User’s within a field or attempts to integrate multiple
reference guide. Chicago: SSI Scientific.
theoretical perspectives. Finally, literature
Kelloway, K. E. (1998). Using LISREL for structural
equation modeling: A researcher’s guide. Thousand
reviews can focus on typical practices within
Oaks, CA: Sage. a field, for instance, on what sort of interven-
tions are used in clinical literature or on the type
of data analyses conducted within an area of
Websites empirical research.
Scientific Software, International:
http://www.ssicentral.com
Goals
Common goals include integrating literature
LITERATURE REVIEW by drawing generalizations (e.g., concluding the
strength of an effect from several studies), resolv-
ing conflicts (e.g., why an effect is found in some
Literature reviews are systematic syntheses of pre-
studies but not others), or drawing links across
vious work around a particular topic. Nearly all
separate fields (e.g., demonstrating that two lines
scholars have written literature reviews at some
of research are investigating a common phenome-
point; such reviews are common requirements for
non). Another goal of a literature review might be
class projects or as part of theses, are often the first
to identify central issues, such as unresolved ques-
section of empirical papers, and are sometimes
tions or next steps for future research. Finally,
written to summarize a field of study. Given the
some reviews have the goal of criticism; although
increasing amount of literature in many fields,
this goal might sound unsavory, it is important for
reviews are critical in synthesizing scientific knowl-
scientific fields to be evaluated critically and have
edge. Although common and important to science,
shortcomings noted.
literature reviews are rarely considered to be held
to the same scientific rigor as other aspects of the
research process. This entry describes the types of
literature reviews and scientific standards for con- Perspective
ducting literature reviews. Literature reviews also vary in terms of perspec-
tive, with some attempting to represent the litera-
ture neutrally and others arguing for a position.
Types of Literature Reviews
Although few reviews fall entirely on one end of
Although beginning scholars often believe that this dimension or the other, it is useful for readers
there is one predefined approach, various types of to consider this perspective when evaluating
literature reviews exist. Literature reviews can vary a review and for writers to consider their own
along at least seven dimensions. perspective.
726 Literature Review

Coverage Audience
Coverage refers to the amount of literature on Literature reviews written to support an empiri-
which the review is based. At one extreme of this cal study are often read by specialized scholars in
dimension is exhaustive coverage, which uses all one’s own field. In contrast, many stand-alone
available literature. A similar approach is the reviews are read by those outside one’s own field,
exhaustive review with selective citation, in which so it is important that these are accessible to scho-
the reviewer uses all available literature to draw lars from other fields. Reviews can also serve as
conclusions but cites only a sample of this litera- a valuable resource for practitioners in one’s field
ture when writing the review. Moving along this (e.g., psychotherapists and teachers) as well as pol-
dimension, a review can be representative, such icy makers and the general public, so it is useful
that the reviewer bases conclusions on and cites if reviews are written in a manner accessible to
a subset of the existing literature believed to be educated laypersons. In short, the reviewer must
similar to the larger body of work. Finally, at the consider the likely audiences of the review and
far end of this continuum is the literature review adjust the level of specificity and technical detail
of most central works. accordingly.
All of these seven dimensions are important
considerations when preparing a literature review.
As might be expected, many reviews will have
Organization multiple levels of these dimensions (e.g., multiple
The most common organization is conceptual, goals directed toward multiple audiences). Tenden-
in which the reviewer organizes literature around cies exist for co-occurrence among dimensions; for
specific sets of findings or questions. However, his- example, quantitative reviews typically focus on
toric organizations are also useful, in that they pro- research outcomes, cover the literature exhaus-
vide a perspective on how knowledge or practices tively, and are directed toward specialized scho-
have changes across time. Methodological organi- lars. At the same time, consideration of these
zations, in which findings are arranged according dimensions suggests the wide range of possibilities
to methodological aspects of the reviewed studies, available in preparing literature reviews.
are also a possible method of organizing literature
reviews. Scientific Standards for Literature Reviews
Given the importance of literature reviews, it is
important to follow scientific standards in prepar-
Method of Synthesis ing these reviews. Just as empirical research fol-
lows certain practices to ensure validity, we can
Literature reviews also vary in terms of how
consider how various decisions impact the quality
conclusions are drawn, with the endpoints of this
of conclusions drawn in a literature review. This
continuum being qualitative versus quantitative.
section follows Harris Cooper’s organization by
Qualitative reviews, which are also called narra-
describing considerations at five stages of the liter-
tive reviews, are those in which reviewers draw
ature review process.
conclusions based on their subjective evaluation of
the literature. Vote counting methods, which might
be considered intermediate on the qualitative ver-
Problem Formulation
sus quantitative dimension, involve tallying the
number of studies that find a particular effect As in any scientific endeavor, the first stage of
and basing conclusions on this tally. Quantitative a literature review is to formulate a problem. Here,
reviews, which are sometimes also called meta- the central considerations involve the questions
analyses, involve assigning numbers to the results that the reviewer wishes to answer, the constructs
of studies (representing an effect size) and then of interest, and the population about which con-
performing statistical analyses of these results to clusions are drawn. A literature review can only
draw conclusions. answer questions about which prior work exists.
Literature Review 727

For instance, to make conclusions of causality, the and therefore might exclude most studies con-
reviewer will need to rely on experimental (or per- ducted in other countries. Although it would be
haps longitudinal) studies; concurrent naturalistic impractical for the reviewer to learn every lan-
studies would not provide answers to this ques- guage in which relevant literature might be writ-
tion. Defining the constructs of interest poses two ten, the reviewer should be aware of this
potential complications: The existing literature limitation and how it impacts the literature on
might use different terms for the same construct, which the review is based. To ensure transpar-
or the existing literature might use similar terms to ency of a literature review, the reviewer should
describe different constructs. The reviewer, report means by which potentially relevant liter-
therefore, needs to define clearly the constructs of ature was searched and obtained.
interest when planning the review. Similarly, the
reviewer must consider which samples will be
Inclusion Criteria
included in the literature review, for instance,
deciding whether studies of unique populations Deciding which works should inform the review
(e.g., prison, psychiatric settings) should be involves reading the literature obtained and draw-
included within the review. The advantages of ing conclusions regarding relevance. Obvious rea-
a broad approach (in terms of constructs and sons to exclude works include the investigation of
samples) are that the conclusions of the review constructs or samples that are irrelevant to the
will be more generalizable and might allow for review (e.g., studies involving animals when one is
the identification of important differences among interested in human behavior) or that do not pro-
studies, but the advantages of a narrow vide information relevant to the review (e.g., treat-
approach are that the literature will likely be ing the construct of interest only as a covariate).
more consistent and the quantity of literature Less obvious decisions need to be made with
that must be reviewed is smaller. works that involve questionable quality or meth-
odological features different from other studies.
Including such works might improve the generaliz-
Literature Retrieval
ability of the review on the one hand, but it might
When obtaining literature relevant for the contaminate the literature basis or distract focus
review, it is useful to conceptualize the literature on the other hand. Decisions at this stage will typi-
included as a sample drawn from a population cally involve refining the problem formulation
of all possible works. This conceptualization stage of the review.
highlights the importance of obtaining an unbi-
ased sample of literature for the review. If the lit-
Interpretation
erature reviewed is not exhaustive, or at least
representative, of the extant research, then the The most time-consuming and difficult stage is
conclusions drawn might be biased. One com- analyzing and interpreting the literature. As men-
mon threat to all literature reviews is publication tioned, several approaches to drawing conclusions
bias, or the file drawer problem. This threat is exist. Qualitative approaches involve the reviewer
that studies that fail to find significant effects (or performing some form of internal synthesis; as
that find counterintuitive effects) are less likely such, they are prone to reviewer subjectivity. At
to be published and, therefore, are less likely to the same time, qualitative approaches are the only
be included in the review. Reviewers should option when reviewing nonempirical literature
attempt to obtain unpublished studies, which (e.g., theoretical propositions), and the simplicity
will either counter this threat or at least allow of qualitative decision making is adequate for
the reviewer to evaluate the magnitude of this many purposes. A more rigorous approach is the
bias (e.g., comparing effects from published vs. vote-counting methods, in which the reviewer tal-
unpublished studies). Another threat is that lies studies into different categories (e.g., signifi-
reviewers typically must rely on literature writ- cant versus nonsignificant results) and bases
ten in a language they know (e.g., English); this decisions on either the preponderance of evidence
excludes literature written in other languages (informal vote counting) or statistical procedures
728 Logic of Scientific Discovery, The

(comparing the number of studies finding signifi- Card, N. A. (in press). Meta-analysis: Quantitative
cant results with that expected by chance). synthesis of social science research. New York:
Although vote-counting methods reduce subjec- Guilford.
tivity relative to qualitative approaches, they are Cooper, H. (1998). Synthesizing research: A guide for
literature reviews (3rd ed.). Thousand Oaks, CA: Sage.
limited in that the conclusions reached involve
Cooper, H., & Hedges, L. V. (Eds.). (1994). The
only whether there is an effect (rather than the handbook of research synthesis. New York: Russell
magnitude of the effect). The best way to draw Sage Foundation.
conclusions from empirical literature is through Hedges, L. V., & Olkin, I. (1985). Statistical methods for
quantitative, or meta-analytic, approaches. Here, meta-analysis. San Diego, CA: Academic Press.
the reviewer codes effect sizes for the studies then Pan, M. L. (2008). Preparing literature reviews:
applies statistical procedures to evaluate the pres- Qualitative and quantitative approaches (3rd ed.).
ence, magnitude, and sources of differences of Glendale, CA: Pyrczak Publishing.
these effects across studies. Rosenthal, R. (1995). Writing meta-analytic reviews.
Psychological Bulletin, 118, 183192.

Presentation
Although presentation formats are highly dis-
ciplinary specific (and therefore, the best way to LOGIC OF SCIENTIFIC
learn how to present reviews is to read reviews
in one’s area), a few guidelines are universal. DISCOVERY, THE
First, the reviewer should be transparent about
the review process. Just as empirical works are The Logic of Scientific Discovery first presented
expected to present sufficient details for replica- Karl Popper’s main ideas on methodology, includ-
tion, a literature review should provide sufficient ing falsifiability as a criterion for science and the
detail for another scholar to find the same litera- representation of scientific theories as logical sys-
ture, include the same works, and draw the same tems from which other results followed by pure
conclusions. Second, it is critical that the written deduction. Both ideas are qualified and extended
report answers the original questions that moti- in later works by Popper and his follower Imré
vated the review or at least describes why such Lakatos.
answers cannot be reached and what future Popper was born in Vienna, Austria, in 1902.
work is needed to provide these answers. A third During the 1920s, he was an early and enthusiastic
guideline is to avoid study-by-study listing. A participant in the philosophical movement called
good review synthesizes—not merely lists—the the Vienna Circle. After the rise of Nazism, he fled
literature (it is useful to consider that a phone- Austria for New Zealand, where he spent World
book contains a lot of information, but is not War II. In 1949, he was appointed Professor of
very informative, or interesting, to read). Logic and Scientific Method at the London School
Reviewers should avoid ‘‘Author A found . . . of Economics (LSE), where he remained for the
Author B found . . .’’ writing. Effective presenta- rest of his teaching career. He was knighted by
tion is critical in ensuring that the review has an Queen Elizabeth II in 1965. Although he retired in
impact on one’s field. 1969, he continued a prodigious output of philo-
sophical work until his death in 1994. He was
Noel A. Card succeeded at LSE by his protégé Lakatos, who
See also Effect Size, Measures of; File Drawer Problem; extended his methodological work in important
Meta-Analysis ways.
The Logic of Scientific Discovery’s central
methodological idea is falsifiability. The Vienna
Further Readings Circle philosophers, or logical positivists, had pro-
Bem, D. J. (1995). Writing a review article for posed, first, that all meaningful discourse was
Psychological Bulletin. Psychological Bulletin, 118, completely verifiable, and second, that science was
172177. coextensive with meaningful discourse. Originally,
Logic of Scientific Discovery, The 729

they meant by this that a statement should be con- paper (‘‘The Aim of Science,’’ reprinted in Objec-
sidered meaningful, and hence scientific, if and tive Knowledge, chapter 5) Popper pointed out
only if it was possible to show that it was true, that there was no deductive link between, for
either by logical means or on the basis of the evi- example, Newton’s laws and the original state-
dence of the senses. Popper became the most ments of Kepler’s laws of planetary motion or
important critic of their early work. He pointed Galileo’s law of fall. The simple logical model of
out that scientific laws, which are represented as science offered by Popper and later logical positi-
unrestricted or universal generalizations such as vists (e.g., Carl Hempel and Paul Oppenheim)
‘‘all planets have elliptical orbits’’ (Kepler’s Second therefore failed for some of the most important
Law), are not verifiable by any finite set of sense intertheoretical relations in the history of science.
observations and thus cannot be counted as mean- An additional limitation of falsifiability as pre-
ingful or scientific. To escape this paradox, Popper sented in The Logic of Scientific Discovery was the
substituted falsifiability for verifiability as the key issue of ad hoc hypotheses. Suppose, as actually
logical relation of scientific statements. He thereby happened between 1821 and 1846, a planet is
separated the question of meaning from the ques- observed that seems to have an orbit that is not an
tion of whether a claim was scientific. A statement ellipse. The response of scientists at the time was
could be considered scientific if it could, in princi- not to falsify Kepler’s law, or the Newtonian Laws
ple, be shown to be false on the basis of sensory of Motion and Universal Gravitation from which it
evidence, which in practice meant experiment or was derived. Instead, they deployed a variety of aux-
observation. ‘‘All planets have elliptical orbits’’ iliary hypotheses ad hoc, which had the effect of
could be shown to be false by finding a planet with explaining away the discrepancy between Newton’s
an orbit that was not an ellipse. This has never laws and the observations of Uranus, which led to
happened, but if it did the law would be counted the discovery of the planet Neptune. Cases like this
as false, and such a discovery might be made suggested that any claim could be permanently insu-
tomorrow. The law is scientific because it is falsifi- lated from falsifying evidence by introducing an ad
able, although it has not actually been falsified. hoc hypothesis every time negative evidence
Falsifiability requires only that the conditions appeared. Indeed, this could even be done if nega-
under which a statement would be deemed false tive evidence appeared against the ad hoc hypothesis
are specifiable; it does not require that they have itself; another ad hoc hypothesis could be intro-
actually come about. However, when this happens, duced to explain the failure and so on ad infinitum.
Popper assumed scientists would respond with Arguments of this type raised the possibility that fal-
a new and better conjecture. Scientific methodol- sifiability might be an unattainable goal, just as veri-
ogy should not attempt to avoid mistakes, but fiability had been for the logical positivists.
rather, as Popper famously put it, it should try to Two general responses to these difficulties app-
make its mistakes as quickly as possible. Scientific eared. In 1962, Thomas Kuhn argued in The
progress results from this sequence of conjectures Structure of Scientific Revolutions that falsification
and refutations, with each new conjecture requir- occurred only during periods of cumulative normal
ing the precise grounds of specification for its science, whereas the more important noncumula-
failure to satisfy the principle of falsifiability. Pop- tive changes, or revolutions, depended on factors
per’s image of science achieved great popularity that went beyond failures of observation or experi-
among working scientists, and he was acknowl- ment. Kuhn made extensive use of historical evi-
edged by several Nobel prize winners (including dence in his arguments. In reply, Lakatos shifted
Peter Medewar, John Eccles, and Jacques Monod). the unit of appraisal in scientific methodology,
In The Logic of Scientific Discovery, Popper, from individual statements of law or theory to
like the logical positivists, presented the view that a historical sequence of successive theories called
scientific theories ideally took the form of logically a research program. Such programs were to be
independent and consistent systems of axioms appraised according to whether new additions, ad
from which (with the addition of initial condi- hoc or otherwise, increased the overall explanatory
tions) all other scientific statements followed by scope of the program (and especially covered pre-
logical deduction. However, in an important later viously unexplained facts) while retaining the
730 Logistic Regression

successful content of earlier theories. In addition political party identification (Democrat, Republi-
to The Logic of Scientific Discovery, Popper’s main can, other, or none); or (c) ordered polytomous,
ideas are presented in the essays collected in Con- which is an ordinal scale variable with three or
jectures and Refutations and Objective Knowl- more categories, for example, level of education
edge. Two books on political philosophy, The completed (e.g., less than elementary school, ele-
Open Society and Its Enemies and The Poverty of mentary school, high school, an undergraduate
Historicism were also important in establishing his degree, or a graduate degree). Here, the basic
reputation. A three-volume Postscript to the Logic logistic regression model for dichotomous out-
of Scientific Discovery, covering respectively, real- comes is examined, noting its extension to polyto-
ism, indeterminism, and quantum theory, appeared mous outcomes and its conceptual roots in both
in 1982. loglinear analysis and the general linear model.
Next, consideration is given to methods
Peter Barker for assessing the goodness of fit and predictive
utility of the overall model, and calculation and
See also Hypothesis; Scientific Method; Significance
interpretation of logistic regression coefficients and
Level, Concept of
associated inferential statistics to evaluate the
importance of individual predictors in the model.
Further Readings The discussion throughout the entry assumes an
interest in prediction, regardless of whether causal-
Hempel, C. G., & Oppenheim, P. (1948). Studies in
the logic of explanation. Philosophy of Science, 15,
ity is implied; hence, the language of ‘‘outcomes’’
135175. and ‘‘predictors’’ is preferred to the language of
Kuhn, T. S. (1962). The structure of scientific revolutions. ‘‘dependent’’ and ‘‘independent’’ variables.
Chicago: Chicago University Press. The equation for the logistic regression model
Lakatos, I. (1978). The methodology of scientific research with a dichotomous outcome is
programmes. Cambridge, UK: Cambridge University
Press. logitðYÞ ¼ α þ β1 X1 þ β2 X2 þ    þ βK XK ,
Popper, K. R. (1945). The open society and its enemies.
London: George Routledge and Sons. where Y is the dichotomous outcome; logit(Y) is
Popper, K. R. (1957). The poverty of historicism. the natural logarithm of the odds of Y, a transfor-
London: Routledge and Kegan Paul. mation of Y to be discussed in more detail momen-
Popper, K. R. (1959). The logic of scientific discovery tarily; and there are k ¼ 1; 2; . . . ; K predictors Xk
(Rev. ed.). New York: Basic Books.
with associated coefficients βk, plus a constant or
Popper, K. R. (1972). Objective knowledge: An
intercept α, which represents the value of logit(Y)
evolutionary approach. Oxford, UK: Clarendon Press.
Popper, K. R. (1982). Postscript to the logic of scientific when all of the Xk are equal to zero. If the two
discovery (W. W. Bartley III, Ed.). Totowa, NJ: categories of the outcome are coded 1 and 0,
Rowman and Littlefield. respectively, and P1 is the probability of being in
the category coded as 1, and P0 is the probability
of being in the category coded as 0, then the odds
of being in category 1 are
LOGISTIC REGRESSION
P1 =P0 ¼ P1 =ð1  P1 Þ
Logistic regression is a statistical technique used in (because the probability of being in one category
research designs that call for analyzing the rela- is one minus the probability of being in the other
tionship of an outcome or dependent variable to category). Logit(Y) is the natural logarithm of
one or more predictors or independent variables the odds,
when the dependent variable is either (a) dichoto-
mous, having only two categories, for example, ln½P1 =ð1  P1 Þ,
whether one uses illicit drugs (no or yes); (b) unor-
dered polytomous, which is a nominal scale vari- where ln represents the natural logarithm
able with three or more categories, for example, transformation.
Logistic Regression 731

Polytomous Logistic Regression Models should not affect estimates for categories other
than the categories that are actually split or com-
When the outcome is polytomous, logistic bined. This property is not characteristic of
regression can be implemented by splitting the other ordinal contrasts. It is commonly assumed
outcome into a set of dichotomous variables. in ordinal logistic regression that only the inter-
This is done by means of contrasts, which iden- cepts (or thresholds, which are similar to inter-
tify a reference category (or set of categories) cepts) differ across the logit functions. The
with which to compare each of the other cate- ordinal logistic regression equation can be writ-
gories (or sets of categories). For a nominal out- ten (here in the format using intercepts instead
come, the most commonly used model is called of thresholds) as
the baseline category logit model. In this model,
the outcome is divided into a set of dummy vari- logitðYm Þ ¼ αm þ β1 X1 þ β2 X2 þ    þ βK XK ,
ables, each representing one of the categories of
the outcome, with one of the categories desig- where
nated as the reference category, in the same way
that dummy coding is used for nominal predic- αm ¼ α1 , α2 ; . . . ; αM1
tors in linear regression. If there are M categories
in the outcome, then are the intercepts associated with the M  1 logit
functions, but β1 , β2 , . . . , βK are assumed to be
logitðYm Þ ¼ lnðPm =P0 Þ ¼ αm þ β1;m X1 þ β2;m X2 identical for the M  1 logit functions. This
þ    þ βK;m XK ; assumption can be tested and, if necessary,
modified.
where P0 is the probability of being in the refer-
ence category and Pm is the probability of being Logistic Regression, Loglinear Analysis,
in category m ¼ 1, 2, . . . , M  1, given that the and the General Linear Model
case is either in category m or in the reference
category. A total of M  1 equations or logit Logistic regression can be derived from two dif-
functions are thus estimated, each with its own ferent sources, the general linear model for linear
intercept αm and logistic regression coefficients regression and the logit model in loglinear analy-
βk,m, representing the relationship of the predic- sis. Linear regression is used to analyze the rela-
tors to logit(Ym ). tionship of an outcome to one or more
For ordinal outcomes, the situation is more predictors when the outcome is a continuous
complex, and several different contrasts might interval or ratio scale variable. Linear regression
be used. In the adjacent category logit model, for is used extensively in the analysis of outcomes
example, each category is contrasted only with with a natural metric, such as kilograms, dollars,
the single category preceding it. In the cumula- or numbers of people, where the unit of mea-
tive logit model, (a) for the first logit function, surement is such that it makes sense to talk
the first category is contrasted with all of the about larger or smaller differences between cases
categories following it, then (b) for the second (the difference between the populations of
logit function, the first two categories are con- France and Germany is smaller than the differ-
trasted with all of the categories following them, ence between the populations of France and
and so forth, until for the last (M  1) logit func- China). Usually, it also makes sense to talk about
tion, all the categories preceding the last are con- one value being some number of times larger
trasted with the last category. Other contrasts than another ($10,000 is twice as much as
are also possible. The cumulative logit model is $5,000); these comparisons are not applicable to
the model most commonly used in logistic the categorical outcome variables for which
regression analysis for an ordinal outcome, and logistic regression is used. The equation for lin-
it has the advantage over other contrasts that ear regression is
splitting or combining categories (representing
more precise or cruder ordinal measurement) Y ¼ α þ β 1 X1 þ β 2 X2 þ    þ β K XK ,
732 Logistic Regression

and the only difference from the logistic regression logistic regression, and loglinear and logit models
equation is that the outcome in linear regression are commonly estimated using iterative maximum
is Y instead of logit(Y). The coefficients βK and likelihood (ML) estimation, in which one begins
intercept α in linear regression are most commonly with a set of initial values for the coefficients in the
estimated using ordinary least-squares (OLS) esti- model, examines the differences between observed
mation, although other methods of estimation are and predicted values produced by the model (or
possible. some similar criterion), and uses an algorithm to
For OLS estimation and for statistical inferences adjust the estimates to improve the model. This
about the coefficients, certain assumptions are process of estimation and adjustment of esti-
required, and if the outcome is a dichotomy (or mates is repeated in a series of steps (iterations)
a polytomous variable represented as a set of that end when, to some predetermined degree of
dichotomies) instead of a continuous interval/ratio precision, there is no change in the fit of the
variable, several of these assumptions are violated. model, the coefficients in the model, or some
For a dichotomous outcome, the predicted values similar criterion.
might lie outside the range of possible values (sug- Logistic regression can be viewed either as a spe-
gesting probabilities greater than one or less than cial case of the general linear model involving the
zero), especially when there are continuous interval logit transformation of the outcome or as an
or ratio scale predictors in the model. Inferential extension of the logit model to incorporate contin-
statistics are typically incorrect because of hetero- uous as well as categorical predictors. The basic
scedasticity (unequal residual variances for differ- form of the logistic regression equation is the same
ent values of the predictors) and non-normal as for the linear regression equation, but the out-
distribution of the residuals. It is also assumed that come logit(Y) has the same form as the outcome in
the relationship between the outcome and the logit analysis. The use of the logit transforma-
predictors is linear; however, in the general linear tion ensures that predicted values cannot exceed
model, it is often possible to linearize a nonlinear observed values (for an individual case, the logit of
relationship by using an appropriate nonlinear Y is either positive or negative infinity, þ ∞ or
transformation. For example, in research on ∞), but it also makes it impossible to estimate
income (measured in dollars), it is commonplace to the coefficients in the logistic regression equation
use the natural logarithm of income as an outcome, using OLS. Estimation for logistic regression, as
because the relationship of income to its predictors for logit analysis, requires an iterative technique,
tends to be nonlinear (specifically, logarithmic). In most often ML, but other possibilities include iter-
this context, the logit transformation is just one of atively reweighted least squares, with roots in the
many possible linearizing transformations. general linear model, or some form of quasi-likeli-
An alternative to the use of linear regression to hood or partial likelihood estimation, which might
analyze dichotomous and polytomous categorical be employed when data are clustered or noninde-
outcomes is logit analysis, which is a special case pendent. Common instances of nonindependent
of loglinear analysis. In loglinear analysis, it is data include multilevel analysis, complex sampling
assumed that the variables are categorical and can designs (e.g., multistage cluster sampling), and
be represented by a contingency table with as designs involving repeated measurement of the
many dimensions as there are variables, with each same subjects or cases, as in longitudinal research.
case located in one cell of the table, corresponding Conditional logistic regression is a technique for
to the combination of values it has on all of the analyzing related samples, for example, in
variables. In loglinear analysis, no distinction is matched case-control studies, in which, with some
made between outcomes and predictors, but in minor adjustments, the model can be estimated
logit analysis, one variable is designated as the out- using ML.
come, and the other variables are treated as predic-
tors. Each unique combination of values of the
Assumptions of Logistic Regression
predictors represents a covariate pattern. Logit
model equations are typically presented in a format Logistic regression assumes that the functional
different from that used in linear regression and form of the equation is correct, and hence, the
Logistic Regression 733

predictors Xk are linearly and additively related to tested model, the model for which the coefficients
logit(Y), but variables can be transformed to adjust are actually estimated DM . DM , which is some-
for nonadditivity and nonlinearity (e.g., nonli- times called the deviance statistic, has been used as
nearly transformed predictors or interaction a goodness-of-fit statistic, but it has somewhat
terms). It also assumes that each case is indepen- fallen out of favor because of concerns with alter-
dent of all the other cases in the sample, or when native possible definitions for the saturated model
cases are not independent, adjustments can be (depending on whether individual cases or covari-
made in either the estimation procedure or the cal- ate patterns are treated as the units of analysis),
culation of standard errors (or both) to adjust for and the concern that, for data in which there are
the nonindependence. Like linear regression, logis- few cases per covariate pattern, DM does not
tic regression assumes that the variables are mea- really have a chi-square distribution. The Hos-
sured without error, that all relevant predictors are merLemeshow goodness-of-fit index is con-
included in the analysis (otherwise the logistic structed by grouping the data, typically into
regression coefficients might be biased), and that deciles, based on predicted values of the outcome.
no irrelevant predictors are included in the analysis This technique is applicable even with few cases
(otherwise standard errors of the logistic regression per covariate pattern. There seems to be a trend
coefficients might be inflated). Also as in linear away from concern with goodness of fit, however,
regression, no predictor may be perfectly collinear to focus instead on the model chi-square statistic,
with one or more of the other predictors in the
model. Perfect collinearity means that a predictor GM ¼ D0  DM ,
is completely determined by or predictable from
which compares the tested model to the model
one or more other predictors, and when perfect
with no predictors. GM generally does follow
collinearity exists, an infinite number of solutions
a chi-square distribution in large samples and it is
is available that maximize the likelihood in ML
analogous to the multivariate F statistic in linear
estimation or minimize errors of prediction more
regression and analysis of variance. GM provides
generally. Logistic regression also assumes that the
a test of the statistical significance of the overall
errors in prediction have a binomial distribution,
model in predicting the outcome. An alternative to
but when the number of cases is large, the bino-
GM for models not estimated using ML is the mul-
mial distribution approximates the normal distri-
tivariate Wald statistic.
bution. Various diagnostic statistics have been
There is a substantial literature on coefficients
developed and are readily available in existing
of determination for logistic regression, in which
software to detect violations of assumptions and
the goal is to find a measure analogous to R2 in
other problems (e.g., outliers and influential cases)
linear regression. When the concern is with how
in logistic regression.
close the predicted probabilities of category
membership are to observed category membership
Goodness of Fit and Accuracy of Prediction (quantitative prediction), two promising options
are the likelihood ratio R2 statistic,
In logistic regression using ML (currently the most
commonly used method of estimation), in place of R2L ¼ GM =D0 ,
the sum of squares statistics used in linear regres-
sion, there are log likelihood statistics, which are which is applicable specifically when ML estima-
calculated based on observed and predicted proba- tion is used, and the OLS R2 statistic itself, which
bilities of being in the respective categories of the is calculated by squaring the correlation between
outcome variable. When multiplied by 2, the dif- observed values (coded zero and one) and the pre-
ference between two log likelihood statistics has dicted probabilities of being in category 1. Advan-
an approximate chi-square distribution for suffi- tages of R2L include the following: (a) it is based on
ciently large samples involving independent obser- the quantity actually being maximized in ML esti-
vations. One can construct 2 log likelihood mation, (b) it seems to be uncorrelated with the
statistics (here and elsewhere designated as D) for base rate (the percentage of cases in category 1),
(a) a model with no predictors D0 and (b) the and (c) it can be calculated for polytomous as well
734 Logistic Regression

as dichotomous outcomes. Other R2 analogs have statistical significance for unstandardized logistic
been proposed but have various problems that regression coefficients. The univariate Wald statis-
include correlation with the base rate (to the tic can be calculated either as the ratio of the logis-
extent that the base rate itself seems to determine tic regression coefficient to its standard error (SE),
the calculated accuracy of prediction), having no
reasonable value for perfect prediction or for per- bk =SEðbk Þ,
fectly incorrect prediction, or being limited to
which has an approximate normal distribution, or
dichotomous outcomes.
[bk/SE(bk)]2, which has an approximate chi-square
Alternatively, instead of being concerned with
distribution. The Wald statistic, however, tends to
predicted probabilities, one might be concerned
be problematic for large bk , tending to fail to reject
with how accurately cases are qualitatively classi-
the null hypothesis when the null hypothesis is
fied into the categories of the outcome by the pre-
false (Type II error), but it might still be the best
dictors (qualitative prediction). For this purpose,
available option when ML is not used to estimate
there is a family of indices of predictive efficiency,
the model. Alternatives include the score statistic
which is designated lambda-p, tau-p, and phi-p,
and the likelihood ratio statistic (the latter being
that are specifically applicable to qualitative pre-
the difference in DM with and without Xk in the
diction, classification, and selection tables (regard-
equation). When ML estimation is used, the likeli-
less of whether they were generated by logistic
hood ratio statistic, which has a chi-square distri-
regression or some other technique), as opposed to
bution and applies to both bk and ωk, is generally
contingency tables more generally. Finally, none of
the preferred test of statistical significance for bk
the aforementioned indices of predictive efficiency
and ωk.
(or R2 analogs) takes into account the ordering in
Unless all predictors are measured in exactly
an ordered polytomous outcome, for which one
the same units, neither bk nor ωk clearly indicates
would naturally consider ordinal measures of asso-
whether one variable has a stronger impact on the
ciation. Kendall’s tau-b is an ordinal measure of
outcome than another. Likewise, the statistical sig-
association that, when squared (τ b2), has a propor-
nificance of bk or ωk tells us only how sure we are
tional reduction in error (PRE) interpretation, and
that a relationship exists, not how strong the
it seems most promising for use with ordinal out-
relationship is. In linear regression, to compare the
comes in logistic regression. Tests of statistical
substantive significance (strength of relationship,
significance can be computed for all these coeffi-
which does not necessarily correspond to statistical
cients of determination.
significance) of predictors measured in different
units, we often rely on standardized regression
coefficients. In logistic regression, there are several
Unstandardized and Standardized alternatives for obtaining something like a stan-
Logistic Regression Coefficients dardized coefficient. A relatively quick and
easy option is simply to standardize the predictors
Interpretation of unstandardized logistic regression
(standardizing the outcome does not matter,
coefficients (bk , the estimated value of βk) is stra-
because it is the probability of being in a particular
ightforward and parallel to the interpretation of
category of Y, not the actual value of Y, that is pre-
unstandardized coefficients in linear regression: A
dicted in logistic regression). A slightly more com-
one-unit increase in Xk is associated with a bk
plicated approach is to calculate
increase in logit(Y) (not in Y itself). If we raise the
base of the natural logarithm, e ¼ 2.718 . . . ; to bk ¼ ðbk Þðsx ÞðRÞ=slogitðYÞ ;
the power bk, we obtain the odds ratio, here desig-
nated ωk, which is sometimes presented in place of where bk* is the fully standardized logistic regres-
or in addition to bk and can be interpreted as indi- sion coefficient, bk is the unstandardized logistic
cating that a one-unit increase in Xk multiplies the regression coefficient, sx is the standard deviation
odds of being in category 1 by ωk. Both bk and ωk of the predictor Xk, R is the correlation between
convey exactly the same information, just in a dif- the observed value of Y and the predicted probabil-
ferent form. There are several possible tests of ity of being in category 1 of Y, slogit(Y) is the
Loglinear Models 735

standard deviation of the predicted values of Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic
logit(Y), and the quantity slogit(Y)/R represents the regression (2nd ed.). New York: John Wiley.
estimated standard deviation in the observed McCullagh, P., & Nelder, J. A. (1989). Generalized
values of logit(Y) (which must be estimated, linear models (2nd ed.). London: Chapman & Hall.
Menard, S. (2000). Coefficients of determination for
because the observed values are positive or nega-
multiple logistic regression analysis. The American
tive infinity for any single case). The advantage to Statistician, 54, 1724.
this fully standardized logistic regression coeffi- Menard, S. (2002). Applied logistic regression analysis
cient is that it behaves more like the standardized (2nd ed.). Thousand Oaks, CA: Sage.
coefficient in linear regression, including showing Menard, S. (2004). Six approaches to calculating
promise for use in path analysis with logistic standardized logistic regression coefficients. The
regression. This technique is currently under devel- American Statistician, 58, 218223.
opment. Also, parallel to the use of OLS regression Menard, S. (2008). Panel analysis with logistic regression.
or more sophisticated structural equation model- In S. Menard (Ed.), Handbook of longitudinal
research: Design, measurement, and analysis. San
ing techniques in linear panel analysis, it is possi-
Francisco: Academic Press.
ble to use logistic regression in panel analysis; once
O’Connell, A. A. (2006). Logistic regression models for
one decides on an appropriate way to measure ordinal response variables. Thousand Oaks, CA: Sage.
change in the linear panel analysis, the application Pregibon, D. (1981). Logistic regression diagnostics.
of logistic regression is straightforward. Annals of Statistics, 9, 705724.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical
linear models: Applications and data analysis methods
Logistic Regression and Its Alternatives (2nd ed.). Thousand Oaks, CA: Sage.
Simonoff, J. S. (1998). Logistic regression, categorical
Alternatives to logistic regression include probit predictors, and goodness of fit: It depends on who you
analysis, discriminant analysis, and models pra- ask. The American Statistician, 52, 1014.
ctically identical to the logistic regression model
but with different distributional assumptions (e.g.,
complementary log-log or extreme value instead of
logit). Logistic regression, however, has increasingly LOGLINEAR MODELS
become the method most often used in empirical
research. Its broad applicability to different types of This entry provides a nontechnical description of
categorical outcomes and the ease with which it loglinear models, which were developed to analyze
can be implemented in statistical software algo- multivariate cross-tabulation tables. Although
rithms, plus its apparent consistency with realistic a detailed exposition is beyond its scope, the entry
assumptions about real-world empirical data, have describes when loglinear models are necessary,
led to the widespread use of logistic regression in what these models do, how they are tested, and
the biomedical, behavioral, and social sciences. the more familiar extensions of binomial and mul-
Scott Menard tinomial logistic regression.

See also Chi-Square Test; Coefficients of Correlation,


Alienation, and Determination; Collinearity;
Why Loglinear Models?
Dependent Variable; Dummy Coding; F Test; General Many social science phenomena, such as desig-
Linear Model; Independent Variable; Interaction; nated college major or type of exercise, are non-
Least Squares, Methods of; Likelihood Ratio Statistic; numeric, and categories of the variable cannot
Multiple Regression; Odds Ratio; Significance, even be ordered from highest to lowest. Thus, the
Statistical phenomenon is a nominal dependent variable; its
categories form a set of mutually exclusive quali-
ties or traits. Any two cases might fall into the
Further Readings
same or different categories, but we cannot assert
Fienberg, S. E. (1980). The analysis of cross-classified that the value of one case is more or less than that
categorical data (2nd ed.). Cambridge: MIT Press. of a second.
736 Loglinear Models

Many popular statistics assume the dependent Thus the following dilemma: Many variables
or criterion variable is numeric (e.g., years of for- researchers would like to explain are non-numeric.
mal education). What can the analyst investigating Using OLS statistics to analyze them can produce
a nominal dependent variable do? There are sev- nonsensical or misleading results. Some common
eral techniques for investigating a nominal depen- methods taught in early statistics classes (e.g.,
dent variable, many of which are discussed in the three-way cross tabulations) are overly restrictive
next section. (Those described in this entry can or lack tests of statistical significance. Other
also be used with ordinal dependent variables. The techniques (e.g., LPM) have many unsatisfactory
categories of an ordinal variable can be rank outcomes.
ordered from highest to lowest, or most to least.) Loglinear models were developed to address
One alternative is logistic regression. However, these issues. Although these models have a rela-
many analysts have learned binomial logistic tively long history in statistical theory, their practi-
regression using only dichotomous or ‘‘dummy’’ cal application awaited the use of high-speed
dependent variables scored 1 or 0. Furthermore, computers.
the uninitiated interpret logistic regression coeffi-
cients as if they were ordinary least squares (OLS)
What Is a Loglinear Model?
regression coefficients. A second analytic possibil-
ity uses three-way cross-tabulation tables and con- Technically, a loglinear model is a set of specified
trol variables with nonparametric statistical parameters that generates a multivariate cross-
measures. This venerable tradition of ‘‘physical’’ tabulation table of expected frequencies or table
(rather than ‘‘statistical’’) control presents its own cell counts. In the general cell frequency (GCF)
problems, as follows: loglinear model, interest centers on the joint
and simultaneous distribution of several vari-
• Limited inference tests for potential three-
ables in the table cells. The focus includes rela-
variable statistical interactions.
tionships among independent variables as well as
• Limiting the analysis to an independent, those between an independent and a dependent
dependent, and control variable. variable.
• There is no ‘‘system’’ to test whether one variable Table 1 is a simple four-cell (2 × 2) table using
affects a second indirectly through a third 2008 General Social Survey data (NORC at the
variable; for example, education usually University of Chicago), which is an in-person rep-
influences income indirectly through its effects resentative sample of the United States. Table 1
on occupational level. compares 1,409 male and female adults on the per-
• The three-variable model has limited utility for centage who did or did not complete a high-school
researchers who want to compare several causes chemistry course.
of a phenomenon. Although men reported completing high-school
chemistry more than women by 8%, these
A third option is the linear probability model results could reflect sampling error (i.e., they are
(LPM) for a dependent dummy variable scored 1 a ‘‘sample accident’’ not a ‘‘real’’ population sex
or 0. In this straightforward, typical OLS regres- difference).
sion model, B coefficients are interpreted as raising
or lowering the probability of a score of 1 on the
dependent variable. Table 1 Respondent Completed High-School
However, the LPM, too, has several problems. Chemistry Course by Sex
The regression often suffers from heteroscedasti- Completed High-School
city in which the dependent variable variance Chemistry Course Sex
depends on scores of the independent variable(s). Male Female
The dependent variable variance is truncated (at
Yes 55.5% 47.5%
a maximum 0.25.) The LPM can predict impossi-
No 45.5 52.5
ble values for the dependent variable that are
Total 100.0% (869) 100.0% (540)
larger than 1 or less than 0.
Loglinear Models 737

The loglinear analyst compares the set of gener- a multivariate cross-tabulation table. Negative
ated or expected table cell frequencies with the set parameters mean fewer cell frequencies than
of observed table cell counts. If the two sets of cell would occur with a predicted no-effects model.
counts coincide overall within sampling error, the Positive parameters mean higher cell counts than
analyst says, ‘‘the model fits.’’ If the deviations a no-effects model would predict.
between the two sets exceed sampling error, the Parameters in loglinear models (and by exten-
model is a ‘‘poor fit.’’ Under the latter circum- sion their cousins, logistic regression and logit
stances, the analyst must respecify the parameters models) are maximum likelihood estimators
to generate new expected frequencies that more (MLEs). Unlike direct estimates such as OLS coef-
closely resemble the observed table cell counts. ficients in linear regression, MLEs are solved
Even in a two-variable table, more than one through iterative, indirect methods. Reestimating
outcome model is possible. One model, for exam- MLEs, which can take several successively closer
ple, could specify that in the American population, reestimate cycles, is why high-speed computers are
males and females completed a chemistry course at needed.
equal rates; thus, in this sample, we would predict
that 52.5% of each sex completed high-school
chemistry. This outcome is less complicated than A Basic Building Block of
one specifying sex differences: If females and simi-
Loglinear Models: The Odds Ratio
larly completed high-school chemistry, then
explaining a sex difference in chemistry exposure The odds ratio is formed by the ratio of one cell
is unnecessary. count in a variable category to a second cell count
Table 2 shows expected frequency counts for for the same variable, for example, the U.S. ratio
this ‘‘no sex differences’’ model for each cell above of males to females. Compared with the focus on
the diagonal (with the actual observed frequencies the entire table in GCF models, this odds ratios
in bold below it). Thus, when calculating expected subset of loglinear models focuses on categories of
frequencies, sample males and females were the dependent variable (categories in the entire
assumed to have 52.5% high-school chemistry table, which the loglinear model examines, are
completion rates. The table has been constrained used to calculate the odds ratios, but the emphasis
to match the overall observed frequencies for gen- is on the dependent variable and less on the table
der and chemistry course exposure. as a whole). In Table 2, 740 adults completed
Comparing expected and observed cell counts, a chemistry course and 669 did not, making the
males have fewer expected than observed cases odds ratio or odds yes:no 740/669 or 1.11. An
completing chemistry, whereas females have odds ratio of 1 would signify a 5050 split on
greater expected than observed cases completing completing high-school chemistry for the entire
high-school chemistry. sample.
Statistically significant GCF coefficients increase In a binary odds, one category is designated as
or decrease the predicted (modeled) cell counts in a ‘‘success’’ (‘‘1’’), which forms the odds numera-
tor, and the second as a ‘‘failure’’ (‘‘0’’), which
forms the ratio denominator. These designations
Table 2 Expected and Observed Frequencies for do not signify any emotive meaning of ‘‘success.’’
High-School Chemistry Course by Sex For example, in disease death rates, the researcher
Completed High-School might designate death as a success and recovery as
Chemistry Course Sex a failure. The odds can vary from zero (no suc-
Male Female Total cesses) to infinity; they are undefined when the
denominator is zero. The odds are fractional when
Yes 456/ 284/
there are more failures than successes; for exam-
483 257 740
ple, if most people with a disease survive, then the
No 413/ 256/
odds would be fractional.
386 283 669
A first-order conditional odds considers one
Total 869 540 1409
independent variable as well as scores on the
738 Loglinear Models

dependent variable. The observed first-order chem- to zero—signifying no sex effect on high-school
istry conditional for males in Table 2 (yes:no) is chemistry completion.
483/386 ¼ 1.25, and for females it is 257/283 ¼
0.91. Here, the first-order conditional indicates
Testing Loglinear Models
that males more often completed chemistry than
not; however, women completed chemistry less Although several models might be possible in the
often than successfully completed it. same table of observed data, not all models will
A second-order odds of 1 designates statistical replicate accurately the observed table cell counts
independence; that is, changes in the distribution within sampling error. A simpler model that fits
of the second variable are not influenced by any the data well (e.g., equal proportions of females
systematic change in the distribution of the first and males completed high-school chemistry) is
variable. Second-order odds ratios departing usually preferred to one more complex (e.g.,
from 1 indicate two variables are associated. males more often elect chemistry than females).
Here, the second-order odds (males:females) of Loglinear and logit models can have any number
the two first-order conditionals on the chemistry of independent variables; the interrelationships
course is 1.25/0.91 or 1.37. Second-order odds among those and with a dependent variable can
greater than 1 indicate males completed a chem- quickly become elaborate. Statistical tests esti-
istry course more often than females, whereas mate how closely the modeled and observed data
fractional odds would signify women more often coincide.
completed chemistry. By extension with more Loglinear and logit models are tested for sta-
variables, third, fourth, or higher order odds tistical significance with a likelihood ratio chi-
ratios can be calculated. square statistic, sometimes designated G2 or L2,
The natural logarithm (base e, or Euler’s con- distinguishing it from the familiar Pearson chi-
stant, abbreviated as ln) of the odds is a logit. The square (χ2). This multivariate test of statistical
male first-order logit on completing chemistry is ln significance is one feature that turns loglinear
1.25 or 0.223; for females, it is ln 0.91 ¼ .094. analysis into a system, which is comparable with
Positive logits signify more ‘‘successes’’ than ‘‘fail- an N-way analysis of variance or multiple
ures,’’ whereas negative logits indicate mostly fail- regression as opposed to physical control and
ures. Unlike the odds ratio, logits are symmetric inspecting separate partial cross tabulations.
around zero. An overwhelming number of ‘‘fail- One advantage of the logarithmic L2 statistic is
ures’’ would produce a large negative logit. Logits that it is additive: The L2 can be partitioned with
of 0 indicate statistical independence. portions of it allocated to different pieces of
Logits can be calculated on observed or mod- a particular model to compare simpler with
eled cell counts. Analysts more often work with more complicated models on the same cross-tab-
logits when they have designated a dependent ulation table.
variable. Original model effects, including logits, Large L2s imply sizable deviations between the
are multiplicative and nonlinear. Because these modeled and observed data, which means the log-
measures were transformed through logarithms, linear model does not fit the observed cell counts.
they become additive and linear, hence the term The analyst then adds parameters (e.g., a sex dif-
loglinear. ference on high school chemistry) to the loglinear
Loglinear parameters for the cross-tabulation equation to make the modeled and observed cell
table can specify univariate distributions and two frequencies more closely resemble each other. The
variable or higher associations. In the Table 2 inde- most complex model, the fully saturated model,
pendence model, parameters match the observed generates expected frequencies that exactly
total case base (n) and both univariate distribu- match the observed cell frequencies (irrespective of
tions exactly. The first-order odds for females and the number of variables analyzed). The saturated
males are set to be identical, forcing identical per- model always fits perfectly with an L2 ¼ 0.
centages on the chemistry question (here 52.5%) The analyst can test whether a specific parame-
for both sexes. The second-order odds (i.e., the ter or effect (e.g., a sex difference on high-school
odds of the first-order odds) are set to 1 and its ln chemistry) must be retained so the model fits or
Loglinear Models 739

whether it can be dropped. The parameter of inter- Any time we ‘‘fix’’ a parameter, that is, specify
est is dropped from the equation; model cell counts that the expected and observed cell counts or vari-
are reestimated and the model is retested. If the able totals must match for that variable or associa-
resulting L2 is large, the respecified model is a poor tion, we lose df. A fully saturated model specifying
fit and the effect is returned to the loglinear equa- a perfect match for all cells has zero df. The model
tion. If the model with fewer effects fits, the analyst fits but might be more complex than we would
next examines which additional parameters can be like.
dropped. In addition to the L2, programs such as
SPSS, an IBM product (formerly called PASW Sta-
tistics), report z scores for each specified parameter
to indicate which parameters are probably neces- Extensions and Uses of Loglinear Models
sary for the model.
Logit and logistic regression models are derived
Most models based on observed data are
from combinations of cells from an underlying
hierarchical; that is, more complex terms contain
GCF model. When the equations for cell counts are
all lower order terms. For example, in the sex-
converted to odds ratios, terms describing the distri-
chemistry four-cell table, a model containing a sex
butions of and associations among the independent
by chemistry association would also match the
variables cancel and drop from the equation, leav-
sex distribution, the chemistry distribution (match-
ing only the split on the dependent variable and the
ing the modeled univariate distribution on the
effects of independent variables on the dependent
chemistry course to the observed split in the vari-
variable. Because any variable, including a depen-
able), and the case base n to the observed sample
dent variable, in a GCF model can have several
size. Nonhierarchical models can result from some
categories, the dependent variable in logistic regres-
experimental designs (equal cases for each treat-
sion can also have several categories. This is multi-
ment group) or disproportionate sampling designs.
nomial logistic regression and it extends the more
Final models are described through two alternative
familiar binomial logistic regression.
terminologies. The saturated hierarchical model
Of the possible loglinear, logit, and logistic
for Tables 1 and 2 could be designated as (A*B) or
regression models, the GCF model allows the most
as {AB}. Either way, this hierarchical model would
flexibility, despite its more cumbersome equations.
include parameters for n, the A variable, the B var-
Associations among all variables, including inde-
iable, and the AB association. For hierarchical
pendent variables, can easily be assessed. The
models, lower order terms are assumed included in
analyst can test path-like causal models and check
the more complex terms. For nonhierarchical mod-
for indirect causal effects (mediators) and statisti-
els, the analyst must separately specify all required
cal interactions (moderators) more readily than
lower order terms.
in extensions of the GCF model, such as logit
models.
Although the terminology and underlying
Degrees of Freedom premises of the loglinear model might be unfamil-
iar to many analysts, it provides useful ways of
L2 statistics are evaluated for statistical significance
analyzing nominal dependent variables that could
with respect to their associated degrees of freedom
not be done otherwise. Understanding loglinear
(df). The df in loglinear models depend on the
models also helps to describe correctly the loga-
number of variables, the number of categories in
rithmic (logit) or multiplicative and exponentiated
each variable, and the effects the model specifies.
(odds ratios) extensions in logistic regression, giv-
The total df depends on the total number of
ing analysts a systemic set of tools to understand
cells in the table. In the saturated 2 × 2 (four-cell)
the relationships among non-numeric variables.
table depicted in Tables 1 and 2, each variable has
two categories. The case base (n) counts as 1 df, Susan Carol Losh
the variables sex and the chemistry course each
have 21 df, and the association between sex and See also Likelihood Ratio Statistic; Logistic Regression;
the chemistry course has (21)*(21) or 1 df. Nonparametric Statistics; Odds Ratio
740 Longitudinal Design

Further Readings be brought together in novel ways to create study


designs that are more appropriate to the measure-
Agresti, A. (2002). Categorical data analysis (2nd ed.).
Hoboken, NJ: John Wiley. ment and modeling of different outcomes, life peri-
Agresti, A. (2007). An introduction to categorical data ods, and in capturing intraindividual variation,
analysis (2nd ed.). Hoboken, NJ: John Wiley. change, and events producing such changes.
Aldrich, J. H., & Nelson, F. D. (1995). Linear In contrast to a cross-sectional design, which
probability, logit and probit models. Thousand Oaks, allows comparisons across individuals differing in
CA: Sage. age (i.e., birth cohort), a longitudinal design aims
DeMaris, A. (1992). Logit modeling: Practical to collect information that allows comparisons
applications. Newbury Park, CA: Sage. across time in the same individual or group of
Gilbert, N. (1993). Analyzing tabular data: Loglinear and
individuals. One dimension on which traditional
logistic models. London: UCL Press.
Schneider, B., Carnoy, M., Kilpatrick, J., Schmidt, W. H.,
longitudinal designs can be distinguished is their
& Shavelson, R. J. (2007). Estimating causal effects sampling method. In following a group of indivi-
using experimental and observational designs. duals over time, one might choose to study a
Washington, DC: American Educational Research particular birth cohort, so that all the research sub-
Association. jects share a single age and historical context. As
an extension of this, a variety of sequential designs
exists, in which multiple cohorts are systematically
sampled and followed over time. K. Warner
LONGITUDINAL DESIGN Schaie’s Seattle Longitudinal Study used such
a design, in which new samples of the same
cohorts are added at each subsequent observa-
A longitudinal design is one that measures the
tional ‘‘wave’’ of the study. More typical, however,
characteristics of the same individuals on at least
is an age-heterogeneous sample design, which
two, but ideally more, occasions over time. Its pur-
essentially amounts to following an initial cross-
pose is to address directly the study of individual
sectional sample of individuals varying in age (and
change and variation. Longitudinal studies are
therefore birth cohort) over time. Cohort-sequen-
expensive in terms of both time and money, but
tial, multiple cohort, and accelerated longitudinal
they provide many significant advantages rela-
studies are all examples of mixed longitudinal
tive to cross-sectional studies. Indeed, longitudinal
designs. These designs contain information on
studies are essential for understanding develop-
both initial between-person age differences and
mental and aging-related changes because they per-
subsequent within-person age changes. An analysis
mit the direct assessment of within-person change
of such designs requires additional care to estimate
over time and provide a basis for evaluating indi-
separately the between-person (i.e., cross-sectional)
vidual differences in level as separate from the rate
and within-person (i.e., longitudinal) age-related
and pattern of change as well as the treatment of
information. Numerous discussions are available
selection effects related to attrition and population
regarding the choice of design and the associated
mortality.
strengths and threats to the validity associated
with each.
Another dimension along which traditional lon-
Traditional Longitudinal Designs
gitudinal designs can differ is the interval between
Longitudinal designs can be categorized in several waves. Typical longitudinal studies reassess partici-
ways but are defined primarily on differences in pants at regular intervals, with relatively equal
initial sample (e.g., age homogeneous or age het- one or several-year gaps, but these might vary
erogeneous), number of occasions (e.g., semian- from smaller (e.g., half-year) to longer (7-year or
nual or intensive), spacing between assessments decade). Variations in this pattern have been used
(e.g., widely spaced panel designs or intensive mea- because of funding cycles; for example, intervals
surement designs), and whether new samples are within a single study might range from 2 to 5 or
obtained at subsequent measurement occasions more years, and to creative use of opportunities,
(e.g., sequential designs). These design features can such as recontacting in later life a sample of
Longitudinal Design 741

former participants in a study of child develop- later life. Population average trends describe
ment or of army inductees, who were assessed in aggregate population change. Trends can be based
childhood or young adulthood. Intensive measure- on between-person differences in age-heteroge-
ment designs, such as daily diary studies and burst neous studies (although confounded with differ-
measurement designs, are based on multiple assess- ences related to birth cohort and sample selection
ments within and across days, permitting the anal- associated with attrition and mortality) or on
ysis of short-term variation and change. direct estimates of within-person change in studies
A relevant design dimension associated with with longitudinal follow-up, in which case they
longitudinal studies is breadth of measurement. can be made conditional on survival. Between-per-
Given the high financial, time, and effort costs it is son age differences can be analyzed in terms of
not unusual for a longitudinal study to be multidis- shared age-related variance in variance decomposi-
ciplinary, or at least to attempt to address a signifi- tion and factor models. This approach to under-
cant range of topics within a discipline. While standing aging, however, confounds individual
some studies have dealt with the cost issue by differences in age-related change with average age
maintaining a very narrow focus, these easily rep- differences (i.e., between-person age trends),
resent the minority. cohort influences, and mortality selection.
Longitudinal models permit the identification
of individual differences in rates of change over
Levels of Analysis in Longitudinal Research
time, which avoids making assumptions of ergo-
For understanding change processes, longitudinal dicity—that age differences between individuals
studies provide many advantages relative to cross- and age changes within individuals are equiva-
sectional studies. Longitudinal data permit the lent. In these models, time can be structured in
direct estimation of parameters at multiple levels many alternative ways. It can be defined as time
of analysis, each of which is complementary to since an individual entered the study, time since
understanding population and individual change birth (i.e., chronological age), or time until or
with age. Whereas cross-sectional analyses permit since occurrence of a shared event such as retire-
between-person analysis of individuals varying ment or diagnosis of disease. Elaboration of the
in age, longitudinal follow-up permits direct evalu- longitudinal model permits estimation of associ-
ation of both between-person differences and ation among within-person rates of change in
within-person change. different outcomes, in other words, using multi-
Information available in cross-sectional and variate associations among intercepts and
longitudinal designs can be summarized in terms change functions to describe the interdependence
of seven main levels of analysis and inferential of change functions. In shorter term longitudinal
scope (shown in italics in the next section). These designs, researchers have emphasized within-per-
levels can be ordered, broadly, in terms of their son variation as an outcome and have examined
focus, ranging from the population to the individ- whether individuals who display greater vari-
ual. The time sampling generally decreases across ability relative to others exhibit this variation
levels of analysis, from decades for analysis of his- generally across different tasks. Within-person
torical birth cohort effects to days, minutes, or sec- correlations (i.e., coupling or dynamic factor
onds for assessment of highly variable within- analysis) are based on the analysis of residuals
person processes. (after separating intraindividual means and
These levels of analysis are based on a combi- trends) and provide information regarding the
nation of multiple-cohort, between-person, and correlation of within-time variation in function-
within-person designs and analysis approaches ing across variables. Each level of analysis
and all are represented by recent examples in provides complementary information regarding
developmental research. Between-cohort differ- population and individual change, and the infer-
ences, which is the broadest level, can be examined ences and interpretations possible from any sin-
to evaluate whether different historical contexts gle level of analysis have distinct and delimited
(e.g., indicated by birth cohort) have lasting effects ramifications for understanding developmental
on level and on rate of change in functioning in and aging-related change.
742 Longitudinal Design

Considerations for the Design mortality selection processes complicate both the
of Longitudinal Studies definition of an aging population and the sam-
pling procedures relevant to obtaining a represen-
The levels of analysis described previously corre- tative sample in studies of later life. Attrition in
spond roughly to different temporal and historical longitudinal studies of aging is often nonran-
(i.e., birth cohort) sampling frames and range from dom, or selective, in that it is likely to result
very long to potentially very short intervals of from mortality or declining physical and mental
assessment. The interpretation, comparison, and functioning of the participants during the period
generalizability of parameters derived from differ- of observation. This presents an important infer-
ent temporal samplings must be carefully consid- ential problem, as the remaining sample becomes
ered and require different types of designs and less and less representative of the population
measurements. The temporal characteristics of from which it originated. Generalization from
change and variation must be taken into account, the sample of continuing participants to the ini-
as different sampling intervals will generally lead tial population might become difficult to justify.
to different results requiring different interpreta- However, a major advantage of longitudinal
tions for both within and between-person pro- studies is that they contain information neces-
cesses. For example, correlations between change sary to examine the impact of attrition and
and variability over time across outcomes will mortality selection on the observed data. This
likely be different for short temporal intervals information, which is inaccessible in cross-
(minutes, hours, days, or weeks) in contrast to sectional data, is essential for valid inferences
correlations among rates of change across years, and improved understanding of developmental
the typical intervals of many longitudinal studies and aging processes.
on aging. Heterogeneity in terms of chronological age and
Measurement interval is also critical for the pre- population mortality poses analytical challenges
diction of outcome variables and for establishing for both cross-sectional and longitudinal data
evidence on leading versus lagging indicators. and is a particular challenge to studies that begin
Causal mechanisms need time for their influences with age-heterogeneous samples. Age-homogeneous
to be exerted, and the size of the effect will vary studies, where single or narrow age birth cohorts
with the time interval between the causal influence are initially sampled, provide an initially well-
and the outcome. Thus, if one statistically controls defined population that can be followed over time,
for a covariate measured at a time before it exerts permitting conditional estimates based on subse-
its causal influence, the resultant model parameters quent survival. However, initial sampling of indivi-
might still be biased by the covariate. Time-vary- duals at different ages (i.e., age-heterogeneous
ing covariates must be measured within the time samples), particularly in studies of adults and
frame in which they are exerting their influence to aging, confounds population selection processes
provide adequate representations of the causal, related to mortality. The results from longitudinal
time-dependent processes. However, deciding on studies, beginning as age-heterogeneous samples,
what an appropriate time frame might be is not an can be properly evaluated and interpreted when
easy task, and might not be informed by previous the population parameters are estimated condi-
longitudinal studies, given that the data collection tional on initial between-person age differences, as
intervals from many studies are determined by well as on mortality and attrition processes that
logistical and financial factors, rather than theore- permit inference to defined populations.
tical expectations about the timing of developmen- Incomplete data can take many forms, such as
tal processes. item or scale nonresponse, participant attrition,
and mortality within the population of interest
(i.e., lack of initial inclusion or follow-up because
Population Sampling, Attrition, and Mortality
of death). Statistical analysis of longitudinal stud-
In observational studies, representative sam- ies is aimed at providing inferences regarding the
pling is important, as random assignment to con- level and rate of change in functioning, group dif-
ditions is not possible. However, attrition and ferences, variability, and construct relations within
Longitudinal Design 743

a population, and incomplete data complicate this within-person variation, covariation, and change
process. To make appropriate population infer- (e.g., because of learning) within measurement
ences about development and change, it is impor- bursts and evaluation of change in maximal per-
tant not only to consider thoroughly the processes formance over time across measurement bursts.
leading to incomplete data (e.g., health, fatigue,
cognitive functioning), but also to obtain measure-
Selecting Measurement Instruments
ments of these selection and attrition processes to
the greatest extent possible and include them in The design of future longitudinal studies on
the statistical analysis based on either maximum aging can be usefully informed by the analysis and
likelihood estimation or multiple imputation pro- measurement protocol of existing studies. Such
cedures. Longitudinal studies might be hindered by studies, completed or ongoing, provide evidence
missing data or not being proximal to critical for informing decisions regarding optimal or essen-
events that represent or influence the process of tial test batteries of health, cognition, personality,
interest. As a consequence, some researchers have and other measures. Incorporating features of
included additional assessments triggered by a par- measurement used in previous studies, when possi-
ticular event or response. ble, would permit quantitative anchoring and
essential opportunities for cross-cohort and cross-
country comparison.
Effects of Repeated Testing
Comparable measures are essential for cross-
Retest (i.e., practice, exposure, learning, or rea- study comparison, replication, and evaluation of
ctivity) effects have been reported in several lon- generalizability of research findings. The similarity
gitudinal studies, particularly in studies on aging of a measure can vary at many levels, and within
and cognition where the expected effects are in a single nation large operational differences can be
the opposite direction. Estimates of longitudinal found. When considering cross-cultural or cross-
change might be exaggerated or attenuated depen- national data sets, these differences can be magni-
ding on whether the developmental function is fied: Regardless of whether the same measure has
increasing or decreasing with age. Complicating been used, differences are inevitably introduced
matters is the potential for improvement to occur because of language, administration, and item rele-
differentially, related to ability level, age, or task vance. A balance must be found between optimal
difficulty, as well as to related influences such as similarity of administration, similarity of meaning,
warm-up effects, anxiety, and test-specific learning. and significance of meaning—avoiding unreason-
Intensive measurement designs, such as those able loss of information or lack of depth. These
involving measurement bursts with widely spaced challenges must clearly be addressed in a collabora-
sets of intensive measurements, are required to dis- tive endeavor, but in fact they are also critical to
tinguish short-term learning gains from long-term general development of the field, for without
aging-related changes. The typical longitudinal some means for comparing research products,
design used to estimate developmental or aging our findings lack evidence for reproducibility and
functions usually involves widely spaced intervals generalizability.
between testing occasions. Design characteristics
that are particularly sensitive to the assessment of
Challenges and Strengths
time-related processes, such as retest or learning
effects, have been termed temporal layering and Longitudinal studies are necessary for explanatory
involve the use of different assessment schedules theories of development and aging. The evidence
within longitudinal design (i.e., daily, weekly, obtained thus far from long-term longitudinal and
monthly, semiannually, or annually). For example, intensive short-term longitudinal studies indicates
one such alternative, the measurement burst remarkable within-person variation in many types
design, where assessment bursts are repeated over of processes, even those once considered highly
longer intervals, is a compromise between single- stable (e.g., personality). From both theoretical
case time series and conventional longitudinal and empirical perspectives, between-person differ-
designs, and they permit the examination of ences are a complex function of initial individual
744 Longitudinal Design

differences and intraindividual change. The iden- L. K. George (Eds.), Handbook of aging and the
tification and understanding of the sources of social sciences (6th ed., pp. 2038). San Diego, CA:
between-person differences and of developmental Academic Press.
and aging-related changes requires the direct Baltes, P. B., & Nesselroade, J. R. (1979). History and
rationale of longitudinal research. In J. R. Nesselroade
observation of within-person change available in
& P. B. Baltes (Eds.), Longitudinal research in the
longitudinal studies. study of behavior and development. New York:
There are many challenges for the design and Academic Press.
analysis of strict within-person studies and large- Hofer, S. M., Flaherty, B. P., & Hoffman, L. (2006).
sample longitudinal studies, and these will differ Cross-sectional analysis of time-dependent data:
according to purpose. The challenges of strict Problems of mean-induced association in age-
within-person studies include limits on inferences heterogeneous samples and an alternative method
given the smaller range of contexts and character- based on sequential narrow age-cohorts. Multivariate
istics available within any single individual. Of Behavioral Research, 41, 165187.
Hofer, S. M., & Hoffman, L. (2007). Statistical analysis
course, the study of relatively stable individual
with incomplete data: A developmental perspective.
characteristics and genetic differences requires
In T. D. Little, J. A. Bovaird, & N. A. Card (Eds.),
between-person comparison approaches. In gen- Modeling ecological and contextual effects in
eral, combinations of within-person and between- longitudinal studies of human development
person population and temporal sampling designs (pp. 1332). Mahwah, NJ: Lawrence
are necessary for comprehensive understanding of Erlbaum.
within-person processes of aging because people Hofer, S. M., & Piccinin, A. M. (2009). Integrative data
differ in their responsiveness to influences of all analysis through coordination of measurement and
types, and the breadth of contextual influences analysis protocol across independent longitudinal
associated with developmental and aging out- studies. Psychological Methods, 14, 150164.
Hofer, S. M., & Sliwinski, M. J. (2006). Design and
comes is unavailable in any single individual. The
analysis of longitudinal studies of aging. In J. E. Birren
strength of longitudinal designs is that they permit
& K. W. Schaie (Eds.), Handbook of the psychology
the simultaneous examination of within-person of aging (6th ed., pp. 1537). San Diego, CA:
processes in the context of between-person vari- Academic Press.
ability, between-person differences in change, Nesselroade, J. R. (2001). Intraindividual variability in
and between-person moderation of within-person development within and between individuals.
processes. European Psychologist, 6, 187193.
Piccinin, A. M., & Hofer, S. M. (2008). Integrative
Scott M. Hofer and Andrea M. Piccinin analysis of longitudinal studies on aging:
Collaborative research networks, meta-analysis, and
See also Cross-Sectional Design; Population; Sequential optimizing future studies. In S. M. Hofer &
Design; Within-Subjects Design D. F. Alwin (Eds.), Handbook on cognitive aging:
Interdisciplinary perspectives (pp. 446476).
Thousand Oaks, CA: Sage.
Further Readings
Schaie, K. W., & Hofer, S. M. (2001). Longitudinal
Alwin, D. F., Hofer, S. M., & McCammon, R. (2006). studies of aging. In J. E. Birren & K. W. Schaie (Eds.),
Modeling the effects of time: Integrating demographic Handbook of the psychology of aging (pp. 5377).
and developmental perspectives. In R. H. Binstock & San Diego, CA: Academic Press.
M
is no significant interaction between the factors
MAIN EFFECTS and many factors are involved in the study, testing
the main effects with a factorial design likely con-
fers efficiency.
Main effects can be defined as the average differ- Plausibly, in factorial design, each factor may
ences between one independent variable (or factor) have more than one level. Hence, the significance
and the other levels of one or more independent of the main effect, which is the difference in the
variables. In other words, investigators identify marginal means of one factor over the levels of
main effects, or how one independent variable other factors, can be examined. For instance, sup-
influences the dependent variable, by ignoring or pose an education researcher is interested in know-
constraining the other independent variables in ing how gender affects the ability of first-year
a model. For instance, let us say there is a differ- college students to solve algebra problems. The
ence between two levels of independent variable A first variable is gender, and the second variable is
and differences between three levels of indepen- the level of difficulty of the algebra problems. The
dent variable B. Consequently, researchers can second variable has two levels of difficulty: difficult
study the presence of both factors separately, as in (proof of algebra theorems) and easy (solution of
single-factor experiments. Thus, main effects can simple multiple-choice questions). In this example,
be determined in either single-factor experiments the researcher uses and examines a 2 × 2 factorial
or factorial design experiments. In addition, main design. The number ‘‘2’’ represents the number of
effects can be interpreted meaningfully only if the levels that each factor has. If there are more than
interaction effect is absent. This entry focuses on two factors, then the factorial design would be
main effects in factorial design, including analysis adjusted; for instance, the factorial design may
of the marginal means. look like 3 × 2 × 2 for three factors with 3 levels
versus 2 levels and another 2-level factor. There-
fore, a total of three main effects would have to be
Main Effects in Factorial Design
considered in the study.
Factorial design is applicable whenever researchers In the previous example of 2 × 2 factorial
wish to examine the influence of a particular factor design, however, both variables are thought to
among two or more factors in their study. This influence the ability of first-year college students to
design is a method for controlling various factors solve algebra problems. Hence, two main effects
of interest in just one experiment rather than can be examined: (1) gender effects, while the level
repeating the same experiment for each of the fac- of difficulty effects is controlled and (2) level-of-
tors or independent variables in the study. If there difficulty effects, while gender effects are

745
746 Main Effects

Table 1 Main Effects of Gender and College Year on IQ Test Scores


College Year
Gender Freshman Sophomore Junior Senior Marginal Means
Male 100 100 110 120 107.5
Female 100 110 115 125 112.5
Marginal means 100 105 112.5 122.5 110

controlled. The hypothesis also can be stated in particular factor to the random error variance.
terms of whether first-year male and female col- The larger the F ratio (i.e., the larger the relative
lege students differ in their ability to solve the variance), the more likely that the factor signifi-
more difficult algebra problems. The hypothesis cantly affects the dependent variable. To determine
can be answered by examining the simple main whether the F ratio is large enough to show that
effects of gender or the simple main effects of the main effects are significant, the researcher can
the second variable (level of difficulty). compare the F ratio with critical F by using the
critical values table provided in many statistics
textbooks. The researcher can also compare the p
Marginal Means
value in the ANOVA table with the chosen signifi-
An easy technique for checking the main effect of cance level, say .05. If p < .05, then the effect for
a factor is to examine the marginal means, or the that factor on the dependent variable is significant.
average difference at each level that makes up the The marginal means can then be interpreted from
factorial design. The differences between levels in these results, that is, which group (e.g., male vs.
a factor could preliminarily affect the dependent female, freshman vs. senior, or sophomore vs.
variable. The differences in the marginal means senior) is significantly higher or lower than the
also tell researchers how much, on average, one other groups on that factor. It is important to
level of the factor differs from the others in affect- report the F and p values, followed by the interpre-
ing the dependent variable. For instance, Table 1 tation of the differences in the marginal means of
shows the two-level main effect of gender and the a factor, especially for the significant main effects
four-level main effect of college year on IQ test on the dependent variable.
points. The marginal means from this 2 × 4 fac- The analysis of the main effects of a factor on
torial design show that there might be a main the dependent variable while other factors are
effect of gender, with an average difference of 5 controlled is used when a researcher is interested
points, on IQ test scores. Also, there might be in looking at the pattern of differences between
a main effect of college year, with differences of 5 the levels of individual independent variables.
to 22.5 points in IQ test scores across college The significant main effects give the researcher
years. To determine whether these point differ- information about how much one level of a fac-
ences are greater than what would be expected tor could be more or less over the other levels.
from chance, the significance of these mains effects The significant main effect, however, is less
needs to be tested. meaningful when the interaction effect is signifi-
The test of main effects significance for each cant, that is, when there is a significant interac-
factor is the test of between-subject effects pro- tion effect between factors A and B. In that case,
vided by the analysis of variance (ANOVA) table the researcher should test the simple main effects
found in many statistical software packages, such instead of the main effects on the dependent
as the SPSS (an IBM company, formerly called variable.
PASWâ Statistics), SAS, and MINITAB. The F
ratio for the two factors—which is empirically Zairul Nor Deana Binti Md Desa
computed from the amount of variance in the
dependent variable contributed by these two See also Analysis of Variance (ANOVA); Factorial
factors—is the ratio of the relative variance of that Design; Interaction; Simple Main Effects
Mann–Whitney U Test 747

Further Readings of two tennis coaches, either Coach Alba or Coach


Bolt. At the end of the course, after 4 weeks of
Aitken, M. R. F. (2006). ANOVA for the behavioural
sciences researcher. Mahwah, NJ: Lawrence Erlbaum. intensive coaching, the tennis expert again exam-
Green, S. B., & Salkind, N. J. (2007). Using SPSS for ines the children on the test of their tennis ability
Windows and Macintosh: Analyzing and and records their performance. The amount of
understanding data (5th ed.). Upper Saddle River, NJ: improvement in the child’s tennis performance is
Prentice Hall. calculated by subtracting their score at the begin-
Keppel, G., & Wickens, T. D. (2004). Design and ning of the course from the one at the end. An
analysis: A researcher’s handbook (4th ed.). Upper interesting question arises: Is it better to be coa-
Saddle River, NJ: Prentice Hall. ched by Coach Alba or by Coach Bolt? Given that
Myers, R. H., Montgomery, D. C., & Vining, G. G.
the children play on the same tennis courts and
(2002). Generalized linear models: With applications
in engineering and the sciences. New York: Wiley.
follow the same course of study, the only differ-
ence between the two groups is the coaching. So
does one group improve more than the other?
As we are unsure whether the tennis expert’s test
MANN–WHITNEY U TEST scores satisfy parametric assumptions, a Mann–
Whitney test is undertaken on the improvement
scores to test the hypothesis. In this example, Alba
The Mann–Whitney U Test is a popular test for coaches six students and Bolt coaches five. In
comparing two independent samples. It is a non- Alba’s group, Juan receives an improvement score
parametric test, as the analysis is undertaken on the of 23, Todd gets 15, Maria 42, Charlene 20, Brad
rank order of the scores and so does not require 32, and Shannon 28. In Bolt’s group, Grace
the assumptions of a parametric test. It was origi- receives an improvement score of 24, Carl gets 38,
nally proposed by Frank Wilcoxon in 1945 for Kelly 48, Ron 45, and Danny 35. How do we
equal sample sizes, but in 1947 H. B. Mann and decide whether one of the coaches achieves the bet-
D. R. Whitney extended it to unequal sample sizes ter results? First, the results can be seen more
(and also provided probability values for the distri- clearly if they are put in a table in rank order, that
bution of U, the test statistic). is, listing the scores in order from least improved
When the null hypothesis is true, and the ranks (at the bottom) to most improved (at the top):
of the two samples are drawn from the same popu-
lation distribution, one would expect the mean Rank Student Name Improvement Score Coach
rank for the scores in one sample to be the same as
the mean rank for the scores in the other sample. 11 Kelly 48 Bolt
However, if there is an effect of the independent 10 Ron 45 Bolt
variable on the scores, then one would expect it to 9 Maria 42 Alba
influence their rank order, and hence, one would 8 Carl 38 Bolt
expect the mean ranks to be different for the two 7 Danny 35 Bolt
samples. 6 Brad 32 Alba
This entry discusses the logic and calculation 5 Shannon 28 Alba
of the Mann–Whitney U test and the probability 4 Grace 24 Bolt
of U. 3 Juan 23 Alba
2 Charlene 20 Alba
1 Todd 15 Alba
The Logic of the Test
The logic of the test can be seen by an example. A In the final column of the table, it can be seen
group of children sign up for a tennis camp during that five of Alba’s students are in the bottom six
summer vacation. At the beginning of the course, places.
a tennis expert examines the children on their ten- The calculation of the Mann–Whitney U test is
nis ability and records each child’s performance. undertaken on the ranks, not the original scores.
The children are then randomly allocated to one So the key information from the above table, the
748 Mann–Whitney U Test

students’ rank plus their coach, is shown in the fol- very similar. For example, in the following table,
lowing table (where A indicates Coach Alba and B the two U values are actually the same:
indicates Coach Bolt):
Rank 1 2 3 4 5 6 7 8 9 10 11 U
Rank 1 2 3 4 5 6 7 8 9 10 11 Coach A B A B A B A B A B A
Coach A A A B A A B B A B B Alba 0 1 2 3 4 5 15
Bolt 1 2 3 4 5 15

If there really is a difference between the


coaches, we would expect the students from one
So large differences in U values indicate a possi-
coach to be in the bottom positions in the rank
ble difference between the coaches, and similar U
order and the students from the other coach to be
values indicate little difference between the coaches.
in the top positions. However, if there is no differ-
Thus, we have a statistic for testing the hypothesis.
ence between the coaches, we would expect the
The calculated U values of 25 and 5 indicate a pos-
students to be scattered across the ranks. One way
sible difference in the coaches.
to measure this is to take each rank in turn and
consider how many results from the other coach
are below it. For example, the student at Rank 5 is The Calculation of the Test
coached by Coach Alba. There is only one of
If we refer to Coach Alba’s students as Sample 1
Coach Bolt’s students below Rank 5, so a score of
and Coach Bolt’s students as Sample 2, then n1,
1 is given to Alba’s student at Rank 5. This is done
the number of scores in Sample 1, is 6 and n2, the
for all Coach Alba’s students. Then a total score
number of scores in Sample 2, is 5.
for Alba’s students is produced by adding up these
In the worst possible situation (for Coach Alba,
values. This value is called U. The same calcula-
that is!) all Coach Alba’s students would take up
tion is done for Coach Bolt’s students, to produce
the bottom six places. The sum of their ranks
a second U value. The following table shows the
would be 1 þ 2 þ 3 þ 4 þ 5 þ 6 ¼ 21. Now how do
results:
Alba’s students really do? The actual sum of ranks
for Alba’s group, Sample 1, is R1 ¼ 1 þ 2 þ 3 þ
Rank 1 2 3 4 5 6 7 8 9 10 11 U 5 þ 6 þ 9 ¼ 26. The difference produces the for-
Coach A A A B A A B B A B B mula for the Mann–Whitney U statistic: U1 equals
Alba 0 0 0 1 1 3 5 the sum of actual ranks minus the sum of bottom
Bolt 3 5 5 6 6 25 n1 ranks or, expressed as a formula:

n1 ðn1 þ 1Þ
Notice that the U score of 5 indicates that most U1 ¼ R 1  ¼ 26  21 ¼ 5:
of Coach Alba’s students are near the bottom 2
(there are not many of Bolt’s students worse than
them) and the much larger U value of 25 indicates This formula gives us the same value of 5 that
that most of Coach Bolt’s students are near the top was calculated by a different method earlier. But
of the ranks. the formula provides a simple method of calcula-
However, consider what the U scores would be tion, without having to laboriously inspect the
if every one of Alba’s students had made up the ranks, as above. (Notice also that the mean rank
bottom six ranks. In this case, none of Alba’s stu- for Alba’s group is Rn 1 ¼ 26
6 ¼ 4:33; below 6, the
1
dents would have been above any of Bolt’s stu- middle rank for 11 results. This also provides an
dents in the ranks, and the U value would have indication that Alba’s students improve less than
been 0. The U value for Bolt’s students would have Bolt’s.)
been 30. The U statistic can be calculated for Coach
Now consider the alternative situation, when Bolt. If Bolt’s students had been in the bottom
the students from the two coaches are evenly five places, their ranks would have added up to 15
spread across the ranks. Here the U values will be (1 þ 2 þ 3 þ 4 þ 5). In actual fact, the sum of the
Mann–Whitney U Test 749

ranks of Bolt’s students is 4 þ 7 þ 8 þ 10 þ 11 ¼ improve less than Bolt’s, then there would be
40. So U2 equals the sum of actual ranks minus only 19 ways of getting 5 or less, and the proba-
the sum of bottom n2 ranks or, expressed as a bility by chance would be .041. With this one-
formula: tailed prediction, a U value of 5 would now
produce a significant difference at the signifi-
n2 ðn2 þ 1Þ cance level of p ¼ .05.
U2 ¼ R 2  ¼ 40  15 ¼ 25:
2 However, one does not normally need to work
out the probability for the calculated U values. It
This is a relatively large value, so Bolt’s students can be looked up in statistical tables. Alternatively,
are generally near the top of the ranks. (Notice a software package will give the probability of a U
also that the mean rank for Bolt’s group is value, which can be compared to the chosen signif-
R2 40
n2 ¼ 5 ¼ 8; above 6, the middle for 11 ranks.) icance level.
And the two values of U are U1 ¼ 5 and U2 ¼ 25, The Mann–Whitney U test is a useful test of
the same as produced by the different method small samples (Mann and Whitney, 1947, gave
earlier. tables of the probability of U for samples of n1
The sum of two U values will always be n1n2, and n2 up to 8), but with large sample sizes (n1
which in this case is 30. While the two U values and n2 both greater than 20), then the U distribu-
are quite different from each other, indicating tion tends to a normal distribution with a mean of
a separation of the samples into the lower and
n1 n2
upper ranks, in statistical tests a result is significant
only if the probability is less than or equal to the 2
significance level (usually, p ¼ .05). and standard deviation of
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The Probability of U n1 n2 ðn1 þ n2 þ 1Þ
:
12
In a two-tailed prediction, researchers always test
the smaller value of U because the probability is With large samples, rather than the U test, a z
based on this value. In the example, there are 11 score can be calculated and the probability looked
scores, 6 in one sample and 5 in the other. What is up in the standard normal distribution tables.
the probability of a U value of 5 or smaller when The accuracy of the Mann–Whitney U test is
the null hypothesis is true? There are 462 possible reduced the more tied ranks there are, and so
permutations for the rank order of 6 scores in one where there are a large number of tied ranks,
11! a correction to the test is necessary. Indeed, in
sample and 5 in the other (5!6! ). Only two permu-
tations produce a U value of zero: all one sample this case, it is often worth considering whether
in the bottom ranks or all the other sample in the a more precise measure of the dependent vari-
bottom ranks. It is not difficult to work out (using able is called for.
a mathematical formula for combinations) that
there are also only two possible ways of getting Perry R. Hinton
a U of 1, four ways of getting a U of 2, six ways
See also Dependent Variable; Independent Variable;
of getting a U of 3, 10 ways for a U of 4, and 14
Nonparametric Statistics; Null Hypothesis; One-Tailed
ways for a U of 5. In sum, there are 38 possible
Test; Sample; Significance Level, Concept of;
ways of producing a U value of 5 or less by chance
Significance Level, Interpretation and Construction;
alone. Dividing 38 by 462 gives a probability of
Two-Tailed Test; Wilcoxon Rank Sum Test; z Score
.082 of this result under the null hypothesis. Thus,
for a two-tailed prediction, with sample sizes of 5
and 6, a U value of 5 is not significant at the Further Readings
p ¼ .05 level of significance. Conover, W. J. (1999). Practical nonparametric statistics
It is interesting to note that if we had made (3rd ed.). New York: Wiley.
a one-tailed prediction, that is, specifically pre- Daniel, W. W. (1990). Applied nonparametric statistics
dicted in advance that Alba’s students would (2nd ed.). Boston: PWS-KENT.
750 Margin of Error

Gibbons, J. D., & Chakraborti, S. (2003). 100ð1  αÞ% represents the confidence level corre-
Nonparametric statistical inference (4th ed.). New sponding to a selected value of α between 0 and 1.
York: Marcel Dekker. Let Qð1  α2Þ denote the 100ð1  α2Þth percentile of
Mann, H. B., & Whitney, D. R. (1947). On a test of ^
ðθθÞ
whether one of two random variables is stochastically the sampling distribution of ^.
SDðθÞ
A symmetric
larger than the other. Annals of Mathematical 100ð1  αÞ% confidence interval is given by
Statistics, 18, 50–60.
Siegel, S., & Castellan, N. J. (1988). Nonparametric  α
statistics for the behavioural sciences. New York: θ^ ± Q 1  SDðθÞ^,
2
McGraw-Hill.
leading to a margin of error of
 α ^:
MARGIN OF ERROR MEðαÞ ¼ Q 1 
2
SDðθÞ

In the popular media, the margin of error is the The size of the margin of error is based on three
most frequently quoted measure of statistical accu- factors: (1) the size of the sample, (2) the variabil-
racy for a sample estimate of a population parame- ity of the data being sampled from the population,
ter. Based on the conventional definition of the and (3) the confidence level (assuming that the
measure, the difference between the estimate and conventional 95% level is not employed). The
the targeted parameter should be bounded by the sample size and the population variability are both
margin of error 95% of the time. Thus, only 1 in reflected in the standard error of the estimator,
20 surveys or studies should lead to a result in ^ which decreases as the sample size increases
SDðθÞ,
which the actual estimation error exceeds the mar- and grows in accordance with the dispersion of
gin of error. the population data. The confidence level is repre-
Technically, the margin of error is defined as the sented by the percentile of the sampling distribu-
radius or the half-width of a symmetric confidence tion, Qð1  α2Þ. This percentile becomes larger as α
interval. To formalize this definition, suppose that is decreased and the corresponding confidence
the targeted population parameter is denoted by θ. level 100ð1  αÞ% is increased.
Let θ^ represent an estimator of θ based on the sam- A common problem in research design is sample
ple data. Let SDðθÞ ^ denote the standard deviation size determination. In estimating a parameter θ, an
^
of θ (if known) or an estimator of the standard investigator often wishes to determine the sample
deviation (if unknown). SDðθÞ ^ is often referred to size n required to ensure that the margin of error
as the standard error. does not exceed some predetermined bound B;
Suppose that the sampling distribution of the that is, to find the n that will ensure MEðαÞ ≤ B.
standardized statistic Solving this problem requires specifying the confi-
dence level as well as quantifying the population
ðθ^  θÞ variability. The latter is often accomplished by
SDðθÞ ^ relying on data from pilot or preliminary studies,
or from prior studies that investigate similar phe-
is symmetric about zero. Let Qð0:975Þ denote the nomena. In some instances (such as when the
97.5th percentile of this distribution. (Note that parameter of interest is a proportion), an upper
the 2.5th percentile of the distribution would then bound can be placed on the population variability.
be given by Qð0:975Þ.) A symmetric 95% confi- The use of such a bound results in a conservative
dence interval for θ is defined as θ^ ± Qð0:975Þ sample size determination; that is, the resulting n
SDðθÞ^ . The half-width or radius of such an inter- is at least as large as (and possibly larger than) the
val, ME ¼ Qð0:975ÞSDðθÞ ^ , defines the conven- sample size actually required to achieve the desired
tional margin of error, which is implicitly based on objective.
a 95% confidence level. Two of the most basic problems in statistical
A more general definition of the margin of error inference consist of estimating a population mean
is based on an arbitrary confidence level. Suppose and estimating a population proportion under
Margin of Error 751

random sampling with replacement. The margins The preceding definition for the margin of error
of error for these problems are presented in the assumes that the standard deviation σ is known,
next sections. an assumption that is unrealistic in practice. If σ is
unknown, it must be estimated by the sample stan-
dard deviation s. In this case, the margin of error is
Margin of Error for Means
based on the sampling distribution of the statistic
Assume that a random sample of size n is drawn
from a quantitative population with mean μ and ðx  μÞ
 :
standard deviation σ. Let x denote the sample s
pffiffiffi
mean. The standard error of x is then given by n
σ
pffiffiffi : This sampling distribution corresponds to the
n Student’s t distribution, either exactly (under nor-
mality of the population) or approximately (in
Assume that either (a) the population data may
a large sample setting). If Tdf ð1  α2Þ denotes the
be viewed as normally distributed or (b) the sam-
100ð1  α2Þth percentile of the t distribution with
ple size is ‘‘large’’ (typically, 30 or greater). The
df ¼ n  1 degrees of freedom, then the margin
sampling distribution of the standardized statistic
of error for x is given by
ðx  μÞ   
  α s
σ MEðαÞ ¼ Tdf 1  pffiffiffi :
pffiffiffi 2 n
n

then corresponds to the standard normal distribu- Margin of Error for Proportions
tion, either exactly (under normality of the popula-
tion) or approximately (in a large sample setting, Assume that a random sample of size n is drawn
by virtue of the central limit theorem). Let from a qualitative population where π denotes
Zð1  α2Þ denote the 100ð1  α2Þth percentile of this a proportion based on a characteristic of interest.
distribution. The margin of error for x is then Let p denote the sample proportion. The standard
given by error of p is then given by
   rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
α σ πð1  πÞ
MEðαÞ ¼ Z 1  pffiffiffi : :
2 n n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
With the conventional 95% confidence level, Here, πð1  πÞ represents the standard deviation
α ¼ 0:05 and Zð0:975Þ ¼ 1:96 ≈ 2, leading to of the binary (0=1) population data, in which each
ME ¼ 2 ðpσffiffin Þ: object is dichotomized according to whether it
In research design, the sample size needed to exhibits the characteristic in question. The sample
ensure that the margin of error does not exceed version of the standard error is
some bound B is determined by finding the smal- rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lest n that will ensure pð1  pÞ
:
  n
α σ 2
n ≥ Z 1 :
2 B Assume that the sample size is ‘‘large’’ (typically,
such that nπ and nð1  πÞ are both at least 10). The
In instances in which the population standard sampling distribution of the standardized statistic
deviation σ cannot be estimated based on data col-
lected from earlier studies, a conservative approxi- ðp  πÞ
mation of σ can be made by taking one fourth the rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!
πð1  πÞ
plausible range of the variable of interest (i.e., ð14Þ
n
[maximum–minimum]).
752 Margin of Error

can then be approximated by the standard normal from the population at random with replacement.
distribution. The margin of error for x is then If the sample is drawn at random without replace-
given by ment, and the sampling fraction is relatively high
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! (e.g., 5% or more), the formulas should be adjusted
 α pð1  pÞ by a finite population correction (fpc). If N denotes
MEðαÞ ¼ Z 1  :
2 n the size of the population, this correction is given as
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
With the conventional 95% confidence level, ðN  nÞ
qffiffiffiffiffiffiffiffiffiffiffi  fpc ¼ :
pð1pÞ ðN  1Þ
ME ¼ 2 n :
In research design for sample size determination, Employing this correction has the effect of
for applications in which no data exist from previous reducing the margin of error. As n approaches
studies for estimating the proportion of interest, the N, the fpc becomes smaller. When N ¼ n, the
computation is often based on bounding the popula-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi entire population is sampled. In this case, the fpc
tion standard deviation πð1  πÞ. Using calculus, and the margin of error are zero because the
it can be shown that this quantity achieves a maxi- sample estimate is equal to the population
mum value of 1/2 when the proportion π is 1/2. The parameter and there is no estimation error. In
maximum margin of error is then defined as practice, the fpc is generally ignored because the
  size of the sample is usually small relative to the
 α 1
MEMAX ðαÞ ¼ Z 1  pffiffiffi : size of the population.
2 2 n

The sample size needed to ensure that this margin Generalization to Two Parameters
of error does not exceed some bound B is deter- The preceding definitions for the margin of error
mined by finding the smallest n that will ensure are easily generalized to settings in which two tar-
Zð1αÞ2
n ≥ 2
. geted population parameters are of interest, say θ1
2B
Often in public opinion polls and other surveys, and θ2 . Two parameters are often compared by
a number of population proportions are estimated, estimating their difference, say θ1  θ2. Let θ^1 and
and a single margin of error is quoted for the θ^2 represent estimators of θ1 and θ2 based on the
entire set of estimates. This is generally accom- sample data. Let SDðθ^1  θ^2 Þ denote the standard
plished with the preceding maximum margin of error of the difference in the estimators θ^1  θ^2 .
error MEMAX ðαÞ, which is guaranteed to be at Confidence intervals for θ1  θ2 may be based
least as large as the margin of error for any of the on the standardized statistic
individual estimates in the set. When the conven-
ðθ^1  θ^2 Þ  ðθ1  θ2 Þ
tional 95% confidence level is employed, the maxi- :
mum margin of error has a particularly simple SDðθ^1  θ^2 Þ
form:
As before, suppose that the sampling distribution
1 of this statistic is symmetric about zero, and let
MEMAX ¼ pffiffiffi : Qð0:975Þ denote the 97.5th percentile of this dis-
n
tribution. A symmetric 95% confidence interval
National opinion polls are often based on a sample for θ is defined as
size of roughly 1,100 participants, leading to the
margin of error MEMAX ≈ 0:03, or 3 percentage ðθ^1  θ^2 Þ ± Qð0:975ÞSDðθ^1  θ^2 Þ:
points.
The half-width or radius of such an interval,
ME ¼ Qð0:975ÞSDðθ^1  θ^2 Þ, defines the conven-
Finite Population Correction
tional margin of error for the difference θ^1  θ^2 ,
The preceding formulas for margins of error are which is based on a 95% confidence level. The
based on the assumption that the sample is drawn more general definition based on an arbitrary
Markov Chains 753

confidence level 100ð1  αÞ% may be obtained by statistics that indicates that the sum of a large num-
replacing the 97.5th percentile Qð0:975Þ with the ber of independent random variables is asymptoti-
100ð1  α2Þth percentile Qð1  α2Þ, leading to cally distributed as a normal distribution.
 Markov was a well-trained mathematician who
α
MEðαÞ ¼ Q 1  SDðθ^1  θ^2 Þ : after 1900 emphasized inquiry in probability. After
2 studying sequences of independent chance events,
Joseph E. Cavanaugh and Eric D. Foster he became interested in sequences of mutually
dependent events. This inquiry led to the creation
See also Confidence Intervals; Estimation; Sample Size
of Markov chains.
Planning; Standard Deviation; Standard Error of
Estimate; Standard Error of the Mean; Student’s t Test;
z Distribution Sequences of Chance Events
Markov chains are sequences of chance events. A
Further Readings
series of flips of a fair coin is a typical sequence of
Lohr, S. L. (1999). Sampling: Design and analysis. Pacific chance events. Each coin flip has two possible out-
Grove, CA: Duxbury Press. comes: Either a head (H) appears or a tail (T)
Utts, J. M. (2005). Seeing through statistics (3rd ed.). appears. With a fair coin, a head will appear with
Belmont, CA: Thomson Brooks/Cole. a probability (p) of 1/2 and a tail will appear with
Utts, J. M., & Heckard, R. F. (2007). Mind on statistics
a probability of 1/2.
(3rd ed.). Belmont, CA: Thomson Brooks/Cole.
Successive coin flips are independent of each
other in the sense that the probability of a head or
a tail on the first flip does not affect the probability
MARKOV CHAINS of a head or a tail on the second flip. In the case of
the sequence HT, the pðHTÞ ¼ pðHÞ × pðTÞ ¼
The topic of Markov chains is a well-developed 1=2 × 1=2 ¼ 1=4. Many sequences of chance
topic in probability. There are many fine exposi- events are composed of independent chance events
tions of Markov chains (e.g., Bremaud, 2008; such as coin flips or dice throws. However, some
Feller, 1968; Hoel, Port, & Stone, 1972; Kemeny sequences of chance events are not composed of
& Snell, 1960). Those expositions and others have independent chance events. Some sequences of
informed this concise entry on Markov chains, chance events are composed of events whose
which is not intended to exhaust the topic of Mar- occurrences are influenced by prior chance events.
kov chains. The topic is just too capacious for Markov chains are such sequences of chance
that. This entry provides an exposition of a judi- events.
cious sampling of the major ideas, concepts, and As an example of a sequence of chance events
methods regarding the topic. that involves interdependence of events, let us con-
sider a sequence of events E1 ; E2 ; E3 ; E4 such that
the probability of any of the events after E1 is
Andrei Andreevich Markov
a function of the prior event. Instead of interpret-
Andrei Andreevich Markov (1856–1922) formu- ing the pðE1 ; E2 ; E3 ; E4 ) as the product of proba-
lated the seminal concept in the field of probability bilities of independent events, pðE1 ; E2 ; E3 ; E4 ) is
later known as the Markov chain. Markov was an the product of an initial event probability and the
eminent Russian mathematician who served as conditional probabilities of successive events.
a professor in the Academy of Sciences at the Uni- From this perspective,
versity of St. Petersburg. One of Markov’s teachers
was Pafnuty Chebyshev, a noted mathematician pðE1 ; E2 ; E3 ; E4 Þ ¼ pðE1 Þ × pðE2 jE1 Þ × pðE3 jE2 Þ
who formulated the famous inequality termed the × pðE4 jE3 Þ:
Chebyschev inequality, which is extensively used in
probability and statistics. Markov was the first per- Such a sequence of events is a Markov chain.
son to provide a clear proof of the central limit the- Let us consider a sequence of chance events E1,
orem, a pivotal theorem in probability and E2, E3 ; . . . ; Ej ; . . . ; En. If p(Ej|Ej  1) ¼ p(Ej|Ej  1,
754 Markov Chains

Ej  2 ; . . . ; E1), then the sequence of chance events example. A type of job certification involves
is a Markov chain. For a more formal definition, if a test with three resulting states. With state S1,
X1 ; X2 ; . . . ; Xn are random variables and if one fails the test with a failing score and then
p(Xn ¼ kn|Xn  1 ¼ kn  1) ¼ p(Xn ¼ kn|Xn  1 ¼ maintains that failure status. With state S2, one
kn 1 ; . . . ; X2 ¼ k2 ; X1 ¼ k2 Þ; then X1 ; X2 ; . . . ; Xn passes the test with a low pass score that is
form a Markov chain. inadequate for certification. One then takes the
Conditional probabilities interrelating events test again. Either one attains state S2 again
are important in defining a Markov chain. Com- with a probability of .5 or one passes the test
mon in expositions of Markov chains, conditional with a high pass score and reaches state S3 with
probabilities interrelating events are termed transi- a probability of .5. With state S3, one passes the
tion probabilities interrelating states, events are test with a high pass score that warrants job
termed states, and a set of states is often termed certification.
a system or a state space. The states in a Markov
chain are either finite or countably infinite; this S1 S2 S3
entry will feature systems or state spaces whose S1 1 0 0
states are finite in number. P2 = S2 0 .5 .5
A matrix of transition probabilities is used to
represent the interstate transitions possible for S3 0 0 1
a Markov chain. As an example, consider the fol-
lowing system of states, S1, S2, and S3, with the P2 indicates the transition probabilities among
following matrix of transition probabilities, P1. the three states of the job certification process.
From an examination of P2, state S1 is an absorb-
S1 S2 S3 ing state because p11 ¼ 1 and p1j ¼ 0 for 1 6¼ j,
S1 .1 .8 .1 and state S3 is an absorbing state because p33 ¼ 1
and p3j ¼ 0 for 3 6¼ j. However, state S2 is a tran-
P1 = S2 .5 .3 .2 sient state because there is a nonzero probability
S3 .1 .7 .2 that the state will never be reached again.
To illustrate that state S2 will never be reached
Using the matrix P1, one can see that the tran- again, let us examine what happens as the succes-
sition probability of entering S1 given that one is sive steps in the Markov chain occur. P2 indicates
in S2 is .8. That same probability is represented the transition probabilities among the three
as p12. states of the job certification process. P22 indicates
the transition probabilities after two steps in the
process.
Features of Markov Chains and Their States
S1 S2 S3
There are various types of states in a Markov
chain. Some types of states in a Markov chain S1 1 0 0
relate to the degree to which states recur over time. 2
P2 = S2 0 .25 .75
A recurrent state is one that will return to itself
S3 0 0 1
before an infinite number of steps with probability
1. An absorbing state i is a recurrent state for
which pii ¼ 1 and pij ¼ 0 for i 6¼ j. In other words, After two steps, p22 has decreased to .25 and
if it is not possible to leave a given state, then that p23 has increased to .75. P32 indicates the transition
state is an absorbing state. Second, a state is tran- probabilities after two steps in the process.
sient if it is not recurrent. In other words, if the
S1 S2 S3
probability that a state will occur again before an
infinite number of steps is less than 1, then that S1 1 0 0
state is transient. 3
P2 = S2 0 .06 .94
To illustrate the attributes of absorbing and
S3 0 0 1
transient states, let us consider the following
Markov Chains 755

After three steps, p22 has further decreased to Child IQ


.06 and p23 has further increased to .94. One S1 S2 S3
could surmise that p22 will further decrease and
p23 will further decrease as the number of steps Parental IQ (low) (intermediate) (high)
increases. P22 and P32 suggest that those receiving an S1 (low) .6 .3 .1
intermediate score will pass the test with successive
attempts. P3 = S2 (intermediate) .2 .6 .2
Generalizing to n steps, Pn2 indicates the transi-
S3 (high) .1 .3 .6
tion probabilities after n steps in the process.
The Markov chain indicated by P3 reflects the
S1 S2 S3 view that the IQ of the parent tends to determine
S1
the IQ of the child.
1 0 0
n
All the states in P3 are accessible from each other,
P2 = S2 0 (.5)n 1 − (.5)n and they all communicate with each other. Thus the
S3 0 0 1 states in P3 form a communicating class. In addi-
tion, this Markov chain is irreducible and ergodic.
But what would be the probabilities of the three
As n increases, p22 will approach 0 and p23 will levels of parental IQ if this Markov chain would
approach 1. From an examination of Pn2 , it is clear proceed for many steps? To determine these proba-
that those receiving an intermediate score will bilities, also called stationary probabilities, one
inexorably pass the test with successive attempts at needs to use an important finding in the study of
taking the test. As the number of steps in the pro- Markov chains.
cess increases, individuals will either have failed Let us assume that we have a transition proba-
the job certification procedure or have passed the bility matrix, P, which, for some n, Pn has only
job certification procedure. nonzero entries. Then P is termed a regular matrix.
Additional types of states in a Markov chain Then two prominent theorems follow.
relate to the degree to which states can be The first theorem addresses the increasing simi-
reached from other states in the chain. A state i larity of rows in successive powers of P. If P is
is accessible from a different state j if there is a regular transition probability matrix, then Pn
a nonzero probability that state j can be reached becomesQincreasing more similar
from state i some time in the future. Second, Q to a probability
matrix with each row of being the same
a state i communicates with state j if state i is probability vector π ¼ (p1, p2 ; . . . ; pk Þ and with the
accessible from state j and state j is accessible components of π being positive.
from state i. A set of states in a Markov chain is The second theorem indicates the equation that
a communicating class if every pair of states in permits the computation of stationary probabili-
the set communicates with each other. A commu- ties. If P is a regular transition probability matrix,
nicating class is closed if it is not possible to then the row vector π is the unique vector of prob-
reach a state outside the class from a state within abilities that satisfies the equation π ¼ π · P. This
the class. A Markov chain is irreducible if any theorem is one of the more important theorems in
state within the chain is accessible from any the study of Markov chains.
other state in the chain. An irreducible Markov With the equation π ¼ π · P, we can determine
chain is also termed an ergodic Markov chain. the stationary probabilities for P3. Let π3 ¼ (p31,
To illustrate these attributes, let us consider p32, p33) be the vector of stationary probabilities
the following Markov chain and its transition for this three-state Markov chain associated with
probability matrix. This hypothetical chain P3. Then π3 ¼ π3 · P3.
describes the transition from the IQ of the parent
to the IQ of a child of the parent. Let the states .6 .3 .1
be three levels of IQ: high, intermediate, and
ðp31 ; p32 ; p33 Þ ¼ ðp31 ; p32 ; p33 Þ · .2 .6 .2
low. P3 is the transition matrix for this Markov
chain. .1 .3 .6
756 Markov Chains

¼ ð:6p31 þ .2p32 þ .1p33,.3p31 þ .6p32 þ .3p33, .4 .4 .2


.1p31 þ .2p32 þ .6p33) .3 .4 .3
This results in three equations: .2 .4 .4
(1) p31 ¼ .6p31 þ .2p32 þ .1p33
¼ (.4p41 þ .3p42 þ .2p43, :4p41 þ .4p42 þ .4p43,
(2) p32 ¼ .3p31 þ .6p32 þ .3p333 :2p41 þ .3p42 þ .4p43)
(3) p33 ¼ .1p31 þ .2p32 þ .6p33
This results in three equations:
Along with those three equations is the addi-
tional equation: (1) p41 ¼ .4p41 þ .3p42 þ .2p43
(2) p42 ¼ .4p41 þ .4p42 þ .4p43
(4) p31 þ p32 þ p33 ¼ 1.
(3) p43 ¼ .2p41 þ .3p42 þ .4p43
An arithmetic manipulation of these four equations
results in numerical solutions for the three unknowns: Along with those three equations is the addi-
p31 ¼ 2/7 ¼ .286, p32 ¼ 3/7 ¼ .428, and p33 ¼ tional equation:
2/52 ¼ 2/7 ¼ .286. These three stationary probabilities
indicate that there would be a modest trend toward (4) p41 þ p42 þ p43 ¼ 1.
intermediate IQs over many generations.
Let us consider another Markov chain and its An arithmetic manipulation of these four
transition probability matrix. This hypothetical equations results in numerical solutions for the
chain also describes the transition from the IQ of three unknowns: p41 ¼ 3/10 ¼ .3, p42 ¼ 2/5 ¼ .4,
the parent to the IQ of a child of the parent. Let and p43 ¼ 3/10 ¼ .3. These three stationary prob-
the states again be three levels of IQ: high, inter- abilities indicate that there would be a weak
mediate, and low. P4 is the transition matrix for trend toward intermediate IQs over many gen-
this Markov chain. erations with this Markov chain but not as
strong a trend as with the prior Markov chain
depicted with P3.
Child IQ
Not all Markov chains have unique stationary
S1 S2 S3 probabilities. Let us consider the matrix of tran-
Parental IQ (low) (intermediate) (high) sition probabilities, P2, with two absorbing
states.
S1 (low) .4 .4 .2
S1 S2 S3
P4 = S2 (intermediate) .3 .4 .3 S1 1 0 0
S3 (high) .2 .4 .4 P2 = S2 0 .5 .5
S3 0 0 1
The Markov chain indicated by P4 reflects the
view that the IQ of the parent weakly influences To determine the stationary probabilities, one
the IQ of the child. needs to solve the equation: π2 ¼ π2 · P2 with
What would be the probabilities of the π2 ¼ (p21, p22, p23), the vector of stationary proba-
three levels of parental IQ if this Markov chain bilities for this three-state Markov chain.
would proceed for many steps? To determine 1 0 0
these probabilities, also called stationary proba-
bilities, one needs to solve the equation π4 ¼ π4 . ðp21 ; p22 ; p23 Þ ¼ ðp21 ; p22 ; p23 Þ · 0 .5 .5
P4 with π4 ¼ (p41, p42, p43), the vector of station- 0 0 1
ary probabilities for this three-state Markov
chain. ¼ (p21, .5p22 þ .5p23, .5p22 þ .5p23)
Markov Chains 757

This results in three equations: 1 0 0


3
P5 = 0 0 1
1. p21 ¼ p21
0 1 0
2. p22 ¼ þ .5p22
3. p23 ¼ þ .5p22 þ p23 P35 ¼ P5 . One could then generalize to the
following equality: P2nþ1
5 ¼ P5 with n ¼ any nat-
Along with those three equations is the addi- ural number. Regarding the attributes of period-
tional equation: icity or recurrence, state S11 is recurrent at each
step and states S13 and S32 are periodic with
4. p21 þ p22 þ p23 ¼ 1. period 2.
One type of Markov chain that is common-
An arithmetic manipulation of these 4 equations place among expositions of Markov chains is
results in numerical solutions for the three that of the random walk. Imagine a person
unknowns: p21 ¼ 1  p23, p22 ¼ 0, and p23 ¼ named Harry, a frequent visitor to a bar, plan-
1  p21. These three stationary probabilities indicate ning to leave the bar, represented as state S2.
that there could be an infinite number of values for From the bar, Harry could start walking to his
p21 and p23. home and return to the bar, S1, or walk to a park,
Additional types of states in a Markov chain state S2, that is halfway between the bar and his
relate to the rates at which states in the Markov home. If Harry is at the park, state S2, then he
chain return to themselves over time. If there is could start walking to his home and return to
a return to state i starting from state i after every the park or walk to his home, state S3. If Harry
three steps, then state i has a period of 3. In gen- is at his home, then Harry could remain at his
eral, if there is a return to a given state in a Markov home or walk to the park. P6 is the matrix of
chain after every t steps if one starts with that transition probabilities that relates to this ran-
state, then that state is periodic with period t. A dom walk.
state is recurrent or persistent if there is a probabil-
ity of 1 that the state will be reached again. S1 S2 S3
To illustrate these attributes, let us consider the
S1 (bar) .5 .5 0
following example. Let P5 be a matrix of transition
probabilities for a Markov chain with three P6 = S2 (park) .5 0 .5
absorbing states.
S3 (home) 0 .5 .5
S1 S2 S3
To determine the stationary probabilities, one
S1 1 0 0
needs to solve the equation: π6 ¼ π6 · P6 with
P5 = S2 0 0 1 π6 ¼ (p61, p62, p63), the vector of stationary proba-
S3 0 1 0 bilities for this three-state Markov chain.

.5 .5 0
After one step in the Markov chain, P25 indicates
the resulting transition probabilities. ðp61 ; p62 ; p63 Þ ¼ ðp61 ; p62 ; p63 Þ · .5 0 .5
0 .5 .5
1 0 0
2 ¼ ð:5p61 þ .5p62, .5p61 þ .5p63, .5p62 þ .5p63)
P5 = 0 1 0
0 0 1
This results in three equations:

1. p61 ¼ .5p61 þ .5p62


P35 ¼ is the identity matrix. After two steps in
2. p62 ¼ .5p61 þ .5p63
the Markov chain, P35 indicates the resulting transi-
tion probabilities. 3. p63 ¼ .5p62 þ .5p63
758 Matching

Along with those three equations is the additional matching procedure requires defining a notion of
equation distance, selecting the number of matches to be
found, and deciding whether units will be used mul-
4. π61 þ π62 þ π63 ¼ 1. tiple times as a potential match. In applications,
matching is commonly used as a preliminary step in
An arithmetic manipulation of these four equa- the construction of a matched sample, that is, a sam-
tions results in numerical solutions for the three ple of observations that are similar in terms of
unknowns: π61 ¼ π62 ¼ π63 ¼ 1/3 ¼ .33. These sta- observed characteristics, and then some statistical
tionary probabilities indicate that Harry would be procedure is computed with this subsample. Typi-
at any of the three locations with equal likelihood cally, the term matching estimator refers to the case
after many steps in the random walk.2 when the statistical procedure of interest is a point
estimator, such as the sample mean. The idea of
Conclusion matching is usually employed in the context of
observational studies, in which it is assumed that
If there is a sequence of random events such that
selection into treatment, if present, is based on
a future event is dependent only on the present
observable characteristics. More generally, under
event and not on past events, then the sequence is
appropriate assumptions, matching may be used as
likely a Markov chain, and the work of Markov
a way of reducing variability in estimation, combin-
and others may be used to extract useful informa-
ing databases from different sources, dealing with
tion from an analysis of the sequence. The topic of
missing data, and designing sampling strategies,
the Markov chain has become one of the most
among other possibilities. Finally, in the economet-
captivating, generative, and useful topics in proba-
rics literature, the term matching is sometimes used
bility and statistics.
more broadly to refer to a class of estimators that
William M. Bart and Thomas Bart exploit the idea of selection on observables in the
context of program evaluation. This entry focuses
See also Matrix Algebra; Probability, Laws of on the implementation of and statistical inference
procedures for matching.
Further Readings
Bremaud, P. (2008). Markov chains. New York: Springer. Description and Implementation
Feller, W. (1968). An introduction to probability theory and
its applications: Volume 1 (3rd ed.). New York: Wiley. A natural way of describing matching formally is
Hoel, P., Port, S., & Stone, C. (1972). Introduction to in the context of the classical potential outcomes
stochastic processes. Boston: Houghton Mifflin. model. To describe this model, suppose that a ran-
Kemeny, J., & Snell, J. (1960). Finite Markov chains. dom sample of size n is available from a large
Princeton, NJ: D. Van Nostrand. population, which is represented by the collection
of random variables (Yi ; Ti ; Xi Þ, i ¼ 1; 2; . . . ; n;
where Ti ∈ f0; 1g,

MATCHING Y0i if Ti ¼ 0
Yi ¼
Y1i if Ti ¼ 1
The term matching refers to the procedure of find-
ing for a sample unit other units in the sample that and Xi represents a (possibly high-dimensional)
are closest in terms of observable characteristics. vector of observed characteristics. This model aims
The units selected are usually referred to as matches, to capture the idea that while the set of character-
and after repeating this procedure for all units (or istics Xi is observed for all units, only one of the
a subgroup of them), the resulting subsample of two random variables (Y0i ; Y1i ) is observed for
units is called the matched sample. This idea is typi- each unit, depending on the value of Ti . The
cally implemented across subgroups of a given sam- underlying random variables Y0i and Y1i are usu-
ple, that is, for each unit in one subgroup, matches ally referred to as potential outcomes because they
are found among units of another subgroup. A represent the two potential states for each unit.
Matching 759

For example, this model is routinely used in the To describe a matching procedure in detail, con-
program evaluation literature, where Ti represents sider the special case of matching that uses the
treatment status and Y0i and Y1i represent out- Euclidean distance to obtain M ≥ 1 matches with
comes without and with treatment, respectively. In replacement for the two groups of observations
most applications the goal is to establish statistical defined by Ti ¼ 0 and Ti ¼ 1, using as a reservoir
inference for some characteristic of the distribution of potential matches for each unit i the group
of the potential outcomes such as the mean or opposite to the group this unit belongs to. Then,
quantiles. However, using the available sample for unit i the mth match, m ¼ 1, 2, . . . , M is given
directly to establish inference may lead to impor- by the observation having index jm ðiÞ such that
tant biases in the estimation whenever units have
selected into one of the two possible groups Tjm ðiÞ 6¼ Ti and
(Ti ¼ 0 or Ti ¼ 1). As a consequence, researchers X n
often assume that the selection process, if present, 1fTj 6¼ Ti g1fkXj  Xi k ≤ kXjm ðiÞ  Xi kg¼m:
is based on observable characteristics. This idea is j¼1
formalized by the so-called conditional indepen-
dence assumption: conditionally on Xi ; the ran- (The function 1f·g is the indicator function and k·k
dom variables (Y0i ; Y1i ) are independent of Ti : In represents the Euclidean norm.) In words, for the
other words, under this assumption, units having ith unit, the mth match corresponds to the
the same observable characteristics Xi are assigned mth nearest neighbor among those observations
to each of the two groups (Ti ¼ 0 or Ti ¼ 1Þ inde- belonging to the opposite group of unit i, as mea-
pendently of their potential gains, captured by sured by the Euclidean distance between their
(Y0i ; Y1i ). Thus, this assumption imposes random observable characteristics. For example, if m ¼ 1,
treatment assignment conditional on Xi : This then j1 ðiÞ corresponds to the unit’s index in the
model also assumes some form of overlap or com- opposite group of unit i with the property that
mon support: For some c > 0; c ≤ PðTi ¼ 1jXi Þ ≤ kXj1 ðiÞ ; Xi k ≤ kXj  Xi k for all j such that
1  c: In words, this additional assumption ensures Tj 6¼ Ti ; that is, Xjm ðiÞ is the observation closest to
that there will be observations in both groups hav- Xi among all the observations in the appropriate
ing a common value of observed characteristics if group. Similarly, Xj1 ðiÞ ; Xj2 ðiÞ ; . . . ; XjM ðiÞ are the sec-
the sample size is large enough. The function ond closest, third closest, and so forth, observa-
pðXi Þ ¼ PðTi ¼ 1jXi Þ is known as the propensity tions to Xi , among those observations in the
score and plays an important role in the literature. appropriate subsample. Notice that to simplify the
Finally, it is important to note that for many appli- discussion, this definition assumes existence and
cations of interest, the model described above uniqueness of an observation with index jm ðiÞ. (It
employs stronger assumptions than needed. For is possible to modify the matching procedure to
simplicity, however, the following discussion does account for these problems.)
not address these distinctions. In general, the always observed random vector
This setup naturally motivates matching: obser- Xi may include both discrete and continuous ran-
vations sharing common (or very similar) values of dom variables. When the distribution of (a subvec-
the observable characteristics Xi are assumed to be tor of) Xi is discrete, the matching procedure may
free of any selection biases, rendering the statistical be done exactly in large samples, leading to so-
inference that uses these observations valid. Of called exact matching. However, for those compo-
course, matching is not the only way of conducting nents of Xi that are continuously distributed,
correct inference in this model. Several parametric, matching cannot be done exactly, and therefore in
semiparametric, and nonparametric techniques are any given sample there will be a discrepancy in
available, depending on the object of interest and terms of observable characteristics, sometimes
the assumptions imposed. Nonetheless, matching called the matching discrepancy. This discrepancy
is an attractive procedure because it does not generates a bias that may affect inference even
require employing smoothing techniques and asymptotically.
appears to be less sensitive to some choices of user- The M matches for unit i are given by the obser-
defined tuning parameters. vations with indexes JM ðiÞ ¼ fj1 ðiÞ; . . . ; jM ðiÞg, that
760 Matching

 
is, Yj1 ðiÞ ; Xj1 ðiÞ ; . . . ; YjM ðiÞ ; XjM ðiÞ . This procedure so-called genetic matching, which uses evolutionary
is repeated for the appropriate subsample of units to genetic algorithms to construct the matched sample,
obtain the final matched sample. Once the matched appears to work well with moderate sample sizes.
sample is available, the statistical procedure of inter- This implementation allows for a generalized notion
est may be computed. To this end, the first step is to of distance (a reweighted Euclidean norm that
‘‘recover’’ those counterfactual variables not observed includes the Mahalanobis metric as a particular case)
for each unit, which in the context of matching is and an arbitrary number of matches with and with-
done by imputation. For example, first define out replacement.
8 There exist several generalizations of the basic
< Yi
> if Ti ¼ 0 matching procedure described above, a particularly
^ 1 X
Y0i ¼ Yj if Ti ¼ 1 and important one being the so-called optimal full
>
:M
j ∈ JM ðiÞ
matching. This procedure generalizes the idea of pair
8 or M matching by constructing multiple submatched
> 1 X
< Yj if Ti ¼ 0 samples that may include more than one observation
Y^1i ¼ M j ∈ J ðiÞ from each group. This procedure encompasses the
>
:
M

Yi if Ti ¼ 1 simple matching procedures previously discussed and


enjoys certain demonstrable optimality properties.
that is, for each unit the unobserved counterfactual
variable is imputed using the average of its M
Statistical Inference
matches. Then simple matching estimators are easy
to construct: A matching estimator forP μ1 ^¼ E½Y1i , In recent years, there have been important theoreti-
the mean of Y1i is given by μ ^ 1 ¼ 1n ni¼1 Y 1i ; while cal developments in statistics and econometrics con-
a matching estimator for τ ¼ μ1  μ0 ¼ E½Y1i   cerning matching estimators for average treatment
E½Y0i ; the difference in means between effects under the conditional independence assump-
both groups,
Pn ^ is given by τ^ ¼ μ ^ 0, where
^1  μ tion. These results establish the validity and lack of
1
^ 0 ¼ n i¼1 Y0i . The latter estimand is called the
μ validity of commonly used statistical inference pro-
average treatment effect in the literature of program cedures involving simple matching estimators.
evaluation and has received special attention in the Despite the fact that in some cases, and under
theoretical literature of matching estimation. somewhat restrictive assumptions, exact (finite
Matching may also be carried out using esti- sample) statistical inference results for matching
mated rather than observed random variables. A estimators exist, the most important theoretical
classical example is the so-called propensity score developments currently available have been deri-
matching, which constructs a matched sample ved for large samples and under mild, standard
using the estimated propensity score (rather than assumptions. Naturally, these asymptotic results
the observed Xi ) to measure the proximity bet- have the advantage of being invariant to particular
ween observations. Furthermore, matching may distributional assumptions and the disadvantage of
also be used to estimate other population para- being valid only for large enough samples.
meters of interest, such as quantiles or dispersion First, despite the relative complexity of matching
measures, in a conceptually similar way. Intui- estimators, it has been established that these estima-
tively, in all cases a matching estimator imputes tors for averages with and without replacement
values for otherwise unobserved random variables enjoy root-n consistency and asymptotic normality
using the matched sample. This imputation pro- under reasonable assumptions. In other words, the
cedure coincides with an M nearest neighbor estimators described in the previous section (as well
(M  NN) nonparametric regression estimator. as other variants of them) achieve the parametric
The implementation of matching is based on sev- rate of convergence having a Gaussian limiting dis-
eral user-defined options (metric, number of matches, tribution after appropriate centering and rescaling.
etc.), and therefore numerous variants of this proce- It is important to note that the necessary conditions
dure may be considered. In all cases, a fast and reli- for this result to hold include the restriction that at
able algorithm is needed to construct a matched most one dimension of the observed characteristics
sample. Among the available implementations, the is continuously distributed, regardless of how many
Matrix Algebra 761

discrete covariates are included in the vector of Further Readings


observed characteristics used by the matching pro-
Abadie, A., & Imbens, G. W. (2006). Large sample
cedure. Intuitively, this restriction arises as a conse- properties of matching estimators for average
quence of the bias introduced by the matching treatment effects. Econometrica, 74, 235–267.
discrepancy for continuously distributed observed Abadie, A., & Imbens, G. W. (2008). On the failure of
characteristics, which turns out not to vanish even the bootstrap for matching estimators. Econometrica,
asymptotically when more than one continuous 76, 1537–1557.
covariate are included. This problem may be fixed Abadie, A., & Imbens, G. W. (2009, February). A
at the expense of introducing further bias reduction martingale representation for matching estimators.
techniques that involve nonparametric smoothing National Bureau of Economic Research, working
paper 14756.
procedures, making the ‘‘bias corrected’’ matching
Diamond, A., & Sekhon, J. S. (2008, December).
estimator somehow less appealing. Genetic matching for estimating causal effects:
Second, regarding the (asymptotic) precision of A general multivariate matching method for
matching estimators for averages, it has been achieving balance in observational studies.
shown that these estimators do not achieve the Retrieved February 12, 2010, from http://
minimum possible variance, that is, these estima- sekhon.berkeley.edu
tors are inefficient when compared with other Hansen, B. B., & Klopfer, S. O. (2006). Optimal full
available procedures. However, this efficiency loss matching and related designs via network flows.
is relatively small and decreases fast with the num- Journal of Computational and Graphical Statistics,
ber of matches to be found for each observation. 15, 609–627.
Imbens, G. W., & Wooldridge, J. M. (2009). Recent
Finally, in terms of uncertainty estimates of
developments in the econometrics of program
matching estimators for averages, two important evaluation. Journal of Economic Literature, 47, 5–86.
results are available. First, it has been shown that Rosenbaum, P. (2002). Observational studies. New York:
the classical bootstrap procedure would provide an Springer.
inconsistent estimate of the standard errors of the
matching estimators. For this reason, other resam-
pling techniques must be used, such as m out of n MATRIX ALGEBRA
bootstrap or subsampling, which do deliver consis-
tent standard error estimates under mild regularity
conditions. Second, as an alternative, it is possible to James Joseph Sylvester developed the modern con-
construct a consistent estimator of the standard cept of matrices in the 19th century. For him
errors that does not require explicit estimation of a matrix was an array of numbers. He worked
nonparametric parameters. This estimator uses the with systems of linear equations; matrices provided
matched sample to construct a consistent estimator a convenient way of working with their coeffi-
of the asymptotic (two-piece) variance of the match- cients, and matrix algebra was to generalize num-
ing estimator. ber operations to matrices. Nowadays, matrix
In sum, the main theoretical results available jus- algebra is used in all branches of mathematics and
tify asymptotically the use of classical inference pro- the sciences and constitutes the basis of most statis-
cedures based on the normal distribution, provided tical procedures.
the standard errors are estimated appropriately.
Matrices: Definition
Computer programs implementing matching, which
also compute matching estimators as well as other A matrix is a set of numbers arranged in a table.
statistical procedures based on a matched sample, For example, Toto, Marius, and Olivette are look-
are available in commonly used statistical comput- ing at their possessions, and they are counting how
ing software such as MATLAB, R, and Stata. many balls, cars, coins, and novels they each pos-
sess. Toto has 2 balls, 5 cars, 10 coins, and 20
Matias D. Cattaneo novels. Marius has 1, 2, 3, and 4, and Olivette has
6, 1, 3, and 10. These data can be displayed in
See also Observational Research; Propensity Score a table in which each row represents a person and
Analysis; Selection each column a possession:
762 Matrix Algebra

Person Balls Cars Coins Novels For either convenience or clarity, the number of
rows and columns can also be indicated as sub-
Toto 2 5 10 20
scripts below the matrix name:
Marius 1 2 3 4
Olivette 6 1 3 10 A ¼ A ¼ ½ai;j : ð3Þ
I×J

We can also say that these data are described by


the matrix denoted A equal to Vectors
2 3 A matrix with one column is called a column
2 5 10 20
vector or simply a vector. Vectors are denoted with
A ¼ 41 2 3 4 5: ð1Þ
bold lowercase letters. For example, the first col-
6 1 3 10
umn of matrix A (of Equation 1) is a column vec-
tor that stores the number of balls of Toto,
Matrices are denoted by boldface uppercase Marius, and Olivette. We can call it b (for balls),
letters. and so
To identify a specific element of a matrix, we 2 3
use its row and column numbers. For example, 2
the cell defined by row 3 and column 1 contains b ¼ 4 1 5: ð4Þ
the value 6. We write that a3,1 ¼ 6. With this 6
notation, elements of a matrix are denoted with
the same letter as the matrix but written in Vectors are the building blocks of matrices. For
lowercase italic. The first subscript always gives example, A (of Equation 1) is made of four col-
the row number of the element (i.e., 3), and sec- umn vectors, which represent the number of balls,
ond subscript always gives its column number cars, coins, and novels, respectively.
(i.e., 1).
A generic element of a matrix is identified with
indices such as i and j. So, ai,j is the element at the Norm of a Vector
ith row and jth column of A. The total number of
We can associate to a vector a quantity, related
rows and columns is denoted with the same letters
to its variance and standard deviation, called the
as the indices but in uppercase letters. The matrix
norm or length. The norm of a vector is the square
A has I rows (here I ¼ 3) and J columns (here
root of the sum of squares of the elements. It is
J ¼ 4), and it is made of I × J elements ai,j (here
denoted by putting the name of the vector between
3 × 4 ¼ 12). The term dimensions is often used to
a set of double bars (||). For example, for
refer to the number of rows and columns, so A has
2 3
dimensions I by J. 2
As a shortcut, a matrix can be represented by b ¼ 4 1 5; ð5Þ
its generic element written in brackets. So A with I 2
rows and J columns is denoted

we find
A ¼ ai;j pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffi
2 3 jjxjj ¼ 22 þ 12 þ 22 ¼ 4 þ 1 þ 4 ¼ 9 ¼ 3
a1;1 a1;2    a1;j    a1;J
6 7 ð6Þ
6 a2;1 a2;2    a2;j    a2;J 7
6 7
6 . .. .. .. .. .. 7 Normalization of a Vector
6 .. . . . . . 7 ð2Þ
6 7
¼6 7
6 ai;1 ai;2  ai;j    ai;J 7
6 7 A vector is normalized when its norm is equal to
6 . .. .. .. .. .. 7
6 .. 1. To normalize a vector, we divide each of its ele-
4 . . . . . 7
5
ments by its norm. For example, vector x from
aI;1 aI;2  aI;j    aI;J Equation 5 is transformed into the normalized x as
Matrix Algebra 763

223
In general
3
x 6 7
x¼ ¼ 4 13 5: ð7Þ
jjxjj 2 AþB¼
3 2 3
a1;1 þ b1;1 a1;2 þ b1;2  a1;j þ b1;j  a1;J þ b1;J
6 7
6 a2;1 þ b2;1 a2;2 þ b2;2  a2;j þ b2;j  a2;J þ b2;J 7
Operations for Matrices 6 7
6 .. .. .. .. .. .. 7
6 . . . . . . 7
Transposition 6 7
6 7:
6 ai;1 þ bi;1 ai;2 þ bi;2  ai;j þ bi;j  ai;J þ bi;J 7
If we exchange the roles of the rows and the 6 7
6 .. .. .. .. .. .. 7
columns of a matrix, we transpose it. This opera- 6 . . . . . . 7
4 5
tion is called the transposition, and the new matrix
aI;1 þ bI;1 aI;2 þ bI;2  aI;j þ bI;j  aI;J þ bI;J
is called a transposed matrix. The A matrix trans-
posed is denoted AT. For example: ð11Þ
2 3
2 5 10 20 Matrix addition behaves very much like usual
6 7 addition. Specifically, matrix addition is commuta-
if A ¼ A ¼ 4 1 2 3 4 5; then
3×4
6 1 3 10 tive (i.e., A þ B ¼ B þ AÞ and associative (i.e.,
A þ ½B þ C ¼ ½A þ B þ C).
2 3 ð8Þ
2 1 6
Multiplication of a Matrix by a Scalar
6 5 2 1 7
T 6
T 7
A ¼ A ¼6 7: To differentiate matrices from the usual num-
4×3 4 10 3 3 5
bers, we call the latter scalar numbers or simply
20 4 10 scalars. To multiply a matrix by a scalar, multiply
each element of the matrix by this scalar. For
example:
Addition of Matrices
2 3
When two matrices have the same dimensions, 3 4 5 6
6 7
we compute their sum by adding the correspond- 10 × B ¼ 10 × 4 2 4 6 85
ing elements. For example, with 1 2 3 5
2 3 2 3 ð12Þ
2 5 10 20 30 40 50 60
6 7 6 7
A ¼ 41 2 3 4 5and ¼ 4 20 40 60 80 5:
6 1 3 10 10 20 30 50
2 3 ð9Þ
3 4 5 6
6 7
B ¼ 42 4 6 8 5; Multiplication: Product or Products?
1 2 3 5
There are several ways of generalizing the con-
we find cept of product to matrices. We will look at the
most frequently used of these matrix products.
2 3
2þ3 5 þ 4 10 þ 5 20 þ 6 Each of these products will behave like the product
6 7 between scalars when the matrices have dimen-
A þ B ¼ 41 þ 2 2þ4 3þ6 4þ8 5
sions 1 × 1.
6þ1 1þ2 3þ3 10 þ 5
2 3
5 9 15 26
6 7 Hadamard Product
¼ 43 6 9 12 5:
When generalizing product to matrices, the first
7 3 6 15
approach is to multiply the corresponding ele-
ð10Þ ments of the two matrices that we want to
764 Matrix Algebra

multiply. This is called the Hadamard product, for matrices with the same dimensions. Formally,
denoted by . The Hadamard product exists only it is defined as shown below, in matrix 13:

A  B ¼ ½ai;j × bi;j 
2 3
a1;1 × b1;1 a1;2 × b1;2    a1;j × b1;j    a1;J × b1;J
6 7
6 a2;1 × b2;1 a2;2 × b2;2    a2;j × b2;j    a2;J × b2;J 7
6 7
6 .. .. .. .. .. .. 7
6 . . . . . . 7 ð13Þ
6 7
¼6 7:
6 ai;1 × bi;1 ai;2 × bi;2  ai;j × bi;j    ai;J × bi;J 7
6 7
6 .. .. .. .. .. .. 7
6 . . . . . . 7
4 5
aI;1 × bI;1 aI;2 × bI;2  aI;j × bI;j    aI;J × bI;J

For example, with


2 3 2 3
2 5 10 20 3 4 5 6
A ¼ 41 2 3 4 5 and B ¼ 4 2 4 6 8 5; ð14Þ
6 1 3 10 1 2 3 5

we get
2 3 2 3
2 × 3 5 × 4 10 × 5 20 × 6 6 20 50 120
4
A  B ¼ 1×2 2×4 3×6 5
4×8 ¼ 2 84 18 32 5: ð15Þ
6 × 1 1 × 2 3 × 3 10 × 5 6 2 9 50

Standard or Cayley Product


A × B ¼ C , ð16Þ
The Hadamard product is straightforward, but I×J J×K I×K
it is not the matrix product that is used most often.
The most often used product is called the standard or even
or Cayley product, or simply the product (i.e., A B ¼ C : ð17Þ
when the name of the product is not specified, it is I J K I×K
the standard product). Its definition comes from
the original use of matrices to solve equations. Its An element ci;k of the matrix C is computed as
definition looks surprising at first because it is
X
J
defined only when the number of columns of the ci;k ¼ ai;j × bj;k : ð18Þ
first matrix is equal to the number of rows of the j¼1
second matrix. When two matrices can be multi- So ci;k is the sum of J terms, each term being the
plied together, they are called conformable. This product of the corresponding element of the ith
product will have the number of rows of the first row of A with the kth column of B.
matrix and the number of columns of the second For example, let
matrix.
2 3
So, A with I rows and J columns can be multi- 1 2
1 2 3
plied by B with J rows and K columns to give C A¼ and B ¼ 4 3 4 5: ð19Þ
with I rows and K columns. A convenient way of 4 5 5
5 6
checking that two matrices are conformable is to
write the dimensions of the matrices as subscripts. The product of these matrices is denoted
For example, C ¼ A × B ¼ AB (the × sign can be omitted when
Matrix Algebra 765

the context is clear). To compute c2,1 we add three For example, with
terms: (1) the product of the first element of
the second row of A (i.e., 4) with the first element 2 1 1 1
A¼ and B ¼ ð25Þ
of the first column of B (i.e., 1); (2) the product of 2 1 2 2
the second element of the second row of A (i.e., 5)
with the second element of the first column of B we get:
(i.e., 3); and (3) the product of the third element of
the second row of A (i.e., 5) with the third element 2 1 1 1 0 0
AB ¼ ¼ : ð26Þ
of the first column of B (i.e., 5). Formally, the term 2 1 2 2 0 0
c2,1 is obtained as
But
X
J¼3
c2;1 ¼ a2;j × bj;1
1 1 2 1
j¼1 BA ¼
2 2 2 1
¼ ða2;1 Þ × ðb1;1 Þ þ ða2;2 × b2;1 Þ þ ða2;3 × b3;1 Þ

¼ ð4 × 1Þ þ ð5 × 3Þ þ ð6 × 5Þ 4 2
¼ : ð27Þ
¼ 49: 8 4
ð20Þ
Incidently, we can combine transposition and pro-
Matrix C is obtained as duct and get the following equation:

AB ¼ ðABÞT ¼ BT AT : ð28Þ
C ¼ ½ci;k 
X
J¼3
¼ ai;j × bj;k Exotic Product: Kronecker
j¼1
Another product is the Kronecker product,
1×1 þ 2×3 þ 3×5 1×2 þ 2×4 þ 3×6 also called the direct, tensor, or Zehfuss
¼
4×1 þ 5×3 þ 6×5 4×2 þ 5×4 þ 6×6 product. It is denoted  and is defined for all

22 28 matrices. Specifically, with two matrices A ¼ ai,j
¼ : (with dimensions I by J) and B (with dimens-
49 64
ions K and L), the Kronecker product gives
ð21Þ a matrix C (with dimensions (I × K) by (J × L)
defined as
2 3
Properties of the Product a1;1 B a1;2 B    a1;j B    a1;J B
6 7
Like the product between scalars, the pro- 6 a2;1 B a2;2 B    a2;j B    a2;J B 7
6 7
duct between matrices is associative, and distri- 6 . .. .. .. .. .. 7
6 .. . . . . . 7
butive relative to addition. Specifically, for any 6 7
AB¼6 7:
set of three conformable matrices A, B, and C, 6 ai;1 B ai;2 B    ai;j B    ai;J B 7
6 7
6 . .. .. .. .. .. 7
ðABÞC ¼ AðBCÞ ¼ ABC associativity ð22Þ 6 .. . . . . . 7
4 5
aI;1 B aI;2 B    aI;j B    aI;J B
AðB þ CÞ ¼ AB þ AC distributivity: ð23Þ ð29Þ
The matrix products AB and BA do not always
exist, but when they do, these products are not, in For example, with
general, commutative:
6 7
A ¼ ½1 2 3  and B ¼ ð30Þ
AB 6¼ BA: ð24Þ 8 9
766 Matrix Algebra

we get Note that for a symmetric matrix,

AB
A ¼ AT : ð36Þ
1×6 1×7 2×6 2×7 3×6 3×7
¼
1×8 1×9 2×8 2×9 3×8 3×9 A common mistake is to assume that the stan-

6 7 12 14 18 21 dard product of two symmetric matrices is com-
¼ :
8 9 16 18 24 27 mutative. But this is not true, as shown by the
ð31Þ following example. With
2 3 2 3
The Kronecker product is used to write design 1 2 3 1 1 2
matrices. It is an essential tool for the derivation of A ¼ 42 1 4 5 and B ¼ 4 1 1 35 ð37Þ
expected values and sampling distributions. 3 4 1 2 3 1

Special Matrices we get


Certain special matrices have specific names. 2 3
9 12 11
Square and Rectangular Matrices 6 7
AB ¼ 4 11 15 11 5, but
A matrix with the same number of rows and 9 10 19
columns is a square matrix. By contrast, a matrix 2 3 ð38Þ
9 11 9
with different numbers of rows and columns is 6 7
a rectangular matrix. So BA ¼ 4 12 15 10 5:
2 3 11 11 19
1 2 3
A ¼ 44 5 55 ð32Þ
Note, however, that combining Equations 35 and
7 8 0
43 gives for symmetric matrices A and B the fol-
is a square matrix, but lowing equation:
2 3 T
1 2 AB ¼ ðBAÞ : ð39Þ
4
B¼ 4 55 ð33Þ
7 8 Diagonal Matrix
is a rectangular matrix. A square matrix is diagonal when all its ele-
ments except the ones on the diagonal are zero.
Symmetric Matrix Formally, a matrix is diagonal if ai;j ¼ 0 when
i 6¼ j. Thus
A square matrix A with ai;j ¼ aj;i is symmetric.
So 2 3
10 0 0
2 3 A ¼ 4 0 20 0 5 is diagonal: ð40Þ
10 2 3
0 0 30
A¼4 2 20 5 5 ð34Þ
3 5 30 Because only the diagonal elements matter for a
diagonal matrix, we can specify only these diago-
is symmetric, but nal elements. This is done with the following nota-
2 3 tion:
12 2 3
A¼4 4 20 5 5 ð35Þ
7 8 30 A ¼ diagf½a1, 1 ; . . . ; ai, i ; . . . ; aI, I g
 
is not. ¼ diag ½ai;i  : ð41Þ
Matrix Algebra 767

2 3
For example, the previous matrix can be rewritten 2 0 0
1 2 3
as: AC ¼ ×40 4 05
2 3 4 5 6
0 0 6
10 0 0

A ¼ 4 0 20 0 5 ¼ diagf½10; 20; 30g: ð42Þ 2 8 18
0 0 30 ¼ ð48Þ
8 20 36
The operator diag can also be used to isolate the
diagonal of any square matrix. For example, with and also
2 3 2 3
1 2 3 2 0 0

A ¼ 44 5 65 ð43Þ 2 0 1 2 3 6 7
7 8 9 BAC ¼ × ×40 4 05
0 5 4 5 6
0 0 6
we get
4 16 36
82 39 2 3 ¼ :
< 1 2 3 = 1 40 100 180
diagfAg ¼ diag 4 4 5 6 5 ¼ 4 5 5: ð44Þ ð49Þ
: ;
7 8 9 9

Note, incidently, that 2 3


1 0 0 Identity Matrix
diagfdiagfAgg ¼ 4 0 5 0 5: ð45Þ
0 0 9 A diagonal matrix whose diagonal elements are
all equal to 1 is called an identity matrix and is
denoted I. If we need to specify its dimensions, we
Multiplication by a Diagonal Matrix use subscripts such as
Diagonal matrices are often used to multiply by
I ¼
a scalar all the elements of a given row or column. 3×3
2 3
Specifically, when we premultiply a matrix by 1 0 0
a diagonal matrix, the elements of the row of the 6 7
I ¼ 40 1 0 5 ðthis is a 3 × 3 identity matrixÞ:
second matrix are multiplied by the corresponding
diagonal element. Likewise, when we postmultiply 0 0 1
a matrix by a diagonal matrix, the elements of the ð50Þ
column of the first matrix are multiplied by the
corresponding diagonal element. For example, The identity matrix is the neutral element for the
with: standard product. So

1 2 3 2 0 I×A ¼ A×I ¼ A ð51Þ
A¼ B¼
4 5 6 0 5
2 3
2 0 0 , ð46Þ for any matrix A conformable with I. For
6 7 example:
C ¼ 40 4 05
0 0 6 2 3 2 3
1 0 1 2 3
0
6 7 6 7
we get 40 05×44 5 55 ¼
1
2 0 1 2 3
BA ¼ × 0 0 1 7 8 0
0 5 4 5 6 2 3 2 3 2 3
1 2 3 1 0 0 1 2 3
6 7 6 7 6 7
¼
2 4 6
ð47Þ 44 5 55×40 1 05 ¼ 44 5 5 5:
20 25 30 7 8 0 0 0 1 7 8 0
and ð52Þ
768 Matrix Algebra

Matrix Full of Ones and for the standard product:


A matrix whose elements are all equal to 1 is
1 2 1×0 þ 2×0 1×0 þ 2×0
denoted by 1 or, when we need to specify its × 0 ¼
3 4 2×2 3×0 þ 4×0 3×0 þ 4×0
dimensions, by 1 . These matrices are neutral ele-
I×J
0 0
ments for the Hadamard product. So ¼ ¼ 0 :
0 0 2×2

1 2 3 1 1 1 ð59Þ
A  1 ¼  ð53Þ
2×3 2×3 4 5 6 1 1 1


1×1 2×1 3×1 1 2 3 Triangular Matrix
¼ ¼ : ð54Þ
4×1 5×1 6×1 4 5 6 A matrix is lower triangular when ai,j ¼ 0 for
i < j. A matrix is upper triangular when ai,j ¼ 0
The matrices can also be used to compute sums of
for i > j. For example,
rows or columns:
2 3
2 3 10 0 0
1
6 7 A ¼ 4 2 20 0 5 is lower triangular, ð60Þ
½ 1 2 3  × 4 1 5 ¼ ð1 × 1Þ þ ð2 × 1Þ þ ð3 × 1Þ 3 5 30
1
¼ 1 þ 2 þ 3 ¼ 6; and
2 3
ð55Þ 12 2 3
B¼4 0 20 5 5 is upper triangular: ð61Þ
or also 0 0 30

1 2 3
½1 1× ¼ ½5 7 9 : ð56Þ Cross-Product Matrix
4 5 6
A cross-product matrix is obtained by multipli-
cation of a matrix by its transpose. Therefore
Matrix Full of Zeros
a cross-product matrix is square and symmetric.
For example, the matrix:
A matrix whose elements are all equal to 0 is 2 3
the null or zero matrix. It is denoted by 0 or, when 1 1
we need to specify its dimensions, by 0 . Null A ¼ 42 45 ð62Þ
I×J
3 4
matrices are neutral elements for addition:
premultiplied by its transpose
1 2 1þ0 2þ0
þ 0 ¼
3 4 2×2 3þ0 4þ0 1 2 3
AT ¼ ð63Þ
1 4 4
1 2
¼ : ð57Þ
3 4 gives the cross-product matrix
They are also null elements for the Hadamard
AT A
product:
1×1 þ 2×2 þ 3×3 1×1 þ 2×4 þ 3×4
1 2 1×0 2×0 ¼
 0 ¼ 1×1 þ 4×2 þ 4×3 1×1 þ 4×4 þ 4×4
3 4 2×2 3×0 4×0
14 21
¼ :
0 0 14 33
¼ ¼ 0 ð58Þ
0 0 2×2 ð64Þ
Matrix Algebra 769

A Particular Case of Cross-Product (Variances are on the diagonal; covariances are


Matrix: Variance–Covariance off-diagonal.)
A particular case of cross-product matrices is
correlation or covariance matrices. A variance– The Inverse of a Square Matrix
covariance matrix is obtained from a data matrix
An operation similar to division exists, but only
by three steps: (1) subtract the mean of each col-
for (some) square matrices. This operation uses the
umn from each element of this column (this is cen-
notion of inverse operation and defines the inverse
tering), (2) compute the cross-product matrix from
of a matrix. The inverse is defined by analogy with
the centered matrix, and (3) divide each element of
the scalar number case, for which division actually
the cross-product matrix by the number of rows of
corresponds to multiplication by the inverse,
the data matrix. For example, if we take the I ¼ 3
namely,
by J ¼ 2 matrix A,
2 3 a
2 1 ¼ a × b1 with b × b1 ¼ 1: ð70Þ
b
A ¼ 4 5 10 5; ð65Þ
8 10 The inverse of a square matrix A is denoted
A  1. It has the following property:
we obtain the means of each column as
A × A1 ¼ A1 × A ¼ I: ð71Þ
1
m¼ × 1 × A The definition of the inverse of a matrix is simple,
I 1×I I×J
2 3 but its computation is complicated and is best left
2 1 ð66Þ
1 6 7 to computers.
¼ × ½ 1 1 1  × 4 5 10 5 ¼ ½ 5 7 : For example, for
3
8 10 2 3
1 2 1
To center the matrix, we subtract the mean of A ¼ 4 0 1 0 5, ð72Þ
each column from all its elements. This centered 0 0 1
matrix gives the deviations of each element from
the mean of its column. Centering is performed as The inverse is:
22 3 3 2 3
12 1 1 2 1
6 7 6 7
D ¼ A  1 × m ¼ 4 5 10 5  4 1 5 × ½ 5 7  A1 ¼ 40 1 0 5: ð73Þ
J×1 0 0 1
8 10 1
ð67Þ All square matrices do not necessarily have an
2 3 2 3 2 3 inverse. The inverse of a matrix does not exist if
2 1 5 7 3 6
¼ 45 10 5  4 5 7 5 ¼ 4 0 3 5: ð68Þ the rows (and the columns) of this matrix are line-
8 10 5 7 3 3 arly dependent. For example,
2 3
3 4 2
We denote as S the variance–covariance matrix
A ¼ 41 0 25 ð74Þ
derived from A. It is computed as
2 1 3
2 3
3 6 does not have an inverse since the second column
1 1 3 0 3 6 7 is a linear combination of the two other columns:
S ¼ DT D ¼ ×4 0 35
I 3 6 3 3
3 3 ð69Þ 2 3 2 3 2 3 2 3 2 3
4 3 2 6 2
1 18 27 6 9 4 0 5 ¼ 2 × 4 1 5  4 2 5 ¼ 4 2 5  4 2 5: ð75Þ
¼ × ¼ :
3 27 54 9 18 1 2 3 4 3
770 Matrix Algebra

A matrix without an inverse is singular. When A1 Notations and Definition


exists, it is unique.
An eigenvector of matrix A is a vector u that
Inverse matrices are used for solving linear
satisfies the following equation:
equations and least square problems in multiple
regression analysis or analysis of variance. Au ¼ λu; ð79Þ

Inverse of a Diagonal Matrix where l is a scalar called the eigenvalue associated


to the eigenvector. When rewritten, Equation 79
The inverse of a diagonal matrix is easy to com- becomes
pute: The inverse of
  ðA  λIÞu ¼ 0: ð80Þ
A ¼ diag ai;i ð76Þ
Therefore u is eigenvector of A if the multiplica-
is the diagonal matrix tion of u by A changes the length of u but not its
n o   orientation. For example,
A1 ¼ diag a1i;i ¼ diag 1=ai;i : ð77Þ
2 3
A¼ ð81Þ
For example, 2 1
2 3 2 3
1 0 0 1 0 0 has for eigenvectors
4 0 :5 0 5 and 4 0 2 0 5 ð78Þ
0 0 4 0 0 :25 3
u1 ¼ with eigenvalue λ1 ¼ 4 ð82Þ
2
are the inverse of each other.
and
1
The Big Tool: Eigendecomposition u2 ¼ with eigenvalue λ2 ¼ 1 : ð83Þ
1
So far, matrix operations are very similar to opera-
tions with numbers. The next notion is specific to When u1 and u2 are multiplied by A, only their
matrices. This is the idea of decomposing a matrix length changes. That is,
into simpler matrices. A lot of the power of matri-
2 3 3 12 3
ces follows from this. A first decomposition is Au1 ¼ λ1 u1 ¼ ¼ ¼4
called the eigendecomposition, and it applies only 2 1 2 8 2
to square matrices. The generalization of the eigen- ð84Þ
decomposition to rectangular matrices is called the
singular value decomposition. and
Eigenvectors and eigenvalues are numbers
2 3 1 1 1
and vectors associated with square matrices. Au2 ¼ λ2 u2 ¼ ¼ ¼ 1 :
Together they constitute the eigendecomposition. 2 1 1 1 1
Even though the eigendecomposition does not ð85Þ
exist for all square matrices, it has a particularly
simple expression for a class of matrices often This is illustrated in Figure 1.
used in multivariate analysis such as correlation, For convenience, eigenvectors are generally nor-
covariance, or cross-product matrices.The eigen- malized such that
decomposition of these matrices is important in
statistics because it is used to find the maximum uT u ¼ 1: ð86Þ
(or minimum) of functions involving these matri- For the previous example, normalizing the
ces. For example, principal components analysis eigenvectors gives
is obtained from the eigendecomposition of
a covariance or correlation matrix and gives the :8321 :7071
u1 ¼ and u2 ¼ : ð87Þ
least square estimate of the original data matrix. :5547 :7071
Matrix Algebra 771

Au 1
8

u1 1
u2
2
1

3 12 −1

Au
−1 2

(a) (b)

Figure 1 Two Eigenvectors of a Matrix

We can check that Reconstitution of a Matrix


The eigendecomposition can also be used to
2 3 :8321 :83213:3284
¼ ¼4 build back a matrix from its eigenvectors and
2 1 :5547 2:2188 :5547 eigenvalues. This is shown by rewriting Equation
ð88Þ 90 as

and A ¼ UΛU1 : ð92Þ



2 3 :7071 :7071 :7071 For example, because
¼ ¼ 1 :
2 1 :7071 :7071 :7071
1 :2 :2
ð89Þ U ¼ ,
:4 :6

we obtain
Eigenvector and Eigenvalue Matrices
1 3 1 4 0 :2 :2
Traditionally, we store the eigenvectors of A as A ¼ UΛU ¼
2 1 0 1 :4 :6
the columns of a matrix denoted U. Eigenvalues
are stored in a diagonal matrix (denoted Λ). 2 3
¼ :
Therefore, Equation 79 becomes 2 1
ð93Þ
AU ¼ UΛ: ð90Þ
Digression: An Infinity of
For example, with A (from Equation 81), we have Eigenvectors for One Eigenvalue
It is only through a slight abuse of language that
2 3 3 1 3 1 4 0
× ¼ × : we talk about the eigenvector associated with one
2 1 2 1 2 1 0 1
eigenvalue. Any scalar multiple of an eigenvector
ð91Þ is an eigenvector, so for each eigenvalue, there are
772 Matrix Algebra

an infinite number of eigenvectors, all proportional A ¼ UΛUT ð100Þ


to each other. For example,
where UTU ¼ I are the normalized eigenvectors.
1 For example,
ð94Þ
1
3 1
A¼ ð101Þ
is an eigenvector of A: 1 3

2 3 can be decomposed as
: ð95Þ
2 1
A ¼ UΛUT
Therefore, 2 qffiffi qffiffi 32 32 qffiffi qffiffi 3
1 1 4 0 1 1
6 2 27
56 2 27
1 2 ¼ 4 qffiffi qffiffi 54 4 qffiffi qffiffi 5
2× ¼ ð96Þ 1
 12 1
 12
1 2 2 0 2 2
2 3
3 1
is also an eigenvector of A:
¼4 5;

2 3 2 2 1 1 3
¼ ¼ 1 × 2 : ð97Þ
2 1 2 2 1 ð102Þ

with
Positive (Semi)Definite Matrices 2 qffiffi qffiffi 32 qffiffi qffiffi 3
1 1 1 1
0 1 2
4 qffiffi 2
qffiffi 54 qffiffi 2
q2ffiffi 5 ¼ 1; 001: ð103Þ
Some matrices, such as ; do not have
0 0 1
 1 1
 12
2 2 2
eigenvalues. Fortunately, the matrices used often in
statistics belong to a category called positive semide-
finite. The eigendecomposition of these matrices Diagonalization
always exists and has a particularly convenient When a matrix is positive semidefinite, we can
form. A matrix is positive semidefinite when it can rewrite Equation 100 as
be obtained as the product of a matrix by its trans-
pose. This implies that a positive semidefinite A ¼ UΛUT ,  ¼ UT AU: ð104Þ
matrix is always symmetric. So, formally, the matrix
A is positive semidefinite if it can be obtained as This shows that we can transform A into a diago-
nal matrix. Therefore the eigendecomposition of
A ¼ XXT ð98Þ a positive semidefinite matrix is often called its
diagonalization.
for a certain matrix X. Positive semidefinite matrices
include correlation, covariance, and cross-product Another Definition for
matrices. Positive Semidefinite Matrices
The eigenvalues of a positive semidefinite matrix A matrix A is positive semidefinite if for any
are always positive or null. Its eigenvectors are nonzero vector x, we have
composed of real values and are pairwise orthogo-
nal when their eigenvalues are different. This xT Ax ≥ 0 8x: ð105Þ
implies the following equality:
When all the eigenvalues of a matrix are positive,
U1 ¼ UT : ð99Þ the matrix is positive definite. In that case, Equa-
tion 105 becomes
We can, therefore, express the positive semidefinite
matrix A as xT Ax > 0 8x: ð106Þ
Matrix Algebra 773

Trace, Determinant, and Rank detfAg ¼ 16:1168 × 1:1168 × 0 ¼ 0: ð113Þ


The eigenvalues of a matrix are closely related Rank
to three important numbers associated to a square
matrix: trace, determinant, and rank. Finally, the rank of a matrix is the number
of nonzero eigenvalues of the matrix. For our
example,
Trace
rankfAg ¼ 2: ð114Þ
The trace of A, denoted tracefAg, is the sum of
its diagonal elements. For example, with The rank of a matrix gives the dimensionality
of the Euclidean space that can be used to repre-
2 3
1 2 3 sent this matrix. Matrices whose rank is equal to
A ¼ 44 5 65 ð107Þ their dimensions are full rank, and they are invert-
7 8 9 ible. When the rank of a matrix is smaller than its
dimensions, the matrix is not invertible and is
we obtain called rank-deficient, singular, or multicolinear.
For example, matrix A from Equation 107 is
tracefAg ¼ 1 þ 5 þ 9 ¼ 15: ð108Þ a 3 × 3 square matrix, its rank is equal to 2, and
therefore it is rank-deficient and does not have an
The trace of a matrix is also equal to the sum of
inverse.
its eigenvalues:
X
tracefAg ¼ λ‘ ¼ tracefΛg ð109Þ
‘ Statistical Properties of the Eigendecomposition

with Λ being the matrix of the eigenvalues of A. The eigendecomposition is essential in optimiza-
For the previous example, we have tion. For example, principal components analysis
is a technique used to analyze an I × J matrix X in
Λ ¼ diagf16:1168; 1:1168; 0g: ð110Þ which the rows are observations and the columns
are variables. Principal components analysis finds
We can verify that orthogonal row factor scores that ‘‘explain’’ as
X much of the variance of X as possible. They are
tracefAg ¼ λ‘ ¼ 16:1168 obtained as
‘ ð111Þ
þ ð1:1168Þ ¼ 15: F ¼ XP; ð115Þ

where F is the matrix of factor scores and P is the


matrix of loadings of the variables. These loadings
Determinant give the coefficients of the linear combination used
The determinant is important for finding the to compute the factor scores from the variables. In
solution of systems of linear equations (i.e., the addition to Equation 115, we impose the con-
determinant determines the existence of a solution). straints that
The determinant of a matrix is equal to the prod-
FT F ¼ PT XT XP ð116Þ
uct of its eigenvalues. If det{A} is the determinant
of A, is a diagonal matrix (i.e., F is an orthogonal
Y matrix) and that
detfAg ¼ λ‘ with λ‘ being the
‘ ð112Þ PT P ¼ I ð117Þ
‘ th eigenvalue of A:
(i.e., P is an orthonormal matrix). The solution is
For example, the determinant of A from Equa- obtained by using Lagrangian multipliers in which
tion 107 is equal to the constraint from Equation 117 is expressed as
774 Matrix Algebra

the multiplication with a diagonal matrix of Lag- The eigendecomposition decomposes a matrix into
rangian multipliers denoted Λ; in order to give the two simple matrices, and the SVD decomposes
following expression: a rectangular matrix into three simple matrices: two
 orthogonal matrices and one diagonal matrix. The
Λ PT P  I : ð118Þ SVD uses the eigendecomposition of a positive
semidefinite matrix to derive a similar decomposi-
This amounts to defining the following equation: tion for rectangular matrices.
L ¼ FT F  ΛðPT P  IÞ

¼ tracefPT XT XP  ΛðPT P  IÞg: ð119Þ Definitions and Notations


The SVD decomposes matrix A as
The values of P that give the maximum values of
L are found by first computing the derivative of L A ¼ PΔQT ; ð124Þ
relative to P,
∂L where P is the (normalized) eigenvectors of the
¼ 2XT XP  2ΛP; ð120Þ matrix AAT (i.e., PTP ¼ I). The columns of P are
∂P
called the left singular vectors of A. Q is the (nor-
and setting this derivative to zero: malized) eigenvectors of the matrix ATA (i.e.,
QTQ ¼ I). The columns of Q are called the right
XT XP  ΛP ¼ 0 , XT XP ¼ ΛP: ð121Þ singular vectors of A. Δ is the diagonal matrix
1
of the singular values, Δ ¼ Λ2 , with Λ being
Because Λ is diagonal, this is an eigendecompo-
the diagonal matrix of the eigenvalues of AAT
sition problem, Λ is the matrix of eigenvalues of
and ATA.
the positive semidefinite matrix XTX ordered from
The SVD is derived from the eigendecomposition
the largest to the smallest, and P is the matrix of
of a positive semidefinite matrix. This is shown by
eigenvectors of XTX. Finally, the factor matrix is
considering the eigendecomposition of the two posi-
1 tive semidefinite matrices obtained from A, namely,
F ¼ PΛ 2 : ð122Þ
AAT and ATA. If we express these matrices in terms
The variance of the factor scores is equal to the of the SVD of A, we find
eigenvalues:
AAT ¼ PΔQT QΔPT
1 T 1 ð125Þ
FT F ¼ Λ P PΛ ¼ Λ:
2 2 ð123Þ ¼ PΔ2 PT ¼ PΛPT
Because the sum of the eigenvalues is equal to
the trace of XTX, the first factor scores ‘‘extract’’ and
as much of the variance of the original data as
possible, the second factor scores extract as much AT A ¼ QΔPT PΔQT
ð126Þ
of the variance left unexplained by the first factor ¼ QΔ2 QT ¼ QΛQT :
as possible, and so on for the remaining factors.
1
The diagonal elements of the matrix Λ2 , which
This equation shows that Δ is the square root of
are the standard deviations of the factor scores,
Λ, that P is eigenvectors of AAT, and that Q is
are called the singular values of X.
eigenvectors of ATA.
For example, the matrix
2 3
A Tool for Rectangular Matrices: 1:1547 1:1547
The Singular Value Decomposition A ¼ 4 1:0774 0:0774 5 ð127Þ
0:0774 1:0774
The singular value decomposition (SVD) generalizes
the eigendecomposition to rectangular matrices. can be expressed as
Matrix Algebra 775

A ¼ PΔQT For example, with


2 3 2 3
0:8165 0 1 1
6 7 A ¼ 4 1 1 5 ð132Þ
¼ 4 0:4082 0:7071 5
0:4082 0:7071 1 1
ð128Þ
2 0 0:7071 0:7071 we find that the pseudoinverse is equal to
0 1 0:7071 0:7071
2 3 þ :25 :25 :5
1:1547 1:1547 A ¼ : ð133Þ
:25 :25 :5
¼ 4 0:0774 0:0774 5:
0:0774 1:0774 This example shows that the product of a matrix
and its pseudoinverse does not always give the
We can check that identity matrix:
2 3 2 3
0:8165 0 2 1 1
T 6 7 2 0 þ 6 7 :25 :25 :5
AA ¼ 4 0:4082 0:7071 5 AA ¼ 4 1 1 5
0 12 :25 :25 :5
0:4082 0:7071 1 1 ð134Þ

0:8165 0:4082 0:4082 0:3750 0:1250
ð129Þ ¼
0 0:7071 0:7071 0:1250 0:3750
2 3
2:6667 1:3333 1:3333
6 7
¼ 4 1:3333 1:1667 0:1667 5 Pseudoinverse and SVD
1:3333 0:1667 1:1667
The SVD is the building block for the Moore–
and that Penrose pseudoinverse because any matrix A with
SVD equal to PΔQT has for pseudoinverse
0:7071 0:7071 22 0
AT A ¼
0:7071 0:7071 0 12 Aþ ¼ QΔ1 PT : ð135Þ

0:7071 0:7071
ð130Þ For the preceding example, we obtain
0:7071 0:7071

2:5 1:5 þ 0:7071 0:7071 21 0
¼ : A ¼
1:5 2:5 0:7071 0:7071 0 11

0:8165 0:4082 0:4082
Generalized or Pseudoinverse ð136Þ
0 0:7071 0:7071

The inverse of a matrix is defined only for full 0:2887 0:6443 0:3557
rank square matrices. The generalization of the ¼ :
0:2887 0:3557 0:6443
inverse for other matrices is called generalized
inverse, pseudoinverse, or Moore–Penrose inverse Pseudoinverse matrices are used to solve multiple
and is denoted by X þ . The pseudoinverse of A is regression and analysis of variance problems.
the unique matrix that satisfies the following four
constraints: Hervé Abdi and Lynne J. Williams

AAþ A ¼ A ðiÞ See also Analysis of Covariance (ANCOVA); Analysis of


Variance (ANOVA); Canonical Correlation Analysis;
Aþ AAþ ¼ Aþ ðiiÞ
Confirmatory Factor Analysis; Correspondence
þ T
ðAA Þ ¼ AAþ ðsymmetry 1Þ ðiiiÞ Analysis; Discriminant Analysis; General Linear
þ T þ Model; Latent Variable; Mauchly Test; Multiple
ðA AÞ ¼ A A ðsymmetry 2Þ ðivÞ:
Regression; Principal Components Analysis;
ð131Þ Sphericity; Structural Equation Modeling
776 Mauchly Test

Further Readings conduct a Mauchly test on such data, and a test


conducted automatically by statistical software
Abdi, H. (2007). Eigendecomposition: Eigenvalues and
eigenvecteurs. In N. J. Salkind (Ed.), Encyclopedia of will not output p values.
measurement and statistics (pp. 304–308). Thousand Sphericity is a more general form of compound
Oaks, CA: Sage. symmetry, the condition of equal population covari-
Abdi, H. (2007). Singular value decomposition (SVD) and ance (among paired levels) and equal population
generalized singular value decomposition (GSVD). In variance (among levels). Whereas compound sym-
N. J. Salkind (Ed.), Encyclopedia of measurement and metry is a sufficient but not necessary precondition
statistics (pp. 907–912). Thousand Oaks, CA: Sage. for conducting valid repeated measures F tests
Basilevsky, A. (1983). Applied matrix algebra in the (assuming normality), sphericity is both a sufficient
statistical sciences. New York: North-Holland.
and necessary precondition. Historically, statisti-
Graybill, F. A. (1969). Introduction to matrices with
applications in statistics. Belmont, CA: Wadsworth.
cians and social scientists often failed to recognize
Healy, M. J. R. (1986). Matrices for statistics. Oxford, these distinctions between compound symmetry
UK: Oxford University Press. and sphericity, leading to frequent confusion over
Searle, S. R. (1982). Matrices algebra useful for statistics. the definitions of both, as well as the true statistical
New York: Wiley. assumptions that underlie repeated measures
ANOVA. In fact, Mauchly’s definition of sphericity
is what is now considered compound symmetry,
MAUCHLY TEST although the Mauchly test nevertheless assesses
what is now considered sphericity.

The Mauchly test (or Mauchly’s test) assesses the


validity of the sphericity assumption that underlies Implementation and Computation
repeated measures analysis of variance (ANOVA). Like any null hypothesis significance test, the
Developed in 1940 by John W. Mauchly, an elec- Mauchly test assesses the probability of obtaining
trical engineer who codeveloped the first general- a value for the test statistic as extreme as that
purpose computer, the Mauchly test is the default observed given the null hypothesis. In this ins-
test of sphericity in several common statistical soft- tance, the null hypothesis is that of sphericity, and
ware programs. Provided the data are sampled the test statistic is Mauchly’s W. Mathematically,
from a multivariate normal population, a signifi- the null hypothesis of sphericity (and alternative
cant Mauchly test result indicates that the assump- hypothesis of nonsphericity) can be written in
tion of sphericity is untenable. This entry first terms of difference scores:
explains the sphericity assumption and then
describes the implementation and computation of H0 : σ 2y1 y2 ¼ σ 2y1 y3 ¼ σ 2y2 y3 :::
the Mauchly test. The entry ends with a discussion
of the test’s limitations and critiques. H1 : σ 2y1 y2 6¼ σ 2y1 y3 6¼ σ 2y2 y3 :::

(for all k(k  1)/2 unique difference scores created


The Sphericity Assumption from k levels of repeated variable y) or in terms of
The sphericity assumption is the assumption matrix algebra:
that the difference scores of paired levels of the
repeated measures factor have equal population H0 : C0 Σ C ¼ λI
variance. As with the other ANOVA assump- H1 : C0 Σ C ¼ λI;
tions of normality and homogeneneity of vari-
ance, it is important to note that the sphericity where C is any (k  1Þ × ðk orthonormal coeffi-
assumption refers to population parameters cient matrix associated with the hypothesized
rather than sample statistics. Also worth noting repeated measure effect; C0 is the transpose of C;
is that the sphericity assumption by definition is Σ is the k × k population covariance matrix; λ is
always met for designs with only two levels of a positive, scalar number; and I is the (k  1Þ ×
a repeated measures factor. One need not ðk  1Þ identity matrix. Mauchly’s test statistic, W,
Mauchly Test 777

can be expressed concisely only in terms of matrix advocate the use of the local invariant test over the
algebra: Mauchly test.
Finally, some statisticians have called into ques-
jC0 SCj tion the utility of conducting any preliminary test
W¼h ik1 ;
0
trðC SCÞ of sphericity such as the Mauchly test. For repe-
k1
ated measures data sets in the social sciences, they
argue, sphericity is almost always violated to some
where S is the k × k sample covariance matrix. degree, and thus researchers should universally
One can rely on either an approximate or exact correct for this violation (by adjusting df with the
sampling distribution to determine the probability Greenhouse–Geisser and the Huynh–Feldt esti-
value of an obtained W value. Because of the cum- mates). Furthermore, like any significance test, the
bersome computations required to determine exact Mauchly test is limited in its utility by sample size:
p values and the precision of the chi-square appr- For large samples, small violations of sphericity
oximation, even statistical software packages (e.g., often produce significant Mauchly test results, and
SPSS, an IBM company, formerly called PASWÓ for small samples, the Mauchly test often does not
Statistics) typically rely on the latter. The chi- have the power to detect large violations of sphe-
square approximation is based on the statistic ricity. Finally, critics of sphericity testing note that
ðn  1ÞdW with degrees of freedom (df Þ ¼ adoption of the df correction tests only when the
kðk  1Þ=2  1, where Mauchly test reveals significant nonsphericity—as
" # opposed to always adopting such df correction
2
2ðk  1Þ þ ðk  1Þ þ 2 tests—does not produce fewer Type I or II errors
d ¼1 :
6ðk  1Þðn  1Þ under typical testing conditions (as shown by sim-
ulation research).
For critical values for the exact distribution, see Aside from using alternative tests of sphericity
Nagarsenker and Pillai (1973). or forgoing such tests in favor of adjusted df tests,
researchers who collect data on repeated measures
should also consider employing statistical models
Limitations and Critiques
that do not assume sphericity. Of these alternative
The Mauchly test is not robust to nonnormality: models, the most common is multivariate ANOVA
Small departures from multivariate normality in (MANOVA). Power analyses have shown that the
the population distribution can lead to artificially univariate ANOVA approach possesses greater
low or high Type I error (i.e., false positive) rates. power than the MANOVA approach when sample
In particular, heavy-tailed (leptokurtic) distribu- size is small (n < k þ 10) or the sphericity viola-
tions can—under typical sample sizes and signifi- tion is not large (ε > .7) but that the opposite is
cance thresholds—triple or quadruple the number true when sample sizes are large and the sphericity
of Type I errors beyond their expected rate. Res- violation is large.
earchers who conduct a Mauchly test should
therefore examine their data for evidence of non- Samuel T. Moulton
normality and, if necessary, consider applying nor-
malizing transformations before reconducting the See also Analysis of Variance (ANOVA); Bartlett’s Test;
Mauchly test. Greenhouse–Geisser Correction; Homogeneity of
Compared with other tests of sphericity, the Variance; Homoscedasticity; Repeated Measures
Mauchly test is not the most statistically powerful. Design; Sphericity
In particular, the local invariant test (see Cornell,
Young, Seaman, & Kirk, 1992) produces fewer
Type II errors (i.e., false negative) than the Mau- Further Readings
chly test does. This power difference between the Cornell, J. E., Young, D. M., Seaman, S. L., & Kirk, R. E.
two tests is trivially small for large samples and (1992). Power comparisons of eight tests for sphericity
small k/n ratios but noteworthy for small samples in repeated measures designs. Journal of Educational
sizes and large k/n ratios. For this reason, some Statistics, 17, 233–249.
778 MBESS

Huynh, H., & Mandeville, G. K. (1979). Validity (such as the functions contained within the MBESS
conditions in repeated measures designs. Psychological package), a resulting benefit is ‘‘reproducible res-
Bulletin, 86, 964–973. earch,’’ in the sense that a record exists of the exact
Keselman, H. J., Rogan, J. C., Mendoza, J. L., & Breen, analyses performed, with all options and subsam-
L. J. (1980). Testing the validity conditions of repeated
ples denoted. Having a record of the exact analyses,
measures F tests. Psychological Bulletin, 87, 479–481.
Mauchly, J. W. (1940). Significance test for sphericity of
by way of a script file, that were performed is bene-
a normal n-variate distribution. Annals of ficial so that the data analyst can (a) respond to
Mathematical Statistics, 11, 204–209. inquiries regarding the exact analyses, algorithms,
Nagarsenker, B. N., & Pillai, K. C. S. (1973). The and options; (b) modify code for similar analyses
distribution of the sphericity test criterion. Journal of on the same or future data; and (c) provide code
Multivariate Analysis, 3, 226–235. and data so that others can replicate the published
results. Many novel statistical techniques are imple-
mented in R, and in many ways R has become nec-
essary for cutting-edge developments in statistics
MBESS and measurement. In fact, R has even been referred
to as the lingua franca of statistics.
MBESS is an R package that was developed pri- MBESS, developed by Ken Kelley, was first
marily to implement important but nonstand- released publicly in May 2006 and has since incor-
ard methods for the behavioral, educational, and porated functions contributed by others. MBESS
social sciences. The generality and applicability of will continue to be developed for the foreseeable
many of the functions contained in MBESS have future and will remain open source and freely
allowed the package to be used in a variety of available. Although only minimum experience with
other disciplines. Both MBESS and R are open R is required in order to use many of the functions
source and freely available from The R Project’s contained within the MBESS package, in order to
Comprehensive R Archive Network. The MBESS use MBESS to its maximum potential, experience
Web page contains the reference manual, source with R is desirable.
code files, and binaries files. MBESS (and R) is
available for Apple Macintosh, Microsoft Win- Ken Kelley
dows, and Unix/Linux operating systems.
See also Confidence Intervals; Effect Size, Measures of;
The major categories of functions contained in
R; Sample Size Planning
MBESS are (a) estimation of effect sizes (standard-
ized and unstandardized), (b) confidence interval
formation based on central and noncentral distri- Further Readings
butions (t, F, and χ2), (c) sample size planning
de Leeuw, J. (2005). On abandoning XLISP-STAT.
from the accuracy in parameter estimation and
Journal of Statistical Software, 13(7), 1–5.
power analytic perspectives, and (d) miscellaneous Kelley, K. (2006–2008). MBESS [computer software and
functions that allow the user to easily interact with manual]. Accessed February 16, 2010, from http://
R for analyzing and graphing data. Most MBESS cran.r-project.org/web/packages/MBESS/index.html
functions require only summary statistics. MBESS Kelley, K. (2007). Confidence intervals for standardized
thus allows researchers to compute effect sizes and effect sizes: Theory, application, and implementation.
confidence intervals based on summary statistics, Journal of Statistical Software, 20(8), 1–24.
which facilitates using previously reported infor- Kelley, K. (2007). Methods for the behavioral,
mation (e.g., for calculating effect sizes to be educational, and social science: An R package.
included in meta-analyses) or if one is primarily Behavior Research Methods, 39(4), 979–984.
Kelley, K., Lai, K., & Wu, P.-J. (2008). Using R for data
using a program other than R to analyze data but
analysis: A best practice for research. In J. Osbourne
still would like to use the functionality of MBESS. (Ed.), Best practices in quantitative methods
MBESS, like R, is based on a programming envi- (pp. 535–572). Thousand Oaks, CA: Sage.
ronment instead of a point-and-click interface for R Development Core Team. (2008). The R project for
the analysis of data. Because of the necessity to statistical computing. Retrieved February 16, 2010,
write code in order for R to implement functions from http://www.R-project.org/
McNemar’s Test 779

Venebles, W. N., Smith, D. M., & The R Development any observed change while accounting for the
Core Team. (2008). An introduction to R: Notes on R: dependent nature of the sample. To do so, a four-
A programming environment for data analysis and fold table of frequencies must be set up to repre-
graphics. Retrieved February 16, 2010, from http:// sent the first and second sets of responses from the
cran.r-project.org/doc/manuals/R-intro.pdf
same or matched individuals. This table is also
known as a 2 × 2 contingency table and is illus-
Websites trated in Table 1.
Comprehensive R Archive Network:
In this table, Cells A and D represent the discor-
http://cran.r-project.org dant pairs, or individuals whose response changed
MBESS: http://cran.r-project.org/web/packages/MBESS/ from the first to the second time. If an individual
index.html changes from þ to  , he or she is included in
The R Project: http://www.r-project.org Cell A. Conversely, if the individual changes from
 to þ , he or she is tallied in Cell D. Cells B and
C represent individuals who did not change
responses over time, or pairs that are in agreement.
MCNEMAR’S TEST The main purpose of McNemar’s test is determine
whether the proportion of individuals who chan-
McNemar’s test, also known as a test of corre- ged in one direction ( þ to  ) is significantly dif-
lated proportions, is a nonparametric test used ferent from that of individuals who changed in the
with dichotomous nominal or ordinal data to other direction (  to þ ).
determine whether two sample proportions When one is using McNemar’s test, it is unnec-
based on the same individuals are equal. McNe- essary to calculate actual proportions. The differ-
mar’s test is used in many fields, including the ence between the proportions algebraically and
behavioral and biomedical sciences. In short, it conceptually reduces to the difference between the
is a test of symmetry between two related sam- frequencies given in A and D. McNemar’s test then
ples based on the chi-square distribution with 1 assumes that A and D belong to a binomial distri-
degree of freedom (df). bution defined by
McNemar’s test is unique in that it is the only
test that can be used when one or both conditions n ¼ A þ D; p ¼ :05; and q ¼ :05:
being studied are measured using the nominal
scale. It is often used in before–after studies, in Based on this, the expectation under the null
which the same individuals are measured at two hypothesis would be that 12(A þ D) cases would
times, a pretest–posttest, for example. McNemar’s change in one direction and 12(A þ D) cases would
test is also often used in matched-pairs studies, in change in the other direction. Therefore, Ho :
which similar people are exposed to two different A ¼ D: The χ2 formula,
conditions, such as a case–control study. This entry X ðOi  Ei Þ2
details the McNemar’s test formula, provides an χ2 ¼ ;
example to illustrate the test, and examines its Ei
application in research.
where Oi ¼ observed number of cases in the ith
category and Ei ¼ expected number of cases in the
Formula ith category under H0, converts into
McNemar’s test, in its original form, was designed
ðA  ðA þ DÞ=2Þ2 ðD  ðA þ DÞ=2Þ2
only for dichotomous variables (i.e., yes–no, right– 2
χ ¼ þ
ðAþDÞ ðAþDÞ
wrong, effect–no effect) and therefore gives rise to 2 2
proportions. McNemar’s test is a test of the equal-
ity of these proportions to one another given the and then factors into
fact that they are based in part on the same
individual and therefore correlated. More specifi- ðA  DÞ2
χ2 ¼ with df ¼ 1:
cally, McNemar’s test assesses the significance of AþD
780 McNemar’s Test

Table 1 Fourfold Table for Use in Testing Significance Table 2 Form of Table to Show Subjects’ Change in
of Change Voting Decision in Response to Negative
Campaign Ad
After
Before  þ After Campaign Ad
þ A B
Before Campaign Ad No Vote Yes Vote
 C D
Yes vote yn yy
This is McNemar’s test formula. The sample No vote nn ny
distribution is distributed approximately as chi-
square with 1 df.
Statistical Test
Correction for Continuity McNemar’s test is chosen to determine whether
The approximation of the sample distribution there was a significant change in voter behavior.
by the chi-square distribution can present prob- McNemar’s test is appropriate because the study
lems, especially if the expected frequencies are uses two related samples, the data are measured
small. This is because the chi-square is a continu- on a nominal scale, and the researcher is using
ous distribution whereas the sample distribution is a before–after design. McNemar’s test formula as
discrete. The correction for continuity, developed it applies to Table 2 is shown below.
by Frank Yates, is a method for removing this
source of error. It requires the subtraction of 1 ðyn  nyÞ2
χ2 ¼ with df ¼ 1:
from the absolute value of the difference between yn þ ny
A and D prior to squaring. The subsequent for-
With the correction for continuity included, the
mula, including the correction for continuity,
formula becomes
becomes
2
ðjyn  nyj  1Þ
ðjA  Dj  1Þ2 χ2 ¼ :
χ2 ¼ with df ¼ 1: yn þ ny
AþD
Small Expected Frequencies
Hypotheses
When the expected frequency is very small
(12(A þ D) < 5), the binomial test should be used
instead. H0: For those subjects who change, the probability
that any individual will change his or her vote from
yes to no after being shown the campaign ad (that
Example is, Pyn ) is equal to the probability that the individ-
Suppose a researcher was interested in the effect ual will change his or her vote from no to yes (that
of negative political campaign messages on voting is, Pny ), which is equal to 12. More specifically,
behavior. To investigate, the researcher uses a 1
before–after design in which 65 subjects are polled H0 : Pyn ¼ Pny ¼ :
2
twice on whether they would vote for a certain
politician: before and after viewing a nega- H1: For those subjects who change, the probability
tive campaign ad discrediting that politician. The that any individual will change his or her vote
researcher hypothesizes that the negative campaign from yes to no after being shown the negative
message will reduce the number of individuals campaign ad will be significantly greater than the
who will vote for the candidate targeted by the probability that the individual will change his or
negative ad. The data are recorded in the form her vote from no to yes. In other words,
shown in Table 2. The hypothesis test follows; the
data are entirely artificial. H1 : Pyn > Pny :
McNemar’s Test 781

Significance Level cher’s hypothesis that negative campaign ads


significantly decrease the number of individuals
Let α ¼ :01; N ¼ 65, the number of individuals
willing to vote for the targeted candidate.
polled before and after the campaign ad was shown.

Sampling Distribution
Application
McNemar’s test is valuable to the behavioral and
The sampling distribution of χ2 as computed by
biomedical sciences because it gives researchers
McNemar’s test is very closely approximated by the
a way to test for significant effects in dependent
chi-square distribution with df ¼ 1. In this exam-
samples using nominal measurement. It does so by
ple, H1 predicts the direction of the difference and
reducing the difference between proportions to
therefore requires a one-tailed rejection region. This
the difference between discordant pairs and then
region consists of all the χ2 values that are so large
applying the binomial distribution. It has proven
they only have a 1% likelihood of occurring if the
useful in the study of everything from epidemiol-
null hypothesis is true. For a one-tailed test, the crit-
ogy to voting behavior, and it has been modified to
ical value with p < :01 is 7.87944.
fit more specific situations, such as misclassified
Calculation and Decision data, improved sample size estimations, multivari-
ate samples, and clustered matched-pair data.
The artificial results of the study are shown in
Table 3. The table shows that 30 subjects changed M. Ashley Morrison
their vote from yes to no (yn) after seeing the nega-
See also Chi-Square Test; Dichotomous Variable;
tive campaign ad, and 7 subjects changed their
Distribution; Nominal Scale; Nonparametric Statistics;
vote from no to yes (ny).
One-Tailed Test; Ordinal Scale
The other two cells, yy ¼ 11 and nn ¼ 17, rep-
resent those individuals who did not change their
Further Readings
vote after seeing the ad.
For these data, Bowker, A. H. (1948). A test for symmetry in
contingency tables. Journal of the American Statistical
ðyn  nyÞ2 ð30  7Þ2 ð23Þ2 Association, 43, 572–574.
χ2 ¼ ¼ ¼ ¼ 14:30: Eliasziw, M., & Donner, A. (1991). Application of the
yn þ ny ð30 þ 7Þ 37
McNemar test to non-independent matched pair data.
Including the correction for continuity: Statistics in Medicine, 10, 1981–1991.
Hays, W. L. (1994). Statistics (5th ed.). Orlando, FL:
ðjyn  ny  1Þ2 j ðj30  7j  1Þ2 Harcourt Brace.
χ2 ¼ ¼ Klingenberg, B., & Agresti, A. (2006). Multivariate
yn þ ny 30 þ 7 extensions of McNemar’s test. Biometrics, 62,
ð22Þ2 921–928.
¼ ¼ 13:08: Levin, J. R., & Serlin, R. C. (2000). Changing students’
37
perspectives of McNemar’s test of change. Journal of
The critical χ2 value for a one-tailed test at Statistics Education [online], 8(2). Retrieved February
α ¼ :01 is 7.87944. Both 14.30 and 13.08 are 16, 2010, from www.amstat.org/publications/jse/
greater than 7.87944; therefore, the null hypothe- secure/v8n2/levin.cfm
sis is rejected. These results support the resear- Lyles, R. H., Williamson, J. M., Lin, H. M., & Heilig,
C. M. (2005). Extending McNemar’s test: Estimation
and inference when paired binary outcome data are
Table 3 Subjects’ Voting Decision Before and After misclassified. Biometrics, 61, 287–294.
Seeing Negative Campaign Ad McNemar, Q. (1947). Note on sampling error of the
difference between correlated proportions or
After Campaign Ad
percentages. Psychometrika, 12, 153–157.
Before Campaign Ad No Vote Yes Vote Satten, G. A., & Kupper, L. L. (1990). Sample size
Yes vote 30 11 determination for matched-pair case-control studies
No vote 17 7 where the goal is interval estimation of the odds ratio.
Journal of Clinical Epidemiology, 43, 55–59.
782 Mean

Z ∞
Siegel, S. (1956). Nonparametric statistics. New York:
McGraw-Hill. jyjf ðyÞdy < ∞: ð2Þ
∞
Yates, F. (1934). Contingency tables involving small
numbers and the χ2 test. Supplement to Journal of the
Royal Statistical Society, 1, 217–235. Comparing Equation 1 with Equation 2, one
notices immediately that the f(y)dy in Equation 2
mirrors the p(y) in Equation 1, and the integration
in Equation 2 is analogous to the summation in
MEAN Equation 1.
The above definitions help to understand con-
The mean is a parameter that measures the central ceptually the expected value, or the population
location of the distribution of a random variable mean. However, they are seldom used in research
and is an important statistic that is widely repor- to derive the population mean. This is because in
ted in scientific literature. Although the arithmetic most circumstances, either the size of the popula-
mean is the most commonly used statistic in des- tion (discrete random variables) or the true pro-
cribing the central location of the sample data, bability density function (continuous random
other variations of it, such as the truncated mean, variables) is unknown, or the size of the popula-
the interquartile mean, and the geometric mean, tion is so large that it becomes impractical to
may be better suited in a given circumstance. The observe the entire population. The population
characteristics of the data dictate which one of mean is thus an unknown quantity.
them should be used. Regardless of which mean is In statistics, a sample is often taken to estimate
used, the sample mean remains a random variable. the population mean. Results derived from data
It varies with each sample that is taken from the are thus called statistics (in contrast to what are
same population. This entry discusses the use of called parameters in populations). If the distribu-
mean in probability and statistics, differentiates tion of a random variable is known, a probability
between the arithmetic mean and its variations, model may be fitted to the sample data. The popu-
and examines how to determine its appropriate- lation mean is then estimated from the model
ness to the data. parameters. For instance, if a sample can be fitted
with a normal probability distribution model with
parameters μ and σ, the population mean is simply
Use in Probability and Statistics estimated by the parameter μ (and σ 2 as the vari-
In probability, the mean is a parameter that mea- ance). If the sample can be fitted with a Gamma
sures the central location of the distribution of distribution with parameters α and β, the popula-
a random variable. For a real-valued random vari- tion mean is estimated by the product of α and β
able, the mean, or more appropriately the popula- (i.e., αβ), with αβ2 as the variance. For an expo-
tion mean, is the expected value of the random nential random variable with parameter β, the
variable. That is to say, if one observes the random population mean is simply the β, with β2 as the
variable numerous times, the observed values of variance. For a chi-square ðχ2 Þ random variable
the random variable would converge in probability with v degrees of freedom, the population mean is
to the mean. For a discrete random variable with v, with 2v being the variance.
a probability function p(y), the expected value
exists if
Arithmetic Mean
X
ypðyÞ < ∞; ð1Þ When the sample data are not fitted with a known
y probability model, the population mean is often
inferred from the sample mean, a common prac-
where y is the values assigned by the random vari- tice in applied research. The most widely used
able. For a continuous random variable with a sample mean for estimating the population mean
probability density function f ðyÞ, the expected is the arithmetic mean, which is calculated as the
value exists if sum of the observed values of a random variable
Mean 783

divided by the number of observations in the either in interval or in ratio scale. For ordinal
sample. data, the arithmetic mean is not always the most
Formally, for a sample of n observations, appropriate measure of the central location; the
x1 ; x2 ; . . . ; xn on a random variable X, the arith- median is, because it does not require the summa-
metic mean (x) of the sample is defined as tion operation.
Notice further that in Equation 3, each observa-
1 1X n
tion is given an equal weight. Consequently, the
x ¼ ðx1 þ x2 þ    þ xn Þ ¼ xi ; ð3Þ
n n i¼1 arithmetic mean is highly susceptible to extreme
values. Extreme low values would underestimate
P
n the mean, while extreme high values would inflate
where the notation is a succinct representation the mean. One must keep this property of the
i¼1
of the summation of all values from the first to the sample arithmetic mean in mind when using it to
last observation of the sample. For example, a describe research results.
sample consisting of five observations with Because the arithmetic mean is susceptible to
values of 4, 5, 2, 6, and 3 has a mean variability in the sample data, it is often insuffi-
4½¼ ð4 þ 5 þ 2 þ 6 þ 3Þ=5 according to the above cient to report only the sample mean without also
definition. A key property of the mean as defined showing the sample standard deviation. Whereas
above is that the sum of deviations from it is zero. the mean describes the central location of the
If data are grouped, the sample mean can no data, the standard deviation provides information
longer be constructed from each individual mea- about the variability of the data. Two sets of data
surement. Instead, it is defined using the midvalue with the same sample mean, but drastically differ-
of each group interval (xj ) and the corresponding ent standard deviations, inform the reader that
frequency of the group (fj ): either they come from two different populations
or they suffer from variability in quality control in
1X m
the data collection process. Therefore, by reporting
x¼ fj xj , ð4Þ both statistics, one informs the reader of not only
n j¼1
the quality of the data but also the appropriateness
of using these statistics to describe the data, as well
where m is the number of groups, and n is the
as the appropriate choice of statistical methods to
total number of observations in the sample. In
analyze these data subsequently.
Equation 4, fj xj is the total value for the jth group.
A summation of the values of all groups is then
the grand total of the sample, which is equivalent Appropriateness
to the value obtained through summation, as
Whether the mean is an appropriate or inappropri-
defined in Equation 3. For instance, a sample of
ate statistic to describe the data is best illustrated
(n ¼ ) 20 observations is divided into three groups.
by examples of some highly skewed sample data,
The intervals for the three groups are 5 to 9
such as data on the salaries of a corporation, on
(x1 ¼ 7), 10 to 14 (x2 ¼ 12), and 15 to 19
the house prices in a region, on the total family
(x3 ¼ 17), respectively. The corresponding fre-
income in a nation, and so forth. These types of
quency for each group is (f1 ¼ ) 6, (f2 ¼ ) 5, and
social economic data are often distorted by a few
(f3 ¼ ) 9. The sample mean according to Equation
high-income earners or a few high-end properties.
4 is then
The mean is thus an inappropriate statistic to
7  6 þ 12  5 þ 17  9 describe the central location of the data, and the
x¼ ¼ 12:75: median would be a better statistic for the purpose.
20
On the other hand, if one is interested in describ-
Notice that in Equation 3, we summed up the ing the height or the test score of students in
values of all individual observations before arriv- a school, the sample mean would be a good
ing at the sample mean. The summation process is description of the central tendency of the popula-
an arithmetic operation on the data. This requires tion as these types of data often follow a unimodal
that the data be continuous, that is, they must be symmetric distribution.
784 Mean

Variations where the i value beneath indicates that the


summation starts from the (n/4 þ 1)th observa-
Extreme values in a data set, if not inherent in tion of the data set, and the value above sig-
a population, are often erroneous and may have nals that the summation ends at the (3n/4)th
either human or instrumental causes. These so- observation. The 2 above n normalizes the inter-
called outliers are therefore artifacts. In order to quartile mean to the full n observations of the
better estimate the population mean when extreme sample.
values occur in a sample, researchers sometimes The mean is frequently referred to as the aver-
order the observations in a sample from the smal- age. This interchangeable usage sometimes con-
lest to the largest in value and then remove an fuses the reader because the median is sometimes
equal percentage of observations from both the also called the average, such as what is routinely
high end and the low end of the data range before used in reporting house prices. The reader must be
applying the arithmetic mean definition to the careful about which one of these statistics is actu-
sample mean. An example is the awarding of a per- ally being referred to.
formance score to an athlete in a sport competi- The arithmetic mean as defined in Equation 3
tion. Both the highest and the lowest scores given is not always a good measure of the central loca-
by the panel of judges are often removed before tion of the sample in some applications. An
a final mean score is awarded. This variation of example is when the data bear considerable vari-
the arithmetic mean is called the truncated (or ability that has nothing to do with quality con-
trimmed) mean. trol in data collection. Instead, it is inherent in
Suppose that n observations, x1, x2, . . . , xn, the random process that gives rise to the data,
are obtained from a study population and α per- such as the concentration of environmental che-
centage of data points are removed from either micals in the air. Within a given day at a given
end of the data range. The truncated mean (xT ) location, their concentration could vary in mag-
for the sample is then nitude by multiples. Another example is the
growth of bacteria on artificial media. The num-
1 ber of bacteria growing on the media at a given
xT ¼ ðx1þαn þ x2þαn þ    þ xnαn Þ
nð1  2αÞ time may be influenced by the number of bacte-
nX
αn ð5Þ ria on the media at an earlier time, by the
1
¼ xi : amount of media available for growth, by the
nð1  2αÞ i¼1þαn
media type, by the different antibiotics incorpo-
rated in the media, by the micro growing envi-
In reporting the truncated mean, one must give ronment, and so on. The growth of the bacteria
the percentage of the removed data points in rela- proceeds, not in a linear pattern, but in a multi-
tion to the total number of observations, that is, plicative way. The central tendency of these
the value of α, in order to inform the reader how types of data is best described according to their
the truncated mean is arrived at. Even with the product, but not their sum. The geometric mean,
removal of some extreme values, the truncated but not the arithmetic mean, would thus be
mean is still not immune to problematic data, par- closer to the center of the data values. The geo-
ticularly if the sample size n is small. metric mean (xG ) of a sample of n observations,
If the entire first quartile and the entire last x1, x2, . . . , xn, is defined as the nth root of the
quartile of the data points are removed after the product of the n values:
observations of the data set are ordered from the
smallest to the largest in value, the truncated mean !1=n
of the sample is called the interquartile mean. The pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y
n

interquartile mean can be calculated as follows: xG ¼ n x1 x2 :::xn ¼ xi : ð7Þ


i¼1

2 X
3n=4
x¼ xi , ð6Þ Analogous to the summation representation
n i¼ðn=4Þþ1 Q
n
in Equation 3, the notation is a succinct
i¼1
Mean Comparisons 785

representation of the multiplication of all values of Further Readings


the sample set from the first to the nth observa-
Bickel, P. J. (1965). On some robust estimates of location.
tion. Therefore, all values from the first to the last Annals of Mathematical Statistics, 36, 847–858.
observation are included in the product. Foster, D. M. E., & Phillips, G. M. (1984). The
Comparing Equation 7 with Equation 3, one arithmetic-harmonic mean. Mathematics of
can see that the difference is that the geometric Computation, 42, 183–191.
mean is obtained by multiplying the observations Huber, P. J. (1972). Robust statistics: A review. Annals of
in the sample first and then taking the nth root Mathematical Statistics, 43, 1041–1067.
of their product. In contrast, the arithmetic Ott, L., & Mendenhall, W. (1994). Understanding
mean is calculated by adding up the observations statistics (6th ed.). Pacific Grove, CA: Duxbury.
Prescott, P., & Hogg, R. V. (1977). Trimmed and outer
first and then dividing their sum by the number
means and their variances. American Statistician, 31,
of observations in the sample. Because of the 156–157.
multiplying and the taking-the-nth-root opera- Tung, S. H. (1975). On lower and upper bounds of the
tions, the geometric mean can be applied only to difference between the arithmetic and the geometric
data of positive values, not to data of negative mean. Mathematics of Computation, 29, 834–836.
or zero values. Weerahandi, S., & Zidek, J. V. (1979). A characterization
When the sample size n is large, the product of of the general mean. Canadian Journal of Statistics,
the values of the observations could be very large, 83, 83–90.
and taking the nth root of the product could
be difficult, even with modern computers. One
way to resolve these difficulties is to transform the
value of all observations into a logarithm scale.
The multiplication process then becomes a summa- MEAN COMPARISONS
tion process, and the operation of taking the nth
root of the product is replaced by the division of n The term mean comparisons refers to the compari-
from the logarithm sum. The geometric mean is son of the average of one or more continuous vari-
then obtained by applying an antilogarithm opera- ables over one or more categorical variables. It is
tion to the result. a general term that can refer to a large number of
For example, suppose that n observations, x1, different research questions and study designs. For
x2, . . . , xn, are taken from a random variable. example, one can compare the mean from one
The mean of the logarithmic product of the n sample of data to a hypothetical population value,
values (xlog ) in the sample is compare the means on a single variable from mul-
tiple independent groups, or compare the means
for a single variable for one sample over multiple
1 1X n
measurement occasions. In addition, more com-
xlog ¼ logðx1 x2 :::xn Þ ¼ logðxi Þ, ð8Þ
n n i¼1 plex research designs can employ multiple continu-
ous dependent variables simultaneously, as well as
a combination of multiple groups and multiple
and the geometric mean is then measurement occasions. Overall, mean compari-
sons are of central interest in any experimental
design and many correlational designs when there
xG ¼ antilogðxlog Þ: ð9Þ are existing categorical variables (e.g., gender).
Two primary questions must be asked in any
mean comparison: Are the means statistically dif-
Here, the base of the logarithm scale can be
ferent, and how big are the differences? The for-
either e( ¼ 2.718281828, the base for natural log-
mer question can be answered with a statistical
arithm) or 10. Most often, 10 is used as the base.
test of the difference in means. The latter is ans-
Shihe Fan wered with a standardized measure of effect size.
Together, these more accurately characterize the
See also Median; Random Variable; Standard Deviation nature of mean differences.
786 Mean Comparisons

Statistical Differences assumption that there is population information


available, the t test becomes a much more flexi-
Testing for statistical differences attempts to ans- ble statistical technique. The t test can be used
wer the question of whether the observed differ- with experimental studies to compare two exper-
ences, however large or small, are due to some real imental conditions, with correlational studies to
effect or simply random sampling error. Depending compare existing dichotomous groups (e.g., gen-
on the nature of the data and the specific research der), and with longitudinal studies to compare
question at hand, different statistical tests must be the same sample over two measurement occa-
employed to properly answer the question of sions. An important limiting factor of the t test
whether there are statistical differences in the is that it can compare only two groups at a time;
means. investigations of mean differences in more com-
plex research designs require a more flexible ana-
z Test lytic technique.
The z test is employed when a researcher
wants to answer the question, Is the mean of this
sample statistically different from the mean of Analysis of Variance
the population? Here, the researcher would have
mean and standard deviation information for the Analysis of variance (ANOVA) is even more
population of interest on a particular continuous flexible than the t test in that it can compare
variable and data from a single sample on the multiple groups simultaneously. Because it is
same variable. The observed mean and standard derived from the same statistical model as the t
deviation from the sample would then be com- test (i.e., the general linear model), ANOVA
pared with the population mean and standard answers questions similar to those answered by
deviation. For example, suppose an organiza- the t test. In fact, when comparisons are between
tion, as part of its annual survey process, had two groups, the t test and ANOVA will yield the
collected job satisfaction information from all its exact same conclusions, with the test statistics
employees. The organization then wishes to con- related by the following formula: F ¼ t2. A signif-
duct a follow-up study relating to job satisfac- icant result from an ANOVA will answer
tion with some of its employees and wishes to whether at least one group (or condition) is sta-
make sure the sample drawn is representative of tistically different from at least one other group
the company. Here, the sample mean and stan- (or condition) on the continuous dependent vari-
dard deviation on the job satisfaction variable able. If a significant result is found when there
would be compared with the mean and standard are three or more groups or conditions, post hoc
deviation for the company as a whole and tested tests must be conducted to determine exactly
with a z test. This test, however, has limited where the significant differences lie.
applications in most research settings, for the ANOVA is also the appropriate statistical test
simple fact that information for the population to use to compare means when there are multiple
is rarely available. When population information categorical independent variables that need to
is unavailable, different statistical tests must be tested simultaneously. These tests are typically
be used. called n-way ANOVAs, where the n is the number
of independent variables. Here, means among all
the conditions are compared. This type of analysis
t Test
can answer questions of whether there are signifi-
Unlike the z test, the t test is a widely used cant effects for any of the independent variables
and applicable statistical test. In general, the t individually, as well as whether there are any com-
test is used to compare two groups on a single bined interactive effects with two or more of the
continuous dependent variable. Generally, this independent variables. Again, significant results do
statistical test answers the question, Is the mean not reveal the nature of the relationships; post hoc
of this sample statistically different from the tests must be conducted to determine how the vari-
mean of this other sample? By removing the ables relate.
Mean Comparisons 787

Multivariate ANOVA appropriate multiplicative terms can exactly model


the results from an n-way ANOVA. Two key
Multivariate ANOVA (MANOVA) is a multivar-
points regarding the multiple regression approach
iate extension of ANOVA. Like ANOVA, MAN-
are that the statistical results from the regression
OVA can compare the means of two or more
analysis will always equal the results from the t
groups simultaneously; also, an n-way MANOVA,
test or ANOVA (because all are derived from the
like its n-way ANOVA counterpart, can compare
general linear model) and that post hoc tests are
the means for multiple independent variables at
generally unnecessary because regression weights
the same time. The key difference between these
provide a way to interpret the nature of the mean
two types of analyses is that MANOVA can com-
differences.
pare means on multiple continuous dependent
variables at the same time, whereas ANOVA can
compare means only for a single continuous Effect Sizes
dependent variable. As such, MANOVA answers
Documenting the magnitude of mean differences is
the question of whether there are differences
of even greater importance than testing whether
between any two groups on any of the dependent
two means differ significantly. Psychological the-
variables examined. Effectively, this analysis exam-
ory is advanced further by examining how big
ines only the question of whether something,
mean differences are than by simply noting that
on any of the dependent variables, is different
two groups (or more) are different. Additionally,
between any of the groups examined. As with
practitioners need to know the magnitude of
most of the other analyses, a significant result here
group differences on variables of interest in order
will require extensive post hoc testing to determine
to make informed decisions about the use of those
the precise nature of the differences.
variables. Also, psychological journals are increas-
ingly requiring the reporting of effect sizes as a con-
dition of acceptance for publication. As such,
Regression
appropriate effect sizes to quantify mean differ-
Mean differences can also be assessed via multi- ences need to be examined. Generally, two types
ple regression, although this approach is less com- of effect size measures are commonly used to com-
mon. Regression techniques can assess mean pare means: mean differences and correlational
differences on a single continuous dependent vari- measures.
able between any number of independent variables
with any number of categories per independent
Mean Differences
variable. When there is a single independent vari-
able with two categories, the statistical conclusions Mean difference effect size measures are des-
from entering this variable in a regression will igned specifically to compare two means at a time.
exactly equal those from a t test; interpreting the As such, they are very amenable to providing an
sign of the regression weight will determine the effect size measure when one is comparing two
nature of the mean differences. When there is a groups via a z test or a t test. Mean difference
single independent variable with three or more effect sizes can also be used when one is compar-
categories, this variable must first be transformed ing means with an ANOVA, but each condition
into K  1 ‘‘dummy’’ variables, where K is the needs to be compared with a control group or
number of categories. Each of these dummy focal group. Because the coding for a categorical
variables is then entered into the regression simul- variable is arbitrary, the sign of the mean differ-
taneously, and the overall statistical conclusions ence effect size measure is also arbitrary; in order
from this model will exactly equal those of for effect sizes to be interpretable, the way in
ANOVA; however, interpreting magnitude and which groups are specified must be very clear.
significance of the regression weights from the Three common effect size measures are the simple
regression model will describe the nature of the mean difference, the standardized mean difference,
mean differences, rendering post hoc tests unneces- and the standardized mean difference designed for
sary. Similarly, creating dummy variables with experimental studies.
788 Mean Comparisons

Simple Mean Difference are several slight variants to Equation 1 to address


some statistical issues in various experimental and
The simplest way to quantify the magnitude
correlational settings, but each is very closely
of differences between groups on a single con-
related to Equation 1.
tinuous dependent variable is to compute a sim-
Another advantage of the standardized mean
ple mean difference, M1  M2 , where M1 and
difference is that, because it is on a standardized
M2 are the means for Groups 1 and 2, respec-
metric, some rough guidelines for interpretation of
tively. Despite its simplicity, this measure has
the numbers can be provided. As a general rule,
several shortcomings. The most important limi-
standardized mean differences (i.e., d values) of
tation is that the simple mean difference is scale
d ¼ 0.20 are considered small, d ¼ 0.50 are
and metric dependent, meaning that simple
medium, and d ¼ 0.80 are considered to be large.
mean differences cannot be directly compared
Of course, the meaning of the mean differences
on different scales, or even on the same scale if
must be interpreted from within the research con-
the scale has been scored differently. As such,
text, but these guidelines provide a rough metric
simple mean differences are best used when
with which to evaluate the magnitude of mean dif-
there is a well-established scale whose metric
ferences obtained via Equation 1.
does not change. For example, simple mean dif-
ferences can be used to compare groups on the
SAT (a standardized test for college admissions) Standardized Mean Difference for Experiments
because it is a well-established test whose scor- In experimental studies, it is often the case that
ing does not change from year to year; however, the experimental manipulation will shift the mean
simple mean differences on the SAT cannot be of the experimental group high enough to run into
compared with simple differences on the ACT, the top of the scale (i.e., create a ceiling effect);
because these two tests are not scored on the this event decreases the variability of the experi-
same scale. mental group. As such, using a pooled variance
will actually underestimate the expected variance
Standardized Mean Difference
and overestimate the expected effect size. In this
The standardized mean difference, also known case, researchers often use an alternative measure
as Cohen’s d, addresses the problem of scale or of the standardized mean difference,
metric dependence by first standardizing the means
to a common metric (i.e., a standard deviation MExp  MCon
d¼ , ð2Þ
metric, where the SD equals 1.0). The equation to SDCon
compute the standardized mean difference is
where the Con and Exp subscripts denote the
M1  M2 control and experimental subgroups, respec-
d ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , ð1Þ tively. This measure of effect size can be inter-
2 2
N1 ðSD1 Þ þN2 ðSD2 Þ preted with the same metric and same general
N1 þN2
guidelines as the standardized mean difference
from Equation 1.
where N is the sample size, SD is the standard
deviation, and other terms are defined as before.
Examination of this equation reveals that the
Correlational Effect Sizes
numerator is exactly equal to the simple mean dif-
ference; it is the denominator that standardizes the Correlational-type effect sizes are themselves
mean difference by dividing by the pooled within- reported on two separate metrics. The first is on
group standard deviation. Generally, this measure the same metric as the correlation, and the second
of effect size is preferred because it can be com- is on the variance-accounted-for metric (i.e., the
pared directly across studies and aggregated in correlation-squared metric, or R2). They can be
a meta-analysis. Also, unlike some other measures converted from one to the other by a simple square
of mean differences, it is generally insensitive to or square-root transformation. These types of
differences in sample sizes between groups. There effect sizes answer questions about the magnitude
Mean Comparisons 789

of the relationship between group membership variable but does not describe it; post hoc exami-
and the dependent variable (in the case of correla- nations must be undertaken to understand the
tional effect sizes) or the percentage of variance nature of the relationship. The eta-squared coeffi-
accounted for in the dependent variable by group cient tells the proportion of variance that can be
membership. Most of these effect sizes are insensi- accounted for by group membership.
tive to the coding of the group membership vari-
able. The three main types of effect sizes are the
point-biserial correlation, the eta (or eta-squared) Multiple R
coefficient, and multiple R (or R2). The multiple R or R2 is the effect size derived
from multiple regression techniques. Like a correla-
Point-Biserial Correlation tion or the eta coefficient, the multiple R tells the
Of the effect sizes mentioned here, the point- magnitude of the relationship between the set of
biserial correlation is the only correlational effect categorical independent variables and the continu-
size whose sign is dependent on the coding of the ous dependent variable. Also similarly, the R2 is
categorical variable. It is also the only measure the proportion of variance in the dependent vari-
presented here that requires that there be only two able accounted for by the set of categorical inde-
groups in the categorical variable. The equation to pendent variables. The magnitude for the multiple
compute the point-biserial correlation is R (and R2) will be equal to the eta (and eta2) for
the full ANOVA model; however, substantial post
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
M1  M2 n1 n2 hoc tests are unnecessary in a multiple regression
rpb ¼ , ð3Þ framework, because careful interpretation of the
SDTot NðN  1Þ
regression weights can describe the nature of the
mean differences.
where SDTot is the total standard deviation across
all groups; n1 and n2 are the sample sizes for
Groups 1 and 2, respectively; and N is total sam-
ple size across the two groups. Though it is a stan- Additional Issues
dard Pearson correlation, it does not range from As research in the social sciences increases at an
zero to ± 1.00; the maximum absolute value is exponential rate, cumulating research findings
about 0.78. The point-biserial correlation is also across studies becomes increasingly important. In
sensitive to the proportion of people in each this context, knowing whether means are statisti-
group; if the proportion of people in each group cally different becomes less important, and docu-
differs substantially from 50%, the maximum menting the magnitude of the difference between
value drops even further away from 1.00. means becomes more important. As such, the
reporting of effect sizes is imperative to allow
Eta
proper accumulation across studies. Unfortunately,
Although the eta coefficient can be interpreted current data accumulation (i.e., meta-analytic)
as a correlation, it is not a form of a Pearson corre- methods require that a single continuous depen-
lation. While the correlation is a measure of the dent variable be compared on a single dichoto-
linear relationship between variables, the eta actu- mous independent variable. Fortunately, although
ally measures any relationship between the cate- multiple estimates of these effect sizes exist, they
gorical independent variable and the continuous can readily be converted to one another. In addi-
dependent variable. Eta-squared is the square of tion, many of the statistical tests can be converted
the eta coefficient and is the ratio of the between- to an appropriate effect size measure.
group variance to the total variance. The eta (and Converting between a point-biserial correla-
eta-squared) can be computed with any number of tion and a standardized mean difference is rela-
independent variables and any number of cate- tively easy if one of them is already available.
gories in each of those categorical variables. The For example, the formula for the conversion of
eta tells the magnitude of the relationship between a point-biserial correlation to a standardized
the categorical variable(s) and the dependent mean difference is
790 Median

rpb Regression; Multivariate Analysis of Variance


d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , ð4Þ
p1 p2 ð1  r2pb Þ (MANOVA); t Test, Independent Samples;

where d is the standardized mean difference, rpb Further Readings


is the point-biserial correlation, and p1 and p2 are Bonnett, D. G. (2008). Confidence intervals for
the proportions in Groups 1 and 2, respectively. standardized linear contrasts of means. Psychological
The reverse of this formula, for the conversion of Methods, 13, 99–109.
a standardized mean difference to a point-biserial Bonnett, D. G. (2009). Estimating standardized linear
correlation, is contrasts of means with desired precision.
Psychological Methods, 14, 1–5.
Fern, E. F., & Monroe, K. B. (1996). Effect size
d
rpb ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , ð5Þ estimates: Issues and problems in interpretation.
d þ p 1p
2 Journal of Consumer Research, 23, 89–105.
1 2
Keselman, H. J., Algina, J., Lix, L. M., Wilcox, R. R., &
Deering, K. N. (2008). A generally robust approach
where terms are defined as before. If means, stan- for testing hypotheses and setting confidence intervals
dard deviations, and sample sizes are available for for effect sizes. Psychological Methods, 13, 110–129.
each of the groups, then these effect sizes can be McGrath, R. E., & Meyer, G. J. (2006). When effect sizes
computed with Equations 1 through 3. Eta coeffi- disagree: The case of r and d. Psychological Methods,
cients and multiple Rs cannot be converted readily 11, 386–401.
to a point-biserial correlation or a standardized McGraw, K. O., & Wong, S. P. (1992). A common
language effect size statistic. Psychological Bulletin,
mean difference unless there is a single dichoto-
111, 361–365.
mous variable; then these coefficients equal the Ruscio, J. (2008). A probability-based measure of effect
point-biserial correlation. size: Robustness to base rates and other factors.
Most statistical tests are not amenable to ready Psychological Methods, 13, 19–30.
conversion to one of these effect sizes; n-way
ANOVAs, MANOVAs, regressions with more
than one categorical variable, and regressions and
ANOVAs with a single categorical variable with MEDIAN
three or more categories do not convert to either
the point-biserial correlation or the standardized The median is one of the location parameters in
mean difference. However, the F statistic from an probability theory and statistics. (The others are
ANOVA with two categorical variables is exactly the mean and the mode.) For a real valued random
equal to the value of the t statistic via the relation- variable X with a cumulative distribution function
ship F ¼ t2 . To convert from a t statistic to F, the median of X is the unique number that satis-
a point-biserial correlation, the following equation fies FðmÞ ≤ 12 ≥ FðmÞ. In other words, the
must be used: median is the number that separates the upper half
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi from the lower half of a population or a sample. If
t2 a random variable is continuous and has a proba-
rpb ¼ , ð6Þ bility density function, half of the area under the
t2 þ df
probability density function curve would be to the
left of m and the other half to the right of m. For
where t is the value of the t test and df is the
this reason, the median is also called the 50th per-
degrees of freedom for the t test. The point-biserial
centile (the ith percentile is the value such that i%
correlation can then be converted to a standardized
of the observations are below it). In a box plot
mean difference if necessary.
(also called box-and-whisker plot), the median is
Matthew J. Borneman the central line between the lower and the higher
hinge of the box. The location of this central line
See also Analysis of Variance (ANOVA); Cohen’s suggests the central tendency of the underlying
d Statistic; Effect Size, Measures of; Multiple data.
Median 791

The population median, like the population data in interval and ratio scale because it requires
mean, is generally unknown. It must be inferred first the summation of all values in a sample.
from the sample median, just like the use of the The sample median as defined in Equation 1 is
sample mean for inferring the population mean. difficult to use when a population consists of all
In circumstances in which a sample can be fitted integers and a sample is taken with an even num-
with a known probability model, the population ber of observations. Because the median should
median may be obtained directly from the model also be an integer, two medians could result. One
parameters. For instance, a random variable that may be called the lower median and the other, the
follows an exponential distribution with a scale upper median. To avoid calling two medians of
parameter β (a scale parameter is the one that a single sample, an alternative is simply to call the
stretches or shrinks a distribution), the median is upper median the sample median, ignoring the
βln2 (where ln means natural logarithm, which lower one.
has a base e ¼ 2.718281828). If it follows a normal If the sample data are grouped into classes, the
distribution with a location parameter μ and value of the sample median cannot be obtained
a scale parameter σ, the median is μ. For a random according to Equation 1 as the individual values of
variable following a Weibull distribution with a the sample are no longer available. Under such
location parameter μ, a scale parameter α, and a circumstance, the median is calculated for the
a shape parameter γ (a shape parameter is the one particular class that contains the median. Two dif-
that changes the shape of a distribution), the ferent approaches may be taken to achieve the
median is μ þ α (ln2)1/γ . However, not all distribu- same result. One approach starts with the fre-
tions have a median in closed form. Their popula- quency and cumulative frequency (see Ott and
tion median cannot be obtained directly from Mendenhall, 1994):
a probability model but has to be estimated from
w n 
the sample median. m¼Lþ  cfb , ð2aÞ
fm 2

Definition and Calculation where m ¼ the median, L ¼ lower limit of the class
that contains the median, n ¼ total number of
The sample median can be defined similarly, irre-
observations in the sample, cfb ¼ cumulative fre-
spective of the underlying probability distribution
quency for all classes before the class that contains
of a random variable. For a sample of n observa-
the median, fm ¼ frequency for the class that con-
tions, x1 ; x2 ; . . . xn ; taken from a random variable
tains the median, and w ¼ interval width of the
X, rank these observations in an ascending order
classes.
from the smallest to the largest in value; the sam-
The other approach starts with the percentage
ple median, m, is defined as
and cumulative percentage:

xk if n ¼ 2k þ 1 w
m¼ : ð1Þ m¼Lþ ð50  cPb Þ, ð2bÞ
ðxk þ xkþ1 Þ=2 if n ¼ 2k Pm

That is, the sample median is the value of the where 50 ¼ the 50th percentile, cPb ¼ cumulative
middle observation of the ordered statistics if the percentage for all classes before the class that con-
number of observations is odd or the average of tains the median, and Pm ¼ percentage of the class
the value of the two central observations if the that contains the median. Both L and w are
number of observations is even. This is the most defined as in Equation 2a. A more detailed descrip-
widely used definition of the sample median. tion of this approach can be found in Arguing
According to Equation 1, the sample median is With Numbers by Paul Gingrich.
obtained from order statistics. No arithmetical sum- This second approach is essentially a special
mation is involved, in contrast to the operation of case of the approach used to interpolate the dis-
obtaining the sample mean. The sample median can tance to a given percentile in grouped data. To do
therefore be used on data in ordinal, interval, and so, one needs only to replace the 50 in Equation
ratio scale, whereas the sample mean is best used on 2b with a percentile of interest. The percentile
792 Median

within an interval of interest can then be interpo- similar to the one above. One can then apply
lated from the lower percentile bound of the inter- either Equation 2a or Equation 2b to find the class
val width. that contains the median of the responses.
To show the usage of Equations 2a and 2b, con-
sider the scores of 50 participants in a hypothetical
contest, which are assigned into five classes with Determining Which Central
the class interval width ¼ 20 (see Table 1). A
Tendency Measure to Use
glance at the cumulative percentage in the right-
most column of the table suggests that the median As pointed out at the beginning, the mean, the
falls in the 61-to-80 class because it contains the median, and the mode are all location para-
50th percentile of the sample population. Accord- meters that measure the central tendency of
ing to Equation 2a, therefore, L ¼ 61, n ¼ 50, a sample. Which one of them should be used for
cfb ¼ 13, fm ¼ 20, and w ¼ 20. The interpolated reporting a scientific study? The answer to this
value for the median then is question depends on the characteristics of the
data, or more specifically, on the skewness (i.e.,
20 × ð50=2  13Þ the asymmetry of distribution) of the data. A dis-
m ¼ 61 þ ¼ 73:
20 tribution is said to be skewed if one of its two
tails is longer than the other. In statistics litera-
Using Equation 2b, we have cPb ¼ 26, Pm ¼ 40, ture, the mean (μ), median (m), and mode (M)
both L ¼ 61, and w ¼ 20, as before. The interpo- inequality are well known for both continuous
lated value for the median of the 50 scores, then, is and discrete unimodal distributions. The three
statistics occur either in the order of M ≤
20 × ð50  26Þ m ≤ μ or in a reverse order of M ≥ m ≥ μ,
m ¼ 61 þ ¼ 73:
40 depending on whether the random variable is
positively or negatively skewed.
Equations 2a and 2b are equally applicable to For random variables that follow a symmetric
ordinal data. For instance, in a survey on the qual- distribution such as the normal distribution, the
ity of customer services, the answers to the cus- sample mean, median, and mode are equal and
tomer satisfaction question may be scored as can all be used to describe the sample central ten-
dissatisfactory, fairly satisfactory, satisfactory, and dency. Despite this, the median, as well as the
strongly satisfactory. Assign a value of 1, 2, 3, or 4 mode, of a normal distribution is not used as fre-
(or any other ordered integers) to represent each of quently as the mean. This is because the variability
these classes from dissatisfactory to strongly satis- (V) associated with the sample mean is much smal-
factory and summarize the number of responses ler than the V associated with the sample median
corresponding to each of these classes in a table (V[m] ¼ [1.2533]2V[μ]). If a random variable fol-
lows a skewed (i.e., nonsymmetrical) distribution,
the sample mean, median, and mode are not equal.
Table 1 Grouped Data for 50 Hypothetical Test
The median differs substantially from both the
Scores
arithmetic mean and the mode and is a better mea-
Frequency sure of the central tendency of a random sample
Cumulative because the median is the minimizer of the mean
Number of Number of absolute deviation in a sample. Take a sample set
Class Observations Observations % Cumulative % {2, 2, 3, 3, 3, 4, 15} as an example. The median is
3 (so is the mode), which is a far better measure of
0–20 2 2 4 4
the centrality of the data set than the arithmetic
21–40 5 7 10 14
mean of 4.57. The latter is largely influenced by
41–60 6 13 12 26
the last extreme value, 15, and does not ade-
61–80 20 33 40 66
quately describe the central tendency of the data
81–100 17 50 34 100
set. From this simple illustration, it can be con-
50 100
cluded that the sample median should be favored
Meta-Analysis 793

over the arithmetic sample mean in describing When one is using the sample median, it helps
the centrality whenever the distribution of a ran- to remember its four important characteristics,
dom variable is skewed. Examples of such as pointed out by Lyman Ott and William
skewed data can be found frequently in eco- Mendenhall:
nomic, sociological, education, and health stud-
ies. A few examples are the salary of employees 1. The median is the central value of a data set,
of a large corporation, the net income of house- with half of the set above it and half below it.
holds in a city, the house price in a country, and 2. The median is between the largest and the
the survival time of cancer patients. A few high- smallest value of the set.
income earners, or a few high-end properties, or
3. The median is free of the influence of extreme
a few longer survivors could skew their respec- values of the set.
tive sample disproportionally. Use of the sample
mean or mode to represent the data centrality 4. Only one median exists for the set (except in the
would be inappropriate. difficult case in which an even number of
The median is sometimes called the average. observations is taken from a population
consisting of only integers).
This term may be confused with the mean for
some people who are not familiar with a specific Shihe Fan
subject in which this interchangeable usage is fre-
quent. In scientific reporting, this interchangeable See also Central Tendency, Measures of; Mean; Mode
use is better avoided.
Further Readings
Abdous, B., & Theodorescu, R. (1998). Mean, median,
Advantages mode IV. Statistica Neerlandica, 52, 356–359.
Gingrich, P. (1995). Arguing with numbers: Statistics for
Compared with the sample mean, the sample the social sciences. Halifax, Nova Scotia, Canada:
median has two clear advantages in measuring Fernwood.
the central tendency of a sample. The first Groneveld, A., & Meeden, G. (1977). The mode,
advantage is that the median can be used for all median, and mean inequality. American Statistician,
data measured in ordinal, interval, and ratio 31, 120–121.
scale because it does not involve the mathematic Joag-Dev, K. (1989). MAD property of a median: A
operation of summation, whereas the mean is simple proof. American Statistician, 43, 26–27.
best used for data measured in interval and ratio MacGillivray, H. L. (1981). The mean, median, mode
scale. The second advantage is that the median inequality and skewness for a class of densities.
Australian Journal of Statistics, 23, 247–250.
gives a measure of central tendency that is more
Ott, L., & Mendenhall, W. (1994). Understanding
robust than the mean if outlier values are present statistics (6th ed.). Pacific Grove, CA: Duxbury.
in the data set because it is not affected by Wackerly, D. D, Mendenhall, W., III, & Scheaffer, R. L.
whether the distribution of a random variable is (2002). Mathematical statistics with applications (6th
skewed. In fact, the median, not the mean, is ed.). Pacific Grove, CA: Duxbury.
a preferred parameter in describing the central
tendency of such random variables when their
distribution is skewed. Therefore, whether to use
the sample median as a central tendency measure META-ANALYSIS
depends on the data type. The median is used if
a random variable is measured in ordinal scale Meta-analysis is a statistical method that integrates
or if a random variable produces extreme values the results of several independent studies consid-
in a set. In contrast, the mean is a better measure ered to be ‘‘combinable.’’ It has become one of the
of the sample central tendency if a random vari- major tools to integrate research findings in social
able is continuous and is measured in interval or and medical sciences in general and in education
ratio scale, and if data arising from the random and psychology in particular. Although the history
variable contain no extreme value. of meta-analytic procedures goes all the way back
794 Meta-Analysis

to the early 1900s and the work of Karl Pearson address multiple hypotheses. It may examine the
and others, who devised statistical tools to com- relation between several variables and account for
pare studies from different samples, Gene V. Glass consistencies as well as inconsistencies within
coined the term in 1976. Glass, Barry McGaw, a sample of study findings. Because of demand for
and Mary Lee Smith described the essential char- robust research findings and with the advance of
acteristics of meta-analysis as follows: statistical procedures, meta-analysis has become
one of the major tools for integrating research
1. It is undeniably quantitative, that is, it uses findings in social and medical science as well as
numbers and statistical methods for organizing the field of education, where it originated. A recent
and extracting information.
search of the ERIC database identified more than
2. It does not prejudge research findings in terms 618 articles published between 1980 and 2000
of research quality (i.e., no a priori arbitrary that use meta-analysis in their title, as opposed to
and nonempirical criteria of research quality are only 36 written before 1980. In the field of psy-
imposed to exclude a large number of studies). chology, the gap was 12 versus 1,623, and in the
3. It seeks general conclusions from many separate field of medical studies, the difference is even more
investigations that address related or identical striking: 7 versus 3,571. Evidence in other fields
hypotheses. shows the same trend toward meta-analysis’s
becoming one of the main tools for evidence-based
Meta-analysis involves developing concise crite- research.
ria for inclusion (i.e., sampling), searching the liter- According to the publication manual of the
ature for relevant studies (i.e., recruitment), coding American Psychological Association, a review
study variables (i.e., data entry), calculating stan- article organizes, integrates, and critically evalu-
dardized effect sizes for individual studies, and gen- ates already published material. Meta-analysis
erating an overall effect size across studies (i.e., is only one way of reviewing or summarizing
data analysis). Unlike primary studies, in which research literature. Narrative review is the more
each case in a sample is a unit of analysis, the unit traditional way of reviewing research literature.
of analysis for meta-analysis is the individual study. There are several differences between traditio-
The effect sizes calculated from the data in an indi- nal narrative reviews and meta-analysis. First,
vidual study are analogous to the dependent because there are very few systematic proce-
variable, and the substantive and methodological dures, the narrative review is more susceptible to
characteristics affecting the study results are defined subjective bias and therefore more prone to error
as independent variables. Any standardized index than are meta-analytic reviews. In the absence of
that can be used to understand different statistical formal guidelines, reviewers of a certain litera-
findings across studies in a common metric can be ture can disagree about many critical issues, such
used as an ‘‘effect size.’’ The effect size metric repre- as which studies to include and how to support
sents both the magnitude and direction of the rela- conclusions with a certain degree of quantitative
tion of interest across different primary studies in evidence. In an adequately presented meta-
a standardized metric. A variety of alternatives are analytical study, one should be able to replicate
available for use with variables that are either con- the review by following the procedure reported
tinuous or discrete, such as the accumulation of in the study.
correlations (effect size r), and standardized differ- Narrative review and meta-analysis are also dif-
ences between mean scores (effect size d), p values, ferent in terms of the scope of the studies that they
or z scores effect size (ES). The dependent variable can review. The narrative review can be inefficient
in meta-analysis is computed by transforming find- for reviewing 50 or more studies. This is especially
ings of each reviewed study into a common metric true when the reviewer wants to go beyond
that relies on either r or d as the combined statistic. describing the findings and explain multiple rela-
Meta-analysis is not limited to descriptive revi- tions among different variables. Unlike narrative
ews of research results but can also examine how reviews, meta-analysis can put together all avail-
and why such findings occur. With the use of mul- able data to answer questions about overall study
tivariate statistical applications, meta-analysis can findings and how they can be accounted for by
Meta-Analysis 795

various factors, such as sample and study charac- less likely to show statistically nonsignificant
teristics. Meta-analysis can, therefore, lead to the results, may introduce a bias in the conclusions of
identification of various theoretical and empirical meta-analyses. Not only can the decision of
factors that may permit a more accurate whether to include unpublished studies lead to
understanding of the issues being reviewed. Thus, bias, but decisions about how to obtain data and
although meta-analysis can provide a better assess- which studies to include can also contribute to
ment of literature because it is more objective, rep- selection bias. The unpublished studies that can be
licable, and systematic, it is important to note that located may thus be an unrepresentative sample of
a narrative description for each study is key to any unpublished studies. A review of meta-analytical
good meta-analytic review as it will help a studies published between 1988 and 1991 indi-
meta-analyst determine which studies to include cated that most researchers had searched for
and what qualitative information about the studies unpublished material, yet only 31% included
can and should be coded and statically related to unpublished studies in their review. Although most
quantitative outcomes in order to evaluate the of these researchers supported the idea of includ-
complexity of topics being reviewed. ing unpublished data in meta-analysis, only 47%
The remainder of this entry addresses the meth- of journal editors supported this practice.
odological issues associated with meta-analytic The nonindependence problem in meta-analysis
research and then describes the steps involved in refers to the assumption that each study in the
conducting meta-analysis. review is taken randomly from a common popula-
tion; that the individual studies are independent of
one another. ‘‘Lumping’’ sets of independent studies
Methodological Issues
can reduce the reliability of estimations of averages
Like any other research strategy, meta-analysis is or regression equations. Although Glass and his
not a perfect solution in research review. Glass associates argued that the nonindependence
summarized the main issues with meta-analysis in assumption is a matter of practicality, they admit
four domains: quality, commensurability, selection that this problem is the one criticism that is not
bias, and nonindependence. The quality problem ‘‘off the mark and shallow’’ (Glass et al., p. 229).
has been a very controversial issue in meta-analy-
sis. At issue is whether the quality of studies be
Steps
included as a selection criteria. To avoid any bias
in selection one option is to include as many stud- Although meta-analytic reviews can take different
ies as possible, regardless of their quality. Others, forms depending on the field of study and the
however, question the practice of including studies focus of the review, there are five general steps in
of poor quality as doing so limits the validity of conducting meta-analysis. The first step invol-
the overall conclusions of the review. The com- ves defining and clarifying the research question,
mensurability problem refers to the most common which includes selecting inclusion criteria. Similar
criticism of meta-analysis: that it compares apples to selecting a sample for an empirical study, inclu-
and oranges. In other words, meta-analysis is illog- sion criteria for a meta-analysis have to be speci-
ical because it mixes constructs from studies that fied following a theoretical or empirical guideline.
are not the same. The inclusion criteria greatly effect the conclusions
The selection-bias problem refers to the inevita- drawn from a meta-analytic review. Moreover, the
ble scrutiny of the claim that the meta-analytic inclusion criteria are one of the steps in a meta-
review is comprehensive and nonbiased in its analytic study where bias or subjectivity comes
reviewing process. Meta-analysis is not inherently into play. Two critical issues should be addressed
immune from selection bias, as its findings will be at this stage of meta-analysis: (a) Should unpub-
biased if there are systematic differences across lished studies be included? and (b) should the qual-
journal articles, book articles, and unpublished ity of the studies be included as part of the
articles. Publication bias is a major threat to the inclusion criteria? There are no clear answers to
validity of meta-analysis. The file drawer effect, these questions. Glass and colleagues, for example,
which refers to the fact that published studies are argued against strict inclusion criteria based on
796 Meta-Analysis

assessing study quality a priori because a meta- articles to see if there are relevant studies that have
analysis itself can empirically determine whether not yet been included in the final pool. Each of
study quality is related to variance in reported these postelectronic search steps can also serve as
study findings. While Glass and others argued for a reliability check to see whether the original
inclusion of all studies, including unpublished search code works well. In other words, if there
reports in order to avoid publication bias toward are too many articles that were not part of the
null findings in the literature, it is possible to electronically searched pool, then it is possible that
empirically assess research quality with a set of the search code was not a valid tool to identify rel-
methodological variables as part of the meta- evant studies for the review. In those circumstances
analytic data analysis. In other words, instead of a modified search would be in order.
eliminating a study based on the reviewer’s judg- The third step in meta-analysis is the develop-
ment of its quality, one can empirically test the ment of a coding schema. The goal of study coding
impact of study quality as a control or moderator is to develop a systematic procedure for recording
variable. the appropriate data elements from each study.
The next step in meta-analysis is to identify William A. Stock identified six categories of study
studies to be included in the review. This step elements for systematic coding that address both
involves a careful literature search that involves substantive and methodological characteristics:
computerized and manual approaches. Computer- report identification (study identifiers such as year
ized search approaches include using discipline of publication, authors), setting (the location or
specific databases such as PsycINFO in psychol- context of the study), subjects (participant charac-
ogy, ERIC in education, MEDLINE in medical teristics), methodology (research design charac-
sciences, or Sociological Abstracts in sociology. teristics), treatment (procedures), and effect size
Increasingly, searching the Internet with search (statistical data needed to calculate common effect
engines such as Google (or Google Scholar) also size). One can modify these basic categories
helps identify relevant studies for meta-analytic according to the specific focus of the review and
review. All databases must be searched with the with attention to the overall meta-analytic ques-
same set of keywords and search criteria in order tion and potential moderator factors. To further
to ensure reliability across the databases. It is also refine the coding scheme, a small subsample of the
important to keep in mind that several vendors data (k ¼ 10) must be piloted with two raters who
market the most popular databases, such as Psyc- did not take part in the creation of the coding
INFO, and each vendor has a different set of schema.
defaults that determine the outcome of any search. The next step is to calculate effect sizes for each
It is, therefore, advisable for investigators to gener- study by transforming individual study statistics
ate a single yet detailed logical search code and test into a common effect size metric. The goal of
it by using various vendors to see if their databases effect size transformation is to reflect with a com-
yield the same result. mon metric the relative magnitude of the relations
Although computerized search engines save reported in various independent studies. The three
time and make it possible to identify relevant most commonly used effect size metrics in meta-
materials in large databases, they should be com- analytic reviews are Cohen’s d, correlation coeffi-
plemented with additional search strategies, cient r, and odds ratio. Cohen’s d, or effect size d,
including manual search. In fields in which there is is a metric that is used when the research involves
no universally agreed-on keyword, for example, mean differences or group contrasts. This is
one can search key publications or citations of a method used in treatment studies or any design
classic articles using the Social Science Citation that calls for calculating standardized mean differ-
Index, which keeps track of unique citations of ences across groups in a variable that is continuous
each published article. If narrative reviews have in nature. Correlation coefficient r can also serve
been published recently, one can also check the as an effect size metric (or effect size r) when the
cited articles in those reviews. Finally, once the focus of the review is identification of the direction
final review pool is determined, one must also and magnitude of the association between vari-
manually check the references in each of the ables. Odds-ratio effect size is commonly used in
Meta-Analysis 797

epidemiological reviews or in reviews that involve heterogeneous, that is, between-study differences
discontinuous variables (e.g., school dropout or are due to unobserved random sources, an effort
diagnosis of a certain condition). must be made to identify sample and study charac-
The calculation of an effect size index also teristics that explain the difference across the stud-
requires a decision about the unit of analysis in ies through the coding process. When combining
a meta-analysis. There are two alternatives. The the outcomes from different studies, one may also
first alternative is to enter the effect size for each choose to use a fixed or random-effects model.
variable separately. For example, if a study reports The fixed-effect model assumes that studies in the
one correlation on the basis of grade point average meta-analysis use identical methods, whereas the
and another correlation on the basis of an achieve- random-effects model assumes that studies are a
ment test score, there will be two different effect random sample from the universe of all possible
sizes for the study, one for grade point average and studies. The former model considers within-study
the other for achievement test score. Similarly, if variability as the only source of variation, while
correlations were reported for girls and boys sepa- the latter model considers both within-study and
rately, there will be two effect sizes, one for girls between-study variation as sources of differences.
and one for boys. The second alternative is to use Fixed and random-effects models can yield very
each study as the unit of analysis. This can be done different results because fixed-effect models are
by averaging effect sizes across the groups. For likely to underestimate and random-effect models
example, one could take the mean of the correla- are likely to overestimate error variance when their
tions for girls and boys and report a single effect assumptions are violated.
size. Both of these approaches have their shortcom- Thus, meta-analysis, like any other survey res-
ings. The former approach gives too much weight earch undertaking, is an observational study of
to those studies that have more outcome measures, evidence. It has its own limitations and therefore
but the latter approach obscures legitimate theoret- should be undertaken rigorously by using well-
ical and empirical differences across dependent defined criteria for selecting and coding individual
measures (i.e., gender differences may serve as a studies, estimating effect size, aggregating signifi-
moderator in certain meta-analytic reviews). cance levels, and integrating effects.
Mark W. Lipsey and David B. Wilson suggest
a third alternative that involves calculating an effect Selcuk R. Sirin
size for each independent sample when the focus of
See also Cohen’s d Statistic; Effect Size, Measures of;
analysis is the sample characteristics (e.g., age, gen-
Fixed-Effects Models; Homogeneity of Variance;
der, race) but allowing for multiple effect sizes from
Inclusion Criteria; ‘‘Meta-Analysis of Psychotherapy
a given study when the focus of the analysis is the
Outcome Studies’’; Mixed- and Random-Effects
study characteristics (e.g., multiple indicators of the
Models; Odds Ratio; Random-Effects Models
same construct). In other words, the first alternative
can be used to calculate an effect size for each dis-
tinct construct in a particular study; this alternative Further Readings
yields specific information for each particular con- American Psychological Association. (2009).
struct being reviewed. The second alternative can Publication manual of the American
be used to answer meta-analytic questions regard- Psychological Association (6th ed.). Washington,
ing sample characteristics, as well as to calculate DC: Author.
the overall magnitude of the correlation. Cook, D. J., Guyatt, G. H., Ryan, G., Clifton. J.,
The final step in meta-analysis involves testing Buckingham, L., Willan, A., et al. (1993). Should
the homogeneity of effect sizes across studies. The unpublished data be included in meta-analyses?
Current convictions and controversies. JAMA, 269,
variation among study effect sizes can be analyzed
2749–2753.
using Hedges’s Q test of homogeneity. If studies in Cook, T. D., & Leviton, L. C. (1980). Reviewing the
meta-analysis provide a homogeneous estimate of literature: A comparison of traditional methods with
a combined effect size across studies, then it is meta-analysis. Journal of Personality, 48, 449–472.
more likely that the various studies are testing the Cooper, H. M. (1989). Integrating research: A guide for
same hypothesis. However, if these estimates are literature reviews (2nd ed.). Newbury Park, CA: Sage.
798 ‘‘Meta-Analysis of Psychotherapy Outcome Studies’’

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta- spontaneous remission of psychological symptoms
analysis in social research. Beverly Hills, CA: Sage. rather than to the therapy applied. His charge
Hedges, L. V., & Olkin, I. (1985). Statistical methods for prompted numerous studies on the efficacy of
meta-analysis. New York: Academic Press. treatment, often resulting in variable and conflict-
Light, R. J., & Pillemer, D. B. (1984). Summing up: The
ing findings.
science of reviewing research. Cambridge, MA:
Harvard University Press.
Prior to the Smith and Glass article, behavioral
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta- researchers were forced to rely on a narrative syn-
analysis. Thousand Oaks, CA: Sage. thesis of results or on an imprecise tallying method
Rosenthal, R. (1991). Meta-analytic procedures for social to compare outcome studies. Researchers from
research (Rev. ed.). Newbury Park, CA: Sage. various theoretical perspectives highlighted studies
Stock, W. A. (1994). Systematic coding for research that supported their work and dismissed or disre-
synthesis. In H. Cooper & L. V. Hedges (Eds.), The garded findings that countered their position. With
handbook of research synthesis (pp. 125–138). New the addition of meta-analysis to the repertoire of
York: Russell Sage.
evaluation tools, however, researchers were able to
objectively evaluate and refine their understanding
of the effects of psychotherapy and other behav-
ioral interventions. Smith and Glass determined
‘‘META-ANALYSIS OF that, on average, an individual who had partici-
PSYCHOTHERAPY pated in psychotherapy was better off than 75%
of those who were not treated. Reanalyses of the
OUTCOME STUDIES’’ Smith and Glass data, as well as more recent meta-
analytic studies, have yielded similar results.
The article ‘‘Meta-Analysis of Psychotherapy Out-
come Studies,’’ written by Mary Lee Smith and
Gene Glass and published in American Psycholo- Effect Size
gist in 1977, initiated the use of meta-analysis as Reviewing 375 studies on the efficacy of psycho-
a statistical tool capable of summarizing the results therapy, Smith and Glass calculated an index of
of numerous studies addressing a single topic. In effect size to determine the impact of treatment on
meta-analysis, individual research studies are iden- patients who received psychotherapy versus those
tified according to established criteria and treated assigned to a control group. The effect size was
as a population, with results from each study equal to the difference between the means of the
subjected to coding and entered into a database, experimental and control groups divided by the
where they are statistically analyzed. Smith and standard deviation of the control group. A positive
Glass pioneered the application of meta-analysis in effect size communicated the efficacy of a psy-
research related to psychological treatment and chological treatment in standard deviation units.
education. Their work is considered a major Smith and Glass found an effect size of .68, indi-
contribution to the scientific literature on psycho- cating that after psychological treatment, indivi-
therapy and has spurred hundreds of other meta- duals who had completed therapy were superior to
analytic studies since its publication. controls by .68 standard deviations, an effect size
that is generally classified as moderately large.
Historical Context
Other Findings
Smith and Glass conducted their research both in
response to the lingering criticisms of psychother- While best known for its contribution to research
apy lodged by Hans Eysenck beginning in 1952 on the general efficacy of psychotherapy, the Smith
and in an effort to integrate the increasing volume and Glass study also examined relative efficacy of
of studies addressing the efficacy of psychological specific approaches to therapy by classifying stud-
treatment. In a scathing review of psychotherapy, ies into 10 theoretical types and calculating an
Eysenck had asserted that any benefits derived effect size for each. Results indicated that approxi-
from treatment could be attributed to the mately 10% of the variance in the effects of
Methods Section 799

treatment could be attributed to the type of ther- At the time of the Smith and Glass publication,
apy employed, although the results were con- the statistical theory of meta-analysis was not yet
founded by differences in the individual studies, fully articulated. More recent studies using meta-
including the number of variables, the duration of analysis have addressed the technical problems
treatment, the severity of the presenting problem, found in earlier work. As a result, meta-analysis
and the means by which progress was evaluated. has become an increasingly influential technique in
The authors attempted to address these problems measuring treatment efficacy.
by collapsing the 10 types of therapies into four
classes: ego therapies, dynamic therapies, behav- Sarah L. Hastings
ioral therapies, and humanistic therapies, and then
See also Control Group; Effect Size, Measures of; Meta-
further collapsing the types of therapy into two
Analysis
superclasses labeled behavioral and nonbehavioral
therapies. They concluded that differences among
the various types of therapy were negligible. They Further Readings
also asserted that therapists’ degrees and creden-
tials were unrelated to the efficacy of treatment, as Chambliss, C. H. (2000). A review of relevant
psychotherapy outcome research. In C. H. Chambliss
was the length of therapy.
(Ed.), Psychotherapy and managed care: Reconciling
research and reality (pp. 197–214). Boston: Allyn and
Bacon.
Criticisms Eysenck, H. J. (1952). The effects of psychotherapy.
Journal of Consulting Psychology, 16, 319–324.
Publication of the Smith and Glass article pro-
Landman, J. T., & Dawes, R. M. (1982). Psychotherapy
mpted a flurry of responses from critics, including outcome: Smith and Glass’ conclusions stand up under
Eysenck, who argued that the studies included in scrutiny. American Psychologist, 37(5), 504–516.
the meta-analysis were too heterogeneous to be Lipsey, M., & Wilson, D. (1993). The efficacy of
compared and that many were poorly designed. psychological, educational, and behavioral treatment:
Some critics pointed out that an unspecified pro- Confirmation from meta-analysis. American
portion of studies included in the analysis did not Psychologist, 48(12), 1181–1209.
feature an untreated control group. Further, some Smith, M. L., & Glass, G. V. (1977). Meta-analysis of
studies did not have a placebo control group to psychotherapy outcome studies. American
rule out the effects of attention or expectation Psychologist, 32, 752–760.
Wampold, B. E. (2000). Outcomes of individual
among patients. A later reanalysis of the data by
counseling and psychotherapy: Empirical evidence
Janet Landman and Robyn Dawes published in addressing two fundamental questions. In S. D. Brown
1982 used more stringent criteria and featured sep- & R. W. Lent (Eds.), Handbook of counseling
arate analyses that used only studies that included psychology (3rd ed.). New York: Wiley.
placebo controls. Their analyses reached conclu-
sions that paralleled those of Smith and Glass.

METHODS SECTION
Influence
The Smith and Glass study not only altered the The purpose of a methods section of a research
landscape of the psychotherapeutic efficacy battle; paper is to provide the information by which
it also laid the groundwork for meta-analytic stud- a study’s validity is judged. It must contain enough
ies investigating a variety of psychological and edu- information so that (a) the study could be repeated
cational interventions. Their work provided an by others to evaluate whether the results are repro-
objective means of determining the outcome of ducible, and (b) others can judge whether the
a given intervention, summarizing the results of results and conclusions are valid. Therefore, the
large numbers of studies, and indicating not only methods section should provide a clear and precise
whether a treatment makes a difference, but how description of how a study was done and the ratio-
much of a difference. nale for the specific procedures chosen.
800 Methods Section

Historically, the methods section was referred or retrospective cohort, case–control, cross-
to as the ‘‘materials and methods section’’ to sectional), qualitative methods (e.g., ethnogra-
emphasize the two areas that must be addressed. phy, focus groups) and others (e.g., secondary
‘‘Materials’’ referred to what was studied (e.g., data analysis, literature review, meta-analysis,
humans, animals, tissue cultures), treatments appl- mathematical derivations, and opinion–editorial
ied, and instruments used. ‘‘Methods’’ referred to pieces). Here is a brief description of the designs.
the selection of study subjects, data collection, and Randomized trials involve the random allocation
data analysis. In some fields of study, because by the investigator of subjects to different
‘‘materials’’ does not apply, alternative headings interventions (treatments or conditions). Quasi-
such as ‘‘subjects and methods,’’ ‘‘patients and experiments involve nonrandom allocation. Both
methods,’’ or simply ‘‘methods’’ have been used or cohort (groups based on exposures) and case–
recommended. control (groups based on outcomes) studies are
Below are the items that should be included in longitudinal studies in which exposures and out-
a methods section. comes are measured at different times. Cross-
sectional studies measure exposures and out-
comes at a single time. Ethnography uses fieldwork
Subjects or Participants
to provide a descriptive study of human societies.
If human or animal subjects were used in the A focus group is a form of qualitative research in
study, who the subjects were and how they were which people assembled in a group are asked
relevant to the research question should be about their attitude toward a product or concept.
described. Any details that are relevant to the An example of secondary data is the abstraction of
study should be included. For humans, these data from existing administrative databases. A
details include gender, age, ethnicity, socioeco- meta-analysis combines the results of several
nomic status, and so forth, when appropriate. For studies that address a set of related research
animals, these details include gender, age, strain, hypotheses.
weight, and so forth. The researcher should also
describe how many subjects and how they were
selected. The selection criteria and rationale for
Data Collection
enrolling subjects into the study must be stated
explicitly. For example, the researcher should The next step in the methods section is a descrip-
define study and comparison subjects and the tion of the variables that were measured and how
inclusion and exclusion criteria of subjects. If the these measurements were made. In laboratory and
subjects were human, the type of reward or moti- experimental studies, the description of measure-
vation used to encourage them to participate ment instruments and reagents should include the
should be stated. When working with human or manufacturer and model, calibration process, and
animal subjects, there must be a declaration that how measurements were made. In epidemiologic
an ethics or institutional review board has deter- and social studies, the development and pretest
mined that the study protocol adheres to ethical of questionnaires, training of interviewers, data
principles. In studies involving animals, the prep- extraction from databases, and conduct of focus
arations made prior to the beginning of the study groups should be described where appropriate. In
must be specified (e.g., use of sedation and some cases, the survey instrument (questionnaire)
anesthesia). may be included as an appendix to the research
paper.
Study Design
Data Analysis
The design specifies the sequence of manipula-
tions and measurement procedures that make up The last step in the methods section is to
the study. Some common designs are experi- describe the way in which the data will be pre-
ments (e.g., randomized trials, quasi-experi- sented in the results section. For quantitative
ments), observational studies (e.g., prospective data, this step should specify whether and which
Method Variance 801

statistical tests will be used for making the infer- Kallet, R. H. (2004). How to write the methods section
ence. If statistical tests are used, this part of the of a research paper. Respiratory Care, 49, 1229–1232.
methods section must specify the significance Van Damme, H., Michel, L., Ceelen, W., & Malaise, J.
level and whether one- or two sided or the type (2007). Twelve steps to writing an effective ‘‘materials
and methods’’ section. Acta Chirurgica Belgica,
of confidence intervals. For qualitative data
107, 102.
a common analysis is observer impression. That
is, expert or lay observers examine the data,
form an impression, and report their impression
in a structured, quantitative form.
The following are some tips for writing the
METHOD VARIANCE
methods section: (a) The writing should be direct
and precise. Complex sentence structures and Method is what is used in the process of measuring
unimportant details should be avoided. (b) The something, and it is a property of the measuring
rationale or assumptions on which the methods instrument. The term method effects refers to the
are based may not always be obvious to the systematic biases caused by the measuring instru-
audience and so should be explained clearly. ment. Method variance refers to the amount of
This is particularly true when one is writing for variance attributable to the methods that are used.
a general audience, as opposed to a subspecialty In psychological measures, method variance is
group. The writer must always keep in mind often defined in relationship to trait variance. Trait
who the audience is. (c) The methods section variance is the variability in responses due to the
should be written in the past tense. (d) Subhead- underlying attribute that one is measuring. In con-
ings, such as participants, design, and so forth, trast, method variance is defined as the variability
may help readers navigate the paper. (e) If the in responses due to characteristics of the measur-
study design is complex, it may be helpful to ing instrument. After sketching a short history of
include a diagram, table, or flowchart to explain method variance, this entry discusses features of
the methods used. (f) Results should not be measures and method variance analyses and
placed in the methods section. However, the describes approaches for reducing method effects.
researchers may include preliminary results from
a pilot test they used to design the main study
A Short History
they are reporting.
The methods section is important because it No measuring instrument is free from error. This is
provides the information the reader needs to judge particularly germane in social science research,
the study’s validity. It should provide a clear and which relies heavily on self-report instruments.
precise description of how a study was conducted Donald Thomas Campbell was the first to mention
and the rationale for specific study methods and the problem of method variance. In 1959, Camp-
procedures. bell and Donald W. Fiske described the fallibility
inherent in all measures and recommended the use
Bernard Choi and Anita Pak of multiple methods to reduce error. Because no
single method can be the gold standard for mea-
See also Discussion Section; Results Section; Validity of
surement, they proposed that multiple methods be
Research Conclusions
used to triangulate on the underlying ‘‘true’’ value.
The concept was later extended to unobtrusive
Further Readings measures.
Method variance has not been well defined in
Branson, R. D. (2004). Anatomy of a research paper.
the literature. The assumption has been that the
Respiratory Care, 49, 1222–1228.
Hulley, S. B., Newman, T. B., & Cummings, S. R.
reader knows what is meant by method variance.
(1988). The anatomy and physiology of research. It is often described in a roundabout way, in rela-
In S. B. Hulley & S. R. Cummings (Eds.), Designing tionship to trait variance. Campbell and Fiske
clinical research (pp. 1–11). Baltimore: William & pointed out that there is no fixed demarcation
Wilkins. between trait and method. Depending on the goals
802 Method Variance

of a particular research project, a characteristic response formats used in social science research
may be considered either a method or a trait. are written paper-and-pencil tests.
Researchers have reported the methods that they
use as different tests, questionnaires with different
Response Categories
types of answers, self-report and peer ratings, clini-
cian reports, or institutional records, to name The response categories include the ways an
a few. item may be answered. Examples of response
In 1950 Campbell differentiated between struc- categories include multiple-choice items, matching,
tured and nonstructured measures, along with Likert-type scales, true–false answers, responses
those whose intent was disguised, versus measures to open-ended questions, and visual analogue
that were obvious to the test taker. Later Campbell scales. Close-ended questions are used most fre-
and others described the characteristics associated quently, probably because of their ease of adminis-
with unobtrusive methods, such as physical traces tration and scoring. Open-ended questions are
and archival records. More recently, Lee Sechrest used less frequently in social science research.
and colleagues extended this characterization to Often the responses to these questions are very
observable methods. short, or the question is left blank. Open-ended
Others have approached the problem of method questions require extra effort to code. Graphical
from an ‘‘itemetric’’ level, in paper-and-pencil responses such as visual analogue scales are infre-
questionnaires. A. Angleitner, O. P. John, and F. quently used.
Löhr proposed a series of item-level characteristics,
including overt reactions, covert reactions, bodily
Raters
symptoms, wishes and interests, attributes of traits,
attitudes and beliefs, biographical facts, others’ Raters are a salient method characteristic. Self-
reactions, and bizarre items. report instruments comprise the majority of mea-
sures. In addition to the self as rater, other raters
include, for example, teachers, parents, and peers.
Obvious Methods Other raters may be used in settings with easy
There appear to be obvious, or manifest, features access to them. For example, studies conducted in
of measurement, and these include stimulus for- schools often include teacher ratings and may
mats, response formats, response categories, raters, collect peer and parent ratings. Investigations in
direct rating versus summative scale, whether the medical settings may include ratings by clinicians
stimulus or response is rated, and finally, opaque and nurses.
versus transparent measures. These method char- The observability of the trait in question pro-
acteristics are usually mentioned in articles to bably determines the accuracy of the ratings by
describe the methods used. For example, an others. An easily observable trait such as extrover-
abs-tract may describe a measure as ‘‘a 30-item sion will probably generate valid ratings. However,
true–false test with three subscales,’’ ‘‘a structured characteristics that cannot be seen, particularly
interview used to collect school characteristics,’’ or those that the respondent chooses to hide, will be
‘‘patient functioning assessed by clinicians using a harder to rate. Racism is a good example of a char-
5-point scale.’’ acteristic that may not be amenable to ratings.

Direct Versus Summative Scale


Stimulus and Response Formats
This method characteristic refers to the number
The stimulus format is the ways the measure is of items used to measure a characteristic. The
presented to the participant, such as written or respondent may be asked directly about his or her
oral stimulus. The response format refers to the standing on a trait; for example, How extroverted
methods used to collect the participant’s response are you? In other instances, multiple-item scales
and includes written, oral, and graphical app- are employed. The items are then summed to esti-
roaches. The vast majority of stimulus and mate the respondent’s standing on the trait. Direct,
Method Variance 803

single items may be sufficient if a trait is obvi- and M. Ronald Buckley examined 70 published
ous and/or the respondent does not care about studies and reported that trait accounted for more
the results. than 40% of the variance and method accounted
for approximately 25%. D. Harold Doty and Wil-
liam H. Glick obtained similar results.
Rating the Stimulus Versus Rating the Response
Rating the prestige of colleges or occupations is
an example of rating the stimulus; self-report ques- Reducing Effects of Methods
tionnaires for extroversion or conscientiousness A variety of approaches can be used to lessen the
are examples of rating the response. The choice effects of methods in research studies. Awareness
depends on the goals of the study. of the problem is an important first step. The sec-
ond is to avoid measurement techniques laden
Opaque Versus Transparent Measures with method variance. Third, incorporate multiple
measures that use maximally different methods,
This method characteristic refers to whether the with different sources of error variance. Finally,
purpose of a test is easily discerned by the respon- the multiple measures can be combined into a trait
dent. The Stanford-Binet is obviously a test of estimate during analysis. Each course of action
intelligence, and the Myers-Briggs Type Indicator reduces the effects of methods in research studies.
inventory measures extroversion. These are trans-
parent tests. If the respondent cannot easily guess Melinda Fritchoff Davis
the purpose of a test, it is opaque.
See also Bias; Confirmatory Factor Analysis; Construct
Validity; Generalizability Theory; Multitrait–
Types of Analyses Used for Method Variance Multimethod Matrix; Rating; Triangulation; True
If a single method is used, it is not possible to Score; Validity of Measurement
estimate method effects. Multiple methods are
required in an investigation in order to study
method effects. When multiple methods are col- Further Readings
lected, they must be combined in some way to esti- Campbell, D. T. (1950). The indirect assessment of
mate the underlying trait. Composite scores or social attitudes. Psychological Bulletin, 47, 15–38.
latent factor models are used to estimate the trait. Cote, J. A., & Buckley, M. R. (1987). Estimating trait,
If the measures in a study have used different method, and error variance: Generalizing across 70
sources of error, the resulting trait estimate will construct validation studies. Journal of Marketing
contain less method bias. Research, 24, 315–318.
Doty, D. H., & Glick, W. H. (1998). Common methods
Estimating the effect of methods is more com-
bias: Does common methods variance really bias
plicated. Neal Schmitt and Daniel Stutts have pro- results? Organizational Research Methods, 1(4),
vided an excellent summary of the types of 374–406.
analyses that may be used to study method vari- Schmitt, N., & Stutts, D. M. (1986). Methodology
ance. Currently, the most popular method of anal- review: Analysis of multitrait-multimethod matrices.
ysis for multitrait–multimethod matrices is confir- Applied Psychological Measurement, 10, 1–22.
matory factor analysis. However, there are a variety Sechrest, L. (1975). Another look at unobtrusive
of problems inherent in this method, and gene- measures: An alternative to what? In W. Sinaiko & L.
ralizability theory analysis shows promise for Broedling (Eds.), Perspectives on attitude assessment:
multitrait–multimethod data. surveys and their alternatives (pp. 103–116).
Washington, DC: Smithsonian Institution.
Sechrest, L., Davis, M. F., Stickle, T., & McKnight, P.
Does Method Variance Pose a Real Problem? (2000). Understanding ‘‘method’’ variance. In L.
Bickman (Ed.), Research Design: Donald Campbell’s
The extent of variance attributable to methods Legacy (pp. 63–88). Thousand Oaks, CA: Sage.
has not been well studied, although several inter- Webb, E. T., Campbell, D. T., Schwartz, R. D., Sechrest,
esting articles have focused on it. Joseph A. Cote L., & Grove, J. B. (1981). Nonreactive measures in
804 Missing Data, Imputation of

the social sciences (2nd ed.). Boston: Houghton Impute or Delete


Mifflin.
The trade-off is between inconvenience and bias.
There are two choices for deletion (casewise or
pairwise) and several approaches to imputation.
Casewise deletion omits entire observations (or
MISSING DATA, IMPUTATION OF cases) with a missing value from all calculations.
Pairwise deletion omits observations on a variable-
Imputation involves replacing missing values, or by-variable basis. Casewise deletion sacrifices par-
missings, with an estimated value. In a sense, tial information either for convenience or to
imputation is a prediction solution. It is one of accommodate certain statistical techniques. Tech-
three options for handling missing data. The niques such as structural equation modeling may
general principle is to delete when the data are require complete data for all the variables, so only
expendable, impute when the data are precious, casewise deletion is possible for them. For techni-
and segment for the less common situation in ques such as calculating correlation coefficients,
which a large data set has a large fissure. Impu- pairwise deletion will leverage the partial informa-
tation is measured against deletion; it is advanta- tion of the observations, which can be advanta-
geous when it affords the more accurate data geous when one is working with small sample
analysis of the two. This entry discusses the dif- sizes and when missings are not random.
ferences between imputing and deleting, the Imputation is the more advantageous technique
types of missings, the criteria for preferring when (a) the missings are not random, (b) the
imputation, and various imputation techniques. missings represent a large proportion of the data
It closes with application suggestions. set, or (c) the data set is small or otherwise

Figure 1 Missing Data Structure


Missing Data, Imputation of 805

precious. If the missings do not occur at random, is missing for a particular observation, and unit miss-
which is the most common situation, then deleting ingness refers to the situation in which all the values
can create significant bias. For some situations, it for an observation are missing. Figure 1 provides an
is possible to repair the bias through weighting— illustration of missingness.
as in poststratification for surveys. If the data set is Second, missings can be categorized by the
small or otherwise precious, then deleting can underlying nature of the missingness. These three
severely reduce the statistical power or value of categories are (1) missing completely at random
the data analysis. (MCAR), (2) missing at random (MAR), and
Imputation can repair the missing data by creat- (3) missing not at random (MNAR), summarized
ing one or more versions of how the data set in Table 1 and discussed below.
should appear. By leveraging external knowledge, Categorizing missings into one of these three
good technique, or both, it is possible to reduce groups provides better judgment as to the most
bias due to missing values. Some techniques offer appropriate imputation technique and the rami-
a quick improvement over deletion. Software is fications of employing that technique. MCAR is
making these techniques faster and sharper; how- the least common, yet the easiest to address.
ever, the techniques should be conducted by those MAR can be thought of as missing partially at
with appropriate training. random; the point is that there is some pattern
that can be leveraged. There are statistical tests
for inferring MCAR and MAR. There are many
Categorizing Missingness imputation techniques geared toward MAR.
Missingness can be categorized in two ways: the The potential of these techniques depends on the
physical structure of the missings and the underlying degree to which other variables are related to the
nature of the missingness. First, the structure of the missings. MNAR is also known as informative
missings can be due to item or unit missingness, the missing, nonignorable missingness. It is the most
merging of structurally different data sets, or barriers difficult to address. The most promising
attributable to the data collection tools. Item miss- approach is to use external data to identify and
ingness refers to the situation in which a single value repair this problem.

Table 1 Underlying Nature of Missingness


Type Definitions and Examples Most Likely Approach
Missing completely at random Missing values occur completely at random Delete
with no relationship to themselves or the
values of any other variables. For example,
suppose researchers can only afford to
measure temperature for a randomly selected
subsample. The remaining observations will
be missing, completely at random.
Missing at random or missing Missing values occur partially at random, Impute
partially at random occurring relative to observable variables
and otherwise randomly. That is, after
controlling for all other observable variables,
the missing values are random.
Missing not at random; Missing values occur relative to the variable Impute
informative missing itself and are not random after controlling
for all other variables. For example, as
temperatures become colder, the
thermometer is increasingly likely to
fail and other variables cannot fully explain
the pattern in these failures.
806 Missing Data, Imputation of

Statistical and Contextual Diagnosis controlled variables in an experiment have differ-


ent implications from those of observed variables.
Analyzing the missings and understanding their Missing values do not always represent a failure
context can help researchers infer whether they are to measure. Sometimes, these are accurate mea-
MCAR, MAR, or MNAR. There are five consid- surements. For example, if amount of loan loss is
erations: (1) relationships between the missings missing, it could be because the loss amount was
and other variables, (2) concurring missing pat- not entered into the computer or because the loan
terns, (3) relationships between variables and never defaulted. In the latter case, the value can be
external information, (4) context of the analysis, thought of as either zero or as ‘‘does not apply,’’
and (5) software. whichever is more appropriate for the analysis. It
is common for collection tools to record a missing
value when a zero is more appropriate.
Relationships Between the
Missings and Other Variables
Software
Exploratory data analysis will reveal relation-
ships, which can indicate MAR and lead to an Software has become more than a tool. Investi-
appropriate imputation technique. There are gators’ choices are often limited to those techni-
statistical tests for comparing the means and distri- ques supported by available software.
butions of covariates for missings versus nonmiss-
ings. Statistically significant results imply MAR
Imputation Techniques
and therefore discount MCAR. A number of other
techniques are available to help provide insight, Imputation techniques are optimal when missing
including logistic regression, regression trees, and values are not all equal, as in MAR. All the techni-
cluster analysis. The insight from this data analysis ques raise statistical issues. Criteria for choosing
should be juxtaposed with the context of the appli- a technique include the underlying nature of the
cation and the source of the data. missings (MCAR, MAR, or MNAR), speed, impli-
cations regarding bias and variance estimation,
and considerations related to preserving the natu-
Concurring Missing Patterns
ral distribution and correlation structure. A final
Concurring missings suggest a shared event. consideration is to avoid extrapolating outside the
Placed within context, these can indicate that the range space of the data. Some missing values have
missings were ‘‘manufactured’’ by the data collec- logical bounds. This entry discusses five types of
tion tool or the data pipeline. These clues can sug- techniques: (1) substitution, (2) regression (least
gest MCAR, MAR, or MNAR. squares), (3) Bayesian methods, (4) maximum like-
lihood estimation (MLE)–expectation maximiza-
tion algorithm, and (5) multiple imputation.
Relationships Between Variables
and External Information
Substitution Techniques
Comparing the distribution of variables with
missing values to external data can reveal that the One quick solution is to substitute the mean,
data are MNAR rather than MCAR. For example, median, series mean, a linear interpolation, and so
the mean of the nonmissings should be compared forth, for the missings. One drawback for using
with an external estimate of the overall mean. the global mean or median is that values are
repeated. This can create a spike in the distribu-
tion. It can be avoided by substituting a local mean
Context of the Analysis
or median. These quick substitutions tend to result
It is necessary to study the context of the analy- in underestimating the variance and can inflate or
sis, including the statistical aspects of the problem deflate correlations.
and the consequences of various imputation tech- For categorical variables, characteristic analysis
niques. For example, missings belonging to compares the mean of the missings to the means of
Missing Data, Imputation of 807

the other categories. The missings are assigned the followed by a maximization step, computing the
category with the closest mean. MLE. The technique assumes an underlying distri-
Hot deck and cold deck are techniques for bution, such as the normal, mixed normal, or
imputing real data into the missings, with or with- Student’s t.
out replacement. For hot deck, the donor data are The MLE method assumes that missing values
the same data set, and for cold deck, the donor data are MAR (as opposed to MCAR) and shares with
are another data set. Hot deck avoids extrapolating regression the problem of overfitting. MLE is con-
outside the range space of the data set, and it better sidered to be stronger than regression and to make
preserves the natural distribution than does imputa- fewer assumptions.
tion of a mean. Both tend to be better for MAR.
Multiple Imputation
Regression (Least Squares)
Multiple imputation leverages another imputation
Regression-based imputation predicts the miss- technique to impute and reimpute the missings. This
ings on the basis of ordinary–least-squares or technique creates multiple versions of the data set;
weighted–least-squares modeling of the nonmissing analyzes each one; and then combines the results,
data. This assumes that relationships among the usually by averaging. The advantages are that this
nonmissing data extrapolate to the missing-value process is easier than MLE, robust to departures
space. This technique assumes that the data are from underlying assumptions, and provides better
MAR and not MCAR. It creates bias depending on estimates of variance than regression does.
the degree to which the model is overfit. As always,
validation techniques such as bootstrapping or data
splitting will curb the amount of overfitting. Suggestions for Applications
Regression-based imputation underestimates the
A project’s final results should include reasons for
variance. Statisticians have studied the addition of
deleting or imputing. It should justify any selected
random errors to the imputed values as a technique
imputation technique and enumerate the corre-
to correct this underestimation. The random errors
sponding potential biases. As a check, it is advis-
can come from a designated distribution or from
able to compare the results obtained with
the observed data.
imputations and those obtained without them.
Regression-based imputation does not preserve
This comparison will reveal the effect due to impu-
the natural distribution or respect the associations
tation. Finally, there is an opportunity to clarify
between variables. Also, it repeats imputed values
wheather the missings provide an additional hur-
when the independent variables are identical.
dle or valuable information.

Bayesian Methods Randy Bartlett

The approximate Bayesian bootstrap uses logistic See also Bias; Data Cleaning; Outlier; Residuals
regression to predict missing and nonmissing values
for the dependent variable, y, based on the observed
Further Readings
x values. The observations are then grouped on the
basis of the probability of the value missing. Candi- Efron, B. (1994). Missing data, imputation, and the
date imputation values are randomly selected, with bootstrap. Journal of the American Statistical
replacement, from the same group. Association, 89(426), 463–475.
Little, R. J. A. (1992). Regression with missing x’s: A
review. Journal of the American Statistical
MLE–Expectation Maximization Algorithm Association, 87(420), 1227–1237.
Little, R. J. A., & D. B. Rubin, D. B. (1987). Statistical
The expectation maximization algorithm is an analysis with missing data. New York: Wiley.
iterative, two-step approach for finding an MLE Schaefer, J., & Graham, J. (2002). Missing data: Our
for imputation. The initial step consists of deriving view of the state of the art. Psychological Methods,
an expectation based on latent variables. This is 7(2), 147–177.
808 Mixed- and Random-Effects Models

batch has five levels. Note that on each level of the


MIXED- AND RANDOM- factor (i.e., on each batch), we obtain the same
number of observations, namely six. In such a case,
EFFECTS MODELS the data are called balanced. In some applications,
it can happen that the numbers of observations
Data that are collected or generated in the context obtained on each level of the factor are not the
of any practical problem always exhibit variability. same; that is, we have unbalanced data.
This variability calls for the use of appropriate sta- In the above example, suppose the purpose of the
tistical methodology for the data analysis. Data data analysis is to test whether there is any difference
that are obtained from designed experiments are in the calcium content among five given batches of
typically analyzed using a model that takes into raw materials obtained in one day. In experimental
consideration the various sources or factors that design terminology, we want to test whether the five
could account for the variability in the data. Here batches have the same effects. This example, as
the term experiment denotes the process by which stated, involves a factor (namely batch) having fixed
data are generated based on the basis of planned effects. The reason is that there is nothing random
changes in one or more input variables that are about the batches themselves; the manufacturer has
expected to influence the response. The plan or lay- five batches given to him on a single day, and he
out used to carry out the experiment is referred to wishes to make a comparison of the calcium content
as an experimental design or design of the experi- among the five given batches. However, there are
ment. The analysis of the data is based on an many practical problems in which the factor could
appropriate statistical model that accommodates have random effects. In the context of the same
the various factors that explain the variability in example, suppose a large number of batches of raw
the data. If all the factors are fixed, that is, nonran- materials are available in the warehouse, and the
dom, the model is referred to as a fixed-effects manufacturer does not have the resources to obtain
model. If all the factors are random, the model is data on all the batches regarding their calcium con-
referred to as a random-effects model. On the tent. A natural option in this case is to collect data
other hand, if the experiment involves fixed as well on a sample of batches, randomly selected from the
as random factors, the model is referred to as population of available batches. Random selection is
a mixed-effects model. In this entry, mixed- as well done to ensure that we have a representative sample
as random-effects models are introduced through of batches. Note that if another random selection is
some simple research design examples. Data analy- made, a different set of five batches could have been
sis based on such models is briefly commented on. selected. If five batches are selected randomly, we
then have a factor having random effects. Note that
the purpose of our data analysis is not to draw con-
A Simple Random-Effects Model
clusions regarding the calcium content of the five
Here is a simple example, taken from Douglas C. batches randomly selected; rather, we would like to
Montgomery’s book on experimental designs. A use the random sample of five batches to draw con-
manufacturer wishes to investigate the research clusions regarding the population of all batches. The
question of whether batches of raw materials furn- difference between fixed effects and random effects
ished by a supplier differ significantly in their should now be clear. In the fixed-effects case, we
calcium content. Suppose data on the calcium con- have a given number of levels of a factor, and the
tent will be obtained on five batches that were purpose of the data analysis is to make comparisons
received in 1 day. Furthermore, suppose six deter- among these given levels only, based on the
minations of the calcium content will be made on responses that have been obtained. In the random-
each batch. Here batch is an input variable, and effects case, we make a random selection of a few
the response is the calcium content. The input vari- levels of the factor (from a population of levels), and
able is also called a factor. This is an example of the responses are obtained on the randomly selected
a single-factor experiment, the factor being batch. levels only. However, the purpose of the analysis is
The different possible categories of the factor are to make inferences concerning the population of all
referred to as levels of the factor. Thus the factor levels.
Mixed- and Random-Effects Models 809

In order to make the concepts more concrete, let exist effects due to the operators as well, account-
yij denote the jth response obtained on the ith level ing for the differences among them. A possible
of the factor, where j ¼ 1; 2; . . . ; n; and i ¼ 1, model that could capture both the batch effects
2; . . . ; a. Here a denotes the number of levels of and the operator effects is
the factor, and n denotes the number of responses
obtained on each level. For our example, a ¼ 5, yij ¼ μ þ τi þ βj þ eij , ð2Þ
n ¼ 6, and yij is the jth determination of the cal-
cium content from the ith batch of raw material. i ¼ 1, 2; . . . ; a, j ¼ 1, 2, . . . , b, where yij is the cal-
The data analysis can be done assuming the follow- cium content measurement obtained from the ith
ing structure for the yijs, referred to as a model: batch by the jth operator; βj is the effect due to the
jth operator; and μ, the τis, and the eijs are as
yij ¼ μ þ τi þ eij ; ð1Þ defined before. In the context of the example,
a ¼ 5 and b ¼ 6. Now there are two input vari-
where μ is a common mean, the quantity τ i repre- ables, that is, batches and operators, that are
sents the effect due to the ith level of the factor expected to influence the response (i.e., the calcium
(effect due to the ith batch), and eij represents content measurement). This is an example of
experimental error. The eij s are assumed to be ran- a two-factor experiment. Note that if the batches
dom, following a normal distribution with mean as well as the operators are randomly selected,
zero and variance σ 2. In the fixed-effects case, the then the τ is, as well as the βjs, become random
τ is are fixed unknown parameters, and the prob- variables; the above model is then called a ran-
lem of interest is to test whether the τ is are equal. dom-effects model. However, if only the batches
The model for the yij is now referred to as a are randomly selected (so that the τ is are random),
fixed-effects
Pa model. In this case, the restriction but the measurements are taken by a given group
i¼1 τ i ¼ 0 can be assumed, without loss of gener- of operators (so that the βjs are fixed unknown
ality. In the random-effects case, the τis are parameters), then we have a mixed-effects model.
assumed to be random variables following a nor- That is, the model involves fixed effects corre-
mal distribution with mean zero and variance σ 2τ . sponding to the given levels of one factor and ran-
The model for the yij is now referred to as a ran- dom-effects corresponding to a second factor,
dom-effects model. Note that σ 2τ is a population whose levels are randomly selected. When the βjs
Pb
variance; that is, it represents the variability are fixed, the restriction j¼1 β j ¼ 0 may be
among the population of levels of the factor. Now assumed. For a random-effects model, independent
the problem of interest is to test the hypothesis normal distributions are typically assumed for the
that σ 2τ is zero. If this hypothesis is accepted, then τ is, for the βjs, and the eij s, similar to that for
the conclusion is that the different levels of the fac- Model 1. When the effects due to a factor are ran-
tor do not exhibit significant variability among dom, the hypothesis of interest is whether the cor-
them. In the context of the example, if the batches responding variance is zero. In the fixed-effects
are randomly selected, and if the hypothesis case, we test whether the effects are the same for
σ 2τ ¼ 0 is not rejected, then the data support the the different levels of the factor.
conclusion that there is no significant variability Note that Model 2 makes a rather strong ass-
among the different batches in the population. umption, namely, that the combined effect due to
the two factors, batch and operator, can be written
Mixed- and Random-Effects as the sum of an effect due to the batch and an
effect due to the operator. In other words, there is
Models for Multifactor Experiments
no interaction between the two factors. In practice,
In the context of the same example, suppose the such an assumption may not always hold when
six calcium content measurements on each batch responses are obtained based on the combined
are made by six different operators. While carrying effects of two or more factors. If interaction is pres-
out the measuring process, there could be differ- ent, the model should include the combined effect
ences among the operators. In other words, in due to the two factors. However, now multiple
addition to the effect due to the batches, there measurements are necessary to carry out the data
810 Mixed- and Random-Effects Models

analysis. Thus, suppose each operator makes three Data Analysis Based on
determinations of the calcium content on each Mixed- and Random-Effects Models
batch. Denote by yijk the kth calcium content mea-
surement on the ith batch by the jth operator. When When the same number of observations is obt-ained
there is interaction, the assumed model is on the various level combinations of the
factors, the data are said to be balanced. For Model
yijk ¼ m þ ti þ bj þ gij þ eijk ; ð3Þ 3, balanced data correspond to the situation in
which exactly n calcium content measurements (say,
where i ¼ 1, 2, . . . , a; j ¼ 1, 2, . . . , b; and k ¼ 1, n ¼ 6) are obtained by each operator from each
2, . . ., n (say). Now τi represents an average effect batch. Thus the observations are yijk; k ¼ 1, 2 ; . . . ;
due to the ith batch. In other words, consider the n; j ¼ 1, 2 ; . . . ; b; and i ¼ 1, 2 ; . . . ; a. On the other
combined effect due to the ith batch and the jth hand, unbalanced data correspond to the situation
operator, and average it over j ¼ 1; 2; . . . ; b. We in which the number of calcium content determina-
refer to τi as the main effect due to the ith batch. tions obtained by the different operators are not all
Similarly, βj is the main effect due to the jth opera- the same for all the batches. For example, suppose
tor. The quantity γ ij is the interaction between the there are five operators, and the first four make six
ith batch and the jth operator. If a set of given measurements each on the calcium content from
batches and operators is available for the experi- each batch, whereas the fifth operator could make
ment (i.e., there is no random selection), then the only three observations from each batch because of
τis, βjs, and γ ijs are all fixed, and Equation 3 is then time constraints; we then have unbalanced data. It
a fixed-effects model. On the other hand, if the could also be the case that an operator, say the first
batches are randomly selected, whereas the opera- operator, makes six calcium content determinations
tors are given, then the τ is and γ ijs are random, but from the first batch but only five each from the
the βjs are fixed. Thus Model 3 now becomes remaining batches. For unbalanced data, if nij
a mixed-effects model. However, if the batches and denotes the number of calcium content determina-
operators are both randomly selected, then all the tions made by the jth operator on the ith batch, the
effects in Equation 3 are random, resulting in a ran- observations are yijk; k ¼ 1, 2 ; . . . ; nij; j ¼ 1, 2 ; . . . ;
dom-effects model. Equation 3 is referred to as b; and i ¼ 1, 2 ; . . . ; a. The analysis of unbalanced
a two-way classification model with interaction. If data is considerably more complicated under
the batches are randomly selected, whereas the mixed- and random-effects models, even under nor-
operators are given, then a formal derivation of mality assumptions. The case of balanced data is
Pb somewhat simpler.
Model P 3 will result in the conditions β
j¼1 j ¼ 0
and bj¼1 γ ij ¼ 0 for every i. That is, the random
variables γ ijs satisfy a restriction. In view of this, the
Analysis of Balanced Data
γ ijs corresponding to a fixed i will not be indepen-
dent among themselves. The normality assumptions Consider the simple Model 1 with balanced
are typically made on all the random quantities. data, along with the normality assumptions for the
In the context of Model 1, note that two obser- distribution of the τis and the eijs with variances σ 2τ
vations from the same batch are correlated. In fact and σ 2e , respectively. The purpose of the data analy-
σ 2τ is also the covariance between two observations sis can be to estimate the variances σ 2τ and σ 2e , to
from the same batch. Other examples of correlated test the null hypothesis that σ 2τ ¼ 0, and to compute
data where mixed- and random-effects models are a confidence interval for σ 2τ and sometimes for the
appropriate include longitudinal data, clustered ratios σ 2τ =σ 2e and σ 2τ =ðσ 2τ þ σ 2τ ). Note that the ratio
data, and repeated measures data. By including σ 2τ =σ 2e provides information on the relative magni-
random-effects in the model, it is possible for tude of σ 2τ compared with that of σ 2e . If the variabil-
researchers to account for multiple sources of vari- ity in the data is mostly due to the variability
ation. This is indeed the purpose of using mixed- among the different levels of the factor, σ 2τ is
and random-effects models for analyzing data in expected to be large compared with σ 2e , and the
the physical end engineering sciences, medical and hypothesis σ 2τ ¼ 0 is expected to be rejected. Also
biological sciences, social sciences, and so forth. note that since the variance of the observations, that
Mixed- and Random-Effects Models 811

is, the variance of the yijs in Model 1, is simply the The following table shows the ANOVA table
sum σ 2τ þ σ 2e , the ratio σ 2τ =ðσ 2τ þ σ 2e Þ is the fraction and the expected mean squares in the mixed-
of the total variance that is due to the variability effects case and in the random-effects case; these
among the different levels of the factor. Thus the are available in a number of books dealing with
individual variances as well as the above ratios have mixed- and random-effects models, in particular in
practical meaning and significance. Montgomery’s book on experimental designs. In
Now consider Model 3 with random effects the mixed-effects case, the βjs are fixed, but the τ is
and with the normality assumptions τi ∼ Nð0; σ 2τ Þ, and γ ijs are random. In other words, the batches
βj ∼ Nð0; σ 2β Þ, γ ij ∼ Nð0; σ 2γ Þ, and eijk ∼ Nð0; σ 2e Þ, are randomly selected, but the operators consist of
where all the random variables are assumed to be a fixed group. In the random-effects case, the βjs,
independently distributed. Now the problems of the τis, and the γ ijs are all random. In the table,
interest include the estimation of the different var- the notations MSτ and so forth are used to denote
iances and testing the hypothesis that the random- mean squares.
effects variances are zeros. For example, if the Note that the expected values can be quite dif-
hypothesis σ 2γ ¼ 0 cannot be rejected, we conclude ferent depending on whether we are in the
mixed-effects setup or the random-effects setup.
that there is no significant interaction. If Model 3
Also, when an expected value is a linear combi-
is a mixed-effects model, then the normality
nation of only the variances, then the sum of
assumptions are made on the effects that are
squares, divided by the expected value, has
random. Note, however, that the γ ijs, although
a chi-square distribution. For example, in Table
random, will no longer be independent in the
1, in the mixed-effects case, SSτ =ðσ 2e þ bnσ 2τ Þ
mixed-effects case, in view of the restriction
Pb follows a chi-square distribution with a  1
j¼1 γ ij ¼ 0 for every i. degrees of freedom. However, if an expected
The usual analysis of variance (ANOVA) value also involves the fixed-effects parameters,
decomposition can be used to arrive at statistical then the chi-square distribution holds under the
procedures to address all the above problems. To appropriate hypothesis concerning the fixed
define the various ANOVA sums of squares for effects. Thus, in Table 1, in the mixed-effects
Model 3 in the context of our example on calcium case, SSβ =ðσ 2e þ nσ 2γ Þ follows a chi-square distri-
content determination from different batches using bution with b  1 degrees of freedom, under the
different operators, let hypothesis β1 ¼ β2 ¼    ¼ βb ¼ 0. Furthermore,
for testing the various hypotheses,
1X n
1 Xb X n
the denominator of the F ratio is not always the
yij: ¼ yijk ; yi:: ¼ yijk ; y:j: mean square due to error. If one compares the
n k¼1 bn j¼1 k¼1
expected values in Table 1, one can see that for
1 X a X n
1 X a X b X n
testing σ 2τ ¼ 0 in the mixed-effects case, the F
¼ yijk ; y::: ¼ yijk :
an i¼1 k¼1 abn i¼1 j¼1 k¼1 ratio is MSτ /MSe. However, for testing β1 ¼
β2 ¼ . . . ¼ βb ¼ 0 in the mixed-effects case, the
If SSτ, SSβ , SSγ , and SSe denote the ANOVA sum F ratio is MSβ/MSγ . In view of this, it is neces-
of squares due to the batches, operators, interac- sary to know the expected values before we can
tion, and error, respectively, these are given by decide on the appropriate F ratio for testing
a hypothesis under mixed- and random-effects
X
a b 
X 2 models. Fortunately, procedures are available for
SSτ ¼ bn ðyi::  y... Þ2 ; SSβ ¼ an y:j:  y::: the easy calculation of the expected values when
i¼1 j¼1 the data are balanced. The expected values in
X b 
a X 2 Table 1 immediately provide us with F ratios for
SSγ ¼ n yij:  y:::  SSτ  SSβ ; testing all the different hypotheses in the mixed-
i¼1 j¼1 effects case, as well as in the random-effects
X
a X n 
b X 2 case. This is so because, under the hypothesis,
SSe ¼ yijk  yij: : we can identify exactly two sums of squares hav-
i¼1 j¼1 k¼1 ing the same expected value. However, this is
812 Mixed Methods Design

Table 1 ANOVA and Expected Mean Squares Under Model 3 With Mixed Effects and With Random Effects, When
the Data Are Balanced
Source of Sum of Degrees of Mean Expected Mean Square Expected Mean Square
Variability Squares (SS) Freedom Square (MS) (Mixed-Effects Case) (Random-Effects Case)
Batches SSτ a1 MSτ σ 2e þ bnσ 2τ σ 2e þ nσ 2γ þ bnσ 2τ
Pb 2
β
j¼1 j
Operators SSβ b1 MSβ σ 2e þ nσ 2γ þ an b1 σ 2e þ nσ 2γ þ bnσ 2β
Interaction SSγ (a  1)(b  1) MSγ σ 2e þ nσ 2γ σ 2e þ nσ 2γ
Error SSe ab(n  1) MSe σ 2e σ 2e

not always the case. Sometimes it becomes nec- IBM company, formerly called PASWâ Statistics),
essary to use a test statistic that is a ratio of the Stata, and so forth. Their use is rather straight-
sum of appropriate mean squares, both in the forward; in fact many excellent books are now
numerator and in the denominator. The test is available that illustrate the use of these software
then carried out using an approximate F distri- packages. A partial list of such books is provided
bution. This procedure is known as the Sat- in the Further Readings. These books provide the
terthwaite approximation. necessary software codes, along with worked-out
examples.
Analysis of Unbalanced Data Thomas Mathew
The nice formulas and procedures available for See also Analysis of Variance (ANOVA); Experimental
mixed- and random-effects models with balanced Design; Factorial Design; Fixed-Effects Models;
data are not available in the case of unbalanced Random-Effects Models; Simple Main Effects
data. While some exact procedures can be derived
in the case of a single factor experiment, that is,
Model 1, such is not the case when we have a mul- Further Readings
tifactor experiment. One option is to analyze the Littell, R. C., Milliken, G. A., Stroup, W. W., &
data with likelihood-based procedures. That is, Wolfinger, R. D. (1996). SAS system for mixed
one can estimate the parameters by maximizing models. Cary, NC: SAS Publishing.
the likelihood and then test the relevant hypothe- Montgomery, D. C. (2009). Design and analysis of
ses with likelihood ratio tests. The computations experiments (7th ed.). New York: Wiley.
have to be carried out by available software. Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects
As for estimating the random-effects variances, models in S and S-Plus. New York: Springer-Verlag.
a point to note is that estimates can be obtained Verbeke, G., & Molenberghs, G. (1997). Linear mixed
on the basis of either the likelihood or the res- models in practice: A SAS-oriented approach (Lecture
notes in statistics, Vol. 126). New York: Springer-
tricted likelihood. Restricted likelihood is free of
Verlag.
the fixed-effects parameters. The resulting esti- Verbeke, G., & Molenberghs, G. (2000). Linear mixed
mates of the variances are referred to as restricted models for longitudinal data. New York: Springer-
maximum likelihood (REML) estimates. REML Verlag.
estimates are preferred to maximum likelihood West, B. T., Welch, K. B., & Galecki, A. T. (2007). Linear
estimates because the REML estimates reduce (or mixed models: A practical guide using statistical
eliminate) the bias in the estimates. software. New York: Chapman & Hall/CRC.

Software for Data Analysis Based


on Mixed- and Random-Effects Models MIXED METHODS DESIGN
Many of the popular statistical software packages
can be used for data analysis based on mixed- and Mixed methods is a research orientation that pos-
random-effects models: R, S-Plus, SAS, SPSS (an sesses unique purposes and techniques. It
Mixed Methods Design 813

integrates techniques from quantitative and quali- (d) data analysis strategies, and (e) knowledge dis-
tative paradigms to tackle research questions that semination. The design of a study thus leads to the
can be best addressed by mixing these two tradi- choice of method strategy. The framework for
tional approaches. As long as 40 years ago, scho- a study, then, depends on the phenomenon being
lars noted that quantitative and qualitative studied, with the participants and relevant theories
research were not antithetical and that every informing the research design. Most study designs
research process, through practical necessity, today need to include both quantitative and quali-
should include aspects of both quantitative and tative methods for gathering effective data and
qualitative methodology. In order to achieve more can thereby incorporate a more expansive set of
useful and meaningful results in any study, it is assumptions and a broader worldview.
essential to consider the actual needs and purposes Mixing methods (or multiple-methods design) is
of a research problem to determine the methods to generally acknowledged as being more pertinent to
be implemented. The literature on mixed methods modern research than using a single approach.
design is vast, and contributions have been made Quantitative and qualitative methods may rely more
by scholars from myriad disciplines in the social on single data collection methods. For example,
sciences. Therefore, this entry is grounded in the whereas a quantitative study may rely on surveys for
work of these scholars. This entry provides a histor- collecting data, a qualitative study may rely on
ical overview of mixed methods as a paradigm for observations or open-ended questions. However, it is
research, establishes differences between quantita- also possible that each of these approaches may use
tive and qualitative designs, shows how qualitative multiple data collection methods. Mixed methods
and quantitative methods can be integrated to design ‘‘triangulates’’ these two types of methods.
address different types of research questions, and When these two methods are used within a single
illustrates some implications for using mixed meth- research study, different types of data are combined
ods. Though still new as an approach to research, to answer the research question—a defining feature
mixed methods design is expected to soon domi- of mixed methods. This approach is already stan-
nate the social and behavioral sciences. dard in most major designs. For example, in social
The objective of social science research is to sciences, interviews and participant observation form
understand the complexity of human behavior and a large part of research and are often combined with
experience. The task of the researcher, whose role other data (e.g., biological markers).
is to describe and explain this complexity, is lim- Even though the integration of these two
ited by his or her methodological repertoire. As research models is considered fairly novel (emerg-
tradition shows, different methods often are best ing significantly in the 1960s), the practice of
applied to different kinds of research. Having the integrating these two models has a long history.
opportunity to apply various methods to a single Researchers have often combined these methods, if
research question can broaden the dimensions and perhaps only for particular portions of their inves-
scope of that research and perhaps lead to a more tigations. Mixed methods research was more com-
precise and holistic perspective of human behavior mon in earlier periods when methods were less
and experience. Research is not knowledge itself, specialized and compartmentalized and when there
but a process in which knowledge is constructed was less orthodoxy in method selection. Research-
through step-by-step data gathering. ers observed and cross-tabulated, recognizing that
Data are gathered most typically through two each methodology alone could be inadequate. Syn-
distinct classical approaches—qualitative and thesis of these two classic approaches in data gath-
quantitative. The use of both these approaches for ering and interpretation does not necessarily mean
a single study, although sometimes controversial, that they are wholly combined or that they are
is becoming more widespread in social science. uniform. Often they need to be employed sepa-
Methods are really ‘‘design’’ components that rately within a single research design so as not to
include the following: (a) the relationship between corrupt either process.
the researcher and research ‘‘subjects,’’ (b) details Important factors to consider when one is
of the experimental environment (place, time, using mixed methods can be summarized as fol-
etc.), (c) sampling and data collection methods, lows. Mixed methods researchers agree that
814 Mixed Methods Design

there are some resonances between the two para- whereas the qualitative method concentrates on
digms that encourage mutual use. The dis- events within a context, relying on meaning and
tinctions between these two methods cannot process. When the two are used together, data can
necessarily be reconciled. Indeed, this ‘‘tension’’ can be transformed. Essentially, ‘‘qualitized’’ data can
produce more meaningful interactions and thus represent data collected using quantitative meth-
new results. Combination of qualitative and quanti- ods that are converted into narratives that are ana-
tative methods must be accomplished productively lyzed qualitatively. ‘‘Quantitized’’ data represent
so that the integrity of each approach is not vio- data collected using qualitative methods that can
lated: Methodological congruence needs to be be converted into numerical codes and analyzed
maintained so that data collection and analytical statistically. Many research problems are not lin-
strategies are not jeopardized and can be consistent. ear. Purpose drives the research questions. The
The two seemingly antithetical research approaches course of the study, however, may change as it pro-
can be productively combined in a pragmatic, inter- gresses, leading possibly to different questions and
active, and integrative design model. The two ‘‘clas- the need to alter method design. As in any rigorous
sical’’ methods can complement each other and research, mixed methods allows for the research
make a study more successful and resourceful by question and purpose to lead the design.
eliminating the possibility of distortion by strict
adherence to a single formal theory.
Historical Overview
Qualitative and Quantitative Data
In the Handbook of qualitative research, Norman
Qualitative and quantitative distinctions are gro- K. Denzin and Yvonna S. Lincoln classified four
unded in two contrasting approaches to categoriz- historic periods in research history for the social
ing and explaining data. Different paradigms sciences. Their classification shows an evolution
produce and use different types of data. Early from strict quantitative methodology, a gradual
studies distinguished the two methods according implementation and acceptance of qualitative
to the kind of data collected, whether textual or methods, to a merging of the two: (1) traditional
numerical. The classic qualitative approach inclu- (quantitative), 1900 to 1950; (2) modernist, 1950
des study of real-life settings, focus on participants’ to 1970; (3) ascendance of constructivism, 1970 to
context, inductive generation of theory, open- 1990; and (4) pragmatism and the ‘‘compatibility
ended data collection, analytical strategies based thesis’’ (discussed later), 1990 to the present.
on textual data, and use of narrative forms of Quantitative methodology, and its paradigm,
analysis and presentation. Basically, the qualitative positivism, dominated methodological orientation
method refers to a research paradigm that add- during the first half of the 20th century. This ‘‘tra-
resses interpretation and socially constructed ditional’’ period, although primarily focused on
realities. The classic quantitative approach encom- quantitative methods, did include some mixed
passes hypothesis formulation based on prece- method approaches without directly acknowledg-
dence, experiment, control groups and variables, ing implementation of qualitative data: Studies
comparative analysis, sampling, standardization of often made extensive use of interviews and res-
data collection, statistics, and the concept of cau- earcher observations, as demonstrated in the
sality. Quantitative design refers to a research par- Hawthorne effect. In the natural sciences, such as
adigm that hypothesizes relationships between biology, paleontology, and geology, goals and
variables in an objective way. methods that typically would be considered quali-
Quantitative methods are related to deductivist tative (naturalistic settings, inductive approaches,
approaches, positivism, data variance, and factual narrative description, and focus on context and
causation. Qualitative methods include inductive single cases) have been integrated with those
approaches, constructivism, and textual informa- that were regarded as quantitative (experimental
tion. In general, quantitative design relies on com- mani-pulation, controls and variables, hypothesis
parisons of measurements and frequencies across testing, theory verification, and measurement and
categories and correlations between variables analysis of samples) for more than a century.
Mixed Methods Design 815

After World War II, positivism began to be dis- against the prejudices and restrictions of positivism
credited, which led to its ‘‘intellectual’’ successor, and postpositivism. They maintained that mixed
postpositivism. Postpositivism (still largely in the methods were already being employed in numer-
domain of the quantitative method) asserts that ous studies.
research data are influenced by the values of the The period of pragmatism and compatibility
researchers, the theories used by the researchers, (1990–the present) as defined by Denzin and Lin-
and the researchers’ individually constructed reali- coln constitutes the establishment of mixed meth-
ties. During this period, some of the first explicit ods as a separate field. Mixed methodologists are
mixed method designs began to emerge. While not representative of either the traditional (quanti-
there was no distinctive categorization of mixed tative) or ‘‘revolutionary’’ (qualitative) camps. In
methods, numerous studies began to employ com- order to validate this new field, mixed methodolo-
ponents of its design, especially in the human gists had to show a link between epistemology and
sciences. Data obtained from participant observa- method and demonstrate that quantitative and qual-
tion (qualitative information) was often implemen- itative methods were compatible. One of the main
ted, for example, to explain quantitative results concerns in mixing methods was to determine
from a field experiment. whether it was also viable to mix paradigms—a con-
The subsequent ‘‘modernist’’ period, or ‘‘Golden cept that circumscribes an interface, in practice,
Age’’ (1950–1970), has been demarcated, then, by between epistemology (historically learned assump-
two trends: positivism’s losing its stronghold and tions) and methodology. A new paradigm, pragma-
research methods that began to incorporate ‘‘multi tism, effectively combines these two approaches
methods.’’ The discrediting of positivism resulted and allows researchers to implement them in a com-
in methods that were more radical than those of plementary way.
postpositivism. From 1970 to 1985—defined by Pragmatism addresses the philosophical aspect
some scholars as the ‘‘qualitative revolution’’— of a paradigm by concentrating on what works.
qualitative researchers became more vocal in their Paradigms, under pragmatism, do not represent
criticisms of pure quantitative approaches and pro- the primary organizing principle for mixed meth-
posed new methods associated with constructiv- ods practice. Believing that paradigms (socially
ism, which began to gain wider acceptance. In the constructed) are malleable assumptions that
years from 1970 to 1990, qualitative methods, change through history, pragmatists make design
along with mixed method syntheses, were becom- decisions based on what is practical, contextually
ing more eminent. In the 1970s, the combination compatible, and consequential. Decisions about
of data sources and multiple methods was becom- methodology are not based solely on congruence
ing more fashionable, and new paradigms, such as with established philosophical assumptions but are
interpretivism and naturalism, were gaining prece- founded on a methodology’s ability to further the
dence and validity. particular research questions within a specified con-
In defense of a ‘‘paradigm of purity,’’ a period text. Because of the complexity of most contexts
known as the paradigm wars took place. Different under research, pragmatists incorporate a dual focus
philosophical camps held that quantitative and between sense making and value making. Pragmatic
qualitative methods could not be combined; such research decisions, grounded in the actual context
a ‘‘blending’’ would corrupt accurate scientific being studied, lead to a logical design of inquiry
research. Compatibility between quantitative and that has been termed fitness for purpose. Mixed
qualitative methods, according to these proponents methodologies are the result. Pragmatism demon-
of quantitative methods, was impossible due to the strates that singular paradigm beliefs are not intrin-
distinction of the paradigms. Researchers who sically connected to specific methodologies; rather,
combined these methods were doomed to fail methods and techniques are developed from multi-
because of the inherent differences in the underly- ple paradigms.
ing systems. Qualitative researchers defined such Researchers began to believe that the concept
‘‘purist’’ traditions as being based on ‘‘received’’ of a single best paradigm was a relic of the past
paradigms (paradigms preexisting a study that are and that multiple, diverse perspectives were criti-
automatically accepted as givens), and they argued cal to addressing the complexity of a pluralistic
816 Mixed Methods Design

society. They proposed what they defined as the the ‘‘mixing’’ occurs in the type of questions asked
dialectical stance: Opposing views (paradigms) and in the inferences that evolve. Mixed model
are valid and provide for more realistic interac- research is implemented in all stages of the study
tion. Multiple paradigms, then, are considered (questions, methods, data collection, analysis, and
a foundation for mixed methods research. inferences).
Researchers, therefore, need to determine which The predominant approach to mixing methods
paradigms are best for a particular mixed meth- encompasses two basic types of design: component
ods design for a specific study. and integrated. In component designs, methods
Currently, researchers in social and behavioral remain distinct and are used for discreet aspects of
studies generally comprise three groups: Quanti- the research. Integrative design incorporates sub-
tatively oriented researchers, primarily interested stantial integration of methods. Although typolo-
in numerical and statistical analyses; qualita- gies help researchers organize actual use of both
tively oriented researchers, primarily interested methods, use of typologies as an organizing tool
in analysis of narrative data; and mixed metho- demonstrates a lingering linear concept that refers
dologists, who are interested in working with more to the duality of quantitative and qualitative
both quantitative and qualitative data. The dif- methods than to the recognition and implementa-
ferences between the three groups (particularly tion of multiple paradigms. Design components
between quantitatively and qualitatively oriented (based on objectives, frameworks, questions, and
researchers) have often been characterized as the validity strategies), when organized by typology,
paradigm wars. These three movements continue are perceived as separate entities rather than as
to evolve simultaneously, and all three have been interactive parts of a whole. This kind of typology
practiced concurrently. Mixed methodology is in illustrates a pluralism that ‘‘combines’’ methods
its adolescent stage as scholars work to deter- without actually integrating them.
mine how to best integrate different methods.
Triangulation and Validity
Integrated Design Models
Triangulation is a method that combines differ-
A. Tashakkori and C. Teddlie have referred to ent theoretical perspectives within a single study.
three categories of multiple-method designs: multi- As applied to mixed methods, triangulation deter-
method research, mixed methods research, and mines an unknown point from two or more
mixed model research. The terms multimethod known points, that is, collection of data from dif-
and mixed method are often confused, but they ferent sources, which improves validity of results.
actually refer to different processes. In multi- In The Research Act, Denzin argued that a hypoth-
method studies, research questions use both quan- esis explored under various methods is more valid
titative and qualitative procedures, but the process than one tested under only one method. Triang-
is applied principally to quantitative studies. This ulation in methods, where differing processes
method is most often implemented in an interre- are implemented, maximizes the validity of the
lated series of projects whose research questions research: Convergence of results from different
are theoretically driven. Multimethod research is measurements enhances validity and verification.
essentially complete in itself and uses simultaneous It was also argued that using different methods,
and sequential designs. and possibly a faulty commonality of framework,
Mixed methods studies, the primary concern of could lead to increased error in results. Triangula-
this entry, encompass both mixed methods and tion may not increase validity but does increase
mixed model designs. This type of research imple- consistency in methodology: Though empirical
ments qualitative and quantitative data collec- results may be conflicting, they are not inherently
tion and analysis techniques in parallel phases or damaging but render a more holistic picture.
sequentially. Mixed methods (combined methods) Triangulation allows for the exploration of both
are distinguished from mixed model designs (com- theoretical and empirical observation (inductive
bined quantitative and qualitative methods in all and deductive), two distinct types of knowle-
phases of the research). In mixed methods design, dge that can be implemented as a methodological
Mixed Methods Design 817

‘‘map’’ and are logically connected. A researcher problems, the study of service utilization and deliv-
can structure a logical study, and the tools needed ery, and translational research into meaningful
for organizing and analyzing data, only if the theo- practice.
retical framework is established prior to empirical Mixed methods research may bridge postmod-
observations. Triangulation often leads to a situa- ern critiques of scientific inquiry and the growing
tion in which different findings do not converge or interest in qualitative research. Mixed methods
complement each other. Divergence of results, research provides an opportunity to test research
however, may lead to additional valid explanations questions, hypotheses, and theory and to acknowl-
of the study. Divergence, in this case, can be reflec- edge the phenomena of human experience. Quan-
tive of a logical reconciliation of quantitative and titative methods support the ability to generalize
qualitative methods. It can lead to a productive findings to the general population. However, quan-
process in which initial concepts need to be modi- titative approaches that are well regarded by
fied and adapted to differing study results. researchers may not necessarily be comprehensible
Recently, two new approaches for mixing or useful to lay individuals. Qualitative approaches
methods have been introduced: an interactive can help contextualize problems in narrative forms
approach, in which the design components are and thus can be more meaningful to lay indivi-
integrated and mutually influence each other, duals. Mixing these two methods offers the poten-
and a conceptual approach, using an analysis of tial for researchers to understand, contextualize,
the fundamental differences between quantita- and develop interventions.
tive and qualitative research. The interactive Mixed methods have been used to examine and
method, as employed in architecture, engineer- implement a wide range of research topics, includ-
ing, and art, is neither linear nor cyclic. It is ing instrument design, validation of constructs, the
a schematic method that addresses data in relationship of constructs, and theory development
a mutually ongoing arrangement. This design or disconfirmation. Mixed methods are rooted,
model is a tool that focuses on analyzing the for one example, in the framework of feminist
research question rather than providing a tem- approaches whereby the study of participants’ lives
plate for creating a study type. This more quali- and personal interpretations of their lives has
tative approach to mixed methods design implications in research. In terms of data analysis,
emphasizes particularity, context, comprehen- content analysis is a way for scientists to confirm
siveness, and the process by which a particular hypotheses and to gather qualitative data from
combination of qualitative and quantitative study participants through different methods (e.g.,
components develops in practice, in contrast to grounded theory, phenomenological, narrative).
the categorization and comparison of data typi- The application of triangulation methodology is
cal of the pure quantitative approach. extremely invaluable in mixed methods research.
While there are certainly advantages to employ-
ing mixed methods in research, their use also
presents significant challenges. Perhaps the most
Implications for Mixed Methods
significant issue to consider is the amount of time
As the body of research regarding the role of the associated with the design and implementation of
environment and its impact on the individual has mixed methods. In addition to time restrictions,
developed, the status and acceptance of mixed costs or barriers to obtaining funding to carry out
methods research in many of the applied disciplines mixed methods research are a consideration.
is accelerating. This acceptance has been influenced
by the historical development of these disciplines
Conclusion
and an acknowledgment of a desire to move away
from traditional paradigms of positivism and post- Rather than choosing one paradigm or method
positivism. The key contributions of mixed meth- over another, researchers often use multiple and
ods have been to an understanding of individual mixed methods. Implementing these newer combi-
factors that contribute to social outcomes, the study nations of methods better supports the modern
of social determinants of medical and social complexities of social behavior the changing
818 Mixed Model Design

perceptions of reality and knowledge better serve Further Readings


the purposes of the framework of new studies in
Creswell, J. W. (2003). Research design: Qualitative,
social science research. The classic quantitative quantitative and mixed methods approach (2nd ed.).
and qualitative models alone cannot encompass Thousand Oaks, CA: Sage.
the interplay between theoretical and empirical Denzin, N. K., & Lincoln, Y. S. (Eds.). (2000).
knowledge. Simply, combining methods makes Handbook of qualitative research (2nd ed.). Thousand
common sense and serves the purposes of complex Oaks, CA: Sage.
analyses. Methodological strategies are tools for Tashakkori, A., & Teddlie, C. (Eds.). (2003). Handbook
inquiry and represent collections of strategies that of mixed methods in social and behavioral research.
corroborate a particular perspective. The strength Thousand Oaks, CA: Sage.
of mixed methods is that research can evolve com-
prehensively and adapt to empirical changes, thus
going beyond the traditional dualism of quantita-
tive and qualitative methods, redefining and
MIXED MODEL DESIGN
reflecting the nature of social reality.
Paradigms are social constructions, culturally Mixed model designs are an extension of the gen-
and historically embedded as discourse practices, eral linear model, as in analysis of variance
and contain their own set of assumptions. As (ANOVA) designs. There is no common term
social constructions, paradigms are changeable for the mixed model design. Researchers some-
and dynamic. The complexity and pluralism of times refer to split-plot designs, randomized com-
our contemporary world require rejecting investi- plete block, nested, two-way mixed ANOVAs, and
gative constraints of singular methods and imple- certain repeated measures designs as mixed mod-
menting more diverse and integrative methods that els. Also, mixed model designs may be restrictive
can better address research questions and evolving or nonrestrictive. The restrictive model is used
social constructions. Knowledge and information most often because it is more general, thus allow-
change with time and mirror evolving social per- ing for broader applications. A mixed model may
ceptions and needs. Newer paradigms and belief be thought of as two models in one: a fixed-effects
systems can help transcend and expand old dual- model and a random-effects model. Regardless of
isms and contribute to redefining the nature of the name, statisticians generally agree that when
social reality and knowledge. interest is in both fixed and random effects, the
Scholars generally agree that it is possible to use design may be classified as a mixed model. Mixed
qualitative and quantitative methods to answer model analyses are used to study research prob-
objective–value and subjective–constructivist ques- lems in a broad array of fields, ranging from edu-
tions, to include both inductive–exploratory and cation to agriculture, sociology, psychology, biol-
deductive–confirmatory questions in a single study, ogy, manufacturing, and economics.
to mix different orientations, and to integrate
qualitative and quantitative data in one or more
Purpose of the Test
stages of research, and that many research ques-
tions can only be answered with a mixed methods A mixed model analysis is appropriate if one is
design. Traditional approaches meant aligning interested in a between-subjects effect (fixed effect)
oneself to either quantitative or qualitative meth- in addition to within-subjects effects (random
ods. Modern scholars believe that if research is to effects), or in exploring alternative covariance
go forward, this dichotomy needs to be fully structures on which to model data with between-
reconciled. and within-subjects effects. A variable may be
fixed or random. Random effects allow the rese-
Rogério M. Pinto archer to generalize beyond the sample.
Fixed and random effects are a major feature
See also Critical Theory; Grounded Theory; distinguishing the mixed model from the standard
Mixed Model Design; Qualitative Research; repeated measures design. The standard two-way
Quantitative Research repeated measures design examines repeated
Mixed Model Design 819

measures on the same subjects. These are within- effects are the measures from the repeated trials,
subjects designs for two factors with two or more measures after the time intervals of some activities,
levels. In the two-way mixed model design, two or repeated measures of some function such as
factors, one for within-subjects and one for blood pressure, strength level, endurance, or
between-subjects are always included in the model. achievement. The mixed model design may be
Each factor has two or more levels. For example, applied when the sample comprises large units,
in a study to determine the preferred time of day such as school districts, military bases, and univer-
for undergraduate and graduate college students to sities, and the variability among the units, rather
exercise at a gym, time of day would be a within- than the differences in means, is of interest. Exam-
subjects factor with three levels: 5:00 a.m., 1:00 ining random effects allows researchers to make
p.m., and 9:00 p.m.; and student classification as inferences to a larger population.
undergraduate or graduate would be two levels of
a between-subjects factor. The dependent variable
Assumptions
for such a study could be a score on a workout
preference scale. A design with three levels on As with other inferential statistical procedures, the
a random factor and two levels on a fixed factor is data for a mixed model analysis must meet certain
written as a 2 × 3 mixed model design. statistical assumptions if trustworthy generaliza-
tions are to be made from the sample to the lar-
ger population. Assumptions apply to both the
Fixed and Random Effects
between- and within-subjects effects. The between-
Fixed Effects subjects assumptions are the same as those in
a standard ANOVA: independence of scores;
Fixed effects, also known as between-subjects
normality; and equal variances, known as homo-
effects, are those in which each subject is a member
geneity of variance. Assumptions for the within-
of either one group or another, but not more than
subjects effects are independence of scores and
one group. All levels of the factor may be included,
normality of the distribution of scores in the larger
or only selected levels. In other words, subjects are
population. The mixed model also assumes that
measured on only one of the designated levels
there is a linear relationship between the depen-
of the factor, such as undergraduate or graduate.
dent and independent variables. In addition, the
Other examples of fixed effects are gender, mem-
complexity of the mixed model design requires the
bership in a control group or an experimental
assumption of equality of variances of the differ-
group, marital status, and religious affiliation.
ence scores for all pairs of scores at all levels of the
within-subjects factor and equal covariances for
the between-subjects factor. This assumption is
Random Effects
known as the sphericity assumption. Sphericity is
Random effects, also known as within-subjects especially important to the mixed model analysis.
effects, are those in which measures of each level
of a factor are taken on each subject, and the
Sphericity Assumption
effects may vary from one measure to another over
the levels of the factor. Variability in the dependent The sphericity assumption may be thought of
variable can be attributed to differences in the as the homogeneity-of-variance assumption for
random factor. In the previous example, all sub- repeated measures. This assumption can be tested
jects would be measured across all levels of the by conducting correlations between and among all
time-of-day factor for exercising at a gym. In levels of repeated measures factors and using Bart-
a study in which time is a random effect and gen- lett’s test of sphericity. A significant probability
der is a fixed effect, the interaction of time and level (p value) means that the data are correlated
gender is also a random effect. Other examples of and the sphericity assumption is violated. How-
random effects are number of trials, in which each ever, if the data are uncorrelated, then sphericity
subject experiences each trial or each subject can be assumed. Multivariate ANOVA (MAN-
receives repeated doses of medication. Random OVA) procedures do not require that the sphericity
820 Mixed Model Design

assumption be met; consequently, statisticians sug- First-Order Autoregressive Structure


gest using the MANOVA results rather than those
The first-order autoregressive structure, or
in the standard repeated measures analysis.
AR(1), model is useful when there are evenly
Mixed model designs can accommodate complex
spaced time intervals between the measurements.
covariance patterns for repeated measures with
The covariance between any two levels is a func-
three or more levels on a factor, eliminating the
tion of the spacing between the measures. The
need to make adjustments. Many different vari-
closer the measurements are in time, the higher
ance–covariance structures are available for fitting
will be the correlations between adjacent mea-
data to a mixed model design. Most statistical soft-
sures. The AR(1) pattern displays a homogeneous
ware can invoke the mixed model design and pro-
structure, with equal variances and covariances
duce parameter estimates and tests of significance.
that decrease exponentially as the time between
measures increases. For example, this exponential
Variance–Covariance Structures function can be seen when an initial measure is
taken and repeated at 1 year, 2 years, 3 years, and
Relationships between levels of the repeated mea-
so forth. Smaller correlations will be observed
sures are specified in a covariance structure. Sev-
for observations separated by 3 years than for
eral different covariance structures are available
those separated by 2 years. The model consists of
from which to choose. The variance–covariance
a parameter for the variance of the observations
structure of a data set is helpful to determine
and a parameter for the correlation between adja-
whether the data fit a specific model. A few of the
cent observations. The AR(1) is a special case of
most common covariance structures are intro-
the Toeplitz, as discussed later in this entry.
duced in the following paragraphs.

Unstructured Covariance Structure


Diagonal, or Variance,
Components Structure The variances for each level of a repeated
measures factor and the covariances in an unstruc-
The diagonal, or variance, components struc- tured variance–covariance matrix are all different.
ture is the simplest of the covariance patterns. This The unstructured covariance structure (UN), also
model, characterized by 1s on the diagonal and 0s known as a general covariance structure, is the
on the off-diagonals, is also known as the identity most heterogeneous of the covariance structures.
matrix. The variance components pattern displays The UN possibly offers the best fit because every
constant variance and no correlation between the covariance entry can be unique. However, this het-
elements. This is an unsatisfactory covariance erogeneity introduces more complexity in the
structure for a mixed model design with repeated model and more parameters to be estimated.
measures because the measures from the same sub- Unlike the first-order autoregressive structure that
ject are not independent. produces two parameters, the UN pattern requires
parameters equal to n(n þ 1)/2, where n is the
Compound Symmetry number of repeated measures for a factor. For
example, 10 parameters are estimated for four
Compound symmetry displays a covariance pat- repeated measures: 4(4 þ 1)/2 ¼ 10.
tern among the multiple levels of a single factor in
which all the off-diagonal elements are constant
Toeplitz Covariance Structure
(equal) and all diagonal elements are constant
(equal) regardless of the time lapse between mea- In the Toeplitz covariance model, correlations
surements; thus such a pattern of equal varia- at the same distance are the same as in the AR(1)
nces and covariances indicates that the sphericity model; however, there is no exponential effect.
assumption is satisfied. The compound symmetry The number of parameters depends on the number
structure is a special case of the simple variance of distances between measures. For example, the
component model that assumes independent mea- distance between the initial measure and the sec-
sures and homogeneous variance. ond measure would be one parameter; the distance
Mixed Model Design 821

between the second measure and the third measure a covariance structure that accommodates vari-
would be another parameter, and so forth. Like ance heterogeneity is appropriate.
the AR(1) model, the Toeplitz is a suitable choice Another procedure for evaluating a covariance
for evenly spaced measures. matrix involves creating several different probable
models using both the maximum likelihood and the
restricted maximum likelihood methods of parame-
First Order: Ante-Dependence
ter estimation. The objective is to select the covari-
The first-order ante-dependence model is a more ance structure that gives the best fit of the data to
general model than the Toeplitz or the AR(1) mod- the model. Information criterion measures, pro-
els. Covariances are dependent on the product of duced as part of the results of each mixed model
the variances at the two points of interest, and cor- procedure, indicate a relative goodness of fit of the
relations are weighted by the variances of the two data, thus providing guidance in model evaluation
points of interest. For example, a correlation of and selection. The information criteria measures for
.70 for points 1 and 2 and a correlation of .20 for the same data set under different models (different
points 2 and 3 would produce a correlation of .14 covariance structures) and estimated with different
for points 1 and 3. This model requires 2n  1 methods can be compared; usually, the information
parameters to be estimated, where n is the number criterion with the smallest value indicates a better fit
of repeated measures for a factor. of the data to the model. Several different criterion
measures can be produced as part of the statistical
analysis. It is not uncommon for the information
Evaluating Covariance Models
criteria measures to be very close in value.
The data should be examined prior to the anal-
ysis to verify whether the mixed model design or
Hypothesis Testing
the standard repeated measures design is the
appropriate procedure. Assuming that the mixed The number of null hypotheses formulated for a
model procedure is appropriate for the data, the mixed model design depends on the number of
next step is to select the covariance structure that factors in the study. A null hypothesis should be
best models the data. The sphericity test alone is generated for each factor and for every combina-
not an adequate criterion by which to select tion of factors. A mixed model analysis with one
a model. A comparison of information criteria for fixed effect and one random effect generates three
several probable models with different covariance null hypotheses. One null hypothesis would be
structures that uses the maximum likelihood and stated for the fixed effects; another null hypothesis
restricted maximum likelihood estimation methods would be stated for the random effects; and a third
is helpful in selecting the best model. hypothesis would be stated for the interaction of
One procedure for evaluating a covariance the fixed and random effects. If more than one
structure involves creating a mixed model with fixed or random factor is included, multiple inter-
an unstructured covariance matrix and examin- actions may be of interest. The mixed model
ing graphs of the error covariance and correla- design allows researchers to select only the interac-
tion matrices. Using the residuals, the error tions in which they are interested.
covariances or correlations can be plotted sepa- The omnibus F test is used to test each null
rately for each start time, as in a trend analysis. hypothesis for mean differences across levels of the
For example, declining correlations or covar- main effects and interaction effects. The sample
iances with increasing time lapses between mea- means for each factor main effect are compared
sures indicate that an AR(1) or ante-dependence to ascertain whether the difference between the
structure is appropriate. For trend analysis, means can be attributed to the factor rather than
trends with the same mean have approximately to chance. Interaction effects are tested to ascertain
the same variance. This pattern can also be whether a difference between the means of the
observed on a graph with lines showing multiple fixed effects between subjects and the means of
trends. If the means or the lines on the graph are each level of the random effects within subjects is
markedly different and the lines do not overlap, significantly different from zero. In other words,
822 Mixed Model Design

the data are examined to ascertain the extent to of significance allows the researcher to reject or
which changes in one factor are observed across retain the null hypothesis that the variance of the
levels of the other factor. random effect is zero in the population. A nonsig-
nificant random effect can be dropped from the
model, and the analysis can be repeated with one
Interpretation of Results
or more other random effects.
Several tables of computer output are produced for Interaction effects between the fixed and ran-
a mixed model design. The tables allow res- dom effects are also included as variance estimates.
earchers to check the fit of the data to the model Interaction effects are interpreted on the basis of
selected and interpret results for the null hypotheses. their levels of significance. For all effects, if the
95% confidence interval contains zero, the respec-
tive effects are nonsignificant. The residual param-
Model Dimension Table
eter estimates the unexplained variance in the
A model dimension table shows the fixed and dependent variable after controlling for fixed
random effects and the number of levels for each, effects, random effects, and interaction effects.
type of covariance structure selected, and the num-
ber of parameters estimated. For example, AR(1)
and compound symmetry covariance matrices
estimate two parameters whereas the number of
parameters varies for a UN based on the number
of repeated measures for a factor. Advantages
The advantages of the mixed model compensate
Information Criteria Table for the complexity of the design. A major advan-
tage is that the requirement of independence of
Goodness-of-fit statistics are displayed in an
individual observations does not need to be met as
information criteria table. Information criteria
in the general linear model or regression proce-
can be compared when different covariance
dures. The groups formed for higher-level analysis
structures and/or estimation methods are speci-
such as in nested designs and repeated measures
fied for the model. The tables resulting from
are assumed to be independent; that is, they are
different models can be used to compare one
assumed to have similar covariance structures. In
model with another. Information criteria are
the mixed model design, a wide variety of covari-
interpreted such that a smaller value means
ance structures may be specified, thus enabling the
a better fit of the data to the model.
researcher to select the covariance structure that
provides the model of best fit. Equal numbers of
Fixed Effects, Random Effects, repeated observations for each subject are not
and Interaction Effects required, making the mixed model design desirable
for balanced and unbalanced designs. Measures
Parameter estimates for the fixed, random, and
for all subjects need not be taken at the same
interaction effects are presented in separate tables.
points in time. All existing data are incorporated
Results of the fixed effects allow the researcher to
into the analysis even though there may be missing
reject or retain the null hypothesis of no relation-
data points for some cases. Finally, mixed model
ship between the fixed factors and the dependent
designs, unlike general linear models, can be
variable. The level of significance (p value) for
applied to data at a lower level that are contained
each fixed effect will indicate the extent to which
(nested) within a higher level, as in hierarchical
the fixed factor or factors have an effect different
linear models.
from zero on the dependent variable.
A table of estimates of covariance parameters Marie Kraska
indicates the extent to which random factors have
an effect on the dependent variable. Random See also Hierarchical Linear Modeling; Latin Square
effects are reported as variance estimates. The level Design; Sphericity; Split-Plot Factorial Design
Mode 823

Further Readings History


Al-Marshadi, A. H. (2007). The new approach to guide The mode is an unusual statistic among those
the selection of the covariance structure in mixed defined in the field of statistics and probability. It
model. Research Journal of Medicine and Medical
is a counting term, and the concept of counting
Sciences, 2(2), 88–97.
elements in categories dates back prior to human
Gurka, M. J. (2006). Selecting the best linear mixed model
under REML. American Statistician, 60(1), 19–26. civilization. Recognizing the maximum of a cate-
Morrell, C. H., & Brant, L. J. (2000). Lines in random gory, be it the maximum number of predators, the
effects plots from the linear mixed-effects model. maximum number of food sources, and so forth, is
American Statistician, 54(1), 1–4. evolutionarily advantageous.
Neter, J., Kutner, M. H., Nachtsheim, C. J., & The mathematician Karl Pearson is often cited
Wasserman, W. (1996). Applied linear statistical as the first person to use the concept of the mode
models (4th ed.). Boston: McGraw-Hill. in a statistical context. Pearson, however, also used
Pan, W. (2001). Akaike’s information criterion in a number of other descriptors for the concept,
generalized estimating equations. Biometrics, 57(1),
including the ‘‘maximum of theory’’ and the ‘‘ordi-
120–125.
Schwarz, C. J. (1993). The mixed model ANOVA: The
nate of maximum frequency.’’
truth, the computer packages, the books. Part I:
Balanced data. American Statistician, 47(1), 48–59.
VanLeeuwen, D. M. (1997). A note on the covariance Calculation
structure in a linear model. American Statistician,
51(2), 140–144. In order to calculate the mode of a distribution, it
Voss, D. T. (1999). Resolving the mixed models is helpful first to group the data into like categories
controversy. American Statistician, 53(4), 352–356. and to determine the frequency of each observa-
Wolfinger, R. (1993). Covariance structure selection in tion. For small samples, it is often easy to find the
general mixed models. Communications in Statistics— mode by looking at the results. For example, if
Simulation and Computation, 22, 1079–1106.
one were to roll a die 12 times and get the follow-
ing results,
MODE f1, 3, 2, 4, 6, 3, 4, 3, 5, 2, 5, 6g,

Together with the mean and the median, the mode it is fairly easy to see that the mode is 3.
is one of the main measurements of the central ten- However, if one were to roll the die 40 times
dency of a sample or a population. The mode is and list the results, the mode is less obvious:
particularly important in social research because it
f6, 5, 5, 4, 4, 1, 6, 6, 3, 4, 4, 4, 2, 5,
is the only measure of central tendency that is rele-
vant for any data set. That being said, it rarely 5, 4, 4, 1, 2, 1, 4, 5, 5, 1, 3, 5, 2, 4,
receives a great deal of attention in statistics 2, 4, 2, 4, 4, 6, 5, 2, 1, 1, 4, 5g:
courses. The purpose of this entry is to identify the
role of the mode in relation to the median and the In Table 1, the data are grouped by frequency,
mean for summarizing various types of data. making it obvious that the mode is 4.

Definition Table 1 Data of Frequency of Die Rolled 40 Times

The mode is generally defined as the most frequent Number Observations


observation or element in the distribution. Unlike
1 6
the mean and the median, there can be more
2 6
than one mode. A sample or a population with one
3 2
mode is unimodal. One with two modes is bimodal,
4 13
one with three modes is trimodal, and so forth. In
5 9
general, if there is more than one mode, one can
6 4
say that a sample or a distribution is multimodal.
824 Mode

Table 2 Frequencies in Which Individual Data Are


Grouped Into Ranges of Categories
Classes (income Frequencies (number
by categories) in each category)
0–$20k 12
$20k–$40k 23
$60k–$60k 42
$60k–$80k 25
$80k–$100k 9
$100k–$120k 2
Total 113
5 15

Often, a statistician will be faced with a table of


frequencies in which the individual data have been Figure 1 Example of Modes for Continuous
grouped into ranges of categories. Table 2 gives an Distributions
example of such a table. We can see that the cate-
gory of incomes between $40,000 and $60,000 data, although some organizing of the data may
has the largest number of members. We can call be necessary.
this the modal class. Yadolah Dodge has outlined For continuous data, the concept of mode is less
a method for calculating a more precise estimate obvious. No value occurs more than once in con-
for the mode in such circumstances: tinuous probability distributions, and therefore, no
  single value can be defined as the mode with the
d1
mode ¼ L1 þ ×c, discrete definition. Instead, for continuous distri-
d1 þ d2 butions, the mode occurs at a local maximum in
the data. For example, in Figure 1, the modes
where L1 ¼ lower value of the modal category;
occur at 5 and 15.
d1 ¼ difference between the number in the modal
As Figure 1 shows, for continuous distributions,
class and the class below; d2 ¼ difference between
a distribution can be defined as multimodal even
the number in the modal class and the class above;
when the local maximum at 5 is greater than the
and c ¼ length of the interval within the modal
local maximum at 15.
class. (This interval length should be common for
all intervals.)
In this particular example, the mode would be The Stevens Classification System
  In social science statistical courses, data are
19
mode ¼ $40; 000 þ ×$20; 000: often classified according to data scales outlined
19 þ 17
by Stanley Smith Stevens.
Therefore, the mode would be estimated to be According to Stevens’s classification system,
$50,556. data can be identified as nominal, ordinal, interval,
or ratio. As Stevens points out, the mode is the
only measure of central tendency applicable to all
The Mode for Various data scales.
Types of Data Classification Data that comply with ratio scales can be either
continuous or discrete and are amenable to all
Discrete and Continuous Data
forms of statistical analysis. In mathematical
In mathematical statistics, distributions of terms, data can be drawn from the real, integer,
data are often discussed under the heading of or natural number systems. Although the mode
discrete or continuous distributions. As the pre- applies as a measure of central tendency for such
vious examples have shown, computing the data, it is not usually the most useful measurement
mode is relatively straightforward with discrete of central tendency. An example of ratio-scale data
Mode 825

would be the total net worth of a randomly only in order of value. For example, a person
selected sample of individuals. So much variability could be asked to rank items on a scale of 1 to 5
is possible in the possible outcomes that unless the in terms of his or her favorite. Likert-type scales
data are grouped into discrete categories (say are an example of this sort of data. A value of 5 is
increments of $5,000 or $10,000) the mode does greater than a value of 4, but an increment from 4
not summarize the central tendency of the data to 5 does not necessarily represent the same
well by itself. increase in preference that an increase from 1 to 2
Interval data are similar to ratio data in that it does. Because of these characteristics, reporting
is possible to carry out detailed mathematical the median and the mode for this type of data
operations on them. As a result, it is possible to makes sense, but the mean does not.
take the mean and the median as measures of cen- In the case of nominal-scale data, the mode is
tral tendency. However, interval data lack a true the only meaningful measure of central tendency.
0 value. Nominal data tell the analyst nothing about the
For example, in measuring household size, it order of the data. In fact, data values do not even
is conceivable that a household can possess very need to be labeled as numbers. One might be inter-
large numbers of members, but generally this is ested in which number a randomly selected group
rare. Many households have fewer than 10 mem- of hockey players at a hockey camp wear on their
bers; however, some modern extended families jersey back home in their regular hockey league.
might run well into double digits. At the extreme, One could select from two groups:
it is possible to observe medieval royal or aristo-
cratic households with potentially hundreds of Group 1 : f1, 3, 4, 4, 4, 7, 7, 10, 11, 99g;
members. However, it is nonsensical to state that Group 2 : f1, 4, 8, 9, 11, 44, 99, 99, 99, 99g:
a household has zero members. It is also nonsensi-
cal to say that a specific household has 1.75 mem- For these groups, taking the mean and the
bers. However, it is possible to say that the mean median are meaningless as measures of central ten-
household size in a geographic region (a country, dency (the number 11 does not represent more
province, or city) is 2.75 or 2.2 or some other value than 4). However, the mode of Group 1 is 4,
number. For interval data, the mode, the median, and the mode of Group 2 is 99. With some back-
and the mean frequently provide valuable but dif- ground information about hockey, the anal-
ferent information about the central tendencies of yst could hypothesize which populations the two
the data. The mean may be heavily influenced by groups are drawn from. Group 2 appears to be
large low-prevalence values in the data; however, made up of a younger group of players whose
the median and the mode are much less influenced childhood hero is Wayne Gretzky (number 99),
by them. and Group 1 is likely made up of an older group
As an example of the role of the mode in sum- of fans of Bobby Orr (number 4).
marizing interval data, if a researcher were inter- Another interesting property of the mode is that
ested in comparing the household size on different the data do not actually need to be organized as
streets, A Street and B Street, he or she might visit numbers, nor do they need to be translated into
both and record the following household sizes: numbers. For example, an analyst might be inter-
ested in the first names of CEOs of large corp-
A Street : f3, 1, 6, 2, 1, 1, 2, 3, 2, 4, 2, 1, 4g;
orations in the 1950s. Examining a particular
B Street : f2, 5, 3, 6, 7, 9, 3, 4, 1, 2, 1g: newspaper article, the analyst might find the fol-
lowing names:
Comparing the measures of central tendency (A
Street: Mean ¼ 2.5, Median ¼ 2, Mode ¼ 1 and 2;
B Street: Mean ¼ 3.8, Median ¼ 3, Mode ¼ 3) gives Names : fTed, Gerald, John, Martin, John,
a clearer picture of the nature of the streets than Peter, Phil, Peter, Simon, Albert, Johng
any one measure of central tendency in isolation.
For data in ordinal scales, not only is there no In this example, the mode would be John, with
absolute zero, but one can also rank the elements three listings. As this example shows, the mode is
826 Models

particularly useful for textual analysis. Unlike the f30,000, 30,000, 30,000, 30,000, 30,000,
median and the mean, it is possible to take counts 15,000, 15,000, 15,000, 5,000, 5,000g:
of the occurrence of words in documents, speech,
or database files and to carry out an analysis from
Now the mode is $30,000, the median is
such a starting point.
(30,000 þ 15,000)/2 ¼ $22,500, and the mean ¼
Examination of the variation of nominal data
$20,500.
is also possible by examining the frequencies
Finally, the following observations are possible:
of occurrences of entries. In the above example,
it is possible to summarize the results by
f30,000, 30,000, 30,000, 15,000, 15,000,
stating that John represents 3/11 (27.3%) of the
entries, Peter represents 2/11 (18.2%) of the 15,000, 15,000, 15,000, 5,000, 5,000g:
entries and that each other name represents 1/11
(9.1%) of the entries. Through a comparison of In this case, the mode is $15,000, the median is
frequencies of occurrence, a better picture of the $15,000, and the mean is $17,500. The distribu-
distribution of the entries emerges even if one tion has much less skew and is nearly symmetric.
does not have access to other measures of central The concept of skew can be dealt with using
tendency. very complex methods in mathematical statistics;
however, often simply reporting the mode together
with the median and the mean is a useful method
of forming first impressions about the nature of
A Tool for Measuring a distribution.
the Skew of a Distribution
Gregory P. Butler
Skew in a distribution is a complex topic; however,
comparing the mode to the median and the mean See also Central Tendency, Measures of; Interval Scale;
can be useful as a simple method of determining Likert Scaling; Mean; Median; Nominal Scale
whether data in a distribution are skewed. A sim-
ple example is as follows. In a workplace a com-
Further Readings
pany offers free college tuition for the children of
its employees, and the administrator of the plan Dodge, Y. (1993). Statistique: Dictionnaire
is interested in the amount of tuition that the pro- encyclopédique [Statistics: Encyclopedic dictionary].
gram may have to pay. For simplicity, take three Paris: Dunod.
Magnello, M. E. (2009). Karl Pearson and the
different levels of tuition. Private university tuition
establishment of mathematical statistics. International
is set at $30,000, out-of-state-student public uni-
Statistical Review, 77(1), 3–29.
versity tuition is set at $15,000, and in-state tuition Pearson, K. (1895). Contribution to the mathematical
is set at $5,000. theory of evolution: Skew variation in homogenous
In this example, 10 students qualify for tuition material. Philosophical Transactions of the Royal
coverage for the current year. The distribution of Society of London, 186(1), 344–434.
tuition amounts for each student is as follows: Stevens, S. S. (1946). On the theory of scales of
measurement. Science, 103(2684), 677–680.
f5,000, 5,000, 5,000, 5,000, 5,000, 15,000,
15,000, 15,000, 30,000, 30,000g
MODELS
Hence, the mode is $5,000. The median
is equal to (15,000 þ 5,000)/2 ¼ $10,000. The It used to be said that models were dispensable aids
mean ¼ $13,000. The tuition costs are skewed to formulating and understanding scientific theo-
toward low tuition, and therefore the distribu- ries, perhaps even props for poor thinkers. This
tion has a negative skew. negative view of the cognitive value of models in
A positively skewed distribution would occur if science contrasts with today’s view that they are an
the tuition payments were as follows: essential part of the development of theories, and
Models 827

more besides. Contemporary studies of scientific form. A scale model of an aircraft prototype, for
practice make it clear that models play genuine example, may be built to test its basic aerody-
and indispensable cognitive roles in science, provid- namic features in a wind tunnel.
ing a basis for scientific reasoning. This entry
describes types and functions of models commonly
Analogue Models
used in scientific research.
Analogue, or analogical, models express relevant
relations of analogy between the model and the
Types of Models
reality being represented. Analogue models are
Given that just about anything can be a model important in the development of scientific theories.
of something for someone, there is an enormous The requirement for analogical modeling often
diversity of models in science. The many senses of stems from the need to learn about the nature of
the word model that stem from this bewildering hidden entities postulated by a theory. Analogue
variety Max Wartofsky has referred to as the models also serve to assess the plausibility of our
‘‘model muddle.’’ It is not surprising, then, that the new understanding of those entities.
wide diversity of models in science has not been Analogical models employ the pragmatic strat-
captured by some unitary account. However, philo- egy of conceiving of unknown causal mechanisms
sophers such as Max Black, Peter Achinstein, and in terms of what is already familiar and well under-
Rom Harré have provided useful typologies that stood. Well-known examples of models that have
impose some order on the variety of available mod- resulted from this strategy are the molecular model
els. Here, discussion is confined to four different of gases, based on an analogy with billiard balls in
types of model that are used in science: scale mod- a container; the model of natural selection, based
els, analogue models, mathematical models, and on an analogy with artificial selection; and, the
theoretical models. computational model of the mind, based on an
analogy with the computer.
To understand the nature of analogical model-
Scale Models
ing, it is helpful to distinguish between a model,
As their name suggests, scale models involve the source of the model, and the subject of the
a change of scale. They are always models of model. From the known nature and behavior of
something, and they typically reduce selected the source, one builds an analogue model of the
properties of the objects they represent. Thus, unknown subject or causal mechanism. To take
a model airplane stands as a miniaturized repre- the biological example just noted, Charles
sentation of a real airplane. However, scale mod- Darwin fashioned his model of the subject of
els can stand as a magnified representation of an natural selection by reasoning analogically from
object, such as a small insect. Although scale the source of the known nature and behavior of
models are constructed to provide a good resem- the process of artificial selection. In this way,
blance to the object or property being modeled, analogue models play an important creative role
they represent only selected relevant features of in theory development. However, this role
the object. Thus, a model airplane will almost requires the source from which the model is
always represent the fuselage and wings of the drawn to be different from the subject that
real airplane being modeled, but it will seldom is modeled. For example, the modern computer
represent the interior of the aircraft. Scale mod- is a well-known source for the modeling of
els are a class of iconic models because they liter- human cognition, although our cognitive appa-
ally depict the features of interest in the original. ratus is not generally thought to be a real com-
However, not all iconic models are scale models, puter. Models in which the source and the
as for example James Watson and Francis Crick’s subject are different are sometimes called para-
physical model of the helical structure of the morphs. Models in which the source and the
DNA molecule. Scale models are usually built in subject are the same are sometimes called home-
order to present the properties of interest in the omorphs. The paramorph can be an iconic, or
original object in an accessible and manipulable pictorial, representation of real or imagined
828 Models

things. It is iconic paramorphs that feature Theoretical Models


centrally in the creative process of theory devel-
Finally, the important class of models known
opment through analogical modeling.
as theoretical models abounds in science. Unlike
In evaluating the aptness of an analogical
scale models, theoretical models are constructed
model, the analogy between its source and subject
and described through use of the scientist’s imag-
must be assessed, and for this one needs to con-
ination; they are not constructed as physical
sider the structure of analogies. The structure of
objects. Further, unlike scale, analogical, and
analogies in models comprises a positive analogy
mathematical models, the properties of theoreti-
in which source and subject are alike, a negative
cal models are better known than the subject
analogy in which source and subject are unlike,
matter that is being modeled.
and a neutral analogy where we have no reliable
A theoretical model of an object, real or imag-
knowledge about matched attributes in the source
ined, comprises a set of assumptions about that
and subject of the model. The negative analogy is
object. The Watson–Crick model of the DNA mol-
irrelevant for purposes of analogical modeling.
ecule and Markov models of human and animal
Because we are essentially ignorant of the nature
learning are two examples of the innumerable the-
of the hypothetical mechanism of the subject apart
oretical models to be found in science. Theoretical
from our knowledge of the source of the model,
models typically describe an object by ascribing to
we are unable to specify any negative analogy
it an inner mechanism or structure. This mecha-
between the model and the mechanism being mod-
nism or structure is frequently invoked in order to
eled. Thus, in considering the plausibility of an
explain the behavior of the object. Theoretical
analogue model, one considers the balance of
models are acknowledged for their simplifying
the positive and neutral analogies. This is where
approximation to the object being modeled, and
the relevance of the source for the model is
they can often be combined with other theoretical
spelled out.
models to help provide a comprehensive under-
An example of an analogue model is Rom
standing of the object.
Harré’s rule-model of microsocial interaction, in
which Erving Goffman’s dramaturgical perspective
provides the source model for understanding the Data, Models, and Theories
underlying causal mechanisms involved in the pro-
Data Models
duction of ceremonial, argumentative, and other
forms of social interaction. In the 1960s, Patrick Suppes drew attention to
the fact that science employs a hierarchy of models.
He pointed out that theoretical models, which are
high in the hierarchy, are not compared directly
Mathematical Models
with empirical data, but with models of the data,
In science, particularly in the social and behav- which are lower in the hierarchy.
ioral sciences, models are sometimes expressed in Data on their own are intractable. They are often
terms of mathematical equations. Mathematical rich, complex, and messy and, because of these
models offer an abstract symbolic representation characteristics, cannot be explained. Their intracta-
of the domain of interest. These models are often bility is overcome by reducing them into simpler
regarded as formalized theories in which the sys- and more manageable forms. In this way, data are
tem modeled is projected on the abstract domain reworked into models of data. Statistical methods
of sets and functions, which can be manipulated in play a prominent role in this regard, facilitating
terms of numerical reasoning, typically with the operations to do with assessing the quality of the
help of a computer. For example, factor analysis is data, the patterns they contain, and the generaliza-
a mathematical model of the relations between tions to which they give rise. Because of their tracta-
manifest and latent variables, in which each mani- bility, models of the data can be explained and used
fest variable is regarded as a linear function of as evidence for or against theoretical models. For
a common set of latent variables and a latent vari- this reason, they are of considerable importance
able that is unique to the manifest variable. in science.
Models 829

Models and Theories can be argued that in science, models and theories
are different representational devices. Consistent
The relationship between models and theories is
with this distinction between models and theories,
difficult to draw, particularly given that they can
William Wimsatt has argued that science often
both be conceptualized in different ways. Some
adopts a deliberate strategy of adopting false mod-
have suggested that theories are intended as true
els as a means by which we can obtain truer theo-
descriptions of the real world, whereas models
ries. This is done by localizing errors in models in
need not be about the world, and therefore need
order eliminate other errors in theories.
not be true. Others have drawn the distinction by
claiming that theories are more abstract and gen-
eral than models. For example, evolutionary psy- Abstraction and Idealization
chological theory can be taken as a prototype for
It is often said that models provide a simplified
the more specific models it engenders, such as
depiction of the complex domains they often
those of differential parental investment and the
represent. The simplification is usually achieved
evolution of brain size. Relatedly, Ronald Giere
through two processes: abstraction and idealiza-
has argued that a scientific theory is best under-
tion. Abstraction involves the deliberate elimina-
stood as comprising a family of models and a set
tion of those properties of the target that are not
of theoretical hypotheses that identify things in the
considered essential to the understanding of that
world that apply to a model in the family.
target. This can be achieved in various ways; for
Yet another characterization of models takes
example, one can ignore the properties, even
them to be largely independent of theories. In argu-
though they continue to exist; one can eliminate
ing that models are ‘‘autonomous agents’’ that
them in controlled experiments; or one can set the
mediate between theories and phenomena, Mar-
values of unwanted variables to zero in simula-
garet Morrison contends that they are not fully
tions. By contrast, idealization involves transform-
derived from theory or data. Instead, they are tech-
ing a property in a system into one that is related,
nologies that allow one to connect abstract theories
but which possesses desirable features introduced
with empirical phenomena. Some have suggested
by the modeler. Taking a spheroid object to be
that the idea of models as mediators does not apply
spherical, representing a curvilinear relation in lin-
to the behavioral and biological sciences because
ear form, and assuming that an agent is perfectly
there is no appreciable gap between fundamental
rational are all examples of idealization. Although
theory and phenomena in which models can
the terms abstraction and idealization are some-
mediate.
time used interchangeably, they clearly refer to dif-
ferent processes. Each can take place without the
other, and idealization can in fact take place with-
The Functions of Models out simplification.
Representation Brian D. Haig
Models can variously be used for the purposes
See also A Priori Monte Carlo Simulation; Exploratory
of systematization, explanation, prediction, con-
Factor Analysis; General Linear Model; Hierarchical
trol, calculation, derivation, and so on. In good
Linear Modeling; Latent Growth Modeling; Multilevel
part, models serve these purposes because they can
Modeling; Scientific Method; Structural Equation
often be taken as devices that represent parts of
Modeling
the world. In science, representation is arguably
the main function of models. However, unlike sci-
entific theories, models are generally not thought Further Readings
to be the sort of things that can be true or false. Abrantes, P. (1999). Analogical reasoning and modeling
Instead, we may think of models as having a kind in the sciences. Foundations of Science, 4, 237–270.
of similarity relationship with the object that is Black, M. (1962). Models and metaphors: Studies in
being modeled. With analogical models, for exam- language and philosophy. Ithaca, NY: Cornell
ple, the similarity relationship is one of analogy. It University Press.
830 Monte Carlo Simulation

Giere, R. (1988). Explaining science. Chicago: University A Monte Carlo simulation study is a systematic
of Chicago Press. investigation of the properties of some quantitative
Harré, R. (1976). The constructive role of models. In method under a variety of conditions in which
L. Collins (Ed.), The use of models in the social a set of Monte Carlo simulations is performed.
sciences (pp. 16–43). London: Tavistock.
Thus, a Monte Carlo simulation study consists of
MacCallum, R. C. (2003). Working with imperfect
models. Multivariate Behavioral Research, 38,
the findings from applying a Monte Carlo simula-
113–139. tion to a variety of conditions. The goal of a Monte
Morgan, M., & Morrison, M. (Eds.). (1999). Models as Carlo simulation study is often to make general
mediators. Cambridge, UK: Cambridge University statements about the various properties of the
Press. quantitative method under a wide range of situa-
Suppes, P. (1962). Models of data. In E. Nagel, P. Suppes, tions. So as to discern the properties of the
& A. Tarski (Eds.). Logic, methodology, and quantitative method generally, and to search for
philosophy of science: Proceedings of the 1960 inter-action effects in particular, a fully crossed fac-
International Congress (pp. 252–261). Stanford, CA:
torial design is often used, and a Monte Carlo sim-
Stanford University Press.
ulation is performed for each combination of the
Wartofsky, M. (1979). Models: Representation and the
scientific understanding. Dordrecht, the Netherlands: situations in the factorial design. After the data
Reidel. have been collected from the Monte Carlo simula-
Wimsatt, W. C. (1987). False models as means to truer tion study, analysis of the data is necessary so that
theories. In M. Nitecki & A. Hoffman (Eds.), Neutral the properties of the quantitative procedure can be
models in biology (pp. 23–55). London: Oxford discerned. Because such a large number of replica-
University Press. tions (e.g., 10,000) are performed for each condi-
tion, the summary findings from the Monte Carlo
simulations are often regarded as essentially popu-
lation values, although confidence intervals for the
estimates is desirable.
MONTE CARLO SIMULATION The general rationale of Monte Carlo simula-
tions is to assess various properties of estimators
A Monte Carlo simulation is a methodological and/or procedures that are not otherwise mathe-
technique used to evaluate the empirical properties matically tractable. A special case of this is com-
of some quantitative method by generating ran- paring the nominal and empirical values (e.g.,
dom data from a population with known proper- Type I error rate, statistical power, standard error)
ties, fitting a particular model to the generated of a quantitative method. Nominal values are
data, collecting relevant information of interest, those that are specified by the analyst (i.e., they
and replicating the entire procedure a large num- represent the desired), whereas empirical values
ber of times (e.g., 10,000) in order to obtain prop- are those observed (i.e., they represent the actual)
erties of the fitted model under the specified from the Monte Carlo simulation study. Ideally,
condition(s). Monte Carlo simulations are gener- the nominal and empirical values are equivalent,
ally used when analytic properties of the model but this is not always the case. Verification that the
under the specified conditions are not known or nominal and empirical values are consistent can be
are unattainable. Such is often the case when no the primary motivation for using a Monte Carlo
closed-form solutions exist, either theoretically or simulation study.
given the current state of knowledge, for the As an example, under certain assumptions the
particular method under the set of conditions of standardized mean difference follows a known dis-
interest. When analytic properties are known for tribution, which in this case allows for exact ana-
a particular set of conditions, Monte Carlo simula- lytic confidence intervals to be constructed for the
tion is unnecessary. Due to the computational population standardized mean difference. One of
tediousness of Monte Carlo methods because of the assumptions on which the analytic procedure
the large number of calculations necessary, in prac- is based is that in the population, the scores within
tice they are essentially always implemented with each of the two groups distribute normally. In
one or more computers. order to evaluate the effectiveness of the (analytic)
Monte Carlo Simulation 831

approach to confidence interval formation when procedure or model works in the specified input
the normality assumption is not satisfied, Ken Kel- conditions.
ley implemented a Monte Carlo simulation study A particular implementation of the Monte
and compared the nominal and empirical confi- Carlo method is a method known as Markov
dence interval coverage rates. Kelley also com- Chain Monte Carlo, which is a method used
pared the analytic approach to confidence interval to sample from various probability distributions
formation using two bootstrap approaches so as to based on a specified model in order to form sam-
determine whether the bootstrap performed better ple means for approximating expectations. Mar-
than the analytic approach under certain types of kov Chain Monte Carlo techniques are most
nonnormal data. Such comparisons require Monte often used in the Bayesian approach to statistical
Carlo simulation studies because no formula-based inference, but they can also be used in the fre-
comparisons are available as the analytic proce- quentist approach.
dure is based on the normality assumption, which The term Monte Carlo was coined in the mid-
was (purposely) not realized in the Monte Carlo 1940s by Nicholas Metropolis while working at
simulation study. the Los Alamos National Laboratory with Stanis-
As another example, under certain assumptions law Ulam and John von Neumann, who proposed
and an asymptotically large sample size, the sam- the general idea and formalized how determinate
ple root mean square error of approximation mathematical problems could be solved with ran-
(RMSEA) follows a known distribution, which dom sampling from a specified model a large num-
allows confidence intervals to be constructed for ber of times, because of the games of chance
the population RMSEA. However, the effective- commonly played in Monte Carlo, Monaco, with
ness of the confidence interval procedure had not the idea of repeating a process a larger number
been well known for finite, and in particular small, of times and then examining the outcomes. The
sample sizes. Patrick Curran and colleagues have Monte Carlo method essentially replaced what
evaluated the effectiveness of the (analytic) con- was previously termed statistical sampling. Statisti-
fidence interval procedure for the population cal sampling was used famously by William Sealy
RMSEA by specifying a model with a known Gossett, who published under the name Student,
population RMSEA, generating data, forming before finalizing the statistical theory of the t dis-
a confidence interval for the population RMSEA, tribution and was reported in his paper to show
and replicating the procedure a large number of a comparison of empirical and nominal properties
times. Of interest was the bias when estimating the of the t distribution.
population RMSEA from sample data and the pro-
portion of confidence intervals that correctly Ken Kelley
bracketed the known population RMSEA, so as to
See also A Priori Monte Carlo Simulation; Law of Large
determine whether the empirical confidence inter-
Numbers; Normality Assumption
val coverage was equal to the nominal confidence
interval coverage (e.g., 90%).
A Monte Carlo simulation is a special case of
Further Readings
a more general method termed the Monte Carlo
method. The Monte Carlo method, in general, Browne, M. W., & Cudeck, R. (1993). Alternative ways
uses many sets of randomly generated data under of assessing model fit. In K. A. Bollen & J. S. Long
some input specifications and applies a particular (Eds.), Testing structural equation models (pp. 136–
procedure or model to each set of the randomly 162). Newbury Park, CA: Sage.
generated data so that the output of interest from Currran, P. J., Bollen, K. A., Chen, F., Paxton, P., &
Kirby, J. B. (2003). Finite sampling properties of the
each fit of the procedure or model to the randomly
point estimates and confidence intervals of the
generated data can be obtained and evaluated. RMSEA. Sociological Methods Research, 32,
Because of the large number of results of interest 208–252.
from the fitted procedure or model to the Gilks, W. R., Richardson, S., & Spiegelhalter, D. J.
randomly generated data sets, the summary of (1996). Introducing Markov chain Monte Carlo. In
the results describes the properties of how the W. R. Gilks, S. Richardson, & D. J. Spiegelhalter
832 Mortality

(Eds.), Markov chain Monte Carlo in practice sex and age. Important subgroup mortality rates,
(pp. 1–20). New York: Chapman & Hall. as recognized by the World Health Organization,
Kelley, K. (2005). The effects of nonnormal distributions include the neonatal mortality rate, or deaths
on confidence intervals around the standardized mean during the first 28 days of life per 1,000 live
difference: Bootstrap and parametric confidence
births; the infant mortality rate, or the probabil-
intervals. Educational & Psychological Measurement,
65, 51–69.
ity of a child born in a specific year or period
Metropolis, N. (1987). The beginning of the Monte dying before reaching the age of 1 year; and the
Carlo method. Los Alamos Science, 125–130. maternal mortality rate, or the number of mater-
Student. (1908). The probable error of the mean. nal deaths due to childbearing per 100,000 live
Biometrika, 6, 1–25. births. The adult mortality rate refers to death
rate between 15 and 60 years of age. Age-specific
mortality rates refer to the number of deaths in
a year (per 100,000 individuals) for individuals
MORTALITY of a certain age bracket. In comparing mortality
rates between groups, age and other demograph-
Mortality refers to death as a study endpoint or ics must be borne in mind. Mortality rates may
outcome. Broader aspects of the study of death be standardized to adjust for differences in the
and dying are embraced in the term thanatology. age distributions of populations.
Survival is an antonym for mortality. Mortality
may be an outcome variable in populations or
Use in Research Studies
samples, associated with treatments or risk fac-
tors. It may be a confounder of other outcomes Mortality and survival are central outcomes in
due to resultant missing data or to biases a variety of research settings.
induced when attrition due to death results in
structural changes in a sample. Mortality is an Clinical Trials
event that establishes a metric for the end of the
life span. Time to death is frequently used as an In clinical trials studying treatments for life-
outcome and, less frequently, as a predictor vari- threatening illnesses, survival rate is often the pri-
able. This entry discusses the use and analysis of mary outcome measure. Survival rate is evaluated
mortality data in research studies. as 1 minus the corresponding mortality rate. Ran-
domized controlled trials are used to compare sur-
vival rates in patients receiving a new treatment to
Population Mortality Rates that in patients receiving a standard or placebo
Nearly all governments maintain records of treatment. The latter is commonly known as the
deaths. Thus many studies of mortality are based control group. Such trials should be designed to
on populations rather than samples. The most recruit sufficient numbers of patients and to follow
common index of death in a specific group is its them for long enough to observe deaths likely to
mortality rate. Interpretation of a mortality rate occur due to the illness to ensure adequate statisti-
requires definition of the time, causes of death, cal power to detect differences in rates.
and groups involved. Mortality rates are usually
specified as the number of deaths in a year per Epidemiological Studies
1,000 individuals, or in circumstances where
mortality is rarer, per 100,000 individuals. A Epidemiological studies of mortality compare
mortality rate may be cause specific, that is, refer rates of death across different groups defined by
to death due to a single condition, such as a dis- demographic measures, by risk factors, by expo-
ease or type of event or exposure. All-cause mor- sures, or by location.
tality refers to all deaths regardless of their
Prospective Studies
cause. Mortality rates are often calculated for
whole populations but can be expected to vary Many studies make an initial assessment of a
as a function of demographic variables, notably sample of interest and follow up with participants
Mortality 833

at least once, but often at multiple, regular inter- Source of Information


vals. Such studies are in a strong position to make
In many countries, registries of deaths may be
causal inferences by examining the association of
accessed by researchers and searched by name and
previously measured variables with survival out-
date of birth. Newspaper death notices may also
come. A variation on this design has measurement
be a source of information. Ideally, multiple meth-
occasions defined by an event such as an episode
ods of tracking the vital status of participants
of illness or a change in personal circumstances
should be used to ensure that deaths are not
(for example, unemployment, marriage, parent-
missed. It is critical that local legal and ethical
hood, criminal conviction). In studies of this type,
requirements are adhered to when data or personal
the death of participants may be an outcome of
information is used in this way. In prospective
interest or a source of missing data.
studies, vital status may be collected when follow-
up interviews are arranged and at any other points
Retrospective Studies of contact.
It is also possible to conduct retrospective stud-
ies using previously collected data to examine risk Follow-Up of Subjects
factors for mortality. For example, a previously
conducted cross-sectional study may be augmented Studies of mortality frequently need to follow
with the current vital status information of up with participants, often over an extended time,
participants. to accrue sufficient deaths for statistical power and
rigor. The amount of time required will depend on
the population being studied. Prospective studies,
Case–Control Studies particularly clinical trials, are commonly designed
Case–control studies involve recruiting a sample to follow participants from the time they enter the
of participants who are classified on the basis of study until either the time of death or the study’s
a certain condition (cases) and a separate sample end. When definitive ascertainment of vital status
of participants who do not meet the requirements cannot be achieved, uncertainty remains about
of the condition (controls). Frequently cases are whether the participant is dead, has withdrawn
individuals diagnosed with an illness who are still from the study or has simply lost contact.
living, but this design can be applied to those who In retrospective studies, it may be more difficult
have died from a specific cause. When the cases to track down participants, as their current infor-
and controls are appropriately matched on con- mation may be incomplete or may have changed
founding variables, the odds of risk factor status in over time. Participants may have moved or changed
cases versus that in controls can be calculated. names, which makes tracking more difficult, espe-
This odds ratio can be shown to estimate relative cially when such information is not updated on
risk of illness given exposure. This design is partic- a regular basis. Losing track of participants can
ularly useful for studying risk factors for uncom- lead to problems of response bias.
mon conditions and associated mortality where it In prospective studies, researchers should collect
would be difficult to find a sufficiently large num- sufficient tracking information at study commence-
ber of cases in an acceptable period. ment, such as comprehensive contact information
for the participant and for their associates (e.g.,
relatives, friends, medical providers). Tracking
information can be updated at each data collection
Assessment of Vital Status
juncture. The drawback of prospective studies,
Vital status is an indicator for the living status of however, is that following participants over long
an individual; that is, it indicates whether a person periods can be very costly in terms of time
is alive or dead at a particular time. In studies that and money. In the case of retrospective studies,
incorporate mortality as one of their study end- tracking is performed only once based on existing
points, information on vital status of participants information, causing minimal time and financial
must be collected and updated. burden.
834 Mortality

Analysis of Mortality Data may be an important missing data problem. This


applies particularly to studies of older or unwell
Statistical Methods samples. Survivor bias can occur when individuals
In mortality studies, the length of time from with an attribute or risk factor of interest have
study entry to death, commonly known as survival a higher mortality rate than other members of the
time, is frequently specified as the primary out- sample. This can result in a risk factor for early
come measure, with participants being followed mortality appearing to be protective against the
over time on their vital status. However, due to development of diseases of old age, such as Alzhei-
time constraints and limited resources, studies are mer’s disease. In this case, investigators can evalu-
designed to follow participants for a predetermined ate the chance of developing Alzheimer’s disease
period. Thus not all participants will be followed using a competing risks analysis incorporating
until death. This gives rise to censored data. For death as a competing event.
participants who are still alive at the end of a study, Methods of handling missingness due to other
all that can be said is they have survived at least as factors may be applied equally when mortality is
long as the period of observation or follow-up and the cause of missing observations. There is no
are not yet dead. Sophisticated statistical methods inherent objection to the use of multiple imputa-
are required in the presence of censored data. tion to create values of deceased participants. Of
Collectively known as survival analysis, these tech- more concern is whether the assumptions underly-
niques may be applied for any time-to-event analy- ing the handling of missing observations are met.
ses, but they originated and are strongly associated At best these techniques allow missingness to be
with the modeling and prediction of time to death. associated with observed variables but not with
A variety of techniques have been developed, the status of the missing observations themselves
including Kaplan–Meier plots, Cox proportional (missing at random). In many situations this
hazards models, and frailty models. assumption is not likely to be met, and the miss-
Survival time is usually modeled using Cox pro- ingness mechanism is thus nonignorable and must
portional hazards regression models, which take be explicitly modeled. These procedures involve
into account both survival time and whether the specifying mechanisms regarding the relationship
observation is censored. In Cox regression, sur- between mortality and missing observations that
vival time is considered a continuous outcome. cannot be evaluated from the data themselves.
Many statistical software packages also provide Often the results of such analyses are speculative
methods for modeling survival time that is mea- and should be accompanied by sensitivity analyses.
sured discretely (for example, when only wave of
Philip J. Batterham, Andrew J. Mackinnon,
measurement is available). An alternative method
and Kally Yuen
is to simultaneously model additional outcomes,
such as dementia or institutionalization, with See also Clinical Trial; Cohort Design; Last Observation
mortality in a latent random-effects model, called Carried Forward; Latent Growth Modeling;
a frailty model. Latent growth models—a form of Longitudinal Design; Missing Data, Imputation of;
structural equation modeling—are another way of Mixed- and Random-Effects Models; Survival
examining relationships between risk factors over Analysis
time when modeling survival. Time to death can
also be used as a predictor variable, in place of or
in addition to chronological age or time in study. Further Readings
This measure may be a more useful indicator of
Dufouil, C., Brayne, C., & Clayton, D. (2004). Analysis
biological aging than age (time since birth) in stud- of longitudinal studies with death and drop-out: A
ies of late-life physical and cognitive function. case study. Statistics in Medicine, 23, 2215–2226.
Harel, O., Hofer, S. M., Hoffman, L., Pedersen, N. L., &
Mortality and Missing Data Johansson, B. (2007). Population inference with
mortality and attrition in longitudinal studies on
In longitudinal studies in which mortality is aging: A two-stage multiple imputation method.
not the outcome of interest, participant mortality Experimental Aging Research, 33, 187–203.
Multilevel Modeling 835

Hosmer, D. W., Lemeshow, S., & May, S. (2008). History and Advantages
Applied survival analysis: Regression modeling of time
to event data. New York: Wiley Interscience. Statistical analyses conducted within an MLM
Ripatti, S., Gatz, M., Pedersen, N. L., & Palmgren, J. framework date back to the late 19th century and
(2003). Three-state frailty model for age at onset of
the work of George Airy in astronomy, but
dementia and death in Swedish twins. Genetic
the basic specifications used today were greatly
Epidemiology, 24, 139–149.
Schlesselman, J. J. (1982). Case-control studies: advanced in the 20th century by Ronald Fisher
Design, conduct, analysis. New York: Oxford and Churchill Eisenhart’s introduction of fixed-
University Press. and random-effects modeling. MLM permits the
Whalley, L. J., & Deary, I. J. (2001). Longitudinal analysis of interdependent data without violating
cohort study of childhood IQ and survival up to age the assumptions of standard multiple regression. A
76. British Medical Journal, 322, 819–822. critical statistic for determining the degree of inter-
World Health Organization. (2008). WHO Statistical relatedness in one’s data is the intraclass correla-
Information System: Indicator definitions and tion (ICC). The ICC is calculated as the ratio of
metadata, 2008. Retrieved August 15, 2008, from
between-group variance to between-subject vari-
http://www.who.int/whosis/indicators/compendium/
2008/en
ance, divided by total variance. The degree to
which the ICC affects alpha levels is dependent on
the size of a sample; small ICCs inflate alpha in
large samples, whereas large ICCs will inflate
alpha in small samples. A high ICC suggests that
MULTILEVEL MODELING the assumption of independence is violated. When
the ICC is high, using traditional methods such as
Multilevel modeling (MLM) is a regression-based multiple linear regression is problematic because
approach for handling nested and clustered data. ignoring the interdependence in the data will often
Nested data (sometimes referred to as person– yield biased results by artificially inflating the sam-
period data) occurs when research designs include ple size in the analysis, which can lead to statisti-
multiple measurements for each individual, and cally significant findings that are not based on
this approach allows researchers to examine how random sampling. In addition, it is important to
participants differ, as well as how individuals vary account for the nested structure of the data—that
across measurement periods. A good example of is, nonindependence—to generate an accurate
nested data is repeated measurements taken from model of the variation in the data that is due to
people over time; in this situation, the repeated differences between groups and between subjects
measurements are nested under each person. Clus- after accounting for within differences within
tered data involves a hierarchical structure such groups and within subjects. Because variation
that individuals in the same group are hypothe- within groups and within individuals usually
sized to be more similar to each other than to accounts for most of the total variance, disregard-
other groups. A good example of clustered data is ing this information will bias these estimates.
the study of classrooms within different schools; in In addition to its ability to handle nonindepen-
this situation, classrooms are embedded within the dent data, an advantage of MLM is that more
schools. Standard (ordinary least squares [OLS]) traditional approaches for studying repeated mea-
regression approaches assume that each obser- sures, such as repeated measures analysis of vari-
vation in a data set is independent. Thus, it is ance (ANOVA), assume that data are completely
immediately obvious that nested and hierarchically balanced with, for example, the same number of
structured data violate this assumption of indepen- students per classroom or equivalent measure-
dence. MLM techniques arose to address this limi- ments for each individual. Missing data or unbal-
tation of OLS regression. As discussed below, anced designs cannot be accommodated with
however, most of the common MLM techniques repeated measures ANOVA and are dropped from
are extensions of OLS regression and are accessible further analysis. MLM techniques were designed
to anyone with a basic working knowledge of mul- to use an iterative process of model estimation by
tiple regression. which all data can be used in analysis; the two
836 Multilevel Modeling

most common approaches are maximum and common structures are person–period data and
restricted maximum likelihood estimation (both of clustered data. Person–period data examine both
which are discussed later in this entry). between- and within-individual variation, with the
latter examining how an individual varies across
a measurement period. Most studies having this
Important Distinctions
design include longitudinal data that examines
Multilevel models are also referred to as hierarchi- individual growth. An example might be a daily
cal linear models, mixed models, general linear diary study in which each diary entry or measure-
mixed models, latent curve growth models, vari- ment (Level 1) is nested within an individual
ance components analysis, random coefficients (Level 2). Ostensibly, a researcher might be inter-
models, or nested or clustered models. These terms ested in examining change within an individual’s
are appropriate and correct, depending on the field daily measurements across time or might be inves-
of study, but the multiple names can also lead to tigating how daily ratings differ as a function of
confusion and apprehension associated with using between-individual factors such as personality,
this form of statistical analysis. For instance, the intelligence, age, and so forth. Thus, at the within-
hierarchical linear model 6.0, developed by Tony person level (Level 1), there may be predictors
Bryk and Steve Raudenbush, is a statistical pro- associated with an individual’s rating at any given
gram that can handle both nested and clustered occasion, but between-person (Level 2) variables
data sets, but MLM can be conducted in other may also exist that moderate the strength of that
popular statistical programs, including SAS, SPSS association. As opposed to assuming that indivi-
(an IBM company, formerly called PASWâ Statis- duals’ responses are independent, the assumption
tics), and R, as well as a host of software for ana- of MLM is that these responses are inherently
lyzing structural equation models (e.g., LISREL, related and more similar within an individual than
MPlus). MLM can be used for nonlinear models as they are across individuals.
well, such as those associated with trajectories of A similar logic applies for clustered data. A
change and growth, which is why the terms hierar- common example of hierarchically structured data
chical linear modeling and general linear mixed in the education literature assumes that students in
models can be misleading. In addition, latent curve the same classroom will be more similar to each
growth analysis, a structural equation modeling other than to students in another class. This might
technique used to fit different curves associated result from being exposed to the same teacher,
with individual trajectories of change, is statisti- materials, class activities, teaching approach, and
cally identical to regression-based MLM. The dif- so forth. Thus, students’ individual responses
ferent approaches recognize that data is not always (Level 1) are considered nested within classrooms
nested within a hierarchical structure, and also that (Level 2). A researcher might be interested in
data may exhibit nonlinear trajectories such as examining individual students’ performance on an
quadratic or discontinuous change. MLM also is arithmetic test to ascertain if variability is due to
referred to as mixed models analysis by resear- differences among students (Level 1) or between
chers interested in the differences between subjects classrooms (Level 2). Similar to its use with
and groups and within subjects and groups that person–period data, MLM in this example can
account for variance in their data. Finally, variance investigate within-classroom variability at the low-
components analysis and random coefficients mod- est level and between-classroom variability at the
els refers to variance that is assumed to be random highest level of the hierarchy.
across groups or individuals as opposed to fixed, as The main distinction between these data struc-
is assumed in single-level regression; MLM is tures rests in the information that is gleaned for
referred to by this terminology as well. data analysis. For person–period data, one can
make inferences regarding variability within per-
son responses or trajectories of change over a time,
Person–Period and Clustered Data
which may help answer questions related to the
As mentioned, MLM can be used flexibly with study of change. For clustered data, one can study
multiple types of data structures. Two of the most differences among and within groups, which may
Multilevel Modeling 837

help answer questions regarding program evalua- estimates the individual growth parameters of the
tion. Perhaps not surprisingly, these two structures intercept and slope for each individual by the fol-
also can be combined, such as when multiple arith- lowing equations:
metic exams are given over time (Level 1) and
nested within each student (Level 2) who remains Level 2 : β0i ¼ γ00 þ ζ0i
assigned to a classroom (Level 3). A thorough dis- β1i ¼ γ10 þ ζ1i ð2Þ
cussion of this three-level example is beyond the
scope of the present entry, but the topic is raised where ζ0i and ζ1i indicate that the Level 2 out-
as an example of the flexibility and sophistication comes (β0i and β1i, the intercept and the slope
of MLM. from the Level 1 model) each have a residual term,
while γ00 represents the grand mean and γ10 indi-
The Multilevel Model cates the grand slope for the sample. This means
that the intercept and slope are expected to vary
This discussion follows the formal notation intro- across individuals and will deviate from the aver-
duced by Bryk and Raudenbush, Judith Singer, age intercept and slope of the entire sample.
John Willet, and others. A standard two-level By substituting the Level 2 equations into the
equation for the lower and higher levels of a hierar- Level 1 equation, one can derive the collapsed
chy that includes a predictor at Level 1 is pre- model:
sented first, followed by a combined equation
showing the collapsed single-level model. This last Yij ¼ ½ðγ00 þ ζ0i Þ þ ðγ10 þ ζ1i Þ þ εij
step is important because, depending on which ð3Þ
software is chosen, the two-level model (e.g., Yij ¼ ðγ00 þ γ10Þ þ ðζ1i þ ζ0i þ εij Þ:
HLM 6.0) or the collapsed model (e.g., SAS Proc-
Mixed) may require an explicit equation. It is
important to note, also, that these equations are Types of Questions Answered
the same for any two-level person–period or clus- Using Multilevel Models
tered data set. This section details the most useful and common
The two-level model is presented below in its approaches for examining multilevel data and is
simplest form: organized in a stepwise fashion, with each con-
sequent model adding more information and
Level 1 : Yij ¼ β0i þ β1i þ εij , ð1Þ
complexity.
where i refers to individual and j refers to time, β0i
is the intercept for this linear model, and β1i is the
Unconditional Means Model
slope for the trajectory of change. Notice that
the Level 1 equation looks almost identical to the This approach is analogous to a one-way
equation used for a simple linear regression. The ANOVA examining the random effect, or variance
main differences are the error term εij and the int- in means, across individuals in a person–period
roduction of subscripts i and j. The error term sig- data set, or across groups with clustered data. This
nifies random measurement error associated with model is run without any predictors at Level 1,
data that, contrary to the slope, deviate from line- which is equivalent to the model included in Equa-
arity. For the earlier daily diary (individual–period) tion 1 without any Level 1 predictors (i.e., the
example, the Level 1 equation details that individ- unconditional means model with only the intercept
ual i’s rating at time j is dependent on his or her and error term). It is by running this model that
first rating (β0i) and the slope of linear change one can determine the ICC and assess whether
(β1i) between time (or occasion) 1 and time j (note a multilevel analysis is indeed warranted. Thus,
that β1i does not always represent time, but more the unconditional means model provides an esti-
generally represents a 1-unit change from baseline mate of how much variance exists between groups
in the time-varying Level 1 predictor; the equa- and between subjects, as well as within groups and
tions are identical, however). The Level 2 equation within subjects in the sample.
838 Multilevel Modeling

Means-as-Outcomes Model Error Variance and Covariance


This model attempts to explain the variation The standard MLM for assessing change can be
that occurs in individual or group means as a func- analyzed with simple multiple regression tech-
tion of a Level 2 variable. For example, perhaps it niques, but it is important to recognize that a mul-
is believed that extroversion explains the mean dif- tilevel approach should use a specialized error
ferences in happiness people report across days, or term (i.e., a term that specifies why and in what
that the number of pop quizzes administered will way data vary). As presented in the collapsed
predict the mean arithmetic scores for a class. It is model (Equation 3) above, the error term in this
hypothesized that with inclusion of a Level 2 pre- simplified regression equation has three compo-
dictor, the individual or group mean will be altered nents: the error associated with each Level 1 equa-
compared with that of the unconditional means tion and two terms associated with the error
model. In order to ascertain whether this associa- around the intercept and the slope, respectively.
tion holds true, one compares the difference in the Singer was among the first to point out that these
between-subject and between-group variance in associated error terms render OLS regression anal-
the unconditional means model with the difference ysis unfit for nested data because it assumes that
in the between-subject and between-group vari- the exhibited variability should approximate a mul-
ance in the means-as-outcomes models. One may tivariate normal distribution. If this were true, the
also examine this model if significant between- mean of the residuals would be zero, the residuals
groups error variance is exhibited in the uncondi- would be independent of one another (they would
tional means model because the model should be not covary), and the population variance of the
respecified to account for this Level 2 variation. residuals would be equivalent for each mea-
surement. Nested data sets do not adhere to these
Random Coefficient Model assumptions; whereas it is expected that there will
be independence among individuals and/or groups,
As opposed to the means-as-outcomes model, it is also assumed that the variance within person
the random coefficient model attempts to partition and/or group will covary and each measurement
variability within groups and within subjects as and/or person will be correlated to the others.
a function of a Level 1 predictor. For the previous Specifically, for longitudinal, or person–period,
examples, this approach will yield the average data, special consideration must be given to the
intercept of happiness and the average slope of time dependence of the measurements for each
extroversion and happiness of all individuals for person. There are multiple approaches that one
person–period data or the average intercept for can take when specifying how the error term
arithmetic scores and the average slope of pop should be structured in an MLM. The following is
quizzes and arithmetic scores of all classrooms. In a brief and basic description of these various
addition, this approach tests for significant differ- approaches.
ences between each individual and each classroom.
Average intercept and slope estimates are yielded
in the output as fixed effects, while a significant p Partitioning Variance Between Groups
value for the variance of the slope estimate indi-
cates that the slope varies across individuals and Random Intercept
classrooms.
This approach is also called the variance com-
ponents, or between-groups compound symmetric
Intercepts and Slopes as Outcomes Model
structure. It is commonly used with longitudinal
The intercepts and slopes as outcomes model data in which subjects are allowed to vary and
includes both Level 1 and Level 2 predictors. This each subject has a mean intercept or score that
step is completed only if significant variability is deviates from the population (or all-subjects)
accounted for by Level 2 and Level 1 predictors. mean. In this instance, the residual intercepts
As stated earlier, model testing occurs in a stepwise are thought to be normally distributed with a mean
fashion, with this model serving as the final step. of zero.
Multilevel Modeling 839

Random Intercept and Slope ratings may be more similar on Days 1 and 2 com-
pared with Days 1 and 15.
This approach is ideal for data in which sub-
jects are measured on different time schedules, and
it allows each subject to deviate from the popula- Model Estimation Methods
tion in terms of both intercept and slope. Thus,
As stated above, MLM differs from repeated
each individual can have a different growth trajec-
measures ANOVA in that it uses all available data.
tory even if the hypothesis is that the population
To accomplish this, most statistical packages use
will approximate similar shapes in its growth tra-
a form of maximum likelihood (ML) estimation.
jectory. Using this method, one can specify the resi-
ML estimations are favored because it is assumed
duals for the intercept and slopes to be zero or
that they converge on population estimates, that
nonzero and alter the variance structures by group
the sampling distribution is equivalent to the
such that they are equal or unbalanced.
known variance, and that the standard error
derived from the use of this method is smaller than
from other approaches. These advantages apply
Partitioning Variance Within Groups
only with large samples because ML estimations
are biased toward large samples and their variance
Unstructured
estimation may become unreliable with a smaller
An alternative approach is to estimate the within- data set. There are two types of ML techniques,
group random effects. This approach assumes full ML and restricted ML. Full ML assumes that
that each subject and/or group is independent the dependent variable is normally distributed, and
with equivalent variance components. In this the mean is based on the regression coefficients
approach, the variance can differ at any time, and the variance components. Restricted ML uses
and covariance can exist between all the vari- the least squares residuals that remain after the
ance components. This is the default within- influence of the fixed effects are removed and only
groups variance structure in most statistical soft- the variance components remain. With ML algo-
ware packages and should serve as a starting rithms, a statistic of fit is usually compared across
point in model testing unless theory or experi- models to reveal which model best accounts
mental design favor another approach. for variance in the dependent variable (e.g., the
deviance statistic). Multiple authors suggest that
Within-Group Compound Symmetric when using restricted ML, one should make sure
that models include the same fixed effects and that
This approach constrains the variance and
only the random effects vary, because one wants to
covariance to a single value. Doing so assumes that
make sure the fixed effects are accounted for
the variance is the same regardless of the time the
equivalently across models.
individual was measured or the subject within the
A second class of estimation methods are exten-
group and that the correlation between measure-
sions of OLS estimation. Generalized least squares
ments will be equivalent. A subspecification of this
(GLS) estimation allows the residuals to be autocor-
error structure is the heterogeneous compound
related and have more dispersion of the variances,
symmetric that dictates the variance is a single
but it requires that the actual amount of autocorre-
value, but the covariance between measurements
lation and dispersion be known in the population
can differ.
in order for one to accurately estimate the true
error in the covariance matrix. In order to account
Autoregressive
for this, GLS uses the estimated error covariance
Perhaps the most useful error structure for longi- matrix as the true error covariance matrix, and
tudinal data, this approach dictates that variance is then it estimates the fixed effects and associated
the same at all times but that covariance decreases standard errors. Another approach, iterative GLS,
as measurement occasions are further apart. From is merely an extension of GLS and uses iterations
a theoretical standpoint, this may not be the case in that repeatedly estimate and refit the model until
clustered data sets, but one can see how a person’s either the model is ideally converged or the
840 Multiple Comparison Tests

maximum number of iterations has occurred. This Singer, J. D. (1998). Using SAS PROC MIXED to fit
method also works only with large and relatively multilevel models, hierarchical models, and individual
balanced data. growth models. Journal of Educational & Behavioral
Statistics, 23, 323–355.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal
Other Applications data analysis: Modeling change and event occurrence.
New York: Oxford University Press.
The applications of MLM techniques are virtually Wallace, D., & Green, S. B. (2002). Analysis of repeated
limitless. Dyadic analytic methods are being used measures designs with linear mixed models. In D. S.
in MLM software, with individuals considered Moskowitz & S. L. Hershberger (Eds.), Modeling
nested within dyads. Mediation and moderation intraindividual variability with repeated measures
analysis are possible both within levels of the hier- data: Methods and applications (pp. 103–134).
archy and across levels. Moderated mediation and Mahwah, NJ: Lawrence Erlbaum.
mediated moderation principles, relatively new to
the literature, also can be applied within a MLM Websites
framework. Scientific Software International. Hierarchical
Linear and Nonlinear Modeling (HLM):
Resources http://www.ssicentral.com/hlm
UCLA Stat Computing Portal:
MLM workshops are offered by many private http://statcomp.ats.ucla.edu
companies and university-based educational pro-
grams. Articles on MLM are easily located on
academic databases. Another resource is the Uni-
versity of California–Los Angeles Stat Comput- MULTIPLE COMPARISON TESTS
ing Portal, which has links to pages and articles
of interest directly related to different aspects of Many research projects involve testing multiple
MLM. Finally, those interested in exploring research hypotheses. These research hypotheses
MLM will find it easily accessible as some pro- could be evaluated using comparisons of means,
grams are designed specifically for MLM analy- bivariate correlations, regressions, and so forth, and
ses (e.g., HLM 6.0), but many of the more in fact most studies consist of a mixture of different
commonly used statistical packages (e.g., SAS types of test statistics. An important consideration
and SPSS) have the same capabilities. when conducting multiple tests of significance is
how to deal with the increased likelihood (relative
Lauren A. Lee and David A. Sbarra to conducting a single test of significance) of falsely
declaring one (or more) hypotheses statistically sig-
See also Analysis of Variance (ANOVA); Growth Curve;
nificant, titled the multiple comparisons problem.
Hierarchical Linear Modeling; Intraclass Correlation;
This multiple comparisons problem is especially rel-
Latent Growth Modeling; Longitudinal Design;
evant to the topic of research design because the
Mixed- and Random-Effects Models; Mixed Model
issues associated with the multiple comparisons
Design; Nested Factor Design; Regression Artifacts;
problem relate directly to designing studies (i.e.,
Repeated Measures Design; Structural Equation
number and nature of variables to include) and
Modeling; Time-Series Study; Variance
deriving a data analysis strategy for the study.
This entry introduces the multiple comparisons
Further Readings
problem and discusses some of the strategies that
Heck, R. H., & Thomas, S. L. (2000). An introduction to have been proposed for dealing with it.
multilevel modeling techniques. Mahwah, NJ:
Lawrence Erlbaum.
Kreft, I., & de Leeuw, J. (1998). Introducing multilevel The Multiple Comparisons Problem
modeling. Thousand Oaks, CA: Sage.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical To help clarify the multiple comparisons problem,
linear model: Applications and data analysis methods imagine a soldier who needed to cross fields con-
(2nd ed.). Thousand Oaks, CA: Sage. taining land mines in order to obtain supplies. It is
Multiple Comparison Tests 841

clear that the more fields the individual crosses, in linear models; (h) evaluating multiple para-
the greater the probability that he or she will acti- meters simultaneously in a structural equation
vate a land mine; likewise, researchers conducting model; and (i) analyzing multiple brain voxels for
many tests of significance have an increased chance stimulation in functional magnetic resonance
of erroneously finding tests significant. It is impor- imaging research. Further, as stated previously,
tant to note that although the issue of multiple most studies involve a mixture of many different
hypothesis tests has been labeled the multiple com- types of test statistics.
parisons problem, most likely because a lot of the An important factor in understanding the multi-
research on multiple comparisons has come within ple comparisons problem is understanding the dif-
the framework of mean comparisons, it applies to ferent ways in which a researcher can ‘‘group’’ his
any situation in which multiple tests of significance or her tests of significance. For example, suppose,
are being performed. in the study looking at whether student ratings dif-
Imagine that a researcher is interested in deter- fer across instruction formats, that there was also
mining whether overall course ratings differ for lec- another independent variable, the sex of the ins-
ture, seminar, or computer-mediated instruction tructor. There would now be two ‘‘main effect’’
formats. In this type of experiment, researchers are variables (instruction format and sex of the ins-
often interested in whether significant differences tructor) and potentially an interaction between
exist between any pair of formats, for example, do instruction format and sex of the instructor. The
the ratings of students in lecture-format classes dif- researcher might want to group the hypotheses
fer from the ratings of students in seminar-format tested under each of the main effect (e.g., pairwise
classes. The multiple comparisons problem in this comparisons) and interaction (e.g., simple effect
situation is that in order to compare each format in tests) hypotheses into separate ‘‘families’’ (groups
a pairwise manner, three tests of significance need of related hypotheses) that are considered simulta-
to be conducted (i.e., comparing the means of lec- neously in the decision process. Therefore, control
ture vs. seminar, lecture vs. computer-mediated, of the Type I error (error of rejecting a true null
and seminar vs. computer-mediated instruction for- hypothesis) rate might be imposed separately for
mats). There are numerous ways of addressing the each family, or in other words, the Type I error
multiple comparisons problem and dealing with the rate for each of the main effect and interaction
increased likelihood of falsely declaring tests families is maintained at α. On the other hand, the
significant. researcher may prefer to treat the entire set of tests
for all main effects and interactions as one family,
depending on the nature of the analyses and the
Common Multiple Testing Situations
way in which inferences regarding the results will
There are many different settings in which be made. The point is that when researchers con-
researchers conduct null hypothesis testing, and duct multiple tests of significance, they must make
the following are just a few of the more common important decisions about how these tests are
settings in which multiplicity issues arise: (a) con- related, and these decisions will directly affect the
ducting pairwise and/or complex contrasts in a power and Type I error rates for both the individ-
linear model with categorical variables; (b) con- ual tests and for the group of tests conducted in
ducting multiple main effect and interaction tests the study.
in a factorial analysis of variance (ANOVA) or
multiple regression setting; (c) analyzing multiple
Type I Error Control
simple effect, interaction contrast, or simple slope
tests when analyzing interactions in linear models; Researchers testing multiple hypotheses, each with
(d) analyzing multiple univariate ANOVAs after a specified Type I error probability (α), risk an
a significant multivariate ANOVA (MANOVA); increase in the overall probability of committing
(e) analyzing multiple correlation coefficients; (f) a Type I error as the number of tests increases. In
assessing the significance of multiple factor load- some cases, it is very important to control for Type
ings or factor correlations in factor analysis; (g) I errors. For example, if the goal of the researcher
analyzing multiple dependent variables separately comparing the three classroom instruction formats
842 Multiple Comparison Tests

described earlier was to identify the most effective matter, how many tests the researcher might con-
instruction method that would then be adopted in duct over his or her career. Second, real differences
schools, it would be important to ensure that between treatment groups are more likely with
a method was not selected as superior by chance a greater number of treatment groups. Therefore,
(i.e., a Type I error). On the other hand, if the goal emphasis in experiments should be not on control-
of the research was simply to identify the best ling for unlikely Type I errors but on obtaining the
classroom formats for future research, the risks most power for detecting even small differences
associated with Type I errors would be reduced, between treatments. The third argument is the
whereas the risk of not identifying a possibly supe- issue of (in)consistency. With more conservative
rior method (i.e., a Type II error) would be error rates, different conclusions can be found
increased. When many hypotheses are being tested, regarding the same hypothesis, even if the test sta-
researchers must specify not only the level of sig- tistics are identical, because the per-test α level
nificance, but also the unit of analysis over which depends on the number of comparisons being
Type I error control will be applied. For example, made. Last, one of the primary advantages of αPT
the researcher comparing the lecture, seminar, and control is convenience. Each of the tests is evalu-
computer-mediated instruction formats must deter- ated with any appropriate test statistic and com-
mine how Type I error control will be imposed pared to an α-level critical value.
over the three pairwise tests. If the probability of The primary disadvantage of αPT control is that
committing a Type I error is set at α for each com- the probability of making at least one Type I error
parison, then the probability that at least one Type increases as the number of tests increases. The
I error is committed over all three pairwise com- actual increase in the probability depends, among
parisons can be much higher than α. On the other other factors, on the degree of correlation among
hand, if the probability of committing a Type I the tests. For independent tests, the probability of
error is set at α for all tests conducted, then the a Type I error with T tests is 1  (1  αÞT ,
probability of committing a Type I error for each whereas for nonindependent tests (e.g., all pairwise
of the comparisons can be much lower than α. comparisons, multiple path coefficients in struc-
The conclusions of an experiment can be greatly tural equation modeling), the probability of a Type
affected by the unit of analysis over which Type I I error is less than 1  (1  αÞT . In general, the
error control is imposed. more tests that a researcher conducts in his or her
experiment, the more likely it is that one (or more)
will be significant simply by chance.
Units of Analysis
Several different units of analysis (i.e., error Familywise Error Rate
rates) have been proposed in the multiple compari- The familywise error rate (αFW) is defined as
son literature. The majority of the discussion has the probability of falsely rejecting one or more
focused on the per-test and familywise error rates, hypotheses in a family of hypotheses. Control-
although other error rates, such as the false discov- ling αFW is recommended when some effects are
ery rate, have recently been proposed. likely to be nonsignificant; when the researcher
is prepared to perform many tests of significance
Per-Test Error Rate
in order to find a significant result; when the
Controlling the per-test error rate (αPT) involves researcher’s analysis is exploratory, yet he or she
simply setting the α level for each test (αT ) equal still wants to be confident that a significant result
to the global α level. Recommendations for con- is real; and when replication of the experiment is
trolling αPT center on a few simple but convincing unlikely.
arguments. First, it can be argued that the natural Although many multiple comparison proce-
unit of analysis is the test. In other words, each test dures purport to control αFW, procedures are said
should be considered independent of how many to provide strong αFW control if αFW is maintained
other tests are being conducted as part of that spe- at approximately α when all population means are
cific analysis or that particular study or, for that equal (complete null) and when the complete null
Multiple Comparison Tests 843

is not true but multiple subsets of the population αPT as the α level for each test is equal to the
means are equal (partial nulls). Procedures that global α level and can therefore be used seamlessly
control αFW for the complete null, but not for par- with any test statistic. It is important to note that
tial nulls, provide weak αFW control. the procedures introduced here are only a small
The main advantage of αFW control is that the subset of the procedures that are available, and for
probability of making a Type I error does not the procedures that are presented, specific details
increase with the number of comparisons con- are not provided. Please see specific sections of the
ducted in the experiment. One of the main disad- encyclopedia for details on these procedures.
vantages of procedures that control αFW is that αT
decreases, often substantially, as the number of
tests increases. Therefore, procedures that control Familywise Error Controlling Procedures
αFW have reduced power for detecting treatment for Any Multiple Testing Environment
effects when there are many comparisons, increas-
ing the potential for inconsistent results between Bonferroni
experiments.
This simple-to-use procedure sets αT ¼ α/T,
where T represents the number of tests being per-
False Discovery Rate formed. The important assumption of the Bonfer-
The false discovery rate represents a compromise roni procedure is that the tests being conducted
between strict αFW control and liberal αPT control. are independent. When this assumption is violated
Specifically, the false discovery rate (aFDR ) is (and it commonly is), the procedure will be too
defined as the expected ratio (Q) of the number of conservative.
erroneous rejections (V) to the total number of
rejections (R ¼V þ S), where S represents the num- Dunn-Sidák
ber of true rejections. Therefore, EðQÞ ¼ The Dunn-Sidák procedure is a more powerful
EðV=½V þ SÞ ¼ EðV=RÞ. version of the original Bonferroni procedure. With
If all null hypotheses are true, αFDR ¼ αFW . On the Dunn-Sidák procedure, αT ¼ 1  (1  α)1/T.
the other hand, if some null hypotheses are false,
αFDR ≤ αFW , resulting in weak control of αFM . As Holm
a result, any procedure that controls αFW also con-
Sture Holm proposed a sequential modification
trols αFDR, but procedures that control αFDR can be
of the original Bonferroni procedure that can be
much more powerful than those that control αFW,
substantially more powerful than the Bonferroni
especially when a large number of tests are per-
or Dunn-Sidák procedures.
formed, and do not entirely dismiss the multiplicity
issue (as with αPT control). Although some research-
Hochberg
ers recommend exclusive use of αFDR control, it is
often recommended that αFDR control be reserved Yosef Hochberg proposed a modified segue
for exploratory research, nonsimultaneous inference to Bonferroni procedure that combined Simes’s
(e.g., if one had multiple dependent variables and inequality with Holm’s testing procedure to create
separate inferences would be made for each), and a multiple comparison procedure that can be more
very large family sizes (e.g., as in an investigation of powerful and simpler than the Holm procedure.
potential activation of thousands of brain voxels in
functional magnetic resonance imaging).
Familywise Error Controlling Procedures
for Pairwise Multiple Comparison Tests
Multiple Comparison Procedures
Tukey
This section introduces some of the multiple
comparison procedures that are available for con- John Tukey proposed the honestly significant
trolling αFW and αFDR. Recall that no multiple difference procedure, which accounts for depen-
comparison procedure is necessary for controlling dencies among the pairwise comparisons and is
844 Multiple Regression

maximally powerful for simultaneous testing of Further Readings


all pairwise comparisons.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the
false discovery rate: A practical and powerful
Hayter approach to multiple testing. Journal of the Royal
Anthony Hayter proposed a modification to Statistical Society B, 57, 289–300.
Hancock, G. R., & Klockars, A. J. (1996). The quest for
Fisher’s least significant difference procedure that
α: Developments in multiple comparison procedures in
would provide strong control over αFW. the quarter century since Games (1971). Review of
Educational Research, 66, 269–306.
REGWQ Hochberg, Y., & Tamhane, A. C. (1987). Multiple
comparison procedures. New York: Wiley.
Ryan proposed a modification to the Newman–
Hsu, J. C. (1996). Multiple comparisons: Theory and
Keuls procedure that ensures that αFW is main-
methods. New York: Chapman & Hall/CRC.
tained at α, even in the presence of multiple partial Toothaker, L. (1993). Multiple comparison procedures.
null hypotheses. Ryan’s original procedure became Newbury Park, CA: Sage.
known as the REGWQ after modifications to the Westfall, P. H., Tobia, R. D., Rom, D., Wolfinger, R. D., &
procedure by Einot, Gabriel, and Welsch. Hochberg, Y. (1999). Multiple comparisons and multiple
tests using the SAS system. Cary, NC: SAS Institute.

False Discovery Rate–Controlling Procedures


for Any Multiple Testing Environment MULTIPLE REGRESSION
FDR-BH Multiple regression is a general and flexible statisti-
cal method for analyzing associations between two
Yoav Benjamini and Hochberg proposed the or more independent variables and a single depen-
original procedure for controlling the αFDR, which dent variable. As a general statistical technique, mul-
is a sequential modified Bonferroni procedure. tiple regression can be employed to predict values of
a particular variable based on knowledge of its
FDR-BY association with known values of other variables,
Benjamini and Daniel Yekutieli proposed a mod- and it can be used to test scientific hypotheses about
ified αFDR controlling procedure that would main- whether and to what extent certain independent
tain αFDR ¼ α under any dependency structure variables explain variation in a dependent variable
among the tests. of interest. As a flexible statistical method, multiple
When multiple tests of significance are per- regression can be used to test associations among
formed, the best form of Type I error control continuous as well as categorical variables, and it
depends on the nature and goals of the research. can be used to test associations between individual
The practical implications associated with the sta- independent variables and a dependent variable, as
tistical conclusions of the research should take pre- well as interactions among multiple independent
cedence when selecting a form of Type I error variables and a dependent variable. In this entry, dif-
control. ferent approaches to the use of multiple regression
are presented, along with explanations of the more
Robert A. Cribbie commonly used statistics in multiple regression,
methods of conduction multiple regression analysis,
See also Bonferroni Procedure; Coefficient Alpha; Error
and the assumptions of multiple regression.
Rates; False Positive; Holm’s Sequential Bonferroni
Procedure; Honestly Significant Difference (HSD)
Test; Mean Comparisons; Newman–Keuls Test and Approaches to Using Multiple Regression
Tukey Test; Pairwise Comparisons; Post Hoc
Prediction
Comparisons; Scheffé Test; Significance Level,
Concept of; Tukey’s Honestly Significant Difference One common application of multiple regression
(HSD); Type I Error is for predicting values of a particular dependent
Multiple Regression 845

variable based on knowledge of its association its age is. A 5-year-old tree will be 10 feet tall, an
with certain independent variables. In this context, 8-year-old tree will be 16 feet tall, and so on.
the independent variables are commonly referred At this point two important issues must be con-
to as predictor variables and the dependent vari- sidered. First, virtually any time one is working with
able is characterized as the criterion variable. In variables from people, animals, plants, and so forth,
applied settings, it is often desirable for one to be there are no perfect linear associations. Sometimes
able to predict a score on a criterion variable by students with high ACT scores do poorly in college
using information that is available in certain pre- whereas some students with low ACT scores do well
dictor variables. For example, in the life insurance in college. This shows how there can always be
industry, actuarial scientists use complex regres- some error when one uses regression to predict
sion models to predict, on the basis of certain pre- values on a criterion variable. The stronger the asso-
dictor variables, how long a person will live. In ciation between the predictor and criterion variable,
scholastic settings, college and university admis- the less error there will be in that prediction.
sions offices will use predictors such as high school Accordingly, regression is based on the line of best
grade point average (GPA) and ACT scores to pre- fit, which is simply the line that will best describe or
dict an applicant’s college GPA, even before he or capture the relationship between X and Y by mini-
she has entered the university. mizing the extent to which any data points fall off
Multiple regression is most commonly used to that line.
predict values of a criterion variable based on lin- A college admissions committee wants to be
ear associations with predictor variables. A brief able to predict the graduating GPA of the students
example using simple regression easily illustrates whom they admit. The ACT score is useful for this,
how this works. Assume that a horticulturist devel- but as noted above, it does not have a perfect asso-
oped a new hybrid maple tree that grows exactly ciation with college GPA, so there is some error in
2 feet for every year that it is alive. If the height of that prediction. This is where multiple regression
the tree was the criterion variable and the age of becomes very useful. By taking into account the
the tree was the predictor variable, one could accu- association of additional predictor variables with
rately describe the relationship between the age college GPA, one can further minimize the error in
and height of the tree with the formula for predicting college GPA. For example, the admis-
a straight line, which is also the formula for a sim- sions committee might also collect information on
ple regression equation: high school GPA and use that in conjunction with
the ACT score to predict college GPA. In this case,
Y ¼ bX þ a, the regression equation would be

where Y is the value of the dependent, or criterion, Y 0 ¼ b1 X1 þ b2 X2 þ a


variable; X is the value of the independent, or pre-
where Y0 is the predicted value of Y (college GPA),
dictor, variable; b is a regression coefficient that
X1 and X2 are the values of the predictor variables
describes the slope of the line; and a is the Y inter-
(ACT score and high school GPA), b1 and b2 are
cept. The Y intercept is the value of Y when X is 0.
the regression coefficients by which X1 and X2 are
Returning to the hybrid tree example, the exact
multiplied to get Y0, and a is the intercept (i.e.,
relationship between the tree’s age and height
value of Y when X1 and X2 are both 0). In this
could be described as follows:
particular case, the intercept serves only an arith-
metic function as it has no practical interpretabil-
height ¼ 2ðage in yearsÞ þ 0
ity because having a high school GPA of 0 and an
ACT score of 0 is meaningless.
Notice that the Y intercept is 0 in this case
because at 0 years of age, the tree has 0 height. At
that point, it is just seed in the ground. It is clear
Explanation
how knowledge of the relationship between the
tree’s age and height could be used to easily predict In social scientific contexts, multiple regression
the height of any given tree by just knowing what is rarely used to predict unknown values on
846 Multiple Regression

a criterion variable. In social scientific research, where rYX1 is the Pearson correlation between Y
values of the independent and dependent variables and X1, rX1 X2 is the Pearson correlation between
are almost always known. In such cases multiple X1 and X2, and so on, and sY is the standard devi-
regression is used to test whether and to what ation of variable Y, sX1 is the standard deviation of
extent the independent variables explain the depen- X1, and so on. The partial regression coefficients
dent variable. Most often the researcher has theo- are also referred to as unstandardized regression
ries and hypotheses that specify causal relations coefficients because they represent the value by
among the independent variables and the depen- which one would multiply the raw X1 or X2 score
dent variable. Multiple regression is a useful tool in order to arrive at Y. In the salary example, these
for testing such hypotheses. For example, an econ- coefficients could look something like this:
omist is interested in testing a hypothesis about the
determinants of workers’ salaries. The model being Y ¼ 745:67X1 þ 104:36X2 þ 11,325:
tested could be depicted as follows:
This means that subjects’ annual salaries are
family of origin SES → education → salary, best described by an equation whereby their family
of origin SES is multiplied by 745.67, their years
where SES stands for socioeconomic status. of formal education are multiplied by 104.36, and
In this simple model, the economist hypo- these products are added to 11,325. Notice how
thesizes that the SES of one’s family of origin the regression coefficient for SES is much larger
will influence how much formal education one than that for years of formal education. Although
acquires, which in turn will predict one’s salary. If it might be tempting to assume that family of
the economist collected data on these three vari- origin SES is weighted more heavily than years of
ables from a sample of workers, the hypotheses formal education, this would not necessarily be
could be tested with a multiple regression model correct. The magnitude of an unstandardized
that is comparable to the one presented previously regression coefficient is strongly influenced by the
in the college GPA example: units of measurement used to assess the indepen-
dent variable with which it is associated. In this
Y ¼ b1 X1 þ b2 X2 þ a example, assume that SES is measured on a 5-
point scale (Levels 1–5) and years of formal educa-
In this case, what was Y0 is now Y because the tion, at least in the sample, runs from 7 to 20.
value of Y is known. It is useful to deconstruct the These differing scale ranges have a profound effect
components of this equation to show how they on the magnitude of each regression coefficient,
can be used to test various aspects of the econo- rendering them incomparable.
mist’s model. However, it is often the case that researchers
In the equation above, b1 and b2 are the partial want to understand the relative importance of each
regression coefficients. They are the weights by independent variable for explaining variation in
which one multiplies the value of X1 and X2 when the dependent variable. In other words, which is
all variables are in the equation. In other words, the more powerful determinant of people’s sal-
they represent the expected change in Y, per unit aries, their education or their family of origin’s
of X when all other variables are accounted for, or socioeconomic status? This question can be evalu-
held constant. Computationally, the values of b1 ated by examining the standardized regression
and b2 can be determined easily by simply know- coefficient, or β. Computationally, β can be deter-
ing the zero-order correlations among all possible mined by the following formulas:
pairwise combinations of Y, X1, and X2, as well as rYX1 rYX2 rX1 X2 rYX2 rYX1 rX1 X2
the standard deviations of the three variables: β1 ¼ 2
; β2 ¼ :
1r X1 X2 1r2 X1 X2
rYX1 rYX2 rX1 X2 sY
b1 ¼ · ; The components of these formulas are identical to
1r2 X1 X2 sX1
those for the unstandardized regression coefficients,
rYX2 rYX1 rX1 X2 sY
b2 ¼ · , but they lack multiplication by the ratio of standard
1r2 X1 X2 sX2 deviations of Y and X1 and X2. Incidentally, one can
Multiple Regression 847

easily convert β to b with the following formulas, important role in testing hypotheses about the role
which illustrate their relationship: of each independent variable in explaining the
dependent variable.
sY sY In addition to concerns about the statistical sig-
b1 ¼ β1 b2 ¼ β2 ,
s1 s2 nificance and relative importance of each inde-
where sY is the standard deviation of variable Y, pendent variable for explaining the dependent
and so on. variable, it is important to understand the collec-
Standardized regression coefficients can be tive function of the independent variables for
thought of as the weight by which one would explaining the dependent variable. In this case, the
multiply a standardized score (or z score) for question is whether the independent variables col-
each independent variable in order to arrive at lectively explain a significant portion of the vari-
the z score for the dependent variable. Because z ance in scores on the dependent variable. This
scores essentially equate all variables on the question is evaluated with the multiple correlation
same scale, researchers are inclined to make coefficient. Just as a simple bivariate correlation is
comparisons about the relative impact of each represented by r, the multiple correlation coeffi-
independent variable by comparing their associ- cient is represented by R. In most contexts, data
ated standardized regression coefficients, some- analysts prefer to use R2 to understand the associa-
times called beta weights. tion between the independent variables and the
In the economist’s hypothesized model of work- dependent variable. This is because the squared
ers’ salaries, there are several subhypotheses or multiple correlation coefficient can be thought of
research questions that can be evaluated. For as the percentage of variance in the dependent var-
example, the model presumes that both family of iable that is collectively explained by the indepen-
origin SES and education will exert a causal influ- dent variables. So, an R2 value of .65 implies that
ence on annual salary. One can get a sense of 65% of the variance in the dependent variable is
which variable has a greater impact on salary by explained by the combination of independent vari-
comparing their beta weights. However, it is also ables. In the case of two independent variables, the
important to ask whether either of the independent formula for the squared multiple correlation coeffi-
variables is a significant predictor of salary. In cient can be explained as a function of the various
effect, these tests ask whether each independent pairwise correlations among the independent and
variable explains a statistically significant portion dependent variables:
of the variance in the dependent variable, indepen-
dent of that explained by the other independent r2 YX1 þ r2 YX2  2rYXI rYX2 rX1 X2
R2 ¼ :
variable(s) also in the regression equation. This 1r2 X1 X2
can be accomplished by dividing the β by its stan-
dard error (SEβ). This ratio is distributed as t with In cases with more than two independent vari-
degrees of freedom ¼ n  k  1, where n is the ables, this formula becomes much more complex,
sample size and k is the number of independent requiring the use of matrix algebra. In such cases,
variables in the regression analysis. Stated more calculation of R2 is ordinarily left to a computer.
formally, The question of whether the collection of inde-
pendent variables explains a statistically significant
β amount of variance in the dependent variable can
t¼ :
SEβ be approached by testing the multiple correlation
coefficient for statistical significance. The test can
If this ratio is significant, that implies that the be carried out by the following formula:
particular independent variable uniquely explains
a statistically significant portion of the variance in R2 ðn  k  1Þ
F¼ :
the dependent variable. These t tests of the statisti- ð1  R2 Þk
cal significance of each independent variable are
routinely provided by computer programs that This test is distributed as F with df ¼ k in the
conduct multiple regression analyses. They play an numerator and n  k  1 in the denominator,
848 Multiple Regression

where n is the sample size and k is the number of equation later should never be the cause of an inde-
independent variables. pendent variable entered into the equation earlier.
Two important features of the test for signifi- Naturally, hierarchical regression analysis is facili-
cance of the multiple correlation coefficient require tated by having a priori theories and hypotheses
discussion. First, notice how the sample size, n, that specify a particular order of causal priority.
appears in the numerator. This implies that all Another method of entry that is based purely on
other things held constant, the larger the sample empirical rather than theoretical considerations is
size, the larger the F ratio will be. That means that stepwise entry. In this case, the data analyst speci-
the statistical significance of the multiple corre- fies the full compliment of potential independent
lation coefficient is more probable as the sample variables to the computer program and allows it to
size increases. Second, the amount of variation in enter or not enter these variables into the regression
the dependent variable that is not explained by the equation, based on the strength of their unique
independent variables, indexed by 1  R2 (this is association with the dependent variable. The pro-
called error or residual variance), is multiplied by gram keeps entering independent variables up to
the number of independent variables, k. This the point at which addition of any further variables
implies that all other things held equal, the larger would no longer explain any statistically significant
the number of independent variables, the larger increment of variance in the dependent variable.
the denominator, and hence the smaller the F ratio. Stepwise analysis is often used when the researcher
This illustrates how there is something of a pen- has a large collection of independent variables and
alty for using a lot of independent variables in little theory to explain or guide their ordering or
a regression analysis. When trying to explain even their role in explaining the dependent variable.
scores on a dependent variable, such as salary, it Because stepwise regression analysis capitalizes on
might be tempting to use a large number of pre- chance and relies on a post hoc rationale, its use is
dictors so as to take into account as many possi- often discouraged in social scientific contexts.
ble causal factors as possible. However, as this
formula shows, this significance test favors parsi-
Assumptions of Multiple Regression
monious models that use only a few key predic-
tor variables. Multiple regression is most appropriately used as
a data analytic tool when certain assumptions
about the data are met. First, the data should be
Methods of Variable Entry
collected through independent random sampling.
Computer programs used for multiple regression Independent means that the data provided by one
provide several options for the order of entry of participant must be entirely unrelated to the data
each independent variable into the regression equa- provided by another participant. Cases in which
tion. The order of entry can make a difference in husbands and wives, college roommates, or doc-
the results obtained and therefore becomes an tors and their patients both provide data would
important analytic consideration. In hierarchical violate this assumption. Second, multiple regres-
regression, the data analyst specifies a particular sion analysis assumes that there are linear relation-
order of entry of the independent variables, usually ships between the independent variables and the
in separate steps for each. Although there are mul- dependent variable. When this is not the case,
tiple possible logics by which one would specify a more complex version of multiple regression
a particular order of entry, perhaps the most com- known as nonlinear regression must be employed.
mon is that of causal priority. Ordinarily, one A third assumption of multiple regression is that at
would enter independent variables in order from each possible value of each independent variable,
the most distal to the most proximal causes. In the the dependent variable must be normally distrib-
previous example of the workers’ salaries, a hierar- uted. However, multiple regression is reasonably
chical regression analysis would enter family of ori- robust in the case of modest violations of this
gin SES into the equation first, followed by years assumption. Finally, for each possible value of each
of formal education. As a general rule, in hierarchi- independent variable, the variance of the residuals
cal entry, an independent variable entered into the or errors in predicting Y (i.e., Y0  Y) must be
Multiple Treatment Interference 849

consistent. This is known as the homoscedasticity and what variance is associated with some other
assumption. Returning to the workers’ salaries treatment or condition. In terms of independent
example, it would be important that at each level and dependent variable designations, multiple
of family-of-origin SES (Levels 1–5), the degree of treatment interference occurs when participants
error in predicting workers’ salaries was compara- were meant to be assigned to one level of the inde-
ble. If the salary predicted by the regression equa- pendent variable (e.g., a certain group with
tion was within ± $5,000 for everyone at Level 1 a researcher assigned condition) but were function-
SES, but it was within ± $36,000 for everyone at ally at a different level of the variable (e.g., they
Level 3 SES, the homoscedasticity assumption received some of the treatment meant for a com-
would be violated because there is far greater vari- parison group). Consequently, valid conclusions
ability in residuals at the higher compared with about cause and effect are difficult to make.
lower SES levels. When this happens, the validity There are several situations that can result in
of significance tests in multiple regression becomes multiple treatment interference, and they can occur
compromised. in either experimental designs (which have random
assignment of participants to groups or levels of
Chris Segrin the independent variable) or quasi-experimental
designs (which do not have random assignment to
See also Bivariate Regression; Coefficients of Correlation,
groups). One situation might find one or more par-
Alienation, and Determination; Correlation; Logistic
ticipants in one group receiving accidentally, in
Regression; Pearson Product-Moment Correlation
addition to their designated treatment, the treat-
Coefficient; Regression Coefficient
ment meant for a second group. This can happen
administratively in medicine studies, for example,
Further Readings if subjects receive both the drug they are meant to
receive and, accidentally, are also given the drug
Aiken, L. S., & West, S. G. (1991). Multiple regression:
Testing and interpreting interactions. Newbury Park,
meant for a comparison group. If benefits are
CA: Sage. found in both groups or in the group meant to
Alison, P. D. (1999). Multiple regression: A primer. receive a placebo (for example), it is unclear
Thousand Oaks, CA: Sage. whether effects are due to the experimental drug,
Berry, W. D. (1993). Understanding regression the placebo, or a combination of the two. The abil-
assumptions. Thousand Oaks, CA: Sage. ity to isolate the effects of the experimental drug or
Cohen, J., Cohen, P., West, S., & Aiken, L. (2003). (more generally in research design) the independent
Applied multiple regression/correlation analysis for the variable on the outcome variable is the strength of
behavioral sciences (3rd ed.). Hillsdale, NJ: Lawrence a good research design, and consequently, strong
Erlbaum.
research designs attempt to avoid the threat of
Pedhazur, E. J. (1997). Multiple regression in behavioral
research (3rd ed.). New York: Wadsworth.
multiple treatment interference. A second situation
involving multiple treatment interference is more
common, especially in the social sciences. Imagine
an educational researcher interested in the effects
MULTIPLE TREATMENT of a new method of reading instruction. The
researcher has arranged for one elementary teacher
INTERFERENCE in a school building to use the experimental
approach and another elementary teacher to use
Multiple treatment interference is a threat to the in- the traditional method. As is typically the case in
ternal validity of a group design. A problem occurs educational research, random assignment to the
when participants in one group have received all two classrooms is not possible. Scores on a reading
or some of a treatment in addition to the one test are collected from both classrooms as part of
assigned as part of an experimental or quasi-exper- a pre-post test design. The design looks likes this:
imental design. In these situations, the researcher
cannot determine what, if any, influence on the Experimental Group:
outcome is associated with the nominal treatment Pretest ! 12 weeks of instruction ! Posttest
850 Multitrait–Multimethod Matrix

Comparison Group: Group 3:


Pretest ! 12 weeks of instruction ! Posttest Treatment 1, followed by Treatment 2 ! Measure
outcome
If the study were conducted as planned, com- Group 4:
parisons of posttest means for the two groups, per- Treatment 2, followed by Treatment 1 ! Measure
haps after controlling for initial differences found outcome
at the time of the pretest, would provide fairly
valid evidence of the comparative effectiveness of A statistical comparison of the four groups’ out-
the new method. The conclusion is predicated, come means would identify the optimum treat-
though, on the assumption that students in the tra- ment—Treatment 1, Treatment 2, or a particular
ditional classroom were not exposed to the experi- sequence of both treatments.
mental method. Often, in the real world, it is
difficult to keep participants in the control or com- Bruce Frey
parison group free from the ‘‘contamination’’ of
See also Experimental Design; Quasi-Experimental Design
the experimental treatment. In the case of this
example, the teacher in the comparison group may
have used some of the techniques or strategies
Further Readings
included in the experimental approach. He may
have done this inadvertently, or intentionally, Cook, T. D., & Campbell, D. T. (1979). Quasi-
deciding that ethically he should use the best meth- experimentation: Design and analysis issues for field
ods he knows. Contamination might also have settings. Boston: Houghton Mifflin.
been caused by the students’ partial exposure to
the new instructional approach in some circum-
stance outside of the classroom—a family member
in the other classroom may have brought home- MULTITRAIT–MULTIMETHOD
work home and shared it with a student, for MATRIX
example.
Both of these examples describe instances of
multiple treatment interference. Because of partici- The multitrait–multimethod (MTMM) matrix
pants’ exposure to more than only the intended contains the correlations between variables when
treatment, it becomes difficult for researchers to each variable represents a trait–method unit, that
establish relationships between well-defined vari- is, the measurement of a trait (e.g., extroversion,
ables. When a researcher believes that multiple neuroticism) by a specific method (e.g., self-report,
treatments might be effective or wishes to investi- peer report). In order to obtain the matrix, each
gate the effect of multiple treatments, however, trait has to be measured by the same set of meth-
a design can be applied that controls the combi- ods. This makes it possible to arrange the correla-
nations of treatments and investigates the conse- tions in such a way that the correlations between
quences of multiple treatments. Drug studies different traits measured by the same method can
sometimes are interested in identifying the benefits be separated from the correlations between differ-
or disadvantages of various interactions or combi- ent traits measured by different methods. The
nations of treatments, for example. A study inter- MTMM matrix was recommended by Donald T.
ested in exploring the effects of multiple Campbell and Donald W. Fiske as a means of mea-
treatments might look like this (assuming random suring the convergent and discriminant validity.
assignment): This entry discusses the structure, evaluation, and
analysis approaches of the MTMM matrix.

Group 1:
Treatment 1 ! Measure outcome Structure of the MTMM Matrix
Group 2: Table 1 shows a prototypical MTMM matrix for
Treatment 2 ! Measure outcome three traits measured by three methods. An
Multitrait–Multimethod Matrix 851

Table 1 Multitrait–Multimethod Matrix for Three Traits Measured by Three Methods

Method 1 Method 2 Method 3


Traits Trait 1 Trait 2 Trait 3 Trait 1 Trait 2 Trait 3 Trait 1 Trait 2 Trait 3
Trait 1 (.90)
Method 1 Trait 2 .10 (.90)
Trait 3 .10 .10 (.90)

Trait 1 .50 .12 .12 (.80)

Method 2 Trait 2 .14 .50 .12 .20 (.80)

Trait 3 .14 .14 .50 .20 .20 (.80)

Trait 1 .50 .05 .05 .40 .03 .03 (.85)

Method 3 Trait 2 .04 .50 .05 .02 .40 .03 .30 (.85)
Trait 3 .04 .04 .50 .02 .02 .40 .30 .30 (.85)

Notes: The correlations are artificial. Reliabilities are in parentheses. Heterotrait–monomethod correlations are in the gray
subdiagonals. Heterotrait–heteromethod correlations are enclosed by a broken line. Monotrait–heteromethod correlations in
the convergent validity diagonals are in bold type.

MTMM matrix consists of two major parts: traits measured by different methods. The hetero-
monomethod blocks and heteromethod blocks. trait–heteromethod triangles cover the correlations
of different traits measured by different methods.
Monomethod Blocks
The monomethod blocks contain the correla- Criteria for Evaluating the MTMM Matrix
tions between variables that belong to the same
method. In Table 1 there are three monomethod Campbell and Fiske described four properties an
blocks, one for each method. Each monomethod MTMM matrix should show when convergent
block consists of two parts. The first part (reli- and discriminant validity is present:
ability diagonals) contains the reliabilities of
the measures. The second part (the heterotrait– 1. The correlations in the validity diagonals
monomethod triangles) include the correlations (monotrait–heteromethod correlations) should
be significantly different from 0 and they should
between different traits that are measured by the
be large. These correlations indicate convergent
same methods. The reliabilities can be consid- validity.
ered as monotrait–monomethod correlations.
2. The heterotrait–heteromethod correlations
should be smaller than the monotrait–
Heteromethod Blocks heteromethod correlations (discriminant
validity).
The heteromethod blocks comprise the correla-
tions between traits that were measured by different 3. The heterotrait–monomethod correlations
methods. Table 1 contains three heteromethod should be smaller than the montrait–
blocks, one for each combination of the three meth- heteromethod correlations (discriminant
ods. A heteromethod block consists of two parts. validity).
The validity diagonal (monotrait–heteromethod 4. The same pattern of trait intercorrelations
correlations) contains the correlations of the same should be shown in all heterotrait triangles in
852 Multitrait–Multimethod Matrix

the monotrait as well as in the heteromethod a method M2) is the product of the correlations
blocks (discriminant validity). Cor(T1, T2) and Cor(M1, M2):

CorðT1 M1 , T2 M2 Þ ¼ CorðT1 , T2 Þ × CorðM1 , M2 Þ:


Limitations of These Rules
These four requirements have been developed If the two traits are measured by the same
by Campbell and Fiske because they are easy-to- method (e.g., M1), the method intercorrelation is 1
apply rules for evaluating an MTMM matrix (Cor[M1, M1] ¼ 1), and the observed correlation
with respect to its convergent and discriminant equals the correlation between the two traits:
validity. They are, however, restricted in several
CorðT1 M1 , T2 M1 Þ ¼ CorðT1 , T2 Þ:
ways. The application of these criteria is difficult
if the different measures differ in their reliabil- If the two traits, however, are measured by dif-
ities. In this case the correlations, which are cor- ferent methods, the correlation of the traits is atten-
relations of observed variables, can be distorted uated by the correlation of the methods. Hence, the
by measurement error in different ways, and dif- smaller the convergent validity, the smaller are the
ferences between correlations could only be due expected correlations between the traits.
to differences in reliabilities. Moreover, there is The four properties of the MTMM matrix pro-
no statistical test of whether these criteria are ful- posed by Campbell and Fiske can be evaluated by
filled in a specific application. Finally, the MTMM the correlations of the direct product model:
matrix is not explained by a statistical model
allowing the separation of different sources of var- 1. The correlations between two methods
iance that are due to trait, method, and error influ- (convergent validity) should be large.
ences. Modern psychometric approaches complete
2. The second property is always fulfilled when the
these criteria and circumvent some of these correlation between two traits is smaller than 1,
problems. because the direct product model always
implies, in this case,
Modern Psychometric Approaches
CorðT1 , T1 Þ × CorðM1 , M2 Þ >
for Analyzing the MTMM Matrix
CorðT1 , T2 Þ × CorðM1 , M2 Þ:
Many statistical approaches, such as analysis of var-
iance and generalizability theory, multilevel model- 3. The third property is satisfied when the
ing, and item response theory, have been applied to correlations between methods are larger than
analyze MTMM data sets. Among all methods, the correlations between traits because Cor(T1,
direct product models and models of confirmatory T1) ¼ Cor(M1, M1) ¼ 1, and in this case,
factor analyses have been the most often applied
CorðT1 , T1 Þ × CorðM1 , M2 Þ >
and influential approaches.
CorðT1 , T2 Þ × CorðM1 , M1 Þ:

Direct Product Models 4. The fourth requirement is always fulfilled if the


direct product model holds because the trait
The basic idea of direct product models is that intercorrelations in all heteromethod blocks are
each correlation of an MTMM matrix is assumed weighted by the same method correlation. This
to be a product of two correlations: a correlation makes sure that the ratio of two trait
between traits and a correlation between methods. correlations is the same for all mono-method
For each combination of traits, there is a correla- and heteromethod blocks.
tion indicating discriminant validity, and for each
combination of methods, there is a correlation The direct product model has been extended by
indicating the degree of convergent validity. For Michael Browne to the composite direct product
example, the heterotrait–heteromethod correlation model by the consideration of measurement error.
Cor(T1M1, T2M2) between a trait T1 (measured Direct product models are reasonable models for
by a method M1) and a trait T2 (measured by analyzing MTMM matrices. They are, however,
Multitrait–Multimethod Matrix 853

also limited in at least two respects. First, they do E11


not allow the decomposition of variance in com- Y11
ponents due to trait, method, and error influences. E21
Y21
Method 1 E31 Trait 1
Second, they presuppose that the correlations
between the traits do not differ between the differ- Y31
ent monomethod blocks. E12
Y12
E22 Y22
Method 2 Trait 2
Models of Confirmatory Factor Analysis E32
Y32
During recent years many different MTMM
models of confirmatory factor analysis (CFA) have E13
Y13
been developed. Whereas the first models were
E23
developed for decomposing the classical MTMM Method 3 Y23 Trait 3
E33
matrix that is characterized by a single indicator Y33
for each trait–method unit, more recently formu-
lated models consider multiple indicators for each
trait–method unit. Figure 1 Correlated-Trait–Correlated-Method Model

Single Indicator Models


a correlated-trait–correlated-method (CTCM) model
Single indicator models of CFA decompose and is depicted in Figure 1.
the observed variables into different components The model has several advantages. It allows
representing trait, method, and error influences. researchers to decompose the variance of an
Moreover, they differ in the assumptions they observed variable into the variance due to trait,
make concerning the homogeneity of method method, and error influences. Convergent validity
and trait effects, as well as admissible correla- is given when the variances due to the method fac-
tions between trait and method factors. Keith tors are small. In the ideal case of perfect conver-
Widaman, for example, describes a taxonomy of gent validity, the variances of all method factors
16 MTMM-CFA models by combining four dif- would be 0, meaning that there are no method fac-
ferent types of trait structures (no trait factor, tors in the model (so-called correlated trait model).
general trait factor, several orthogonal trait fac- Perfect discriminant validity could be present if
tors, several oblique trait factors) with four types the trait factors are uncorrelated. Method factors
of method structures (no method factor, general could be correlated, indicating that the different
method factor, several orthogonal method fac- methods can be related in a different way. How-
tors, several oblique method factors). ever, aside from these advantages, this model is
In the most general single-indicator MTMM affected by serious statistical and conceptual prob-
model, an observed variable Yjk ; indicating a trait i lems. One statistical problem is that the model is
measured by a method k, is decomposed into a trait not generally identified, which means that there
factor Tj ; a method factor Mk ; and a residual vari- are data constellations in which it is not pos-
able Ejk : sible to estimate the parameters of the model. For
example, if all factor loadings do not differ from
Yjk ¼ αjk þ lTjk Tj þ lMjk Mk þ Ejk , each other, the model is not identified. Moreover,
applications of this model often show nonadmissi-
where αjk is an intercept, lTjk is a trait loading, and ble parameter estimates (e.g., negative variances of
lMjk is a method loading. The trait factors are method factors), and the estimation process often
allowed to be correlated. The correlations indicate does not converge. From a more conceptual point
the degree of discriminant validity. Also, the method of view, the model has been criticized because it
factors are allowed to be correlated. These method allows correlations between method factors. These
intercorrelations show whether method effects correlations make the interpretation of the correla-
generalize across methods. This model is called tions between trait factors more difficult. If, for
854 Multitrait–Multimethod Matrix

example, all trait factors are uncorrelated, this problem of the CTCM model in many cases.
would indicate perfect discriminant validity. How- This model with one method factor less than
ever, if all method factors are correlated, it is diffi- the number of methods considered is called
cult to interpret the uncorrelatedness of the trait the correlated-trait-correlated-(method – 1) model
factors as perfect discriminant validity because the ðCTC½M  1Þ. This model is a special case of the
method factor correlations represent a portion of model depicted in Figure 1 but with one method
variance shared by all variables that might be due factor less. If the first method in Figure 1 is the
to a general trait effect. self-report, the second method is the teacher
The major problems of the CTCM model are report, and the third method is the parent report,
caused by the correlated method factors. Accord- dropping the first method factor would imply
ing to Michael Eid, Tanja Lischetzke, and Fridtjof that the three trait factors equal the true-score
Nussbeck, the problems of the CTCM model can variables of the self-reports. The self-report
be circumvented by dropping the correlations method would play the role of the reference
between the method factors or by dropping one method that has to be chosen in this model.
method factor. A CTCM model without correla- Hence, in the CTC(M  1) model, the trait factor
tions between method factors is called a correlated- is completely confounded with the reference
trait–uncorrelated-method (CTUM) model. This method. The method factors have a clear meaning.
model is a special case of the model depicted in They indicate the deviation of the true (error-free)
Figure 1 but with uncorrelated method factors. other reports from the value predicted by the self-
This model is reasonable if correlations between report. A method effect is that (error-free) part of
method factors are not expected. According to a nonreference method that cannot be predicted
Eid and colleagues, this is the case when inter- by the reference method. The correlations between
changeable methods are considered. Interchange- the two method factors would then indicate that
able methods are methods that are randomly the two other raters (teachers and parents) share
chosen from a set of methods. If one considers a common view of the child that is not shared by
different raters as different methods, an example the child herself or himself. This model allows con-
of interchangeable raters (methods) is randomly trasting methods, but it does not contain common
selected students rating their teacher. If one ran- ‘‘method-free’’ trait factors. It is doubtful that such
domly selects three students for each teacher and a common trait factor has a reasonable meaning in
if one assigns these three students randomly to the case of structurally different methods.
three rater groups, the three method factors All single indicator models presented so far
would represent the deviation of individual assume that the method effects belonging to one
raters from the expected (mean) rating of the method are unidimensional as there is one com-
teacher (the trait scores). Because the three raters mon method factor for each method. This assump-
are interchangeable, correlations between the tion could be too strong as method effects could
method factors would not be expected. Hence, be trait specific. Trait-specific method effects are
applying the CTUM model in the case of inter- part of the residual in the models presented so far.
changeable raters would circumvent the pro- That means that reliability will be underestimated
blems of the CTCM model. because a part of the residual is due to method
The situation, however, is quite different in the effects and not due to measurement error. More-
case of structurally different methods. An example over, the models may not fit the data in the case of
of structurally different methods is a self-rating, trait-specific method effects. If the assumption of
a rating by the parents, and a rating by the teacher. unidimensional method factors for a method is too
In this case, the three raters are not interchange- strong, the method factors can be dropped and
able but are structurally different. In this case, the replaced by correlated residuals. For example, if
CTUM model is not reasonable as it may not ade- one replaces the method factors in the CTUM
quately represent the fact that teachers and parents model by correlations of residuals belonging to the
can share a common view that is not shared same methods, one obtains the correlated-trait–
with the student (correlations of method effects). correlated-uniqueness (CTCU) model. However, in
In this case, dropping one method factor solves the this model the reliabilities are underestimated
Multivalued Treatment Effects 855

because method effects are now part of the resi- Further Readings
duals. Moreover, the CTUM model does not
Browne, M. W. (1984). The decomposition of multitrait-
allow correlations between residuals of different multimethod matrices. British Journal of
methods. This might be necessary in the case of Mathematical & Statistical Psychology, 37, 1–21.
structurally different methods. Problems that Campbell, D. T., & Fiske, D. W. (1959). Convergent and
are caused by trait-specific method effects can discriminant validation by the multitrait-multimethod
be appropriately handled in multiple indicator matrix. Psychological Bulletin, 56, 81–105.
models. Dumenci, L. (2000). Multitrait-multimethod analysis. In S.
D. Brown & H. E. A. Tinsley (Eds.), Handbook of
applied multivariate statistics and mathematical modeling
(pp. 583–611). San Diego, CA: Academic Press.
Eid, M. (2000). A multitrait-multimethod model with
Multiple Indicator Models minimal assumptions. Psychometrika, 65, 241–261.
In multiple indicator models, there are several Eid, M. (2006). Methodological approaches for analyzing
indicators for one trait–method unit. In the less multimethod data. In M. Eid & E. Diener (Eds.),
Handbook of multimethod measurement in
restrictive model, there is one factor for all indi-
psychology (pp. 223–230). Washington, DC:
cators belonging to the same trait–method unit. American Psychological Association.
The correlations between these factors constitute Eid, M., & Diener, E. (2006). Handbook of multimethod
a latent MTMM matrix. The correlation coeffi- measurement in psychology. Washington, DC:
cients of this latent MTMM matrix are not dis- American Psychological Association.
torted by measurement error and allow a more Eid, M., Nussbeck, F. W., Geiser, C., Cole, D. A.,
appropriate application of the Campbell and Fiske Gollwitzer, M., & Lischetzke, T. (2008). Structural
criteria for evaluating the MTMM matrix. Multi- equation modeling of multitrait-multimethod data:
ple indicator models allow the definition of trait- Different models for different types of methods.
specific method factors and, therefore, the separa- Psychological Methods, 13, 230–253.
Kenny, D. A. (1976). An empirical application of
tion of measurement error and method-specific
confirmatory factor analysis to the multitrait-
influences in a more appropriate way. Eid and col- multimethod matrix. Journal of Experimental Social
leagues have shown how different models of CFA Psychology, 12, 247–252.
can be defined for different types of methods. In Marsh, H. W., & Grayson, D. (1995). Latent variable
the case of interchangeable methods, a multilevel models of multitrait-multimethod data. In R. H. Hoyle
CFA model can be applied that allows the specifi- (Ed.), Structural equation modeling: Concepts, issues, and
cation of trait-specific method effects. In contrast applications (pp. 177–198). Thousands Oaks, CA: Sage.
to the extension of the CTCU model to multiple Shrout, P. E., & Fiske, S. T. (Eds.). (1995). Personality
indicators, the multilevel approach has the advan- research, methods, and theory: A festschrift honoring
tage that the number of methods (e.g., raters) can Donald W. Fiske. Hillsdale, NJ: Lawrence Erlbaum
differ between targets. In the case of structurally
different raters, an extension of the CTC(M  1)
model to multiple indicators can be applied. This
model allows a researcher to test specific hypothe- MULTIVALUED TREATMENT
ses about the generalizability of method effects EFFECTS
across traits and methods. In the case of a combi-
nation of structurally different and interchangeable
The term multivalued treatment effects broadly
methods, a multilevel CTC(M  1) model would
refers to a collection of population parameters that
be appropriate.
capture the impact of a given treatment assigned
Michael Eid to each observational unit, when this treatment
status takes multiple values. In general, treatment
See also Construct Validity; ‘‘Convergent and levels may be finite or infinite as well as ordinal or
Discriminant Validation by the Multitrait– cardinal, leading to a large collection of possible
Multimethod Matrix’’; MBESS; Structural Equation treatment effects to be studied in applications.
Modeling; Triangulation; Validity of Measurement When the treatment effect of interest is the mean
856 Multivalued Treatment Effects

outcome for each treatment level, the resulting in most applications, which treatment each unit
population parameter is typically called the dose– has taken up is not random and hence further
response function in the statistical literature, reg- assumptions would be needed to identify the treat-
ardless of whether the treatment levels are finite or ment effect of interest.
infinite. The analysis of multivalued treatment A binary treatment effect model has
effects has several distinct features when compared T ¼ f0; 1g, a finite multivalued treatment effect
with the analysis of binary treatment effects, model has T ¼ f0; 1; . . . ; Jg for some positive inte-
including the following: (a) A comparison or con- ger J, and a continuous treatment effect model has
trol group is not always clearly defined, (b) new T ¼ ½0; 1. (Note that the values in T are ordinal,
parameters of interest arise capturing distinct phe- that is, they may be seen just as normalizations of
nomena such as nonlinearities or tipping points, the underlying real treatment levels in a given
(c) in most cases correct statistical inferences application.) Many applications focus on a binary
require the joint estimation of all treatment effects treatment effects model and base the analysis on
(as opposed to the estimation of each treatment the comparison of two groups, usually called treat-
effect at a time), and (d) efficiency gains in statisti- ment group ðTi ¼ 1Þ and control group ðTi ¼ 0Þ.
cal inferences may be obtained by exploiting A multivalued treatment may be collapsed into
known restrictions among the multivalued treat- a binary treatment, but this procedure usually
ment effects. This entry discusses the treatment would imply some important loss of information
effect model and statistical inference procedures in the analysis. Important phenomena such as non-
for multivalued treatment effects. linearities, differential effects across treatment
levels or tipping points, cannot be captured by
a binary treatment effect model.
Treatment Effect Model Typical examples of multivalued treatment eff-
ects are comparisons between some characteristic
and Population Parameters
of the distributions of the potential outcomes.
A general statistical treatment effect model with Well-known examples are mean and quantile com-
multivalued treatment assignments is easily des- parisons, although in many applications other fea-
cribed in the context of the classical potential out- tures of these distributions may be of interest. For
comes model. This model assumes that each example, assuming, to simplify the discussion, that
unit i in a population has an underlying collec- the random potential outcomes are equal for all
tion of potential outcome random variables units (this holds, for instance, in the context of
fYi ¼ Yi ðtÞ : t ∈ T g, where T denotes the collec- random sampling), the mean of the potential
tion of possible treatment assignments. The ran- outcome under treatment regime t ∈ T is given by
dom variables Yi ðtÞ are usually called potential μðtÞ ¼ E½Yi ðtÞ. The collection of these means is
outcomes because they represent the random out- the so-called dose–response function. Using this
come that unit i would have under treatment estimand, it is possible to construct different multi-
regime t ∈ T . For each unit i and for any two treat- valued treatment effects of interest, such as pair-
ment levels, t1 and t2 , it is always possible to wise comparisons (e.g., μðt2 Þ  μðt1 ÞÞ or differ-
define the individual treatment effect given by ences in pairwise comparisons, which would
Yi ðt1 Þ  Yi ðt2 Þ, which may or may not be a degen- capture the idea of nonlinear treatment effects. (In
erate random variable. However, because units are the particular case of binary treatment effects, the
not observed under different treatment regimes only possible pairwise comparison is μð1Þ  μð0Þ,
simultaneously, such comparisons are not feasible. which is called the average treatment effect.) Using
This idea, known as the fundamental problem of the dose–response function, it is also possible to
causal inference, is formalized in the model by consider other treatment effects that arise as
assuming that for each unit i only (Yi ; Ti ) is nonlinear transformations of μðtÞ, such as ratios,
observed, where Yi ¼ Yi ðTi Þ and Ti ∈ T . In words, incremental changes, tipping points, or the maxi-
for each unit i, only the potential outcome for mal treatment effect μ * ¼ maxt ∈ T μðtÞ, among
treatment level Ti ¼ t is observed while all other many other possibilities. All these multivalued
(counterfactual) outcomes are missing. Of course, treatment effects are constructed on the basis of
Multivariate Analysis of Variance (MANOVA) 857

the mean of the potential outcomes, but similar Imai, K., & van Dyk, D. A. (2004). Causal inference with
estimands may be considered that are based on general treatment regimes: Generalizing the propensity
quantiles, dispersion measures, or other character- score. Journal of the American Statistical Association,
istics of the underlying potential outcome distribu- 99, 854–866.
Imbens, G. W., & Wooldridge, J. M. (2009). Recent
tion. Conducting valid hypothesis testing about
developments in the econometrics of program
these treatment effects requires in most cases the evaluation. Journal of Economic Literature, 47, 5–86.
joint estimation of the underlying multivalued Rosembaum, P. (2002). Observational studies. New
treatment effects. York: Springer.

Statistical Inference
There exists a vast theoretical literature proposing
and analyzing different statistical inference proce-
MULTIVARIATE ANALYSIS
dures for multivalued treatment effects. This large OF VARIANCE (MANOVA)
literature may be characterized in terms of the key
identifying assumption underlying the treatment Multivariate analysis of variance (MANOVA)
effect model. This key assumption usually takes the designs are appropriate when multiple depen-
form of a (local) independence or orthogonality con- dent variables are included in the analysis. The
dition, such as (a) a conditional independence dependent variables should represent continuous
assumption, which assumes that conditional on a set measures (i.e., interval or ratio data). Dependent
of observable characteristics, selection into treatment variables should be moderately correlated. If
is random, or (b) an instrumental variables assump- there is no correlation at all, MANOVA offers
tion, which assumes the existence of variables that no improvement over an analysis of variance
induce exogenous changes in the treatment assign- (ANOVA); if the variables are highly correlated,
ment. With the use of an identifying assumption the same variable may be measured more than
(together with other standard model assumptions), it once. In many MANOVA situations, multiple
has been shown in the statistical and econometrics independent variables, called factors, with multi-
literatures that several parametric, semiparametric, ple levels are included. The independent vari-
and nonparametric procedures allow for optimal ables should be categorical (qualitative). Unlike
joint inference in the context of multivalued treat- ANOVA procedures that analyze differences
ments. These results are typically obtained with the across two or more groups on one dependent vari-
use of large sample theory and justify (asymptoti- able, MANOVA procedures analyze differences
cally) the use of classical statistical inference proce- across two or more groups on two or more depen-
dures involving multiple treatment levels. dent variables. Investigating two or more depen-
Matias D. Cattaneo dent variables simultaneously is important in
various disciplines, ranging from the natural and
See also Multiple Treatment Interference; Observational physical sciences to government and business and
Research; Propensity Score Analysis; Selection; to the behavioral and social sciences. Many
Treatment(s) research questions cannot be answered adequately
by an investigation of only one dependent variable
Further Readings because treatments in experimental studies are
likely to affect subjects in more than one way. The
Cattaneo, M. D. (2010). Efficient semiparametric focus of this entry is on the various types of MAN-
estimation of multi-valued treatment effects under OVA procedures and associated assumptions. The
ignorability. Journal of Econometrics, 155, 138–154.
logic of MANOVA and advantages and disadvan-
Heckman, J. J., & Vytlacil, E. J. (2007). Econometric
evaluation of social programs, Part I: Causal models,
tages of MANOVA are included.
structural models and econometric policy evaluation. MANOVA is a special case of the general linear
In J. J. Heckman and E. E. Leamer (Eds.), Handbook models. MANOVA may be represented in a basic
of econometrics (Vol. 6B, pp. 4779–4874). linear equation as Y ¼ Xβ þ ε, where Y represents
Amsterdam: North-Holland. a vector of dependent variables, X represents
858 Multivariate Analysis of Variance (MANOVA)

a matrix of independent variables, β represents to rejection of a true null hypothesis. For example,
a vector of weighted regression coefficients, and ε analysis of group differences on three dependent
represents a vector of error terms. Calculations for variables would require three univariate tests. If
the multivariate procedures are based on matrix the alpha level is set at .05, there is a 95% chance
algebra, making hand calculations virtually impos- of not making a Type I error. The following calcula-
sible. For example, the null hypothesis for MAN- tions show how the 95% error rate is compounded
OVA states no difference among the population with three univariate tests: (.95)(.95)(.95) ¼ .857
mean vectors. The form of the omnibus null and 1  .857 ¼ .143, or 14.3%, which is an unac-
hypothesis is written as H0 ¼ μ1 ¼    ¼ μk . It is ceptable error rate. In addition, univariate tests do
important to remember that the means displayed not account for the intercorrelations among vari-
in the null hypothesis represent mean vectors for ables, thus risking loss of valuable information. Fur-
the population, rather than the population means. thermore, MANOVA decreases the Type II error
The complexity of MANOVA calculations requires (error of not rejecting a false null hypothesis) rate
the use of statistical software for computing. by detecting group differences that appear only
through the combination of two or more dependent
variables.
Logic of MANOVA
MANOVA procedures evaluate differences in pop-
Disadvantages of MANOVA Designs
ulation means on more than one dependent vari-
able across levels of a factor. MANOVA uses MANOVA procedures are more complex than uni-
a linear combination of the dependent variables to variate procedures; thus, outcomes may be ambig-
form a new dependent variable that minimizes uous and difficult to interpret. The power of
within-group variance and maximizes between- MANOVA may actually reveal statistically signifi-
group differences. The new variable is used in an cant differences when multiple univariate tests
ANOVA to compare differences among the groups. may not show differences. Statistical power is the
Use of the newly formed dependent variable in the probability of rejecting the null hypothesis when
analysis decreases the Type I error (error of reject- the null is false. (Power ¼ 1  β.) The difference
ing a true null hypothesis) rate. The linear combi- in outcomes between ANOVA and MANOVA
nation reveals a more complete picture of the results from the overlapping of the distributions
characteristic or attribute under study. For exam- for each of the groups with the dependent vari-
ple, a social scientist may be interested in the kinds ables in separate analyses. In the MANOVA pro-
of attitudes that people have toward the environ- cedure, the linear combination of dependent
ment based on their attitudes about global warm- variables is used for the analysis. Finally, more
ing. In such a case, analysis of only one dependent assumptions are required for MANOVA than for
variable (attitude about global warming) is not ANOVA.
completely representative of the attitudes that
people have toward the environment. Multiple
Assumptions of MANOVA
measures, such as attitude toward recycling, will-
ingness to purchase environmentally friendly pro- The mathematical underpinnings of inferential
ducts, and willingness to conserve water and statistics require that certain statistical assumptions
energy, will give a more holistic view of attitudes be met. Assumptions for MANOVA designs are
toward the environment. In other words, MAN- (a) multivariate normality, (b) homoscedasticity,
OVA analyzes the composite of several variables, (c) linearity, and (d) independence and randomness.
rather than analyzing several variables individually.
Multivariate Normality
Advantages of MANOVA Designs
Observations on all dependent variables are
MANOVA procedures control for experiment- multivariately normally distributed for each level
wide error rate, whereas multiple univariate proce- within each group and for all linear combinations
dures increase the Type I error rate, which can lead of the dependent variables. Joint normality in
Multivariate Analysis of Variance (MANOVA) 859

more than two dimensions is difficult to assess; positive or negative to indicate a high peak or flat-
however, tests for univariate normality on each ness near the mean, respectively. Values within ± 2
of the variables are recommended. Univariate nor- standard deviations from the mean or ± 3 standard
mality, a prerequisite to multivariate normality, deviations from the mean are generally considered
can be assessed graphically and statistically. For within the normal range. A normal distribution has
example, a quantile–quantile plot resembling a zero kurtosis. In addition to graphical techniques,
straight line suggests normality. While normality the Shapiro–Wilk W statistic and the Kolmogorov–
of the univariate tests does not mean that the data Smirnov statistic with Lilliefors significance levels
are multivariately normal, such tests are useful in are used to assess normality. Statistically significant
evaluating the assumption. MANOVA is insensi- W or Kolmogorov–Smirnov test results indicate
tive (robust) to moderate departures from normal- that the distribution is nonnormal.
ity for large data sets and in situations in which
the violations are due to skewed data rather
than outliers. Homoscedasticity
A scatterplot for pairs of variables for each The variance and covariance matrices for all
group can reveal data points located far from the dependent variables across groups are assumed to
pattern produced by the other observations. be equal. George Box’s M statistic tests the null
Mahalanobis distance (distance of each case from hypothesis of equality of the observed covariance
the centroid of all the remaining cases) is used to matrices for the dependent variables for each
detect multivariate outliers. Significance of Maha- group. A nonsignificant F value with the alpha
lanobis distance is evaluated as a chi-square statis- level set at .001 from Box’s M indicates equality of
tic. A case may be considered an outlier if its the covariance matrices. MANOVA procedures
Mahalanobis distance is statistically significant at can tolerate moderate departures from equal
the p < .0001 level. Other graphical techniques, variance–covariance matrices when sample sizes
such as box plots and stem-and-leaf plots, may be are similar.
used to assess univariate normality. Two additional
descriptive statistics related to normality are skew-
ness and kurtosis. Linearity

Skewness MANOVA procedures are based on linear com-


binations of the dependent variables; therefore, it
Skewness refers to the symmetry of the distribu- is assumed that linear relationships exist among all
tion about the mean. Statistical values for skewness pairs of dependent variables and all pairs of co-
range from ± ∞. A perfectly symmetrical distribu- variates across all groups. Consequently, linearity
tion will yield a value of zero. In a positively is important for all dependent variable–covariate
skewed distribution, observations cluster to the left pairs as well. Linearity may be assessed by examin-
of the mean on the normal distribution curve with ing scatterplots of pairs of dependent variables for
the right tail on the curve extended with a small each group. The scatterplot displays an elliptical
number of cases. The opposite is true for a nega- shape to indicate a linear relationship. If both vari-
tively skewed distribution. Observations cluster to ables are not normally distributed, the assumption
the right of the mean on the normal distribution of linearity will not hold, and an elliptical shape
curve with the left tail on the curve extended with will not be displayed on the scatterplot. If the line-
a small number of cases. In general, a skewness arity assumption is not met, data transformations
value of .7 or .8 is cause for concern and suggests may be necessary to establish linearity.
that data transformations may be appropriate.

Kurtosis Independence and Randomness


Kurtosis refers to the degree of peakedness Observations are independent of one another.
or flatness of a sample distribution compared That is, the score for one participant is indepen-
with a normal distribution. Kurtosis values may be dent of scores of any other participants for each
860 Multivariate Analysis of Variance (MANOVA)

variable. Randomness means that the sample was his work on the multivariate T2 distribution. Cal-
randomly selected from the population of interest. culation of T2 is based on the combination of two
Student’s t ratios and their pooled estimate of cor-
relation. The resulting T2 is converted into an F
Research Questions for MANOVA Designs statistic and distributed as an F distribution.
Multivariate analyses cover a broad range of sta-
tistical procedures. Common questions for which One-Way MANOVA
MANOVA procedures are appropriate are as fol-
lows: What are the mean differences between two Another variation of the MANOVA procedure is
levels of one independent variable for multiple useful for investigating the effects of one multilevel
dependent variables? What are the mean differ- independent variable (factor) on two or more
ences between or among multiple levels of one dependent variables. An investigation of differences
independent variable on multiple dependent vari- in mathematics achievement and motivation for stu-
ables? What are the effects of multiple independent dents assigned to three different teaching methods
variables on multiple dependent variables? What is such a situation. For this problem, the researcher
are the interactions among independent variables has one multilevel factor (teaching method with
on one dependent variable or on a combination of three levels) and two dependent variables (mathe-
dependent variables? What are the mean differ- matics test scores and scores on a motivation scale).
ences between or among groups when repeated The objective is to determine the differences among
measures are used in a MANOVA design? What the mean vectors for groups on the dependent vari-
are the effects of multiple levels of an independent ables, as well as differences among groups for the
variable on multiple dependent variables when linear combinations of the dependent variables.
effects of concomitant variables are removed from This form of MANOVA extends Hotelling’s T2 to
the analysis? What is the amount of shared vari- more than two groups; it is known as the one-way
ance among a set of variables when variables are MANOVA, and it can be thought of as the MAN-
grouped around a common theme? What are the OVA analog of the one-way F situation. Results of
relationships among variables that may be useful the MANOVA produce four multivariate test statis-
for predicting group membership? These sample tics: Pillai’s trace, Wilks’s lambda (), Hotelling’s
questions provide a general sense of the broad trace, and Roy’s largest root. Usually results will
range of questions that may be answered with not differ for the first three tests when applied to
MANOVA procedures. a two-group study; however, for studies involving
more than two groups, tests may yield different
results. The Wilks’s  is the test statistic reported
Types of MANOVA Designs most often in publications. The value of Wilks’s 
ranges from 0 to 1. A small value of  indicates
Hotelling’s T2
statistically significant differences among the groups
Problems for MANOVA can be structured in or treatment effects. Wilks’s , the associated F
different ways. For example, a researcher may value, hypotheses and error degrees of freedom,
wish to examine the difference between males and and the p value are usually reported. A significant F
females on number of vehicle accidents in the past value is one that is greater than the critical value of
5 years and years of driving experience. In this F at predetermined degrees of freedom for a preset
case, the researcher has one dichotomous indepen- level of significance. As a general rule, tables for
dent variable (gender) and two dependent vari- critical values of F and accompanying degrees of
ables (number of accidents and years of driving freedom are published as appendixes in many
experience). The problem is to determine the dif- research and statistics books.
ference between the weighted sample mean vectors
(centroids) of a multivariate data set. This form of
Factorial MANOVA
MANOVA is known as the multivariate analog to
the Student’s t test, and it is referred to as Hotell- Another common variation of multivariate pro-
ing’s T2 statistic, named after Harold Hotelling for cedures is known as the factorial MANOVA. In this
Multivariate Analysis of Variance (MANOVA) 861

design, the effects of multiple factors on multiple over time across a set of response variables mea-
dependent variables are examined. For example, sured at each time while accounting for the cor-
the effects of geographic location and level of relation among responses. A design would be
education on job satisfaction and attitudes toward considered doubly multivariate when multiple con-
work may be investigated via a factorial MAN- ceptually dissimilar dependent variables are mea-
OVA. Geographic location with four levels and sured across multiple time periods, as in a repeated
level of education with two levels are the factors. measures study. For example, a study to compare
Geographic location could be coded as 1 ¼ south, problem-solving strategies of intrinsically and
2 ¼ west, 3 ¼ north, and 4 ¼ east; level of education extrinsically motivated learners in different test
could be coded as 1 ¼ college graduate and 0 ¼ not situations could involve two dependent measures
college graduate. The MANOVA procedure will (score on a mathematics test and score on a reading
produce the main effects for each of the factors, as test) taken at three different times (before a unit of
well as the interaction between the factors. For this instruction on problem solving, immediately fol-
example, three new dependent variables will be cre- lowing the instruction, and 6 weeks after the
ated to maximize group differences: one dependent instruction) for each participant. Type of learner
variable to maximize the differences in geographic and test situation would be between-subjects fac-
location and the linear combination of job satisfac- tors and time would be a within-subjects factor.
tion and attitudes toward work; one dependent var-
iable to maximize the differences in education and
the linear combination of job satisfaction and atti- Multivariate Analysis of Covariance
tudes toward work; and another dependent variable
A blend of analysis of covariance and MAN-
to maximize separation among the groups for the
OVA, called multivariate analysis of covariance
interaction between geographic location and level
(MANCOVA) allows the researcher to control for
of education. As in the previous designs, the facto-
the effects of one or more covariates. MANCOVA
rial MANOVA produces Pillai’s trace, Wilks’s ,
allows the researcher to control for sources of vari-
Hotelling’s trace, and Roy’s largest root. The multi-
ation within multiple variables. In the earlier
ple levels in factorial designs may produce slightly
example on attitudes toward the environment, the
different values for the test statistics, even though
effects of concomitant variables such as number of
these differences do not usually affect statistical sig-
people living in a household, age of head of house-
nificance. Wilks’s , associated F statistic, degrees
hold, gender, annual income, and education level
of freedom, and the p value are usually reported in
can be statistically removed from the analysis with
publications.
MANCOVA.
K Group MANOVA
MANOVA designs with three or more groups Factor Analysis
are known as K group MANOVAs. Like other mul-
tivariate designs, the null hypothesis tests whether MANOVA is useful as a data reduction proce-
differences between the mean vectors of K groups dure to condense a large number of variables into
on the combination of dependent variables are due a smaller, more definitive set of hypothetical con-
to chance. As with the factorial design, the K group structs. This procedure is known as factor analy-
MANOVA produces the main effects for each fac- sis. Factor analysis is especially useful in survey
tor, as well as the interactions between factors. The research to reduce a large number of variables (sur-
same statistical tests and reporting requirements vey items) to a smaller number of hypothetical vari-
apply to the K group situation as to the factorial ables by identifying variables that group or cluster
MANOVA. together. For example, two or more dependent vari-
ables in a data set may measure the same entity or
construct. If this is the case, the variables may be
Doubly Multivariate Designs
combined to form a new hypothetical variable. For
The purpose of doubly multivariate studies is to example, a survey of students’ attitudes toward
test for statistically significant group differences work may include 40 related items, whereas a factor
862 Multivariate Normal Distribution

analysis may reveal three underlying hypothetical


constructs. MULTIVARIATE NORMAL
DISTRIBUTION
Discriminant Analysis
One of the most familiar distributions in statistics
A common use of discriminant analysis (DA) is is the normal or Gaussian distribution. It has
prediction of group membership by maximizing two parameters, corresponding to the first two
the linear combination of multiple quantitative moments (mean and variance). Once these param-
independent variables that best portrays differ- eters are known, the distribution is completely
ences among groups. For example, a college may specified. The multivariate normal distribution is
wish to group incoming students based on their a generalization of the normal distribution and
likelihood of being graduated. This is called pre- also has a prominent role in probability theory
dictive DA. Also, DA is used to describe differ- and statistics. Its parameters include not only the
ences among groups by identifying discriminant means and variances of the individual variables in
functions based on uncorrelated linear combina- a multivariate set but also the correlations between
tions of the independent variables. This technique those variables. The success of the multivariate
may be useful following a MANOVA analysis and normal distribution is due to its mathematical trac-
is called descriptive DA. tability and to the multivariate central limit theo-
rem, which states that the sampling distributions
Marie Kraska of many multivariate statistics are normal, regard-
less of the parent distribution. Thus, the multivari-
See also Analysis of Variance (ANOVA); Discriminant
ate normal distribution is very useful in many
Analysis; Multivariate Normal Distribution; Principal
statistical problems, such as multiple linear regres-
Components Analysis; Random Sampling; Repeated
sions and sampling distributions.
Measures Design

Probability Density Function


Further Readings
If X ¼ ðX1 ; . . . ; Xn Þ0 is a multivariate normal ran-
Green, S. B., & Salkind, N. J. (2008). Using SPSS for dom vector, denoted X ∼ Nðμ; Þ or
Windows and Macintosh: Analyzing and X ∼ Nn ðμ; Þ, then its density is given by
understanding data (5th ed.). Upper Saddle River, NJ:
Prentice Hall.
Johnson, R. A., & Wichern, D. W. (2002). Applied fX ðxÞ ¼
 
multivariate statistical analysis (5th ed.). Upper Saddle 1 1 0
1
River, NJ: Prentice Hall. n=2
exp  ðx  μÞ ðx  μÞ ,
j j1=2 ð2πÞ 2
Rencher, A. C. (2002). Methods of multivariate analysis
(2nd ed.). San Francisco: Wiley.
Stevens, J. P. (2001). Applied multivariate statistics for where μ ¼ ðμ1 ; . . . ; μn Þ0 ¼ EðXÞ is a vector whose
the social sciences (4th ed.). Hillsdale, NJ: Lawrence components are the expectations EðX1 Þ; . . . ; EðXn Þ
Erlbaum. and is the nonsingular variance–covariance
Tabachnick, B. G., & Fidell, L. S. (2001). Using matrix ðn × nÞ whose diagonal terms are variances
multivariate statistics (4th ed.). Boston: Allyn and and off-diagonal terms are covariances:
Bacon.
Tatsuoka, M. M. (1988). Multivariate analysis:

Techniques for educational and psychological research ¼ VðXÞ ¼ E ðX  μÞðX  fμÞ0
(2nd ed.). New York: Macmillan. 2 2 3
Timm, N. H. (2002). Applied multivariate analysis. New σ 1 σ 12 . . . σ 1n
York: Springer-Verlag. 6 σ 21 σ 22 . . . σ 2n 7
6 7
Weerahandi, S. (2004). Generalized inference in repeated ¼6 . . . . 7:
4 .. .
. . . .
. 5
measures: Exact methods in MANOVA and mixed
models. San Francisco: Wiley. σ n1 σ n2 ... σ 2n
Multivariate Normal Distribution 863

Note that the covariance matrix is symmetric are both multivariate normal with the same first
and positive definite.

The
 ði; jÞth element is given two moments, then they are similarly distributed.
by σ ij ¼ E ðXi  μi Þ Xj  μj and σ 2i ≡
σ ii ¼ VðXi Þ. 2. Let X ¼ðX1 ; . . . ; Xn Þ0 be a multivariate normal
An important special case of the multivariate random vector with mean μ and covariance
normal  distribution
 is the bivariate normal. matrix , and let α0 ¼ ðα1 ; . . . ; αn Þ ∈ Rn =0. The lin-
X1 μ1 ear combination Y ¼ α0 X ¼ α1 X1 þ    þ αn Xn is
If ∼ N2 ðμ; Þ, where μ¼ ,
X2 μ2 normal with mean EðYÞ ¼ α0 μ and variance
Pn PP
σ 21 ρσ 1 σ 2 VðYÞ¼α0 α¼ α2i VðXi Þþ αi αj CovðXi ; Xj Þ.
¼ and ρ ≡ CorrðX1 ; X2 Þ ¼
ρσ 1 σ 2 σ 22 i¼1 i6¼j
σ 12
σ σ , then the bivariate density is given by
Also, if α0 X is normal with mean α0 μ and variance
1 2
α0 α for all possible α, then X must be a multivari-
1 ate normal random vector with mean μ and
fX1 , X2 ðx; yÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2πσ 1 σ 2 ð1  ρ2 Þ covariance matrix (X ∼ Nn ðμ; Þ).
8  2     2 9
>
< xμ1
 2ρ σ
xμ1 yμ2
þ σ
yμ2 >
=
σ1 1 σ2 2 3. More generally, let X ¼ðX1 ; . . . ; Xn Þ0 be a multi-
exp  :
>
: 2ð1  ρ2 Þ >
; variate normal random vector with mean μ and
covariance matrix , and let A ∈ Rm × n be a full
rank matrix with m ≤ n, the set of linear combina-
Let X ¼ðX1 ; X2 Þ0 ; the joint density can be tions Y ¼ ðY1 ; . . . ; Ym Þ0 ¼ AX is multivariate nor-
rewritten in matrix notation as mally distributed with mean Aμ and covariance
   matrix A A0 . Also, if Y ¼ AX þ b where b is
1 1 1
fX ðxÞ ¼ 1
exp  ð x  μ Þ0
ðx  μ Þ : a m × 1 vector of constants, then Y is multivariate
2πj j2 2 normally distributed with mean Aμ þ b and co-
variance matrix A A0 .
Multivariate Normal Density Contours
4. If Xi and Yj are jointly normally distributed,
The contour levels of fX ðxÞ, that is, the set of then they are independent if and only if
points in Rn for which fX ðxÞ is constant, satisfy Cov ðYi ; Yj Þ ¼ 0. Note that it is not necessarily
1
true that uncorrelated univariate normal random
ðx  μÞ0 ðx  μÞ ¼ c2 : variables are independent. Indeed, two random
variables that are marginally normally distributed
These surfaces are n-dimensional ellipsoids cen- may fail to be jointly normally distributed.
tered at μ, whose axes of symmetry are given by
the principal components (the eigenvectors) of . 5. Let Z ¼ ðZ1 ; . . . ; Zn Þ0 where Zi ∼ i:i:d: Nð0; 1Þ
Specifically,
pffiffiffiffi the length of the ellipsoid along the ith (where i.i.d. ¼ independent and identically dis-
axis is c λi , where λi is the ith eigenvalue associ- tributed). Z is said to be standard multivariate nor-
ated with the eigenvector ei (recall that eigen- mal, denoted Z ∼ Nð0, In Þ, and it can be shown
vectors ei and eigenvalues λi are solutions to that E½Z ¼ 0 and VðZÞ ¼ In , where In denotes the
ei ¼ λi ei for i ¼ 1; . . . ; n). unit matrix of order n. The joint density of vector
Z is given by
Some Basic Properties Yn  
1
The following list presents some important proper- fZ ðzÞ ¼ fZi ðzi Þ ¼ ð2πÞn=2 exp  z0 z :
i¼1
2
ties involving the multivariate normal distribution.
The density fZ ðzÞ is symmetric and unimodal
1. The first two moments of a multivariate normal with mode equal to zero. The contour levels of
distribution, namely μ and , completely charac- fZ ðzÞ, that is, the set of points in Rn for which
terize the distribution. In other words, if X and Y fZ ðzÞ is constant, are defined by
864 Multivariate Normal Distribution

 
X
n
n=2 1 0
0
zz¼ z2i ¼c ,2 fZ ðzÞ ¼ ð2πÞ exp  z z :
i¼1
2
The moment generating function of Z is
where c ≥ 0. The contour levels of fZ ðzÞ are con- obtained as follows:
centric circles in Rn centered at zero.
h i Z
 Mz ðtÞ¼E e t’Z ¼ð2πÞ n=2
Rn expft0 zz0 z=2gdz
6. If Y1 ; . . . ; Yn ∼ ind N μi , σ 2i , then σ ij ¼ 0 for n
all i 6¼ j, and it follows that is a diagonal matrix. Yn Z þ∞  
1 1 2
Thus, if ¼ pffiffiffiffiffiffi exp ti zi  zi dzi
i¼1 ∞ 2π 2
0 2 1
σ1 0    0 Yn
B 0 σ 22    0 C ¼ MZi ðti Þ
B C
¼B . .. . . .. C, i¼1
@ .. . . . A



¼E et1 Z1 E et2 Z2 ...E etn Zn :
0 0    σ 2n      
12 1 2 1
¼exp t1 exp t2 ...exp tn2
then 2 2 2
( )
0  2 1 1X n
1 σ1 0 2  0 ¼exp t2
B 0 C 2 i¼1 i
1
B 0 1 σ2  C  
¼B . .. .. .. C , 1
@ .. . . . A ¼exp t0 t
 2
0 0    1 σ 2n

so that
To obtain the moment generating function
1 X
n
 2 . 2 of the generalized location-scale family, let
ðy  μÞ0 ðy  μÞ ¼ y j  μj σj : X ¼ fμ þ 1=2 Z where 1=2 1=2 ¼ ( 1=2 is
j¼1
obtained via the Cholesky decomposition of ) so
Note also that, as is diagonal, we have that X ∼ Nðμ; Þ. Hence,
h 0 i
j j ¼ σ 21 σ 22 . . . σ 2n :
MX ðtÞ ¼ E et X
 
The joint density becomes 1=2
¼ E exp t0 μþt0 Z
Y
n
  1=2 
fY ðyÞ ¼ fYi yi ; μi ; σ 2i t0 μ
i¼1 ¼ e E exp t0 Z
(  2 )
Y
n
1 1 y i  μi  0 !
¼ pffiffiffiffiffiffi exp  : 0 1=2 :
i¼1 σ i 2π
2 σi ¼ et μ MZ t
( 1=20 !0 1=20 !)
Thus, fY ðyÞ reduces to the product of univariate 1
t0 μ
normal densities. ¼ e exp t t
2
 
1 0
¼ exp t μ þ t t
0

Moment Generating Function 2

Let Z ¼ ðZ1 ; . . . ; Zn Þ0 where Zi ∼ i:i:d: Nð0; 1Þ. As


Simulation
previously seen, Z ∼ Nð0, In Þ is referred to as
a standard multivariate normal vector, and the To generate a sample of observations from a ran-
density of Z is given by dom variable Z ∼ Nð0, In Þ, one should note that
Multivariate Normal Distribution 865

each of the n components of vector Z is indepen- Moreover, if 12 ¼ 0 (Yð1Þ and Yð2Þ are uncorre-
dent and identically distributed standard uni- lated), then Yð1Þ and Yð2Þ are statistically indepen-
variate normal, for which simulation methods are dent. Recall that the covariance of two
well known. Let X ¼ μ þ 1=2 Z where independent random variables is always zero
1=2 1=2 ¼ so that X ∼ Nðfμ, Þ. Realizations of but the opposite need not be true. Thus, Yð1Þ and
X can be obtained from the generated samples z as Yð2Þ are statistically independent if and only if
fμ þ 1=2 z where 1=2 can be computed via the 12 ¼ 012 ¼ 0:
Cholesky decomposition.

Conditional Normal Distributions


Cumulative Distribution Function
Let Y ¼ ðY1 ; . . . ; Yn Þ0 ∼ N ðμ; Þ. Suppose that
The cumulative distribution function is the proba- Y is partitioned into two subvectors, Yð1Þ and
bility that all values in the random vector X are Yð2Þ , in the same manner as in the previous
less than or equal to the values in the random vec- section.
tor x(PrðX ≤ xÞÞ. Although there is no closed form The conditional distribution of Yð1Þ given Yð2Þ is
for the cumulative distribution function of the multivariate normal characterized by its mean and
multivariate normal, it can be calculated numeri- covariance matrix as follows:
cally by generating a large sample of observations
  
and computing the fraction that satisfies X ≤ x. 
Yð1Þ Yð2Þ ¼ yð2Þ
   
Marginal Distributions N μð1Þ þ 12 1 22 y ð2Þ  μ ð2Þ , 11  12 1
22 21 :

Let Y ¼ ðY1 , . . . , Yn Þ0 ∼ N ðμ; Þ. Suppose that Y


To verify this, let X ¼ Yð1Þ  μð1Þ  12
is partitioned into two subvectors, Yð1Þ and Yð2Þ , so
 0
that we can write Y ¼ Y0ð1Þ , Y0ð2Þ , where 22 Yð2Þ  μð2Þ . As a linear combination of the
1
 0 elements of Y, X is also normally distributed with
Yð1Þ ¼ ðY1 ; . . . , YP Þ0 and Yð2Þ ¼ Ypþ1 ; . . . ; Yn .
Let μ and be independent and identi- mean zero and variance VðXÞ ¼ E½XX0 , given by:
cally distributed partitioned accordingly, that is,
 0 E½XX0  ¼
μ ¼ μ0ð1Þ , μ0ð2Þ and
  
22 Yð2Þ  μð2Þ
E Yð1Þ  μð1Þ  12 1
n 0  0 oi
11 12 × Yð1Þ  μð1Þ  Yð2Þ  μð2Þ 122 12
¼ ; 21 ¼ 012 ,
21 22
¼ 11  12 1 1
22 21  12 22 21


þ 12 1 1
22 22 22 21
where μðiÞ ¼ E YðiÞ , ii ¼ V YðiÞ , i ¼ 1; 2; and
 ¼ 11  12 1
12 ¼ Cov Yð1Þ ; Yð2Þ . 22 21 :
 0
Then, it can be shown that the distributions of Moreover, if we consider X0 , Y0 ð2Þ , which is
the two subvectors Yð1Þ and Yð2Þ are multivariate also multivariate normal, we obtain the following
normal, defined as follows: covariance term:

Yð1Þ ∼ Nðμ1 ; 11 Þ and Yð2Þ ∼ Nðμ2 ; 22 Þ:  0


E½ðX  0Þ Yð2Þ  μð2Þ 
 
This result means that ¼ E½ Yð1Þ  μð1Þ  12 1
22 Yð2Þ  μð2Þ

ðYð2Þ  μð2Þ Þ0  ¼ 12  12 1
22 22 ¼ 0:
• Each of the Yi s is univariate normal.
• All possible subvectors are multivariate normal.
• All marginal distributions are multivariate This implies that X and Yð2Þ are independent,
normal. and we can write
866 Multivariate Normal Distribution


X by σ ijjðpþ1, ..., nÞ, where p is the dimension of the
∼ vector Yð1Þ.
Yð2Þ
" # " #! The partial correlation of Yi and Yj , given Yð2Þ ,
0 11  12 1
22 21 0 is defined by
N ; :
μð2Þ 0 22
σ ijjðpþ1, ..., nÞ
As ρijjðpþ1, ..., nÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
σ iijðpþ1, ..., nÞ σ jjjðpþ1, ..., nÞ

Yð1Þ ¼ X þ μð1Þ þ 12 1
22 Yð2Þ  μð2Þ ;

conditional on Yð2Þ ¼ yð2Þ , Yð1Þ is normally dis- Testing for Multivariate Normality
tributed with mean If a vector is multivariate normally distributed,
  each individual component follows a univariate
EðXÞ þμð1Þ þ 12 122 yð2Þ  μð2Þ Gaussian distribution. However, univariate nor-
|fflffl{zfflffl}
0 mality of each variable in a set is not sufficient for
multivariate normality. Therefore, to test the ade-
and variance VðXÞ ¼ 11  12 122 21 .
quacy of the multivariate normality assumption,
In the bivariate normal case, we replace Yð1Þ several authors have presented a battery of tests
and Yð2Þ by Y1 and Y2 , and the conditional distri- consistent with the multivariate framework. For
bution of Y1 given Y2 is normal as follows: this purpose, Kanti V. Mardia proposed multivari-
ate extensions of the skewness and kurtosis statis-
Y1 =ðY2 ¼ y2 Þ ∼ tics, which are extensively used in the univariate
  framework to test for normality.
σ1 2
 2

N μ1 þ ρ ðy2  μ2 Þ; σ 1 1  ρ
σ2
: Let fXi gm m
i¼1 ¼ fX1i , . . . , Xni gi¼1 be the ith ran-
dom sample from an n-variate distribution, X the
Another special case is with Yð1Þ ¼ Y1 and vector of sample means, and SX the sample covari-
Yð2Þ ¼ ðY2 ; . . . ; Yn Þ0 so that ance matrix. The n-variate skewness and kurtosis
statistics denoted respectively by b1;n and b2;n are
  
 defined as follows:
Y1 Yð2Þ ¼ yð2Þ ∼
   
N μ1 þ 12 1 y  μ ;  1
: m 
3
22 ð2Þ ð2Þ 11 12 22 21
1 X m X
 0 1 
b1, n ¼ Xi  X S X X j  X
m2 i¼1 j¼1
The mean of this particular
 conditional
 dis-
m 
tribution (μ1 þ 12 22 yð2Þ  μð2Þ ) is referred to
1 1X  0  2
b2, n ¼ Xi  X S1 X X i  X :
m i¼1
as a multiple regression function of Y1 on Y2 , with
the regression coefficient vector being β ¼ 12 1
22 .

We can verify that these statistics reduce to the


Partial Correlation well-known univariate skewness and kurtosis for
n ¼ 1.
Let Y ¼ ðY1 , . . . , Yn Þ0 ∼ N ðμ, Þ. The covariance Based of their asymptotic distributions, the mul-
(correlation) between two of the univariate ran- tivariate skewness and kurtosis defined above can
dom variables in Y, say Yi and Yj , is determined be used to test for multivariate normality. If X
from the ði, jÞth entry of . The partial correlation is multivariate normal, then b1, n and b2, n have
concept considers the correlation of Yi and Yj expected values 0 and nðn þ 2Þ. It can be also
when conditioning on a set of other variables in Y. shown that for large m, the limiting distribution of
Let C ¼ 11  12 1 22 21 , which is the variance ðm=6Þb1;n is a chi square with nðn þ 1Þðn þ 2Þ=6
of the conditional distribution of Yð1Þ given degrees
pffiffiffiffi of freedom, and
pthe limiting ffi distribution
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Yð2Þ ¼ yð2Þ , and denote the ði; jÞth element of C of m b2;n  nðn þ 2Þ 8nðn þ 2Þ is Nð0, 1Þ.
Multivariate Normal Distribution 867

The Gaussian Copula Function Consider the following multiple linear regres-
sion model:
Recently in multivariate modeling, much attention
has been paid to copula functions. A copula is Y ¼ β0 þ β1 X1 þ β2 X2 þ    þ βq Xq þ ε,
a function that links an n-dimensional distribution
function to its one-dimensional margins and is where Y is the response variable, ðX1 ,
itself a continuous distribution function character- X2 , . . . , Xq Þ0 is a vector representing a set of q
izing the dependence structure of the model. explanatory variables, and ε is the error term.
Sklar’s theorem states that under appropriate con- Note that the simple linear regression model is
ditions, the joint density can be written as a prod- a special case with q ¼ 1.
uct of the marginal densities and the copula Suppose that we have n observations on Y and
density. Several copula families are available that on each of the explanatory variables, that is,
can incorporate the relationships between random
variables. Among these families, the Gaussian cop- Y1 ¼ β0 þ β1 X11 þ β2 X12 þ    þ βq X1q þ ε1
ula encodes dependence in precisely the same way Y2 ¼ β0 þ β1 X21 þ β2 X22 þ    þ βq X2q þ ε2
as the multivariate normal distribution does, using .. .. ,
only pairwise correlations among the variables. . .
However, it does so for variables with arbitrary Yn ¼ β0 þ β1 Xn1 þ β2 Xn2 þ    þ βq Xnq þ εn
margins. A multivariate normal distribution arises

whenever univariate normal margins are linked where Eðεi Þ ¼ 0, var εj ¼ σ 2 and covðεi , εj Þ ¼ 0
through a Gaussian copula. The Gaussian copula for i 6¼ j.
function is defined by We can rewrite
0 1
Cðu1 , u2 , . . . , un Þ ¼ Y1
 BY C
nρ 1 ðu1 Þ, 1 ðu2 Þ, . . . , 1 ðun Þ , B 2C
B . C¼
B . C
@ . A
where nρ denotes the joint distribution function of Yn
the n-variate standard normal distribution with 0 1
0 1 β0 0 1
linear correlation matrix ρ, and 1 denotes the 1 X11 X12  X1q B ε1
inverse of the univariate standard normal distribu- B β1 CC B C
B1 X2q C
B X21 X22  CB C B ε2 C
tion function. In the bivariate case, the copula B CB β2 C þ B C
B .. .. .. .. CB C B . C
expression can be written as @. . . . AB
B .. C @ .. A
@ . C
A
Z Z 1 Xn1 Xn2  Xnq εn
1 ðuÞ 1 ðvÞ
1 βq
Cðu; vÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
∞ ∞ 2π 1  ρ2
 
1 x2  2ρxy þ y2 or else
exp  dxdy,
2 ð1  ρ2 Þ
Y ¼ Xβ þ ε,
where ρ is the usual linear correlation coefficient where EðεÞ ¼ 0 and covðεÞ ¼ σ 2 In .
of the corresponding bivariate normal distribution. Note that in the previous expression, ε is
a multivariate normal random vector whereas
Multiple Linear Regression Xβ is a vector of constants. Thus, Y is a linear
combination of a multivariate normally distrib-
and Sampling Distribution
uted vector. It follows that Y is also multivariate
In this section, the multiple regression model and normal with mean EðYÞ ¼ Xβ and covariance
the associated sampling distribution are presented covðYÞ ¼ σ 2 In .
as an illustration of the usefulness of the multivari- The goal of an analysis of data of this form is
ate normal distribution in statistics. to estimate the regression parameter β. The least
868 Multivariate Normal Distribution

squares estimate of β is found by minimizing the See also Central Limit Theorem; Coefficients
sum of squared deviations of Correlation, Alienation, and Determination; Copula
Functions; Kurtosis; Multiple Regression; Normal
X
n
 2 Distribution; Normality Assumption; Partial
Yj  β0  β1 Xj1 þ β2 Xj2 þ    þ βq Xjq :
j¼1
Correlation; Sampling Distributions

In matrix terms, this sum of squared deviations Further Readings


may be written
Balakrishnan, N., & Nevrozov, V. B. (2003). A
ðY  XβÞ0 ðY  XβÞ ¼ ε0 ε: primer on statistical distributions. New York:
Wiley.
If X is of full rank, then the least squares esti- Evans, M., Hastings, N., & Peacock, B. (2000).
mator for β is given by Statistical distributions. New York: Wiley.
Kotz, S., Balakrishnan, N., & Johnson, N. L. (2000).
^ ¼ ðX0 XÞ1 X0 Y:
β Continuous multivariate distributions, Volume 1:
Models and applications. New York: Wiley.
It is of interest to characterize the probability Mardia, K. V. (1970). Measures of multivariate skewness
distribution of an estimator. If we refer to the mul- and kurtosis with applications. Biometrika, 57, 519–
tiple regression problem, the estimator β ^ depends 530.
on the response Y. Therefore, the properties of its Paolella, M. S. (2007). Intermediate probability: A
computational approach. New York: Wiley.
distribution will depend on those of Y. More spe-
^ is multivariate normal Rose, C., & Smith, M. D. (1996). The multivariate
cifically the distribution of β normal distribution. Mathematica, 6, 32–37.
as follows: Sklar, A. (1959). N-dimensional distribution functions
and their margins. Publications de l’Institut de
^ ∼ Nðβ, σ 2 ðX0 XÞ1 Þ:
β Statistique de l’Université de Paris, 8, 229–231.
This result can be used to obtain estimated stan- Tong, Y. L. (1990). The multivariate normal distribution.
dard errors for the components of β, ^ that is, esti- New York: Springer-Verlag.
mates of the standard deviation of the sampling
distributions of each component of β. ^

Chiraz Labidi
N
in that there is a respect for the relativity and
NARRATIVE RESEARCH multiplicity of truth in regard to the human
sciences. Narrative researchers rely on the episte-
mological arguments of such philosophers as
Narrative research aims to explore and conceptu- Paul Ricoeur, Martin Heidegger, Edmund
alize human experience as it is represented in tex- Husserl, Wilhelm Dilthey, Ludwig Wittgenstein,
tual form. Aiming for an in-depth exploration of Mikhail Bakhtin, Jean-Francois Lyotard, and
the meanings people assign to their experiences, Hans-Georg Gadamer. Although narrative rese-
narrative researchers work with small samples of archers differ in their view of the possibility of
participants to obtain rich and free-ranging dis- objectively conceived ‘‘reality,’’ most agree with
course. The emphasis is on storied experience. Donald Spence’s distinction between narrative
Generally, this takes the form of interviewing peo- and historical truth. Factuality is of less interest
ple around the topic of interest, but it might also than how events are understood and organized,
involve the analysis of written documents. Narra- and all knowledge is presumed to be socially
tive research as a mode of inquiry is used by constructed.
researchers from a wide variety of disciplines, Ricoeur, in his seminal work Time and Narra-
which include anthropology, communication stud- tive, argues that time is organized and experienced
ies, cultural studies, economics, education, history, narratively; narratives bring order and meaning to
linguistics, medicine, nursing, psychology, social the constantly changing flux. In its simplest form,
work, and sociology. It encompasses a range of our experience is internally ordered as ‘‘this hap-
research approaches including ethnography, phe- pened, then that happened’’ with some (often
nomenology, grounded theory, narratology, action causal) connecting link in between. Narrative is also
research, and literary analysis, as well as such central to how we conceive of ourselves; we create
interpretive stances as feminism, social constuc- stories of ourselves to connect our actions, mark
tionism, symbolic interactionism, and psycho- our identity, and distinguish ourselves from others.
analysis. This entry discusses several aspects of Questions about how people construct them-
narrative research, including its epistemological selves and others in various contexts, under vari-
grounding, procedures, analysis, products, and ous conditions, are the focus of narrative research.
advantages and disadvantages. Narrative research paradigms, in contrast to
hypothesis-testing ones, have as their aims describ-
ing and understanding rather than measuring and
Epistemological Grounding
predicting, focusing on meaning rather than causa-
The epistemological grounding for narrative research tion and frequency, interpretation rather than sta-
is on a continuum of postmodern philosophical ideas tistical analysis, and recognizing the importance of

869
870 Narrative Research

language and discourse rather than reduction to Narrative researchers invite participants to
numerical representation. These approaches are describe in detail—tell the story of—either a partic-
holistic rather than atomistic, concern themselves ular event or a significant aspect or time of life
with particularity rather than universals, are inter- (e.g., a turning point), or they ask participants to
ested in the cultural context rather than trying to narrate an entire life story. Narration of experi-
be context-free, and give overarching significance ence, whether of specific events or entire life histo-
to subjectivity rather than questing for some kind ries, involves the subjectivity of the actor, with
of objectivity. attendant wishes, conflicts, goals, opinions, emo-
Narrative research orients itself toward under- tions, worldviews, and morals, all of which are
standing human complexity, especially in those open to the gaze of the researcher. Such narratives
cases where the many variables that contribute also either implicitly or explicitly involve settings
to human life cannot be controlled. Narrative that include those others who are directly involved
research aims to take into account—and interpre- in the events being related and also involve all
tively account for—the multiple perspectives of those relationships that have influenced the narra-
both the researched and researcher. tor in ever-widening social circles. The person is
Jerome Bruner has most championed the legiti- assumed to be speaking from a specific position in
mization of what he calls ‘‘narrative modes of culture and in historical time. Some of this posi-
knowing,’’ which privileges the particulars of lived tionality is reflected in the use of language and
experience rather than constructs about variables concepts with which a person understands her or
and classes. It aims for the understanding of lives his life. Other aspects of context are made explicit
in context rather than through a prefigured and as the researcher is mindful of the person’s experi-
narrowing lens. Meaning is not inherent in an act ence of herself or himself in terms of gender, race,
or experience but is constructed through social dis- culture, age, social class, sexual orientation,
course. Meaning is generated by the linkages the nationality, etc. Participants are viewed as unique
participant makes between aspects of the life he individuals with particularity in terms of social
or she is living and by the explicit linkages the location; a person is not viewed as representative
researcher makes between this understanding and of some universal and interchangeable, randomly
interpretation, which is meaning constructed at selected ‘‘subject.’’
another level of analysis. People, however, do not ‘‘have’’ stories of their
lives; they create them for the circumstance in
which the story will be told. No two inter-
Life Is a Story
viewers will obtain exactly the same story from
One major presupposition of narrative research is an individual interviewee. Therefore, a thor-
that humans experience their lives in emplotted oughly reflexive analysis of the parameters and
forms resembling stories or at least communicate influences on the interview situation replaces
about their experiences in this way. People use nar- concern with reliability.
rative as a form of constructing their views of the A narrative can be defined as a story of
world; time itself is constructed narratively. Impor- a sequence of events. Narratives are organized so
tant events are represented as taking place through as to place meanings retrospectively on events,
time, having roots in the past, and extending in with events described in such a way as to express
their implications into the future. Life narratives the meanings the narrator wishes to convey. Nar-
are also contextual in that persons act within situ- rative is a way of understanding one’s own (and
ational contexts that are both immediate and more others’) action, of organizing personal experience,
broadly societal. The focus of research is on what both internal and external, into a meaningful
the individuals think they are doing and why they whole. This involves attributing agency to the
think they are doing so. Behavior, then, is always characters in the narrative and inferring causal
understood in the individual’s context, however he links between the events. In the classic formula-
or she might construct it. Thus, narratives can be tion, a narrative is an account with three compo-
examined for personal meanings, cultural mean- nents: a beginning, a middle, and an end. William
ings, and the interaction between these. Labov depicts all narratives as having clauses that
Narrative Research 871

orient the reader to the story, tell about the events, In interview-based designs, which are the most
or evaluate the story—that is, instruct the listener widespread form of narrative research, partici-
or reader as to how the story is to be understood. pants who fit into the subgroup of interest are
The evaluation of events is of primary interest to invited to be interviewed at length (generally 1–4
the narrative researcher because this represents the hours). Interviews are recorded and then tran-
ways in which the narrator constructs a meaning scribed. The narrative researcher creates ‘‘experi-
(or set of meanings) within the narrative. Such ence-near’’ questions related to the conceptual
meanings, however, are not viewed to be either question that might be used to encourage partici-
singular or static. Some narrative theorists have pants to tell about their experiences. This might be
argued that the process of creating an autobio- a request for a full life story or it might be a ques-
graphical narrative is itself transforming of self tion about a particular aspect of life experience
because the self that is fashioned in the present such as life transitions, important relationships, or
extends into the future. Thus, narrative research is responses to disruptive life events.
viewed to be investigating a self that is alive and Narrative research meticulously attends to the
evolving, a self that can shift meanings, rather than process of the interview that is organized in as
a fixed entity. unstructured a way as possible. The narrative res-
Narrative researchers might also consider the earcher endeavors to orient the participant to the
ways in which the act of narration is performative. question of interest in the research and then inter-
Telling a story constructs a self and might be used vene only to encourage the participant to continue
to accomplish a social purpose such as defending the narration or to clarify what seems confusing
the self or entertaining someone. Thus, the focus is to the researcher. Inviting stories, the interviewer
not only on the content of what is communicated asks the participant to detail his or her experiences
in the narrative but also on how the narrator con- in rich and specific narration. The interviewer
structs the story and the social locations from takes an empathic stance toward the interviewees,
which the narrator speaks. Society and culture also trying to understand their experience of self and
enable and constrain certain kinds of stories; world from their point of view. Elliott Mishler,
meaning making is always embedded in the con- however, points out that no matter how much the
cepts that are culturally available at a particular interviewer or researcher attempts to put aside his
time, and these might be of interest in a narrative or her own biases or associations to the interview
research project. Narrative researchers, then, content, the researcher has impact on what is told
attend to the myriad versions of self, reality, and and this must be acknowledged and reflected on.
experience that the storyteller produces through Because such interviews usually elicit highly per-
the telling. sonal material, confidentiality and respect for the
interviewee must be assured. The ethics of the
interview are carefully considered in advance, dur-
Procedures
ing the interview itself and in preparation of the
Narrative research begins with a conceptual ques- research report.
tion derived from existing knowledge and a plan Narrative research questions tend to focus on
to explore this question through the narratives of individual, developmental, and social processes
people whose experience might illuminate the that reflect how experience is constructed both
question. Most narrative research involves per- internally and externally. Addressing questions
sonal interviews, most often individual, but some- that cannot be answered definitively, narrative
times in groups. Some narrative researchers might research embraces multiple interpretations rather
(also) use personal documents such as journals, than aiming to develop a single truth. Rooted in
diaries, memoirs, or films as bases for their analy- a postmodern epistemology, narrative approaches
ses. Narrative research uses whatever storied mate- to research respect the relativity of knowing—the
rials are available or can be produced from the meanings of the participant filtered through the
kinds of people who might have personal knowl- mind of the researcher with all its assumptions and
edge and experiences to bring to bear on the a priori meanings. Knowledge is presumed to be
research question. constructed rather than discovered and is assumed
872 Narrative Research

to be localized, perspectival, and occurring within to discover patterns across individual narrative
intersubjective relationships to both participants interview texts or to explore what might create
and readers. ‘‘Method’’ then becomes not a set of differences between people in their narrated
procedures and techniques but ways of thinking experiences.
about inquiry, modes of exploring questions, and There are many approaches to analyses, with
creative approaches to offering one’s constructed some researchers focusing on meanings through
findings to the scholarly community. All communi- content and others searching through deconstruct-
cation is through language that is understood to be ing the use and structure of language as another
always ambiguous and open to interpretation. set of markers to meanings. In some cases, resea-
Thus, the analytic framework of narrative research rchers aim to depict the layers of experience
is in hermeneutics, which is the science of impre- detailed in the narratives, preserving the point of
cise and always shifting meanings. view, or voice, of the interviewee. At other times,
researchers might try to go beyond what is said
and regard the narrated text as a form of disguise;
Analysis
this is especially true when what is sought are
The analysis of narrative research texts is pri- unconscious processes or culturally determined
marily aimed at inductively understanding the aspects of experience that are embedded rather
meanings of the participant and organizing them than conscious.
at some more conceptual level of understanding. The linguistic emphasis in some branches of
This might involve a close reading of an indivi- narrative inquiry considers the ways in which lan-
dual’s interview texts, which includes coding for guage organizes both thought and experience.
particular themes or extracting significant pas- Other researchers recognize the shaping function
sages for discussion in the report. The researcher of language but treat language as transparent as
looks inductively for patterns, and the kinds of they focus more on the content of meanings that
patterns recognized might reflect the researcher’s might be created out of life events.
prior knowledge about the phenomena. The pro- The purpose of narrative research is to pro-
cess of analysis is one of piecing together data, duce a deep understanding of dynamic processes.
making the invisible apparent, deciding what is No effort is made to generalize about popula-
significant and what is insignificant, and linking tions. Thus, statistics, which aims to represent
seemingly unrelated facets of experience toge- populations and the distribution of variables
ther. Analysis is a creative process of organizing within them, have little or no place in narrative
data so the analytic scheme will emerge. Texts research. Rather, knowledge is viewed to be
are read multiple times in what Friedrich localized in the analysis of the particular people
Schleiermacher termed a ‘‘hermeneutic circle,’’ studied and generalization about processes that
a process in which the whole illuminates the might apply to other populations is left to the
parts that in turn offers a fuller and more com- reader. That is, in a report about the challenges
plex picture of the whole, which then leads to of immigration in a particu-lar population, the
a better understanding of the parts, and so on. reader might find details of the interactive pro-
Narrative researchers focus first on the voices cesses that might illuminate the struggles of
within each narrative, attending to the layering of another population in a different locale—or even
voices (subject positions) and their interaction, as people confronting other life transitions.
well as the continuities, ambiguities, and disjunc- Narrative research avoids having a predeter-
tions expressed. The researcher pays attention to mined theory about the person that the interview
both the content of the narration (‘‘the told’’) and or the life-story is expected to support. Although
the structure of the narration (‘‘the telling’’). Nar- no one is entirely free of preconceived ideas and
rative analysts might also pay attention to what is expectations, narrative researchers try to come to
unsaid or unsayable by looking at the structure of their narrators as listeners open to the surprising
the narrative discourse and markers of omissions. variation in their social world and private lives.
After each participant’s story is understood as well Although narrative researchers try to be as knowl-
as possible, cross-case analysis might be performed edgeable as possible about the themes that they
Narrative Research 873

are studying to be maximally sensitive to nuances capturing data from the inside of the actors with
of meaning, they are on guard again inflicting a view to understanding and conceptualizing
meaning in the service of their own ends. their meaning making in the contexts within which
they live. Narrative researchers recognize that
many interpretations of their observations are pos-
Products
sible and they argue their interpretive framework
Reports of narrative research privilege the words through careful description of what they have
of the participants, in what Clifford Geertz calls observed.
‘‘thick description,’’ and present both some of the Narrative researchers also recognize that they
raw data of the text as well as the analysis. Offer- themselves are narrators as they present their orga-
ing as evidence the contextualized words of the nization and interpretation of their data. They
narrator lends credence to the analysis suggested endeavor to make their work as interpreters trans-
by the researcher. The language of the research parent, writing about their own interactions with
report is often near to experience as lived rather their participants and their data and remaining
than as obscured by scientific jargon. Even Sig- mindful of their own social location and personal
mund Freud struggled with the problem of making predilections. This reflexive view of researcher as
the study of experience scientific, commenting in narrator opens questions about the representation
1893 that the nature of the subject was responsible of the other and the nature of interpretive author-
for his works reading more like short stories than ity, and these are addressed rather than elided.
customary scientific reports. The aim of a narrative
research report is to offer interpretation in a form
Advantages and Disadvantages
that is faithful to the phenomena. In place of form-
neutral ‘‘objectivized’’ language, many narrative A major appeal of narrative research is the oppor-
researchers concern themselves with the poetics of tunity to be exploratory and make discoveries free
their reports and strive to embody the phenomena of the regimentation of prefabricated hypotheses,
in the language they use to convey their meanings. contrived variables, control groups, and statistics.
Narrative researchers stay respectful of their parti- Narrative research can be used to challenge con-
cipants and reflect on how they are representing ceptual hegemony in the social sciences or to
‘‘the other’’ in the published report. extend the explanatory power of abstract theoreti-
Some recent narrative research has concerned cal ideas. Some of the most paradigm-defining
such topics as how people experience immigration, conceptual revolutions in the study of human
illness, identity, divorce, recovery from addictions, experience have come from narrative research—
belief systems, and many other aspects of human Sigmund Freud, Erik Erikson, and Carol Gilligan
experience. Any life experiences that people can being the most prominent examples. New narra-
narrate or represent become fertile ground for nar- tive researchers, however, struggle with the vaguely
rative research questions. The unity of a life resides defined procedures on which this research depends
in a construction of its narrative, a form in which and with the fact that interesting results cannot be
hopes, dreams, despairs, doubts, plans, and emo- guaranteed in advance. Narrative research is also
tions are all phrased. labor intensive, particularly in the analysis phase,
Although narrative research is generally con- where text must be read and reread as insights and
cerned with individuals’ experience, some narra- interpretations develop.
tive researchers also consider narratives that Narrative research is not generalizable to popu-
particular collectives (societies, groups, or orga- lations but rather highlights the particularities of
niza-tions) tell about themselves, their histories, experience. Many narrative researchers, however,
their dominant mythologies, and their aspira- endeavor to place the individual narratives they
tions. Just as personal narratives create personal present in a broader frame, comparing and con-
identity, group narratives serve to bond a com- trasting their conclusions with the work of others
munity and distinguish it from other collectives. with related concerns. All people are like all other
A good narrative research report will detail people, like some other people—and also are
a holistic overview of the phenomena under study, unique. Readers of narrative research are invited
874 National Council on Measurement in Education

explicitly to apply what is learned to contexts that


are meaningful to them. NATIONAL COUNCIL ON
Narrative research opens possibilities for social
change by giving voice to marginalized groups,
MEASUREMENT IN EDUCATION
representing unusual or traumatic experiences that
are not conducive to control group designs, and by The National Council on Measurement in
investigating the ways in which social life (and Education (NCME) is the sole professional
attendant oppression) is mediated through meta- association devoted to the scientific study and
narratives. Readers, then, are challenged to under- improvement of educational measurement. Origi-
stand their own or their society’s stories in new nally founded in the United States in February
ways at both experiential and theoretical levels. 1938, it currently includes members from coun-
tries throughout the world. As of November 2008,
Ruthellen Josselson it has approximately 2,000 members, including
professional members and graduate students.
See also Case-Only Design; Case Study; Naturalistic
Professional members work in university settings;
Observation; Observational Research; Observations
national, state, and local government settings (typ-
ically departments of education); testing compa-
nies; and other industrial settings. Graduate
Further Readings
student members are typically enrolled in graduate
Andrews, M. (2007). Shaping history: Narratives of programs housed in schools of education or
political change. Cambridge, UK: Cambridge departments of psychology.
University Press. Since its founding, the organization’s name has
Bruner, J. (1990). Acts of meaning. Cambridge, MA: changed several times. Originally founded as the
Harvard University Press. National Association of Teachers of Educational
Clandinnin, J. (Ed.). (2007). The handbook of narrative
Measurement, its name was changed to National
inquiry. Thousand Oaks, CA: Sage.
Council on Measurements Used in Education in
Freeman, M. (1993). Rewriting the self: History,
memory, narrative. London: Routledge. 1943. The organization took its current name in
Hinchman, L. P., & Hinchman, S. K. (Eds). (1997). 1960.
Memory, identity, community: The idea of narrative in The mission statement of NCME affirms that it
the human sciences. Albany: State University of New is ‘‘incorporated exclusively for scientific, educa-
York. tional, literary, and charitable purposes.’’ In addi-
Josselson, R., Lieblich, A., & McAdams, D. P. (Eds.). tion, the mission statement describes two major
(2003). Up close and personal: The teaching and purposes of NCME. The first is the ‘‘encouragement
learning of narrative research. Washington, DC: APA of scholarly efforts to: Advance the science of
Books.
measurement in the field of education; improve
Lieblich, A., Tuval-Mashiach, R., & Zilber, T. (1998).
Narrative research: Reading, analysis and
measurement instruments and procedures for their
interpretation. Thousand Oaks, CA: Sage. administration, scoring, interpretation, and use; and
McAdams, D. P. (1993). The stories we live by: Personal improve applications of measurement in assessment
myths and the making of the self. New York: Morrow. of individuals and evaluations of educational pro-
Mishler, E. (2004). Historians of the self: Restorying grams.’’ The second focuses on dissemination: ‘‘Dis-
lives, revising identities. Research in Human semination of knowledge about: Theory, techniques,
Development, 1, 101–121. and instrumentation available for measurement
Polkinghorne, D. (1988) Narrative knowing and the of educationally relevant human, institutional, and
human sciences. Albany: State University of New York. social characteristics; procedures appropriate to the
Rosenwald, G. C., & Ochberg, R. L. (Eds.). (1992).
interpretation and use of such techniques and instru-
Storied lives: The cultural politics of self-
understanding. New Haven, CT: Yale University Press.
ments; and applications of educational measurement
Sarbin, T. R. (1986). Narrative psychology: The storied in individual and group evaluation studies.’’
nature of human conduct. New York: Praeger. These purpose statements underscore the
Spence, D. (1982). Narrative truth and historical truth. NCME’s focus on supporting research on educa-
New York: Norton. tional tests and testing, development of improved
National Council on Measurement in Education 875

assessments in education, and disseminating infor- The ITEMS units, which first appeared in 1987,
mation regarding new developments in educational are available for free and can be downloaded from
testing and the proper use of tests. To accomplish the ITEMS page on the NCME website. As of
these goals, NCME hosts an annual conference 2010, there are 22 modules covering a broad range
each year (jointly scheduled with the annual con- of topics such as how to equate tests, evaluate dif-
ference of the American Educational Research ferential item functioning, or set achievement level
Association), publishes two highly regarded jour- standards on tests.
nals covering research and practice in educational NCME has also partnered with other organizations
measurement, and partners with other professional to publish books and other materials designed to pro-
organizations to develop and disseminate guide- mote fair or improved practices related to educational
lines and standards for appropriate educational measurement. The most significant partnership has
assessment practices and to further the understand- been with the Joint Committee on Testing Standards,
ing of the strengths and limitations of educational which produced the Standards for Educational and
tests. In the following sections, some of the most Psychological Testing in 1999 as well as the four previ-
important activities of NCME are described. ous versions of those standards (in 1954, 1966, 1974,
and 1985). NCME also partnered with the American
Council on Education to produce four versions of the
Dissemination Activities
highly acclaimed book Educational Measurement.
The NCME publishes two peer-reviewed journals, Two books on evaluating teachers were also spon-
both of which have four volumes per year. The first sored by NCME: the Handbook of Teacher Evalua-
is the Journal of Educational Measurement (JEM), tion and the New Handbook of Teacher Evaluation:
which was first published in 1963. JEM publishes Assessing Elementary and Secondary School Teachers.
original research related to educational assessment, In addition to publishing journals, instructional
particularly advances in statistical techniques such modules, and books, NCME has also partnered
as equating tests (maintaining score scales over with other professional organizations to publish
time), test calibration (e.g., using item response the- material to inform educators or the general public
ory), and validity issues related to appropriate test about important measurement issues. For example,
development and use (e.g., techniques for evaluat- in 1990 it partnered with the American Federation
ing item and test bias). It also publishes reviews of of Teachers (AFT) and the National Education
books related to educational measurement and the- Association (NEA) to produce the Standards for
oretical articles related to major issues and develop- Teacher Competence in the Educational Assess-
ments in educational measurement (e.g., reliability ment of Students. It has also been an active mem-
and validity theory). The second journal published ber on the Joint Committee on Testing Practices
by NCME is Educational Measurement: Issues and (JCTP) that produced the ABCs of Testing, which
Practice (EM:IP), which was first published in is a video and booklet designed to inform parents
1982. EM:IP focuses on more applied issues, typi- and other lay audiences about the use of tests in
cally less statistical in nature, that are of broad schools and about important characteristics of
interest to measurement practitioners. According to quality educational assessments. NCME also
the EM:IP page on the NCME website, the pri- worked with JCTP to produce the Code of Fair
mary purpose of EM:IP is ‘‘to promote a better Testing Practices, which describes the responsibili-
understanding of educational measurement and to ties test developers and test users have for ensuring
encourage reasoned debate on current issues of fair and appropriate testing practices. NCME disse-
practical importance to educators and the public. minates this document for free at its website.
EM:IP also provides one means of communication NCME also publishes a quarterly newsletter, which
among NCME members and between NCME.’’ can also be downloaded for free from its website.
In addition to the two journals, NCME also
publishes the Instructional Topics in Educational
Annual Conference
Measurement Series (ITEMS), which are instruc-
tional units on specific measurement topics of inter- In addition to the aforementioned publications, the
est to measurement researchers and practitioners. NCME’s annual conference is another mechanism
876 Natural Experiments

with which it helps disseminate new findings and American Federation of Teachers, National Education
research on educational measurement. The annual Association, & National Council on Measurement in
conference, typically held in March or April, fea- Education. (1990). Standards for teacher competence
tures three full days of paper sessions, symposia, in the educational assessment of students. Educational
Measurement: Issues and Practice 9, 30–34.
invited speakers, and poster sessions in which psy-
Brennan, R. L. (Ed). (2006). Educational measurement
chometricians and other measurement practitioners (4th ed.). Washington, DC: American Council on
can learn and dialog about new developments and Education/Praeger.
issues in educational measurement and research. Coffman, W. (1989). Past presidents’ committee: A look
About 1,200 members attend the conference each at the past, present, and future of NCME. 1. A look at
year. the past. (ERIC Document Reproduction Service No.
ED308242).
Joint Committee on Testing Practices. (1988). Code of
Governance Structure fair testing practices in education. Washington, DC:
American Psychological Association.
The governance structure of NCME consists of an
Joint Committee on Testing Practices. (1993). ABCs of
Executive Committee (President, President-Elect, testing. Washington, DC: National Council on
and Past President) and a six-member Board of Measurement in Education.
Directors. The Board of Directors includes all Joint Committee on Testing Practices. (2004). Code of
elected positions, and including the President-Elect fair testing practices in education. Washington, DC:
(also referred to as Vice President). In addition, American Psychological Association.
NCME has 20 volunteer committees that are run Lindquist, E. F. (Ed.). (1951). Educational measurement.
by its members. Examples of these committees Washington, DC: American Council on Education.
include the Outreach and Partnership Committee Linn, R. L. (Ed.). (1989). Educational measurement (3rd
and the Diversity Issues and Testing Committee. ed.). Washington, DC: American Council on
Education.
Millman, J. (1981). Handbook of teacher evaluation.
Joining NCME Beverly Hills, CA: Sage.
Millman, J., & Darling-Hammond, L. (1990). The new
NCME is open to all professionals and member- handbook of teacher evaluation: Assessing elementary
ship includes subscriptions to the NCME Newslet- and secondary school teachers. Newbury Park, CA:
ter, JEM, and EM:IP. Graduate students can join Sage.
for a reduced rate and can receive all three publi- Thorndike, R. L. (Ed.). (1971). Educational measurement
cations as part of their membership. All profes- (2nd ed.). Washington, DC: American Council on
sionals interested in staying current with respect to Education.
new developments and research related to assess-
ing students are encouraged to become members. Websites
To join the NCME visit the NCME website or
write the NCME Central Office at 2810 Cross- National Council on Measurement in Education:
roads Drive, Suite 3800, Madison WI, 53718. http://www.ncme.org

Stephen G. Sireci

See also American Educational Research Association; NATURAL EXPERIMENTS


American Statistical Association; ‘‘On the Theory of
Scales of Measurement’’
Natural experiments are designs that occur in
nature and permit a test of an otherwise untestable
Further Readings hypothesis and thereby provide leverage to disen-
American Educational Research Association, American
tangle variables or processes that would otherwise
Psychological Association, & National Council on be inherently confounded. Experiments in nature
Measurement in Education. (1999). Standards for do not, by definition, have the sort of leverage that
educational and psychological testing. Washington, traditional experiments have because they were
DC: American Educational Research Association. not manufactured to precise methodological detail;
Natural Experiments 877

they are fortuitous. They do, however, have dis- (b) maternal depression is virtually always accom-
tinct advantages over observational studies and panied by other risks that are also reliably linked
might, in some circumstances, address questions with children’s maladjustment, such as poor par-
that randomized controlled trials could not enting and marital conflict; and (c) mate selection
address. A key feature of natural experiments is for psychiatric disorders means that depressed
that they offer insight into causal processes, which mothers are more likely to have a partner with
is one reason why they have an established role in a mental illness, which confounds any specific
developmental science. ‘‘effect’’ that investigators might wish to attribute
Natural experiments represent an important to maternal depression per se. Most of the major
research tool because of the methodological limits risk factors relevant to psychological well-being
of naturalistic and experimental designs and the and public health co-occur; in general terms, risk
need to triangulate and confirm findings across exposures are not distributed randomly in the
multiple research designs. Notwithstanding their population. Indeed, one of the more useful lessons
own set of practical limitations and threats to gen- from developmental science has been to demon-
eralizability of the results, natural experiments strate the ways in which exposures to risk accrue
have the potential to deconfound alternative mod- in development.
els and accounts and thereby contribute signifi- One response to the problems in selection bias
cantly to developmental science and other areas of or confounded risk exposure is to address the
research. This entry discusses natural experiments problem analytically. That is, even if, for example,
in the context of other research designs and then maternal depression is inherently linked with
illustrates how their use in developmental science compromised parenting and family conflict, the
has provided information about the relationship ‘‘effect’’ of maternal depression might nevertheless
between early exposure to stress and children’s be derived if the confounded variables (compro-
development. mised parenting and family conflict) are statisti-
cally controlled for. There are some problems with
that solution, however. If risk processes are con-
The Scientific Context of Natural Experiments
founded in nature, then statistical controlling for
The value of natural experiments is best appreci- one or the other is not a satisfying solution; inter-
ated when viewed in the context of other designs. pretations of the maternal depression ‘‘effect’’ will
A brief discussion of other designs is therefore be possible but probably not (ecologically) valid.
illustrative. Observational or naturalistic studies— Sampling design strategies to obtain the same
cross-sectional or longitudinal assessments in kind of leverage, such as sampling families with
which individuals are observed and no experimen- depressed mothers only if there is an absence of
tal influence is brought to bear on them—generally family conflict, will yield an unrepresentative sam-
cannot address causal claims. That is because ple of affected families with minimal generalizabil-
a range of methodological threats, including selec- ity. Case-control designs try to gain some leverage
tion biases and coincidental or spurious associa- over cohort observational studies by tracking
tions, undermine causal claims. So, for example, in a group or groups of individuals, some of whom
the developmental and clinical psychology litera- have a condition(s) of interest. Differences between
ture, there is considerable interest in understanding groups are inferred to be attributable to the condi-
the impact of parental mental health—maternal tion(s) of interest because the groups were
depression is probably the most studied example— matched on key factors. That is not always possi-
on children’s physical and mental development. ble and the relevant factors to control for are not
Dozens of studies have addressed this question always known; as a result, between-group and
using a variety of samples and measures. However, even within-subject variation in these designs is
almost none of these studies—even large-scale subject to confounders.
cohort and population studies—are equipped to A potential methodological solution is offered
identify causal mechanisms for several reasons, by experimental designs. So, for example, testing
including (a) genetic transmission is confounded the maternal depression hypothesis referred to pre-
with family processes and other psychosocial risks; viously might be possible to the extent that some
878 Natural Experiments

affected mothers are randomly assigned to treat- a single design. Researchers are now accustomed
ment for depression. That would offer greater to defining an effect or association as robust if it
purchase on the question of whether maternal is replicated across samples and measures. Also,
depression per se was a causal contributor to a finding should replicate across design. No sin-
children’s adjustment difficulties. Interestingly, gle research sample and no single research design
intervention studies have shown that there are is satisfactory for testing causal hypotheses or
a great many questions about causal processes that inferring causal mechanisms.
emerge even after a successful trial. For example, Finally, identifying natural experiments can be
cognitive-behavioral treatment might successfully an engaging and creative process, and studies
resolve maternal depression and, as a result, chil- based on natural experiments are far less expensive
dren of the treated mothers might show improved and arduous to investigate—they occur naturally—
outcomes relative to children whose depressed than those using conventional research designs;
mothers were not treated. It would not necessarily they can also be common. Thus, dramatic shifts in
follow, however, that altering maternal depression income might be exploited to investigate income
was the causal mediator affecting child behavior. dynamics and children’s well-being; cohort
It might be that children’s behavior improved changes in the rates of specific risks (e.g., divorce)
because the no-longer-depressed mothers could might be used to examine psychosocial accounts
engage as parents in a more effective manner and for children’s adjustment problems. Hypotheses
there was a decrease in inter-parental conflict, or about genetic and/or psychosocial risk exposure
any of several other secondary effects of the might be addressed using adoption and twin
depression treatment. In other words, questions designs, and many studies exploit the arbitrariness
about causal mechanisms are not necessarily resol- of age cut-off for school to contrast exposure with
ved fully by experimental designs. maturation accounts of reading, language, and
Investigators in applied settings are also aware mathematic ability; the impact of compulsory
that some contexts are simply not amenable to schooling; and many other practical and concep-
randomized control. School-based interventions tual questions.
sometimes hit resistance to random assignment Like all other forms of research design, natu-
because principals, teachers, or parents object to ral experiments have their own special set of lim-
the idea that some children needing intervention itations. But, they can offer both novel and
might not get the presumed better treatment. confirmatory findings. Examples of how natural
Court systems are often nonreceptive experimental experiments have informed the debate on early
proving grounds. That is, no matter how compel- risk exposure and children’s development are
ling data from a randomized control trial might be reviewed below.
to address a particular question, there are circum-
stances in which a randomized control trial is
extremely impractical or unethical. Natural Experiments to Examine the
Natural experiments are, therefore, particu-
Long-Term Effects of Early Risk Exposure
larly valuable where traditional nonexperimental
designs might not be scientifically inadequate or Understanding the degree to which, and by what
where experimental designs may be practical or mechanisms, early exposure to stress has long-term
ethical. And, natural experiments are useful sci- effects is a primary question for developmental sci-
entific tools even where other designs might be ence with far-reaching clinical and policy applica-
judged as capable of testing the hypothesis of tions. This area of inquiry has been extensively
interest. That is because of the need for findings and experimentally studied in animal models. But
to be confirmed not only by multiple studies but animal studies are inadequate for deriving clinical
also by multiple designs. That is, natural experi- and public health meaning; research in humans is
ments can provide a helpful additional scientific essential. However, sound investigation to inform
‘‘check’’ on findings generated from naturalistic the debate in humans has been overshadowed
or experimental studies. There are many illustra- by claims that might overplay the evidence, as in
tions of the problems in relying on findings from the case of extending animal findings to humans
Natural Experiments 879

willy-nilly. The situation is compounded by the and because this is context in which natural experi-
general lack of relevant human studies that have ments of altering care are conducted. Clearly, there
leverage for deriving claims about early experience are inherent complications, but the findings have
and exposure per se. That is, despite the hundreds provided some of the most interesting data in clini-
of studies that assess children’s exposure to early cal and developmental psychology.
risk, almost none can differentiate the effects of An even more extreme context involves children
early risk exposure from later risk exposure because who experienced gross deprivation via institutional
the exposure to risk—maltreatment, poverty, and care and were then adopted into low-/normal-risk
parental mental illness—is continuous rather than homes. There are many studies of this sort. One is
precisely timed or specific to the child’s early life. the English and Romanian Adoptees (ERA) study,
Intervention studies have played an important role which is a long-term follow-up of children who
in this debate, and many studies now show long- were adopted into England after institutional rear-
term effects of early interventions. In contrast, ing in Romanian; the study also includes an early
because developmental timing was not separated adopted sample of children in England as a com-
from intervention intensity in most cases, these stud- parison group. A major feature of this particular
ies do not resolve issues about early experience as natural experiment—and what makes it and simi-
such. In other words, conventional research designs lar studies of exinstitutionalized children notewor-
have not had much success in tackling major ques- thy—is that there was a remarkable discontinuity
tions about early experience. It is not surprising, in caregiving experience, from the most severe to
then, that natural experiments have played such a normal-risk setting. That feature offers unparal-
a central role in this line of investigation. lel leverage for testing the hypothesis that it is
Several different forms of natural experiments early caregiving risk that has persisting effects on
to study the effects of early exposure have been long-term development. The success of the natural
reported. One important line of inquiry is from the experiment design depends on many considera-
Dutch birth cohort exposed to prenatal famine tions, including the representativeness of the fami-
during the Nazi blockade. Alan S. Brown and col- lies who adopted from Romania to the general
leagues found that the rate of adult unipolar and population of families, for example. A full account
bipolar depression requiring hospitalization was of this issue is not within the scope of this entry,
increased among those whose mothers experienced but it is clear that the impact of findings from nat-
starvation during the second and third trimesters ural experiments needs to be judged in relation to
of pregnancy. The ability of the study to contrast the kinds of sampling and other methodological
rates of disorder among individuals whose mothers features.
were and were not pregnant during the famine The findings from studies of exinstitutionalized
allowed unprecedented experimental ‘‘control’’ on samples correspond across studies. So, for exam-
timing of exposure. A second feature that makes ple, there is little doubt now from long-term
the study a natural experiment is that it capitalized follow-up assessments that early caregiving depri-
on a situation that is ethically unacceptable and so vation can have long-term impact on attachment
impossible to design on purpose. and intellectual development, with a sizable minor-
Another line of study that has informed the ity of children showing persisting deficits many
early experience debate concerns individuals whose years after the removal from the institutional set-
caregiving experience undergoes a radical ting and despite many years in a resourceful, car-
change—far more radical than any traditional ing home environment. Findings also show that
psychological intervention could create. Of course, individual differences in response to early severe
radical changes in caregiving do not happen ordi- deprivation are substantial and just as continuous.
narily. A notable exception is those children who Research into the effects of early experience has
are removed from abusive homes and placed into depended on these natural experiments because
nonabusive or therapeutic settings (e.g., foster conventional research designs were either impracti-
care). Studies of children in foster care are, there- cal or unethical.
fore, significant because this is a population for
whom the long-term outcomes are generally poor Thomas G. O’Connor
880 Naturalistic Inquiry

See also Case-Only Design; Case Study; Narrative and sociology, including participant observation,
Research; Observational Research; Observations direct observation, ethnographic methods, case
studies, grounded theory, unobtrusive methods,
and field research methods. Working in the
Further Readings
places where people live and work, naturalistic
Anderson, G. L., Limacher, M., Assaf, A. R., Bassford, researchers draw on observations, interviews,
T., Beresford, S. A., Black, H., et al. (2004). Effects of and other sources of descriptive data, as well
conjugated equine estrogen in postmenopausal women as their own subjective experiences, to create
with hysterectomy: The women’s health initiative rich, evocative descriptions and interpretations
randomized controlled trial. Journal of the American of social phenomena. Naturalistic inquiry desi-
Medical Association, 291, 1701–1712.
gns are valuable for exploratory research, partic-
Beckett, C., Maughan, B., Rutter, M., Castle, J., Colvert,
E., Groothues, C., et al. (2006). Do the effects of early
ularly when relevant theoretical frameworks are
severe deprivation on cognition persist into early not available or when little is known about the
adolescence? Findings from the English and Romanian people to be investigated. The characteristics,
adoptees study. Child Development, 77; 696–711. methods, indicators of quality, philosophical
Campbell, F. A., Pungello, E. P., Johnson, S. M., foundations, history, disadvantages, and advan-
Burchinal, M., & Ramey, C. T. (2001). The tages of naturalistic research designs are descri-
development of cognitive and academic abilities: bed below.
Growth curves from an early childhood educational
experiment. Developmental Psychology, 37, 231–242.
Collishaw, S., Goodman, R., Pickles, A., & Maughan, B.
(2007). Modelling the contribution of changes in Characteristics of Naturalistic Research
family life to time trends in adolescent conduct
problems. Social Science and Medicine, 65, Naturalistic inquiry involves the study of a single
2576–2587. case, usually a self-identified group or community.
Grady, D., Rubin, S. M., Petitti, D. B., Fox, C. S., Black, Self-identified group members are conscious of
D., Ettinger, B., et al. (1992). Hormone therapy to boundaries that set them apart from others. When
prevent disease and prolong life in postmenopausal qualitative (naturalistic) researchers select a case
women. Annals of Internal Medicine, 117; for study, they do so because it is of interest in its
1016–1037. own right. The aim is not to find a representative
O’Connor, T. G., Caspi, A., DeFries, J. C., & Plomin, R.
case from which to generalize findings to other,
(2000). Are associations between parental divorce and
children’s adjustment genetically mediated? An
similar individuals or groups. It is to develop inter-
adoption study. Developmental Psychology, 36, pretations and local theories that afford deep
429–437. insights into the human experience.
Tizard, B., & Rees, J. (1975). The effect of early Naturalistic inquiry is conducted in the field,
institutional rearing on the behavioral problems and within communities, homes, schools, churches,
affectional relationships of four-year-old children. hospitals, public agencies, businesses, and other set-
Journal of Child Psychology and Psychiatry, 16, tings. Naturalistic researchers spend large amounts
61–73. of time interacting directly with participants. The
researcher is the research instrument, engaging in
daily activities and conversations with group mem-
bers to understand their experiences and points of
NATURALISTIC INQUIRY view. Within this tradition, language is considered
a key source of insight into socially constructed
Naturalistic inquiry is an approach to under- worlds. Researchers record participants’ words
standing the social world in which the researcher and actions in detail with minimal interpretation.
observes, describes, and interprets the experi- Although focused on words, narratives, and dis-
ences and actions of specific people and groups course, naturalistic researchers learn through all of
in societal and cultural context. It is a research their senses. They collect data at the following
tradition that encompasses qualitative research experiential levels: cognitive, social, affective, phys-
methods originally developed in anthropology ical, and political/ideological. This strategy adds
Naturalistic Inquiry 881

depth and texture to the body of data qualitative Participants are selected based on the purpose of
researchers describe, analyze, and interpret. the study and the questions under investigation,
Naturalistic researchers study research problems which are refined as the study proceeds. This strat-
and questions that are initially stated broadly then egy might increase the possibility that unusual
gradually narrowed during the course of the study. cases will be identified and included in the study.
In non-naturalistic, experimental research designs, Purposive sampling supports the development of
terms are defined, research hypotheses stated, theories grounded in empirical data tied to specific
and procedures for data collection established in local settings.
advance before the study begins. In contrast,
qualitative research designs develop over time as
Analyzing and Interpreting Data
researchers formulate new understandings and
refine their research questions. Throughout the The first step in qualitative data analysis
research process, naturalistic researchers modify involves transforming experiences, conversations,
their methodological strategies to obtain the kinds and observations into text (data). When naturalis-
of data required to shed light on more focused tic researchers analyze data, they review field
or intriguing questions. One goal of naturalistic notes, interview transcripts, journals, summaries,
inquiry is to generate new questions that will and other documents looking for repeated patterns
lead to improved observations and interpretations, (words, phrases, actions, or events) that are salient
which will in turn foster the formulation of still bet- by virtue of their frequency. In some instances, the
ter questions. The process is circular but ends when researcher might use descriptive statistics to iden-
the researcher has created an account that seems to tify and represent these patterns.
capture and make sense of all the data at hand. Interpretation refers to making sense of what
these patterns or themes might mean, developing
explanations, and making connections between the
Naturalistic Research Methods
data and relevant studies or theoretical frameworks.
General Process For example, reasoning by analogy, researchers
might note parallels between athletic events and
When naturalistic researchers conduct field
anthropological descriptions of ritual processes.
research, they typically go through the following
Naturalistic researchers draw on their own under-
common sequence of steps:
standing of social, psychological, and economic the-
1. Gaining access to and entering the field site ory as they formulate accounts of their findings.
They work inductively, from the ground up, and
2. Gathering data eventually develop location-specific theories or
3. Ensuring accuracy and trustworthiness accounts based on analysis of primary data.
(verifying and cross-checking findings) As a by-product of this process, new research
questions emerge. Whereas traditional researchers
4. Analyzing data (begins almost immediately and
continues throughout the study) establish hypotheses prior to the start of their
studies, qualitative researchers formulate broad
5. Formulating interpretations (also an ongoing research questions or problem statements at the
process) start, then reformulate or develop new questions
6. Writing up findings as the study proceeds. The terms grounded theory,
inductive analysis, and content analysis, although
7. Member checking (sharing conclusions and
conferring with participants) not synonymous, refer to this process of making
sense of and interpreting data.
8. Leaving the field site

Evaluating Quality
Sampling
The standards used to evaluate the adequacy of tra-
Naturalistic researchers employ purposive rather ditional, quantitative studies should not be used to
than representative or random sampling methods. assess naturalistic research projects. Quantitative
882 Naturalistic Inquiry

and qualitative researchers work within distinct language to construct their representations of the
traditions that rest on different philosophical ass- social world. A third form of reflexivity examines
umptions, employ different methods, and produce how participants and the researchers who study
different products. Qualitative researchers argue them create social order through practical, goal-
among themselves about how best to evaluate nat- oriented actions and discourse.
uralistic inquiry projects, and there is little consen-
sus on whether it is possible or appropriate to Comprehensiveness and Scope
establish common standards by which such studies
might be judged. However, many characteristics The cultural anthropologist Clifford Geertz
are widely considered to be indicators of merit in used the term thick description to convey the level
the design of naturalistic inquiry projects. of rich detail typical of qualitative, ethnographic
descriptions. When writing qualitative research
Immersion reports, researchers place the study site and find-
ings as a whole within societal and cultural con-
Good qualitative studies are time consuming. texts. Effective reports also incorporate multiple
Researchers must become well acquainted with the perspectives, including perspectives of participants
field site and its inhabitants as well as the wider from all walks of life (for example) within a single
context within which the site is located. They also community or organization.
immerse themselves in the data analysis process,
through which they read, review, and summarize
Accuracy
their data.
Researchers are expected to describe the steps
Transparency and Rigor taken to verify findings and interpretations. Strate-
gies for verification include triangulation (using
When writing up qualitative research projects, and confirming congruence among multiple sour-
researchers must put themselves in the text, ces of information), member checking (negotiating
describing how the work was conducted, how they conclusions with participants), and auditing (criti-
interacted with participants, how and why they cal review of the research design, processes, and
decided to proceed as they did, and noting how conclusions by an expert).
participants might have been affected by these
interactions. Whether the focus is on interview
Claims and Warrants
transcripts, visual materials, or field research notes,
the analytical process requires meticulous attention In well-designed studies, naturalistic researchers
to detail and an inductive, bottom-up process of ensure that their conclusions are supported by
reasoning that should be made clear to the reader. empirical evidence. Furthermore, they recognize
that their conclusions follow logically from the
Reflexivity design of the study, including the review of perti-
nent literature, data collection, analysis, interpreta-
Naturalistic inquirers do not seek to attain tion, and the researcher’s inferential process.
objectivity, but they must find ways to articulate
and manage their subjective experiences. Evidence
of one or more forms of reflexivity is expected in Attention to Ethics
naturalistic inquiry projects. Positional reflexivity Researchers should describe the steps taken to
calls on researchers to attend to their personal protect participants from harm and discuss any
experiences—past and present—and describe how ethical issues that arose during the course of the
their own personal characteristics (power, gender, study.
ethnicity, and other intangibles) played a part in
their interactions with and understandings of parti-
Fair Return
cipants. Textual reflexivity involves skeptical, self-
critical consideration of how authors (and the pro- Naturalistic inquiry projects are time consuming
fessional communities in which they work) employ not only for researchers but also for participants,
Naturalistic Inquiry 883

who teach researchers about their ways of life and through separation of the researcher from partici-
share their perspectives as interviewees. Research- pants and by dispassionate analysis and interpreta-
ers should describe what steps they took to com- tion of results. In contrast, naturalistic researchers
pensate or provide fair return to participants for tap into their own subjective experiences as
their help. Research leads to concrete benefits for a source of data, seeking experiences that will
researchers (degree completion or career advance- afford them an intuitive understanding of social
ment). Researchers must examine what benefits phenomena through empathy and subjectivity.
participants will gain as a result of the work and Qualitative researchers use their subjective experi-
design their studies to ensure reciprocity (balanced ences as a source of data to be carefully described,
rewards). analyzed, and shared with those who read their
research reports.
For the naturalistic inquirer, objectivity and
Coherence
detachment are neither possible nor desirable.
Good studies call for well-written and compel- Human experiences are invariably influenced by
ling research reports. Standards for writing are the methods used to study them. The process of
genre specific. Postmodern authors defy tradition being studied affects all humans who become
through experimentation and deliberate violations subjects of scientific attention. The presence of
of writing conventions. For example, some authors an observer affects those observed. Furthermore,
avoid writing in clear, straightforward prose to the observer is changed through engaging with
express more accurately the complexities inherent and observing the other. Objectivity is always
in the social world and within the representational a matter of degree.
process. Qualitative researchers are typically far less con-
cerned about objectivity as this term is understood
within traditional research approaches than with
Veracity
intersubjectivity. Intersubjectivity is the process by
A good qualitative report brings the setting and which humans share common experiences and sub-
its residents to life. Readers who have worked or scribe to shared understandings of reality. Natural-
lived in similar settings find the report credible istic researchers seek involvement and engagement
because it reflects aspects of their own experiences. rather than detachment and distance. They believe
that humans are not rational beings and cannot be
understood adequately through objective, disembo-
Illumination
died analysis. Authors critically examine how their
Good naturalistic studies go beyond mere theoretical assumptions, personal histories, and
description to offer new insights into social and methodological decisions might have influenced
psychological phenomena. Readers should learn findings and interpretations (positional reflexivity).
something new and important about the social In a related vein, naturalistic researchers do not
world and the people studied, and they might also believe that political neutrality is possible or help-
gain a deeper understanding of their own ways ful. Within some qualitative research traditions,
of life. researchers collaborate with participants to bring
about community-based political and economic
change (social justice).
Philosophical Foundations
Qualitative researchers reject determinism, the
Traditional scientific methods rest on philosophical idea that human behaviors are lawful and can be
assumptions associated with logical positivism. predicted. Traditional scientists try to discover
When working within this framework, researchers relationships among variables that remain consis-
formulate hypotheses that are drawn from estab- tent across individuals beyond the experimental
lished theoretical frameworks, define variables by setting. Naturalistic inquiry rests on the belief that
stipulating the processes used to measure them, studying humans requires different methods than
collect data to test their hypotheses, and report those used to study the material world. Advocates
their findings objectively. Objectivity is attained emphasize that no shared, universal reality remains
884 Naturalistic Inquiry

constant over time and across cultural groups. The authors also translated key concepts across what
phenomena of most interest to naturalistic research- they thought were profoundly different paradigms
ers are socially constructed, constantly changing, (disciplinary worldviews). In recent years, quali-
and multiple. Naturalistic researchers hold that all tative researchers considered the implications of
human phenomena occur within particular contexts critical, feminist, postmodern, and poststructural
and cannot be interpreted or understood apart from theories for their enterprise. The recognition or
these contexts. rediscovery that researchers create the phenomena
they study and that language plays an important
part in this process has inspired methodological
History
innovations and lively discussions. The discourse on
The principles that guide naturalistic research naturalistic inquiry remains complex and ever
methods were developed in biology, anthropology, changing. New issues and controversies emerge
and sociology. Biologist Charles Darwin developed every year, reflecting philosophical debates within
the natural history method, which employs and across many academic fields.
detailed observation of the natural world directed
by specific research questions and theory building
based on analysis of patterns in the data, followed Methodological Disadvantages
by confirmation (testing) with additional observa- and Advantages
tions in the field. Qualitative researchers use
Disadvantages
similar strategies, which transform experiential,
qualitative information gathered in the field into Many areas are not suited to naturalistic
data amenable to systematic investigation, analy- investigation. Naturalistic research designs cannot
sis, and theory development. uncover cause and effect relationships and they
Ancient adventurers, writers, and missionaries cannot help researchers evaluate the effectiveness
wrote the first naturalistic accounts, describing the of specific medical treatments, school curricula, or
exotic people they encountered on their travels. Dur- parenting styles. They do not allow researchers to
ing the early decades of the 20th century, cultural measure particular attributes (motivation, reading
anthropologists and sociologists pioneered the use ability, or test anxiety) or to predict the outcomes
of ethnographic research methods for the scientific of interventions with any degree of precision.
study of social phenomena. Ethnography is both Qualitative research permits only claims about the
a naturalistic research methodology and a written specific case under study. Generalizations beyond
report that describes field study findings. Although the research site are not appropriate. Furthermore,
there are many different ethnographic genres, all of naturalistic researchers cannot set up logical condi-
them employ direct observation of naturally occur- tions whereby they can demonstrate their own
ring events in the field. Early in the 20th century, assumptions to be false.
University of Chicago sociologists used ethno- Naturalistic inquiry is time consuming and
graphic methods to study urban life, producing pio- difficult. Qualitative methods might seem to be
neering studies of immigrants, crime, work, youth, easier to use than traditional experimental and
and group relations. Sociologist Herbert Blumer, survey methods because they do not require mas-
drawing on George Herbert Mead, William I. tery of technical statistical and analytical meth-
Thomas, and John Dewey, developed a rationale for ods. However, naturalistic inquiry is one of the
the naturalistic study of the social world. In the most challenging research approaches to learn
1970s, social scientists articulated ideas and theoret- and employ. Qualitative researchers tailor meth-
ical issues pertinent to naturalistic inquiry. Interest ods to suit each project, revising data-collection
in qualitative research methods grew. In the mid- strategies as questions and research foci emerge.
1980s, Yvonne Lincoln and Egon Guba published Naturalistic researchers must have a high toler-
Naturalistic Inquiry, which provided a detailed cri- ance for uncertainty and the ability to work
tique of positivism and examined implications for independently for extended periods of time, and
social research. Highlighting the features that set these researchers must also be able to think crea-
qualitative research apart from other methods, these tively under pressure.
Naturalistic Observation 885

Advantages Creswell, J. W. (2006). Qualitative inquiry and research


design: Choosing among five approaches (2nd ed.).
Once controversial, naturalistic research meth- Thousand Oaks, CA: Sage.
ods are now used in social psychology, develop- Denzin, N. K., & Lincoln, Y. S. (2005). Sage handbook
mental psychology, qualitative sociology, and of qualitative research (3rd ed.). Thousand Oaks,
anthropology. Researchers in professional schools CA: Sage.
(education, nursing, health sciences, law, social Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic
work, and counseling) and applied fields (regional inquiry. Beverly Hills, CA: Sage.
Macbeth, D. (2001). On ‘‘reflexivity’’ in qualitative
planning, library science, program evaluation,
research: Two readings, and a third. Qualitative
information science, and sports administration)
Inquiry, 7, 35–68.
employ naturalistic strategies to investigate social Marshall, C., & Rossman, G. B. (2006). Designing
phenomena. Naturalistic approaches are well qualitative research (4th ed.). Thousand Oaks,
suited to the study of groups about which little is CA: Sage.
known. They are also holistic and comprehensive. Seale, C. (1999). Quality in qualitative research.
Qualitative researchers try to tell the whole story, Qualitative Inquiry, 5, 465–478.
in context. A well-written report has some of the
same characteristics as a good novel, bringing the
lives of participants to life. Naturalistic methods
help researchers understand how people view the NATURALISTIC OBSERVATION
world, what they value, and how these values
and cognitive schemas are reflected in practices
Naturalistic observation is a nonexperimental,
and social structures. Through the study of
primarily qualitative research method in which
groups unlike their own, researchers learn that
organisms are studied in their natural settings.
many different ways are available to raise
Behaviors or other phenomena of interest are
children, teach, heal, maintain social order, and
observed and recorded by the researcher, whose
initiate change. Readers learn about the extraor-
presence might be either known or unknown
dinary variety of designs for living and adaptive
to the subjects. This approach falls within the
strategies humans have created, thus broadening
broader category of field study, or research con-
awareness of possibilities beyond conventional
ducted outside the laboratory or institution of
ways of life. Thus, naturalistic inquiry can pro-
learning. No manipulation of the environment is
vide insights that deepen our understanding of
involved in naturalistic observation, as the activi-
the human experience and generate new theoret-
ties of interest are those manifested in everyday
ical insights. For researchers, the process of
situations. This method is frequently employed
performing qualitative research extends and
during the initial stage of a research project, both
intensifies the senses and provides interesting
for its wealth of descriptive value and as a founda-
and gratifying experiences as relationships are
tion for hypotheses that might later be tested
formed with the participants from whom and
experimentally.
with whom one learns.
Zoologists, naturalists, and ethologists have
Jan Armstrong long relied on naturalistic observation for a com-
prehensive picture of the variables coexisting with
See also Ethnography; Grounded Theory; Interviewing; specific animal behaviors. Charles Darwin’s 5-year
Naturalistic Observation; Qualitative Research voyage aboard the H.M.S. Beagle, which is an
expedition that culminated in his theory of evolu-
tion and the publication of his book On the Origin
Further Readings of Species in 1859, is a paradigm of research based
Blumer, H. (1969). Symbolic interactionism: Perspective
on this method. Studies of interactions within the
and method. Englewood Cliffs, NJ: Prentice Hall. social structures of primates by both Dian Fossey
Bogdan, R., & Bicklen, S. K. (2006). Qualitative research and Jane Goodall relied on observation in the
for education: An introduction to theories and subjects’ native habitats. Konrad Lorenz, Niko
methods (5th ed.). Boston: Allyn & Bacon. Tinbergen, and Karl von Frisch advanced the
886 Naturalistic Observation

understanding of communication among animal the 1920s and 1930s. The approach became
species through naturalistic observation, introduc- widely adopted among anthropologists during
ing such terminology as imprinting, fixed action these same two decades. In Mead’s 1928 study
pattern, sign stimulus, and releaser to the scientific ‘‘Coming of Age in Samoa,’’ data were collected
lexicon. All the investigations mentioned here were while she resided among the inhabitants of a small
notable for their strong ecological validity, as they Samoan village, making possible her groundbreak-
were conducted within a context reflective of the ing revelations on the lives of girls and women in
normal life experiences of the subjects. It is highly this island society.
doubtful that the same richness of content could Over the years, naturalistic observation became
have been obtained in an artificial environment a widely used technique throughout the many sci-
devoid of concurrent factors that would have nor- entific disciplines concerned with human behavior.
mally accompanied the observed behaviors. Among its best-known practitioners was Jean
The instances in which naturalistic observation Piaget, who based his theory of cognitive devel-
also yields valuable insight to psychologists, social opment on observations of his own children
scientists, anthropologists, ethnographers, and throughout the various stages of their maturation;
behavioral scientists in the study of human behav- in addition, he would watch other children at play,
ior are many. For example, social deficits symp- listening to and recording their interactions. Jer-
tomatic of certain psychological or developmental emy Tunstall conducted a study of fishermen in the
disorders (such as autism, childhood aggression, or English seaport of Hull by living among them and
anxiety) might be evidenced more clearly in a typi- working beside them, a sojourn that led to the
cal context than under simulated conditions. The publication of his book Fishermen: The Sociology
dynamics within a marital or family relationship of an Extreme Occupation in 1962. Stanley Mil-
likewise tend to be most perceptible when the gram employed naturalistic observation in an
participants interact as they would under everyday investigation on the phenomenon of ‘‘familiar
circumstances. In the study of broader cultural strangers’’ (people who encountered but never
phenomena, a researcher might collect data by spoke to one another) among city dwellers, watch-
living among the population of interest and wit- ing railway commuters day after day as they
ness-ing activities that could only be observed in waited to board the train to their workplaces in
a real-life situation after earning their trust and New York City. At his Family Research Labora-
their acceptance as an ‘‘insider.’’ tory in Seattle, Washington, John Gottman has
This entry begins with the historic origins of used audiovisual monitoring as a component of his
naturalistic observation. Next, the four types of marriage counseling program since 1986. Couples
naturalistic observation are described and natural- stay overnight in a fabricated apartment at the lab-
istic observation and experimental methods are oratory, and both qualitative data (such as verbal
compared. Last, this entry briefly discusses the interactions, proxemics, and kinesics) and quanti-
future direction of naturalistic observation. tative data (such as heart rate, pulse amplitude,
and skin conductivity) are collected. In his book
The Seven Principles for Making Marriage Work,
Historic Origins
Gottman reported that the evidence gathered dur-
The field of qualitative research gained prominence ing this phase of therapy enabled him to predict
in the United States during the early 20th century. whether a marriage would fail or succeed with
Its emergence as a recognized method of scientific 91% accuracy.
investigation was taking place simultaneously in
Europe, although the literature generated by many
Types of Naturalistic Observation
of these proponents was not available in the West-
ern Hemisphere until after World War II. At the Naturalistic observation might be divided into four
University of Chicago, such eminent researchers as distinct categories. Each differs from the others in
Robert Park, John Dewey, Margaret Mead, and terms of basic definitions, distinguishing features,
Charles Cooley contributed greatly to the develop- strengths and limitations, and appropriateness for
ment of participant observation methodology in specific research designs.
Naturalistic Observation 887

Overt Participant Observation presence of the researcher, reactivity to the test


situation is eliminated. Frequently, it is easier to
In this study design, subjects are aware that
win the confidence of subjects if they believe the
they are being observed and are apprised of the
researcher to be a peer. However, measures taken
research purpose prior to data collection. The
to maintain secrecy might also restrict the range of
investigator participates in the activities of the sub-
observation. Of even greater concern is the poten-
jects being studied and might do this by frequent-
tial breach of ethics represented by involving sub-
ing certain social venues, taking part in the affairs
jects in a study without their informed consent. In
of an organization, or living as a member of a com-
all such cases, it is important to consider whether
munity. For example, a researcher might travel
the deception involved is justified by the potential
with a tour group to observe how people cope
benefits to be reaped. The ethical principles of such
with inability to communicate in a country where
professional organizations as the American Psy-
an unfamiliar language is spoken.
chological Association, American Medical Associ-
The nature of this study design obviates
ation, and American Counseling Association stress
several ethical concerns, as no deception is
the overarching goal of promoting the welfare and
involved. However, reactivity to the presence of
respecting the dignity of the client or patient. This
the observer might compromise internal validity.
can be summarized as an aspirational guideline to
The Hawthorne effect, in which behavioral and
do no harm. Apart from avoiding physical injury,
performance-related changes (usually positive)
it is necessary to consider and obviate any aspects
occur as a result of the experimenter’s attention,
of the study design that could do psychological or
is one form of reactivity. Other problems might
emotional damage. Even when potentially detri-
ensue if subjects change their behaviors after
mental effects fail to materialize, the practitioner is
learning the purpose of the experiment. These
responsible for upholding the principle of integrity
artifacts might include social desirability bias,
in his or her professional conduct and for perform-
attempts to confirm any hypotheses stated or
ing research honestly and without misrepresenta-
suggested by the investigator, or noncooperation
tion. Finally, subjects might eventually learn of
with the aim of disconfirming the investigator’s
their involuntary inclusion in research, and the
hypotheses. Reluctant subjects might also antici-
consequent potential for litigation cannot be taken
pate the presence of the investigator and might
lightly. The issue of whether to extenuate decep-
take measures to avoid being observed.
tion to gain information that might not otherwise
be procured merits serious deliberation, and is not
Covert Participant Observation
treated casually in research design.
In many circumstances, disclosure of the investi-
gator’s purpose would jeopardize the successful
Overt Nonparticipant Observation
gathering of data. In such cases, covert participant
observation allows direct involvement in the sub- The primary distinction between this method
jects’ activities without revealing that a study is and overt participant observation is the role of the
being conducted. The observer might join or pre- observer, who remains separate from the subjects
tend to join an organization; assume the role of being studied. Of the two procedures, nonpartici-
a student, instructor, or supervisor; or otherwise pant observation is implemented in research more
mingle unobtrusively with subjects to gain access frequently. Subjects acknowledge the study being
to relevant information. Observation via the inter- conducted and the presence of the investigator,
net might involve signing up for a website mem- who observes and records data but does not min-
bership using a fictitious identity and rationale for gle with the subjects. For instance, a researcher
interest in participation. A study in which investi- might stand at a subway station, watching com-
gators pretend to be cult members to gather infor- muters and surreptitiously noting the frequency of
mation on the dynamics of indoctrination is one discourteous behaviors during rush hour as
instance of covert participant observation. opposed to off-peak travel times.
Certain distinct benefits are associated with this A major advantage of overt nonparticipant
approach. As subjects remain oblivious to the observation is the investigator’s freedom to use
888 Naturalistic Observation

various tools and instruments openly, thus enabling the experimental method, and also more flexibility
easier and more complete recording of observa- is involved in accommodating change throughout
tions. (In contrast, a covert observer might be the research process. These attributes make for
forced to write hasty notes on pieces of paper to an ideal preliminary procedure, one that might
avoid suspicion and to attempt reconstruction of serve to lay the groundwork for a more focused
fine details from memory later on.) Still, artifacts investigation. As mentioned earlier, unexpected
associated with awareness of the investigator’s observations might generate new hypotheses,
presence might persist, even though observation thereby contributing to the comprehensiveness
from a distance might tend to exert less influence of any research based thereon.
on subjects’ behavior. In addition, there is virtually By remaining unobtrusive, the observer has
no opportunity to question subjects should the access to behaviors that are more characteristic,
researcher wish to obtain subsequent clarification more spontaneous, and more diverse that those
of the meaning attached to an event. The observer one might witness in a laboratory setting. In many
might, thus, commit the error of making subjective instances, such events simply cannot be examined
interpretations based on inconclusive evidence. in a laboratory setting. To learn about the natural
behavior of a wild animal species, the workplace
Covert Nonparticipant Observation dynamics of a corporate entity, or the culturally
prescribed roles within an isolated society, the
This procedure involves observation conducted investigator must conduct observations in the sub-
apart from the subjects being studied. As in covert jects’ day-to-day environment. This requirement
participant observation, the identity of the investi- ensures a greater degree of ecological validity
gator is not revealed. Data are often secretly than one could expect to achieve in a simulated
recorded and hidden; alternatively, observations environment. However, there are no implications
might be documented at a later time when the for increased external validity. As subjects are
investigator is away from the subjects. Witnessing observed by happenstance, not selected according
events by means of electronic devices is also a form to a sampling procedure, representativeness cannot
of covert nonparticipant observation. For example, be guaranteed. Any conclusions drawn must neces-
the researcher might watch a videotape of children sarily be limited to the sample studied and cannot
at recess to observe peer aggression. generalize to the population.
The covert nonparticipant observer enjoys the There are other drawbacks to naturalistic obser-
advantages of candid subject behavior as well as vation vis-à-vis experimental methods. One of
the availability of apparatus with which to record these is the inability to control the environment in
data immediately. However, as in covert participant which subjects are being observed. Consequently,
observation, measures taken to preserve anonymity the experimenter can derive descriptive data from
might also curtail access to the full range of obser- observation but cannot establish cause-and-effect
vations. Remote surveillance might similarly offer relationships. Not only does this preclude explana-
only a limited glimpse of the sphere of contextual tion of why behaviors occur, but also it limits the
factors, thereby diminishing the usefulness of the prediction of behaviors. Additionally, the natural
data. Finally, the previously discussed ethical conditions observed are unique in all instances,
infractions associated with any form of covert thus rendering replication unfeasible.
observation, as well as the potential legal repercus- The potential for experimenter bias is also signifi-
sions, make using this method highly controversial. cant. Whereas the number of times a behavior is
recorded and the duration of the episode are both
Comparing Naturalistic Observation With unambiguous measures, the naturalistic observer
lacks a clear-cut system for measuring the extent or
Experimental Methods
magnitude of a behavior. Perception of events might
The advantages offered by naturalistic observa- thus be influenced by any number of factors, includ-
tion are many, whether in conjunction with experi- ing personal worldview. An especially problematic
mental research or as the primary constituent of situation might arise when the observer is informed
a study. First, there is less formal planning than in of the hypothesis and of the conditions under
Naturalistic Observation 889

investigation, as this might lead to seeking confirma- positive for the virus. Similarly, only preexisting
tory evidence. Another possible error is that of the psychiatric conditions (such as posttraumatic stress
observer recording data in an interpretative rather disorder) are studied, as subjects cannot be expo-
than a descriptive manner, which can result in an ex sed to manipulations that could cause psychologi-
post facto conclusion of causality. The researcher’s cal or emotional trauma. Certain factors might be
involvement with the group in participant observa- difficult or impossible to measure, as is the case
tion might constitute an additional source of bias. with various cognitive processes.
Objectivity can suffer because of group influence, The researcher’s choice of method can either
and data might also be colored by a strong positive contribute to or, conversely, erode scientific rigor. If
or negative impression of the subjects. a convenience sample is used, if there are too few
Experimental approaches to research differ from subjects in the sample, if randomization is flawed,
naturalistic observation on a number of salient or if the sample is otherwise not representative of
points. One primary advantage of the true experi- the population from which it is selected, then the
ment is that a hypothesis can be tested and a cause- study will not yield generalizable results. The use
and-effect relationship can be demonstrated. The of an instrument with insufficient reliability and
independent variable of interest is systematically validity might similarly undermine the experimen-
manipulated, and the effects of this manipulation tal design. Nonetheless, bias and human error are
on the dependent variable are observed. Because universal in all areas of research. Self-awareness,
the researcher controls the environment in which critical thinking, and meticulous research methods
the study is conducted, it is thus possible to elimi- can do much to minimize their ill effects.
nate confounding variables. Besides enabling attri-
bution of causality, this design also provides
evidence of why a behavior occurs and allows pre- Future Directions
diction of when and under what conditions the Robert Elliott and his colleagues proposed new
behavior is likely to occur again. Unlike naturalistic guidelines for the publication of qualitative
observation, an experimental study can possess research studies in 1999, with the goal of encourag-
internal and external validity, although the controls ing legitimization, quality control, and subsequent
inherent in this approach can diminish ecological development of this approach. Their expoition of
validity, as it might be difficult to eliminate extra- both the traditional value and the current evolution
neous variables while maintaining some semblance of qualitative research was a compelling argument
of a real-world setting. in support of its function not only as a precursor to
An additional benefit of experimental research is experimental investigations but also as a method
the relative stability of the environment in which the that addressed a different category of questions and
researcher conducts the study. In contrast, partici- therefore merited recognition in its own right.
pant observation might entail a high degree of stress Given the ongoing presence of nonexperimental
and personal risk when working with certain groups approaches in college and university curricula and
(such as gang members or prison inmates). This in the current literature, it is likely that naturalistic
method also demands investment of considerable observation will continue to play a vital role in
time and expense, and the setting might not be con- scientific research.
ducive to management of other responsibilities.
Although experimental design is regarded as Barbara M. Wells
a more conclusive method than naturalistic obser-
vation and is more widely used in science, it is not See also Descriptive Statistics; Ecological Validity;
suitable for all research. Ethical and legal guide- Naturalistic Inquiry; Observational Research;
lines might forbid an experimental treatment if it Observations; Qualitative Research
is judged capable of harming subjects. For exam-
ple, in studying the progression of a viral infection
such as HIV, the investigator is prohibited from Further Readings
causing subjects to contract the illness and Davidson, B., Worrall, L., & Hickson, L. (2003).
must instead recruit those who have already tested Identifying the communication activities of older
890 Nested Factor Design

people with aphasia: Evidence from naturalistic treatments when therapists or treatment centers
observation. Aphasiology, 17; 243–264. provide one treatment to more than one partici-
Education Forum. Primary research methods. Retrieved pant. Because each therapist or treatment center
August 24, 2008, from http://www.educationforum. provides only one treatment, the provider or treat-
co.uk/Health/primarymethods.ppt
ment center factor is nested under only one level of
Elliott, R., Fisher, C. T., & Rennie, D. L. (1999).
Evolving guidelines for publication of qualitative
the treatment factor. Nested factor designs are also
research studies in psychology and related fields. common in educational research in which class-
British Journal of Clinical Psychology, 38; 215–229. rooms of students are nested within classroom
Fernald, D. (1999). Research methods. In Psychology. interventions. For example, researchers commonly
Upper Saddle River, NJ: Prentice Hall. assign whole classrooms to different levels of
Mehl, R. (2007). Eavesdropping on health: A naturalistic a classroom-intervention factor. Thus, each class-
observation approach for social health research. Social room, or cluster, is assigned to only one level of the
and Personality Psychology Compass, 1; 359–380. intervention factor and is said to be nested under
Messer, S. C., & Gross, A. M. (1995). Childhood
this factor. Ignoring a nested factor in the evalua-
depression and family interaction: A naturalistic
tion of a design can lead to consequences that are
observation study. Journal of Clinical Child
Psychology, 24; 77–88. detrimental to the validity of statistical decisions.
Pepler, D. J., & Craig, W. M. (1995). A peek behind the The main reason for this is that the observations
fence: Naturalistic observations of aggressive children within the levels of a nested factor are likely to not
with remote audiovisual recording. Developmental be independent of each other but related. The mag-
Psychology, 31; 548–553. nitude of this relationship can be expressed by
Spata, A. V. (2003). Research methods: Science and a so-called intraclass correlation coefficient ρI .
diversity. New York: Wiley. The focus of this entry is on the most common
Weinrott, M. R., & Jones, R. R. (1984). Overt versus nested design: the two-level nested design. This entry
covert assessment of observer reliability. Child
discusses whether nested factors are random or fixed
Development, 55; 1125–1137.
effects and the implications of nested designs on sta-
tistical power. In addition, the criteria to determine
which model to use and the consequences of ignor-
ing nested factors are also examined.
NESTED FACTOR DESIGN
Two-Level Nested Factor Design
In nested factor design, two or more factors are
not completely crossed; that is, the design does not The most common nested design involves two fac-
include each possible combination of the levels of tors with a factor B nested within the levels of
the factors. Rather, one or more factors are nested a second factor A. The linear structural model for
within the levels of another factor. For example, in this design can be given as follows:
a design in which a factor (factor B) has four levels
and is nested within the two levels of a second fac- Yijk ¼ μ þ αj þ βkðjÞ þ εijk ; ð1Þ
tor (factor A), levels 1 and 2 of factor B would only
occur in combination with level 1 of factor A and where Yijk is the observation for the ith subject
levels 3 and 4 of factor B would only be combined (i ¼ 1; 2; . . . ; nÞ in the jth level of factor A (j ¼ 1;
with level 2 of factor A. In other words, in a nested 2; . . . ; pÞ and the kth level of the nested factor
factor design, there are cells that are empty. In the B (k ¼ 1; 2; . . . ; qÞ; μ is the grand mean, αj is the
described design, for example, no observations are effect for the jth treatment, βkðjÞ is the effect of the
made for the combination of level 1 of factor A kth provider nested under the jth treatment, and εijk
and level 3 of factor B. When a factor B is nested is the error of the observation (within cell variance).
under a factor A, this is denoted as B(A). In more Note that because factors A and B are not completely
complex designs, a factor can also be nested under crossed, the model does not include an interaction
combinations of other factors. A common example term because it cannot be estimated separately from
in which factors are nested is within treatments, the error term. More generally speaking, because
for example, the evaluation of psychological nested factor designs have not as many cells as
Nested Factor Design 891

completely crossed designs, one cannot perform all generalize to the levels of the factor that are not
tests for main effects and interactions. included in the study (the universe of levels),
The assumptions of the nested model are that then a nested factor should be treated as a random
the effects of the fixed factor A sum up to zero factor. The assumption that the levels included in
X the study are representative of an underlying
αj ¼ 0; ð2Þ population requires that the levels are drawn at
j random from the universe of levels. Thus, nested
factor levels are treated like subjects who are also
that errors are normally distributed and have an
considered to be random samples from the popula-
expected value of zero
tion of all subjects. The resulting model is called
  a mixed model and assumes that the effects of fac-
εijk ∼ N 0; σ 2ε ; and that ð3Þ
iðjkÞ tor B are normally distributed with a mean of zero,
specifically:
εiðjkÞ ; αj ; and βkðjÞ are pairwise independent: ð4Þ  
βkðjÞ ∼ N 0; σ 2β : ð6Þ
kðjÞ
In the next section, the focus is on nested factor
designs with two factors with one of the factors A nested factor is correctly conceptualized as
nested under the other. More complex models a fixed factor if a researcher only seeks to make
can be built analogously. For example, the model an inference about the specific levels of the
equation for a design with two crossed factors, A nested factor included in his or her study and if
and B, and a third factor nested within factor C is the levels included in the study are not drawn at
described by the following structural model: random from a universe of levels. For example,
if a researcher wants to make an inference about
Yijkm ¼ μ þ αi þ βj þ γ kðiÞ þ ðαβÞjkðiÞ þεijkm : ð5Þ the specific treatment centers (nested within dif-
ferent treatments) that are included in her study,
Nested Factors as Random and Fixed Effects the nested factor is correctly modeled as a fixed
effect. The corresponding assumption of the
In experimental and quasi-experimental designs, fixed model is
the factor under which the second factor is nested X
is almost always conceptualized as a fixed factor. βkðjÞ ¼ 0; ð7Þ
That is, a researcher seeks to make inferences kðjÞ

about the specific levels of the factor included


in the study and does not consider them random that is, the effects of the nested factor add up to
samples from a population of levels. For example, zero within each level of factor A. The statistical
if a researcher compares different treatment groups conclusion is then conditional on the factor levels
or sets of experimental stimuli with different char- included in the study.
acteristics, the researcher’s goal is to make infer- In both the mixed model and the fixed model,
ences about the specific treatments implemented the variance for the total model is given by
in the study or the stimulus properties that are
σ 2total ¼ σ 2A þ σ 2B þ σ 2within : ð8Þ
manipulated.
The assumptions concerning the nested factor In the population model, the proportion of vari-
depend on whether it is considered to be a random ance accounted for by factor A can be expressed as
or a fixed effect. Although in nested designs the
nested factor is often considered to be random, σ 2A σ 2A
ω2 ¼ ¼ ; ð9Þ
there is no necessity to do this. The question of σ 2total σ 2A þ σ 2B þ σ 2within
whether a factor should be treated as random or
fixed is generally answered as follows: If the levels that is, the variance caused by factor A divided by
of a nested factor that are included in a study are the total variance (i.e., the sum of the variance
considered to be a random sample of an underly- caused by the treatments, variance caused by pro-
ing population of levels and if it is intended to viders, and within-cell variance). Alternatively, the
892 Nested Factor Design

treatment effect can be expressed as the partial The main difference between the mixed and
effect size the fixed model is that in the mixed model, the
expected mean square for factor A contains a term
σ 2A that includes the variance caused by the nested fac-
ω2part ¼ ; ð10Þ
σ 2A þ σ 2within tor B (viz., nσ 2B ), whereas in the fixed model, the
expected mean square for treatment effects con-
that is, the variance caused by factor A divided by tains no such term. Consequently, in the mixed-
the sum of the variance caused by the treatments model case, the correct denominator to calculate
and within-cell variance. The partial effect size the test statistic for factor A is the mean square for
reflects the effectiveness of the factor A indepen- the nested factor, namely
dent of additional effects of the nested factor B.
The effect resulting from the nested factor can be MSA
analogously defined as partial effect size, as follows: Fmixed ¼ : ð12Þ
MSB
σ 2B
ω2B ¼ ¼ ρI : ð11Þ Note that the degrees of freedom of the denomi-
σ 2B þ σ 2within nator are exclusively a function of the number of
levels of the factors A and B and are not influenced
If the nested factor is modeled as random, this
by the number of subjects within each cell of the
effect is equal to the intraclass correlation coeffi-
design.
cient ρI . This means that intraclass correlations ρI
In the fixed-model case, the correct denominator
are partial effect sizes of the nested factor B (i.e.,
is the mean square for within cell variation, namely
independent of the effects of factor A). The intra-
class correlation coefficient, which represents the MSA
relative amount of variation attributable to the Ffixed ¼ : ð13Þ
MSwithin
nested factor, is also a measure of the similarity of
the observations within the levels of the nested fac- Note that the within-cell variation does not
tors. It is, therefore, a measure of the degree to include the variation resulting from the nested factor
which the assumption of independence—required and that the degrees of freedom of the denominator
if the nested factor is ignored in the analysis and are largely determined by the number of subjects.
individual observations are the unit of analysis—is The different ways the tests statistics for the
violated. Ignoring the nested factor if the intraclass non-nested factor A are calculated reflect the dif-
correlation is not zero can lead to serious prob- ferent underlying model assumptions. In the mixed
lems, especially to alpha inflation. model, levels of the nested factor are treated as
a random sample from an underlying universe of
levels. Because variation caused by the levels of the
Sample Statistics
nested factor sampled in a particular study will
The source tables for the mixed model (factor A randomly vary across repetitions of a study, this
fixed and factor B random) and the fixed model variation is considered to be error. In the fixed
(both factors fixed) are presented in Table 1. model, it is assumed that the levels of the nested

Table 1 Sources of Variance and Expected Mean Squares for Nested Design: Factor A Fixed and Nested Factor B
Random (Mixed Model) Versus Factor A Fixed and Nested Factor B Fixed (Fixed Model)
Source SS df MS E(MS) Mixed Model E(MS) Fixed Model
SSA 2 2 2
A SSA p1 p1 σ w þ nσ B þ ½p=ðp  1Þnqσ A σ 2w þ [p/(p  1)]nqσ 2A
SSB
B(A) SSB p(q  1) pðq1Þ σ 2w þ nσ 2B σ 2w þ [q/(q  1)]nσ 2B
SSwithin
Within cell (w) SSw pq(n  1) pqðn1Þ σ 2w σ 2w
Notes: The number of levels of factor A is represented by p; the number of levels of the nested factor B within each level of factor
A is represented by q; and the number of subjects within each level of factor B is represented by n: df ¼ degrees of freedom; SS ¼
sum of squares; MS ¼ mean square; E(MS) ¼ expected mean square; A ¼ factor A; B ¼ nested factor B; w ¼ within cell.
Nested Factor Design 893

factor included in a particular study will not vary As a result, the population effect size of factor
across replications of a study, and variation from A can be estimated by the following formula for
the nested factor is removed from the estimated the nonpartial effect size:
error.
SSA  ðp  1ÞMSB
^ 2mixed ¼
ω : ð21Þ
SSA  ðp  1ÞMSB
Effect Size Estimates in Nested Factor Designs þ pqðMSB  MSwithin Þ
The mixed and the fixed models vary with þ pqnMSwithin
respect to how population effect sizes are esti-
mated. First, population effects are typically not Accordingly, the partial effect size for factor A
estimated by the sample effect size, namely can be estimated with the formula

SSA SSA  ðp  1ÞMSB


η2A ¼ ; ð14Þ ^ 2mixed; partial ¼
ω :
SStotal SSA  ðp  1ÞMSB þ pqnMSwithin
ð22Þ
and
SSA In the fixed model, the population effect is also
η2part ¼ ð15Þ estimated by Equation 16, but the estimates of the
SSA þ SSwithin
factor A and nested factor B variance components
for the nonpartial and the partial effect of the non- differ from the mixed model (cf. Table 1), namely:
nested factor A, because this measure is biased in
0 ðp  1ÞðMSA  MSwithin Þ
that it overestimates the true population effect. σ^ 2A ¼ ð23Þ
Rather, the population effect is correctly estimated npq
with an effect size measure named omega square:
and
0 q1
σ^ 2A σ^ 2B ¼ ðMSB  MSwithin Þ: ð24Þ
^ 2A
ω ¼ ð16Þ qn
σ^ 2A þ σ^ 2B þ σ^ 2within
As a consequence, the nonpartial population
and effect size of factor A for the fixed-effects model
σ^ 2A can be estimated by the following formula:
^ 2part ¼
ω ð17Þ
σ^ 2A þ σ^ 2within SSA  ðp  1ÞMSwithin
^ 2fixed ¼
ω : ð25Þ
SStotal þ MSwithin
for the estimated nonpartial and partial population
effects of the non-nested factor A. The nature of The corresponding estimated partial population
the exact formula estimating the population effect effect can be estimated with the following
on the basis of sample statistics depends on how formula:
the individual variance components are estimated
in the different analysis of variance (ANOVA) SSA  ðp  1ÞMSwithin
models. In the mixed model, the variance compo- ^ 2fixed; partial ¼
ω :
SSA  ðp  1ÞMSwithin þ pqnMSwithin
nents are estimated as follows (see also Table 1):
ð26Þ
ðp  1ÞðMSA  MSB Þ
σ^ 2A ¼ ; ð18Þ
npq
Statistical Power in Nested Factor Designs
It is important to consider the implication of
σ^ 2B ¼ ðMSB  MSwithin Þ=n; ð19Þ nested factor designs on the statistical power of
a study. The consequences a nested factor design
and
will have on the power of a study will vary dra-
σ^ 2within ¼ MSwithin : ð20Þ matically depending on whether the nested factor
894 Nested Factor Design

is modeled correctly as a random or as a fixed fac- that nested factors should be treated as random
tor. In the mixed-effects model, statistical power effects by default. Nested factors should also be
mainly depends on the number of levels of the treated as random if the levels of the nested factors
nested factor, whereas power is largely indepen- are randomly assigned to the levels of the non-
dent of the number of subjects within each level of nested factor. In the absence of random sampling
the nested factor. In fact, a mixed-model ANOVA from a population, random assignment can be
with, for instance, a nested factor with two levels used as a basis of statistical inference. Under the
nested within the levels of a higher order factor random assignment model, the statistical inference
with two levels for each treatment essentially has can be interpreted as applying to possible reran-
the statistical power of a t test with two degrees of domizations of the subjects in the sample.
freedom. In the mixed model, power is also nega- If a researcher seeks to make an inference about
tively related to the magnitude of the effect of the the specific levels of the nested factors included in
nested factor. Studies with random nested factors the study, a fixed-effects model should be used.
should be designed accordingly with a sufficient Any (statistical) inference made on the basis of the
number of levels of the nested factor, especially if fixed model is restricted to the specific levels of the
a researcher expects large effects of the nested fac- nested factor as they were realized in the study.
tor (i.e., a large intraclass correlation). The question of which model should be used
In the fixed model, statistical power is mainly in the absence of random sampling and random
determined by the number of subjects and remains assignment is debatable. Some authors argue that
largely unaffected by the number of levels of the the mixed model should be used regardless of
nested factor. Moreover, the power of the fixed- whether random sampling or random assignment
model test increases with increasing nested factor is involved. Other authors argue that in this case,
effects because the fixed-effects model residualizes a mixed-effects model is not justified and the fixed-
the F-test denominator (expected within subject effects model should be used, with an explicit
variance) for nested factor variance. acknowledgement that it does not allow a general-
ization of the obtained results. The choice between
the mixed and the fixed model is less critical if the
Criteria to Determine the Correct Model
effects of the nested factor are zero. In this case,
Two principal possibilities exist for dealing with the mixed and the fixed model reach the same con-
nested effects: Nested factors can be treated clusions when the null hypothesis is true even if the
as random factors leading to a mixed-model mixed model is assumed to be a valid statistical
ANOVA, or nested factors might be treated as model for the study. In particular, the mixed model
fixed factors leading to a fixed model ANOVA. does not lead to inflated Type I error levels. The
There are potential risks associated with choos- fixed-effects analysis, however, can have dramati-
ing the incorrect model in nested factor designs. cally greater power when the alternative hypothesis
The incorrect use of the fixed model might lead is true. It has to be emphasized, however, that any
to overestimations of effect sizes and inflated choice between the mixed and the fixed model
Type I error rates. In contrast, the incorrect use should not be guided by statistical power consid-
of the mixed model might lead to serious under- erations alone.
estimations of effect sizes and inflated Type II
errors (lack of power). It is, therefore, important
Consequences of Ignoring Nested Factors
to choose the correct model to analyze a nested-
factor design. Although the choice between the two different
If the levels of a nested factor have been ran- models to analyze nested factor designs may be
domly sampled from a universe of population difficult, ignoring the nested factor is always
levels and the goal of a researcher is to generalize a wrong decision. If the mixed model is the correct
to this universe of levels, the mixed model has model and there are nested factor effects (i.e., the
to be used. Because the generalization of results is intraclass correlation is different from zero), then
commonly recognized as an important aim of sta- ignoring a nested factor, and thus the dependence
tistical hypothesis testing, many authors emphasize of observations within the subjects within the
Network Analysis 895

levels of the nested factor, leads to inflated Type I the relationships between these factors. These rela-
error rates and an overestimation of population tionships are illustrated in a diagrammatic net-
effects. Some authors have suggested that after work consisting of nodes (i.e., causal factors) and
a preliminary test (with a liberal alpha level) shows arcs representing the relationships between nodes.
that there are no significant nested factor effects, it The technique captures the complexities of peo-
is safe to remove the nested factor from the analy- ple’s cognitive representations of causal attribu-
sis. Monte-Carlo studies have shown, however, tions for a given phenomenon. This entry discusses
that these preliminary tests are typically not pow- the history, techniques, applications, and limita-
erful enough (even with a liberal alpha level) to tions of network analysis.
detect meaningful nested-factor effects.
However, if the fixed-effects model correctly
History
describes the data, ignoring the nested factor will
lead to an increase in Type II error levels (i.e., Network analysis was developed to account for
a loss in statistical power) and an underestima- individuals’ relatively complex and sophisticated
tion of population effects. Both tendencies are explanations of human behavior. It is underpinned
positively related to the magnitude of the nested by the notion of a perceived causal structure,
factor effect. which Harold Kelly described as being implicit in
the cognitive representation of attributions. The
Matthias Siemer perceived causal structure constitutes a temporally
ordered network of interconnected causes and
See also Cluster Sampling; Fixed-Effects Models;
effects. Properties of the structure include the fol-
Hierarchical Linear Modeling; Intraclass Correlation;
lowing: direction (past–future), extent (proximal–
Mixed-and Random-Effects Models; Multilevel
distal), patterning (simple–complex), components
Modeling; Random-Effects Models
of varying stability–instability, and features rang-
ing from actual to potential. The structure
Further Readings produced might be sparse or dense in nature,
Maxwell, S. E., & Delaney, H. D. (2004). Designing depending on the number of causal factors identi-
experiments and analyzing data. A model comparison fied. Network analysis comprises a group of tech-
perspective (2nd ed.). Mahwah, NJ: Lawrence niques developed in sociology and social
Erlbaum. anthropology, and it provides a method for gener-
Siemer, M., & Joormann, J. (2003). Power and measures ating and analyzing perceived causal networks,
of effect size in analysis of variance with fixed versus their structural properties, and the complex chains
random nested factors. Psychological Methods, 8, of relationships between causes and effects.
497–517.
Wampold, B. E., & Serlin, R. C. (2000). The consequence
of ignoring a nested factor on measures of effect size Network Analysis Techniques
in analysis of variance. Psychological Methods, 5;
425–433. Network analysis can be conducted using semi-
Zucker, D. M. (1990). An analysis of variance pitfall: structured interviews, diagram methods, and mat-
The fixed effects analysis in a nested design. rix methods. Although interviews provide detailed
Educational and Psychological Measurement, 50, individual networks, difficulties arise in that indi-
731–738. vidual structures cannot be combined, and causal
structures of different groups cannot be compared.
The diagram method involves either the spatial
arrangement of cards containing putative causes or
NETWORK ANALYSIS the participant directly drawing the structure.
Participants can both choose from a given set of
Network analysis elicits and models perceptions potential causal factors and incorporate other
of the causes of a phenomenon. Typically, respon- personally relevant factors into their network.
dents are provided with a set of putative causal In addition, the strength of causal paths can be
factors for a focal event and are asked to consider rated. Although these methods have the virtue of
896 Network Analysis

ensuring only the most important causal links investigation of all possible links, and as it does
are elicited, they might potentially oversimplify not rely on participants’ recall, it would be
respondents’ belief structures, often revealing only expec-ted to produce more reliable results.
sparse networks.
The matrix technique employs an adjacency
Applications
grid with the causes of a focal event presented
vertically and horizontally along its top and side. Network analysis has been applied to diverse areas
Participants rate the causal relationship for every to analyze belief structures. Domains that have
pairwise combination. Early studies used a binary been examined include lay understandings of
scale to indicate the presence/absence of causal social issues (e.g., loneliness, poverty), politics
links; however, this method does not reveal the (e.g., the 2nd Iraq war, September 11th), and more
strength of the causal links. Consequently, recent recently illness attributions for health problems
studies have used Likert scales whereby partici- (e.g., work-based stress, coronary heart disease,
pants rate the strength of each causal relationship. lower back pain, and obesity).
A criterion is applied to these ratings to establish The hypothetical network (Figure 1) illustrates
which of the resulting causal links should be some properties of network analysis. For example,
regarded as consensually endorsed and, therefore, the illness is believed to have three causes: stress,
contributing to the network. smoking, and family history. Both stress and smok-
Early studies adopted a minimum systems crite- ing are proximal causes, whereas family history is
rion (MSC), the value at which all causes are a more distal cause. In addition to a direct effect
included in the system, to determine the network of stress, the network shows a belief that stress
nodes. Accordingly, causal links are added hierar- also has an indirect effect on illness, as stress
chically to the network, in the order of mean causes smoking. Finally, there is a reciprocal rela-
strength, until the MSC is reached. It is generally tionship (bidirectional arrow) between the illness
accompanied by the cause-to-link ratio, which is and stress, such that stress causes the illness, and
the ratio of the number of extra links required to having the illness causes stress.
include a new cause in the network. Network con-
struction stops if this requirement is too high,
Limitations
reducing overall endorsement of the network. An
alternative criterion is inductive eliminative analysis There are several unresolved issues regarding the
(IEA), wherein every network produced when establishment of networks and the selection of cut-
working toward the MSC is checked for endorse- off points. Comparative network analysis studies
ment. Originally developed to deal with binary are necessary to compare and evaluate the differ-
adjacency matrices, networks were deemed con- ential effectiveness of the individual network anal-
sensual if endorsed by at least 50% of partici- ysis methods. The criteria for selection of cut-off
pants. However, the introduction of Likert scales points for the network, such as the MSC and cause
necessitated a modified form of IEA, whereby an to link, have also been criticized as atheoretical,
item average criterion (IAC) was adopted. The producing extremely large networks that represent
mean strength of a participant’s endorsement of an aggregate rather than a consensual solution.
all items on a network must be above the IAC, Although IEA resolves some of these issues,
which is usually set at 3 or 4 on a 5-point scale,
depending on the overall link strength. In early
Family
research, the diagrammatic networks produced
History
using these methods were topological, not spa-
Smoking
tial. However, recent studies have subjected the
matrices of causal ratings to multidimensional Stress
scaling analysis to determine the spatial structure Illness
of networks. Thus, proximal and distal effects
can be easily represented. The matrix method
has the advantage of ensuring the exhaustive Figure 1 Example of Network Diagram
Newman–Keuls Test and Tukey Test 897

producing networks that tend to be more consen- Tukey test is most commonly used in other disci-
sual, smaller, and easier to interpret, the cut-off plines. An advantage of the Tukey test is to keep
points (50% criterion and IAC) are established the level of the Type I error (i.e., finding a differ-
arbitrarily and, thus, might be contested. ence when none exists) equal to the chosen alpha
level (e.g., α ¼ :05 or α ¼ :01). An additional
Amy Brogan and David Hevey advantage of the Tukey test is to allow the compu-
tation of confidence intervals for the differences
See also Cause and Effect; Graphical Display of Data;
between the means. Although the Newman–Keuls
Likert Scaling
test has more power than the Tukey test, the exact
value of the probability of making a Type I error
Further Readings
of the Newman–Keuls test cannot be computed
Kelley, H. H. (1983). Perceived causal structures. In because of the sequential nature of this test. In
J. Jaspars, F. D. Fincham, & M. Hewstone (Eds.), addition, because the criterion changes for each
Attribution theory and research: Conceptual, level of the Newman–Keuls test, confidence inter-
developmental and social dimensions (pp. 343–369). vals cannot be computed around the differences
London: Academic Press. between means. Therefore, selecting whether to
Knoke, D., & Kuklinski, J. H. (1982). Network analysis. use the Tukey or Newman–Keuls test depends on
Beverly Hills, CA: Sage.
whether additional power is required to detect sig-
nificant differences between means.

NEWMAN–KEULS TEST Studentized Range and Student’s q


AND TUKEY TEST Both the Tukey and Newman–Keuls tests use
a sampling distribution derived by Willam Gosset
An analysis of variance (ANOVA) indicates (who was working for Guiness and decided to
whether several means come from the same popu- publish under the pseudonym of ‘‘Student’’ because
lation. Such a procedure is called an omnibus test, of Guiness’s confidentiality policy). This distribu-
because it tests the whole set of means at once tion, which is called the Studentized Range or
(omnibus means ‘‘for all’’ in Latin). In an ANOVA Student’s q; is similar to a t-distribution. It corre-
omnibus test, a significant result indicates that at sponds to the sampling distribution of the largest
least two groups differ from each other, but it does difference between two means coming from a set
not identify the groups that differ. So an ANOVA of A means (when A ¼ 2, the q distribution corre-
is generally followed by an analysis whose goal is sponds to the usual Student’s tÞ.
to identify the pattern of differences in the results. In practice, one computes a criterion denoted
This analysis is often performed by evaluating all qobserved , which evaluates the difference between the
the pairs of means to decide which ones show a sig- means of two groups. This criterion is computed as
nificant difference. In a general framework, this Mi ·  Mj ·
approach, which is called a pairwise comparison, qobserved ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ; ð1Þ
is a specific case of an ‘‘a posteriori contrast analy- MSerror 1S
sis,’’ but it is specific enough to be studied in itself.
Two of the most common methods of pairwise where Mi and Mj are the group means being com-
comparisons are the Tukey test and the Newman– pared, MSerror is the mean square error from
Keuls test. Both tests are based on the ‘‘Studentized the previously computed ANOVA (i.e., this is the
range’’ or ‘‘Student’s q:’’ They differ in that the mean square used for the denominator of the omni-
Newman–Keuls test is a sequential test designed to bus F ratio), and S is the number of observations per
have more power than the Tukey test. group (the groups are assumed to be of equal size).
Choosing between the Tukey and Newman– Once the qobserved is computed, it is then com-
Keuls tests is not straightforward and there is no pared with a qcritical value from a table of critical
consensus on this issue. The Newman–Keuls test is values (see Table 1). The value of qcritical depends
most frequently used in psychology, whereas the on the α-level, the degrees of freedom ν ¼ N K;
898 Newman–Keuls Test and Tukey Test

Table 1 Table of Critical Values of the Studentized Range q


R ¼ Range (Number of Groups)
v2 2 3 4 5 6 7 8 9 10 12 14 16 18 20
6 3.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49 6.79 7.03 7.24 7.43 7.59
5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 9.10 9.48 10.08 10.32 10.54
7 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.43 6.66 6.85 7.02 7.17
4.95 5.92 6.54 7.01 7.37 7.68 7.94 8.10 8.37 8.71 9.00 9.24 9.46 9.65
8 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.18 6.39 6.57 6.73 6.87
4.75 5.64 6.20 6.62 6.96 7.24 7.47 7.68 7.86 8.18 8.44 8.66 8.85 9.03
9 3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74 5.98 6.19 6.36 6.51 6.64
4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.33 7.49 7.78 8.03 8.23 8.41 8.57
10 3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60 5.83 6.03 6.19 6.34 6.47
4.48 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21 7.49 7.71 7.91 8.08 8.23
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 5.71 5.90 6.06 6.20 6.33
4.39 5.15 5.62 5.97 6.25 6.48 6.67 6.84 6.99 7.25 7.47 7.65 7.81 7.95
12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39 5.62 5.80 5.95 6.09 6.21
4.32 5.05 5.50 5.84 6.10 6.32 6.51 6.67 6.81 7.06 7.27 7.44 7.59 7.73
13 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32 5.53 5.71 5.86 6.00 6.11
4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67 6.90 7.10 7.27 7.42 7.55
14 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25 5.46 5.64 5.79 5.92 6.03
4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 6.77 6.96 7.13 7.27 7.40
15 3.01 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20 5.40 5.57 5.72 5.85 5.96
4.17 4.84 5.25 5.56 5.80 5.99 6.16 6.31 6.44 6.66 6.85 7.00 7.14 7.26
16 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15 5.35 5.52 5.66 5.79 5.90
4.13 4.79 5.19 5.49 5.72 5.92 6.08 6.22 6.35 6.56 6.74 6.90 7.03 7.15
17 2.98 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11 5.31 5.47 5.61 5.73 5.84
4.10 4.74 5.14 5.43 5.66 5.85 6.01 6.15 6.27 6.48 6.66 6.81 6.94 7.05
18 2.97 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07 5.27 5.43 5.57 5.69 5.79
4.07 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20 6.41 6.58 6.73 6.85 6.97
19 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04 5.23 5.39 5.53 5.65 5.75
4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14 6.34 6.51 6.65 6.78 6.89
20 2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01 5.20 5.36 5.49 5.61 5.71
4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09 6.29 6.45 6.59 6.71 6.82
24 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92 5.10 5.25 5.38 5.44 5.59
3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.81 5.92 6.11 6.26 6.39 6.51 6.61
30 2.89 3.49 3.85 4.10 4.30 4.46 4.60 4.72 4.82 5.00 5.15 5.27 5.38 5.48
3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76 5.93 6.08 6.20 6.31 6.41
40 2.86 3.44 3.79 4.04 4.23 4.39 4.52 4.63 4.73 4.90 5.04 5.16 5.27 5.36
3.82 4.37 4.70 4.93 5.11 5.26 5.39 5.50 5.60 5.76 5.90 6.02 6.12 6.21
60 2.83 3.40 3.74 3.98 4.16 4.31 4.44 4.55 4.65 4.81 4.94 5.06 5.15 5.24
3.76 4.28 4.59 4.82 4.99 5.13 5.25 5.36 5.45 5.60 5.73 5.84 5.93 6.02
120 2.80 3.36 3.68 3.92 4.10 4.24 4.36 4.47 4.56 4.71 4.84 4.95 5.04 5.13
3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30 5.44 5.56 5.66 5.75 5.83
∞ 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.47 4.62 4.74 4.85 4.93 5.01
3.64 4.12 4.40 4.60 4.76 4.88 4.99 5.08 5.16 5.29 5.40 5.49 5.57 5.65
Note: Studentized range q distribution table of critical values for α = .05 and α = .01.
Newman–Keuls Test and Tukey Test 899

where N is the total number of participants and K difference implies not rejecting the null hypothesis
is the number of groups, and on a parameter R; for any other difference.
which is the number of means being tested. For If the null hypothesis is rejected for the largest
example, in a group of K ¼ 5 means ordered from difference, the two differences with a range of
smallest to largest, A  1 are examined. These means will be tested
with R ¼ A  1. When the null hypothesis for
M1 < M2 < M3 < M4 < M5 a given pair of means cannot be rejected, none of
the differences included in that difference will be
R ¼ 5 when comparing M5 with M1 ; however,
tested. If the null hypothesis is rejected, then the
R ¼ 3 when comparing M3 with M1 .
procedure is reiterated for a range of A  2 (i.e.,
R ¼ A  2). The procedure is reiterated until all
F range means have been tested or have been declared non-
significant by implication.
Some statistics textbooks refer to a pseudo-F It takes some experience to determine which
distribution called the ‘‘F range’’ or ‘‘Frange ,’’ rather comparisons are implied by other comparisons.
than the Studentized q distribution. The Frange can Figure 1 describes the structure of implication for
be computed easily from q using the following a set of 5 means numbered from 1 (the smallest) to
formula: 5 (the largest). The pairwise comparisons implied
q2 by another comparison are obtained by following
Frange ¼ : ð2Þ the arrows. When the null hypothesis cannot be
2
rejected for one pairwise comparison, then all the
comparisons included in it are crossed out so that
Tukey Test they are not tested.

For the Tukey test, qobserved (see Equation 1) is Example


computed between any pair of means that need to
be tested. Then, qcritical value is determined using An example will help describe the use of the
R ¼ total number of means. The qcritical value is Tukey and Newman–Keuls tests and Figure 1. We
the same for all pairwise comparisons. Using the will use the results of a (fictitious) replication
previous example, R ¼ 5 for all comparisons. of a classic experiment on eyewitness testimony
by Elizabeth F. Loftus and John C. Palmer. This
experiment tested the influence of question word-
Newman–Keuls Test ing on the answers given by eyewitnesses. The
authors presented a film of a multiple-car accident
The Newman–Keuls test is similar to the Tukey
to their participants. After viewing the film, parti-
test, except that the Newman–Keuls test is
cipants were asked to answer several specific ques-
a sequential test in which qcritical depends on the
tions about the accident. Among the questions,
range of each pair of means. To facilitate the
one question about the speed of the car was pre-
exposition, we suppose that the means are ordered
sented in five different versions:
from the smallest to the largest. Hence, M1 is the
smallest mean and MA is the largest mean. 1. Hit: About how fast were the cars going when
The Newman–Keuls test starts exactly like the they hit each other?
Tukey test. The largest difference between two 2. Smash: About how fast were the cars going
means is selected. The range of this difference is when they smashed into each other?
R ¼ A: A qobserved is computed using Equation 1,
3. Collide: About how fast were the cars going
and that value is compared with the critical value, when they collided with each other?
qcritical , in the critical values table using α, ν, and
R: The null hypothesis can be rejected if qobserved is 4. Bump: About how fast were the cars going
greater than qcritical . If the null hypothesis cannot when they bumped into each other?
be rejected, then the test stops here because not 5. Contact: About how fast were the cars going
rejecting the null hypothesis for the largest when they contacted each other?
900 Newman–Keuls Test and Tukey Test

M1. − M5. Table 3 Absolute Values of qobserved for the Data


A from Table 2
Experimental Group
M1. − M4. M2. − M5. A−1 M1 M2 M3 M4 M5
Contact Hit 1 Bump Collide Smash
30 35 38 41 46
M1. − M3. M2. − M4. M3. − M5. M1 ¼ 30 Contact 0 1.77 ns 2.83 ns 3.89 ns 5.66**
A−2 M2 ¼ 35 Hit 0 1.06 ns 2.12 ns 3.89 ns
M3 ¼ 38 Bump 0 1.06 ns 2.83 ns
M4 ¼ 41 Collide 0 1.77 ns
M1. − M2. M2. − M3. M3. − M4. M4. − M5. A−3 M5 ¼ 46 Smash 0
Notes: For the Tukey test, qobserved is significant at α ¼ :05
(or at the α ¼ :01 levelÞ if qobserved is larger than qcritical ¼
Figure 1 Structure of Implication of the Pairwise 4:04ðqcritical ¼ 4:93Þ:  p < :05:  p < :01:
Comparisons When A = 5 for the
Newman–Keuls Test
Table 4 Presentation of the Results of the Tukey Test
Notes: Means are numbered from 1 (the smallest) to 5 (the for the Data from Table 2
largest). The pairwise comparisons implied by another one
Experimental Group
are obtained by following the arrows. When the null
hypothesis cannot be rejected for one pairwise comparison, M1 M2 M3 M4 M5
then all the comparisons included in it can be crossed out to Contact Hit 1 Bump Collide Smash
omit them from testing. 30 35 38 41 46
M1 ¼ 30 Contact 0 5.00 ns 8.00 ns 11.00 ns 16.00**
Table 2 A Set of Data to Illustrate the Tukey and M2 ¼ 35 Hit 0 3.00 ns 6.00 ns 11.00 ns
Newman–Keuls Tests M3 ¼ 38 Bump 0 3.00 ns 8.00 ns
Experimental Group M4 ¼ 41 Collide 0 5.00 ns
Contact Hit Bump Collide Smash M5 ¼ 46 Smash 0
21 23 35 44 39 Notes: * p < .05. * * p < .01. The qobserved is greater than
20 30 35 40 44 qcriticalð5Þ and H0 is rejected for the largest pair.
26 34 52 33 51
46 51 29 45 47 from the previously calculated ANOVA is 80.00, the
35 20 54 45 50 value of qobserved for the difference between M1 and
13 38 32 30 45 M2 (i.e., ‘‘contact’’ and ‘‘hit’’) is equal to
41 34 30 46 39
30 44 42 34 51 M1  M2
42 41 50 49 39
qobserved ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

MSerror 1S
26 35 21 44 55
M1 M2 M3 M4 M5 35:00  30:00
¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
Ma 30.00 35.00 38.00 41.00 46.00
80:00 10
Notes: S ¼ 10: MSerror ¼ 80:00:
5
¼ pffiffiffi
In our replication we used 50 participants (10 8
in each group); their responses are given in ¼ 1:77:
Table 2.
The values of qobserved are shown in Table 3. With
Tukey’s approach, each qobserved is declared signifi-
Tukey Test
cant at the α ¼ :05 level (or the α ¼ :01 level) if it
For the Tukey test, the qobserved values are com- is larger than the critical value obtained for this
puted between every pair of means using Equation 1. alpha level from the table with R ¼ 5 and
For example, taking into account that the MSerror ν ¼ N  K ¼ 45 degrees of freedom (45 is not in
Newman–Keuls Test and Tukey Test 901

the table so 40 is used instead). The qcriticalð5Þ;α¼:05 is Now we proceed to test the means with a range
equal to 4.04 and the qcriticalð5Þ;α¼:01 is equal to 4.93. of 4, namely the differences (M4  M1 Þ and
When performing pairwise comparisons, it is (M5  M2 Þ. With α ¼ :05; R ¼ 4 and 45 degrees
customary to report the table of differences between of freedom, qcriticalð4Þ ¼ 3:79: Both differences are
means with an indication of their significance (e.g., declared significant at the .05 level [qobservedð4Þ ¼
one star meaning significant at the .05 level, and 3.89 in both cases]. We then proceed to test the
two stars meaning significant at the .01 level). This comparisons with a range of 3. The value of qcritical
is shown in Table 4. is now 3.44. The differences (M3  M1 Þ and
(M5  M3 Þ, both with a qobserved of 2.83, are
declared nonsignificant. Furthermore, the difference
Newman–Keuls Test (M4  M2 Þ, with a qobserved of 2.12, is also declared
Note that for the Newman–Keuls test, the nonsignificant. Hence, the null hypothesis for these
group means are ordered from the smallest to the differences cannot be rejected, and all comparisons
largest. The test starts by evaluating the largest implied by these differences should be crossed out.
difference that corresponds to the difference That is, we do not test any difference with a range
between M1 and M5 (i.e., ‘‘contact’’ and ‘‘smash’’). of A  3 ½i:e:; ðM2  M1 Þ, (M3  M2 Þ; ðM4  M3 Þ,
For α ¼ :05, R ¼ 5 and ν ¼ N  K ¼ 45 degrees and (M5  M4 Þ. Because the comparisons with
of freedom, the critical value of q is 4.04 (using a range of 3 have already been tested and found to
the ν value of 40 in the table). This value is be nonsignificant, any comparisons with a range of
denoted as qcriticalð5Þ ¼ 4:04. The qobserved is com- 2 will consequently be declared nonsignificant as
puted from Equation 1 (see also Table 3) as they are implied or included in the range of 3 (i.e.,
the test has been performed implicitly).
M5  M1 As for the Tukey test, the results of the
qobserved ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  ¼ 5:66: ð3Þ
Newman–Keuls tests are often presented with
1
MSerror the values of the pairwise differences between the
S means and with stars indicating the significance

Critical values of q/nk

M1. − M5. A 4.04


5.66 **

M1. − M4. M2. − M5. A−1 3.79


3.89 * 3.89 *

M1. − M3. M2. − M4. M3. − M5. A−2 3.44


2.83 ns 2.12 ns 2.83 ns

M1. − M2. M2. − M3. M3. − M4. M4. − M5. A − 3 2.86

Figure 2 Newman–Keuls Test for the Data from a Replication of Loftus & Palmer (1974)
Note: The number below each range is the qobserved for that range.
902 Nominal Scale

Table 5 Presentation of the Results of the Newman– This form of scale does not require the use of
Keuls Test for the Data from Table 2 numeric values or categories ranked by class, but
Experimental Group simply unique identifiers to label each distinct
category. Often regarded as the most basic form
M1 M2 M3 M4 M5 of measurement, nominal scales are used to cate-
Contact Hit 1 Bump Collide Smash gorize and analyze data in many disciplines. His-
30 35 38 41 46 torically identified through the work of
*
M1 = 30 0 5.00 ns 8.00 ns 11.00 16.00** psychophysicist Stanley Stevens, use of this scale
Contact has shaped research design and continues to
M2 = 35 0 3.00 ns 6.00 ns 11.00* impact on current research practice. This entry
Hit presents key concepts, Stevens’s hierarchy of
M3 = 38 0 3.00 ns 8.00 ns measurement scales, and an example demon-
Bump strating the properties of the nominal scale.
M4 = 41 0 5.00 ns
Collide
M5 = 46 0 Key Concepts
Smash The nominal scale, which is often referred to as
Notes: * p < . 05. **
p < :01: the unordered categorical or discrete scale, is used
to assign individual datum into categories. Cate-
level (see Table 5). The comparison of Table 5 and gories in the nominal scale are mutually exclusive
Table 4 confirms that the Newman–Keuls test is and collectively exhaustive. They are mutually
more powerful than the Tukey test. exclusive because the same label is not assigned to
different categories and different labels are not
Herve Abdi and Lynne J. Williams assigned to events or objects of the same category.
Categories in the nominal scale are collectively
See also Analysis of Variance (ANOVA); Bonferroni
exhaustive because they encompass the full range
Procedure; Holm’s Sequential Bonferroni Procedure;
of possible observations so that each event or
Honestly Significant Difference (HSD) Test; Multiple
object can be categorized. The nominal scale holds
Comparison Tests; Pairwise Comparisons; Post Hoc
two additional properties. The first property is that
Comparisons; Scheffe Test
all categories are equal. Unlike in other scales,
such as ordinal, interval, or ratio scales, categories
Further Readings
in the nominal scale are not ranked. Each category
Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J. has a unique identifier, which might or might not
(2009). Experimental design and analysis for be numeric, which simply acts as a label to distin-
psychology. Oxford, UK: Oxford University Press. guish categories. The second property is that the
Dudoit S., & van der Laan, M. (2008). Multiple testing
nominal scale is invariant under any transforma-
procedures with applications to genomics. New York:
tion or operation that preserves the relationship
Springer-Verlag.
Hochberg, Y., & Tamhane, A. C. (1987). Multiple between individuals and their identifiers.
comparison procedures. New York: Wiley. Some of the most common types of nominal
Jaccard, J., Becker, M. A., & Wood, G. (1984). Pairwise scales used in research include sex (male/female),
multiple comparison procedures: A review. marital status (married or common-law/widowed/
Psychological Bulletin, 94, 589–596. divorced/never-married), town of residence, and
questions requiring binary responses (yes/no).

NOMINAL SCALE
Stevens’s Hierarchy
A nominal scale is a scale of measurement used to In the mid-1940s, Harvard psychophysicist
assign events or objects into discrete categories. Stanley Stevens wrote the influential article ‘‘On
Nominal Scale 903

the Theory of Scales of Measurement,’’ pub- Table 1 Class List for Attendance on May 1
lished in Science in 1946. In this article, Stevens
Arrives by Attendance
described a hierarchy of measurement scales that
Student ID School Bus on May 1
includes nominal, ordinal, interval, and ratio
scales. Based on basic empirical operations, 001 Yes Absent
mathematical group structure, and statistical 002 Yes Absent
procedures deemed permissible, this hierarchy 003 Yes Present
has been used in textbooks worldwide and con- 004 Yes Absent
tinues to shape statistical reasoning used to 005 Yes Absent
guide the design of statistical software packages 006 Yes Absent
today. 007 Yes Absent
Under Stevens’s hierarchy, the primary, and 008 Yes Absent
arguably only, use for nominal scales is to deter- 009 Yes Absent
mine equality, that is, to determine whether the 010 No Present
object of interest falls into the category of inter- 011 No Present
est by possessing the properties identified for 012 No Present
that category. Stevens argued that no other 013 No Present
determinations were permissible, whereas others 014 No Absent
argued that even though other determinations 015 No Present
were permissible, they would, in effect, be mean-
ingless. A less argued property of the nominal
scale is that it is invariant under any transforma- The header row denotes the names of the vari-
tion. When taking attendance in a classroom, for able to be categorized and each row contains an
example, those in attendance might be assigned individual student record. Student 001, for exam-
1, whereas those who are absent might be ple, uses the school bus and is absent on the day in
assigned 2. This nominal scale could be replaced question. An appropriate nominal scale to catego-
by another nominal scale, where ‘‘1’’ is replaced rize class attendance would involve two categories:
by the label ‘‘present’’ and ‘‘2’’ is replaced by the absent or present. Note that these categories are
label ‘‘absent.’’ The transformation is considered mutually exclusive (a student cannot be both pres-
invariant because the identity of each individual ent and absent), collectively exhaustive (the cate-
is preserved. Given the limited determinations gories cover all possible observations), and each is
deemed permissible, Stevens proposed a restric- equal in value.
tion on analysis for nominal scales. Only basic Permissible statistics for the attendance variable
statistics are deemed permissible or meaningful would include frequency, mode, and contingency
for the nominal scale, including frequency, mode correlation. Using the previously provided class
as the sole measure of central tendency, and list, the frequency of those present is 6 and those
contingency correlation. Despite much criticism absent is 9. The mode, or the most common obser-
during the past 50 years, statistical software vation, is ‘‘absent.’’ Contingency tables could be
developed during the past decade has sustained constructed to answer questions about the popula-
the use of Stevens’s terminology and permissibil- tion. If, for example, a contingency table was used
ity in its architecture.
Table 2 Contingency Table for Attendance and
Arrives by School Bus
Example: Attendance in the Classroom Attendance
Again, attendance in the classroom can serve as Absent Present Total
an example to demonstrate some properties of the Arrives by Yes 8 1 9
nominal scale. After taking attendance, the infor- school bus No 1 5 6
mation has been recorded in the class list as illus-
Total 9 6 15
trated in Table 1.
904 Nomograms

to classify students using the two variables atten- intervention studies that have greater statistical
dance and arrives by school bus, then Table 2 power by targeting the enrollment of patients with
could be constructed. the highest risk of disease. In addition, nomograms
The results of the Fisher’s exact test for contin- rely on well-designed studies to validate the accu-
gency table analysis show that those who arrive racy of their predictions.
by school bus were significantly more likely to
be absent than those who arrive by some other
Deriving Outcome Probabilities
means. One might then conclude that the school
bus was late. All medical decisions are based on the predicted
probability of different outcomes. Imagine a 35-
Deborah J. Carr year-old patient who presents to a physician with
a 6-month history of cough. A doctor in Chicago
See also Chi-Square Test; Frequency Table; Mode;
might recommend a test for asthma, which is
‘‘On the Theory of Scales of Measurement’’;
a common cause of chronic cough. If the same
Ordinal Scale
patient presented to a clinic in rural Africa, the
physician might be likely to test for tuberculosis.
Further Readings Both physicians might be making sound recom-
mendations based on the predicted probability of
Duncan, O. D. (1984). Notes on social measurement:
disease in their locale. These physicians are making
Historical and critical. New York: Russell Sage
clinical decisions based on the overall probability
Foundation.
Michell, J. (1986). Measurement scales and statistics: A of disease in the population. These types of deci-
clash of paradigms. Psychological Bulletin, sions are better than arbitrary treatment but treat
3; 398–407. all patients the same.
Stevens, S. S. (1946). On the theory of scales of A more sophisticated method for medical deci-
measurement. Science, 103; 677–680. sion making is risk stratification. Physicians will
Velleman, P. F., & Wilkinson, L. (1993). Nominal, frequently assign patients to different risk groups
ordinal, interval, and ratio typologies are misleading. when making treatment decisions. Risk group
The American Statistician, 47; 65–72. assignment will generally provide better predicted
probabilities than estimating risk according to the
overall population. In the previous cough example,
a variety of other factors might impact the pre-
NOMOGRAMS dicted risk of tuberculosis (e.g., fever, exposure to
tuberculosis, and history of tuberculosis vaccine)
Nomograms are graphical representations of equa- that physicians are trained to explore. Most risk
tions that predict medical outcomes. Nomograms stratification performed in clinical practice is based
use a points-based system whereby a patient accu- on rough estimates that simply order patients
mulates points based on levels of his or her risk into levels of risk, such as high risk, medium
factors. The cumulative points total is associated risk, or low risk. Nomograms provide precise
with a prediction, such as the predicted probability probability estimates that generally make more
of treatment failure in the future. Nomograms accurate assessments of risk.
can improve research design, and well-designed A problem with risk stratification arises when
research is crucial for the creation of accurate continuous variables are turned into categorical
nomograms. Nomograms are important to variables. Physicians frequently commit dichoto-
research design because they can help identify the mized cutoffs of continuous laboratory values to
characteristics of high-risk patients while high- memory to guide clinical decision making. For
lighting which interventions are likely to have the example, blood pressure cut-offs are used to guide
greatest treatment effects. Nomograms have dem- treatment decisions for hypertension. Imagine
onstrated better accuracy than both risk grouping a new blood test called serum marker A. Research
systems and physician judgment. This improved shows that tuberculosis patients with serum
accuracy should allow researchers to design marker A levels greater than 50 are at an increased
Nomograms 905

NOT FOR CLINICAL USE

Points 0 50 100

Yes
Fever
No

Age
80 60 50 40 30 20 10
Yes
Cough
No Hemoptysis
0 10 20 40 60 70 80 90 100
Marker A

Yes
History of Tb Vaccine
No
Yes
Intubated
No

Total Points
0 200 400

Mortality Probability
.975 0.95 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01

INSTRUCTIONS: Locate the tic mark associated with the value of each predictor variable. Use a straight edge
to find the corresponding points on the top axis for each variable. Calculate the total points by summing the
individual points for all of the variables. Draw a vertical line from the value on the total points axis to the
bottom axis in order to determine the hypothetical mortality probability from tuberculosis.

Figure 1 Hypothetical Nomogram for Predicting Mortality Risk in Tuberculosis (Tb)

risk for dying from tuberculosis. In reality, patients required intubation has a much greater possible
with a value of 51 might have similar risks com- impact on the predicted probability of mortality.
pared with patients with a value of 49. In contrast, Nomograms like the one shown in Figure 1 are
a patient with a value of 49 would be considered created from the coefficients obtained by the statis-
to have the same low risk of a patient whose tical model (e.g., logistic regression or Cox propor-
serum level of marker A is 1. Nomograms allow tional hazards regression) and are only as precise
for predictor variables to be maintained as contin- as the paper graphics. However, the coefficients
uous values while allowing numerous risk factors used to create the paper-based nomogram can be
to be considered simultaneously. In addition, more used to calculate the exact probability. Similarly,
complex models can be constructed that account the coefficients can be plugged into a Microsoft
for interactions. Excel spreadsheet or built into a computer inter-
Figure 1 illustrates a hypothetical nomogram face that will automatically calculate the probabil-
designed to predict the mortality probability for ity based on the user inputs.
patients with tuberculosis. The directions for using
the nomogram are contained in the legend. One
glance at the nomogram allows the user to deter- Why Are Nomograms
mine quickly which predictors have the greatest
Important to Research Design?
potential impact on the probability. Fever has a rel-
atively short axis and can contribute less than 25 Nomograms provide an improved ability to iden-
possible points. In contrast, whether the patient tify the correct patient population for clinical
906 Nomograms

studies. The statistical power in prospective studies in the original data set and is available for subse-
with dichotomous clinical outcomes is derived quent random selection. The random selection of
from the number of events. (Enrolling excessive patients is continued until a dataset that is the
numbers of patients who do not develop the event same size as the original data has been formed.
of interest is an inefficient use of resources.) The model is applied (i.e., fit) to the bootstrap
For instance, let us suppose that a new controver- data and the model is graded on its ability to
sial medication has been developed for treating predict accurately the outcome of patients in
patients with tuberculosis. The medication either the original data (apparent accuracy) or
shows promise in animal studies, but it also seems the bootstrap sample (unbiased accuracy). Alter-
to carry a high risk of serious toxicity and even natively, the original data can be partitioned ran-
death in some individuals. Researchers might want domly. The model is fit to only a portion of the
to determine whether the medication improves original data and the outcome is predicted in the
survival in some patients with tuberculosis. The remaining subset. The bootstrap method has
nomogram in Figure 1 could be used to identify the added benefit that the sample size used for
which patients are at highest risk of dying, are the model fitting is not reduced.
most likely to benefit from the new drug, and
therefore should be enrolled in a drug trial. The
Evaluating Model Accuracy
nomogram could also be tested in this fashion
using a randomized clinical trial design. One arm As mentioned previously, the models’ predic-
of the study could be randomized to usual care, tions are evaluated on their ability to discrimi-
whereas the treatment arm is randomized to use nate between pairs of discordant patients
the Tb nomogram and then to receive the usual (patients who had different outcomes). The
care if the risk of mortality is low or the experi- resultant evaluation is called a concordance
mental drug if the risk of mortality is high. index or c-statistic. The concordance index is
simply the proportion of the time that the model
accurately assigns a higher risk to the patient
Validation
with the outcome. The c-statistic can vary from
The estimated probability obtained from nomo- 0.50 (equivalent to the flip of a coin) to 1.0 (per-
grams like the one in Figure 1 is generally much fect discrimination). The c-statistic provides an
more accurate than rough probabilities obtained objective method for evaluating model accuracy,
by risk stratification and should help both patients but the minimum c-statistic needed to claim that
and physicians make better treatment decisions. a model has good accuracy depends on the spe-
However, nomograms are only as good as the data cific condition and is somewhat subjective. How-
that were used for their creation. But, predicted ever, models are generally not evaluated in
probabilities can be graded (validated) on their isolation. Models can be compared head-to-head
ability to discriminate between pairs of patients either with one another or with physician judg-
who have different outcomes (discordant pairs). ment. In this case, the most accurate model can
The grading can be performed using either a valida- generally be identified as the one with the high-
tion data set that was created with the same data- est concordance index.
base used to create the prediction model (internal However, to grade a model fully, it is also
validation) or with external data (external valida- necessary to determine a model’s calibration. Cali-
tion). Ideally, a nomogram should be validated in bration is a measure of how close a model’s predic-
an external database before it is widely used in tion compares with the actual outcome and is
heterogeneous patient populations. frequently displayed by plotting the predicted
A validation data set using the original data can probability (or value) versus the actual proportion
be created either with the use of bootstrapping or with the outcome (or actual value). The concor-
by dividing the data set into random partitions. In dance index is simply a ‘‘rank’’ test that orders
the bootstrap method, a random patient is selected patients according to risk. A model can theoreti-
and a copy of the patient’s data is added to the val- cally have a great concordance index but poor cali-
idation data set. The patient’s record is maintained bration. For instance, a model might rank patients
Nonclassical Experimenter Effects 907

appropriately while significantly overestimating or Kattan, M. W. (2003). Nomograms are superior to


underestimating the probability (or value) in all of staging and risk grouping systems for identifying high-
the patients. risk patients: Preoperative application in prostate
cancer. Current Opinion in Urology, 13; 111–116.

Conclusion
Designing efficient clinical research, especially
when designing prospective studies, relies on accu- NONCLASSICAL EXPERIMENTER
rate predictions of the possible outcomes. Nomo- EFFECTS
grams provide an opportunity for researchers to
easily identify the target population that will be Experimenter effects denominate effects where an
predicted to have the highest incidence of events outcome seems to be a result of an experimental
and will therefore keep the necessary sample size intervention but is actually caused by conscious or
low. Paper-based nomograms provide an excellent unconscious effects the experimenter has on how
medium for easily displaying risk probabilities and data are produced or processed. This could be
do not require a computer or calculator. The coef- through inadvertently measuring one group differ-
ficients used to construct the nomogram can be ently from another one, treating a group of people
used to create a computer-based prediction tool. or animals that are known to receive or to have
However, nomograms are only as good as the received the intervention differently compared with
data that were used in their creation, and no the control group, or biasing the data otherwise.
nomogram can provide a perfect prediction. Ulti- Normally, such processes happen inadvertently
mately, the best evaluation of a nomogram is made because of expectation and because participants
by validating the prediction accuracy of a nomo- sense the desired outcome in some way and hence
gram on an external data set and comparing comply or try to please the experimenter. Control
the concordance index with another prediction procedures, such as blinding (keeping participants
method that was validated using the same data. and/or experimenters unaware of a study’s critical
The validation of nomograms provides another aspects), are designed to keep such effects at bay.
opportunity for research design. Prospective stud- Whenever the channels by which such effects are
ies that collect all the predictor variables needed to transmitted are potentially known or knowable,
calculate a specific nomogram are ideal for deter- the effect is known as a classical experimenter
mining a nomogram’s accuracy. In addition, more effect. They normally operate through the
randomized controlled trials that compare nomo- known senses and very often by subliminal per-
gram-derived treatment recommendations versus ception. If an experiment is designed to exclude
standard of care are needed to promote the use of such classical channels of information transfer,
nomograms in medicine. because it is testing some claims of anomalous
Brian J. Wells and Michael Kattan cognition, and such differential effects of experi-
menters still happen, then these effects are called
See also Decision Rule; Evidence-Based Decision nonclassical experimenter effects, because there
Making; Probability, Laws of is no currently accepted model to understand
how such effects might have occurred in the first
place.
Further Readings
Harrell, F. E., Jr. (1996). Multivariate prognostic models:
Issues in developing models, evaluating assumptions Empirical Evidence
and accuracy, and measuring and predicting errors.
Statistics in Medicine, 15, 361.
This effect has been known in parapsychological
Harrell, F. E., Jr., Califf, R. M., Pryor, D. B., Lee, K. L., research for awhile. Several studies reported that
& Rosati, R. A. (1982). Evaluating the yield of parapsychological effects were found in some stud-
medical tests. Journal of the American Medical ies, whereas in other studies with the same experi-
Association, 247; 2543–2546. mental procedure, the effects were not shown. A
908 Nonclassical Experimenter Effects

well-known experiment that has shown such a was no indication how the individual in question
nonclassical experimenter effect is one where could have potentially biased this blinded system,
a parapsychological researcher who had previously although such tampering, and hence a classical
produced replicable results with a certain experi- experimenter effect, could not be excluded.
mental setup invited a skeptical colleague into her The nonclassical experimenter effect has been
laboratory to replicate the experiment with her. shown repeatedly in parapsychological research.
They ran the same experiment together; half of the The source of this effect is unclear. If the idea
subjects were introduced to the experimental behind parapsychology that intention can affect
procedures by the enthusiastic experimenter and physical systems without direct interaction is at all
half by the skeptical experimenter. The experimen- sensible and worth any consideration, then there is
tal task was to influence a participant’s arousal no reason why the intention of an experimenter
remotely, measured by electrodermal activity, via should be left out of an experimental system in
intention only according to a random sequence. question. Furthermore, one could argue that if the
The two participants were separated from each intention of an experimental participant could
other and housed in shielded chambers. Otherwise, affect a system without direct interaction, then the
all procedures were the same. Although the enthu- intention of the experimenter could do the same.
siastic researcher could replicate the previous Strictly speaking, any nonclassical experimenter
results, the skeptical researcher produced null effect defies experimental control and calls into
results. This finding occurred even though there question the concept of experimental control.
was no way of transferring the information in the
experiment itself. This result was replicated in
another study in the skeptical researcher’s labora- Theoretical Considerations
tory, where again the enthusiastic researcher could When it comes to understanding such effects, they
replicate the findings but the skeptic could not. are probably among the strongest empirical facts
There are also several studies reported where more that point to a partially constructivist view of the
than one experimenter interacted with the partici- world that is also embedded in some spiritual
pants. If these studies are evaluated separately for worldviews such as in the Buddhist, Vedanta, or
each experimenter, it could be shown that some other mystical concepts. Here, our mental con-
experimenters find consistently significant results structs, intentions, thoughts, and wishes are not
whereas others do not. These are not only explor- only reflections of the world or idle mental opera-
atory findings because some of these studies could tions that might affect the world indirectly by
be repeated and the experimenter effects were being responsible for our future actions but also
hypothesized. could be viewed as constituents and creators of
Another experimental example are the so-called reality itself. This is difficult to understand within
memory-of-water effects, where Jacques Benve- the accepted scientific framework of the world.
niste, who was a French immunologist, had Hence, such effects and a constructivist concept of
claimed that water mixed with an immunogenic reality also point to the limits of the validity of our
substance and successively diluted in steps to current worldview. For such effects to be scientifi-
a point where no original molecules were present cally viable concepts, researchers need to envisage
would still have a measurable effect. Blinded a world in which mental and physical acts can
experiments produced some results, sometimes interact with each other directly. Such effects make
replicable and sometimes not. Later, he claimed us aware of the fact that we constantly partition
that such effects can also be digitized, recorded, our world into compartments and pieces that are
and played back via a digital medium. A definitive useful for certain purposes, for instance, for the
investigation could show that these effects only purpose of technical control, but do not necessar-
happened when one particular experimenter was ily describe reality as such. In this sense, they
present who was known to be indebted to remind us of the constructivist basis of science and
Benveniste and wanted the experiments to work. the whole scientific enterprise.
Although a large group of observers with special-
ists from different disciplines were present, there Harald Walach and Stefan Schmidt
Nondirectional Hypotheses 909

See also Experimenter Expectancy Effect; Hawthorne tested relationship, stating that one variable is pre-
Effect; Rosenthal Effect dicted to be larger or smaller than null value, but
not both. Choosing a nondirectional or directional
alternative hypothesis is a basic step in conducting
Further Readings
a significance test and should be based on the
Collins, H. M. (1985). Changing order. Beverly Hills, research question and prior study in the area. The
CA: Sage. designation of a study’s hypotheses should be
Jonas, W. B., Ives, J. A., Rollwagen, F., Denman, D. W., made prior to analysis of data and should not
Hintz, K., Hammer, M., et al. (2006). Can specific change once analysis has been implemented.
biological signals be digitized? FASEB Journal, 20, For example, in a study examining the
23–28. effectiveness of a learning strategies intervention,
Kennedy, J. E., & Taddonio, J. L. (1976). Experimenter
a treatment group and a control group of
effects in parapsychological research. Journal of
Parapsychology, 40, 1–33. students are compared. The null hypothesis
Palmer, J. (1997). The challenge of experimenter psi. states that there is no difference in mean scores
European Journal of Parapsychology, 13, 110–125. between the two groups. The nondirectional
Smith, M. D. (2003). The role of the experimenter in alternative hypothesis states that there is a differ-
parapsychological research. Journal of Consciousness ence between the mean scores of two groups but
Studies, 10, 69–84. does not specify which group is expected to be
Walach, H., & Schmidt, S. (1997). Empirical larger or smaller. In contrast, a directional alter-
evidence for a non-classical experimenter effect: An native hypothesis might state that the mean of
experimental, double-blind investigation of
the treatment group will be larger than the mean
unconventional information transfer. Journal of
of the control group. The null and the nondirec-
Scientific Exploration, 11, 59–68.
Watt, C. A., & Ramakers, P. (2003). tional alternative hypothesis could be stated as
Experimenter effects with a remote facilitation of follows:
attention focusing task: A study with
multiple believer and disbeliever experimenters. Null Hypothesis: H0 : μ1  μ2 ¼ 0.
Journal of Parapsychology, 67, 99–116.
Wiseman, R., & Schlitz, M. (1997). Experimenter effects Nondirectional Alternative Hypothesis:
and the remote detection of staring. Journal of H1 : μ1  μ2 6¼ 0:
Parapsychology, 61, 197–208.
A common application of nondirectional
hypothesis testing involves conducting a t test
and comparing the means of two groups. After
NONDIRECTIONAL HYPOTHESES calculating the t statistic, one can determine the
critical value of t that designates the null hypoth-
A nondirectional hypothesis is a type of alterna- esis rejection region for a nondirectional or two-
tive hypothesis used in statistical significance tailed test of significance. This critical value will
testing. For a research question, two rival depend on the degrees of freedom in the sample
hypotheses are formed. The null hypothesis and the desired probability level, which is usu-
states that there is no difference between the ally .05. The rejection region will be represented
variables being compared or that any difference on both sides of the probability curve because
that does exist can be explained by chance. The a nondirectional hypothesis is sensitive to a larger
alternative hypothesis states that an observed or smaller effect.
difference is likely to be genuine and not likely Figure 1 shows a distribution in which at the
to have occurred by chance alone. Sometimes 95% confidence level, the solid regions at the top
called a two-tailed test, a test of a nondirectional and bottom of the distribution represent 2.5%
alternative hypothesis does not state the direc- accumulated probability in each tail. If the calcu-
tion of the difference, it indicates only that a lated value for t exceeds the critical value at either
difference exists. In contrast, a directional alter- tail of the distribution, than the null hypothesis
native hypothesis specifies the direction of the can be rejected.
910 Nonexperimental Designs

y Pillemer, D. B. (1991). One-versus two-tailed hypothesis


tests in contemporary educational research.
0.45
Educational Researcher, 20, 13–17.
0.40
0.35
0.30
0.25
0.20 NONEXPERIMENTAL DESIGNS
0.15
0.10 bottom top
0.05 rejection region rejection region Nonexperimental designs include research designs
0.00 in which an experimenter simply either describes
−4 −3 −2 −1 0 1 2 3 4 a group or examines relationships between preex-
x isting groups. The members of the groups are not
randomly assigned and an independent variable is
Figure 1 Nondirectional t Test with df = 30
not manipulated by the experimenter, thus, no
conclusions about causal relationships between
Note: Alpha = 0.05, critical value = 2.0423. variables in the study can be drawn. Generally,
little attempt is made to control for threats to
In contrast, the rejection region of a directional internal validity in nonexperimental designs. Non-
alternative hypothesis, or one-tailed test, would be experimental designs are used simply to answer
represented on only one side of the distribution, questions about groups or about whether group
because the hypothesis would choose a smaller or differences exist. The conclusions drawn from
larger effect, but not both. In this instance, the crit- nonexperimental research are primarily descriptive
ical value for t will be smaller because all 5% in nature. Any attempts to draw conclusions about
probability will be represented on one tail. causal relationships based on nonexperimental
Much debate has occurred over the appropri- research are done so post hoc.
ate use of nondirectional and directional hypoth- This entry begins by detailing the differences
esis in significance testing. Because the critical between nonexperimental and other research
rejection values for a nondirectional test are designs. Next, this entry discusses types of non-
higher than a directional test, it is a more conser- experimental designs and the potential threats to
vative approach and the most commonly used. internal validity that nonexperimental designs
However, when prior research supports the use present. Last, this entry examines the benefits of
of a directional hypothesis test, a significant using nonexperimental designs.
effect is easier to find and, thus, represents more
statistical power. The researcher should state Differences Among Experimental, Quasi-
clearly the chosen alternative hypothesis and the
Experimental, and Nonexperimental Designs
rationale for this decision.
The crucial differences between the three main
Gail Tiemann, Neal Kingston, categories of research design lie in the assignment
Jie Chen, and Fei Gu of participants to groups and in the manipulation of
an independent variable. In experimental designs,
See also Alternative Hypotheses; Directional Hypothesis;
members are randomly assigned to groups and the
Hypothesis
experimenter manipulates the values of the inde-
pendent variable so that causal relationships might
Further Readings be established or denied. In quasi-experimental and
nonexperimental designs, the groups already exist.
Marks, M. R. (1951). Two kinds of experiment
distinguished in terms of statistical operations.
The experimenter cannot randomly assign the par-
Psychological Review, 60; 179–184. ticipants to groups because either the groups were
Nolan, S., & Heinzen, T. (2008). Statistics for the already established before the experimenter began
behavioral sciences (1st ed.). New York: Worth his or her research or the groups are being estab-
Publishers. lished by someone other than the researcher for
Nonexperimental Designs 911

a purpose other than the experiment. In quasi- also referred to as differential or ex post facto
experimental designs, the experimenter can still designs), correlational designs, developmental
manipulate the value of the independent variable, designs, one-group pretest–posttest designs, and
even though the groups to be compared are already finally posttest only nonequivalent group designs.
established. In nonexperimental designs, the groups
already exist and the experimenter cannot or does
Comparative Designs
not attempt to manipulate an independent variable.
The experimenter is simply comparing the existing In these designs, two or more groups are com-
groups based on a variable that the researcher did pared on one or more measures. The experimenter
not manipulate. The researcher simply compares might collect quantitative data and look for statis-
what is already established. Because he or she can- tically significant differences between groups, or
not manipulate the independent variable, it is the experimenter might collect qualitative data
impossible to establish a causal relationship and compare the groups in a more descriptive
between the variables measured in a nonexperimen- manner. Of course, the experimenter might also
tal design. use mixed methods and do both of the previously
A nonexperimental design might be used when mentioned strategies. Conclusions can be drawn
an experimenter would like to know about the about whether differences exist between groups,
relationship between two variables, like the fre- but the reasons for the differences cannot be
quency of doctor visits for people who are obese drawn conclusively. The study described previ-
compared with those who are of healthy weight or ously regarding obese, healthy weight, and under-
are underweight. Clearly, from both an ethical and weight people’s doctor visits is an example of
logistical standpoint, an experimenter could not a comparative design.
simply select three groups of people randomly
from a population and make one of the groups
Causal-Comparative, Differential, or Ex Post
obese, one of the groups healthy weight, and one
Facto Research Designs
of the groups underweight. The experimenter
could, however, find obese, healthy weight, and Nonexperimental research that is conducted
underweight people and record the number of doc- when values of a dependent variable are compared
tor visits the members of each of these groups have based on a categorical independent variable is
to look at the relationship between the variables of often referred to as a causal-comparative or a differ-
interest. This nonexperimental design might yield ential design. In these designs, the groups are deter-
important conclusions even though a causal rela- mined by their values on some preexisting
tionship could not clearly be established between categorical variable, like gender. This design is also
the variables. sometimes called ex post facto for that reason; the
group membership is determined after the fact.
After determining group membership, the groups
Types of Nonexperimental Designs
are compared on the other measured dependent
Although the researcher does not assign partici- variable. The researcher then tests for statistically
pants to groups in nonexperimental design, he or significant differences in the dependent variable
she can usually still determine what is measured between groups. Even though this design is referred
and when it will be measured. So despite the lack to as causal comparative, a causal relationship can-
of control in aspects of the experiment that are not be established using this design.
generally important to researchers, there are still
ways in which the experimenter can control the
Correlational Designs
data collection process to obtain interesting and
useful data. Various authors classify nonexperi- In correlational designs, the experimenter
mental designs in a variety of ways. In the sub- measures two or more nonmanipulated variables
sequent section, six types of frequently used for each participant to ascertain whether linear
nonexperimental designs are discussed: compara- relationships exist between the variables. The
tive designs, causal-comparative designs (which are researcher might use the correlations to conduct
912 Nonexperimental Designs

subsequent regression analyses for predicting the variables possibly causing change over time. As
values of one variable from another. No conclu- with all nonexperimental designs, the researcher
sions about causal relationships can be drawn does not control the independent variable. How-
from correlational designs. It is important to note, ever, this design is generally used when a researcher
however, that correlational analyses might also be knows that an intervention of some kind will be
used to analyze data from experimental or quasi- taking place in the future. Thus, although the
experimental designs. researcher is not manipulating an independent var-
iable, someone else is. When the researcher knows
this will occur before it happens, he or she can col-
Developmental Designs
lect pretest data, which are simply data collected
When a researcher is interested in developmen- before the intervention. An example of this design
tal changes that occur over time, he or she might would be if a professor wants to study the impact
choose to examine the relationship between age of a new campus-wide recycling program that will
and other dependent variables of interest. Clearly, be implemented soon. The professor might want
the researcher cannot manipulate age, so develop- to collect data on the amount of recycling that
mental studies are often conducted using nonex- occurs on campus before the program and on atti-
perimental designs. The researcher might find tudes about recycling before the implementation
groups of people at different developmental stages of the program. Then, perhaps 6 months after the
or ages and compare them on some characteristics. implementation, the professor might want to col-
This is essentially a form of a differential or lect the same kind of data again. Although the pro-
causal-comparative design in that group member- fessor did not manipulate the independent variable
ship is determined by one’s value of a categorical of the recycling program and did not randomly
variable. Although age is not inherently a categori- assign students to be exposed to the program, con-
cal variable, when people are grouped together clusions about changes that occurred after the
based on categories of ages, age acts as a categori- program can still be drawn. Given the lack of
cal variable. manipulation of the independent variable and the
Alternatively, the researcher might investigate lack of random assignment of participants, the
one group of people over time in a longitudinal study is nonexperimental research.
study to examine the relationship between age
and the variables of interest. For example, the
Posttest-Only Nonequivalent
researcher might be interested in looking at how
Control Group Design
self-efficacy in mathematics changes as children
grow up. He or she might measure the math self- In this type of between-subjects design, two
efficacy of a group of students in 1st grade and nonequivalent groups of participants are com-
then measure that same group again in the 3rd, pared. In nonexperimental research, the groups are
5th, 7th, 9th, and 11th grades. In this case, chil- almost always nonequivalent because the partici-
dren were not randomly assigned to groups and pants are not randomly assigned to groups.
the independent variable (age) was not manipu- Because the researcher also does not control the
lated by the experimenter. These two characteris- intervention, this design is used when a researcher
tics of the study qualify it as nonexperimental wants to study the impact of an intervention that
research. already occurred. Given that the researcher cannot
collect pretest data, he or she collects posttest data.
However, to draw any conclusions about the post-
One-Group Pretest–Posttest Design
test data, the researcher collects data from two
In this within-subjects design, each individual in groups, one that received the treatment or inter-
a group is measured once before and once after vention, and one that did not. For example, if one
a treatment. In this design, the researcher is not is interested in knowing how participating in
examining differences between groups but examin- extracurricular sports during high school affects
ing differences across time in one group. The students’ attitudes about the importance of physi-
researcher does not control for possible extraneous cal fitness in adulthood, an experimenter might
Nonexperimental Designs 913

survey students during the final semester of their that are between careers of their choosing opt for
senior year. The researcher could survey a group jobs in food service. Another possibility is that peo-
that participated in sports and a group that did ple with more education are more satisfied with
not. Clearly, he or she could not randomly assign their jobs, and people in academia tend to be the
students to participate or not participate. In this most educated, followed by those in business and
case, he or she also could not compare the atti- then those in food service. Thus, if the researcher
tudes prior to participating with those after partici- found that academics are the most satisfied, it
pating. Obviously with no pretest data and with might be because of their jobs, or it might be
groups that are nonequivalent, the conclusions because of their education. These proposed ratio-
drawn from these studies might be lacking in inter- nales are purely speculative; however, they demon-
nal validity. strate how internal validity might be threatened by
self-selection. In both cases, a third variable exists
that contributes to the differences between groups.
Threats to Internal Validity
Third variables can threaten internal validity in
Internal validity is important in experimental numerous ways with nonexperimental research.
research designs. It allows one to draw unambigu-
ous conclusions about the relationship between
Assignment Bias
two variables. When there is more than one possi-
ble explanation for the relationship between Like self-selection, the assignment of partici-
variables, the internal validity of the study is pants to groups in a nonrandom method can
threatened. Because the experimenter has little create a threat to internal validity. Although parti-
control over potential confounding variables in cipants do not always self-select into groups used
nonexperimental research, the internal validity can in nonexperimental designs, when they do not self-
be threatened in numerous ways. select, they are generally assigned to a group for
a particular reason by someone other than the
researcher. For example, if a researcher wanted
Self-Selection
to compare the vocabulary acquisition of students
The most predominant threat with nonexperi- exposed to bilingual teachers in elementary
mental designs is caused by the self-selection that schools, he or she might compare students taught
often occurs by the participants. Participants in by bilingual teachers with students taught by
nonexperimental designs often join the groups to monolingual teachers in one school. Students might
be compared because of an interest in the group or have been assigned to their classes for reasons
because of life circumstances that place them in related to their skill level in vocabulary related
those groups. For example, if a researcher wanted tasks, like reading. Thus, any relationship the
to compare the job satisfaction levels of people in researcher finds might be caused not by the expo-
three different kinds of careers like business, acade- sure to a bilingual teacher but by a third variable
mia, and food service, he or she would have to use like reading level.
three groups of people that either intentionally
chose those careers or ended up in their careers
History and Maturation
because of life circumstances. Either way, the
employees in those careers are likely to be in those In nonexperimental designs, an experimenter
different careers because they are different in other might simply look for changes across time in
ways, like educational background, skills, and a group. Because the experimenter does not con-
interests. Thus, if the researcher finds differences in trol the manipulation of the independent variable
job satisfaction levels, they might be because the or group assignment, both history and maturation
participants are in different careers or they might can affect the measures collected from the partici-
be because people who are more satisfied with pants. Some uncontrolled event (history) might
themselves overall choose careers in business, occur that might confuse the conclusions drawn
whereas those who do not consider their satisfac- by the experimenter. For example, in the job
tion in life choose careers in academia, and those satisfaction study above, if the researcher was
914 Nonexperimental Designs

looking at changes in job satisfaction over time design to acquire as much information about the
and during the course of the study the stock mar- program’s effectiveness as possible rather than sim-
ket crashed, then many of those with careers in ply to not attempt to study the effectiveness of the
business might have become more dissatisfied with program.
their jobs because of that event. However, a stock Even though nonexperimental designs give the
market crash might not have affected academics experimenter little control over the experimental
and food service workers to the same extent that it process, the experimenter can improve the reliabil-
affected business workers. Thus, the conclusions ity of the findings by replicating the study. Addi-
that might be formed about the dissatisfaction tionally, one important feature of nonexperimental
of business employees would not have internal designs is the possibility of stronger ecological
validity. validity than one might obtain with a controlled,
Similarly, in the vocabulary achievement exam- experimental design. Given that nonexperimental
ple above, one would expect elementary students’ designs are often conducted with preexisting inter-
vocabularies to improve simply because of matura- ventions with ‘‘real people’’ in the ‘‘real world,’’
tion over the course of a school year. Thus, if the rather than participants in a laboratory, the find-
experimenter only examined differences in vocabu- ings are often more likely to be true to other real-
lary levels for students with bilingual teachers over world situations.
the course of the school year, then he or she might
draw erroneous conclusions about the relationship Jill H. Lohmeier
between vocabulary performance and exposure to
See also Experimental Design; Internal Validity; Quasi-
a bilingual teacher when in fact no relationship
Experimental Design; Random Assignment; Research
exists. This maturation of the students would be
Design Principles; Threats to Validity; Validity of
a threat to the internal validity of that study.
Research Conclusions

Benefits of Using Nonexperimental Designs


Nonexperimental designs are often relied on when Further Readings
a researcher has a question that requires a large DePoy, E., & Gilson, S. F. (2003). Evaluation practice.
group that cannot easily be assigned to groups. Pacific Grove, CA: Brooks/Cole.
They might also be used when the population of Gravetter, F. J., & Forzano, L. B. (2009). Research
interest is small and hard to access. Sometimes, methods for the behavioral sciences, Belmont, CA:
nonexperimental designs are used when a researcher Wadsworth Cengage Learning.
simply wants to know something about a popula- Johnson, B., & Christensen, L. B. (2007). Educational
tion but does not actually have a research research. Thousand Oaks, CA: Sage.
Kerlinger, F. N., & Lee, H. B. (2000). Foundations of
hypothesis.
behavioral research (4th ed.). Belmont, CA:
Although experimental designs are often used in Wadsworth Cengage Learning.
both the hard and social sciences, there are numer- McMillan, J. H., & Schumacher, S. (2006). Research in
ous occasions in social sciences in which it is simply education: Evidence-based inquiry (6th ed.). Boston:
not possible to use an experimental design. This is Pearson Education, Inc.
especially true in fields like education or program Payne, D. A. (1994). Designing educational project and
evaluation, where the programs to be studied can- program evaluations: A practical overview based on
not simply be offered to a random set of partici- research and experience. Dordrecht, the Netherlands:
pants from the population of interest. Rather, the Kluwer Academic Publishers.
educational program might already be in use in cer- Punch, K. F. (1998) Introduction to social research:
Quantitative & qualitative approaches. Thousand
tain schools or classrooms. The researcher might
Oaks, CA: Sage.
have to determine how conclusions can be drawn Smith, E. R., & Mackie, D. M. (2007). Social psychology
based on the already established samples that are (3rd ed.). Philadelphia: Psychology Press.
either participating or not participating in the inter- Spector, P. E. (1990). Research designs, series:
vention that is not controlled by the researcher. In Quantitative applications in the social sciences.
such cases, it is preferable to use a nonexperimental Newbury Park, CA: Sage.
Nonparametric Statistics 915

tests. It was first proposed by Frank Wilcoxon in


NONPARAMETRIC STATISTICS 1945 for equal sample sizes, and it was later
extended to arbitrary sample sizes by Henry B.
Nonparametric statistics refer to methods of mea- Mann and Donald R. Whitney in 1947.
surement that do not rely on assumptions that the To obtain the statistic, the observations are
data are drawn from a specific distribution. Non- first ranked without regard to which sample they
parametric statistical methods have been widely are in. Then for samples 1 and 2, the sum of
used in various kinds of research designs to make ranks R1 and R2 ; respectively, are computed.
statistical inferences. In practice, when the normal- The statistic takes the form of
ity assumption on the measurements is not
satisfied, parametric statistical methods might pro- U ¼ MIN½R1 N1 ðN1 1Þ=2; R2 N2 ðN2 1Þ=2;
vide misleading results. In contrast, nonparametric
methods make much less stringent distributional where N1 ; N2 denotes the samples sizes. For small
assumptions on the measurements. They are valid samples, the distribution of the statistic is tabu-
methods regardless of the underlying distributions lated. However, for sample sizes greater than 20;
of the observations. Because of this attractive the statistic can be normalized into
advantage, ever since the first introduction of non-
N1 N2
parametric tests in the last century, many different U 
types of nonparametric tests have been developed 2
z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
to analyze various types of experimental designs. N1 N2 ðN1 þ N2 þ 1Þ
Such designs encompass one-sample design, two- 12
sample design, randomized-block design, two-way
factorial design, repeated measurements design, The significance of the normalized statistic z
and high-way layout. The observations in each can be assessed using the standard normal table.
experimental condition could be equal or unequal. As a test to compare two populations, MWW is
The targeted inferences include the comparisons in spirit very similar to the parametric two-
of treatment effects, the existence of interaction sample t test. In comparison, the parametric t
effects, ordered inferences of the effects, and multi- test is more powerful if the data are drawn from
ple comparisons of the effects. All these methods normal distribution. In contrast, if the distribu-
share the same feature that instead of using the tional assumption is violated in practice, then
actual observed measurements, they used the MWW is more powerful than its parametric
ranked values to form the statistics. By discarding counterpart. In terms of efficiency, under the
the actual measurements, the methods gain the normality assumption, the efficiency of MWW
robustness to the underlying distributions and the test is 95% of that of t test, which implies
potential contamination of outliers. This gain of that to achieve the same power, the t test will
robustness is only at the price of losing a relatively need 5% less data points than the MWW test.
small amount of efficiency. In this entry, a brief Under other non-normal, especially heavy-tailed
review of the existing nonparametric methods is distributions, the efficiency of MWW could be
provided to facilitate the application of them in much higher.
practical settings. For the case of two related samples or repeated
measurements on a single sample, the Wilcoxon
signed-rank test is a nonparametric alternative to
the paired Student’s t test. The null hypothesis to be
Tests for One or Multiple Populations
tested is that the median of the paired differences
The Mann–Whitney–Wilcoxon (MWW) test is is equal to zero. Let ðX1 ; Y1 Þ; . . . ; ðXN ; YN Þ denote
a nonparametric test to determine whether two the paired observations. The researcher computes
samples of observations are drawn from the the differences di ¼ Xi  Yi ; i ¼ 1; . . . ; N; and
same distribution. The null hypothesis specifies omits those differences with zero values. Then, the
that the two probability distributions are identi- remaining differences are ranked without regard to
cal. It is one of the best-known nonparametric sign. The researcher then computes Wþ and W as
916 Nonparametric Statistics

the sums of the ranks corresponding to the individual factor effects in a nonparametric
positive and negative differences. If the alterna- way, researchers could employ a rank transform
tive hypothesis specifies the median is greater, method to test for the treatment effect of
less than, or unequal, then the test statistic will interest. Assume Xijn ¼ θ þ αi þ βj þ eijn ; while
be Wþ ; W ; or MINðW ; Wþ Þ; respectively. For i ¼ 1; . . . ; I indexes for the blocks, j ¼ 1; . . . ; J
small samples with N less than 30; the table of indexes for the treatment levels, and
critical values are tabulated, whereas N is large, n ¼ 1; . . . ; N indexes for the replicates. The
normal approximation could be used to assess null hypothesis to be tested is H0 : βj ¼ 0,
the significance. Note that the Wilcoxon signed- j ¼ 1; . . . ; J; versus the alternative hypothesis
rank test can be used directly on one sample H1 : at least one βj 6¼ 0: The noises eijn are
to test whether the population median is zero assumed to be independent and identically dis-
or not. tributed with certain distribution F: The rank
For comparisons of multiple populations, the transform method proposed by W. J. Conover
nonparametric counterpart of the analysis of and Ronald L. Iman consists of replacing the
variance (ANOVA) test is the Kruskal–Wallis observations by their ranks in the overall sample
k-sample test proposed by William H. Kruskal and then performing one of the standard
and W. Allen Wallis in 1952. Given independent ANOVA procedures on these ranks. Let Rijn be
random samples of sizes N1 ; N2 ; . . . ; NK ; drawn the rank corresponding to the observation Xijn ;
from k populations, the null hypothesis is that P P P
and Rij · ¼ 1=N n Rijn ; R · ¼ 1=ðNIÞ i n Rijn :
all the k populations are identical and have the
same median; the alternative hypothesis is that The Hora–Conover statistic proposed by Ste-
at least one of the populations has a median dif- phen C. Hora and Conover takes the form of
ferent from the others. Let N denote the total P 2
number of measurements in the k samples, NI ðR ·  R · : · Þ =ðJ  1Þ
P j
N ¼ ki¼1 Nk : Let Ri denote the sum of the ranks F ¼ PPP 2
:
associated with the ith sample. It can be shown ðRijk  Rij · Þ =IJðN  1Þ
i j k
that the grand mean rank is N 2þ 1 ; whereas the
Ri
sample mean rank for the ith sample is N : The When sample size is large, either the number of
i
test statistic takes the form of replicates per cell N → ∞; or the number of blocks
I → ∞; the F statistic has a limiting χ2J  1 distribu-
X k  2 tion. The F statistic resembles the analysis of vari-
12 Ri Nþ1
Ni  : ance statistic in which the actual observation Xijk s
NðN þ 1Þ i¼1 Ni 2
are replaced by Rijk s. Such an analysis is easy to
perform, as most software implement ANOVA
Ri
The term of ðN  N 2þ 1Þ measures the deviation procedures. This method has wide applicability in
i
of the ith sample rank mean away from the grand the analysis of experimental data because of its
rank mean. The term of NðN12þ 1Þ is the inverse of the robustness and simplicity in use.
However, the Hora–Conover statistic F cannot
variance of the total summation of ranks, and there-
handle unbalanced designs that often arise in prac-
fore it serves as a standardization factor. When N is
tice. Consider the following unbalanced design:
fairly large, the asymptotic distribution of Kruskal–
Wallis statistic can be approximated by the chi-
Xijn ¼ θ þ αi þ βj þ εijn ;
squared distribution with k  1 degrees of freedom.
where i ¼ 1; . . . ; I and j ¼ 1; . . . ; J index levels
Tests for Factorial Design for factors
P A and B, respectively, and n ¼ 1; . . . ;
nij ; N ¼ ij nij : We wish to test the hypothesis:
Test for Main Effects
H0 : βj ¼ 08j versus H1 : βj 6¼ 0 for some j: To
In practice, data are often generated from address the problem of unbalance in designs, let us
experiments with several factors. To assess the examine the composition of a traditional rank.
Nonparametric Statistics 917

Define the function uðxÞ ¼ 1 if x ≥ 0; and in which the general inverse of the covariance
uðxÞ ¼ 0 if x < 0 and note that matrix is employed. The statistic TM is invariant
with respect to choices of the general inverses.
nij
XXX When the design is balanced, the test statistic is
Rijn ¼ uðXijn  Xi0 j0 n0 Þ: equivalent to the Hora–Conover statistic. This sta-
0 0 0
i j n
tistic TM converges to a central χ2J  1 as N → ∞:
Thus, the overall rankings do not adjust for
Test for Nested Effects
different sample sizes in unbalanced designs. To
address this problem, we define the notion of Often, practitioners might speculate that the dif-
a weighted rank. ferent factors might not act separately on the
response and interactions might exist between the
Definition factors. In light of such a consideration, one could
study an unbalanced two-way layout with an inter-
Let  ¼ Xijn ; i ¼ 1; . . . ; I; j ¼ 1; . . . ; J; n ¼ 1;
. . . ; nij g be a collection of random variables. The action effect. Let Xijn ði ¼ 1; .P . . ; I;
P j ¼ 1; . . . ; J;
weighted rank of Xijn within this set is n ¼ 1; . . . ; nij Þ be a set of N ¼ i j nij indepen-
" # dent random variables with the model
 NX 1 X
Rijn ¼ uðXijn  Xi0 j0 n0 Þ ; Xijn ¼ θ þ αi þ βj þ γ ij þ εijn ; ð2Þ
IJ 0 0 ni0 j0 0
ij n
X where i and j index levels
where N ¼ nij : P for factors
P APand B,
respectively. Assume i αi ¼ j βj ¼ i γ ij ¼
ij P
γ
j ij ¼ 0 and ε ijn are independent and identically
Define SN* ¼ ½SN* ðjÞ; j ¼ 1; . . . ; J to be a distributed (i.i.d.) random variables with absolute
vector of weighted linear rank statistics with continuous cumulative distribution function (cdf)
P Pnij
components SN ðjÞ ¼ IJ1 i n1 n¼1 Rijn : Let F. Let δij ¼ βj þ γ ij : To test for nested effect, we
ij
* P consider testing H0 : δij ¼ 0, 8i and j; versus
SN ¼ 1J j SN* ðjÞ: Denote the covariance matrix of H1 : δij 6¼ 0; for some i and j: The nested effect can
SN* as Σ ¼ ðσ b;b0 Þ; with b; b0 ¼ 1; . . . ; J: To estimate be viewed as the combined overall effect of the
Σ; We construct a variable Cbijn treatment either through its own main effect or
through its interaction with the block factor.
1 The same technique of using weighted rank can
Cbijn ¼  ðRijn =NÞ; j 6¼ b be applied in testing for nested effects. Define
IJ2 ρ ij
J1 
ð1Þ SN* ði; jÞ ¼ 1=nij P R * and let SN* be the IJ vector
ðR =NÞ; j ¼ b · n ijn
IJ2 ρib ibn of ½SN* ði; jÞ; 1 ≤ i ≤ I; 1 ≤ j ≤ J: Construct a contrast
0 P P P matrix B ¼ II  ðIJ  1J JJ Þ; such that the ij element
Let σ^ N ðb; b Þ ¼ i j n  ðCbijm  C  b Þ2 ; and P
0 P P P 0
ij ·
0
of BSN* is SN* ði; jÞ  1J Jb¼1 ði; bÞ: Let Γ denote the
b  b b
σ^ N ðb; b Þ ¼ i j n0 ðCijn  Cij · ÞðCijn  C  b Þ:
0
ij ·
0
covariance matrix of SN* : To facilitate the estima-
Let Σ ^ N be the J × J matrix of ½^σ N ðb; b Þ; b; b ¼ tion of Γ; we define the following variables:
*
1; . . . ; J: The fact that converges to HðXijn Þ
Rijn =N
almost surely leads to the fact that under H0 ; 8j; Cði;jÞ ðXabn Þ ¼
8
N ðΣN  ΣN Þ → 0 a.s. elementwise.
1 ^ nij
>
>
> N X
Construct a contrast matrix A ¼ IJ  1J JJ : The >
> uðXabn Xijk Þ for ða; bÞ 6¼ ði; jÞ:
< IJnab nij k¼1
generalized Hora–Conover statistic proposed by
n0 0
Xin Gao and Mayer Alvo for the main effects in >
> N P 1 X ij
>
>
unbalanced designs takes the form : lIJnij 0 0 n 0 0
> uðXijn  Xi0 j0 n0 Þ for ða; bÞ ¼ ði; jÞ
ði ;j Þði;jÞ i j 0
n ¼1
0
TM ¼ ^ N A0 Þ ðAS * Þ;
ðASN* Þ ðAΣ ð3Þ
N
918 Nonparametric Statistics

Let Γ^ N be the IJ matrix with elements ordinal data. Compared with the classic ANOVA
X models, this nonparametric framework is different
0 0
γ^ N ½ði; jÞ; ði ; j Þ ¼ ðCði;jÞ ÞðXabn Þ in two aspects: First, the normality assumption is
a;b;n relaxed; second, it not only includes the commonly
 ði;jÞ
0 0 0 0
 ði ;j Þ ðXab Þ; used location models but also encompasses other
C ½Xab · ½Cði ;j Þ ðXabn Þ  C
arbitrary models with different cells having differ-
 ði;jÞ ðXab · Þ ¼ 1 P Cði;jÞ ðXabn Þ: It can be ent distributions. Under this nonparametric setting,
where C the hypotheses can be formulated in terms of lin-
pffiffiffiffiffi na b n
proved that 1= N ð^ γ N  γ N Þ → 0 almost surely ear contrasts of the distribution functions. Accord-
elementwise. The proposed test statistic TN for ing to Akritas and Arnold’s method, Fij can be
nested effects takes the form decomposed as follows:
0
^ N B0 Þ ðBS * Þ:
ðBSN* Þ ðBΓ Fij ðyÞ ¼ MðyÞ þ Ai ðyÞ þ Bj ðyÞ þ Cij ðyÞ;
N
P P P
Under H0 : δij ¼ 0, 8i, j, the proposed statistic where i Ai ¼ j Bj ¼ 0; and i Cij ¼ 0; for all j;
P
TN converges to a central χ2IðJ1Þ as N → ∞: and j Cij ¼ 0; for all i: It follows that
M ¼ F · · ; Ai ¼ Fi ·  M; Bj ¼ F · j  M; and Cij ¼
Tests for Pure Nonparametric Models Fij  Fi ·  F · j þ M; where the subscript ‘‘ · ’’ denotes
summing over all values of the index. Denote the
The previous discussion has been focused on the treatment factor as factor A and the block factor as
linear model with the error distribution unspeci- factor B. The overall nonparametric hypotheses
fied. To reduce the model assumption, Michael of no treatment main effects and no treatment
G. Akritas, Steven F. Arnold, and Edgar Brunner simple factor effects are specified as follows:
have proposed a nonparametric framework in
which the structures of the designs are no longer H0 ðAÞ : Fi ·  F · · ¼ 0; 8i ¼ 1; . . . ; I;
restricted to linear location models. The nonpara-
H0 ðA|BÞ : Fij  F · j ¼ 0; 8i ¼ 1; . . . ; I:; 8j
metric hypotheses are formulated in terms of linear
contrasts of normalized distribution functions. ¼ 1; . . . ; J:
One advantage of the nonparametric hypotheses is
that the parametric hypotheses in linear models The hypothesis H0 ðA|BÞ implies that the
are implied by the nonparametric hypotheses. Fur- treatment has no effect on the response either
thermore, the nonparametric hypotheses are not through the main effects or through the interaction
restricted to continuous distribution functions and effects.
therefore any models with discrete observations This framework especially accommodates the
might also be included in this setup. analysis of interaction effects in a unified manner. In
Under this nonparametric setup, the response literature, testing interactions using ranking methods
variables in a two-way unbalanced layout with I have been a controversial issue for a long time.
treatments and J blocks can be described by the The problem pertaining to the analysis of interac-
following model: tion effects is because the interaction effects based
on cell means can be artificially removed or intro-
Xijn : Fij ðxÞ; i ¼ 1; . . . ; I; j ¼ 1; . . . ; J; duced after certain nonlinear transformations. As
ð2Þ rank statistics are invariant to nonlinear monotone
n ¼ 1; . . . ; nij ;
transformations, they cannot be used to test for
hypotheses that are not invariant to monotone
where Fij ðxÞ ¼ 12 ½Fijþ ðxÞ þ Fij ðxÞ denotes the
transformations. To address this problem, Akritas
normalized-version of the distribution function, and Arnold proposed to define nonparametric inter-
Fijþ ðxÞ ¼ PðXijn ≤ xÞ denotes the right continuous action effects in terms of linear contrasts of the

version, and Fijn ðxÞ ¼ PðXijn < xÞ denotes the left distribution functions. Such nonparametric formula-
continuous version. The normalized version of the tion of interaction effects is invariant to monotone
distribution function accommodates both ties and transformations. The nonparametric hypothesis of
Nonparametric Statistics 919

no interaction assumes the additivity of the distribu- 1 ^ 1 X


ni j

tion functions and is defined as follows: ^ ^


Fij ðxÞ ¼ ½Fijþ ðxÞ þ Fij ðxÞ ¼ ðx  Xijk Þ;
2 ni j k¼1
H0 ðABÞ : Fij  Fi ·  Fj · · j þ F · · ¼ 0; 8i ¼ 1; . . . ; I ·;
8j ¼ 1; . . . ; J: where cðuÞ ¼ 12 ½cþ ðuÞ þ c ðuÞ denotes the indica-
tor function with cþ ðuÞ ¼ 0 or 1 depending on
This implies the distribution in the ði; jÞth cell, whether u < 0 or ≥ 0, and c ðuÞ ¼ 0 or 1
Fij ; is a mixture of two distributions, one depend- depending on whether u ≤ 0 or > 0: The vector
ing on i and the other depending on j; and the mix- of the empirical distribution functions is denoted
ture parameter is the same for all ði; jÞ: It is noted ^ ¼ ðF ^IJ Þ0 : The average empirical
^11 ; . . . ; F
that all the nonparametric hypotheses bear analo- by F
gous representation to their parametric counter- distribution function is denoted as H ^ * ðxÞ ¼
1
PI Pj ^
parts except the parametric means are replaced by IJ i¼1 j¼1 FijðxÞ: Then, the vector of the
the distribution functions. Furthermore, all the unweighted relative effects is estimated unbiasedly
R
nonparametric hypotheses are stronger than the by π ^ ¼ H * dF ^¼ 1R  * ; where N ¼ P P nij ,
N i j
parametric counterparts. 0  Pnij
     
R ¼ ðR ; . . . ; R Þ ; R ¼ 1 
R ; and
As no parameters are involved in the general 11 · IJ · ij · nij k¼1 ijk
model in Equation 4, the distribution functions PI Pn0i j0
Rijk*
¼ IJ i0 ¼1 1i0 j0 k0 ¼1  Xi0 j0 k0 : The latter is the
N
Fij ðxÞ can be used to quantify the treatment main n
effects, the treatment simple factor effects, and the weighted rank of the observation Xijk among all
interaction effects. To achieve this goal, Brunner the observations proposed by Sebastian Domhof,
and M. L. Puri considered the unweighted relative Gao, and Alvo.
effects which take the form Focusing on the problem of hypothesis testing
in unbalanced two-way layouts, with
Z
H0 : CF ¼ 0; versus H0 : CF 6¼ 0: and C being
πij ¼ H dFij ; i ¼ 1; . . . ; I; j ¼ 1; . . . ; J; ð5Þ a IJ by IJ contrast matrix. It can be shown that
the testing main effects, nested effects, or
P P interaction effects can all be expressed in such
where H * ðxÞ ¼ IJ1 Ii¼1 Jj¼1 Fij ðxÞ is the average
distribution function in the experiment. Denote a general form. Assume that as
F ¼ ðF11 ; . . . ; FIJ Þ0 as the vector of the distribution minfnij g → ∞; limN → ∞ nij =N ¼ λij ; and
functions. In general, the measure πij should be inter- σ 2ij ¼ var½H * ðXijk Þ > 0; for all i and j: Then, an
preted as the probability that the random variable application of the Lindeberg-Feller theorem
pffiffiffiffiffi
generated from Fij will tend to be larger than a ran- yields N ðC^ πÞ −→d
Nð0; CVC0 Þ; with V ¼ diag;
dom variable generated from the average distribu- 1 2 1 2
ðλ σ 11 ; . . . ; λ σ IJ Þ: The asymptotic variance
11 IJ
tion function H * : In the special case of shift models
or by assuming noncrossing cumulative distribution σ 2ij ; i ¼ 1; . . . ; I, and j ¼ 1; . . . ; J; can be esti-
functions, the relative effects πij define a stochastic mated from the pseudo ranks as
2 1
Pni j * 2
order by Fij < ; ¼; < H * ; according as πij < ; ¼, or *
σ^ ij ¼ N2 ðn j1Þ k¼1 ðRijk  Rij · Þ ; and the esti-
i
< 12 : In practice, the definition of a relative effect is mated covariance matrix takes the form of
particularly convenient for ordinal data. As the dif- ^ ¼ diagð 1 σ^ 2 ; . . . ; 1 σ^ 2 Þ: The fact that R * =N
V 11 IJ ijk
ferences of means can only be defined with metric λ 11 λ IJ

scales, the parametric approach of comparing two converges to H * ðXijk Þ almost surely leads to the
treatments based on means is not applicable on ordi- ^ a:s: 0 elementwise. Therefore,
result that V  V−→
nal scales. However, the points of an ordinal scale we consider the test statistic
can be ordered by size, which can be used to form pffiffiffiffiffi 0 0
T ¼ Nπ ^ 0 Þ C^
^ C ðCVC π; where ðCVC ^ 0 Þ denotes
the estimate of the nonparametric R relative effects. ^ 0 Þ: According to Slut-
the general inverse of ðCVC
Let π ¼ ðπ11 ; . . . ; πIJ Þ0 ¼ H  dF denote the
sky’s theorem, because V ^ is consistent, we have
vector of the relative effects. The relative effects πij
d
can be estimated by replacing the distribution T−→ χ2f ; where the degrees of freedom
functions Fij ðxÞ by their empirical counterparts f ¼ rankðCÞ:
920 Nonparametric Statistics for the Behavioral Sciences

Conclusions Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in


one-criterion variance analysis. Journal of American
Nonparametric methods have wide applicabil- Statistical Association, 47; 583–621.
ities in the analysis of experimental data because Mann, H., & Whitney, D. (1947). On a test of whether
of their robustness to the underlying distribu- one of two random variables is stochastically larger
tions and insensitivity to potential outliers. In than the other. The Annals of Mathematical Statistics,
this entry, various methods to test for different 18; 50–60.
effects in a factorial design have been reviewed. Wilcoxon, F. (1945). Individual comparisons by ranking
methods. Biometrics Bulletin, 1; 80–83.
It is worthy to emphasize that in the situation
where the normality assumption is violated,
nonparametric methods are powerful alternative
approaches to make sounded statistical inference
for experimental data. NONPARAMETRIC STATISTICS
FOR THE BEHAVIORAL SCIENCES
Xin Gao

See also Distribution; Normal Distribution; Sidney Siegel (January 4, 1916–November 29,
Nonparametric Statistics for the Behavioral Sciences; 1961) was a psychologist trained at Stanford Uni-
Null Hypothesis; Research Hypothesis versity. He spent nearly his entire career as a pro-
fessor at Pennsylvania State University. He is
known for his contribution to nonparametric sta-
Further Readings tistics, including the development with John Tukey
of the Siegel–Tukey test—a test for differences in
Akritas, M. G., & Arnold, S. F. (1994). Fully scale between groups. Arguably, he is most well
nonparametric hypotheses for factorial designs I: known for his book, Nonparametric Statistics for
Multivariate repeated measures designs. Journal of
the Behavioral Sciences, the first edition of which
American Statistical Association, 89; 336–343.
Akritas, M., Arnold, S., & Brunner, E. (1997).
was published by McGraw-Hill in 1956. After Sie-
Nonparametric hypotheses and rank statistics for gel’s death, a second edition was published (1988)
unbalanced factorial designs. Journal of American adding N. John Castellan, Jr., as coauthor. Non-
Statistical Association, 92; 258–265. parametric Statistics for the Behavior Sciences is
Akritas, M. G., & Brunner, E. (1997). A unified approach the first text to provide a practitioner’s introduc-
to rank tests in mixed models. Journal of Statistical tion to nonparametric statistics. By its copious use
Planning and Inferences, 61; 249–277. of examples and its straightforward ‘‘how to’’
Brunner, E., Puri, M. L., & Sun, S. (1995). approach to the most frequently used nonpara-
Nonparametric methods for stratified two-sample metric tests, this text was the first accessible
designs with application to multi-clinic trials. Journal
introduction to nonparametric statistics for the
of American Statistical Association, 90; 1004–1014.
Brunner, E., & Munzel, U. (2002). Nichtparametrische
nonmathematician. In that sense, it represents an
datenanalyse. Heidelberg, Germany: Springer-Verlag. important step forward in the analysis and presen-
Conover, W. J., & Iman, R. L. (1976). On some tation of non-normal data, particularly in the field
alternative procedures using ranks for the analysis of of psychology.
experimental designs. Communication in Statistics, The organization of the book is designed to
A5; 1349–1368. assist the researcher in choosing the correct non-
Domhof, S. (2001). Nichtparametrische relative effekte parametric test. After the introduction, the second
(Unpublished dissertation). University of G€ ottingen, chapter introduces the basic principles of hypothe-
G€ottingen, Germany. sis testing, including the definitions of: the null and
Gao, X., & Alvo, M., (2005). A unified nonparametric
alternative hypothesis, the size of the test, Type I
approach for unbalanced factorial designs. Journal of
American Statistical Association, 100; 926–941.
and Type II errors, power, sampling distributions,
Hora, S. C., & Conover, W. J. (1984). The F statistic in and the decision rule. Chapter 3 describes the fac-
the two-way layout with rank-score transformed data. tors that influence the choice of correct test. After
Journal of the American Statistical Association, 79; explaining some common parametric assumptions
668–673. and the circumstances under which nonparametric
Nonprobability Sampling 921

tests should be used, the text gives a basic outline list of steps for performing the test, and other
of how the proper statistical test should be chosen. references for a more in-depth description of the
Tests are distinguished from one another in two test.
important ways: First, tests are distinguished by
their capability of analyzing data of varying levels Gregory Michaelson and Michael Hardin
of measurement. For example, the χ2 goodness-of-
See also Distribution; Nonparametric Statistics; Normal
fit test can be applied to nominal data, whereas
Distribution; Null Hypothesis; Research Hypothesis
the Kolmogorov–Smirnov requires at least the
ordinal level of measurement. Second, tests are dis-
tinguished in terms of the type of samples to be Further Readings
analyzed. For example, two-sample paired tests Siegel, S. (1956). Nonparametric statistics for the
are distinguished from tests applicable to k inde- behavioral sciences. New York: McGraw-Hill.
pendent samples, which are distinguished tests of
correlation, and so on. Tests included in the text
include the following: the binomial test, the sign
test, the signed-rank test, tests for data displayed NONPROBABILITY SAMPLING
in two-way tables, the Mann–Whitney U test, the
Kruskal–Wallis test, and others. Also included are The two kinds of sampling techniques are proba-
extensive tables of critical values for the various bility and nonprobability sampling. Probability
tests discussed in the text. sampling is based on the notion that the people or
Because nonparametric tests make fewer assum- events chosen are selected because they are repre-
ptions than parametric tests, they are generally less sentative of the entire population. Nonprobability
powerful than the parametric alternatives. The text refers to procedures in which researchers select
compares the various tests presented with their their sample elements not based on a predeter-
parametric analogues in terms of power efficiency. mined probability. This entry examines the appli-
Power efficiency is defined to be the percent cation, limitations, and utility of nonprobability
decrease in sample size required for the parametric sampling procedures. Conceptual and empirical
test to achieve the same power as that of the non- strategies to use nonprobability sampling tech-
parametric test when the test is performed on data niques more effectively are also discussed.
that do, in fact, satisfy the assumptions of the
parametric test.
This work is important because it seeks to pre- Sampling Procedures
sent nonparametric statistics in a way that is
Probability Sampling
‘‘completely intelligible to the reader whose mathe-
matical training is limited to elementary algebra’’ There are many different types of probability
(Siegel, 1956, p. 4). It is replete with examples to sampling procedures. More common ones include
demonstrate the application of these tests in con- simple, systematic, stratified, multistage, and clus-
texts that are familiar to psychologists and other ter sampling. Probability sampling allows one to
social scientists. The text is organized so that the have confidence that the results are accurate and
user, knowing the specific level of measurement unbiased, and it allows one to estimate how pre-
and type(s) of samples being analyzed, can imme- cise the data are likely to be. The data from a prop-
diately identify several nonparametric tests that erly drawn sample are superior to data drawn
might be applied to his or her data. from individuals who just show up at a meeting or
Included in each test is a description of its func- perhaps speak the loudest and convey their per-
tion (under what circumstances this particular test sonal thoughts and sentiments. The critical issues
should be used), rationale, and method (a heuristic in sampling include whether to use a probability
description of why the test works and how the test sample, the sampling frame (the set of people that
statistic is calculated) including any modifications have a chance of being selected and how well it
that exist and the procedure for dealing with ties, corresponds to the population studied), the size of
both large and small sample examples, a numbered the sample, the sample design (particularly the
922 Nonprobability Sampling

strategy used to sample people, schools, house- subjects from a single site or agency is one of the
holds, etc.), and the response rate. The details of most popular methods among studies using
the sample design, including size and selection pro- nonprobability procedures. The barriers prevent-
cedures, influence the precision of sample estimates ing a large-scale multisite collaboration among
regarding how likely the sample is to approximate researchers can be formidable and difficult to
population characteristics. The use of standardized overcome.
measurement tools and procedures also helps to
assure comparable responses.
Statistical Theories About Sampling Procedures
Because a significant number of studies employ
Nonprobability Sampling
nonprobability samples and at the same time apply
Nonprobability sampling is conducted without inferential statistics, it is important to understand
the knowledge about whether those chosen in the the consequences. In scientific research, there are
sample are representative of the entire population. many reasons to observe elements of a sample
In some instances, the researcher does not have rather than a population. Advantages of using
sufficient information about the population to sample data include reduced cost, greater speed,
undertake probability sampling. The researcher greater scope, and greater accuracy. However,
might not even know who or how many people or there is no reason to use a biased sample that does
events make up the population. In other instances, not represent a target population. Scientists have
nonprobability sampling is based on a specific given this topic a rigorous treatment and devel-
research purpose, the availability of subjects, or oped several statistical theories about sampling
a variety of other nonstatistical criteria. Applied procedures. Central limit theorem, which is usually
social and behavioral researchers often face chal- given in introductory statistics courses, forms the
lenges and dilemmas in using a random sample, foundation of probability-sampling techniques. At
because such samples in a real-world research are the core of this theorem are the proven relation-
‘‘hard to reach’’ or not readily available. Even if ships between the mean of a sampling distribution
researchers have contact with hard to reach sam- and the mean of a population, between the stan-
ples, they might be unable to obtain a complete dard deviation of a sampling distribution (known
sampling frame because of peculiarities of the as standard error) and the standard deviation of
study phenomenon. This is especially true when a population, and between the normal sampling
studying vulnerable or stigmatized populations, distribution and the possible non-normal popula-
such as children exposed to domestic violence, tion distribution.
emancipated foster care youth, or runaway teen- Statisticians have developed various formulas
agers. Consider for instance the challenges of sur- for estimating how closely the sample statistics are
veying adults with the diagnosis of paranoid clustered around the population true values under
personality disorder. This is not a subgroup that is various types of sampling designs, including simple
likely to agree to sign a researcher’s informed con- random sampling, systematic sampling, stratified
sent form, let alone complete a lengthy battery of sampling, clustered sampling, and multistage clus-
psychological instruments asking a series of per- tered and stratified sampling. These formulas
sonal questions. become the yardstick for determining adequate
Applied researchers often encounter other prac- sample size. Calculating the adequacy of probabil-
tical dilemmas when choosing a sampling method. istic samples size is generally straightforward and
For instance, there might be limited research can be estimated mathematically based on prese-
resources. Because of limitations in funding, time, lected parameters and objectives (i.e., x statistical
and other resources necessary for conducting power with y confidence intervals). In practice,
large-scale research, researchers often find it diffi- however, sampling error—the key component
cult to use large samples. Researchers employed by required by the formulas for figuring out a needed
a single site or agency might be unable to access sample size—is often unknown to researchers. In
subjects served by other agencies located in other such instances, which often involve quasi-experi-
sites. It is not a coincidence that recruiting study mental designs, Jacob Cohen’s framework of
Nonprobability Sampling 923

statistical power analysis is employed instead. This George Judge et al. caution researchers to be
framework concerns the balance among four ele- aware of assumptions embedded in the statistical
ments of a study: sample size, effect size or differ- models they employ, to be sensitive to departures
ence between comparison groups, probability of of data from the assumptions, and to be willing to
making a Type I error, and probability of denying take remedial measures. John Neter et al. recom-
a false hypothesis or power. Studies using small mend that researchers always perform diagnostic
nonprobability samples, for example, are likely to tests to investigate departures of data from the sta-
have an inadequate power (significantly below .85 tistical assumptions and take corrective measures
convention indicating an adequate power). As if detrimental problems are present. In theory, all
a consequence, studies employing sophisticated research should use probabilistic sampling meth-
analytical models might not meet the required sta- odology, but in practice this is difficult especially
tistical criteria. The ordinary least-square reg- for hard to reach, hidden, or stigmatized popula-
ression model, for example, makes five statistical tions. Much of social science research can hardly
assumptions about data, and most of the be performed in a laboratory. It is important to
assumptions require a randomized process for stress that the results of the study are meaningful if
data gathering. Violating statistical assumptions they are interpreted appropriately and used in
in a regression analysis refers to the presence of conjunction with statistical theories. Theory,
one or more detrimental problems such as het- design, analysis, and interpretation are all con-
eroscedasticity, autocorrelation, non-normality, nected closely.
multicollinearity, and others. Multicollinearity Researchers are also advised to study compel-
problems are particularly likely to occur in non- ling populations and compelling questions. This
probability studies in which data were gathered most often involves purposive samples in which
through a sampling procedure with hidden selec- the research population has some special signifi-
tion bias and/or with small sample sizes. Violat- cance. Most commonly used samples, particularly
ing statistical assumptions might increase the in applied research, are purposive. Purposive sam-
risk of producing biased and inefficient estimates pling is more applicable in exploratory studies and
of regression coefficients and exaggerated R2 . studies contributing new knowledge. Therefore,
it is imperative for researchers to conduct a thor-
ough literature review to understand the ‘‘edge
of the field’’ and whether the study population
Guidelines and Recommendations
or question is a new or significant contribution.
This brief review demonstrates the importance How does this study contribute uniquely to the
of using probability sampling; however, proba- existing research knowledge? Purposive samples
bility sampling cannot be used in all instances. are selected based on a predetermined criteria
Therefore, the following questions must be related to the research. Research that is field ori-
addressed: Given the sampling dilemmas, what ented and not concerned with statistical general-
should researchers do? How can researchers izability often uses nonprobabilistic samples.
using nonprobability sampling exercise caution This is especially true in qualitative research
in reporting findings or undertake remedial mea- studies. Adequate sample size typically relies on
sures? Does nonprobability sampling necessarily the notion of ‘‘saturation,’’ or the point in which
produce adverse consequences? It is difficult to no new information or themes are obtained from
offer precise remedial measures to correct the the data. In qualitative research practice, this
most commonly encountered problems associ- can be a challenging determination.
ated with the use of nonprobability samples Researchers should also address subject recruit-
because such measures vary by the nature of ment issues to reduce selection bias. If possible,
research questions and type of data researchers researchers should use consecutive admissions
employ in their studies. Instead of offering spe- including all cases during a representative time
cific measures, the following strategies are frame. They should describe the population in
offered to address the conceptual and empirical greater detail to allow for cross-study comparisons.
dilemmas in using nonprobability samples. Other researchers will benefit from additional data
924 Nonprobability Sampling

and descriptors that provide a more comprehensive and David L. Hussey have shown that a homoge-
picture of the characteristics of the study popula- neous sample produced by nonprobability sam-
tion. It is critical in reporting results (for both pling is better than a less homogeneous sample
probability and nonprobability sampling) to tell produced by probability sampling in prediction.
the reader who was or was not given a chance to Remember, regression is a leading method used by
be selected. Then, to the extent that is known, applied researchers employing inferential statistics.
researchers should tell the reader how those omit- Regression-type models also include simple linear
ted from the study the same or different from those regression, multiple regression, logistic regres-
included. Conducting diagnostics comparing omit- sion, structural equation modeling, analysis of
ted or lost cases with the known study subjects can variance (ANOVA), multivariate analysis of vari-
help in this regard. Ultimately, it is important for ance (MANOVA), and analysis of covariance
researchers to indicate clearly and discuss the limits (ANCOVA). In a regression analysis, a residual
of generalizability and external validity. is defined as the difference between the observed
In addition, researchers are advised to make value and model-predicted value of the depen-
efforts to assure that a study sample provides ade- dent variable. Researchers are concerned about
quate statistical power for hypothesis testing. It this measure because it is the model with the
has been shown that other things being equal, smallest sample residual that gives the most
a large sample always produces more efficient and accurate predictions about sample subjects. Sta-
unbiased estimates about population true para- tistics such as Theil’s U can gauge the scope of
meters than a small sample. When the use of a non- sample residuals, which is a modified version of
probability sample is inevitable, researchers should root-mean-square error measuring the magnitude
carefully weigh the pros and cons that are associ- of the overall sample residual. The statistic ranges
ated with different study designs and choose a sam- from zero to one, with a value closer to zero indi-
ple size that is as large as possible. cating a smaller overall residual. In this regard,
Another strategy is for researchers to engage in nonprobability samples can be more homogeneous
multiagency research collaborations that generate than a random sample. Using regression coeffi-
samples across agencies and/or across sites. In one cients (including an intercept) to represent study
study, because of limited resources, Brent Benda subjects, it is much easier to obtain an accurate
and Robert Flynn Corwyn found it unfeasible to estimate for a homogeneous sample than for a het-
draw a nationally representative sample to test the erogeneous sample. This consequence, therefore, is
mediating versus moderating effects of religion on that small homogeneous samples generated by
crime in their study. To deal with the challenge, a nonprobability sampling procedure might pro-
they used a comparison between two carefully duce more accurate predictions about sample sub-
chosen sites: random samples selected from two jects. Therefore, if the task is not to infer statistics
public high schools involving 360 adolescents in from sample to population, using a nonprobability
the inner city of a large east coast metropolitan sample is a better strategy than using a probability
area and simple random samples involving 477 sample.
adolescents from three rural public high schools in With the explosive growth of the World Wide
an impoverished southern state. The resultant data Web and other new electronic technologies such
undoubtedly had greater external validity than as search monkeys, nonprobability sampling
studies based on either site alone. remains an easy way to obtain feedback and col-
If possible, researchers should use national sam- lect information. It is convenient, verifiable, and
ples to run secondary data analyses. These data- low cost, particularly when compared with face-
bases were created by probability sampling and to-face paper and pencil questionnaires. Along
are deemed to have a high degree of representa- with the benefits of new technologies, however,
tiveness and other desirable properties. The draw- the previous cautions apply and might be even
back is that these databases are likely to be useful more important given the ease with which larger
for only a minority of research questions. samples might be obtained.
Finally, does nonprobability sampling necessar-
ily produce adverse consequences? Shenyang Guo David L. Hussey
Nonsignificance 925

See also Naturalistic Inquiry; Probability Sampling; data. Emphasis is placed on the three routes to
Sampling; Selection nonsignificance: a real lack of effect in the popula-
tion; failure to detect a real effect because of an
insufficiently large sample; or failure to detect
Further Readings a real effect because of a methodological flaw. Of
Benda, B. B., & Corwyn, R. F. (2001). Are the effects of greatest importance is the recognition that non-
religion on crime mediated, moderated, and significance is not affirmative evidence of the
misrepresented by inappropriate measures? Journal of absence of an effect in the population.
Social Service Research, 27, 57–86. Nonsignificance is the determination in NHST
Cochran, W. G. (1977). Sampling techniques (3rd ed.). that no statistically significant effect (e.g., correla-
New York: Wiley. tion, difference between means, and dependence of
Cohen, J. (1977). Statistical power analysis for the
proportions) can be inferred for a population.
behavioral sciences. New York: Academic Press.
Guest, G., Bunce, A., & Johnson, L. (2006). How many
NHST typically involves statistical testing (e.g., t
interviews are enough? An experiment with data test) performed on a sample to infer whether two
saturation and variability. Field Methods, 18, 59–82. or more variables are related in a population.
Guo, S., & Hussey, D. (2004). Nonprobability sampling Studies often have a high probability of failing to
in social work research. Journal of Social Service reject a false null hypothesis (i.e., commit a Type
Research, 30, 1–18. II, or false negative, error), thereby returning a non-
Judge, G. G., Griffiths, W. E., Hill, R. C., Lutkepohl, H., significant result even when an effect is present in
& Lee, T. C. (1985). The theory and practice of the population.
econometrics (2nd ed.). New York: Wiley. In its most common form, a null hypothesis
Kennedy, P. (1985). A guide to econometrics (2nd ed.).
posits that the means on some measurable variable
Cambridge: MIT Press.
Kish, L. (1965). Survey sampling. New York: Wiley.
for two groups are equal to each other. A statisti-
Mech, E. V., & Che-Man Fung, C. (1999). Placement cally significant difference would indicate that the
restrictiveness and educational achievement among probability that a true null hypothesis is errone-
emancipated foster youth. Research on Social Work ously rejected (Type I, or false positive, error) is
Practice, 9; 213–228. below some desired threshold (αÞ, which is typi-
Neter, J., Kutner, M. H., Nachtsheim, C. J., & cally .05. As a result, statistical significance refers
Wasserman, W. (1996). Applied linear regression to a conclusion that there likely is a difference in
models (3rd ed.). Chicago: Irwin. the means of the two population groups. In con-
Nugent, W. R., Bruley, C., & Allen, P. (1998). The effects trast, nonsignificance refers to the finding that the
of aggression replacement training on antisocial
two means do not significantly differ from each
behavior in a runaway shelter. Research on Social
Work Practice, 8; 637–656.
other (a failure to reject the null hypothesis).
Peled, E., & Edleson, J. L. (1998). Predicting children’s Importantly, nonsignificance does not indicate that
domestic violence service participation and the null hypothesis is true, it only indicates that
completion. Research on Social Work Practice, 8; one cannot rule out chance and random variation
698–712. to explain observed differences. In this sense,
Rubin, A., & Babbie, E. (1997). Research methods for NHST is analogous to an American criminal trial,
social work (3rd ed.). Pacific Grove, CA: Brooks/Cole. in which there is a presumption of innocence
Theil, H. (1966). Applied economic forecasting. (equality), the burden of proof is on demonstrating
Amsterdam, the Netherlands: Elsevier. guilt (difference), and a failure to convict (reject
the null hypothesis) results only in a verdict of
‘‘not guilty’’ (not significant), which does not con-
fer innocence (equality).
NONSIGNIFICANCE Nonsignificant findings might reflect accurately
the absence of an effect or might be caused by
This entry defines nonsignificance within the con- a research design flaw leading to low statistical
text of null hypothesis significance testing (NHST), power and a Type II error. Statistical power is
the dominant scientific statistical method for mak- defined as the probability of detecting an existing
ing inferences about populations based on sample effect (rejecting a false null hypothesis) and might
926 Normal Distribution

be calculated ex ante given the population effect studies with significant findings, contributing to
size (or an estimate thereof), the desired signifi- a documented upward bias in effect sizes in pub-
cance level (e.g., .05), and the sample size. lished studies. Directional hypotheses might also
Type II errors resulting from insufficient statisti- be an issue. For accurate determination of signifi-
cal power can result from several factors. First, cance, directionality should be specified in the
small samples yield lower power because they are design phase, as directional hypotheses (e.g., pre-
simply less likely than large samples to be repre- dicting that one particular mean will be higher
sentative of the population, and they lead to larger than the other) have twice the statistical power of
estimates of the standard error. The standard error nondirectional hypotheses, provided the results are
is estimated as the sample standard deviation in the hypothesized direction.
divided by the square root of the sample size. Many of these problems can be avoided with
Therefore, the smaller the sample, the bigger the a careful research design that incorporates a suffi-
estimate of the standard error will be. Because the ciently large sample based on an a priori power
standard error is the denominator in significance analysis. Other suggestions include reporting effect
test equations, the bigger it is, the less likely the sizes (e.g., Cohen’s dÞ and confidence intervals to
test statistic will be large enough to reject the null convey more information about the magnitude
hypothesis. Small samples also contribute to non- of effects relative to variance and how close
significance because sample size (specifically, the the results are to being determined significant.
degrees of freedom that are derived from it) is an Additional options include reporting the power of
explicit factor in calculations of significance levels the tests performed or the sample size needed to
(p values). Low power can also result from impre- determine significance for a given effect size. Addi-
cise measurement, which might result in excessive tionally, meta-analysis—combining effects across
variance. This too will cause the denominator in multiple studies, even those that are nonsignifi-
the test statistic calculation to be large, thereby cant—can provide more powerful and reliable
underestimating the magnitude of the effect. assessments of relations among variables.
Type II error can also result from flawed meth-
odology, wherein variables are operationalized Christopher Finn and Jack Glaser
inappropriately. If variables are not manipulated
See also Null Hypothesis; Power; Power Analysis;
or measured well, the real relationship between
Significance, Statistical; Type II Error
the intended variables will be more difficult to dis-
cern from the data. This issue is often referred to
as construct validity. A nonrepresentative sample, Further Readings
even if it is large, or a misspecified model might
Berry, E. M., Coustere-Yakir, C., & Grover, N. B.
also prevent the detection of an existing effect.
(1998). The significance of non-significance. QJM,
To reduce the likelihood of nonsignificance 91, 647–653.
resulting from Type II errors, an a priori power Cohen, J. (1977). Statistical power analysis for the
analysis can determine the necessary sample size to behavioral sciences. New York: Academic Press.
provide the desired likelihood of rejecting the null Cohen, J. (1994). The earth is round. American
hypothesis if it is false. The suggested convention Psychologist, 49, 997–1003.
for statistical power is .8. Such a level would allow Rosenthal, R. (1979). The ‘‘file drawer problem’’ and
a researcher to say with 80% confidence that no tolerance for null results. Psychological Bulletin, 86;
Type II error had been committed and, in the event 638–641.
of nonsignificant findings, that no effect exists.
Some have critiqued the practice of reporting
significance tests alone, given that with a .05 crite-
rion, determining that a result of .049 is statisti- NORMAL DISTRIBUTION
cally significant, whereas one of .051 is not
artificially dichotomizes the determination of sig- The normal distribution, which is also called
nificance. An overreliance on statistical significance a Gaussian distribution, bell curve, or normal
also results in a bias among published research for curve, is commonly known for its bell shape (see
Normal Distribution 927

μ = 80, σ = 20
μ = 100, σ = 20

μ = 100, σ = 40

−50 0 50 100 150 200 250

Figure 2 Examples of Three Different Normal


Distributions
Figure 1 The Normal Distribution
• The curve is infinitely divisible.
• The skewness and kurtosis are both zero.
Figure 1) and is defined by a mathematical for-
mula. It is a member of families of distributions
such as exponential, monotone likelihood ratio, Different normal density curves exist because
a normal distribution is determined by the mean
Pearson, stable, and symmetric power. Many bio-
logical, physical, and psychological measurements, and the standard deviation. Figure 2 presents
as well as measurement errors, are thought to three normal distributions, one with a mean of
approximate normal distributions. It is one of the 100 and a standard deviation of 20, another nor-
most broadly used distributions to describe contin- mal distribution with a mean of 100 and a stan-
uous variables. dard deviation of 40, and a normal distribution
The normal curve has played an essential with a mean of 80 and a standard deviation of
role in statistics. Consequently, research and 20. The curves differ with respect to their spread
theory have grown and evolved because of the and their height. There can be an unlimited num-
properties of the normal curve. This entry first ber of normal distributions because there are an
describes the characteristics of the normal distri- infinite number of means and standard devia-
bution, followed by a discussion of its applica- tions, as well as combinations of those means
tions. Lastly, this entry gives a brief history of and standard deviations.
the normal distribution. Although sometimes labeled as a ‘‘bell curve,’’
the curve does not always resemble a bell shape
(e.g., the distribution with a mean of 100 and stan-
Characteristics of the Normal Distribution dard deviation of 40 is flatter in this instance
The normal distribution has several properties, because of the scale being used for the x and y
including the following: axes). Also, not all bell-shaped curves are normal
distributions. What determines whether a curve is
• The curve is completely determined by the mean a normal distribution is not dependent on its
(average) and the standard deviation (the spread appearance but on its mathematical function.
about the mean, or girth). The normal distribution is a mathematical curve
• The mean is at the center of this symmetrical defined by the probability density function (PDF):
curve (which is also the peak or maximum
1 2 2
ordinate of the curve). f ðxÞ ¼ pffiffiffiffiffiffi eðxμÞ =2σ ;
• The distribution is unimodal; the mean, median, σ 2π
and mode are the same.
• The curve is asymptotic. Values trail off where
symmetrically from the mean in both directions,
indefinitely (the tails never touch the x axis) and f(xÞ ¼ the density of the function or the height of
form the two tails of the distribution. the curve (usually plotted on the y axis) for
928 Normal Distribution

a particular variable x (usually plotted on the x


axis) with a normal distribution
σ ¼ standard deviation of the distribution
π ¼ the constant 3.1416
e ¼ base of Napierian logarithms, 2.7183
Normal CDF
μ ¼ mean of the distribution

This equation dictates the shape of the normal


distribution. Normal distributions with the same Normal PDF
mean and standard deviation would be identical in
form. The curve is symmetrical about its mean
because (x – μÞ is squared. Furthermore, because −5 −4 −3 −2 −1 0 1 2 3 4 5
the exponent is negative, the more x deviates from
the mean (large positive number or large negative Figure 3 The Normal CDF and the Normal PDF
number), f ðxÞ becomes very small, infinitely small Curves
but never zero. This explains why the tails of the
distribution would never touch the x axis.
The total area under the normal distribution Applications
curve equals one (100%). Mathematically, this is
The term normal was coined to reflect that this
represented by the following formula:
probability distribution is commonly found and
not that other probability distributions are some-
Zþ∞ how not ‘‘normal.’’ Although at one time it was
f ðxÞdx ¼ 1: extolled as the distribution of all traits, it is now
recognized that many traits are best described
∞
by other distributions. Exponential distributions,
Levy distributions, and Poisson distributions are
A probability distribution of a variable X can some examples of the numerous other distribu-
also be described by its cumulative distribution tions that exist.
function (CDF). The mathematical formula for the Some variables that can be said to be approxi-
CDF is: mated by the normal curve include height, weight,
personality traits, intelligence, and memory. They
are approximated because their real distributions
Zx would not be identical to that of a normal distribu-
FðxÞ ¼ pðX ≤ xÞ ¼ f ðtÞdt tion. For instance, the real distributions of these
∞ variables might not be perfectly symmetrical, their
Zx ðtμÞ2 curves might bend slightly, and their tails do not go
1 
¼ pffiffiffiffiffiffi e 2σ2 dt: to infinity. (The tails of the normal distribution go
σ 2π to infinity on the x axis. This observation is not
∞
possible for the variables mentioned previously;
e.g., certain heights for a human being would be
FðxÞ gives the probability that any randomly impossible.) The normal distribution reflects the
chosen variable X; with a normal distribution, is best shape of the data. It is an idealized version of
less than or equal to x; the value of interest. The the data (see Figure 4), is described by a mathemati-
normal CDF curve is presented in Figure 3 along cal function (the PDF), and because many variables
with the normal PDF curve. The value of the CDF are thought to approximate the normal distribu-
at a point on the x axis equals the area of the tion, it can be used as a model to describe the dis-
PDF up to that same value. CDF values range tribution of the data. Thus, it is possible to take
from 0 to 1. advantage of many of the normal curve’s strengths.
Normal Distribution 929

Normal curve

Curve actually
fitted to the data

Figure 4 An Example of a Normal Curve and


a Curve Fitted to the Data
2.14% 2.14%
0.13% 0.13%
One strength of these normal curves is that they 13.59% 34.13% 34.13% 13.59%

all share a characteristic: probabilities can be iden-


tified. However, because normal distributions can −5 −4 −3 −2 −1 0 1 2 3 4 5
take on any mean and/or standard deviation, cal-
culations for each normal distribution would be
required to determine the area or probability for Figure 5 The Standard Normal Distribution and Its
a given score or the area between two scores. To Percentiles
avoid these lengthy calculations, the values are
converted to standard scores (also called standard Given that all normal curves share the same
units or z scores). Many statistics textbooks pre- mathematical function and have the same area
sent a table of probabilities that is based on a distri- under the curve, variables that are normally dis-
bution called the standard normal distribution. tributed can be converted to a standard normal
The standard normal curve consists of a mean (μÞ distribution using the following formula:
of zero and a standard deviation (σÞ of one. The
probabilities for the standard normal distribution ðX  μÞ
z¼ ;
have been calculated and are known, and because σ
the values of any normal distribution can be con-
verted to a standard score, the probabilities for where
each normal curve do not need to be calculated
z ¼ standard score on x and represents how many
but can be found through their conversion to z
standard deviations a raw score (xÞ is above or
scores. The density function of the standard nor- below the mean
mal distribution is
X ¼ score of interest for the normally distributed
1 2
data
f ðzÞ ¼ pffiffiffiffiffiffi ez =2 :
2π μ ¼ mean for the normally distributed data
σ ¼ standard deviation for the normally distributed
Figure 5 presents a standard normal curve; data
68.27% of the values in a particular data set lie
between 1 σ and þ1 σ, 34:13% lie between the In standardizing a set of scores they do not
mean and þ1 σ, 34.13% lie between the mean and become normal. In other words, the shape of a dis-
1 σÞ, 95.45% of the values lie between tribution of scores does not change (i.e., does not
2 σ and þ2 σ, 13.59% of the cases lie between become normal) by converting them to z scores.
þ1 σ and þ2 σ, and 13.59% lie between 2 σ and Once scores are standardized, the percentile
1 σð2 × 13:59%Þ þ ð2 × 34:13%Þ ¼ 95:45%, (proportion of scores) for a given data point/score
and 99.97% of all values lie between 3 σ and can be determined easily. To illustrate, if the aver-
þ3 σ. Because 99.97% of all values lie between age score for an intelligence measure is 100 and
3 and þ3 standard deviations, a value found out- the standard deviation is 10, then using the above
side of this range would be rare. formula and the table of the area of the standard
930 Normal Distribution

normal distribution, it is possible to determine per- both the positive and negative z scores). In looking
centages (proportions) of cases that have scored up the table for z ¼ 1.5 and z ¼ 0.5, .4332 and
less than 120. First, the calculation of the z score .1915, respectively, are obtained. To determine the
is required: proportion of IQ scores between 85 and 95, .1915
has to be subtracted from .4332; .4332  .1915 ¼
ðX  μÞ ð120  100Þ .2417, 24.17% of IQ scores are found between 85
z¼ ¼ ¼ 2:
σ 10 and 95.
For both examples presented here, the propor-
A z score of þ2 means that the score of 120 is 2
tions are estimates based on mathematical calcula-
standard deviations above the mean. To determine
tions and do not represent actual observations.
the percentile for this z score, this value has to be
Therefore, in the previous example, it is estimated
looked up in a table of standardized values of the
that 97.72% of IQ scores are expected to be less
normal distribution (i.e., a z table). The value of 2
than or equal to 120, and it is estimated that
is looked up in a standard normal-curve areas
24.17% are expected to be found between 85 and
table (not presented here, but can be found in most
95, but what would actually be observed might be
statistics textbooks) and the corresponding value
different.
of .4772 is found (this is the area between the
The normal distribution is also important
mean and z value of 2). This means that the proba-
because of its numerous mathematical properties.
bility of observing a score between 100 (the mean
Assuming that the data of interest are normally
in this example) and 120 (the score of interest in
distributed allows researchers to apply different
this example) is 47.72. The standard normal distri-
calculations that can only be applied to data that
bution is symmetrical around its mean so 50% of
share the characteristics of a normal curve. For
all scores are 100 and less. To determine what pro-
instance, many scores such as percentiles, t scores
portion of individuals score below 120, the value
(scores that have been converted to standard
below 100 has to be added to the value between
scores and subsequently modified such that their
100 and 120. Therefore, 50% is added to 47.72%
mean is 50 and standard deviation is 10), and sta-
resulting in 97.72%. Thus, 97.72% of individuals
nines (scores that have been changed to a value
are expected to score worse than or equal to 120.
from 1 to 9 depending on their location in the dis-
Conversely, if interested in determining the propor-
tribution; e.g., a score found in the top 4% of the
tion of individuals who would score better than
distribution is given a value of 9, a score found in
120, .4772 would be subtracted from .5 and this
the middle 20% of the distribution is given a value
would equal .0228, which means that 2.28% of
of 5) are calculated based on the normal distribu-
individuals would be expected to score better than
tion. Many statistics rely on the normal distribu-
or equal to 120.
tion as they are based on the assumption that
Should a person wish to know the proportion
directly observable scores are normally distributed
of people who obtained an IQ between 85 and 95,
or have a distribution that approximates normal-
a series of the previous calculations would have to
ity. Some statistics that assume the variables under
be conducted. In this example, the z scores for 85
study are normally distributed include t; F; and χ2 :
and for 95 would have to be first calculated
Furthermore, the normal distribution can be used
(assuming the same mean and standard deviation
as an approximation for some other distributions.
as the previous example):
To determine whether a given set of data fol-
ðX  μÞ ð85  100Þ lows a normal distribution, examination of skew-
For 85 : z ¼ ¼ ¼ 1:5: ness and kurtosis, the Probability-Probability
σ 10
ðX  μÞ ð95  100Þ (P-P) plot, or results of normality tests such as
For 95 : z ¼ ¼ ¼ 0:5: the Kolmogorov–Smirnov test, Lilliefors test, and
σ 10
the Shapiro–Wilk test can be conducted. If the
The areas under the negative z scores are the data do not reflect a normal distribution, then
same as the areas under the identical positive z- the researcher has to determine whether a few out-
scores because the standard normal distribution is liers are influencing the distribution of the data,
symmetrical about its mean (not all tables present whether data transformation will be necessary, or
Normal Distribution 931

whether nonparametric statistics will be used to Gauss independently discovered the normal curve
analyze the data, for instance. and its properties at around the same time as de
Many measurements (latent variables) and phe- Laplace and was interested primarily in its applica-
nomena are assumed to be normally distributed tion to errors of observation in astronomy. It was
(and thus can be approximated by the normal consequently extensively used for describing errors.
distribution). For instance, intelligence, weight, Adolphe Quetelet extended the use of the normal
height, abilities, and personality traits can each be curve beyond errors, believing it could be used to
said to follow a normal distribution. However, describe phenomena in the social sciences, not just
realistically, researchers deal with data that come physics. Sir Francis Galton in the late 19th century
from populations that do not perfectly follow extended Quetelet’s work and applied the normal
a normal distribution, or their distributions are curve to other psychological measurements.
not actually known. The Central Limit Theorem
(also known as the second fundamental theorem Adelheid A. M. Nicol
of probability) partly takes care of this problem.
See also Central Limit Theorem; Data Cleaning;
One important element of the Central Limit Theo-
Multivariate Normal Distribution; Nonparametric
rem states that when the sample size is large, the
Statistics; Normality Assumption; Normalizing Data;
sampling distribution of the sample means will
Parametric Statistics; Percentile Rank; Sampling
approach the normal curve even if the population
Distributions
distribution is not normal. This allows researchers
to be less concerned about whether the population
distributions follow a normal distribution or not. Further Readings
These descriptions and applications apply to the
univariate normal distribution (i.e., the normal dis- Hays, W. L. (1994). Statistics. Orlando, FL: Harcourt.
tribution of a single variable). When two (bivariate Hopkins, K. D., & Glass, G. V. (1978). Basic statistics
for the behavioral sciences. Englewood Cliffs, NJ:
normal distribution) or more variables are consid-
Prentice Hall.
ered, the multivariate normal distribution is impor- King, B. M., & Minium, E. M. (2006). Statistical
tant for examining the relation of those variables reasoning in psychology and education. Hoboken, NJ:
and for using multivariate statistics. Wiley.
Whether many variables are actually normally Levin, J., & Fox, J. A. (2006). Elementary statistics in
distributed is a point of debate for many social research. Boston: Allyn & Bacon.
researchers. For instance, the view that certain per- Lewis, D. G. (1957). The normal distribution of
sonality traits are normally distributed can never intelligence: A critique. British Journal of Psychology,
be observed, as the constructs are not actually mea- 48, 98–104.
sured. Many variables are measured using discrete Micceri, T. (1989). The unicorn, the normal curve, and
other improbable creatures. Psychological Bulletin,
rather than continuous scales. Furthermore, large
105, 156–166.
sample sizes are not always obtained, thus the nor- Patel, J. K., & Read, C. B. (1996). Handbook of the
mal curve might not actually fit well those data. normal distribution. New York: Marcel Dekker.
Sigler, S. M. (1986). The history of statistics. The
History of the Normal Distribution measurement of uncertainty before 1900. Cambridge,
MA: Belknap Press of Harvard University Press.
The first known documentation of the normal dis- Snyder, D. M. (1986). On the theoretical derivation of
tribution was written by Galileo in the 17th cen- the normal distribution for psychological phenomena.
tury in his description of random errors found in Psychological Reports, 59, 399–404.
measurements by astronomers. Abraham de Thode, H. (2002). Testing for normality. New York:
Marcel Dekker.
Moivre is credited with its first appearance in his
Wilcox, R. R. (1996). Statistics for the social sciences.
publication of an article in 1733. Pierre Simon de San Diego, CA: Academic Press.
Laplace developed the first general Central Limit Zimmerman, D. W. (1998). Invalidation of parametric
Theorem in the early 1800s (an important element and nonparametric statistical tests by concurrent
in the application of the normal distribution) and violation of two assumptions. Journal of Experimental
described the normal distribution. Carl Friedrich Education, 67, 55–68.
932 Normality Assumption

understanding of how the data were generated is


NORMALITY ASSUMPTION required for valid statistical analysis and interpreta-
tion to be undertaken.
The normal distribution (also called the Gaussian Common variance (often referred to as homo-
distribution: named after Johann Gauss, a German geneity of variance) refers to the concept that the
scientist and mathematician who justified the least variance of all samples drawn has similar vari-
squares method in 1809) is the most widely used ability. For example, if you were testing the dif-
family of statistical distributions on which many ference in height between two samples of people,
statistical tests are based. Many measurements of one from Town A and the other from Town B,
physical and psychological phenomena can be the test assumes that the variance of height in
approximated by the normal distribution and, hence, Town A is similar to that of Town B. In 1953,
the widespread utility of the distribution. In many G. E. P. Box demonstrated that for even modest
areas of research, a sample is identified on which sample sizes, most tests are robust to this
measurements of particular phenomena are made. assumption, and differences of up to 3-fold in
These measurements are then statistically tested, via variance do not greatly affect the Type I error
hypothesis testing, to determine whether the obser- level. Many statistical tests are available to
vations are different because of chance. Assuming ascertain whether the variances are equal among
the test is valid, an inference can be made about the different samples (including the Bartlett–Kendall
population from which the sample is drawn. test, Levene’s test, and the Brown–Forsythe test).
Hypothesis testing involves assumptions about These tests for the homogeneity of variance are
the underlying distribution of the sample data. sensitive to normality departures, and as such
Three key assumptions, in the order of impor- they might indicate that the common variance
tance, are independence, common variance, and assumption does not hold, although the validity
normality. The term normality assumption arises of the test is not in question.
when the researcher asserts that the distribution of Although it is the least important of the ass-
the data follows a normal distribution. Parametric umptions when considering hypothesis testing, the
and nonparametric tests are commonly based on normality assumption should not be ignored.
the same assumptions with the exception being Many statistical tests and methods employed in
nonparametric tests do not require the normality research require that a variable or variables be nor-
assumption. mally distributed. Aspects of normality include the
Independence refers to the correlation between following: (a) that the possible values of the quan-
observations of a sample. For example, if you tity being studied can vary from negative infinity
could order the observations in a sample by time, to positive infinity; (b) that there will be symmetry
and observations that are closer together in time in the data, in that observed values will fall with
are more similar and observations further apart in equal probability above and below the true popu-
time are less similar, then we would say the obser- lation mean value, as a process of unbiased, ran-
vations are not independent but correlated or dom variability; and (c) the width or spread of the
dependent on time. If the correlation between distribution of observed values around the true
observations is positive then the Type I error is mean will be determined by the standard deviation
inflated (Type I error level is the probability of of the distribution (with predictable percentages of
rejecting the null hypothesis when it is true and is observed values falling bounds defined by multi-
traditionally defined by alpha and set at .05). If ples of the true distribution standard deviation).
the correlation is negative, then Type I error is Real-world data might behave differently from the
deflated. Even modest levels of correlation can normal distribution for several reasons. Many
have substantial impacts on the Type I error level physiological and behavioral variables cannot truly
(for a correlation of .2 the alpha is .11, whereas for have infinite range, and distributions might be
a correlation of .5, the alpha level is .26). Indepen- truncated at true zero values. Skewness (asymme-
dence of observations is difficult to assess. With no try), or greater spread in observed values than pre-
formal statistical tests widely in use, knowledge of dicted by the standard deviation, might also result
the substantive area is paramount and a through in situations where the distribution of observed
Normality Assumption 933

values is not determined exclusively by random be bisected in the middle by the median, and
variability, and it also might be a result of uniden- both whiskers will be of equal length.
tified systematic influences (or unmeasured predic- A normal quartile plot compares the spacing of
tors of the outcome). the data with that of the normal distribution. If the
The statistical tests assume that the data follow data being examined are approximately normal,
a normal distribution to preserve the tests’ validity. then more observations should be clustered around
When undertaking regression models, the normal- the mean and only a few observations should exist
ity assumption applies to the error term of the in each of the tails. The vertical axis of the plot dis-
model (often called the residuals) and not the orig- plays the actual data whereas the horizontal axis
inal data and, hence, it is often misunderstood in displays the quartiles from the normal distribution
this context. It should be noted that the normality (expected z scores). If the data are normally distrib-
assumption is sufficient, but not necessary, for the uted, the resulting plot will form a straight line with
validity of many hypothesis tests. The remainder a slope of 1. If the line demonstrates an upward
of this entry focuses on the assessment of normal- bending curve, the data are right skewed, whereas
ity and the transformation of data that are not nor- if the line demonstrates a downward bending curve,
mally distributed. the data are left skewed. If the line has an S-shape,
it indicates that the data are kurtotic.
Several common statistical tests were designed to
Assessing Normality
assess normality. These would include, but are not
A researcher can assess for the normality of vari- limited to, the Kolmogorov–Smirnov, the Shapiro–
ables in several ways. To say a variable is nor- Wilk, the Anderson–Darling, and the Lilliefor’s test.
mally distributed indicates that the distribution In each case, the test calculates a test statistic under
of observations for that variable follows the nor- the null hypothesis that the sample is drawn from
mal distribution. So in essence, if you examined a normal distribution. If the associated p value for
the distribution graphically, it would look simi- the test statistic is greater than the selected alpha
lar to the typical bell-shaped normal cure. A his- level then one does not reject the null hypothesis
togram, a box-and-whisker plot, or a normal that the data were drawn from a normal distribu-
quartile plot (often called a Q-Q plot) can be tion. Some tests can be modified to test samples
created to inspect the normality of the data visu- against other statistical distributions. The Shapiro–
ally. With many data analysis packages, a histo- Wilk and the Anderson–Darling tests have been
gram can be requested with the normal noted to perform better with small sample sizes.
distribution superimposed to aid in this assess- With all the tests, small deviations from normality
ment. Other standard measures of distribution can lead to a rejection of the null hypothesis and
exist and include skewness and kurtosis. Skew- therefore should be used and interpreted with
ness refers to the symmetry of the distribution, caution.
in which right-skewed distributions have a long When a violation of the normality assumption
tail pointing to the right and left-skewed distri- is observed, it might be a sign that a better statisti-
butions have a long tail pointing to the left. Kur- cal model can be found. So, exploring why the
tosis refers to peakedness of the distribution. assumption is violated might be fruitful. Non-
A box-and-whisker plot is created with five normality of the error term might indicate that the
numeric summaries of the variables including resulting error is greater than expected under the
the minimum value, the lower quartile, the assumption of true random variability and (espe-
median, the upper quartile, and the maximum cially when the distribution of the data are asym-
value. The box is formed by the lower and upper metrical) might suggest that the observations come
quartile bisected by the median. Whiskers are from more than one ‘‘true’’ underlying population.
formed on the box plot by drawing a line from Additional variables could be added to the model
the lowest edge of the box (lower quartile) to the (or the study) to predict systematically observed
minimum value and the highest edge of the box values not yet in the model, thereby moving more
(upper quartile) to the maximum value. If the information to the linear predictor. Similarly, non-
variable has a normal distribution, the box will normality might reflect that variables in the model
934 Normality Assumption

are incorrectly specified (such as assuming there is variables might improve normality. If after trans-
a linear association between a continuous predic- formation the variable meets the normality ass-
tor variable and the outcome). umption, the transformed variable can be
Research is often concerned with more than substituted in the analysis. Interpretation of
one variable, and with regression analysis or sta- a transformed variable in an analysis needs to be
tistical modeling, the assumption is that the undertaken with caution as the scale of the vari-
combination of variables under study follows able will be related to the transformation and
a multivariate normal distribution. There are no not the original units.
direct tests for multivariate normality, and as Based on Frederick Mosteller and John
such, each variable under consideration is con- Tukey’s Ladders of Power, if a researcher needs to
sidered individually for normality. If all the vari- remove right skewness from the data, then he or
ables under study are normally distributed, then she moves ‘‘down’’ the ladder of power by apply-
another assumption is made that the variables ing a transformation smaller than 1, such as the
combined are multivariate normal. Although this square root, cube root, logarithm, or reciprocal.
assumption is made, it is not always the case that If the researcher needs to remove left skewness
variables are normal individually and collec- from the data, then he or she moves ‘‘up’’ the
tively. Note that when assessing normality in ladder of power by applying a transformation
a regression modeling situation, the assessment larger than 1 such as squaring or cubing.
of normality should be undertaken with the Many analysts have tried other means to
error term (residuals). avoid non-normality of the error term including
categorizing the variable, truncating or eliminat-
Note of Caution: Small Sample Sizes ing extreme values from the distribution of the
original variable, or restricting the study or
When assessing normality with small sample experiment to observations within a narrower
sizes (samples with less than approximately 50 range of the original measure (where the resi-
observations), caution should be exercised. Both duals observed might form a ‘‘normal’’ pattern).
the visual aids (histogram, box-and-whisker plot, None of these are ideal as they might affect the
and normal quartile plot) and the statistical tests measurement properties of the original variable
(Kolmogorov–Smirnov, Shapiro–Wilk, Anderson– and create problems with bias of estimates of
Darling, and Lilliefor’s test) can provide mislead- interest and/or loss of statistical power in the
ing results. Departures from normality are difficult analysis.
to detect with small sample sizes, largely because
of the power of the test. The power of the statisti- Jason D. Pole and Susan J. Bondy
cal tests decreases as the significance level is
See also Central Limit Theorem; Homogeneity
decreased (as the statistical test is made more strin-
of Variance; Law of Large Numbers; Normal
gent) and increases as the sample size increases.
Distribution; Type I Error; Type II Error; Variance
So, with small sample sizes, the statistical tests
will nearly always indicate acceptance of the null
hypothesis even though departures from normal-
ity could be large. Likewise, with large sample
Further Readings
sizes, the statistical tests become powerful, and
often minor inconsequential departures from Box, G. E. P. (1953). Non-normality and tests on
normality would lead the researcher to reject the variances. Biometrika, 40; 318–335.
null hypothesis. Holgersson, H. E. T. (2006). A graphical method for
assessing multivariate normality. Computational
Statistics, 21; 141–149.
What to Do If Data Lumley, T., Diehr, P., Emerson, S., & Chen, L.
Are Not Normally Distributed (2002). The importance of the normality
assumption in large public health data sets.
If, after assessment, the data are not normally Annual Review of Public Health, 23;
distributed, a transformation of the non-normal 151–169.
Normalizing Data 935

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
kyk ¼ 352 þ 362 þ 462 þ 682 þ 702
NORMALIZING DATA pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
¼ 14; 161 ¼ 119:
Researchers often want to compare scores or sets Normalizing With the Norm
of scores obtained on different scales. For exam-
ple, how do we compare a score of 85 in a cooking To normalize y, we divide each element by
contest with a score of 100 on an IQ test? To do ||y|| ¼ 119. The normalized vector, denoted e
y, is
so, we need to ‘‘eliminate’’ the unit of measure- equal to
ment; this operation means to normalize the data. 2 3
35
There are two main types of normalization. The
6 119 7
first type of normalization originates from linear 6 36 7 2 3
6 7 0:2941
algebra and treats the data as a vector in a multidi- 6 7
6 119 7 6 0:3025 7
mensional space. In this context, to normalize the 6 46 7 6 7
data is to transform the data vector into a new ~y = 6 7 6 7
6 119 7 = 6 0:3866 7 : ð3Þ
6 7 4 0:5714 5
vector whose norm (i.e., length) is equal to one. 6 68 7
6 7 0:5882
The second type of normalization originates from 6 119 7
4 70 5
statistics and eliminates the unit of measurement
by transforming the data into new scores with 119
a mean of 0 and a standard deviation of 1. These
transformed scores are known as z scores. The norm of vector ~y is now equal to one:

k ~y k ¼
Normalization to a Norm of One pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
0:29412 þ0:30252 þ0:38662 þ0:57142 þ0:58822
The Norm of a Vector pffiffiffi
¼ 1 ¼ 1:
In linear algebra, the norm of a vector measures ð4Þ
its length which is equal to the Euclidean distance
of the endpoint of this vector to the origin of the Normalization Using Centering
vector space. This quantity is computed (from and Standard Deviation: z Scores
the Pythagorean theorem) as the square root of the
sum of the squared elements of the vector. For The Standard Deviation of a Set of Scores
example, consider the following data vector Recall that the standard deviation of a set of
denoted y: scores expresses the dispersion of the scores
3 2 around their mean. A set of N scores, each
35 denoted Yn ; whose mean is equal to M; has a stan-
6 36 7 dard deviation denoted ^S which is computed as
6 7
y¼6 7
6 46 7: ð1Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
4 68 5 P 2
^S ¼ ðYN  MÞ
70 : ð5Þ
N1

The norm of vector y is denoted ||y|| and is com- For example, the scores from vector y (see Equation
puted as 4) have a mean of 51 and a standard deviation of

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 2 2 2 2
^ ð35  51Þ þ ð36  51Þ þ ð46  51Þ þ ð68  51Þ þ ð70  51Þ

51
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð6Þ
1
¼ ð16Þ2 þ 152 þ ð5Þ2 þ 172 þ 192
2
¼ 17:
936 Nuisance Variable

z Scores: Normalizing Myatt, G. J., & Johnson, W. P. (2009). Making Sense of


With the Standard Deviation Data II: A Practical Guide to Data Visualization,
Advanced Data Mining Methods, and Applications.
To normalize a set of scores using the standard Hoboken, NJ: Wiley.
deviation, we divide each score by the standard
deviation of this set of scores. In this context, we
almost always subtract the mean of the scores
from each score prior to dividing by the standard
deviation. This normalization is known as z scores.
NUISANCE VARIABLE
Formally, a set of N scores each denoted Yn and
whose mean is equal to M and whose standard A nuisance variable is an unwanted variable that is
deviation is equal to ^
S is transformed in z scores as typically correlated with the hypothesized indepen-
dent variable within an experimental study but is
Yn  M typically of no interest to the researcher. It might
zn ¼ : ð7Þ
^
S be a characteristic of the participants under study
or any unintended influence on an experimental
With elementary algebraic manipulations, it can manipulation. A nuisance variable causes greater
be shown that a set of z scores has a mean equal variability in the distribution of a sample’s scores
of zero and a standard deviation of one. Therefore, and affects all the groups measured within a given
z scores constitute a unit-free measure that can be sample (e.g., treatment and control).
used to compare observations measured with dif- Whereas the distribution of scores changes
ferent units. because of the nuisance variable, the location (or
the measure of central tendency) of the distribution
Example remains the same, even when a nuisance variable is
For example, the scores from vector y (see present. Figure 1 is an example using participants’
Equation 1) have a mean of 51 and a standard scores on a measure that assesses accuracy in
deviation of 17. These scores can be transformed responding to simple math problems. Note that the
into the vector z of z scores as mean is 5.00 for each of the two samples, yet the
2 3551 3 2 16 3 2 3 distributions are actually different. The range for
17
 17 0:9412 the group without the nuisance variable is 4, with
6 3651 7 6  15 7 6 0:8824 7 a standard deviation of 1.19. However, the range
17 7
6 4651 6 17 7 6 7
z¼6 17 7
6 6851
7 ¼ 6  5 7 ¼ 6 0:2941 7:
6 1717 7 6 7 ð8Þ
4 5 4 5 4 1:0000 5
17 17
7051 19 (a)
17 17
1:1176
Frequency

100
80
The mean of vector z is now equal to zero, and 60
40
its standard deviation is equal to one. 20
0
0 1 2 3 4 5 6 7 8 9 10

Hervé Abdi and Lynne J. Williams Math Accuracy

See also Mean; Normal Distribution; Standard (b)


Deviation; Standardization; Standardized Score;
Frequency

100
80
Variance; z Score 60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
Further Readings Math Accuracy

Abdi, H., Edelman, B., Valentin, D., & Dowling, W. J.


(2009). Experimental design and analysis for
psychology. Oxford, UK: Oxford University Press. Figure 1 Frequency Distributions With and Without
Date, C. J., Darwen, H., & Lorentzos, N. A. (2003).
Nuisance Variable
Temporal data and the relational model. San Fancisco: Notes: (a) Without nuisance variable (N ¼ 240). (b) With
Morgan Kauffman. nuisance variable (N ¼ 240).
Nuisance Variable 937

for the group with the nuisance variable is 8, with (a)


a standard deviation of 2.05. 80

Frequency
For the first distribution, participants’ math 60
40
accuracy scores are relatively similar and cluster
20
around the mean of the distribution. There are 0
fewer very high scores and fewer very low scores 0 2 4 6 8 10

when compared with the distribution in which the Math Accuracy

nuisance variable is operating. Within the distribu- Control Treatment

tion with the nuisance variable, a wider spread is (b)


observed, and there are fewer scores at the mean 70
60
than in the distribution without the nuisance

Frequency
50
40
variable. 30
20
In an experimental study, a nuisance variable 10
0
affects within-group differences for both the treat- 0 2 4 6 8 10
ment group and the control group. When a nuisance Math Accuracy
variable is present, the spread of scores for each Control Treatment

group increases, which makes it more difficult to


observe effects that might be attributed to the inde-
pendent variable (i.e., the treatment effect). When Figure 2 Frequency Distributions of Treatment and
there is greater spread within the distributions of Control Groups With and Without
the treatment group and the control group, there is Nuisance Variable
more overlap between the two. This makes the dif- Notes: (a) Without nuisance variable. (b) With nuisance
ferences between the two groups less clear and variable.
distinct.
Figure 2 is an example using treatment and con- anxiety influences his or her math accuracy, then
trol groups, in which the independent variable is an greater variation will be observed in the distributions
extracurricular math tutoring program and is of participants’ scores on the measure of math accu-
received only by the treatment group. Both samples racy. For example, someone who performs poorly
are tested after the administration of the math when feeling anxious might have scored a 4 on math
tutoring program on measures of math accuracy. In accuracy if this nuisance variable had been removed.
this case, a nuisance variable might be the partici- However, now that he or she is feeling anxious, his
pant’s level of anxiety, but it could just as easily be or her score might drop to a 2. Alternatively, others
an external characteristic of the experiment, such as might perform better when experiencing anxiety. In
the amount of superfluous noise in the room where this case, a participant might have originally scored
the measure of math accuracy is being adminis- a 6 if he or she felt no anxiety but actually scored an
tered. In considering the effects of the nuisance vari- 8 because of the anxiety. In this case, with the nui-
able, the distributions of the two groups within the sance variable present, it becomes more difficult to
sample might look similar to that in Figure 2. detect whether statistically significant differences
In the distribution without the nuisance vari- exist between the two groups in question because of
able, the variation in participants’ anxiety is the greater variance within the groups.
reduced or eliminated, and it is clear that the dif- Researchers aim to control the influences of nui-
ferences in participants’ observed math scores are, sance variables methodologically and/or statisti-
more than likely, caused by the manipulation of cally, so that differences in the dependent variable
the independent variable only (i.e., the administra- might be attributed more clearly to the manipula-
tion of a math tutorial to the treatment group but tion of the independent variable. In exercising
not the control group). methodological control, a researcher might choose
In the distribution with the nuisance variable, if only those participants who do not experience
there is greater variation in the amount of anxiety anxiety when taking tests of this nature. This
experienced by participants in both the treatment would eliminate any variation in anxiety and, in
group and the control group, and if the participant’s turn, any influence that anxiety might have on
938 Null Hypothesis

participants’ math accuracy scores. In exercising common forms of a null hypothesis. Third, this
statistical control, a researcher might employ entry articulates several problems with using null
regression techniques to control for any variation hypothesis-based data analysis procedures.
caused by a nuisance variable. However, in this
case, it becomes important to specify and measure
Null Hypothesis Significance
potential nuisance variables before and during the
experiment. In the example used previously, parti- Testing and the Null Hypothesis
cipants could be given a measure of anxiety along Most experiments entail measuring the effect(s) of
with the measure of math accuracy. Multiple some number of independent variables on some
regression models might then be used to statisti- dependent variable.
cally control for the influence of anxiety on math
accuracy scores alongside any experimental treat-
An Example Experiment
ment that might be administered.
The term nuisance variable is often used along- In the simplest sort of experimental design, one
side the terms extraneous and confounding vari- measures the effect of a single independent vari-
able. Whereas an extraneous variable influences able, such as the amount of information held in
differences observed between groups, a nuisance short-term memory on a single dependent variable
variable influences differences observed within and the reaction time to scan through this informa-
groups. By eliminating the effects of nuisance vari- tion. To pick a somewhat arbitrary example from
ables, the tests of the null hypothesis become more cognitive psychology, consider what is known as
powerful in uncovering group differences. a Sternberg experiment, in which a short sequence
of memory digits (e.g., ‘‘34291’’) is read to an
Cynthia R. Davis observer who must then decide whether a single,
See also Confounding; Control Variables; Statistical Control
subsequently presented test digit was part of the
sequence. Thus for instance, given the memory
Further Readings digits above, the correct answer would be ‘‘yes’’ for
a test digit of ‘‘2’’ but ‘‘no’’ for a test digit of ‘‘8.’’
Breaugh, J. A. (2006). Rethinking the control of nuisance The independent variable of ‘‘amount of informa-
variables in theory testing. Journal of Business and
tion held in short-term memory’’ can be implemen-
Psychology, 20, 429–443.
Meehl, P. (1970). Nuisance variables and the ex post
ted by varying set size, which is the number of
facto design. In M. Radner & S. Winokur (Eds.), memory digits presented: In different conditions,
Minnesota studies in the philosophy of science: the set size might be, say, 1, 3, 5 (as in the exam-
Vol. IV. Analyses of theories and methods of physics ple), or 8 presented memory digits. The number of
and psychology (pp. 373–402). Minneapolis: different set sizes (here 4) is more generally referred
University of Minnesota Press. to as the number of levels of the independent vari-
able. The dependent variable is the reaction time
measured from the appearance of the test digit to
the observer’s response. Of interest in general is the
NULL HYPOTHESIS degree to which the magnitude of the dependent
variable (here, reaction time) depends on the level
In many sciences, including ecology, medicine, and of the independent variable (here set size).
psychology, null hypothesis significance testing
(NHST) is the primary means by which the
Sample and Population Means
numbers comprising the data from some experi-
ment are translated into conclusions about the Typically, the principal dependent variable takes
question(s) that the experiment was designed to the form of a mean. In this example, the mean
address. This entry first provides a brief descrip- reaction time for a given set size could be com-
tion of NHST, and within the context of NHST, it puted across observers. Such a computed mean is
defines the most common incarnation of a null called a sample mean, referring to its having been
hypothesis. Second, this entry sketches other less computed across an observed sample of numbers.
Null Hypothesis 939

A sample mean is construed as an estimate of beyond the scope of this entry, but two remarks
a corresponding population mean, which is what about the process are appropriate here.
the mean value of the dependent variable would
be if all observers in the relevant population 1. A major ingredient in the decision is the vari-
were to participate in a given condition of the ability of the Mj s. To the degree that the Mj s are
experiment. Generally, conclusions from experi- close to one another, evidence ensues for possible
ments are meant to apply to population means. equality of the μj s and, ipso facto, validity of the
Therefore, the measured sample means are only null hypothesis. Conversely, to the degree that the
interesting insofar as they are estimates of the Mj s differ from one another, evidence ensues for
corresponding population means. associated differences among the μj s and, ipso
Notationally, the sample means are referred to facto, validity of the alternative hypothesis.
as the Mj s, whereas the population means are 2. The asymmetry between the null hypothesis
referred to as the μj s. For both sample and popula- (which is exact) and the alternative hypothesis
tion means, the subscript ‘‘j’’ indexes the level of (which is inexact) sketched previously implies an
the independent variable; thus, in our example, associated asymmetry in conclusions about their
M2 would refer to the observed mean reaction validity. If the Mj s differ sufficiently, one ‘‘rejects
time of the second set-size level (i.e., set size ¼ 3) the null hypothesis’’ in favor of accepting the
and likewise, μ2 would refer to the corresponding, alternative hypothesis. However, if the Mj s do not
unobservable population mean reaction time cor- differ sufficiently, one does not ‘‘accept the null
responding to set size ¼ 3. hypothesis’’ but rather one ‘‘fails to reject the null
hypothesis.’’ The reason for the awkward, but
Two Competing Hypotheses logically necessary, wording of the latter conclu-
sion is that, because the alternative hypothesis is
NHST entails establishing and evaluating two inexact, one cannot generally distinguish a genu-
mutually exclusive and exhaustive hypotheses inely true null hypothesis on the one hand from
about the relation between the independent vari- an alternative hypothesis entailing small differ-
able and the dependent variable. Usually, and in its ences among the μj s on the other hand.
simplest form, the null hypothesis (abbreviated
H0 Þ is that the independent variable has no effect
Multifactor Designs: Multiple Null Hypothesis–
on the dependent variable, whereas the alternative
Alternative Hypothesis Pairings
hypothesis (abbreviated H1 Þ is that the indepen-
dent variable has some effect on the dependent So far, this entry has described a simple design
variable. Note an important asymmetry between in which the effect of a single independent variable
a null hypothesis and an alternative hypothesis: A on a single dependent variable is examined. Many,
null hypothesis is an exact hypothesis, whereas an if not most experiments, use multiple independent
alternative hypothesis is an inexact hypothesis. By variables and are known as multifactor designs
this it is meant that a null hypothesis can be cor- (‘‘factor’’ and ‘‘independent variable’’ are synony-
rect in only one way, viz, the μj s are all equal to mous). Continuing with the example experiment,
one another, whereas there are an infinite number imagine that in addition to measuring the effects
of ways in which the μj s can be different from one of set size on reaction time in a Sternberg task, one
another (i.e., an infinite number of ways in which also wanted to measure simultaneously the effects
an alternative hypothesis can be true). on reaction time of the test digit’s visual contrast
(informally, the degree to which the test digit
stands out against the background). One might
Decisions Based on Data
then factorially combine the four levels of set
Having established a null and an alternative size (now called ‘‘factor 1’’) with, say, two levels,
hypothesis that are mutually exclusive and exhaus- ‘‘high contrast’’ and ‘‘low contrast,’’ of test-digit
tive, the experimental data are used to—roughly contrast (now called ‘‘factor 2’’). Combining the
speaking—decide between them. The technical four set-size levels with the two test-digit contrast
manner by which one makes such a decision is levels would yield 4 × 2 ¼ 8 separate conditions.
940 Null Hypothesis

Typically, three independent NHST procedures between two independent variables. This kind of
would then be carried out, entailing three null no-effect null hypothesis is by far the most common
hypothesis–alternative hypothesis pairings. They null hypothesis to be found in the literature. Techni-
are as follows: cally however, a null hypothesis can be any exact
hypothesis; that is, the null hypothesis of ‘‘all μj s
1. For the set size main effect: are equal to one another’’ is but one special case of
what a null hypothesis can be.
H0 : Averaged over the two test-digit contrasts,
To illustrate another form, let us continue with
there is no set-size effect the first, simpler Sternberg-task example (set size is
the only independent variable), but imagine that
H1 : Averaged over the two test-digit contrasts, prior research justifies the assumption that the
there is a set-size effect relation between set size and reaction time is lin-
ear. Suppose also that research with digits has
2. For the test-digit contrast main effect: yielded the conclusion that reaction time increases
by 35 ms for every additional digit held in short-
H0 : Averaged over the four set sizes, there is no term memory; that is, if reaction time were plotted
test-digit contrast effect against set size, the resulting function would be
H1 : Averaged over the four set sizes, there is a test- linear with a slope of 35 ms.
digit contrast effect Now, let us imagine that the Sternberg experi-
ment is done with words rather than digits. One
3. For set-size by test-digit contrast interaction: could establish the null hypothesis that ‘‘short-term
memory processing proceeds at the same rate with
Two independent variables are said to interact words as it does with digits’’ (i.e., that the slope of
if the effect of one independent variable depends the reaction time versus set-size function would be
on the level of the other independent variable. As 35 ms for words just as it is known to be with
with the main effects, interaction effects are imme- digits). The alternative hypothesis would then be
diately identifiable with respect to the Mj s; how- ‘‘for words, the function’s slope is anything other
ever, again as with main effects, the goal is to than 35 ms.’’ Again, the fundamental distinction
decide whether interaction effects exist with between a null and alternative hypothesis is that
respect to the corresponding μj s. As with the main the null hypothesis is exact (35 ms/digit), whereas
effects, NHST involves pitting a null hypothesis the alternative hypothesis is inexact (anything else).
against an associated alternative hypothesis. This distinction would again drive the asymmetry
between conclusions, which were articulated previ-
H0 : With respect to the μj s, set size and test-digit ously: a particular pattern of empirical results
contrast do not interact.
could logically allow ‘‘rejection of the null hypothe-
H1 : With respect to the μj s, set size and test-digit sis’’; that is, ‘‘acceptance of the alternative hypothe-
contrast do interact. sis’’ but not ‘‘acceptance of the null hypothesis.’’

The logic of carrying out NHST with respect to


interactions is the same as the logic of carrying out
Problems With Null Hypothesis
NHST with respect to main effects. In particular,
with interactions as with main effects, one can Significance Testing
reject a null hypothesis of no interaction, but one No description of NHST in general, or a null
cannot accept a null hypothesis of no interaction. hypothesis in particular, is complete without at
least a brief account of the serious problems that
accrue when NHST is the sole statistical technique
Non-Zero-Effect Null Hypotheses
used for making inferences about the μs from the
The null hypotheses described above imply ‘‘no Mj s. Briefly, three major problems involving a null
effect’’ of one sort or another—either no main effect hypothesis as the centerpiece of data analysis are
of some independent variable or no interaction discussed below.
Null Hypothesis 941

A Null Hypothesis Cannot Be Literally True concomitant caution in using the Mj s to make
inferences about the μj s.
In most sciences, it is almost a self-evident truth
None of this is relevant within the process of
that any independent variable must have some
NHST, which does not in any way emphasize the
effect, even if small, on any dependent variable.
degree to which the Mj s are good estimates of the
This is certainly true in psychology. In the Stern-
μj s. In its typical form, NHST allows only a limited
berg task, to illustrate, it is simply implausible that
assessment of the nature of the μj s: Are they all
set size would have literally zero effect on reaction
equal or not? Typically, the ‘‘no’’ or ‘‘not necessarily
time (i.e., that is, that the μj s corresponding to the
no’’ conclusion that emerges from this process is
different set sizes would be identical to an infinite
insufficient to evaluate the totality of what the data
number of decimal places). Therefore, rejecting
might potentially reveal about the nature of the μj s.
a null hypothesis—which, as noted, is the only
An alternative that is gradually emerging within
strong conclusion that is possible within the con-
several NHST-heavy sciences—an alternative that
text of NHST—tells the investigator nothing that
is common in the natural sciences—is the use of
the investigator should have been able to realize
confidence intervals that assess directly how good
was true beforehand. Most investigators do not
is a Mj as an estimate of the corresponding μj :
recognize this, but that does not prevent it from
Briefly, a confidence interval is an interval con-
being so.
structed around a sample mean that, with some
pre-specified probability (typically 95%), includes
Human Nature Makes Acceptance the corresponding population mean. A glance at
of a Null Hypothesis Almost Irresistible a set of plotted Mj s with associated plotted confi-
Earlier, this entry detailed why it is logically for- dence intervals provides immediate and intuitive
bidden to accept a null hypothesis. However, information about (a) the most likely pattern of
human nature dictates that people do not like to the μj s and (b) the reliability of the pattern of Mj s
make weak yet complicated conclusions such as as an estimate of the pattern of μj s. This in turn
‘‘We fail to reject the null hypothesis.’’ Scientific provides immediate and intuitive information both
investigators, generally being humans, are not about the relatively uninteresting question of
exceptions. Instead, a ‘‘fail to reject’’ decision, whether some null hypothesis is true and about the
which is dutifully made in an article’s results sec- much more interesting questions of what the pat-
tion, often morphs into ‘‘the null hypothesis is true’’ tern of μj s actually is and how much belief can be
in the article’s discussion and conclusions sections. placed in it based on the data at hand.
This kind of sloppiness, although understandable, Geoffrey R. Loftus
has led to no end of confusion and general scientific
mischief within numerous disciplines. See also Confidence Intervals; Hypothesis; Research
Hypothesis; Research Question
Null Hypothesis Significance Testing
Emphasizes Barren, Dichotomous Conclusions Further Readings
Earlier, this entry described that the pattern of Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005).
population means—the relations among the unob- Researchers misunderstand confidence intervals and
servable μj s—is of primary interest in most scien- standard error bars. Psychological Methods, 10,
tific experiments and that the observable Mj s are 389–396.
estimates of the μj s. Accordingly, it should be of Cumming, G., Williams, J., & Fidler, F. (2004).
great interest to assess how good are the Mj s as Replication, and researchers’ understanding of
confidence intervals and standard error bars.
estimates of the μj s. If, to use an extreme example,
Understanding Statistics, 3, 299–311.
the Mj s were perfect estimates of the μj s there Fidler, F., Burgman, M., Cumming, G., Buttrose, R., &
would be no need for statistical analysis: The Thomason, N. (2006). Impact of criticism of null
answers to any question about the μj s would be hypothesis significance testing on statistical reporting
immediately available from the data. To the degree practices in conservation biology. Conservation
that the estimates are less good, one must exercise Biology, 20, 1539–1544.
942 Nuremberg Code

Haller, H., & Krauss, S. (2002). Misinterpretations of experiments. However, it was reinterpreted to
significance: A problem students share with their expand medical research in the Declaration of Hel-
teachers? Methods of Psychological Research, 7, 1–20. sinki by the World Medical Association in 1964.
Kline, R. B. (2004). Beyond significance testing: The legal judgment in the Nuremberg trial by
Reforming data analysis methods in behavioral
a panel of American and European physicians and
research. Washington, DC: American Psychological
Association.
scientists contained 10 moral, ethical, and legal
Lecoutre, M. P., Poitevineau, J., & Lecoutre, B. (2003). requirements to guide researchers in experiments
Even statisticians are not immune to with human subjects. These requirements are as
misinterpretations of null hypothesis significance tests. follows: (1) voluntary informed consent based on
International Journal of Psychology, 38, 37–45. legal capacity and without coercion is essential;
Loftus, G. R. (1996). Psychology will be a much better (2) research should be designed to produce results
science when we change the way we analyze data. for the good of society that are not obtainable by
Current Directions in Psychological Science, 5; other means; (3) human subjects research should
161–171.
be based on prior animal research; (4) physical
Rosenthal, R., & Gaito, J. (1963). The interpretation of
and mental suffering must be avoided; (5) no
levels of significance by psychological researchers.
Journal of Psychology, 55, 33–38. research should be conducted for which death or
disabling injury is anticipated; (6) risks should be
justified by anticipated humanitarian benefits;
(7) precautions and facilities should be provided to
protect research subjects against potential injury,
NUREMBERG CODE disability, or death; (8) research should only be
conducted by qualified scientists; (9) the subject
The term Nuremberg Code refers to the set of should be able to end the study during the
standards for conducting research with human research; (10) the scientist should be able to end
subjects that was developed in 1947 at the end the research at any stage if potential for injury, dis-
of World War II, in the trial of 23 Nazi doctors ability, or death of the subject is recognized.
and scientists in Nuremberg, Germany, for war
crimes that included medical experiments on per-
Impact on Human Subjects Research
sons designated as non-German nationals. The
trial of individual Nazi leaders by the Nurem- At the time the Nuremberg Code was formulated,
berg War Crimes Tribunal, the supranational many viewed it as created in response to Nazi
institution charged with determining justice in medical experimentation and without legal author-
the transition to democracy, set a vital precedent ity in the United States and Europe. Some Ameri-
for international jurisprudence. can scientists considered the guidelines implicit in
The Nuremberg Code was designed to protect their human subjects research, applying to non-
the autonomy and rights of human subjects in therapeutic research in wartime. The informed
medical research, as compared with the Hippo- consent requirement was later incorporated into
cratic Oath applied in the therapeutic, paternalistic biomedical research, and physicians continued to
patient-physician relationship. It is recognized as be guided by the Hippocratic Oath for clinical
initiating the modern international human rights research.
movement during social construction of ethical Reinterpretation of the Nuremberg Code in the
codes, with the Universal Declaration of Human Declaration of Helsinki for medical research modi-
Rights in 1954. Human subjects abuse by Nazi fied requirements for informed consent and subject
physicians occurred despite German guidelines recruitment, particularly in pediatrics, psychiatry,
for protection in experimentation, as noted by and research with prisoners. Therapeutic research
Michael Branigan and Judith Boss. Because the was distinguished from nontherapeutic research,
international use of prisoners in research had and therapeutic privilege was legitimated in the
grown during World War II, the Code required patient-physician relationship.
that children, prisoners, and patients in mental However, social and biomedical change, as well
institutions were not to be used as subjects in as the roles of scientists and ethicists in research
Nuremberg Code 943

and technology, led to the creation of an explicitly colleagues note that public health emphasis on com-
subjects-centered approach to human rights. This munity-based participatory research is shifting focus
has been incorporated into research in medicine from protection of individual subjects to ethical
and public health, and in behavioral and social relationships with community members and organi-
sciences. zations as partners.
Yet global clinical drug trials by pharmaceutical
companies and researchers’ efforts to limit restric-
Biomedical, Behavioral, tions on placebo-controlled trials have contributed
to the substitution of ‘‘Good Clinical Practice
and Community Research
Rules’’ for the Declaration of Helsinki by the U.S.
With the enactment of federal civil and patient Food and Drug Administration in 2004. The
rights legislation, the erosion of public trust, and development of rules by regulators and drug indus-
the extensive criticism of ethical violations and dis- try trade groups and the approval by untrained
crimination in the Tuskegee syphilis experiments local ethics committees in developing countries
by the United States Public Health Service (1932– could diminish voluntary consent and benefits for
1971), biomedical and behavioral research with research subjects.
human subjects became regulated by academic and Subsequent change in application of the Nurem-
hospital-based institutional review boards. The berg Code has occurred with the use of prisoners
National Commission for the Protection of in clinical drug trials, particularly for HIV drugs,
Human Subjects of Biomedical and Behavioral according to Branigan and Boss. This practice is
Research was established in the United States in illegal for researchers who receive federal support
1974, with requirements for review boards in insti- but is legal in some states. The inclusion of prison-
tutions supported by the Department of Health, ers and recruitment of subjects from ethnic or
Education and Welfare. racial minority groups for clinical trials might offer
In 1979, the Belmont Report set four ethical them potential benefits, although it must be bal-
principles for human subjects research: (1) benefi- anced against risk and need for justice.
cence and nonmalfeasance, to maximize benefits
and minimize risk of harm; (2) respect for auton- Sue Gena Lurie
omy in decision-making and protection of those
See also Declaration of Helsinki
with limited autonomy; (3) justice, for fair treat-
ment; and (4) equitable distribution of benefits and
risks. The Council for International Organizations Further Readings
of Medical Sciences and the World Health Orga-
nization formulated International Ethical Guide- Beauchamp, D., & Steinbock, B. (1999). New ethics for the
public’s health. Oxford, UK: Oxford University Press.
lines for Biomedical Research Involving Human
Branigan, M., & Boss, J. (2001). Human and animal
Subjects for research ethics committees in 1982. experimentation. In Healthcare ethics in a diverse
Yet in 1987, the United States Supreme Court society. Mountain View, CA: Mayfield.
refused to endorse the Nuremberg Code as bind- Cohen, J., Bankert, E., & Cooper, J. (2006). History and
ing on all research, and it was not until 1997 ethics. CITI course in the protection of human
that national security research had to be based research subjects. Retrieved December 6, 2006, from
on informed consent. http://www.citiprogram.org/members/courseandexam/
Institutional review boards (IRBs) were estab- moduletext
lished to monitor informed consent and avoid risk Elster, J. (2004). Closing the books: Transitional justice in
and exploitation of vulnerable populations. historical perspective. New York: Cambridge
University Press.
Research organizations and Veterans Administra-
Farmer, P. (2005). Rethinking health and human rights.
tion facilities might be accredited by the Association In Pathologies of power. Berkeley: University of
for Accreditation of Human Research Protection California.
Programs. Although IRBs originated for biomedical Levine, R. (1981). The Nuremberg Code. In Ethics and
and clinical research, their purview extends to the regulation of clinical research. Baltimore: Urban &
behavioral and social sciences. Nancy Shore and Schwarzenberg.
944 NVivo

Levine, R. (1996). The Institutional Review Board. In The consideration of NVivo is relevant to the
S. S. Coughlin & T. L. Beauchamp (Eds.), Ethics and practical task of research design in two senses: Its
epidemiology. New York: Oxford University. tools are useful when designing or preparing for
Lifton, R. (2000). The Nazi doctors. New York: Basic a research project, and if it is to be used also for
Books.
the analysis of qualitative or mixed methods data,
Morrison, E. (2008). Health care ethics: Critical issues
for the 21st century. Sudbuy, MA: Jones & Bartlett.
then consideration needs to be given to designing
National commission for the protection of human and planning for its use.
subjects of biomedical and behavioral research.
(1979). The Belmont Report: Ethical principles and
Designing With NVivo
guidelines for the protection of human subjects of
research. Washington, DC: U. S. Department of NVivo can assist in the research design process in
Health, Education and Welfare. (at least) three ways, regardless of the methodolog-
Shah, S. (2006). The body hunters: Testing new drugs on ical approach to be adopted in the research: keep-
the world’s poorest patients. New York: New Press.
ing a research journal, working with literature,
Shore, N., Wong, K., Seifer, S., Grignon, J., & Gamble, V.
and building conceptual models.
(2008). Introduction: Advancing the ethics of
community-based participatory research. Journal of
Empirical Research on Human Research Ethics, 3; 1–4. Keeping a Journal
Keeping a record of decisions made when plan-
ning and conducting a research project, tracking
events occurring during the project (foreseen and
NVIVO unforeseen), or even recording random thoughts
about the project, will assist an investigator to pre-
NVivo provides software tools to assist a researcher pare an accurate record of the methods adopted
from the time of conceptualization of a project for the project and the rationale for those meth-
through to its completion. Although NVivo is soft- ods. Journaling serves also to simulate thinking, to
ware that is designed primarily for researchers prevent loss of ideas that might be worthy of
undertaking analysis of qualitative (text and multi- follow-up, and to provide an audit trail of devel-
media) data, its usefulness extends to researchers opment in thinking toward final conclusions.
engaged in any kind of research. The tools pro- NVivo can work with text that has been
vided by NVivo assist in the following: recorded using Microsoft Word, or a journal can
be recorded progressively within NVivo. Some
• Tracking and management of data sources and researchers keep all their notes in a single docu-
information about these sources ment, whereas others prefer to make several docu-
• Tracking and linking ideas associated with or ments perhaps to separate methodological from
derived from data sources substantive issues. The critical contribution that
• Searching for terms or concepts
NVivo can make here is to assist the researcher in
• Indexing or coding text or multimedia
keeping track of their ideas through coding the
information for easy retrieval
• Organizing codes to provide a conceptual content of the journal. A coding system in NVivo
framework for a study works rather like an index but with a bonus. The
• Querying relationships between concepts, text in the document is highlighted and tagged
themes, or categories with a code (a label that the researcher devises).
• Building and drawing visual models with links Codes might be theoretically based and designed
to data a priori, or they can be created or modified
(renamed, rearranged, split, combined, or content
Particular and unique strengths of NVivo lie in its recoded) in an emergent way as the project pro-
ability to facilitate work involving complex data ceeds. The bonus of using codes in NVivo is that
sources in a variety of formats, in the range of the all materials that have been coded in a particular
query tools it offers, and in its ability to link quan- way can be retrieved together, and if needed for
titative with qualitative data. clarification, any coded segment can be shown
NVivo 945

within its original context. A research journal is concept or perhaps to compare the key concerns
a notoriously messy document, comprising ran- of North American versus European writers. The
dom thoughts, notes of conversations, ideas associations between codes, or between attri-
from reading, and perhaps even carefully consid- butes and codes, could be used to review the
ered strategies that entered in no particular relationship between an author’s theoretical per-
sequence. By coding their journal, researchers spective and his or her understanding of the
can find instantly any thoughts or information likely impact of a planned intervention.
they have on any particular aspect of their Searching text in NVivo provides a useful sup-
project, regardless of how messy the original plement to coding. One could search the reports of
document was. This different view of what has a group of related studies, for example, for the
been written not only brings order out of chaos alternative words ‘‘extraneous OR incidental OR
and retrieves long-forgotten thoughts but also unintended’’ to find anything written about the
prompts deeper thinking and perhaps reconcep- potential impact of extraneous variables on the
tualization of that topic after visualizing all the kind of experiment being planned.
material on one topic together.

Working With Literature Building Conceptual Models

NVivo’s coding system can be used also to Researchers often find it useful at the start of
index, retrieve, and synthesize what is learned a project to ‘‘map’’ their ideas about their experi-
from reading across the substantive, theoretical, or mental design or about what they are expecting
methodological literature during the design phase to find from their data gathering. Doing so can
of a project. In the same way that a journal can be help to identify all the factors that will possibly
coded, either notes derived from reading or the impinge on the research process and to clarify
text of published articles can be coded for retrieval the pathways by which different elements will
and reconsideration according to either an a priori impact on the process and its outcomes. As
or emergent system (or a combination thereof). a conceptual or process model is drawn, fresh
Thus, the researcher can locate and bring together awareness of sampling or validity issues might
all their material from any of their references on, be prompted and solutions sought. NVivo pro-
for example, the concept of equity, debates about vides a modeling tool in which items and their
the use of R2 ; or the role of antioxidants in pre- links can be shown. A variety of shapes can be
venting cancer. With appropriate setting up, the used in designing the model. Project items such
author/s and year of publication for any segment as codes, cases, or attributes can be added to the
of coded text can be retrieved alongside the text, model, and where coding is present, these codes
which facilitates the preparation of a written provide a direct link back to the data they repre-
review of the literature. The database created in sent. Labels can be added to links (which might
NVivo becomes available for this and many more be associative, unidirectional, or bidirectional),
projects, and it serves as an ongoing database for and styles (e.g., color, fill, and font) can be used
designing, conducting, and writing up future to emphasize the significance of different items
projects. or links. The items can be grouped so that they
A preliminary review of the literature can be can be turned on or off in the display. The mod-
extended into a more thorough analysis by els can be archived, allowing the researcher to
drawing on NVivo’s tools for recording and continue to modify their model as their under-
using information about sources (referred to standing grows while keeping a historical record
as attributes) in comparative analyses or by of their developing ideas.
examining the relationship between, say, per-
spectives on one topic and what is said about Designing for Analysis With NVivo
another. Recorded information about each refer-
ence could be used, for example, to review Where the intention is to use NVivo for analysis of
changes over time in perspectives on a particular qualitative or mixed-methods data, there are
946 NVivo

several points to be aware of at the design stage of Data Preparation


the project and again when preparing data to facil-
Most often, data imported into NVivo are
itate the effective (and efficient) use of NVivo’s
documents prepared in Word, although it is also
coding and query tools.
possible to import and code video, audio and
NVivo’s data management system is based on
image data, and .pdf documents. There is no spell-
the notion of the ‘‘case,’’ with the case being best
check function in NVivo, so careful editing of the
thought of as the unit of analysis for the current
file before import is recommended (this is espe-
study. Most often, this is an individual; alterna-
cially so if someone other than the interviewer has
tively, the case might be a workgroup, a document,
done the transcription).
a site, or a message. The information gathered
Where each file to be imported represents
for a case might come from a single document,
unstructured or semistructured material for a single
such as an interview or response to a survey. It
case, detailed preparation for importing is not nec-
might be across several documents, such as repeat
essary. Where, however, Word files each contain
interviews in a longitudinal study, or when data
information for several cases, such as from a focus
about the case come through multiple sources or
group or a series of internet messages, or if the
methods; the information might come from several
individual’s responses are to a series of structured
parts of a document, such as for a member of
questions (such as from a survey), there is much to
a focus group. All the information for a case is
be gained by a careful preparation of the file. Spe-
held together, along with attribute data relating to
cifically, the application of styled headings which
that case, in a node (NVivo’s name for the ‘‘buck-
identify each case (i.e., who was speaking or
ets’’ that hold coding)—one for each case. Addi-
whose response this was) or each question will
tionally, sources (and nodes) can be organized in
allow for the use of an automatic coding tool to
sets. Thus, for example, in a study of children with
assign passages of text to individual cases, to code
attention-deficit/hyperactivity disorder (ADHD),
responses for the questions being answered, or
one might interview the children, their parents,
both.
their teachers and their doctors, observe the fami-
Video and audio data can be imported and
lies in interaction, and record a number of demo-
coded without transcription. Alternatively, the
graphic and scaled measures about each child. All
transcription can be imported along with the mul-
the information relating to each child would be
timedia file, or it can be transcribed within NVivo.
referenced in a case node for that child, allowing
for within-case and cross-case analysis, but as well,
one could create a set of child interviews, a set of
Designing for Integration of Qualitative
teacher interviews, and so on. Once the content of
and Quantitative Data and Analyses
the various data sources has also been coded for
concepts and themes, the flexibility of this data If the intention is to link quantitative and quali-
management system allows for a wide range of tative data for each case, it is critical to have
analysis strategies. a common identifier for both sources of data, so
As well as recording notes, verbatim text, or that it can be matched within NVivo. Quantitative
multimedia data, the astute researcher will plan data are most commonly used as a basis for sub-
for recording demographic and other categorical group comparisons of open responses, or these
or scaled information when designing data collec- data can be used to define subsets of the data as
tion. This information will be entered into NVivo the basis for specific or repeated analyses.
as attributes of cases and used for comparative To go beyond simply retrieving coded text,
analyses—a valuable tool not only for comparing NVivo’s query function facilitates asking relational
patterns of differences across subgroups but also questions of the data. These include comparative
for revealing subtleties within the text. Attribute and pattern analyses that are typically presented in
data can be entered interactively within the soft- matrix format (a qualitative cross-tabulation with
ware, or it can be entered in a spreadsheet or data- both numbers and text available); questions about
base as case-by-variable data and imported using associations between, say, conditions and actions;
a text file format. questions to explore negative cases; and questions
NVivo 947

that arise as a consequence of asking other ques- Model Design; Observations; Planning Research;
tions (so that results from one question are fed into Qualitative Research
another). Queries can be saved so that they can be
run again, with more data or with a different
Further Readings
subset of data. Relevant text is retrieved for
review and drawing inferences; patterns of coding Bazeley, P. (2006). The contribution of computer
reflected in numbers of sources, cases, or words software to integrating qualitative and quantitative
are available in numeric form or as charts. All cod- data and analyses. Research in the Schools, 13; 64–73.
ing information, including results from queries, Bazeley, P. (2007). Qualitative data analysis with NVivo.
can be exported in numeric form for subsequent Thousand Oaks, CA: Sage.
Richards, L. (2005). Handling qualitative data. Thousand
statistical analysis if appropriate—but always with
Oaks, CA: Sage.
the supporting text readily available in the NVivo
database to give substance to the numbers.
Websites
Pat Bazeley
QSR International: http://www.qsrinternational.com
See also Demographics; Focus Group; Interviewing; Research Support Pty. Limited:
Literature Review; Mixed Methods Design; Mixed http://www.researchsupport.com.au
O
might be maximally inclusive, such as in the case
OBSERVATIONAL RESEARCH of the ethogram, which attempts to provide a com-
prehensive description of all of the characteristic
behavior patterns of a species, or they might be
The observation of human and animal behavior restricted to a much smaller set of behaviors, such
has been referred to as the sine qua non of science, as the social behaviors of jackdaws, as studied
and indeed, any research concerning behavior ulti- by the Nobel Prizewinning ethologist Konrad
mately is based on observation. A more specific Lorenz, or the facial expressions of emotion in
term, naturalistic observation, traditionally has humans, as studied by the psychologist Paul
referred to a set of research methods wherein the Ekman. Thus, the versatile set of measurement
emphasis is on capturing the dynamic or temporal methods referred to as observational research
nature of behavior in the environment where it emphasizes temporally dynamic behaviors as they
naturally occurs, rather than in a laboratory where naturally occur, although the conditions of obser-
it is experimentally induced or manipulated. What vation and the breadth of behaviors observed will
is unique about the more general notion of obser- vary with the research question(s) at hand.
vational research, however, and what has made it Because of the nature of observational research,
so valuable to science is the fact that the process of it is often better suited to hypothesis generation
direct systematic observation (that is, the what, than to hypothesis testing. When hypothesis test-
when, where, and how of observation) can be con- ing does occur, it is limited to the study of the rela-
trolled to varying degrees, as necessary, while still tionship(s) between/among behaviors, rather than
permitting behavior to occur naturally and over to the causal links between them, as is the focus of
time. Indeed, the control of what Roger Barker experimental methods with single or limited beha-
referred to as ‘‘the stream of behavior,’’ in his 1962 vioral observations and fully randomized designs.
book by that title, may range from a simple specifi- This entry discusses several aspects of observa-
cation of certain aspects of the context for tional research: its origins, the approaches, special
comparative purposes (e.g., diurnal vs. nocturnal considerations, and the future of observational
behaviors) to a full experimental design involving research.
the random assignment of participants to strictly
specified conditions.
Even the most casual observations have been
Origins
included among these research methods, but they
typically involve, at a minimum, a systematic pro- Historically, observational research has its roots in
cess of specifying, selecting, and sampling beha- the naturalistic observational methods of Charles
viors for observation. The behaviors considered Darwin and other naturalists studying nonhuman

949
950 Observational Research

animals. The work of these 19th-century scientists naturally in the absence of an observer. For exam-
spawned the field of ethology, which is defined as ple, anthropological linguists have observed the
the study of the behavior of animals in their natural hypercorrection of speech pronunciation ‘‘errors’’
habitats. Observational methods are the primary in lower- and working-class women when reading
research tools of the ethologist. In the study of a list of words to an experimenter compared to
human behaviors, a comparable approach is that of when speaking casually. Presumably, compared to
ethnography, which combines several research tech- upper-middle-class speakers, they felt a greater
niques (observations, interviews, and archival and/ need to ‘‘speak properly’’ when it was obvious that
or physical trace measures) in a long-term investiga- their pronunciation was the focus of attention.
tion of a group or culture. This technique also Although various techniques exist for limiting the
involves immersion and even participation in the effects of evaluation apprehension, obtrusive obser-
group being studied in a method commonly vational techniques can never fully guarantee the
referred to as participant observation. nonreactivity of their measurements.
The use of observational research methods of In the case of unobtrusive observation, partici-
various kinds can be found in all of the social pants in the research are not made aware that they
sciences—including, but not limited to, anthropol- are being observed (at least not at the time of
ogy, sociology, psychology, communication, politi- observation). This can effectively eliminate the
cal science, and economics—and in fields that problem of measurement reactivity, but it presents
range from business to biology, and from educa- another issue to consider when the research parti-
tion to entomology. These methods have been cipants are humans; namely, the ethics of making
applied in innumerable settings, from church ser- such observations. In practice, ethical considera-
vices to prisons to psychiatric wards to college tions have resulted in limits to the kinds of beha-
classrooms, to name a few. viors that can be observed unobtrusively, as well
as to the techniques (for example, the use of
recording devices) that can be employed. If the
Distinctions Among Methods
behavior occurs in a public place where the person
Whether studying humans or other animals, one of being observed cannot reasonably expect complete
the important distinctions among observational privacy, the observations may be considered accep-
research methods is whether the observer’s pres- table. Another guideline involves the notion of
ence is overt or obtrusive to the participants or minimal risk. Generally speaking, procedures that
covert or unobtrusive. In the former case, research- involve no greater risk to participants than they
ers must be wary of the problem of reactivity of might encounter in everyday life are considered
measurement; that is, of measurement procedures acceptable. Before making unobtrusive observa-
where the act of measuring may, in all likelihood, tions, researchers should take steps to solicit the
change the behavior being measured. Reactivity opinions of colleagues and others who might be
can operate in a number of ways. For example, the familiar with issues of privacy, confidentiality, and
physical space occupied by an observer under minimal risk in the kinds of situations involved in
a particular tree or in the corner of a room may the research. Research conducted at institutions
militate against the occurrence of the behaviors that receive federal funding will have an institu-
that would naturally occur in that particular loca- tional review board composed of researchers and
tion. More likely, at least in the case of the study community members who review research proto-
of human behavior, participants may attempt to cols involving human participants and who will
control their behaviors in order to project a certain assist researchers in determining appropriate ethi-
image. One notable example in this regard has cal procedures in these and other circumstances.
been termed evaluation apprehension. Specifically,
human participants who know that they are being
Special Considerations
observed might feel apprehensive about being
judged or evaluated and might attempt to behave Observational research approaches generally include
in ways that they believe put them in the most pos- many more observations or data points than typi-
itive light, as opposed to behaving as they would cal experimental approaches, but they, too, are
Observational Research 951

reductionistic in nature; that is, although relatively regarding behavior codes if the behaviors are
more behaviors are observed and assessed, not all observed and classified again at another time. In
behaviors that occur during data collection may be the case of interrater reliability, two (or more)
studied. This fact raises some special considerations. judges independently viewing the behaviors should
make the same classifications or judgments.
Although in practice, reliability estimates seldom
How Will the Behaviors
involve perfect agreement between judgments made
Being Studied Be Segmented?
at different times or by different coders, there are
Aristotle claimed that ‘‘natural’’ categories are standards of disagreement accepted by researchers
those that ‘‘carve at the joint.’’ Some behaviors do based upon the computations of certain descriptive
seem to segment relatively easily via their observ- and inferential statistics. The appropriate statistic(s)
able features, such as speaking turns in conversa- to use to make a determination of reliability depends
tion, or the beginning and end of an eye blink. For upon the nature of the codes/variables being used.
many other behaviors, beginnings and endings Correlations often are computed for continuous vari-
may not be so clear. Moreover, research has shown ables or codes (that is, for classifications that vary
that observers asked to segment behaviors into the along some continuum; for example, degrees of dis-
smallest units they found to be natural and mean- played aggression), and Cohen’s kappa coefficients
ingful formed different impressions than observers often are computed for discrete or categorical vari-
asked to segment behaviors into the largest units ables or codes; for example, types of hand gestures.
they found natural and meaningful, despite observ-
ing the same videotaped series of behaviors. The What Behaviors Will Be Sampled?
small-unit observers also were more confident of
their impressions. Consumers of observational res- The key to sampling is that there is a sufficient
earch findings should keep in mind that different amount and appropriate kind of sampling per-
strategies for segmenting behavior may result in formed such that one represents the desired popu-
different kinds of observations and inferences. lation of behaviors (and contexts and types of
participants) to which one would want to general-
ize. Various sampling procedures exist, as do sta-
How Will Behavior Be Classified or Coded? tistics to help one ascertain the number of
The central component of all observational sys- observations necessary to test the reliability of the
tems is sometimes called a behavior code, which is measurement scheme employed and/or test hypo-
a detailed description of the behaviors and/or theses about the observations (for example, power
events to be observed and recorded. Often, this analyses and tests of effect size).
code is referred to as a taxonomy of behavior. The
best taxonomies consist of a set of categories with Problems Associated With
the features of being mutually exclusive (that is,
Observational Research
every instance of an observed behavior fits into
one and only one category of the taxonomy) and Despite all of the advantages inherent in making
exhaustive (that is, every instance of an observed observations of ongoing behavior, a number of
behavior fits into one of the available categories of problems are typical of this type of research. Pro-
the taxonomy). minent among them is the fact that the develop-
ment and implementation of reliable codes can be
time-consuming and expensive, often requiring
Are the Classifications of
huge data sets to achieve representative samples
Observed Behaviors Reliable Ones?
and the use of recording equipment to facilitate
The coding of behaviors according to the cate- reliable measurement. Special methods may be
gories of a taxonomy have, as a necessary condi- needed to prevent, or at least test for, what has
tion, that the coding judgments are reliable ones. been called observer drift. This term refers to the
In the case of intrarater reliability, this means that fact that, with prolonged observations, observers
an observer should make the same judgments may be more likely to forget coding details,
952 Observations

become fatigued, experience decreased motivation Hawkins, R. P. (1982). Developing a behavior code. In
and attention, and/or learn confounding habits. D. P. Hartmann (Ed.), Using observers to study
Finally, observational methods cannot be applied behavior. San Francisco: Jossey-Bass.
to hypotheses concerning phenomena not suscepti- Jones, R. (1985). Research methods in the social and
behavioral sciences. Sunderland, MA: Sinauer
ble to direct observation, such as cognitive or
Associates, Inc.
affective variables. Indeed, care must be taken by Longabaugh, R. (1980). The systematic observation of
researchers to be sure that actual observations behavior in naturalistic settings. In H. Triandis (Ed.),
(e.g., he smiled or the corners of his mouth were The handbook of cross-cultural psychology: II,
upturned or the zygomaticus major muscle was Methodology. Boston: Allyn & Bacon.
contracted) and not inferences (e.g., he was happy) Magnusson, M. S. (2005). Understanding social
are recorded as data. interaction: Discovering hidden structure with model
and algorithms. In L. Anolli, S. Duncan, Jr., M. S.
Magnusson, & G. Riva (Eds.), The hidden structure of
Future Outlook interaction. Amsterdam: IOS Press.
Suen, H. K., & Ary, D. (1989). Analyzing quantitative
With the increasing availability and sophistication behavioral observation data. Mahwah, NJ: Lawrence
of computer technology, researchers employing Erlbaum.
observational research methods have been able to Wilkinson, L., & the Task Force on Statistical Inference.
search for more complicated patterns of behavior, (1999). Statistical methods in psychology journals.
not just within an individual’s behavior over time, American Psychologist, 54, 594604.
but among interactants in dyads and groups as
well. Whether the topic is family interaction pat-
terns, courtship behaviors in Drosophila, or pat-
terns of nonverbal behavior in doctor-patient
interactions, a collection of multivariate statistical
OBSERVATIONS
tools, including factor analyses, time-series analy-
ses, and t-pattern analyses, has become available Observations refer to watching and recording the
to the researcher to assist him or her in detecting occurrence of specific behaviors during an episode
the hidden yet powerful patterns of behavior that of interest. The observational method can be
are available for observation. employed in the laboratory as well as a wide vari-
ety of other settings to obtain a detailed picture of
Carol Toris how behavior unfolds. This entry discusses types
of observational design, methods for collecting
See also Cause and Effect; Cohen’s Kappa; Correlation; observations, and potential pitfalls that may be
Effect Size, Measures of; Experimental Design; encountered.
Hypothesis; Laboratory Experiments; Multivariate
Analysis of Variance (MANOVA); Naturalistic
Observation; Power Analysis; Reactive Arrangements; Types of Observational Designs
Reliability; Sample Size Planning; Unit of Analysis
There are two types of observational design: natu-
Further Readings ralistic and laboratory observations. Naturalistic
observations entail watching and recording beha-
Barker, R. G. (Ed.). (1963). The stream of behavior: viors in everyday environments such as animal col-
Explorations of its structure and content. New York: onies, playgrounds, classrooms, and retail settings.
Appleton-Century-Crofts. The main advantage of naturalistic observation is
Campbell, D. T., & Stanley, J. (1966). Experimental and
that it affords researchers the opportunity to study
quasi-experimental designs for research. Chicago:
Rand McNally.
the behavior of animals and people in their natural
Ekman, P. (1982). Methods for measuring facial action. settings. Disadvantages associated with naturalistic
In K. R. Scherer & P. Ekman (Eds.), Handbook of observations are lack of control over the setting;
methods in nonverbal behavior research. Cambridge, thus, confounding factors may come into play.
UK: Cambridge University Press. Also, the behavior of interest may be extremely
Observations 953

infrequent and unlikely to be captured during a computer to facilitate observational research.


observational sessions. Several computer programs are available for rec-
Laboratory observations involve watching and ording observations. Computer entry allows for
recording behaviors in a laboratory setting. The exact timing of behavior so that researchers can
advantage of laboratory observations is that res- determine time lags between particular instances
earchers can structure them to elicit certain beha- of behavior.
viors by asking participants to discuss a particular Another choice in recording behavior is whether
topic or complete a specific task. The major disad- to use time or event sampling. Time sampling
vantage is that participants may behave unnatu- involves dividing the observational time session
rally because of the contrived nature of the into short time periods and recording any occur-
laboratory. rences of the behavior of interest. For instance,
researchers studying classroom participation might
divide a 1-hour class into twelve 5-minute inter-
Collecting Observations
vals. They could then record if students partici-
Specifying the Behavior of Interest pated in each interval and then determine the
percentage of intervals that included student par-
The first step in collecting observations is to
ticipation. A second option is to use event sam-
specify the behavior(s) of interest. This often con-
pling, recording each behavior of interest as it
sists of formulating an operational definition, or
occurs. Researchers using the event sampling
precisely describing what constitutes an occurrence
technique in the classroom participation example
of each type of behavior. For instance, physical
would record each time a student participated in
aggression may be operationally defined as hitting,
class and would ultimately calculate the frequency
kicking, or biting another person. Thus, when any
of student participation across the class period.
of these behaviors occur, the researcher would
record an instance of physical aggression. Research-
ers often create coding manuals, which include Ensuring Interrater Reliability
operational definitions and examples of the beha-
Interrater reliability refers to agreement among
viors of interest, to use as a reference guide when
observers and is necessary for scientifically sound
observing complex behaviors.
observations. Several steps can be taken to ensure
high interrater reliability. First, new observers
often receive detailed training before they begin to
Recording Observations
code observational data. In many cases, novice
Researchers use different methods for recording observers also practice with DVD/video recordings
observations dependent on the behavior of interest to gain experience with the observational task.
and the setting. One dimension that may differ is Finally, two or more observers often code a per-
whether observations are made live or while centage of the episodes to ensure that agreement
watching a recording of the episode. Researchers remains high.
often choose to record simple behaviors live and in
real time as they occur. In situations characterized
Potential Pitfalls
by complexity, researchers often elect to make
a DVD/video recording of the episode. This gives There are two potential dangers in collecting
researchers more flexibility in recording a variety observations: observer influence and observer bias.
of behaviors, most notably the ability to progress Observer influence refers to changes in behavior in
at their own speed, or to watch behaviors again if response to the presence of an observer. Indivi-
needed. duals may be cognizant of the observer and alter
Similarly, researchers may choose to record their behavior, often in a more positive direction.
behavior by making hand tallies or using compu- The term Hawthorne effect is often used to refer
ters. If the behavior interest is simple, researchers to these increases in positive (e.g., prosocial, pro-
may choose to record each time a behavior ductive) behavior in response to an observer. Steps
occurs. Many researchers, however, choose to use can be taken in order to reduce observer influence,
954 Occam’s Razor

including utilization of adaptation periods during selected over one with many, provided they are
which observers immerse themselves in the envi- equally functional. Likewise, a straightforward
ronment prior to data collection so that the sub- explanation ought to be believed over one that
jects of their observation become accustomed to requires many separate contingencies.
their presence. For instance, there are a number of possible rea-
Another potential danger is observer bias, in sons why a light bulb does not turn on when
which observers’ knowledge of the study hypothe- a switch is flipped: Aliens could have abducted the
ses influences their recording of behavior. Obser- light bulb, the power could be out, or the filament
vers may notice and note more behavior that is within the bulb has burned out. The explanation
congruent with the study hypotheses than actually requiring aliens is exceedingly complex, as it neces-
occurs. At the same time, they may not notice and sitates the existence of an unknown life form,
note behavior that is incongruent with the study a planet from which they have come, a motive for
hypotheses. One means of lessening observer bias taking light bulbs, and so on. A power outage is
is to limit the information given to observers not as complicated, but still requires an intricate
regarding the study hypotheses. chain of events, such as a storm, accident, or engi-
neering problem. The simplest of these theories is
Lisa H. Rosen and Marion K. Underwood that the light bulb has simply burned out. All theo-
ries provide explanations, but vary in complexity.
See also Hawthorne Effect; Interrater Reliability;
Until proof corroborating one account surfaces,
Naturalistic Observation; Observational Research
Occam’s Razor requires that the simplest explana-
tion be preferred above the others. Thus, the
Further Readings logical—and most likely correct—hypothesis is
that the light bulb has burned out.
Dallos, R. (2006). Observational methods. In This entry begins with a brief history of Occam’s
G. Breakwell, S. Hammond, C. Fife-Schaw, & J. Smith
Razor. It then discusses the implications for res-
(Eds.), Research methods in psychology
(pp. 124145). Thousand Oaks, CA: Sage.
earch. The entry concludes with some caveats
Margolin, G., Oliver, P. H., Gordis, E. B., O’Hearn, H. G., related to the use of Occam’s Razor.
Medina, A. M., Ghosh, C. M., & Morland, L. (1998).
The nuts and bolts of behavioral observation of marital History
and family interaction. Clinical Child and Family
Psychology Review, 1, 195213. Occam’s Razor is named for the 14th-century
Pope, C., & Mays, N. (2006). Observational methods. In English theologian, philosopher, and friar William
C. Pope & N. Mays (Eds.), Qualitative research in of Occam. William, who was presumably from the
health care (pp. 3242). Malden, MA: Blackwell. city of Occam, famously suggested that ‘‘entities
should not be multiplied beyond necessity.’’ To do
so, he explained, implied vanity and needlessly
increased the chances of error. This principle had
OCCAM’S RAZOR been formalized since the time of Aristotle, but
Occam’s unabashed and consistent use of the
Occam’s Razor (also spelled Ockham) is known as Razor helped Occam become one of the foremost
the principle of parsimony or the economy of critics of Thomas Aquinas.
hypotheses. It is a philosophical principle dictating
that, all things being equal, simplicity is preferred
Implications for Scientific Research
over complexity. Traditionally, the Razor has been
used as a philosophical heuristic for choosing The reasons for emphasizing simplicity when con-
between competing theories, but the principle is ceptualizing and conducting research may seem
also useful for defining methods for empirical obvious. Simple designs reduce the chance of
inquiry, selecting scientific hypotheses, and refining experimenter error, increase the clarity of the
statistical models. According to Occam’s Razor, results, obviate needlessly complex statistical
a tool with fewer working parts ought to be analyses, conserve valuable resources, and curtail
Occam’s Razor 955

potential confounds. As such, the Razor can be with two independent variables. Because the third
a helpful guide when attempting to produce an variable does not contribute information to the
optimal research design. Although often imple- model, Occam’s Razor can be used to cut it away.
mented intuitively, it may be helpful to review and
refine proposed research methods with the Razor
in mind. Caveats
Just as any number of tools can be used to
accomplish a particular job, there are many poten- In practice, a strict adherence to Occam’s Razor is
tial methodological designs for each research ques- usually impossible or ill-advised, as it is rare to
tion. Occam’s Razor suggests that a tool with find any two models or theories that are equivalent
fewer working parts is preferable to one that is in all ways except complexity. Often, when some
needlessly complicated. A correlation design that portion of a method, hypothesis, or theory is cut
necessitates only examining government records away, some explanatory or logical value must be
may be more appropriate than an experimental sacrificed. In the previously mentioned case of the
design that necessitates recruiting, assigning, mani- regression analysis, the addition of a third indepen-
pu

S-ar putea să vă placă și