Documente Academic
Documente Profesional
Documente Cultură
CONTEXT All assessment data, like other scientific KEYWORDS education, medical, undergraduate
experimental data, must be reproducible in order to *standards; educational measurement *standards;
be meaningfully interpreted. reproducibility of results.
PURPOSE The purpose of this paper is to discuss Medical Education 2004; 38: 10061012
applications of reliability to the most common doi:10.1046/j.1365-2929.2004.01932.x
assessment methods in medical education. Typical
methods of estimating reliability are discussed intui-
tively and non-mathematically. INTRODUCTION
SUMMARY Reliability refers to the consistency of This article discusses reliability of assessments in
assessment outcomes. The exact type of consistency medical education and presents examples of various
of greatest interest depends on the type of assess- methods used to estimate reliability. It expands the
ment, its purpose and the consequential use of the brief discussion of reliability by Crossley et al.1 in an
data. Written tests of cognitive achievement look to earlier paper in this series and discusses uses of
internal test consistency, using estimation methods generalisability theory, which have been described in
derived from the test-retest design. Rater-based detail elsewhere.24 The emphasis of this paper is
assessment data, such as ratings of clinical perform- applied and practical, rather than theoretical.
ance on the wards, require interrater consistency or
agreement. Objective structured clinical examina- What is reliability? In its most straightforward defini-
tions, simulated patient examinations and other tion, reliability refers to the reproducibility of assess-
performance-type assessments generally require gen- ment data or scores, over time or occasions. Notice
eralisability theory analysis to account for various that this definition refers to reproducing scores or
sources of measurement error in complex designs data, so that, just like validity, reliability is a charac-
and to estimate the consistency of the generalisations teristic of the result or outcome of the assessment,
to a universe or domain of skills. not the measuring instrument itself. Feldt and
Brennan5 suggest that: Quantification of the consis-
CONCLUSIONS Reliability is a major source of tency and inconsistency in examinee performance
validity evidence for assessments. Low reliability constitutes the essence of reliability analysis. (p 105)
indicates that large variations in scores can be
expected upon retesting. Inconsistent assessment This paper explores the importance of reliability in
scores are difficult or impossible to interpret mean- assessments, some types of reliability that are com-
ingfully and thus reduce validity evidence. Reliability monly used in medical education and their methods
coefficients allow the quantification and estimation of estimation, and the potential impact on students
of the random errors of measurement in assessments, of using assessments with low reliability.
such that overall assessment can be improved.
obtain nearly the same scores at the second testing as assessments that depend on the consistency of raters
at the first testing. While this test-retest concept is the and ratings for their reproducibility or reliability.
foundation of most of the reliability estimates used in
medical education, the test-retest design is rarely, if For all assessments that depend on human raters or
ever, used in actual practice, as it is logistically so judges for their primary source of data, the reliability
difficult to carry out. or consistency of greatest interest is that of the rater
or judge. The largest threat to the reproducibility of
Happily, measurement statisticians sorted out ways to such clinical or oral ratings is rater inconsistency or
estimate the test-retest condition many years ago, low interrater reproducibility. (Technically, in most
from a single testing.11 The logic is: the test-retest designs, raters or judges are nested or confounded
design divides a test into 2 random halves, perhaps with the items they rate or the cases, or both, so that it
scoring the even-numbered items as the first test and is often impossible to directly estimate the error
the odd-numbered questions as the second test. associated with raters except in the context of items
Assuming that a single construct is measured by the and cases.)
entire test, the 2 random half tests are a reasonable
proxy for 2 complete tests administered to the same The internal consistency (alpha) reliability of the
group of examinees. Further, the correlation of the rating scale (all items rated for each student) may be
scores from the 2 random half tests approximates the of some marginal interest to establish some commu-
test-retest reproducibility of the examination scores. nality for the construct assessed by the rating scale,
(Note that this is the reliability of only half of the test but interrater reliability is surely the most important
and a further calculation must be applied, using the type of reliability to estimate for rater-type assess-
Spearman)Brown prophecy formula, in order to ments.
determine the reliability of the complete examina-
tion.12) There are many ways to estimate interrater reliability,
depending on the statistical elegance desired by the
A further statistical derivation (making a few investigator. The simplest type of interrater reliability
assumptions about the test and the statistical char- is percent agreement, such that for each item rated,
acteristics of the items) allows one to estimate the agreement of the 2 (or more) independent raters
internal consistency reliability from all possible ways is calculated. Percent-agreement statistics may be
to split the test into 2 halves: this is Cronbachs alpha acceptable for in-house or everyday use, but would
coefficient, which can be used with polytomous data likely not be acceptable to manuscript reviewers and
(0, 1, 2, 3, 4, n) and is the more general form of the editors of high quality publications, as these statistics
KR 20 coefficient, which can be used only with do not account for the chance occurrence of agree-
dichotomously scored items (0, 1), such as typically ment. The kappa13 statistic (a type of correlation
found on selected-response tests. coefficient) does account for the random-chance
occurrence of rater agreement and is therefore
A high internal consistency reliability estimate for a sometimes used as an interrater reliability estimate,
written test indicates that the test scores would be particularly for individual questions, rated by 2
about the same, if the test were to be repeated at a independent raters. (The phi14 coefficient is the
later time. Moreover, the random errors of meas- same general type of correlation coefficient, but does
urement (e.g. examinee fatigue or inattention, intra- not correct for chance occurrence of agreement and
individual differences, blind guessing and so on) are therefore tends to overestimate true rater agree-
reasonably low so that the test scores exemplify an ment.)
important source of validity evidence: score repro-
ducibility. The most elegant estimates of interrater agreement
use generalisability theory (GT) analysis.24 From a
properly designed GT study, one can estimate
RELIABILITY OF RATER DATA variance components for all the variables of interest
in the design: the persons, the raters and the items.
Ward evaluation of clinical performance is often used An examination of the percentage of variance asso-
in medical education as a major assessment method ciated with each variable in the design is often
in clinical education. In addition, oral examinations instructive in understanding fully the measurement
are used in some settings in which raters or judges error due to each design variable. From these
evaluate the spoken or oral performance of medical variance components, the generalisability coefficient
students or residents. Both are examples of can be calculated, indicating how well these
particular raters and these specific items represent case set must be used as the unit of reliability analysis.
the whole universe of raters and items, how repre- Practically, this means that if one administers a 20-
sentative this particular assessment is with respect to station OSCE, with each station having 5 items, the
all possible similar assessments, and, therefore, how reliability analysis must use the 20 OSCE scores, not
much we can trust the consistency of the ratings. In the 100 individual item scores. The reliability esti-
addition, a direct assessment of the rater error can be mate for 20 observations will almost certainly be
estimated in such a study as an index of interrater lower than that for 100 observations.
consistency.
The greatest threat to reliable measurement in
A slightly less elegant, but perhaps more accessible performance examinations is case specificity, as is
method of estimating interrater reliability is by use of well documented.18,19 Complex performance assess-
the intraclass correlation coefficient.15 Intraclass ments require a complex reliability model, such as
correlation uses analysis of variance (ANOVA), as does generalisability theory analysis, to properly estimate
generalisability theory analysis, to estimate the vari- sources of measurement error variance in the design
ance associated with factors in the reliability design. and to ultimately estimate how consistently a partic-
The strength of intraclass correlation used for inter- ular sample of cases, examinees and SPs can be
rater reliability is that it is easily computed in generalised to the universe or domain.
commonly available statistical software and it permits
the estimation of both the actual interrater reliability
of the n-raters used in the study as well as the HOW ARE RELIABILITY COEFFICIENTS
reliability of a single rater, which is often of greater USED IN ASSESSMENT?
interest. Additionally, missing ratings, which are
common in these datasets, can be managed by the There are many ways to use reliability estimates in
intraclass correlation. assessments. One practical use of the reliability
coefficient is in the calculation of the standard error
of measurement (SEM). The SEM for the entire
RELIABILITY OF PERFORMANCE distribution of scores on an assessment is given by
EXAMINATIONS: OSCES AND SPS the formula:12 p
SEM Standard Deviation 1 Reliability.
Some of the most important constructs assessed in
medical education are concerned with behaviour and This SEM can be used to form confidence bands
skills such as communication, history taking, diag- around the observed assessment score, indicating the
nostic problem solving and patient management in precision of measurement, given the reliability of the
the clinical setting. While ward-type evaluations assessment, for each score level.
attempt to assess some of these skills in the real
setting (which tends to lower reliability due to the
interference of many uncontrolled variables and a HOW MUCH RELIABILITY IS ENOUGH?
lack of standardisation), simulated patient (SP)
examinations and objective structured clinical exam- The most frequently asked question about reliability
inations (OSCEs) can be used to assess such skills in a may be: how high must the reliability coefficient be
more standardised, controlled fashion. in order to use the assessment data? The answer
depends, of course, on the purpose of the assess-
Performance examinations pose a special challenge ment, its ultimate use and the consequences result-
for reliability analysis. Because the items rated in a ing from the assessment. If the stakes are extremely
performance examination are typically nested in a high, the reliability must be high in order to
case, such as an OSCE, the unit of reliability analysis defensibly support the validity evidence for the
must necessarily be the case, not the item. One measure. Various authors, textbook writers and
statistical assumption of all reliability analyses is that researchers offer a variety of opinions on this issue,
the items are locally independent, which means that but most educational measurement professionals
all items must be reasonably independent of one suggest a reliability of at least 0.90 for very high stakes
another. Items nested in sets, such as an OSCE, an SP assessments, such as licensure or certification exam-
examination, a key features item set16 or a testlet17 of inations in medicine, which have major conse-
multiple choice questions (MCQs), generally violate quences for examinees and society. For more
this assumption of local independence. Thus, the moderate stakes assessments, such as major end-of-
If possible, obtain pretest or tryout data from assess- unambiguous and clearly written and are, if possible,
ments before they are used as live or scored critiqued by content-expert reviewers. Pretesting,
questions. This may be nearly impossible for class- item tryout and item banking are recommended as
room-type assessments and even for some larger-scale means of improving the reliability of assessments in
medical school assessments. However, it is possible to medical education, wherever possible.
bank effective test questions or performance cases in
secure item pools for reuse later. Given the great cost
and difficulty of creating effective, reliable test ACKNOWLEDGEMENTS
questions and performance prompts, securing
effective questions and prompts which have solid The author wishes to thank Brian Jolly PhD and an
psychometric characteristics can add greatly to reli- anonymous reviewer for their critical reviews of this
able measurement. manuscript.
CONCLUSION FUNDING
Reliability estimates the amount of random meas- There was no external funding for this project.
urement error in assessments. All reliability analyses
are concerned with some type of consistency of
measurement. For written tests, the internal test ETHICAL APPROVAL
consistency is generally most important, estimated by
reliability indices such as Cronbachs alpha or the This review-type paper had no requirement for
Kuder-Richardson formula 20. The internal consis- University of Illinois at Chicago Institutional Review
tency coefficients are all derived from the test-retest Board (IRB) approval, and, thus, was not submitted
design and approximate the results of such test-retest for review or approval.
experiments.
11 Stanley JC. Reliability. In: Thorndike RL, ed. Educa- 18 Elstein A, Shulman L, Sprafka S. Medical Problem-Sol-
tional Measurement. 2nd edn. Washington: American ving: An Analysis of Clinical Reasoning. Cambridge,
Council on Education 1971: 356442. Massachusetts: Harvard University Press 1978.
12 Crocker L, Algina J. Introduction to Classical and Modern 19 van der Vleuten CPM, Swanson DB. Assessment of
Test Theory. Belmont, California: Wadsworth clinical skills with standardised patients: state of the art.
Group Thomson Learning 1986. Teach Learn Med 1990;2:5876.
13 Cohen J. A coefficient of agreement for nominal scales. 20 Subkoviak MJ. A practitioners guide to computation
Educ Psych Measurement 1960;10:3747. and interpretation of reliability indices for mastery
14 Howell DC. Statistical Methods for Psychology. 5th edn. tests. J Educ Measurement 1988;25:4755.
Pacific Grove, California: Duxbury 2002. 21 Wainer H, Thissen D. How is reliability related to the
15 Ebel RL. Estimation of the reliability of ratings. Psy- quality of test scores? What is the effect of local
chometrika 1951;16:40724. dependence on reliability? Educational Measurement:
16 Page G, Bordage G. The Medical Council of Canadas Issues Pract 1996;15 (1):229.
Key Features Project: a more valid written examination
of clinical decision making skills. Acad Med Received 8 January 2004; editorial comments to author 5
1995;70:10410. February 2004; accepted for publication 8 March 2004
17 Thissen D, Wainer H, eds. Test Scoring. Mahwah, New
Jersey: Lawrence Erlbaum 2001.