A Systematic Review of Methods For Evaluating Rating Quality in Language Assessment PDF

686999
research-article2016
LTJ0010.1177/0265532216686999Language TestingWind and Peterson
/$1*8$*(
Article 7(67,1*
Language Testing
A systematic review of
1–32
© The Author(s) 2017
Reprints and permissions:
methods for evaluating sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0265532216686999
https://doi.org/10.1177/0265532216686999
rating quality in language journals.sagepub.com/home/ltj
assessment
Stefanie A. Wind and Meghan E. Peterson

The University of Alabama, USA
Abstract
The use of assessments that require rater judgment (i.e., rater-mediated assessments) has
become increasingly popular in high-stakes language assessments worldwide. Using a systematic
literature review, the purpose of this study is to identify and explore the dominant methods
for evaluating rating quality within the context of research on large-scale rater-mediated
language assessments. Results from the review of 259 methodological and applied studies
reveal an emphasis on inter-rater reliability as evidence of rating quality that persists across
methodological and applied studies, studies primarily focused on rating quality and studies not
primarily focused on rating quality, and across multiple language constructs. Additional findings
suggest discrepancies in rating designs used in empirical research and practical concerns in
performance assessment systems. Taken together, the findings from this study highlight the
reliance upon aggregate-level information that is not specific to individual raters or specific
facets of an assessment context as evidence of rating quality in rater-mediated assessments.
In order to inform the interpretation and use of ratings, as well as the improvement of rater-
mediated assessment systems, rating quality indices are needed that go beyond group-level
indicators of inter-rater reliability, and provide diagnostic evidence of rating quality specific
to individual raters, students, and other facets of the assessment system. These indicators are
available based on modern measurement techniques, such as Rasch measurement theory and
other item response theory approaches. Implications are discussed as they relate to validity,
reliability/precision, and fairness for rater-mediated assessments.
Keywords
Language assessment, rater effects, rater-mediated assessment, rating quality, raters
Corresponding author:
Stefanie A. Wind, Educational Studies in Psychology, Research Methodology, and Counseling, The University
of Alabama, 313C Carmichael Hall, USA.
Email: swind@ua.edu
2 Language Testing
Assessments that include constructed-response components have become increasingly

popular in high-stakes language assessments worldwide. In particular, assessments that
require rater judgment (i.e., rater-mediated assessments; Eckes, 2015; Engelhard, 2013)
play a central role in many international assessments, such as the Program for International
Student Assessment (OECD, 2012), the Test of English as a Foreign Language (Jamieson
& Poonpon, 2013), the ACCESS for ELLs assessment (Assessing Comprehension and
Communication in English State to State for English Language Learners; Wisconson
Center for Educational Research, 2016), and the oral proficiency and writing proficiency
components of the American Council on the Teaching of Foreign Language (ACTFL)
assessments (ACTFL, 2012).
A central consideration in the interpretation of results from rater-mediated assess-
ments is the quality of ratings (Hamp-Lyons, 2007; Johnson, Penny, & Gordon, 2009);
concerns related to rating quality are prevalent in research related to both performance
assessment in general (e.g., Clauser, 2000; Lane & Stone, 2006), as well as within lan-
guage assessment research more specifically (e.g., Hamp-Lyons, 2007; Huot, 1990;
McNamara, 1996). As a result, researchers have proposed a plethora of quality-control
indicators for ratings that reflect a variety of measurement frameworks. Because evi-
dence of rating quality is a key component of the psychometric quality of rater-
mediated assessments (AERA, APA, & NCME, 2014), different methods for evaluating
ratings reflect multiple perspectives on several key aspects of these assessment systems,
including the following: (a) what properties of ratings constitute evidence of psycho-
metric quality; (b) the information used to inform rater employment and remediation
decisions; and (c) conclusions about the interpretation of ratings as indicators of student
performance.
In this study, we use a systematic literature review to identify and explore the domi-
nant methods used to evaluate rating quality within the context of rater-mediated lan-
guage assessments in order to consider the implications of these methods in terms of the
foundational areas of validity, reliability/precision, and fairness, as set forth in the revised
Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014).
The synthesis of this research reflects current views about what constitutes evidence of
rating quality to support the interpretation and use of ratings for their intended purpose.
Purpose
The purpose of this study is to identify and explore the dominant methods for evaluating
rating quality within the context of research on rater-mediated language assessments in
order to consider the implications of these methods as evidence to support the interpreta-
tion and use of scores. We considered six research questions:
1. In studies of rater-mediated language assessments, what statistical methods did

researchers use for evaluating the quality of ratings?
2. In applied studies of rater-mediated language assessments, what indices of rating
quality did researchers report? In methodological studies of rater-mediated lan-
guage assessments, what indices of rating quality did researchers report?
Wind and Peterson 3
3. In studies of rater-mediated language assessments in which rating quality was the

primary focus, what indices of rating quality did researchers report? In studies of
rater-mediated language assessments in which rating quality was not the primary
focus, what indices of rating quality did researchers report?
4. In studies of rater-mediated language assessments of different constructs (e.g.,
reading, writing, speaking), what indices of rating quality did researchers report?
5. In studies in which researchers simulated data based on characteristics of opera-
tional language assessments, what indices of rating quality did the researchers
report?
6. In studies of rater-mediated language assessments, what information about the
rating designs employed did researchers report?
This study contributes to research on language assessment in several ways. First, it pro-
vides an overview of the methods currently used to evaluate ratings in the context of
language assessment research. Several reviews of literature are available that include
discussions of methods for evaluating rating quality. Specifically, Saal, Downey, and
Lahey (1980) conducted a review and meta-analysis of methods for evaluating rating
quality that appeared in three psychological journals between 1975 and 1977, and con-
cluded that the terms used to describe various aspects of rating quality were inconsistent.
More recently, Myford and Wolfe (2003) discussed a range of descriptions of rater effects
(i.e., rater errors), researchers’ interpretations of their implications, and indices employed
to detect these effects, and also concluded that the operational definitions and procedures
for identifying these effects vary across researchers. Focusing on the values of indicators
of rating quality, such as reliability coefficients, Meadows and Billington (2005) and
Tisi, Whitehouse, Maughan, and Burdett (2013) conducted reviews of literature on rater-
mediated assessments across content areas. The authors of both of these reviews con-
cluded that ratings in performance assessments are generally unreliable, and proposed
alternative strategies to scoring that could be used to improve rater reliability. Despite
these previous reviews of literature, research is limited that systematically synthesizes
techniques that have been applied specifically within the context of language assess-
ment. Furthermore, our study presents an adaptation of Engelhard’s (2013) theoretical
framework for classifying measurement techniques to methods for evaluating rating
quality. Finally, we highlight the implications of various approaches to evaluating ratings
in terms of validity, reliability/precision, and fairness that can be used to inform the
selection and interpretation of rating quality indices in research and practice.
Theoretical framework
This study is organized around the concept of research traditions that can be used to
classify measurement theories (Engelhard, 2008, 2013). Using an adapted version of
the framework presented by Engelhard (2013; also see Wind, 2014), we use two major
research traditions to classify the major measurement theories developed during the
20th century: (1) the observed ratings tradition, and (2) the scaled ratings tradition. In
this section, we provide an overview of the features and models that characterize the
4 Language Testing
Table 1. Methods for evaluating ratings within the observed ratings and scaled ratings
traditions.
Observed ratings tradition Scaled ratings tradition

A. Essential features • Focus on decomposing • Focus on describing individual raters,
observed ratings into students, and other facets on a
sources of error scale with equal units (i.e., a linear
• Linear models continuum) that represents the latent
variable
• Non-linear models
B. Key models • Analysis of variance • Psychophysical models and absolute
• Classical test theory scaling
• Factor analysis • Item response theory models
• Generalizability theory • Nonparametric item response theory
• Regression models
• Structural equation • Hierarchical generalized linear models
modeling • Signal detection theory models
C. Underlying • How consistently do • How can raters, student responses,
questions for raters score the same and other facets be mapped onto a
evaluating ratings student responses? linear continuum that represents the
• What are the sources of construct?
error that contribute to • How closely do observed ratings
variation in ratings? match the expectations of the model?
Note: This table and the theoretical framework are adapted from Engelhard (2013), who used the term
“test-score tradition” to describe the characteristics listed within the observed ratings tradition, and the
term “scaling tradition” to describe the characteristics listed within the scaled ratings tradition.
observed ratings and the scaled ratings traditions (see Table 1 for a summary of these
features).
Observed ratings tradition

The first major tradition in Engelhard’s (2013) original framework for classifying meas-
urement theories is the test-score tradition. In this study, we extend the characteristics of
the test-score tradition to the context of rater-mediated assessments using the term
observed ratings tradition. When applied to the context of rater-mediated assessments,
methods for evaluating rating quality based on the observed ratings tradition reflect a
focus on decomposing observed ratings into sources related to true scores and errors.
Measurement models within this tradition include classical test theory (Gulliksen, 1950),
analysis of variance (ANOVA; Fisher, 1925), generalizability theory (Brennan, 1997;
Cronbach, Gleser, Nanda, & Rajaratnam, 1972), regression models (e.g., Neter &
Wasserman, 1974), traditional factor analysis (Spearman, 1927; Thurstone, 1935, 1947;
see McDonald, 1985), and structural equation modeling (Joreskog, 2007). Each of these
approaches reflects a focus evaluating the consistency of observed ratings, where ordinal
rating scale categories are treated as equally spaced units, and ratings are summed or
averaged across raters, tasks, and other components of an assessment system.
Wind and Peterson 5
The following two questions summarize the underlying concerns that characterize
methods for evaluating ratings within the observed ratings tradition:
•• How consistently do raters score the same student responses?

•• What are the sources of error that contribute to variation in ratings?
Across the observed ratings tradition, a major theme is the use of correlation coefficients
to explore the consistency, or reliability, of ratings. Essentially, research within the
observed ratings tradition emphasizes the use of linear models, and focuses on identify-
ing and describing the influence of measurement error on the consistency of ratings as an
indicator of the quality of measurement procedures.
Scaled ratings tradition

The second major tradition in Engelhard’s (2013) original framework for classifying
measurement theories is the scaling tradition. In this study, we extend the characteristics
of the scaling tradition to the context of rater-mediated assessments using the term scaled
ratings tradition. When methods based on the scaled ratings tradition are used to evalu-
ate rating quality, measurement techniques include models that examine individual rater
judgments in order to describe student achievement, rater severity, and other relevant
components of an assessment system (e.g., prompts) using a scale with equal units.
Applications of the scaled ratings tradition that have been applied to rater-mediated
assessments include Rasch measurement theory (Rasch, 1960), item response theory
(IRT; Lord, 1980), nonparametric item response theory (Mokken, 1971), and modern
factor analyses in the form of multidimensional item response theory (Reckase, 2009).
Additional techniques that reflect the scaled ratings tradition include multilevel models
that incorporate IRT models (e.g., hierarchical generalized linear models; Muckle &
Karabatsos, 2009), longitudinal rater modeling with splines (Dobria, 2011), and hierar-
chical rater modeling based on signal detection theory (DeCarlo, 2005; DeCarlo, Kim, &
Johnson, 2011).
In particular, polytomous IRT models are recognized as useful tools within the scaled
ratings tradition for examining a variety of aspects of rater-mediated assessments. IRT
models that are widely used in current research on rater-mediated assessments can be
broadly categorized in terms of their origins in the work of Rasch (1960) or Birnbaum
(1968). When they are applied to rater-mediated assessments, Rasch models define the
probability of a rating in a particular category of the rating scale as a function of rater
severity and student achievement (i.e., the level of achievement or ability reflected in a
student’s performance). Furthermore, these models are based on the requirement of
rater-invariant measurement. When rater-invariant measurement is achieved, student
achievement can be described independently of the particular raters who scored their
performances, rater severity can be described independently of the particular students
whose performances they scored (Engelhard, 2013; Wright & Stone, 1979). Researchers
who conduct Rasch measurement research related to rater-mediated assessments focus
on evaluating the degree to which rater-invariant measurement is achieved in order to
improve measurement procedures and guide further investigations.
6 Language Testing
On the other hand, Birnbaum models (e.g., Birnbaum, 1957, 1968) that have been
adapted for use with rater-mediated assessments incorporate additional variables beyond
rater severity and student achievement. Specifically, several researchers have proposed
IRT models adapted from Birnbaum models that include parameters related to rater dis-
crimination (Myford & Wolfe, 2004, p. 219; Patz, Wilson, & Hoskens, 1997; Rost, 1988;
Wolfe, 1998). The major implication of these models is that the probability of a rating in
a particular category of the rating scale is calculated as a function of rater severity, stu-
dent achievement, and the degree to which raters distinguish among students with differ-
ent levels of achievement (i.e., discrimination or slope).
The following two questions summarize the underlying concerns that characterize
methods for evaluating ratings within the scaled ratings tradition:
•• How can raters, student responses, and other facets be mapped onto a linear con-
tinuum that represents the construct?
•• How closely do observed ratings match the expectations of the model?
Across the scaled ratings tradition, rating quality indicators focus on evaluating the
match between observed ratings and the ratings that would be expected based on the
selected model.
Methods
We conducted a systematic review (Petticrew & Roberts, 2006) of empirical, peer-
reviewed journal articles that included studies based on rater-mediated assessments.
Although other publication outlets, such as technical reports and book chapters, often
describe applications of and advances in methods for evaluating rating quality, we lim-
ited our review to journal articles in order to provide an overview of the current range of
methods that are discussed among language assessment researchers in these outlets. This
section includes details regarding our inclusion criteria and methods for synthesizing
selected research.
Search and selection process

Data used in this study included articles from top-tier educational research journals in
which researchers described empirical and methodological studies of rater-mediated lan-
guage assessments published between January 1980 and January 2016; this time period is
particularly germane to the current study in light of the increased emphasis on performance
assessments since around the 1980s (Clauser, 2000; Lane & Stone, 2006; Linn, Baker, &
Dunbar, 1991). Articles that met the selection criteria included descriptions of methodo-
logical or applied studies in which researchers applied one primary quantitative technique
for evaluating rating quality. We did not include studies in which researchers used multiple
methods and/or compared multiple methods for evaluating ratings. For the purpose of this
review, we defined applied studies as empirical articles in which methods for evaluating
rating quality were used to address research questions aimed at practical concerns related
Wind and Peterson 7
to a particular language assessment or language assessment in general. On the other hand,

we defined methodological studies as empirical analyses used to develop or improve an
analytic technique that was not specific to a particular assessment. We selected applied
studies within the context of language assessment, and methodological studies in which
researchers described methods that could be applied to rater-mediated educational achieve-
ment assessments within the context of language assessment. We did not include methodo-
logical articles in which researchers described self-report rating scales, computer software
presentations, or presented commentaries on existing methods.
In order to be included in the review, it was not necessary that the primary focus of a
study was on the methods or conclusions related to rating quality; rather, we only required
that some quantitative technique be used to evaluate the ratings. The scope of our litera-
ture search was limited to articles published in official journals of the major professional
organizations for educational research in the USA: the American Educational Research
Association (AERA) and the National Council on Measurement in Education (NCME),
along with the official journal of the International Test Commission (ITC). Our review
also included empirical and methodological articles published in top-tier journals within
the fields of applied educational measurement and language assessment. Specifically, we
identified journals in these categories based on impact factors (> 0.9), and other indica-
tors of quality, including sponsorship by a national or international association related to
educational assessment or language/literacy instruction. We conducted the search using
the advanced search options in Google Scholar. Because the journals included in the
review were identified prior to the search, we conducted the search one journal at a time,
and limited the inclusion criteria in Google Scholar to the journal of interest and the
selected time period.
While recognizing the known limitations related to impact factors (Togia & Tsigilis,
2006), we found the value of 0.9 to be a useful cutoff point for identifying top-tier
research journals within which to conduct our search. For example, this value reflects
approximately the third quartile of the distribution of the most-recent impact factors of
education journals reported in Togia and Tsigilis’ discussion of impact factors in educa-
tion journals. Furthermore, our use of this value resulted in the identification of all of the
official journals published by the AERA, NCME, and the ITC. Nonetheless, this critical
value and the other selection criteria, including the limitation to journal articles, limited
the scope of the review and resulted in the exclusion of some publications that, if
included, may have resulted in different findings from the review; this limitation is dis-
cussed further at the end of the paper.
Table 2 provides a list of the journals included in the review. We identified articles
using the following keywords and their plural variants as terms that appeared in the
title, keywords, or abstract of the manuscripts: (1) “constructed response”; (2) “lan-
guage assessment”; (3) “performance assessment”; (4) “rater”; (5) “rating quality”; (6)
“rating scale”; and (7) “rubric.” During the first round of data collection, each author
evaluated the title, abstract, and key words of studies identified using these search
terms against the inclusion criteria, and selected possible relevant manuscripts. Next,
we discussed the suitability of each study. Using the inclusion and exclusion criteria,
we selected 259 articles.
8 Language Testing
Table 2. Selected journals.
Journal category Selected Journals included in search* Selected

articles articles
(N = 259)
n % n %
American Educational 21 8.11 American Educational Research Journal 10 3.86
Research Association Educational Evaluation and Policy Analysis 2 0.77
(AERA) Educational Researcher 2 0.77
Journal of Educational and Behavioral 7 2.70
Statistics
National Council 20 7.72 Educational Measurement: Issues and 5 1.93
on Measurement in Practice
Education (NCME) Journal of Educational Measurement 15 5.79
International Test 6 2.32 International Journal of Testing 6 2.32
Commission (ITC)
Applied educational 56 21.62 Applied Measurement in Education 12 4.63
measurement** Applied Psychological Measurement 22 8.49
Assessment in Education: Principles, Policy 6 2.32
& Practice
Educational Assessment 3 1.16
Educational & Psychological Measurement 13 5.02
Language 156 60.23 Assessing Writing 55 21.24
assessment** Journal of Research in Reading 28 10.81
Language Assessment Quarterly 11 4.25
Language Testing 33 12.74
Reading Research Quarterly 21 8.11
Research in the Teaching of English 8 3.09
*Does not include journals focused on reviews of research.

**Selected journals in these categories were identified based on impact factors (> 0.9), and other indicators
of quality, including sponsorship by a national or international association related to educational assess-
ment or language/literacy instruction.
Data extraction
Following Petticrew and Roberts (2006), we prepared summaries of each article that
included key details from each of the 259 selected studies. We coded each study accord-
ing to five major characteristics: (1) Type (methodological or applied); (2) Method used
to evaluate rating quality; (3) Language construct measured using ratings or, where
applicable, simulated ratings; (4) Focus (whether the study was primarily focused on
methods for evaluating rating quality or if the rating quality indices were only applied in
service to a larger goal); and (5) Rating design (system for collecting ratings).
Appendix A presents the coding framework used in the systematic review. We pre-
pared an initial list of codes that included methods for evaluating rating quality based on
several approaches within the observed ratings and scaled ratings traditions. The final list
of codes emerged during the analysis as we identified methods that appeared in the
selected studies. In terms of language constructs, we prepared an initial list using the
Wind and Peterson 9
Standards For Communicative Competence set forth by the ACTFL (2012), which
included reading, writing, listening, and speaking in students’ first language (L1) or in
another language (L2), and multiple language constructs. Studies in which simulated
data were used were also identified.
We used NVivo, Version 11 (NVivo qualitative data analysis software, 2015) to code
the articles. Prior to coding the entire set of articles, both authors scored a common set of
three randomly selected articles from each of the four categories of journals listed in
Table 2, for a total agreement set of 12 articles. We resolved any disagreements and clari-
fied the coding scheme until we reached complete agreement. We divided the remaining
articles and coded them. Prior to the final analyses, the first author reviewed each of the
second author’s coded articles, and we discussed and resolved any disagreements.
Data analysis
Our data analysis procedures involved classifying each of the 259 studies within the
observed ratings or scaled ratings tradition. Next, we examined the results using tabula-
tions of the frequencies of methods for evaluating rating quality within each classifica-
tion (i.e., matrix coding; Miles, Huberman, & Saldana, 2014). Specifically, we used the
NVivo program to generate tabular displays of the frequencies of each method across
methodological and applied studies, across studies in which rating quality was the pri-
mary focus, and studies in which rating quality was not the primary focus, across research
related to the assessment of different language constructs, within studies based on simu-
lated data, and across research that made use of different rating designs.
Results
Overall, results from our analysis of the 259 selected studies indicated a range of meth-
ods for evaluating rating quality in language assessment research; Appendix A includes
a description of each of the methods that we identified. Table 2 includes the overall
results in terms of the distribution of selected studies across journals. In this section, we
present results using descriptive statistics to describe patterns observed across the 259
selected studies.
Overall characteristics
Table 3 includes a summary of the characteristics of the 259 selected studies in terms of
the major coding categories. The literature review included more applied studies (n =
160, 61.78%) than methodological studies (n = 99; 38.22%). Three dominant methods
for evaluating rating quality appeared: Inter-rater reliability (n = 80, 30.89%), rater
agreement (n = 52, 20.08%), and Rasch measurement theory (n = 48, 18.53%). In terms
of the language constructs, we found that most research on rater-mediated language
assessment was focused on L1 writing (n = 101, 39.00%), followed by L1 reading (n =
43, 16.60%). Although most studies included real data, a substantial proportion used
simulated data (n = 49, 18.92%). Approximately equal proportions of the studies were
focused on rating quality specifically (n = 136, 52.51%) as those in which rating quality
was not the primary focus (n =123, 47.49%). Finally, nearly half of the studies involved
10 Language Testing
Table 3. Overall characteristics across selected studies (N = 259).
Characteristic Number of articles % of total

Type Applied 160 61.78
Methodological 99 38.22
Method for Inter-rater reliability 80 30.89
evaluating rating Rater agreement 52 20.08
quality Rasch measurement theory 48 18.53
Generalizability theory 23 8.88
Expert agreement 11 4.25
Correlations with external variable 10 3.86
Other IRT 9 3.47
Factor analysis or structural 7 2.70
equation modeling
Regression or ANOVA 7 2.70
Hierarchical linear model 5 1.93
Decision consistency/accuracy 3 1.16
Mokken scaling 1 0.39
Signal detection 1 0.39
Odds ratio 1 0.39
Intra-rater reliability 1 0.39
Language L1 writing 101 39.00
construct L1 reading 43 16.60
L2 writing 34 13.13
L2 speaking 16 6.18
Multiple 5 1.93
L2 reading 5 1.93
L1 speaking 5 1.93
L1 listening 1 0.39
Simulation Studies based on simulated data 49 18.92
Focus Rating quality primary 136 52.51
Rating quality not primary 123 47.49
Rating design Fully crossed 106 40.93
No details given 77 29.73
Incomplete, but connected 64 24.71
Disconnected 8 3.09
Multiple designs 4 1.54
fully crossed rating designs in which all raters scored all language performances (n =
106, 40.93%).
Research traditions
Our analysis of the 259 selected studies revealed 14 quantitative techniques for evaluat-
ing ratings (see Table 4). When considered in terms of the theoretical framework, the
methods for evaluating ratings indicate a dominance of the observed ratings tradition in
Table 4. Methods for evaluating rating quality across study characteristics.
Method for evaluating rating quality Total A. Type B. Focus on rating quality
(N = 259)
Applied Method- Rating quality Rating quality
(n = 160) ological primary not primary
Wind and Peterson
(n = 99) (n = 136) (n = 123)
n % n % n % n % n %
Scaled ratings Rasch measurement theory 48 81.36 29 100.00 19 63.33 38 80.85 10 83.33
tradition Other IRT 9 15.25 0 0.00 9 30.00 7 14.89 2 16.67
Mokken scaling 1 1.69 0 0.00 1 3.33 1 2.13 0 0.00
Signal detection theory 1 1.69 0 0.00 1 3.33 1 2.13 0 0.00
Total across scaled ratings tradition 59 100 29 100 30 100 47 100 12 100
Observed ratings Inter-rater reliability 80 40.00 63 48.09 17 24.64 21 23.60 59 53.15
tradition Rater agreement 52 26.00 38 29.01 14 20.29 18 20.22 34 30.63
Generalizability theory 23 11.50 12 9.16 11 15.94 13 14.61 10 9.01
Expert agreement 11 5.50 4 3.05 7 10.14 10 11.24 1 0.90
Correlations with external 10 5.00 5 3.82 5 7.25 8 8.99 2 1.80
variable
Factor analysis or structural 7 3.50 1 0.76 6 8.70 6 6.74 1 0.90
equation modeling
Regression or ANOVA 7 3.50 5 3.82 2 2.90 5 5.62 2 1.80
Hierarchical linear model* 5 2.50 1 0.76 4 5.80 4 4.49 1 0.90
Decision consistency/ 3 1.50 1 0.76 2 2.90 2 2.25 1 0.90
accuracy
Intra-rater reliability 1 0.50 0 0.00 1 1.45 1 1.12 0 0.00
Odds ratio 1 0.50 1 0.76 0 0.00 1 1.12 0 0.00
Total across observed ratings tradition 200 100 131 100 69 100 89 100 111 100
(Continued)
11
Table 4. (Continued)
12
Method for evaluating rating quality C. Language construct
L1 writing L1 reading L2 writing L2 Multiple L2 L1 L1

(n = 101) (n = 43) (n = 34) speaking (n = 5) reading speaking listening
(n = 16) (n = 5) (n = 5) (n = 1)
n % n % n % n % n % n % n % n %
Scaled ratings Rasch measurement 20 83.33 1 100.00 14 100.00 8 100.00 0 0.00 0 0.00 3 75.00 0 0.00
tradition theory
Other IRT 2 8.33 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 1 25.00 0 0.00
Mokken scaling 1 4.17 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Signal detection theory 1 4.17 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Total across scaled ratings tradition 24 100 1 100 14 100 8 100 0 0 0 0 4 100 0 0
Observed ratings Inter-rater reliability 33 42.86 21 50.00 7 35.00 4 50.00 3 60.00 3 60.00 0 0.00 0 0.00
tradition Rater agreement 13 16.88 18 42.86 5 25.00 1 12.50 1 20.00 2 40.00 0 0.00 1 100.00
Generalizability theory 8 10.39 1 2.38 5 25.00 1 12.50 1 20.00 0 0.00 0 0.00 0 0.00
Expert agreement 6 7.79 0 0.00 0 0.00 1 12.50 0 0.00 0 0.00 0 0.00 0 0.00
Correlations with 5 6.49 1 2.38 1 5.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
external variable
Factor analysis or 4 5.19 1 2.38 0 0.00 0 0.00 0 0.00 0 0.00 1 100.00 0 0.00
structural equation
modeling
Regression or ANOVA 3 3.90 0 0.00 1 5.00 1 12.50 0 0.00 0 0.00 0 0.00 0 0.00
Hierarchical linear 2 2.60 0 0.00 1 5.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
model
Decision consistency/ 2 2.60 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
accuracy
Intra-rater reliability 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Odds ratio 1 1.30 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Total across observed ratings tradition 77 100 42 100 20 100 8 100 5 100 5 100 1 100 1 100
Language Testing
Table 4. (Continued)
Method for evaluating rating quality D. Simulation D. Rating design
studies
Studies based Fully crossed No details Incomplete, Disconnected

on simulated (n = 106) given but connected (n = 8)
data (n = 49) (n = 77) (n = 64)
Wind and Peterson
n % n % n % n % n %
Scaled Rasch measurement 2 25.00 24 88.89 10 76.92 14 73.68 0 0.00
ratings theory
tradition Other IRT 6 75.00 2 7.41 2 15.38 5 26.32 0 0.00
Mokken scaling 0 0.00 1 3.70 0 0.00 0 0.00 0 0.00
Signal detection theory 0 0.00 0 0.00 1 7.69 0 0.00 0 0.00
Total across scaled ratings tradition 8 100 27 100 13 100 19 100 0 100
Observed Inter-rater reliability 9 21.95 30 37.97 23 35.94 24 53.33 1 12.50
ratings Rater agreement 11 26.83 23 29.11 16 25.00 13 28.89 0 0.00
tradition Generalizability theory 7 17.07 10 12.66 6 9.38 3 6.67 4 50.00
Expert agreement 4 9.76 5 6.33 5 7.81 1 2.22 0 0.00
(accuracy)
Correlations with 3 7.32 4 5.06 4 6.25 0 0.00 1 12.50
external variable
Factor analysis or SEM 1 2.44 2 2.53 2 3.13 1 2.22 1 12.50
Regression or ANOVA 2 4.88 4 5.06 1 1.56 2 4.44 0 0.00
Decision consistency/ 2 4.88 0 0.00 3 4.69 1 2.22 1 12.50
accuracy
Hierarchical linear 1 2.44 0 0.00 3 4.69 0 0.00 0 0.00
model
Intra-rater reliability 1 2.44 1 1.27 0 0.00 0 0.00 0 0.00
Odds ratios 0 0.00 0 0.00 1 1.56 0 0.00 0 0.00
Total across observed ratings 41 100 79 100 64 100 45 100 8 100
tradition
13
*The articles included in the hierarchical linear model (HLM) category listed within the “Observed ratings tradition” did not include the use of hierarchical general-
ized linear models (HGLMs). Articles in which HGLMs were applied were included in the “Other IRT” classification within the scaled ratings tradition.
14 Language Testing
research on rater-mediated language assessment, with 200 of the 259 studies (77.22%)
classified within the observed ratings tradition, compared to 59 studies (22.78%) classi-
fied within the scaled ratings tradition.
When examining the methods identified within each tradition, it is interesting to note
that, whereas methods classified within the observed ratings tradition included three pre-
vailing techniques that make up more than 10% of the studies classified within the
observed ratings tradition (inter-rater reliability: n = 80, 40%; rater agreement: n = 52,
26%; and generalizability theory: n = 23, 11.50%), observed techniques within the scaled
ratings tradition were most often based on Rasch measurement theory (n = 48, 81.36%).
Applied and methodological research

Next, we examined the frequency of each of the 14 identified methods within applied
and methodological research (see Table 4, column A). In terms of the scaled ratings tradi-
tion, our results indicate that researchers used methods based on Rasch measurement
theory most often for both applied (n = 29, 100%) and methodological research (n = 19,
63.33%). Whereas additional methods for evaluating ratings based on the scaled ratings
tradition appeared in methodological research, Rasch measurement theory was the only
method observed based on the scaled ratings tradition for applied studies. We observed
similar patterns across both types of research in terms of the observed ratings tradition.
Specifically, we found a shared ordering of three dominant methods between both
applied and methodological research, where inter-rater reliability was the most prevalent
(napplied = 63, 48.09%; nmethodological = 17, 24.64%), followed by rater agreement (napplied =
38, 29.01%; nmethodological = 14, 20.29%), and generalizability theory (napplied = 12, 9.16%;
nmethodological = 11, 15.94%).
Focus on rating quality

Table 4, column B describes observed methods for evaluating rating quality across stud-
ies in which rating quality was the primary focus and studies in which rating quality was
only examined in service to a larger goal. In terms of the scaled ratings tradition, we
observed similar patterns across research foci. Specifically, Rasch models were the most
common method to evaluate rating quality based on the scaled ratings tradition in
research focused on rating quality (n = 38, 80.85%), and in research not focused on rat-
ing quality (n = 10, 83.33%), followed by other IRT models (nprimary = 7, 14.89%; nnot
primary = 2, 16.67%). When rating quality was the primary focus, we observed only one
instance each of two additional methods based on the scaled ratings tradition that did not
appear in research in which rating quality was not the primary focus (Mokken scaling: n
= 1, 2.13%; Signal detection models: n = 1, 2.13%).
In terms of the observed ratings tradition, we also observed a similar ordering of
three dominant methods for evaluating rating quality across the two research foci.
Specifically, inter-rater reliability was the most common observed ratings approach
(nprimary = 21, 23.60%; nnot primary = 59, 53.15%), followed by rater agreement (nprimary =
18, 20.22%; nnot primary = 34, 30.63%), and generalizability theory (nprimary = 13, 14.61%;
nnot primary = 10, 9.01%).
Wind and Peterson 15
Language constructs
Table 4, column C presents the observed frequencies of methods for evaluating ratings
across the language constructs. In terms of L1 assessments of reading, writing, speaking,
and listening, the results in Table 4, column C indicate some variation in methods for
evaluating ratings across constructs. Within L1 writing assessments (n = 101), methods
based on the observed ratings tradition were most prevalent (n = 77, 76%) and followed
the pattern observed within the entire sample of studies, where inter-rater reliability anal-
yses were used most often (n = 33, 42.86%). When researchers used methods based on
the scaled ratings in the context of L1 writing assessments, Rasch measurement theory
was the most frequent approach (n = 20, 83.33%). We observed a similar emphasis on
test-score methods for evaluating ratings in assessments of L1 reading (n = 43), where
we only identified one study in which researchers used methods based on the scaled rat-
ings tradition; the authors of this study used Rasch measurement theory to evaluate rat-
ings. The remaining studies related to L1 reading included inter-rater reliability as the
most common approach within the observed ratings tradition (n = 21, 50%), followed by
rater agreement (n = 18, 42.86%). On the other hand, researchers who evaluated ratings
within L1 speaking assessments most frequently drew upon the scaled ratings tradition
through the use of Rasch measurement theory (n = 3, 75%); however, we identified only
a small number of studies related to this construct (n = 5), such that the dominance of the
scaled ratings tradition observed here may not be generalizable across research on L1
speaking assessments. Similarly, we identified only one instance of an L1 Listening
assessment; the authors of this article used inter-rater reliability coefficients as evidence
of rating quality.
Within selected studies in which researchers examined rater-mediated language
assessments in L2, we found similar patterns to L1 assessments. Within L2 writing
assessment research (n = 34) researchers used methods based the observed ratings tradi-
tion slightly more frequently (n = 20) compared to the scaled ratings tradition (n = 14),
where results within the research traditions followed the general pattern observed within
the entire sample of selected studies. Specifically, Rasch measurement theory was the
only scaled ratings tradition method for evaluating ratings (n = 14, 100%), and research-
ers used inter-rater reliability studies most frequently (n = 7, 35%) among methods based
on the observed ratings tradition. Within research on L2 speaking (n = 16), an equal
number of studies applied scaling and test-score methods. Similar to L2 writing, Rasch
measurement theory was the only method used to evaluate ratings based on the scaled
ratings tradition (n = 8, 100%), and researchers used inter-rater reliability studies most
frequently (n = 4, 50%) among methods based on the observed ratings tradition. When
researchers evaluated ratings in L2 reading assessments (n = 5), rating quality analyses
only included methods based on the observed ratings tradition, including inter-rater reli-
ability (n = 3, 60%), and rater agreement (n = 2, 40%). However, the small number of
studies included in which researchers described L2 reading assessments may limit the
generalizability of these results.
Finally, among studies that included multiple language constructs (n = 5), researchers
drew exclusively upon the observed ratings tradition to evaluate ratings, with inter-rater
reliability used most often (n = 3, 60%), followed by rater agreement (n = 1, 20%) and
16 Language Testing
generalizability theory (n = 1, 20%). As noted above, the small number of studies of

multiple language constructs identified for this review may limit the generalizability of
this focus on the observed ratings tradition.
Studies based on simulated data

Because our literature search included methodological studies, it was somewhat unsur-
prising that we identified a large number of studies based on simulated data (n = 49).
Among research based on simulated data, IRT modeling was the most common method
for evaluating rating quality based on the scaled ratings tradition (n = 6, 75%), followed
by Rasch models (n = 2, 25%). On the other hand, we observed rater agreement (n = 11,
26.83%), inter-rater reliability (n = 9, 21.95%), and generalizability theory (n = 7,
17.07%) most often among the observed ratings methods for evaluating the quality of
ratings in simulated data.
Rating designs
Table 4, column E presents methods for evaluating rating quality across rating designs.
Overall, these results indicate that most research on rater-mediated language assessments
is based on fully crossed (i.e., complete) rating designs (n = 106), or does not give explicit
consideration to the topic of rating designs (n = 77). In general, the results within each of
these categories follow the pattern observed in the overall sample, where the most com-
mon method for evaluating ratings based on the scaled ratings tradition is Rasch meas-
urement theory, and methods based on the observed ratings tradition are dominated by
inter-rater reliability, rater agreement, and generalizability theory.
Discussion
The purpose of this study was to identify and explore the dominant methods used to
evaluate rating quality within the context of rater-mediated language assessments in
order to consider the implications of the use of these methods as evidence of psychomet-
ric quality. Using an adaptation of Engelhard’s (2013) theoretical framework based on
research traditions, we classified methods for evaluating rating quality that focused on
the consistency and sources of error within observed ratings within the observed ratings
tradition, and we classified methods for evaluating rating quality that focused on the
description of raters, students, and other facets on a scale with equal units (i.e., a linear
continuum) that represents a latent variable within the scaled ratings tradition.
We used this framework to examine 259 peer-reviewed articles in which researchers
described the use of a single quantitative technique for evaluating the quality of ratings
in a rater-mediated language assessment between 1980 and 2016. Similar to the findings
reported in previous reviews (Meadows & Billington, 2005; Myford & Wolfe, 2003;
Saal et al., 1980; Tisi et al., 2013), our results suggested that researchers during the
selected time period applied a wide range of techniques to evaluate rating quality that
reflect both the observed ratings and scaled ratings tradition. Furthermore, our results
indicated that researchers’ use of techniques within these traditions varied across applied
and methodological research, studies in which researchers primarily focused on rating

quality and studies in which rating quality was not of primary interest, language con-
structs, studies based on simulated data, and studies in which different rating designs
are employed. Although we observed variation related to these characteristics, our
results indicated an overall dominance of methods based on the observed ratings tradi-
tion for all language constructs except L2 writing and speaking, where researchers
applied methods based on the observed ratings and scaled ratings tradition nearly
equally. In general, researchers used inter-rater reliability and rater agreement most
often among methods based on the observed ratings tradition, and Rasch measurement
theory most often among methods based on the scaled ratings tradition.
Implications in terms of the Standards

Because evidence of rating quality is used to support claims regarding the psychometric
quality of rater-mediated language assessments, it is relevant to consider the implications
of the identified methods in terms of the foundational areas for evaluating the psycho-
metric quality of assessments set forth in the recent revision of the Standards for
Educational and Psychological Testing (AERA et al., 2014). Using the conceptualiza-
tions presented in the recent revision of the Standards, we include a brief discussion of
the results as they relate to the foundational areas of validity, reliability/precision, and
fairness. We describe these three topics separately in order to match the current perspec-
tive set forth in the Standards; however, raters play a central role in all three of these
aspects of psychometric quality.
Validity
According to the Standards (AERA et al., 2014), validity is “the degree to which accu-
mulated evidence and theory support a specific interpretation of test scores for a given
use of a test” (p. 11). In the context of rater-mediated assessments, raters and rating qual-
ity play a central role in the interpretation of test scores as indicators of student standing
on a construct. Because rater judgments are the primary channel through which informa-
tion about student achievement is provided, raters can be viewed as a type of “lens” that
mediates the interpretation of student performances in terms of a construct (Cooksey,
1996a, 1996b; Cooksey, Freebody, & Davidson, 1986; Cooksey, Freebody, & Wyatt-
Smith, 2007; Engelhard, 2013; Hogarth & Karelaia, 2007; Thompson, Foster, Cole, &
Dowding, 2005). Consequently, evidence about the quality of ratings provides insight
into the degree to which ratings can be interpreted and used for a particular purpose (i.e.,
validity evidence).
Despite the centrality of raters in these assessments, most discussions of validity for
performance assessment systems, including the discussion of validity issues related to
performance assessments in the Standards, do not explicitly focus on the role of the rater
in performance assessments; rather, considerations related to rating quality generally
appear in discussions of reliability. Our finding of an emphasis on inter-rater reliability
and rater agreement as evidence of rating quality reflects the general perspective that
rating quality considerations are not a key source of validity evidence for performance
18 Language Testing
assessments. Further, our findings suggest a general perspective that methods that
describe the overall consistency of ratings are sufficient descriptions of rating quality
to support the interpretation and use of scores from rater-mediated assessments.
In light of the central role of raters and rating quality in rater-mediated assessments,
these group-level indicators do not provide sufficient validity evidence to support the
interpretation and use of ratings. In addition to these indices, it is essential that language
assessment researchers consider the use of methods that result in diagnostic information
about rating quality related to individual raters, students, and other assessment compo-
nents as additional information to support validity arguments for rater-mediated lan-
guage assessments, such as those available within the scaled ratings tradition.
Reliability/precision
The recent revision of the Standards (AERA et al., 2014) defines reliability/precision as
the “consistency over replications of the testing procedure” (p. 35). In contrast to valid-
ity, discussions related to reliability/precision in the context of performance assessments,
including the Standards, often include the explicit consideration of raters. In particular,
the Standards chapter on reliability/precision describes the role of measurement error in
rater-mediated assessments as follows: “if raters are used to assign scores to responses,
the variability in scores over qualified raters is a source of error” (p. 33). The perspective
set forth in the Standards reflects the observed ratings tradition, where high levels of
rater agreement and reliability are presented as sufficient sources of evidence of psycho-
metric quality for performance assessments. Reflecting the Standards, the finding in the
current study of an overall emphasis on inter-rater reliability as evidence of rating quality
suggests an emphasis in rater-mediated assessment research on identifying rater consist-
ency as evidence of high-quality ratings.
In order to maximize consistency, most assessment systems incorporate rater training
exercises and qualification requirements prior to operational scoring (Johnson et al.,
2009). Despite these efforts, research on performance assessments indicates that differ-
ences in rater severity persist beyond training (e.g., Knoch, Read, & von Randow, 2007;
Raczynski, Cohen, Engelhard, & Lu, 2015; Weigle, 1998). Recognizing these persistent
differences, it is essential that indicators of rating quality go beyond the dominant meth-
ods observed in this review, which focus on rater consistency, and incorporate informa-
tion about individual raters into estimates of student achievement.
In particular, when rating quality is considered from the perspective of the scaled rat-
ings tradition, indicators of rating quality can be used to evaluate the reliability/precision
of rater-mediated assessment systems in terms of individual raters, students, and other
facets. Generally, these indicators are used to evaluate measurement precision in terms
of the degree to which students, raters, and other facets in an assessment system are
appropriately matched (i.e., targeted). In other words, because methods based on the
scaled ratings tradition result in estimates of student achievement, rater severity, and
other facets on a scale that represents the construct, it is possible to evaluate the align-
ment among the estimated locations of raters, students, and other facets as indicators of
precision. Closer alignment (i.e., better targeting) results in more-precise estimates
(Embretson, 1996). Rather than focusing on the decomposition of error variance into
overall sources of measurement error, methods for evaluating rating quality based on the
scaled ratings tradition focus on examining the precision of measurement.
Fairness
The third foundational area in the revised Standards is fairness, which is defined as
“responsiveness to individual characteristics and testing contexts so that test scores will
yield valid interpretations for intended uses” (p. 50). In the context of rater-mediated
assessment systems, fairness concerns are often discussed in relation to construct-
irrelevant influences on rater judgment. These influences may include characteristics of
students, such as demographic variables, characteristics of raters, such as previous
experiences or levels of training, or characteristics of the assessment context, such as
the administration platform for composing essays.
When considered in light of this conceptualization of fairness, the dominance of inter-
rater reliability and rater agreement identified in this review suggests that research on
rater-mediated language assessments often does not include sufficient evidence to explic-
itly evaluate the potential influence of construct-irrelevant influences on rater judgment.
In order to obtain empirical evidence of fairness in a rater-mediated assessment, methods
for evaluating rating quality must be employed that allow for the explicit consideration
of these potential influences within a measurement framework based on fundamental
measurement properties.
In contrast to overall measures of inter-rater reliability, a variety of methods are
available within both the observed ratings and scaled ratings traditions that facilitate the
exploration of the influence of a variety of potentially construct-irrelevant variables.
Within the observed ratings tradition, these methods include generalizability theory and
its recent extensions (Marcoulides & Drezner, 1997, 2000). Within the scaled ratings
tradition, differential rater functioning analyses can be used to identify potential areas
for further investigation, including qualitative studies, in order to more fully explore the
quality of a rater-mediated assessment in terms of fairness. Rather than focusing on
overall indices of rater consistency, the full consideration of fairness concerns related to
raters requires methods that allow for the direct examination of potential construct-
irrelevant influences on raters’ judgmental processes at the individual rater level.
Conclusions
Next, we present and discuss conclusions as they relate to the research questions used to
guide this study.
In studies of rater-mediated language assessments, what statistical methods did researchers

use for evaluating the quality of ratings?
Overall, the results revealed an emphasis on techniques based on the observed ratings
tradition, with inter-rater reliability, rater agreement, and traditional applications of
generalizability theory (e.g., Brennan, 2000; Shavelson & Webb, 1991) as the most
commonly observed method for evaluating ratings. These three methods result in
20 Language Testing
aggregate-level information that is not specific to individual raters or specific facets of

an assessment context. Although generalizability theory analyses can be specified to
describe sources of sources of variance that are unique to a particular assessment context,
and relatively recent developments in generalizability theory methods (e.g., Marcoulides
& Drezner, 1997, 2000) facilitate the examination of individual students and raters, these
developments did not appear among the studies identified in the current review. Rather,
the identified studies included traditional applications of generalizability theory
(Brennan, 2000; Shavelson & Webb, 1991) that do not result in information about meas-
urement quality in terms of individual raters that is necessary in order to guide rater
training or remediation procedures and inform the interpretation and use of scores
(Eckes, 2015; McNamara, 1996).
On the other hand, our results indicated that Rasch measurement theory was the domi-
nant technique for evaluating ratings when methods based on the scaled ratings tradition
were applied. This inclination toward Rasch measurement theory was particularly salient
within studies that described rater-mediated assessments in L2 writing and speaking. Our
finding that Rasch-based methods were the most common within the scaled ratings tradi-
tion reflects the advocacy for this approach in language testing in general over the last
four decades (McNamara & Knoch, 2012), within the context of L2 assessments (Eckes,
2005, 2015; McNamara, 1996), as well as recognition of its practical utility for assess-
ment development and revision among language assessment scholars. In contrast to the
three dominant methods based on the observed ratings tradition, indices of rating quality
based on the Rasch model provide individual-rater-level indices of rating quality that
describe the degree to which raters meet the requirements for invariant measurement.
Rasch indices also include diagnostic information related to rater severity, rating scale
category use, unexpected responses, and a variety of other rater effects (Eckes, 2015;
Engelhard, 2013; McNamara, 1996; Myford & Wolfe, 2003, 2004).
Overall, the range of methods identified in this review reflect those identified in other
relatively recent reviews of literature on rating quality in general (Meadows & Billington,
2005; Myford & Wolfe, 2003; Tisi et al., 2013), with the most commonly observed tech-
niques based on inter-rater reliability, rater agreement, applications of traditional gener-
alizability theory methods (e.g., Brennan, 2000; Shavelson & Webb, 1991), and Rasch
measurement theory. This finding suggests that the methods employed by language
assessment researchers to evaluate rating quality may reflect general views about rating
quality indices across a variety of content areas.
In applied studies of rater-mediated language assessments, what indices of rating quality did
researchers report? In methodological studies of rater-mediated language assessments, what
indices of rating quality did researchers report?
Our results indicated similar techniques across methodological and applied research that
reflect the overall focus on inter-rater reliability in studies of rater-mediated language
assessments. Although we observed several techniques in methodological studies that
did not appear in the applied research, this finding suggests an overall shared understand-
ing of available techniques across both types of research.
In studies of rater-mediated language assessments in which rating quality was the primary
focus, what indices of rating quality did researchers report? In studies of rater-mediated
language assessments in which rating quality was not the primary focus, what indices of rating
quality did researchers report?
As we observed related to methodological and applied research, the results indicated the
same overall dominance of methods based on the observed ratings tradition across stud-
ies in which rating quality was a primary focus and studies in which rating quality was
not the primary focus. However, it is interesting to note that researchers applied a slightly
more diverse range of methods based on the scaled ratings tradition in studies where rat-
ing quality was the primary focus compared to studies in which rating quality was not the
primary focus. This finding suggests that a potentially greater awareness of the implica-
tions of different rating quality indices among researchers whose primary focus was on
rating quality. In particular, when rating quality was of primary interest, researchers used
additional techniques classified within the scaled ratings tradition. As we noted above,
these methods often go beyond group-level rating quality indices and provide additional
diagnostic indices that can be used to inform the interpretation of the validity, reliability/
precision, and fairness of ratings. The somewhat-more-limited methods that appeared
among studies in which researchers were not explicitly focused on rating quality reflects
McNamara and Knoch’s (2012) observation that “the professional background of many
language testers lies outside measurement”, and suggests a need for increased communi-
cation across the psychometric and language communities (p. 4).
In studies of rater-mediated language assessments of different constructs (e.g., reading, writing,

speaking), what indices of rating quality did researchers report?
We identified studies related to L1 writing most frequently. This finding suggests a poten-
tially greater emphasis on rater-mediated assessments in research within this domain,
compared to other language constructs. Further, we observed more variation in techniques
for evaluating ratings within L1 writing assessment than the other constructs. However,
with the exception of the small sample of L1 speaking assessments (n = 5), and L2 speak-
ing assessments, we observed rating quality indices based on the observed ratings tradi-
tion most often across all of the language constructs observed in the selected studies.
Although methods based on the observed ratings tradition generally dominated the
selected studies, it is interesting to note the slightly more balanced distribution of meth-
ods based on the observed ratings and scaled ratings traditions among studies of L2 writ-
ing and L2 speaking. Further, whereas Rasch measurement theory was the only scaled
ratings tradition method that was applied in studies of both of these constructs, we
observed a variety of methods based on the observed ratings tradition. This finding sug-
gests that researchers who study L2 writing and speaking assessments may be familiar
with Rasch-based techniques for evaluating rating quality as an additional tool besides
methods based on the observed ratings tradition.
In studies in which researchers simulated data based on characteristics of operational language

assessments, what indices of rating quality did the researchers report?
22 Language Testing
Among the simulation studies identified in this review, the most common methods for
evaluating rating quality were rater agreement and rater reliability. This finding suggests
that the emphasis on rating quality indices based on observed ratings persists across real
and simulated data. When researchers applied methods based on the scaled ratings tradi-
tion to simulated data, IRT techniques were most common.
In studies of rater-mediated language assessments, what information about the rating designs
employed did researchers report?
In terms of rating designs, the results indicated a prevalence of fully crossed data collec-
tion systems, in which every rater scored every student on every assessment task. This
finding suggests a potential disconnection between research and practice related to rater-
mediated assessments. Specifically, practical considerations in most operational assess-
ment systems limit the use of fully crossed rating designs. Although several methods for
evaluating rating quality are well suited to handle a variety of rating designs that involve
incomplete data (e.g., methods based on Rasch measurement theory and other item
response models), the explicit consideration of rating quality in the context of opera-
tional rating designs would potentially provide a more concrete connection between
research and practice in language assessment. The prevalence of fully crossed rating
designs may also reflect our focus on manuscripts published in peer-reviewed journals.
Had our search included other outlets in which research on rating quality is reported,
such as technical documentation, less-connected designs may have appeared more often,
as these designs reflect operational assessment systems.
In addition, a particularly concerning finding from this review was the result that the
authors of 77 of the studies included in this review did not provide any details regarding
the rating designs used to collect data. This result signals a potential unawareness among
language assessment researchers regarding the significance of connectivity in assess-
ment designs, particularly when methods are employed whose interpretation depends on
sufficient connectivity across facets, such as Rasch and item response theory models.
Furthermore, this finding signals a weakness in language assessment research related to
performance assessments with regard to the description of data collection methods.
Limitations
When considering the results from this study, it is important to note several limitations.
First, our review only included peer-reviewed journal articles within a subset of educa-
tional measurement and language/literacy journals. Our focus on journal articles in this
review resulted in the exclusion of other outlets in which methods for evaluating rating
quality are often employed, such as technical reports, books, and chapters in edited vol-
umes. Similarly, although we used the selection criteria for journals to identify top-tier,
international research outlets, the omission of other journals within the fields of educa-
tional measurement, language/literacy instruction, and language/literacy assessment
may have limited generalizability of the findings. In particular, the journals that we
selected are primarily published in the USA and the UK – thus potentially limiting the
generalizability of the current findings to these national contexts. Furthermore, our use
of a critical value for impact factors to inform journal selection may have resulted in the
omission of research that, if included, would have revealed different patterns in the meth-
ods used to evaluate rating quality in language assessment.
Another important limitation to note is related to our exclusive focus in this study on
articles that included only a single method for evaluating rating quality. Comparison
studies that include multiple methods reflect an important component of methodological
research whose examination may provide additional insight into the development and
advancement of techniques for evaluating rating quality. Furthermore, our analysis did
not include an examination of methods for evaluating rating quality across time.
Accordingly, it is not possible to draw conclusions regarding the development of and
changes in the frequency of methods for evaluating rating quality from a historical
perspective.
Additionally, it is important to consider the implications of differences in methods for
evaluating rating quality across different types of constructed response tasks (e.g.,
extended essay responses or short answer tasks), different scoring schemes (e.g., ana-
lytic, holistic, trait-based), and the stakes associated with ratings on these tasks. In par-
ticular, constructed response tasks that are used across measures of reading, writing,
listening, and speaking often reflect different formats and scoring schemes. For example,
whereas writing assessments frequently include extended response tasks, such as essays,
reading and listening assessments often include shorter constructed-response tasks.
These differences may influence researchers’ choices of indicators of rating quality that
are of interest – thus potentially limiting the comparability of methods for evaluating
rating quality across different types of assessments. Nonetheless, all constructed response
assessments that involve rater judgments invoke a judgmental process whose evaluation
is of interest as evidence of psychometric quality.
Directions for future research

Several directions for future research are of note. First, future research should include a
review of methods for evaluating rating quality in language assessments that is not lim-
ited to peer-reviewed journal outlets. Because outlets besides journals, such as technical
reports, are often used to disseminate results related to operational assessments, an
examination of the methods reported outside of peer-reviewed journals may shed addi-
tional light on the techniques used in practice to evaluate rating quality. Similarly, a
broader search that includes journals besides those included in the current review may
reveal additional techniques or different patterns of methods for evaluating rating quality
that were not identified within the current list of journals.
Additional research could also include studies in which multiple methods for evaluat-
ing ratings are employed, and those in which multiple methods are compared. Because
these studies often result in the selection of “better” approaches, an examination of these
studies would likely reveal methodological perspectives and advancements related to
methods for evaluating rating quality that are not observed in studies based on single
methods. Furthermore, an examination of methods for evaluating rating quality across
time would provide interesting insight into the development and popularity of techniques
that could be interpreted in light of historical contexts. In particular, these findings could
24 Language Testing
be considered alongside those of McNamara and Knoch (2012), whose review provided
some historical insight into the use of models based on Rasch measurement theory in
language assessment.
Finally, an examination of methods for evaluating rating quality specific to different
formats of constructed response assessment tasks, scoring schemes, and assessments
with different consequences may shed additional light on the selection of rating quality
indicators that was not captured in the combined analysis of these types of assessments.
Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this
article.
References
American Council on the Teaching of Foreign Languages (ACTFL). (2012). ACTFL proficiency
guidelines. Alexandria, VA. Retrieved from www.actfl.org/publications/guidelines-and-
manuals/actfl-proficiency-guidelines-2012
American Educational Research Association (AERA), American Psychological Association
(APA), & National Council on Measurement in Education (NCME). (2014). Standards for
educational and psychological testing. Washington, DC: AERA.
Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach.
Language Testing, 27(4), 515–535. https://doi.org/10.1177/0265532210368717
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and
rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293.
https://doi.org/10.1080/0969594X.2010.526585
Bechger, T. M., Maris, G., & Hsiao, Y. P. (2010). Detecting halo effects in performance-based
examinations. Applied Psychological Measurement, 34(8), 607–619. https://doi.org/10.1177/
0146621610367897
Birnbaum, A. (1957). Efficient design and use of tests of a mental ability for various decision mak-
ing problems. Randolph Air Force Base, TX: USAF Scholl of Aviation Medicine.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability,
Part 5. In Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-
Wesley.
Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational
Measurement: Issues and Practice, 16(4), 14–20.
Brennan, R. L. (2000). Performance assessments from the perspective of generalizability
theory. Applied Psychological Measurement, 24(4), 339–353. https://doi.org/10.1177/
01466210022031796
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of
essays: differences by gender, ethnicity, and country. Applied Measurement in Education,
25(1), 27–40. https://doi.org/10.1080/08957347.2012.635502
Burke, J., & Cizek, G. (2006). Effects of composition mode and self-perceived computer skills
on essay scores of sixth graders. Assessing Writing, 11, 148–166. https://doi.org/10.1016/j.
asw.2006.11.003
Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assess-
ments. Applied Psychological Measurement, 24(4), 310–324. https://doi.org/10.1177/
01466210022031778
Cooksey, R. W. (1996a). Judgment analysis: Theory, methods, and applications (Vol. xv). San
Diego, CA: Academic Press.
Cooksey, R. W. (1996b). The methodology of social judgement theory. Thinking & Reasoning,
2(2–3), 141–174. https://doi.org/10.1080/135467896394483
Cooksey, R. W., Freebody, P., & Davidson, G. R. (1986). Teachers’ predictions of children’s
early reading achievement: An application of social judgment theory. American Educational
Research Journal, 23(1), 41. https://doi.org/10.2307/1163041
Cooksey, R. W., Freebody, P., & Wyatt-Smith, C. (2007). Assessment as judgment-in-context:
Analysing how teachers evaluate students’ writing. Educational Research and Evaluation,
13(5), 401–434. https://doi.org/10.1080/13803610701728311
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavio-
ral measurements: Theory of generalizability for scores and profiles. New York: John Wiley.
DeCarlo, L. T. (2005). A model of rater behavior in essay grading based on signal detection theory.
Journal of Educational Measurement, 42(1), 53–76.
DeCarlo, L. T., Kim, Y., & Johnson, M. S. (2011). A hierarchical rater model for constructed
responses, with a signal detection rater model: Hierarchical signal detection rater model.
Journal of Educational Measurement, 48(3), 333–356. https://doi.org/10.1111/j.1745–
3984.2011.00143.x
Dobria, L. (2011). Longitudinal rater modeling with splines. University of Illinois at Chicago.
Retrieved from http://gradworks.umi.com/34/72/3472389.html
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assess-
ments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://
doi.org/10.1207/s15434311laq0203_2
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-
mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang.
Egan, O., & Archer, P. (1985). The accuracy of teachers’ ratings of ability: A regression model.
American Educational Research Journal, 22(1), 25–34. https://doi.org/http://aer.sagepub.
com.libdata.lib.ua.edu/content/22/1/25.full.pdf+html
Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8(4),
341–349. https://doi.org/10.1037/1040–3590.8.4.341
Engelhard, G. (2008). Historical perspectives on invariant measurement: Guttman, Rasch, and
Mokken. Measurement, 6(3), 155–189. https://doi.org/10.1080/15366360802197792
Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and
health sciences. New York: Routledge.
Farnia, F., & Geva, E. (2013). Growth and predictors of change in English language learn-
ers’ reading comprehension. Journal of Research in Reading, 36(4), 389–421. https://doi.
org/10.1111/jrir.12003
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver & Boyd.
Fisicaro, S. A., & Lance, C. E. (1990). Implications of three causal models for the measure-
ment of halo error. Applied Psychological Measurement, 14(4), 419–429. https://doi.
org/10.1177/014662169001400407
Fritz, E., & Ruegg, R. (2013). Rater sensitivity to lexical accuracy, sophistication and range when
assessing writing. Assessing Writing, 18, 173–181. https://doi.org/10.1016/j.asw.2013.02.001
Gao, X., & Brennan, R. L. (2001). Variability of estimated variance components and related
statistics in a performance assessment. Applied Measurement in Education, 14(2), 191–203.
Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill.
26 Language Testing
Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 12(1), 1–9. https://doi.
org/10.1016/j.asw.2007.05.002
Hogarth, R. M., & Karelaia, N. (2007). Heuristic and linear models of judgment: Matching rules
and environments. Psychological Review, 114(3), 733–758. https://doi.org/10.1037/0033–
295X.114.3.733
Huang, J. (2008). How accurate are ESL students’ holistic writing scores on large-scale
assessments?—A generalizability theory approach. Assessing Writing, 13, 201–218. https://
doi.org/10.1016/j.asw.2008.10.002
Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends.
Review of Educational Research, 60(2), 237–263.
Jamieson, J., & Poonpon, K. (2013). Developing analytic rating guides for TOEFL IBT’s inte-
grated speaking tasks (ETS Research Report Series No. RR-13–13) (p. i-93). Princeton, NJ:
Educational Testing Service. Retrieved from http://doi.wiley.com/10.1002/j.2333–8504.2013.
tb02320.x
Johnson, R. L., Penny, J., Gordon, B., Shumate, S. R., & Fisher, S. P. (2005). Resolving score
differences in the rating of writing samples: Does discussion improve the accuracy of scores?
Language Assessment Quarterly, 2(2), 117–146. https://doi.org/10.1207/s15434311laq0202_2
Johnson, R. L., Penny, J. A., & Gordon, B. (2009). Assessing performance: Designing, scoring,
and validating performance tasks. New York: The Guilford Press.
Johnson, D., & VanBrackle, L. (2011). Linguistic discrimination in writing assessment: How
raters react to African American “errors,” ESL errors, and standard English errors on a
state-mandated writing exam. Assessing Writing, 17, 35–54. https://doi.org/10.1016/j.asw.
2011.10.001
Joreskog, K. G. (2007). Factor analysis and its extensions. In Factor analysis at 100: Historical
developments and future directions (pp. 47–77). Mahwah, NJ: Lawrence Erlbaum.
Kachchaf, R., & Solano-Flores, G. (2012). Rater language background as a source of measure-
ment error in the testing of English language learners. Applied Measurement in Education,
25(2), 162–177. https://doi.org/10.1080/08957347.2012.660366
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it com-
pare with face-to-face training? Assessing Writing, 12(1), 26–43. https://doi.org/10.1016/j.
asw.2007.04.001
Lane, S., & Stone, C. (2006). Performance assessment. American Council on Education and
Praeger, 387–431.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assess-
ment: Expectations and validation criteria. Educational Researcher, 20(8), 15. https://doi.
org/10.2307/1176232
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,
NJ: Lawrence Erlbaum.
Marcoulides, G. A., & Drezner, Z. (1997). A method for analyzing performance assessments. In
M. Wilson, K. Draney, & G. Engelhard Jr. (Eds.), Objective measurement: Theory into prac-
tice (Vol. 4, pp. 261–278). Grenwhich, CT: Ablex.
Marcoulides, G. A., & Drezner, Z. (2000). A procedure for detecting pattern clustering in measure-
ment designs. In M. Wilson & G. Engelhard Jr. (Eds.), Objective measurement: Theory into
practice (Vol. 5, pp. 287–302). Grenwhich, CT: Ablex.
McCarty, F. A., Oshima, T. C., & Raju, N. S. (2007). Identifying possible sources of differential
functioning using differential bundle functioning with polytomously scored data. Applied
Measurement in Education, 20(2), 205–225. https://doi.org/10.1080/0895734070130166
McDonald, R. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum.
McNamara, T. (1996). Measuring second language performance. London and New York:
Longman.
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in lan-
guage testing. Language Testing, 29(4), 555–576. https://doi.org/10.1177/0265532211430367
Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. London:
AQA for the National Assessment Agency. Retrieved from https://cerp.aqa.org.uk/sites/
default/files/pdf_upload/CERP_RP_MM_01052005.pdf
Michael, W. B., Cooper, T., Shaffer, P., & Wallis, E. (1980). A comparison of the reliability and
validity of ratings of student performance on essay examinations by professors of english
and by professors in other disciplines. Educational and Psychological Measurement, 40(1),
183–195. https://doi.org/10.1177/001316448004000131
Miles, M. B., Huberman, A. M., & Saldana, J. (2014). Qualitative data analysis (3rd ed.).
Thousand Oaks, CA: SAGE Publications. Retrieved from https://books.google.com/books/
about/Qualitative_Data_Analysis.html?id=3CNrUbTu6CsC
Mokken, R. J. (1971). A Theory and Procedure of Scale Analysis. The Hague: Mouton/ Berlin:
De Gruyter.
Muckle, T., & Karabatsos, G. (2009). Hierarchical generalized linear models for the analysis of
judge ratings. Journal of Educational Measurement, 46(2), 198–219.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet
Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet
Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.
Neter, J., & Wasserman, W. (1974). Applied linear statistical methods: Regression, Analysis of
Variance, and Experimental Designs. Homewood, IL: Richard D. Irwin.
NVivo qualitative data analysis software. (2015). (Version 11) [Windows]. QSR International Pty
Ltd.
OECD. (2012). PISA 2009 Technical Report. OECD Publishing. Retrieved from http://www.oecd-
ilibrary.org/education/pisa-2009-technical-report_9789264167872-en
Patz, R. J., Wilson, M. J., & Hoskens, M. (1997). Optimal rating procedures and methodology
for NAEP open-ended items − 9737.pdf (Working Paper No. 97–37). Washington, DC: U.S.
Department of Education, Office of Educational Research and Improvement, National Center
for Education Statistics. Retrieved from http://nces.ed.gov/pubs97/9737.pdf
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The Hierarchical rater model
for rated test items and its application to large-scale educational assessment data. Journal of
Educational and Behavioral Statistics, 27(4), 341–384.
Penny, J., & Johnson, R. (2011). The accuracy of performance task scores after resolution of
rater disagreement: A Monte Carlo study. Assessing Writing, 16, 221–236. https://doi.
org/10.1016/j.asw.2011.06.001
Petticrew, M., & Roberts, H. (Eds.). (2006). Systematic reviews in the social sciences. Oxford,
UK: Blackwell. Retrieved from http://doi.wiley.com/10.1002/9780470754887
Plakans, L., & Gebril, A. (2012). A close investigation into source use in integrated second lan-
guage writing tasks. Assessing Writing, 17, 18–34. https://doi.org/10.1016/j.asw.2011.09.002
Popp, S. E. O., Ryan, J. M., & Thompson, M. S. (2009). The critical role of anchor paper selec-
tion in writing assessment. Applied Measurement in Education, 22(3), 255–271. https://doi.
org/10.1080/08957340902984026
Raczynski, K. R., Cohen, A. S., Engelhard, G., & Lu, Z. (2015). Comparing the effectiveness of
self-paced and collaborative frame-of-reference training on rater accuracy in a large-scale
writing assessment: Comparing rater training methods. Journal of Educational Measurement,
52(3), 301–318. https://doi.org/10.1111/jedm.12079
28 Language Testing
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded
edition, 1980. Chicago, IL: University of Chicago Press). Copenhagen, Denmark: Danish
Institute for Educational Research.
Reckase, M. D. (2009). Multidimensional item response theory (1st ed.). New York: Springer.
Rost, J. (1988). Rating scale analysis with latent class models. Psychometrika, 53(3), 327–348.
https://doi.org/10.1007/BF02294216
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric
quality of rating data. Psychological Bulletin, 88(2), 413–428.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation
modeling. Language Testing, 22(1), 1–30. https://doi.org/10.1191/0265532205lt295oa
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks, CA:
SAGE Publications.
Spearman, C. (1927). The abilities of man: Their nature and measurement. New York: Macmillian.
Sykes, R. C., Ito, K., & Wang, Z. (2008). Effects of assigning raters to items. Educational
Measurement: Issues and Practice, 27(1), 47–55. https://doi.org/10.1111/j.1745-3992.
2008.00114.x
Thompson, C. A., Foster, A., Cole, I., & Dowding, D. W. (2005). Using social judgement theory to
model nurses’ use of clinical information in critical care education. Nurse Education Today,
25(1), 68–77. https://doi.org/10.1016/j.nedt.2004.10.003
Thurstone, L. L. (1935). The vectors of mind. Chicago, IL: University of Chicago Press.
Thurstone, L. L. (1947). Multiple factor analysis. Chicago, IL: University of Chicago Press.
Tisi, J., Whitehouse, G., Maughan, S., & Burdett, N. (2013). A review of literature on marking
reliability research (Report commissioned by the Office of Qualifications and Examinations
Regulation). Slough, UK: National Foundation for Educational Research. Retrieved from
www.nfer.ac.uk/publications/MARK01/MARK01.pdf
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2),
263–287.
Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test
indicators of writing ability. Language Testing, 27(3), 335–353. https://doi.org/10.1177/
0265532210364406
Wilson, J., Olinghouse, N., McCoach, D. B., Santangelo, T., & Andrada, G. (2015). Comparing
the accuracy of different scoring methods for identifying sixth graders at risk of failing a state
writing assessment. Assessing Writing, 27, 11–23. https://doi.org/10.1016/j.asw.2015.06.003
Wind, S. A. (2014). Evaluating rater-mediated assessment with Rasch measurement theory and
Mokken scaling. Emory University, USA. Retrieved from http://gradworks.umi.com.libdata.
lib.ua.edu/36/34/3634388.html
Wind, S. A., & Engelhard, G. (2015). Exploring Rating quality in rater-mediated assessments
using Mokken Scale Analysis. Educational and Psychological Measurement. https://doi.
org/10.1177/0013164415604704
Wisconson Center for Educational Research, University of Wisconsin-Madison. (2016). ACCESS
for ELLs 2.0 Interpretative Guide for Score Reports: Kindergarten – Grade 12. Madison, WI.
Wolfe, E. W. (1998). A two-parameter logistic rater model (2PLRM): Detecting rater harshness
and centrality. Presented at the American Educational Research Association, San Diego, CA.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press.
Zhang, Y., & Elder, C. (2014). Investigating native and non-native English-speaking teacher
raters’ judgements of oral proficiency in the College English Test-Spoken English Test
(CET-SET). Assessment in Education: Principles, Policy & Practice, 21(3), 306–325.
https://doi.org/10.1080/0969594X.2013.845547
Appendix A. Coding framework.
Characteristic Codes Example

A. Type Methodological “The purpose of this study was to extend the procedures
for dichotomous DBF to the polytomous case and
to illustrate how DBF analysis can be conducted with
polytomous scoring, common to psychological and
educational rating scales.” (McCarty, Oshima, & Raju, 2007,
p. 205)
Applied “For two essay questions – Question 1 given to one
sample (N1 = 100) and Question 2 to a second group
(N2 = 100) – ratings of student performance rendered
by professors of English and by professors in other
disciplines were compared for reliability and concurrent
validity. From the data analyses it was concluded that the
reliability and validity of the ratings provided by professors
outside of English departments and by professors in English
departments were nearly comparable.” (Michael, Cooper,
Shaffer, & Wallis, 1980, p. 183)
B. Method Other item “In this article we develop Patz’s (1996) hierarchical rater
for evaluating response theory model (HRM) for polytomous item response data scored
ratings by multiple raters, and show how it can be used to scale
examinees and items, to model aspects of consensus
among raters, and to model individual rater severity and
consistency effects.” (Patz, Junker, Johnson, & Mariano,
2002, p. 341)
Mokken scaling “In this study, adaptations of the polytomous formulations
of Mokken’s (1971) MH and DM models (Molenaar,
1982, 1997) are adapted in order to arrive at a suite of
Mokken-based indicators of data quality for rater-mediated
assessments.” (Wind & Engelhard, 2015, p. 6)
Rasch model “Observed ratings were analyzed using three- and four-
facet Rasch (one-parameter logistic) models.” (Popp, Ryan,
& Thompson, 2009, p. 255)
Latent class “Here it is shown that a latent class model motivated by
analysis signal detection theory (SDT) is a natural candidate for
the first level of the HRM, the rater model. The latent
class SDT model provides measures of rater precision and
various rater effects, above and beyond simply severity or
leniency. The HRM-SDT model is applied to data from a
large-scale assessment and is shown to provide a useful
summary of various aspects of the raters’ performance.”
(DeCarlo, Kim, & Johnson, 2015, p. 333)
Correlations “Correlations between both human and e-rater scores
with external and non-test indicators were moderate but consistent,
variable providing criterion-related validity evidence for the use of
e-rater along with human scores.” (Weigle, 2010, p. 335)
Decision “Classification accuracy was measured using the area
consistency/ under the ROC curve.” (Wilson, Olinghouse, McCoach,
accuracy Santangelo, & Andrada, 2015, p. 11)
(Continued)
30 Language Testing
Appendix A. (Continued)

Factor analysis “A structural equation modeling (SEM) approach is used to
or SEM estimate the variance components in the writing scores.”
(Schoonen, 2005, p. 1)
Generalizability “Using generalizability theory, we examined the
theory amount of score variation due to student (the object of
measurement) and four sources of measurement error.”
(Kachchaf & Solano-Flores, 2012, p. 162)
Hierarchical “This study adopted a multilevel modeling (MLM) approach
linear model to examine the contribution of rater and essay factors to
variability in ESL essay holistic scores.” (Barkaoui, 2010,
p. 515)
Rater “The percentage of readers’ scores in perfect agreement
agreement across score categories ranged from 58 to 76%.” (Burke &
Cizek, 2006, p. 156)
Expert “Accuracy of experts is determined by the Georgia
agreement Writing Assessment by examining their typical level of
(accuracy) agreement with the scores assigned by the validation
committee to exemplar papers.” (R. L. Johnson, Penny,
Gordon, Shumate, & Fisher, 2005, p. 129)
Intra-rater “For each rater, an index of observed halo was calculated
reliability for each pair of rating dimensions as the correlation
between ratings across ratees.” (Fisicaro & Lance, 1990,
p. 425)
Inter-rater “The Pearson inter-rater reliability coefficient was
reliability r = 0.75.” (Plakans & Gebril, 2012, p. 23)
Regression or “Using a between-subjects ANOVA design, it was found
ANOVA that raters were sensitive to accuracy, but not range or
sophistication, when rating essays for lexis.” (Fritz & Ruegg,
2013, p. 173)
Odds ratios “Using a log-linear model to generate odds ratios for
comparison of essays with these error types, results
indicate linguistic discrimination against African American
‘errors’ and a leniency for ESL errors in writing
assessment.” (D. Johnson & VanBrackle, 2011, p. 35)
C. Language L1 reading “The basic data for the present study are sets of
construct ratings and standardized test scores in the areas of
measured Mathematics and English reading.” (Egan & Archer,
using ratings 1985, p. 28)
L1 writing “The sample for Study 1 includes all essays written in
response to the independent prompt on the TOEFL
iBT administered worldwide from January 2008
through October 2008. This essay is one of two essays
that constitute the TOEFL iBT writing score; this
‘independent’ task asks examinees to support an opinion
on the provided topic.” (Bridgeman, Trapani, & Attali,
2012, p. 30)

L1 speaking “Here, the examination for speaking is considered …
Each assignment presents the examinee with a practical
situation, and he or she responds by speaking aloud. The
utterances are recorded and sent to two independent
raters for judgment. The raters are chosen from a file of
available raters such that no rater is assigned the same
examinee twice. Raters are instructed to listen to the
performance on each assignment and answer a set of
questions concerning different aspects such as tempo,
content, or vocabulary.” (Bechger, Maris, & Hsiao, 2010,
p. 613)
L2 reading “This study modelled reading comprehension trajectories
in Grades 4 to 6 English language learners (ELLs = 400),
with different home language backgrounds.” (Farnia &
Geva, 2013, p. 389)
L2 writing “The TOEFL iBT is a measure of English language
skills aimed at students who want to pursue
undergraduate or graduate studies in universities in
which English is the language of instruction.” (Bridgeman
et al., 2012, p. 30)
L2 speaking “This study investigates the impact of raters’ language
background on their judgments of the speaking
performance in the College English Test-Spoken
English Test (CET-SET) of China.” (Zhang & Elder,
2014, p. 306)
Multiple “This study empirically investigates sampling variability
language of estimated variance components using data collected
constructs in several years for a listening and writing performance
assessment.” (Gao & Brennan, 2001, p. 191)
Simulated data/ “To examine the efficacy of the resolution models in the
Other improvement of the accuracy of scores, we completed
a Monte Carlo study in which we varied critical factors
of the scoring and resolution process. The Monte Carlo
computer simulation allowed us to manipulate multiple
factors (i.e., independent variables) to examine their impact
on an outcome variable (e.g., validity coefficients).” (Penny
& Johnson, 2011, p. 226)
D. Focus Rating quality “Using generalizability theory, this study examined both
primary the rating variability and reliability of ESL students’ writing
in the provincial English examinations in Canada.” (Huang,
2008, p. 201)
Rating quality “This study was conducted to gather evidence regarding
not primary effects of the mode of writing (handwritten vs.
word-processed) on compositional quality in a sample of
sixth grade students.” (Burke & Cizek, 2006, p. 148)
(Continued)
32 Language Testing

E. Level of Disconnected “The students were asked to complete three separate
connectivity writing tasks (paragraph format writing task for poetry,
in the rating essay format writing task for literary prose, and original
design composition) and each writing task received scores from
two different independent raters.” (Huang, 2008, p. 204)
Fully crossed “We attained a crossed design in which each of the eight
raters scored all the student responses to all the items.”
(Kachchaf & Solano-Flores, 2012, p. 167)
Incomplete but “The 168 essays were first compiled into batches of 24
connected essays each (4 essays ~ 2 prompts ~ 3 proficiency levels)
and then batches were randomly assigned to raters. Each
batch included a common set of six essays and a randomly
selected set of 18 essays unique to each batch. This was
done to ensure ‘linkage’ between all facets for multi-
faceted Rasch analyses of scores (Linacre 2009; Myford and
Wolfe 2000; see ‘Data analysis’ section below).” (Barkaoui,
2011, p. 282)
Multiple “[Constructed response (CR)] item responses for the
designs students taking the six tests were allocated to raters
in three different ways or modalities. The first scoring
modality (SM1) consisted of a single reader scoring all of a
student’s CR responses. SM2 assigned each of a student’s
CR responses to a different rater (item specific) while
SM3 split the subset of CR items into rater item blocks
or RIBs of approximately one-third of the CR items,
with a different rater assigned to each RIB of student’s
responses.” (Sykes, Ito, & Wang, 2008, p. 48)
No details given (N/A)

A Systematic Review of Methods For Evaluating Rating Quality in Language Assessment PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Systematic Review of Methods For Evaluating Rating Quality in Language Assessment PDF

Încărcat de

Drepturi de autor:

Formate disponibile

686999

Stefanie A. Wind and Meghan E. Peterson

Assessments that include constructed-response components have become increasingly

1. In studies of rater-mediated language assessments, what statistical methods did

3. In studies of rater-mediated language assessments in which rating quality was the

Observed ratings tradition Scaled ratings tradition

Observed ratings tradition

•• How consistently do raters score the same student responses?

Scaled ratings tradition

Search and selection process

to a particular language assessment or language assessment in general. On the other hand,

Table 2. Selected journals.

Journal category Selected Journals included in search* Selected

*Does not include journals focused on reviews of research.

Table 3. Overall characteristics across selected studies (N = 259).

Characteristic Number of articles % of total

(n = 99) (n = 136) (n = 123)

Method for evaluating rating quality C. Language construct

L1 writing L1 reading L2 writing L2 Multiple L2 L1 L1

Studies based Fully crossed No details Incomplete, Disconnected

Applied and methodological research

Focus on rating quality

generalizability theory (n = 1, 20%). As noted above, the small number of studies of

Studies based on simulated data

and methodological research, studies in which researchers primarily focused on rating

Implications in terms of the Standards

In studies of rater-mediated language assessments, what statistical methods did researchers

aggregate-level information that is not specific to individual raters or specific facets of

In studies of rater-mediated language assessments of different constructs (e.g., reading, writing,

In studies in which researchers simulated data based on characteristics of operational language

Directions for future research

Declaration of conflicting interests

Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

Appendix A. Coding framework.

Characteristic Codes Example

Characteristic Codes Example

Characteristic Codes Example

Characteristic Codes Example

S-ar putea să vă placă și