Documente Academic
Documente Profesional
Documente Cultură
research-article2016
LTJ0010.1177/0265532216686999Language TestingWind and Peterson
/$1*8$*(
Article 7(67,1*
Language Testing
A systematic review of
1–32
© The Author(s) 2017
Reprints and permissions:
methods for evaluating sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0265532216686999
https://doi.org/10.1177/0265532216686999
rating quality in language journals.sagepub.com/home/ltj
assessment
Abstract
The use of assessments that require rater judgment (i.e., rater-mediated assessments) has
become increasingly popular in high-stakes language assessments worldwide. Using a systematic
literature review, the purpose of this study is to identify and explore the dominant methods
for evaluating rating quality within the context of research on large-scale rater-mediated
language assessments. Results from the review of 259 methodological and applied studies
reveal an emphasis on inter-rater reliability as evidence of rating quality that persists across
methodological and applied studies, studies primarily focused on rating quality and studies not
primarily focused on rating quality, and across multiple language constructs. Additional findings
suggest discrepancies in rating designs used in empirical research and practical concerns in
performance assessment systems. Taken together, the findings from this study highlight the
reliance upon aggregate-level information that is not specific to individual raters or specific
facets of an assessment context as evidence of rating quality in rater-mediated assessments.
In order to inform the interpretation and use of ratings, as well as the improvement of rater-
mediated assessment systems, rating quality indices are needed that go beyond group-level
indicators of inter-rater reliability, and provide diagnostic evidence of rating quality specific
to individual raters, students, and other facets of the assessment system. These indicators are
available based on modern measurement techniques, such as Rasch measurement theory and
other item response theory approaches. Implications are discussed as they relate to validity,
reliability/precision, and fairness for rater-mediated assessments.
Keywords
Language assessment, rater effects, rater-mediated assessment, rating quality, raters
Corresponding author:
Stefanie A. Wind, Educational Studies in Psychology, Research Methodology, and Counseling, The University
of Alabama, 313C Carmichael Hall, USA.
Email: swind@ua.edu
2 Language Testing
Purpose
The purpose of this study is to identify and explore the dominant methods for evaluating
rating quality within the context of research on rater-mediated language assessments in
order to consider the implications of these methods as evidence to support the interpreta-
tion and use of scores. We considered six research questions:
This study contributes to research on language assessment in several ways. First, it pro-
vides an overview of the methods currently used to evaluate ratings in the context of
language assessment research. Several reviews of literature are available that include
discussions of methods for evaluating rating quality. Specifically, Saal, Downey, and
Lahey (1980) conducted a review and meta-analysis of methods for evaluating rating
quality that appeared in three psychological journals between 1975 and 1977, and con-
cluded that the terms used to describe various aspects of rating quality were inconsistent.
More recently, Myford and Wolfe (2003) discussed a range of descriptions of rater effects
(i.e., rater errors), researchers’ interpretations of their implications, and indices employed
to detect these effects, and also concluded that the operational definitions and procedures
for identifying these effects vary across researchers. Focusing on the values of indicators
of rating quality, such as reliability coefficients, Meadows and Billington (2005) and
Tisi, Whitehouse, Maughan, and Burdett (2013) conducted reviews of literature on rater-
mediated assessments across content areas. The authors of both of these reviews con-
cluded that ratings in performance assessments are generally unreliable, and proposed
alternative strategies to scoring that could be used to improve rater reliability. Despite
these previous reviews of literature, research is limited that systematically synthesizes
techniques that have been applied specifically within the context of language assess-
ment. Furthermore, our study presents an adaptation of Engelhard’s (2013) theoretical
framework for classifying measurement techniques to methods for evaluating rating
quality. Finally, we highlight the implications of various approaches to evaluating ratings
in terms of validity, reliability/precision, and fairness that can be used to inform the
selection and interpretation of rating quality indices in research and practice.
Theoretical framework
This study is organized around the concept of research traditions that can be used to
classify measurement theories (Engelhard, 2008, 2013). Using an adapted version of
the framework presented by Engelhard (2013; also see Wind, 2014), we use two major
research traditions to classify the major measurement theories developed during the
20th century: (1) the observed ratings tradition, and (2) the scaled ratings tradition. In
this section, we provide an overview of the features and models that characterize the
4 Language Testing
Table 1. Methods for evaluating ratings within the observed ratings and scaled ratings
traditions.
Note: This table and the theoretical framework are adapted from Engelhard (2013), who used the term
“test-score tradition” to describe the characteristics listed within the observed ratings tradition, and the
term “scaling tradition” to describe the characteristics listed within the scaled ratings tradition.
observed ratings and the scaled ratings traditions (see Table 1 for a summary of these
features).
The following two questions summarize the underlying concerns that characterize
methods for evaluating ratings within the observed ratings tradition:
Across the observed ratings tradition, a major theme is the use of correlation coefficients
to explore the consistency, or reliability, of ratings. Essentially, research within the
observed ratings tradition emphasizes the use of linear models, and focuses on identify-
ing and describing the influence of measurement error on the consistency of ratings as an
indicator of the quality of measurement procedures.
On the other hand, Birnbaum models (e.g., Birnbaum, 1957, 1968) that have been
adapted for use with rater-mediated assessments incorporate additional variables beyond
rater severity and student achievement. Specifically, several researchers have proposed
IRT models adapted from Birnbaum models that include parameters related to rater dis-
crimination (Myford & Wolfe, 2004, p. 219; Patz, Wilson, & Hoskens, 1997; Rost, 1988;
Wolfe, 1998). The major implication of these models is that the probability of a rating in
a particular category of the rating scale is calculated as a function of rater severity, stu-
dent achievement, and the degree to which raters distinguish among students with differ-
ent levels of achievement (i.e., discrimination or slope).
The following two questions summarize the underlying concerns that characterize
methods for evaluating ratings within the scaled ratings tradition:
•• How can raters, student responses, and other facets be mapped onto a linear con-
tinuum that represents the construct?
•• How closely do observed ratings match the expectations of the model?
Across the scaled ratings tradition, rating quality indicators focus on evaluating the
match between observed ratings and the ratings that would be expected based on the
selected model.
Methods
We conducted a systematic review (Petticrew & Roberts, 2006) of empirical, peer-
reviewed journal articles that included studies based on rater-mediated assessments.
Although other publication outlets, such as technical reports and book chapters, often
describe applications of and advances in methods for evaluating rating quality, we lim-
ited our review to journal articles in order to provide an overview of the current range of
methods that are discussed among language assessment researchers in these outlets. This
section includes details regarding our inclusion criteria and methods for synthesizing
selected research.
n % n %
American Educational 21 8.11 American Educational Research Journal 10 3.86
Research Association Educational Evaluation and Policy Analysis 2 0.77
(AERA) Educational Researcher 2 0.77
Journal of Educational and Behavioral 7 2.70
Statistics
National Council 20 7.72 Educational Measurement: Issues and 5 1.93
on Measurement in Practice
Education (NCME) Journal of Educational Measurement 15 5.79
International Test 6 2.32 International Journal of Testing 6 2.32
Commission (ITC)
Applied educational 56 21.62 Applied Measurement in Education 12 4.63
measurement** Applied Psychological Measurement 22 8.49
Assessment in Education: Principles, Policy 6 2.32
& Practice
Educational Assessment 3 1.16
Educational & Psychological Measurement 13 5.02
Language 156 60.23 Assessing Writing 55 21.24
assessment** Journal of Research in Reading 28 10.81
Language Assessment Quarterly 11 4.25
Language Testing 33 12.74
Reading Research Quarterly 21 8.11
Research in the Teaching of English 8 3.09
Data extraction
Following Petticrew and Roberts (2006), we prepared summaries of each article that
included key details from each of the 259 selected studies. We coded each study accord-
ing to five major characteristics: (1) Type (methodological or applied); (2) Method used
to evaluate rating quality; (3) Language construct measured using ratings or, where
applicable, simulated ratings; (4) Focus (whether the study was primarily focused on
methods for evaluating rating quality or if the rating quality indices were only applied in
service to a larger goal); and (5) Rating design (system for collecting ratings).
Appendix A presents the coding framework used in the systematic review. We pre-
pared an initial list of codes that included methods for evaluating rating quality based on
several approaches within the observed ratings and scaled ratings traditions. The final list
of codes emerged during the analysis as we identified methods that appeared in the
selected studies. In terms of language constructs, we prepared an initial list using the
Wind and Peterson 9
Standards For Communicative Competence set forth by the ACTFL (2012), which
included reading, writing, listening, and speaking in students’ first language (L1) or in
another language (L2), and multiple language constructs. Studies in which simulated
data were used were also identified.
We used NVivo, Version 11 (NVivo qualitative data analysis software, 2015) to code
the articles. Prior to coding the entire set of articles, both authors scored a common set of
three randomly selected articles from each of the four categories of journals listed in
Table 2, for a total agreement set of 12 articles. We resolved any disagreements and clari-
fied the coding scheme until we reached complete agreement. We divided the remaining
articles and coded them. Prior to the final analyses, the first author reviewed each of the
second author’s coded articles, and we discussed and resolved any disagreements.
Data analysis
Our data analysis procedures involved classifying each of the 259 studies within the
observed ratings or scaled ratings tradition. Next, we examined the results using tabula-
tions of the frequencies of methods for evaluating rating quality within each classifica-
tion (i.e., matrix coding; Miles, Huberman, & Saldana, 2014). Specifically, we used the
NVivo program to generate tabular displays of the frequencies of each method across
methodological and applied studies, across studies in which rating quality was the pri-
mary focus, and studies in which rating quality was not the primary focus, across research
related to the assessment of different language constructs, within studies based on simu-
lated data, and across research that made use of different rating designs.
Results
Overall, results from our analysis of the 259 selected studies indicated a range of meth-
ods for evaluating rating quality in language assessment research; Appendix A includes
a description of each of the methods that we identified. Table 2 includes the overall
results in terms of the distribution of selected studies across journals. In this section, we
present results using descriptive statistics to describe patterns observed across the 259
selected studies.
Overall characteristics
Table 3 includes a summary of the characteristics of the 259 selected studies in terms of
the major coding categories. The literature review included more applied studies (n =
160, 61.78%) than methodological studies (n = 99; 38.22%). Three dominant methods
for evaluating rating quality appeared: Inter-rater reliability (n = 80, 30.89%), rater
agreement (n = 52, 20.08%), and Rasch measurement theory (n = 48, 18.53%). In terms
of the language constructs, we found that most research on rater-mediated language
assessment was focused on L1 writing (n = 101, 39.00%), followed by L1 reading (n =
43, 16.60%). Although most studies included real data, a substantial proportion used
simulated data (n = 49, 18.92%). Approximately equal proportions of the studies were
focused on rating quality specifically (n = 136, 52.51%) as those in which rating quality
was not the primary focus (n =123, 47.49%). Finally, nearly half of the studies involved
10 Language Testing
fully crossed rating designs in which all raters scored all language performances (n =
106, 40.93%).
Research traditions
Our analysis of the 259 selected studies revealed 14 quantitative techniques for evaluat-
ing ratings (see Table 4). When considered in terms of the theoretical framework, the
methods for evaluating ratings indicate a dominance of the observed ratings tradition in
Table 4. Methods for evaluating rating quality across study characteristics.
Method for evaluating rating quality Total A. Type B. Focus on rating quality
(N = 259)
Applied Method- Rating quality Rating quality
(n = 160) ological primary not primary
Wind and Peterson
n % n % n % n % n %
Scaled ratings Rasch measurement theory 48 81.36 29 100.00 19 63.33 38 80.85 10 83.33
tradition Other IRT 9 15.25 0 0.00 9 30.00 7 14.89 2 16.67
Mokken scaling 1 1.69 0 0.00 1 3.33 1 2.13 0 0.00
Signal detection theory 1 1.69 0 0.00 1 3.33 1 2.13 0 0.00
Total across scaled ratings tradition 59 100 29 100 30 100 47 100 12 100
Observed ratings Inter-rater reliability 80 40.00 63 48.09 17 24.64 21 23.60 59 53.15
tradition Rater agreement 52 26.00 38 29.01 14 20.29 18 20.22 34 30.63
Generalizability theory 23 11.50 12 9.16 11 15.94 13 14.61 10 9.01
Expert agreement 11 5.50 4 3.05 7 10.14 10 11.24 1 0.90
Correlations with external 10 5.00 5 3.82 5 7.25 8 8.99 2 1.80
variable
Factor analysis or structural 7 3.50 1 0.76 6 8.70 6 6.74 1 0.90
equation modeling
Regression or ANOVA 7 3.50 5 3.82 2 2.90 5 5.62 2 1.80
Hierarchical linear model* 5 2.50 1 0.76 4 5.80 4 4.49 1 0.90
Decision consistency/ 3 1.50 1 0.76 2 2.90 2 2.25 1 0.90
accuracy
Intra-rater reliability 1 0.50 0 0.00 1 1.45 1 1.12 0 0.00
Odds ratio 1 0.50 1 0.76 0 0.00 1 1.12 0 0.00
Total across observed ratings tradition 200 100 131 100 69 100 89 100 111 100
(Continued)
11
Table 4. (Continued)
12
n % n % n % n % n % n % n % n %
Scaled ratings Rasch measurement 20 83.33 1 100.00 14 100.00 8 100.00 0 0.00 0 0.00 3 75.00 0 0.00
tradition theory
Other IRT 2 8.33 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 1 25.00 0 0.00
Mokken scaling 1 4.17 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Signal detection theory 1 4.17 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Total across scaled ratings tradition 24 100 1 100 14 100 8 100 0 0 0 0 4 100 0 0
Observed ratings Inter-rater reliability 33 42.86 21 50.00 7 35.00 4 50.00 3 60.00 3 60.00 0 0.00 0 0.00
tradition Rater agreement 13 16.88 18 42.86 5 25.00 1 12.50 1 20.00 2 40.00 0 0.00 1 100.00
Generalizability theory 8 10.39 1 2.38 5 25.00 1 12.50 1 20.00 0 0.00 0 0.00 0 0.00
Expert agreement 6 7.79 0 0.00 0 0.00 1 12.50 0 0.00 0 0.00 0 0.00 0 0.00
Correlations with 5 6.49 1 2.38 1 5.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
external variable
Factor analysis or 4 5.19 1 2.38 0 0.00 0 0.00 0 0.00 0 0.00 1 100.00 0 0.00
structural equation
modeling
Regression or ANOVA 3 3.90 0 0.00 1 5.00 1 12.50 0 0.00 0 0.00 0 0.00 0 0.00
Hierarchical linear 2 2.60 0 0.00 1 5.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
model
Decision consistency/ 2 2.60 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
accuracy
Intra-rater reliability 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Odds ratio 1 1.30 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00
Total across observed ratings tradition 77 100 42 100 20 100 8 100 5 100 5 100 1 100 1 100
Language Testing
Table 4. (Continued)
Method for evaluating rating quality D. Simulation D. Rating design
studies
n % n % n % n % n %
Scaled Rasch measurement 2 25.00 24 88.89 10 76.92 14 73.68 0 0.00
ratings theory
tradition Other IRT 6 75.00 2 7.41 2 15.38 5 26.32 0 0.00
Mokken scaling 0 0.00 1 3.70 0 0.00 0 0.00 0 0.00
Signal detection theory 0 0.00 0 0.00 1 7.69 0 0.00 0 0.00
Total across scaled ratings tradition 8 100 27 100 13 100 19 100 0 100
Observed Inter-rater reliability 9 21.95 30 37.97 23 35.94 24 53.33 1 12.50
ratings Rater agreement 11 26.83 23 29.11 16 25.00 13 28.89 0 0.00
tradition Generalizability theory 7 17.07 10 12.66 6 9.38 3 6.67 4 50.00
Expert agreement 4 9.76 5 6.33 5 7.81 1 2.22 0 0.00
(accuracy)
Correlations with 3 7.32 4 5.06 4 6.25 0 0.00 1 12.50
external variable
Factor analysis or SEM 1 2.44 2 2.53 2 3.13 1 2.22 1 12.50
Regression or ANOVA 2 4.88 4 5.06 1 1.56 2 4.44 0 0.00
Decision consistency/ 2 4.88 0 0.00 3 4.69 1 2.22 1 12.50
accuracy
Hierarchical linear 1 2.44 0 0.00 3 4.69 0 0.00 0 0.00
model
Intra-rater reliability 1 2.44 1 1.27 0 0.00 0 0.00 0 0.00
Odds ratios 0 0.00 0 0.00 1 1.56 0 0.00 0 0.00
Total across observed ratings 41 100 79 100 64 100 45 100 8 100
tradition
13
*The articles included in the hierarchical linear model (HLM) category listed within the “Observed ratings tradition” did not include the use of hierarchical general-
ized linear models (HGLMs). Articles in which HGLMs were applied were included in the “Other IRT” classification within the scaled ratings tradition.
14 Language Testing
research on rater-mediated language assessment, with 200 of the 259 studies (77.22%)
classified within the observed ratings tradition, compared to 59 studies (22.78%) classi-
fied within the scaled ratings tradition.
When examining the methods identified within each tradition, it is interesting to note
that, whereas methods classified within the observed ratings tradition included three pre-
vailing techniques that make up more than 10% of the studies classified within the
observed ratings tradition (inter-rater reliability: n = 80, 40%; rater agreement: n = 52,
26%; and generalizability theory: n = 23, 11.50%), observed techniques within the scaled
ratings tradition were most often based on Rasch measurement theory (n = 48, 81.36%).
Language constructs
Table 4, column C presents the observed frequencies of methods for evaluating ratings
across the language constructs. In terms of L1 assessments of reading, writing, speaking,
and listening, the results in Table 4, column C indicate some variation in methods for
evaluating ratings across constructs. Within L1 writing assessments (n = 101), methods
based on the observed ratings tradition were most prevalent (n = 77, 76%) and followed
the pattern observed within the entire sample of studies, where inter-rater reliability anal-
yses were used most often (n = 33, 42.86%). When researchers used methods based on
the scaled ratings in the context of L1 writing assessments, Rasch measurement theory
was the most frequent approach (n = 20, 83.33%). We observed a similar emphasis on
test-score methods for evaluating ratings in assessments of L1 reading (n = 43), where
we only identified one study in which researchers used methods based on the scaled rat-
ings tradition; the authors of this study used Rasch measurement theory to evaluate rat-
ings. The remaining studies related to L1 reading included inter-rater reliability as the
most common approach within the observed ratings tradition (n = 21, 50%), followed by
rater agreement (n = 18, 42.86%). On the other hand, researchers who evaluated ratings
within L1 speaking assessments most frequently drew upon the scaled ratings tradition
through the use of Rasch measurement theory (n = 3, 75%); however, we identified only
a small number of studies related to this construct (n = 5), such that the dominance of the
scaled ratings tradition observed here may not be generalizable across research on L1
speaking assessments. Similarly, we identified only one instance of an L1 Listening
assessment; the authors of this article used inter-rater reliability coefficients as evidence
of rating quality.
Within selected studies in which researchers examined rater-mediated language
assessments in L2, we found similar patterns to L1 assessments. Within L2 writing
assessment research (n = 34) researchers used methods based the observed ratings tradi-
tion slightly more frequently (n = 20) compared to the scaled ratings tradition (n = 14),
where results within the research traditions followed the general pattern observed within
the entire sample of selected studies. Specifically, Rasch measurement theory was the
only scaled ratings tradition method for evaluating ratings (n = 14, 100%), and research-
ers used inter-rater reliability studies most frequently (n = 7, 35%) among methods based
on the observed ratings tradition. Within research on L2 speaking (n = 16), an equal
number of studies applied scaling and test-score methods. Similar to L2 writing, Rasch
measurement theory was the only method used to evaluate ratings based on the scaled
ratings tradition (n = 8, 100%), and researchers used inter-rater reliability studies most
frequently (n = 4, 50%) among methods based on the observed ratings tradition. When
researchers evaluated ratings in L2 reading assessments (n = 5), rating quality analyses
only included methods based on the observed ratings tradition, including inter-rater reli-
ability (n = 3, 60%), and rater agreement (n = 2, 40%). However, the small number of
studies included in which researchers described L2 reading assessments may limit the
generalizability of these results.
Finally, among studies that included multiple language constructs (n = 5), researchers
drew exclusively upon the observed ratings tradition to evaluate ratings, with inter-rater
reliability used most often (n = 3, 60%), followed by rater agreement (n = 1, 20%) and
16 Language Testing
Rating designs
Table 4, column E presents methods for evaluating rating quality across rating designs.
Overall, these results indicate that most research on rater-mediated language assessments
is based on fully crossed (i.e., complete) rating designs (n = 106), or does not give explicit
consideration to the topic of rating designs (n = 77). In general, the results within each of
these categories follow the pattern observed in the overall sample, where the most com-
mon method for evaluating ratings based on the scaled ratings tradition is Rasch meas-
urement theory, and methods based on the observed ratings tradition are dominated by
inter-rater reliability, rater agreement, and generalizability theory.
Discussion
The purpose of this study was to identify and explore the dominant methods used to
evaluate rating quality within the context of rater-mediated language assessments in
order to consider the implications of the use of these methods as evidence of psychomet-
ric quality. Using an adaptation of Engelhard’s (2013) theoretical framework based on
research traditions, we classified methods for evaluating rating quality that focused on
the consistency and sources of error within observed ratings within the observed ratings
tradition, and we classified methods for evaluating rating quality that focused on the
description of raters, students, and other facets on a scale with equal units (i.e., a linear
continuum) that represents a latent variable within the scaled ratings tradition.
We used this framework to examine 259 peer-reviewed articles in which researchers
described the use of a single quantitative technique for evaluating the quality of ratings
in a rater-mediated language assessment between 1980 and 2016. Similar to the findings
reported in previous reviews (Meadows & Billington, 2005; Myford & Wolfe, 2003;
Saal et al., 1980; Tisi et al., 2013), our results suggested that researchers during the
selected time period applied a wide range of techniques to evaluate rating quality that
reflect both the observed ratings and scaled ratings tradition. Furthermore, our results
indicated that researchers’ use of techniques within these traditions varied across applied
Wind and Peterson 17
Validity
According to the Standards (AERA et al., 2014), validity is “the degree to which accu-
mulated evidence and theory support a specific interpretation of test scores for a given
use of a test” (p. 11). In the context of rater-mediated assessments, raters and rating qual-
ity play a central role in the interpretation of test scores as indicators of student standing
on a construct. Because rater judgments are the primary channel through which informa-
tion about student achievement is provided, raters can be viewed as a type of “lens” that
mediates the interpretation of student performances in terms of a construct (Cooksey,
1996a, 1996b; Cooksey, Freebody, & Davidson, 1986; Cooksey, Freebody, & Wyatt-
Smith, 2007; Engelhard, 2013; Hogarth & Karelaia, 2007; Thompson, Foster, Cole, &
Dowding, 2005). Consequently, evidence about the quality of ratings provides insight
into the degree to which ratings can be interpreted and used for a particular purpose (i.e.,
validity evidence).
Despite the centrality of raters in these assessments, most discussions of validity for
performance assessment systems, including the discussion of validity issues related to
performance assessments in the Standards, do not explicitly focus on the role of the rater
in performance assessments; rather, considerations related to rating quality generally
appear in discussions of reliability. Our finding of an emphasis on inter-rater reliability
and rater agreement as evidence of rating quality reflects the general perspective that
rating quality considerations are not a key source of validity evidence for performance
18 Language Testing
assessments. Further, our findings suggest a general perspective that methods that
describe the overall consistency of ratings are sufficient descriptions of rating quality
to support the interpretation and use of scores from rater-mediated assessments.
In light of the central role of raters and rating quality in rater-mediated assessments,
these group-level indicators do not provide sufficient validity evidence to support the
interpretation and use of ratings. In addition to these indices, it is essential that language
assessment researchers consider the use of methods that result in diagnostic information
about rating quality related to individual raters, students, and other assessment compo-
nents as additional information to support validity arguments for rater-mediated lan-
guage assessments, such as those available within the scaled ratings tradition.
Reliability/precision
The recent revision of the Standards (AERA et al., 2014) defines reliability/precision as
the “consistency over replications of the testing procedure” (p. 35). In contrast to valid-
ity, discussions related to reliability/precision in the context of performance assessments,
including the Standards, often include the explicit consideration of raters. In particular,
the Standards chapter on reliability/precision describes the role of measurement error in
rater-mediated assessments as follows: “if raters are used to assign scores to responses,
the variability in scores over qualified raters is a source of error” (p. 33). The perspective
set forth in the Standards reflects the observed ratings tradition, where high levels of
rater agreement and reliability are presented as sufficient sources of evidence of psycho-
metric quality for performance assessments. Reflecting the Standards, the finding in the
current study of an overall emphasis on inter-rater reliability as evidence of rating quality
suggests an emphasis in rater-mediated assessment research on identifying rater consist-
ency as evidence of high-quality ratings.
In order to maximize consistency, most assessment systems incorporate rater training
exercises and qualification requirements prior to operational scoring (Johnson et al.,
2009). Despite these efforts, research on performance assessments indicates that differ-
ences in rater severity persist beyond training (e.g., Knoch, Read, & von Randow, 2007;
Raczynski, Cohen, Engelhard, & Lu, 2015; Weigle, 1998). Recognizing these persistent
differences, it is essential that indicators of rating quality go beyond the dominant meth-
ods observed in this review, which focus on rater consistency, and incorporate informa-
tion about individual raters into estimates of student achievement.
In particular, when rating quality is considered from the perspective of the scaled rat-
ings tradition, indicators of rating quality can be used to evaluate the reliability/precision
of rater-mediated assessment systems in terms of individual raters, students, and other
facets. Generally, these indicators are used to evaluate measurement precision in terms
of the degree to which students, raters, and other facets in an assessment system are
appropriately matched (i.e., targeted). In other words, because methods based on the
scaled ratings tradition result in estimates of student achievement, rater severity, and
other facets on a scale that represents the construct, it is possible to evaluate the align-
ment among the estimated locations of raters, students, and other facets as indicators of
precision. Closer alignment (i.e., better targeting) results in more-precise estimates
(Embretson, 1996). Rather than focusing on the decomposition of error variance into
Wind and Peterson 19
overall sources of measurement error, methods for evaluating rating quality based on the
scaled ratings tradition focus on examining the precision of measurement.
Fairness
The third foundational area in the revised Standards is fairness, which is defined as
“responsiveness to individual characteristics and testing contexts so that test scores will
yield valid interpretations for intended uses” (p. 50). In the context of rater-mediated
assessment systems, fairness concerns are often discussed in relation to construct-
irrelevant influences on rater judgment. These influences may include characteristics of
students, such as demographic variables, characteristics of raters, such as previous
experiences or levels of training, or characteristics of the assessment context, such as
the administration platform for composing essays.
When considered in light of this conceptualization of fairness, the dominance of inter-
rater reliability and rater agreement identified in this review suggests that research on
rater-mediated language assessments often does not include sufficient evidence to explic-
itly evaluate the potential influence of construct-irrelevant influences on rater judgment.
In order to obtain empirical evidence of fairness in a rater-mediated assessment, methods
for evaluating rating quality must be employed that allow for the explicit consideration
of these potential influences within a measurement framework based on fundamental
measurement properties.
In contrast to overall measures of inter-rater reliability, a variety of methods are
available within both the observed ratings and scaled ratings traditions that facilitate the
exploration of the influence of a variety of potentially construct-irrelevant variables.
Within the observed ratings tradition, these methods include generalizability theory and
its recent extensions (Marcoulides & Drezner, 1997, 2000). Within the scaled ratings
tradition, differential rater functioning analyses can be used to identify potential areas
for further investigation, including qualitative studies, in order to more fully explore the
quality of a rater-mediated assessment in terms of fairness. Rather than focusing on
overall indices of rater consistency, the full consideration of fairness concerns related to
raters requires methods that allow for the direct examination of potential construct-
irrelevant influences on raters’ judgmental processes at the individual rater level.
Conclusions
Next, we present and discuss conclusions as they relate to the research questions used to
guide this study.
Overall, the results revealed an emphasis on techniques based on the observed ratings
tradition, with inter-rater reliability, rater agreement, and traditional applications of
generalizability theory (e.g., Brennan, 2000; Shavelson & Webb, 1991) as the most
commonly observed method for evaluating ratings. These three methods result in
20 Language Testing
In applied studies of rater-mediated language assessments, what indices of rating quality did
researchers report? In methodological studies of rater-mediated language assessments, what
indices of rating quality did researchers report?
Our results indicated similar techniques across methodological and applied research that
reflect the overall focus on inter-rater reliability in studies of rater-mediated language
assessments. Although we observed several techniques in methodological studies that
did not appear in the applied research, this finding suggests an overall shared understand-
ing of available techniques across both types of research.
Wind and Peterson 21
In studies of rater-mediated language assessments in which rating quality was the primary
focus, what indices of rating quality did researchers report? In studies of rater-mediated
language assessments in which rating quality was not the primary focus, what indices of rating
quality did researchers report?
As we observed related to methodological and applied research, the results indicated the
same overall dominance of methods based on the observed ratings tradition across stud-
ies in which rating quality was a primary focus and studies in which rating quality was
not the primary focus. However, it is interesting to note that researchers applied a slightly
more diverse range of methods based on the scaled ratings tradition in studies where rat-
ing quality was the primary focus compared to studies in which rating quality was not the
primary focus. This finding suggests that a potentially greater awareness of the implica-
tions of different rating quality indices among researchers whose primary focus was on
rating quality. In particular, when rating quality was of primary interest, researchers used
additional techniques classified within the scaled ratings tradition. As we noted above,
these methods often go beyond group-level rating quality indices and provide additional
diagnostic indices that can be used to inform the interpretation of the validity, reliability/
precision, and fairness of ratings. The somewhat-more-limited methods that appeared
among studies in which researchers were not explicitly focused on rating quality reflects
McNamara and Knoch’s (2012) observation that “the professional background of many
language testers lies outside measurement”, and suggests a need for increased communi-
cation across the psychometric and language communities (p. 4).
We identified studies related to L1 writing most frequently. This finding suggests a poten-
tially greater emphasis on rater-mediated assessments in research within this domain,
compared to other language constructs. Further, we observed more variation in techniques
for evaluating ratings within L1 writing assessment than the other constructs. However,
with the exception of the small sample of L1 speaking assessments (n = 5), and L2 speak-
ing assessments, we observed rating quality indices based on the observed ratings tradi-
tion most often across all of the language constructs observed in the selected studies.
Although methods based on the observed ratings tradition generally dominated the
selected studies, it is interesting to note the slightly more balanced distribution of meth-
ods based on the observed ratings and scaled ratings traditions among studies of L2 writ-
ing and L2 speaking. Further, whereas Rasch measurement theory was the only scaled
ratings tradition method that was applied in studies of both of these constructs, we
observed a variety of methods based on the observed ratings tradition. This finding sug-
gests that researchers who study L2 writing and speaking assessments may be familiar
with Rasch-based techniques for evaluating rating quality as an additional tool besides
methods based on the observed ratings tradition.
Among the simulation studies identified in this review, the most common methods for
evaluating rating quality were rater agreement and rater reliability. This finding suggests
that the emphasis on rating quality indices based on observed ratings persists across real
and simulated data. When researchers applied methods based on the scaled ratings tradi-
tion to simulated data, IRT techniques were most common.
In studies of rater-mediated language assessments, what information about the rating designs
employed did researchers report?
In terms of rating designs, the results indicated a prevalence of fully crossed data collec-
tion systems, in which every rater scored every student on every assessment task. This
finding suggests a potential disconnection between research and practice related to rater-
mediated assessments. Specifically, practical considerations in most operational assess-
ment systems limit the use of fully crossed rating designs. Although several methods for
evaluating rating quality are well suited to handle a variety of rating designs that involve
incomplete data (e.g., methods based on Rasch measurement theory and other item
response models), the explicit consideration of rating quality in the context of opera-
tional rating designs would potentially provide a more concrete connection between
research and practice in language assessment. The prevalence of fully crossed rating
designs may also reflect our focus on manuscripts published in peer-reviewed journals.
Had our search included other outlets in which research on rating quality is reported,
such as technical documentation, less-connected designs may have appeared more often,
as these designs reflect operational assessment systems.
In addition, a particularly concerning finding from this review was the result that the
authors of 77 of the studies included in this review did not provide any details regarding
the rating designs used to collect data. This result signals a potential unawareness among
language assessment researchers regarding the significance of connectivity in assess-
ment designs, particularly when methods are employed whose interpretation depends on
sufficient connectivity across facets, such as Rasch and item response theory models.
Furthermore, this finding signals a weakness in language assessment research related to
performance assessments with regard to the description of data collection methods.
Limitations
When considering the results from this study, it is important to note several limitations.
First, our review only included peer-reviewed journal articles within a subset of educa-
tional measurement and language/literacy journals. Our focus on journal articles in this
review resulted in the exclusion of other outlets in which methods for evaluating rating
quality are often employed, such as technical reports, books, and chapters in edited vol-
umes. Similarly, although we used the selection criteria for journals to identify top-tier,
international research outlets, the omission of other journals within the fields of educa-
tional measurement, language/literacy instruction, and language/literacy assessment
may have limited generalizability of the findings. In particular, the journals that we
selected are primarily published in the USA and the UK – thus potentially limiting the
generalizability of the current findings to these national contexts. Furthermore, our use
Wind and Peterson 23
of a critical value for impact factors to inform journal selection may have resulted in the
omission of research that, if included, would have revealed different patterns in the meth-
ods used to evaluate rating quality in language assessment.
Another important limitation to note is related to our exclusive focus in this study on
articles that included only a single method for evaluating rating quality. Comparison
studies that include multiple methods reflect an important component of methodological
research whose examination may provide additional insight into the development and
advancement of techniques for evaluating rating quality. Furthermore, our analysis did
not include an examination of methods for evaluating rating quality across time.
Accordingly, it is not possible to draw conclusions regarding the development of and
changes in the frequency of methods for evaluating rating quality from a historical
perspective.
Additionally, it is important to consider the implications of differences in methods for
evaluating rating quality across different types of constructed response tasks (e.g.,
extended essay responses or short answer tasks), different scoring schemes (e.g., ana-
lytic, holistic, trait-based), and the stakes associated with ratings on these tasks. In par-
ticular, constructed response tasks that are used across measures of reading, writing,
listening, and speaking often reflect different formats and scoring schemes. For example,
whereas writing assessments frequently include extended response tasks, such as essays,
reading and listening assessments often include shorter constructed-response tasks.
These differences may influence researchers’ choices of indicators of rating quality that
are of interest – thus potentially limiting the comparability of methods for evaluating
rating quality across different types of assessments. Nonetheless, all constructed response
assessments that involve rater judgments invoke a judgmental process whose evaluation
is of interest as evidence of psychometric quality.
be considered alongside those of McNamara and Knoch (2012), whose review provided
some historical insight into the use of models based on Rasch measurement theory in
language assessment.
Finally, an examination of methods for evaluating rating quality specific to different
formats of constructed response assessment tasks, scoring schemes, and assessments
with different consequences may shed additional light on the selection of rating quality
indicators that was not captured in the combined analysis of these types of assessments.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this
article.
References
American Council on the Teaching of Foreign Languages (ACTFL). (2012). ACTFL proficiency
guidelines. Alexandria, VA. Retrieved from www.actfl.org/publications/guidelines-and-
manuals/actfl-proficiency-guidelines-2012
American Educational Research Association (AERA), American Psychological Association
(APA), & National Council on Measurement in Education (NCME). (2014). Standards for
educational and psychological testing. Washington, DC: AERA.
Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach.
Language Testing, 27(4), 515–535. https://doi.org/10.1177/0265532210368717
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and
rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–293.
https://doi.org/10.1080/0969594X.2010.526585
Bechger, T. M., Maris, G., & Hsiao, Y. P. (2010). Detecting halo effects in performance-based
examinations. Applied Psychological Measurement, 34(8), 607–619. https://doi.org/10.1177/
0146621610367897
Birnbaum, A. (1957). Efficient design and use of tests of a mental ability for various decision mak-
ing problems. Randolph Air Force Base, TX: USAF Scholl of Aviation Medicine.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability,
Part 5. In Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-
Wesley.
Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational
Measurement: Issues and Practice, 16(4), 14–20.
Brennan, R. L. (2000). Performance assessments from the perspective of generalizability
theory. Applied Psychological Measurement, 24(4), 339–353. https://doi.org/10.1177/
01466210022031796
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of
essays: differences by gender, ethnicity, and country. Applied Measurement in Education,
25(1), 27–40. https://doi.org/10.1080/08957347.2012.635502
Burke, J., & Cizek, G. (2006). Effects of composition mode and self-perceived computer skills
on essay scores of sixth graders. Assessing Writing, 11, 148–166. https://doi.org/10.1016/j.
asw.2006.11.003
Wind and Peterson 25
Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assess-
ments. Applied Psychological Measurement, 24(4), 310–324. https://doi.org/10.1177/
01466210022031778
Cooksey, R. W. (1996a). Judgment analysis: Theory, methods, and applications (Vol. xv). San
Diego, CA: Academic Press.
Cooksey, R. W. (1996b). The methodology of social judgement theory. Thinking & Reasoning,
2(2–3), 141–174. https://doi.org/10.1080/135467896394483
Cooksey, R. W., Freebody, P., & Davidson, G. R. (1986). Teachers’ predictions of children’s
early reading achievement: An application of social judgment theory. American Educational
Research Journal, 23(1), 41. https://doi.org/10.2307/1163041
Cooksey, R. W., Freebody, P., & Wyatt-Smith, C. (2007). Assessment as judgment-in-context:
Analysing how teachers evaluate students’ writing. Educational Research and Evaluation,
13(5), 401–434. https://doi.org/10.1080/13803610701728311
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavio-
ral measurements: Theory of generalizability for scores and profiles. New York: John Wiley.
DeCarlo, L. T. (2005). A model of rater behavior in essay grading based on signal detection theory.
Journal of Educational Measurement, 42(1), 53–76.
DeCarlo, L. T., Kim, Y., & Johnson, M. S. (2011). A hierarchical rater model for constructed
responses, with a signal detection rater model: Hierarchical signal detection rater model.
Journal of Educational Measurement, 48(3), 333–356. https://doi.org/10.1111/j.1745–
3984.2011.00143.x
Dobria, L. (2011). Longitudinal rater modeling with splines. University of Illinois at Chicago.
Retrieved from http://gradworks.umi.com/34/72/3472389.html
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assess-
ments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://
doi.org/10.1207/s15434311laq0203_2
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-
mediated assessments (2nd ed.). Frankfurt am Main: Peter Lang.
Egan, O., & Archer, P. (1985). The accuracy of teachers’ ratings of ability: A regression model.
American Educational Research Journal, 22(1), 25–34. https://doi.org/http://aer.sagepub.
com.libdata.lib.ua.edu/content/22/1/25.full.pdf+html
Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8(4),
341–349. https://doi.org/10.1037/1040–3590.8.4.341
Engelhard, G. (2008). Historical perspectives on invariant measurement: Guttman, Rasch, and
Mokken. Measurement, 6(3), 155–189. https://doi.org/10.1080/15366360802197792
Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and
health sciences. New York: Routledge.
Farnia, F., & Geva, E. (2013). Growth and predictors of change in English language learn-
ers’ reading comprehension. Journal of Research in Reading, 36(4), 389–421. https://doi.
org/10.1111/jrir.12003
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver & Boyd.
Fisicaro, S. A., & Lance, C. E. (1990). Implications of three causal models for the measure-
ment of halo error. Applied Psychological Measurement, 14(4), 419–429. https://doi.
org/10.1177/014662169001400407
Fritz, E., & Ruegg, R. (2013). Rater sensitivity to lexical accuracy, sophistication and range when
assessing writing. Assessing Writing, 18, 173–181. https://doi.org/10.1016/j.asw.2013.02.001
Gao, X., & Brennan, R. L. (2001). Variability of estimated variance components and related
statistics in a performance assessment. Applied Measurement in Education, 14(2), 191–203.
Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill.
26 Language Testing
McNamara, T. (1996). Measuring second language performance. London and New York:
Longman.
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in lan-
guage testing. Language Testing, 29(4), 555–576. https://doi.org/10.1177/0265532211430367
Meadows, M., & Billington, L. (2005). A review of the literature on marking reliability. London:
AQA for the National Assessment Agency. Retrieved from https://cerp.aqa.org.uk/sites/
default/files/pdf_upload/CERP_RP_MM_01052005.pdf
Michael, W. B., Cooper, T., Shaffer, P., & Wallis, E. (1980). A comparison of the reliability and
validity of ratings of student performance on essay examinations by professors of english
and by professors in other disciplines. Educational and Psychological Measurement, 40(1),
183–195. https://doi.org/10.1177/001316448004000131
Miles, M. B., Huberman, A. M., & Saldana, J. (2014). Qualitative data analysis (3rd ed.).
Thousand Oaks, CA: SAGE Publications. Retrieved from https://books.google.com/books/
about/Qualitative_Data_Analysis.html?id=3CNrUbTu6CsC
Mokken, R. J. (1971). A Theory and Procedure of Scale Analysis. The Hague: Mouton/ Berlin:
De Gruyter.
Muckle, T., & Karabatsos, G. (2009). Hierarchical generalized linear models for the analysis of
judge ratings. Journal of Educational Measurement, 46(2), 198–219.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet
Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet
Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.
Neter, J., & Wasserman, W. (1974). Applied linear statistical methods: Regression, Analysis of
Variance, and Experimental Designs. Homewood, IL: Richard D. Irwin.
NVivo qualitative data analysis software. (2015). (Version 11) [Windows]. QSR International Pty
Ltd.
OECD. (2012). PISA 2009 Technical Report. OECD Publishing. Retrieved from http://www.oecd-
ilibrary.org/education/pisa-2009-technical-report_9789264167872-en
Patz, R. J., Wilson, M. J., & Hoskens, M. (1997). Optimal rating procedures and methodology
for NAEP open-ended items − 9737.pdf (Working Paper No. 97–37). Washington, DC: U.S.
Department of Education, Office of Educational Research and Improvement, National Center
for Education Statistics. Retrieved from http://nces.ed.gov/pubs97/9737.pdf
Patz, R. J., Junker, B. W., Johnson, M. S., & Mariano, L. T. (2002). The Hierarchical rater model
for rated test items and its application to large-scale educational assessment data. Journal of
Educational and Behavioral Statistics, 27(4), 341–384.
Penny, J., & Johnson, R. (2011). The accuracy of performance task scores after resolution of
rater disagreement: A Monte Carlo study. Assessing Writing, 16, 221–236. https://doi.
org/10.1016/j.asw.2011.06.001
Petticrew, M., & Roberts, H. (Eds.). (2006). Systematic reviews in the social sciences. Oxford,
UK: Blackwell. Retrieved from http://doi.wiley.com/10.1002/9780470754887
Plakans, L., & Gebril, A. (2012). A close investigation into source use in integrated second lan-
guage writing tasks. Assessing Writing, 17, 18–34. https://doi.org/10.1016/j.asw.2011.09.002
Popp, S. E. O., Ryan, J. M., & Thompson, M. S. (2009). The critical role of anchor paper selec-
tion in writing assessment. Applied Measurement in Education, 22(3), 255–271. https://doi.
org/10.1080/08957340902984026
Raczynski, K. R., Cohen, A. S., Engelhard, G., & Lu, Z. (2015). Comparing the effectiveness of
self-paced and collaborative frame-of-reference training on rater accuracy in a large-scale
writing assessment: Comparing rater training methods. Journal of Educational Measurement,
52(3), 301–318. https://doi.org/10.1111/jedm.12079
28 Language Testing
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded
edition, 1980. Chicago, IL: University of Chicago Press). Copenhagen, Denmark: Danish
Institute for Educational Research.
Reckase, M. D. (2009). Multidimensional item response theory (1st ed.). New York: Springer.
Rost, J. (1988). Rating scale analysis with latent class models. Psychometrika, 53(3), 327–348.
https://doi.org/10.1007/BF02294216
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric
quality of rating data. Psychological Bulletin, 88(2), 413–428.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation
modeling. Language Testing, 22(1), 1–30. https://doi.org/10.1191/0265532205lt295oa
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks, CA:
SAGE Publications.
Spearman, C. (1927). The abilities of man: Their nature and measurement. New York: Macmillian.
Sykes, R. C., Ito, K., & Wang, Z. (2008). Effects of assigning raters to items. Educational
Measurement: Issues and Practice, 27(1), 47–55. https://doi.org/10.1111/j.1745-3992.
2008.00114.x
Thompson, C. A., Foster, A., Cole, I., & Dowding, D. W. (2005). Using social judgement theory to
model nurses’ use of clinical information in critical care education. Nurse Education Today,
25(1), 68–77. https://doi.org/10.1016/j.nedt.2004.10.003
Thurstone, L. L. (1935). The vectors of mind. Chicago, IL: University of Chicago Press.
Thurstone, L. L. (1947). Multiple factor analysis. Chicago, IL: University of Chicago Press.
Tisi, J., Whitehouse, G., Maughan, S., & Burdett, N. (2013). A review of literature on marking
reliability research (Report commissioned by the Office of Qualifications and Examinations
Regulation). Slough, UK: National Foundation for Educational Research. Retrieved from
www.nfer.ac.uk/publications/MARK01/MARK01.pdf
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2),
263–287.
Weigle, S. C. (2010). Validation of automated scores of TOEFL iBT tasks against non-test
indicators of writing ability. Language Testing, 27(3), 335–353. https://doi.org/10.1177/
0265532210364406
Wilson, J., Olinghouse, N., McCoach, D. B., Santangelo, T., & Andrada, G. (2015). Comparing
the accuracy of different scoring methods for identifying sixth graders at risk of failing a state
writing assessment. Assessing Writing, 27, 11–23. https://doi.org/10.1016/j.asw.2015.06.003
Wind, S. A. (2014). Evaluating rater-mediated assessment with Rasch measurement theory and
Mokken scaling. Emory University, USA. Retrieved from http://gradworks.umi.com.libdata.
lib.ua.edu/36/34/3634388.html
Wind, S. A., & Engelhard, G. (2015). Exploring Rating quality in rater-mediated assessments
using Mokken Scale Analysis. Educational and Psychological Measurement. https://doi.
org/10.1177/0013164415604704
Wisconson Center for Educational Research, University of Wisconsin-Madison. (2016). ACCESS
for ELLs 2.0 Interpretative Guide for Score Reports: Kindergarten – Grade 12. Madison, WI.
Wolfe, E. W. (1998). A two-parameter logistic rater model (2PLRM): Detecting rater harshness
and centrality. Presented at the American Educational Research Association, San Diego, CA.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press.
Zhang, Y., & Elder, C. (2014). Investigating native and non-native English-speaking teacher
raters’ judgements of oral proficiency in the College English Test-Spoken English Test
(CET-SET). Assessment in Education: Principles, Policy & Practice, 21(3), 306–325.
https://doi.org/10.1080/0969594X.2013.845547
Wind and Peterson 29
(Continued)
30 Language Testing
Appendix A. (Continued)
Appendix A. (Continued)
(Continued)
32 Language Testing
Appendix A. (Continued)