An Investigation Into Native and Non-Native Teachers' Judgments of Oral English Performance - Mixed Method

Language Testing http://ltj.sagepub.
com/
An investigation into native and non-native teachers' judgments of oral English performance: A mixed methods approach
Youn-Hee Kim Language Testing 2009 26: 187 DOI: 10.1177/0265532208101010 The online version of this article can be found at: http://ltj.sagepub.com/content/26/2/187
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://ltj.sagepub.com/content/26/2/187.refs.html
>> Version of Record - Mar 26, 2009 What is This?
Downloaded from ltj.sagepub.com at UNIV OF NEW MEXICO on October 14, 2013
Language Testing 2009 26 (2) 187 217
An investigation into native and non-native teachers judgments of oral English performance: A mixed methods approach
Youn-Hee Kim University of Toronto, Canada
This study used a mixed methods research approach to examine how native English-speaking (NS) and non-native English-speaking (NNS) teachers assess students oral English performance. The evaluation behaviors of two groups of teachers (12 Canadian NS teachers and 12 Korean NNS teachers) were compared with regard to internal consistency, severity, and evaluation criteria. Results of a Many-faceted Rasch Measurement analysis showed that most of the NS and NNS teachers maintained acceptable levels of internal consistency, with only one or two inconsistent raters in each group. The two groups of teachers also exhibited similar severity patterns across different tasks. However, substantial dissimilarities emerged in the evaluation criteria teachers used to assess students performance. A qualitative analysis demonstrated that the judgments of the NS teachers were more detailed and elaborate than those of the NNS teachers in the areas of pronunciation, specific grammar use, and the accuracy of transferred information. These findings are used as the basis for a discussion of NS versus NNS teachers as language assessors on the one hand and the usefulness of mixed methods inquiries on the other. Keywords: mixed methods, NS and NNS, oral English performance assessment, many-faceted Rasch Measurement
In the complex world of language assessment, the presence of raters is one of the features that distinguish performance assessment from traditional assessment. While scores in traditional fixed response assessments (e.g., multiple-choice tests) are elicited solely from the interaction between test-takers and tasks, it is possible that the final scores awarded by a rater could be affected by variables
Address for correspondence: Youn-Hee Kim, Modern Language Center, Ontario Institute for Studies in Education, University of Toronto, 252 Bloor Street West, Toronto, Ont., Canada, M5S 1V6; email: younkim@oise.utoronto.ca
The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav DOI:10.1177/0265532208101010
188 An investigation into native and non-native teachers judgments
inherent to that rater (McNamara, 1996). Use of a rater for performance assessment therefore adds a new dimension of interaction to the process of assessment, and makes monitoring of reliability and validity even more crucial. The increasing interest in rater variability has also given rise to issues of eligibility; in particular, the question of whether native speakers should be the only norm maker[s] (Kachru, 1985) in language assessment has inspired heated debate among language professionals. The normative system of native speakers has long been assumed in English proficiency tests (Taylor, 2006), and it is therefore unsurprising that large-scale, high-stakes tests such as the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS) rendered their assessments using native English-speaking ability as a benchmark (Hill, 1997; Lowenberg, 2000, 2002). However, the current status of English as a language of international communication has caused language professionals to reconsider whether native speakers should be the only acceptable standard (Taylor, 2006). Indeed, non-native English speakers outnumber native English speakers internationally (Crystal, 2003; Graddol, 1997; Jenkins, 2003; Lowenberg, 2000), and localization of the language has occurred in outer circle countries such as China, Korea, Japan, and Russia (Kachru, 1985, 1992; Lowenberg, 2000). These developments suggest that new avenues of opportunity may be opening for non-native English speakers as language assessors. This study, in line with the global spread of English as a lingua franca, investigates how native English-speaking (NS) and non-native English-speaking (NNS) teachers evaluate students oral English performance in a classroom setting. A mixed methods approach will be utilized to address the following research questions: 1) Do NS and NNS teachers exhibit similar levels of internal consistency when they assess students oral English performance? 2) Do NS and NNS teachers exhibit interchangeable severity across different tasks when they assess students oral English performance? 3) How do NS and NNS teachers differ in drawing on evaluation criteria when they comment on students oral English performance?
Youn-Hee Kim
189
I Review of the literature A great deal of research exploring rater variability in second language oral performance assessment has been conducted, with a number of early studies focusing on the impact of raters different backgrounds (Barnwell, 1989; Brown, 1995; Chalhoub-Deville, 1995; ChalhoubDeville & Wigglesworth, 2005; Fayer & Krasinski, 1987; Galloway, 1980; Hadden, 1991). In general, teachers and non-native speakers were shown to be more severe in their assessments than non-teachers and native speakers, but the outcomes of some studies contradicted one another. This may be explained by their use of different native languages, small rater samples, and different methodologies (Brown, 1995; Chalhoub-Deville, 1995). For example, in a study of raters professional backgrounds, Hadden (1991) investigated how native English-speaking teachers and non-teachers perceive the competence of Chinese students in spoken English. She found that teachers tended to be more severe than non-teachers as far as linguistic ability was concerned, but that there were no significant differences in such areas as comprehensibility, social acceptability, personality, and body language. Chalhoub-Deville (1995), on the other hand, comparing three different rater groups (i.e., native Arabic-speaking teachers living in the USA, non-teaching native Arabic speakers living in the USA, and non-teaching native Arabic speakers living in Lebanon), found that teachers attended more to the creativity and adequacy of information in a narration task than to linguistic features. Chalhoub-Deville suggested that the discrepant findings of the two studies could be due to the fact that her study focused on modern standard Arabic (MSA), whereas Haddens study focused on English. Another line of research has focused on raters different linguistic backgrounds. Fayer and Krasinski (1987) examined how the English-speaking performance of Puerto Rican students was perceived by native English-speaking raters and native Spanishspeaking raters. The results showed that non-native raters tended to be more severe in general and to express more annoyance when rating linguistic forms, and that pronunciation and hesitation were the most distracting factors for both sets of raters. However, this was somewhat at odds with Browns (1995) study which found that while native speakers tended to be more severe than non-native speakers, the difference was not significant. Brown concluded that, there is little evidence that native speakers are more suitable than non-native speakers However, the way in which they perceive
the items (assessment criteria) and the way in which they apply the scale do differ (p. 13). Studies of raters with diverse linguistic and professional backgrounds have also been conducted. Comparing native and nonnative Spanish speakers with or without teaching experience, Galloway (1980) found that non-native teachers tended to focus on grammatical forms and reacted more negatively to nonverbal behavior and slow speech, while non-teaching native speakers seemed to place more emphasis on content and on supporting students attempts at self-expression. Conversely, Barnwell (1989) reported that untrained native Spanish speakers provided more severe assessments than an ACTFL-trained Spanish rater. This result conflicts with that of Galloway (1980), who found that untrained native speakers were more lenient than teachers. Barnwell suggested that both studies were small in scope, and that it was therefore premature to draw conclusions about native speakers responses to non-native speaking performance. Hill (1997) further pointed out that the use of two different versions of rating scales in Barnwells study, one of which was presented in English and the other in Spanish, remains questionable. One recent study of rater behavior focused on the effect of country of origin and task on evaluations of students oral English performance. Chalhoub-Deville and Wigglesworth (2005) investigated whether native English-speaking teachers who live in different English-speaking countries (i.e., Australia, Canada, the UK, and the USA) exhibited significantly different rating behaviors in their assessments of students performance on three Test of Spoken English (TSE) tasks 1) give and support an opinion, 2) picturebased narration, and 3) presentation which require different linguistic, functional, and cognitive strategies. MANOVA results indicated significant variability among the different groups of native English-speaking teachers across all three tasks, with teachers residing in the UK the most severe and those in the USA the most lenient across the board; however, the very small effect size (2 = 0.01) suggested that little difference exists among different groups of native English-speaking teachers. Although the above studies provide some evidence that raters linguistic and professional backgrounds influence their evaluation behavior, further research is needed for two reasons. First, most extant studies are not grounded in finely tuned methodologies. In some early studies (e.g., Fayer & Krasinski, 1987; Galloway, 1980;
Youn-Hee Kim
191
Hadden, 1991), raters were simply asked to assess speech samples of less than four minutes length without reference to a carefully designed rating scale. Also, having raters assess only one type of speech sample did not take the potential systematic effect of task type on task performance into consideration. Had the task types varied, raters could have assessed diverse oral language output, which in turn might have elicited unknown or unexpected rating behaviors. Second, to my knowledge, no previous studies have attempted to use both quantitative and qualitative rating protocols to investigate differences between native and non-native English-speaking teachers judgments of their students oral English performance. A mixed methods approach, known as the third methodological movement (Tashakkori & Teddlie, 2003, p. ix), incorporates quantitative and qualitative research methods and techniques into a single study and has the potential to reduce the biases inherent in one method while enhancing the validity of inquiry (Greene, Caracelli, & Graham, 1989). However, all previous studies that have examined native and non-native English-speaking raters behavior in oral language performance assessment have been conducted using only a quantitative framework, preventing researchers from probing research phenomena from diverse data sources and perspectives. The mixed methods approach of the present study seeks to enhance understanding of raters behavior by investigating not only the scores assigned by NS and NNS teachers but also how they assess students oral English performance.
II Methodology 1 Research design overview The underlying research framework of this study is based on both expansion and complementarity mixed methods designs, which are most commonly used in empirical mixed methods evaluation studies (see Greene et al., 1989 for a review of mixed methods evaluation designs). The expansion design was considered particularly well suited to this study because it would offer a comprehensive and diverse illustration of rating behavior, examining both the product that the teachers generate (i.e., the numeric scores awarded to students) and the process that they go through (i.e., evaluative comments) in their assessment of students oral English performance (Greene et al., 1989). The complementarity design was included to
provide greater understanding of the NS and NNS teachers rating behaviors by investigating the overlapping but different aspects of rater behavior that different methods might elicit (Greene et al., 1989). Intramethod mixing, in which a single method concurrently or sequentially incorporates quantitative and qualitative components (Johnson & Turner, 2003), was the selected guiding procedure. The same weight was given to both quantitative and qualitative methods, with neither method dominating the other. 2 Participants Ten Korean students were selected from a college-level language institute in Montreal, Canada, and were informed about the research project and the test to participate in the study. The students were drawn from class levels ranging from beginner to advanced, so that the student sample would include differing levels of English proficiency. The language institute sorted students into one of five class levels according to their aggregate scores on a placement test measuring four English language skills (listening, reading, speaking, and writing): Level I for students with the lowest English proficiency, up to Level V for students with the highest English proficiency. Table 1 shows the distribution of the student sample across the five class levels. For the teacher samples, a concurrent mixed methods sampling procedure was used in which a single sample produced data for both the quantitative and qualitative elements of the study (Teddlie & Yu, 2007). Twelve native English-speaking Canadian teachers of English and 12 non-native English-speaking Korean teachers of English constituted the NS and NNS teacher groups, respectively. In order to ensure that the teachers were sufficiently qualified, certain participation criteria were outlined: 1) at least one year of prior experience teaching an English conversation course to nonnative English speakers in a college-level language institution; 2) at least one graduate degree in a field related to linguistics or language education; and 3) high proficiency in spoken English for Korean teachers of English. Teachers background information
Table 1 Level Number of students Distribution of students across class levels I 1 II 1 III 3 IV 3 V 2
Youn-Hee Kim
193
was obtained via a questionnaire after their student evaluations were completed: all of the NNS teachers had lived in Englishspeaking countries for one to seven years for academic purposes, and their self-assessed English proficiency levels ranged from advanced (six teachers) to near-native (six teachers); none of the NNS teachers indicated their self-assessed English proficiency levels at or below an upper-intermediate level. In addition, nine NS and eight NNS teachers reported having taken graduate-level courses specifically in Second Language Testing and Evaluation, and four NS and one NNS teacher had been trained as raters of spoken English.
3 Instruments A semi-direct oral English test was developed for the study. The purpose of the test was to assess the overall oral communicative language ability of non-native English speakers within an academic context. Throughout the test, communicative language ability would be evidenced by the effective use of language knowledge and strategic competence (Bachman & Palmer, 1996). Initial test development began with the identification of target language use domain, target language tasks, and task characteristics (Bachman & Palmer, 1996). The test tasks were selected and revised to reflect potential test-takers language proficiency and topical knowledge, as well as task difficulty and interest. An effort was also made to select test tasks related to hypothetical situations that could occur within an academic context. In developing the test, the guiding principles of the Simulated Oral Proficiency Interview (SOPI) were referenced. The test consisted of three different task types in order to assess the diverse oral language output of test-takers: picturebased, situation-based, and topic-based. The picture-based task required test-takers to describe or narrate visual information, such as describing the layout of a library (Task 1, [T1]), explaining the library services based on a provided informational note (Task 2, [T2]), narrating a story from six sequential pictures (Task 4, [T4]), and describing a graph of human life expectancy (Task 7, [T7]). The situation-based task required test-takers to perform the appropriate pragmatic function in a hypothetical situation, such as congratulating a friend on being admitted to school (Task 3, [T3]). Finally, the topic-based task required test-takers to offer
their opinions on a given topic, such as explaining their personal preferences for either individual or group work (Task 5, [T5]), discussing the harmful effects of Internet use (Task 6, [T6]), and suggesting reasons for an increase in human life expectancy (Task 8, [T8]). The test was administered in a computer-mediated indirect interview format. The indirect method was selected because the intervention of interlocutors in a direct speaking test might affect the reliability of test performance (Stansfield & Kenyon, 1992a, 1992b). Although the lexical density produced in direct speaking tests and indirect speaking tests have been found to be different (OLoughlin, 1995), it has consistently been reported that scores from indirect speaking tests have a high correlation with those from direct speaking tests (Clark & Swinton, 1979, 1980; OLoughlin, 1995; Stansfield, Kenyon, Paiva, Doyle, Ulsh, & Antonia, 1990). In order to effectively and economically facilitate an understanding of the task without providing test-takers with a lot of vocabulary (Underhill, 1987), each task was accompanied by visual stimuli. The test lasted approximately 25 minutes, 8 of which were allotted for responses. A four-point rating scale was developed for rating (see Appendix A). It had four levels labeled 1, 2, 3, and 4. A response of I dont know or no response was automatically rated NR (Not Ratable). The rating scale only clarified the degree of communicative success without addressing specific evaluation criteria. Because this study aimed to investigate how the teachers commented on the students oral communicative ability and defined the evaluation criteria to be measured, the rating scale did not provide teachers with any information about which evaluation features to draw on. To deal with cases in which teachers sit on the fence, an even number of levels was sought in the rating scale. Moreover, in order not to cause a cognitive and psychological overload on the teachers, six levels were set as the upper limit during the initial stage of the rating scale development. Throughout the trials, however, the six levels describing the degree of successfulness of communication proved to be indistinguishable without dependence on the adjacent levels. More importantly, teachers who participated in the trials did not use all six levels of the rating scale in their evaluations. For these reasons, the rating scale was trimmed to four levels, enabling the teachers to consistently distinguish each level from the others.
Youn-Hee Kim
195
4 Procedure The test was administered individually to each of 10 Korean students, and their speech responses were simultaneously recorded as digital sound files. The order of the students test response sets was randomized to minimize a potential ordering effect, and then 12 of the possible test response sets were distributed to both groups of teachers. A meeting was held with each teacher in order to explain the research project and to go over the scoring procedure, which had two phases: 1) rating the students test responses according to the four-point rating scale; and 2) justifying those ratings by providing written comments either in English or in Korean. While the NS teachers were asked to write comments in English, the NNS teachers were asked to write comments in Korean (which were later translated into English). The rationale for requiring teachers comments was that they would supply not only the evaluation criteria that they drew on to infer students oral proficiency, but that it would help to identify the construct being measured. The teachers were allowed to control the playing, stopping, and replaying of test responses and to listen to them as many times as they wanted. After rating a single task response by one student according to the rating scale, they justified their ratings by writing down their reasons or comments. They then moved on to the next task response of that student. The teachers rated and commented on 80 test responses (10 students 8 tasks). To decrease the subject expectancy effect, the teachers were told that the purpose of the study was to investigate teachers rating behavior, and the comparison of different teacher groups was not explicitly mentioned. The two groups of teachers were therefore unaware of each other. In addition, a minimum amount of information about the students (i.e., education level, current visa status, etc.) was provided to the teachers. Meetings with the NS teachers were held in Montreal, Canada, and meetings with the NNS teachers followed in Daegu, Korea. Each meeting lasted approximately 30 minutes. 5 Data analyses Both quantitative and qualitative data were collected. The quantitative data consisted of 1,727 valid ratings, awarded by 24 teachers to 80 sample responses by 10 students on eight tasks. Each
teacher rated every students performance on every task, so that the data matrix was fully crossed. A rating of NR (Not Ratable) was treated as missing data; there were eight such cases among the 80 speech samples. In addition, one teacher failed to make one rating. The qualitative data included 3,295 written comments. Both types of data were analyzed in a concurrent manner: a Many-faceted Rasch Measurement (Linacre, 1989) was used to analyze quantitative ratings, and typology development and data transformation (Caracelli & Greene, 1993) guided the analysis of qualitative written comments. The quantitative and qualitative research approaches were integrated at a later stage (rather than at the outset of the research process) when the findings from both methods were interpreted and the study was concluded. Since the nature of the component designs to which this study belongs does not permit enough room to combine the two approaches (Caracelli & Greene, 1997), the different methods tended to remain distinct throughout the study. Figure 1 summarizes the overall data analysis procedures. a Quantitative data analysis: The data were analyzed using the FACETS computer program, Version 3.57.0 (Linacre, 2005). Four facets were specified: student, teacher, teacher group, and task. The teacher group facet was entered as a dummy facet and anchored at zero. A hybrid Many-faceted Rasch Measurement Model (Myford & Wolfe, 2004a) was used to differentially apply the Rating Scale Model to teachers and tasks, and the Partial Credit Model to teacher groups. Three different types of statistical analysis were carried out to investigate teachers internal consistency, based on: 1) fit statistics; 2) proportions of large standard residuals between observed and expected scores (Myford & Wolfe, 2000); and 3) a single raterrest of the raters (SR/ROR) correlation (Myford & Wolfe, 2004a). The multiple analyses were intended to strengthen the validity of inferences drawn about raters internal consistency through converging evidence, and to minimize any bias that is inherent to a particular analysis. Teachers severity measures were also examined in three different ways based on: 1) task difficulty measures, 2) a bias analysis between teacher groups and tasks, and 3) a bias analysis between individual teachers and tasks.
Youn-Hee Kim
Fit statistics
197
Teachers internal consistency
Proportions of large standard residuals Single raterrest of the raters (SR/ROR) correlation
1,727 ratings
Task difficulty measures
Teachers severity
Bias analysis (group level)
Bias analysis (individual level)
Typology development 3,295 written comments Teachers evaluation criteria Data transformation (quantification of evaluation features) for cross-comparison
Figure 1
Flowchart of data analysis procedure
b Qualitative data analysis: The written comments were analyzed based on evaluation criteria, with each written comment constituting one criterion. Comments that provided only evaluative adjectives without offering evaluative substance (e.g., accurate, clear, and so on) were excluded in the analysis so as not to misjudge the evaluative intent. The 3,295 written comments were open-coded so that the evaluation criteria that the teachers drew upon emerged. Nineteen recurring evaluation criteria were identified (see Appendix B for definitions and specific examples). Once I had coded and analyzed the teachers comments, a second coder conducted an independent examination of the original uncoded comments of 10 teachers (five NS and five NNS teachers); our results reached approximately 95% agreement (for a detailed description about coding procedures,
see Kim, 2005). The 19 evaluative criteria were compared across the two teacher groups through a frequency analysis.
III Results and discussion 1 Do NS and NNS teachers exhibit similar levels of internal consistency when they assess students oral English performance? To examine fit statistics, the infit indices of each teacher were assessed. Teachers fit statistics indicate the degree to which each teacher is internally consistent in his or her ratings. Determining an acceptable range of infit mean squares for teachers is not a clear-cut process (Myford & Wolfe, 2004a); indeed, there are no straightforward rules for interpreting fit statistics, or for setting upper and lower limits. As Myford and Wolfe (2004a) noted, such decisions are related to the assessment context and depend on the targeted use of the test results. If the stakes are high, tight quality control limits such as mean squares of 0.8 to 1.2 would be set on multiple-choice tests (Linacre & Williams, 1998); however, in the case of low-stakes tests, looser limits would be allowed. Wright and Linacre (1994) proposed the mean square values of 0.6 to 1.4 as reasonable values for data in which a rating scale is involved, with the caveat that the ranges are likely to vary depending on the particulars of the test situation. In the present study, the lower and upper quality control limits were set at 0.5 and 1.5, respectively (Lunz & Stahl, 1990), given the tests rating scale and the fact that it investigates teachers rating behaviors in a classroom setting rather than those of trained raters in a high-stakes test setting. Infit mean square values greater than 1.5 indicate significant misfit, or a high degree of inconsistency in the ratings. On the other hand, infit mean square values less than 0.5 indicate overfit, or a lack of variability in their scoring. The fit statistics in Table 2 show that three teachers, NS10, NNS6, and NNS7, have misfit values. None of the teachers show overfit rating patterns. Another analysis was carried out based on proportions of large standard residuals between observed and expected scores in order to more precisely identify the teachers whose rating patterns differed greatly from the model expectations. According to Myford and Wolfe (2000), investigating the proportion to which each rater is involved with the large standard residuals between observed scores
Youn-Hee Kim
Table 2 Teacher average Teacher measurement report Obsvd FairM average (logits) Measure S.E. Model MnSq 0.20 0.20 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.20 0.20 0.19 0.00 Infit MnSq 1.51 1.26 1.09 0.85 1.34 1.07 1.29 0.96 1.54 0.81 1.11 1.00 0.52 0.52 0.83 0.69 0.77 0.85 0.67 0.78 1.30 1.61 0.68 0.85 1.00 0.31 Outfit
199
PtBis
NS10 2.9 2.78 0.60 NNS10 2.9 2.74 0.52 NNS11 2.8 2.63 0.29 NNS1 2.7 2.52 0.07 NS9 2.7 2.43 0.11 NS5 2.6 2.37 0.23 NNS9 2.6 2.35 0.26 NS12 2.6 2.32 0.33 NNS7 2.6 2.32 0.33 NNS5 2.5 2.29 0.40 NS7 2.5 2.27 0.44 NS11 2.5 2.25 0.47 NS4 2.5 2.22 0.54 NNS4 2.5 2.22 0.54 NNS12 2.4 2.17 0.65 NNS2 2.4 2.13 0.72 NS3 2.4 2.08 0.83 NNS3 2.4 2.08 0.83 NS2 2.3 2.02 0.97 NS8 2.3 1.99 1.05 NS6 2.2 1.91 1.23 NNS6 2.2 1.84 1.38 NS1 2.1 1.75 1.60 NNS8 2.1 1.73 1.64 Mean 2.5 2.22 0.54 S.D. 0.2 0.27 0.58 RMSE (model) = 0.19 Adj. S.D. = 0.55 Separation = 2.87 Separation (not d.f. = 23 Fixed (all same) 2 = 214.7 Significance (probability) = .00
1.37 1.21 0.94 0.74 1.43 1.28 1.46 1.12 1.29 0.82 1.12 0.94 0.48 0.48 0.97 0.68 1.03 0.73 0.69 0.77 1.41 1.74 0.60 0.72 1.00 0.33
0.56 0.58 0.55 0.57 0.51 0.53 0.50 0.54 0.49 0.57 0.53 0.53 0.60 0.60 0.56 0.57 0.57 0.59 0.57 0.59 0.53 0.49 0.58 0.56 0.55 0.03
inter-rater) Reliability = 0.89
Note: SR/ROR correlation is presented as the point-biserial correlation (PtBis) in the FACET output.
and expected scores can provide useful information about rater behavior. If raters are interchangeable, it should be expected that all raters would be assigned the same proportion of large standard residuals, according to the proportion of total ratings that they make (Myford & Wolfe, 2000). Based on the number of large standard residuals and ratings that all raters make and each rater makes, they suggest that the null proportion of large standard residuals for each rater ( ) and the observed proportion of large standard residuals for each rater (Pr) can be computed using equations (1) and (2):

= Nu Nt
(1)
where, Nu = the total number of large standard residuals and Nt = the total number of ratings. Pr =
Nur N tr
(2)
where, Nur = the number of large standard residuals made by rater r and Ntr = the number of ratings made by rater r. An inconsistent rating will occur when the observed proportion exceeds the null proportion beyond the acceptable deviation (Myford & Wolfe, 2000). Thus, Myford and Wolfe propose that the frequency of unexpected ratings (Zp) can be calculated using equation (3). According to them, if a Zp value for a rater is below +2, it indicates that the unexpected ratings that he or she made are random error; however, if the value is above +2, the rater is considered to be exercising an inconsistent rating pattern. Zp =
Pr 2 Ntr
(3)
In this study, an unexpected observation was reported if the standardized residual was greater than +2, which was the case in 89 out of a total of 1,727 responses. When rating consistency was examined, one NS teacher and two NNS teachers were found to exhibit inconsistent rating patterns, a result similar to what was found in the fit analysis. The two NNS teachers whose observed Zp values were greater than +2 were NNS6 and NNS7, who had been flagged as misfitting teachers by their infit indices. Interestingly, the analysis of NS teachers showed that it was NS9, not NS10, who had Zp values greater than +2. This may be because NS10 produced only a small number of unexpected ratings which did not produce large residuals. That small Zp value indicates that while the teacher gave a few ratings that were somewhat unexpectedly higher (or lower) than the model would expect, those ratings were not highly unexpected (C. Myford, personal communication, May, 31, 2005). Myford and Wolfe (2004a, 2004b) introduced the more advanced Many-faceted Rasch Measurement application to detect raters consistency based on the single raterrest of the raters (SR/ROR) correlation. When raters exhibit randomness, they are flagged with
Youn-Hee Kim
201
significantly large infit and outfit mean square indices; however, significantly large infit and outfit mean square indices may also indicate other rater effects (Myford & Wolfe, 2004a, 2004b). Thus, Myford and Wolfe suggested that it is important to examine significantly low SR/ROR correlations as well. More specifically, they suggested that randomness will be detected when infit and outfit mean square indices are significantly larger than 1 and SR/ROR correlations are significantly lower than those of other raters. Four teachers appeared to be inconsistent: NS9, NNS6, NNS7, and NNS9 showed not only large fit indices but also low SR/ROR correlations. When compared relatively, NS9, NNS7, and NNS9 seemed to be on the borderline in their consistency, whereas NNS6 was obviously signaled as an inconsistent teacher. In summary, the three different types of statistical approaches showed converging evidence; most of the NS and NNS teachers were consistent in their ratings, with one or two teachers from each group showing inconsistent rating patterns. This result implies that the two groups rarely differed in terms of internal consistency, and that the NNS teachers were as dependable as the NS teachers in assessing students oral English performance. 2 Do NS and NNS teachers exhibit interchangeable severity across different tasks when they assess students oral English performance? The analysis was carried out in order to identify whether the two groups of teachers showed similar severity measures across different tasks. Given that task difficulty is determined to some extent by raters severity in a performance assessment setting, comparison of task difficulty measures is considered a legitimate approach. Figure 2 shows the task difficulty derived from the NS and the NNS groups of teachers. As can be seen, the ratings of the NS group were slightly more diverse across tasks, with task difficulty measures ranging from 0.53 logits to 0.97 logits, with a 1.50 logit spread; in the NNS groups ratings, the range of task difficulty measures was similar to that of the NS group, though slightly narrower: from 0.59 logits to 0.82 logits, with a 1.41 logit spread. Figure 2 also shows that both groups exhibited generally similar patterns in task difficulty measures. Task 6 was given the highest difficulty measure by both groups of teachers, and Tasks 3 and 2 were given the lowest difficulty measure by the NS and the NNS teacher groups, respectively.

1.2 1 Task Difficulty (logits) 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 Tasks NS Teacher Group NNS Teacher Group T1 T2 T3 T4 T5 T6 T7 T8
Figure 2 Task difficulty measures by NS and NNS teacher groups
A bias analysis was carried out to further explore the potential interaction between teacher groups and tasks. In the bias analysis, an estimate of the extent to which a teacher group was biased toward a particular task is standardized to a Z-score. When the Z-score values in a bias analysis fall between 2 and +2, that group of teachers is thought to be scoring a task without significant bias. Where the values fall below 2, that group of teachers is scoring a task leniently compared with the way they have assessed other tasks, suggesting a significant interaction between the group and the task. By the same token, where the values are above +2, that group of teachers is thought to be rating that task more severely than other tasks. As the bias slopes of Figure 3 illustrate, neither of the two groups of teachers was positively or negatively biased toward any particular tasks; thus, the NS and NNS teacher groups do not appear to have any significant interactions with particular tasks. A bias analysis between individual teachers and tasks confirmed the result of the previous analysis. While an interaction was found between individual teachers and tasks, no bias emerged toward a particular task from a particular group of teachers. Strikingly, certain teachers from each group showed exactly the same bias patterns on particular tasks. As shown in Table 3, one teacher from each group exhibited significantly lenient rating patterns on Tasks 1 and 4, and significantly severe patterns on Task 7. Two NS teachers exhibited conflicting rating patterns on Task 6: NS11 showed a significantly
Youn-Hee Kim
Teacher Group 1. NS GROUP 1 0.8 0.6 z-value 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 2. NNS GROUP
203
T1
Figure 3
T2
T3
T4
T5
T6
T7
T8
Bias analysis between teacher groups and tasks
more lenient pattern of ratings, while NS9 showed the exact reverse pattern; that is, NS9 rated Task 6 significantly more severely. It is very interesting that one teacher from each group showed the same bias patterns on Tasks 1, 4, and 7, since it implies that the ratings of these two teachers may be interchangeable in that they display the same bias patterns. In summary, the NS and NNS teachers seem to have behaved similarly in terms of severity, and this is confirmed by both the task difficulty measures and the two bias analyses. The overall results of the multiple quantitative analyses also show that the NS and NNS
Table 3 Teacher Bias analysis: Interactions between teachers and tasks Task Obs-Exp average Bias measure (logits) 1.26 1.23 1.22 1.18 1.06 1.01 1.06 1.21 1.21 1.90 2.02 Model S.E. Z-score Infit MnSq
NS11 NS9 NNS9 NNS12 NS3 NS5 NNS6 NS9 NS3 NS6 NNS6
T6 T4 T4 T1 T1 T6 T6 T6 T6 T7 T7
0.54 0.38 0.43 0.47 0.44 0.43 0.34 0.49 0.44 0.60 0.60
0.55 0.58 0.55 0.53 0.50 0.55 0.69 0.58 0.64 0.65 0.69
2.29 2.13 2.19 2.24 2.11 1.84 1.54 2.09 1.90 2.92 2.93
0.9 1.5 1.5 0.7 0.8 1.3 3.0 2.1 0.7 1.1 1.1
teachers appeared to reach an agreement as to the score a test-taker should be awarded, exhibiting little difference in internal consistency and severity. Given that the quantitative outcomes that the two groups of teachers generated were rarely different, the research focus now turns to qualitative analyses of the processes the teachers went through in their assessments. 3 How do NS and NNS teachers differ in drawing on evaluation criteria when they comment on students oral English performance? In order to illustrate teachers evaluation patterns, their written comments were analyzed qualitatively. While the quantitative approach to their ratings provided initial insights into the teachers evaluation behavior, a more comprehensive and enriched understanding was anticipated from the complementary inclusion of a qualitative approach. The mixed methods design was intended to enhance the studys depth and breadth, elucidating dimensions that might have been obscured by the use of a solely quantitative method. When comments from both groups were reviewed, a variety of themes emerged. Nineteen evaluation criteria were identified, and they were quantified and compared between the NS and NNS teacher groups. Figure 4 illustrates the frequency of comments made by the NS and NNS teacher groups for the 19 evaluation criteria. The analysis was conducted using comments drawn from all eight tasks; the NS and NNS teacher groups could not be compared for individual tasks in that they made very few comments (fewer than 10) on some evaluation criteria related to a particular task. The comparison was therefore based on comments for all tasks rather than for each individual task. Interestingly, the total number of comments made by the two groups differed distinctly: while the NS group made 2,123 comments, the NNS group made only 1,172. This may be because providing students with detailed evaluative comments on their performance is not as widely used in an EFL context as traditional fixed response assessment. Figure 4 also shows that the NS group provided more comments than the NNS group for all but two of the evaluation criteria: accuracy of transferred information and completeness of discourse. Still, the overall number of comments for these two criteria was similar in the two teacher groups (46 vs. 50 comments
Youn-Hee Kim
205
NS NNS
Frequency of Comments
ov
300 250 200 150 100 50 0
Figure 4
for accuracy of transferred information; 53 vs. 66 comments for completeness of discourse). When the evaluation criteria emphasized by the two teacher groups were examined, the NS group was found to draw most frequently on overall language use (13.46% of all comments), pronunciation (11.47%), vocabulary (11.42%), fluency (9.33%), and specific grammar use (6.70%). The NNS group emphasized pronunciation (15.23% of all comments), vocabulary (14.47%), intelligibility (7.69%), overall language use (7.00%), and coherence (5.68%). These trends indicate that the two teacher groups shared common ideas about the ways in which the students performance should be assessed. Although the NS and NNS groups differed in that the NS group made more comments across most of the evaluation criteria, both groups considered vocabulary, pronunciation, and overall language use to be the primary evaluation criteria. The NS teachers provided more detailed and elaborate comments, often singling out a word or phrase from students speech responses and using it as a springboard for justifying their evaluative comments. For example, when evaluating pronunciation, the NS teachers commented that some small pronunciation issue (can/cant & show/saw) causes confusion, some words mispronounced (e.g., reverse for reserve, arrive for alive), pronunciation difficulty, l/r, d/t, f/p, vowels, i/e,, pronunciation occasionally unclear (e.g., really), sometimes pronunciation is not clear, especially at word onsets, etc. The explicit pinpointing of pronunciation errors
un d er erst al l t and as in g k ac th str com e ta ac e n g p l i sk cu sh t ra cy h of me nt ar of tra gum ns fe ent rre to ov pic d in er al rele fo ll an van g u ce ag e vo use ca bu pr la on un ry cia tio n flu e i se nte ncy n ll ge tenc igib ne ili es t ra t l g ruc y s so cio pec ram ture i f m -c ar ul ic g tu ra ram use l m co a nt ppr ar u ex o t a pri se pp ate n ro pr ess iat e s n co ess co upp h l m pl eme ere nc ete nt n of e ela ess de o bo t ra f di ails tio s n cou of rse ar gu m en t
Frequency distribution of the comments by NS and NNS teacher groups
might imply that the NS teachers tended to be sensitive or strict in terms of phonological accuracy. It can also be interpreted to suggest that the NS teachers were less tolerant of or more easily distracted by phonological errors made by non-native English-speakers. These findings are somewhat contradictory to those of previous studies (e.g., Brown, 1995; Fayer & Krasinski, 1987) that indicated native speakers are less concerned about or annoyed by non-native speech features as long as they are intelligible. This inconsistency might ultimately be due to the different methodological approaches employed in the studies. While this study examined non-native speakers phonological features through a qualitative lens, the previous studies focused on the quantitative scores awarded on pronunciation as one analytic evaluation criterion. When the comments provided by the NNS teachers on pronunciation were examined, they were somewhat different. Although pronunciation was one of the most frequently mentioned evaluation criteria and constituted 15.23% of the total comments, the NNS teachers were more general in their evaluation comments. Instead of identifying problems with specific phonological features, they tended to focus on the overall quality of students pronunciation performance. For example, their comments included problems with pronunciation, problems with word stress, hard to follow due to pronunciation, good description of library but problems with pronunciation (often only with effort can words be understood), etc. It appears that the NNS teachers were less influenced by phonological accuracy than by global comprehensibility or intelligibility. In other words, as long as students oral performance was intelligible or comprehensible, the NNS teachers did not seem to be interested in the micro-level of phonological performance. Intelligibility was the third most frequently mentioned evaluation criterion by the NNS teachers, confirming that their attention is more focused on overall phonological performance or intelligibility than specific phonological accuracy. Another possible explanation might be that, as one of the reviewers of this article suggested, the NNS teachers were more familiar with the students English pronunciation than the NS teachers because the NNS teachers shared the same first language background with the students. Similar patterns appeared in the evaluation criteria of specific grammar use and accuracy of transferred information. The NS teachers provided more detailed feedback on specific aspects of grammar use, making more comments compared to the NNS teachers
Youn-Hee Kim
207
(152 vs. 29 comments). For example, when evaluating students performance on Task 1 (describing the layout of a library), the NS teachers paid more attention to accurate use of prepositions than to other grammatical features. They further pointed out that accurate use of prepositions might facilitate listeners visualization of given information, for example, by stating prepositions of place could be more precise (e.g., in front of computers) and incorrect or vague use of prepositions of place hinders visualization. The same observations were also made on Task 4 (narrating a story from six sequential pictures) and Task 7 (describing a graph of human life expectancy). Tasks 4 and 7 were similar in that students had to describe events that had taken place in the past in order to complete them successfully. It was therefore essential for students to be comfortable with a variety of verb tenses (past, past progressive, past perfect, present, and future) so as not to confuse their listeners. As was the case with preposition use, the NS teachers were more aware than the NNS teachers of the precise use of verb tenses, as their comments make manifest: successfully recounted in the past with complex structure (i.e., past perfect, past progressive), changing verb tense caused some confusion, all recounted in present tense, tense accuracy is important for listener comprehension in this task, and minor error in verb tense (didnt use future in reference to 2010 at first). By contrast, the NNS teachers neither responsively nor meticulously cited the use of prepositions or verb tenses. Their 29 total comments on specific grammar use were often too short to enable interpretation of their judgments (e.g., no prepositions, wrong tense, problems with prepositions, and problems with tense), suggesting that the NNS teachers were less distracted by the misuse of prepositions and verb tenses than NS teachers, consistent with Galloways (1980) findings. Speculating as to why native and non-native speakers had different perceptions of the extent to which linguistic errors disrupt communication, Galloway noted that confusion of tense may not have caused problems for the non-native speaker, but it did seem to impede communication seriously for the native speaker (p. 432). Although the native language group in the Galloway study was quite different from that of the present study (i.e., native Spanish speakers as opposed to native Korean speakers), her conjectures are noteworthy. The responses of the two teacher groups to the accuracy of transferred information followed the same pattern. Although the NNS
teachers provided more comments than did the NS teachers (50 vs. 46, respectively), their characteristics were dissimilar. This was especially evident in Task 2 (explaining the library services based on a provided informational note) and Task 7 (describing a graph of human life expectancy), where students were asked to verbalize literal and numeric information. On these two tasks, the NS teachers appeared very attentive to the accuracy of transmitted information, and jotted down content errors whenever they occurred. For example, they pointed out every inconsistency between the provided visual information and the transferred verbalized information, commenting some key information inaccurate (e.g., confused renewals for grads & undergrads; fines of $50/day 50/day), some incorrect info (e.g., closing time of 9:00 pm instead of 6:00 pm), gradually accurate at first, then less so when talking about fines (e.g., $50 50), some incorrect information (the gap between men and women was smallest in 1930, NOT 2000), etc. By contrast, the NNS teachers were primarily concerned with whether the delivered information was generally correct, for example, commenting accurate info, not very inaccurate info, or provided wrong information. The NNS teachers global judgments on the accuracy of transmitted information raises the question of whether the NNS teachers were not as attentive as the NS teachers to specific aspects of content accuracy, as long as the speech was comprehensible. It may simply be that the NNS teachers considered content errors to be simple mistakes that should not be used to misrepresent students overall oral English proficiency. The tendency of the NNS teachers to provide less detailed, less elaborate comments than the NS teachers on certain evaluation criteria requires careful interpretation. NNS teachers who teach daily in an EFL context may be poorly informed about how to evaluate students language performance without depending on numeric scores and traditional fixed response assessment. Although there have been recent advances in performance assessment in an ELF context, it has been pointed out that NNS teachers had not been effectively trained to assess students performance (Lee, 2007). This different evaluation culture might have contributed to the dissimilar evaluation patterns for the NS and NNS teachers. The different evaluation behaviors might also be attributable to a methodological matter. Because this study was intended only to capture teachers evaluation behavior, those who participated in the study were not told that they should make their comments as specific as
Youn-Hee Kim
209
possible, which might have influenced the NNS teachers lack of evaluative comments. For example, the NNS teachers may simply have noted the major characteristics of students oral output, focusing on overall quality without considering the granularity of their own comments. As one of the reviewers suggested, it is also possible that the NNS teachers did not orient their comments toward providing feedback for the students. To suggest that the NNS teachers did not identify linguistic errors as accurately as did the NS teachers would therefore be premature, and more evidence needs to be gathered to address the specific ways in which the NS and NNS teachers provided students with feedback related to those linguistic errors.
IV Conclusion and implications This study has examined how a sample of NS and NNS teachers assessed students oral English performance from comprehensive perspectives. A variety of test tasks were employed, enabling the teachers to exhibit varied rating behaviors while assessing diverse oral language output. The teachers not only exhibited different severity measures, but they also drew on different evaluation criteria across different tasks. These findings suggest that employing multiple tasks might be useful in capturing diverse rater behaviors. Three different statistical approaches were used to compare teachers internal consistency, and they revealed almost identical patterns. Most of the NS and NNS teachers maintained acceptable levels of internal consistency, with only one or two teachers from each group identified as inconsistent raters. Similar results were obtained when the severity of the two groups was compared. Of the eight individual tasks, both teacher groups were most severe on Task 6, and neither was positively or negatively biased toward a particular task. More interestingly, a bias analysis carried out for individual teachers and individual tasks showed that one teacher from each group exhibited exactly the same bias patterns on certain tasks. A striking disparity, however, appeared in the NS and NNS teachers evaluation criteria for students performance. The NS teachers provided far more comments than the NNS teachers with regard to students performance across almost all of the evaluation criteria. A qualitative analysis further showed the NS teachers to be more detailed and elaborate in their comments than were the NNS teachers. This observation arose
from their judgments on pronunciation, specific grammar use, and the accuracy of transferred information. The comparable internal consistency and severity patterns that the NS and NNS teachers exhibited appear to support the assertion that NNS teachers can function as assessors as reliably as NS teachers can. Although the NS teachers provided more detailed and elaborate comments, the study has not evidenced how different qualitative evaluation approaches interact with students and which evaluation method would be more beneficial to them. Therefore, the studys results offer no indication that NNS teachers should be denied positions as assessors simply because they do not own the language by primogeniture and due of birth (Widdowson, 1994, p. 379). Considering assessment practices can be truly valid only when all contextual factors are considered, the involvement of native speakers in an assessment setting should not be interpreted as a panacea. By the same token, an inquiry into validity is a complicated quest, and no validity claims are one-size-fits-all. In a sense, NNS teachers could be more compelling or sensitive assessors than NS teachers in expanding circles countries (Kachru, 1985), since the former might be more familiar with the instructional objectives and curriculum goals of indigenous educational systems. Further research is therefore warranted to investigate the effectiveness of NNS teachers within their local educational systems. This study has shown that by combining quantitative and qualitative research methods, a comprehensive understanding of research phenomena can be achieved via paradigmatic and methodological pluralism. Diverse paradigms and multiple research methods enabled diverse social phenomena to be explored from different angles; the inclusion of a qualitative analysis provided insight into the different ways in which NS and NNS teachers assessed students oral language performance, above and beyond findings from the quantitative analysis alone. Collecting diverse data also helped to overcome the limitations of the aforementioned previous studies, which depended solely on numeric data to investigate raters behavior in oral language performance assessment. Several methodological limitations and suggestions should be noted. First, this studys results cannot be generalized to other populations. Only Canadian and Korean English teachers were included in the sample, and most of these were well-qualified and experienced, with at least one graduate degree related to linguistics or language education. Limiting the research outcomes to the
Youn-Hee Kim
211
specific context in which this study was carried out will make the interpretations of the study more valid. The use of other qualitative approaches is also recommended. The only qualitative data collected were written comments, which failed to offer a full account of the teachers in-depth rating behavior. Those behaviors could be further investigated using verbal protocols or in-depth interviews for a fuller picture of what the teachers consider effective language performance. As one of the reviewers pointed out, it might be also interesting to investigate whether the comments made by the NS and NNS teachers tap different constructs of underlying oral proficiency and thereby result in different rating scales. Lastly, further research is suggested to examine the extent to which the semidirect oral test and the rating scale employed in this study represent the construct of underlying oral proficiency. Acknowledgements I would like to acknowledge that this research project was funded by the Social Sciences and Humanities Research Council of Canada through McGill Universitys Institutional Grant. My sincere appreciation goes to Carolyn Turner for her patience, insight, and guidance, which inspired me to complete this research project. I am also very grateful to Eunice Jang, Alister Cumming, and Merrill Swain for their valuable comments and suggestions on an earlier version of this article. Thanks are also due to three anonymous reviewers of Language Testing for their helpful comments.
V References
Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. Barnwell, D. (1989). Nave native speakers and judgments of oral proficiency in Spanish. Language Testing, 6, 152163. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12, 115. Caracelli, V. J. & Greene, J. C. (1993). Data analysis strategies for mixedmethod evaluation designs. Educational Evaluation and Policy Analysis, 15, 195207. Caracelli, V. J. & Greene, J. C. (1997). Crafting mixed method evaluation designs. In Greene, J. C. & Caracelli, V. J., editors, Advances in mixedmethod evaluation: The challenges and benefits of integrating diverse

paradigms. New Directions for Evaluation no. 74 (pp. 1932). San Francisco: Jossey-Bass. Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater groups. Language Testing, 12, 1633. Chalhoub-Deville, M. & Wigglesworth, G. (2005). Rater judgment and English language speaking proficiency. World Englishes, 24, 383391. Clark, J. L. D. & Swinton, S. S. (1979). An exploration of speaking proficiency measures in the TOEFL context (TOEFL Research Report No. RR-04). Princeton, NJ: Educational Testing Service. Clark, J. L. D. & Swinton, S. S. (1980). The test of spoken English as a measure of communicative ability in English-medium instructional settings (TOEFL Research Report No. RR-07). Princeton, NJ: Educational Testing Service. Crystal, D. (2003). English as a global language. Cambridge: Cambridge University Press. Fayer, J. M. & Krasinski, E. (1987). Native and nonnative judgments of intelligibility and irritation. Language Learning, 37, 313326. Galloway, V. B. (1980). Perceptions of the communicative efforts of American students of Spanish. Modern Language Journal, 64, 428433. Graddol, D. (1997). The future of English?: A guide to forecasting the popularity of English in the 21st century. London, UK: The British Council. Greene, J. C., Caracelli, V. J. & Graham, W. F. (1989). Toward a conceptual framework for mixed-method evaluation design. Educational Evaluation and Policy Analysis, 11, 255274. Hadden, B. L. (1991). Teacher and nonteacher perceptions of second-language communication. Language Learning, 41, 124. Hill, K. (1997). Who should be the judge?: The use of non-native speakers as raters on a test of English as an international language. In Huhta, A., Kohonen, V., Kurki-Suonio, L., & Luoma, S., editors, Current developments and alternatives in language assessment: Proceedings of LTRC 96 (pp. 275290). Jyvskyl: University of Jyvskyl and University of Tampere. Jenkins, J. (2003). World Englishes: A resource book for students. New York: Routledge. Johnson, B. & Turner, L. A. (2003). Data collection strategies in mixed methods research. In Tashakkori, A. & Teddlie, C., editors, Handbook of mixed methods in social and behavioral research (pp. 297319). Thousand Oaks, CA: Sage. Kachru, B. B. (1985). Standards, codification and sociolinguistic realism: The English language in the outer circle. In Quirk, R. & Widdowson, H., editors, English in the world: Teaching and learning the language and literatures (pp. 1130). Cambridge: Cambridge University Press. Kachru, B. B. (1992). The other side of English. In Kachru, B. B., editors, The other tongue: English across cultures (pp. 115). Urbana, IL: University of Illinois Press.
Youn-Hee Kim
213
Kim, Y-H. (2005). An investigation into variability of tasks and teacher-judges in second language oral performance assessment. Unpublished masters thesis, McGill University, Montreal, Quebec, Canada. Lee, H-K. (2007). A study on the English teacher quality as an English instructor and as an assessor in the Korean secondary school. English Teaching, 62, 309330. Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press. Linacre, J. M. (2005). A users guide to facets: Rasch-model computer programs. [Computer software and manual]. Retrieved April 10, 2005, from www.winsteps.com. Linacre, J. M. & Williams, J. (1998). How much is enough? Rasch Measurement: Transactions of the Rasch Measurement SIG, 12, 653. Lowenberg, P. H. (2000). Assessing English proficiency in the global context: The significance of non-native norms. In Kam, H. W., editor, Language in the global context: Implications for the language classroom (pp. 207228). Singapore: SEAMEO Regional Language Center. Lowenberg, P. H. (2002). Assessing English proficiency in the Expanding Circle. World Englishes, 21, 431435. Lunz, M. E. & Stahl, J. A. (1990). Judge severity and consistency across grading periods. Evaluation and the health professions, 13, 425444. McNamara, T. F. (1996). Measuring second language performance. London: Longman. Myford, C. M. & Wolfe, E. W. (2000). Monitoring sources of variability within the test of spoken English Assessment System (TOEFL Research Report No. RR-65). Princeton, NJ: Educational Testing Service. Myford, C. M. & Wolfe, E. W. (2004a). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. In Smith, Jr., E. V. & Smith, R. M., editors, Introduction toRasch measurement (pp. 460517). Maple Grove, MN: JAM Press. Myford, C. M. & Wolfe, E. W. (2004b). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. In Smith, Jr., E. V. & Smith, R. M., editors, Introduction to Rasch measurement. Maple Grove, MN: JAM Press, 518574. OLoughlin, K. (1995). Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test. Language Testing, 12, 217237. Stansfield, C. W. & Kenyon, D. M. (1992a). The development and validation of a simulated oral proficiency interview. The Modern Language Journal, 72, 129141. Stansfield, C. W. & Kenyon, D. M. (1992b). Research on the comparability of the oral proficiency interview and the simulated oral proficiency interview. System, 20, 347364. Stansfield, C. W., Kenyon, D. M., Paiva, R., Doyle, F., Ulsh, I., & Antonia, M. (1990). The development and validation of the Portuguese Speaking Test, Hispania, 73, 641651.

Tashakkori, A. & Teddlie, C., editors (2003). Handbook of mixed methods in social and behavioral research. Thousand Oaks, CA: Sage. Taylor, L. B. (2006). The changing landscape of English: Implications for language assessment, ELT Journal, 60, 5160. Teddlie, C. & Yu, F. (2007). Mixed methods sampling: A typology with examples. Journal of Mixed Methods Research, 1, 77100. Underhill, N. (1987). Testing spoken language: A handbook of oral testing techniques. Cambridge: Cambridge University Press. Widdowson, H. G. (1994). The ownership of English. TESOL Quarterly, 28, 377388. Wright, B. D. & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement: Transactions of the Rasch Measurement SIG, 8, 370.
Youn-Hee Kim
215
Appendix A: Rating scale for the oral English test

4
Overall communication is almost always successful; little or no listener effort is required.

3
Overall communication is generally successful; some listener effort is required.

2
Overall communication is less successful; more listener effort is required.

1
Overall communication is generally unsuccessful; a great deal of listener effort is required.
Notes: 1. Communication is defined as an examinees ability to both address a given task and get a message across. 2. A score of 4 does not necessarily mean speech is comparable to that of native English speakers. 3. No response, or a response of I dont know is automatically rated NR (Not Ratable).
Appendix B: Definitions and examples of the evaluation criteria

Examples
Evaluation criteria & definitions
1. Understanding the task: the degree to which a speaker understands the given task 2. Overall task accomplishment: the degree to which a speaker accomplishes the general demands of the task
3.
4.
5.
6.
Didnt seem to understand the task. Didnt understand everything about the task. Generally accomplished the task. Task not really well accomplished. Successfully accomplished task. Strength of argument: the degree to which the argument of Good range of points raised. the response is robust Good statement of main reason presented. Arguments quite strong Accuracy of transferred information: the degree to which a Misinterpretation of information (e.g., graduate renewals for speaker transfers the given information accurately undergrads, $50 a day for book overdue?) Incorrect information (e.g., 9pm instead of 6pm) Topic relevance: the degree to which the content of the Not all points relevant response is relevant to the topic Suddenly addressing irrelevant topic (i.e., focusing on physically harmful effects of laptops rather than on harmful effects of the internet) Overall language use: the degree to which the language Generally good use of language component of the response is of good and appropriate Native-like language quality Very limited language Good choice of vocabulary Some unusual vocabulary choices (e.g., he crossed a girl.) Native-like pronunciation Pronunciation difficulty (e.g., l/r, d/t, vowels, i/e) Mispronunciation of some words (e.g., circulation ) Choppy, halted Pausing, halting, stalling periods of silence Smooth flow of speech
7. Vocabulary: the degree to which vocabulary used in the response is of good and appropriate quality 8. Pronunciation: the degree to which pronunciation of the response is of good quality and clarity
9. Fluency: the degree to which the response is fluent without too much hesitation
10. Intelligibility: the degree to which the response is intelligible or comprehensible 11. Sentence structure: the degree to which the sentential structure of the response is of good quality and complexity
12.
13.
14.
15.
16.
17.
18.
Youn-Hee Kim
19.
Hard to understand language (a great deal of listener work required) Almost always understandable language Cannot make complex sentences. Telegraphic speech Took risk with more complex sentence structure General grammar use: the degree to which the general Generally good grammar grammatical use is of good quality Some problems with grammar Few grammatical errors Specific grammar use: the degree to which the micro-level Omission of articles of grammatical use is of good quality Incorrect or vague use of prepositions of place Good use of past progressive Socio-cultural appropriateness: the degree to which the Cultural/pragmatic issue (a little formal to congratulate a friend) response is appropriate in a social and cultural sense Little congratulations, more advice (culturally not appropriate) Contextual appropriateness: the degree to which the Appropriate language for a given situation response is appropriate to the intended communicative Student response would have been appropriate if Monica had goals of a given situation expressed worry about going to graduate school. Coherence: the degree to which the response is developed Good use of linking words in a coherent manner Great time markers Organized answer Supplement of details: the degree to which sufficient Provides enough details for effective explanation about the graph. information or details are provided for effective Student only made one general comment about the graph without communication referring to specifics. Lacks enough information with logical explanation. Completeness of discourse: the degree to which the Incomplete speech discourse of the response is organized in a complete No reference to conclusion manner End not finished. Elaboration of argument: the degree to which the argument Mentioned his arguments but did not explain them. of the response is elaborated Good elaboration of reasons Connect ideas smoothly by elaborating his arguments.
217

An Investigation Into Native and Non-Native Teachers' Judgments of Oral English Performance - Mixed Method

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

An Investigation Into Native and Non-Native Teachers' Judgments of Oral English Performance - Mixed Method

Încărcat de

Drepturi de autor:

Formate disponibile

Language Testing http://ltj.sagepub.

>> Version of Record - Mar 26, 2009 What is This?

Downloaded from ltj.sagepub.com at UNIV OF NEW MEXICO on October 14, 2013

Language Testing 2009 26 (2) 187 217

188 An investigation into native and non-native teachers judgments

190 An investigation into native and non-native teachers judgments

192 An investigation into native and non-native teachers judgments

194 An investigation into native and non-native teachers judgments

196 An investigation into native and non-native teachers judgments

Teachers internal consistency

Task difficulty measures

Bias analysis (group level)

Bias analysis (individual level)

Flowchart of data analysis procedure

198 An investigation into native and non-native teachers judgments

inter-rater) Reliability = 0.89

200 An investigation into native and non-native teachers judgments

202 An investigation into native and non-native teachers judgments

Figure 2 Task difficulty measures by NS and NNS teacher groups

Bias analysis between teacher groups and tasks

204 An investigation into native and non-native teachers judgments

300 250 200 150 100 50 0

Frequency distribution of the comments by NS and NNS teacher groups

206 An investigation into native and non-native teachers judgments

208 An investigation into native and non-native teachers judgments

210 An investigation into native and non-native teachers judgments

212 An investigation into native and non-native teachers judgments

214 An investigation into native and non-native teachers judgments

Appendix A: Rating scale for the oral English test

Overall communication is almost always successful; little or no listener effort is required.

Overall communication is generally successful; some listener effort is required.

Overall communication is less successful; more listener effort is required.

Overall communication is generally unsuccessful; a great deal of listener effort is required.

Appendix B: Definitions and examples of the evaluation criteria

Evaluation criteria & definitions

216 An investigation into native and non-native teachers judgments

S-ar putea să vă placă și