Documente Academic
Documente Profesional
Documente Cultură
com/
An investigation into native and non-native teachers' judgments of oral English performance: A mixed methods approach
Youn-Hee Kim Language Testing 2009 26: 187 DOI: 10.1177/0265532208101010 The online version of this article can be found at: http://ltj.sagepub.com/content/26/2/187
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://ltj.sagepub.com/content/26/2/187.refs.html
An investigation into native and non-native teachers judgments of oral English performance: A mixed methods approach
Youn-Hee Kim University of Toronto, Canada
This study used a mixed methods research approach to examine how native English-speaking (NS) and non-native English-speaking (NNS) teachers assess students oral English performance. The evaluation behaviors of two groups of teachers (12 Canadian NS teachers and 12 Korean NNS teachers) were compared with regard to internal consistency, severity, and evaluation criteria. Results of a Many-faceted Rasch Measurement analysis showed that most of the NS and NNS teachers maintained acceptable levels of internal consistency, with only one or two inconsistent raters in each group. The two groups of teachers also exhibited similar severity patterns across different tasks. However, substantial dissimilarities emerged in the evaluation criteria teachers used to assess students performance. A qualitative analysis demonstrated that the judgments of the NS teachers were more detailed and elaborate than those of the NNS teachers in the areas of pronunciation, specific grammar use, and the accuracy of transferred information. These findings are used as the basis for a discussion of NS versus NNS teachers as language assessors on the one hand and the usefulness of mixed methods inquiries on the other. Keywords: mixed methods, NS and NNS, oral English performance assessment, many-faceted Rasch Measurement
In the complex world of language assessment, the presence of raters is one of the features that distinguish performance assessment from traditional assessment. While scores in traditional fixed response assessments (e.g., multiple-choice tests) are elicited solely from the interaction between test-takers and tasks, it is possible that the final scores awarded by a rater could be affected by variables
Address for correspondence: Youn-Hee Kim, Modern Language Center, Ontario Institute for Studies in Education, University of Toronto, 252 Bloor Street West, Toronto, Ont., Canada, M5S 1V6; email: younkim@oise.utoronto.ca
The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav DOI:10.1177/0265532208101010
inherent to that rater (McNamara, 1996). Use of a rater for performance assessment therefore adds a new dimension of interaction to the process of assessment, and makes monitoring of reliability and validity even more crucial. The increasing interest in rater variability has also given rise to issues of eligibility; in particular, the question of whether native speakers should be the only norm maker[s] (Kachru, 1985) in language assessment has inspired heated debate among language professionals. The normative system of native speakers has long been assumed in English proficiency tests (Taylor, 2006), and it is therefore unsurprising that large-scale, high-stakes tests such as the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS) rendered their assessments using native English-speaking ability as a benchmark (Hill, 1997; Lowenberg, 2000, 2002). However, the current status of English as a language of international communication has caused language professionals to reconsider whether native speakers should be the only acceptable standard (Taylor, 2006). Indeed, non-native English speakers outnumber native English speakers internationally (Crystal, 2003; Graddol, 1997; Jenkins, 2003; Lowenberg, 2000), and localization of the language has occurred in outer circle countries such as China, Korea, Japan, and Russia (Kachru, 1985, 1992; Lowenberg, 2000). These developments suggest that new avenues of opportunity may be opening for non-native English speakers as language assessors. This study, in line with the global spread of English as a lingua franca, investigates how native English-speaking (NS) and non-native English-speaking (NNS) teachers evaluate students oral English performance in a classroom setting. A mixed methods approach will be utilized to address the following research questions: 1) Do NS and NNS teachers exhibit similar levels of internal consistency when they assess students oral English performance? 2) Do NS and NNS teachers exhibit interchangeable severity across different tasks when they assess students oral English performance? 3) How do NS and NNS teachers differ in drawing on evaluation criteria when they comment on students oral English performance?
Youn-Hee Kim
189
I Review of the literature A great deal of research exploring rater variability in second language oral performance assessment has been conducted, with a number of early studies focusing on the impact of raters different backgrounds (Barnwell, 1989; Brown, 1995; Chalhoub-Deville, 1995; ChalhoubDeville & Wigglesworth, 2005; Fayer & Krasinski, 1987; Galloway, 1980; Hadden, 1991). In general, teachers and non-native speakers were shown to be more severe in their assessments than non-teachers and native speakers, but the outcomes of some studies contradicted one another. This may be explained by their use of different native languages, small rater samples, and different methodologies (Brown, 1995; Chalhoub-Deville, 1995). For example, in a study of raters professional backgrounds, Hadden (1991) investigated how native English-speaking teachers and non-teachers perceive the competence of Chinese students in spoken English. She found that teachers tended to be more severe than non-teachers as far as linguistic ability was concerned, but that there were no significant differences in such areas as comprehensibility, social acceptability, personality, and body language. Chalhoub-Deville (1995), on the other hand, comparing three different rater groups (i.e., native Arabic-speaking teachers living in the USA, non-teaching native Arabic speakers living in the USA, and non-teaching native Arabic speakers living in Lebanon), found that teachers attended more to the creativity and adequacy of information in a narration task than to linguistic features. Chalhoub-Deville suggested that the discrepant findings of the two studies could be due to the fact that her study focused on modern standard Arabic (MSA), whereas Haddens study focused on English. Another line of research has focused on raters different linguistic backgrounds. Fayer and Krasinski (1987) examined how the English-speaking performance of Puerto Rican students was perceived by native English-speaking raters and native Spanishspeaking raters. The results showed that non-native raters tended to be more severe in general and to express more annoyance when rating linguistic forms, and that pronunciation and hesitation were the most distracting factors for both sets of raters. However, this was somewhat at odds with Browns (1995) study which found that while native speakers tended to be more severe than non-native speakers, the difference was not significant. Brown concluded that, there is little evidence that native speakers are more suitable than non-native speakers However, the way in which they perceive
the items (assessment criteria) and the way in which they apply the scale do differ (p. 13). Studies of raters with diverse linguistic and professional backgrounds have also been conducted. Comparing native and nonnative Spanish speakers with or without teaching experience, Galloway (1980) found that non-native teachers tended to focus on grammatical forms and reacted more negatively to nonverbal behavior and slow speech, while non-teaching native speakers seemed to place more emphasis on content and on supporting students attempts at self-expression. Conversely, Barnwell (1989) reported that untrained native Spanish speakers provided more severe assessments than an ACTFL-trained Spanish rater. This result conflicts with that of Galloway (1980), who found that untrained native speakers were more lenient than teachers. Barnwell suggested that both studies were small in scope, and that it was therefore premature to draw conclusions about native speakers responses to non-native speaking performance. Hill (1997) further pointed out that the use of two different versions of rating scales in Barnwells study, one of which was presented in English and the other in Spanish, remains questionable. One recent study of rater behavior focused on the effect of country of origin and task on evaluations of students oral English performance. Chalhoub-Deville and Wigglesworth (2005) investigated whether native English-speaking teachers who live in different English-speaking countries (i.e., Australia, Canada, the UK, and the USA) exhibited significantly different rating behaviors in their assessments of students performance on three Test of Spoken English (TSE) tasks 1) give and support an opinion, 2) picturebased narration, and 3) presentation which require different linguistic, functional, and cognitive strategies. MANOVA results indicated significant variability among the different groups of native English-speaking teachers across all three tasks, with teachers residing in the UK the most severe and those in the USA the most lenient across the board; however, the very small effect size (2 = 0.01) suggested that little difference exists among different groups of native English-speaking teachers. Although the above studies provide some evidence that raters linguistic and professional backgrounds influence their evaluation behavior, further research is needed for two reasons. First, most extant studies are not grounded in finely tuned methodologies. In some early studies (e.g., Fayer & Krasinski, 1987; Galloway, 1980;
Youn-Hee Kim
191
Hadden, 1991), raters were simply asked to assess speech samples of less than four minutes length without reference to a carefully designed rating scale. Also, having raters assess only one type of speech sample did not take the potential systematic effect of task type on task performance into consideration. Had the task types varied, raters could have assessed diverse oral language output, which in turn might have elicited unknown or unexpected rating behaviors. Second, to my knowledge, no previous studies have attempted to use both quantitative and qualitative rating protocols to investigate differences between native and non-native English-speaking teachers judgments of their students oral English performance. A mixed methods approach, known as the third methodological movement (Tashakkori & Teddlie, 2003, p. ix), incorporates quantitative and qualitative research methods and techniques into a single study and has the potential to reduce the biases inherent in one method while enhancing the validity of inquiry (Greene, Caracelli, & Graham, 1989). However, all previous studies that have examined native and non-native English-speaking raters behavior in oral language performance assessment have been conducted using only a quantitative framework, preventing researchers from probing research phenomena from diverse data sources and perspectives. The mixed methods approach of the present study seeks to enhance understanding of raters behavior by investigating not only the scores assigned by NS and NNS teachers but also how they assess students oral English performance.
II Methodology 1 Research design overview The underlying research framework of this study is based on both expansion and complementarity mixed methods designs, which are most commonly used in empirical mixed methods evaluation studies (see Greene et al., 1989 for a review of mixed methods evaluation designs). The expansion design was considered particularly well suited to this study because it would offer a comprehensive and diverse illustration of rating behavior, examining both the product that the teachers generate (i.e., the numeric scores awarded to students) and the process that they go through (i.e., evaluative comments) in their assessment of students oral English performance (Greene et al., 1989). The complementarity design was included to
provide greater understanding of the NS and NNS teachers rating behaviors by investigating the overlapping but different aspects of rater behavior that different methods might elicit (Greene et al., 1989). Intramethod mixing, in which a single method concurrently or sequentially incorporates quantitative and qualitative components (Johnson & Turner, 2003), was the selected guiding procedure. The same weight was given to both quantitative and qualitative methods, with neither method dominating the other. 2 Participants Ten Korean students were selected from a college-level language institute in Montreal, Canada, and were informed about the research project and the test to participate in the study. The students were drawn from class levels ranging from beginner to advanced, so that the student sample would include differing levels of English proficiency. The language institute sorted students into one of five class levels according to their aggregate scores on a placement test measuring four English language skills (listening, reading, speaking, and writing): Level I for students with the lowest English proficiency, up to Level V for students with the highest English proficiency. Table 1 shows the distribution of the student sample across the five class levels. For the teacher samples, a concurrent mixed methods sampling procedure was used in which a single sample produced data for both the quantitative and qualitative elements of the study (Teddlie & Yu, 2007). Twelve native English-speaking Canadian teachers of English and 12 non-native English-speaking Korean teachers of English constituted the NS and NNS teacher groups, respectively. In order to ensure that the teachers were sufficiently qualified, certain participation criteria were outlined: 1) at least one year of prior experience teaching an English conversation course to nonnative English speakers in a college-level language institution; 2) at least one graduate degree in a field related to linguistics or language education; and 3) high proficiency in spoken English for Korean teachers of English. Teachers background information
Table 1 Level Number of students Distribution of students across class levels I 1 II 1 III 3 IV 3 V 2
Youn-Hee Kim
193
was obtained via a questionnaire after their student evaluations were completed: all of the NNS teachers had lived in Englishspeaking countries for one to seven years for academic purposes, and their self-assessed English proficiency levels ranged from advanced (six teachers) to near-native (six teachers); none of the NNS teachers indicated their self-assessed English proficiency levels at or below an upper-intermediate level. In addition, nine NS and eight NNS teachers reported having taken graduate-level courses specifically in Second Language Testing and Evaluation, and four NS and one NNS teacher had been trained as raters of spoken English.
3 Instruments A semi-direct oral English test was developed for the study. The purpose of the test was to assess the overall oral communicative language ability of non-native English speakers within an academic context. Throughout the test, communicative language ability would be evidenced by the effective use of language knowledge and strategic competence (Bachman & Palmer, 1996). Initial test development began with the identification of target language use domain, target language tasks, and task characteristics (Bachman & Palmer, 1996). The test tasks were selected and revised to reflect potential test-takers language proficiency and topical knowledge, as well as task difficulty and interest. An effort was also made to select test tasks related to hypothetical situations that could occur within an academic context. In developing the test, the guiding principles of the Simulated Oral Proficiency Interview (SOPI) were referenced. The test consisted of three different task types in order to assess the diverse oral language output of test-takers: picturebased, situation-based, and topic-based. The picture-based task required test-takers to describe or narrate visual information, such as describing the layout of a library (Task 1, [T1]), explaining the library services based on a provided informational note (Task 2, [T2]), narrating a story from six sequential pictures (Task 4, [T4]), and describing a graph of human life expectancy (Task 7, [T7]). The situation-based task required test-takers to perform the appropriate pragmatic function in a hypothetical situation, such as congratulating a friend on being admitted to school (Task 3, [T3]). Finally, the topic-based task required test-takers to offer
their opinions on a given topic, such as explaining their personal preferences for either individual or group work (Task 5, [T5]), discussing the harmful effects of Internet use (Task 6, [T6]), and suggesting reasons for an increase in human life expectancy (Task 8, [T8]). The test was administered in a computer-mediated indirect interview format. The indirect method was selected because the intervention of interlocutors in a direct speaking test might affect the reliability of test performance (Stansfield & Kenyon, 1992a, 1992b). Although the lexical density produced in direct speaking tests and indirect speaking tests have been found to be different (OLoughlin, 1995), it has consistently been reported that scores from indirect speaking tests have a high correlation with those from direct speaking tests (Clark & Swinton, 1979, 1980; OLoughlin, 1995; Stansfield, Kenyon, Paiva, Doyle, Ulsh, & Antonia, 1990). In order to effectively and economically facilitate an understanding of the task without providing test-takers with a lot of vocabulary (Underhill, 1987), each task was accompanied by visual stimuli. The test lasted approximately 25 minutes, 8 of which were allotted for responses. A four-point rating scale was developed for rating (see Appendix A). It had four levels labeled 1, 2, 3, and 4. A response of I dont know or no response was automatically rated NR (Not Ratable). The rating scale only clarified the degree of communicative success without addressing specific evaluation criteria. Because this study aimed to investigate how the teachers commented on the students oral communicative ability and defined the evaluation criteria to be measured, the rating scale did not provide teachers with any information about which evaluation features to draw on. To deal with cases in which teachers sit on the fence, an even number of levels was sought in the rating scale. Moreover, in order not to cause a cognitive and psychological overload on the teachers, six levels were set as the upper limit during the initial stage of the rating scale development. Throughout the trials, however, the six levels describing the degree of successfulness of communication proved to be indistinguishable without dependence on the adjacent levels. More importantly, teachers who participated in the trials did not use all six levels of the rating scale in their evaluations. For these reasons, the rating scale was trimmed to four levels, enabling the teachers to consistently distinguish each level from the others.
Youn-Hee Kim
195
4 Procedure The test was administered individually to each of 10 Korean students, and their speech responses were simultaneously recorded as digital sound files. The order of the students test response sets was randomized to minimize a potential ordering effect, and then 12 of the possible test response sets were distributed to both groups of teachers. A meeting was held with each teacher in order to explain the research project and to go over the scoring procedure, which had two phases: 1) rating the students test responses according to the four-point rating scale; and 2) justifying those ratings by providing written comments either in English or in Korean. While the NS teachers were asked to write comments in English, the NNS teachers were asked to write comments in Korean (which were later translated into English). The rationale for requiring teachers comments was that they would supply not only the evaluation criteria that they drew on to infer students oral proficiency, but that it would help to identify the construct being measured. The teachers were allowed to control the playing, stopping, and replaying of test responses and to listen to them as many times as they wanted. After rating a single task response by one student according to the rating scale, they justified their ratings by writing down their reasons or comments. They then moved on to the next task response of that student. The teachers rated and commented on 80 test responses (10 students 8 tasks). To decrease the subject expectancy effect, the teachers were told that the purpose of the study was to investigate teachers rating behavior, and the comparison of different teacher groups was not explicitly mentioned. The two groups of teachers were therefore unaware of each other. In addition, a minimum amount of information about the students (i.e., education level, current visa status, etc.) was provided to the teachers. Meetings with the NS teachers were held in Montreal, Canada, and meetings with the NNS teachers followed in Daegu, Korea. Each meeting lasted approximately 30 minutes. 5 Data analyses Both quantitative and qualitative data were collected. The quantitative data consisted of 1,727 valid ratings, awarded by 24 teachers to 80 sample responses by 10 students on eight tasks. Each
teacher rated every students performance on every task, so that the data matrix was fully crossed. A rating of NR (Not Ratable) was treated as missing data; there were eight such cases among the 80 speech samples. In addition, one teacher failed to make one rating. The qualitative data included 3,295 written comments. Both types of data were analyzed in a concurrent manner: a Many-faceted Rasch Measurement (Linacre, 1989) was used to analyze quantitative ratings, and typology development and data transformation (Caracelli & Greene, 1993) guided the analysis of qualitative written comments. The quantitative and qualitative research approaches were integrated at a later stage (rather than at the outset of the research process) when the findings from both methods were interpreted and the study was concluded. Since the nature of the component designs to which this study belongs does not permit enough room to combine the two approaches (Caracelli & Greene, 1997), the different methods tended to remain distinct throughout the study. Figure 1 summarizes the overall data analysis procedures. a Quantitative data analysis: The data were analyzed using the FACETS computer program, Version 3.57.0 (Linacre, 2005). Four facets were specified: student, teacher, teacher group, and task. The teacher group facet was entered as a dummy facet and anchored at zero. A hybrid Many-faceted Rasch Measurement Model (Myford & Wolfe, 2004a) was used to differentially apply the Rating Scale Model to teachers and tasks, and the Partial Credit Model to teacher groups. Three different types of statistical analysis were carried out to investigate teachers internal consistency, based on: 1) fit statistics; 2) proportions of large standard residuals between observed and expected scores (Myford & Wolfe, 2000); and 3) a single raterrest of the raters (SR/ROR) correlation (Myford & Wolfe, 2004a). The multiple analyses were intended to strengthen the validity of inferences drawn about raters internal consistency through converging evidence, and to minimize any bias that is inherent to a particular analysis. Teachers severity measures were also examined in three different ways based on: 1) task difficulty measures, 2) a bias analysis between teacher groups and tasks, and 3) a bias analysis between individual teachers and tasks.
Youn-Hee Kim
Fit statistics
197
Proportions of large standard residuals Single raterrest of the raters (SR/ROR) correlation