Documente Academic
Documente Profesional
Documente Cultură
the four main skills of listening, speaking, reading, and writing in English. My goal in this
review is to ascertain which of the three tests meet my criteria of practicality (i.e., ease of use,
time to take test, time to receive results), and provides sufficient evidence of validity and
reliability (i.e., fairness of scoring, consistency of rating, actual purposes match what test
intends). This knowledge will help me determine which test best suits the academic ELLs in the
educational context that I am interested in. Below, Ive organized each test review summary into
Tables 1, 2, and 3.
Table 1
ACTFL - American Council on the Teaching of Foreign Languages
Publisher
ACTFL, Inc., 1001 N. Fairfax St., STE 200, Alexandra, Virginia 22314; Phone:
(703) 894-2900; http://www.actfl.org/professional-development/proficiencyassessments-the-actfl-testing-office
LTI- Language Testing International (Exclusive Licensee of ACTFL)
445 Hamilton Avenue, Suite 1104, White Plains, NY 10601; Tel: 914-963-7110,
Toll Free: 1-800-486-8444; http://www.languagetesting.com/
ACTFL Proficiency Guidelines were first published in 1986
Date of 1st
publication
Target
Secondary education, higher education, and beyond. Tests available in many
population different target languages (ACTFL Assessments Brochure, 2012).
Cost
Scoring
Evidence
for
reliability
Structure
(parts and
item types)
Scoring
Reading: 60-80 min., 36-56 multiple-choice questions, read 3-4 passages from
academic texts and answer questions
Listening: 60-80 min., 34-51 multiple-choice questions, listen to lectures,
classroom discussions and conversations, then answer questions.
10 minute break
Speaking: 20 min., 6 tasks, express an opinion on a familiar topic; speak based on
reading and listening tasks.
Preparation Time: 15 -30 seconds depending on task
Response Time: 45 60 seconds depending on task
Writing: 50 min., 2 tasks, write essay responses based on reading and listening
tasks, support an opinion in writing.
In actual test, candidates can take notes while listening and reading and use
them to complete the essay.
In an actual test, 3 minutes are allowed to read the passage and 20 30 minutes
to plan and write a response. Typically, an effective response to the first essay
is 150 to 225 words, and 300 word minimum to the second essay.
Test takers with disabilities may request additional time to read the passage
and write the response (TOEFL iBT Sample Test Questions, 2016).
A total score of up to 120 is assessed from each language skill test.
Reading: 0-30 (High = 22-30, Intermediate = 15-21, Low = 0-14)
Listening: 0-30 (High = 22-30, Intermediate = 15-21, Low = 0-14)
The Reading and Listening sections are scored by computer with a score range
from 0 to 30. The Reading section has 3656 tasks based on reading passages
from academic texts and answering questions. The Listening section has 3451
tasks based on listening to lectures, classroom discussions and conversations, then
answering questions (ETS Website, 2016).
Speaking: 0-30 (Good = 26-30, Fair = 18-25, Limited = 10-17, Weak = 0-9)
ETS-certified test scorers rate responses for six tasks from 0 to 4 based on a
Speaking Rubric. The sum is converted to a scaled score of 0 to 30.
Writing: 0-30 (Good = 24-30, Fair = 17-23, Limited = 1-16)
Two tasks are rated from 0 to 5 based on a Writing Rubric for Integrated and
Independent tasks. The sum is converted to a scaled score of 0 to 30. The writing
section is scored by: evaluating the integrated writing task for development,
organization, grammar, vocabulary, accuracy and completeness (ETS Website,
2016). The independent essay is scored on overall writing quality, including
development, organization, grammar and vocabulary.
Human rating: multiple, rigorously trained raters score tests anonymously.
ETS raters are continually monitored to ensure fairness and the highest
quality (ETS Website, 2016).
Software rating: eRater automated scoring technology is used with human
ratings to score the independent and integrated writing tasks. Using both
human judgment for content and meaning with automated scoring for linguistic
features ensures consistent, quality scores (ETS Website, 2016).
SEM
Evidence
for
reliability
Evidence
for validity
The data presented in the table below is based on test takers who took the TOEFL
iBT test between January 2015 and December 2015. This is based on examinees
who indicated that they were applying for admission to colleges or universities as
undergraduate students.
Table 5. Percentile Ranks for TOEFL iBT Scores Undergraduate Level
Students.
Scale Reading Listening Speaking Writing Total Percentile
Mean 19.2
19.2
20.1
20.2
Mean 79
S.D.
6.9
7
4.6
5
S.D.
21
(Test and Score Data Summary for TOEFL iBT Tests, 2015).
Score
Scale
SEM
Reading
0-30
3.35
Listening
0-30
3.20
Speaking
0-30
1.62
Writing
0-30
2.76
Total
0-120
5.64
(Reliability and Comparability of TOEFL iBT Scores, 2011, p. 5).
Score
Reliability Est.
Reading
.85
Listening
.85
Speaking
.88
Writing
.74
Total
.94
The above chart shows The reliability estimates for the Reading, Listening,
Speaking, and Total scores are relatively high, while the reliability of the
Writing score is somewhat lower. This is a typical result for writing measures
composed of only two tasks (Breland, Bridgeman, & Fowles, 1999)
(Reliability and Comparability of TOEFL iBT Scores, 2011, p. 5).
Zhang (February 2008) compared the test scores of more than 12,000
examinees who were identified as having taken two TOEFL iBT tests within a
period of one month. The correlations of their scores on the two test forms
were 0.77 for the listening and writing sections, 0.78 for reading, 0.84 for
speaking, and 0.91 for the total test score. Because these measures of reliability
take into account additional sources of variability, they are typically lower than
internal consistency measures. Nevertheless, they indicate a high degree of
consistency in the rank ordering of the scores of these test repeaters
(Reliability and Comparability of TOEFL iBT Scores, 2011, p. 5).
TOEFL iBT validity studies: http://www.ets.org/toefl/research/topics/validity
One study conducted by Sawaki, Stricker, and Oranje, (2009) summarizes firstorder factors on higher-order factors.
Turning to the higher-order factor loadings in Table 4, all the four
sections had high loadings, ranging from .78 to .97. This supports the
presence of a common underlying dimension that is strongly related
to the Reading, Listening, Speaking and Writing trait factors.
However, it is notable that the loading of the Speaking factor is
Date of 1st
publication
Target
population
Cost
Purpose
Structure
(parts and
item types)
10
Scoring
11
12
relationships and connections between facts in the text versus opinions and
theories, ability to skim and scan text for specific information, and ability to read
for detail, understand a detailed description, and relate it to information
presented in the form of a diagram (IELTS Website, 2016).
Academic and GT Writing: Each task is assessed independently. Task 2 carries
more weight than Task 1. Detailed performance descriptors apply to both the
Academic and General Training Modules and are based on the following
criteria: Task Achievement/Response, Coherence and Cohesion, Lexical
Resource, and Grammatical Range and Accuracy (IELTS Website, 2016).
Statistical
Table 1: Mean, standard deviation and standard error of measurement of
distribution Listening and Reading (2014)
of the
scores
Module
Mean
SD
Listening
6.1
1.3
Academic Reading
6.0
1.2
GT Reading
6.0
1.4
(IELTS Test Performance)
2014 2014)
(IELTSdeviation
Test Performance,
SEM
Table 1: Mean, standard
and standard error of measurement of
Listening and Reading
2014)(2014)
Evidence
for
reliability
Evidence
for validity
Module
SEM
Listening
0.391
Academic Reading
0.378
GT Reading
0.418
(IELTS Test Performance)
2014 2014)
Table 2.3: Reliability Estimates for the IELTS Modules (IELTS, 2004)
Module
M
SD
Listening
.90
0.02
Academic Reading
.85
0.02
GT Reading
.90
0.02
(OSullivan, 2005).
Shaw (2004) shows reliability estimates from .77 to .85 for a series of research
studies done during a revision project (OSullivan, 2005).
OSullivan (2005) states in his review of IELTS, University of Cambridge
ESOL Examinations claims evidence of construct-related validity through the
use of expert judgement in operationalizing the constructin addition to
empirical evidence provided through statistical analysis of test responses
(OSullivan, 2005, p. 77). However, the following study by Moore, Morton, and
Price (2012) analyzed the construct validity of the IELTS Academic Reading
Test. The researchers found a divergence between the two domains of the IELTS
corpus and an academic corpus.
13
14
instruction, and depending on the group of learners, other content-areas of focus (i.e., fields of
study, careers, etc.) may be included in classroom materials.
In comparing the three published proficiency tests reviewed to the teaching context
above, I have concluded that the TOEFL iBT would be the most appropriate test for this
particular group of ELLs. Considering the practicality of the TOEFL iBT (internet-based, 4
hours total, numerical score), along with the empirical evidence of validity and reliability of test
results, the IELTS and ACTFL are not suitable to my context of learners.
Though the other two tests are designed for the same target population and a similar
purpose, the TOEFL iBT is more practical than the IELTS because it is strictly internet and
computer based, and TOEFL iBT scores are more recognized by universities than the ACTFL
scores, which are not numerical. There are also multiple third-party research studies that provide
evidence of validity and reliability for the TOEFL iBT. Unfortunately, I could not find as much
research regarding the validity or reliability of the IELTS. The fact that the TOEFL iBT Writing
test uses both computer and human ratings shows further evidence of rater reliability.
Though the IELTS is the shortest test of the three (2 hours & 45 min.), it requires testtakers to transfer their answers to an answer sheet, sometimes with no extra time allowed for
transfer. This extra step seems repetitive and time-consuming for test-takers. They are also
penalized for poor spelling and grammar in their answer book for the Listening and Reading
modules of IELTS (IELTS Website, 2016), as opposed to the multiple choice question types
presented on the TOEFL iBT and ACTFL. The IELTS General Training track provides a good
alternative for learners seeking jobs abroad, but there is not much research to prove the construct
or content validity of this test. I believe the least effective assessment of the tests reviewed is the
15
IELTS, due to a lack of evidence of validity and repetitive (and outdated) hand-written
procedures on some of the modules (Reading and Listening).
Some benefits of ACTFL assessments include proficiency tests in multiple languages,
instantaneous results, and the volume of research that has been validated by independent
researchers and consulting groups. In spite of the ACTFL being easy to use, auto-graded, and
providing results instantaneously, employers and universities must be familiar with the ACTFL
five-point scale from Novice to Distinguished levels and sub-levels (low, mid, high). ACTFL
assessments also provide no data regarding statistical distribution of scores or standard error of
measurement, which are often used to determine evidence of reliability. The ACTFL can also be
extremely time-consuming for test-takers with the Reading and Listening tests maxing out at 125
minutes each for four-level tests, and the Writing test maxing out at 80 minutes.
16
References
17
ETS Website (2016). TOEFL iBT Sample Test Questions. Retrieved from
http://www.ets.org/Media/Tests/TOEFL/pdf/SampleQuestions.pdf
ETS Website (2016). TOEFL Home. Retrieved from http://www.ets.org/toefl/
IELTS Website (2016). Retrieved from https://www.ielts.org/
LTI Website (2016). Language Testing International. Exclusive Licensee of ACTFL. Retrieved
from http://www.languagetesting.com/
Miller, M.D., Linn, R.L., & Gronlund, NE. (2009). Measurement and assessment in teaching.
Upper Saddle River, NJ: Pearson.
Moore, T., Morton, J., & Price, S. (2012). Construct validity in the IELTS Academic Reading
test: A comparison of reading requirements in IELTS test items and in university study.
In L. Taylor & C. J. Weir (Eds.), IELTS Collected Papers 2: Research in reading and
listening assessment (pp. 120 211). Cambridge: Cambridge University Press.
OSullivan, B. (2005). International English Language Testing System (IELTS). In S. Stoynoff
& C. A. Chapelle (Eds.), ESOL Tests and Testing (pp. 7378). Alexandria, VA: TESOL.
Reliability and comparability of TOEFL iBT scores. (2011). Insight: TOEFL iBT Research, 1(3),
1-8. Retrieved from http://www.ets.org/s/toefl/pdf/toefl_ibt_research_s1v3.pdf
Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009) Factor structure of the TOEFL Internet-based
test. Language Testing, 26(1), 530. doi: 10.1177/0265532208097335
Stoynoff, S., & Chapelle, C. A. (2005). ESOL tests and testing: A resource for teachers and
administrators. Alexandria, VA: TESOL.
18
SWA Consulting, Inc. (2012). Reliability Study of ACTFL OPIc in Spanish, English, and
Arabic for the ACE Review. Retrieved from http://www.languagetesting.com/wpcontent/uploads/2013/08/actfl-opic-reliability-2012.pdf
SWA Consulting, Inc. (2004). Preliminary Reliability and Validity Findings for the ACTFL
Writing Proficiency Test. SWA Technical Report 2004-C04-R01. Retrieved from
http://www.languagetesting.com/wp-content/uploads/2013/08/ACTFL-WPT-TechnicalReport-2004.pdf
SWA Consulting, Inc. (2008). Two Studies Investigating the Reliability and Validity of the
English ACTFL OPIc with Korean Test Takers. The ACTFL OPIc Validation Project
Technical Report. Retrieved from http://d2k4mc04236t2s.cloudfront.net/wpcontent/uploads/2013/08/ACTFL-OPIc-English-Validation-2008.pdf
Test and Score Data Summary for TOEFL iBT Tests (2015). January 2015 December 2015
Test Data. Retrieved from https://www.ets.org/s/toefl/pdf/94227_unlweb.pdf