Assessing Language Tests for Practicality, Reliability and Validity

Principles of Language Assessment - Practicality, Reliability, Validity, Authenticity, and Washback
Rizki RMd on 01.00 No Comment
A. Practicality
An effective test is practical. This means that it
Is not excessively expensive,
Stays within appropriate time constraints,
Is relatively easy to administer, and
Has a scoring/evaluation procedure that is specific and time-efficient.
A test that is prohibitively expensive is impractical. A test of language proficiency that takes a
student five hours to complete is impractical-it consumes more time (and money) than necessary
to accomplish its objective. A test that requires individual one-on-one proctoring is impractical for
a group of several hundred test-takers and only a handful of examiners. A test that takes a few
minutes for a student to take and several hours for an examiner too evaluate is impractical for
most classroom situations.
B. Reliability
A reliable test is consistent and dependable. If you give the same test to the same student or
matched students on two different occasions, the test should yield similar result. The issue of
reliability of a test may best be addressed by considering a number of factors that may contribute
to the unreliability of a test. Consider the following possibilities (adapted from Mousavi, 2002, p.
804): fluctuations in the student, in scoring, in test administration, and in the test itself.
Student-Related Reliability
He most common learner-related issue in reliability is caused by temporary illness, fatigue, a bad
day, anxiety, and other physical or psychological factors, which may make an observed score
deviate from ones true score. Also included in this category are such factors as a test-takers
test-wiseness or strategies for efficient test taking (Mousavi, 2002, p. 804).
Rater Reliability
Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occurs
when two or more scores yield inconsistent score of the same test, possibly for lack of attention to
scoring criteria, inexperience, inattention, or even preconceived biases. In the story above about
the placement test, the initial scoring plan for the dictations was found to be unreliable-that is, the
two scorers were not applying the same standards.
Test Administration Reliability
Unreliability may also result from the conditions in which the test is administered. I once
witnessed the administration of a test of aural comprehension in which a tape recorder played
items for comprehension, but because of street noise outside the building, students sitting next to
windows could not hear the tape accurately. This was a clear case of unreliability caused by the
conditions of the test administration. Other sources of unreliability are found in photocopying
variations, the amount of light in different parts of the room, variations in temperature, and even
the condition of desks and chairs.
Test Reliability
Sometimes the nature of the test itself can cause measurement errors. If a test is too long, test-
takers may become fatigued by the time they reach the later items and hastily respond incorrectly.
Timed tests may discriminate against students who do not perform well on a test with a time limit.
We all know people (and you may be include in this category1) who know the course material
perfectly but who are adversely affected by the presence of a clock ticking away. Poorly written
test items (that are ambiguous or that have more than on correct answer) may be a further source
of test unreliability.
C. Validity
By far the most complex criterion of an effective test-and arguably the most important principle-is
validity, the extent to which inferences made from assessment result are appropriate,
meaningful, and useful in terms of the purpose of the assessment (Ground, 1998, p. 226). A valid
test of reading ability actually measures reading ability-not 20/20 vision, nor previous knowledge
in a subject, nor some other variable of questionable relevance. To measure writing ability, one
might ask students to write as many words as they can in 15 minutes, then simply count the words
for the final score. Such a test would be easy to administer (practical), and the scoring quite
dependable (reliable). But it would not constitute a valid test of writing ability without some
consideration of comprehensibility, rhetorical discourse elements, and the organization of ideas,
among other factors.
Content-Relate Evidence
If a test actually samples the subject matter about which conclusion are to be drawn, and if it
requires the test-takers to perform the behavior that is being measured, it can claim content-
related evidence of validity, often popularly referred to as content validity (e.g., Mousavi, 2002;
Hughes, 2003). You can usually identify content-related evidence observationally if you can clearly
define the achievement that you are measuring.
Criterion-Related Evidence
A second of evidence of the validity of a test may be found in what is called criterion-related
evidence, also referred to as criterion-related validity, or the extent to which the criterion of the
test has actually been reached. You will recall that in Chapter I it was noted that most classroom-
based assessment with teacher-designed tests fits the concept of criterion-referenced assessment.
In such tests, specified classroom objectives are measured, and implied predetermined levels of
performance are expected to be reached (80 percent is considered a minimal passing grade).
Construct-Related Evidence
A third kind of evidence that can support validity, but one that does not play as large a role
classroom teachers, is construct-related validity, commonly referred to as construct validity. A
construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our
universe of perceptions. Constructs may or may not be directly or empirically measured-their
verification often requires inferential data.
Consequential Validity
As well as the above three widely accepted forms of evidence that may be introduced to support
the validity of an assessment, two other categories may be of some interest and utility in your own
quest for validating classroom test. Messick (1989), Grounlund (1998), McNamara (2000), and
Brindley (2001), among others, underscore the potential importance of the consequences of using
an assessment. Consequential validity encompasses all the consequences of a test, including such
considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-
takers, its effect on the learner, and the (intended and unintended) social consequences of a tests
interpretation and use.
Face Validity
An important facet of consequential validity is the extent to which students view the assessment
as fair, relevant, and useful for improving learning (Gronlund, 1998, p. 210), or what is popularly
known as face validity. Face validity refers to the degree to which a test looks right, and appears
to measure the knowledge or abilities it claims to measure, based on the subjective judgment of
the examines who take it, the administrative personnel who decode on its use, and other
psychometrically unsophisticated observers (Mousavi, 2002, p. 244).
D. Authenticity
An fourth major principle of language testing is authenticity, a concept that is a little slippery to
define, especially within the art and science of evaluating and designing tests. Bachman and
Palmer (1996, p. 23) define authenticity as the degree of correspondence of the characteristics of
a given language test task to the features of a target language task, and then suggest an agenda
for identifying those target language tasks and for transforming them into valid test items.
E. Washback
A facet of consequential validity, discussed above, is the effect of testing on teaching and
learning (Hughes, 2003, p. 1), otherwise known among language-testing specialists as washback.
In large-scale assessment, wasback generally refers to the effects the test have on instruction in
terms of how students prepare for the test. Cram courses and teaching to the test are
examples of such washback. Another form of washback that occurs more in classroom assessment
is the information that washes back to students in the form of useful diagnoses of strengths and
weaknesses. Washback also includes the effects of an assessment on teaching and learning prior
to the assessment itself, that is, on preparation for the assessment.
F. Applying Principles to the Evaluation of Classroom Tests
The five principles of practicality, reliability, validity, authenticity, and washback go a long way
toward providing useful guidelines for both evaluating an existing assessment procedure and
designing one on your own. Quizzes, tests, final exams, and standardized proficiency tests can all
be scrutinized through these five lenses.
Abstract
The purpose of this study is to reveal the reliability, practicality, and validity level of the
entrance exam for the teacher training program at National Institute of Education in
Cambodia and to measure the knowledge of the student and test when they did the
examination in 2010. This study employed documentary interviews from the two
teachers and the two students at NIE. As a result of these findings, the current study
contributes to the understanding of the practicality, reliability, and validity level of test
items and providing suggestions towards the improvement of the tests design for the
entrance exam at NIE in the future.
Introduction
National Institute of Education is well-known school in Cambodian and The National
Institute of Education is charged with responsibility for the training of Cambodia's
school teachers and school administrators. It comprises two departments - the
Education Department, which trains lower and upper secondary school teachers in the
sciences and social sciences, and the Planning and Management Department, which
trains school principals, inspectors, supervisors and office administrators to plan and
evaluate the quality of education throughout the country. And it has an enrollment of
about 900 students every year (MoEYS). So this paper will consider reliability,
practicality, and validity level in a representatives English language tests that was a
part of the entrance examination for the National Institute of Education in 2010 and in
order to understand the nature of English tests and the analysis put forward the
context in which the entrance examination was taken by the students or trainees.
Moreover, all students or trainees must take the entrance examination whether or not
they will have a chance to study there and if they pass the examination, they will have
a chance to study there. If they fail the examination, they must wait until next year and
take the exam again.
Literature review
Reliability
According to Clark (1975) said that reliability is in fact a prerequisite to validity in
performance assessment in the sense that the test must provide consistent, replicable
information about candidates' language performance. And Jones (1979) said that there
is no test can achieve its intended purpose if the test results are unreliable. Reliability
in a performance test depends on two significant variables: (1) the simulation of the
test tasks, and the consistency of the ratings, and four types of reliability have drawn
serious attention: (1) inter-examiner reliability, (2) intra-examine reliability, (3) inter-
rater reliability, and (4) intra-rater reliability.
Jones also mentioned that since the administration of performance tests may vary in
different contexts at different times, it may result in inconsistent ratings for the same
examinee on different performance tests. Therefore, attention should be devoted to
inter-examiner and intra-examiner reliability, which concern consistency in eliciting test
performance from the test takers (Jones, 1979).
In addition, performance tests require human or mechanical raters' judgments. The
reliability issue is generally more complicated when tests involve human raters
because human judgments involve subjective interpretation on the part of the rater and
may thus lead to disagreement (McNamara, 1996). Inter-rater and intra-rater reliability
are the main considerations when investigating the issue of rater disagreement. Inter-
rater reliability has to do with the consistency between two or more raters who
evaluate the same test performance (Jones, 1979). For inter-rater reliability, it is of
primary interest to examine if the observations over raters are consistent or not, which
may be estimated through the application of generalizability (Crocker & Algina, 1986).
Intra-rater reliability concerns the consistency of one rater for the same test
performance at different times (Jones, 1979). Both inter- and intra-rater reliability
deserve close attention in that test scores are likely to vary from rater to rater or even
from the same rater (Clark, 1979).
Practicality
It refers to the economy of time, effort and money in testing. In other words, a test
should be easy to design, easy to administer, easy to mark, and easy to interpret the
results (Bachman and Palmer, 1996). Moreover, according to Brown (2004) said that
the test that is practical it needs to be within the means of financial limitations,
appropriate time constraints, easy to administrator, score, and interpret.
Validity
The term validity refers to whether or not a test measures what it intends to measure.
On a test with high validity the items will be closely linked to the test's intended focus.
For many certification and licensure tests this means that the items will be highly
related to a specific job or occupation. If a test has poor validity then it does not
measure the job-related content and competencies it ought to (Bachman and Palmer,
1996).
Content Validity
Content validity refers to the connections between the test items and the subject-
related tasks. The test should evaluate only the content related to the field of study in a
manner sufficiently representative, relevant, and comprehensible (Bachman and
Palmer, 1996).
Hughes (1989) said that a test is to have content validity if its content constitutes a
representative sample of the language skills, structures, etc. with which it is meant to
be concerned. Hughes (1989) also said that in order to judge whether or not a test has
content validity, we need a specification of the skills or structures. That it is meant to
cover a principled selection of elements for inclusion in the test. The greater a test's
content validity, the more likely it is to be an accurate measure of what it is supposed
to measure.
Construct Validity
According to Bachman and Palmer (1996) said that It implies using the construct
correctly (concepts, ideas, notions). Construct validity seeks agreement between a
theoretical concept and a specific measuring device or procedure. And Derewianka
(1999) said that Construct validity is concerned with the extent to which the
assessment task reflects the theoretical assumptions underpinning its construction.
Moreover, Brow (2004) also said that construct validity is any theory, hypothesis, or
model that attempts to explain observed phenomena in our universe of perception.
Research questions:
What is the reliability level of the entrance exam for teacher training program at NIE?
What is the practicality level of the entrance exam for teacher training program at NIE?
What is the validity level of the entrance exam for teacher training program at NIE?
Method
Participant
The two teachers who English at NIE and the two students who study at NIE were
selected to participate in one-on-one interviews. These people were invited to
participate in this study because they are more experiences in designing the tests and
scoring the tests at NIE and other two students are the trainee there and they had
experience in doing entrance exam there. More importantly, they really would like to
share their demonstration experiences with the interviewer. Moreover, as a result,
enough information must be obtained through in-depth interviews with the four
participants about the practicality, reliability, and validity level of the entrance exam for
the teacher training program at NIE.
Procedure
The two teachers and two students were invited in one-on-one interview at different
time. And I have made the appointment at least two days before the acting the
interview was occurred because I would like to give more time for them to prepare for
the interview process. While I am interviewing, the process of noting and recording will
be revealed and the main objective of the research will be told to the four participants.
After recording and noting the participant speech will be transcribed into the writing
script because it is easier for writing the result and discussion which need to be done
at the end of this research. The result from the participants and the findings from
literature review will be compared to find the similarities and differences because the
purpose of this research is just to further understand the practicality, reliability, and
validity of the entrance exam for teacher training program at National Institute of
Education (NIE).
Instrument
The instruments were selected to collect the data needed for the research. Those
kinds of instrument gave information about specific points to evaluate in the research
such as preferences, practicality, reliability, validity level of the entrance exam at for
the teacher training at NIE. At the end of the treatment, the information collected by
those instruments was the necessary to find the main purpose of the research.

Assessing Language Tests for Practicality, Reliability and Validity

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Assessing Language Tests for Practicality, Reliability and Validity

Încărcat de

Drepturi de autor:

Formate disponibile

Principles of Language Assessment - Practicality, Reliability, Validity, Authenticity, and Washback

Rizki RMd on 01.00 No Comment

An effective test is practical. This means that it

Is not excessively expensive,

Stays within appropriate time constraints,

Is relatively easy to administer, and

Has a scoring/evaluation procedure that is specific and time-efficient.

F. Applying Principles to the Evaluation of Classroom Tests

S-ar putea să vă placă și