Unit 3

UNIT 3
WHAT MAKES A GOOD TEST?
3.0 Introduction
Unit 1 of this block introduced you to the aims and purposes of testing and
helped you understand what language testing involves. Unit 2 dealt with
the objectives of teaching and testing, what is worth testing and how. In
this unit we will examine the inherent characteristics of a good test.
A good test is based on certain general principles of testing. Whatever be
the kind of test or the scoring procedure, a good test will tell us whether
certain abilities do or do not exist in the language user. It also tells us the
extent to which the abilities exist. This information has to be true. The two
important terms associated with this truth value of test-information are
validity and reliability. This unit will deal with the key principles of
language testing:
validity
reliability
authenticity
interactiveness
practicality and
impact
A test is an instrument of measurement and we have to make sure that the

instrument itself is reliable and dependable. The test instrument itself will
have to be evaluated against certain well-defined principles.
By the end of this unit you should be able to
identify the criteria of a good test and
evaluate the tests that you make for your students
3.1 What makes a good test?

Why do we test? We test because we need to know if the objectives of
learning/teaching of a particular area of language ability have been
fulfilled or not. This involves assessing how much the learner has learnt
and how effective the learning/teaching processes have been. Only if we
establish this, can we think about taking the learners to the next level of
learning. If we find that the learning has not been effective, we may
consider changing the instructional methods. Sometimes the learning may
have been successful but the test has not elicited the abilities that the
1
learners have or alternatively, the learning has not taken place but the test
results make us think that learning has been successful.
The need to assess the worth of a test can arise in two types of situation:
a. You along with a team have prepared a test for some specific purpose
and wish to evaluate it.
b. You wish to use a readymade test for a specific purpose.
In both cases you will need some criteria for evaluating the test
instruments. Various criteria are available for judging the quality of a test,
or rather, a testing procedure, which involves the test itself and the use that
is made of the information that is provided.
The questions we ask are:
Is the test valid?

Is the test reliable?
Is the test authentic?
Is the test interactive?
Is the test practical?
Does it have a washback effect?
The first two, validity and realiability are technical concepts. The other
three are considerations related to the usability and usefulness of the test in
practical situations. We will deal with each of these questions in the
following sections.
Please remember that no test is perfect. The goal is to constantly improve
the validity, reliability and usability of tests.
3.2 Validity
Every test has a specific purpose. How well does the test fulfill this
purpose? How successful is it in eliciting the information that it seeks to
elicit? Does it accurately measure the ability it seeks to measure? The
answers to these questions indicate the validity of a test.
Validity has been defined as "the degree to which a test measures what it
claims, or purports, to be measuring" (Brown, 1996, p. 231).
This will include the test design, the scoring procedure and the
interpretation of test performance.
There are three main kinds of validity:
Construct validity
Content validity and
2
Criterion validity
Let us study each of these now.

3.2.1 Construct validity
Before we explore this kind of validity, let us understand what the term
construct means. A construct, or psychological construct as it is also
called, is an attribute, proficiency, ability, or skill that happens in the
human brain and is defined by established theories. For example, "overall
English language proficiency" is a construct. It exists in theory and has
been observed to exist in practice. (Brown, 2000) (italics mine)
Intelligence is a construct. It happens in the human brain and is defined by
established theories. It can be tested.
Activity A
With the help of the above definition, tick the items that may be called a
construct.
reading
writing
voting
counting
spelling
telephoning
inferring
sequencing
Discussion
Except for voting and telephoning all the others are examples of constructs.
All of them are mental abilities. Reading, writing, spelling are languagerelated abilities. Counting, inferring, sequencing are cognitive abilities.
Each of them can be tested.
If we want to test if a person can read, we give him/her a reading test. This
test will specifically measure reading ability. We recognise that reading is
an ability distinct from writing or speaking or listening or grammatical
control. The ability can be isolated for testing.
When we talk of validity in testing, we primarily mean construct validity.
The other kinds of validity fall under construct validity.
How do we say that a test has construct validity? If the test results show
the existence or non-existence of an ability, we say it has construct validity.
For example, if we give a six- year old child a string of 25 beads and ask
the child to count them, we can say from what the child does, if she/he can
count or not. This test of counting has construct validity.
3
It is easy to say that a test has construct validity in a direct test, say, of
writing or speaking. It is difficult to establish this in indirect tests. We
need empirical evidence to prove this.
If we test writing indirectly through multiple choice tests, the performance
on these tests should correspond with performance on longer pieces of
writing.
If we test reading comprehension, we must be sure that the items test only
comprehension and not the grammaticality of the written answers.
The two primary forms of construct validity are the two subordinate forms
of validity, content validity and criterion validity.
Construct
validity
Content
validity
3.2.2
Criterion
validity
Content validity
As the name suggests, this relates to the content of the test. How is the
content determined? This depends on the area or domain being tested.
This content will already have been laid down in the course objectives.
The test should measure the ability/abilities in this domain appropriately,
and at the same time include adequate samples from the domain.
Let us look at an example of a sample CBSE Class IX test paper and see
how it aims to establish content validity. Only the outline of the contents is
given, not the actual test paper.
SAMPLE QUESTION PAPER
3 hours
80 marks
SECTION A
(40 periods)
READING
15 marks
Two unseen passages with a variety of comprehension questions including
4 marks for word attack skills such as word formation and word meaning.
A.1
250-300 words in length (6 marks)
A.2
400-500 words in length (9 marks)
The total length of the two passages is between 650 and 800 words.
A.1
A.2
A factual passage (e.g. instruction, description, report, etc.) or a

literary passage (e.g. extract from fiction, drama, poetry, essay or
biography). In the case of a poetry extract, the text may be shorter
than 150 words.
A factual passage or a discursive passage involving opinion,
(argumentative, persuasive or interpretative text). It will include
questions on word attack skills for 4 marks.
4
SECTION B
WRITING
(63 periods)
25 marks
B.1 and B.2 Short composition of not more than 50 words each e.g.
notice, message or short postcard.
5 marks
B. 3 Composition based on a verbal stimulus such as an advertisement,

notice, newspaper cutting, table, diary extract, notes, letter or other
forms of correspondence.
Word limit: 200 words (For letter: 150 words only for body of the
letter)
7 marks
B. 4 Composition based on a visual stimulus such as a diagram, picture,
graph, map, cartoon or flow chart.
Word limit: 150-200 words
8 marks
One of the longer (7/8 marks) questions will draw on the thematic content
of the Main Course Book.
The 150 word composition will be for 7 marks and the 200 word
composition for 8 marks.
SECTION C
GRAMMAR
(42 periods)
15 marks
A variety of short questions (5) involving the use of particular structures

within a context (i.e. not in isolated sentences). Test types used will include
gap-filling, cloze, sentence-completion, reordering word groups in
sentences, editng, dialogue completion and sentence transformation.
The grammar syllabus will be sampled each year, with marks allotted for
Verb forms
Sentence structures
Other areas.
SECTION D
(65 periods)
LITERATURE
25 marks
D.1 and D2: Two extracts from different poems from the prescribed
Reader followed by two or three questions to test local and global
comprehension of the set text. Each extract will carry 4 marks. Word
limit: one or two lines for each answer.
D.3
One question (with or without an extract) testing global or local

comprehension of a poem or a play from the prescribed reader. Word
limit: (75 -100 words)
4 marks
D. 4
Up to three questions based on one of the drama texts from the

5
prescribed reader to test local and global comprehension of the set

text. An extract may or may not be used.
Word limit: one or two lines for each question if an extract is given.
If an extract is not given, the word limit is 75 words.
4 marks
D.5
One question based on one of the prose texts from the prescribed
reader to test global comprehension and extrapolation beyond the
text. Word limit: 50-75 words.
3 marks
D. 6
One extended question based on one of the prose texts from the
prescribed reader to test global comprehension and extrapolation
beyond the text. Word limit: 150-175 words.
Questions will test comprehension at different levels: literal,
inferential and evaluative.
6 marks
Note:
Since Continuous and Comprehension Evaluation is to be followed the
weighting will be as follows:
Assignments to test Listening skills:
10%
Project to test Speaking skills:
10%
Formative testing (Unit tests):
20%
Mid-term (Half-yearly/ Cumulative test): 20%

Annual Examination:
40%
The four areas of content are specified, namely, reading, writing, grammar
and literature. Writing and literature get 25 marks each; reading and
grammar 15 marks each out of a total of 80 marks.
You will see that 55% f the test paper is devoted to language and 25% to
questions based on the prescribed reader.
The course of instruction will have a similar distribution. The number of
periods is also indicated. This follows the objectives of a school English
course at Class IX level.
What has been described above is the content of the course. Content
validity in an achievement test would imply the extent to which the areas
covered during the course appear in the test paper and if there are adequate
samples of each area.
We might argue that literature and memory-based questions do not reflect
language ability. In school contexts, however, there is a great reliance on
textbooks for language teaching and hence both teachers and students feel
that there should be questions based on the prescribed textbook. Very few
language papers in the academic set-up do without memory questions,
though we may question the construct that is being tested here. At best, it
may be said that overall language ability is indirectly tested.
6
To partially veer around this problem, in the question paper above, the
lines from the textbook are reproduced (with or without extract). Yet, the
questions depend upon remembering the context of the poem and the prose
pieces.
If you look at the objectives in any language syllabus, you will not find any
reference to prescribed texts. You will find only language abilities. The
abilities stated in the objectives should form the content of a language test.
The above example illustrates what we mean by content validity.
A proficiency test will also have similar content specification. Here is the
test content specification for the CBSE Class X Proficiency Test.
The test is text independent i.e. it is not based on a set text or syllabus. As a
Proficiency Test, it tests both skills and knowledge. There is a balance between
key aspects of language as, for instance, reading skills, involving language
knowledge as well as ability to process meanings through inference, analysis,
comparison and evaluation; knowledge of grammar and vocabulary to the extent
that is required for general communicative tasks. Specific skills involved in writing
are also important e.g. awareness of the structure of simple written texts, how
they are organized, and the kinds of formats that are used in letters, for
instance.
Reading is given 30 marks as it is the basis for grammar & writing and because it
is important in further studies which students have to undertake in their later
academic work.
One passage of reading is of narrative type, (it may be an extract from a story)
and tests candidates understanding of events, characters, descriptions and also
the perception of meanings which are implicit in the details of the story.
One passage is a nonfictional text containing information, argument, opinion,
facts and ideas. Reading of this kind is focused on ability to arrive at the gist of an
idea/argument, to correctly separate opinions from facts (which implies some
ability to analyse), to be able to distinguish main ideas from subordinate ones, to
understand the tone or viewpoint e.g. humorous, ironic, serious etc.
The third passage is a short poem, around 20 lines. This is to test if candidates
can understand language which is composed differently it is not linear, has
hidden meanings, unusual expressions and uses sound effects (e.g. rhyme),
simile and metaphor which conveys meaning indirectly rather than directly.
Vocabulary is given 20 marks as:
1. Vocabulary is central in reading comprehension, where it is essential in
meaning.
2. It is also tested separately in order to test range of knowledge of words.
It is to be kept in mind that the level of vocabulary is such that is commonly found
in texts that are prescribed in the school readers at class 9 and 10.
Grammar + Writing (30 +20) The MCQ format does not allow testing of writing
skills as writing is integrative of other skills and needs to be tested through
production. However, it is felt that the awareness of the components of writing can
be tested here e.g. format of letters, paragraph organization, linkage between
sentences etc. These are also part of language knowledge.
Cloze Test has been included as it is a test which is the most global and
comprehensive test of language. It consists of a passage where the first sentence
sets the context of the passage and subsequently there are blanks at regular
intervals. Filling these blanks requires an overall understanding of meaning and of

the language items that will fit into the particular context of the sentences. For
example, when the use of articles is explained in isolation, a user cannot
understand where a certain article needs to be used/omitted. It is only in context
that a decision regarding the use of an article is made. Therefore a cloze test is
the best way to know if a student can infer from the context whether that article is
to be used or not. The same applies to other items like prepositions, conjunctions
etc.
The format and breakup of parts of the test is as follows:
Maximum Marks: 100
Time: 2.30 Hours
The question paper contains 100 questions. All questions are compulsory.
The questions are divided into the following parts:
(Reading 3 passages)
A Unseen Passage (Narrative type) Q. 110
B Unseen passage (Informative type) Q. 1120
C Poem Q. 2130
D Vocabulary Q. 3150
E Grammar Q. 5165
F Writing Q. 6685
G Cloze Q. 86100
Notice the first statement. It states that it is text independent and it tests
both skills and knowledge.
The domain of the test is delineated. The question paper must reflect the
description of the test content.
One question that is important is the coverage of the different areas and the
adequacy of the sampling. This is conditioned by the length of the test and
the sub-test. Obviously 3 items are not enough to cover the whole area of
tense. The principle of coverage is readily seen in the case of the general
English paper related to the reader consisting of several selections. While
no test can be long enough to include a number of questions on each
lesson, the overall plan of the paper (choice within sections only) can
enhance the coverage of the content.
Content validity cannot be represented by a numerical index. It is
established qualitatively by a process of inspection. Both the apparent
nature of the item and what actually happens when a student attempts it
need to be scrutinized. It is important to look at what the item actually
requires the examinee to do, rather than the format alone (short answer,
long essay etc.).
8
3.2.3
Criterion validity
What is a criterion? The dictionary meaning of criterion is a standard by

which something can be judged or decided.
Criterion validity relates to the degree to which the results on the test
agree with those provided by some independent and highly dependable
assessment of the candidates ability. The independent assessment is thus
the criterion measure against which the test is validated. (Hughes, 1989:
27)
It compares the test with other measures or outcomes (the criteria) already
held to be valid. For example, employee selection tests are often validated
against measures of job performance (the criterion), and IQ tests are often
validated against measures of academic performance (the criterion).
We had said earlier that there are two subtypes of construct validity:
content validity and criterion validity. Criterion validity, in turn, has two
subtypes: concurrent validity and predictive validity.
content validity and criterion validity.
Construct
validity
Content
validity
Criterion
validity
Concurrent
validity
Predictive
validity
3.2.3.1 Concurrent validity

We will take up concurrent validity first. Concurrent means at the same
time as.
Let us understand this with an example.
Let us suppose, we have to conduct a writing test in say 1 hours time. We
find that we cannot cover the entire content of the course within this
limited time. We may need 3 hours to include all the components. We
prepare a 3 hour test of 18 items. This is the long test. We do not have the
time to make all our students appear for this 3 hour test. So we select 6
representative items for the shorter test.
If we have about 40 students in the group, we randomly select 10 students.
We make these 10 students appear for the long test. They along with the
rest of the class appear for the short test a little before or after the long test.
The scores of the 10-student sample on the long test and the short test are
compared. If they correspond well, then the short test is a valid
representation of the items in the long test.
This is called concurrent validity. The scores on the longer test is the
criterion against which the validity of the scores on the shorter test is
judged.
A long test may not be the only criterion. As a teacher you are in a position
to judge the abilities of your students. You expect a high degree of
agreement between your subjective assessment and your students test
performance. In other words, what we are saying is that the test should
bring out the best of the students ability. The test should be a valid
measure of a candidates ability.
Degrees of agreement between a test and the criterion can be statistically
measured by the correlation coefficient. This will be dealt with in Unit 3,
Block IV. Perfect agreement between two sets of scores will result in a
coefficient of 1 and total lack of correlation will result in 0.
3.2.3.2
Predictive validity
Predictive validity relates to what degree the performance on the test

corresponds to performance on some future activity. Let us suppose you
appear for an entrance test to get admission into a research programme.
You are selected on the basis of your performance on this test. How well
you cope with the requirements of the research programme determines the
validity of the entrance test.
Let us take another example. Based on a placement test, we group students
into basic, lower-intermediate and upper-intermediate. How good was our
judgement in grouping the students according to their level of ability? Do
we find that the judgement was fairly accurate? Do we find that there are
quite a number of students in groups higher or lower than their levels of
ability? If accurate, the test has predictive validity. If not, the test loses out
on predictive validity.
In terms of predictive validity, the criterion always is connected to the
purpose for which the test is administered.
This validity can also be established by the correlation coefficient measure.
3.1.4 Face validity

This is not really related to the quality of a test but its appearance. A
speaking test should elicit speech; if we ask candidates to write down or
tick the right remark or response in a situation presented in writing, it does
not have face validity.
10
If candidates are asked to write about some subject area topic in the
English test, it will fail to elicit the right kind of response. Finally, it should
also be said that if the test is not scored properly, its validity suffers. This
will be further discussed under reliability.
In this section, we have discussed validity and its different forms. When
we say a test is valid, we are saying that the test result or score is a
dependable measure of a persons ability.
Review question I
Match the statements in Column A with the terms in Column B.
A
B
a. The quality of a test that indicates
fairly accurately, the existence or
non-existence of an ability in an
individual.
1. Predictive validity
2. Face validity
b. The quality of a test that ensures a

fair and representative coverage of
the items that can reflect an ability.
3. Construct
c. A mental ability that can be defined
and can be shown to exist or not
exist in an individual.
4. Content validity
d. The
correlation
between
performance on a test and some
independent
and
dependable
assessment of that ability.
5. Concurrent validity
e. The
correlation
between
an
individuals performance on a test
and his/her performance on future
tasks that are represented in the test.
6. Construct validity
f. The correlation between

performance a on a shorter
representative version of a longer
test.
7. Criterion validity
g. The correspondence between the

appearance of the test content and
the ability it seeks to test.
1. ______
2. ________ 3. _____ 4. ______ 5. ________
6. _______ 7. _________
11
3.2
Reliability
We said that validity is an inherent quality of a test. The content of the test
is one aspect of testing; the other aspect relates to the conditions that affect
the performance on a test.
Activity B
Given below are five test situations. Tick the items that you think will lead to
a true reflection of ability.
1. Candidates appearing for a Common Medical PG Entrance test were
surprised to find that the 200 marks paper that they had come prepared
for turned out to be a 100 marks paper. As soon as the mistake was
realised the test papers were replaced by the 200 marks paper. This
resulted in a half hour delay in the starting the test.
2. Candidates appearing for a mathematical ability test were given four
different versions of a test covering the same content areas on four
successive days.
3. Candidates were assessed on the basis of one examination at the end of
a year-long course of instruction.
4. Candidates appeared for a test under strict police vigilance during a
local political disturbance.
5. Candidates had to submit a collection of writings that they had done
during the course for evaluation when they appeared for the final
writing test.
Discussion
Items 2 and 5 are true indicators of ability. The tests can be said to be
reliable.
In situation 2, the candidates are tested for the same ability again and
again. This evens out any differences arising from ones psychological
state on a particular occasion.
12
In situation 5, the candidates are tested not only by a formal test but also
all the writing that they had done during the course. This offers a variety of
writing samples which can be judged.
Let us now examine the other three situations.
In situation 1, candidates will tend to be psychologically affected by the
administrative mistake and may not perform to the best of their ability.
In situation 3, one final examination may not be a reliable source of
information. A candidate may not be in the best of health; some other
candidate may have had a mental setback; a third candidate may have not
studied through the year but crammed for the examination. Several
samples of ability are required to judge a candidate.
Item 4 is a stressful situation. Candidates may experience a great deal of
anxiety while they are writing the exam. This may affect their
performance.
The activity would have helped you see that reliability is very important
for the interpretations we make about a candidates ability based on test
scores. Assuming that the test is measuring a stable characteristic
(construct), different observations made using the same test should yield
consistent results.
To ensure reliability we need the scores on several tests. There is always an
element of chance each time a test is administered. At several stages in the
total process of constructing and using an ability test there could arise the
question of error of measurement in a statistical sense. At each stage a
decision has to be made, which, even if sound in itself, is only one among
many possible decisions at that point. The occurrence of a particular
decision is governed by chance factors- and so is a source of error, in the
sense of variability. That is, when the test is repeated, some other factor
might influence the item construction. This becomes an error of
measurement.
Let us understand this with examples.

Let us consider a 50 item test of basic grammar for Class X which involves
multiple-choice and open-ended questions. We call this Test A. Now let us
see what can happen with other tests on the same content and the same
format .
a) Test A contains a particular combination of testing points covering
word order, concord, tenses, etc.
13
Other test writers working independently to produce Tests B,C, D,

would almost inevitably select a different mix to make 50 items. There
is no rule which indicates what the 50 testing points of a general
grammar test are. (Note that the content and construct validity is not
affected at all.)
b) Given the points selected for inclusion in Test A, each item has to be
housed in suitable sentences to provide the semantic-syntactic context
for the appropriate rule.
Other test writers, given the same list of points, will certainly use other
sentence contexts.
c)
Each item of Test A that is in the multiple choice format has a number
of distractors.
Other test writers could use a wide range of other possible distractors
even if the correct option is identical in every case.
d) For each supply type item scoring guidelines are provided.

Other test writers may prepare different scoring guidelines. There is no
rule which says that when the passive construction is to be supplied,
errors in concord should (or should not) be penalised.
e) One examiner marking the supply items will exercise his/her

judgement in a particular way. No scoring key can be so exhaustive
that examiner judgement is totally eliminated.
The same examiner encountering the same script earlier or later (in the
pile of scripts) or, at a different time (with regard to concentration,
mood, etc.) may exercise his/her judgement rather differently. Other
examiners will vary in the way they exercise their judgement in
assessing open-ended items.
f)
A student encounters the items in a particular order. Encountering

items easy for him/her early in the test and so being encouraged, or
encountering difficult items and being discouraged can influence
his/her overall performance.
The particular order is a matter of chance. A different order might have
led to a different pattern of encouragement/discouragement and hence a
different overall performance.
14
The list above has indicated a number of points in the testing procedure at
which a particular decision is taken or outcome obtained. Thus the score
secured by a candidate in a test is affected by a unique combination of
chance factors. A different combination, again, purely the result of chance,
might have resulted in a different score.
Although these examples relate to test content, it is not the validity of the
items themselves that are being questioned. Each test may be valid by
itself. What is at issue here are the variable factors that may influence test
performance.
3.2.1 Reliability of test scores

The very nature of an ability test thus tends to make test scores
undependable or unreliable. How different would the new score (on
another occasion or with another set of factors of the type listed above) be
from the one actually obtained? The issue of reliability lies in this question
and the answer to it. If Test A appears to be (or is) very vulnerable to the
operation of these chance factors- or errors of measurement- its reliability
is low.
However, the reliability of a test can be enhanced if we can take care of the
following:
careful specification and planning to make independently

constructed tests A,B, C, D very similar
instructions in the test booklet, instructions and arrangements
during test administration to minimise the influence of varying
psychological states of the examinees.
clear and adequate scoring guidelines to minimise inter-rater
variation
3.2.3 Scorer reliability

Scorer reliability is another important factor. The same answer script
assessed by two different raters should get more or less the same scores.
This is fairly straightforward in objective type tests where the answer key
can also be provided to the examiners.
Language tests cannot be entirely objective if they are to be truly valid. We
need subjective type items. We had earlier mentioned that the difference
between objective and subjective testing is not so much one of format but
one of assessment.
15
To ensure scorer reliability detailed criteria or guidelines for marking long

answers, need to be given to the scorers. Scorers have to be trained in
using the guidelines. If the scores of two scorers are very divergent, a third
scorer will have to examine the script.
3.2.4 Measures of reliability
The reliability of a test can be estimated empirically. It is defined as the
agreement between two sets of scores obtained from the same sample of
tested examinees, where each set of scores represents one of the possible
occasions or versions of the test.
Thus we can look at the agreement between
a. test and re-test score for the same test
b. scores on two parallel forms of the test based on the same test
design
c. scores on split-halves of the same test.
d. scores of two scorers of the same set of test answer scripts.
In each case an index of the agreement between scores called the
reliability coefficient is obtained.
We will discuss this statistical measure in Unit 3, Block IV.
3.2.5
How to make tests reliable
Having considered the importance of reliability in testing let us now look

at how we can ensure reliability in our tests.
Given below are useful tips adapted from Arthur Hughes Testing for
Language Teachers (1989: 44-49).
1. Adequate samples of behavior. There should be a large number of test
items to be sure that the candidate really has the ability. The items
should be independent of each other. For example, if two items in a
reading test, test the same point, then the second item is not really an
additional item. If the candidate gets the first one wrong, he is sure to
get the second item also wrong. The greater the number of items the
higher is the reliability. Yet, the test should not be so long that it makes
the candidate bored and tired.
2. Discrimination between good and weak students. The items should
clearly discriminate between good students and weak students. A test
that does not discriminate well between the two lacks reliability. Very
easy items or very difficult items do not discriminate well. A good test
will begin with a few easy non-discriminatory items and then move on
to more challenging and discriminating items. We will deal with
Facility Value and Discrimination Index in Unit 2, Block IV.
16
3. Restricted choice of questions. Restrict the focus of the topic of essays.

This will give us more reliable information than if a wide choice is
given. However, this should not restrict the candidates performance so
much that it affects the validity of the test. This also allows for more
reliable comparison between candidates performance. If students
choose different kinds of topics, judging them by a common yardstick
becomes difficult.
4. Clear and unambiguous items. If different candidates interpret the
items in different ways, then the item is not reliable. One way of
ensuring clarity is by pre-testing the items or asking other teachers to
examine the items.
5. Clear and explicit instructions. Candidates should not be misled by
confusing instructions. Reliability is affected if a good student has
failed a test because of not understanding the intended expectation. The
test writer should not assume that the candidate
6.
Familiar formats, question paper patterns. This was discussed under

face validity. Candidates should not be taken by surprise by a changed
question paper pattern although the content and constructs may be
valid.
7. Objective assessment. Clear guidelines for scoring needs to be given if

the questions are open-ended. Right responses should be decided upon
and the answer key provided to the scorers.
8. Multiple, independent scoring. Answer scripts should be scored by
trained independent scorers to achieve reliability of scoring.
3.2.4 Relationship between validity and reliability

Though validity and reliability are based on different questions about the
nature of a test, there is a definite relationship between them. Reliability is
a pre-requisite for validity. A test cannot be valid if it is not reliable.
Reliability, however, in itself, does not guarantee validity. To put it
formally, reliability is a necessary but not sufficient condition for validity.
We noted earlier that low reliability arises from the vulnerability of tests to
various errors of measurement. Thus the score that a person obtains on an
unreliable test is not his true score but a measure representing ability
influenced by other factors. This, in turn, is one of the factors which
reduces validity. Validity requires that the test measure what it claims to
measure and not other things. Hence the need for high reliability before
validity becomes possible.
The essence of validity lies in soundness and appropriateness. A test may
be very reliable, that is, the scores are stable. But it might still be
17
measuring the wrong thing. The answer to a vocabulary item may be

arrived at through guessing.
In this section we looked at what we mean by reliability and its
relationship with validity. The two together form the basis of a good test.
Review question II
Tick the items that make a test reliable.
1. Limited number of test items.
2. Open-ended questions.
3. Good discrimination between strong and weak candidates.
4. Adequate samples of behavior.
5. Overlaps in a series of questions on the same area.
6. Multiple, independent scorers.
7. Convergence in the scores of multiple scorers.
8. Ambiguous items.
9. Clear, explicit instructions.
10. Expected test paper formats.
3.3 Authenticity
Validity and reliability are traditionally accepted principles in testing and
have been applied to system-based testing. With the shift in approach to
communicative language testing, authenticity is an additional principle that
is seen to be important for validity. Performance on language tests should
indicate how well the candidate can perform on real language use tasks.
These are called Target Language Use (TLU) tasks. If you are taking a
course in editing and publishing in order to take up an editorial job in a
publishing firm, the test should include tasks that you will be expected to
do in that field. If the test is a general language proficiency test, it might
be valid, but it may not be authentic, because editing for publishing
requires a specialised orientation.
You may think that this is similar to predictive validity. The difference is
this. Predictive validity is an indirect way of judging suitability for specific
tasks. The content of an authentic test is more direct in that it uses tasks
18
similar to those that will be encountered in future fields of profession or

occupation or higher education. Bachman and Palmer define authenticity
as the degree of correspondence of the characteristics of a given language
test task to the features of a TLU task. ((1996: 23) The following diagram
represents this relationship.
Characteristics of
the TLU task
Authenticity
Characteristics of
the test task
Let us now look at the reasons why authenticity in test tasks is important.
One of the purposes of tests is that we can make generalisations about
performance on future tasks based on our interpretations of the test scores.
That is, we want to match test performance with non-test performance in
the Target language Use domains. It is thus related to construct validity
which addresses the question: Does the test give accurate information on
the ability that it claims to test.
Try this activity to see what we mean.
Activity C
Given below are two test tasks. State which task is more authentic and why.
1. Write a story with the help of the outline given below.
2. Write a notice to be put up on your school notice-board announcing a
poetry-reading competition to be held I your school with the help of the
information given.
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
Discussion
Task 2 is more authentic than Task 1.
What are the reasons?
Wherever we work or study, at some point or the other, we may have to
write notices for various purposes. Notices belong to the informational
19
genre of writing. Hence if we give a test task involving notice-writing, it

indicates the candidates ability to write notices in real-life.
Story-writing is a useful language-developing activity in the school setting,
but beyond that, not many of us will become story-writers or novelists.
Thus though story-writing as a language development activity is valid in
itself, it is not something in which everyone engages in real life and hence
as a test task, it is less authentic than notice writing.
The second reason is that students perceive authentic tasks as more

relevant and useful for the kinds of tasks they will have to do later in the
TLU domain and hence are better motivated to do them rather than
textbook kind of tasks. This motivation will have a positive affective
attitude towards doing the test and they will be encouraged to give their
best to the test task. This relates both to the topic of the task as well as the
task type.
An example of a test task for an English course for the automobile Industry
follows:
Read the text about car production and complete the flow chart below.
BUILT TO ORDER
As soon as a car is ordered and a delivery date agreed, weekly and daily
production schedules are created and sent to outside suppliers and the companys
own pre-assembly stations. This is to make sure that all the necessary
components arrive on time.
First of all, a small data carrier contains all the customers specifications and
communicates wirelessly with control units along the production line. In the body
shop the floor pan, wheel arches, side panels, and roof are welded together by
robots to make the frame of the car. The add-on parts _ the doors, boot lid, and
bonnet_ are then mounted to make the body-in-white.
The finished body shell then goes into the paint shop where the data carrier
determines the colour. In final assembly, the interior and exterior parts (for
example, the front and rear bumpers, headlights, windscreen and other windows)
are fitted. After quality control and a final check, the finished car can be released.
It is now ready for delivery to its new owner.
From English for the Automobile Industry, Marie Cavanagh, OUP, 2007 p.14
20
21
Car Ordered, Delivery date

released
Finished car released
22
Both the topic and the task type will be of interest to automobile
engineering students. The focus is off-language but they have to read the
text, gather information from it, and fill in the flow chart according to what
they read. All this involves processing language and transferring
information to a non-text format. The students may have to read such texts
during their course of study as well as in their occupations later on. They
will thus perceive its relevance and usefulness and be motivated to do the
task well.
Compare this with the following test task for the same group of students.
Read the following poem and underline the figures of speech used in it. Then
explain each of these.
This task is literary in nature and not all engineering students will see the
usefulness or relevance of this task for their needs for English in their work
domain.
It must, however, be said that the objectives of a particular language course
must be made clear to the learners so that they understand the rationale for
particular tasks that are done during classroom learning as well as the test
tasks. This will ensure that the perceptions of course developers, teachers,
paper-setters, examiners and the candidates converge towards a common
goal.
I am saying this because many teachers and students come with the
conditioned view that a language course is meant to give exposure through
literary and semi-literary texts and that this would invisibly lead to
language development. While this may happen to a certain extent, it may
not lead to equipping the learners with the kind of skills that they need to
employ for authentic language tasks.
Another factor to be remembered is that complete congruence of test tasks
and TLU domain tasks is difficult to achieve. What we can aim at is a
close approximation to them.
Closely related to authenticity is interactiveness of test tasks.
3.4 Interactiveness
Bachman and Palmer define interactiveness as the extent and type of
involvemet of the test takers individual characteristics in accomplishing a
test task (1996: 25).
The three elements involved in individual characteristics are the testtakers:
language ability (language knowledge, strategic competence or

metacognitive strategies)
topical knowledge and
23
affective schemata
Interactiveness of a test task is the extent to which these three elements are
engaged while doing the test task.
For example, a student of sociology will find a language test task which
has something related to sociology as a test input more engaging than a test
input from the domain of natural science.
The following figure from Bachman and Palmer illustrates the relationship
between the three elements and the characteristics of a language test task.
LANGUAGE ABILITY
Language knowledge
Metacognitive strategies
Affective
schemata
Topical
knowledg
e
Characteristics of
language test task
Authenticity pertains to the correlation between test tasks and the TLU
tasks. Interactiveness pertains to the relationship between the individual
characteristics and the test task. Interactiveness is a quality of any task.
To bring out this distinction Bachman and Palmer gives four examples of
test tasks.
24
The first example (A) is a test for selection of typists. The test task for this
involves the ability either to listen to and type out a spoken text or copy out
a written text. A typist may or may not really understand what he/she is
typing. But as a test of typing ability the task relates to the future work
domain ability. Such a task is highly authentic because the test task relates
to the TLU task (what the typist has to do at work) but it is not very
interactive because the typist does not really process the text or interact
with it. It is mechanical copying on the typewriter. (You might be familiar
with this example with the typists of your own institution.)
The second example (B) is a test for the same selection of typists. The
candidates are asked to make small conversation; talk about food, the
weather, clothing etc. through an interview format. They are also allowed
to choose the topic that interests them. This task involves a lot of
interaction but does not have any connection with a typists job. Hence this
task is more interactive but less authentic.
The third example (C) is a vocabulary task for international students
entering an American University. The test involves matching a list of
words in one column with their meanings in the second column. This test is
expected to give diagnostic information on the students ability to read
academic texts. But the test task has no relation to the TLU domain
because the candidate will never be asked to match words with their
meanings in their studies at the university. It is not very authentic. The
task also does not demand too much language knowledge or use of
metacognitive strategies. Hence it is not very interactive either.
The last example (D) is a test task for prospective salesmen involving a
role-play between the salesperson and a customer. The examiner takes on
the role of a customer and the candidate has to sell his/her product. This
task is highly authentic and interactive. It is authentic because it relates to
the TLU domain tasks. It is also highly interactive because it involves all
areas of language knowledge, strategic competence and topic knowledge.
This task is both authentic and interactive.
Both interactiveness and authenticity are relative concepts. We say that a
task is high or low on these principles but not authentic/inauthentic or
interactive/non-interactive. The characteristics are not inherent in the tasks
but depend upon the individuals taking the test and the intended purposes
of the test.
Please note that while interactiveness is more prominent in oral tasks,
reading and writing tasks also involve interactiveness. The individuals
engagement with the task is what determines interactiveness.
This discussion will help you especially when you are studying the unit on
testing speaking skills.
25
3.5
Practicality
As you were reading about these principles, you must have been
wondering whether all this is really practicable and if tests can always be
made according to these principles. This raises the issue of practicality or
usability.
A concern for all the principles mentioned above may lead to the devising
of tests that are not very cost- and time-effective. We are concerned with
educational testing- procedures that you as a language teacher can use and
even develop yourself. Moreover, the test should not be so strange that
students and even the other stake-holders (parents, employers) do not
accept them. It is also important that the use of a new test has, in balance,
a positive effect on the teaching-learning process. This will be discussed in
the next section under washback.
A test may be very valid, reliable, authentic and interactive and yet it may
require a disproportionate amount of teachers (and students) time and
hence not be practicable. Two important considerations are: a) the
resources that will be required to develop an operational test that will have
a balance of all the qualities mentioned above and , b) the allocation and
management of the time that is available.
Bachman and Palmer define practicality as the relationship between the
resources that will be required in the design, development, and use of the
test and the resources that will be available for these activities. If the
available resources are equal to the required resources, the test is practical;
if the available resources are less than the resources required the test is not
practical.
This is why we need Continuous Comprehensive Evaluation. What is not
practicable in a limited-time large-scale testing can be done in an openended manner during the course of instruction. This allows for a variety of
task-types. Authenticity can be better ensured with greater allocations of
resources and time. Interactiveness, especially in speaking activities can be
better tested through items that will involve peer interaction rather than
communication with the examiner or a recording machine.
Project-based learning is an accepted educational practice. This would
involve out-of-classroom field work. They may be surveys, data collection
of various kinds, creating tasks etc. These can be valid and reliable
measures of a candidates ability. This however, cannot be tested in a
timed-examination.
In this section, we tried to rationalise the principles of testing and ensure
that you do not get the feeling that the principles are paramount and we
have to subject all our tests to these principles rigidly. Practicality, we saw
is the final consideration.
26
Review question III

Complete the following statements:
1. The relationship between individual characteristics and the characteristics
of the test task is called _______________.
2. A number of underlying abilities can be tested through a wider range of
tasks
not
practicable
in
a
timed-test
through
__________________________________.
3. The relationship between required resources and available resources is
called ______________________ .
4. The relationship between test tasks and Target Langauge Use domain
tasks is called __________________ .
5. The three elements involved in individual characteristics are
________________________________________________________ .
3.6 Impact
Until this point we discussed the principles of validity, reliability,
authenticity, interactiveness and practicality. All these are principles are
27
directly related to the test content and how it is administered. Another

important factor that affects tests is its impact on all the stake-holders.
Do the activity to below.
Activity D
Consider the Class XII examination conducted either by the Central or the
State Boards. List all the stakeholders in this examination.
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
____________________________________________________________
Discussion
Match your list with the following:
1. Students
2. Teachers
3. The school system
4. The examining boards
5. The higher education admitting authorities
6. Employers
The school-leaving examination is a landmark in the career path of
students. Their levels of achievement serve as an indicator for successive
batches of students with regard to the investment of effort they will have to
make. This examination decides their future, their choice of career and
institutions of higher education. Those who cannot continue their education
will have this certificate for applying for various jobs available at that
level.
Teachers have a very high stake in examination results as they are
perceived to be primarily responsible for the preparation of the students for
the examination.
School administrations take these results as a yardstick of the performance
of their schools. If the results are good with students getting national or
state level ranks, it serves as publicity for their schools. If the results are
not satisfactory, they take remedial measures to ensure that their
institutions do not fall behind the others in standards.
28
The exam results help examining boards in reviewing the administrative

procedures as well as test paper effectiveness. Examination reforms are
undertaken on the basis of the exam results. If a change I the pattern of the
question paper is found necessary, they take steps to ensure that this is
notified to all the concerned people.
The Class XII examination results are held as criteria for admission in
various institutions of tertiary education, both professional and general.
Even where entrance examinations are conducted, the marks at this
examination are also considered.
For many jobs which do not require higher qualifications, Class XII
examination scores are a criterion for selection for appointment. Even
when higher qualifications are obtained the school-leaving certificate
marks are also considered as an indicator of the academic standing of
individuals.
We thus see that a test is a matter of interest and concern to various

stakeholders. We have examples of Common Admission Test (CAT) for
admission into Management programmes in the country; common medical
and engineering entrance tests for admission into professional colleges;
Scholastic Aptitude test for admission to Universities in the USA; and
other such examinations at state, national and international levels. The
scores obtained on these tests tell colleges how well a candidate knows the
subject and how well he can apply it. These tests are standardizes and have
a high predictive validity.
Washback
That apart, one of the most important points about testing is the effect it
has on teaching. Based on the test results, teaching content, techniques,
methods change. Objectives can be reformulated on the basis of test
results. Students also change their learning strategies based on their test
performance. Tests can thus have a beneficial washback on the whole
instructional process. Hence it is very important to ensure that tests are
valid and reliable.
Tests should be based directly on the objectives of the course to ensure that
the course objectives have been fulfilled. If the test results are positive, it
indicates that the objectives are realistic and achievable. If the test results
are not encouraging, then the course of instruction including the materials
used should be reviewed. In an extreme case it might be that the objectives
are unrealistic. Thus we see that the test is the final determinant of the
course of instruction.
29
3.7 Summary
In this unit we examined the various characteristics of a good test. We
discussed validity and the three types of validity, construct, content and
criterion and their sub-components. We saw how reliability is an important
concomitant of validity. We then moved on to the principles that are very
relevant to communicative language testing, namely, authenticity and
interactiveness. Finally we said that all these principles are subject to the
condition of their being usable and practicable While validity and
reliability are absolutely essential criteria, the principles of authenticity and
interactiveness are relative to the intended purposes and the resources
available.
The unit ended with a section on the impact that tests have on various
stakeholders in the educational system.
In the next unit we will discuss test formats and their appropriateness for
testing various levels of linguistic and cognitive abilities.
3.8 Sources
Bachman, L. F. and
Palmer A. S.
1996. Language Testing in Practice. Oxford. Oxford
University Press.
Hughes, Arthur. 1989.
Testing for Language Teachers. Cambridge

Cambridge University Press.
3.9 Answers to review questions

Review question I
1e
2g 3c 4b 5f
6 a 7d
Review question II
3,4,6,7,10 reliable; 1,2, 5,8 not reliable
Review question III
1.
2.
3.
4.
Interactiveness
Continuous Comprehensive Evaluation
Practicality
Authenticity
30
5. Language ability, topical knowledge and affective schemata
31

Unit 3 - Final

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Unit 3 - Final

Încărcat de

Drepturi de autor:

Formate disponibile

WHAT MAKES A GOOD TEST?

A test is an instrument of measurement and we have to make sure that the

identify the criteria of a good test and

evaluate the tests that you make for your students

3.1 What makes a good test?

Is the test valid?

Let us study each of these now.

250-300 words in length (6 marks)

400-500 words in length (9 marks)

A factual passage (e.g. instruction, description, report, etc.) or a

B. 3 Composition based on a verbal stimulus such as an advertisement,

A variety of short questions (5) involving the use of particular structures

One question (with or without an extract) testing global or local

Up to three questions based on one of the drama texts from the

prescribed reader to test local and global comprehension of the set

Project to test Speaking skills:

Formative testing (Unit tests):

Mid-term (Half-yearly/ Cumulative test): 20%

intervals. Filling these blanks requires an overall understanding of meaning and of

What is a criterion? The dictionary meaning of criterion is a standard by

3.2.3.1 Concurrent validity

Predictive validity relates to what degree the performance on the test

3.1.4 Face validity

b. The quality of a test that ensures a

f. The correlation between

g. The correspondence between the

Let us understand this with examples.

Other test writers working independently to produce Tests B,C, D,

d) For each supply type item scoring guidelines are provided.

e) One examiner marking the supply items will exercise his/her

A student encounters the items in a particular order. Encountering

3.2.1 Reliability of test scores

careful specification and planning to make independently

3.2.3 Scorer reliability

To ensure scorer reliability detailed criteria or guidelines for marking long

How to make tests reliable

Having considered the importance of reliability in testing let us now look

3. Restricted choice of questions. Restrict the focus of the topic of essays.

Familiar formats, question paper patterns. This was discussed under

7. Objective assessment. Clear guidelines for scoring needs to be given if

3.2.4 Relationship between validity and reliability

measuring the wrong thing. The answer to a vocabulary item may be

3. Good discrimination between strong and weak candidates.

4. Adequate samples of behavior.

5. Overlaps in a series of questions on the same area.

6. Multiple, independent scorers.

7. Convergence in the scores of multiple scorers.

9. Clear, explicit instructions.

10. Expected test paper formats.

similar to those that will be encountered in future fields of profession or

genre of writing. Hence if we give a test task involving notice-writing, it

The second reason is that students perceive authentic tasks as more

Car Ordered, Delivery date

Finished car released

language ability (language knowledge, strategic competence or

Review question III

directly related to the test content and how it is administered. Another

The exam results help examining boards in reviewing the administrative

We thus see that a test is a matter of interest and concern to various

Testing for Language Teachers. Cambridge

3.9 Answers to review questions