Sunteți pe pagina 1din 12

1

CHAPTER I

INTRODUCTION

A. Background of the Problem

There are as many different tests of foreign language skills as there are
reasons for testing them. However, one thing that holds true for any test is that
there is no such thing as perfection. Human fallibility has a part to play there, but
it is also a result of the need to bear in mind certain principles when constructing a
test, principles which have within them certain contradictions, making it
impossible to design The Ultimate Test. The aim here is to set out the principles
that are used when the construction of language tests is under discussion, suggest
examples of how they can be applied, and point out the areas of conflict which
make test design so tricky. It should also be noted that while this entry will look at
these principles in relation only to language tests, they could well be applied to
many tests in other subjects. The difference tends to be that every field has its own
way of referring to and grouping the issues, which will be discussed here.

Principles of language assessment can and should be applied to formal


tests, but with the ultimate recognition that these principles also apply to
assessments of all kinds. Principles will be used to evaluate an existing,
previously published, or created test.

According to Doughlas Brown opinion in his book Language Assessment,


to know if the test is effective, we have to give some question to get respond, such
as: Can it be given within appropriate administrative constraints? Is it dependable?
Does it accurately measure what you want it to measure?

These and other questions help to identify five cardinal criteria for “testing
a test”, practically, reliability, validity, authenticity, and washback. In our paper
will be discussed three principles of language assessment like, practically,
reliability and validity. Because, it not many time to discussed enough for all of
them.
2

B. Finding of the Problem

The problems of this study that will be discussed can be formulated as


follows:

1. What is Practicality?
2. What is Validity?
3. What is Reliability?
C. Purposes of the Problem

The purpose of this study is to provide knowledge. The intended knowledge


is the truth about the use of some language testing knowledge. For the next we
will add further about the problem.

1. To know about the Practicality.


2. To know about the Validity.
3. To know about the Reliability.
3

CHAPTER II
CONCEPTUAL AND THEORETICAL FRAMEWORK

A. Practically

An effective test is practical. This means the test:1

 Is not extremely expensive,


 Keep in the appropriate time constraints,
 Relatively easy to manage, and
 Has a scoring/evaluation procedure that is specific and time-
efficient.
A test that is extremely expensive is impractical. A test of language skills
that takes a student five hours to complete is impractical, it consumes more time
(and money) than necessary to achieve its objective. A test that requires an
individual and a supervisor (proctor) is impractical for a group of several hundred
test-takers and only a handful of examiners. A test that takes a few minutes for a
student to take and several hours for an examiner to evaluate is impractical in the
test takes place a thousand miles away from the nearest computer. The value and
quality of a test sometimes depends on the essence, practical considerations.

Practicality sometimes called manageability; refers to the need to ensure that


the assessment requirements are appropriate to the intended learning outcomes of
a program, and that in their operation they do not distort the learning/training
process, and that they do no make unreasonable demands on the time and
resources available to learner, teacher/trainer and/or assessor.

In some circumstances it would be possible in theory at least, to improve


reliability and validity by doubling the number of assessments and the time
available for assessment within a program. However, the cost, in terms of time,
resources, impact on the quality of the learning process and the motivation of the
learners, are likely make such an approach counterproductive and unmanageable.
They would fail to meet the necessary practicability requirements.

1
Doughlas H, Brown, 2004, Language Assessment (Principle and Classroom Practices),
(San Francisco: Longman), p. 19.
4

B. Reliability
1. Definition of reliability

A reliability test is consistent and dependable. If you give the same test to
the same student or matched students on two different opportunities, the test
should yield similar result. The issue of reliability of a test may be addressed by
considering a number of factors that may contribute to the unreliability of a test.
Consider the following possibilities movement in the student, in scoring, in the
test administration and in the test itself.2

Reliability leads to the accuracy and precision of a measuring instrument in


a measurements procedure. The reliability coefficient indicates a stability score
obtained by individuals, which reflect the existence of the reproductive process of
the score. The score is called stable if the score obtained at a time and at other
times the result is relatively the same. Mean another reliability in stability
terminology is subject to the measurement will be ranks relatively similarly on
separate testing with equivalent test kits (Singh, 1986, Thorndike, 1991).3

In terms of language, reliability is the translation of the word reliability has


the origin of the word rely and ability. When combined, the two words will
conical to an understanding of the ability of a measuring instrument to be
trustworthy and become a backdrop decision-making. By Anastasi and Urbina
(1997), in this context the reliability of the test equipment will point to the extent
to which individual differences in test scores can be deemed to be caused by the
actual differences in that characteristic considered and how far can be attributed to
an opportunity error. Same with that opinion, Suryabrata (2000) states that in the
broadest sense, the reliability of the measuring instrument refers to the extent of
the difference in the acquisition scores reflects actual attribute differences.

2
Brown, Language Assessment ..., p. 20-21
3
Jurnal Psikologi Universitas Diponegoro Vol.3 No. 1, June 2006 .
5

2. Rater reliability

Inter-rater reliability occurs when two or more scores produce inconsistent


scores of the same test, possibly for lack of attention to scoring criteria,
inexperience, inattention, or eve preconceived biases. In the story above about the
placement test, the initial scoring plan for the command was found to be
unreliable-that is, the two scorers were no applying the same standards.4

3. Test Administration Reliability

Unreliability may also result from the conditions in which the test is
administered. I once witnessed the administrations of a test of aural
comprehension in which a tape recorder played items for comprehension, but
because of street noise outside the building, students sitting next to windows
could not hear the tape accurately. This was a clear case of unreliability caused by
the situations of the test administration. Other sources of unreliability are found in
photocopying kinds, the amount of light in different parts of the room, kinds in
temperature, and even the conditions of the desks and chairs5.

4. Test Reliability

Sometimes the nature of the test itself can cause measurement errors. If a
test is too long test-takers may become fatigued by the time they reach the later
items and hastily respond incorrectly. Time tests may discriminate against
students who do not perform well on a test with a time limit. We all know people
(and you may be include in this category) who “know” the course material
perfectly but who are advesely affected by the presence of a clock ticking away.
Poorly written test items (that are ambiguous or that have more than one correct
answer) may be a further source of test unreliability6.

5. There are ways how to make tests more reliable

As we have seen, there are two components of test reliability: the


performance of candidates from occasion to occasion, and the reliability of the
4
Brown, Language Assessment ..., p. 21
5
Brown, Language Assessment ..., p. 21.
6
Brown, Language Assessment ..., p. 22.
6

scoring. We will begin by suggesting ways of achieving consistent performances


from candidates and then turn our attention to scorer reliability.

a. Take enough samples of behavior.


Other things being equal, the more items that you have on a test, the more
reliable that test will be. This seems intuitively right. If we wanted to know how
good an archer someone was, we wouldn’t rely on the evidence of a single shot at
the target. That one shout could be quite unrepresentative of their ability. To be
satisfied that we had a really reliable measure of the ability we would want to see
a large number of shots at the target.
b. Exclude items which do not discriminate well between weaker and stronger
students.
Items on which strong students and weak students perform with similar
degrees of success contribute little to the reliability of a test.
c. Do not allow candidates to much freedom.
In some kinds of language test there is a tendency to offer candidates a
choice of questions and then to allow team a great deal of freedom in the way that
they answer the ones that they chosen.
d. Write unambiguous items
It is essential that candidates should not be presented with items whose
meaning is not clear or to which there is an acceptable answer which the test
writer has not anticipated.
e. Provide clear and explicit instructions
This applies both to written and oral instructions. If it is possible for
candidates to misinterpret what they are asked to do, then on some occasions
some of them certainly will.
f. Ensure that tests are well laid out and perfectly legible.
The often, institutional tests are badly typed (or handwritten), have too
much text in too small a space, and are poorly reproduced. As a result, students
are faced with additional tasks which are not one meant to measure their language
ability. Their variable performance on the unwanted tasks will lower the reliability
of a test.
7

g. Make candidates familiar with format and testing techniques


If any aspect of test is unfamiliar to candidates, they are likely to perform
less well than they would do otherwise (on subsequently taking a parallel version,
for example). For this reason, effort must be made to ensure that all candidates
have the opportunity to learn just what will be required of them
h. Provide uniform and non-distracting conditions of administration
The greater the differences between one administration of a test and another,
the greater the differences one can expect between a candidates performance on
the two occasions. Great care should be taken to ensure uniformity.
i. Use items that permit scoring which is as objective as possible
This may appear to be a recommendation to use multiple choices, items,
which permit completely objective scoring. This is not intended.
j. Make comparisons between candidates as direct as possible
To reinforce the suggestion already made that candidate should not as
should not be given a choice of items and that they should be limited in the way
that they are allowed to respond.
k. Provide a detailed scoring key
This should specify acceptable answers and assign points for acceptable
partially correct responses. For high scorer reliability the key should be as detailed
as possible in its assignments of points.
l. Train scores
This is especially important where scoring it most subjective.
m. Identify candidates by number, not name
Scores inevitable have expectations candidates that they know. Expect in
purely objective testing, this will affect the way that they score. Studies have
shown that even where the candidates are unknown to the scores, the name on a
script will make a significant difference to the scores given.
n. Employ multiple, independent scoring
As a general rule, and certainly where testing is subjective, all scripts should
be scored by at least two independent scores. Neither scorer should know how the
other has scored a test paper.
8

C. Validity

Validity is the accuracy aspect of measurement.7 Validity is described as the


degree to which a research study measures what it intends to measure. A valid
measuring instrument is not only capable of producing the right data but also
requires accurate data. There are some types of validity:

1. Content validity

The first form of evidence relates to the content of the test. A test is said to
have content validity its content constitutes a representative sample of the
language skills, structures, etc.

The test would have content validity only if it included a proper sample of
relevant structures. Just what are the relevant structures will depend, of course,
upon the purpose of the test.

What is the importance of content validity? First, the greater a test content
validity, the more likely it is to an accurate measure of what it is suppose to
measure. A test in which major areas identified in the specification are under-
represented-or not represented at all-is unlike to be accurate. Secondly, such a test
is likely to have a harmful backwash effect. Areas that are not tested are likely to
become areas ignored in teaching and learning. For this reason, content validation
should be carried out while test is being developed, it should not wait until the test
is already being used.8

2. Ceriteration-related validity

The second form of evidence of a test construct validity relates to the degree
to which result on the test agree with those provided by some independent and
highly dependable assessment of the candidates ability.

7
Arthur Hughes, Testing For Language Teachers, Second Edition,( United Kingdom:
Cambridge University Press, 2003), p.26
8
Hughes, Testing For …., p. 27
9

There are essentially two kinds of criterion-related validity: concurrent


validity and predictive validity.9 Concurrent validity is established when the test
and the criterion are administered at about the same time. To examplify this kind
of validation is achievement testing, let us considered a situation where course
objectives call for an oral component as part of the final achievement test. From
the point of view of content validity, this will depend on how many of the
function are tested in the component, and how representative they are of the
complete set of function included in the objectives. It should be said that the
criterion for concurrent validation is not necessarily a proven, longer test. A test
may be validated against, for example, teacher assessments of their students,
provided that the assessment themselves can be relied on. This would be
appropriate where a test was developed that claimed to be measuring something
different from all exiting tests.

The second kind of criterion-related validity is predective validity. This


concerns the degree to which a test can predict candidate future performance.10 An
example would be how well a profiency test could predict a student ability to cope
with a graduate course at a British university. The criterion measure here might be
an assessment of the students English as perceived by his or her supervisor at the
university, or it could be the outcome of the course (pass/fail etc.). content
validity, concurrent validity and predictive validity all have a part to play in the
development of a test. For instance, in developing an English placement test for
language schools, Hughes et al (1996) validated test content against the content of
three popular course books used by language school in Britain, compared students
performance on the test with their performance on the exiting placement tests of a
number of language school, and then examined the success of the test in placing
students in classes. Only when this process was complete (and minor changes
made on the basis of the result obtain) was the test published.

9
Hughes, Testing For …., p. 27
10
Hughes, Testing For …., p. 27
10

3. Other forms of envidence for construct validity

The word “construct” refers to any underlying ability (or trait) that is
hypothesized in the theory of language ability.11 A construct is any theory,
hypothesis, or model that attempts to explain observed phenomena in our universe
of perceptions. Constructs may or may not be directly or empirically measured-
their verification often requires inferential data.

Construct validity is a major issue in validating large-scale standardized


tests of proficiency. Because such test much for economic reasons, adhere to the
principle of practicality, and because they must sample a limited number of
domain of language, they may not be able to contain all the content of particular
field or skill. The ability to guess the meaning unknown words from context,
referred to above, would be an example.

4. Consequential Validity

Consequential Validity encompasses all the consequences of a test,


including such considerations as its accuracy in measuring intended criteria, its
impact on the preparation of test-takers, its effect on the learner, and the social
consequences of a test’s interpretation and use.12

5. Face Validity

A test is said to have face validity if it looks as if it measures what it is


supposed to measure.13 For example, a test that pretended to measure
pronunciation ability but which did not require the test taker to speak (and there
have been some) might be thought to lack face validity. This would be true even if
the test’s construct and criterion-related validity could be demonstrated. Face
validity is not a scientific notion and is not seen as providing evidence for
construct validity, yet it can be very important. A test which does not have face
validity may not be accepted by candidates, teachers, education authorities, or
employers. It may simply not be used; and if it is used, the candidates’ reaction to

11
Hughes, Testing For …., p. 31
12
Hughes, Testing For …., p. 26
13
Hughes, Testing For …., p. 33
11

it may mean that they do not perform on it a way that truly reflects their ability.
Novel techniques, particularly those which provide indirect measures, have to be
introduced slowly, with case, and with convicing explanations.

How to Make Test More Valid

In the development of a high stakes test, which may significantly affect the
lives of those who take it, there is an obligation to carry out a full validation
exercise before the test becomes operational.

In the case of teacher-made tests, full validation is unlikely to be possible. In


these circumstances, I would recommend the following:

First, write explicit specifications for the test which take account of all that
is known about the constructs that are to be measured. Make sure that you include
a representative sample of the content of these in the test.

Second, whenever feasible, use direct testing. If for some reason it is decide
that indirect testing is necessary, reference should be made to the research
literature to confirm that measurement of the relevant underlying constructs has
been demonstrated using the testing techniques that are to be employed (this may
often result in disappointment, another reason for favouring direct testing).

Third, make sure that the scoring of responses related directly to what is
being tested.

Finally, do everything possible to make the test reliable. If a test is not


reliable, it cannot be valid.14

14
Hughes, Testing For …., p. 34
12

CHAPTER III
CONCLUSION

A. Conclusion

Practicality refers to the need to ensure that the assessment requirements are
appropriate to the intended learning outcomes of a program, and that in their
operation they do not distort the learning/training process, and that they do no
make unreasonable demands on the time and resources available to learner,
teacher/ trainer and/or assessor.

A reliability test is consistent and dependable. If you give the same test to
the same student or matched students on two different opportunities, the test
should yield similar result.

Reliability leads to the accuracy and precision of a measuring instrument in


a measurements procedure. The reliability coefficient indicates a stability score
obtained by individuals, which reflect the existence of the reproductive process of
the score.

Validity is the accuracy aspect of measurement. Validity is described as the


degree to which a research study measures what it intends to measure. A valid
measuring instrument is not only capable of producing the right data but also
requires accurate data.

S-ar putea să vă placă și