Sunteți pe pagina 1din 47

TEST RELIABILITY

Definitions

• Reliability refers to the consistency of test scores obtained


by the same persons when they are re-examined with the
same test on different occasions, or with different sets of
equivalentitems, or under varying examining conditions
(Anastasi and Urbina, 1996).
Definitions

• Reliability is the extent to which a score or measure is free


from measurement error. Theoretically, reliability is the ratio
of true score variance to observed score variance (Kaplan
and Saccuzzo, 2011).
• Reliability refers to the consistency in measurement; the
extent to which measurements differ from occasion to
occasion as a function of measurement error (Cohen and
Swerdlik, 2009).
The Concept of Reliability

• Reliability underlies the computation of the ERROR OF


MEASUREMENT of a single score.
• In psychological testing, error does not imply that a
mistake has been made. Instead it implies that therewill
always be some inaccuracy in our measurements.
The Concept of Reliability

• Test reliability indicates the extent to which individual


differences in test scores are attributable to TRUE
DIFFERENCES in the characteristic under consideration, and
the extent to which they are attributable to CHANCE
ERRORS.
The Concept of Reliability

• Measures of test reliability make it possible to estimate what


proportion of the total variance of the test scores is ERROR
VARIANCE.
• Error variance represents any condition that is irrelevant
to the purpose of thetest.
• It is reduced by controllingthetest environment,
instructions, time limit, rapport, etc.
Sources of Error Variance (Cohen and Swerdlik, 2009)

• Test Construction
• Item sampling or contentsampling, or the variation
among items within a test, as well as to variationamong
items between tests.
• Higher scores may be obtained when the test takers are
familiar with the items that were sampled (or made part
of the test taken). There are other items that were
unfamiliar to test takers that could have been asked inthe
test, and this would have lowered the test taker’s score.
Sources of Error Variance (Cohen and Swerdlik, 2009)

• Test Administration
• Test environment, i.e., room temperature, level of
lighting, ventilation, changes in weather, broken pencil
point, and noise
• Test-taker variables, i.e., emotional problems, physical
discomfort, lack of sleep, illness, fatigue, drugs or
medications taken, worry
• Examiner-related variables, i.e., physical appearance and
demeanor, manner of speaking, emphasis on certain
words (unknowingly providingclues), eye nodding, other
nonverbal gestures
Sources of Error Variance (Cohen and Swerdlik, 2009)

• Test Scoring and Interpretation


• Hand-scoring versus machine scoring
• Objective versus subjective scoring
Sources of Error Variance (Cohen and Swerdlik, 2009)

• Despite optimum testing conditions, however, no test is a


perfectly reliable instrument.
• Since all types of reliability are concerned with the degree of
consistency or agreement between two independently
derived sets of scores, they can all be expressed in terms of
a correlation coefficient (r).
Basics of Test Score Theory (Kaplan and Saccuzzo, 2011)

• Each person has a TRUE SCORE that would be obtainedif


there were no ERRORS in measurement.
• However, because measuring instruments are imperfect, the
score obtained for each person almost always differs from
the person’s true ability or characteristic.
Basics of Test Score Theory (Kaplan and Saccuzzo, 2011)

• The difference between the TRUE SCORE and theOBSERVED


SCORE results from MEASUREMENT ERROR.
• Symbolically, the followingdescribes the
concept: X (observed score) = T (true score) +
E (error)
A major assumption in classical test theory is that errors in
measurement are random.
The Domain Sampling Model (Kaplan and Saccuzzo, 2011)

• This model considers the problems created by using a


limited number of items to represent a larger and more
complicated construct.
• For example, to evaluate one’s spellingability, instead of
using the entire number of words in the dictionary to
comprise the items of the test, we decide to use a SAMPLE
of words.
The Domain Sampling Model (Kaplan and Saccuzzo, 2011)

All words in the dictionary Sample of items

Number of words spelled Number of words in the


correctly sample spelled correctly

Percentage form
Percentage form

This is your TRUE SCORE


This is your OBSERVED SCORE
The Domain Sampling Model (Kaplan and Saccuzzo, 2011)

• The task in reliability analysis is to estimate HOW MUCH


ERROR we would make by using the score from the shorter
test as an estimate of the test-taker’s true ability.
• As the sample gets larger, it represents the domain score
more and more accurately. As a result, the greater the
number of items, the higher the reliability.
The Domain Sampling Model (Kaplan and Saccuzzo, 2011)

• When tests are constructed, each item is a sample of the


ability or behavior to be measured.
• When testing your spelling ability, for example, we could
use 5 words, 100 words, or 5,000 words.
The Domain Sampling Model (Kaplan and Saccuzzo, 2011)

• Reliability can be estimated from the correlation of


the observed test score with the true score.
• But because true scores are NOT available, they can
only be estimated.
• Given that items are randomly drawn from a given
domain, each test or group of items should yield an
unbiased estimate of the truescore.
• Because of sampling error, however, different
random samples of items might give different
estimates of the true score.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• Repeating the identical test on a second occasion
• The reliability coefficient is simply the correlation between
the scores obtained by the same persons on the two
administrationsof the test.
• The same test is administered at two different times
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• It is of value only if we are measuring characteristics thatdo
not change over time (e.g., IQ)
• If an IQ test administered at two points in timeproduces
different scores, we might conclude that the lack of
correspondence is due to randommeasurement error.
• Usually, we do not assume that a person got smarter or
less so in the time between tests.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• Tests that measure some constantly changing characteristic
are not appropriate for test-retest evaluation.
• Relativelyeasy to evaluate: just administer the same test on
two well-specified occasions and then find the correlation
between scores from the two administrationsof the test,
using the Coefficientof Correlation
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• Retest reliability shows the extent to which scores on a test
can be generalized over different occasions. The higher the
reliability, the less susceptible the scores are to random daily
changes in the condition of the test-takers or the test
environment.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• When retest reliability is reported in the test manual, the
interval over which it was measured should always be
specified.
• Retest correlations decrease progressively as this interval
lengthens.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
Two possible negative effects when doing test-retest
reliability
• Carryover Effect
• Occurs when the first testing session influences scores
from the second session
• For example, test takers sometimes remember their
answers from the first time they took the test
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• Carryover Effect
• They are of concern only when the changes over time are
random. In cases where the changes are systematic,
carryover effects do not harm the reliability.
• An example of a systematic carryover is wheneveryone’s
score improves exactly 5 points. In this case, no new
variability occurs.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• Carryover Effect
• Random carryover effects occur when the changes are
not predictable from earlier scores, or whensomething
affects some BUT NOT ALL test takers.
• If something affects all test-takers equally, then the
results are uniformly affected, and no net error occurs.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
Practice Effects
• Some skills improvewith practice
• When a test is given a second time, test takers score better
because they have sharpened their skills by having taken the
test the first time
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
The time interval between testing sessions must be selected
and evaluated carefully. If the two test administrations of the
test are very close in time, there is relatively great risk of
carryover and practice effects.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• TEaSnTd-RSEaTccEuSzTzoR,E2L0IA1B1I)LITY (or Time Sampling)
• However, as the time interval between testingsessions
INCREASES, many other factors can intervene to affect
scores.
• A well-evaluated test will have many retest correlations
associated with different time intervals between testing
sessions.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• AaLTnEdRSNaAccTuEzFzoO,R2M01R1E)LIABILITY (or Item Sampling)
• Also called “Equivalent Forms” or “Parallel Forms”Reliability
• An alternative to test-retest reliability, it makes useof
alternate or parallel forms of the test.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
• AaLTnEdRSNaAccTuEzFzoO,R2M01R1E)LIABILITY (or Item Sampling)
• The same persons can thus be tested with one form on the
first occasion and with another, equivalentform, on the
second occasion.
• The correlation between the scores obtained on the two
forms represents the reliability coefficient of the test.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• ALTERNATE FORM RELIABILITY (or Item Sampling)
• Measures both temporal stability and consistency
of responses to different itemsamples.
• Alternate form reliability must always be
accompanied by a statement of the length of the
interval between test administrations, as well as
the relevant interveningexperiences.
• Item sampling is IMPORTANT. Non-equivalence
between the two forms of the test representsan
error variance resulting from content sampling.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• ALTERNATE FORM RELIABILITY (or Item Sampling)
• In the developmentof alternate forms, care should
be exercised to ensure that they are truly parallel.
• Same number of items
• Same type and content
• Equal range and level of difficulty
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• ALTERNATE FORM RELIABILITY (or Item Sampling)
• Limitations
• Can only reduce, but not totally eliminate
PRACTICE EFFECTS
• Sometimes, the two forms are administered to
the same group of people on the same day.
When both forms of the test are given on the
same day, the only sources of variation are
random error and the difference between the
forms of the test.
• This type of reliability testing can be quite
burdensome, considering that you have to
develop two forms of the same test.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 3. SPLIT-HALF RELIABILITY
• In a split-half reliability, a test is given and divided
into halves that are scored separately. The results
of one half of the test are then compared with the
results of the other.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 3. SPLIT-HALF RELIABILITY
• How to divide the test to two halves?
• Dividethe test randomly to two halves
• Calculate score for the first half of the items and
another score for the second half
• Although convenient, this method can cause
problems if the questions in the second half
are more difficult
• Use odd-even system
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 3. SPLIT-HALF RELIABILITY
• The correlation (between the two halves) isusually
an underestimate, because each subset is only half
as long as the full test. It is less reliable because it
has fewer items.
• To correct for half length, one can apply the
Spearman-Brown formula, which allows you to
estimate what the correlation between the two
halves would have been if each half had been the
length of whole test.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 3. SPLIT-HALF RELIABILITY
𝟐𝒓
𝑪𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅 𝒓 =
𝟏+
Where r is the 𝒓estimated correlation between the
two halves of the test if each had had the total
number of items, and r is the correlation between
the two halves of thetest.
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 3. SPLIT-HALF RELIABILITY
• The Spearman-Brown formula is advisable for use
only when the two halves of the test have equal
variances. Otherwise, Cronbach’s coefficient alpha
can be used. This general reliability coefficient
provides the lowest estimate of reliability
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 4. KR20 Formula

• Also known as Kuder-Richardson 20, it calculates


the reliability of a test in which the items are
dichotomous, scored 0 or 1 (usually for wrong or
right).
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 4. KR20 Formula
The formula is:

KR20 = N/N-1 {(s2 - ∑pq)/s2}


Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 4. KR20 Formula
KR20 = N/N-1 {(s2 - ∑pq)/s2}
Where: KR20 = the reliability estimate
N = the numberof items on
the test
s2 = the variance ofthe total
test score
p = the proportion of the
people getting each item
correct (this is foundseparately
for each item)
q = the proportion of people
getting each item incorrect.
For each item, q equals1 - p
Types of Reliability (Anastasi and Urbina, 1996; Kaplan
and Saccuzzo, 2011)
• 5. Coefficientalpha
• Developed by Cronbach to estimate the internal
consistency of tests in which the items are not
scored as 0 or 1 (wrong or right).
• Applicable for many personality and attitude scales.
• The SPSS software provides a convenientway of
determining the coefficientalpha.
How reliable is reliable?

• Reliabilityestimates in the range of .70 to .80 are good


enough for most purposes in basicresearch.
• In clinical settings, high reliability is extremely important
(i.e., reliability of.90 to .95).
What to do about low reliability

• Increase the number of items (see formula).


• The decision to expand the test from the original number
of items to the number suggested by the formula must
depend on economicand practical considerations.
What to do about low reliability

• Factor and item analysis


• The items in the test must measure the same thing.
• Examine the correlation between each item and the total
score for the test. When the correlation between the
performance on a single item and the total test score is
low, the item is probably measuring something different
from the other items in the test (discriminability analysis).
What to do about low reliability

• Correction for attenuation


• Potential correlations are attenuated, or diminished by
measurement error.
• Fortunately, there are procedures that can help
determine what the correlation betweentwo measures
would have been if they had not been measured with
error.
THANK YOU!

S-ar putea să vă placă și