Sunteți pe pagina 1din 47

ESSENTIAL OF VALIDITY AND

RELIABILITY
B Y: S U S A N A U B I N A
RELIABILITY

The term reliability suggests trustworthiness.


Reliability in measurement implies consistency and precision.
Reliability, then, is a quality of test scores that suggests they are
sufficiently consistent and free from measurement error to be
useful.
The fact is that the quality of reliability is one that, if present,
belongs not to tests but to test scores.

Reliability is a characteristic of test scores, rather than of tests


themselves.

Measurement error may be defined as any fluctuation in scores


that results from factors related to the measurement process that
are irrelevant to what is being measured.
RELIABILITY

TRUTH AND ERROR IN PSYCHOLOGICAL MEASUREMENT


One of the most enduring approaches to the topic of
reliability is the classical test theory notion of the true score.

THE CONCEPT OF THE TRUE SCORE IN INDIVIDUAL DATA


In classical test theory, an individuals true score is
conceptualized as the average score in a hypothetical
distribution of scores that would be obtained if the individual
took the same test an infinite number of times
Sample variance (s2) was defined as the average amount of
variability in a group of scores
SOURCES OF ERROR IN PSYCHOLOGICAL
TESTING
(a) the context in which the testing takes place
(b) the test taker
(c) the test itself.
Interscorer Differences - label assigned to the errors that
may enter into scores whenever the element of subjectivity
plays a part in scoring a test.
Scorer Reliability - the basic method for estimating error due
to interscorer differences consists of having at least two
different individuals score the same set of tests, so that for
each test takers performance two or more independent
scores are generated.
The correlations between the sets of scores generated in this
fashion are indexes of scorer reliability.
Time Sampling Error - refers to the variability inherent in
test scores as a function of the fact that they are obtained at
one point in time rather than at another.

traitswhich are construed to be relatively enduring


characteristics
states, which are by definition temporary conditions.
Test-Retest Reliability or stability coefficient administer
the same test on two different occasions, separated by a
certain time interval, to one or more groups of individuals.

Which scores are likely to fluctuate as a result of time


sampling error
Content Sampling Error - the term used to label the trait-
irrelevant variability that can enter into test scores as a result
of fortuitous factors related to the content of the specific
items included in a test

Alternate-Form Reliability
Alternate-form reliability procedures are intended to estimate
the amount of error in test scores that is attributable to
content sampling error.
To investigate this kind of reliability, two or more different
forms of the testidentical in purpose but differing in
specific contentneed to be prepared and administered to
the same group of subjects.
Split-Half Reliability - create two scores for each person by
splitting the test into halves.
(a) whether some test items differ systematically from other
items across the length of the test
( b) whether speed plays a significant role in test performance.

The Spearman-Brown (S-B) formula is applied to rhh to


obtain the estimate for the full test
SPEARMAN-BROWN (S-B) FORMULA

where
rS-B = Spearman-Brown estimate of a reliability coefficient,
n = the multiplier by which test length is to be increased or
decreased, and
rxx = the reliability coefficient obtained with the original test
length.
MAGNITUDE OF SPLIT-HALF RELIABILITY
COEFFICIENTS.
Systematic differences across test items can occur due to a
variety of reasons
tests start with the easiest items and get progressively more
difficult, or are divided into parts or subtests that cover
different content.
Wonderlic Personnel Test are structured in a spiral-omnibus
format
When test performance depends primarily on speed, items
are usually pegged at a low enough difficulty level for all test
takers to complete correctly, but time limits are set so that
most test takers will not be able to finish the test.
Inter-item Inconsistency - refers to error in scores that results
from fluctuations in items across an entire test, as opposed to
the content sampling error emanating from the particular
configuration of items included in the test as a whole.

Content Heterogeneity - results from the inclusion of items


or sets of items that tap content knowledge or psychological
functions that differ from those tapped by other items in the
same test
Internal Consistency Measures are statistical procedures
designed to assess the extent of inconsistency across test
items.
Split-half reliability coefficients accomplish this to some
extent.
The two most frequently used formulas used to calculate
interitem consistency are the Kuder-Richardson formula 20 (K-
R 20) and coefficient alpha (), also known as Cronbachs
alpha
The magnitude of the K-R 20 and alpha coefficients is a
function of two factors:
(a) the number of items in the test
( b) the ratio of variability in test takers performance across all
the items In the test to total test score variance.
TIME AND CONTENT SAMPLING ERROR
COMBINED
Delayed alternate-form reliability coefficients can be
calculated when two or more alternate forms of the same
test are administered on two different occasions, separated
by a certain time interval, to one or more groups of
individuals
EVALUATION OF ERROR FROM MULTIPLE
SOURCES
Generalizability Theory or G Theory is an extension of classical test
theory that uses analysis of variance (ANOVA) methods to
evaluate the combined effects of multiple sources of error variance
on test scores simultaneously.
it also allows for the evaluation of the interaction effects from
different types of error sources.
introduced in the early 1960s

The Item Response Theory Approach to Reliability - for large


scale
and computer adaptive testing, have been rapidly spurring their
development and application in the past few decades.
RELIABILITY CONSIDERATIONS IN TEST
INTERPRETATION
1. Acknowledge and quantify the margin of error in obtained
test scores.
2. Evaluate the statistical significance of the difference
between obtained scores to help determine the import of
those differences in terms of what the scores represent
Quantifying Error in Test Scores: The Standard Error of
Measurement
Reliability data are used to derive the upper and lower limits
of the range within which test takers true scores are likely
to fall.
A confidence interval is calculated for an obtained score on
the basis of the estimated reliability of scores from the test in
question.
The size of the interval depends on the level of probability
that is chosen.
The use of standard errors of measurement (SEMs) for test
scores, and standard errors for test score differences
(SEdiffs), both of which are derived from estimates of score
reliability, are essential pieces of information because
1. SEMs provide confidence intervals for obtained test scores
that alert test users to the fact that scores are subject to
fluctuation due to measurement error, and
2. the confidence intervals obtained with the use of SEdiff
statistics forestall the overvaluation of score differences that
may be insignificant in light of measurement error.
ESSENTIALS OF VALIDITY
ESSENTIALS OF VALIDITY

Validity as the degree to which all the accumulated evidence


supports the intended interpretation of test scores for the
proposed purpose
Validity is always a matter of degree rather than an all-or-none
determination.
Validity is a matter of judgments that pertain to test scores as they
are employed for a given purpose and in a given context
Validation - which is the process whereby validity evidence is
gatheredbegins with an explicit statement by the test developer
of the conceptual framework and rationale for a test, but is in its
very nature open-ended because it includes all the information
that adds to our understanding of test results
validation is the joint responsibility of the test developer
[who provides evidence and a rationale for the intended use
of the test] and the test user [who evaluates the available
evidence within the context in which the test is to be used]

According to Messick (1989, p. 13), validity is an integrated


evaluative judgment of the degree to which empirical
evidence and theoretical rationales support the adequacy and
appropriateness of inferences and actions based on test scores
or other modes of assessment.
THE CLASSIC DEFINITION OF VALIDITY

The first definition of validity as the extent to which a test


measures what it purports to measure was formulated in
1921 by the National Association of the Directors of
Educational Research (T. B. Rogers, 1995, p. 25).
The view that test validity concerns what the test measures
and how well it does so (Anastasi & Urbina, p. 113)
most famous example of this is E. G. Borings 1923 definition
of intelligence as whatever it is that intelligence tests
measure
WHAT IS CONSTRUCT?

Construct is anything that is devised by the human mind but not directly
observable.
are abstractions that may refer to concepts, ideas, theoretical entities,
hypotheses, or inventions of many sorts
Psychological constructs differ widely in terms of
their breadth and complexity,
their potential applicability, and
the degree of abstraction required to infer them from the available data.
Synonyms: The terms construct and latent variable are often used
interchangeably.
A latent variable is a characteristic that presumably underlies some
observed phenomenon but is not directly measurable or observable. All
psychological traits are latent variables, or constructs, as are the labels
given to factors that emerge from factor analytic research, such as verbal
comprehension or neuroticism.
THE INTEGRATIVE FUNCTION OF
CONSTRUCTS IN TEST VALIDATION
1. To designate the traits, processes, knowledge stores, or
characteristics whose presence and extent we wish to
ascertain through the specific behavior samples collected
by the tests.
2. To designate the inferences that may be made on the basis
of test scores.

One of the earliest formulations was Cronbachs (1949)


classification of validity into two types, namely, logical and
empirical.
Cronbach suggested the use of the term construct validity to
designate the nomological net, or network of
interrelationships between and among theoretical and
observable elements that support a construct.
According to Embretson (p. 180), construct representation
research is concerned with identifying the theoretical
mechanisms that underlie task performance.
From an information-processing perspective, the goal of
construct representation is task decomposition. The process
of task decomposition can be applied to a variety of cognitive
tasks, including interpersonal inferences and social
judgments.
Nomothetic span, on the other hand, concerns the network
of relationships of a test to other measures
It refers to the strength, frequency, and pattern of
significant relations between test scores
ROLE OF SOURCES OF VALIDITY EVIDENCE:

1. Construct representation research is concerned primarily with


identifying differences in the tests tasks, whereas
nomothetic span research is concerned with differences
among test takers.
2. Validation of the construct-representation aspect of test
tasks is independent of the supporting evidence that may be
gathered in terms of the nomothetic span of test scores, and
vice versa
convergent validity, that is, evidence regarding the
similarity, or identity, of the constructs they are evaluating.
discriminant validity evidence, based on consistently low
correlations between measures that are supposed to differ,
also may be used to substantiate the identities of the
constructs they tap.
The Multitrait-Multimethod Matrix - In an effort to organize
the collection and presentation of convergent and
discriminant validation data, D. T. Campbell and Fiske (1959)
proposed a design they called the multitrait-multimethod
matrix (MTMMM).
The Multitrait-Multimethod Matrix refers to a validation
strategy that requires the collection of data on two or more
distinct traits (e.g., anxiety, affiliation, and dominance) by
two or more different methods (e.g., self-report
questionnaires, behavioral observations, and projective
techniques).
(a) reliability coefficients for each measure,
(b) correlations between scores on the same trait assessed by
different methods (i.e., convergent validity data)
(c) correlations between scores on different traits measured by
the same methods,
(d) between scores on different traits assessed by different
methods (both of which constitute discriminant validity data).
Age Differentiation - the criterion of age differentiation is one of
the oldest sources of evidence for validating ability tests
Experimental Results - In the area of ability testing, this evidence
derives primarily from pre- and posttest score differences
following interventions aimed at remediating deficiencies or
upgrading performance in various cognitive and intellectual skills.
Factor Analysis - reduce the number of dimensions needed to
describe data derived from a large number of measures
Factors are not real entities, although they are often discussed as
though they were.They are simply constructs or latent variables that
may be inferred from the patterns of covariance revealed by
statistical analyses.
TWO BASIC WAYS TO CONDUCT FACTOR
ANALYSES
1. The original approach to the method is exploratory in nature
and thus is known as exploratory factor analysis, or EFA; it
sets out to discover which factors (i.e., latent variables or
constructs) underlie the variables under analysis.
start with a correlation matrix, a table that displays the
intercorrelations among the scores obtained by a sample of
individuals on a wide variety of tests (or subtests or items).
The end product of factor analyses is a factor matrix, which is
a table that lists the loadings of each one of the original
variables on the factors extracted from the analyses.
Factor loadings are correlations between the original
measures in the correlation matrix and the factors that have
been extracted
TWO BASIC WAYS TO CONDUCT FACTOR
ANALYSES
2. A more recent, approach is called confirmatory factor
analysis (CFA) because it sets out to test hypotheses, or to
confirm theories, about factors that are already presumed
to exist.

Structural equation modeling (SEM) - One rapidly evolving


set of procedures that can be used to test the plausibility of
hypothesized interrelationships among constructs as well as
the relationships between constructs and the measures used
to assess them
The correspondence between the data and the models is
evaluated with statistics, appropriately named goodness-of-
fit statistics.
VALIDITY EVIDENCE BASED ON RELATIONSHIPS
BETWEEN TEST SCORES AND CRITERIA
Some Essential Facts About Criteria
Merriam-Websters Collegiate Dictionary (1995) defines criterion as
a standard on which a judgment or decision may be based or a
characteristic mark or trait.
Criterion measures are indexes of the criteria that tests are
designed to assess or predict and that are gathered independently
of the test in question.

Criterion measures or estimates may be naturally dichotomous


(e.g., graduating vs. dropping out) or artificially dichotomized
(e.g., success vs. failure); polytomous (e.g., diagnoses of anxiety vs.
mood vs. dissociative disorders, or continuous (e.g., grade point
average).
The validity of test scores is evaluated in terms of hit rates.

Hit rates typically indicate the percent of correct decisions


or classifications made with the use of test scores, although
mean differences and suitable correlation indexes may also
be used.
CRITERION-RELATED VALIDATION
PROCEDURES
Concurrent and Predictive Validation

Concurrent validation evidence is gathered when indexes of


the criteria that test scores are meant to assess are available
at the time the validation studies are conducted.

Predictive validation evidence is relevant for test scores that


are meant to be used to make decisions based on estimating
future levels of performance or behavioral outcomes.
Concurrent Validation Example: The Whitaker Index of
Schizophrenic Thinking
The WIST was designed to identify the kind of thinking
impairment that often accompanies schizophrenic syndromes.
Each of its two forms (A and B) consists of 25 multiple choice
items.

Correlations between test scores and criterion measures (rxy)


are usually called validity coefficients.
There is bound to be some error in the estimates of the
criterion made by using the test scores. This error is gauged by
the standard error of estimate
Criterion contamination - the validity of criterion measures
can be eroded when those who are charged with determining
the criterion standing of individuals in the validation samples
have access to scores on the test that is used as a predictor.
Replication of predictor-criterion relationships on separate
samples which is a process known as cross-validation.
Some reduction in the magnitude of the original R, or
shrinkage, is expected upon cross-validation
A moderator variable is any characteristic of a subgroup of
persons in a sample that influences the degree of correlation
between two other variables.
Differential validity, in the context of test bias, refers to
differences in the size of the correlations obtained between
predictors and criteria for members of different groups.
The problem of differential validity is also referred to as slope
bias.

Differential prediction, on the other hand, occurs when test


scores underpredict or overpredict the criterion performance
of one group compared to the other.
This problem is labeled intercept bias
Stereotype threat, which refers to the deleterious effects
that fear of confirming negative racial stereotypes seems to
have on the test performance of some minority group
members.

Meta-analyses rely on a series of quantitative procedures


that provide for the synthesis and integration of the results
obtained from the research literature on a given subject.
Statistical significance levels that stress the avoidance of
Type I errors (i.e., incorrectly rejecting the null hypothesis of
no difference when it is true)
while neglecting the possibility of Type II errors (i.e.,
incorrectly accepting the null hypothesis when it is false).
Discriminant functions involve the application of weighted
combinations of scores on the predictors
Another traditional strategy that can be used for both
selection and classification problems is synthetic validation

S-ar putea să vă placă și