Validity and Reliability in Test Construction

Validity
MEAM 607 – Advanced Test and Measurement

By: Sherwin Trinidad
INTENDED LEARNING OUTCOMES
1. Define validity and its evidences

2. Create table of specification
3. Differentiate between types of validity
4. Perform valid interpretation and testing
CONSTRUCT
RELATED SESSION
EVIDENCES of CONTENT
EVIDENCES
VALIDITY
CONTENT
RELATED
EVIDENCES
VALID
INTERPRETATION
CRITERION
AND USE OF TESTS
RELATED
EVIDENCES
WHAT IS VALIDITY?
Validity refers to whether or not an assessment

measures what it is supposed to measure –
even if a test is reliable, it may not provide valid
results
VALIDITY QUESTION:
• To what extent does your assessment measure

what you say it does [and is as useful as you
claim]?
Very important points. Validity is:
• A matter of degree ("how valid")
• Always specific to a particular purpose
("validity for…")
• A unitary concept (three kinds of evidence to
make one judgment—"how valid?")
• Must be inferred from evidence; cannot be
directly measured
CONTENT
RELATED
CONSTRUCT EVIDENCES
RELATED
EVIDENCES CRITERION
RELATED
EVIDENCES
3 CATEGORIES OF VALIDITY EVIDENCE

CONSTRUCT RELATED
EVIDENCE
What is a construct?
A hypothetical quality or construct (e.g., extraversion,

intelligence, mathematical reasoning ability) that we
use to explain some pattern of behavior (e.g., good at
making new friends, learns quickly, good in all math
courses).
Construct-Related Evidence
The extent to which an assessment measures the

construct (e.g., reading ability, intelligence, anxiety)
the test purports to measure
Some kinds of evidence:
 See if items behave the same (if test meant to measure a
single construct)
 Analyze mental processes required
 Compare scores of known groups
 Compare scores before and after treatment (do they
change in the way your theory says they will and
will not?)
 Correlate scores with other constructs (do they correlate
well—and poorly—in the pattern expected?)
Important Points:
• usually assessed after the fact

• usually requires test scores
• is a complex, extended logical process; cannot be
quantified
CONTENT-RELATED
EVIDENCE
What is an achievement domain?
A carefully specified set or range of learning

outcomes (content and mental skills).
In short, your set of instructional targets.
Content-Related Evidence
The extent to which an assessment’s tasks provide a

relevant and representative sample of the domain of
outcomes you are intending to measure.
The evidence:
 Most useful type of validity evidence for classroom tests

 Domain is defined by learning objectives
 Items chosen with table of specifications
Important Points:
• Is an attempt to build validity into the test rather
than assess it after the fact
• Sample can be faulty in many ways: inappropriate
vocabulary, unclear directions, omits higher order
skills, fails to reflect content or weight of what
actually taught
• "face validity" (superficial appearance) or label
does not provide evidence of validity
• Assumes that test administration and scoring were
proper
CRITERION-RELATED
EVIDENCE
What is a criterion?
A valued performance or outcome (e.g., scores high

on a standardized achievement test in math, later
does well in an algebra class) that we believe might—
or should—be related to what we are measuring (e.g.,
knowledge of basic mathematical concepts).
Criterion-Related Evidence
The extent to which a test’s scores correlate with some

valued performance outside the test (the criterion)
The evidence:
 concurrent correlations (relate to a different current

performance)
 predictive correlations (predict a future performance)
Important Points:
• Always requires test scores
• Is quantified (i.e., a number)
• Must be interpreted cautiously because
• Irrelevant factors can raise or lower validity
coefficients (unreliability, spread of scores, etc.)
• Often hard to find a good "criterion"
• Can be used to create "expectancy tables"
VALID
INTERPRETATION
AND USE OF TESTS
Item difficulty or P
• The percentage of students who correctly answered an

item.
• Also referred to as the p-value
• Ranges from 0% to 100%, or more typically written as a
proportion 0.00 to 1.00
• The higher the value, the easier the item
P-values above 0.90 indicate very easy items that you should
not use in subsequent tests. If almost all students responded
correctly, an item addresses a concept probably not worth
testing.
Format Ideal Difficulty:

• Five-response multiple-choice .60
• Four-response multiple-choice .62
• Three-response multiple-choice .66
• True-false (two-response multiple-choice) .75
Item discrimination or R(IT)
• The relationship between how well students performed on

the item and their total test score.
• Also referred to as the Point-Biserial correlation (PBS)
• Ranges from 0.00 to 1.00
• The higher the value, the more discriminating the item
• A highly discriminating item indicates that students with
high test scores responded correctly whereas students
with low test scores responded incorrectly.
• Remove items with discrimination values near or less than
zero, because this indicates that students who performed
poorly on the test performed better on an item than
students who performed well on the test. The item is
confusing.
Evaluate items using four guidelines for classroom test

discrimination values:
• 0.40 or higher very good items
• 0.30 to 0.39 good items
• 0.20 to 0.29 fairly good items
• 0.19 or less poor items
Reliability coefficient or ALPHA
• A measure of the amount of measurement error

associated with a test score.
• Ranges from 0.00 to 1.00
• The higher the value, the more reliable the test score
• Typically, a measure of internal consistency, indicating
how well items are correlated with one another
Reliability Coefficient or ALPHA
• High reliability indicates that items are measuring the

same construct (e.g., knowledge of how to calculate
integrals)
• Two ways to improve test reliability: 1) increase the
number of items or 2) use items with high discrimination
values
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized tests
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test; in the range of most. There are probably
a few items that could be improved.
.60 - .70 Somewhat low. This test should be supplemented by other
measures to determine grades. There are probably some items that could be
improved.
.50 - .60 Suggests need to revise the test, unless it is quite short (ten or
fewer items). The test must be supplemented by other measures for grading.
.50 or below Questionable reliability. This test should not contribute heavily to
the course grade, and it needs revision.
References
• DeVellis, R. F. (1991). Scale development: Theory and
applications. Newbury Park: Sage Publications.
• Haladyna. T. M. (1999). Developing and validating
multiple-choice test items (2nd ed.). Mahwah, NJ:
Lawrence Erlbaum Associates.
• Lord, F.M. (1952). The relationship of the reliability of
multiple-choice test to the distribution of item
difficulties. Psychometrika, 18, 181-194.
References
• http://korbedpsych.com/R09eValidity.html
• http://www1.udel.edu/educ/gottfredson/451/unit3-
chap4.htm
• https://slideplayer.com/slide/5737496/
Thank you for Listening!

Validity and Reliability in Test Construction

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Validity and Reliability in Test Construction

Încărcat de

Drepturi de autor:

Formate disponibile

Validity

MEAM 607 – Advanced Test and Measurement

1. Define validity and its evidences

Validity refers to whether or not an assessment

• To what extent does your assessment measure

3 CATEGORIES OF VALIDITY EVIDENCE

A hypothetical quality or construct (e.g., extraversion,

The extent to which an assessment measures the

• usually assessed after the fact

A carefully specified set or range of learning

The extent to which an assessment’s tasks provide a

 Most useful type of validity evidence for classroom tests

A valued performance or outcome (e.g., scores high

The extent to which a test’s scores correlate with some

 concurrent correlations (relate to a different current

• The percentage of students who correctly answered an

Format Ideal Difficulty:

• The relationship between how well students performed on

Evaluate items using four guidelines for classroom test

• A measure of the amount of measurement error

• High reliability indicates that items are measuring the

S-ar putea să vă placă și