Documente Academic
Documente Profesional
Documente Cultură
EFPA
Users of this document and its contents are required by EFPA to acknowledge this source with the following text:
The EFPA Test Review Criteria were largely modelled on the form and content of the British
Psychological Society's (BPS) test review criteria and criteria developed by the Dutch Committee on
Tests and Testing (COTAN) of the Dutch Association of Psychologists (NIP). EFPA is grateful to the
Test Review Form Version 4.2.6
09-04-2013
Page 1
09-04-2013
Page 2
CONTENTS
1
Introduction
PART 1
General description
Classification
14
PART 2
7
5
6
16
20
23
26
26
26
28
31
31
Norms
9.1 Norm-referenced interpretation
33
33
10 Reliability
11
38
Validity
43
53
54
58
61
13 Final evaluation
62
PART 3 BIBLIOGRAPHY
66
68
09-04-2013
72
Page 3
09-04-2013
Page 4
1 Introduction
The main goal of the EFPA Test Review Model is to provide a description and a detailed and rigorous
assessment of the psychological assessment tests, scales and questionnaires used in the fields of Work,
Education, Health and other contexts. This information will be made available to test users and
professionals in order to improve tests and testing and help them to make the right assessment decisions.
The EFPA Test Review Model is part of the information strategy of the EFPA, which aims to provide
evaluations of all necessary technical information about tests in order to enhance their use (Evers et al.,
2012; Muiz & Bartram, 2007). Following the Standards for Educational and Psychological Testing the
label test is used for any evaluative device or procedure in which a sample of examinees behaviour in
a specified domain is obtained and subsequently evaluated and scored using a standardized process
(American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education, 1999, p. 3). Therefore, this review model applies to all instruments that are
covered under this definition, whether called a scale, questionnaire, projective technique, or whatever.
Introducere Scopul principal al revizuirii modelului EFPA de testare este de a oferi o descriere i
o analiz detaliat i riguroas a-psihologice Evalurii testelor de evaluare, scale i chestionare
utilizate n domeniile de munc, educaie, sntate i alte contexte. Aceste informaii vor fi puse
la dispoziie pentru a testa utilizatorilor i profe-, n scopul de a ca profesionitii din
mbuntirea testelor i de testare i a le ajuta s ia deciziile de evaluare corecte. EFPA Test de
revizuire a modelului face parte din strategia de informare a EFPA, care i propune s ofere
evalurile efectuate-tiile de toate informaiile tehnice necesare cu privire la teste, n scopul de a
spori utilizarea acestora (Evers i colab, 2012;. Muiz & Bartram, 2007). Ca urmare a
standardelor de testare educaionale i psihologice testul eticheta este folosit pentru orice "...
dispozitiv evaluativ sau procedur n care se obine un eantion de comportament candidat la
examen ntr-un domeniu specific-ficate i evaluate ulterior i a marcat printr-un proces
standardizat" (Cercetare american de nvmnt Asociatia, american Psychological Association,
i Consiliul Naional pentru msura-ment n Educaie, 1999, p. 3). Prin urmare, acest model de
revizuire se aplic tuturor instrumentelor care sunt cuprinse n aceast definitie.
The original version of the EFPA test review model was produced from a number of sources,
including the BPS Test Review Evaluation Form (developed by Newland Park Associates Limited, NPAL,
and later adopted by the BPS Steering Committee on Test Standards); the Spanish Questionnaire for the
Evaluation of Psychometric Tests (developed by the Spanish Psychological Association) and the Rating
System for Test Quality (developed by the Dutch Committee on Tests and Testing of the Dutch Association
of Psychologists). Much of the content was adapted with permission from the review proforma originally
developed in 1989 by Newland Park Associates Ltd for a review of tests used by training agents in the UK
(see Bartram, Lindley & Foster, 1990). This was subsequently used and further developed for a series of
BPS reviews of instruments for use in occupational assessment (e.g., Bartram, Lindley, & Foster, 1992;
Lindley et al., 2001).
Versiunea original a modelului de test de revizuire EFPA a fost produs dintr-un numr de surse,
revizuire a testului Formularul de evaluare BPS-ing inclu (dezvoltat de Newland Park Associates
Limited, LANP, i ulterior adoptat de ctre Comitetul de Coordonare BPS privind standardele de
testare); Chestionarul spaniol pentru evaluarea testelor psihometrice (dezvoltat de ctre
Psychological Association spaniol) i sistemul de rating pentru testare a calitii (elaborat de
Comitetul olandez privind testele i Testarea olandeze Associa-TION a psihologilor). O mare
parte din coninutul a fost adaptat cu permisiunea de revizuire proforma origi-nally dezvoltat in
Test Review Form Version 4.2.6
09-04-2013
Page 5
1989 de catre Newland Park Associates Ltd pentru o revizuire a testelor utilizate de ctre agenii
de formare din Marea Britanie (a se vedea Bartram, Lindley & Foster, 1990). Acesta a fost
utilizat ulterior i dezvoltat n continuare pentru o serie de comentarii BPS de instrumente pentru
a fi utilizate n evaluarea ocupaional (de exemplu, Bartram, Lindley, & Foster, 1992; Lindley i
colab., 2001).
Afiai originalul
The first version of the EFPA review model was compiled and edited by Dave Bartram (Bartram,
2002a, 2002b) following an initial EFPA workshop in March 2000 and subsequent rounds of consultation.
A major update and revision was carried out by Patricia Lindley, Dave Bartram, and Natalie Kennedy for
use in the BPS review system (Lindley et al, 2004). This was subsequently adopted by EFPA in 2005
(Lindley et al., 2005) with minor revisions in 2008 (Lindley et al., 2008). The current version of the model
has been prepared by a Task Force of the EFPA Board of Assessment, whose members are Arne Evers
(Chair, the Netherlands), Carmen Hagemeister (Germany), Andreas Hstmlingen (Norway), Patricia
Lindley (UK), Jos Muiz (Spain), and Anders Sjberg (Sweden). In this version the notes and checklist
for translated and adapted tests produced by Pat Lindley and the Consultant Editors of the UK test
reviews have been integrated (Lindley, 2009). The texts of some major updated passages are based on
the revised Dutch rating system for test quality (Evers, Lucassen, Meijer, & Sijtsma, 2010; Evers, Sijtsma,
Lucassen, & Meijer, 2010).).
Prima versiune a modelului de revizuire EFPA a fost compilat i editat de Dave Bartram
(Bartram, 2002a, 2002b), ca urmare a unui atelier de lucru EFPA iniial n martie 2000 i runde
ulterioare de consultare. O actualizare major i revizuire a fost realizat de Patricia Lindley,
Dave Bartram, i Natalie Kennedy pentru a fi utilizate n sistemul de revizuire BPS (Lindley et al,
2004). Acest lucru a fost adoptat ulterior de EFPA n 2005 (Lindley et al., 2005), cu revizuiri
minore n 2008 (Lindley et al., 2008). Versiunea actual a modelului a fost elaborat de ctre un
grup operativ al Consiliului EFPA de evaluare, ai crei membri sunt Arne Evers (preedinte,
rile de Jos), Carmen Hagemeister (Germania), Andreas Hstmlingen (Norvegia), Patricia
Lindley (UK), Jos Muiz (Spania), Anders Sjberg (Suedia). n aceast versiune notele i lista
de verificare pentru testele traduse i adaptate, produse de Pat Lindley i Editors consultant al
testelor din Marea Britanie re-vederi au fost integrate (Lindley, 2009). Textele unor pasaje majore
actualizate sunt bazate pe sistemul de rating olandez revizuit pentru testare a calitii (Evers,
Lucassen, Meijer, & Sijtsma, 2010; Evers, Sijtsma, Lu-Cassen, & Meijer, 2010).).
Afiai originalul
The EFPA test review model is divided into three main parts. In the first part (Description of the
instrument) all the features of the test evaluated are described in detail. In the second part (Evaluation of
the instrument) the fundamental properties of the test are evaluated: Test materials, norms, reliability,
validity, and computer generated reports, including a global final evaluation. In the third part
(Bibliography), the references used in the review are included.
Modelul de test de revizuire EFPA este mprit n trei pri principale. n prima parte (Descrierea
in strument) toate caracteristicile testului evaluate sunt descrise n detaliu. n a doua parte
(Evaluarea instrumentului) sunt evaluate proprietile fundamentale ale testului: Materiale de
Test Review Form Version 4.2.6
09-04-2013
Page 6
As important as the model itself is the proper implementation of the model. The current version of the
model is intended for use by two independent reviewers, in a peer review process similar to the usual
evaluation of scientific papers and projects. A consulting editor will oversee the reviews and may call in a
third reviewer if significant discrepancies between the two reviews are found. Some variations in the
procedure are possible, whilst ensuring the competence and independence of the reviewers, as well as
the consulting editor. EFPA recommends that the evaluations in these reviews are directed towards
qualified practising test users, though they should also be of interest to academics, test authors and
specialists in psychometrics and psychological testing.
La fel de important ca i modelul n sine este punerea n aplicare corect a modelului. Versiunea
actual a modelului este destinat utilizrii de ctre doi recenzori independeni, ntr-un proces de
evaluare inter pares similar cu evaluarea obinuit a documentelor i proiectelor tiinifice. Un
editor de consultanta va supraveghea comentarii i pot apela ntr-un al treilea examinator n cazul
n care se constat discrepane semnificative ntre cele dou reexaminri. Anumite variaii ale
pro-ceduri sunt posibile, s asigure competena i independena recenzorilor, precum editorul de
consultan n acelai timp. EFPA recomand ca evalurile n aceste evaluri sunt direcionate
ctre utilizatori de testare care practic calificat, dei acestea ar trebui s fie, de asemenea, de
interes pentru academicieni, autori de testare i specialiti n domeniul psihometriei i testarea
psihologic.
Afiai originalul
Another key issue is the publication of the results of a tests evaluation. The results should be
available for all professionals and users (either paid or for free). A good option is that results are available
on the website of the National Psychological Association, although they could also be published by third
parties or in other media such as journals or books.
The intention of making this model widely available is to encourage the harmonisation of review
procedures and criteria across Europe. Although harmonisation is one of the objectives of the model,
another objective is to offer a system for test reviews to countries which do not have their own review
procedures. It is realized that local issues may necessitate changes in the EFPA Test Review Model or in
the review procedures when countries start to use the Model. Therefore, the Model is called a Model to
stress that local adaptations are possible to guarantee a better fit with local needs.
Comments on the EFPA test review model are welcomed in the hope that the experiences of users
will be instrumental in improving and clarifying the processes.
O alt problem-cheie este publicarea rezultatelor evalurii unui test de. Rezultatele ar trebui s
fie disponibile pentru toi profesionitii i utilizatorii (fie pltit sau gratuit). O opiune bun este
c rezultatele sunt disponibile pe website-ul Asociaiei Naionale de Psihologie, cu toate c
Test Review Form Version 4.2.6
09-04-2013
Page 7
acestea ar putea fi, de asemenea, publicate de ctre teri sau n alte medii, cum ar fi reviste sau
cri. Intenia de a face acest model disponibil pe scar larg este de a ncuraja armonizarea
procedurilor de revizuire i a criteriilor n ntreaga Europ. Cu toate c armonizarea este unul
dintre obiectivele modelului, un alt obiectiv este acela de a oferi un sistem de comentarii de
testare pentru rile care nu au propriile lor proceduri de revizuire. Se realizeaz c problemele
locale pot necesita modificri ale EFPA Test de revizuire a modelului sau n procedurile de atac,
atunci cnd rile ncep s utilizeze modelul. Prin urmare, modelul este numit un model pentru a
sublinia faptul c adaptri locale sunt posibile pentru a garanta o mai bun potrivire cu nevoile
locale. Comentarii cu privire la modelul de test de revizuire EFPA sunt binevenite, n sperana c
experienele utilizatorilor vor avea un rol esenial n mbuntirea i clarificarea proceselor.
Afiai originalul
09-04-2013
Page 8
PART 1
09-04-2013
Page 9
2 General description
This section of the form should provide the basic information needed to identify the instrument and where
to obtain it. It should give the title of the instrument, the publisher and/or distributor, the author(s),
the date of original publication and the date of the version that is being reviewed.
The questions 2.1.1 through 2.7.3 should be straightforward. They are factual information, although some
judgment will be needed to complete information regarding content domains.
Reviewer1
RECENZOR
DATA CURENTA
2.1.1
NUMELE INSTRUMENTULUI
2.1.2
2.2
PRESCURTAREA TESTULUI(DACA SE
APLICA)
NUMELE ORIGINAL AL TESTULUI (DACA
VERSIUNEA LOCALA ESTE O ADAPTARE)
2.3
2.4
2.5
2.6
2.7.1
2.7.2
2.7.3
Each country can decide either to publish the reviewers names when the integrated review is published or to opt for
anonymous reviewing.
2
This information should be filled in by the editor or the administration.
Test Review Form Version 4.2.6
09-04-2013
Page 10
09-04-2013
Page 11
General description of the instrument Short stand-alone non-evaluative description (200-600 words)
A concise non-evaluative description of the instrument should be given here. The description should
provide the reader with a clear idea of what the instrument claims to be - what it contains, the scales it
purports to measure etc. It should be as neutral as possible in tone. It should describe what the
instrument is, the scales it measures, its intended use, the availability and type of norm groups, general
points of interest or unusual features and any relevant historical background. This description may be
quite short (200-300 words). However, for some of the more complex multi-scale instruments, it will need
to be longer (300-600 words). It should be written so that it can stand alone as a description of the
instrument. As a consequence it may repeat some of the more specific information provided in response
to sections 2 6. It should outline all versions of the instrument that are available and referred to on
subsequent pages.
This item should be answered from information provided by the publisher and checked for accuracy by
the reviewer.
09-04-2013
Page 12
Ar trebui s fie scris astfel nct s poate fi informatie de sine stttoare, ca o descriere tablou.
In consecinta, pot repeta unele din mai multe informaii specifice prezente in sectiunile 2-6.
Ar trebui sa sublinieze toate versiunile de tabloul care sunt disponibile i menionate pe paginile
urmtoare.
Acest element trebuie s se rspund la informaiile furnizate de productorul produsului i
verificate pentru precizia prin referentului.
09-04-2013
Page 13
3 Classification
3.1
3.2
Clinical clinic
Advice, guidance and career choice
sfat,ghidare sau alegerea carierei
Educational - educational
Forensic - juridic
General health, life and well-being
Neurological - neurologic
Sports and Leisure - sport i petrecere
timpului liber
Work and Occupational loc de munca si
ocupatie
Other (please describe):- altele (va rugam
descrieti)
09-04-2013
Page 14
3.3
3.4
3.5
Response mode
This item should be answered from
information provided by the publisher.
If any special pieces of equipment (other
Test Review Form Version 4.2.6
09-04-2013
Page 15
Mod de rspuns
Acest produs trebuie s se rspund la
informaiile furnizate de ctre editor.
Daca sunt necesare piese speciale de
echipamente (altele dect cele indicate
n lista de opiuni, de exemplu,
recorder digital), acestea ar trebui s
fie descrise aici. In plus, orice condiii
speciale de testare ar trebui s fie dedescrise. "Condiii de testare standard"
se presupune c sunt disponibile
pentru evaluarea tored-proc /
monitorizat. Acestea ar include o
camer linitit, bine luminat i bine
ventilat cu birou spaiu i scaune
adecvate necesare ad-ministrator (e) i
a candidatului (e).
3.6
09-04-2013
Page 16
09-04-2013
Page 17
09-04-2013
Page 18
3.8
Ipsativity
As mentioned in 3.7 multiple choice mixed
scale alternatives may result in ipsative
scores. Distinctive for ipsative scores is
that the score on each scale or dimension
is constrained by the scores on the other
scales or dimensions. In fully ipsative
instruments the sum of the scale scores is
constant for each person. Other scoring
procedures can result in ipsativity (e.g.
subtraction of each persons overall mean
from each of their scale scores)
Nu,
multiple variante alternative la scar
Distinctiv este nscris de ipsative c
mixte care nu conduc la rezultate ipsative
scorul de pe fiecare scal sau
nu
relevant
dimensiune este limitat de scorurile
la celelalte scale sau dimensiuni. n
instrumente complet ipsative suma
scorurilor de scal este constant
pentru fiecare persoan. Alte proceduri
de notare pot duce la ipsativity (de
exemplu, scderea medie global a
fiecrei persoane din fiecare dintre
scorurile lor la scar)
3.9
09-04-2013
Page 19
3.10
09-04-2013
Page 20
3.12
09-04-2013
Page 21
experimentat.
Preparation:
pregatirea
Administration:
administrarea
Scoring:
punctajul
Analysis:
analiza
Feedback:
Feedback
09-04-2013
Page 22
3.13
09-04-2013
Page 23
09-04-2013
Page 24
4.1
Scores
This item should be completed by
reference to the publishers information
and the manuals and documentation.
Brief description of the scoring system to
obtain global and partial scores,
correction for guessing, qualitative
interpretation aids, etc).
4.3
Scorurile
Acest articol trebuie s fie completate prin
referinta la informaiile editorului i
manualele i documentare.
Scurt descriere a sistemului de notare pentru
a obine scoruri globale i pariale, corectia
pentru ghicirea, mijloace auxiliare de
interpretare calitativ, etc).
Percentile Based Scores
Centiles
5-grade classification: 10:20:40:20:10 centile
splits
Deciles
09-04-2013
Page 25
4.4
09-04-2013
Page 26
09-04-2013
Page 27
Rapoartele generate de computer Reinei c aceast seciune este pur descriptiv. Evaluri ale
rapoartelor ar trebui furnizate n partea de evaluare a revizuirii Pentru cazurile n care exist mai
multe rapoarte generate disponibile v rugm s completai articolele 5.2 - 5.13 pentru fiecare
raport sau seciune raport de fond (pagini de copiere, dup caz). Acest sistem de clasificare ar
putea fi utilizat pentru a descrie dou rapoarte furnizate de un sistem, de exemplu, raportul 1 pot
fi destinate taker de testare sau de ali utilizatori ONU pregtii, i Raportul 2 pentru un utilizator
instruit, care este competent n utilizarea instrumentului i nelege cum s-l interpreteze.
5.1
5.3
i grafic
fr legtur
09-04-2013
Page 28
5.5
09-04-2013
Page 29
5.6
09-04-2013
Page 30
09-04-2013
Page 31
5.9
09-04-2013
Page 32
Directive/stipulative
Guidance/suggests hypotheses
Other (please describe):
09-04-2013
Page 33
09-04-2013
Page 34
09-04-2013
Page 35
09-04-2013
Page 36
Yes da
No nu
09-04-2013
Page 37
6.1
Documentation
provided
by
the
distributor as part of the test package
(select all that apply)
Documentaia furnizat de di-tributor
ca parte din pachetul de testare (selecteaza toate care se aplic)
User Manual
Technical (psychometric) manual
Supplementary technical information and
updates (e.g. local norms, local validation
studies etc.)
Books and articles of related interest
Paper hartie
CD or DVD cd sau dvd
Internet download descarcate de pe internet
Other (specify): altele (specificati)
nregistrrile 6.3 - 6.5 costurile de acoperire. Aceast informaie este probabil s fie cel mai
rapid din data. Este reco-reparat ca furnizorul sau editorul este contactat aproape n momentul
publicrii revizuirii ca posi-ble, pentru a furniza informaii actuale pentru aceste elemente.)
09-04-2013
Page 38
Start-up costs
Price of a complete set of materials (all
manuals and other material sufficient for
at least one sample administration).
Specify how many test takers could be
assessed with the materials obtained for
start-up costs, and whether these costs
include
materials
for
recurrent
assessment.
This item should try to identify the 'set-up'
cost. That is the costs involved in
obtaining a full reference set of materials,
scoring keys and so on. It only includes
training costs if the instrument is a 'closed'
one - where there will be an unavoidable
specific training cost, regardless of the
prior qualification level of the user. In such
cases, the training element in the cost
should be made explicit. The initial costs
do NOT include costs of general-purpose
equipment (such as computers, DVD
players and so on). However, the need for
these should be mentioned. In general,
define: any special training costs; costs of
administrator's
manual;
technical
manual(s); specimen or reference set of
materials; initial software costs, etc.
6.3.2
Recurrent costs
Specify, where appropriate, recurrent
costs of administration and scoring
separately from costs of interpretation
(see 6.4.1 6.5).
This item is concerned with the on-going
cost of using the instrument. It should give
the cost of the instrument materials
(answer sheets, non-reusable or reusable
question
booklets,
profile
sheets,
computer usage release codes or dongle
units, etc.) per person per administration.
Note that in most cases, for paper-based
administration such materials are not
available singly but tend to be supplied in
packs of 10, 25 or 50.
Itemise any annual or per capita licence
fees (including software release codes
where relevant), costs of purchases or
leasing re-usable materials, and per
candidate
costs
of
non-reusable
09-04-2013
Page 39
by
6.6
None
Test specific accreditation
Accreditation in general achievement testing:
measures of maximum performance in
attainment (equivalent to EFPA Level 2)
Accreditation in general ability and aptitude
testing: measures of maximum performance in
relation to potential for attainment (equivalent to
EFPA Level 2)
Accreditation in general personality and
assessment: measures of typical behaviour,
attitudes and preferences (equivalent to EFPA
Level 2)
Other (specify):
09-04-2013
Page 40
None
Practitioner psychologist with qualification in the
relevant area of application
Practitioner psychologist
Research psychologist
Non-psychologist academic researcher
Practitioner in relevant related professions
(therapy, medicine, counselling, education,
human resources etc.). Specify:
EFPA Test User Qualification Level 1 or national
equivalent
EFPA Test User Qualification Level 2 or national
equivalent
Specialist qualification equivalent to EFPA Test
User Standard Level 3
Other (indicate):
09-04-2013
Page 41
PART 2
09-04-2013
Page 42
1. Manualul i / sau rapoarte care sunt furnizate de ctre editor pentru utilizator: Acestea sunt
ntotdeauna furnizate de ctre editorul / distribuitorul nainte de a instrumentului este acceptat
de ctre organizaia recenzare i formeaz materialele de baz pentru revizuire.
2. Open information that is available in the academic or other literature:
This is generally sourced by the reviewer and the reviewer may make use of this information in the
review and the instrument may be evaluated as having (or having not) made reference to the
information in its manual.
Informaiile deinute de ctre editor, care nu este publicat sau distribuit n mod oficial:
distribuitorului / editorul poate face acest lucru disponibile de la nceput sau poate trimite n
cazul n care revizuirea este trimis napoi la editor pentru a verifica acurateea faptelor.
Examinatorul ar trebui s utilizeze aceste informaii, dar reinei foarte clar la nceputul
observaiilor asupra informaiilor tehnice pe care "rating-ul a jucat n aceast reexaminare se
refer la materialele care sunt deinute de ctre editorul / distribuitorul care nu este [n mod
normal] furnizat pentru a testa utilizatorilor" . n cazul n care acestea conin informaii
valoroase, evaluarea global ar trebui s recomande editorul public aceste rapoarte i / sau de
a le face disponibile pentru a testa cumprtori.
4. Information that is commercial in confidence:
In some instances, publishers may have technically important material that they are unwilling to make
public for commercial reasons. In practice there is very little protection available for intellectual property
to test developers (copyright law being about the only recourse). Such information could include
reports that cover the development of particular scoring algorithms, test or item generation procedures
and report generation technology. Where the content of such reports might be important in making a
Test Review Form Version 4.2.6
09-04-2013
Page 43
4. Informaiile care fac comercial ncredere: n unele cazuri, editorii pot avea materiale
importante punct de vedere tehnic ca acestea nu sunt dispui s fac publice din motive
comerciale. n practic, exist o protecie foarte puin disponibile pentru proprietate intelectual
pentru a testa dezvoltatorilor (legea privind drepturile de autor fiind despre singurul recurs).
Astfel de informaii ar putea include rapoarte care acoper dezvoltarea anumitor algoritmi de
notare, procedurile de testare sau de generare element i tehnologie de generare a rapoartelor. n
cazul n care coninutul acestor rapoarte ar putea fi important n a face o hotrre ntr-un
comentariu, asociaia sau organizaia responsabil pentru revizuirea ar putea oferi s se angajeze
s intre ntr-un acord de nedivulgare cu editorul. Acest acord ar fi obligatoriu pentru recenzorii i
editor. Dupa analiza ar putea evalua informaiile i comentariu cu privire la aspectele tehnice i
evaluarea general a efectului pe care "rating-ul a jucat n aceast reexaminare se refer la
materialele care sunt deinute de ctre editorul / distribuitorul care au fost examinate de ctre
recenzori pe o reclam n baza de ncredere . Acestea nu sunt furnizate utilizatorilor finali. "
Afiai originalul
Explanation of ratings
All sections are scored using the following rating system (see table on next page). Detailed descriptions
giving anchor-points for each rating are provided.
Where a [ 0 ] or [ 1 ] rating is provided on an attribute that is regarded as critical to the safe use of an
instrument, the review will recommend that the instrument should only be used in exceptional
circumstances by highly skilled experts or in research.
The instrument review needs to indicate which, given the nature of the instrument and its intended use,
are the critical technical qualities. It is suggested that the convention to adopt is that ratings of these
critical qualities are then shown in bold print.
In the following sections, overall ratings of the adequacy of information relating to validity, reliability and
norms are shown, by default, in bold.
Legend a ratingurilor Toate seciunile sunt marcate folosind urmtorul sistem de evaluare (a se
vedea tabelul de la pagina urmtoare). sunt furnizate descrieri detaliate care dau ancorare puncte
pentru fiecare categorie. n cazul n care o [0] sau [1] rating este furnizat pe un atribut care este
considerat ca fiind critic pentru utilizarea n siguran a unui instrument, revizuirea va recomanda
ca instrumentul s fie utilizat numai n cazuri excepionale, de ctre experi cu nalt calificare
sau n cercetare. Reexaminarea instrumentului trebuie s indice care, avnd n vedere natura
instrumentului i utilizarea preconizat, sunt calitile tehnice critice. Se sugereaz c aceast
convenie s adopte este c ratingurile acestor caliti critice sunt apoi afiate n format tiprit cu
caractere aldine. n urmtoarele seciuni, evaluri globale ale caracterului adecvat al informaiilor
referitoare la validitatea, fiabilitatea i normele sunt prezentate, n mod implicit, n caractere
aldine.
Test Review Form Version 4.2.6
09-04-2013
Page 44
Afiai originalul
Any instrument with one or more [ 0 ] or [ 1 ] ratings regarding attributes that are regarded as
critical to the safe use of that instrument, shall not be deemed to have met the minimum standard.
Orice instrument cu unul sau mai muli [0] sau [1] n ceea ce privete atributele evaluri, care
sunt considerate ca fiind critice pentru utilizarea n siguran a acestui instrument, nu se consider
a fi ndeplinit standardul minim.
Rating/evaluare
Explanation* / explicatie
[n/a]
Inadequate/ inadecvat
Adequate/ adecvat
Good/ bun
Excellent/ excelent
* A five point scale is defined by EFPA but each user can concatenate the points on the scale (for example
combining points 3 and 4 into a single point). The only constraint is that there must be a distinction made
between inadequate (or worse) on the one hand and adequate (or better) on the other. Descriptive terms or
symbols such as stars or smiley faces may be used in place of numbers. Where the five point scale is
replaced or customized, the user should provide a key that links the points and the nomenclature to the five
point scale of EFPA.
09-04-2013
Page 45
In this section a number of ratings need to be given to various aspects or attributes of the documentation
supplied with the instrument (or package). The term documentation is taken to cover all those materials
supplied or readily available to the qualified user: e.g. the administrator's manual; technical handbooks;
booklets of norms; manual supplements; updates from publishers/suppliers and so on.
Suppliers are asked to provide a complete set of such materials for each Reviewer. If you think there is
something which users are supplied with which is not contained in the information sent to you for review,
please contact your review editor.
Rating
n/a
n/a
n/a
i modelul de analiz
09-04-2013
Page 46
n/a
n/a
n/a
Rating
n/a
09-04-2013
Page 47
Development
Excellent: Full details of item sources, development of
stimulus material according to accepted guidelines (e.g.
Haladyna, Downing, & Rodriguez, 2002; Moreno,
Martinez, & Muiz, 2006), piloting, item analyses,
comparison studies and changes made during
development trials.
n/a
n/a
n/a
Standardisation
Excellent: Clear and detailed information provided about
sizes and sources of standardisation sample and
standardisation procedure.
09-04-2013
Page 48
Norms
Excellent: Clear and detailed information provided about
sizes and sources of norms groups, representativeness,
conditions of assessment etc.
n/a
n/a
n/a
n/a
Reliability
Excellent: Excellent explanation of reliability and
standard error of measurement (SEM), and a
comprehensive range of internal consistency, temporal
stability and/or inter-scorer and inter-judge reliability
measures and the resulting SEMs provided with
explanations of their relevance, and the generalisability
of the assessment instrument.
Construct validity
Excellent: Excellent explanation of construct validity with
a wide range of studies clearly and fairly described.
Criterion validity
Excellent: Excellent explanation of criterion validity with
a wide range of studies clearly and fairly described.
09-04-2013
Page 49
n/a
Rating
n/a
n/a
09-04-2013
Page 50
For norming
Excellent: Clear and detailed information provided, with
checks described to deal with possible errors in
norming.
n/a
n/a
n/a
09-04-2013
Page 51
n/a
n/a
n/a
n/a
Restrictions on use
Excellent: Clear descriptions of who should and who
should not be assessed, with well-explained
justifications for restrictions (e.g. types of disability,
literacy levels required etc.)
09-04-2013
Page 52
n/a
n/a
Overall adequacy
This overall rating for section 7 is obtained by using
judgment based on the overall ratings given for the subsections 7.1, 7.2, and 7.3.
09-04-2013
Page 53
09-04-2013
Page 54
8.1
Calitatea materialelor de testare 8.1 Calitatea materialelor de testare ale testelor de hrtie i creion
(aceast sub-seciune poate fi omis dac nu este cazul))
Items to be rated n/a or 0 to 4
8.1.1
Rating
n/a
n/a
n/a
n/a
n/a
n/a
Ease with which the test taker can understand the task
09-04-2013
Page 55
Calitatea materialelor de testare CBT i WBT (aceast sub-seciune poate fi omis dac nu este
cazul)
Items to be rated n/a or 0 to 4
8.2.1
Rating
n/a
n/a
n/a
n/a
n/a
n/a
n/a
Ease with which the test taker can understand the task
09-04-2013
Page 56
n/a
09-04-2013
Page 57
Norms
Norme
orientri generale privind acordarea ratingurilor pentru aceast seciune, este dificil s se
stabileasc criterii clare pentru calitile de rating tehnice ale unui instrument. Aceste note ofer
unele ndrumri cu privire la tipurile de valori pentru a se asocia cu evaluri inadecvate, adecvate,
bune i foarte bune. Totui, acestea sunt destinate s acioneze ca numai ghidaje. Natura
instrumentului, aria de aplicare a acestuia, calitatea datelor pe care se bazeaz norme, precum i
tipurile de decizii pe care va fi utilizat pentru toate ar trebui s afecteze modul n care sunt
acordate evaluri.
To give meaning to a raw test score two ways of scaling or categorizing raw scores can be distinguished
(American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education, 1999). First, a set of scaled scores or norms may be derived from the
distribution of raw scores of a reference group. This is called norm-referenced interpretation (see subsection 9.1). Second, standards may be derived from a domain of skills or subject matter to be mastered
(domain-referenced interpretation) or cut scores may be derived from the results of empirical validity
research (criterion-referenced interpretation)(see sub-section 9.2). With the latter two possibilities raw
scores will be categorized in two (for example pass of fail) or more different score ranges, e.g. to assign
patients in different score ranges to different treatment programs, to assign pupils scoring below a critical
score to remedial teaching, or to accept or reject applicants in personnel selection.
Pentru a da un sens la un test de prime scor dou moduri de scalare sau categorisire scorurilor
brute pot fi distinse (American Association Educational Research, American Psychological
Association, i Consiliul Naional pentru Msurarea n Educaie, 1999). n primul rnd, un set de
scoruri sau norme scalate pot fi derivate din distribuia scorurilor brute ale unui grup de referin.
Aceasta se numete interpretare normai (a se vedea sub-seciunea 9.1). n al doilea rnd,
standardele pot fi derivate dintr-un domeniu de competene sau subiect care trebuie stpnite
(interpretare de referin-domeniu) sau tiate scorurile pot fi obinute din rezultatele cercetrii
validitii empirice (interpretarea de referin criteriu) (a se vedea subseciunea 9.2). Cu scorurile
prime ultimele dou posibiliti vor fi clasificate n dou (de exemplu, "trecere" a "nu") sau mai
multe intervale diferite de scor, de exemplu, pentru a atribui pacieni n diferite scor variaz la
programe de tratament diferite, pentru a aloca elevii punctare de mai jos un scor critic de predare
de remediere, sau de a accepta sau de a respinge solicitani de selecie a personalului.
Afiai originalul
9.1
Norm-referenced interpretation
(This sub-section can be skipped if not applicable)
09-04-2013
Page 58
09-04-2013
Page 59
Natura eantionului
Echilibrul surselor de prob (de exemplu, o prob care este de 95%, din Germania, cu un
britanic de 2%, italian i 3% nu este o prob internaional real). O prob ar putea fi ponderate
pentru a reflecta mai bine constituenii si diferii.
Echivalena fundal (ocuparea forei de munc, educaie, condiii de testare, etc.) ale diferitelor
pri ale eantionului. probe de norma care nu permit acest lucru s fie evaluate sunt insuficiente.
The type of measure:
Where there are measures which have little or no verbal content then there will be less impact on
translation. This will apply to performance tests and to some extent to abstract and diagrammatic
reasoning tests where should be less impact on the scores.
Tipul msurii:
n cazul n care exist msuri care au coninut puin sau deloc verbal, atunci nu va fi un impact
mai mic asupra traducerii. Acest lucru se va aplica la testele de performan i ntr-o anumit
msur, la teste abstracte i raionament n cazul n care ar trebui s schematice fie un impact mai
mic asupra scorurilor
The equivalence of the test version used with the different language samples.
There should be evidence that all the language versions are well translated/adapted
Is there any evidence that any of the groups have completed the test in a non-primary language?
09-04-2013
Page 60
Absena acestor surse de probe trebuie s fie comentate n Examinatorilor Comentariile la finalul
Ghidului seciunii dat despre generalizrii normelor, dincolo de aceste grupuri sunt incluse n
normele internaionale ar trebui s fie incluse n manualul pentru instrumentul de exemplu, n
cazul n care o norm este format din 20% din Germania, 20% francez, 20% italian, 20%
britanici i 20% olandez, ar putea fi necesar s-l foloseasc ca un grup de comparatie pentru
candidaii elveieni sau belgieni, dar este posibil s nu fie necesar s-l foloseasc ca comparaie
pentru un grup de solicitani chinezi.
9.1
Norm-referenced interpretation
Where an instrument is designed for use without recourse to norms or reference groups (e.g.,
ipsative tests designed for intra-individual comparisons only), the not applicable category should
be used rather than no information given. However, the reviewer should evaluate whether the
reasoning to provide no norms is justified, otherwise the category no information given must be
used.
Potrivite pentru cazul utilizrii locale, dac normele locale sau internaionale Reinei c,
pentru testele adaptate numai normele locale (pe baz la nivel naional), sau ntr-adevr
internaionale sunt eligibile pentru rating 2, 3 sau 4, chiar dac construiesc echivalena
ntre diferitele culturi este gsit. n cazul n care apar probleme de invarianta-msur
ment necesar s se prevad norme distincte pentru (sub) grupuri i orice probleme
ntlnite trebuie explicate.
Not applicable Nu
n/a
se aplic
No information given Nu
09-04-2013
1
Page 61
Local sample(s) that do(es) not fit well with the relevant application domain but could be
used with caution
prob (e) Local care fac (ele) care nu se potrivesc bine cu domeniul de aplicare
relevant, dar ar putea fi utilizat cu precauie
Local country samples or relevant international samples with good relevance for
intended application
n/a
Adequate general population norms and/or range of norm tables, or adequate norms for
some but not all intended applications
norme adecvate populaiei n general i / sau gama de tabele normei, sau norme
adecvate pentru unele, dar nu toate aplicaiile destinate
Excellent range of sample relevant, age-related and sex-related norms with information
about other differences within groups (e.g. ethnic group mix)
09-04-2013
Page 62
Mrimea eantionului (normant clasic) Pentru cele mai multe scopuri, mostre de factorii
de ncercare mai puin de 200 va fi prea mic, deoarece rezoluia prevzut n cozile de
distribuie va fi foarte mic. SEmean pentru un scor z cu N = 200 este 0.071 SD - sau pur
i simplu mai bine dect un punct de scor T. Cu toate c acest grad de imprecizie poate
avea doar consecine minore n centrul distribuiei impactului la cozile de distribuie
poate fi destul de mare (iar acest lucru poate fi intervalele de scor care sunt cele mai
relevante pentru deciziile care urmeaz s fie luate pe baza testului nscris). Dac exist
norme internaionale, atunci, n general, datorit eterogenitii lor, acestea trebuie s fie
mai mare dect cerinele tipice ale eantioanelor locale. Cifrele de orientare diferite sunt
date pentru a fi utilizate mize mici i mari. n general, cu miz mare este n cazul n care
utilizai o decizie de baz non-trivial se bazeaz cel puin parial pe punctajul testului (e
Low-stakes use/
miza mica a utilizari
High-stakes decisions
Miza mare a decizilor
n/a
e.g. 200-299
e.g. 200-299
e.g. 300-399
e.g. 400-999
e.g. 1000
mostrei inadecvate
adevarate
e.g. 1000
mostrei excelente
9.1.4
09-04-2013
Page 63
n/a
Good sample size (e.g. 8 subgroups with 100 - 149 respondents each) Dimensiunea
bun prob (de exemplu, 8 subgrupuri cu 100 - 149 de respondeni fiecare)
mrime
excelent eantionului (de exemplu, 8 subgrupuri cu cel puin 150 de respondeni
fiecare)
Excellent sample size (e.g. 8 subgroups with at least 150 respondents each)
9.1.5
09-04-2013
Page 64
[ ]
[ ]
[ ]
[ ]
[ ]
09-04-2013
Page 65
[ ]
[ ]
comodiatea
Non-probability sample quota nu exista probabilitatea mostrei/ esantion cota
[ ]
[ ]
de zapada
Non-probability sample purposive nu exista probabilitatea mostrei/ esantion
[ ]
[ ]
9.1.6
n/a
the
Adequate adecvat
Good bun
9.1.7
n/a
09-04-2013
0
Page 66
cu analiza minim
Excellent range of analyses and discussion of relevant issues relating to use and
interpretation gam excelent de analize i discutarea problemelor relevante
legate
de utilizare i inter-pretare
9.1.8
How old are the normative studies? Cat de vechi sunt studile normativa
Not applicable neaplicabile
9.1.9
n/a
Excellent, norms less than 10 years old excelente mai mici de 10 ani
testele de performan)
n/a
[ ]
[ ]
Normele de aplicare
9.2
[ ]
Criterion-referenced interpretation
(This sub-section can be skipped if not applicable)
To determine the critical score(s) one can differentiate between procedures that make use of the judgment
of experts (these methods are also referred to as domain-referenced norming, see sub-category 9.2.1)
and procedures that make use of actual data with respect to the relation between the test score and an
external criterion (referred to as criterion-referenced in the restricted sense, see sub-category 9.2.2).
Test Review Form Version 4.2.6
09-04-2013
Page 67
Interpretarea se face referire-criteriu (Aceast sub-seciune poate fi omis dac nu este cazul)
Pentru a determina scorul (e) critic se poate diferenia ntre procedurile care fac uz de judecata
experilor (aceste metode sunt, de asemenea denumite normare de referin-domeniu, a se vedea
sub-categorie 9.2.1), precum i proceduri care utilizeaz datele reale n ceea ce privete relaia
dintre scorul de ncercare i un criteriu extern (denumit criteriu de referin n sens restrns, a se
vedea sub-categorie 9.2.2).
9.2.1
Domain-referenced norming
9.2.1.1
If the judgment of experts is used to determine the critical score, are the judges appropriately
selected and trained?
Judges should have knowledge of the content domain of the test and they should be
appropriately trained in judging (the work of) test takers and in the use of the standard setting
procedure applied. The procedure of the selection of judges and the training offered must be
described.
n cazul n care hotrrea de experi este utilizat pentru a determina scorul critic, sunt
judectorii selectai i instruit corespunztor? Judectorii trebuie s aib cunotine din
domeniul coninutului testului i trebuie s fie instruii corespunztor n judecarea
(lucrarea) factorii de ncercare i n utilizarea procedurii de setare standard aplicat.
Trebuie descris procedura de selecie a judectorilor i formarea oferit.
Not applicable
9.2.1.2
n/a
No information given
Inadequate
Adequate
Good
Excellent
If the judgment of experts is used to determine the critical score, is the number of judges used
adequate?
The required number of judges depends on the tasks and the contexts. The numbers
suggested should be considered as an absolute minimum.
n cazul n care hotrrea de experi este utilizat pentru a determina scorul critic, este
numrul de judectori utilizat adecvat? Numrul necesar de judectori depinde de
sarcinile i contextele. Numerele propuse trebuie s fie considerate drept un minim
absolut.
Not applicable
n/a
No information given
Test Review Form Version 4.2.6
0
09-04-2013
Page 68
9.2.1.3
If the judgment of experts is used to determine the critical score, which standard setting
procedure is reported? (select one)
n cazul n care hotrrea de experi este utilizat pentru a determina scorul critic, care
procedura de stabilire a standardelor este raportat? (alege unul)
Nedelsky
[ ]
Angoff
[ ]
Ebel
[ ]
[ ]
[ ]
Beuk
[ ]
Hofstee
[ ]
Other, describe:
9.2.1.4
[ ]
If the judgment of experts is used to determine the critical score, which method to compute
inter-rater agreement is reported? (select one)
Coefficient p0
[ ]
Coefficient Kappa
[ ]
Coefficient Livingston
[ ]
[ ]
[ ]
Other, describe:
9.2.1.5
[ ]
If the judgment of experts is used to determine the critical score, what is the size of the interrater agreement coefficients (e.g. Kappa or ICC)?
In the scientific literature there are no unequivocal standards for the interpretation of these
kinds of coefficients, although generally values below .60 are considered insufficient. Below
the classification of Shrout (1998) is followed. Using the classification needs some caution,
because the prevalence or base rate may affect the value of Kappa.
09-04-2013
Page 69
Not applicable
9.2.1.6
No information given
Not applicable
9.2.1.7
No information given
[ ]
[ ]
[ ]
9.2.2
Criterion-referenced norming
9.2.2.1
If the critical score is based on empirical research, what are the results and the quality of this
research?
To answer this question no explicit guidelines can be given as to which level of relationship is
acceptable, not only because what is considered high or low may differ for each criterion to
be predicted, but also because prediction results will be influenced by other variables such as
base rate or prevalence. Therefore, the reviewer has to rely on his/her expertise for his/her
judgment. Also the composition of the sample used for this research (is it similar to the group
for which the test is intended, more heterogeneous, or more homogeneous?) and the size of
this group must be taken into account.
Dac scorul critic se bazeaz pe cercetri empirice, care sunt rezultatele i calitatea
acestei cercetri? Pentru a rspunde la aceast ntrebare nu exist linii directoare
explicite pot fi date la care nivelul de relaie este acceptabil, nu numai pentru c ceea
ce este considerat "ridicat" sau "mici" poate fi diferit pentru fiecare criteriu pentru a fi
prezis, dar, de asemenea, pentru c rezultatele de predicie va fi influenat de alte
Test Review Form Version 4.2.6
09-04-2013
Page 70
variabile, cum ar fi rata de baz sau prevalen. Prin urmare, examinatorul trebuie s se
bazeze pe expertiza lui / ei pentru judecata lui / ei. De asemenea, compoziia probei
utilizate pentru aceast cercetare (este similar cu grupul pentru care este destinat testul,
mai eterogen, sau mai omogene?) i mrimea acestui grup trebuie s fie luate n
considerare.
Afiai originalul
n/a
Not applicable
9.2.2.2
No information given
Inadequate
Adequate
Good
Excellent
How old are the normative studies? Cat de vechi sunt studiile normative?
n/a
Not applicable
9.2.2.3
No information given
[ ]
[ ]
Overall adequacy
This overall rating is obtained by using judgment based on the ratings given for items 9.1
9.2.2.3.
09-04-2013
Page 71
adecvarea global Aceast clasificare general este obinut prin utilizarea unei judeci
bazate pe ratingurile acordate pentru articolele 9.1 - 9.2.2.3. Evaluarea general pentru
interpretarea normai nu poate fi niciodat mai mare dect ratingul pentru eantionul
de dimensiune-element, dar poate fi dependent de jos pe celelalte informaii furnizate.
Din acest alte informaii n special informaii despre reprezentativitatea i mbtrnirea
normelor este relevant. Dac se folosesc grupe normate neprobabilistice calitatea
normelor poate fi cel mai calificat drept "adecvat", dar numai atunci cnd descrierea
grupului norm arat c distribuia pe variabile relevante este similar cu inta sau
grupul menionat. Evaluarea general ar trebui s reflecte caracteristicile cele mai mari
i cele mai semnificative norme, mai degrab dect "media" n toate normele
publicate. Evaluarea general pentru interpretarea menionat criteriu n judectorii de
caz sunt folosite pentru a determina scorul critic nu poate fi mai mare dect ratingul
pentru dimensiunea acordului inter-evaluatori, dar poate fi dependent de jos pe
celelalte informaii furnizate. Din aceast alte informaii n special aplicarea corect a
metodei n cauz i calitatea, instruirea i numrul de judectori sunt importante. Dac
scorul critic se bazeaz pe cercetarea empiric, rating-ul nu poate fi niciodat mai
mare dect ratingul pentru punctul 9.2.2.1, dar poate fi mai mic atunci cnd studiile
sunt prea vechi.
n/a
Not applicable
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 72
Reviewers comments on the norms: Brief report about the norms and their history, including information
on provisions made by the publisher/author for updating norms on a regular basis. Comments pertaining to
non-local norms should be made here.
09-04-2013
Page 73
10
Reliability
Fiabilitate orientare general privind acordarea ratingurilor pentru aceast seciune fiabilitate se
refer la gradul n care scorurile sunt libere de variaie erorilor de msurare (de exemplu, un
interval de eroare de msurare de ateptat). Pentru fiabilitate, liniile directoare se bazeaz pe
necesitatea de a avea o mic eroare standard pentru estimrile de fiabilitate. Criteriile de orientare
pentru fiabilitate sunt date n legtur cu dou contexte distincte: utilizarea instrumentelor pentru
a lua decizii cu privire la grupuri de persoane (de exemplu, diagnoza organizationala), precum i
utilizarea acestora pentru a face evaluri individuale. Cerinele de fiabilitate sunt mai mari pentru
acesta din urm dect cea dinti. Ali factori care pot afecta, de asemenea cerine de fiabilitate,
cum ar fi tipul deciziilor luate i dac scale sunt interpretate pe cont propriu, sau agregate cu alte
scri ntr-o scar de compozit. n acest din urm caz fiabilitatea compozitului ar trebui s se
concentreze pentru evaluare nu a Fiabilitate:
Fiabilitile componentelor.
Fiabilitatea se refer la gradul n care scorurile sunt libere de variaie erorilor de msurare (de
exemplu, un interval de eroare de msurare de ateptat). Pentru fiabilitate, liniile directoare se
bazeaz pe necesitatea de a avea o mic eroare standard pentru estimrile de fiabilitate. Criteriile
de orientare pentru fiabilitate sunt date n legtur cu dou contexte distincte: utilizarea
instrumentelor pentru a lua decizii cu privire la grupuri de persoane (de exemplu, diagnoza
organizationala), precum i utilizarea acestora pentru a face evaluri individuale. Cerinele de
fiabilitate sunt mai mari pentru acesta din urm dect cea dinti. Ali factori care pot afecta, de
asemenea cerine de fiabilitate, cum ar fi tipul deciziilor luate i dac scale sunt interpretate pe
cont propriu, sau agregate cu alte scri ntr-o scar de compozit. n acest din urm caz fiabilitatea
compozitului ar trebui s se concentreze pentru evaluare nu a Fiabilitate: Fiabilitile
componentelor.
When an instrument has been translated and/or adapted from a non-local context, one could apply
reliability evidence of the original version to support the quality of the translated/adapted version. In this
case evidence of equivalence of the measure in a new language to the original should be proposed.
Without this it is not possible to generalise findings in one country/language version to another. For
internal consistency reliability evidence based on local groups is preferable, however, as this evidence is
more accurate and usually easy to get. For some guidelines with respect to establishing equivalence see
the introduction of the section on Validity. An aide memoire of critical points for comment when an
instrument has been translated and/or adapted from a non-local context is included in the Appendix.
09-04-2013
Page 74
Atunci cnd un instrument a fost tradus i / sau adaptat dintr-un context non-locale, se poate
aplica o dovad de fiabilitate a versiunii originale pentru a sprijini calitatea versiunii traduse /
adaptate. n acest caz, dovada echivalenei msurii ntr-o nou limb original ar trebui s fie
propuse. Fr acest lucru nu este posibil de a generaliza concluziile ntr-o singur ar versiune /
limb n alta. Pentru dovezi consecven fiabilitatea intern bazat pe grupuri locale este de
preferat, cu toate acestea, deoarece aceste dovezi sunt mai precise i de obicei, uor pentru a
obine. Pentru unele linii directoare cu privire la stabilirea echivalenei a se vedea introducerea
seciunii privind valabilitate. Un memoire consilier al punctelor critice pentru comentarii atunci
cnd un instrument a fost tradus i / sau adaptat dintr-un context non-local inclus n apendice.
It is difficult to set clear criteria for rating the technical qualities of an instrument. These notes provide
some guidance on the values to be associated with inadequate, adequate, good and excellent ratings.
However these are intended to act as guides only. The nature of the instrument, its area of application, the
quality of the data on which reliability estimates are based, and the types of decisions that it will be used
for should all affect the way in which ratings are awarded. Under some conditions a reliability of 0.70 is
fine; under others it would be inadequate. For these reasons, summary ratings should be based on your
judgment and expertise as a reviewer and not simply derived by averaging sets of ratings.
Este dificil s se stabileasc criterii clare pentru calitile de rating tehnice ale unui instrument.
Aceste note ofer unele indicaii cu privire la valorile care urmeaz s fie asociate cu evaluri
inadecvate, adecvate, bune i foarte bune. Totui, acestea sunt destinate s acioneze ca numai
ghidaje. Natura instrumentului, aria de aplicare a acestuia, calitatea datelor pe care estimrile de
fiabilitate se bazeaz, precum i tipurile de decizii pe care va fi utilizat pentru toate ar trebui s
afecteze modul n care sunt acordate evaluri. n anumite condiii o fiabilitate de 0,70 este bine;
sub altele, ar fi inadecvat. Din aceste motive, evaluri sumare ar trebui s se bazeze pe judecata i
expertiza ca referent i nu pur i simplu derivate prin calcularea mediei seturi de rating.
In order to provide some idea of the range and distribution of values associated with the various scales
that make up an instrument, enter the number of scales in each section. For example, if an instrument
being used for group-level decisions had 15 scales of which five had retest reliabilities lower than 0.6, six
between 0.60 and 0.70 and the other four in the 0.70 to 0.80 range, the median stability could be judged
as adequate (being the category in which the median of the 15 values falls). If more than one study is
concerned, first the median value per scale should be computed, taking the sample sizes into account; in
some cases results from a meta-analysis may be available, these can be judged in the same way. This
would be entered as:
n scopul de a oferi o idee despre gama i distribuia valorilor asociate diferitelor scale care
constituie un instrument, introducei numrul de solzi n fiecare seciune. De exemplu, dac un
instrument utilizat pentru deciziile la nivel de grup a avut 15 scale din care cinci au Fiabilitate:
Fiabilitile retestare mai mic dect 0,6, ase ntre 0,60 i 0,70, iar celelalte patru din 0.70-0.80
gama, stabilitatea median poate fi considerat drept " adecvat "(fiind categoria n care mediana
15 valori scade). n cazul n care mai mult de un studiu este n cauz, mai nti valoarea medie pe
scal trebuie s fie calculat, lund n considerare dimensiunile eantioanelor; n unele cazuri,
rezultate dintr-o meta-analiz pot fi disponibile, acestea pot fi judecate n acelai mod. Acest
lucru ar fi introduse ca:
09-04-2013
Page 75
Number of scales
(if applicable)
[-]
[5]
[6]
[4]
[0]
M*
0
1
2
3
4
For each of the possible ratings example values are given for guidance only - especially the distinctions
between Adequate, Good and Excellent. For high stakes decisions, such as personnel selection, these
example values will be .10 higher. However, it needs to be noted that decisions are often based on
aggregate scale scores. Aggregates may have much higher reliabilities than their component primary
scales. For example, primary scales in a multi-scale instrument may have reliabilities around 0.70 while
Big Five secondary aggregate scales based on these can have reliabilities in the 0.90s. Good test
manuals will report the reliabilities of secondary as well as primary scales.
It is realised that it may be impossible to calculate actual median figures in many cases. What is required
is your best estimate, given the information provided in the documentation. There is space to add
comments. You can note here any concerns you have about the accuracy of your estimates. For example,
in some cases, a very high level of internal consistency might be commented on as indicating a bloated
specific.
Pentru fiecare dintre posibilele exemple de rating valorile sunt date numai cu titlu orientativ - n
special distincia dintre "adecvat", "bine" i "excelent". Pentru deciziile pe mize mari, cum ar fi
selectarea personalului, aceste exemple de valori vor fi mai mari .10. Cu toate acestea, este
necesar s se constate c deciziile sunt de multe ori pe baza scorurilor la scara aggre poarta.
Agregate pot avea mult mai mari dect cele Fiabilitate: Fiabilitile scalele lor primare de
componente. De exemplu, cntare primare ntr-un instrument multi-scar poate avea n jurul
valorii de 0,70 n timp ce Fiabilitate: Fiabilitile mari cinci scale agregate secundare bazate pe
acestea pot avea n 0.90s Fiabilitate: Fiabilitile. manuale bune de testare va raporta de secundar
Fiabilitate: Fiabilitile i scale primare. Este realizat c poate fi imposibil s se calculeze cifrele
reale mediane, n multe cazuri. Ceea ce este necesar este estimarea cea mai bun, avnd n vedere
informaiile furnizate n documentaie. Exist spaiu pentru a aduga comentarii. Puteti observa
aici orice probleme le avei despre acurateea estimrilor tale. De exemplu, n unele cazuri, un
nivel foarte ridicat de consisten intern ar putea fi comentat ca indicnd un "specific umflat".
10
Reliability
10.1
Coeficientul de fiabilitate doar o singur dat (pentru fiecare scal sau subscala)
09-04-2013
Page 76
Doar o singur estimare a erorii de msurare standard dat (pentru fiecare scal
sau subscala)
Reliability coefficients for a number of different groups (for each scale or subscale)
Internal consistency
The use of internal consistency coefficients is not sensible for assessing the reliability of speed
tests, heterogeneous scales (also mentioned empirical or criterion-keyed scales; Cronbach,
1970), effect indicators (Nunnally & Bernstein, 1994) and emergent traits (Schneider & Hough,
1995). In these cases all items concerning internal consistency should be marked not
applicable. It is also biased as a method for estimating reliability of ipsative scales. Alternate
form or retest measures are more appropriate for these scale types.
Internal consistency coefficients give a better estimate of reliability than split-half coefficients
corrected with the Spearman-Brown formula. Therefore, the use of split-halves is only justified
if, for any reason, information about the answers on individual items is not available. Split-half
coefficients can be reported in item 10.7 (Other methods).
10.2.1
Sample size
Not applicable
n/a
No information given
09-04-2013
Page 77
One large (e.g. sample size more than 200) or more than one adequate sized study
10.2.3
Kind of coefficients reported (select as many as applicable) tipul coeficientilor raportati, se aleg
cat mai multi
Not applicable
n/a
Lambda-2
Other, describe: ..
Number of scales
(if applicable)
Size of coefficients
M*
Not applicable
10.2.4
n/a
No information given
[ ]
[ ]
[ ]
[ ]
[ ]
. do not match the intended test takers, leading to more favourable coefficients (e.g.
inflation by artificial heterogeneity)
09-04-2013
Page 78
n/a
10.3
Sample size
Not applicable
10.3.2
n/a
No information given
One large (e.g. sample size more than 200) or more than one adequate sized study
Number of scales
(if applicable)
M*
Not applicable
n/a
No information given
[ ]
[ ]
[ ]
09-04-2013
Page 79
10.3.3
[ ]
[ ]
Data provided about the test-retest interval (select or fill in test-retest interval)
10.3.4
Not applicable
n/a
No information given
. do not match the intended test takers, leading to more favourable coefficients (e.g.
inflation by artificial heterogeneity)
Not applicable
n/a
10.4
10.4.1
Sample size
Not applicable
n/a
09-04-2013
Page 80
10.4.2
No information given
One large (e.g. sample size more than 200) or more than one adequate sized study
Are the assumptions for parallelism* met for the different versions of the test for which
equivalence reliability is investigated?
*Note that tests can be considered to be parallel tests if in the same group the mean scores,
variances and correlations with other tests are the same.
Sunt ipotezele pentru paralelismului * ndeplinite pentru diferitele versiuni ale testului
pentru care echivalena fiabilitatea este investigat? * Reinei c testele pot fi
considerate ca fiind teste paralele dac n acelai grup scorurile medii, varianele i
corelaiile cu alte teste sunt aceleai.
Not applicable
10.4.3
n/a
No information given
Inadequate
Adequate
Good
Excellent
Number of scales
(if applicable) nr de scale daca se
aplica
Not applicable
M*
n/a
No information given
[ ]
[ ]
[ ]
[ ]
[ ]
10.4.4
09-04-2013
Page 81
n/a
10.5
10.5.1
Sample size
It is difficult to give uniform guidelines for the adequacy of sample sizes in case IRT methods
for the estimation of reliability are used, because the requirements are different in function of
the item response format and the item response model used. Dependent on the item response
model used minimum values for adequate sample sizes are: 200 for 1-parameter studies, 400
for 2-parameter studies, and 700 for 3-parameter studies (based on Parshall, Davey, Spray, &
Kalohn, 2001). These values apply to dichotomous models, but can be of some guidance for
the reviewer when polytomous models are used for which the sample sizes may be smaller.
n/a
09-04-2013
Page 82
Un fel de coeficieni raportate (selectai ct mai multe, dup caz) Prima metod d
fiabilitatea trsturii latente estimate care n IRT nlocuiete scorul adevrat estimat,
adic scorul de ncercare (a se vedea Embretson & Reise, 2000). Cea de a doua metod
se bazeaz pe informaii cu privire la elementele individuale i ofer o estimare a
fiabilitii atunci cnd cerinele tipice pentru IRT sunt ndeplinite (Mokken, 1971). A
treia metod ofer o estimare a preciziei de msurare referitoare la poziia de pe
trasatura latenta. .
[ ]
Rho
[ ]
[ ]
Others, describe:
[ ]
n/a
09-04-2013
Page 83
Number of scales
(if applicable) nr de scale daca se
aplica
09-04-2013
Page 84
M*
n/a
No information given
[ ]
[ ]
[ ]
[ ]
[ ]
10.6
Inter-rater reliability
If the scoring of a test involves no judgmental processes (e.g. simply summing the scores of
multiple-choice items), this type of reliability is not required and all items concerning inter-rater
reliability should be marked not applicable. Note that although inter-rater reliability may not
apply to the test as a whole, it may apply to one or more subtests (e.g. some subtests of an
intelligence test).
Sample size
Not applicable
10.6.2
n/a
No information given
One large (e.g. sample size more than 200) or more than one adequate sized study
n/a
[ ]
[ ]
09-04-2013
Page 85
10.6.3
[ ]
[ ]
[ ]
Size of coefficients
To some methods mentioned in 10.6.2 the guide
numbers may not apply as no rs are computed.
Number of scales
(if applicable) nr de scale daca se
aplica
M*
n/a
No information given
[ ]
[ ]
[ ]
[ ]
[ ]
10.7
10.7.1
Sample size
Not applicable
n/a
No information given
One large (e.g. sample size more than 200) or more than one adequate sized study
10.7.2
10.7.3
Results rezultate
Number of scales
(if applicable) numarul de scale
daca se aplica
n/a
No information given
Inadequate
M*
09-04-2013
[ ]
[ ]
Page 86
10.8
Adequate
[ ]
Good
[ ]
Excellent
[ ]
Overall Adequacy
This overall rating is obtained by using judgment based on the ratings given for items 10.1
10.7.3. Do not simply average numbers to obtain an overall rating.
For some instruments, internal consistency may be inappropriate (broad traits or scale
aggregates), in which case more emphasis on the retest data should be placed. In other cases
(state measures), retest reliabilities would be inappropriate, so emphasis should be placed on
internal consistencies. For your final judgment you should also take into account:
whether the test is used for individual assessment or to make decisions on groups of people
the nature of the decision (high-stakes vs. low-stakes)
whether one or more (types of) reliability studies are reported
whether also standard errors of measurement are provided
procedural issues, e.g. group size, number of reliability studies, heterogeneity of the
group(s) on which the coefficient are computed, number of raters if inter-rater agreement is
computed, length of the test-retest interval, etc.
comprehensiveness of the reporting on the reliability studies.
Adecvarea global Acest rating de ansamblu este obinut prin utilizarea unei judeci
bazate pe ratingurile acordate pentru articolele 10.1 - 10.7.3. Nu fac pur i simplu
numere medii pentru a obine un rating de ansamblu. Pentru unele instrumente,
consisten intern pot fi inadecvate (trasaturi generale sau agregate la scar), caz n
care ar trebui pus accentul mai mult pe datele retestare. n alte cazuri (msuri de stat),
ar fi retestare Fiabilitate: Fiabilitile inadecvat, deci accentul ar trebui pus pe
consistene interne. Pentru judecata final dac testul estear trebui, de asemenea, s ia
n considerare: utilizat pentru evaluarea individual sau de a lua decizii asupra unor
natura deciziei (cu miz mare vs. mize mici) grupuri de persoane dac una sau mai
multe ( tipuri de) studii de fiabilitate sunt raportate dac erorile de asemenea, de
msurare standard sunt furnizate aspecte procedurale, marimea exemplu de grup,
numrul de studii de fiabilitate, eterogenitatea grupului (e) pe care se calculeaz
coeficientul, numrul de evaluatori dac este inter-evaluatori acord se calculeaz,
lungimea intervalului de testare-retestare etc. comprehensivitii raportarea cu privire
la studiile de fiabilitate.
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 87
09-04-2013
Page 88
Reviewers comments on Reliability: Underline the strong and weak aspects of the evidence of
reliability available. Comments pertaining to equivalence/reliability generalisation should also be made
here (if applicable).
comentariile evaluatorilor cu privire la fiabilitatea: sublinia aspectele forte i slabe ale probelor
de fiabilitate disponibile. Comentarii referitoare la echivalena / generalizare fiabilitatea ar trebui
s fie, de asemenea, fcute aici (dac este cazul).
09-04-2013
Page 89
11 Validity
General guidance on assigning ratings for this section
Validity is the extent to which a test serves its purpose: can one draw the conclusions from the test scores
which one has in mind? In the literature many types of validity are differentiated, e.g. Drenth and Sijtsma
(2006, p. 334 340) mention eight different types. The differentiations may have to do with the purpose of
validation or with the process of validation by specific techniques of data analysis. In the last decades of
the past century there was a growing consensus that validity should be considered as a unitary concept
and that differentiations in types of validity should be considered as different ways of gathering evidence
only (American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education, 1999). Borsboom, Mellenbergh, and Van Heerden (2004) state
that a test is valid for measuring an attribute if variation in the attribute causally produces variation in the
measured outcomes. Although this is a different approach, also in the opinion of these authors a
differentiation between types a validity is not relevant.
Valabilitatea
Orientare general privind atribuirea ratingurilor pentru aceast seciune validitate este msura n
care un test servete scopului su: se poate trage concluzii din rezultatele testelor pe care o are n
minte? n literatura de specialitate sunt difereniate multe tipuri de valabilitate, de exemplu,
Drenth i Sijtsma (2006, p 334 -. 340) menioneaz opt tipuri diferite. Diferenierile pot avea de a
face cu scopul validrii sau cu procesul de validare prin tehnici specifice de analiz a datelor. n
ultimele decenii ale secolului trecut a existat un consens n cretere c valabilitatea ar trebui
considerat drept un concept unitar i c diferenierile n tipuri de valabilitate ar trebui s fie
considerate ca fiind diferite modaliti de a reuni numai probe (American Research Association
Educational, American Psychological Association, & Consiliul naional pentru Msurarea n
Educaie, 1999). Borsboom, Mellenbergh, i Van Heerden (2004) afirm c un test este valabil
pentru msurarea unui atribut n cazul n care variaia n atribut produce variaii n cauzal
rezultatele msurate. Cu toate c aceasta este o abordare diferit, de asemenea, n opinia acestor
autori o difereniere ntre tipurile de o perioad de valabilitate nu este relevant.
However, whichever approach to validity one prefers, for a standardised judgment it is necessary to
structure the concept of validity a bit. For this reason, separate sub-sections on construct and criterion
validity are differentiated. Depending on the purpose of the test one of these aspects of validity may be
more relevant than the other. However, it is realized that construct validity is the more fundamental
concept and that evidence on criterion validity may add to establishing the construct validity of a test.
It is realized also, that a test may have different validities depending on the type of decisions made with
the test, the type of samples used, etc. However, inherent in a test review system is that one quality
judgment is made about the (construct or criterion) validity of a test. This judgment should be a reflection
of the quality of the evidence supporting the claim that the test can be used for the interpretations that are
stated in the manual. The broader the intended applications, the more validity evidence the
author/publisher should deliver. Note that the final rating for construct and criterion validity will be a kind of
average of this evidence and that there may be situations or groups for which the test may have higher or
lower validities (or for which the validity may not have been studied at all).
09-04-2013
Page 90
este necesar structurarea conceptului de valabilitate un pic. Din acest motiv, sub-seciuni
separate privind construcia i validitatea criteriu sunt difereniate. n funcie de scopul testului
una dintre aceste aspecte de valabilitate poate fi mai relevant dect cealalt. Cu toate acestea, se
nelege c validitatea de construct este conceptul fundamental i c dovezile privind
valabilitatea criteriu poate aduga la stabilirea validitii de construct a unui test. Se realizeaz, de
asemenea, ca un test poate avea diferite n funcie de elemente valide tipul deciziilor luate cu
testul, tipul de probe utilizate etc. Cu toate acestea, inerente ntr-un sistem de revizuire de testare
este c o hotrre de calitate se face despre (construct sau criteriu) validitatea unui test. Aceast
hotrre ar trebui s fie o reflectare a calitii probelor ce susin afirmaia c testul poate fi utilizat
pentru interpretrile care sunt menionate n manual. Mai largi aplicaiile avute n vedere, mai
multe probe de valabilitate autorul / editorul ar trebui s livreze. Reinei c ratingul final pentru
construct i criteriu de valabilitate va fi un fel de medie a acestei probe i c pot exista situaii sau
grupuri pentru care testul
When an instrument has been translated and/or adapted from a non-local context, evidence of
equivalence of the measure in a new language to the original should be proposed. Without this it is not
possible to generalise findings in one country/language version to another. Examples of equivalent
evidence:
Invariance in construct structure e.g. via factor structure or correlation with standard measures.
Similar criterion related validity e.g. similar profile of correlations of a multi-scale instrument with
independent external criterion such as ratings of job competencies.
Items show similar patterns of scale loadings e.g. items correlate in same pattern with other scales;
strongest/weakest loading items are similar in original and new languages.
Bilingual candidates have similar profiles in two languages (c.f. alternate form reliability).
Atunci cnd un instrument a fost tradus i / sau adaptat dintr-un context non-locale, ar trebui
propuse dovezi ale echiva-LENCE a msurii ntr-o nou limb cu originalul. Fr acest lucru nu
este posibil de a generaliza concluziile ntr-o singur ar versiune / limb n alta. Exemple de
probe echivalente: invariana n structura de construct - de exemplu, prin structura factorului sau
corelare cu msurile standard. La fel validitatea legat criteriu - de exemplu, profil similar de
corelaii ale unui instrument multi-scar cu criteriu extern independent - precum ratingurile
competene de locuri de munc. Elemente arat modele similare de ncrcri la scar de
exemplu elemente n acelai model sunt corelate cu alte scale; Cele mai puternice / cele mai slabe
elemente de ncrcare sunt similare n limbile originale i noi. candidaii bilingve au profiluri
similare n dou limbi (cf. form alternativ de fiabilitate
Afiai originalul
Validity generalisation needs stronger evidence when translating tests across linguistic families (e.g. from
an Indo-European to a Semitic language). In such a situation equivalence is under greater threat because
of the differences in language structure and cultural differences. However, validity generalisation might be
inferred from evidence of validity invariance in previous translations when a test has been translated into
multiple languages. For instance, if a Swedish test has already been translated into French, German and
Italian and has been shown to have equivalence in these languages.
In considering the whole issue of equivalence, it may be useful to follow Van de Vijver and Poortingas
(2005) classification:
Valabilitate generalizare are nevoie de dovezi mai puternice atunci cnd traducerea de teste in
intreaga familii lingvistice (de exemplu, dintr-un indo-european ntr-o limb semitic). ntr-o
astfel de echivalen situaie se afl sub o ameninare mai mare, din cauza diferenelor n
Test Review Form Version 4.2.6
09-04-2013
Page 91
structura limbii i diferenele culturale. Cu toate acestea, validitatea generalizare se poate deduce
din probe invarianei validitii n traducerile anterioare atunci cnd un test a fost tradus n mai
multe limbi. De exemplu, n cazul n care un test suedez a fost deja tradus n francez, german i
italian i a fost dovedit a avea echivalena n aceste limbi. Lund n considerare ntreaga
problem a echivalenei, poate fi util s se urmeze Van de Vijver i (2005) clasificarea lui
Poortinga:
Structural / functional equivalence
There is evidence that the source and target language versions measure the same psychological
constructs across groups. This is generally demonstrated by showing that patterns of correlations
between variables are the same across groups.
Measurement unit equivalence
There is evidence that the measurement units are the same, but there are different origins across
groups (i.e. individual differences found in group A can be compared with differences found in group
B, but the absolute raw scores for A and B are not directly comparable without some form of
rescaling).
Scalar / Full score equivalence
The same measurement unit and the same origin (i.e. raw scores have the same meanings and can
be compared across groups).
structurale Exist dovezi c versiunile surs i/ funcionale de echivalen limba int msoar
aceleai constructe psihologice n cadrul tuturor grupurilor. Acest lucru este, n general,
demonstrat prin care s arate c modelele de corelaii ntre variabile sunt aceleai grupuri de
peste. Unitate de msur echivalen Exist dovezi c unitile de msur sunt aceleai, dar
exist diferite origini n cadrul tuturor grupurilor (de exemplu, diferenele individuale gsite n
grupa A poate fi comparat cu diferenele constatate n grupa B, dar scorurile brute absolute pentru
A i B nu sunt direct comparabile, fr o anumit form de rescaling). aceeai unitate de
msur i descalar / completa scor de echivalen aceeai origine (adic scorurile brute au
aceleai semnificaii i pot fi comparate n toate grupurile).
The benchmarks and the notes in the sub-sections 11.1 and 11.2 provide some guidance on the values to
be associated with inadequate, adequate, good and excellent ratings. However these are intended to act
as guides only. The nature of the instrument, its area of application, the quality of the data on which
validity estimates are based, and the types of decisions that it will be used for should all affect the way in
which ratings are awarded. For validity, guidelines on sample sizes are based on power analysis of the
sample sizes needed to find moderate sized validities if they exist.
Criteriile de referin i notele n sub-seciunile 11.1 i 11.2 ofer unele ndrumri cu privire la
valorile care urmeaz s fie asociate cu evaluri inadecvate, adecvate, bune i foarte bune. Totui,
acestea sunt destinate s acioneze ca numai ghidaje. Natura instrumentului, aria de aplicare a
acestuia, calitatea datelor pe care se bazeaz estimrile de valabilitate, precum i tipurile de
decizii pe care va fi utilizat pentru toate ar trebui s afecteze modul n care sunt acordate evaluri.
Pentru validitatea, orientri privind dimensiunile eantioanelor se bazeaz pe analiza puterii
dimensiunilor eantioanelor necesare pentru a gsi de dimensiuni moderate, n cazul n care
elemente valide acestea exist.
09-04-2013
Page 92
11.1
Construct validity
The purpose of construct validation is to find an answer to the question whether the test actually measures
the intended construct or, partly or mainly, something else. Common methods for the investigation of
construct validity are exploratory or confirmatory factor analysis, item-test correlations, comparison of
mean scores of groups for which score differences may be expected, testing for invariance of factor
structure and item-bias (DIF) for different groups, correlations with other instruments which are intended to
measure the same (convergent validity) or different constructs (discriminant validity), Multi-Trait-MultiMethod research (MTMM), IRT-methodology and (quasi-)experimental designs.
Constructul
Validitatea Scopul validrii construct este de a gsi un rspuns la ntrebarea dac testul msoar
efectiv construcia preconizat sau, parial sau n principal, altceva. Metode comune de
investigare a validitii de construct sunt analize exploratorii sau de confirmare factor, corelaiile
element-test, compararea scorurilor medii ale grupurilor pentru care scorul diferenele pot fi de
ateptat, testarea pentru invarianta a structurii factorului i elementul-prejudecat (DIF) pentru
diferite grupuri , corelaii cu alte instrumente, care sunt destinate s msoare acelai (validitatea
convergent) sau diferitelor construcii (validitate discriminant), cercetarea multi-trsturMulti-Metoda (MTMM), IRT-metodologia i (cvasi-), desene sau modele experimentale.
11.1
Construct validity
11.1.1
[ ]
[ ]
[ ]
[ ]
Testing for invariance of structure and differential item functioning across groups
[ ]
[ ]
[ ]
MTMM correlations
[ ]
IRT methodology
[ ]
(Quasi-)Experimental Designs
[ ]
Other, describe:
[ ]
09-04-2013
Page 93
Do the results of (exploratory or confirmatory) factor analysis support the structure of the
test?
11.1.3
No information given
Inadequate
Adequate
Good
Excellent
11.1.4
No information given
Inadequate
Adequate
Good
Excellent
Is the factor structure invariant across groups and/or is the test free of item-bias (DIF)?
This kind of research can be carried out on basis of models within classical test theory or the
IRT framework. If item-bias is found, the effect on the total score should be estimated (small
effects are acceptable).
Este structura factorului invariante la nivelul grupurilor i / sau este testul liber de
element-prtinire (DIF)? Acest tip de cercetare poate fi realizat pe baza modelelor
din cadrul teoriei clasice de testare sau cadrul IRT. Dac elementul-prejudecat este
gsit, efectul asupra punctajului total ar trebui s fie estimat (mici efecte sunt
acceptabile).
No information given
Inadequate
Adequate
09-04-2013
Page 94
11.1.5
Good
Excellent
Diferene n ceea ce scorurile medii ntre grupurile relevante cum era de ateptat?
elevii de exemplu, n grupul 8 sunt de ateptat s scor mai mare dect elevii din
grupa 6 pe un test pentru competen numeric; copiii cu ADHD diagnostic ar trebui
s scor mai mare pe un test de hiperactivitate decat copiii care nu diagnosticati cu
ADHD; ar trebui s salespersons un scor mai mare pe un test de cunotine
comerciale dect populaia activ medie. Chiar dac rezultatele sunt n direcia
ateptat, acest tip de cercetare, de obicei, este neconcludent n ceea ce privete
validitatea de construct a testului. Cu toate acestea, valoarea acestui tip de cercetare
este c, atunci cnd diferenele ateptate nu sunt afiate, acest lucru s-ar ridica
ndoieli serioase cu privire la validitatea de construct a testului.
Afiai originalul
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 95
Median and range of the correlations between the test and tests measuring similar
constructs
An essential element of the process of construct validation is correlating the test score(s)
with scales from similar instruments, the so-called congruent or convergent validity. The
guidelines on congruent validity coefficients need to be interpreted flexibly. Where two very
similar instruments have been correlated (with data obtained concurrently) we would expect
to find correlations of 0.60 or more for adequate. Where the instruments are less similar, or
administration sessions are separated by some time interval, lower values may be
adequate. When evaluating congruent validity, care should be taken when interpreting very
high correlations. When correlations are above 0.90, the likelihood is that the scales in
question are measuring exactly the same construct. This is not a problem if the scales in
question represent a new scale and an established marker. It would be a problem though, if
the scale(s) in question was (were) meant to be adding useful variance to what other scales
already measure. The guidelines given concern correlations that are not adjusted for
common-method variance or attenuation. Therefore, also the reliabilities of both instruments
should be taken into account when judging the congruent validity coefficients. E.g., when
both instruments have a reliability of .75, the maximum correlation between the instruments
is .56. If reliabilities are higher, higher correlations are to be expected.
11.1.7
No information given
Excellent (r 0.75)
Do the correlations with other instruments show good discriminant validity with respect to
constructs that the test is not supposed to measure?
09-04-2013
Page 96
11.1.8
No information given
Inadequate
Adequate
Good
Excellent
11.1.9
No information given
Inadequate
Adequate
Good
Excellent
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 97
Sample sizes
The guidelines below concern studies within the classical test theory framework. For the
estimation of item-parameters within IRT methodology adequate sample sizes are: more
than 200 for 1-parameter studies, more than 400 for 2-parameter studies and more than 700
for 3-parameter studies (based on Parshall, Davey, Spray, & Kalohn, 2001).
11.1.11
No information given
One large (e.g. sample size more than 200) or more than one adequate sized study
Inadequate quality
Adequate quality
Good quality
Excellent quality with wide range of relevant markers for convergent and divergent
validation
09-04-2013
Page 98
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 99
11.2
Criterion-related validity
Criterion-related evidence of validity (concurrent and predictive validity) refers to studies where real-world
criterion measures (i.e. not other instrument scores) have been correlated with scales. Predictive studies
generally refer to situations where assessment was carried out at a qualitatively different point in time to
the criterion measurement - e.g. for a work-related selection measure intended to predict job success, the
instrument would have been carried out at the time of selection - rather than just being a matter of how
long the time interval was between instrument and criterion measurement. Studies can also be postdictive, for example, where scores on a potential selection test are correlated with job incumbents earlier
line manager ratings of performance. Basically, evidence of criterion validity is required for all kinds of
tests. However, when it is explicitly stated in the manual that test use does not serve prediction purposes
(such as educational tests that measure progress), criterion validity can be considered not applicable.
11.2
Criterion-related validity
11.2.1
Tipul de studiu criteriu sau studii (selectai ct mai multe sunt aplicabile)
Predictive predictiva
Concurrent concomitenta
Post-dictive post-dictiva
11.2.2
Sample sizes
Marimea esantionului
No information given
09-04-2013
Page 100
11.2.3
One large (e.g. sample size more than 200) or more than one adequate sized study
Inadequate quality
Adequate quality
Good quality
Excellent quality with respect to reliability and representation of the criterion construct
11.2.4
09-04-2013
Page 101
Rezistena relaiei dintre test i criterii Este dificil s se stabileasc criterii clare de
rating mrimea coeficienilor de validitate criteriul unui instrument. O valabilitate
legat de criterii de 0,20 pot avea utilitate considerabil n unele situaii, n timp ce
unul de 0,40 ar putea fi de mic valoare n altele. Un coeficient de 0.30 poate fi
considerat bun de selecie a personalului, n timp ce n situaii educaionale
coeficieni mai mari sunt comune. Din aceste motive, evaluri ar trebui s se bazeze
pe judecata i expertiza ca referent i nu pur i simplu derivate prin calcularea mediei
de seturi de coeficieni de corelaie. Orientrile prezentate se bazeaz pe Hemphill
(2003; a se vedea, de asemenea, Meyer i colab., 2001) i se refer la corelaii care nu
sunt corectate pentru atenuarea fie predictor sau criteriu. Cu toate acestea, coeficieni
pot fi corectate pentru restrngerea domeniului.
Intervalele prezentate mai jos se refer la coeficienii de validitate, deoarece corelaii
ntre teste i criterii sunt cel mai utilizat mod de a reprezenta validitatea criteriu. Cu
toate acestea, n special pentru utilizarea n situaii clinice de date privind
sensibilitatea i specificitatea unui test poate da mai util n formaie pe relaia dintre
un test i un criteriu. ROC-curbe sunt o modalitate populara de cuantificare a
sensibilitii i specificitii. Swets (1988) prezint o trecere n revist a valorilor ROC
curbe n diferite zone. Pentru anumite tipuri de diagnostic medical valorile sunt ntre
0.81 i 0.97, pentru detectarea ntre minciun 0.70 i 0.95, precum i pentru realizarea
de invatamant (treci / nu) ntre 0.71 i 0.94. Aceste valori pot fi utilizate ca linii
directoare, dar este lsat la expertiza recenzentului de a decide n ce msur testul
poate aduce o contribuie util la decizia n cauz. De asemenea, atunci cnd sunt
raportate nc ali indici, cum ar fi valoarea pozitiv i negativ pre-dictive unui test,
raportul probabilitate etc.
No information given
Excellent (r 0.50)
09-04-2013
Page 102
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 103
valabilitate global Atunci cnd judeca validitatea general, este important s se in cont de
importana acordat validitii de construct ca cel mai bun indicator dac un test msoar ceea ce
pretinde pentru a msura. n unele cazuri, dovada principal a acestei ar putea fi sub forma unor
studii legate de criteriu. Un astfel de test ar putea avea un rating de "adecvat" sau mai bine pentru
validitatea legat de criterii i o mai mic dect una adecvat pentru validitatea de construct. n
general, ratingul pentru valabilitate global va fi egal fie valabilitatea Construct sau valabilitatea
legat de criterii, oricare dintre acestea este mai mare. Cu toate acestea, n funcie de scopul
testului, unul dintre aceste tipuri de probe pot fi considerate mai relevante dect cealalt.
Evaluarea pentru valabilitate n general nu ar trebui s fie considerat ca o medie sau ca cel mai
mic numitor comun.
11.3
Valabilitate - capacitatea general Acest rating de ansamblu este obinut prin utilizarea
unei judeci bazate pe ratingurile acordate pentru elementele 11.1.1 -11.2.6. Nu fac pur
i simplu numere medii pentru a obine un rating de ansamblu.
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 104
09-04-2013
Page 105
12
Judging computer-based reports is made difficult by the fact that many suppliers will, understandably, wish
to protect their intellectual property in the algorithms and scoring rules. In practice, sufficient information
should be available for review purposes from the technical manual describing the development of the
reporting process and its rationale, and through the running of a sample of test cases of score
configurations. Ideally the documentation should also describe the procedures that were used to test the
report generation for accuracy, consistency and relevance. For the purpose of reviewing at least three
reports based on different score profiles including the actual scores should be provided, even if the
algorithms for generating the reports are confidential.
For each of the following attributes, some questions are stated that should help you make a judgment, and
a definition of an excellent (4) rating is provided.
Calitatea de rapoarte generate de calculator Judecnd rapoarte bazate pe computer este ngreunat
de faptul c muli furnizori vor, de neles, doresc s protejeze proprietatea intelectual n
algoritmii i regulile de notare. n practic, informaii suficiente ar trebui s fie disponibile n
scopuri de revizuire din manualul tehnic care descrie evoluia procesului de raportare i motivaia
acestuia, precum i prin rularea unui eantion de cazuri de testare de configuraii de scor. n mod
ideal, documentaia ar trebui s descrie, de asemenea, procedurile care au fost utilizate pentru a
testa generarea rapoartelor pentru acuratee, consisten i relevan. n scopul de a revizui cel
puin trei rapoarte bazate pe profiluri de scor diferite, inclusiv scorurile reale ar trebui s fie
furnizate, chiar dac algoritmii pentru generarea rapoartelor sunt confideniale. Pentru fiecare
dintre urmtoarele atribute, unele ntrebri sunt a declarat c ar trebui s v ajute s fac o
judecat, i este prevzut o definiie a unui (4) rating "excelent".
Afiai originalul
Items to be rated n/a or 0 to 4, benchmarks are provided for an excellent (4) rating.
12.1
Scope or coverage
Reports can be seen as varying in both their breadth and their specificity. Reports may also vary
in the range of people for whom they are suitable. In some cases it may be that separate tailored
reports are provided for different groups of recipients.
Does the report cover the range of attributes measured by the instrument?
Does it do so at a level of specificity justifiable in terms of the level of detail obtainable from
the instrument scores?
Can the 'granularity' of the report (i.e. the number of distinct score bands on a scale that are
used to map onto different text units used in the report) be justified in terms of the scales
measurement errors?
Is the report designed for the same populations of people for whom the instrument was
developed? (e.g. groups for whom the norm groups are relevant, or for whom there is
relevant criterion data etc.).
Domeniul de aplicare sau de acoperire Rapoartele pot fi vzute ca variind att limea lor
i specificitatea lor. Rapoartele pot varia, de asemenea, n intervalul de oameni pentru
care acestea sunt adecvate. n unele cazuri, se poate ntmpla ca rapoarte adaptate
Test Review Form Version 4.2.6
09-04-2013
Page 106
distincte sunt furnizate pentru diferite grupuri de destinatari. Are raportul acoper
intervalul de atribute msurate de instrument? Are face acest lucru la un nivel de
specificitate justificat n ceea ce privete nivelul de detaliere poate fi obinut din
scorurile instrumentului? Poate "granularitatea" a raportului (adic numrul de benzi de
scor distincte pe o scara care sunt folosite pentru a mapa pe diferite uniti de text
folosite n raport) s fie justificate n ceea ce privete erorile de msurare cntare? Este
raportul proiectat pentru aceleai populaii de oameni pentru care a fost dezvoltat
instrumentul? (De exemplu, grupuri pentru care grupurile de norma sunt relevante, sau
pentru care exist date de criteriu pertinent, etc.).
Afiai originalul
No information given fara informatie
Inadequate inadecvat
Adequate adecvat
Good bun
Excellent: Excellent fit between the scope of the instrument and the scope of the report,
with the level of specificity in the report being matched to the level of detail measured by
the scales. Good use made of all the scores reported from the instrument.
12.2
Reliability
How consistent are the reports in their interpretation of similar sets of score data?
If report content is varied (e.g. by random selection from equivalent text units), is this done in
a satisfactory manner?
Is the interpretation of scores and the differences between scores justifiable in terms of the
scale measurement errors?
No information given
Inadequate
Adequate
Good
09-04-2013
Page 107
Relevance or validity
The linkage between the instrument and the content of the report may be explained either within
the report or be separately documented. Where reports are based on clinical judgment, the
process by which the expert(s) produced the content and the rules relating scores to content
should be documented.
How strong is the relationship between the content of the report and the scores on the
instrument? To what degree does the report go beyond or diverge from the information
provided by the instrument scores?
Does the report content relate clearly to the characteristics measured by the instrument?
Does it provide reasonable inferences about criteria to which we might expect such
characteristics to be related?
What empirical evidence is provided to show that these relationships actually exist?
It is relevant to consider both the construct validity of a report (i.e. the extent to which it provides
an interpretation that is in line with the definition of the underlying constructs) and criterionvalidity (i.e. where statements are made that can be linked back to empirical data).
No information given
Inadequate
Adequate
Good
Excellent: Relationship between the scales and the report content, with clear
justifications provided.
09-04-2013
Page 108
Is the content of the report and the language used likely to create impressions of
inappropriateness for certain groups?
Does the report make clear any areas of possible bias in the results of the instrument?
Are alternate language forms available? If so, have adequate steps been taken to ensure
their equivalence?
No information given
Inadequate
Adequate
Good
Excellent: Clear warnings and explanations of possible bias, available in all relevant user
languages.
Acceptability
This will depend substantially on the complexity of the language used in the report, the
complexity of the constructs being described and the purpose for which it is intended.
Is the form and content of the report likely to be acceptable to the intended recipients?
Is the report written in a language that is appropriate for the likely levels of numeracy and
literacy of the intended reader?
No information given
Inadequate
Adequate
09-04-2013
Page 109
Length
This is also an aspect of Practicality and should be reflected in the rating given for this, but too
long reports may also be an indication of over-interpretation of scores. Therefore the length of
reports is rated separately also. Generally reports that on average take more than one page per
scale (excluding title pages, copyright notices etc.) may be over long and over-interpreted.
12.7
No information given
Inadequate
Adequate
Good
Excellent
adecvarea global generate de calculator rapoarte Acest rating de ansamblu este obinut
prin utilizarea unei judeci bazate pe rating acordate pentru articolele 12.1 -12.6. Nu fac
pur i simplu numere medii pentru a obine un rating de ansamblu
No information given
Inadequate
Adequate
Good
Excellent
09-04-2013
Page 110
09-04-2013
Page 111
13
Final evaluation
Raportul evaluativ testului Aceast seciune trebuie s conin o hotrre concis, n mod clar a
argumentat despre test. Ar trebui s descrie argumente pro i contra, i s dea cteva
recomandri generale cu privire la modul n care i cnd s-ar putea folosi - mpreun cu
avertismente (acolo unde este cazul) cu privire la cazul n care nu ar trebui s fie utilizat. Un
rezumat al oricror puncte pozitive sau negative ridicate n legtur cu testele adaptate i traduse
trebuie s fie cuprinse aici. O list de verificare a consideraii importante pentru astfel de
instrumente se adaug n apendice ca o aducere aminte a notelor din seciunile respective.
Numai comentarii cu privire la aceste dac acest lucru este de-dup caz.
Evaluarea ar trebui s acopere subiecte cum ar fi caracterul adecvat al instrumentului pentru
diverse funcii sau domenii de aplicare a evalua-ment; orice pregtire special are nevoie sau
cunotine speciale; dac cerinele trenului sunt stabilite la ING nivelul corect; usor de folosit;
calitatea i cantitatea informaiilor furnizate de ctre furnizor i dac exist informaii
importante care nu sunt furnizate utilizatorilor i n cazul n care exist probleme care decurg din
instrumentul s fie tradus sau adaptat (a se vedea apendicele). Includei comentarii cu privire la
orice activitate de cercetare, care este cunoscut a fi n curs de desfurare, i planurile
furnizorului pentru dezvoltri i rafinri viitoare etc.
09-04-2013
Page 112
09-04-2013
Page 113
Norms
10
Reliabilityoverall
11
Validity-overall
12
09-04-2013
Page 114
09-04-2013
Page 115
PART 3
BIBLIOGRAPHY
09-04-2013
Page 116
09-04-2013
Page 117
09-04-2013
Page 118
09-04-2013
Page 119
APPENDIX
An aide memoire of critical points for comment when an instrument has been translated and/or
adapted from a non-local context
Development
Evidence or discussion of
Basic psychometric properties
Norms
A local norm is provided
Non-local norm
International norms
The nature of the sample
The type of measure
The equivalence of the test
version
Similarities of scores in different
samples
Guidance about generalising the
norms
Equivalence/ Reliability/Validity
Invariance in construct structure
Similar criterion related validity
Similar patterns of scale loadings
Alternate form reliability
Validity generalisation
Validity generalisation needs
strong evidence
Validity generalisation can be
inferred
09-04-2013
Page 120