Documente Academic
Documente Profesional
Documente Cultură
Abstract
Classification systems help orthopaedic surgeons characterize a problem, suggest
a potential prognosis, and offer guidance in determining the optimal treatment
method for a particular condition. Classification systems also play a key role in
the reporting of clinical and epidemiologic data, allowing uniform comparison
and documentation of like conditions. A useful classification system is reliable
and valid. Although the measurement of validity is often difficult and sometimes impractical, reliabilityas summarized by intraobserver and interobserver reliabilityis easy to measure and should serve as a minimum standard for
validation. Reliability is measured by the kappa value, which distinguishes true
agreement of various observations from agreement due to chance alone. Some
commonly used classifications of musculoskeletal conditions have not proved to
be reliable when critically evaluated.
J Am Acad Orthop Surg 2002;10:290-297
Assessment of Reliability
Classifications of musculoskeletal
conditions have at least two central
functions. First, accurate classification characterizes the nature of a
problem and then guides treatment
decision making, ultimately improving outcomes. Second, accurate classification establishes an
expected outcome for the natural
history of a condition or injury, thus
forming a basis for uniform reporting of results for various surgical
and nonsurgical treatments. This
allows the comparison of results
from different centers purportedly
treating the same entity.
A successful classification system
must be both reliable and valid.
Reliability reflects the precision of a
classification system; in general, it
refers to interobserver reliability, the
agreement between different observers. Intraobserver reliability is
the agreement of one observers repeated classifications of an entity.
290
Surgeon No. 1
Surgeon No. 2
Displaced
Nondisplaced
Total
Displaced
50
15
65
Nondisplaced
15
20
35
Total
65
35
100
65 65 35 35
+
100 = 0.545
100
100
0.70 0.545
= 0.34
1 0.545
Figure 1 Hypothetical example of agreement between two orthopaedic surgeons classifying radiographs of subcapital hip fractures.
291
292
Surgeon No. 1
Surgeon No. 2
Displaced
Nondisplaced
Total
Displaced
65
15
80
Nondisplaced
15
20
Total
80
20
100
80 80 20 20
+
100 = 0.68
100
100
0.70 0.68
= 0.06
1 0.68
Figure 2 Hypothetical example of agreement between two orthopaedic surgeons classifying radiographs, with a higher prevalence of displaced fractures than in Figure 1.
Sources of Disagreement
As mentioned, any given classification system must have a high degree
of reliability or precision. The degree of observer agreement obtained
is affected by many factors, including the precision of the classification
system. To improve reliability, these
other sources of disagreement must
293
Assessment of Commonly
Used Orthopaedic
Classification Systems
Although many classification systems have been widely adopted and
frequently used in orthopaedic surgery to guide treatment decisions,
few have been scientifically tested
for their reliability. A high degree of
reliability or precision should be a
minimum requirement before any
classification system is adopted. The
results of several recent studies that
have tested various orthopaedic classifications for their intraobserver and
interobserver reliability are summarized in Table 1.12-21
In general, the reliability of the
listed classification systems would
be considered low and probably
unacceptable. Despite this lack of
reliability, these systems are commonly used. Although Table 1 lists
only a limited number of systems,
they were chosen because they have
been subjected to reliability testing.
Many other classification systems
commonly cited in the literature
have not been tested; consequently,
there is no evidence that they are or
are not reliable. In fact, most classifications systems for medical conditions and injuries that have been
tested have levels of agreement that
are considered unacceptably low.22,23
There is no reason to believe that
the classification systems that have
not been tested would fare any better. Four of the studies listed in
Table 1 are discussed in detail to
294
AO type ranged from 0.67 for residents to 0.86 for attending surgeons.
Again, with more detailed AO subgroups, kappa values decreased
progressively. When all 27 categories were included, kappa values
ranged from 0.25 to 0.42. The conclusions of this study were that the
use of AO types A, B, and C produced levels of reliability that were
high and acceptable. However, subclassification into groups and subgroups was unreliable. The clinical
utility of using only the three types
was not addressed and awaits further study.
Several important aspects of this
study, aside from the results, merit
mention. This study showed that
not only the classification system is
tested but also the observer. For
any classification system tested, it
is important to document the observers experience because this
can substantially affect reliability.
One omission in this study15 was
the lack of discussion of observed
agreement and the prevalence of
fracture categories; these factors
have a distinct effect on observer
variability.
Campbell et al 13 looked at the
reliability of acetabular bone defect
classifications in revision hip arthroplasty. One group of observers
included the originators of the classification system. This is the ultimate way to remove observer bias;
however, it lacks generalizability
because the originators would be
expected to have unusually high
levels of reliability. In this study,
preoperative radiographs of 33 hips
were shown to three different
groups of observers on two occasions at least 2 weeks apart. The
groups of observers were the three
originators, three reconstructive
orthopaedic surgeons, and three
senior residents. The three classifications assessed were those attributed to Gross,24 Paprosky,25 and the
American Academy of Orthopaedic
Surgeons.26 The unweighted kappa
Table 1
Intraobserver and Interobserver Agreement in Orthopaedic Classification Systems
Intraobserver
Observed
Agreement (%)
Value
Interobserver
Observed
Agreement (%)
Value
Study
Classification
Assessors
Brady
et al12
Periprosthetic
femur fractures
(Vancouver)
Reconstructive
orthopaedic surgeons,
including originator;
residents
0.73 0.83*
0.60 0.65*
Campbell
et al13
Acetabular bone
defect in revision
total hip (AAOS26)
Reconstructive
orthopaedic surgeons,
including originators
0.05 0.75*
0.11 0.28*
Campbell
et al13
Acetabular bone
defect in revision
total hip (Gross24)
Reconstructive
orthopaedic surgeons,
including originators
0.33 0.55*
0.19 0.62*
Campbell
et al13
Acetabular bone
defect in revision
total hip
(Paprosky25)
Reconstructive
orthopaedic surgeons,
including originators
0.27 0.60*
0.17 0.41*
Ward et al14
Pediatric orthopaedic
surgeons
45 61
0.20 0.44*
0.32 0.59
14 61
0.01 0.42*
0.05 0.55
Attending surgeons,
fellows, residents,
nonclinicians
0.25 0.42*
0.33*
Sidor et al16
Proximal humerus
(Neer)
Shoulder surgeon,
radiologist, residents
62 86
0.50 0.83*
0.43 0.58*
Siebenrock
et al17
Proximal humerus
(Neer)
Shoulder surgeons
0.46 0.71
0.25 0.51
Siebenrock
et al17
Proximal humerus
(AO/ASIF)
Shoulder surgeons
0.43 0.54
0.36 0.49
McCaskie
et al18
Quality of cement
grade in THA
Experts in THA,
consultants, residents
0.07 0.63*
0.04*
Lenke et al19
Scoliosis (King)
Spine surgeons
56 85
0.34 0.95*
55
0.21 0.63*
Cummings
et al20
Scoliosis (King)
Pediatric orthopaedic
surgeons, spine
surgeons, residents
0.44 0.72*
0.44*
Haddad
et al21
Femoral bone
defect in revision
total hip (AAOS,30
Mallory,28
Paprosky et al29)
Reconstructive
orthopaedic surgeons
0.43 0.62*
0.12 0.29*
* Unweighted
Weighted
295
296
Summary
Classification systems are tools for
identifying injury patterns, assessing
prognoses, and guiding treatment
decisions. Many classification systems have been published and widely adopted in orthopaedics without
information available on their reliability. Classification systems should
consistently produce the same results.
A system should, at a minimum,
have a high degree of intraobserver
and interobserver reliability. Few
systems have been tested for this reliability, but those that have been
tested generally fall short of acceptable levels of reliability. Because
most classification systems have poor
reliability, their use to differentiate
treatments and suggest outcomes is
not warranted. A system that has not
been tested cannot be assumed to be
reliable. The systems used by orthopaedic surgeons must be tested for
reliability, and if a system is not
found to be reliable, it should be
modified or its use seriously questioned. Improving reliability involves
looking at many components of the
classification process.1
Methodologies exist to assess
classifications, with the kappa value
the standard for measuring observer reliability. Once a system is
found to be reliable, the next step is
to prove its utility. Only when a
system is shown to be reliable
should it be widely adopted by the
medical community. This should
not be construed to mean that
untested classification systems, or
those with disappointing reliability,
are without value. Systems are
needed to categorize or define surgical problems before surgery in order to plan appropriate approaches
and techniques. Classification systems provide a discipline to help
define pathology as well as a lan-
confirm or refine proposed preoperative categories by careful intraoperative observation of the actual
findings. Furthermore, submission
of classification systems to statistical analysis highlights their inherent flaws and lays the groundwork
for their improvement.
References
1. Wright JG, Feinstein AR: Improving the
reliability of orthopaedic measurements.
J Bone Joint Surg Br 1992;74:287-291.
2. Cohen J: A coefficient of agreement
for nominal scales. Educational and
Psychological Measurement 1960;20:37-46.
3. Kristiansen B, Andersen UL, Olsen CA,
Varmarken JE: The Neer classification
of fractures of the proximal humerus:
An assessment of interobserver variation. Skeletal Radiol 1988;17:420-422.
4. Kraemer HC: Extension of the kappa
coefficient. Biometrics 1980;36:207-216.
5. Landis JR, Koch GG: The measurement
of observer agreement for categorical
data. Biometrics 1977;33:159-174.
6. Svanholm H, Starklint H, Gundersen
HJ, Fabricius J, Barlebo H, Olsen S:
Reproducibility of histomorphologic
diagnoses with special reference to the
kappa statistic. APMIS 97 1989;689-698.
7. Feinstein AR, Cicchetti DV: High
agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol
1990;43:543-549.
8. Cicchetti DV, Feinstein AR: High
agreement but low kappa: II. Resolving
the paradoxes. J Clin Epidemiol 1990;43:
551-558.
9. Byrt T, Bishop J, Carlin JB: Bias,
prevalence and kappa. J Clin Epidemiol
1993;46:423-429.
10. Clinical disagreement: I. How often it
occurs and why. Can Med Assoc J 1980:
123;499-504.
11. Lowe RW, Hayes TD, Kaye J, Bagg RJ,
Luekens CA: Standing roentgenograms in spondylolisthesis. Clin
Orthop 1976;117:80-84.
12. Brady OH, Garbuz DS, Masri BA,
Duncan CP: The reliability and validity of the Vancouver classification of
femoral fractures after hip replacement. J Arthroplasty 2000;15:59-62.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
297