Classification Systems in Orthopaedics

Classification Systems in Orthopaedics
Donald S. Garbuz, MD, MHSc, FRCSC, Bassam A. Masri, MD, FRCSC,

John Esdaile, MD, MPH, FRCPC, and Clive P. Duncan, MD, FRCSC
Abstract
Classification systems help orthopaedic surgeons characterize a problem, suggest
a potential prognosis, and offer guidance in determining the optimal treatment
method for a particular condition. Classification systems also play a key role in
the reporting of clinical and epidemiologic data, allowing uniform comparison
and documentation of like conditions. A useful classification system is reliable
and valid. Although the measurement of validity is often difficult and sometimes impractical, reliabilityas summarized by intraobserver and interobserver reliabilityis easy to measure and should serve as a minimum standard for
validation. Reliability is measured by the kappa value, which distinguishes true
agreement of various observations from agreement due to chance alone. Some
commonly used classifications of musculoskeletal conditions have not proved to
be reliable when critically evaluated.
J Am Acad Orthop Surg 2002;10:290-297
would be considered soft data because independent confirmation of

this intraoperative assessment is
often impossible. This problem with
the validation phase affects many
commonly used classification systems that are based on radiographic
criteria, and it introduces the element of observer bias to the validation process. Because of the difficulty
of measuring validity, it is critical
that classification systems have at
least a high degree of reliability.
Assessment of Reliability
Classifications of musculoskeletal
conditions have at least two central
functions. First, accurate classification characterizes the nature of a
problem and then guides treatment
decision making, ultimately improving outcomes. Second, accurate classification establishes an
expected outcome for the natural
history of a condition or injury, thus
forming a basis for uniform reporting of results for various surgical
and nonsurgical treatments. This
allows the comparison of results
from different centers purportedly
treating the same entity.
A successful classification system
must be both reliable and valid.
Reliability reflects the precision of a
classification system; in general, it
refers to interobserver reliability, the
agreement between different observers. Intraobserver reliability is
the agreement of one observers repeated classifications of an entity.
290
The validity of a classification

system reflects the accuracy with
which the classification system
describes the true pathologic process. A valid classification system
correctly categorizes the attribute of
interest and accurately describes the
actual process that is occurring.1 To
measure or quantify validity, the
classification of interest must be
compared to some gold standard.
If the surgeon is classifying bone
stock loss prior to revision hip arthroplasty, the gold standard could
potentially be intraoperative assessment of bone loss. Validation of the
classification system would require
a high correlation between the preoperative radiographs and the intraoperative findings. In this example,
the radiographic findings would be
considered hard data because different observers can confirm the
radiographic findings. Intraoperative findings, on the other hand,
Classifications and measurements

in general must be reliable to be
assessed as valid. However, because confirming validity is difficult, many commonly used classifi-
Dr. Garbuz is Assistant Professor, Department

of Orthopaedics, University of British
Columbia, Vancouver, BC, Canada. Dr. Masri
is Associate Professor and Head, Division of
Reconstructive Orthopaedics, University of
British Columbia. Dr. Esdaile is Professor and
Head, Division of Rheumatology, University of
British Columbia. Dr. Duncan is Professor
and Chairman, Department of Orthopaedics,
University of British Columbia.
Reprint requests: Dr. Garbuz, Laurel Pavilion,
Third Floor, 910 West Tenth Avenue,
Vancouver, BC, Canada V5Z 4E3.
Copyright 2002 by the American Academy of
Orthopaedic Surgeons.
Journal of the American Academy of Orthopaedic Surgeons
Donald S. Garbuz, MD, et al
Vol 10, No 4, July/August 2002
Surgeon No. 1
Surgeon No. 2
cation systems can be shown to be

reliable yet not valid. On preoperative radiographs of a patient with a
hip fracture, for example, two observers may categorize the fracture
as Garden type 3. This measurement is reliable because of interobserver agreement. However, if the
intraoperative findings are of a
Garden type 4 fracture, then the
classification on radiographs, although reliable, is not valid (ie, is
inaccurate). A minimum criterion
for the acceptance of any classification or measurement, therefore, is a
high degree of both interobserver
and intraobserver reliability. Once
a classification system has been
shown to have acceptable reliability,
then testing for validity is appropriate. If the degree of reliability is
low, however, then the classification
system will have limited utility.
Initial efforts to measure reliability looked only at observed agreementthe percentage of times that
different observers categorized
their observations the same. This
concept is illustrated in Figure 1, a
situation in which the two surgeons agree 70% of the time. In
1960, Cohen2 introduced the kappa
value (or kappa statistic) as a measure to assess agreement that occurred above and beyond that
related to chance alone. Today the
kappa value and its variants are the
most accepted methods of measuring observer agreement for categorical data.
Figure 1 demonstrates how the
kappa value is used and how it differs from the simple measurement
of observed agreement. In this
hypothetical example, observed
agreement is calculated as the percentage of times both surgeons
agree whether fractures were displaced or nondisplaced; it does not
take into account the fact that they
may have agreed by chance alone.
To calculate the percentage of
chance agreement, it is assumed
that each surgeon will choose a cate-
Displaced
Nondisplaced
Total
Displaced
50
15
65
Nondisplaced
15
20
35
Total
65
35
100
Observed agreement = 0.70

Chance agreement =
65 65 35 35
+
100 = 0.545
100
100
Agreement beyond chance () =
0.70 0.545
= 0.34
1 0.545
Figure 1 Hypothetical example of agreement between two orthopaedic surgeons classifying radiographs of subcapital hip fractures.
gory independently of the other.

The marginal totals are then used to
calculate the agreement expected by
chance alone; in Figure 1, this is
0.545.
To calculate the kappa value, the
observed agreement (Po) minus the
chance agreement (Pc) is divided by
the maximum possible agreement
that is not related to chance (1 Pc):
= (Po Pc) / (1 Pc)
This example is the simplest
case of two observers and two categories. The kappa value can be
used for multiple categories and
multiple observers in a similar
manner.
In analyzing categorical data,
which the kappa value is designed
to measure, there will be cases in
which disagreement between various categories may not have as profound an impact as disagreement
between other categories. For this
reason, categorical data are divided
into two types: nominal (unranked),
in which all categorical differences
are equally important, and ordinal
(ranked), in which disagreement
between some categories has a more
profound impact than disagreement

between other categories. An example of nominal data is eye color; an
example of ordinal data is the AO
classification, in which each subsequent class denotes an increase in
severity of the fracture.
The kappa value can be unweighted or weighted depending
on whether the data are nominal or
ordinal. Unweighted kappa values
should always be used with unranked data. When ordinal data
are being analyzed, however, a
decision must be made whether or
not to weight the kappa value.
Weighting has the advantage of
giving some credit to partial agreement, whereas the unweighted
kappa value treats all disagreements as equal. A good example of
appropriate use of the weighted
kappa value is in a study by
Kristiansen et al3 of interobserver
agreement in the Neer classification of proximal humeral fractures.
This well-known classification has
four categories of fractures, from
nondisplaced or minimally displaced to four-part fractures.
Weighting was appropriate in this
case because disagreement be-
291
292
have gained widespread use in

orthopaedics and other branches of
medicine. The most widely adopted
criteria for assessing the extent of
agreement are those of Landis and
Koch:5
> 0.80, almost perfect;
= 0.61 to 0.80, substantial;
= 0.41 to 0.60, moderate;
= 0.21 to 0.40, fair;
= 0.00 to 0.20, slight; and
< 0.00, poor.
Although these criteria have
gained widespread acceptance, the
values were chosen arbitrarily and
were never intended to serve as
general benchmarks. The criteria of
Svanholm et al,6 while less widely
used, are more stringent than those
of Landis and Koch and are perhaps
more practical for use in medicine.
Like Landis and Koch, Svanholm et
al chose arbitrary values:
0.75, excellent;
= 0.51 to 0.74, good; and
0.50, poor.
When reviewing reports of studies on agreement of classification
systems, readers should look at the

actual kappa value and not just at
the arbitrary categories described
here.
Although the interpretation of a
given kappa value is difficult, it is
clear that the higher the value, the
more reliable the classification system. When interpreting a given
kappa value, the impact of prevalence and bias must be considered.
Feinstein and Cicchetti 7,8 refer to
them as the two paradoxes of high
observed agreement and low kappa
values. Most important is the effect
that the prevalence (base rate) can
have on the kappa value. Prevalence refers to the number of times a
given category is selected. In general, as the proportion of cases in one
category approaches 0, or 100%, the
kappa value will decrease for any
given observed agreement. In
Figure 2, the same two hypothetical
orthopaedic surgeons as in Figure 1
review and categorize 100 different
radiographs. The observed agreement is the same as in Figure 1, 0.70.
However, the agreement beyond
chance (kappa value) is 0.06. The
main difference between Figures 1
and 2 is the marginal totals or the
Surgeon No. 1
Surgeon No. 2
tween a two-part and three-part

fracture is not as serious as disagreement between a nondisplaced
fracture and a four-part fracture.
By weighting kappa values, one
can account for the different levels
of importance between levels of
disagreement. If a weighted kappa
value is determined to be appropriate, the weighting scheme must be
specified in advance because the
weights chosen will dramatically
affect the kappa value. In addition,
when reporting studies that have
used a weighted kappa value, the
weighting scheme must be documented clearly. One problem with
weighting is that without uniform
weighting schemes, it is difficult to
generalize across studies. Sample
size will allow the confidence interval to be narrower but it does not
automatically affect the number of
categories.
Although the kappa value has
become the most widely accepted
method to measure observer agreement, interpretation is difficult.
Values obtained range from 1.0
(complete disagreement) through
0.0 (chance agreement) to 1.0 (complete agreement). Hypothesis testing has limited usefulness when the
kappa value is used because it allows the researcher to see only if
obtained agreement is significantly
different from zero or chance agreement, revealing nothing about the
extent of agreement. Consequently,
when kappa values are obtained for
assessing classifications of musculoskeletal conditions, hypothesis testing has almost no role. As
Kraemer stated, It is insufficient to
demonstrate merely the nonrandomness of diagnostic procedures;
one requires assurance of substantial
agreement between observations.4
This statement is equally applicable
to classifications used in orthopaedics.
To assess the strength of agreement obtained with a given kappa
value, two different benchmarks
Displaced
Nondisplaced
Total
Displaced
65
15
80
Nondisplaced
15
20
Total
80
20
100
Observed agreement = 0.70

Chance agreement =
80 80 20 20
+
100 = 0.68
100
100
Agreement beyond chance () =
0.70 0.68
= 0.06
1 0.68
Figure 2 Hypothetical example of agreement between two orthopaedic surgeons classifying radiographs, with a higher prevalence of displaced fractures than in Figure 1.
underlying prevalence of displaced

and nondisplaced fractures, defined as the proportion of displaced and nondisplaced fractures.
If one category has a very high
prevalence, there can be paradoxical high observed agreement yet
low kappa values (although to
some extent this can be the result of
the way chance agreement is calculated). The effect of prevalence on
kappa values must be kept in mind
when interpreting studies of observer variability. The prevalence,
observed agreement, and kappa
values should be clearly stated in
any report on classification reliability. Certainly a study with a low
kappa value and extreme prevalence rate will not represent the
same level of disagreement as will
a low kappa value in a sample with
a balanced prevalence rate.
Bias (systematic difference) is the
second factor that can affect the
kappa value. Bias has a lesser effect
than does prevalence, however. As
bias increases, kappa values paradoxically will increase, although
this is usually seen only when
kappa values are low. To assess the
extent of bias in observer agreement studies, Byrt et al9 have suggested measuring a bias index, but
this has not been widely adopted.
Although the kappa value, influenced by prevalence and bias, measures agreement, it is not the only
measure of the precision of a classification system. Many other factors
can affect both observer agreement
and disagreement.
Sources of Disagreement
As mentioned, any given classification system must have a high degree
of reliability or precision. The degree of observer agreement obtained
is affected by many factors, including the precision of the classification
system. To improve reliability, these
other sources of disagreement must
be understood and minimized.

Once this is done, the reliability of
the classification system itself can be
accurately estimated.
Three sources of disagreement or
variability have been described:1,10
the clinician (observer), the patient
(examined), and the procedure
(examination). Each of these can
affect the reliability of classifications
in clinical practice and studies that
examine classifications and their
reliability.
Clinician variability arises from
the process by which information is
observed and interpreted. The information can be obtained from different sources, such as history,
physical examination, or radiographic examination. These raw
data are often then converted into
categories. Wright and Feinstein1
called the criteria used to put the
raw data into categories conversion
criteria. Disagreement can occur
when the findings are observed or
when they are organized into the
arbitrary categories commonly used
in classification systems.
An example of variability in the
observational process is the measurement of the center edge angle of
Wiberg. Inconsistent choice of the
edge of the acetabulum will lead to
variations in the measurements
obtained (Fig. 3).
As a result of the emphasis on
arbitrary criteria for the various categories in a classification system, an
observer may make measurements
that do not meet all of the criteria of
a category. The observer will then
choose the closest matching category.
Another observer may disagree
about the choice of closest category
and choose another. Such variability
in the use of conversion criteria is
common and is the result of trying
to convert the continuous spectrum
of clinical data into arbitrary and
finite categories.
The particular state being measured will vary depending on
when and how it is measured. This
Figure 3 Anteroposterior radiograph of a

dysplastic hip, showing the difficulty in
defining the true margin of the acetabulum
when measuring the center edge angle of
Wiberg (solid lines). The apparent lateral
edge of the acetabulum (arrow) is really a
superimposition of the true anterior and
posterior portions of the superior rim of
the acetabulum. Inconsistent choice
among observers may lead to errors in
measurement.
results in patient variability. A

good example is the variation obtained in measuring the degree of
spondylolisthesis when the patient
is in a standing compared with a
supine position.11 To minimize patient variability, examinations should
be performed in a consistent, standardized fashion.
The final source of variability is
the procedure itself. This often
refers to technical aspects, such as
the taking of a radiograph. If the
exposures of two radiographs of the
same patients hip are different, for
example, then classification of the
degree of osteopenia, which depends on the degree of exposure,
will differ as a result of the variability. Standardization of technique
will help reduce this source of variability.
293

These three sources of variation
apply to all measurement processes.
The variability of classification systems is not just a problem of improving a classification system itself;
it is only one aspect by which the
reliability and utility of classification
systems can be improved. Understanding these sources of measurement variability and how to minimize them is critically important.1,10
Assessment of Commonly
Used Orthopaedic
Classification Systems
Although many classification systems have been widely adopted and
frequently used in orthopaedic surgery to guide treatment decisions,
few have been scientifically tested
for their reliability. A high degree of
reliability or precision should be a
minimum requirement before any
classification system is adopted. The
results of several recent studies that
have tested various orthopaedic classifications for their intraobserver and
interobserver reliability are summarized in Table 1.12-21
In general, the reliability of the
listed classification systems would
be considered low and probably
unacceptable. Despite this lack of
reliability, these systems are commonly used. Although Table 1 lists
only a limited number of systems,
they were chosen because they have
been subjected to reliability testing.
Many other classification systems
commonly cited in the literature
have not been tested; consequently,
there is no evidence that they are or
are not reliable. In fact, most classifications systems for medical conditions and injuries that have been
tested have levels of agreement that
are considered unacceptably low.22,23
There is no reason to believe that
the classification systems that have
not been tested would fare any better. Four of the studies listed in
Table 1 are discussed in detail to
294
highlight the methodology that

should be used to assess the reliability of any classification system: the
AO classification of distal radius
fractures, 15 the classification of
acetabular bone defect in revision
hip arthroplasty,13 the Severin classification of congenital dislocation
of the hip,14 and the Vancouver classification of periprosthetic fractures
of the femur.12
Kreder et al15 assessed the reliability of the AO classification of
distal radius fractures. This classification system divides fractures
into three types based on whether
the fracture is extra-articular (type
A), partial articular (type B), or complete articular (type C). These fracture types can then be divided into
groups, which are further divided
into subgroups with 27 possible
combinations. Thirty radiographs
of distal radial fractures were presented to observers on two occasions. Before classifying the radiographs, a 30-minute review of the
AO classification was conducted.
Assessors also had a handout, which
they were encouraged to use when
classifying the fractures. There
were 36 observers in all, including
attending surgeons, clinical fellows,
residents, and nonclinicians. These
groups were chosen to ascertain
whether the type of observer had an
influence on the reliability of the
classification. In this study, an
unweighted kappa value was used.
The authors evaluated intraobserver
and interobserver reliability for AO
type, AO group, and AO subgroup.
The criteria of Landis and Koch 5
were used to grade the levels of
agreement. Interobserver agreement was highest for the initial AO
type, and it decreased for groups
and subgroups as the number of
categories increased. This should be
expected because, as the number of
categories increases, there is more
opportunity for disagreement.
Intraobserver agreement showed
similar results. Kappa values for
AO type ranged from 0.67 for residents to 0.86 for attending surgeons.
Again, with more detailed AO subgroups, kappa values decreased
progressively. When all 27 categories were included, kappa values
ranged from 0.25 to 0.42. The conclusions of this study were that the
use of AO types A, B, and C produced levels of reliability that were
high and acceptable. However, subclassification into groups and subgroups was unreliable. The clinical
utility of using only the three types
was not addressed and awaits further study.
Several important aspects of this
study, aside from the results, merit
mention. This study showed that
not only the classification system is
tested but also the observer. For
any classification system tested, it
is important to document the observers experience because this
can substantially affect reliability.
One omission in this study15 was
the lack of discussion of observed
agreement and the prevalence of
fracture categories; these factors
have a distinct effect on observer
variability.
Campbell et al 13 looked at the
reliability of acetabular bone defect
classifications in revision hip arthroplasty. One group of observers
included the originators of the classification system. This is the ultimate way to remove observer bias;
however, it lacks generalizability
because the originators would be
expected to have unusually high
levels of reliability. In this study,
preoperative radiographs of 33 hips
were shown to three different
groups of observers on two occasions at least 2 weeks apart. The
groups of observers were the three
originators, three reconstructive
orthopaedic surgeons, and three
senior residents. The three classifications assessed were those attributed to Gross,24 Paprosky,25 and the
American Academy of Orthopaedic
Surgeons.26 The unweighted kappa
Table 1
Intraobserver and Interobserver Agreement in Orthopaedic Classification Systems
Intraobserver
Observed
Agreement (%)
Value
Interobserver
Observed
Agreement (%)
Value
Study
Classification
Assessors
Brady
et al12
Periprosthetic
femur fractures
(Vancouver)
Reconstructive
orthopaedic surgeons,
including originator;
residents
0.73 0.83*
0.60 0.65*
Campbell
et al13
Acetabular bone
defect in revision
total hip (AAOS26)
Reconstructive
including originators
0.05 0.75*
0.11 0.28*
Campbell
et al13
Acetabular bone
defect in revision
total hip (Gross24)
Reconstructive
0.33 0.55*
0.19 0.62*
Campbell
et al13
Acetabular bone
defect in revision
total hip
(Paprosky25)
Reconstructive
0.27 0.60*
0.17 0.41*
Ward et al14
Congenital hip dislocation (Severin)
Pediatric orthopaedic
surgeons
45 61
0.20 0.44*
0.32 0.59
14 61
0.01 0.42*
0.05 0.55
Kreder et al15 Distal radius (AO)
Attending surgeons,
fellows, residents,
nonclinicians
0.25 0.42*
0.33*
Sidor et al16
Proximal humerus
(Neer)
Shoulder surgeon,
radiologist, residents
62 86
0.50 0.83*
0.43 0.58*
Siebenrock
et al17
Proximal humerus
(Neer)
Shoulder surgeons
0.46 0.71
0.25 0.51
Siebenrock
et al17
Proximal humerus
(AO/ASIF)
Shoulder surgeons
0.43 0.54
0.36 0.49
McCaskie
et al18
Quality of cement
grade in THA
Experts in THA,
consultants, residents
0.07 0.63*
0.04*
Lenke et al19
Scoliosis (King)
Spine surgeons
56 85
0.34 0.95*
55
0.21 0.63*
Cummings
et al20
Scoliosis (King)
Pediatric orthopaedic
surgeons, spine
surgeons, residents
0.44 0.72*
0.44*
Haddad
et al21
Femoral bone
defect in revision
total hip (AAOS,30
Mallory,28
Paprosky et al29)
Reconstructive
orthopaedic surgeons
0.43 0.62*
0.12 0.29*
* Unweighted
Weighted
295

value was used to assess the level of
agreement.
As expected, the originators had
higher levels of intraobserver agreement than did the other two observer
groups (AAOS, 0.57; Gross, 0.59;
Paprosky, 0.75). However, levels of
agreement fell markedly when tested
by surgeons other than the originators. This study underscores the importance of the qualifications of the
observers in studies that measure
reliability. To test the classification
system itself, experts would be the
initial optimal choice, as was the
case in this study.13 However, even
if the originators have acceptable
agreement, this result should not be
generalized. Because most classification systems are developed for
widespread use, reliability must be
high among all observers for a system to have clinical utility. Hence,
although the originators of the classifications of femoral bone loss were
not included in a similar study21 at
the same center, the conclusions of
the study remain valuable with respect to the reliability of femoral
bone loss classifications in the hands
of orthopaedic surgeons other than
the originators.
Ward et al14 evaluated the Severin
classification, which is used to assess the radiographic appearance
of the hip after treatment for congenital dislocation. This system has
six main categories ranging from
normal to recurrent dislocation and
is reported to be a prognostic indicator. Despite its widespread acceptance, it was not tested for reliability
until 1997. The authors made every
effort to test only the classification
system by minimizing other potential sources of disagreement. All
identifying markers were removed
from 56 radiographs of hips treated
by open reduction. Four fellowship-trained pediatric orthopaedic
surgeons who routinely treated congenital dislocation of the hip independently rated the radiographs.
Before classifying the hips, the
296
observers were given a detailed description of the Severin classification.

Eight weeks later, three observers
repeated the classifying exercise.
The radiographs were presented in
a different order in an attempt to
minimize recall bias. Both weighted
and unweighted kappa values were
calculated. Observed agreement
also was calculated and reported so
that the possibility of a high observed agreement with a low kappa
value would be apparent. The
kappa values, whether weighted or
unweighted, were low, usually less
than 0.50. The authors of this study
used the arbitrary criteria of
Svanholm et al6 to grade their agreement and concluded that this classification scheme is unreliable and
should not be widely used. This
study demonstrated the methodology that should be used when testing classification systems. It eliminated other sources of disagreement
and focused on the precision of the
classification system itself.
The Vancouver classification of
periprosthetic femur fractures is an
example of a system that was tested
for reliability prior to its widespread adoption and use. 12 The
first description was published in
1995.27 Shortly afterward, testing
began on the reliability and the
validity of this system. The methodology was similar to that described in the three previous studies. Reliability was acceptable for
the three experienced reconstructive orthopaedic surgeons tested,
including the originator. To assess
generalizability, three senior residents also were assessed for their
intraobserver and interobserver
reliability. The kappa values for
this group were nearly identical to
those of the three expert surgeons.
This study confirmed that the
Vancouver classification is both
reliable and valid. With these two
criteria met, this system can be recommended for widespread use and
can subsequently be assessed for its
value in guiding treatment and outlining prognosis.
Summary
Classification systems are tools for
identifying injury patterns, assessing
prognoses, and guiding treatment
decisions. Many classification systems have been published and widely adopted in orthopaedics without
information available on their reliability. Classification systems should
consistently produce the same results.
A system should, at a minimum,
have a high degree of intraobserver
and interobserver reliability. Few
systems have been tested for this reliability, but those that have been
tested generally fall short of acceptable levels of reliability. Because
most classification systems have poor
reliability, their use to differentiate
treatments and suggest outcomes is
not warranted. A system that has not
been tested cannot be assumed to be
reliable. The systems used by orthopaedic surgeons must be tested for
reliability, and if a system is not
found to be reliable, it should be
modified or its use seriously questioned. Improving reliability involves
looking at many components of the
classification process.1
Methodologies exist to assess
classifications, with the kappa value
the standard for measuring observer reliability. Once a system is
found to be reliable, the next step is
to prove its utility. Only when a
system is shown to be reliable
should it be widely adopted by the
medical community. This should
not be construed to mean that
untested classification systems, or
those with disappointing reliability,
are without value. Systems are
needed to categorize or define surgical problems before surgery in order to plan appropriate approaches
and techniques. Classification systems provide a discipline to help
define pathology as well as a lan-
guage to describe that pathology.

However, it is necessary to recognize the limitations of existing classification systems and the need to
confirm or refine proposed preoperative categories by careful intraoperative observation of the actual
findings. Furthermore, submission
of classification systems to statistical analysis highlights their inherent flaws and lays the groundwork
for their improvement.
13. Campbell DG, Garbuz DS, Masri BA,

Duncan CP: Reliability of acetabular
bone defect classification systems in
revision total hip arthroplasty. J Arthroplasty 2001;16:83-86.
14. Ward WT, Vogt M, Grudziak JS,
Tumer Y, Cook PC, Fitch RD: Severin
classification system for evaluation of
the results of operative treatment of
congenital dislocation of the hip: A
study of intraobserver and interobserver reliability. J Bone Joint Surg Am
1997;79:656-663.
15. Kreder HJ, Hanel DP, McKee M,
Jupiter J, McGillivary G, Swiontkowski MF: Consistency of AO fracture classification for the distal radius.
J Bone Joint Surg Br 1996;78:726-731.
16. Sidor ML, Zuckerman JD, Lyon T,
Koval K, Cuomo F, Schoenberg N: The
Neer classification system for proximal
humeral fractures: An assessment of
interobserver reliability and intraobserver reproducibility. J Bone Joint Surg
Am 1993;75:1745-1750.
17. Siebenrock KA, Gerber C: The reproducibility of classification of fractures
of the proximal end of the humerus.
J Bone Joint Surg Am 1993;75:1751-1755.
18. McCaskie AW, Brown AR, Thompson
JR, Gregg PJ: Radiological evaluation
of the interfaces after cemented total
hip replacement: Interobserver and
intraobserver agreement. J Bone Joint
Surg Br 1996;78:191-194.
19. Lenke LG, Betz RR, Bridwell KH, et al:
Intraobserver and interobserver reliability of the classification of thoracic
adolescent idiopathic scoliosis. J Bone
Joint Surg Am 1998;80:1097-1106.
20. Cummings RJ, Loveless EA, Campbell
J, Samelson S, Mazur JM: Interobserver
reliability and intraobserver reproducibility of the system of King et al.
for the classification of adolescent idiopathic scoliosis. J Bone Joint Surg Am

1998; 80:1107-1111.
Haddad FS, Masri BA, Garbuz DS,
Duncan CP: Femoral bone loss in total
hip arthroplasty: Classification and
preoperative planning. J Bone Joint
Surg Am 1999;81:1483-1498.
Koran LM: The reliability of clinical
methods, data and judgments (first
of two parts). N Engl J Med 1975;293:
642-646.
Koran LM: The reliability of clinical
methods, data and judgments (second
of two parts). N Engl J Med 1975;293:
695-701.
Garbuz D, Morsi E, Mohamed N,
Gross AE: Classification and reconstruction in revision acetabular arthroplasty with bone stock deficiency. Clin
Orthop 1996;324:98-107.
Paprosky WG, Perona PG, Lawrence
JM: Acetabular defect classification
and surgical reconstruction in revision
arthroplasty: A 6-year follow-up evaluation. J Arthroplasty 1994;9:33-44.
DAntonio JA, Capello WN, Borden LS:
Classification and management of acetabular abnormalities in total hip arthroplasty. Clin Orthop 1989;243:126-137.
Duncan CP, Masri BA: Fractures of
the femur after hip replacement. Instr
Course Lect 1995;44:293-304.
Mallory TH: Preparation of the proximal femur in cementless total hip revision. Clin Orthop 1988;235:47-60.
Paprosky WG, Lawrence J, Cameron
H: Femoral defect classification:
Clinical application. Orthop Rev 1990;
19(suppl 9):9-15.
DAntonio J, McCarthy JC, Bargar WL,
et al: Classification of femoral abnormalities in total hip arthroplasty. Clin
Orthop 1993;296:133-139
References
1. Wright JG, Feinstein AR: Improving the
reliability of orthopaedic measurements.
J Bone Joint Surg Br 1992;74:287-291.
2. Cohen J: A coefficient of agreement
for nominal scales. Educational and
Psychological Measurement 1960;20:37-46.
3. Kristiansen B, Andersen UL, Olsen CA,
Varmarken JE: The Neer classification
of fractures of the proximal humerus:
An assessment of interobserver variation. Skeletal Radiol 1988;17:420-422.
4. Kraemer HC: Extension of the kappa
coefficient. Biometrics 1980;36:207-216.
5. Landis JR, Koch GG: The measurement
of observer agreement for categorical
data. Biometrics 1977;33:159-174.
6. Svanholm H, Starklint H, Gundersen
HJ, Fabricius J, Barlebo H, Olsen S:
Reproducibility of histomorphologic
diagnoses with special reference to the
kappa statistic. APMIS 97 1989;689-698.
7. Feinstein AR, Cicchetti DV: High
agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol
1990;43:543-549.
8. Cicchetti DV, Feinstein AR: High
agreement but low kappa: II. Resolving
the paradoxes. J Clin Epidemiol 1990;43:
551-558.
9. Byrt T, Bishop J, Carlin JB: Bias,
prevalence and kappa. J Clin Epidemiol
1993;46:423-429.
10. Clinical disagreement: I. How often it
occurs and why. Can Med Assoc J 1980:
123;499-504.
11. Lowe RW, Hayes TD, Kaye J, Bagg RJ,
Luekens CA: Standing roentgenograms in spondylolisthesis. Clin
Orthop 1976;117:80-84.
12. Brady OH, Garbuz DS, Masri BA,
Duncan CP: The reliability and validity of the Vancouver classification of
femoral fractures after hip replacement. J Arthroplasty 2000;15:59-62.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
297

Classification Systems in Orthopaedics

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Classification Systems in Orthopaedics

Încărcat de

Drepturi de autor:

Formate disponibile

Classification Systems in Orthopaedics

Donald S. Garbuz, MD, MHSc, FRCSC, Bassam A. Masri, MD, FRCSC,

would be considered soft data because independent confirmation of

The validity of a classification

Classifications and measurements

Dr. Garbuz is Assistant Professor, Department

Journal of the American Academy of Orthopaedic Surgeons

Donald S. Garbuz, MD, et al

Vol 10, No 4, July/August 2002

cation systems can be shown to be

Observed agreement = 0.70

Agreement beyond chance () =

gory independently of the other.

profound impact than disagreement