Documente Academic
Documente Profesional
Documente Cultură
Abstract Accuracy
Data set Min Mean Max
Several researchers have illustrated that constraints can im- Glass 67.6 69.9 72.3
prove the results of a variety of clustering algorithms. How- Iris 82.2 88.4 93.4
ever, there can be a large variation in this improvement, Ionosphere 58.2 60.1 62.3
even for a fixed number of constraints for a given data set. Wine 68.0 71.3 74.3
We present the first attempt to provide insight into this phe-
nomenon by characterizing two constraint set properties: in- Table 1: Variation in accuracy (Rand Index) obtained by
consistency and incoherence. We show that these measures COP-KMeans using 50 randomly generated constraints and
are strongly anti-correlated with clustering algorithm perfor- a fixed starting point, over 100 trials.
mance. Since they can be computed prior to clustering, these
measures can aid in deciding which constraints to use in prac- Quantifying Inconsistency/Incoherence
tice.
Inconsistency is the amount of conflict between the con-
straints and the underlying objective function and search
Introduction and Motivation bias of an algorithm. Our measure of inconsistency quan-
The last five years have seen extensive work on incor- tifies the degree to which the algorithm cannot “figure out”
porating instance-level constraints into clustering meth- the constraints on its own. Given an algorithm A, we gen-
ods (Wagstaff et al. 2001; Klein, Kamvar, & Manning 2002; erate partition PA by running A on the data set without any
Xing et al. 2003; Bilenko, Basu, & Mooney 2004; Bar-Hillel constraints. We then calculate the fraction of constraints in
et al. 2005). Instance-level constraints specify that two constraint set C that are unsatisfied by PA , and average this
items must be placed into the same cluster (must-link, ML) measure over multiple unconstrained clustering runs.
or different clusters (cannot-link, CL). This semi-supervised Incoherence is the amount of internal conflict between the
approach has led to improved performance on several real- constraints in set C, given a distance metric D. It is not
world applications, such as noun phrase coreference resolu- algorithm dependent. Constraints can be incoherent with
tion and GPS-based map refinement (Wagstaff et al. 2001), respect to distance incoherence and/or directional incoher-
person identification from surveillance camera clips (Bar- ence. Intuitively, the points involved in an ML constraint
Hillel et al. 2005) and landscape detection from hyperspec- should be close together, while points involved in a CL
tral data (Lu & Leen 2005). constraint should be far apart, as viewed by the ideal dis-
However, the common practice of presenting results using tance metric. In fact, this line of reasoning is what moti-
learning curves, which average performance over multiple vates some of the metric-learning constrained clustering al-
constraint sets of the same size, obscures important details. gorithms (Klein, Kamvar, & Manning 2002; Bilenko, Basu,
For example, we took four UCI data sets (Blake & Merz & Mooney 2004). We define distance incoherence as the
1998) and generated 100 randomly selected constraint sets, fraction of constraint pairs (each pair consisting of a ML
each containing 50 constraints. We then clustered them with and a CL constraint) for which the separation of the points
COP-KMeans (Wagstaff et al. 2001), using the same initial- in the ML constraint is more than that of the CL constraint.
ization for each clustering run. We observed that, even when Distance aside, the directional position of a constraint in
the number of constraints is fixed, the accuracy of the output the feature space is also important. An ML (or CL) con-
partition measured using the Rand Index (Rand 1971) varies straint can be viewed as imposing an attractive (or repulsive)
greatly, by 4 to 11% (Table 1). Since the starting point for force in the feature space within its vicinity. An ML/CL con-
each run was held fixed, the only source of variation is the straint pair is directionally incoherent if they exert contra-
constraint set. We have identified two constraint set prop- dictory forces in the same vicinity. To determine if two con-
erties that help explain these variations: inconsistency and straints, c1 and c2 , are incoherent, we treat the constraints as
incoherence. line segments and compute their projected overlap, or how
∗
Abstract type: original, unpublished research. This work was much the projection of c1 along the direction of c2 overlaps
supported by NSF ITR #0325329. with (interferes with) c2 . We define directional incoherence
Copyright
c 2006, American Association for Artificial Intelli- as the fraction of constraint pairs that have a non-zero pro-
gence (www.aaai.org). All rights reserved. jected overlap.
100 100
100
CKM CKM CKM
PKM PKM PKM
MKM MKM 90 MKM
90 90
MPKM MPKM MPKM
70 70 70
60 60 60
50 50 50
40 40 40
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Inconsistency Distance incoherence Direction incoherence
(a) Inconsistency (r = −0.80) (b) Distance Incoherence (r = −0.82) (c) Direction Incoherence (r = −0.74)
Figure 1: Mean accuracy of algorithms, as a function of problem inconsistency and incoherence, over four UCI data sets. The
line is a linear fit to the data that shows strong negative correlations (r values, 99% confidence).