Documente Academic
Documente Profesional
Documente Cultură
X, SEPTEMBER XXXX
I. I NTRODUCTION
LUSTERING is a technique that tries to group similar
objects into same groups called clusters and dissimilar objects into different clusters [1]. There is no general
consensus what exactly a cluster is, in fact it is generally
acknowledged that the problem is ill-defined [2]. Various
algorithms use slightly different definition of a cluster, e.g.
based on distance to closes center of cluster or density of
points in its neighborhood. Unlike supervised learning where
labeled data are used to train a model which is afterwards
used to classify unseen data, clustering belong to category
of unsupervised problems. Clustering is more difficult and
challenging problem than classification [3]. Clustering has
applications in many fields including data-mining, machine
learning, marketing, biology, chemistry, astrology, psychology
and spatial database technology.
Probably due to this interdisciplinary scope most of respected authors ([4], [5]) define clustering in a vague way,
while leaving space for several interpretations. Generally
clustering algorithms are designed to capture a notion of
grouping in the same way as a human observer does. The
ultimate goal would be a detection of structures in higher
dimensions where human fails. How to evaluate such methods
is another problem. In this contribution we focus on patterns
that are easily detected by a human. Yet, many algorithms fail
such test. A detailed overview of many algorithms and their
applications, including recently proposed clustering methods,
can be found in [6]. So far hundreds of algorithms have
been proposed, some assign items to exactly one cluster other
allow fuzzy assignment to many clusters. There are methods
based on cluster prototypes, mixture models, graph structures,
density or grid based models.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
k-nearest
neighbor graph
Data set
Construct a
sparse graph
Final clusters
Partition
the graph
Merge
partitions
Fig. 1. The overview of the Chameleon approach. Diagram courtesy of Karypis et al. [7].
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
(1)
RCL (Ci , Cj ) =
i , Cj )
(C
|ECi |
|ECi |+|ECj |
i) +
(C
|ECj |
|ECi |+|ECj |
j)
(C
(2)
(Ci , Cj )
2(Ci , Cj )
=
(3)
(Ci ) + (Cj )
(Ci ) + (Cj )
2
Where |ECi,j |, |ECi | are interconnectivity properties number of edges between clusters Ci and Cj , respective number
of edges inside cluster Ci .
i , Cj ), (C
i ) denote closeness properties average
(C
weights of all edges between clusters Ci and Cj , respective
average weights of all edges after cluster Ci bisection (an
example shown on Figure 2).
RIC (Ci , Cj ) =
BCi = bisect(Ci )
(Ci ) =
w(e)
(4)
1
(Ci )
|BCi |
(5)
eBCi
i) =
(C
(6)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
TABLE I
C OMPARISON OF THE PARTITIONING METHODS
TABLE II
C HAMELEON 2 PARAMETERS
Computational time
Dataset size
hMETIS
Parameter
k
RB with F-M
psize
similarity
1) Improved similarity measure: Shatovska et al. [21] propose a modified similarity measure which proved to be more
robust than the original function based only on interconnectivity and closeness. We incorporated the function into the
Chameleon algorithm and the experiments showed that the
achieved results are always at least as good as the original,
most of the time even better.
The whole improved formula [21] can be written as:
Simshat (Ci , Cj ) = RCLS (Ci , Cj ) RICS (Ci , Cj ) (Ci , Cj )
(7)
RCLS (Ci , Cj ) =
s(Ci , Cj )
|ECi |
|ECi |+|ECj |
s(Ci ) +
|ECj |
|ECi |+|ECj |
s(Cj )
(8)
RICS (Ci , Cj ) =
(Ci , Cj ) =
min{
s(Ci ), s(Cj )}
max{
s(Ci ), s(Cj )}
(9)
|ECi,j |
min(|ECi |, |ECj |)
(10)
1 X
w(e)
|ECi |
eCi
(11)
Description
number of neighbors (k-NN)
max. partition size
interconnectivity priority
closeness priority
determines merging order
Default value
2 log(n)
max {5, n/100}
1.0
2.0
Shatovska
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
TABLE III
DATASETS USED FOR EXPERIMENTS .
Dataset
aggregation
atom
chainlink
chameleon-t4.8k
chameleon-t5.8k
chameleon-t7.10k
chameleon-t8.8k
compound
cure-t2-4k
D31
DS-850
diamond9
flame
jain
long1
longsquare
lsun
pathbased
s-set1
spiralsquare
target
triangle1
twodiamonds
wingnut
d
2
3
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
n
788
800
1000
8000
8000
10000
8000
399
4000
3100
850
3000
240
373
1000
900
400
300
5000
1500
770
1000
800
1016
classes
7
2
2
71
91
81
81
6
7
31
5
9
2
2
2
6
3
3
15
6
6
4
2
2
source
[27]
[28]
[28]
[29]
[29]
[29]
[29]
[30]
[8]2
[31]
[32]
[33]
[34]
[35]
[36]
[36]
[28]
[37]
[38]
[36]
[28]
[36]
[28]
[28]
VI. E XPERIMENTS
We evaluated Chameleon 2 against several popular algorithms. For evaluation of clustering we used Normalized
Mutual Information (NMIsqrt ) as defined by Strehl and Gosh
in [19]. NMI computes agreement between clustering and
ground truth labels which we provided for each dataset. NMI
value 1.0 means complete agreement of clustering to external
labels while 0.0 means complete opposite. Another popular
criterion for external evaluation is Adjusted Rand Index which
would provide in this case very similar results.
It is not feasible to run a benchmark of our algorithm against
every other existing algorithm. However, we tried to select a
representative algorithm from several distinguishable groups
(12)
2 The
(13)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
TABLE IV
C LUSTERING BENCHMARK ON DATASETS USED IN THE LITERATURE .
Dataset
aggregation
atom
chainlink
chameleon-t4.8k
chameleon-t5.8k
chameleon-t7.10k
chameleon-t8.8k
compound
cure-t2-4k
D31
diamond9
DS-850
flame
jain
long1
longsquare
lsun
pathbased
s-set1
spiralsquare
target
triangle1
twodiamonds
wingnut
Ch2-auto
0.99
1.00
1.00
0.88
0.82
0.86
0.88
0.96
0.88
0.96
0.99
0.98
0.87
1.00
1.00
0.98
1.00
0.90
1.00
0.91
0.94
1.00
0.99
0.97
Ch2-nd1
0.99
0.99
0.99
0.89
0.87
0.90
0.89
0.95
0.91
0.96
0.98
0.95
0.86
0.93
0.97
0.97
0.99
0.81
0.98
0.93
0.96
0.97
0.99
0.91
Ch2-Std
0.98
1.00
1.00
0.86
0.82
0.90
0.88
0.95
0.87
0.94
0.97
0.99
0.91
1.00
1.00
0.98
1.00
0.86
1.00
0.99
0.94
1.00
1.00
0.97
DBSCAN
0.98
0.99
1.00
0.95
0.94
0.97
0.89
0.92
0.88
0.88
0.98
0.98
0.90
0.89
0.99
0.94
1.00
0.89
0.97
0.98
0.99
1.00
1.00
1.00
HAC-AL
0.97
0.59
0.55
0.67
0.82
0.68
0.69
0.85
0.83
0.95
1.00
0.98
0.80
0.70
0.62
0.90
0.82
0.71
0.98
0.74
0.74
0.98
0.99
1.00
HAC-CL
0.87
0.57
0.51
0.63
0.68
0.63
0.66
0.82
0.72
0.95
1.00
0.62
0.70
0.70
0.55
0.83
0.83
0.58
0.97
0.78
0.70
0.91
0.97
1.00
HAC-SL
0.89
1.00
1.00
0.86
0.80
0.87
0.86
0.85
0.82
0.87
0.99
0.99
0.84
0.86
1.00
0.93
1.00
0.70
0.96
0.92
1.00
1.00
0.93
1.00
HAC-WL
0.94
1.00
0.55
0.65
0.82
0.67
0.68
0.82
0.78
0.95
1.00
0.69
0.59
0.52
0.55
0.84
0.73
0.62
0.98
0.67
0.69
1.00
1.00
1.00
k-means
0.85
0.29
0.07
0.59
0.77
0.58
0.57
0.72
0.69
0.92
0.95
0.57
0.43
0.37
0.02
0.81
0.54
0.55
0.95
0.64
0.69
0.93
1.00
0.77
(a)
(b)
1
Sim(Cx , Cy )
(14)
The main reason for the change is that over time, cluster
similarity can increase, thus the distance decreases. In the
standard representation this would mean that the dendrogram
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
works with a minimal error rate on all of the tested data. However, by configuring each phase of the algorithm, Chameleon 2
is able to correctly identify clusters in basically any dataset.
Therefore, Chameleon 2 can also be viewed as a general robust
clustering framework which can be adjusted for a wide range
of specific problems.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
A PPENDIX A
DATASET VISUALIZATIONS
All datasets used in our experiments contains distinguishable pattern. In case of datasets contaminated by noise, clusters
are area with high density of data points.
ACKNOWLEDGMENT
We would like to thank Petr Bartunek, Ph.D. from the IMG
CAS institute for supporting our research and letting us publish
all details of our work. This research is partially supported
by CTU grant SGS15/117/OHK3/1T/18 New data processing
methods for data mining and Program NPU I (LO1419) by
Ministry of Education, Youth and Sports of Czech Republic.
R EFERENCES
[1] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. A John Wiley & Sons, Inc., 1990.
[2] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith, Meta Clustering,
in Proceedings of the Sixth International Conference on Data Mining,
ser. ICDM 06. Washington, DC, USA: IEEE Computer Society, 2006,
pp. 107118.
[3] A. K. Jain, Data Clustering : 50 Years Beyond K-Means, Pattern
Recognition Letters, 2010.
[4] A. K. Jain and R. C. Dubes, Algorithms for clustering data. Upper
Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.
[5] B. S. Everitt, Cluster Analysis. Edward Arnold, 1993.
[6] C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms
and Applications. CRC Press, 2014.
[7] G. Karypis, E. Han, and V. Kumar, Chameleon: Hierarchical Clustering
Using Dynamic Modeling, Computer, vol. 32, no. 8, pp. 6875, August
1999.
[8] S. Guha, R. Rastogi, and K. Shim, CURE: an efficient clustering
algorithm for large databases, in ACM SIGMOD Record, vol. 27, no. 2.
ACM, 1998, pp. 7384.
[9] , ROCK: A Robust Clustering Algorithm for Categorical Attributes. in ICDE, M. Kitsuregawa, M. P. Papazoglou, and C. Pu, Eds.
IEEE Computer Society, 1999, pp. 512521.
[10] S. Lloyd, Least squares quantization in PCM, Information Theory,
IEEE Transactions on, vol. 28, no. 2, pp. 129137, 1982.
[11] G. Ball and D. Hall, ISODATA: A novel method of data analysis and
pattern classification, Stanford Research Institute, Menlo Park, Tech.
Rep., 1965.
[12] J. B. MacQueen, Some Methods for Classification and Analysis of
MultiVariate Observations, in Proc. of the fifth Berkeley Symposium on
Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman,
Eds., vol. 1. University of California Press, 1967, pp. 281297.
[13] G. N. Lance and W. T. Williams, A General Theory of Classificatory
Sorting Strategies, The Computer Journal, vol. 9, no. 4, pp. 373380,
1967.
[14] A. K. Jain, A. Topchy, M. H. C. Law, and J. M. Buhmann, Landscape
of Clustering Algorithms, in Proceedings of the Pattern Recognition,
17th International Conference on (ICPR04) Volume 1 - Volume 01, ser.
ICPR 04. Washington, DC, USA: IEEE Computer Society, 2004, pp.
260263.
[15] G. McLachlan and K. Basford, Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1988.
[16] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood
from incomplete data via the EM algorithm, Journal of the Royal
Statistical Society. Series B (Methodological), pp. 138, 1977.
[17] R. A. Jarvis and E. A. Patrick, Clustering using a similarity measure
based on shared near neighbors, Computers, IEEE Transactions on, vol.
100, no. 11, pp. 10251034, 1973.
[18] M. Ester, H. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise. in
KDD, E. Simoudis, J. Han, and U. M. Fayyad, Eds. AAAI Press,
1996, pp. 226231.
[19] A. Strehl and J. Ghosh, Cluster Ensembles A Knowledge Reuse
Framework for Combining Multiple Partitions, Journal on Machine
Learning Research (JMLR), vol. 3, pp. 583617, December 2002.
PLACE
PHOTO
HERE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX
10
Fig. 10. Visualization of datasets used in experiments with ground truth assignments.