Documente Academic
Documente Profesional
Documente Cultură
DOI: 10.1007/s00357-010-9069-1
Magdalena Niewiadomska-Bugaj
Western Michigan University, U.S.A
Abstract: This paper proposes a maximum clustering similarity (MCS) method for
determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between
the two clusterings is calculated at the same number of clusters, using the indices of
Rand (R), Fowlkes and Mallows (FM), and Kulczynski (K) each corrected for chance
agreement. The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated
bivariate normal data, and further extended for use in circular data. Its performance
is compared to the criteria discussed in Tibshirani, Walther, and Hastie (2001). The
proposed method is not based on any distributional or data assumption which makes
it widely applicable to any type of data that can be clustered using at least two clustering algorithms.
Keywords: Similarity index; Clustering algorithm; Circular data; Bivariate normal
mixture; Correction for chance agreement; Gap statistic; Number of clusters; Comparing partitions.
The authors thank Willem Heiser and two anonymous referees for helpful comments
and valuable suggestions on an earlier draft of this paper.
Authors Addresses: Ahmed N. Albatineh, Department of Epidemiology and Biostatistics, Florida International University, Miami, FL, Tel: +1305-348-4909, Fax: +1305-3484901, email: aalbatin@fiu.edu; Magdalena Niewiadomska-Bugaj, Department of Statistics,
Western Michigan University, Kalamazoo, MI, email: m.bugaj@wmich.edu.
Published online
1.
Introduction
MCS Method
clusters is replication analysis proposed by Breckenridge (1989). Later Dudoit and Fridlyand (2002) generalized Breckenridge approach by proposing
a prediction-based resampling approach (Clest) to find the number of clusters.
This paper proposes a method for finding the number of clusters in a
data set based on the maximum clustering similarity between two partitions
of the same data set. The paper is organized as follows: Section 2 presents
an overview of similarity indices, Section 3 develops the proposed method,
Section 4 shows the simulation results using different data structures, while
Section 5 extends the use of the proposed method to circular data and will be
applied to a real data set representing movements of 76 turtles after some experimental conditions. Section 6 compares the performance of the proposed
method with other criteria discussed in Tibshirani et al. (2001). Finally, Section 7 provides conclusions and final comments.
2. Similarity Indices
I
J
mij
i=1 j=1
1 2
n
mij
2
2
i=1 j=1
(2.1)
Method A
b=
J
m+j
j=1
c=
I
i=1
Method B
in the same clusters in different clusters
a
b
c
d
a+c
b+d
Number of pairs
in the same clusters
in different clusters
Total
mi+
2
I
J
mij
i=1 j=1
J
I
i=1 j=1
mij
2
1 2
1 2
=
m+j
mij (2.2)
2
2
j=1
1
=
2
I
i=1
i=1 j=1
m2i+
1 2
mij
2
(2.3)
i=1 j=1
I
J
1 2
n2
n
d=
abc=
mij +
2
2
2
i=1 j=1
I
J
1 2
mi+ +
m2+j
2
i=1
Total
a+b
c+d
N
(2.4)
j=1
Indices that can be expressed in the form SI= + Ii=1 Jj=1 m2ij
are members of the L family as discussed in Albatineh, NiewiadomskaBugaj, and Mihalko (2006). Some members of the L family used in this
paper include: Rand (R) (1971), Fowlkes and Mallows (FM) (1983), and
Kulczynski (K)(1927) which can be written in terms of a, b, c, d or mij as
I
J
a+d
1
R=
=1
m2i+ +
m2+j +
a+b+c+d
n(n 1)
i=1
j=1
2
m2ij
n(n 1)
(2.5)
i=1 j=1
a
n
FM =
=
+
(a + b)(a + c)
( Ii=1 m2i+ n)( Jj=1 m2+j n)
I J
2
i=1
j=1 mij
( Ii=1 m2i+ n)( Jj=1 m2+j n)
(2.6)
MCS Method
1
K=
2
n( Ii=1 m2i+ + Jj=1 m2+j 2n)
+
= I
2( i=1 m2i+ n)( Jj=1 m2+j n)
I
J
( Ii=1 m2i+ + Jj=1 m2+j 2n)
m2ij
2( Ii=1 m2i+ n)( Jj=1 m2+j n) i=1 j=1
a
a
+
a+b a+c
(2.7)
In general, any similarity index SI, when corrected for chance agreement, takes the form
SI E(SI)
CSI =
(2.8)
1 E(SI)
where the expectation E(SI) is conditional upon fixed sets of marginal counts
in the matrix M (fixed class sizes in both partitions) and 1 being the theoretical maximum of the index. Albatineh (2010) derived means and variances
for any member of the L family. Once corrected for chance agreement, the
indices of R, FM, and K will be denoted by CR, CFM, CK, respectively.
Non members of L are not considered in this paper, because their correction
for chance agreement is not straightforward and thus deferred to another paper. Albatineh et al. (2006) have shown that as the clusters size increases,
the difference between indices corrected using the expectation proposed by
Morey and Agresti (1984) (asymptotic expectation) and that proposed by
Hubert and Arabie (1985) (exact expectation) becomes negligible. Consequently, we chose to use simpler correction based on the asymptotic expectation (Morey and Agresti 1984). Correction by elimination of chance effect
is similar to the proposal of Guttman (1941) in his measure of nominal association (later denoted by by Goodman and Kruskal (1954)), and in the
measure of interjudge agreement proposed by Cohen (1960), see Albatineh
et al. (2006) for discussion.
3. Proposed Method
partitions for the same data set. The MCS method has no distributional assumptions. It only assumes that data can be clustered using at least two of
the available clustering algorithms. Seven clustering algorithms are chosen
for simulations presented in this paper: the single linkage, average linkage, complete linkage, Wards minimum variance method, centroid method,
c
McQuittys method, and K-means method all of which are available in R
statistical software (2007). The proposed method can be summarized as
follows1 :
1. Use any two clustering algorithms to cluster the same data into same
number of clusters, k = 2, 3, . . . , M .
2. Use any of the corrected similarity measures discussed by Albatineh
et al. (2006) to calculate the similarity between the obtained sets of k
clusters at k = 2, 3, . . . , M .
3. Record the number of clusters for which the corrected similarity measures attain a maximum value.
4. Repeat steps (1) - (3) for all the P2 combinations.
5. Determine the most frequent number of clusters (M ) in step (3). Such
a number will be a candidate for the optimal number of clusters.
4. Simulations Results
The data sets used in this section were generated from a mixture of
bivariate normal distributions with specified means and variancecovariance
matrix. For the requested number of clusters (k = 2, 3, . . . , M ), 1000 data
sets were generated from the same distribution, clustered using two clustering algorithms with similarity indices been calculated and averaged over
the 1000 data sets. To evaluate the performance of MCS method, examples of known data structure/shape (round, elongated, different covariance,
high dimensions, and circular) are considered. In the simulations, it is not
our intention to vary systematically certain design factors, but to present
examples similar to those used in Tibshirani et al. (2001), see Brusco and
Steinley (2007) for simulations design. The following section presents examples of specific clustering structures selected to evaluate performance of
the proposed MCS method.
1. A referee suggested formulating the MCS method as: The MCS calculates, for all k with 2 k M
and all clustering algorithms j = 1, 2, . . . , P , the corresponding m-partitions Pkj , then determines,
for each pair of algorithms j, j , the class number kjj := argmax k sim (Pkj , Pkj ) with maximum
similarity between Pkj and Pkj , and finally choose as the estimated number of classes, the integer M
that has maximum frequency among the P2 numbers k12 , . . . , kP,P 1 .
MCS Method
The data used in the simulations was generated from the examples
below. Figure 1 presents sample run from each of those examples.
1. Five compact clusters: the data for this example were generated from
the mixture of bivariate normal distributions with equal mixing proportions and means and variance covariance matrix given by
5
5
10
15
15
1 =
, 2 =
, 3 =
, 4 =
, 5 =
,
5
16
11
16
5
2.25 0.5
and =
.
0.5 2.25
2. Two elongated clusters: The data for this example consists of two elongated clusters of size 100 and 400 generated from the mixture of bivariate normal distributions with means and variancecovariance matrix
given by
5
5
1 0.9
1 =
, 2 =
, and =
.
5
10
0.9 1
3. Three elongated clusters: The data for this example consists of 500
observations comprising three clusters of roughly equal size, generated from a mixture of bivariate normal distributions with means and
variance covariance matrix given by
5
7
5
1
0.80
1 =
, 2 =
, 3 =
, with =
.
5
10
15
0.80
1
4. Different covariance structure: The data for this example consists of
two clusters of size 250 each generated from the mixture of bivariate normal distributions with means and variance covariance matrices
given by
10
10
1 0
1 =
, 2 =
, with 1 =
10
17
0 1
and 2 =
1 0.8
.
0.8 1
Example 2
10
y
8
4
10
15
20
Example 1
10
15
20
Example 3
Example 4
16
8
12
10
15
20
10
11
12
13
5
10
1.5
5
10
1.0
1 =
5 , 2 = 10 , and = 1.5
5
10
0.5
1.0
2.0
1.5
1.8
1.5
1.5
2.0
1.0
0.5
1.8
.
1.0
2.0
MCS Method
For each example, 1000 data sets of size 500 each were generated and
clustered by any two of seven different algorithms using Euclidean distance
measure to compute the distances between the points. The similarity between the two clusterings is calculated using the indices of R, FM, and K
(equations 2.5, 2.6, 2.7) corrected for chance agreement. For each pair of
algorithms, the average of each similarity index is calculated over the 1000
data sets. Due to simulation size and space needed as 7 clustering algorithms
(21 combinations) and 8 examples of data are used, a part of the simulation
results are tabulated (see Appendix), but the results for all 168 cases are
summarized in Table 3.
Tables 5-12 present portion of the simulation results. As summarized in Table 3, out of 168 cases, MCS indicated the correct number of
clusters in 110 (65.48%) cases, wrong number of clusters in 38 (22.62%)
cases, and gave no indication (index value increased as the number of clusters increased) in 20 (11.90%) cases. Albatineh et al. (2006) showed that
R, FM, and K indices are members of the L family. Under some conditions,
members of L family were shown to be identical after correction for chance
agreement (regardless of the the expectation used). That explains why in the
simulations the values of the corrected indices CR, CFM, and CK were very
close.
5.
Cluster analysis for circular data has received little attention in the
literature. In fact, the most widely known texts on circular statistics by
Batschelet (1981), Fisher (1993), and Mardia and Jupp (2000) do not mention any criterion that would be useful for identifying the number of clusters
in a circular data set. The statistics used for linear data can not be applied
for circular data because they do not account for their circular nature (i.e., 5
and 355 are only 10 degrees apart), (see Kaufman and Rousseeuw (1990)
for a discussion). Lund (1999) proposed a statistic for determining the optimal number of clusters in a circular data set. The number of clusters is the
value at which Lund statistic is maximized. Later, Baragona (2003) indicated that Lunds statistic performs satisfactorily if there is an equal number
of observations in each cluster which is rare in practice, otherwise Lunds
statistic identifies only large clusters. This clearly indicates the need for
other criteria.
5.1 Five Clusters Example
The data for this example consist of five clusters of size 50 each generated from the mixture of von Mises distributions, vmi (i , ) unimodal
90
180
270
Figure 2. Five clusters from a mixture of von Mises distributions with equal sizes, different
mean directions and same concentration parameter.
This example extends the application of the proposed method to turtles data which represent movements of 76 turtles after an experimental
treatment (see Yang and Pan (1997)). The authors analyzed this data set
using their proposed Fuzzy CDirections (FCD) clustering algorithm. Figure 3 shows a plot of the turtles data. A distance matrix of size 76 76 for
the turtles data was calculated using the distance measure
1
(1 cos(i j ))
2
where i , j are two circular observations. The distance matrix obtained
was clustered using the previous clustering methods by requesting 2, 3,. . . ,
10 clusters, and the similarity between the two clusterings is calculated usdij =
MCS Method
90
180
270
ing the same indices. Table 12 presents simulation results for this data. Table 3 reveals that 20 out of 21 combinations identified the number of clusters
present to be two. In Fact, when Yang and Pan (1997) discussed this data
set using their proposed FCD clustering algorithm, they have indicated that
it had two clusters.
6. Comparison With Other Methods
Many criteria2 have been proposed in the literature to find the number of clusters: a good summary can be found in Gordon (1999). Recently,
Tibshirani et al. (2001) proposed the Gap statistic to find the number of
clusters in a given data set and compared it to Calinski and Harabasz (CH)
(1974) being among those with best performance in Milligan and Cooper
(1985), Krzanowski and Lai (KL) (1985), Hartigan (H) (1975), and Silhouette statistic (Silhouette) proposed by Kaufman and Rousseeuw (1990). In
this section, we compare the performance of the MCS method to all the criteria discussed in Tibshirani et al. (2001). The values of the MCS, CH,
KL, H, Silhouette, Gap/unif and Gap/pc are calculated and the frequency
of obtaining the correct number of clusters is reported in Table 2. In the
simulations, we adopt the same manner of Tibshirani et al. (2001) by simulating 50 data sets from each of our first six examples. The number of
times for which the correct number of clusters was identified is reported.
The only exception is that we did not consider the null model (one cluster
case), because hierarchical procedures would assume the number of clusters to be greater than one and some of the criteria are not defined for one
cluster. Because our method requires two clustering algorithms, we have
chosen (arbitrarily) average linkage method and K-means for use in MCS,
2. The initial comparison of MCS was made with Cubic Clustering Criterion (CCC), Pseudo F, and
Pseudo T2 which are options in SAS PROC CLUSTER procedure. The MCS outperformed those criteria, but one reviewer suggested a comparison with more recent ones and provided a reference for the
criteria described in this section.
while average linkage method for the other criteria. It is to be noted that the
average linkage and K-means clustering methods were used by Tibshirani et
al. (2001). As shown in Table 2, the KL and H criteria have the worst performance in all examples. The MCS and CH criteria produced comparable
results in three of the six examples. The MCS outperformed CH in the two
elongated clusters example, while CH performed better in examples (1) and
(6). Also, MCS produced comparable results to Gap statistic in all examples, except example (1) in which Gap statistic performed relatively better.
Finally, MCS produced comparable results to the Silhouette method in four
of the six examples, while Silhouette method performed better than MCS in
examples (1) and (6). It must be mentioned that the criteria of CH, KL, H,
and Gap cant be implemented in circular data because their input is a data
matrix rather than a distance matrix, hence a comparison is not performed
for circular data.
In order to show the consequences of using the wrong clustering algorithm, Table 4 shows simulation results of 50 data sets each contains two
elongated clusters (see Section 4.1). The MCS used a combination of Kmeans and single linkage methods, while the other criteria used single linkage method. It is clear that MCS,CH and Silhouette outperformed KL,H,
and the Gap statistic (Gap statistic has the best performance in Tibshirani et
al. (2001)). This example shows the need to choose not only a good criteria for finding the number of clusters, but also the use of a proper clustering
algorithm.
7.
MCS Method
Table 2. Frequency of finding the correct number of clusters for 50 data sets using K-means
vs average linkage method for MCS and average linkage method for the other criteria (if total
is smaller than 50, it means the number of clusters found is greater than 10, the represents
the correct number of clusters and represents the frequency of the correct number of von
Mises distributions with equal cluster sizes, different mean directions, clusters)
Example 1
Five separated
clusters
Criteria
2
3 4 5 6 7 8
MCS
0
9 4 37 0 0 0
CH
0
0 0 50 0 0 0
KL
10 0 0 26 2 0 0
H
0
0 4 25 14 7 2
Silhouette 0
0 0 50 0 0 0
Gap/unif
0
0 0 49 1 0 0
Gap/pc
0
0 0 50 0 0 0
Example 3
Three elongated
clusters
Criteria
2 3 4 5 6 7 8
MCS
3 47 0 0 0 0 0
CH
0 50 0 0 0 0 0
KL
0 25 8 6 5 1 2
H
0 13 5 2 2 5 4
Silhouette 0 50 0 0 0 0 0
Gap/unif
0 49 1 0 0 0 0
Gap/pc
0 50 0 0 0 0 0
Example 5
Two clusters in four
dimensions
Criteria
2 3 4 5 6 7 8
MCS
46 4 0 0 0 0 0
CH
50 0 0 0 0 0 0
KL
29 0 2 0 2 4 4
H
11 8 7 13 2 2 2
Silhouette 50 0 0 0 0 0 0
Gap/unif 34 7 1 8 0 0 0
Gap/pc
48 2 0 0 0 0 0
Example 2
9
0
0
4
0
0
0
0
10
0
0
8
0
0
0
0
Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc
Example 4
2
50
25
0
11
50
44
44
9
0
0
2
2
0
0
0
10
0
0
1
3
0
0
0
Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc
Example 6
2
50
50
29
9
50
50
50
9
0
0
5
2
0
0
0
10
0
0
4
2
0
0
0
Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc
2
7
0
0
0
0
0
0
Two elongated
clusters
3 4 5 6 7 8 9
0 0 0 0 0 0 0
2 1 2 5 0 5 3
2 4 4 15 5 4 6
4 2 4 5 2 7 5
0 0 0 0 0 0 0
4 2 0 0 0 0 0
2 2 2 0 0 0 0
Two clusters with
different covariance
3 4 5 6 7 8 9
0 0 0 0 0 0 0
0 0 0 0 0 0 0
1 0 1 11 3 2 2
9 9 4 2 2 6 4
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Four clusters in four
dimensions
3 4 5 6 7 8 9
1 39 3 0 0 0 0
0 47 3 0 0 0 0
0 28 8 3 3 2 1
0 21 9 2 5 2 3
0 48 2 0 0 0 0
0 40 7 0 3 0 0
0 46 4 0 0 0 0
10
0
7
10
4
0
0
0
10
0
0
1
2
0
0
0
10
0
0
5
6
0
0
0
Table 3. Simulation results for evaluating MCS using seven clustering algorithms (single
linkage (1), average linkage (2), complete linkage (3), Wards (4), centroid (5), McQuitty
(6), and K-means (7)) with eight data structures/examples where a check mark indicates maximum similarity at the correct number of clusters, a x sign indicates maximum
similarity at the wrong number of clusters, and a ? indicates no maximum attained)
Data
1
2
3
4
5
6
7
8
(1,2)
?
x
?
x
x
(1,3)
?
x
x
?
x
x
Data
1
2
3
4
5
6
7
8
(3,4)
?
(3,5)
?
?
(2,5)
x
x
x
x
x
(2,6)
x
x
(5,7)
x
(6,7)
x
x
(2,7)
x
Table 4. Frequency of finding the correct number of clusters for 50 data sets with K-means
vs single linkage method used in MCS and single linkage method for the other criteria. The
represents the correct number of clusters and represents the frequency of the correct
number of clusters
Data
Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc
MCS Method
KL and H criteria have the worst performance in all examples. The MCS
and CH criteria produced comparable results in three of the six examples.
The MCS outperformed CH in the two elongated clusters example, while
CH performed better in examples (1) and (6). The MCS produced comparable results to Gap statistic in all examples, except example 1 in which
Gap statistic performed better. Also, the MCS produced comparable results to the Silhouette method in four of the six examples, while Silhouette
method performed better than MCS in examples (1) and (6). In an example of two elongated clusters generated from the data structure described in
Section 4.1, the MCS along with CH and Silhouette methods outperformed
the Gap statistic (Gap statistic has the best performance in Tibshirani et al.
(2001)).
The proposed method can be used to choose the similarity index (the
index at which maximum occurred) as well as the clustering algorithm to
cluster the data (any algorithm of the combination for which the maximum
occurred). Also, the MCS method can be used to check how similar two
algorithms are in clustering a data set (value 1 would mean that clusterings
are identical). The MCS method does not require any distributional or data
assumptions unlike other criteria. It only assumes that a given data set can
be clustered using two clustering algorithms. Finally, one of the advantages
of this method is that when the similarity indices in more than two combinations of clustering algorithms attain their maximum at the same number
of clusters, this gives some assurance that such a number is the true number
of clusters. Moreover, unlike other methods that yield a wrong number of
clusters our method sometimes doesnt attain a maximum value, i.e., indices
values increase as the number of clusters increase, a sort of an inconclusive
test which we think is better than reporting a wrong answer. In cases of multiple maxima the opinion of the clustering practitioner (geneticist, biologist,
ecologist, etc) in determining if such a splitting is realistic can be of great
value.
Throughout all the simulations and an application to real data example, we found that the combination of average linkage method and Wards
method identified correctly the number of clusters in all examples. In addition, the indices of CR, CFM, and CK are recommended because of their
consistent performance in identifying the correct number of clusters. All
c
simulations were performed using R
statistical software (2007).
8. Appendix
10
0.64848
0.65064
0.65282
10
0.70424
0.71830
0.73277
10
0.63780
0.66816
0.70022
10
0.69657
0.71684
0.73786
10
0.66518
0.66684
0.66851
10
0.62303
0.62783
0.63270
10
0.71973
0.72901
0.73849
10
0.68083
0.69727
0.71425
10
0.61846
0.61965
0.62086
10
0.2797
0.34813
0.44660
MCS Method
Table 6. Example 2 simulations: averages of corrected similarity indices for different combinations with two elongated clusters.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
10
0.28528
0.34427
0.42329
10
0.12868
0.19536
0.31428
10
0.50397
0.53837
0.57634
10
0.41785
0.46840
0.52816
10
0.40305
0.45310
0.51249
10
0.12562
0.19165
0.31018
10
0.47213
0.50565
0.54279
10
0.61104
0.61204
0.61305
10
0.65633
0.66594
0.67616
10
0.36486
0.41483
0.47693
10
0.49538
0.53972
0.58952
10
0.60933
0.62884
0.64935
10
0.62307
0.62846
0.63393
10
0.59468
0.60511
0.61587
10
0.64973
0.65311
0.65655
10
0.64557
0.65207
0.65872
10
0.40265
0.46875
0.55189
10
0.58085
0.59486
0.60948
10
0.60833
0.61033
0.61238
10
0.56626
0.60804
0.65623
MCS Method
Table 8. Example 4 simulations: averages of corrected similarity indices for different combinations with two clusters of different covariances.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
10
0.25797
0.33763
0.45114
10
0.44650
0.47088
0.49762
10
0.46324
0.46861
0.47412
10
0.51071
0.51233
0.51401
10
0.32170
0.38505
0.46621
10
0.44179
0.45254
0.46381
10
0.48141
0.53133
0.58960
10
0.33353
0.40511
0.49825
10
0.23372
0.31524
0.43494
10
0.79474
0.80530
0.81626
10
0.47747
0.49241
0.50821
10
0.48714
0.49145
0.49584
10
0.49001
0.49853
0.50740
10
0.47927
0.48262
0.48605
10
0.46498
0.47485
0.48512
10
0.45859
0.48843
0.52112
10
0.27233
0.34070
0.43251
10
0.52953
0.56300
0.60027
10
0.47674
0.50317
0.53185
10
0.03138
0.04442
0.06848
MCS Method
Table 10. Example 6 simulations: averages of corrected similarity indices for different combinations with four clusters of equal size from four dimensional data.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
10
0.60964
0.63372
0.65918
10
0.36843
0.44313
0.53786
10
0.62039
0.62237
0.62436
10
0.65298
0.65402
0.65507
10
0.61364
0.62896
0.64491
10
0.38592
0.45003
0.52857
10
0.59543
0.63436
0.67740
10
0.54105
0.58567
0.63470
10
0.56743
0.60732
0.65074
10
0.65265
0.67749
0.70367
10
0.75944
0.76443
0.76948
10
0.94254
0.94303
0.94353
10
0.76039
0.76463
0.76892
10
0.72832
0.72945
0.73059
10
0.74952
0.75243
0.75537
10
0.75066
0.75146
0.75227
10
0.73041
0.73346
0.73655
10
0.70424
0.70521
0.70618
10
0.73006
0.73159
0.73312
10
0.78072
0.79288
0.80543
MCS Method
Table 12. Example 8 results: values of similarity indices for different combinations of clustering algorithms for the turtles data
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
2
1
1
1
2
1
1
1
2
0.93632
0.93656
0.93680
2
0.93632
0.93656
0.93680
2
1
1
1
2
0.93632
0.93656
0.93680
2
1
1
1
2
1
1
1
2
0.82460
0.82618
0.82776
2
0.82460
0.82618
0.82776
10
0.63977
0.64093
0.64210
10
0.21991
0.29584
0.40859
10
0.18891
0.26746
0.39153
10
0.74083
0.74311
0.74539
10
0.56873
0.57079
0.57152
10
0.76981
0.76994
0.77006
10
0.63977
0.64093
0.64210
10
0.19564
0.27368
0.39519
10
0.11834
0.17428
0.26689
10
0.68538
0.69254
0.69978
References
ALBATINEH, A.N., NIEWIADOMSKA-BUGAJ, M., and MIHALKO, D.P. (2006), On
Similarity Indices and Correction for Chance Agreement, Journal of Classification,
23, 301313.
ALBATINEH, A.N (2010), Means and Variances for a Family of Similarity Indices Used
in Cluster Analysis, Journal of Statistical Planning and Inference, 140, 28282838.
ANDREWS, D.F. (1972), Plots of High Dimensional Data, Biometrics, 28, 125136.
BANFIELD, J.D., and RAFTERY, A.E. (1993), Model-based Gaussian and Non-Gaussian
Clustering, Biometrics, 49, 803821.
BARAGONA, R. (2003), Further Results on Lunds Statistic for Identifying Cluster in a
Circular Data Set with Application to Time Series, Communications in Statistics:
Simulation and Computation, 32, 943952.
BATSCHELET, E. (1981), Circular Statistics in Biology, London: Academic Press.
BOCK, H.H. (1985), On Some Significance Tests in Cluster Analysis, Journal of Classification, 2, 77108.
BRECKENRIDGE, J.N. (1989), Replicating Cluster Analysis: Method, Consistency, and
Validity, Multivariate Behavioral Research, 24, 147161.
BRUSCO, M.J., and STEINLEY, D. (2007), A Comparison of Heuristic Procedures for
Minimum Within-Cluster Sums of Squares Partitioning, Psychometrika, 72, 583
600.
CALINSKI, R.B., and HARABASZ, J. (1974), A Dendrite Method for Cluster Analysis,
Communications in Statistics, 3, 127.
COHEN, A.J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and
Psychological Measurement, 3, 3746
DUDOIT, S., and FRIDLYAND, J. (2002), A Prediction-based Resampling Method for
Estimating the Number of Custers in a Dataset, Genome Biology, 3, 121.
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: Oxford University Press.
FISHER, N.I. (1993), Statistical Analysis of Circular Data, Cambridge: Cambridge University Press.
FOWLKES, E.B., and MALLOWS, C.L. (1983), A Method for Comparing Two Hierarchical Clusterings, Journal of the American Statistical Association, 78, 553569.
FRALEY, C., and RAFTERY, A.E. (1998), How Many Clusters? Which Clustering Method?
Answers Via Model-Based Cluster Analysis, The Computer Journal, 41, 578588.
GOODMAN, L., and KRUSKAL, W. (1954), Measures of Association for Cross Classifications, Journal of the American Statistical Association, 49, 732764.
GORDON, A.D. (1999), Classification (2nd ed.), St. Andrews: Chapman&Hall/CRC.
GUTTMAN, L. (1941), An Outline of the Statistical Theory of Prediction, in In Prediction of Personal Adjustment, ed. P. Horst, New York: Social Science Research
Council.
HARDY, A. (1994), An Examination of Procedures for Determining the Number of Clusters in a Data Set, in New Approaches in Classification and Data Analysis, ed. E.
Diday et al., Paris: Springer-Verlag, pp. 178185.
HARDY, A. (1996), On the Number of Clusters, Computational Statistics and Data Analysis, 23, 8396.
HUBERT, L., and ARABIE, P. (1985), Comparing Partitions, Journal of Classification,
2, 193218.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.
JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, New Jersey: Prentice Hall.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction
to Cluster Analysis, New York: John Wiley & Sons.
KOZIOL, J.A. (1990), Cluster Analysis of Antigenic Profiles of Tumors: Selection of
Number of Clusters Using Akaikes Information Criterion, Methods of Information
in Medicine, 29, 200204.
MCS Method