Sunteți pe pagina 1din 26

Journal of Classification 28 (2011)

DOI: 10.1007/s00357-010-9069-1

MCS: A Method for Finding the Number of Clusters


Ahmed N. Albatineh
Florida International University, U.S.A

Magdalena Niewiadomska-Bugaj
Western Michigan University, U.S.A

Abstract: This paper proposes a maximum clustering similarity (MCS) method for
determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between
the two clusterings is calculated at the same number of clusters, using the indices of
Rand (R), Fowlkes and Mallows (FM), and Kulczynski (K) each corrected for chance
agreement. The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated
bivariate normal data, and further extended for use in circular data. Its performance
is compared to the criteria discussed in Tibshirani, Walther, and Hastie (2001). The
proposed method is not based on any distributional or data assumption which makes
it widely applicable to any type of data that can be clustered using at least two clustering algorithms.
Keywords: Similarity index; Clustering algorithm; Circular data; Bivariate normal
mixture; Correction for chance agreement; Gap statistic; Number of clusters; Comparing partitions.

The authors thank Willem Heiser and two anonymous referees for helpful comments
and valuable suggestions on an earlier draft of this paper.
Authors Addresses: Ahmed N. Albatineh, Department of Epidemiology and Biostatistics, Florida International University, Miami, FL, Tel: +1305-348-4909, Fax: +1305-3484901, email: aalbatin@fiu.edu; Magdalena Niewiadomska-Bugaj, Department of Statistics,
Western Michigan University, Kalamazoo, MI, email: m.bugaj@wmich.edu.
Published online

A. Albatineh and M. Niewiadomska-Bugaj

1.

Introduction

One of the main problems in cluster analysis is to determine how


many clusters exist in a given data set, when the only information available
are the actual observations. There exist many papers in the cluster analysis literature which propose and study methods for determining the correct
number of clusters in a given data set. Everitt, Landau, and Leese (2000)
argued that there is no optimal criterion for determining the number of
clusters in a given data set, see also Hartigan (1975), Bock (1985), Hardy
(1996), and Gordon (1999) for further discussion. In a survey paper, Milligan and Cooper (1985) performed simulations to compare thirty procedures to determine the number of clusters. Koziol (1990) used Akaikes
information criterion with multinomial data in identifying the appropriate
number of clusters of tumor types with similar profiles of cell surface antigens. Sugar and James (2003) developed a nonparametric method based
on distortion, a quantity that measures the average distance, per dimension,
between each observation and its closest cluster center. Hardy (1994) compared three methods based on the hypervolumes clustering criterion with
four other methods available in the CLUSTAN software (see Wishart 1978).
On the other hand, Peck, Fisher, and Van Ness (1989) developed a bootstrap
based procedure for obtaining approximate confidence bounds on the number of clusters. Milligan and Cooper (1986) used similarity indices to assess
ability of structure recovery for different clustering methods (similarity between the original clusters and the resulting clustering).
It is to be noted that some of the methods for determining the number
of clusters in a data set are based on maximizing (minimizing) an objective
function. For example, Marriott (1971) suggested the number of clusters to
be the value k that minimizes k2 | W | where W is the within groups dispersion. Other methods for determining the optimal number of clusters are
model based, in the sense that the actual observations are assumed to follow
a mixture of distributions and the number of subpopulations is to be found,
see Banfield and Raftery (1993) and Fraley and Raftery (1998) for a discussion. For example, Wolfes test (Wolfe 1970) is based on the assumption
of multivariate normality, it is a likelihood ratio criterion for testing the hypothesis of k clusters vs k 1 clusters. Krolak-Schwerdt and Eckes (1992)
suggested a graph theoretic criterion named GRAPH for determining the
number of clusters in a data set. However, their algorithm usually terminates
with fewer groups than are actually present in the data set. To overcome this
problem, Vassilliou, Tambouratzis, Koutras, and Bersimis (2004) introduced
a distance measure - as an alternative to the standard Euclidean distance based on success runs theory and Andrews curves (Andrews 1972) and incorporated it in GRAPH procedure. Another approach to find the number of

MCS Method

clusters is replication analysis proposed by Breckenridge (1989). Later Dudoit and Fridlyand (2002) generalized Breckenridge approach by proposing
a prediction-based resampling approach (Clest) to find the number of clusters.
This paper proposes a method for finding the number of clusters in a
data set based on the maximum clustering similarity between two partitions
of the same data set. The paper is organized as follows: Section 2 presents
an overview of similarity indices, Section 3 develops the proposed method,
Section 4 shows the simulation results using different data structures, while
Section 5 extends the use of the proposed method to circular data and will be
applied to a real data set representing movements of 76 turtles after some experimental conditions. Section 6 compares the performance of the proposed
method with other criteria discussed in Tibshirani et al. (2001). Finally, Section 7 provides conclusions and final comments.
2. Similarity Indices

A standard approach in comparing two clusterings of the same data


set is to calculate the similarity between such clusterings (partitions) using
similarity indices. Since the clusters are not predefined, the similarity of different clusterings is usually based on the number of pairs of data points that
are (not) placed into the same cluster according to each procedure. Consequently, a 2 2 similarity table (see Table 1) is formed where a, b, c, d, and
N are defined as:
a: Number of pairs located in the same cluster using both clustering methods.
b(c): Number of pairs located in the same cluster by method A(B), while in
separate clusters according to method B(A).
d: Number of pairs not clustered together by either of the two methods.
 
The total number of pairs N = a + b + c + d = n2 = n(n1)
, where
2
n is the number of observations to be clustered.
Entries of Table 1 can also be expressed in terms of counts in the
matching matrix M = [mij ] obtained for two clustering algorithms producing I and J clusters, respectively (i = 1, 2, . . . , I and j = 1, 2, . . . , J ).
Entry mij is the number of observations classified to cluster i according to
method A, and to cluster j according to method B. The entries in Table 1
can be expressed in terms of the mij  s, (see Jain and Dubes 1988), as
a=


I 
J 

mij
i=1 j=1

1  2
n
mij
2
2
i=1 j=1

(2.1)

A. Albatineh and M. Niewiadomska-Bugaj


Table 1. Binary Counts for Two Clustering Methods

Method A

b=


J 

m+j
j=1

c=

I 

i=1

Method B
in the same clusters in different clusters
a
b
c
d
a+c
b+d

Number of pairs
in the same clusters
in different clusters
Total

mi+
2


I 
J 

mij
i=1 j=1

J 
I 

i=1 j=1

mij
2

1 2
1  2
=
m+j
mij (2.2)
2
2
j=1

1
=
2

I

i=1

i=1 j=1

m2i+

1  2

mij
2

(2.3)

i=1 j=1

 
I
J
1  2
n2
n
d=
abc=
mij +
2
2
2
i=1 j=1

I
J

1  2
mi+ +
m2+j

2
i=1

Total
a+b
c+d
N

(2.4)

j=1


Indices that can be expressed in the form SI= + Ii=1 Jj=1 m2ij
are members of the L family as discussed in Albatineh, NiewiadomskaBugaj, and Mihalko (2006). Some members of the L family used in this
paper include: Rand (R) (1971), Fowlkes and Mallows (FM) (1983), and
Kulczynski (K)(1927) which can be written in terms of a, b, c, d or mij as

I
J


a+d
1

R=
=1
m2i+ +
m2+j +
a+b+c+d
n(n 1)
i=1

j=1


2
m2ij
n(n 1)

(2.5)

i=1 j=1

a
n
FM =
=
+

(a + b)(a + c)
( Ii=1 m2i+ n)( Jj=1 m2+j n)
I J
2
i=1
j=1 mij


( Ii=1 m2i+ n)( Jj=1 m2+j n)

(2.6)

MCS Method

1
K=
2



n( Ii=1 m2i+ + Jj=1 m2+j 2n)
+
= I

2( i=1 m2i+ n)( Jj=1 m2+j n)


I 
J

( Ii=1 m2i+ + Jj=1 m2+j 2n)

m2ij


2( Ii=1 m2i+ n)( Jj=1 m2+j n) i=1 j=1

a
a
+
a+b a+c

(2.7)
In general, any similarity index SI, when corrected for chance agreement, takes the form
SI E(SI)
CSI =
(2.8)
1 E(SI)
where the expectation E(SI) is conditional upon fixed sets of marginal counts
in the matrix M (fixed class sizes in both partitions) and 1 being the theoretical maximum of the index. Albatineh (2010) derived means and variances
for any member of the L family. Once corrected for chance agreement, the
indices of R, FM, and K will be denoted by CR, CFM, CK, respectively.
Non members of L are not considered in this paper, because their correction
for chance agreement is not straightforward and thus deferred to another paper. Albatineh et al. (2006) have shown that as the clusters size increases,
the difference between indices corrected using the expectation proposed by
Morey and Agresti (1984) (asymptotic expectation) and that proposed by
Hubert and Arabie (1985) (exact expectation) becomes negligible. Consequently, we chose to use simpler correction based on the asymptotic expectation (Morey and Agresti 1984). Correction by elimination of chance effect
is similar to the proposal of Guttman (1941) in his measure of nominal association (later denoted by by Goodman and Kruskal (1954)), and in the
measure of interjudge agreement proposed by Cohen (1960), see Albatineh
et al. (2006) for discussion.
3. Proposed Method

Let x1 , x2 , . . . , xn be n vectors corresponding to n objects to be


clustered by two different clustering methods. Suppose that the maximum
possible number of clusters
  is M , where 2 M n. For P clustering
algorithms there are P2 combinations of algorithms. The MCS method
seeks to find the most frequent M , 2 M M , at which the similarity
between two clusterings at the same number of clusters is maximum. The
MCS method is different from the replication and prediction-based resampling methods proposed by Breckenridge (1989) and Dudoit and Fridlyand
(2002). The latter methods randomly split the original data into two nonoverlapping sets, while MCS uses two clustering algorithms to obtain two

A. Albatineh and M. Niewiadomska-Bugaj

partitions for the same data set. The MCS method has no distributional assumptions. It only assumes that data can be clustered using at least two of
the available clustering algorithms. Seven clustering algorithms are chosen
for simulations presented in this paper: the single linkage, average linkage, complete linkage, Wards minimum variance method, centroid method,
c
McQuittys method, and K-means method all of which are available in R
statistical software (2007). The proposed method can be summarized as
follows1 :
1. Use any two clustering algorithms to cluster the same data into same
number of clusters, k = 2, 3, . . . , M .
2. Use any of the corrected similarity measures discussed by Albatineh
et al. (2006) to calculate the similarity between the obtained sets of k
clusters at k = 2, 3, . . . , M .
3. Record the number of clusters for which the corrected similarity measures attain a maximum value.
 
4. Repeat steps (1) - (3) for all the P2 combinations.
5. Determine the most frequent number of clusters (M ) in step (3). Such
a number will be a candidate for the optimal number of clusters.
4. Simulations Results

The data sets used in this section were generated from a mixture of
bivariate normal distributions with specified means and variancecovariance
matrix. For the requested number of clusters (k = 2, 3, . . . , M ), 1000 data
sets were generated from the same distribution, clustered using two clustering algorithms with similarity indices been calculated and averaged over
the 1000 data sets. To evaluate the performance of MCS method, examples of known data structure/shape (round, elongated, different covariance,
high dimensions, and circular) are considered. In the simulations, it is not
our intention to vary systematically certain design factors, but to present
examples similar to those used in Tibshirani et al. (2001), see Brusco and
Steinley (2007) for simulations design. The following section presents examples of specific clustering structures selected to evaluate performance of
the proposed MCS method.
1. A referee suggested formulating the MCS method as: The MCS calculates, for all k with 2 k M
and all clustering algorithms j = 1, 2, . . . , P , the corresponding m-partitions Pkj , then determines,

for each pair of algorithms j, j , the class number kjj  := argmax k sim (Pkj , Pkj  ) with maximum
similarity between Pkj and Pkj  , and finally choose as the estimated number of classes, the integer M
 
that has maximum frequency among the P2 numbers k12 , . . . , kP,P 1 .

MCS Method

4.1 Data Generation

The data used in the simulations was generated from the examples
below. Figure 1 presents sample run from each of those examples.
1. Five compact clusters: the data for this example were generated from
the mixture of bivariate normal distributions with equal mixing proportions and means and variance covariance matrix given by
 
 
 
 
 
5
5
10
15
15
1 =
, 2 =
, 3 =
, 4 =
, 5 =
,
5
16
11
16
5


2.25 0.5
and =
.
0.5 2.25
2. Two elongated clusters: The data for this example consists of two elongated clusters of size 100 and 400 generated from the mixture of bivariate normal distributions with means and variancecovariance matrix
given by
 
 


5
5
1 0.9
1 =
, 2 =
, and =
.
5
10
0.9 1
3. Three elongated clusters: The data for this example consists of 500
observations comprising three clusters of roughly equal size, generated from a mixture of bivariate normal distributions with means and
variance covariance matrix given by
 
 
 


5
7
5
1
0.80
1 =
, 2 =
, 3 =
, with =
.
5
10
15
0.80
1
4. Different covariance structure: The data for this example consists of
two clusters of size 250 each generated from the mixture of bivariate normal distributions with means and variance covariance matrices
given by
 
 


10
10
1 0
1 =
, 2 =
, with 1 =
10
17
0 1

and 2 =


1 0.8
.
0.8 1

A. Albatineh and M. Niewiadomska-Bugaj

Example 2

10
y

8
4

10

15

20

Example 1

10

15

20

Example 3

Example 4

16
8

12

10

15

20

10

11

12

13

Figure 1. Sample data sets representing examples one through four

5. Two four-dimensional clusters: For this example two clusters of size


250 each were generated from the mixture of multivariate normal distributions, N4 (, ) where

5
10
1.5
5
10
1.0

1 =
5 , 2 = 10 , and = 1.5
5
10
0.5

1.0
2.0
1.5
1.8

1.5
1.5
2.0
1.0

0.5
1.8
.
1.0
2.0

6. Four four-dimensional clusters: In this example four clusters of size


125 each were generated from the mixture of multivariate normal distributions, N4 (, ), with the same covariance matrix as in the previous example but with means given by 1 = (5, 5, 5, 5)T , 2 =
(10, 10, 10, 10)T , 3 = (5, 5, 10, 10)T , and 4 = (10, 10, 5, 5)T .

MCS Method

For each example, 1000 data sets of size 500 each were generated and
clustered by any two of seven different algorithms using Euclidean distance
measure to compute the distances between the points. The similarity between the two clusterings is calculated using the indices of R, FM, and K
(equations 2.5, 2.6, 2.7) corrected for chance agreement. For each pair of
algorithms, the average of each similarity index is calculated over the 1000
data sets. Due to simulation size and space needed as 7 clustering algorithms
(21 combinations) and 8 examples of data are used, a part of the simulation
results are tabulated (see Appendix), but the results for all 168 cases are
summarized in Table 3.
Tables 5-12 present portion of the simulation results. As summarized in Table 3, out of 168 cases, MCS indicated the correct number of
clusters in 110 (65.48%) cases, wrong number of clusters in 38 (22.62%)
cases, and gave no indication (index value increased as the number of clusters increased) in 20 (11.90%) cases. Albatineh et al. (2006) showed that
R, FM, and K indices are members of the L family. Under some conditions,
members of L family were shown to be identical after correction for chance
agreement (regardless of the the expectation used). That explains why in the
simulations the values of the corrected indices CR, CFM, and CK were very
close.
5.

Extension To Circular Data

Cluster analysis for circular data has received little attention in the
literature. In fact, the most widely known texts on circular statistics by
Batschelet (1981), Fisher (1993), and Mardia and Jupp (2000) do not mention any criterion that would be useful for identifying the number of clusters
in a circular data set. The statistics used for linear data can not be applied
for circular data because they do not account for their circular nature (i.e., 5
and 355 are only 10 degrees apart), (see Kaufman and Rousseeuw (1990)
for a discussion). Lund (1999) proposed a statistic for determining the optimal number of clusters in a circular data set. The number of clusters is the
value at which Lund statistic is maximized. Later, Baragona (2003) indicated that Lunds statistic performs satisfactorily if there is an equal number
of observations in each cluster which is rare in practice, otherwise Lunds
statistic identifies only large clusters. This clearly indicates the need for
other criteria.
5.1 Five Clusters Example

The data for this example consist of five clusters of size 50 each generated from the mixture of von Mises distributions, vmi (i , ) unimodal

A. Albatineh and M. Niewiadomska-Bugaj

90

180

270

Figure 2. Five clusters from a mixture of von Mises distributions with equal sizes, different
mean directions and same concentration parameter.

distribution on the circle with mean directions (i ) and concentration


parameter () given by = (1 , 2 , 3 , 4 , 5 )T = (5, 10, 15, 20, 25)T
and = 15. Since we have a total of 250 observations, a distance matrix
of size 250 250 for the data was calculated using the distance measure
dij = 21 (1 cos(i j )) where i and j are two circular observations.
Clearly, the range of dij is [0, 1], with zero attained for two identical angles,
and one when the two angles are 180 apart. The obtained distance matrix
was clustered using the seven clustering methods by requesting 2, 3,. . . , 10
clusters. An example of five clusters from a von Mises mixture is in Figure 2. For 1000 data sets generated we used different pairs of clustering
algorithms and then obtained values of the same similarity indices. Hence,
each entry in the table is the average of 1000 value of the index. Table 11
presents simulations results for this example. Out of 21 combinations, MCS
identified correctly the number of clusters in 14 combinations (67%), while
seven combinations (33%) indicated wrong number of clusters.
5.2 Turtles Data Example

This example extends the application of the proposed method to turtles data which represent movements of 76 turtles after an experimental
treatment (see Yang and Pan (1997)). The authors analyzed this data set
using their proposed Fuzzy CDirections (FCD) clustering algorithm. Figure 3 shows a plot of the turtles data. A distance matrix of size 76 76 for
the turtles data was calculated using the distance measure
1
(1 cos(i j ))
2
where i , j are two circular observations. The distance matrix obtained
was clustered using the previous clustering methods by requesting 2, 3,. . . ,
10 clusters, and the similarity between the two clusterings is calculated usdij =

MCS Method

90

180

270

Figure 3. Movements of 76 turtles after an experimental treatment

ing the same indices. Table 12 presents simulation results for this data. Table 3 reveals that 20 out of 21 combinations identified the number of clusters
present to be two. In Fact, when Yang and Pan (1997) discussed this data
set using their proposed FCD clustering algorithm, they have indicated that
it had two clusters.
6. Comparison With Other Methods

Many criteria2 have been proposed in the literature to find the number of clusters: a good summary can be found in Gordon (1999). Recently,
Tibshirani et al. (2001) proposed the Gap statistic to find the number of
clusters in a given data set and compared it to Calinski and Harabasz (CH)
(1974) being among those with best performance in Milligan and Cooper
(1985), Krzanowski and Lai (KL) (1985), Hartigan (H) (1975), and Silhouette statistic (Silhouette) proposed by Kaufman and Rousseeuw (1990). In
this section, we compare the performance of the MCS method to all the criteria discussed in Tibshirani et al. (2001). The values of the MCS, CH,
KL, H, Silhouette, Gap/unif and Gap/pc are calculated and the frequency
of obtaining the correct number of clusters is reported in Table 2. In the
simulations, we adopt the same manner of Tibshirani et al. (2001) by simulating 50 data sets from each of our first six examples. The number of
times for which the correct number of clusters was identified is reported.
The only exception is that we did not consider the null model (one cluster
case), because hierarchical procedures would assume the number of clusters to be greater than one and some of the criteria are not defined for one
cluster. Because our method requires two clustering algorithms, we have
chosen (arbitrarily) average linkage method and K-means for use in MCS,
2. The initial comparison of MCS was made with Cubic Clustering Criterion (CCC), Pseudo F, and
Pseudo T2 which are options in SAS PROC CLUSTER procedure. The MCS outperformed those criteria, but one reviewer suggested a comparison with more recent ones and provided a reference for the
criteria described in this section.

A. Albatineh and M. Niewiadomska-Bugaj

while average linkage method for the other criteria. It is to be noted that the
average linkage and K-means clustering methods were used by Tibshirani et
al. (2001). As shown in Table 2, the KL and H criteria have the worst performance in all examples. The MCS and CH criteria produced comparable
results in three of the six examples. The MCS outperformed CH in the two
elongated clusters example, while CH performed better in examples (1) and
(6). Also, MCS produced comparable results to Gap statistic in all examples, except example (1) in which Gap statistic performed relatively better.
Finally, MCS produced comparable results to the Silhouette method in four
of the six examples, while Silhouette method performed better than MCS in
examples (1) and (6). It must be mentioned that the criteria of CH, KL, H,
and Gap cant be implemented in circular data because their input is a data
matrix rather than a distance matrix, hence a comparison is not performed
for circular data.
In order to show the consequences of using the wrong clustering algorithm, Table 4 shows simulation results of 50 data sets each contains two
elongated clusters (see Section 4.1). The MCS used a combination of Kmeans and single linkage methods, while the other criteria used single linkage method. It is clear that MCS,CH and Silhouette outperformed KL,H,
and the Gap statistic (Gap statistic has the best performance in Tibshirani et
al. (2001)). This example shows the need to choose not only a good criteria for finding the number of clusters, but also the use of a proper clustering
algorithm.
7.

Discussion and Conclusions

The proposed MCS method for determining the number of clusters


uses the behavior of similarity indices between two clusterings of the same
data set to obtain the correct number of clusters. The number of clusters at
which the similarity index attains its maximum is a candidate for the true
number of clusters. Seven clustering algorithms were used namely: single linkage, average linkage, complete linkage, Wards method, centroid
method, McQuittys method, and K-means method.
Hence a total of 21 combinations of algorithms were implemented in
the simulations. Eight examples of data sets were used, hence a total of 168
cases are reported in Table 3. The indices of R, FM, and K corrected for
chance agreement were used to calculate similarity between the clusterings.
Table 3 presents a summary of the simulation results. Out of 168 cases, MCS
indicated the correct number of clusters in 110 (65.48%) cases, wrong number of clusters in 38 (22.62%) cases, and gave no indication (index increased
as the number of clusters increased) in 20 (11.90%) cases. The MCS method
was applied to circular data. Five clusters were generated from a mixture of
von Mises distributions with equal cluster sizes, different mean directions,

MCS Method

Table 2. Frequency of finding the correct number of clusters for 50 data sets using K-means
vs average linkage method for MCS and average linkage method for the other criteria (if total
is smaller than 50, it means the number of clusters found is greater than 10, the   represents
the correct number of clusters and   represents the frequency of the correct number of von
Mises distributions with equal cluster sizes, different mean directions, clusters)
Example 1

Five separated
clusters
Criteria
2
3 4 5 6 7 8
MCS
0
9 4 37 0 0 0
CH
0
0 0 50 0 0 0
KL
10 0 0 26 2 0 0
H
0
0 4 25 14 7 2
Silhouette 0
0 0 50 0 0 0
Gap/unif
0
0 0 49 1 0 0
Gap/pc
0
0 0 50 0 0 0
Example 3
Three elongated
clusters
Criteria
2 3 4 5 6 7 8
MCS
3 47 0 0 0 0 0
CH
0 50 0 0 0 0 0
KL
0 25 8 6 5 1 2
H
0 13 5 2 2 5 4
Silhouette 0 50 0 0 0 0 0
Gap/unif
0 49 1 0 0 0 0
Gap/pc
0 50 0 0 0 0 0
Example 5
Two clusters in four
dimensions
Criteria
2 3 4 5 6 7 8
MCS
46 4 0 0 0 0 0
CH
50 0 0 0 0 0 0
KL
29 0 2 0 2 4 4
H
11 8 7 13 2 2 2
Silhouette 50 0 0 0 0 0 0
Gap/unif 34 7 1 8 0 0 0
Gap/pc
48 2 0 0 0 0 0

Example 2
9
0
0
4
0
0
0
0

10
0
0
8
0
0
0
0

Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc
Example 4

2
50
25
0
11
50
44
44

9
0
0
2
2
0
0
0

10
0
0
1
3
0
0
0

Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc
Example 6

2
50
50
29
9
50
50
50

9
0
0
5
2
0
0
0

10
0
0
4
2
0
0
0

Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc

2
7
0
0
0
0
0
0

Two elongated
clusters
3 4 5 6 7 8 9
0 0 0 0 0 0 0
2 1 2 5 0 5 3
2 4 4 15 5 4 6
4 2 4 5 2 7 5
0 0 0 0 0 0 0
4 2 0 0 0 0 0
2 2 2 0 0 0 0
Two clusters with
different covariance
3 4 5 6 7 8 9
0 0 0 0 0 0 0
0 0 0 0 0 0 0
1 0 1 11 3 2 2
9 9 4 2 2 6 4
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Four clusters in four
dimensions
3 4 5 6 7 8 9
1 39 3 0 0 0 0
0 47 3 0 0 0 0
0 28 8 3 3 2 1
0 21 9 2 5 2 3
0 48 2 0 0 0 0
0 40 7 0 3 0 0
0 46 4 0 0 0 0

10
0
7
10
4
0
0
0

10
0
0
1
2
0
0
0

10
0
0
5
6
0
0
0

and same concentration parameter. Out of 21 combinations, MCS identified


correctly the number of clusters in 14 combinations, while seven combinations indicated the wrong number of clusters. When MCS applied to a real
turtles data example, 20 out of 21 combinations identified two as the number
of clusters. This conforms with the results obtained by Yang and Pan (1997)
using their FCD algorithm.
Finally, the proposed method is compared to six criteria discussed in
Tibshirani et al. (2001) with results presented in Table 2. It is clear that the

A. Albatineh and M. Niewiadomska-Bugaj

Table 3. Simulation results for evaluating MCS using seven clustering algorithms (single
linkage (1), average linkage (2), complete linkage (3), Wards (4), centroid (5), McQuitty
(6), and K-means (7)) with eight data structures/examples where a check mark  indicates maximum similarity at the correct number of clusters, a x sign indicates maximum
similarity at the wrong number of clusters, and a ? indicates no maximum attained)
Data
1
2
3
4
5
6
7
8

(1,2)
?

x

?
x
x


(1,3)
?
x
x

?
x
x


Data
1
2
3
4
5
6
7
8

(3,4)

?







(3,5)

?



?



Combination of clustering algorithms


(1,4) (1,5) (1,6) (1,7) (2,3) (2,4)
?
?
?
?



x
x

?

x
x





x




?
?
?
?



?
x
x
x

x
x
x
x








Combination of clustering algorithms
(3,6) (3,7) (4,5) (4,6) (4,7) (5,6)






?
x

?

x
x

















?

x
x

x





x




x


(2,5)
x
x
x

x
x



(2,6)

x



x



(5,7)





x



(6,7)

x



x



(2,7)





x



Table 4. Frequency of finding the correct number of clusters for 50 data sets with K-means
vs single linkage method used in MCS and single linkage method for the other criteria. The
 
represents the correct number of clusters and   represents the frequency of the correct
number of clusters
Data
Criteria
MCS
CH
KL
H
Silhouette
Gap/unif
Gap/pc

Two elongated clusters from Section 4.1


2
3
4
5 6 7
8
9
10
50
0
0
0 0 0
0
0
0
50
0
0
0 0 0
0
0
0
7
1
2
5 6 4 11 10
4
36 11
3
0 0 0
0
0
0
50
0
0
0 0 0
0
0
0
18 18 10 1 3 0
0
0
0
21 18
7
4 0 0
0
0
0

MCS Method

KL and H criteria have the worst performance in all examples. The MCS
and CH criteria produced comparable results in three of the six examples.
The MCS outperformed CH in the two elongated clusters example, while
CH performed better in examples (1) and (6). The MCS produced comparable results to Gap statistic in all examples, except example 1 in which
Gap statistic performed better. Also, the MCS produced comparable results to the Silhouette method in four of the six examples, while Silhouette
method performed better than MCS in examples (1) and (6). In an example of two elongated clusters generated from the data structure described in
Section 4.1, the MCS along with CH and Silhouette methods outperformed
the Gap statistic (Gap statistic has the best performance in Tibshirani et al.
(2001)).
The proposed method can be used to choose the similarity index (the
index at which maximum occurred) as well as the clustering algorithm to
cluster the data (any algorithm of the combination for which the maximum
occurred). Also, the MCS method can be used to check how similar two
algorithms are in clustering a data set (value 1 would mean that clusterings
are identical). The MCS method does not require any distributional or data
assumptions unlike other criteria. It only assumes that a given data set can
be clustered using two clustering algorithms. Finally, one of the advantages
of this method is that when the similarity indices in more than two combinations of clustering algorithms attain their maximum at the same number
of clusters, this gives some assurance that such a number is the true number
of clusters. Moreover, unlike other methods that yield a wrong number of
clusters our method sometimes doesnt attain a maximum value, i.e., indices
values increase as the number of clusters increase, a sort of an inconclusive
test which we think is better than reporting a wrong answer. In cases of multiple maxima the opinion of the clustering practitioner (geneticist, biologist,
ecologist, etc) in determining if such a splitting is realistic can be of great
value.
Throughout all the simulations and an application to real data example, we found that the combination of average linkage method and Wards
method identified correctly the number of clusters in all examples. In addition, the indices of CR, CFM, and CK are recommended because of their
consistent performance in identifying the correct number of clusters. All
c
simulations were performed using R
statistical software (2007).
8. Appendix

Due to space limitation and size of the simulations, Tables 5 - 12


present part of the simulations of the 168 cases (21 combinations 8 examples). Full results of all 168 cases can be found in Table 3.

A. Albatineh and M. Niewiadomska-Bugaj


Table 5. Example 1 simulations: averages of corrected similarity indices for different combinations with five compact clusters.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

Complete Linkage Method vs Wards Method


2
3
4
5
6
7
8
9
0.65004 0.53683 0.69906 0.91260 0.85036 0.78547 0.72651 0.67894
0.65011 0.53871 0.69924 0.91271 0.85057 0.78583 0.72723 0.68024
0.65018 0.54062 0.69943 0.91281 0.85077 0.78619 0.72796 0.68154
Average Linkage Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.64912 0.5564996 0.70188 0.91376 0.88220 0.83674 0.79259 0.74452
0.64917 0.558391 0.70210 0.91389 0.88334 0.84024 0.79929 0.75543
0.64922 0.5603049 0.70232 0.91403 0.88449 0.84376 0.80607 0.76657
Centroid Method vs Wards Method
2
3
4
5
6
7
8
9
0.32028 0.92145 0.88340 0.93499 0.89937 0.83960 0.77343 0.70390
0.32754 0.92292 0.88543 0.93679 0.90112 0.84502 0.78521 0.72434
0.33506 0.92450 0.88758 0.93864 0.90288 0.85048 0.79719 0.74548
Centroid Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.23156 0.54921 0.67903 0.89276 0.88019 0.83788 0.78991 0.74012
0.23684 0.55159 0.68073 0.89457 0.88157 0.84197 0.79862 0.75439
0.24232 0.55406 0.68252 0.89642 0.88296 0.84609 0.80746 0.76903
McQuittys Method vs Complete Linkage method
2
3
4
5
6
7
8
9
0.47311 0.52134 0.67093 0.84104 0.81074 0.77566 0.73478 0.69492
0.47441 0.52311 0.67178 0.84183 0.81149 0.77649 0.73602 0.69655
0.47577 0.52490 0.67264 0.84263 0.81224 0.77732 0.73728 0.69819
McQuittys Method vs Wards Method
2
3
4
5
6
7
8
9
0.49027 0.51738 0.67992 0.86662 0.81818 0.75473 0.70311 0.66069
0.49146 0.51831 0.68053 0.86743 0.81898 0.75586 0.70492 0.66354
0.49269 0.51924 0.68116 0.86826 0.81979 0.75699 0.70674 0.66642
Average Linkage Method vs McQuittys Method
2
3
4
5
6
7
8
9
0.49546 0.51036 0.68440 0.86609 0.84900 0.81934 0.78660 0.75035
0.49674 0.51136 0.68504 0.86692 0.84977 0.82170 0.79124 0.75748
0.49806 0.51237 0.68571 0.86776 0.85053 0.82407 0.79592 0.76473
average linkage method vs K-means method
2
3
4
5
6
7
8
9
0.62644 0.59843 0.69315 0.91272 0.87991 0.83104 0.77355 0.72708
0.62654 0.60042 0.69331 0.91386 0.88146 0.83558 0.78227 0.73952
0.62664 0.60243 0.69347 0.91501 0.88303 0.84016 0.79112 0.75226
complete linkage method vs K-means method
2
3
4
5
6
7
8
9
0.49211 0.47842 0.63378 0.87190 0.82878 0.76588 0.70174 0.65849
0.49221 0.47991 0.63410 0.87299 0.82932 0.76644 0.70252 0.65947
0.49232 0.48141 0.63443 0.87409 0.82987 0.76701 0.70330 0.66046
Single Linkage Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.07791 0.13569 0.19481 0.22475 0.25269 0.26455 0.27519 0.27848
0.07972 0.14381 0.22199 0.27092 0.30528 0.32092 0.33713 0.34383
0.08164 0.15295 0.25624 0.33481 0.37883 0.40026 0.42485 0.43751

10
0.64848
0.65064
0.65282
10
0.70424
0.71830
0.73277
10
0.63780
0.66816
0.70022
10
0.69657
0.71684
0.73786
10
0.66518
0.66684
0.66851
10
0.62303
0.62783
0.63270
10
0.71973
0.72901
0.73849
10
0.68083
0.69727
0.71425
10
0.61846
0.61965
0.62086
10
0.2797
0.34813
0.44660

MCS Method
Table 6. Example 2 simulations: averages of corrected similarity indices for different combinations with two elongated clusters.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

Single Linkage Method vs Average Linkage Method


2
3
4
5
6
7
8
9
0.95653 0.95292 0.86050 0.69407 0.51860 0.41740 0.36000 0.31356
0.95655 0.95405 0.86646 0.71196 0.55141 0.46055 0.40992 0.36910
0.95658 0.95523 0.87294 0.73211 0.58949 0.51246 0.47208 0.44118
Single Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.96431 0.45522 0.30044 0.26377 0.21360 0.1790864 0.16164 0.14443
0.96434 0.49247 0.35471 0.32164 0.27592 0.2435887 0.22732 0.21089
0.96438 0.53471 0.42435 0.39945 0.36631 0.3438198 0.33380 0.32373
Average Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.97204 0.45353 0.33703 0.38518 0.43076 0.46066 0.48865 0.49789
0.97205 0.48705 0.37951 0.42209 0.46581 0.49478 0.52213 0.53139
0.97206 0.52479 0.43196 0.46727 0.50738 0.53375 0.55951 0.56854
Centroid Method vs Wards Method
2
3
4
5
6
7
8
9
0.84295 0.45057 0.31186 0.30206 0.32764 0.36396 0.39518 0.41580
0.84299 0.48597 0.36081 0.34970 0.37548 0.41234 0.44262 0.46433
0.84305 0.52593 0.42228 0.34970 0.43685 0.47327 0.50036 0.52199
Centroid method vs K-means method
2
3
4
5
6
7
8
9
0.83457 0.44117 0.33145 0.30187 0.32649 0.35930 0.39394 0.40245
0.83472 0.47727 0.37789 0.35202 0.37550 0.40682 0.44022 0.45011
0.83487 0.51864 0.43523 0.41652 0.43847 0.46658 0.49634 0.50708
single linkage method vs K-means method
2
3
4
5
6
7
8
9
0.92571 0.44303 0.31283 0.25204 0.21123 0.18231 0.15891 0.14088
0.92585 0.48067 0.36447 0.31011 0.27305 0.24646 0.22432 0.20667
0.92599 0.52402 0.42976 0.38915 0.36289 0.34528 0.33089 0.31923
average linkage method vs K-means method
2
3
4
5
6
7
8
9
0.93740 0.44071 0.35897 0.37310 0.42233 0.44914 0.46573 0.46755
0.93752 0.47504 0.39987 0.41196 0.45649 0.48190 0.49831 0.50138
0.93765 0.51423 0.44968 0.45992 0.49698 0.51922 0.53480 0.53917
Wards method vs K-means method
2
3
4
5
6
7
8
9
0.95597 0.70448 0.65948 0.60898 0.61564 0.61935 0.62415 0.62102
0.95607 0.70658 0.66071 0.61036 0.61658 0.62042 0.62528 0.62200
0.95617 0.70881 0.66196 0.61174 0.61753 0.62151 0.62642 0.62299
Average Linkage Method vs Centroid Method
2
3
4
5
6
7
8
9
0.87808 0.94753 0.85236 0.72858 0.64364 0.63214 0.63614 0.64775
0.87811 0.94853 0.85739 0.73966 0.65798 0.64567 0.64644 0.65687
0.87814 0.94959 0.86281 0.75179 0.67372 0.66055 0.65756 0.66660
Single Linkage Method vs Centroid Method
2
3
4
5
6
7
8
9
0.86289 0.97091 0.93001 0.84352 0.70759 0.57757 0.47834 0.41056
0.86293 0.97125 0.93175 0.84984 0.72354 0.60494 0.51554 0.45512
0.86298 0.97161 0.93357 0.85665 0.74124 0.63625 0.55936 0.50907

10
0.28528
0.34427
0.42329
10
0.12868
0.19536
0.31428
10
0.50397
0.53837
0.57634
10
0.41785
0.46840
0.52816
10
0.40305
0.45310
0.51249
10
0.12562
0.19165
0.31018
10
0.47213
0.50565
0.54279
10
0.61104
0.61204
0.61305
10
0.65633
0.66594
0.67616
10
0.36486
0.41483
0.47693

A. Albatineh and M. Niewiadomska-Bugaj


Table 7. Example 3 simulations: averages of corrected similarity indices for different combinations with three elongated clusters.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

Wards Method vs Centroid Method


2
3
4
5
6
7
8
9
0.92899 0.93870 0.85346 0.72466 0.60074 0.54461 0.51321 0.49809
0.92898 0.94187 0.85788 0.74021 0.63129 0.58087 0.55327 0.54117
0.92897 0.94530 0.86234 0.75620 0.6639 0.62032 0.59748 0.58934
Average Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.92982 0.97536 0.85441 0.72272 0.62018 0.60151 0.59904 0.59855
0.92982 0.97572 0.85821 0.73496 0.64203 0.62353 0.62041 0.61968
0.92982 0.97610 0.86205 0.74751 0.66505 0.64691 0.64312 0.64204
Complete Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.36584 0.76641 0.71054 0.65234 0.66338 0.64497 0.63336 0.62583
0.36620 0.76750 0.71118 0.65322 0.66576 0.64785 0.63714 0.63112
0.36656 0.76861 0.71183 0.65409 0.66817 0.65078 0.64097 0.63651
McQuittys Method vs Wards Method
2
3
4
5
6
7
8
9
0.39266 0.76964 0.67919 0.63464 0.64290 0.62538 0.60689 0.59887
0.39291 0.77303 0.68193 0.63751 0.64824 0.63195 0.61520 0.60873
0.39314 0.77654 0.68475 0.64044 0.65370 0.63869 0.62373 0.61889
Average Linkage Method vs McQuittys Method
2
3
4
5
6
7
8
9
0.38458 0.76413 0.73155 0.66058 0.67526 0.66848 0.66328 0.65844
0.38488 0.76784 0.73388 0.66711 0.68524 0.67746 0.67002 0.66337
0.38518 0.77170 0.73624 0.67379 0.69556 0.68674 0.67694 0.66841
Average Linkage Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.37545 0.75901 0.740304 0.66620 0.65424 0.64678 0.64247 0.64818
0.37585 0.76016 0.74337 0.67533 0.66939 0.66006 0.65312 0.65668
0.37625 0.76134 0.74646 0.684701 0.68519 0.67388 0.66417 0.66544
single linkage method vs K-means method
2
3
4
5
6
7
8
9
0.54102 0.60844 0.55547 0.51597 0.48753 0.46514 0.44312 0.42008
0.54129 0.63745 0.59174 0.55779 0.53401 0.51636 0.49928 0.48118
0.54156 0.66886 0.63296 0.60739 0.59088 0.57951 0.56905 0.55780
average linkage method vs K-means method
2
3
4
5
6
7
8
9
0.59029 0.96907 0.84033 0.73021 0.65079 0.60810 0.59001 0.58479
0.59059 0.96955 0.84454 0.74162 0.66681 0.62452 0.60548 0.59981
0.59088 0.97007 0.84879 0.75332 0.68356 0.64185 0.62180 0.61557
Wards method vs K-means method
2
3
4
5
6
7
8
9
0.58651 0.97527 0.78775 0.69469 0.66469 0.63758 0.62521 0.61362
0.58679 0.97527 0.78779 0.69535 0.66694 0.63986 0.62742 0.61588
0.58708 0.97527 0.78784 0.69602 0.66921 0.64217 0.62966 0.61818
Single Linkage Method vs Average Linkage Method
2
3
4
5
6
7
8
9
0.85999 0.60645 0.65168 0.69178 0.69569 0.67266 0.62293 0.595668
0.85999 0.63574 0.67766 0.71476 0.71875 0.69835 0.65483 0.63228
0.85999 0.66745 0.70592 0.74003 0.74466 0.72746 0.69149 0.67466

10
0.49538
0.53972
0.58952
10
0.60933
0.62884
0.64935
10
0.62307
0.62846
0.63393
10
0.59468
0.60511
0.61587
10
0.64973
0.65311
0.65655
10
0.64557
0.65207
0.65872
10
0.40265
0.46875
0.55189
10
0.58085
0.59486
0.60948
10
0.60833
0.61033
0.61238
10
0.56626
0.60804
0.65623

MCS Method
Table 8. Example 4 simulations: averages of corrected similarity indices for different combinations with two clusters of different covariances.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

single linkage method vs K-means method


2
3
4
5
6
7
8
9
0.91539 0.74105 0.60694 0.50670 0.42467 0.36546 0.32503 0.28801
0.91539 0.75372 0.63708 0.55094 0.48095 0.43040 0.39563 0.36386
0.91539 0.76671 0.66999 0.60182 0.54902 0.51256 0.48859 0.46777
average linkage method vs K-means method
2
3
4
5
6
7
8
9
0.99831 0.74418 0.59939 0.50949 0.46044 0.43592 0.43389 0.43545
0.99831 0.75580 0.62373 0.54045 0.49326 0.46780 0.46347 0.46223
0.99831 0.76769 0.65007 0.57508 0.53063 0.50413 0.49690 0.49203
complete linkage method vs K-means method
2
3
4
5
6
7
8
9
0.99216 0.69366 0.55100 0.49808 0.48692 0.47521 0.46841 0.46429
0.99217 0.69524 0.55459 0.50249 0.49110 0.47950 0.47314 0.46931
0.99217 0.69684 0.55826 0.50702 0.49539 0.48390 0.47800 0.47446
Wards method vs K-means method
2
3
4
5
6
7
8
9
0.99882 0.69004 0.55290 0.52160 0.50825 0.50634 0.50426 0.49822
0.99882 0.69021 0.55674 0.52556 0.51201 0.50874 0.50681 0.50042
0.99882 0.69037 0.56065 0.52965 0.51596 0.51127 0.50949 0.50275
centroid method vs K-means method
2
3
4
5
6
7
8
9
0.99858 0.74895 0.60826 0.50619 0.43491 0.37893 0.35007 0.33465
0.99858 0.76127 0.63566 0.54618 0.48361 0.43561 0.41004 0.39635
0.99858 0.77388 0.66542 0.59174 0.54125 0.50497 0.48526 0.47479
McQuittys method vs K-means method
2
3
4
5
6
7
8
9
0.96003 0.68903 0.55893 0.50522 0.47679 0.45561 0.45347 0.44148
0.96014 0.69192 0.56464 0.51100 0.48326 0.46362 0.46221 0.45143
0.96025 0.69485 0.57058 0.51700 0.48997 0.47197 0.47132 0.46183
single linkage method vs average linkage method
2
3
4
5
6
7
8
9
0.92171 0.97179 0.94557 0.88723 0.79463 0.69753 0.61712 0.53383
0.92171 0.97200 0.94691 0.89153 0.80630 0.71885 0.64799 0.57600
0.92171 0.97223 0.94827 0.89598 0.81868 0.74198 0.68228 0.62407
single linkage method vs complete linkage method
2
3
4
5
6
7
8
9
0.93464 0.79531 0.61582 0.52772 0.46864 0.42227 0.38707 0.35979
0.93464 0.80371 0.64527 0.56917 0.51887 0.47970 0.45001 0.42718
0.93465 0.81227 0.67719 0.61574 0.57718 0.54863 0.52770 0.51250
single linkage method vs Wards method
2
3
4
5
6
7
8
9
0.92160 0.75966 0.53926 0.44353 0.36912 0.32141 0.28499 0.25564
0.92160 0.77091 0.57868 0.49677 0.43339 0.39248 0.36069 0.33483
0.92160 0.78240 0.62229 0.55893 0.51312 0.48501 0.46369 0.44714
single linkage method vs centroid method
2
3
4
5
6
7
8
9
0.93988 0.97822 0.97661 0.96132 0.94116 0.91090 0.87761 0.83687
0.93988 0.97826 0.97676 0.96178 0.94224 0.91333 0.88185 0.84405
0.93988 0.97831 0.97691 0.96225 0.94333 0.91580 0.88618 0.85146

10
0.25797
0.33763
0.45114
10
0.44650
0.47088
0.49762
10
0.46324
0.46861
0.47412
10
0.51071
0.51233
0.51401
10
0.32170
0.38505
0.46621
10
0.44179
0.45254
0.46381
10
0.48141
0.53133
0.58960
10
0.33353
0.40511
0.49825
10
0.23372
0.31524
0.43494
10
0.79474
0.80530
0.81626

A. Albatineh and M. Niewiadomska-Bugaj


Table 9. Example 5 simulations: averages of corrected similarity indices for different combinations with two clusters of equal size from four dimensional data.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

Complete Linkage Method vs Average Linkage Method


2
3
4
5
6
7
8
9
0.80111 0.65827 0.57417 0.55031 0.52342 0.49996 0.49284 0.48219
0.80139 0.66580 0.58940 0.56675 0.53858 0.51539 0.50778 0.49744
0.80167 0.67352 0.60555 0.58432 0.55483 0.53193 0.52370 0.51366
Complete Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.80394 0.62272 0.57120 0.54383 0.51458 0.49976 0.49264 0.48944
0.80418 0.62383 0.57386 0.54656 0.51791 0.50357 0.49643 0.49339
0.80442 0.62496 0.57659 0.54934 0.52130 0.50747 0.50029 0.49742
McQuittys Method vs Average Linkage Method
2
3
4
5
6
7
8
9
0.68686 0.63269 0.58239 0.54705 0.51910 0.50390 0.49765 0.49674
0.68757 0.63915 0.59363 0.55867 0.53052 0.51383 0.50730 0.50585
0.68831 0.64578 0.60546 0.57100 0.54267 0.52431 0.51746 0.51541
McQuittys Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.65035 0.56944 0.54988 0.52629 0.50406 0.48783 0.48042 0.47789
0.65115 0.57203 0.55343 0.52954 0.50738 0.49120 0.48350 0.48130
0.65198 0.57469 0.55708 0.53286 0.51077 0.49465 0.48666 0.48479
McQuittys Method vs Wards Method
2
3
4
5
6
7
8
9
0.70471 0.57957 0.53807 0.50911 0.48645 0.47067 0.46534 0.46212
0.70532 0.58177 0.54282 0.51480 0.49364 0.47815 0.47386 0.47137
0.70595 0.58403 0.54773 0.52067 0.50112 0.48591 0.48271 0.48101
Average Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.91965 0.71391 0.54787 0.51494 0.50182 0.48288 0.47469 0.46259
0.91966 0.72317 0.56977 0.53928 0.52818 0.51007 0.50189 0.49207
0.91967 0.73264 0.59332 0.56586 0.55727 0.54001 0.53171 0.52446
Centroid Method vs Wards Method
2
3
4
5
6
7
8
9
0.87267 0.70869 0.51689 0.43369 0.37518 0.32981 0.30204 0.28510
0.87267 0.72031 0.55069 0.47977 0.42887 0.38940 0.36527 0.35084
0.87268 0.73224 0.58775 0.53286 0.49353 0.46405 0.44678 0.43757
Average Linkage Method vs Centroid Method
2
3
4
5
6
7
8
9
0.87104 0.89140 0.82785 0.74094 0.66310 0.61041 0.57536 0.54991
0.87105 0.89254 0.83299 0.75244 0.68153 0.63449 0.60321 0.58074
0.87105 0.89369 0.83831 0.76457 0.70127 0.66061 0.63374 0.61479
Average linkage method vs K-means method
2
3
4
5
6
7
8
9
0.92458 0.68329 0.51317 0.50659 0.51048 0.50096 0.49479 0.48666
0.92458 0.69419 0.53908 0.53283 0.53661 0.52706 0.52145 0.51270
0.92459 0.70537 0.56729 0.56166 0.56529 0.55562 0.55050 0.54094
Single linkage method vs K-means method
2
3
4
5
6
7
8
9
0.00003 0.00166 0.00255 0.00802 0.012306 0.01469 0.01855 0.02331
0.00003 0.00176 0.00297 0.00991 0.01578 0.01969 0.02543 0.03269
0.00004 0.00192 0.00365 0.01315 0.02193 0.02897 0.03830 0.05043

10
0.47747
0.49241
0.50821
10
0.48714
0.49145
0.49584
10
0.49001
0.49853
0.50740
10
0.47927
0.48262
0.48605
10
0.46498
0.47485
0.48512
10
0.45859
0.48843
0.52112
10
0.27233
0.34070
0.43251
10
0.52953
0.56300
0.60027
10
0.47674
0.50317
0.53185
10
0.03138
0.04442
0.06848

MCS Method
Table 10. Example 6 simulations: averages of corrected similarity indices for different combinations with four clusters of equal size from four dimensional data.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

Average Linkage Method vs Wards Method


2
3
4
5
6
7
8
9
0.65450 0.72457 0.92155 0.88138 0.80187 0.71797 0.64693 0.62407
0.65450 0.72602 0.92422 0.88365 0.80897 0.73177 0.66799 0.64662
0.654512 0.72751 0.92701 0.88595 0.81616 0.74595 0.69001 0.67035
Single Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.21793 0.43324 0.70519 0.63510 0.56439 0.49513 0.42604 0.39462
0.22100 0.43651 0.72632 0.66630 0.60650 0.54877 0.49142 0.46501
0.22415 0.44008 0.74845 0.69955 0.65276 0.61008 0.56993 0.55193
Complete Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.26267 0.38208 0.58802 0.58046 0.56521 0.58940 0.61890 0.62213
0.26391 0.38315 0.59002 0.58148 0.56612 0.59033 0.62059 0.62384
0.26522 0.38425 0.59206 0.58250 0.56703 0.59128 0.62229 0.62555
Wards method vs K-means method
2
3
4
5
6
7
8
9
0.53323 0.55687 0.84121 0.79872 0.72993 0.68872 0.67633 0.66686
0.53328 0.55754 0.84177 0.79889 0.73013 0.68915 0.67730 0.66800
0.53332 0.55820 0.84234 0.79906 0.73034 0.68958 0.67828 0.66914
Average Linkage Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.23469 0.37346 0.58022 0.59722 0.56997 0.58759 0.59880 0.60873
0.23585 0.37483 0.58277 0.59831 0.57397 0.59613 0.61095 0.62382
0.23708 0.37623 0.58537 0.59942 0.57803 0.60486 0.62348 0.63949
Single Linkage Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.02401 0.21655 0.37302 0.37642 0.36096 0.37419 0.38508 0.38851
0.02436 0.21827 0.37977 0.39137 0.38535 0.41035 0.43201 0.44550
0.02473 0.22016 0.38677 0.40721 0.41204 0.45128 0.48675 0.51383
Single Linkage Method vs Average Linkage Method
2
3
4
5
6
7
8
9
0.23546 0.41243 0.65991 0.69758 0.68986 0.67630 0.65372 0.62868
0.23822 0.41505 0.67774 0.71858 0.71274 0.70209 0.68295 0.66222
0.24103 0.41787 0.69636 0.74045 0.73668 0.72934 0.71415 0.69859
Centroid Method vs Wards Method
2
3
4
5
6
7
8
9
0.22034 0.50027 0.63512 0.73982 0.74191 0.69227 0.61808 0.57902
0.22067 0.50679 0.65523 0.75527 0.75680 0.71304 0.65012 0.61738
0.22102 0.51370 0.67742 0.77211 0.77269 0.73484 0.68432 0.65882
Centroid method vs K-means method
2
3
4
5
6
7
8
9
0.15408 0.36771 0.58976 0.70220 0.71334 0.68739 0.64383 0.60278
0.15443 0.37385 0.60812 0.71858 0.72930 0.70760 0.67070 0.63643
0.15479 0.38041 0.62835 0.73654 0.74639 0.72882 0.69913 0.67248
Centroid Method vs McQuittys Method
2
3
4
5
6
7
8
9
0.11679 0.30278 0.47826 0.57193 0.64753 0.70210 0.69516 0.67470
0.11763 0.30662 0.48674 0.58012 0.65567 0.71419 0.71216 0.69622
0.11854 0.31076 0.49587 0.58901 0.66432 0.72682 0.72386 0.71873

10
0.60964
0.63372
0.65918
10
0.36843
0.44313
0.53786
10
0.62039
0.62237
0.62436
10
0.65298
0.65402
0.65507
10
0.61364
0.62896
0.64491
10
0.38592
0.45003
0.52857
10
0.59543
0.63436
0.67740
10
0.54105
0.58567
0.63470
10
0.56743
0.60732
0.65074
10
0.65265
0.67749
0.70367

A. Albatineh and M. Niewiadomska-Bugaj


Table 11. Example 7 simulations: averages of corrected similarity indices for different combinations with five clusters of equal size of 50 generated from von Mises distributions.
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

Wards Method vs Centroid Method


2
3
4
5
6
7
8
9
0.82242 0.87971 0.89193 0.95261 0.90551 0.86045 0.81754 0.77751
0.82243 0.87973 0.89197 0.95262 0.90599 0.86189 0.82007 0.78134
0.82243 0.87975 0.89201 0.95262 0.90647 0.86334 0.82261 0.78521
Average Linkage Method vs Centroid Method
2
3
4
5
6
7
8
9
0.86036 0.93084 0.94837 0.98232 0.96903 0.96046 0.95194 0.94687
0.86036 0.93085 0.94839 0.98232 0.96914 0.96067 0.95226 0.94725
0.86037 0.93086 0.94841 0.98232 0.96925 0.96088 0.95257 0.94763
Average Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.77612 0.86283 0.89062 0.95346 0.90495 0.85802 0.82209 0.78834
0.77613 0.86285 0.89065 0.95346 0.90537 0.85916 0.82420 0.79145
0.77614 0.86288 0.89069 0.95346 0.90579 0.86031 0.82631 0.79459
McQuittys Method vs Wards Method
2
3
4
5
6
7
8
9
0.36449 0.55455 0.63252 0.79261 0.78401 0.76077 0.74632 0.72839
0.36455 0.55470 0.63278 0.79317 0.78419 0.76105 0.74703 0.72903
0.36461 0.55486 0.63303 0.79372 0.78437 0.76132 0.74703 0.72968
Average Linkage Method vs Complete Linkage Method
2
3
4
5
6
7
8
9
0.40105 0.60359 0.68768 0.87560 0.84704 0.82317 0.79151 0.76814
0.40143 0.60368 0.68785 0.87574 0.84750 0.82449 0.79350 0.77077
0.40182 0.60378 0.68801 0.87587 0.84797 0.82581 0.79551 0.77342
Complete Linkage Method vs Wards Method
2
3
4
5
6
7
8
9
0.39617 0.60881 0.69037 0.87099 0.84012 0.79851 0.77162 0.75747
0.39660 0.60889 0.69052 0.87114 0.84023 0.79874 0.77197 0.75795
0.39704 0.60898 0.69067 0.87129 0.84035 0.79898 0.77232 0.75844
average linkage method vs K-means method
2
3
4
5
6
7
8
9
0.62009 0.71678 0.71216 0.90023 0.88003 0.82476 0.78489 0.74822
0.62013 0.71693 0.71239 0.90107 0.88063 0.82618 0.78699 0.75084
0.62017 0.71708 0.71263 0.90192 0.88125 0.82760 0.78911 0.75349
complete linkage method vs K-means method
2
3
4
5
6
7
8
9
0.53702 0.62906 0.65916 0.82970 0.81336 0.76772 0.72985 0.71275
0.53744 0.62919 0.65932 0.83062 0.81359 0.76801 0.73026 0.71339
0.53787 0.62931 0.65947 0.83154 0.81383 0.76830 0.73068 0.71403
Wards method vs K-means method
2
3
4
5
6
7
8
9
0.61988 0.71177 0.71186 0.89799 0.85540 0.79695 0.76397 0.73797
0.61993 0.71191 0.71210 0.89888 0.85560 0.79710 0.76432 0.73872
0.61997 0.71206 0.71233 0.89977 0.85580 0.79726 0.76468 0.73947
Single Linkage Method vs Average Linkage Method
2
3
4
5
6
7
8
9
0.38677 0.57192 0.70123 0.76820 0.80937 0.83504 0.82245 0.80680
0.38872 0.57715 0.70811 0.78263 0.82026 0.84348 0.83114 0.81682
0.39074 0.58280 0.71544 0.79793 0.83166 0.85227 0.84013 0.82720

10
0.75944
0.76443
0.76948
10
0.94254
0.94303
0.94353
10
0.76039
0.76463
0.76892
10
0.72832
0.72945
0.73059
10
0.74952
0.75243
0.75537
10
0.75066
0.75146
0.75227
10
0.73041
0.73346
0.73655
10
0.70424
0.70521
0.70618
10
0.73006
0.73159
0.73312
10
0.78072
0.79288
0.80543

MCS Method
Table 12. Example 8 results: values of similarity indices for different combinations of clustering algorithms for the turtles data
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Methods
Index
CR
CFM
CK
Method
Index
CR
CFM
CK
Method
Index
CR
CFM
CK

2
1
1
1
2
1
1
1
2
0.93632
0.93656
0.93680
2
0.93632
0.93656
0.93680
2
1
1
1
2
0.93632
0.93656
0.93680
2
1
1
1
2
1
1
1
2
0.82460
0.82618
0.82776
2
0.82460
0.82618
0.82776

Centroid Method vs Wards Method


3
4
5
6
7
8
9
0.44584 0.49709 0.57658 0.58042 0.59431 0.58071 0.63305
0.47163 0.53384 0.61199 0.61942 0.60989 0.59730 0.63567
0.49966 0.57464 0.65033 0.66191 0.62594 0.61443 0.63831
Single Linkage Method vs Average Linkage Method
3
4
5
6
7
8
9
0.81101 0.98169 0.68783 0.68713 0.36449 0.38877 0.23668
0.81516 0.98173 0.70467 0.70390 0.41365 0.44525 0.31194
0.81935 0.98178 0.72214 0.72130 0.47244 0.51360 0.42094
Single Linkage Method vs Complete Linkage Method
3
4
5
6
7
8
9
0.43062 0.48184 0.48821 0.46927 0.30655 0.31048 0.24524
0.46601 0.51831 0.52675 0.51071 0.37108 0.37801 0.31956
0.50598 0.55890 0.56981 0.55757 0.45507 0.46653 0.42569
Average Linkage Method vs Complete Linkage Method
3
4
5
6
7
8
9
0.36514 0.46312 0.58186 0.56669 0.61340 0.64709 0.82379
0.38248 0.49650 0.59113 0.57826 0.61864 0.65187 0.82393
0.40106 0.53346 0.60060 0.59015 0.62393 0.65668 0.82407
McQuitty Method vs Wards Method
3
4
5
6
7
8
9
0.41348 0.74559 0.50460 0.59201 0.53575 0.57063 0.56408
0.45294 0.74897 0.52083 0.59218 0.54188 0.57903 0.56779
0.49831 0.75238 0.53774 0.59235 0.54809 0.58758 0.57286
Complete Linkage Method vs Wards Method
3
4
5
6
7
8
9
0.45461 0.47706 0.48013 0.43933 0.65421 0.69348 0.75624
0.45485 0.47707 0.48638 0.44530 0.65736 0.69825 0.76080
0.45509 0.47707 0.49274 0.45137 0.66052 0.70306 0.76538
Average Linkage Method vs Wards Method
3
4
5
6
7
8
9
0.44584 0.48121 0.55801 0.58042 0.59431 0.58071 0.63305
0.47163 0.51506 0.59016 0.61942 0.60989 0.59730 0.63567
0.49966 0.55247 0.62479 0.66191 0.62594 0.61443 0.63831
Single Linkage Method vs Wards Method
3
4
5
6
7
8
9
0.41348 0.49709 0.34290 0.35048 0.22354 0.24670 0.19622
0.45294 0.53384 0.39590 0.40995 0.28593 0.32125 0.27411
0.49831 0.57464 0.46087 0.48410 0.37304 0.42766 0.39522
single linkage method vs K-means method
3
4
5
6
7
8
9
0.41348 0.46065 0.46909 0.34853 0.19077 0.17907 0.13152
0.45294 0.50216 0.51357 0.40908 0.24639 0.23944 0.18198
0.49831 0.54923 0.56429 0.48494 0.32502 0.32846 0.25955
average linkage method vs K-means method
3
4
5
6
7
8
9
0.41315 0.45723 0.40665 0.54449 0.64346 0.55773 0.60939
0.44414 0.49398 0.41499 0.58249 0.65982 0.57735 0.61253
0.47862 0.53511 0.42356 0.62402 0.67667 0.59777 0.61568

10
0.63977
0.64093
0.64210
10
0.21991
0.29584
0.40859
10
0.18891
0.26746
0.39153
10
0.74083
0.74311
0.74539
10
0.56873
0.57079
0.57152
10
0.76981
0.76994
0.77006
10
0.63977
0.64093
0.64210
10
0.19564
0.27368
0.39519
10
0.11834
0.17428
0.26689
10
0.68538
0.69254
0.69978

A. Albatineh and M. Niewiadomska-Bugaj

References
ALBATINEH, A.N., NIEWIADOMSKA-BUGAJ, M., and MIHALKO, D.P. (2006), On
Similarity Indices and Correction for Chance Agreement, Journal of Classification,
23, 301313.
ALBATINEH, A.N (2010), Means and Variances for a Family of Similarity Indices Used
in Cluster Analysis, Journal of Statistical Planning and Inference, 140, 28282838.
ANDREWS, D.F. (1972), Plots of High Dimensional Data, Biometrics, 28, 125136.
BANFIELD, J.D., and RAFTERY, A.E. (1993), Model-based Gaussian and Non-Gaussian
Clustering, Biometrics, 49, 803821.
BARAGONA, R. (2003), Further Results on Lunds Statistic for Identifying Cluster in a
Circular Data Set with Application to Time Series, Communications in Statistics:
Simulation and Computation, 32, 943952.
BATSCHELET, E. (1981), Circular Statistics in Biology, London: Academic Press.
BOCK, H.H. (1985), On Some Significance Tests in Cluster Analysis, Journal of Classification, 2, 77108.
BRECKENRIDGE, J.N. (1989), Replicating Cluster Analysis: Method, Consistency, and
Validity, Multivariate Behavioral Research, 24, 147161.
BRUSCO, M.J., and STEINLEY, D. (2007), A Comparison of Heuristic Procedures for
Minimum Within-Cluster Sums of Squares Partitioning, Psychometrika, 72, 583
600.
CALINSKI, R.B., and HARABASZ, J. (1974), A Dendrite Method for Cluster Analysis,
Communications in Statistics, 3, 127.
COHEN, A.J. (1960), A Coefficient of Agreement for Nominal Scales, Educational and
Psychological Measurement, 3, 3746
DUDOIT, S., and FRIDLYAND, J. (2002), A Prediction-based Resampling Method for
Estimating the Number of Custers in a Dataset, Genome Biology, 3, 121.
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: Oxford University Press.
FISHER, N.I. (1993), Statistical Analysis of Circular Data, Cambridge: Cambridge University Press.
FOWLKES, E.B., and MALLOWS, C.L. (1983), A Method for Comparing Two Hierarchical Clusterings, Journal of the American Statistical Association, 78, 553569.
FRALEY, C., and RAFTERY, A.E. (1998), How Many Clusters? Which Clustering Method?
Answers Via Model-Based Cluster Analysis, The Computer Journal, 41, 578588.
GOODMAN, L., and KRUSKAL, W. (1954), Measures of Association for Cross Classifications, Journal of the American Statistical Association, 49, 732764.
GORDON, A.D. (1999), Classification (2nd ed.), St. Andrews: Chapman&Hall/CRC.
GUTTMAN, L. (1941), An Outline of the Statistical Theory of Prediction, in In Prediction of Personal Adjustment, ed. P. Horst, New York: Social Science Research
Council.
HARDY, A. (1994), An Examination of Procedures for Determining the Number of Clusters in a Data Set, in New Approaches in Classification and Data Analysis, ed. E.
Diday et al., Paris: Springer-Verlag, pp. 178185.
HARDY, A. (1996), On the Number of Clusters, Computational Statistics and Data Analysis, 23, 8396.
HUBERT, L., and ARABIE, P. (1985), Comparing Partitions, Journal of Classification,
2, 193218.
HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley.
JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, New Jersey: Prentice Hall.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction
to Cluster Analysis, New York: John Wiley & Sons.
KOZIOL, J.A. (1990), Cluster Analysis of Antigenic Profiles of Tumors: Selection of
Number of Clusters Using Akaikes Information Criterion, Methods of Information
in Medicine, 29, 200204.

MCS Method

KROLAKSCHWERDT, S., and ECKES, T. (1992), A Graph Theoretic Criterion for


Determining the Number of Clusters in a Data Set, Multivariate Behavior Research,
27, 541565.
KRZANOWSKI, W.J., and LAI, Y.T. (1985), A Criterion for Determining the Number of
Groups in a Data Set Using Sum of Squares Clustering, Biometrics, 44, 2334.
KULCZYNSKI, S. (1927), Die Pflanzenassociationen der Pienenen, Bulletin of the International Academy of Political Science Letters, Science Mathematics Nature, Series B,
Supplement, 2, 57203.
LANGE, T., ROTH, V., BRAUN, M.L., and BUHMANN, J.M. (2004), Stability-Based
Validation of Clustering Solutions, Neural Computations, 16, 12991323.
LUND, U. (1999), Cluster Analysis for Directional Data, Communications in Statistics :
Simulations and Computations, 4, 10011009.
MARDIA, K.V., and JUPP, P.E. (2000), Directional Statistics, England: John Wiley & Sons
Ltd.
MARRIOTT, F.H.C. (1971), Practical Problems in a Method of Cluster Analysis, Biometrics, 27, 501514.
MILLIGAN, G., and COOPER, M. (1985), An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50, 159179.
MILLIGAN, G., and COOPER, M. (1986), A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis, Multivariate Behavioral Research, 21, 441
458.
MILLIGAN, G., SOON, S., and SOKOL, L. (1983), The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure, IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-5, 4047.
MOREY, L., and AGRESTI, A. (1984), The Measurement of Classification Agreement:
An Adjustment to the Rand Statistic for Chance Agreement, Educational and Psychological Measurement, 44, 3337.
PECK, R. , FISHER, L., and NESS, V.J. (1989), Approximate Confidence Intervals for the
Number of Clusters, Journal of the American Statistical Association, 84, 184191.
R Development Core Team (2007), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0,
URL http://www.R-project.org.
RAND, W. (1971), Objective Criteria for the Evaluation of Clustering Methods, Journal
of the American Statistical Association, 66, 846850.
SAXENA, P.C., and NAVANEERHAM, K. (1991), The Effect of Cluster Size, Dimensionality, and Number of Clusters on Recovery of True Cluster Structure Through
Chernoff-Type Faces, The Statistician, 40, 415425.
SAXENA, P.C., and NAVANEERHAM, K. (1993), Comparison of Chernoff-Type Face
and Non-Graphical Methods for Clustering Multivariate Observations, Computational Statistics and Data Analysis, 15, 6379.
STEINLEY, D., and BRUSCO, M.J. (2007), Initializing K-means Batch Clustering: A
Critical Evaluation of Several Techniques, Journal of Classification, 24, 99121.
STRUYF, A., HUBERT, M., and ROUSSEEUW, P.J. (1997), Integrating Robust Clustering
Techniques in S-PLUS, Computational Statistics and Data Analysis, 26, 1737.
SUGAR, C.A., and JAMES, G.M. (2003), Finding the Number of Clusters in a Dataset
: An Information-Theoretic Approach, Journal of the American Statistical Association, 98, 750763.
TIBSHIRANI, R., WALTHER, G., and HASTIE, T. (2001), Estimating the Number of
Clusters in a Data Set via the Gap Statistic, Journal of the Royal Statistical Society
B, 63, 411423.
VASSILLIOU, A., TAMBOURATZIS, D.G., KOUTRAS, M.V., and BERSIMIS, S. (2004),
A New Similarity Measure and Its Use in Determining the Number of Clusters in a
Multivariate Data Set, Communications in Statistics, Theory and Methods, 33, 1643
1666.
WISHART, D. (1978), CLUSTAN User Manual (3rd ed.), Program Library Unit, University
of Edinburgh.

A. Albatineh and M. Niewiadomska-Bugaj

WOLFE, J.H. (1970), Pattern Clustering by Multivariate Mixture Analysis, Multivariate


Behavioral Research, 5, 329350.
YANG, M-S., and PAN, J-A. (1997), On Fuzzy Clustering of Directional Data, Fuzzy Sets
and Systems, 91, 319326.

S-ar putea să vă placă și