Documente Academic
Documente Profesional
Documente Cultură
Information Criterion
Abstract—This paper presents an algorithm to automatically (Improve-structure) to optimize the value of k according to
determine the number of clusters in a given input data set, under a Information Criterion.
mixture of Gaussians assumption. Our algorithm extends the One of the major problems with X-means is that it
Expectation- Maximization clustering approach by starting with a assumes an identical spherical Gaussian of the data.
single cluster assumption for the data, and recursively splitting Because of this, it tends to overfit data in elliptical clusters
one of the clusters in order to find a tighter fit. An Information [6], or in an input set with data of varying cluster size. The
Criterion parameter is used to make a selection between the
G-Means and PG-Means algorithms try to solve this
current and previous model after each split. We build this
approach upon prior work done on both the K-Means and
problem by projecting the data onto one dimension, and
Expectation-Maximization algorithms. We also present a novel running a statistical goodness-of-fit test. This approach
idea for intelligent cluster splitting which minimizes convergence leads to better performance for non-spherical distributions,
time and substantially improves accuracy. however, projections may not work optimally for all data
sets. A projection can collapse the data from many clusters
Keywords-clustering, unsupervised learning, expectation- together, neglecting the difference in density. This requires
maximization, mixture of gaussians multiple projections for accuracy [4].
Our algorithm uses an Information Criterion test
I. INTRODUCTION similar to X-Means, however, it differs by considering each
cluster to be generated by a general multivariate Gaussian
The clustering problem is defined as follows : given distribution. This allows each distribution to take a non-
a set of n-dimensional input vectors {x1 , . . . , xm}, we spherical shapes, and permits accurate computation of the
want to group them into an appropriate number of clusters likelihood of the model. We use Expectation-Maximization
such that points in the same cluster are positionally instead of K-Means for greater accuracy in detection of the
coherent. Such algorithms are useful for image parameters. Further, relying on multivariate Gaussian
compression, bioinformatics, pattern matching, data mining, distributions allow us to use the directional properties of the
astrophysics and many other fields. A common approach for distributions as an aid for intelligent splitting. As we show
the clustering problem is to assume a Gaussian Mixture in our test results, not only does this result in reduced
Model. In this model, the input data is assumed to have been convergence time, it remarkably improves the accuracy of
generated by selecting any one of k Gaussian distributions, our algorithm.
and drawing the input vector from the chosen distribution.
Each cluster is thus represented by a single distribution. The II. CONCEPTS AND DEFINITIONS
Expectation-Maximation algorithm [3] is a well known
method to estimate the set of parameters for such a mixture A. Mixture of Gaussians and Expectation-Maximization
corresponding to maximum likelihood, however, it requires In the Gaussian Mixture model the input data set I =
pre-knowledge about the number of clusters in the data (k). {x1 , . . . , xm} where xi ∈ Rn is assumed to be sampled from a
Determining this value is a fundamental problem in set of distributions L = {A1,...,Ak} such that the probability
data clustering, and has been attempted using Information density is given by
Theoritic [5], Silhouette based [7], [8] and Goodness-of-fit
k
methods [4], [6]. The X-Means algorithm [10] is an
Information Criteria based approach to this problem p x i =∑ j G xi | A j
developed for use with the K-Means algorithm. X-Means j =1
works by alternatively applying two operations – The K-
Means algorithm (Improve-params) to optimally detect the Where Aj denotes a gaussian distribution characterized
clusters for a chosen value of k, and cluster splitting by mean μj and variance matrix Σj. Фj denotes the normalized
weight of the jth distribution, and k is the number of B. Maximum Likelihood and Information Criterion
distributions.
Increasing the number of clusters in the mixture model
The likelihood of the data set is given by results in an increase in the dimensionality of the model,
m causing a monotonous increase in its likelihood. If we were
L=∏ p x i to focus on finding the maximum likelihood model with any
i=1
number of clusters, we would ultimately end up with a
model in which every data point is the sole member of its
The Expectation-Maximization algorithm EM(I,L) own cluster. Obviously, we wish to avoid such a
maximizes the likelihood for the given data set by repeating construction, and hence we must choose some criteria that
the following two steps for all i ∈ {1,...,m} and j ∈ {1,...,k} does not depend solely on likelihood.
until convergence [9]. An Information Criteria parameter is used for selection
among models with different number of parameters. It seeks
1. Expectation Step to balance the increase in likelihood due to additional
parameters by introducing a penalty term for each
w ij P A j | x i parameter. Of the various Information Criterion available,
Bayesian Information Criterion (BIC) or Schwarz Criterion
Equivalently, [11] has been shown to work the best with a mixture of
Gaussians [12].
1
j T −1
− x i− j j x i − j
2 BIC =2log L− f log ∣I∣
w ij n/2 1/ 2
e
2 ∣ j∣
Where f is the number of free parameters. For a mixture
of k Gaussians, f is given by
2. Maximization Step d d −1
f =k −1kd k
2
m
1 Other Information Criterion measures like Akaike's
j ∑w
m i=1 ij Information Criterion [1] or Integrated Completed
Likelihood [2] may also be used.
m
∑ w ij x i III. ALGORITHM
i=1
k m The algorithm functions by alternating Expectation-
Maximization and Split operations, as depicted in Illustration
∑ w ij 1. In the first figure, the EM algorithm is applied to a single
i=1
cluster resulting in (a). The obtained distribution is split in
m (b), and the parameters are maximized once again to get (c).
This is repeated to get the final set of distributions in (e). A
∑ wij x i− j x i− j T formal version of the algorithm is given in Figure 3. The
j i =1 m following lines describe the algorithm:
∑ w ij I is the set of input vectors
i=1
[ ]
1 3 5 7 9 11
Dcos
new 2= j − Avg no. of clusters
Dsin
Figure 4. Performance of the algorithm
Where D is proportional to the length of the major axis.
9
8
IV. RESULTS 7
We tested our algorithm on a set of test two-dimensional 6 % RMS
Error
images generated according to the mixture of Gaussians 5
without
assumption. The results are shown in Figures 4 and 5. The 4 SPLIT
number of iterations required are roughly linear with respect 3 % RMS
to the number of clusters. The error in detection was 2 Error
measured as the root mean squared difference between the with
1
actual and detected number of clusters in the data. The error SPLIT
0
was very low, especially when the SPLIT procedure was 1 3 5 7 9 11 13
used (under 2%). The intelligent splitting algorithm
Avg no. of clusters
consistently outperforms the random split algorithm in both
accuracy and performance. It's usage improves the average Figure 5. Accuracy of the algorithm
accuracy of the algorithm by more than 90%.
V. CONCLUSION
In conclusion, our algorithm presents a general approach
to use any Information Criterion to automatically detect the
number of clusters during Expectation-Maximization cluster
analysis. Based on results of prior research, we chose the
Schwarz criterion in our algorithm, however it may be
applied to any criterion which may provide better
performance for the data under consideration.
Interestingly, not only did our SPLIT algorithm help in
the performance of the algorithm, it was observed that it Figure 6. Overfitting in X-Means
improved its accuracy to a remarkable extent. Randomly
splitting the clusters often tends to under-fit the data in the
case when the algorithm gets stuck in local minima. Our split
method reduces this problem by placing the new clusters in
the vicinity of their final positions. Although we have
derived it under two dimensions, it can easily be generalized
to multi-dimensional problems.
We believe that our algorithm is a useful method to
automatically detect the number of clusters in a Gaussian
mixture data, with particular significance when non-spherical
distributions may be present and multi-dimensional data
projection is not desired.
Figure 7. Image with our algorithm
VI. ACKNOWLEDGEMENT
VII. REFERENCES
[1] H. Akaike, "A new look at the statistical model identification," IEEE
Transactions on Automatic Control , vol.19, no.6, pp. 716-723, Dec
1974
[2] C. Biernacki, G. Celeux and G. Govaert, "Assessing a mixture model
for clustering with the integrated completed likelihood," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.22,
no.7, pp.719-725, Jul 2000
[3] A.P. Dempster, N.M. Laird and D.B. Rubin, "Maximum likelihood
from incomplete data via the em algorithm," Journal of the Royal
Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1-38,
1977
[4] Y. Feng, G. Hamerly and C. Elkan, "PG-means: learning the number
of clusters in data.” The 12-th Annual Conference on Neural
Information Processing Systems (NIPS), 2006
[5] E. Gokcay, and J.C. Principe, "Information theoretic clustering,"
IEEE Transactions on Pattern Analysis and Machine Intelligence ,
vol.24, no.2, pp.158-171, Feb 2002
[6] G. Hamerly and C. Elkan, "Learning the k in k-means," Advances in
Neural Information Processing Systems, vol. 16, 2003.
[7] S. Lamrous and M. Taileb, "Divisive Hierarchical K-Means,"
International Conference on Computational Intelligence for
Modelling, Control and Automation, 2006 and International
Conference on Intelligent Agents, Web Technologies and Internet
Commerce, vol., no., pp.18-18, Nov. 28 2006-Dec. 1 2006
[8] R. Lletí, M. C. Ortiz, L. A. Sarabia, and M. S. Sánchez, "Selecting
variables for k-means cluster analysis by using a genetic algorithm
that optimises the silhouettes," Analytica Chimica Acta, vol. 515, no.
1, pp. 87-100, July 2004.
[9] A. Ng, "Mixtures of Gaussians and the EM algorithm" [Online]
Available: http://see.stanford.edu/materials/aimlcs229/cs229-
notes7b.pdf [Accessed: Oct 19, 2009]
[10] D. Pelleg, and A. Moore, "X-means: Extending k-means with
efficient estimation of the number of clusters," in In Proceedings of
the 17th International Conf. on Machine Learning, 2000, pp. 727-
734.
[11] G. Schwarz, "Estimating the dimension of a model," The Annals of
Statistics, vol. 6, no. 2, pp. 461-464, 1978.
[12] R. Steele and A. Raftery, "Performance of Bayesian Model Selection
Criteria for Gaussian Mixture Models" , Technical Report no. 559,
Department of Statistics, University of Washington, Sep 2009