Documente Academic
Documente Profesional
Documente Cultură
K-means Algorithm
Abstract
k-Means is a rather simple but well known algorithms for grouping objects,
clustering. Again all objects need to be represented as a set of numerical
features. In addition the user has to specify the number of groups (referred to
as k) he wishes to identify. Each object can be thought of as being represented
by some feature vector in an n dimensional space, n being the number of all
features used to describe the objects to cluster. The algorithm then randomly
chooses k points in that vector space, these point serve as the initial centers of
the clusters. Afterwards all objects are each assigned to center they are closest
to. Usually the distance measure is chosen by the user and determined by the
1
learning task. After that, for each cluster a new center is computed by
averaging the feature vectors of all objects assigned to it. The process of
assigning objects and recomputing centers is repeated until the process
converges. The algorithm can be proven to converge after a finite number of
iterations. Several tweaks concerning distance measure, initial center choice
and computation of new average centers have been explored, as well as the
estimation of the number of clusters k. Yet the main principle always remains
the same. In this project we will discuss about K-means clustering algorithm,
implementation and its application to the problem of unsupervised learning
Contents
Abstract....1
1. Introduction...3
2. The k-means algorithm..........................................4
3. How the k-mean clustering algorithm works....5
4. Task Formulation..6
4.1 K-means implementation.6
4.2 Estimation of parameters of a Gaussian mixture.8
4.3 Unsupervised learning..9
5. Limitations..13
6. Difficulties with k-means14
7. Available software...15
8. Applications of the k-Means Clustering Algorithm15
9. Conclusion...16
References......17
1 Introduction
useful.
The k-means algorithm is a simple iterative clustering algorithm that
partitions a given dataset into a user-specified number of clusters, k. The
algorithm is simple to implement and run, relatively fast, easy to adapt, and
common in practice. It is historically one of the most important algorithms in
data mining. Historically, k-means in its essential form has been discovered by
several researchers across different disciplines, most notably Lloyd
(1957,1982), Forgey (1965), Friedman and Rubin(1967), and McQueen(1967).
A detailed history of k-means along with descriptions of several variations are
given in Jain and Dubes. Gray and Neuhoff provide a nice historical
background for k-means placed in the larger context of hill-climbing
algorithms.
In the rest of this project, we will describe how k-means works, discuss the
limitations of k-means, difficulties and some applications of this algorithm.
where
introduction, k-means is a clustering algorithm that partitions D into clusters
of points. That is, the k-means algorithm clusters all of the data points in D
such that each point x i falls in one and only one of the k partitions. One
can keep track of which point is in which cluster by assigning each point a
cluster ID. Points with the same cluster ID are in the same cluster, while points
with different cluster IDs are in different clusters. One can denote this with a
cluster membership vector m of length N, where mi is the cluster ID of x i .
The value of is an input to the base algorithm. Typically, the value for is
based on criteria such as prior knowledge of how many clusters actually appear
in D , how many clusters are desired for the current application, or the types
of clusters found by exploring/experimenting with different values of k .
How k is chosen is not necessary for understanding how k-means partitions
the dataset D , and we will discuss how to choose k when it is not prespecified in a later section.
In k-means, each of the k clusters is represented by a single point in
d
R . Let us denote this set of cluster representatives as the set
C={c j j=1,... , k } .
These k cluster representatives are also called the cluster means or cluster
centroids. In clustering algorithms, points are grouped by some notion of
closeness or similarity. In k-means, the default measure of closeness is the
Euclidean distance. In particular, one can readily show that k-means attempts
4
argmi n j xi c j 2
(1)
i=1
If the number of data is less than the number of cluster then we assign each
data as the centroid of the cluster. Each centroid will have a cluster number. If
the number of data is bigger than the number of cluster, for each data, we
calculate the distance to all centroid and get the minimum distance. This data is
said belong to the cluster that has minimum distance from this data.
Since we are not sure about the location of the centroid, we need to adjust the
centroid location based on the current updated data. Then we assign all the data
to this new centroid. This process is repeated until no data is moving to another
cluster anymore. Mathematically this loop can be proved to be convergent. The
convergence will always occur if the following condition satisfied:
1. Each switch in step 2 the sum of distances from each training sample to that
training samples group centroid is decreased.
2. There are only finitely many partitions of the training examples into k
clusters.
4 Task Formulation
4.1 K-means implementation
3 j=1 P( j) N (x j , j)
p( x)=
where N(j ,j) denotes a normal distribution with mean value j and
covariance j. P(j) denotes the weight of j-th gaussian within the mixture.
The task is, for given input data x1 , x2 , ... , xN, to estimate the mixture
parameters j ,
j,
P(j) .
Tasks:
1. In each iteration of the implemented k-means algorithm, reestimate
means j and covariances j using the maximal likelihood
method. P(j) will be the relative number (percentage) of data points
classified to j-th cluster.
2. In each iteration, plot
parameters j , j , P(j) :
the
total
likelihood L of
estimated
of
letters image_data.mat,
compute
2. Using the k-means method, classify the images into three classes. In
each iteration, display the means j , current classification, and the
likelihood L .
3. After the iteration stops, compute and display the average image of
each of the three classes. To display the final classification, you can
use show_class function.
9
10
Class 2
Class 3
12
5 Limitations
The greedy-descent nature of k-means on a non-convex cost implies that the
convergence is only to a local optimum, and indeed the algorithm is typically
quite sensitive to the initial centroid locations. In other words, initializing the
set of cluster representatives C differently can lead to very different clusters,
even on the same dataset D . A poor initialization can lead to very poor
clusters.
The local minima problem can be countered to some extent by running the
algorithm multiple times with different initial centroids and then selecting the
best result, or by doing limited local search about the converged solution. Other
approaches include methods that attempts to keep k-means from converging to
local minima. There are also a list of different methods of initialization, as well
as a discussion of other limitations of k-means.
As mentioned, choosing the optimal value of k may be difficult. If one has
knowledge about the dataset, such as the number of partitions that naturally
comprise the dataset, then that knowledge can be used to choose k .
Otherwise, one must use some other criteria to choose k , thus solving the
model selection problem. One naive solution is to try several different values of
k and choose the clustering which minimizes the k-means objective function
(Equation 1). Unfortunately, the value of the objective function is not as
informative as one would hope in this case. For example, the cost of the
optimal solution decreases with increasing k till it hits zero when the
number of clusters equals the number of distinct data points. This makes it
more difficult to use the objective function to
(a) directly compare solutions with different numbers of clusters and
13
7Available software
Because of the k-means algorithms simplicity, effectiveness, and historical
importance, software to run the k-means algorithm is readily available in
several forms. It is a standard feature in many popular data mining software
packages. For example, it can be found in Weka or in SAS under the
FASTCLUS procedure. It is also commonly included as add-ons to existing
software. For example, several implementations of k-means are available as
parts of various toolboxes in Matlab. k-means is also available in Microsoft
Excel after adding XL Miner. Finally, several stand-alone versions of kmeans exist and can be easily found on the Internet.The algorithm is also
straightforward to code, and the reader is encouraged to create their own
implementation of k-means as an exercise.
optical
character
recognition,
speech
recognition,
and
Because of this,
approach based on k-means to solve the practical problem where simple MLSE
is not enough.
9 Conclusion
This project tried to explain about K-means clustering algorithm and its
application to the problem of un supervised learning. The k-means algorithm is
a simple iterative clustering algorithm that partitions a dataset into k clusters.
At its core, the algorithm works by iterating over two steps:
1) clustering all points in the dataset based on the distance between each
point and its closest cluster representative, and
2) re-estimating the cluster representatives.
Limitations of the k-means algorithm include the sensitivity of k-means to
initialization and determining the value of k. Despite its drawbacks, k-means
remains the most widely used partitional clustering algorithm in practice. The
algorithm is simple, easily understandable and reasonably scalable, and can be
easily modified to deal with different scenarios such as semi-supervised
learning or streaming data. Continual improvements and generalizations of the
basic algorithm have ensured its continued relevance and gradually increased
its effectiveness as well.
References
16
1. http://www.ideal.ece.utexas.edu/papers/km.pdf
2. http://www.science.uva.nl/research/ias/alumni/m.sc.theses/theses/NoahLaith.doc
3. http://cw.felk.cvut.cz/cmp/courses/ae4b33rpz/Labs/kmeans/index_en.html
ppatterns(gmm.X, gmm.y);
axis([-3 3 -3 3]);
model = kminovec(gmm.X, 3, 10, 1, gmm);
figure(gcf);plot(model.L);
%% cast 3
data = load('image_data.mat');
for i = 1:size(data.images, 3)
% soucet sum leva - prava cast obrazku
pX(i) = sum(sum(data.images(:, 1:floor(end/2) , i)))
- sum(sum(data.images(:, (floor(end/2)+1):end ,
% soucet sum horni - dolni
pY(i) = sum(sum(data.images(1:floor(end/2),: , i)))
- sum(sum(data.images((floor(end/2)+1):end , :,
end
...
i)));
...
i)));
18