Cluster Analysis Using K Mean Algorithm

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations
in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique
for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image
analysis, information retrieval, and bioinformatics.
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical
taxonomy,botryology and typological analysis.
Types of clustering
Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either
agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate
cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide
it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical
clustering.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded
as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of
this kind.
Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the
data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while
the special case of axis-parallel subspaces is also known as Two-way clustering, co-clustering or biclustering: in these
methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data
matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature
combinations as in general subspace methods. But this special case deserves attention due to its applications in
bioinformatics.
Many clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to
execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a
problem on its own for which a number of techniques have been developed.
Distance measure
An important step in most clustering is to select a distance measure, which will determine how the similarity of two elements
is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one
distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point
(x = 1, y = 0) and the origin (x = 0, y = 0) is always 1 according to the usual norms, but the distance between the point
(x = 1, y = 1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
 The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in
health psychology research found that the most common distance measure in published studies in that research area is
the Euclidean distance or the squared Euclidean distance.[Full citation needed]
 The Manhattan distance (aka taxicab norm or 1-norm)
 The maximum norm (aka infinity norm)
 The Mahalanobis distance corrects data for different scales and correlations in the variables
 The angle between two vectors can be used as a distance measure when clustering high dimensional data.
See Inner product space.
 The Hamming distance measures the minimum number of substitutions required to change one member into
another.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. Many of the distance
functions listed above have the property that distances are symmetric (the distance from object A to B is the same as the
distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is
not the case. (A true metric gives symmetric measures of distance.)
Hierarchical clustering
Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram.
The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual
observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively
merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.
Any valid metric may be used as a measure of similarity between pairs of observations. The choice of which clusters to
merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the
second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser
clustering, with a smaller number of larger clusters.
Agglomerative hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
Raw data
The hierarchical clustering dendrogram would be as such:
Traditional representation
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have
six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to
take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the
distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters
are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of
caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage
clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f},
and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the
distance between two clusters. Usually the distance between two clusters and is one of the following:
 The maximum distance between elements of each cluster (also called complete linkage clustering):
 The minimum distance between elements of each cluster (also called single-linkage clustering):
 The mean distance between elements of each cluster (also called average linkage clustering, used e.g.
in UPGMA):
 The sum of all intra-cluster variance.
 The increase in variance for the cluster being merged (Ward's criterion).
 The probability that candidate clusters spawn from the same distribution function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and
one can decide to stop clustering either when the clusters are too far apart to be merged (distance
criterion) or when there is a sufficiently small number of clusters (number criterion).
Partitional clustering
[edit]K-means and derivatives
[edit]k-means clustering
Main article: k-means clustering
The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the
average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over
all the points in the cluster.
Example: The data set has three dimensions and the cluster has two points: X = (x1,x2,x3) and Y =
(y1,y2,y3). Then the centroid Z becomes Z = (z1,z2,z3), where ,
and .
The algorithm steps are[1]:
 Choose the number of clusters, k.
 Randomly generate k clusters and determine the cluster centers, or directly generate k random points as
cluster centers.
 Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the
distance measures discussed above.
 Recompute the new cluster centers.
 Repeat the two previous steps until some convergence criterion is met (usually that the assignment
hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its
disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial
random assignments (the k-means++ algorithm addresses this problem by seeking to choose better starting clusters).
It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. Another
disadvantage is the requirement for the concept of a mean to be definable which is not always the case. For such
datasets the k-medoids variants is appropriate. An alternative, using a different criterion for which points are best
assigned to which centre is k-medians clustering.
k-means clustering
In statistics and machine learning, k-means clustering is a method of cluster analysis which aims
to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar
to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural
clusters in the data as well as in the iterative refinement approach employed by both algorithms.
Description
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims
to partition the n observations into k sets (k ≤ n)S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares
(WCSS):
where μi is the mean of points in Si.
[edit]History
The term "k-means" was first used by James MacQueen in 1967,[1] though the idea goes back to Hugo Steinhaus in
1956.[2] The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation,
though it wasn't published until 1982.[3]
[edit]Algorithms
Regarding computational complexity, the k-means clustering problem is:
 NP-hard in general Euclidean space d even for 2 clusters [4][5]
 NP-hard for a general number of clusters k even in the plane [6]
 If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n), where n is the number of
entities to be clustered [7]
Thus, a variety of heuristic algorithms are generally used.
[edit]Standard algorithm
The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means
algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community.
Given an initial set of k means m1(1),…,mk(1), which may be specified randomly or by some heuristic, the algorithm
proceeds by alternating between two steps:[8]
Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition the observations
according to the Voronoi diagram generated by the means).
Update step: Calculate the new means to be the centroid of the observations in the cluster.
The algorithm is deemed to have converged when the assignments no longer change.
 Demonstration of the standard algorithm


1) k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).
2) k clusters are created by associating every observation with the nearest mean. The partitions here represent
theVoronoi diagram generated by the means.
3) The centroid of each of the kclusters becomes the new means.


4) Steps 2 and 3 are repeated until convergence has been reached.
As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the
result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it
multiple times with different starting conditions. However, in the worst case, k-means can be very slow to
converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on
which k-means takes exponential time, that is 2Ω(n), to converge[9][10]. These point sets do not seem to
arise in practice: this is corroborated by the fact that the smoothed running time of k-means is
polynomial[11].
The "assignment" step is also referred to as expectation step, the "update step" as maximization step,
making this algorithm a variant of the generalized expectation-maximization algorithm.
[edit]Variations
 The expectation-maximization algorithm (EM algorithm) maintains probabilistic assignments to
clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of
means.
 k-means++ seeks to choose better starting clusters.
 The filtering algorithm uses kd-trees to speed up each k-means step.[12]
 Some methods attempt to speed up each k-means step using coresets[13] or the triangle
inequality.[14]
 Escape local optima by swapping points between clusters.[15]
[edit]Discussion
k-means clustering result for the Iris flower data set and actual species visualized usingELKI. Cluster means are
marked using larger, semi-transparent symbols.
k-means clustering and EM clustering on an artificial dataset ("mouse"). The tendency ofk-means to produce equi-sized
clusters leads to bad results, while EM benefits from the Gaussian distribution present in the data set
The two key features of k-means which make it efficient are often regarded as its biggest drawbacks:
 Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
 The number of clusters k is an input parameter: an inappropriate choice of k may yield poor
results. That is why, when performing k-means, it is important to run diagnostic checks
for determining the number of clusters in the data set.
A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are
separable in a way so that the mean value converges towards the cluster center. The clusters are
expected to be of similar size, so that the assignment to the nearest cluster center is the correct
assignment. When for example applying k-means with a value of k = 3 onto the well-known Iris flower
data set, the result often fails to separate the three Iris species contained in the data set. With k = 2, the
two visible clusters (one containing two species) will be discovered, whereas with k = 3 one of the two
clusters will be split into two even parts. In fact, k = 2 is more appropriate for this data set, despite the
data set containing 3 classes. As with any other clustering algorithm, the k-means result relies on the
data set to satisfy the assumptions made by the clustering algorithms. It works very well on some data
sets, while failing miserably on others.
The result of k-means can also be seen as the Voronoi cells of the cluster means. Since data is split
halfway between cluster means, this can lead to suboptimal splits as can be seen in the "mouse"
example. The Gaussian models used by the Expectation-maximization algorithm (which can be seen as
a generalization of k-means) are more flexible here by having both variances and covariances. The EM
result is thus able to accommodate clusters of variable size much better than k-means as well as
correlated clusters (not in this example).
[edit]Applications of the algorithm

[edit]Image segmentation
The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation.
The results of the segmentation are used to aid border detection and object recognition. In this context,
the standard Euclidean distance is usually insufficient in forming the clusters. Instead, a weighted
distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is
commonly used.[16]
K-Means Algorithm (3416 hits)
Posted by cincoutprabu on Aug-03-2010

Languages: C#, Silverlight
View Downloads available for this article | View Comments on this article
This article gives a short introduction to clustering and then explains K-means algorithm in an efficient way using a
live demo in Silverlight. The demo can be used to understand the working of k-means algorithm through user-
defined data points. The full source code in C# and Silverlight is available for download below.
Machine Learning and Clustering
Machine learning is a scientific discipline used to automatically learn in order to understand complex patterns and
make intelligent decisions based on data. This computational learning can be supervised or unsupervised. Data
Mining is the process of extracting useful patterns from large volumes of data. Uncovering hidden patterns in data
using data mining techniques will be very useful for businesses, scientists and governments.
Clustering is the process of organizing a set of items into subsets (called clusters) so that items in the same cluster
are similar. The similarity between items can be defined by a function or a formula, based on the context. For
example, the Euclidean distance between two points acts as a similarity function for list of points/co-ordinates in
space. Clustering is a method of unsupervised learning and a common technique for statistical data analysis used
in many fields. The term clustering can also refer to automatic classification, numerical taxonomy, topological
analysis etc. For more information on Clustering, see http://en.wikipedia.org/wiki/Cluster_analysis.
Data Structures for this Article
We illustrate the k-means algorithm using a set of points in 2-dimensional (2D) space. The following data-structure
classes are created. The Point class represents a point in 2D space. The PointCollection represents a set of points
and/or cluster.
01 public class Point

02 {
03 public int Id { get; set; }

04 public double X { get; set; }
05 public double Y { get; set; }
06 }
07
08 public class PointCollection : List<Point>
09 {
10 public Point Centroid { get; set; }
11 }
K-Means Algorithm
The K-Means is a simple clustering algorithm used to divide a set of objects, based on their attributes/features,
into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for
each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity
function) to all objects of that cluster.
Since we know the number of clusters to be formed, the objects in the input list are initially divided into random
groups, that is, each object is assigned to a random cluster. After this, the algorithm iteratively refines each group
by moving objects from irrelevant group to relevant group. The relevance is defined by the similarity measure or
function. Whenever a new object is added or removed from a cluster, its centroid is updated or recalculated. Each
iteration is guaranteed to increase the similarility between all the points inside a cluster. This iterative refinement is
continued until all the clusters become stable i.e. there is no futher movement of objects between clusters. For
more information on k-means algorithm, see http://en.wikipedia.org/wiki/K-means_clustering. The k-means
algorithm is also referred to as Lloyd's algorithm.
The K-means algorithm can be used for grouping any set of objects whose similarity measure can be defined
numerically. For example, a set of records of a relational-database table can be divided into clusters based on any
numerical field of the table. For example, the set of customers or employees can be divided based on their
attributes/properties like age, income, date-of-join, etc. In such cases, the similarity measure has to be defined
based on that attribue.
The following code implements the K-means algorithm, using the data-structures defined above.
public static List<PointCollection> DoKMeans(PointCollection

01
points, int clusterCount)
02 {
03 //divide points into equal clusters
04 List<PointCollection> allClusters = new List<PointCollection>();
List<List<Point>> allGroups = ListUtility.SplitList<Point>(points,
05
clusterCount);
06 foreach (List<Point> group in allGroups)
07 {
08 PointCollection cluster = new PointCollection();
09 cluster.AddRange(group);
10 allClusters.Add(cluster);
11 }
12
13 //start k-means clustering
14 int movements = 1;
15 while (movements > 0)
16 {
17 movements = 0;
18
19 foreach (PointCollection cluster in allClusters) //for all clusters
20 {
21 for (int pointIndex = 0; pointIndex < cluster.Count; pointIndex+

+) //for all points in each cluster
22 {
23 Point point = cluster[pointIndex];
24
25 int nearestCluster = FindNearestCluster(allClusters, point);
26 if (nearestCluster != allClusters.IndexOf(cluster)) //if

point has moved
27 {
28 if (cluster.Count > 1) //each cluster shall have minimum

one point
29 {
30 Point removedPoint = cluster.RemovePoint(point);
31 allClusters[nearestCluster].AddPoint(removedPoint);
32 movements += 1;
33 }
34 }
35 }
36 }
37 }
38
39 return (allClusters);
40 }
The SplitList() function defined in ListUtility class is used to split a list of objects into equal number of groups. This
is explained in more detail in this article. The FindNearestCluster() function finds the cluster that is very nearest (in
terms of euclidean distance) to the given point.
The following function finds the euclidean-distance between two points in 2D space.
1 public static double FindDistance(Point pt1, Point pt2)

2{
3 double x1 = pt1.X, y1 = pt1.Y;
4 double x2 = pt2.X, y2 = pt2.Y;
5
6 //find euclidean distance
7 double distance = Math.Sqrt(Math.Pow(x2 - x1, 2.0) + Math.Pow(y2 - y1,

2.0));
8 return (distance);
9}

Cluster Analysis Using K Mean Algorithm

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Cluster Analysis Using K Mean Algorithm

Încărcat de

Drepturi de autor:

Formate disponibile

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations

analysis, information retrieval, and bioinformatics.

taxonomy,botryology and typological analysis.

it into successively smaller clusters.

Common distance functions:

the Euclidean distance or the squared Euclidean distance.[Full citation needed]

 The Manhattan distance (aka taxicab norm or 1-norm)

 The maximum norm (aka infinity norm)

See Inner product space.

not the case. (A true metric gives symmetric measures of distance.)

clustering, with a smaller number of larger clusters.

Agglomerative hierarchical clustering

The hierarchical clustering dendrogram would be as such:

take the two closest elements, according to the chosen distance.

 The sum of all intra-cluster variance.

criterion) or when there is a sufficiently small number of clusters (number criterion).

all the points in the cluster.

(y1,y2,y3). Then the centroid Z becomes Z = (z1,z2,z3), where ,

The algorithm steps are[1]:

 Choose the number of clusters, k.

distance measures discussed above.

 Recompute the new cluster centers.

assigned to which centre is k-medians clustering.

where μi is the mean of points in Si.

though it wasn't published until 1982.[3]

Regarding computational complexity, the k-means clustering problem is:

 NP-hard in general Euclidean space d even for 2 clusters [4][5]

 NP-hard for a general number of clusters k even in the plane [6]

entities to be clustered [7]

Thus, a variety of heuristic algorithms are generally used.

proceeds by alternating between two steps:[8]

according to the Voronoi diagram generated by the means).

 Demonstration of the standard algorithm

theVoronoi diagram generated by the means.

3) The centroid of each of the kclusters becomes the new means.

4) Steps 2 and 3 are repeated until convergence has been reached.

making this algorithm a variant of the generalized expectation-maximization algorithm.

 The expectation-maximization algorithm (EM algorithm) maintains probabilistic assignments to

clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of

 k-means++ seeks to choose better starting clusters.

 The filtering algorithm uses kd-trees to speed up each k-means step.[12]

 Escape local optima by swapping points between clusters.[15]

marked using larger, semi-transparent symbols.

for determining the number of clusters in the data set.

sets, while failing miserably on others.

correlated clusters (not in this example).

[edit]Applications of the algorithm

K-Means Algorithm (3416 hits)

Posted by cincoutprabu on Aug-03-2010

Machine Learning and Clustering

01 public class Point

03 public int Id { get; set; }

public static List<PointCollection> DoKMeans(PointCollection

21 for (int pointIndex = 0; pointIndex < cluster.Count; pointIndex+

26 if (nearestCluster != allClusters.IndexOf(cluster)) //if

28 if (cluster.Count > 1) //each cluster shall have minimum

1 public static double FindDistance(Point pt1, Point pt2)

7 double distance = Math.Sqrt(Math.Pow(x2 - x1, 2.0) + Math.Pow(y2 - y1,

S-ar putea să vă placă și