Sunteți pe pagina 1din 13

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations

in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique

for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image

analysis, information retrieval, and bioinformatics.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical

taxonomy,botryology and typological analysis.

Types of clustering

Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either

agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate

cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide

it into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical

clustering.

Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded

as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of

this kind.

Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the

data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while

the special case of axis-parallel subspaces is also known as Two-way clustering, co-clustering or biclustering: in these

methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data

matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature

combinations as in general subspace methods. But this special case deserves attention due to its applications in

bioinformatics.

Many clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to

execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a

problem on its own for which a number of techniques have been developed.

Distance measure

An important step in most clustering is to select a distance measure, which will determine how the similarity of two elements

is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one

distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point
(x = 1, y = 0) and the origin (x = 0, y = 0) is always 1 according to the usual norms, but the distance between the point

(x = 1, y = 1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.

Common distance functions:

 The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in

health psychology research found that the most common distance measure in published studies in that research area is

the Euclidean distance or the squared Euclidean distance.[Full citation needed]

 The Manhattan distance (aka taxicab norm or 1-norm)

 The maximum norm (aka infinity norm)

 The Mahalanobis distance corrects data for different scales and correlations in the variables

 The angle between two vectors can be used as a distance measure when clustering high dimensional data.

See Inner product space.

 The Hamming distance measures the minimum number of substitutions required to change one member into

another.

Another important distinction is whether the clustering uses symmetric or asymmetric distances. Many of the distance

functions listed above have the property that distances are symmetric (the distance from object A to B is the same as the

distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is

not the case. (A true metric gives symmetric measures of distance.)

Hierarchical clustering

Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram.

The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual

observations.

Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively

merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.

Any valid metric may be used as a measure of similarity between pairs of observations. The choice of which clusters to

merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.

Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the

second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser

clustering, with a smaller number of larger clusters.

Agglomerative hierarchical clustering

For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
Raw data

The hierarchical clustering dendrogram would be as such:

Traditional representation

This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have

six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to

take the two closest elements, according to the chosen distance.

Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the

distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters
are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of

caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage

clustering page; it can easily be adapted to different types of linkage (see below).

Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f},

and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the

distance between two clusters. Usually the distance between two clusters and is one of the following:

 The maximum distance between elements of each cluster (also called complete linkage clustering):

 The minimum distance between elements of each cluster (also called single-linkage clustering):

 The mean distance between elements of each cluster (also called average linkage clustering, used e.g.
in UPGMA):

 The sum of all intra-cluster variance.

 The increase in variance for the cluster being merged (Ward's criterion).

 The probability that candidate clusters spawn from the same distribution function (V-linkage).

Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and

one can decide to stop clustering either when the clusters are too far apart to be merged (distance

criterion) or when there is a sufficiently small number of clusters (number criterion).

Partitional clustering
[edit]K-means and derivatives
[edit]k-means clustering
Main article: k-means clustering
The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the

average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over

all the points in the cluster.

Example: The data set has three dimensions and the cluster has two points: X = (x1,x2,x3) and Y =

(y1,y2,y3). Then the centroid Z becomes Z = (z1,z2,z3), where ,

and .

The algorithm steps are[1]:

 Choose the number of clusters, k.

 Randomly generate k clusters and determine the cluster centers, or directly generate k random points as

cluster centers.

 Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the

distance measures discussed above.

 Recompute the new cluster centers.

 Repeat the two previous steps until some convergence criterion is met (usually that the assignment

hasn't changed).

The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its

disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial

random assignments (the k-means++ algorithm addresses this problem by seeking to choose better starting clusters).

It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. Another

disadvantage is the requirement for the concept of a mean to be definable which is not always the case. For such

datasets the k-medoids variants is appropriate. An alternative, using a different criterion for which points are best

assigned to which centre is k-medians clustering.

k-means clustering

In statistics and machine learning, k-means clustering is a method of cluster analysis which aims
to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It is similar
to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural
clusters in the data as well as in the iterative refinement approach employed by both algorithms.

Description
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims

to partition the n observations into k sets (k ≤ n)S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares

(WCSS):

where μi is the mean of points in Si.

[edit]History

The term "k-means" was first used by James MacQueen in 1967,[1] though the idea goes back to Hugo Steinhaus in

1956.[2] The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation,

though it wasn't published until 1982.[3]

[edit]Algorithms

Regarding computational complexity, the k-means clustering problem is:

 NP-hard in general Euclidean space d even for 2 clusters [4][5]

 NP-hard for a general number of clusters k even in the plane [6]

 If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n), where n is the number of

entities to be clustered [7]

Thus, a variety of heuristic algorithms are generally used.

[edit]Standard algorithm

The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means

algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community.

Given an initial set of k means m1(1),…,mk(1), which may be specified randomly or by some heuristic, the algorithm

proceeds by alternating between two steps:[8]

Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition the observations

according to the Voronoi diagram generated by the means).

Update step: Calculate the new means to be the centroid of the observations in the cluster.
The algorithm is deemed to have converged when the assignments no longer change.

 Demonstration of the standard algorithm


1) k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).

2) k clusters are created by associating every observation with the nearest mean. The partitions here represent

theVoronoi diagram generated by the means.

3) The centroid of each of the kclusters becomes the new means.


4) Steps 2 and 3 are repeated until convergence has been reached.

As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the

result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it

multiple times with different starting conditions. However, in the worst case, k-means can be very slow to

converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on

which k-means takes exponential time, that is 2Ω(n), to converge[9][10]. These point sets do not seem to

arise in practice: this is corroborated by the fact that the smoothed running time of k-means is

polynomial[11].

The "assignment" step is also referred to as expectation step, the "update step" as maximization step,

making this algorithm a variant of the generalized expectation-maximization algorithm.

[edit]Variations

 The expectation-maximization algorithm (EM algorithm) maintains probabilistic assignments to

clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of

means.

 k-means++ seeks to choose better starting clusters.

 The filtering algorithm uses kd-trees to speed up each k-means step.[12]

 Some methods attempt to speed up each k-means step using coresets[13] or the triangle

inequality.[14]

 Escape local optima by swapping points between clusters.[15]

[edit]Discussion
k-means clustering result for the Iris flower data set and actual species visualized usingELKI. Cluster means are

marked using larger, semi-transparent symbols.

k-means clustering and EM clustering on an artificial dataset ("mouse"). The tendency ofk-means to produce equi-sized

clusters leads to bad results, while EM benefits from the Gaussian distribution present in the data set

The two key features of k-means which make it efficient are often regarded as its biggest drawbacks:

 Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.

 The number of clusters k is an input parameter: an inappropriate choice of k may yield poor

results. That is why, when performing k-means, it is important to run diagnostic checks

for determining the number of clusters in the data set.

A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are

separable in a way so that the mean value converges towards the cluster center. The clusters are

expected to be of similar size, so that the assignment to the nearest cluster center is the correct

assignment. When for example applying k-means with a value of k = 3 onto the well-known Iris flower
data set, the result often fails to separate the three Iris species contained in the data set. With k = 2, the

two visible clusters (one containing two species) will be discovered, whereas with k = 3 one of the two

clusters will be split into two even parts. In fact, k = 2 is more appropriate for this data set, despite the

data set containing 3 classes. As with any other clustering algorithm, the k-means result relies on the

data set to satisfy the assumptions made by the clustering algorithms. It works very well on some data

sets, while failing miserably on others.

The result of k-means can also be seen as the Voronoi cells of the cluster means. Since data is split

halfway between cluster means, this can lead to suboptimal splits as can be seen in the "mouse"

example. The Gaussian models used by the Expectation-maximization algorithm (which can be seen as

a generalization of k-means) are more flexible here by having both variances and covariances. The EM

result is thus able to accommodate clusters of variable size much better than k-means as well as

correlated clusters (not in this example).

[edit]Applications of the algorithm


[edit]Image segmentation

The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation.

The results of the segmentation are used to aid border detection and object recognition. In this context,

the standard Euclidean distance is usually insufficient in forming the clusters. Instead, a weighted

distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is

commonly used.[16]

K-Means Algorithm (3416 hits)

Posted by cincoutprabu on Aug-03-2010


Languages: C#, Silverlight

View Downloads available for this article | View Comments on this article

This article gives a short introduction to clustering and then explains K-means algorithm in an efficient way using a
live demo in Silverlight. The demo can be used to understand the working of k-means algorithm through user-
defined data points. The full source code in C# and Silverlight is available for download below.

Machine Learning and Clustering

Machine learning is a scientific discipline used to automatically learn in order to understand complex patterns and
make intelligent decisions based on data. This computational learning can be supervised or unsupervised. Data
Mining is the process of extracting useful patterns from large volumes of data. Uncovering hidden patterns in data
using data mining techniques will be very useful for businesses, scientists and governments.

Clustering is the process of organizing a set of items into subsets (called clusters) so that items in the same cluster
are similar. The similarity between items can be defined by a function or a formula, based on the context. For
example, the Euclidean distance between two points acts as a similarity function for list of points/co-ordinates in
space. Clustering is a method of unsupervised learning and a common technique for statistical data analysis used
in many fields. The term clustering can also refer to automatic classification, numerical taxonomy, topological
analysis etc. For more information on Clustering, see http://en.wikipedia.org/wiki/Cluster_analysis.
Data Structures for this Article

We illustrate the k-means algorithm using a set of points in 2-dimensional (2D) space. The following data-structure
classes are created. The Point class represents a point in 2D space. The PointCollection represents a set of points
and/or cluster.

01 public class Point


02 {

03 public int Id { get; set; }


04 public double X { get; set; }
05 public double Y { get; set; }
06 }

07
08 public class PointCollection : List<Point>
09 {
10 public Point Centroid { get; set; }
11 }

K-Means Algorithm

The K-Means is a simple clustering algorithm used to divide a set of objects, based on their attributes/features,
into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for
each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity
function) to all objects of that cluster.

Since we know the number of clusters to be formed, the objects in the input list are initially divided into random
groups, that is, each object is assigned to a random cluster. After this, the algorithm iteratively refines each group
by moving objects from irrelevant group to relevant group. The relevance is defined by the similarity measure or
function. Whenever a new object is added or removed from a cluster, its centroid is updated or recalculated. Each
iteration is guaranteed to increase the similarility between all the points inside a cluster. This iterative refinement is
continued until all the clusters become stable i.e. there is no futher movement of objects between clusters. For
more information on k-means algorithm, see http://en.wikipedia.org/wiki/K-means_clustering. The k-means
algorithm is also referred to as Lloyd's algorithm.

The K-means algorithm can be used for grouping any set of objects whose similarity measure can be defined
numerically. For example, a set of records of a relational-database table can be divided into clusters based on any
numerical field of the table. For example, the set of customers or employees can be divided based on their
attributes/properties like age, income, date-of-join, etc. In such cases, the similarity measure has to be defined
based on that attribue.

The following code implements the K-means algorithm, using the data-structures defined above.

public static List<PointCollection> DoKMeans(PointCollection


01
points, int clusterCount)
02 {
03 //divide points into equal clusters
04 List<PointCollection> allClusters = new List<PointCollection>();
List<List<Point>> allGroups = ListUtility.SplitList<Point>(points,
05
clusterCount);
06 foreach (List<Point> group in allGroups)

07 {
08 PointCollection cluster = new PointCollection();
09 cluster.AddRange(group);
10 allClusters.Add(cluster);
11 }
12
13 //start k-means clustering
14 int movements = 1;
15 while (movements > 0)
16 {
17 movements = 0;
18
19 foreach (PointCollection cluster in allClusters) //for all clusters
20 {

21 for (int pointIndex = 0; pointIndex < cluster.Count; pointIndex+


+) //for all points in each cluster
22 {
23 Point point = cluster[pointIndex];
24
25 int nearestCluster = FindNearestCluster(allClusters, point);

26 if (nearestCluster != allClusters.IndexOf(cluster)) //if


point has moved
27 {

28 if (cluster.Count > 1) //each cluster shall have minimum


one point
29 {
30 Point removedPoint = cluster.RemovePoint(point);
31 allClusters[nearestCluster].AddPoint(removedPoint);
32 movements += 1;
33 }
34 }
35 }
36 }
37 }
38
39 return (allClusters);
40 }

The SplitList() function defined in ListUtility class is used to split a list of objects into equal number of groups. This
is explained in more detail in this article. The FindNearestCluster() function finds the cluster that is very nearest (in
terms of euclidean distance) to the given point.

The following function finds the euclidean-distance between two points in 2D space.

1 public static double FindDistance(Point pt1, Point pt2)


2{
3 double x1 = pt1.X, y1 = pt1.Y;
4 double x2 = pt2.X, y2 = pt2.Y;
5
6 //find euclidean distance

7 double distance = Math.Sqrt(Math.Pow(x2 - x1, 2.0) + Math.Pow(y2 - y1,


2.0));
8 return (distance);
9}

S-ar putea să vă placă și