Sunteți pe pagina 1din 9

5.

7 Hierarchical Clustering

41

has the highest sum of similarities to all non-medoid points, excluding those points that are more
similar to one of the currently chosen initial medoids.
While the K-medoid algorithm is relatively simple, it should be clear that it is expensive compared
to K-means. More recent improvements of the K-medoids algorithm have better efficiency than the
basic algorithm, but are still relatively expensive and will not be discussed here. Relevant references
may be found in the bibliographic remarks.

5.7 Hierarchical Clustering


A hierarchical clustering algorithm is any algorithm that produces a hierarchical clustering as defined
in Section 5.7. More specifically, the goal of such algorithms is to produce a sequence of nested
clusters, ranging from singleton clusters of individual points to an all-inclusive cluster. As mentioned,
this hierarchy of clusters is often graphically represented by a dendrogram as illustrated by figures
5.3 and 5.4. A dendrogram captures the process by which a hierarchical clustering is generated by
showing the order in which clusters are merged (bottom-up view) or clusters are split (top-down
view).
One of the attractions of hierarchical techniques is that they correspond to taxonomies that are
very common in the biological sciences, e.g., kingdom, phylum, genus, species, . . . . (Some cluster
analysis work occurs under the name of mathematical taxonomy.) Another attractive feature is
that hierarchical techniques do not assume any particular number of clusters. Instead, any desired
number of clusters can be obtained by cutting the dendrogram at the proper level. Also, hierarchical
techniques are sometimes thought to produce better quality clusters.

5.7.1 Agglomeration and Division


There are two basic approaches to generating a hierarchical clustering:
Agglomerative Start with the points as individual clusters and, at each step, merge the closest pair
of clusters. This requires defining the notion of cluster proximity. Agglomerative techniques
are most popular, and most of this section will be spent describing them.
Divisive Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton
clusters of individual points remain. In this case, we need to decide which cluster to split at
each step and how to do the splitting.

Sample Data
In the examples that follow we shall use the following data, which consists of six, two-dimensional
points, to illustrate the behavior of the various hierarchical clustering algorithms. The x and y
coordinates of the points and the distances between them are shown, respectively, in tables 5.6 and
5.7. The points themselves are shown in Figure 5.24.

5.7.2 Divisive Algorithms


As mentioned, divisive techniques are less common. We have already seen an example of this type
of technique, bisecting K-means, which was described in Section 5.6.1. Another simple hierarchical divisive technique, which we shall refer to as MST, starts with the minimum spanning of the
proximity graph.
Conceptually, the minimum spanning tree of the proximity graph is built by starting with a tree
that consists of any point. In successive steps, we look for the closest pair of points, p and q, such

5.7.2 Divisive Algorithms

point
p1
p2
p3
p4
p5
p6

x coordinate
0.4005
0.2148
0.3457
0.2652
0.0789
0.4548

y coordinate
0.5306
0.3854
0.3156
0.1875
0.4139
0.3022

Table 5.6. X-Y coordinates of six points.

p1
p2
p3
p4
p5
p6

p1
0.0000
0.2357
0.2218
0.3688
0.3421
0.2347

p2
0.2357
0.0000
0.1483
0.2042
0.1388
0.2540

p3
0.2218
0.1483
0.0000
0.1513
0.2843
0.1100

p4
0.3688
0.2042
0.1513
0.0000
0.2932
0.2216

p5
0.3421
0.1388
0.2843
0.2932
0.0000
0.3921

Table 5.7. Distance Matrix for Six Points

0.6
1
0.5

0.4

2
3

0.3

0.2

0.1

0
0

0.1

0.2

0.3

0.4

0.5

0.6

Figure 5.24. Set of Six Two-dimensional Points.

p6
0.2347
0.2540
0.1100
0.2216
0.3921
0.0000

42

5.7.3 Basic Agglomerative Hierarchical Clustering Algorithms

43

0.6

1
0.5

5
2

0.4

0.3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

0.6

Figure 5.25. Minimum Spanning Tree for Set of Six Two-dimensional Points.
that one point, p, is in the current tree and one, q, is not. We add q to the tree and put an edge
between p and q. Figure 5.25 shows the MST for the points in Figure 5.24.
The MST divisive hierarchical algorithm is shown below. This approach is the divisive version of
the single link agglomerative technique that we will see shortly. Indeed, the hierarchical clustering
produced by MST is the same as that produced by single link. See Figure 5.27.
Algorithm 5 MST Divisive Hierarchical Clustering Algorithm
1:
2:
3:

4:

Compute a minimum spanning tree for the proximity graph.


repeat
Create a new cluster by breaking the link corresponding to the largest distance (smallest
similarity).
until Only singleton clusters remain

5.7.3 Basic Agglomerative Hierarchical Clustering Algorithms


Many agglomerative hierarchical clustering techniques are variations on a single approach: Starting
with individual points as clusters, succesively merge two clusters until only one cluster remains. This
approach is expressed more formally in Algorithm 6.
Algorithm 6 Basic Agglomerative Hierarchical Clustering Algorithm
1:
2:
3:
4:

5:

Compute the proximity graph, if necessary.


repeat
Merge the closest two clusters.
Update the proximity matrix to reflect the proximity between the new cluster and the original
clusters.
until Only one cluster remains

5.7.4 Defining Proximity Between Clusters

(a) MIN

(b) MAX

44

(c) Group Average

Figure 5.26. Definition of Cluster Proximity

5.7.4 Defining Proximity Between Clusters


The key step of the previous algorithm is the calculation of the proximity between two clusters, and
this is where the various agglomerative hierarchical techniques differ. Cluster proximity is typically
defined by a conceptual view of the clusters. For example, if we view a cluster as being represented
by all the points, then we can take the proximity between the closest two points in different clusters
as the proximity between the two clusters. This defines the MIN technique. Alternatively, we can
take the proximity between the farthest two points in different clusters to be our definition of cluster
proximity. This defines the MAX technique. (Notice that the names, MIN and MAX are only
appropriate if our proximities are distances, and thus, many prefer the alternative names, which,
respectively, are single link and complete link. However, we shall prefer the terms MIN and MAX
for their brevity.) Also, we can average the pairwise proximities of all pairs of two points from
different clusters. This yields the group average technique. These three approaches are graphically
illustrated by Figure 5.26.
If, instead, we represent each cluster by a centroid, then we find that different definitions of
cluster proximity are more natural. For the centroid approach, the cluster proximity is defined as
the proximity between cluster centroids. An alternative technique, Wards method, also assumes
that a cluster is represented by its centroid. However, it measures the proximity between two clusters
in terms of the increase in the SSE that results from merging two clusters into one. Like K-means,
Wards method attempts to minimizes the sum of the squared distance of points from their cluster
centroids.

Time and Space Complexity


Hierarchical clustering techniques typically use a proximity matrix. This requires the computation
and storage of m2 proximities, where m is the number of data points. This a factor that limits the
size of data sets that can be processed. It is possible to compute the proximities on the fly and save
space, but this increases computation time. Overall, the time required for hierarchical clustering is
O(m2 log m).

5.7.5 MIN or Single Link


For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined
to be the minimum of the distance (maximum of the similarity) between any two points in the
different clusters. (The technique is called single link because, if you start with all points as
singleton clusters, and add links between points, strongest links first, then these single links combine
the points into clusters.) Single link is good at handling non-elliptical shapes, but is sensitive to
noise and outliers. Figure 5.27 shows the result of applying MIN to our example data set of six
points.
Figure 5.27a shows the nested clusters as a sequence of nested ellipses, while 5.27b shows the same
information, but as a dendrogram. The height at which two clusters are merged in the dendrogram

5.7.6 MAX or Complete Link or CLIQUE

3
5

45

0.2

0.15

0.1

0.05

(a) Single Link Clustering

(b) Single Link Dendrogram

Figure 5.27. Single Link Clustering of Six Points.


reflects the distance of the two clusters. For instance, from Table 5.7, we see that the distance
between points 3 and 6 is 0.11, and that is the height at which they are joined into one cluster in
the dendrogram. As another example, the distance between clusters {3, 6} and {2, 5} is given by
dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843,
0.3921) = 0.1483.

5.7.6 MAX or Complete Link or CLIQUE


For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined to be the maximum of the distance (minimum of the similarity) between any two points in
the different clusters. (The technique is called complete link because, if you start with all points
as singleton clusters, and add links between points, strongest links first, then a group of points is
not a cluster until all the points in it are completely linked, i.e., form a clique.) Complete link is
less susceptible to noise and outliers, but can break large clusters, and has tends to favor globular
shapes.
Figure 5.28 shows the results of applying MAX to the sample data set of six points. Again, points
3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of {2, 5}. This is because the
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) = 0.2216, which is smaller than
dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = max(0.1483, 0.2540, 0.2843,
0.3921) = 0.3921 and dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) = max(0.2218, 0.2347) = 0.2347.

5.7.7 Group Average


For the group average version of hierarchical clustering, the proximity of two clusters is defined to
be the average of the pairwise proximities between all pairs of points in the different clusters. Notice
that this is an intermediate approach between MIN and MAX. This is expressed by the following
equation:
proximity(cluster1 , cluster2 ) =

p1 cluster1
p2 cluster2

proximity(p1 , p2 )
size(cluster1 ) size(cluster2 )

(5.17)

5.7.8 Wards Method and Centroid methods

46

0.4

0.35

2
5

0.3
0.25

0.2

6
1

0.15
0.1
0.05
0

(a) Complete Link Clustering

(b) Complete Link Dendrogram

Figure 5.28. Complete Link Clustering of Six Points.


Figure 5.29 shows the results of applying group average to the sample data set of six points. To illustrate how group average works, we calculate the distance between some clusters. dist({3, 6, 4}, {1}) =
(0.2218 + 0.3688 + 0.2347)/(3 1) = 0.2751. dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 1) = 0.2889.
dist({3, 6, 4}, {2, 5}) = (0.1483+0.2843+0.2540+0.3921+0.2042+0.2932)/(61) = 0.2637. Because
dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are
merged at the fourth stage.

5.7.8 Wards Method and Centroid methods


For Wards method the proximity between two clusters is defined as the increase in the squared error
that results when two clusters are merged. Thus, this method uses the same objective function as is
used by K-means clustering. While it may seem that this makes this technique somewhat distinct
from other hierarchical techniques, some algebra will show that this technique is very similar to
the group average method when the proximity between two points is taken to be the square of the
distance between them. Figure 5.30 shows the results of applying Wards method to the sample
data set of six points. The resulting clustering is somewhat different from those produced by MIN,
MAX, and group average.
Centroid methods calculate the proximity between two clusters by calculating the distance between the centroids of clusters. These techniques may seem similar to K-means, but as we have
remarked, Wards method is the correct hierarchical analogue.
Centroid methods also have a characteristicoften considered badthat other hierarchical clustering techniques we have discussed dont posses: the possibility of inversions. To be specific, two
clusters that are merged may be more similar (less distant) than the pair of clusters that were
merged in a previous step. For other methods, the similarity of the clusters being merged monotonically decreases (the distance between merged clusters monotonically increases) as we proceed from
singleton clusters to one all inclusive clusters.

5.7.8 Wards Method and Centroid methods

5
0.25

2
5

0.2

0.15

6
1

0.1

0.05

(a) Group Average Clustering

(b) Group Average Dendrogram

Figure 5.29. Group Average Clustering of Six Points.

0.25

2
5

0.2

0.15

6
1

0.1

0.05

(a) Wards Clustering

(b) Wards Dendrogram

Figure 5.30. Wards Clustering of Six Points.

47

5.7.9 Key Issues in Hierarchical Clustering

48

5.7.9 Key Issues in Hierarchical Clustering


Lack of a Global Objective Function
Previously, we mentioned that hierarchical clustering cannot be viewed as globally optimizing an
objective function. Instead, hierarchical clustering techniques use various criteria to decide locally,
at each step, which clusters should be joined (or split for divisive approaches). This approach yields
clustering algorithms that avoid the difficulty of trying to solve a hard combinatorial optimization
problem. (As previously, the general clustering problem for objective functions such as minimize
SSE is NP hard.) Furthermore, such approaches do not have problems with local minima or
difficulties in choosing initial points. Of course, the time complexity of O(m 2 log m) and the space
complexity of O(m2 ) are prohibitive in many cases.

The Impact of Cluster Size


Another aspect of agglomerative hierarchical clustering that should be considered is how to treat
the relative sizes of the pairs of clusters that may be merged. (Note that this discussion only
applies to cluster proximity schemes that involve sums, i.e., centroid and group average.) There are
basically two schemes: weighted and unweighted. Weighted schemes treat all clusters equally, and
thus, objects in smaller clusters effectively have larger weight. Unweighted schemes treat all objects
equally. Unweighted schemes are more popular, and in our previous discussions about the centroid
and group average techniques, we discussed only the unweighted versions.

Merging Decisions are Final


Agglomerative hierarchical clustering algorithms tend to make good local decisions about combining
two clusters since they have access to the proximity matrix. However, once a decision is made to
merge two clusters, this decision cannot be undone at a later time. This prevents a local optimization
criterion from becoming a global optimization criterion.
For example, in Wards method, the minimize squared error criteria from K-means is used in
deciding which clusters to merge. However, this does not result in a clustering that could be used to
solve the K-means problem. Even though the local, per-step decisions try to minimize the squared
error, the clusters produced on any level, except perhaps the very lowest levels, do not represent an
optimal clustering from a minimize global squared error point of view. Furthermore, the clusters
are not even stable, in the sense that a point in a cluster may be closer to the centroid of some
other cluster than to the centroid of its current cluster.
However, Wards method can be used as a robust method of initializing a K-means clustering.
Thus, a local minimize squared error objective function seems to have some connection with a global
minimize squared error objective function.
Finally it is possible to attempt to fix up the hierarchical clustering produced by hierarchical
clustering techniques. One idea is to move branches of the tree around so as to improve some global
objective function. Another idea is to refine the clusters produced by a hierarchical technique by
using an approach similar to that used for the multi-level refinement of graph partitions.

5.7.10 The Lance-William Formula for Cluster Proximity


Any of the cluster proximities that we discussed in this section can be viewed as a choice of different
parameters (in the Lance-Williams formula shown below in equation 5.18) for the proximity between
clusters Q and R, where R is formed by merging clusters A and B. (Note that in this formula p(., .)
is a proximity function.) In words, this formula says that after you merge clusters A and B to form

5.8 Density-Based Clustering

49

Table 5.8. Table of Lance-William Coefficients for Common Hierarchical Clustering Approaches
Clustering Method
A
B

MIN
MAX
Group Average
Centroid
Wards

1/2
1/2

1/2
1/2

nA
nA +nB
nA
nA +nB
nA +nQ
nA +nB +nQ

nB
nA +nB
nB
nA +nB
nB +nQ
nA +nB +nQ

0
0
0
nA nB
(nA +nB )2
nQ
nA +nB +nQ

-1/2
1/2
0
0
0

cluster R, then the proximity of the new cluster, R, to an existing cluster, Q, is a linear function of the
proximities of Q from the original clusters A and B. Table 5.8 shows the values of these coefficients
for the techniques that we discussed. nA , nB , and nQ are the number of points in clusters A, B,
and Q, respectively.
p(R, Q) = A p(A, Q) + B p(B, Q) + p(A, B) + |p(A, Q) p(B, Q)|

(5.18)

Any hierarchical technique that can be phrased in this way does not need the original points,
only the proximity matrix, which is updated as clustering occurs. However, while a general formula
is nice, especially for implementation, it is often easier to understand the different hierarchical
methods by looking directly at the definition of cluster proximity that each method uses, which was
the approach taken in our previous discussion.

5.8 Density-Based Clustering


In this section, we describe some clustering algorithms that use the density-based definition of a
cluster. In particular, we will focus on the algorithm DBSCAN, which illustrates a number of
important concepts. In addition, we will also examine an extension of DBSCAN, DENCLUE. There
are many other density-based clustering algorithms, and we will see some of these in other sections.
In particular, CLIQUE and MAFIA are two density-based clustering algorithms that are specifically
designed for handling clusters in high-dimensional data, and we discuss them later in Section 5.9,
the the section on subspace clustering.

5.8.1 DBSCAN
DBSCAN is a density based clustering algorithm that works with a number of different distance
metrics. After DBSCAN has processed a set of data points, a point will either be in a cluster or will
be classified as a noise point. Furthermore, DBSCAN also makes a distinction between the points in
clusters, classifying some as core points, i.e., points in the interior of a cluster, and some as border
points, i.e., points on the edge of a cluster. Informally, any two core points that are close enough
are put in the same cluster. Likewise, any border point that is close enough to a core point is put
in the same cluster as the core point. Noise points are discarded.

Classification of Points according to Density


Figure 5.31 graphically illustrates the concepts of a core, border, and noise point with a collection
of two-dimensional points, while the following text provides a more precise description.
Core points. These are points that are at the interior of a cluster. A point is a core point if there
are enough points in its neighborhood, i.e., if the number of points within a given neighborhood

S-ar putea să vă placă și