Sunteți pe pagina 1din 37

Data Mining – Chapter 4

Cluster Analysis
Part 2

Introduction to Data Mining with Case Studies


Author: G. K. Gupta
Prentice Hall India, 2006.
Recap
• Considered two types of clustering algorithms –
partitioning type and hierarchical type.

• Discussed a partitioning algorithm called K-Means.


Needs user to specify the number of clusters and their
seeds. Uses Euclidean distance nearest neighbour
approach.

• A hierarchical algorithm called Agglomerative Method


was discussed. Starts with one cluster for each object,
merging pair of clusters that are nearest. Continues until
one large cluster includes all. Distances between
clusters need to be computed. Possible approaches are
Single Link, Complete Link and others.

9/11/2019 ©GKGupta 2
Recap
• In our examples, the attributes are not independent.
There is strong correlation between students’ marks and
strong correlation between Olympic medals and per
capita income.

• Examples used are always very simple and too small to


illustrate the methods fully.

• For K-Means, it is difficult to guess the number of


clusters and starting seeds. No generally accepted
procedure for determining them.

• How to interpret results?

9/11/2019 ©GKGupta 3
K-Means
Iterative-improvement greedy algorithm. Random
selection of clusters to start which is then improved by
reassignment of objects until no improvement is possible.

All the data in the dataset must be processed for each


iteration. If the data is very large and cannot fit in the
main memory the process may become inefficient

9/11/2019 ©GKGupta 4
K-Means
Finds a local minimum.

To have a better chance of finding the global minimum, the


method should be restarted with a new set of cluster seeds
(and may be a different number of clusters). The results of
several such runs can then be evaluated based on within-
group variations and between-group variation. The best
result accepted.

9/11/2019 ©GKGupta 5
K-Means
Define within cluster variation = I
Define between clusters variation = E

I = sum of squares of pair-wise distances between objects


in a cluster
E = sum of squares of pair-wise distances between central
points of clusters

One possible objective could then be to maximize E/I. This


ratio could also be used to judge quality of different
iterations.

9/11/2019 ©GKGupta 6
K-Means
Small I shows that the cluster is tight. Large I shows that
the cluster has much variation.

Similarly a small E (relative to say sum of I's) shows that


the clusters are not well separated. May be the clusters
found are not very good. Large E shows good separation
between the clusters.

9/11/2019 ©GKGupta 7
Hierarchical Methods
Quite different from K-Means. No need to specify the
number of clusters or cluster seeds.

Agglomerative method starts with each object in an


individual cluster and then merges the two closest clusters
to build larger and larger clusters. The divisive approach is
the opposite of that.

Results depend on the distance measure used.

Not easy to evaluate the quality of the results.

9/11/2019 ©GKGupta 8
Distance between Clusters

Single Link method – nearest neighbor - distance


between two clusters is the distance between the two
closest points, one from each cluster. Chains can be
formed (why?) and a long string of points may be
assigned to the same cluster.

Complete Linkage method – furthest neighbor –


distance between two clusters is the distance between
the two furthest points, one from each cluster. Does not
allow chains.

9/11/2019 ©GKGupta 9
Distance between Clusters

Centroid method – distance between two clusters is the


distance between the centroids or the centres of gravity
of the two clusters.

Unweighted pair-group average method –distance


between two clusters is the average distance between all
pairs of objects in the two clusters. This means p*n
distances need to be computed if p and n are the number
of objects in each of the clusters

9/11/2019 ©GKGupta 10
Distance between Clusters

Ward’s method – is the difference between the total


within cluster sum of squares for the two clusters
separately and the within cluster sum of squares
resulting from merging the two clusters.

9/11/2019 ©GKGupta 11
Single Link
B
Consider
two clusters B B
Each with B
5 objects.
Need B

the
smallest
distance A
A
to be
computed. A
A
A
9/11/2019 ©GKGupta 12
Complete Link
B
Consider
two clusters B B
each with B
5 objects.
Need B

the
largest
distance A
A
to be
computed. A
A
A
9/11/2019 ©GKGupta 13
Centroid Link
B
Consider
two clusters. B C
B
Need B
the
distance B

between the
centroids
to be A
A
computed.
A C
A
A
9/11/2019 ©GKGupta 14
Pair-Group Average
Consider two clusters
each with B B
4 objects. B
Need 16
distances B

to be
computed
and average A
A
found.
A
A

9/11/2019 ©GKGupta 15
Divisive Clustering

Opposite to Agglomerative. Less commonly used.

Start with one cluster that has all the objects and seek
to split the cluster into two which themselves are then
split into even smaller clusters.

The split may be based on using one variable at a time


or using all the variables together.

The algorithm terminates when a termination condition


is met or when each cluster has only one object.
9/11/2019 ©GKGupta 16
Divisive Clustering

There are two types of divisive methods.

The split may be based on using one variable at a time


or using all the variables together.

Monothetic - split a cluster using only one variable at a


time. How does one choose the variable?

Polythetic - split a cluster using all of the attributes


together. How to allocate objects to clusters?

9/11/2019 ©GKGupta 17
Example
Student Age Marks1 Marks2 Marks3
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 60
S8 21 53 56 59
S9 19 82 82 60
S10 47 75 76 77

9/11/2019 ©GKGupta 18
Algorithm
A typical (polythetic) divisive algorithm works like the
following:
1. Decide on a method of measuring the distance
between two objects and decide a threshold distance.
2. Create a distance matrix by computing distances
between all pairs of objects within the cluster. Sort
these distances in ascending order.
3. Find the two most dissimilar objects (i.e. the two
objects that have the largest distance between them).

9/11/2019 ©GKGupta 19
Algorithm
4. If the distance is smaller than a pre-specified threshold
and there is no other cluster that needs to be divided
then stop.
5. Use the pair of objects identified in Step 3 as seeds of
a K-means type algorithm to create two new clusters.
Examine all objects. Place each object in the cluster
which has the seed with a smaller distance.
6. If there is only one object in each cluster then stop
otherwise continue with Step 2.

9/11/2019 ©GKGupta 20
Divisive Method
Two major issues that need resolving are:
• Which cluster to split next?
• How to split the cluster?

9/11/2019 ©GKGupta 21
Which cluster to split?
Number of possibilities:
• Split the clusters in some sequential order
• Split the cluster that has the largest number of objects
• Split the cluster that has the largest variation within it.
The first two approaches are clearly very simple but the
third approach is better since it is based on splitting a
cluster that has the most variation which is a sound
criterion.

9/11/2019 ©GKGupta 22
How to split a cluster?
A simple approach for splitting a cluster is to split the
cluster based on distances between the objects in the
cluster as outlined in the algorithm.
A distance matrix is created and the two most dissimilar
objects are selected as seeds of two new clusters. A
method like the K-Means method may then be used to
split the cluster.

9/11/2019 ©GKGupta 23
Example
We now consider the simple example about students’
marks and use the distance between two objects as the
Manhattan distance since it is simple to compute without a
computer.

We first calculate the distance matrix.

9/11/2019 ©GKGupta 24
Example
Student Age Marks1 Marks2 Marks3
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 60
S8 21 53 56 59
S9 19 82 82 60
S10 47 75 76 77
9/11/2019 ©GKGupta 25
Distance Matrix

9/11/2019 ©GKGupta 26
Example
• The matrix gives distance of each object with every
other object.
• The largest distance is between S8 and S9. They
become the seeds of two new clusters. Use K-means to
split the group in two clusters.

9/11/2019 ©GKGupta 27
Example
• Distances of other objects from S8 and S9 are:

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S8 44 74 40 8 91 104 28 0 115 98
S9 20 22 36 60 37 46 30 115 0 99

The two Clusters are:


C1 S8, S4, S7, S10
C2 S9, S1, S2, S3, S5, S6

9/11/2019 ©GKGupta 28
Next split
• We now decide which cluster to split next and then
repeat the process.
• The two Clusters are:
C1 S8, S4, S7, S10
C2 S9, S1, S2, S3, S5, S6
• We may want to split the larger cluster first
• Find the largest distance in C2 first. The information is
available in the distance matrix given earlier.

9/11/2019 ©GKGupta 29
Distances in C2

9/11/2019 ©GKGupta 30
Next split
• In cluster C2 the largest distance is 86 between S3 and
S6. C1 can be split with these as seeds.
• In cluster C1 (distance matrix on next slide) the largest
distance is 98 between S8 and S10. C1 can be split with
these seeds.

9/11/2019 ©GKGupta 31
Distances in C1

9/11/2019 ©GKGupta 32
Next

• We stop short of finishing the example.


• The result might look something like what is given on the
next slide

9/11/2019 ©GKGupta 33
Sample Result

S1 S3 S7 S2 S9 S4 S8 S5 S6 S2 S9

9/11/2019 ©GKGupta 34
Hierarchical Clustering
• The ordering produced can be useful in gaining some
insight into the data.
• The major difficulty is that once an object is allocated to
a cluster it cannot be moved to another cluster even if
the initial allocation was incorrect
• Different distance metrics can produce different results

9/11/2019 ©GKGupta 35
Validating Clusters

Difficult problem since no objective metric exists.

Often a subjective measure must be used e.g. does the


result provide insight to the data to the user? The
validity therefore is in the eye of the beholder.

Statistical testing may be possible – could test a


hypothesis that there are no clusters.

May be test data is available where clusters are


known.
9/11/2019 ©GKGupta 36
Summary
There is a bewildering array of clustering methods, they are
based on diverse underlying principles and they often
lead to qualitatively different results.
Little work done in independently reasoning about
clustering without considering any of the methods. What
basic properties clustering ought to obey? Three
properties could be scale-invariance, richness and
consistency.

9/11/2019 ©GKGupta 37

S-ar putea să vă placă și