Documente Academic
Documente Profesional
Documente Cultură
Cluster Analysis
Part 2
9/11/2019 ©GKGupta 2
Recap
• In our examples, the attributes are not independent.
There is strong correlation between students’ marks and
strong correlation between Olympic medals and per
capita income.
9/11/2019 ©GKGupta 3
K-Means
Iterative-improvement greedy algorithm. Random
selection of clusters to start which is then improved by
reassignment of objects until no improvement is possible.
9/11/2019 ©GKGupta 4
K-Means
Finds a local minimum.
9/11/2019 ©GKGupta 5
K-Means
Define within cluster variation = I
Define between clusters variation = E
9/11/2019 ©GKGupta 6
K-Means
Small I shows that the cluster is tight. Large I shows that
the cluster has much variation.
9/11/2019 ©GKGupta 7
Hierarchical Methods
Quite different from K-Means. No need to specify the
number of clusters or cluster seeds.
9/11/2019 ©GKGupta 8
Distance between Clusters
9/11/2019 ©GKGupta 9
Distance between Clusters
9/11/2019 ©GKGupta 10
Distance between Clusters
9/11/2019 ©GKGupta 11
Single Link
B
Consider
two clusters B B
Each with B
5 objects.
Need B
the
smallest
distance A
A
to be
computed. A
A
A
9/11/2019 ©GKGupta 12
Complete Link
B
Consider
two clusters B B
each with B
5 objects.
Need B
the
largest
distance A
A
to be
computed. A
A
A
9/11/2019 ©GKGupta 13
Centroid Link
B
Consider
two clusters. B C
B
Need B
the
distance B
between the
centroids
to be A
A
computed.
A C
A
A
9/11/2019 ©GKGupta 14
Pair-Group Average
Consider two clusters
each with B B
4 objects. B
Need 16
distances B
to be
computed
and average A
A
found.
A
A
9/11/2019 ©GKGupta 15
Divisive Clustering
Start with one cluster that has all the objects and seek
to split the cluster into two which themselves are then
split into even smaller clusters.
9/11/2019 ©GKGupta 17
Example
Student Age Marks1 Marks2 Marks3
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 60
S8 21 53 56 59
S9 19 82 82 60
S10 47 75 76 77
9/11/2019 ©GKGupta 18
Algorithm
A typical (polythetic) divisive algorithm works like the
following:
1. Decide on a method of measuring the distance
between two objects and decide a threshold distance.
2. Create a distance matrix by computing distances
between all pairs of objects within the cluster. Sort
these distances in ascending order.
3. Find the two most dissimilar objects (i.e. the two
objects that have the largest distance between them).
9/11/2019 ©GKGupta 19
Algorithm
4. If the distance is smaller than a pre-specified threshold
and there is no other cluster that needs to be divided
then stop.
5. Use the pair of objects identified in Step 3 as seeds of
a K-means type algorithm to create two new clusters.
Examine all objects. Place each object in the cluster
which has the seed with a smaller distance.
6. If there is only one object in each cluster then stop
otherwise continue with Step 2.
9/11/2019 ©GKGupta 20
Divisive Method
Two major issues that need resolving are:
• Which cluster to split next?
• How to split the cluster?
9/11/2019 ©GKGupta 21
Which cluster to split?
Number of possibilities:
• Split the clusters in some sequential order
• Split the cluster that has the largest number of objects
• Split the cluster that has the largest variation within it.
The first two approaches are clearly very simple but the
third approach is better since it is based on splitting a
cluster that has the most variation which is a sound
criterion.
9/11/2019 ©GKGupta 22
How to split a cluster?
A simple approach for splitting a cluster is to split the
cluster based on distances between the objects in the
cluster as outlined in the algorithm.
A distance matrix is created and the two most dissimilar
objects are selected as seeds of two new clusters. A
method like the K-Means method may then be used to
split the cluster.
9/11/2019 ©GKGupta 23
Example
We now consider the simple example about students’
marks and use the distance between two objects as the
Manhattan distance since it is simple to compute without a
computer.
9/11/2019 ©GKGupta 24
Example
Student Age Marks1 Marks2 Marks3
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 60
S8 21 53 56 59
S9 19 82 82 60
S10 47 75 76 77
9/11/2019 ©GKGupta 25
Distance Matrix
9/11/2019 ©GKGupta 26
Example
• The matrix gives distance of each object with every
other object.
• The largest distance is between S8 and S9. They
become the seeds of two new clusters. Use K-means to
split the group in two clusters.
9/11/2019 ©GKGupta 27
Example
• Distances of other objects from S8 and S9 are:
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
S8 44 74 40 8 91 104 28 0 115 98
S9 20 22 36 60 37 46 30 115 0 99
9/11/2019 ©GKGupta 28
Next split
• We now decide which cluster to split next and then
repeat the process.
• The two Clusters are:
C1 S8, S4, S7, S10
C2 S9, S1, S2, S3, S5, S6
• We may want to split the larger cluster first
• Find the largest distance in C2 first. The information is
available in the distance matrix given earlier.
9/11/2019 ©GKGupta 29
Distances in C2
9/11/2019 ©GKGupta 30
Next split
• In cluster C2 the largest distance is 86 between S3 and
S6. C1 can be split with these as seeds.
• In cluster C1 (distance matrix on next slide) the largest
distance is 98 between S8 and S10. C1 can be split with
these seeds.
9/11/2019 ©GKGupta 31
Distances in C1
9/11/2019 ©GKGupta 32
Next
9/11/2019 ©GKGupta 33
Sample Result
S1 S3 S7 S2 S9 S4 S8 S5 S6 S2 S9
9/11/2019 ©GKGupta 34
Hierarchical Clustering
• The ordering produced can be useful in gaining some
insight into the data.
• The major difficulty is that once an object is allocated to
a cluster it cannot be moved to another cluster even if
the initial allocation was incorrect
• Different distance metrics can produce different results
9/11/2019 ©GKGupta 35
Validating Clusters
9/11/2019 ©GKGupta 37