Clustering: Partictioning Methods: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Clustering: Partictioning Methods
Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi

How Do Partitioning Methods Work?

Given n objects and k clusters, nd a partition of k clusters that

minimizes a given score

Each of the k clusters is usually identied by its centroid C

with m is the cluster identier {1, , k}

Sum of squares is a rather typical score for partitioning

methods

k m=1 tmiKm
(Cm tmi )
Global optimal is possible exhaustively enumerate all partitions

Heuristic methods are always used (k-means and k-medoids)


K-means Example (step 1)

C1
Y
Pick 3
initial
cluster
centers
(randomly)

C2

C3
X



C1
Y
Assign
each point
to the closest
cluster
center

C2

C3
X



C1
Y
Move
each cluster center
to the mean
of each cluster
C2

C1

C3
C2
C3
X



C1
Y
Reassign
points
closest to a different new cluster center

C3
C2

X



C1
Y
re-compute cluster means

C3
C2

X



C1
Y re-compute cluster means

C2
C3

X

K-means Clustering

10

Works with numeric data

Pick a number (K) of cluster centers (at random)
Assign every item to its nearest cluster center
(e.g. using Euclidean distance)

Move each cluster center to the mean of its assigned items

Repeat assignment and moving steps until convergence
(change in cluster assignments less than a threshold)


Evaluating K-means Clusters

11

Most common measure is the sum of squared error (SSE)

For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.

SSE = dist 2 (mi , x )
K
x is a data point in cluster C and m is the representative point for cluster C

Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor
i i i
i =1 xCi
clustering with higher K


Two different K-means Clusterings

3 2.5 2 1.5
12

Original Points
y
1 0.5 0 -2
-1.5
-1
-0.5
0.5
1.5
3 2.5 2 1.5
3 2.5 2 1.5
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering

Sub-optimal Clustering
Importance of Choosing the Initial Centroids

Iteration 1 6 5 4 3 2
3 2.5 2 1.5
13

y
1 0.5 0 -2
-1.5
-1
-0.5
0.5
1.5


Iteration 1
3 2.5 2 1.5 3 2.5 2 1.5
14

Iteration 2
3 2.5 2 1.5
Iteration 3
1 0.5 0
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
Iteration 4
3 2.5 2 1.5 3 2.5 2 1.5
Iteration 5
3 2.5 2 1.5
Iteration 6
1 0.5 0
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5


Iteration 1 5 4 3 2
3 2.5 2 1.5
15

y
1 0.5 0 -2
-1.5
-1
-0.5
0.5
1.5


Iteration 1
3 2.5 2 1.5 3 2.5 2 1.5
16

Iteration 2
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-1.5
-1
-0.5
0.5
1.5
Iteration 3
3 2.5 2 1.5 3 2.5 2 1.5
Iteration 4
3 2.5 2 1.5
Iteration 5
1 0.5 0
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5

Why Selecting the Best Initial Centroids is Difcult?

centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then

17

If there are K real clusters then the chance of selecting one
For example, if K = 10, then probability = 10!/1010 = 0.00036

Sometimes the initial centroids will readjust themselves in
right way, and sometimes they dont
Consider an example of ve pairs of clusters


10 Clusters Example

Iteration 1 4 3 2
8 6 4 2
18

0 -2 -4 -6 0 5 10 15 20
x Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Iteration 1
8 6 4 2 8 6 4 2
19

Iteration 2
0 -2 -4 -6 0 5 10 15 20
0 -2 -4 -6 0 5 10 15 20
Iteration 3
8 6 4 2 8 6 4 2
Iteration 4
0 -2 -4 -6 0 5 10 15 20
0 -2 -4 -6 0 5 10 15 20
Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example

Iteration 1 4 3 2
8 6 4 2
20

0 -2 -4 -6 0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi

10 Clusters Example

Iteration 1
8 6 4 2 8 6 4 2
21

Iteration 2
0 -2 -4 -6 0 8 6 4 2 5 10 15 20
0 -2 -4 -6 0 8 6 4 2 5 10 15 20
x Iteration 3
x Iteration 4
0 -2 -4 -6 0 5 10 15 20
0 -2 -4 -6 0 5 10 15 20
Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi

Dealing with the Initial Centroids Issue

22

Multiple runs, helps, but probability is not on your side

Sample and use another clustering method (hierarchical?) to
determine initial centroids

Select more than k initial centroids and then select among

these initial centroids

Postprocessing
Bisecting K-means, not as susceptible to initialization issues


Updating Centers Incrementally

23

In the basic K-means algorithm, centroids are updated after all

points are assigned to a centroid

An alternative is to update the centroids after each assignment

(incremental approach)
Each assignment updates zero or two centroids
More expensive
Introduces an order dependency
Never get an empty cluster
Can use weights to change the impact


Pre-processing and Post-processing

24

Pre-processing
Normalize the data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent outliers
Split loose clusters, i.e., clusters with relatively high SSE
Merge clusters that are close and
that have relatively low SSE
These steps can be used during the clustering process


Bisecting K-means

25

Variant of K-means that can produce
a partitional or a hierarchical clustering


Bisecting K-means Example

26


Limitations of K-means

27

K-means has problems when clusters are of differing

Sizes
Densities
Non-globular shapes
K-means has also problems when the data contains outliers.


Limitations of K-means: Differing Sizes

28

Original Points

K-means (3 Clusters)


Limitations of K-means: Differing Density

29

Original Points



Limitations of K-means: Non-globular Shapes

30

Original Points



Overcoming K-means Limitations

31

Original Points

K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.



32

Original Points

K-means Clusters



33

Original Points
K-means Clusters

K-Means Clustering Summary

34

Strength
Relatively efcient
Often terminates at a local optimum
The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms

Weakness
Applicable only when mean is dened, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes


K-Means Clustering Summary

35

Advantages
Simple, understandable
Items automatically assigned to clusters
Disadvantages
Must pick number of clusters before hand
All items forced into a cluster
Too sensitive to outliers


Variations of the K-Means Method

36

A few variants of the k-means which differ in

Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures
to deal with categorical objects

Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method


Variations of the K-Means Method

37

A few variants of the k-means which differ in

Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures
to deal with categorical objects

Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method


K-Medoids Clustering

38

Instead of mean, use medians of each cluster

Mean of 1, 3, 5, 7, 9 is 5
Mean of 1, 3, 5, 7, 1009 is 205
Median of 1, 3, 5, 7, 1009 is 5
Median is not affected by extreme values
For large databases, use sampling


K-Medoids Clustering

39

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)


Clustering: Partictioning Methods: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Clustering: Partictioning Methods: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Încărcat de

Drepturi de autor:

Formate disponibile

Clustering: Partictioning Methods

Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi

How Do Partitioning Methods Work?

Given n objects and k clusters, nd a partition of k clusters that

Each of the k clusters is usually identied by its centroid C

Sum of squares is a rather typical score for partitioning

Global optimal is possible exhaustively enumerate all partitions

K-means Example (step 1)

K-means Example (step 2)

K-means Example (step 3)

K-means Example (step 4)

K-means Example (step 5)

K-means Example (step 6)

Works with numeric data

Move each cluster center to the mean of its assigned items

Evaluating K-means Clusters

Most common measure is the sum of squared error (SSE)

x is a data point in cluster C and m is the representative point for cluster C

clustering with higher K

Prof. Pier Luca Lanzi

Two different K-means Clusterings

Importance of Choosing the Initial Centroids

Prof. Pier Luca Lanzi

Importance of Choosing the Initial Centroids

Prof. Pier Luca Lanzi

Importance of Choosing the Initial Centroids

Prof. Pier Luca Lanzi

Importance of Choosing the Initial Centroids

Prof. Pier Luca Lanzi

Why Selecting the Best Initial Centroids is Difcult?

If there are K real clusters then the chance of selecting one

For example, if K = 10, then probability = 10!/1010 = 0.00036

Dealing with the Initial Centroids Issue

Multiple runs, helps, but probability is not on your side

Select more than k initial centroids and then select among

Updating Centers Incrementally

In the basic K-means algorithm, centroids are updated after all

An alternative is to update the centroids after each assignment

Prof. Pier Luca Lanzi

Pre-processing and Post-processing

Prof. Pier Luca Lanzi

Variant of K-means that can produce 

a partitional or a hierarchical clustering

Prof. Pier Luca Lanzi

Bisecting K-means Example

Prof. Pier Luca Lanzi

K-means has problems when clusters are of differing

Prof. Pier Luca Lanzi

Limitations of K-means:  Differing Sizes

Prof. Pier Luca Lanzi

Limitations of K-means:  Differing Density

Prof. Pier Luca Lanzi

Limitations of K-means:  Non-globular Shapes

Prof. Pier Luca Lanzi

Overcoming K-means Limitations

One solution is to use many clusters.

Overcoming K-means Limitations

Prof. Pier Luca Lanzi

Overcoming K-means Limitations

Prof. Pier Luca Lanzi

Dealing with the Initial Centroids Issue

Variant of K-means that can produce

Limitations of K-means: Differing Sizes

Limitations of K-means: Differing Density

Limitations of K-means: Non-globular Shapes