Documente Academic
Documente Profesional
Documente Cultură
k m=1 tmiKm
(Cm tmi )
C1
Y
Pick 3
initial
cluster
centers
(randomly)
C2
C3
X
Prof. Pier Luca Lanzi
C1
Y
Assign
each point
to the closest
cluster
center
C2
C3
X
Prof. Pier Luca Lanzi
C1
Y
Move
each cluster center
to the mean
of each cluster
C2
C1
C3
C2
C3
X
Prof. Pier Luca Lanzi
C1
Y
Reassign
points
closest to a different new cluster center
C3
C2
X
Prof. Pier Luca Lanzi
C1
Y
re-compute cluster means
C3
C2
X
Prof. Pier Luca Lanzi
C1
Y re-compute cluster means
C2
C3
X
Prof. Pier Luca Lanzi
K-means Clustering
10
11
i =1 xCi
12
Original Points
y
1 0.5 0 -2
-1.5
-1
-0.5
0.5
1.5
3 2.5 2 1.5
3 2.5 2 1.5
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
Prof. Pier Luca Lanzi
Sub-optimal Clustering
13
y
1 0.5 0 -2
-1.5
-1
-0.5
0.5
1.5
14
Iteration 2
3 2.5 2 1.5
Iteration 3
1 0.5 0
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
Iteration 4
3 2.5 2 1.5 3 2.5 2 1.5
Iteration 5
3 2.5 2 1.5
Iteration 6
1 0.5 0
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
15
y
1 0.5 0 -2
-1.5
-1
-0.5
0.5
1.5
16
Iteration 2
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-1.5
-1
-0.5
0.5
1.5
Iteration 3
3 2.5 2 1.5 3 2.5 2 1.5
Iteration 4
3 2.5 2 1.5
Iteration 5
1 0.5 0
1 0.5 0
y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
17
10 Clusters Example
Iteration 1 4 3 2
8 6 4 2
18
0 -2 -4 -6 0 5 10 15 20
x Starting with two initial centroids in one cluster of each pair of clusters
Prof. Pier Luca Lanzi
10 Clusters Example
Iteration 1
8 6 4 2 8 6 4 2
19
Iteration 2
0 -2 -4 -6 0 5 10 15 20
0 -2 -4 -6 0 5 10 15 20
Iteration 3
8 6 4 2 8 6 4 2
Iteration 4
0 -2 -4 -6 0 5 10 15 20
0 -2 -4 -6 0 5 10 15 20
Starting with two initial centroids in one cluster of each pair of clusters
Prof. Pier Luca Lanzi
10 Clusters Example
Iteration 1 4 3 2
8 6 4 2
20
0 -2 -4 -6 0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi
10 Clusters Example
Iteration 1
8 6 4 2 8 6 4 2
21
Iteration 2
0 -2 -4 -6 0 8 6 4 2 5 10 15 20
0 -2 -4 -6 0 8 6 4 2 5 10 15 20
x Iteration 3
x Iteration 4
0 -2 -4 -6 0 5 10 15 20
0 -2 -4 -6 0 5 10 15 20
Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi
22
Postprocessing
Bisecting K-means, not as susceptible to initialization issues
Prof. Pier Luca Lanzi
23
24
Pre-processing
Normalize the data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent outliers
Split loose clusters, i.e., clusters with relatively high SSE
Merge clusters that are close and
that have relatively low SSE
These steps can be used during the clustering process
Bisecting K-means
25
26
Limitations of K-means
27
28
Original Points
K-means (3 Clusters)
29
Original Points
K-means (3 Clusters)
30
Original Points
K-means (2 Clusters)
31
Original Points
K-means Clusters
32
Original Points
K-means Clusters
33
Original Points
K-means Clusters
34
Strength
Relatively efcient
Often terminates at a local optimum
The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
Weakness
Applicable only when mean is dened, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Prof. Pier Luca Lanzi
35
Advantages
Simple, understandable
Items automatically assigned to clusters
Disadvantages
Must pick number of clusters before hand
All items forced into a cluster
Too sensitive to outliers
36
37
K-Medoids Clustering
38
K-Medoids Clustering
39