Documente Academic
Documente Profesional
Documente Cultură
i 1 i 1 pCi
1
Example of Square Error of Cluster
4
K-means
Initialization
Arbitrarily choose k objects as the initial cluster centers
(centroids)
Iteration until no change
For each object Oi
Calculate the distances between Oi and the k centroids
5
k-Means Clustering Method cluster
10 current mean
10
9 clusters 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
objects
new relocated
clusters
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
6
Example
7
Variations of k-Means Method
Aspects of variants of k-means
Selection of initial k centroids
E.g., choose k farthest points
Dissimilarity calculations
E.g., use Manhattan distance
8
Strengths of k-Means Method
Strength
Relatively efficient for large datasets
O(tkn) where n is # objects, k is # clusters, and t is #
9
Weakness of k-Means Method
Weakness
Applicable only when mean is defined, then what about
categorical data?
k-modes algorithm
Density-based algorithms
10
k-modes Algorithm age income student credit_rating
< = 30 high no fair
Handling categorical data: < = 30 high no excellent
31…40 high no fair
k-modes (Huang’98) > 40 medium no fair
Replacing means of > 40 low yes fair
> 40 low yes excellent
clusters with modes
31…40 low yes excellent
Given n records in < = 30 medium no fair
cluster, mode is record < = 30 low yes fair
made up of most > 40 medium yes fair
< = 30 medium yes excellent
frequent attribute
31…40 medium no excellent
values 31…40 high yes fair
In the example cluster, mode = (<=30, medium, yes, fair)
Using new dissimilarity measures to deal with
categorical objects
11
A Problem of K-means
Sensitive to outliers
Outlier: objects with extremely large (or small) values
May substantially distort the distribution of the data
+
+
Outlier
12
k-Medoids Clustering Method
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means k-medoids
13
PAM (Partitioning Around Medoids) (1987)
i 1 pCi
Compute Eh-Em
Negative: swapping brings benefit
Choose the minimum swapping cost
15
Four Swapping Cases
When a medoid m is to be swapped with a non-medoid
object h, check each of other non-medoid objects j
j is in cluster of m reassign j
in cluster represented by h
16
PAM Clustering: Total swapping cost TCmh=jCjmh
Case 1 10 Case 3 10
9 9
j
8
h 8
k
7
6
j 7
5 5
m h
4
k 4
2
3
2
m
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9
9
8
k 8
7
h 7
6
j 6
5
5 m
4
3
m 4
h j
3
1
2
1
k
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
18
Strength and Weakness of PAM
19
CLARA (Clustering Large Applications) (1990)
20
CLARA - Algorithm
Set mincost to MAXIMUM;
Repeat q times // draws q samples
Create S by drawing s objects randomly from D;
Generate the set of medoids M from S by applying the
PAM algorithm;
Compute cost(M,D)
If cost(M, D)<mincost
Mincost = cost(M, D);
Bestset = M;
Endif;
Endrepeat;
Return Bestset;
21
Complexity of CLARA
Set mincost to MAXIMUM; O(1)
Repeat q times O((s-k)2*k+(n-k)*k)
Create S by drawing s objects
randomly from D; O(1)
Generate the set of medoids M
from S by applying the PAM
algorithm; O((s-k)2*k)
Compute cost(M,D) O((n-k)*k)
If cost(M, D)<mincost O(1)
Mincost = cost(M, D);
Bestset = M;
Endif;
Endrepeat;
Return Bestset; 22
Strengths and Weaknesses of CLARA
Strength:
Handle larger data sets than PAM (1,000 objects in 10
clusters)
Weakness:
Efficiency depends on sample size
A good clustering based on samples will not necessarily
represent a good clustering of whole data set if sample is
biased
23
CLARANS (“Randomized” CLARA) (1994)
CLARANS (A Clustering Algorithm based on
Randomized Search) (Ng and Han’94)
CLARANS draws sample in solution space dynamically
A solution is a set of k medoids
The solutions space contains n solutions in total
k
The solution space can be represented by a graph where
every node is a potential solution, i.e., a set of k medoids
24
Graph Abstraction
Every node is a potential solution (k-medoid)
Every node is associated with a squared error
Two nodes are adjacent if they differ by one medoid
Every node has k(nk) adjacent nodes
{O1,O2,…,Ok}
k(n k)
{Ok+1,O2,…,Ok}
… {Ok+n,O2,…,Ok}
… neighbors for
one node
26
CLARANS Compare no more than
maxneighbor times
N C N
N
N
<
C
… Local
minimum
N N numlocal
… Local
minimum
… Local
minimum
…
Best Node
Local
minimum
27
CLARANS - Algorithm
Set mincost to MAXIMUM;
For i=1 to h do // find h local optimum
Randomly select a node as the current node C in the graph;
J = 1; // counter of neighbors
Repeat
Randomly select a neighbor N of C;
If Cost(N,D)<Cost(C,D)
Assign N as the current node C;
J = 1;
Else J++;
Endif;
Until J > m
Update mincost with Cost(C,D) if applicableEnd for;
End For
Return bestnode;
28
Graph Abstraction (k-means, k-modes, k-medoids)
Each vertex is a set of k-representative objects (means,
modes, medoids)
Each iteration produces a new set of k-representative
objects with lower overall dissimilarity
Iterations correspond to a hill descent process in a
landscape (graph) of vertices
29
Comparison with PAM
Search for minimum in graph (landscape)
At each step, all adjacent vertices are examined; the one
with deepest descent is chosen as next k-medoids
Search continues until minimum is reached
For large n and k values (n=1,000, k=10), examining all
k(nk) adjacent vertices is time consuming; inefficient
for large data sets
CLARANS vs PAM
For large and medium data sets, it is obvious that
CLARANS is much more efficient than PAM
For small data sets, CLARANS outperforms PAM
significantly
30
When n=80,
CLARANS is 5
times faster
than PAM,
while the
cluster quality
is the same.
31
Comparision with CLARA
CLARANS vs CLARA
CLARANS is always able to find clusterings of better
quality than those found by CLARA; CLARANS may
use much more time than CLARA
When the time used is the same, CLARANS is still better
than CLARA
32
33
Hierarchies of Co-expressed Genes and Coherent Patterns
The interpretation of
co-expressed genes
and coherent patterns
mainly depends on the
domain knowledge
34
A Subtle Situation
group A1
group A2
group A
35