Documente Academic
Documente Profesional
Documente Cultură
What is clustering?
• A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups
This is classic case of Unsupervised learning
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Outliers
• Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality
cluster
outliers
• Basic questions:
– What does “similar” mean
– What is a good partition of the objects? I.e., how is
the quality of a solution measured
– How to find a good partition of the observations
Observations to cluster
• Usually data objects consist of a set of attributes (also
known as dimensions)
• Real-value attributes/variables
– e.g., salary, height
• Binary attributes
– e.g., gender (M/F), has_cancer(T/F)
• Ordinal/Ranked attributes
– e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Observations to cluster
• If all d dimensions are real-valued then we
can visualize each data point as points in a d-
dimensional space
• data matrix
x11 ... x
1
... x
1d
...
tuples/objects
... ... ... ...
x ... x ... x
i1 i id
... ... ... ... ...
x ... x ... x
n1 n nd
objects
• Distance matrix 0
d(2,1) 0
objects
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
Distance functions for binary vectors
• Jaccard similarity between binary vectors X and Y
X Y
JSim ( X , Y )
X Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)
Q1 Q2 Q3 Q4 Q5 Q6
• Example: X 1 0 0 1 1 1
• JSim = 1/6 Y 0 1 1 0 1 0
• Jdist = 5/6
Distance functions for real-valued vectors
• Lp norms or Minkowski distance:
p 1/ p
p p p 1/ p d
L p ( x, y) | x y | | x y | ... | x x |
|x y |
1 1 2 2 d d
i 1
i i
d
L ( x, y) | x1 y1 | | x y | ... | x y | x y
1 2 2 d d i i
i 1
Distance functions for real-valued
vectors
• If p = 2, L2 is the Euclidean distance:
d ( x, y) (| x y |2 | x y |2 ... | x y |2 )
1 1 2 2 d d
d ( x, y) w x y w x y ... w x y
1 1 1 2 2 2 d d d
Algorithms: basic concept
Partitioning Clustering
K-means
The k-means problem
• Given a set X of n points in a d-dimensional
space and an integer k
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendrogram at the proper level
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are
k clusters)
p2
p3
p4
p5
.
.
.
Distance/Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• Merge the two closest clusters (C2 and C5) and update the distance
matrix. C1 C2 C3 C4 C5
C1
C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• “How do we update the distance matrix?”
C2
U
C1 C5 C3 C4
C3 C1 ?
C4 C2 U C5 ? ? ? ?
C3 ?
C1 C4 ?
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Distance between two clusters
• Each cluster is a set of points
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Single-link clustering: example
5
1
3
5 0.2
2 1
0.15
2 3 6
0.1
4 0.05
4
0
3 6 2 5 4 1
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete-link clustering: example
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
Davg Ci , C j
1
Ci C j
d ( x, y )
xCi , yC j
Average-link clustering: example
• Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Average-link clustering: example
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1
1
0.05
4
0
3 3 6 4 1 2 5
• Strengths
– Less susceptible to noise and outliers
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1
2
5
2
3 6 Group Average
3
4 1
4
Practical considerations
• Too few or too many ?
• One large and many small ?
• Profiling clusters – compare with complete
sample
Mixed data distance computation
• Idea: Use distance measure between 0 and 1 for each variable:
• Aggregate = average distance over variables
• Binary (a/s), nominal: Jaccard dist, proportion mismatch etc.
• Interval-scaled: (xif – xjf)/Rf
– xif: Value for object i in variable f
– Rf : Range of variable f for all objects
• Ordinal: Use normalized ranks; then like interval-scaled based on range
normalized rank = zif = (rif- 1) /(Mf-1)
Thank you