Sunteți pe pagina 1din 43

Clustering

What is clustering?
• A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups
This is classic case of Unsupervised learning
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Outliers
• Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality

cluster

outliers

• In some applications we are interested in discovering


outliers, not clusters (outlier analysis)
Why do we cluster?
• Clustering : given a collection of data objects group them so
that
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters

• Clustering results are used:


– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Applications of clustering?
• Image Processing
– cluster images based on their visual content
• Web
– Cluster groups of users based on their access
patterns on webpages
– Cluster webpages based on their content
Example – segmenting for
efficient marketing
Attitude towards driving
• I prefer leisurely driving
• I like driving with an open top Speed Oriented
• My car must be environment-friendly
• I like driving a car that stands out from others
Comfort Oriented
• I like driving fast
• My car must reflect my personality
• Modern technology makes driving easier
• My car must be comfortable Safety Oriented
• My car must be safe
• I depend on my own abilities
• Electronics makes driving safer
The clustering task
• Group observations so that the observations
belonging in the same group are similar, whereas
observations in different groups are different

• Basic questions:
– What does “similar” mean
– What is a good partition of the objects? I.e., how is
the quality of a solution measured
– How to find a good partition of the observations
Observations to cluster
• Usually data objects consist of a set of attributes (also
known as dimensions)
• Real-value attributes/variables
– e.g., salary, height

• Binary attributes
– e.g., gender (M/F), has_cancer(T/F)

• Nominal (categorical) attributes


– e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

• Ordinal/Ranked attributes
– e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Observations to cluster
• If all d dimensions are real-valued then we
can visualize each data point as points in a d-
dimensional space

• If all d dimensions are binary then we can


think of each data point as a binary vector
Distance functions
• The distance d(x, y) between two objects x and y is a metric if

– d(i, j)0 (non-negativity)


– d(i, i)=0 (isolation)
– d(i, j)= d(j, i) (symmetry)
– d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

• The definitions of distance functions are usually different for


real, boolean, categorical, and ordinal variables.

• Weights may be associated with different variables based on


applications and data semantics.
Data Structures
attributes/dimensions

• data matrix
 x11 ... x
1
... x 
1d 
 ...

tuples/objects
 ... ... ... ... 
x ... x ... x 
 i1 i id 
 ... ... ... ... ... 
x ... x ... x 
 n1 n nd 
objects

• Distance matrix  0 
 d(2,1) 0 
 
objects

 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Distance functions for binary vectors
• Jaccard similarity between binary vectors X and Y
X Y
JSim ( X , Y ) 
X Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)

Q1 Q2 Q3 Q4 Q5 Q6
• Example: X 1 0 0 1 1 1
• JSim = 1/6 Y 0 1 1 0 1 0

• Jdist = 5/6
Distance functions for real-valued vectors
• Lp norms or Minkowski distance:
 p 1/ p
 p p p 1/ p d
L p ( x, y)  | x  y |  | x  y | ... | x  x |




  

|x  y | 

 1 1 2 2 d d 
i 1

i i 
 

where p is a positive integer

• If p = 1, L1 is the Manhattan (or city block) distance:

d
L ( x, y) | x1  y1 |  | x  y | ... | x  y |  x y
1 2 2 d d i i
i 1
Distance functions for real-valued
vectors
• If p = 2, L2 is the Euclidean distance:
d ( x, y)  (| x  y |2  | x  y |2 ... | x  y |2 )
1 1 2 2 d d

• Also one can use weighted distance:


d ( x, y)  (w | x  x |2 w | x  x |2 ... w | x  y |2 )
1 1 1 2 2 2 d d d

d ( x, y)  w x  y  w x  y ... w x  y
1 1 1 2 2 2 d d d
Algorithms: basic concept

• Construct a partition of a set of n objects into a set of k clusters


– Hierarchical Clustering
Single Linkage
Complete Linkage
Average Linkage

Partitioning Clustering

K-means
The k-means problem
• Given a set X of n points in a d-dimensional
space and an integer k

• Task: choose a set of k points {c1, c2,…,ck} in


the d-dimensional space to form clusters {C1,
C2,…,Ck} such that overall distance is
minimized
The k-means algorithm
• One way of solving the k-means problem

• Randomly pick k cluster centers {c1,…,ck}

• For each i, set the cluster Ci to be the set of points in X that


are closer to ci than they are to cj for all i≠j

• For each i let ci be the center of cluster Ci (mean of the


vectors in Ci)

• Repeat until convergence (no further reduction in distance)


k-means algorithm
• Finds a local optimum
• Converges often quickly (but not always)
• The choice of initial points can have large
influence
– Clusters of different densities
– Clusters of different sizes

• Outliers can also cause a problem (Example?)


Some alternatives to random
initialization of the central points
• Multiple runs
– Helps, but probability is not on your side

• Select original set of points by methods other


than random . E.g., pick the most distant
(from each other) points as cluster centers
(kmeans++ algorithm)
What is the right number of clusters?
• …or who sets the value of k?

• For n points to be clustered consider the case


where k=n. What is the value of the error
function

• What happens when k = 1?

• Since we want to minimize the error why don’t


we select always k = n?
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree-like diagram that records the sequences of
merges or splits
6 5
0.2
4
3 4
0.15 2
5

0.1 2

0.05
1
3 1

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendrogram at the proper level

• Hierarchical clustering may correspond to


meaningful taxonomies
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are
k clusters)

• Traditional hierarchical algorithms use a similarity or distance


matrix
– Merge or split one cluster at a time
Agglomerative clustering algorithm

• Most popular hierarchical clustering technique


• Basic algorithm
1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains

• Key operation is the computation of the distance between


two clusters
– Different definitions of the distance between clusters lead to
different algorithms
Input/ Initial setting
• Start with clusters of individual points and a
distance/proximity matrix p1 p2 p3 p4 p5 . . .
p1

p2
p3

p4
p5
.
.
.
Distance/Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• Merge the two closest clusters (C2 and C5) and update the distance
matrix. C1 C2 C3 C4 C5
C1

C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• “How do we update the distance matrix?”
C2
U
C1 C5 C3 C4

C3 C1 ?

C4 C2 U C5 ? ? ? ?

C3 ?

C1 C4 ?

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Distance between two clusters
• Each cluster is a set of points

• How do we define distance between two sets


of points
– Lots of alternatives
– Not an easy task
Distance between two clusters
• Single-link distance between clusters Ci and Cj
is the minimum distance between any object
in Ci and any object in Cj

• The distance is defined by the two most


similar objects

Dsl Ci , C j   min x , y d ( x, y ) x  Ci , y  C j 
Single-link clustering: example
• Determined by one pair of points, i.e., by one
link in the proximity graph.

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Single-link clustering: example

5
1
3
5 0.2
2 1
0.15
2 3 6
0.1

4 0.05

4
0
3 6 2 5 4 1

Nested Clusters Dendrogram


Distance between two clusters
• Complete-link distance between clusters Ci
and Cj is the maximum distance between any
object in Ci and any object in Cj

• The distance is defined by the two most


dissimilar objects

Dcl Ci , C j   max x , y d ( x, y ) x  Ci , y  C j 
Complete-link clustering: example
• Distance between clusters is determined by
the two most distant points in the different
clusters

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete-link clustering: example

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5

Nested Clusters Dendrogram


Distance between two clusters
• Group average distance between clusters Ci
and Cj is the average distance between any
object in Ci and any object in Cj

Davg Ci , C j  
1
Ci  C j
 d ( x, y )
xCi , yC j
Average-link clustering: example
• Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Average-link clustering: example
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1

1
0.05

4
0
3 3 6 4 1 2 5

Nested Clusters Dendrogram


Average-link clustering
• Compromise between Single and Complete
Link

• Strengths
– Less susceptible to noise and outliers
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1
2
5
2
3 6 Group Average
3
4 1
4
Practical considerations
• Too few or too many ?
• One large and many small ?
• Profiling clusters – compare with complete
sample
Mixed data distance computation
• Idea: Use distance measure between 0 and 1 for each variable:
• Aggregate = average distance over variables
• Binary (a/s), nominal: Jaccard dist, proportion mismatch etc.
• Interval-scaled: (xif – xjf)/Rf
– xif: Value for object i in variable f
– Rf : Range of variable f for all objects
• Ordinal: Use normalized ranks; then like interval-scaled based on range
normalized rank = zif = (rif- 1) /(Mf-1)
Thank you

S-ar putea să vă placă și