Cluster

Clustering
What is clustering?
• A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups
This is classic case of Unsupervised learning
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Outliers
• Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality
cluster
outliers
• In some applications we are interested in discovering

outliers, not clusters (outlier analysis)
Why do we cluster?
• Clustering : given a collection of data objects group them so
that
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering results are used:

– As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
– As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Applications of clustering?
• Image Processing
– cluster images based on their visual content
• Web
– Cluster groups of users based on their access
patterns on webpages
– Cluster webpages based on their content
Example – segmenting for
efficient marketing
Attitude towards driving
• I prefer leisurely driving
• I like driving with an open top Speed Oriented
• My car must be environment-friendly
• I like driving a car that stands out from others
Comfort Oriented
• I like driving fast
• My car must reflect my personality
• Modern technology makes driving easier
• My car must be comfortable Safety Oriented
• My car must be safe
• I depend on my own abilities
• Electronics makes driving safer
The clustering task
• Group observations so that the observations
belonging in the same group are similar, whereas
observations in different groups are different
• Basic questions:
– What does “similar” mean
– What is a good partition of the objects? I.e., how is
the quality of a solution measured
– How to find a good partition of the observations
Observations to cluster
• Usually data objects consist of a set of attributes (also
known as dimensions)
• Real-value attributes/variables
– e.g., salary, height
• Binary attributes
– e.g., gender (M/F), has_cancer(T/F)
• Nominal (categorical) attributes

– e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
• Ordinal/Ranked attributes
– e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Observations to cluster
• If all d dimensions are real-valued then we
can visualize each data point as points in a d-
dimensional space
• If all d dimensions are binary then we can

think of each data point as a binary vector
Distance functions
• The distance d(x, y) between two objects x and y is a metric if
– d(i, j)0 (non-negativity)

– d(i, i)=0 (isolation)
– d(i, j)= d(j, i) (symmetry)
– d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)
• The definitions of distance functions are usually different for

real, boolean, categorical, and ordinal variables.
• Weights may be associated with different variables based on

applications and data semantics.
Data Structures
attributes/dimensions
• data matrix
 x11 ... x
1
... x 
1d 
 ...
tuples/objects
 ... ... ... ... 
x ... x ... x 
 i1 i id 
 ... ... ... ... ... 
x ... x ... x 
 n1 n nd 
objects
• Distance matrix  0 
 d(2,1) 0 
 
objects
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Distance functions for binary vectors
• Jaccard similarity between binary vectors X and Y
X Y
JSim ( X , Y ) 
X Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)
Q1 Q2 Q3 Q4 Q5 Q6
• Example: X 1 0 0 1 1 1
• JSim = 1/6 Y 0 1 1 0 1 0
• Jdist = 5/6
Distance functions for real-valued vectors
• Lp norms or Minkowski distance:
 p 1/ p
 p p p 1/ p d
L p ( x, y)  | x  y |  | x  y | ... | x  x |




  

|x  y | 

 1 1 2 2 d d 
i 1

i i 
 
where p is a positive integer
• If p = 1, L1 is the Manhattan (or city block) distance:
d
L ( x, y) | x1  y1 |  | x  y | ... | x  y |  x y
1 2 2 d d i i
i 1
Distance functions for real-valued
vectors
• If p = 2, L2 is the Euclidean distance:
d ( x, y)  (| x  y |2  | x  y |2 ... | x  y |2 )
1 1 2 2 d d
• Also one can use weighted distance:

d ( x, y)  (w | x  x |2 w | x  x |2 ... w | x  y |2 )
1 1 1 2 2 2 d d d
d ( x, y)  w x  y  w x  y ... w x  y
1 1 1 2 2 2 d d d
Algorithms: basic concept
• Construct a partition of a set of n objects into a set of k clusters

– Hierarchical Clustering
Single Linkage
Complete Linkage
Average Linkage
Partitioning Clustering
K-means
The k-means problem
• Given a set X of n points in a d-dimensional
space and an integer k
• Task: choose a set of k points {c1, c2,…,ck} in

the d-dimensional space to form clusters {C1,
C2,…,Ck} such that overall distance is
minimized
The k-means algorithm
• One way of solving the k-means problem
• Randomly pick k cluster centers {c1,…,ck}
• For each i, set the cluster Ci to be the set of points in X that

are closer to ci than they are to cj for all i≠j
• For each i let ci be the center of cluster Ci (mean of the

vectors in Ci)
• Repeat until convergence (no further reduction in distance)

k-means algorithm
• Finds a local optimum
• Converges often quickly (but not always)
• The choice of initial points can have large
influence
– Clusters of different densities
– Clusters of different sizes
• Outliers can also cause a problem (Example?)

Some alternatives to random
initialization of the central points
• Multiple runs
– Helps, but probability is not on your side
• Select original set of points by methods other

than random . E.g., pick the most distant
(from each other) points as cluster centers
(kmeans++ algorithm)
What is the right number of clusters?
• …or who sets the value of k?
• For n points to be clustered consider the case

where k=n. What is the value of the error
function
• What happens when k = 1?
• Since we want to minimize the error why don’t

we select always k = n?
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree-like diagram that records the sequences of
merges or splits
6 5
0.2
4
3 4
0.15 2
5
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendrogram at the proper level
• Hierarchical clustering may correspond to

meaningful taxonomies
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are
k clusters)
• Traditional hierarchical algorithms use a similarity or distance

matrix
– Merge or split one cluster at a time
Agglomerative clustering algorithm
• Most popular hierarchical clustering technique

• Basic algorithm
1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
• Key operation is the computation of the distance between

two clusters
– Different definitions of the distance between clusters lead to
different algorithms
Input/ Initial setting
• Start with clusters of individual points and a
distance/proximity matrix p1 p2 p3 p4 p5 . . .
p1
p2
p3
p4
p5
.
.
.
Distance/Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
• Merge the two closest clusters (C2 and C5) and update the distance
matrix. C1 C2 C3 C4 C5
C1
C3 C2
C3
C4
C4
C5
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• “How do we update the distance matrix?”
C2
U
C1 C5 C3 C4
C3 C1 ?
C4 C2 U C5 ? ? ? ?
C3 ?
C1 C4 ?
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Distance between two clusters
• Each cluster is a set of points
• How do we define distance between two sets

of points
– Lots of alternatives
– Not an easy task
• Single-link distance between clusters Ci and Cj
is the minimum distance between any object
in Ci and any object in Cj
• The distance is defined by the two most

similar objects

Dsl Ci , C j   min x , y d ( x, y ) x  Ci , y  C j 
Single-link clustering: example
• Determined by one pair of points, i.e., by one
link in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Single-link clustering: example
5
1
3
5 0.2
2 1
0.15
2 3 6
0.1
4 0.05
4
0
3 6 2 5 4 1
Nested Clusters Dendrogram

• Complete-link distance between clusters Ci
and Cj is the maximum distance between any
object in Ci and any object in Cj
• The distance is defined by the two most

dissimilar objects

Dcl Ci , C j   max x , y d ( x, y ) x  Ci , y  C j 
Complete-link clustering: example
• Distance between clusters is determined by
the two most distant points in the different
clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Complete-link clustering: example
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5

• Group average distance between clusters Ci
and Cj is the average distance between any
object in Ci and any object in Cj
Davg Ci , C j  
1
Ci  C j
 d ( x, y )
xCi , yC j
Average-link clustering: example
• Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Average-link clustering: example
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1
1
0.05
4
0
3 3 6 4 1 2 5

Average-link clustering
• Compromise between Single and Complete
Link
• Strengths
– Less susceptible to noise and outliers
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1
2
5
2
3 6 Group Average
3
4 1
4
Practical considerations
• Too few or too many ?
• One large and many small ?
• Profiling clusters – compare with complete
sample
Mixed data distance computation
• Idea: Use distance measure between 0 and 1 for each variable:
• Aggregate = average distance over variables
• Binary (a/s), nominal: Jaccard dist, proportion mismatch etc.
• Interval-scaled: (xif – xjf)/Rf
– xif: Value for object i in variable f
– Rf : Range of variable f for all objects
• Ordinal: Use normalized ranks; then like interval-scaled based on range
normalized rank = zif = (rif- 1) /(Mf-1)
Thank you

Cluster

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Cluster

Încărcat de

Drepturi de autor:

Formate disponibile

Clustering

• In some applications we are interested in discovering

• Clustering results are used:

• Nominal (categorical) attributes

• If all d dimensions are binary then we can

– d(i, j)0 (non-negativity)

• The definitions of distance functions are usually different for

• Weights may be associated with different variables based on

where p is a positive integer

• If p = 1, L1 is the Manhattan (or city block) distance:

• Also one can use weighted distance:

• Construct a partition of a set of n objects into a set of k clusters

• Task: choose a set of k points {c1, c2,…,ck} in

• Randomly pick k cluster centers {c1,…,ck}

• For each i, set the cluster Ci to be the set of points in X that

• For each i let ci be the center of cluster Ci (mean of the

• Repeat until convergence (no further reduction in distance)

• Outliers can also cause a problem (Example?)

• Select original set of points by methods other

• For n points to be clustered consider the case

• What happens when k = 1?

• Since we want to minimize the error why don’t

• Hierarchical clustering may correspond to

• Traditional hierarchical algorithms use a similarity or distance

• Most popular hierarchical clustering technique

• Key operation is the computation of the distance between

• How do we define distance between two sets

• The distance is defined by the two most

Nested Clusters Dendrogram

• The distance is defined by the two most

Nested Clusters Dendrogram

Nested Clusters Dendrogram

S-ar putea să vă placă și