Documente Academic
Documente Profesional
Documente Cultură
Concepts and
Techniques
Chapter 8
May 5, 2015
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
May 5, 2015
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering
feature spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
May 5, 2015
Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
May 5, 2015
May 5, 2015
Scalability
High dimensionality
May 5, 2015
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
May 5, 2015
Data Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
Dissimilarity matrix
d(3,1) d ( 3,2) 0
(one mode)
:
:
:
May 5, 2015
... 0
May 5, 2015
10
Interval-scaled variables:
Binary variables:
May 5, 2015
11
Interval-valued variables
Standardize data
where
...
xnf )
m f 1n (x1 f x2 f
xif m f
zif
sf
May 5, 2015
12
q
q
Some popular
d (i, j) q (| ones
x x |include:
| x x Minkowski
| q ... | x x |distance:
)
i1
j1
i2
j2
ip
jp
If q = 1, d isd (iManhattan
, j) | x x | |distance
x x | ... | x x |
May 5, 2015
i1
j1
i2
j2
ip
jp
13
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
May 5, 2015
14
Binary Variables
Object i
1
0
1
a
c
0
b
d
sum a c b d
sum
a b
cd
p
variable is asymmetric):
d (i, j)
May 5, 2015
bc
a bc
15
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
May 5, 2015
16
Nominal Variables
May 5, 2015
17
Ordinal Variables
May 5, 2015
18
Ratio-Scaled Variables
Methods:
May 5, 2015
19
Variables of Mixed
Types
f 1 ij d ij
d (i, j )
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
zif
if
May 5, 2015
20
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
May 5, 2015
21
May 5, 2015
22
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
May 5, 2015
23
May 5, 2015
24
May 5, 2015
25
Example
10
10
0
0
10
10
10
10
0
0
May 5, 2015
10
10
26
Strength
Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
May 5, 2015
27
May 5, 2015
28
May 5, 2015
29
May 5, 2015
30
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
10
10
31
Weakness:
May 5, 2015
32
May 5, 2015
33
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
May 5, 2015
agglomerative
(AGNES)
Step 3
divisive
Step 2 Step 1 Step 0
(DIANA)
Data Mining: Concept
34
Go on in a non-descending fashion
10
10
10
0
0
May 5, 2015
10
0
0
10
10
35
May 5, 2015
36
10
10
0
0
May 5, 2015
10
0
0
10
10
37
May 5, 2015
38
BIRCH (1996)
May 5, 2015
39
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
May 5, 2015
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
40
Root
CF Tree
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
May 5, 2015
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
41
May 5, 2015
42
Drawbacks of Distance-Based
Method
May 5, 2015
43
Eliminate outliers
By random sampling
May 5, 2015
44
p=2
s/p = 25
y
y
x
y
y
x
x
May 5, 2015
x
x
45
May 5, 2015
46
Sim( T 1, T 2)
May 5, 2015
{3}
1
0.2
{1,2,3,4,5}
5
47
Rock: Algorithm
Algorithm
Draw random sample
Cluster with links
Label data in disk
May 5, 2015
48
CHAMELEON
May 5, 2015
49
Overall Framework of
CHAMELEON
Construct
Sparse Graph
Data Set
Merge Partition
Final Clusters
May 5, 2015
50
Density-Based Clustering
Methods
Clustering
based on density (local cluster criterion),
such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination
condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD96)
OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98)
May 5, 2015
51
Density-Based Clustering:
Background
Two parameters:
1) p belongs to NEps(q)
p
q
MinPts = 5
Eps = 1 cm
52
Density-Based Clustering:
Background (II)
Density-reachable:
Density-connected
A point p is density-reachable
from a point q wrt. Eps, MinPts if
there is a chain of points p1, ,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
A point p is density-connected to
a point q wrt. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
wrt. Eps and MinPts.
May 5, 2015
p1
q
o
53
Eps = 1cm
Core
May 5, 2015
MinPts = 5
54
If p is a border point, no points are densityreachable from p and DBSCAN visits the next
point of the database.
May 5, 2015
55
May 5, 2015
56
Index-based:
k = number of dimensions
N = 20
p = 75%
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance
Reachability Distancep2
p1
o
o
MinPts = 5
57
Reachability
-distance
undefined
May 5, 2015
Cluster-order
of the objects 58
Major features
59
May 5, 2015
60
Example
f Gaussian ( x , y ) e
f
D
Gaussian
( x) i 1 e
D
Gaussian
May 5, 2015
d ( x , y )2
2 2
d ( x , xi ) 2
2 2
( x, xi ) i 1 ( xi x) e
N
d ( x , xi ) 2
2 2
61
Density Attractor
May 5, 2015
62
May 5, 2015
63
May 5, 2015
64
May 5, 2015
65
STING: A Statistical
Information Grid Approach (2)
May 5, 2015
STING: A Statistical
Information Grid Approach (3)
May 5, 2015
WaveCluster (1998)
Input parameters:
May 5, 2015
68
WaveCluster (1998)
May 5, 2015
70
May 5, 2015
71
Quantization
May 5, 2015
72
Transformation
May 5, 2015
73
WaveCluster (1998)
May 5, 2015
74
May 5, 2015
75
Identify clusters:
May 5, 2015
76
=3
40
50
la
a
S
May 5, 2015
20
30
40
50
age
60
Vacation
30
Vacation(
week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
ry
30
50
age
77
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist
in those subspaces
It is insensitive to the order of records in input and
does not presume some canonical data distribution
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
May 5, 2015
78
Model-Based Clustering
Methods
Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
Conceptual clustering
COBWEB (Fisher87)
May 5, 2015
79
COBWEB Clustering
Method
A classification tree
May 5, 2015
80
More on Statistical-Based
Clustering
Limitations of COBWEB
The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
Not suitable for clustering large database data
skewed tree and expensive probability distributions
CLASSIT
an extension of COBWEB for incremental clustering
of continuous data
suffers similar problems as COBWEB
AutoClass (Cheeseman and Stutz, 1996)
Uses Bayesian statistical analysis to estimate the
number of clusters
Popular in industry
May 5, 2015
81
Other Model-Based
Clustering Methods
May 5, 2015
82
May 5, 2015
83
May 5, 2015
84
May 5, 2015
85
Outlier Discovery:
Statistical
Approaches
May 5, 2015
86
May 5, 2015
May 5, 2015
88
May 5, 2015
89
Constraint-Based Clustering
Analysis
Clustering analysis: less parameters but more userdesired constraints, e.g., an ATM allocation problem
May 5, 2015
90
Summary
May 5, 2015
91