Documente Academic
Documente Profesional
Documente Cultură
Data Mining
Data warehouses and OLAP (On Line Analytical
Processing.)
Association Rules Mining
Clustering: Hierarchical and Partitional approaches
Classification: Decision Trees and Bayesian classifiers
Sequential Patterns Mining
Advanced topics: outlier detection, web mining
Optimization problem
i 1 p Ci
p mi
10
10
0
0
10
10
10
10
0
0
10
10
Strength
Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
PAM works effectively for small data sets, but does not scale
well for large data sets
10
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
10
10
Weakness:
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
Go on in a non-descending fashion
10
10
10
0
0
10
0
0
10
10
Top-down approach
10
10
10
0
0
10
0
0
10
10
BIRCH
BIRCH
CF = (5, (16,30),(54,190))
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree
B=7
L=6
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
Drawbacks of Distance-Based
Method
Good only for convex shaped, similar size and density, and if k can be
reasonably estimated
Eliminate outliers
By random sampling
s = 50
p=2
s/p = 25
s/pq = 5
y
y
x
y
y
x
x
x
x
nm
m
n
log n)
m a
Basic ideas:
Similarity function and neighbors:
T1 T2
Let T1 = {1,2,3}, T2={3,4,5}
Sim( T , T )
1
Sim( T 1, T 2)
{3}
1
0.2
{1,2,3,4,5}
5
T1 T2
Rock: Algorithm
Algorithm
Draw random sample
Cluster with links
Label data in disk
It partitions each dimension into the same number of equal length interval
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify clusters:
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
la
a
S
ry
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
age
50
age
60
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
It is insensitive to the order of records in input and does not
presume some canonical data distribution
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
EM Algorithm
P (d i ck ) wk Pr( d i | ck )
wk
Pr( d c )
i
Pr( d i | c j )
d i P ( d i ck )
i 1 P ( d i c j )
m