Documente Academic
Documente Profesional
Documente Cultură
Specific Readings
Textbook:
Chapter 1
Chapter 2: 2.1- 2.4
Chapter 3: 3.1-3.4
Chapter 4: 4.1.1-4.1.2, 4.2.1
Chapter 5
Chapter 6: 6.1-6.5, 6.9.1, 6.12, 6.13, 6.14
Chapter 7: 7.1-7.4
Papers:
Apriori Paper: R. Agrawal, R. Srikant: Fast Algorithms for Mining Association
Rules. VLDB 1994
MaxMiner Paper: R. J. Bayardo Jr: Efficiently Mining Long Patterns from
Databases. SIGMOD 1998
SLIQ paper:M. Mehta, R. Agrawal, J. Rissanen: SLIQ: A Fast Scalable
Classifier for Data Mining. EDBT 1996
Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Clustering High-Dimensional Data
Clustering high-dimensional data
Many applications: text documents, DNA micro-array data
Major challenges:
Many irrelevant dimensions may mask clusters
Distance measure becomes meaningless—due to equi-distance
Clusters may exist only in some subspaces
Methods
Feature transformation: only effective if most dimensions are relevant
PCA & SVD useful only when features are highly correlated/redundant
Feature selection: wrapper or filter approaches
useful to find a subspace where the data have nice clusters
Subspace-clustering: find clusters in all the possible subspaces
CLIQUE, ProClus, and frequent pattern-based clustering
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD
Explorations 2004)
=3
0 1 2 3 4 5 6 7
20
30
40
50
Vacation
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
Strength and Weakness of
CLIQUE
Strength
automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
insensitive to the order of records in input and does not
presume some canonical data distribution
scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Frequent Pattern-Based Approach
Typical methods
Frequent-term-based document clustering
Clustering by pattern similarity in micro-array data
(pClustering)
Clustering by Pattern Similarity (p-Clustering)
Right: The micro-array “raw”
data shows 3 genes and their
values in a multi-dimensional
space
Difficult to find their patterns
Bottom: Some subsets of
dimensions form nice shift and
scaling patterns
Why p-Clustering?
Microarray data analysis may need to
Clustering on thousands of dimensions (attributes)
Discovery of both shift and scaling patterns
Clustering with Euclidean distance measure? — cannot find shift
patterns
Clustering on derived attribute Aij = ai – aj? — introduces N(N-1)
dimensions
Bi-cluster using transformed mean-squared residue score matrix (I, J)
1
d d 1
ij | J | j J ij d d d
1
d
Where Ij | I | i I ij IJ | I || J | i I , j J ij
Pr( d c )
i k
wk i
N
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means Complete
0.7 0.7
Link
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists
in the data.
2. Comparing the results of a cluster analysis to externally
known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the
data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster
analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
DBSCAN
Using Similarity Matrix for Cluster Validation
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
K-means
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
Complete Link
Using Similarity Matrix for Cluster
Validation
1
0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000
DBSCAN
Internal Measures: SSE
Clusters in more complicated figures aren’t well separated
Internal Index: Used to measure the goodness of a
clustering structure without respect to external information
SSE
SSE is good for comparing two clusterings or two clusters
(average SSE).
Can also be used to estimate the number of clusters
10
9
6 8
7
4
6
SSE
2
5
0 4
3
-2
2
-4 1
0
-6 2 5 10 15 20 25 30
K
5 10 15
Internal Measures: SSE
SSE curve for a more complicated data set
1
2 6
3
4
cohesion separation
Internal Measures: Silhouette Coefficient
Silhouette Coefficient combine ideas of both cohesion and
separation, but for individual points, as well as clusters and
clusterings
For an individual point, i
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to points in another
cluster)
The silhouette coefficient for a point is then given by