Sunteți pe pagina 1din 17

NEW LINK BASED APPROACH FOR CATEGORICAL DATA CLUSTERING

KNOWLEDGE AND DATA ENGINEERING

By, CHIRANTH B O 4th Sem M.tech

Presentation Outline
Introduction to Clustering Abstract Existing System Proposed System Experimental Design

Experimental results
Conclusion
July 22, 2013 2

Clustering
Introduction
Clustering Grouping similar kind of data. Data clustering concerns how to group a set of objects based on their similarity of attributes. Main methods

Partitioning : K-Means Hierarchical : BIRCH,ROCK, Density-based: DBSCAN,

A good clustering method will produce high quality clusters with

high intra-class similarity low inter-class similarity


3

July 22, 2013

ABSTRACT
The categorical data clustering methods are generating results based on incomplete information. This problem degrades the quality of the clustering result. This paper presents a new link-based approach for categorical data clustering which improves results by discovering unknown entries through similarity between clusters

July 22, 2013

Existing Methods

K-means cannot cluster the categorical data.

SQUEEZER and CACTUS generates final clustering


using incomplete information. Many data entries are left unknown.

July 22, 2013

Proposed Methods

Link based approach improves the matrix by discovering the unknown entries.

An efficient link based algorithm used to find similarity between clusters.


July 22, 2013 6

Introduction to NLCD

Designed for very large data sets:

Time and memory are limited


Only one scan of data is necessary Does not need the whole data set in advance

Two key Modules:


Scans the database to build an Binary Matrix.

Building refined matrix using Weighted Triple Quality Algorithm.


July 22, 2013 7

Basic process
Clustering 1 Clustering 2 Consensus Function

Dataset X

Clustering M

July 22, 2013

Clustering

PairWise-Similarity Matrix

Binary Matrix

July 22, 2013

Weighted Triple Quality


ALGORITHM - WTQ (G, , ) G = (V, W), a weighted graph, where , ; , a set of adjacent neighbors of ; =

, the WTQ measure of and ; 0 For each c If c + Return Following that, the similarity between clusters and can be estimated by
July 22, 2013 10

Sim , =

Over Lapping Member


Wx,y W where Cx ,Cy V
Cluster Network

wxy =

July 22, 2013

11

Experimental Results
Input parameters:
Memory (M): 5% of data set Disk space (R): 20% of M Initial threshold (T): 0.0 Page size (P): 1024 bytes
July 22, 2013 12

Experimental Results
KMEANS clustering
No 1 2 3 Time 43.9 13.2 32.9 D 2.09 4.43 3.66 # Scan 289 51 187 DS 1o 2o 3o Time 33.8 12.7 36.0 D 1.97 4.20 4.35 # Scan 197 29 241

NLCD clustering
No 1 2 3 Time 11.5 10.7 11.4 D 1.87 1.99 3.95 # Scan 2 2 2 DS 1o 2o 3o Time 13.6 12.1 12.2 D 1.87 1.99 3.99 # Scan 2 2 2

July 22, 2013

13

Conclusions
A New Link Based Clustering that stores the clustering features in Matrix.
Given a limited amount of main memory, NLCD can minimize the time required for I/O. The problem of constructing the refined matrix is efficiently resolved by similarity among categorical clusters
July 22, 2013 14

Future Work
The first prominent future work includes an extensive study regarding the behavior of other link-based similarity measures within this problem context.
The second prominent future work is the new method will be applied to specific domains, including tourism and medical data sets.

July 22, 2013

15

References
IEEE Journal on Data Mining http://ilpubs.stanford.edu:8090/508/1/2001-41.pdf IEEE Journal on Knowledge and data engineering http://en.wikipedia.org/wiki/Clustering_algorithm

July 22, 2013

16

Q&A

Thank you for your patience

July 22, 2013

17

S-ar putea să vă placă și