Sunteți pe pagina 1din 46

Clustering

The term clustering is used in several research communities to describe methods for grouping of unlabeled data

Clustering
It is widely used technique in data mining applications for  pattern analysis Grouping Decision-making Image segmentation

Different names of Clustering


unsupervised learning in pattern recognition,  numerical taxonomy in biology and ecology, typology in social sciences partition in graph theory

Main interest of Clustering


partition a finite set of points in a multidimensional space into classes called clusters so that the points belonging to the same class are similar and the points belonging to different classes are dissimilar

Types of Clustering techniques

Hierarchical clustering Partitional clustering

Types of Clustering techniques

Hierarchical clustering Produced a nested series of partitions based on criterion merging or splitting clusters based on similarity

Types of Clustering techniques

Partitional clustering Identify the partition that optimizes a clustering criterion

Types of data in Clustering


numerical data Categorical data Interval data Binary data

Data to be used
mixed type data
majority of the mixed data is described by a combination of numeric and nominal-valued features.

In such situation, the standard distance measures for numeric data are inappropriate for assessing the dissimilarity between two observations because the variables do not only involve quantitative but also qualitative. .

difficulties that arise in mixed data


similarity between categorical values is not taken into consideration which may lead to counter-intuitive clustering results. transform categorical values to a set of binary, numeric values and then apply metric distance measures that are used in numeric clustering. the transformation leads to the loss of semantics and waste of storage

Cluster Ensemble
method to combine several runs of different clustering algorithms to get a common partition of the original dataset, aiming for consolidation of results from a portfolio of individual clustering results was proposed by He et al. (2002) uses divide-and-conquer technique to solve mixed data.

Overview of Cluster Ensemble

Cluster Ensemble algorithm

Clustering Algorithm for Categorical Data


CACTUS algorithm

COOLCAT algorithm

ROCK algorithm

CACTUS algorithm
 Introduced by Ganti et. al  or Categorical Clustering Using Summaries  a fast summarization-based algorithm  composed of three phases Summarization phase Clustering phase Validation phase

Fig. 2. The CACTUS Algorithm.


1. Compute the inter-attribute summaries and intra-attribute summaries from the database {summarization phase}; 2. Analyze each attribute to compute all cluster projections on it, and then synthesize candidate clusters on sets of attributes from the cluster projections on individual attributes {clustering phase}; 3. Compute the actual clusters from the candidate clusters {validation phase}.

COOLCAT algorithm

Was proposed by Barbara et.al. (2002) Uses the notion of entropy to group records of data. Capable of clustering large data sets of records with categorical attributes Stable for different sample sizes and parameter settings

ROCK algorithm Robust Clustering using Links Proposed by Guha et al. an agglomerative hierarchical clustering algorithm that employs links to merge clusters uses link-based similarity measure to measure the similarity between two data points and between clusters.

ROCK algorithm

Clustering algorithm for numeric data

CURE algorithm BIRCH algorithm

CURE algorithm
Clustering Using Representatives Employs hierarchical clustering algorithm that adopts a middle ground between centroid-based and the all-point extremes

BIRCH algorithm
Balanced Iterative Reducing and Clustering using Hierarchies uses a hierarchical data structures called CF-tree for partitioning the incoming data points in an incremental and dynamic way Based on two factors Branching factor B Threshold T can typically find a good clustering with a single scan of data and improve the quality further with few additional scans.

Distance Measure
Distance between two points is taken to assess the similarity among instances of population measures dissimilarity
Discordance between two objects Can be normalized and convert to similarity measure

Distance Measure
distance is a quantitative variable satisfying at least the first three conditions
d ij u o
Distance is always greater than or equal to zero Distance is zero if and only if it measured to itself Distance is symmetric Distance satisfy triangular inequality

dii ! o
d ij ! d ji dij e dik  d jk

Distance Measure for numeric data


Algorithm based on variance considers the position of each observation relative to the mean of the set

Distance Measure for categorical data


Distance hierarchy composed of concepts nodes and links extended structure based on concept hierarchy organizes concepts into different levels of abstraction

Higher-level nodes represents general concept Lower- level nodes represent more specific concepts Each link is associated with a weight representing a distance. For simplicity, each link weight is set to 1.

Computation on Cluto Package CLUTO - a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters provides tools for analyzing the discovered clusters to understand the relations between the objects assigned to each cluster and the relations between the different clusters, and tools for visualizing the discovered clustering solutions. Cluto produced the dendograms for the study.

Cluster mixed type of data using cluster ensemble technique employing existing algorithms for numeric and categorical data sets. Specifically endeavors to:  Employ other existing clustering algorithms like CURE and BIRCH for numeric data sets and CACTUS, COOLCAT and ROCK algorithm for categorical data sets to the algorithm framework of cluster ensemble.

Test the efficiency of each combined clustering algorithms using the auto data, adult data sets and the data of Emballonurid species  Compare the results to the clustering algorithm used by HE et al., (2005) in terms of clustering quality and scalability, and  Determine which integrated clustering algorithm will achieve both high quality clustering results and scalability.


Contribute in the field of clustering analysis for clustering data with numeric and categorical features. Provide an additional knowledge in determining what clustering algorithm is better integrated to mixed data when using the cluster ensemble method.

aims to employ other alternative clustering algorithms for numeric data sets and categorical data sets explore the capabilities of different clustering algorithms and the characteristics of different types of data sets using the algorithm framework of cluster ensemble.

Construction of Cluster Ensemble Based Algorithm The steps involved in the algorithm framework of the cluster ensemble technique are the following: the original mixed data set is divided into two sub-data sets: the pure categorical data set and the pure numeric data set. existing well established clustering algorithms designed for different types of data sets are employed- CURE and BIRCH for numeric data sets and CACTUS, COOLCAT and ROCK algorithm for categorical data sets shown in Table 1 to produce corresponding clusters. the clustering results on the categorical and numeric data sets are combined as a categorical dataset.

Materials and Methods

Table 1: Combination of algorithms to be employ for categorical and numeric data sets
Algorithms for Categorical data sets Algorithms for Numeric data sets Combination of Algorithms

CACTUS

CURE

CACTUS- CURE

BIRCH

CACTUS-BIRCH

COOLCAT

CURE

COOLCAT-CURE

BIRCH

COOLCAT- BIRCH

ROCK

CURE

ROCK- CURE

BIRCH

ROCK- BIRCH

Materials and Methods

Materials and Methods

Datasets
The datasets that will be use in the study are the auto data, adult dataset, and the data of the Emballonurid species. The data contains no missing values The auto data and the adult dataset will be used to demonstrate the algorithm framework of cluster ensemble and the data of Emballonurid species as an application.

Pattern-analysis

grouping

Decision-making

S-ar putea să vă placă și