SPtata 2

Clustering
The term clustering is used in several research communities to describe methods for grouping of unlabeled data
Clustering
It is widely used technique in data mining applications for pattern analysis Grouping Decision-making Image segmentation
Different names of Clustering

unsupervised learning in pattern recognition, numerical taxonomy in biology and ecology, typology in social sciences partition in graph theory
Main interest of Clustering

partition a finite set of points in a multidimensional space into classes called clusters so that the points belonging to the same class are similar and the points belonging to different classes are dissimilar
Types of Clustering techniques
Hierarchical clustering Partitional clustering
Hierarchical clustering Produced a nested series of partitions based on criterion merging or splitting clusters based on similarity
Partitional clustering Identify the partition that optimizes a clustering criterion
Types of data in Clustering

numerical data Categorical data Interval data Binary data
Data to be used
mixed type data
majority of the mixed data is described by a combination of numeric and nominal-valued features.
In such situation, the standard distance measures for numeric data are inappropriate for assessing the dissimilarity between two observations because the variables do not only involve quantitative but also qualitative. .
difficulties that arise in mixed data

similarity between categorical values is not taken into consideration which may lead to counter-intuitive clustering results. transform categorical values to a set of binary, numeric values and then apply metric distance measures that are used in numeric clustering. the transformation leads to the loss of semantics and waste of storage
Cluster Ensemble
method to combine several runs of different clustering algorithms to get a common partition of the original dataset, aiming for consolidation of results from a portfolio of individual clustering results was proposed by He et al. (2002) uses divide-and-conquer technique to solve mixed data.
Overview of Cluster Ensemble
Cluster Ensemble algorithm
Clustering Algorithm for Categorical Data

CACTUS algorithm
COOLCAT algorithm
ROCK algorithm
CACTUS algorithm
Introduced by Ganti et. al or Categorical Clustering Using Summaries a fast summarization-based algorithm composed of three phases Summarization phase Clustering phase Validation phase
Fig. 2. The CACTUS Algorithm.

1. Compute the inter-attribute summaries and intra-attribute summaries from the database {summarization phase}; 2. Analyze each attribute to compute all cluster projections on it, and then synthesize candidate clusters on sets of attributes from the cluster projections on individual attributes {clustering phase}; 3. Compute the actual clusters from the candidate clusters {validation phase}.
COOLCAT algorithm
Was proposed by Barbara et.al. (2002) Uses the notion of entropy to group records of data. Capable of clustering large data sets of records with categorical attributes Stable for different sample sizes and parameter settings
ROCK algorithm Robust Clustering using Links Proposed by Guha et al. an agglomerative hierarchical clustering algorithm that employs links to merge clusters uses link-based similarity measure to measure the similarity between two data points and between clusters.
ROCK algorithm
Clustering algorithm for numeric data
CURE algorithm BIRCH algorithm
CURE algorithm
Clustering Using Representatives Employs hierarchical clustering algorithm that adopts a middle ground between centroid-based and the all-point extremes
BIRCH algorithm
Balanced Iterative Reducing and Clustering using Hierarchies uses a hierarchical data structures called CF-tree for partitioning the incoming data points in an incremental and dynamic way Based on two factors Branching factor B Threshold T can typically find a good clustering with a single scan of data and improve the quality further with few additional scans.
Distance Measure
Distance between two points is taken to assess the similarity among instances of population measures dissimilarity
Discordance between two objects Can be normalized and convert to similarity measure
Distance Measure
distance is a quantitative variable satisfying at least the first three conditions
d ij u o
Distance is always greater than or equal to zero Distance is zero if and only if it measured to itself Distance is symmetric Distance satisfy triangular inequality
dii ! o
d ij ! d ji dij e dik d jk
Distance Measure for numeric data

Algorithm based on variance considers the position of each observation relative to the mean of the set
Distance Measure for categorical data

Distance hierarchy composed of concepts nodes and links extended structure based on concept hierarchy organizes concepts into different levels of abstraction
Higher-level nodes represents general concept Lower- level nodes represent more specific concepts Each link is associated with a weight representing a distance. For simplicity, each link weight is set to 1.
Computation on Cluto Package CLUTO - a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters provides tools for analyzing the discovered clusters to understand the relations between the objects assigned to each cluster and the relations between the different clusters, and tools for visualizing the discovered clustering solutions. Cluto produced the dendograms for the study.
Cluster mixed type of data using cluster ensemble technique employing existing algorithms for numeric and categorical data sets. Specifically endeavors to: Employ other existing clustering algorithms like CURE and BIRCH for numeric data sets and CACTUS, COOLCAT and ROCK algorithm for categorical data sets to the algorithm framework of cluster ensemble.
Test the efficiency of each combined clustering algorithms using the auto data, adult data sets and the data of Emballonurid species Compare the results to the clustering algorithm used by HE et al., (2005) in terms of clustering quality and scalability, and Determine which integrated clustering algorithm will achieve both high quality clustering results and scalability.

Contribute in the field of clustering analysis for clustering data with numeric and categorical features. Provide an additional knowledge in determining what clustering algorithm is better integrated to mixed data when using the cluster ensemble method.
aims to employ other alternative clustering algorithms for numeric data sets and categorical data sets explore the capabilities of different clustering algorithms and the characteristics of different types of data sets using the algorithm framework of cluster ensemble.
Construction of Cluster Ensemble Based Algorithm The steps involved in the algorithm framework of the cluster ensemble technique are the following: the original mixed data set is divided into two sub-data sets: the pure categorical data set and the pure numeric data set. existing well established clustering algorithms designed for different types of data sets are employed- CURE and BIRCH for numeric data sets and CACTUS, COOLCAT and ROCK algorithm for categorical data sets shown in Table 1 to produce corresponding clusters. the clustering results on the categorical and numeric data sets are combined as a categorical dataset.
Materials and Methods
Table 1: Combination of algorithms to be employ for categorical and numeric data sets
Algorithms for Categorical data sets Algorithms for Numeric data sets Combination of Algorithms
CACTUS
CURE
CACTUS- CURE
BIRCH
CACTUS-BIRCH
COOLCAT
CURE
COOLCAT-CURE
BIRCH
COOLCAT- BIRCH
ROCK
CURE
ROCK- CURE
BIRCH
ROCK- BIRCH
Datasets
The datasets that will be use in the study are the auto data, adult dataset, and the data of the Emballonurid species. The data contains no missing values The auto data and the adult dataset will be used to demonstrate the algorithm framework of cluster ensemble and the data of Emballonurid species as an application.
Pattern-analysis
grouping
Decision-making

SPtata 2

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

SPtata 2

Încărcat de

Drepturi de autor:

Formate disponibile

Clustering

Different names of Clustering

Main interest of Clustering

Types of Clustering techniques

Hierarchical clustering Partitional clustering

Types of Clustering techniques

Types of Clustering techniques

Partitional clustering Identify the partition that optimizes a clustering criterion

Types of data in Clustering

difficulties that arise in mixed data

Overview of Cluster Ensemble

Cluster Ensemble algorithm

Clustering Algorithm for Categorical Data

Fig. 2. The CACTUS Algorithm.

Clustering algorithm for numeric data

CURE algorithm BIRCH algorithm

Distance Measure for numeric data

Distance Measure for categorical data

Materials and Methods

Materials and Methods

Materials and Methods

S-ar putea să vă placă și