Documente Academic
Documente Profesional
Documente Cultură
The term clustering is used in several research communities to describe methods for grouping of unlabeled data
Clustering
It is widely used technique in data mining applications for pattern analysis Grouping Decision-making Image segmentation
Hierarchical clustering Produced a nested series of partitions based on criterion merging or splitting clusters based on similarity
Data to be used
mixed type data
majority of the mixed data is described by a combination of numeric and nominal-valued features.
In such situation, the standard distance measures for numeric data are inappropriate for assessing the dissimilarity between two observations because the variables do not only involve quantitative but also qualitative. .
Cluster Ensemble
method to combine several runs of different clustering algorithms to get a common partition of the original dataset, aiming for consolidation of results from a portfolio of individual clustering results was proposed by He et al. (2002) uses divide-and-conquer technique to solve mixed data.
COOLCAT algorithm
ROCK algorithm
CACTUS algorithm
Introduced by Ganti et. al or Categorical Clustering Using Summaries a fast summarization-based algorithm composed of three phases Summarization phase Clustering phase Validation phase
COOLCAT algorithm
Was proposed by Barbara et.al. (2002) Uses the notion of entropy to group records of data. Capable of clustering large data sets of records with categorical attributes Stable for different sample sizes and parameter settings
ROCK algorithm Robust Clustering using Links Proposed by Guha et al. an agglomerative hierarchical clustering algorithm that employs links to merge clusters uses link-based similarity measure to measure the similarity between two data points and between clusters.
ROCK algorithm
CURE algorithm
Clustering Using Representatives Employs hierarchical clustering algorithm that adopts a middle ground between centroid-based and the all-point extremes
BIRCH algorithm
Balanced Iterative Reducing and Clustering using Hierarchies uses a hierarchical data structures called CF-tree for partitioning the incoming data points in an incremental and dynamic way Based on two factors Branching factor B Threshold T can typically find a good clustering with a single scan of data and improve the quality further with few additional scans.
Distance Measure
Distance between two points is taken to assess the similarity among instances of population measures dissimilarity
Discordance between two objects Can be normalized and convert to similarity measure
Distance Measure
distance is a quantitative variable satisfying at least the first three conditions
d ij u o
Distance is always greater than or equal to zero Distance is zero if and only if it measured to itself Distance is symmetric Distance satisfy triangular inequality
dii ! o
d ij ! d ji dij e dik d jk
Higher-level nodes represents general concept Lower- level nodes represent more specific concepts Each link is associated with a weight representing a distance. For simplicity, each link weight is set to 1.
Computation on Cluto Package CLUTO - a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters provides tools for analyzing the discovered clusters to understand the relations between the objects assigned to each cluster and the relations between the different clusters, and tools for visualizing the discovered clustering solutions. Cluto produced the dendograms for the study.
Cluster mixed type of data using cluster ensemble technique employing existing algorithms for numeric and categorical data sets. Specifically endeavors to: Employ other existing clustering algorithms like CURE and BIRCH for numeric data sets and CACTUS, COOLCAT and ROCK algorithm for categorical data sets to the algorithm framework of cluster ensemble.
Test the efficiency of each combined clustering algorithms using the auto data, adult data sets and the data of Emballonurid species Compare the results to the clustering algorithm used by HE et al., (2005) in terms of clustering quality and scalability, and Determine which integrated clustering algorithm will achieve both high quality clustering results and scalability.
Contribute in the field of clustering analysis for clustering data with numeric and categorical features. Provide an additional knowledge in determining what clustering algorithm is better integrated to mixed data when using the cluster ensemble method.
aims to employ other alternative clustering algorithms for numeric data sets and categorical data sets explore the capabilities of different clustering algorithms and the characteristics of different types of data sets using the algorithm framework of cluster ensemble.
Construction of Cluster Ensemble Based Algorithm The steps involved in the algorithm framework of the cluster ensemble technique are the following: the original mixed data set is divided into two sub-data sets: the pure categorical data set and the pure numeric data set. existing well established clustering algorithms designed for different types of data sets are employed- CURE and BIRCH for numeric data sets and CACTUS, COOLCAT and ROCK algorithm for categorical data sets shown in Table 1 to produce corresponding clusters. the clustering results on the categorical and numeric data sets are combined as a categorical dataset.
Table 1: Combination of algorithms to be employ for categorical and numeric data sets
Algorithms for Categorical data sets Algorithms for Numeric data sets Combination of Algorithms
CACTUS
CURE
CACTUS- CURE
BIRCH
CACTUS-BIRCH
COOLCAT
CURE
COOLCAT-CURE
BIRCH
COOLCAT- BIRCH
ROCK
CURE
ROCK- CURE
BIRCH
ROCK- BIRCH
Datasets
The datasets that will be use in the study are the auto data, adult dataset, and the data of the Emballonurid species. The data contains no missing values The auto data and the adult dataset will be used to demonstrate the algorithm framework of cluster ensemble and the data of Emballonurid species as an application.
Pattern-analysis
grouping
Decision-making