Sunteți pe pagina 1din 4

International Journal of Advanced Engineering Research and Technology (IJAERT) 276

Volume 3 Issue 8, August 2015, ISSN No.: 2348 8190

SIMILARITY BASED CLUSTERING TECHNIQUES IN UNCERTAIN


DATASET A LITERATURE SURVEY
Mrs.S.Vydehi,
Professor & Head
Dr.SNS Rajalakshmi College of Arts and Science
Coimbatore-641049.

Abstract
Mining fruitful information has a remarkable growth
of interest in todays world. To stipulate an indication
numerous well performed achievements are studied and
summarized to recognize the several challenges in
existing real world applications. Similarity based
clustering is an obstacle that has applications in a
widespread assortment of fields and has recently
fascinated a large amount of investigation. Comparing
the frequently occurring patterns of each class and
grouping them is called similarity based clustering. Real
world data are frequently very large and may comprise
outliers and uncertainty. Hence, similarity based
clustering using distance measure of such real data is a
significant issue in the data mining process. Various
similarities based clustering techniques and clustering
algorithms have been proposed earlier by numerous
authors to assist clustering of real data. The similarity
based clustering techniques and its effectiveness on
numerous applications is compared to progress an
innovative technique to solve the clustering obstacles.
This paper offers a survey on various similarity based
clustering algorithms available for real world datasets.
Moreover, the uniqueness and restriction of previous
research are discussed and several achievable topics for
future study are recognized. Furthermore the areas that
utilize similarity based clustering are also summarized.
Keywords: Data Mining, Similarity based Clustering,
Machine Learning, Unsupervised Learning, Feature
Extraction and Feature Selection.

I.

INTRODUCTION

Today real world data management has become an


interesting research topic by the data mining researchers.
Remarkably, the similarity based clustering in uncertain
dataset has attracted the interest of researchers. Data
mining is normally constrained by three limited
resources. They are Sample Size, Time and Memory.
Recently time and memory seem to be bottleneck for
machine learning applications. Clustering is an
unsupervised learning process for grouping a dataset into

Dr.M.Punithavalli,
Associate Professor,
Bharathiar University,
Coimbatore-641046.

subgroups. A data may be in a form of static data or in a


dynamic data. Each single data is referred as a static
data. In which each data have individual characteristics.
But dynamic data are not a single data. It is in the form
of continuous data referred as stream data. A data stream
is an ordered sequence of points x1, , , , , ,xn. These data
can be read or accessed only once or a small number of
times. A time series is a sequence of real numbers, each
number indicating a value at a time point. Data flows
continuously from a data stream at high speed,
producing more examples over time in recent real world
applications. Traditional algorithms cannot support to
the high speed arrival of real world data. This is a
reason; the new algorithms and techniques have been
developed for real time processing data.
Real world data are being generated at an unique
speed from almost every application domain e.g.,
Medical analysis, Daily fluctuations of stock market,
Fault diagnosis, Dynamic scientific experiments,
Electrical power demand, position updates of moving
objects in location based services, various reading from
sensor networks, Biological and Medical experimental
observations, etc. Traditionally clustering is taken as a
batch procedure. Most of the clustering techniques can
be two major categories. One is Partitional clustering
and another one is Hierarchical Clustering [1]. They are
the two key aspects for achieving effectiveness and
efficiency when using time series data. A time series
experiment requires multiple arrays which all makes it
very expensive. Dimensionality reduction techniques can
be divided into two groups (i) Feature Extraction (ii)
Feature Selection. Feature Extraction techniques extract
a set of new features from the original attributes. Feature
Selection is a process that selects a subset of original
attributes. There have been numerous textbooks [5] and
publications on clustering of scientific data for a variety
of areas such as taxonomy, agriculture [2], remote
sensing [3], as well as process control [4]. This paper
presents a survey on various similarity based clustering
algorithms available for real world datasets. Moreover,
the distinctiveness and restriction of previous research
are discussed and several achievable topics for future

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 277


Volume 3 Issue 8, August 2015, ISSN No.: 2348 8190

study are recognized. Furthermore the areas that


similarity based clustering have been applied to are also
summarized.
The rest of the paper is organized as follows.
Section 2 reviews the concept of similarity based
clustering and gives an overview of the clustering
process of various techniques. Section 3 marginally
discusses possible future extensions of the work. Section
4 concludes the paper with fewer discussions.

II.

RELATED WORK

(S. Abiteboul, at all, 1987) represent a set of


possible worlds using an incomplete information
database. The representation techniques that we study
form a hierarchy, which generalizes relations of
constants. This hierarchy ranges from the very simple
Codd-table, (ie, a relation of constants and distinct
variables called nulls, which stand for values present but
unknown), to muchmore complex mechanisms involving
views on conditioned-tables,(ie, queries on Codd-tables
together with conditions).
In the past few years powerful generalizations to
the Euclidean k-means problem have been made, such as
Bregman clustering, co-clustering (i.e., simultaneous
clustering of rows and columns of an input matrix), and
tensor clustering. Like k-means, these more general
problems also suffer from the NP-hardness of the
associated optimization. Researchers have developed
approximation algorithms of varying degrees of
sophistication for k-means, k-medians, and more
recently also for Bregman clustering. However, there
seem to be no approximation algorithms for Bregman
co- and tensor clustering. In this paper we derive the first
(to our knowledge) guaranteed methods for these
increasingly important clustering settings. Going beyond
Bregman divergences, (Banerjee A at all, 2008) also
prove an approximation factor for tensor clustering with
arbitrary separable metrics. Through extensive
experiments we evaluate the characteristics of our
method, and show that it also has practical impact.
Clustering on uncertain data, one of the essential
tasks in mining uncertain data, posts significant
challenges on both modeling similarity between
uncertain objects and developing efficient computational
methods. The previous methods extend traditional
partitioning clustering methods like k-means and
density-based clustering methods like DBSCAN to
uncertain data, thus rely on geometric distances between
objects. Such methods cannot handle uncertain objects
that are geometrically indistinguishable, such as
products with the same mean but very different
variances in customer ratings. Surprisingly, probability
distributions, which are essential characteristics of

uncertain objects, have not been considered in measuring


similarity between uncertain objects. In this paper, (Bin
Jiang at al, 2011) organized systematically model
uncertain objects in both continuous and discrete
domains, where an uncertain object is modeled as a
continuous and discrete random variable, respectively.
They use the well-known Kullback-Leibler divergence
to measure similarity between uncertain objects in both
the continuous and discrete cases, and integrate it into
partitioning and density-based clustering methods to
cluster uncertain objects. Nevertheless, a nave
implementation is very costly. Particularly, computing
exact KL divergence in the continuous case is very
costly or even infeasible. To tackle the problem, we
estimate KL divergence in the continuous case by kernel
density estimation and employ the fast Gauss transform
technique to further speed up the computation. Our
extensive experiment results verify the effectiveness,
efficiency, and scalability of our approaches.
A wide variety of distortion functions, such as
squared Euclidean distance, Mahalanobis distance,
Itakura-Saito distance and relative entropy, have been
used for clustering. In this paper, (Arindam Banerjee
Srujana Merugu at all , 2005) propose and analyze
parametric hard and soft clustering algorithms based on
a large class of distortion functions known as Bregman
divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical
kmeans, the Linde-Buzo-Gray (LBG) algorithm and
information-theoretic clustering, which arise by special
choices of the Bregman divergence. The algorithms
maintain the simplicity and scalability of the classical
kmeans algorithm, while generalizing the method to a
large class of clustering loss functions. This is achieved
by first posing the hard clustering problem in terms of
minimizing the loss in Bregman information, a quantity
motivated by rate distortion theory, and then deriving an
iterative algorithm that monotonically decreases this
loss. In addition, we show that there is a bijection
between regular exponential families and a large class of
Bregman divergences, that we call regular Bregman
divergences. This result enables the development of an
alternative interpretation of an efficient EMscheme for
learning mixtures of exponential family distributions,
and leads to a simple soft clustering algorithm for
regular Bregman divergences. Finally, we discuss the
connection between rate distortion theory and Bregman
clustering and present an information theoretic analysis
of Bregman clustering algorithms in terms of a trade-off
between compression and loss in Bregman information.
(David M. Blei at all, 2003) described latent
Dirichlet allocation (LDA), a generative probabilistic
model for collections of discrete data such as text

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 278


Volume 3 Issue 8, August 2015, ISSN No.: 2348 8190

corpora. LDA is a three-level hierarchical Bayesian


model, in which each item of a collection is modeled as
a finite mixture over an underlying set of topics. Each
topic is, in turn, modeled as an infinite mixture over an
underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit
representation of a document. We present efficient
approximate inference techniques based on variational
methods and an EM algorithm for empirical Bayes
parameter estimation. We report results in document
modeling, text classification, and collaborative filtering,
comparing to a mixture of unigrams model and the
probabilistic LSI model.
The author presents a Monte Carlo algorithm to
find approximate solutions of the traveling salesman
problem. The algorithm generates randomly the
permutations of the stations of the traveling salesman
trip, with probability depending on the length of the
corresponding route. Reasoning by analogy with
statistical thermodynamics, we use the probability given
by the Boltzmann-Gibbs distribution. Surprisingly
enough, using this simple algorithm, one can get very
close to the optimal solution of the problem or even find
the true optimum. We demonstrate this on several
examples. The authors conjecture that the analogy with
thermodynamics can offer a new insight into
optimization problems and can suggest efficient
algorithms for solving them.
As a popular search mechanism, keyword search
has been applied to retrieve useful data in documents,
texts, graphs, and even relational databases. However, so
far, there is no work on keyword search over uncertain
graph data even though the uncertain graphs have been
widely used in many real applications, such as modeling
road networks, influential detection in social networks,
and data analysis on PPI networks. Therefore, in this
paper, (Ye Yuan at all, 2013) study the problem of top$(k)$ keyword search over uncertain graph data.
Following the similar answer definition for keyword
search over deterministic graphs, the authors consider a
subtree in the uncertain graph as an answer to a keyword
query if 1) it contains all the keywords; 2) it has a high
score (defined by users or applications) based on
keyword matching; and 3) it has low uncertainty. Due to
the existence of uncertainty, keyword search over
uncertain graphs is much harder. Therefore, to improve
the search efficiency, we employ a filtering-andverification strategy based on a probabilistic keyword
index, PKIndex. For each keyword, the authors offline
compute path-based top-$(k)$ probabilities, and attach
these values to PKIndex in an optimal, compressed way.
In the filtering phase, we perform existence, path-based
and tree-based probabilistic pruning phases, which filter

out most false subtrees. In the verification, we propose a


sampling algorithm to verify the candidates. Extensive
experimental results demonstrate the effectiveness of the
proposed algorithms
The novel distance based clustering algorithm is
developed for tagging a face dataset. This dataset is like
a photo album. The authors (Chunhui Zhu, 2011)
introduced a new distance based clustering algorithm is
called as Rank-Order distance, which is used to measure
the dissimilarity among the two various faces using their
neighboring information in the dataset. This distance
measuring algorithm is motivated by surveillance that
faces of the identical person typically share their top
neighbors. Precisely, for each face, the authors produce a
ranking order list by sorting all other remaining faces in
the dataset by absolute distance. Then, the Rank-Order
distance algorithm is applied to the two faces is
calculated using their ranking orders. Using the
innovative distance method, a Rank-Order distance
based clustering algorithm is intended to iteratively
collect all faces into a small number of clusters for
effective tagging. The proposed Rank order distance
algorithm outperforms than other competitive clustering
algorithms in term of both precision/recall and
efficiency.

III.

FUTURE WORK

Similarity based Clustering data is a problematic


task in the real applications that has wide-range
assortment of fields, and has recently attracted a large
amount of investigation. The proposed study provides a
way to investigate the existing clustering algorithms and
techniques of real time data and assists to give directions
for future enhancement. Future research can be directed
to the following aspects: Similarity based Clustering
data in high dimensional data, increased the Effort of
computational process, to predict the fruitful accurate
data from the real data sets.

IV.

CONCLUSION

In modern years, the management and


processing of so-called real data has become a subject of
dynamic research in numerous fields of computer
science such as, e.g., distributed systems, database
systems, and data mining. Lot of research work has been
carried in this field to develop an efficient clustering
technique for real data. Real world data are frequently
very large and may comprise outliers. Hence, careful
examination of the earlier proposed algorithms is
necessary. In this paper we surveyed the current studies
on similarity based clustering. These studies are
structured into many categories depending upon whether

www.ijaert.org

International Journal of Advanced Engineering Research and Technology (IJAERT) 279


Volume 3 Issue 8, August 2015, ISSN No.: 2348 8190

they work directly with the innovative data. Most


clustering algorithms are not capable to make a
distinction between real and random patterns. In
addition, this paper discusses about possible high
dimensional problems with real datasets. The application
areas are summarized with a brief description of the data
used. The uniqueness and drawbacks of past studies and
some possible topics for further study are also discussed.
The future work determines to develop an effective
clustering algorithm for time series data streams.

REFERENCES
[1] S. Abiteboul, P.C. Kanellakis, and G. Grahne, On
the Representation and Querying of Sets of
Possible
Worlds, Proc. ACM SIGMOD Intl Conf. Management
of Data (SIGMOD), 1987
[2] Banerjee A. and Jegelka S. and Sra S.
Approximation Algorithms for Bregman Co-clustering
and Tensor Clustering, 2008.
[3] Bin Jiang, Jian Pei, Senior Member, IEEE, Yufei
Tao, Member, IEEE, and Xuemin Lin, Senior Member,
IEEE, Clustering Uncertain Data Based on Probability
Distribution Similarity IEEE Transactions On
Knowledge And Data Engineering, Vol. 25, No. 4,
APRIL 2011,
[4] Arindam Banerjee Srujana Merugu, Inderjit S.
Dhillon Joydeep Ghosh, Clustering with Bregman
Divergences Journal of Machine Learning Research 6
(2005) 17051749 Submitted 10/03; Revised 2/05;
Published 10/05.
[5] Ye Yuan , Guoren Wang , Lei Chen , Haixun Wang ,
Efficient Keyword Search on Uncertain Graph Data
IEEE Transactions on Knowledge and Data Engineering,
Issue No.12 - Dec. (2013 vol.25) pp: 2767-2779,
http://doi.ieeecomputersociety.org/10.1109/TKDE.2012.
222
[6] Chunhui Zhu ; Tsinghua Univ., Beijing, China ; Fang
Wen ; Jian Sun,A rank-order distance based clustering
algorithm for face taggingIEEE International
Conference, June 2011.

www.ijaert.org

S-ar putea să vă placă și