Sunteți pe pagina 1din 4

Bonfring International Journal of Data Mining, Vol.

1, Special Issue, December 2011 18


ISSN 2250 107X | 2011 Bonfring
Abstract--- Data mining is the procedure of extorting
patterns from data. At present, it is broadly used in various
fields like profiling practices, such as marketing, observation,
fraud detection and scientific discovery, bioinformatics
research. In this survey, mainly give attention to the
classification and clustering of data mining approaches. Data
mining includes clustering with difficulties of very large
datasets with several classes of different types. This inflicts
individual computational need on significant clustering
algorithms. Another thing is Classification which is a data
mining related to machine learning approach used to identify
group membership for data samples. The classification
approaches like decision tree induction, Bayesian networks, k-
nearest neighbor classifier, case-based reasoning, genetic
algorithm and fuzzy logic techniques are used widely in many
areas. The aim of this survey is to give a wide-ranging
evaluation of different classification and clustering techniques
in data mining. This investigation evidently analysis the
clustering and classification in the review and finally
concludes which is clustering and classification is better for
various fields.
I ndex Terms--- Data mining, Clustering, Classification,
Knowledge Extraction, Support Vector Machine, K Means
Clustering
I. INTRODUCTION
The main aim of this survey is to present complete review
of various clustering techniques and classification approaches
in data mining. Clustering is a separation of data into groups
of parallel objects. Representing data by smaller quantity of
clusters essentially loses definite details, but attains
generalization. It corresponds to numerous data objects by few
clusters and therefore it form data by their clusters. Data
modeling situate clustering in a historical viewpoint embedded
in mathematics, statistics and numerical investigation. As a
result, data mining comprises of collection of data and
managing data, it also contains study and prediction.
Clustering is an unsupervised learning; it studies about the
observation before instances. There is no predefined class
label survived for the data points. Cluster analysis is used in a
number of functions such as data investigation, image
processing, market analysis etc. It helps in attaining all the

R. Malathi Ravindran, Research Scholar, Assistant Professor of MCA,
NGM College, Pollachi.
Dr.N. Nalayini, Associate Professor, Department of Computer Science,
NGM College, Pollachi.
distribution of patterns and correlation amongst data objects
[1].
Classification is another data mining task which is used to
classify the required data in effective manner. The goal is to
calculate the value of the class labels of a user-specified goal
attribute depends on the values of other attributes known as
predictive attributes. In classification process, the data mining
algorithms can follow three different learning approaches,
they are supervised, unsupervised, or semi-supervised. In
supervised learning, the algorithm works with a group of
examples whose labels are well-known. In classification task,
the labels are known as nominal values, numerical values in
the view of the regression task. In unsupervised learning, the
labels in the dataset are unidentified and the algorithm
normally aims at grouping examples according to the
similarity of their attribute values, exemplify a clustering task.
At last, semi-supervised learning is generally used when a
small subset of labeled examples is obtainable, jointly with a
large number of unlabeled examples.
In this work study the data mining framework, this
includes clustering and classification. This paper provides the
analysis of existing classification methods for several areas.
II. SURVEY OF CLUSTERING AND CLASSIFICATION
TECHNIQUES
Density-based algorithms in data mining need a metric
space, were the usual setting for them is spatial data clustering
(Han et al. 2001; Kolatch 2001). To produce practical
calculation, various index of data is build up such as R*-tree.
Classic indices were useful only with rationally low-
dimensional data. The algorithm DENCLUE that, indeed, is a
combination of density-based clustering and a grid-based
preprocessing is lesser exaggerated by data dimensionality.
The majority of frequent requirement is to bound number
of cluster points. Alas, k-means algorithm, which provides
frequently a number of very small (in certain implementations
empty) clusters. The modification of the k-means objective
function and of k-means updates that integrate lower limits on
cluster volumes is discussed in [Bradley et al. 2000 [4]]. This
comprises soft assignments of data points with coefficients
subject to linear program needs. Banerjee & Ghosh [2002 [5]]
introduced another changes to k-means algorithm. Their
objective function is related to an isotropic Gaussian mixture
with widths inversely relative to points that are presented in
the clusters results in frequency sensitive k-means. Strehl &
Ghosh 2000[6] presented a balanced cluster used to convert a
particular task in a graph partitioning problem.
A Thorough Investigation on the Clustering and
Classification Techniques in Various Applications
R. Malathi Ravindran and Dr.N. Nalayini
Bonfring International Journal of Data Mining, Vol. 1, Special Issue, December 2011 19
ISSN 2250 107X | 2011 Bonfring
T. Sakamoto et al [10] proposed a phylogenetic analysis of
the whole genome series from the instances attained from the
Arctic and those from Japan and Asia exposed six distinct
clusters in HBV/B. In each HBV genotype C subgroup, more
than a few clusters with genomic similarity to one another can
be found.
There are two types of genotypes, B and C, in the 200 plus
HBV DNA sequences are gathered particularly for this
project. At the same time genotype B HBV seen to be a
homogenous group [11], the outcome of phylogenetic tree
show that there already three main clusters are present in the
genotype C between the HBV strains collected [12].
Sub grouping of HBV genotype C was depends on an
intersubgroup variation of nucleotide sequence is discussed in
[13]. This is in agreement with earlier phylogenetic
investigation with available full-length sequence in the
GenBank. The major reason for us to discover individually the
markers from the clusters (subgenotypes) attained from
clustering analysis is that these subgenotypes exhibit
mutations caused by geographical diversity which are not
markers for carcinogenic diagnosis.
Cases reporting lifetime IDU were approximately utterly
contaminated with genotype D and all 12 cases who reported
injecting within 6 months prior to diagnosis were infected with
genotype D. It must be prominent that because of social
popularity or recall biases might cause an underreporting of
new IDU [14] the number of cases reporting lifetime Injection
Drug Use (IDU) may be more pinpointing of risk than the
number reporting recent drug use. Despite the consequences,
these findings point toward that IDU is a most important route
of spread for HBV in BC and that based on phylogenetic
analysis clustering exists. To this end, targeted vaccination
may be necessary to decrease the transmission of HBV
(genotype D) in such high menace populations as IDU and
incarcerated individuals.
Rule Learning using Evolutionary Algorithm executes a
global search and can handle with characteristic interactions
better than the existing classification approaches [15][16].
Moreover, the classification rules produced are trouble-free
and easily interpretable by human experts who regularly make
use of the same reasoning approach very much analogous to
the rules.
K.B. Xu et al [17] introduced a weighted Choquet integral
approach based on fuzzy measure which performs as an
aggregation tool to show the feature space evidently onto a
real axis optimally correspond to an error criterion, and
classifying attribute is appropriately analyzed numerically on
the axis concurrently making the classification simple. To
implement the classification requirement to find out the
unknown parameters, the values of fuzzy measure and the
weight function. This can be made by processing an adaptive
genetic algorithm on the provided training data. The new-
fangled classifier is experienced by recovering the current
parameters from a set of artificial training data produced from
these parameters. It also performs better on various real-world
data sets. Ahead of discriminating classes, this method can
also study the scaling needs and the respective significance
indexes of the feature attributes with the relationships among
them. These parameter values can be used for short-listing
significant feature attributes to decrease the complexity
(dimensions) of the classification problem.
C.C Chang et al [18] aim to be of assistance with the users
to effortlessly apply SVM to their real applications. LIBSVM
has attained extensive esteem in machine learning areas and in
many other fields. The SVM classifier is widely used for
classification purpose in data mining and also in other fields.
Even though it contains some problems like, there is some
difficulties in solving SVM optimization problems, theoretical
convergence, multi-class classification, probability estimates,
and parameter selection.
A decision tree [20] is a tree-structured classifier used to
learn the decision tree and also it is employed as a recursive
tree rising process. Each test equivalent to an attribute is
assessed on the training data by means of a test condition
function. The test criterion function allocates each test a score
depends on how fine it partitions the data set. The test with the
highest score is chosen and located at the source of the tree.
The subtrees of each node are afterward grown repeatedly by
applying the same algorithm to the instances in each leaf. The
algorithm stops when the present node contains either all
positive or all negative instances.
Naive Bayes classifiers frequently work much better in
several difficult real world conditions. H. Zhang [19]
introduced an approach on the classification performance of
naive Bayes. This classifier demonstrates about the need of
dependence distribution which means how the local
dependence of a node distributes in each class, consistently or
unevenly, and how the local dependencies of all nodes work
together, always (supporting a certain classification) plays a
critical role. Hence, no matter how well-built the dependences
amongst attributes are, naive Bayes can be optimal if the
dependences deal out equally in classes, or if the dependences
cancel out each other.
S. Mika et al [21] introduced a fast training algorithm for
the kernel Fisher discriminant classifier. The author utilizes a
greedy approximation method and has an empirical scaling
behavior which develops the state of the art by more than an
order of magnitude, therefore rendering the kernel Fisher
algorithm is a feasible option also for large datasets.











Bonfring International Journal of Data Mining, Vol. 1, Special Issue, December 2011 20
ISSN 2250 107X | 2011 Bonfring
Conventional Clustering and
Classification Approaches
Limitations
K-Means Clustering
Difficulty in comparing quality
of the clusters produced (e.g.
for different initial partitions or
values of K affect outcome).
Fixed number of clusters can
make it difficult to predict what
K should be.
Does not work well with non-
globular clusters.
Different initial partitions can
result in different final clusters.
It is helpful to rerun the
program using the same as well
as different K values, to
compare the results achieved.

Fuzzy C Means Clustering
It computes the neighborhood
term in each iteration step,
which is very time-consuming.
Support Vector Machine Classifier
Takes longer time for
classification
An important practical question
that is not entirely solved, is the
selection of the kernel function
parameters
K NN classifier
The main disadvantage of the
KNN algorithm is that it is a
insignificant learner, i.e. it does
not learn anything from the
training data and simply uses
the training data itself for
classification.

III. PROBLEMS AND DIRECTIONS
The aim of the Clustering component to make sure whether
clusters survived based on the phylogenetic tree analysis. If
clusters are identified, each cluster will be examined
independently for data because it will reduce the noise
produced by the related data differences and provide much
better classification accuracy.
In earlier works the clustering approaches doesnt combine
the cluster consequence with optimization methods. As a
result more noise takes place in the data and becomes less
result in the accuracy. The classification is an important data
mining task used in many areas. The classification model is
supposed to have high sensitivity, appropriate accuracy and
specificity for HCC analysis and prediction. The
representation educated should also provide obvious
indication of the degrees of influence of the attributes toward
the classification goal and whether there are any interactions
among the predictive attributes.
In recent years, there has been lot of research works in the
field of clustering and classification. A number of techniques
have been proposed by various researchers. In recent years,
fuzzy based clustering techniques provide significant results
with higher clustering accuracy in lesser clustering time.
Evolutionary Algorithm (EA) based clustering approach gives
significant results. Moreover, swarm intelligence based
classification techniques provide significant results with
higher accuracy in lesser classification time. Swarm
Intelligence and Neural network approaches based
classification algorithms have been widely used in various
applications such as gene classification, cancer classification,
etc.
Swarm intelligence approaches include Particle Swarm
Intelligence (PSO), Artificial Bee Colony (ABC), Glow worm
Optimization (GSO), etc.
Neural Network based approaches include Artificial Neural
Network (ANN), Fuzzy Neural Network (FNN), Neuro Fuzzy
approaches.

Neural Network and
Swarm Intelligence
Clustering and
Classification
Advantages
Evolutionary Clustering
Algorithm
Provides higher
classification accuracy with
lesser classification time
Performs well even in large
datasets.

GA clustering
Provides optimal solution to
the clustering results.
PSO and ABC based classification
Nature inspired algorithm
with lesser error rate
It gives optimal
classification results for
large datasets

IV. CONCLUSION
This work clearly survey of existing classification and
clustering methods. The individual characteristics and their
own specific functionality are studied in efficient manner. It
motivates to propose a new clustering method to group the
data in efficient manner; it reduces the noise data in the cluster
group. Existing classification methods discussed in survey
achieves better accuracy, but still it needs alternative
classification methods to improve the accuracy of the results
in various applications. In future work, different evolutionary
based classification algorithms are used to improve the
classification accuracy.
REFERENCES
[1] S.Anitha Elavarasi and Dr. J. Akilandeswari and Dr. B. Sathiyabhama,
January 2011, A Survey On Partition Clustering Algorithms
[2] HAN, J., KAMBER, M., and TUNG, A. K. H. 2001. Spatial clustering
methods in data mining: A survey. In Miller, H. and Han, J. (Eds.)
Geographic Data Mining and Knowledge Discovery, Taylor and Francis.
[3] KOLATCH, E. 2001. Clustering Algorithms for Spatial Databases: A
Survey. PDF is available on the Web.
[4] BRADLEY, P. S., BENNETT, K. P., and DEMIRIZ, A. 2000.
Constrained k-means clustering. Technical Report MSR-TR-2000-65.
Microsoft Research, Redmond, WA
[5] BANERJEE, A. and GHOSH, J. 2002. On scaling up balanced
clustering algorithms. In Proceedings of the 2nd SIAM ICDM, 333-349,
Arlington, VA.
Bonfring International Journal of Data Mining, Vol. 1, Special Issue, December 2011 21
ISSN 2250 107X | 2011 Bonfring
[6] STREHL, A. and GHOSH, J. 2000. A scalable approach to balanced,
high-dimensional clustering of market baskets, In Proceedings of 17th
International Conference on High Performance Computing, Springer
LNCS, 525-536, Bangalore, India.
[7] T. Sakamoto, Y. Tanaka, J. Simonetti, C. Osiowy, M.L. Brresen, A.
Koch, F. Kurbanov, M. Sugiyama, G.Y. Minuk, B.J. McMahon, T. Joh,
and M. Mizokami, Classification of Hepatitis B Virus Genotype B into
2 Major Types Based on Characterization of a Novel Subgenotype in
Arctic Indigenous Populations, J. Infectious Diseases, vol. 196, pp.
1487-1492, 2007.
[8] F. Sugauchi, H. Kumada, H. Sakugawa, M. Komatsu, H. Niitsuma, H.
Watanabe, Y. Akahane, H. Tokita, T. Kato, Y. Tanaka, E. Orito, R.
Ueda, Y. Miyakawa, and M. Mizokami, Two Subtypes of Genotype B
(Ba and Bj) of Hepatitis B Virus in Japan, Clinical Infectious Diseases,
vol. 38, pp. 1222-1228, 2004 .
[9] H.L.Y. Chan, S.K.W. Tsui, E.Y.T. NG, P.C.H. Tse, K.S. Leung, K.H.
Lee, T. Mok, A. Bartholomeuz, T.C.C. Au, and J.J.Y. Song,
Epidemiological and Virological Characteristics of Two Subgroups of
Genotype C Hepatitis Virus, J. Infectious Diseases, vol. 191, pp. 2022-
2032, 2005.
[10] S.M. Owyer and J.G.M. Sim, Relationships within and between
Genotypes of Hepatitis B Virus at Points Across the Genome: Footprints
of Recombination in Certain Isolates, J. General Virology, vol. 81, pp.
379-392, 2000.
[11] Perlis TE, Des Jarlais DC, Friedman SR et al. Audio-computerized self-
interviewing versus face-to-face interviewing for research data
collection at drug abuse treatment programs. Addiction 2004; 99: 885
896.
[12] A.A. Freitas, A Survey of Evolutionary Algorithms for Data Mining
and Knowledge Discovery, Advances in Evolutionary Computation, A.
Ghosh and S. Tsutsui, eds., Springer-Verlag, 2002.
[13] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain,
Dimensionality Reduction Using Genetic Algorithms, IEEE Trans.
Evolutionary Computing, vol. 4, no. 2, pp. 164-171, July 2000.
[14] K.B. Xu, Z.Y. Wang, P.A. Heng, and K.S. Leung, Classification by
Nonlinear Integral Projections, IEEE Trans. Fuzzy Systems, vol. 11,
no. 2, pp. 187-201, Apr. 2003.
[15] C.C Chang and C.J. Lin, LIBSVM: A Library for Support Vector
Machines, Software, http://www.csie.ntu.edu.tw/~cjlin/
[16] libsvm, 2001.
[17] H. Zhang, The Optimality of Naive Bayes, Proc. 17th Intl Florida
Alliance of Information and Referral Services (FLAIRS) Conf., 2004.
[18] Data Mining Tools See5 and C5.0, Software, http://www.
rulequest.com/see5-info.html, May 2006.
[19] S. Mika, A.J. Smola, and B. Scholkopf, An Improved Training
Algorithm for Fisher Kernel Discriminants, Proc. Artificial Intelligence
and Statistics (AISTATS 01), T. Jaakkaola and T. Richardson, eds., pp.
98-104, 2001.