Sunteți pe pagina 1din 5

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com


Volume 3, Issue 2, March – April 2014 ISSN 2278-6856

Comparison and Analysis of Various Clustering


Methods in Data mining On Education data set
Using the weak tool
Suman 1 and Mrs.Pooja Mittal2
1
Student of Masters of Technology,
Department of Science and Application M.D. University, Rohtak, Haryana, India
2
Assistant Professor
Department of Computer Science and Application M.D. University, Rohtak, Haryana, India
Abstract:- Data mining is used to find the hidden Centre will represent with input vector can tell which
information pattern and relationship between the large data cluster this vector belongs to by measuring a similarity
set which is very useful in decision making. Clustering is very metric between input vector and all cluster centers and
important techniques in data mining, which divides the data determining which cluster is nearest or most similar one
into groups and Each group containing similar data and [1]. There are various method in clustering these are
dissimilar from other groups. Clustering using various
followed:-
notations to create the groups and these notations can be like
 PARTITIONING MATHOD
as clusters include groups with low distances among the
cluster members, dense areas of the data space, intervals or o K-mean method
particular statistical distributions In this paper provide a o K- Medoids method
comparison of various clustering algorithms like k-Means  HIERARCHICAL METHODS
Clustering, Hierarchical Clustering, Density based clustering,
grid clustering etc. We compare the performance of these o Agglomerative
three major clustering algorithms on the aspect of correctly o Divisive
class wise cluster building ability of the algorithm.  GRID BASED
Performance of the 3 techniques is presented and compared
 DENSITY BASED METHODS
using a clustering tool WEKA.
o DBSCAN
Keywords: - Data mining, clustering, k-Means
Clustering, Hierarchical Clustering, DBSCAN clustering,
grid clustering etc.

I. Introduction
Data mining is also known as knowledge discovery. In
computer science field data mining is an important Fig 2.1 methods of clustering techniques
subfield which has computational ability to discover the
patterns from large data sets. The main objective of data III. Weka
mining is that to discover the data and patterns and store Weka is developed by the University of Waikato (New
it in an understandable form. Data mining applications Zealand) and its first modern form is implemented in
are used almost every field to manage the records and in 1997.It is open source means it is available for use public.
other forms. Data mining is a process to convert the raw Weka code is written in Java language and it contains a
data into meaningful information according to stepwise GUI for Interacting with data files and producing visual
(data mining follows some steps to discover the hidden results. The figure of Weka is shown in the figure3. 1
data and pattern). Data mining having various numbers
of techniques which have some own capabilities, but in
this paper, we will concentrate on clustering techniques
and its methods.

II. Clustering
In this technique we split the data into groups and these
groups are known as clusters. Each cluster contains the
homogenous data, but it is heterogeneous data from other
cluster's data. A data is choosing the cluster according to
attribute values describing by objects. Clustering is used
in many fields like education, industries, agriculture etc.
Figure3.1: front view of weka tools
Clustering used unsupervised learning techniques. Cluster
Volume 3, Issue 2 March – April 2014 Page 240
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856

The GUI Chooser consists of four buttons:


 Explorer: An environment for exploring data with
WEKA.
 Experimenter: An environment for performing
experiments and conducting statistical tests
between learning schemes.
 Knowledge Flow: This environment supports
essentially the same functions as the Explorer, but
with a drag and- drop interface. One advantage is
that it supports incremental learning.
 Simple CLI: Provides a simple command-line Fig.6.2 various clustering algorithms in weka
interface that allows direct execution of WEKA
commands for operating systems that do not
provide their own command line interface. [8] VII. Partitioning methods
As the name suggested that in this method we divide the
large object into (groups) clusters and each cluster
IV. Dataset contain at least one element. This method follows an
For performing the comparison analysis, we need the
iterative process by use of this process, we can relocate
datasets. In this research I am taking education data set.
the object from one group to another more relevance
This data set is very helpful for the researchers. We can
group. This method is effective for small to medium sized
directly apply this data in the data mining tools and
data sets. Examples of partitioning methods include k-
predict the result.
means and k-medoids [2].

V. Methodology VII (I) K-Means Algorithm


My methodology is very simple. I am taking the
education data set and apply it on the weka in different- It is a centroid based technique. This algorithm takes the
different data set of student records . In the weka I am input parameters k and partition a set of n objects into k
applying different- different clustering algorithms and clusters that the resulting intra-cluster similarity is high
predict a useful result that will be very helpful for the new but the inter-cluster similarity is low. The method can be
users and new researchers. used by cluster to assign rank values to the cluster
categorical data is statistical method. K mean is mainly
VI. Performing clustering on weka based on the distance between the object and the cluster
For performing cluster analysis on Weka.I have loaded
mean. Then it computes the new mean for each cluster.
the data set on weka that shown in this fig.6.1.waka can
support CSV and ARFF format of data set. Here we are Here categorical data have been converted into numeric
using CSV data set. In this data having 2197instances by assigning rank value [3].
and 9 attributes. Algorithm:-
In this we take k the number of cluster and D as data set
containing an object. In this output is stored as A set Of k
clusters. Algorithm follows some steps these are:-
Steps1:- Randomly choose k object from D as initial
cluster center.
Steps2:- Calculate the distance from the data point to
each cluster.
Step3: - If the data point is closest to its own cluster, leave
it where it is. If the data point is not closest to its own
cluster, move it into the closest cluster.
Step4: repeat step2 and 3 until best relevant cluster is
found for each data.
Figure 6.1: load data set in to the weka
Step5: - updates the cluster means and calculate the mean
After that we have many options shown in the figure value of the object for each cluster.
After that we have many options shown in the figure. We Step6: - stop (every data is located in a proper positioned
perform clustering [10] so we click on the cluster button. cluster).
After that we need to choose which algorithm is applied Now I am applying the k-mean on weak tool table17.
to the data. It is shown in the figure 6.2. And then click 1show the result of k-mean.
the ok button.

Volume 3, Issue 2 March – April 2014 Page 241


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856

Table7. 1.1k- means clustering algorithms It is two types:-


Dataset Attribut Clustere Time Square No of Agglomerative (bottom up):-
Name e and d taken d Iterat It is a bottom up approach so that it is starting from sub
Instance Instance to Error ions
s s build cluster than merge the sub clusters and makes a big
the cluster at the top.
model
Civil Instance 0.02 13.5 3
s: 446 0: 247 secon
Attribut (55%) ds
es: 9 1: 199
(45%)
Computer Instance 0 206 0.02 15.6 5
and IT s: 452 (46%) secon
Attribut 1 246 ds
es: 9 (54%)
E.C.E Instance 0 317 0.27 16.03 5
s: 539 ( 59%) secon
Attribut 1 222 ds
es: 9 ( 41%) Figure 8.1: Hierarchical Clustering Process [7]
Mechanic Instance 0 327 0.06 22.7 3
al s: 760 ( 43%) secon
Attribut 1 433 ds Divisive (top down):-
es: 9 ( 57%) It is working opposite like as agglomerative. It is starting
from top mean a big cluster than decomposed it into
smaller cluster. Thus, it is a stat from top and reached at
the bottom. Table 7.2.2 shows the result

Table 7.2.1 Hierarchical Clustering


Dataset Attribute and Clustered Time taken to
Name Instances Instances build the model
Civil Instances: 446 0 445 2.09 seconds
Attributes: 9 (100%)
1 1(
0%)
Comput Instances: 452 0 305 ( 4.02 seconds
er and Attributes: 9 67%)
FIG.7.1 compression between attributes of k-mean IT 1 147 (
33%)
VII (II) K-Medoids Algorithm E.C.E Instances: 539 0 538 3.53 seconds
Attributes: 9 (100%)
This is a variation of the k-means algorithm and is less 1 1(
sensitive to outliers [5]. In this instead of mean we use the 0%)
actual object to represent the cluster, using one Mechani Instances: 760 0 758 13 seconds
cal Attributes: 9 (100%)
representative object per cluster. Clusters are generated by 1 2(
points which are close to respective methods. The 0%)
function used for classification is a measure of
dissimilarities of points in a cluster and their
representative [5]. The partitioning is done based on
minimizing the sum if the dissimilarities between each
object and its cluster representative. This criterion is
called as absolute-error criterion.
N Sum of Absolute error=Σ Σ Dist (p, a)
i=1 p ∈ Ci
Where p represents an object in the data set and oi is the
ith representative.
N is the number of clusters.
Two well-known types of k-medoids clustering [6] are the
PAM (Partitioning Around Medoids) and CLARA FIG 7.1 comparison between attributes of hierarchical
(Clustering LARge Applications). clustering
IX. Grid based
VIII. Hierarchical Clustering The grid based clustering approach uses a multi
This method provides the tree relationship between resolution grid data structure. It measures the object space
clusters. In this method we use same no. cluster and data, into a finite number of cells that form a grid structure on
means if we have n no. of data then we use n no of which all of the operations for clustering are performed.
clusters. We are present two examples; STING and CLIQUE.

Volume 3, Issue 2 March – April 2014 Page 242


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856

STING (Statistical Information Grid): - It is used mainly XI. Experimental results


with numerical values. It is a grid-based multi resolution Here we use various clustering method of student record
clustering technique which is computed the numerical data and compare these using weka tools. According to
attribute and store in a rectangular cell. The quality of these comparisons we find the which method is
clustering produced by this method is directly related to performed better result. Fig 11.1 shows the comparison
the granularity of the bottom most layers, approaching the result on according to the time taken to build a model.
result of DBSCAN as granularity reaches zero [2].
CLIQUE (Clustering in Quest): - It was the first
algorithm proposed for dimension –growth subspace
clustering in high dimensional space. CLIQUE is a
subspace partitioning algorithm introduced in 1998.

X. Density based clustering


X.I. DBSCAN (for density-based spatial clustering of
applications with noise) is a density based clustering
algorithm. It is using the concept of “density reachibility”
and “density connect ability”, both of which depends Fig11.1 compared according to time taken to build a
upon input parameter- size of epsilon neighborhood e and model.
minimum terms of local distribution of nearest neighbors.
Here parameter e controls the size of the neighborhood According to this result, we can say that k-mean provide
and size of clusters. It starts with an arbitrary starting better results than other methods. But only a single
point that has not been visited [4]. DBSCAN algorithm is attribute we cannot use k-mean every time. Thus we can
an important part of clustering technique which is mainly use any other methods if time is not important.
used in scientific literature. Density is measured by the
number of objects which are nearest the cluster.
XII. Conclusion
Data mining is covering every field of our life. Mainly we
Table 10.1.1 DBSCAN Clustering
Dataset Attribute and Clustered Time taken to
are using the data mining in banking, education, business
Name Instances Instances build the model etc. In this paper, we have provided an overview of the
Civil Instances: 446 446 4.63 seconds comparison, classification of clustering algorithms such
Attributes: 9 as partitioning, hierarchical, density based and grid based
Comput Instances: 452 452 6.13 seconds
er and Attributes: 9
methods. Under partitioning methods, we have applied k-
IT means, and its variant k-medicine weka tool. Under
E.C.E Instances: 539 539 11.83 seconds hierarchical, we have discussed the two approaches which
Attributes: 9
are the top-down approach and the bottom-up approach.
Mechani Instances: 760 760 23.95 seconds We have also applied the DBSCAN and OPTICS
cal Attributes: 9
algorithms under the density based methods. Finally, we
have used the STING and CLIQUE algorithms under the
X. II. Optics: - stands for Ordering Points to Identify grid based methods. And we are describing the
Clustering Structure. DBSCAN burdens the user from comparative study of data mining techniques.These
choosing the input parameters. Moreover, different parts comparisons we can show in the above tables. Thus we
of the data could require different parameters [5]. It is an can say that every technique is important in his functional
algorithm for finding density based clusters in spatial data area. We can improve the capability of data mining
which addresses one of DBSCAN’S major weaknesses i.e. techniques by removing the limitation of these
Of detecting meaningful clusters in data of varying techniques.
density.
References
Table 10.2.1. OPTICS Clustering
[1] Manish Verma, Mauly Srivastava, Neha Chack, Atul
Dataset Attribute and Clustered Time taken to
Name Instances Instances build the model Kumar Diswar, Nidhi Gupta,” A Comparative Study
Civil Instances: 446 446 5.42 seconds of Various Clustering Algorithms in Data Mining,”
Attributes: 9 International Journal of Engineering Research and
Comput Instances: 452 452 6.73 seconds
Attributes: 9
Applications (IJERA), Vol. 2, Issue 3, pp. 1379-
er and
IT 1384, 2012.
E.C.E Instances: 539 539 9.81 seconds [2] Jiawei Han and Micheline Kamber, Jian Pei, B Data
Attributes: 9 Mining: Concepts and Techniques, 3rd Edition,
2007.
Mechani Instances: 760 760 23.85 seconds
cal Attributes: 9 [3] Patnaik, Sovan Kumar, Soumya Sahoo, and Dillip
Kumar Swain, “Clustering of Categorical Data by
Volume 3, Issue 2 March – April 2014 Page 243
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 2, March – April 2014 ISSN 2278-6856

Assigning Rank through Statistical Approach,”


International Journal of Computer Applications 43.2:
43.2: 1-3, 2012.
[4] Manish Verma, Mauly Srivastava, Neha Chack, Atul
Kumar Diswar, Nidhi Gupta,” A Comparative Study
of Various Clustering Algorithms in Data Mining,”
International Journal of Engineering Research and
Applications (IJERA), Vol. 2, Issue 3, pp. 1379-1384,
2012
[5] Survey of Clustering Data Mining Techniques, Pavel
Berkhin, 2002.
[6] C. Y. Lin, M. Wu, J. A. Bloom, I. J. Cox, and M.
Miller, “Rotation, scale, and translation resilient
public watermarking for images,” IEEE Trans.
Image Processing, vol. 10, no. 5, pp. 767-782, May
2001.
[7] Pallavi, Sunila Godara “A Comparative Performance
Analysis of Clustering Algorithms”International
Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com Vol. 1,
Issue 3, pp. 441-445
[8] Bharat Chaudhari1, Manan Parik“A Comparative
Study of clustering algorithms Using weka tools”
International Journal of Application or Innovation in
Engineering & Management (IJAIEM)
[9] M. And Heckerman, D. (February, 1998). An
experimental comparison of several clustering and
initialization methods. Technical Report MSRTR-98-
06, Microsoft Research, Redmond, WA.

Volume 3, Issue 2 March – April 2014 Page 244

S-ar putea să vă placă și