Sunteți pe pagina 1din 4

© May 2017 | IJIRT | Volume 3 Issue 12 | ISSN: 2349-6002

AN OPTIMIZED K-MEANS ALGORITHM

Miss Geeta Rani1 and Mr. Sachin Shrivastava2


1,2
Department of Computer science & engineering.MDU(Rohtak)
Abstract—Data mining is a new technology, developing algorithm calculates distance of each data point from
with database and artificial intelligence. It is a processing each cluster centre. Calculating this distance in each
procedure of extracting credible, novel, effective and iteration makes the algorithm of low efficiency. This
understandable patterns from database. Cluster analysis
paper introduces an optimized algorithm which solve
is an important data mining technique used to find data
this problem.
segmentation and pattern information. By clustering the
data, people can obtain the data distribution, observe the
The process, which is called k-means, appears to give
character of each cluster, and make further study on partitions which are reasonably efficient in the sense
particular clusters. within class variance, corroborated to some
Cluster analysis method is one of the most analytical extend by mathematical analysis and
methods of data mining. The method will directly practical experience.
influence the result of clustering. This paper discusses Also, the kmeans.procedure is easily programmed an
the standard of k-mean clustering and analyzes the d is computationally economical, so that it isfeasible t
shortcomings of standard k-means such as k-means
o process very large samples on a digital computer.
algorithm calculates distance of each data point from
K-means clustering is a type of unsupervised learning.
each cluster centre. Calculating this distance in each
iteration makes the algorithm of low efficiency. This
it is used when the user have unlabeled data. Unlabeled
paper introduces an optimized algorithm which solves data means which have no categories and groups. The
this problem. This is done by introducing a simple data goal of this algorithm is to find groups in the data, with
structure to store some information in every iteration a the number of groups represented by the variable K.
nd used this information in next iteration. The algorithm works iteratively to assign each data
IndexTerms—Datamining,k-meansalgorithm,Clustering point to one of K groups based on the features that are
provided.
I. INTRODUCTION
The Data mining is also known as knowledge What Is Clustering?
discovery in databases (KDD). It is commonly defined Clustering is defined as the grouping of a particular set
as the process of searching useful patterns or of objects based on their properties, characteristics
knowledge from data sources, e.g., databases, texts, them according to their similarities. Clustering analyze
images, the Web, etc. The patterns must be valid, the grouping of hard partitioning and soft partitioning
potentially useful, and understandable. Data mining is .If any object not to be the part of cluster then it
a multi-disciplinary field involving machine learning, belongs to the hard partitioning .In Soft partitioning ,
statistics, databases, artificial intelligence, information every object belongs to the cluster.
retrieval, and visualization .“Computers have There are several different ways to implement this
promised us a fountain of wisdom but delivered a partitioning, based on distinct models.
flood of data (William J. Frawley, Gregory Piatetsky- Centralized - each cluster is represented by a single
Shapiro)”. Clustering is a process of grouping data vector mean, and a object value is compared to these
objects into disjointed clusters so that the data in the meanvalues
same cluster are similar, but data belonging to Distributed – the cluster is built using statistical
different cluster differ. A cluster is a collection of data distributions
object that are similar to one another are in same Connectivity – The connectivity on these models is
cluster and dissimilar to the objects are in other based on a distance function between elements
clusters.This paper describe the standard of k- Group – algorithms have only group information
mean clustering and analyses the Graph – cluster organization and relationship between
drawbacks of standard k-means such as k-means members is defined by a graph linked structure

IJIRT 144530 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 126


© May 2017 | IJIRT | Volume 3 Issue 12 | ISSN: 2349-6002

Density – members of the cluster are grouped by III. HOW THE K-MEAN CLUSTERING
regions where observations are dense and similar ALGORITHM WORKS

 The Drawbacks Of The K-Means Algorithm

II. ANALYSIS OF STANDARD K-MEANS 1. As many clustering methods, the k-means


ALGORITHM algorithm assumes that the number of
clusters k in the database is known
According to the basic K-Means algorithm, clusters beforehand which, obviously, is not
are fully dependant on the selection of initial clusters necessarily true in real-world applications.
centroids. K data elements are selected as initial 2. As an iterative technique, the k-means
centers; then distances of all data elements are algorithm is especially sensitive to initial
calculated by Euclidean distance formula. Data centers selection.
elements having less distance to centroids are moved 3. The k-means algorithm may converge to
to the appropriate cluster.The process is continued local minima.
until no more changes occur.
Input: Number of desired clusters K, data objects
{d1,d2….,dn}. IV. PROPOSED WORK
Output: A set of K clusters. Optimize K-Means clustering Algorithm
Steps: The optimized k-means algorithm is divided into two
 Randomly select k data objects from data set D as phases.
initial centers. Phase-I. In this phase, the cluster size is fixed and the
Repeat: output of the first phase forms initial clusters. Here,
Calculate the distance between each data object the input array of elements is scanned and split up into
di(1<=i<=n) and all K clust-ers Cj(1<=j<=k) and sub-arrays, which represent the initial clusters.
assign each data object di to each cluster. Phase-II In second phase, the cluster sizes vary and the
 For each cluster j (1 <= j<=k), recalculate the output of this phase are the finalized clusters. Initial
cluster center. clusters are inputs for this phase. The centroids of
 Until no change in the center of clusters. these initial clusters are computed first, on the basis of
Time complexity of K Means Algorithm is represented which distance from other data elements are
by O(nkt), Where n is the number of objects, k is the calculated. Furthermore the data elements having less
number of clusters and t is the number of iteration Data or equal distance remains in the same cluster otherwise
clustering is a data exploration technique that allows they are moved to appropriate clusters. The entire
objects with similar process continues until no changes in the clusters are
The k-means algorithm always converges to the local detected.
minimum. Before converging it calculates distance of
each data object from each cluster centre in each loop Steps of Algorithm:
execution. Algorithm is divided into two Phases.

IJIRT 144530 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 127


© May 2017 | IJIRT | Volume 3 Issue 12 | ISSN: 2349-6002

In Phase-I, we find the initial clusters, while in  If D of Dp and M is less than or equal to other
Phase-II, data elements are moved in appropriate distances of Mi (1 i k) then Dp stay in same cluster
clusters.  Else Dp having less D is assigned to
Phase-I: To find the initial clusters Corresponding Ci
Input: Array {a1, a2, a3,…… an}, K //Number of  For each cluster Cj (1<= j<= k), Recompute the M
Required clusters and move Dp until no change in clusters.
Output: A set of Initial Clusters.
Steps:
 Find the size of cluster Si (1= i = k) byFloor (n/k).
 Where n= number of data points Dp (a1, a2, a3,
…… an) K= number of clusters.
 Create K number of Arrays Ak
 Move data points (Dp) from Input Array to Ak
untill Si =Floor (n/k).
 Continue Step 3 untill all Dp removedfrom input
array
 Exit with having k initial clusters.

I. Comparison of Basic and Proposed K-Mean


Clustering Algorithm:
 As in basic K-mean algorithm, initial centroids
are selected randomly from the input data, so
clusters vary from one another, because of which
the number of iterations and total elapsed time
also changes in each run of the same data.
 In proposed K-mean algorithm initial centroids
Phase-II: To find the final clusters are calculated and as the data is same, it results in
Input: A set of Initial Clusters same calculations, so the number of iterations
Output: A set of k Clusters. remains constant and elapsed time is also
Steps: improved. This is the reason that proposed K-
 Compute the Arithmetic Mean M of all initial mean clustering algorithm is efficient from basic
clusters CI K-mean algorithm
 Set 1<= j<= k
 Compute the distance D of all Dp to M of Initial
Clusters Cj

IJIRT 144530 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 128


© May 2017 | IJIRT | Volume 3 Issue 12 | ISSN: 2349-6002

V. CONCLUSION
In our work we have combined various similarity
measures to generate an effective matching function.
Effectiveness of the matching function depends upon
all similarity measures based on weight given by
genetic algorithm. So to have an effective matching
function both semantic and syntactic aspects should
be taken into consideration while choosing similarity
measures.In basic K-mean clustering, initial clusters
are based on randomly selected centroids.
In this papers an enhanced K-mean algorithm is
introduced and compared with the basic K-mean
algorithm. In enhanced K-mean clustering algorithm
any type of integer data is used. The performance of
basic K-mean clustering algorithm in terms of
number of iterations and time complexity is improved
.ACKNOWLEDGMENT
I would like to express my gratitude to my Guide “Mr.
Sachin Shrivastava” for his support, guidance and
helps throughout this research .The Research on “An
Optimized K-Means Algorithm” has been given to me as
part of curriculum in two years master’s degree in
computer science & engineering. I have tried my best
to present this information as clearly as possible using
the basic terms.I will failed my duty if I don’t
acknowledge esteemed scholarly guidance, assistance
and knowledge
REFERENCES
[1] Malay K. Pakhira, “A Modified K Means
algorithm to avoid empty clusters”,
International Journal of Recent Trends in
Engineering, Vol 1, No. 1, May 2009 .
[2] Anil K. Jain, M. N. Murty, P. J. Flynn, “Data
Clustering: A Review,” ACM Computing Su
rveys, 31(3).
[3] T. Kanungo, D. M. Mount, N. Netanyahu, C.P
iatko,R. Silverman, and A. Y. Wu, “An efficie
ntkmeansclustering algorithm: Analysis andim
plementation”IEEETransaction Pattern Analys
is and Machine Intelligence,2002.
[4] Kiri Wagstaff and Claire Cardie Department o
fcomputer science, Cornell University, USA “
Constrainedk- means algorithm with backgrou
nd knowledge”.

IJIRT 144530 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 129

S-ar putea să vă placă și