Sunteți pe pagina 1din 7

International Journal of Computer Information Systems, Vol. 3, No.

4, 2011

Feature Extraction Using K-means Clustering: An Approach & Implementation


Kharabela Parida 1 Sumanta Kumar Mandal 2 Sudhansu Sekhar Das 3 Alok Ranjan tripathy 4
1,2,3,4 Dept. of Computer Science, College of Engineering Bhubaneswar, Bhubaneswar, Odisha, India. Emails: kharabelaparida@gmail.com1, sumantab4u2@gmail.com2, sudhansu.das09@rediffmail.com3, tripathyalok@gmail.com4.
Abstract- In attempting to classify real-world objects or concepts using computational methods, the selection of an appropriate representation is of considerable importance. Feature extraction is the process of deriving new features from the original features in order to reduce the cost of feature measurement, increase classifier efficiency, and allow higher classification accuracy. Many current feature extraction techniques involve linear transformations of the original pattern vectors to new vectors of lower dimensionality. In todays scenario its been given much importance to classify the data by choosing less number of attribute. accuracy of the classification being the major challenge issue. The two important tasks of data mining are clustering and classification. In this thesis work, we propose a feature extraction method for classification using K-Means clustering. The new model is proposed to reduce input feature space, which decreases the learning time of classifiers, but also, improves the prediction accuracy according to the chosen relevance criterion. The raw data is preprocessed, normalized and then data points are clustered using k-means technique. Feature vectors for all the classes are generated by extracting the most relevant features from the corresponding clusters and used for further classification. KNN classifier is used to perform the classification task. Experiments are conducted on datasets and the accuracy obtained by performing specific feature extraction for a particular data set is compared with generic feature extraction scheme. The algorithm performs relatively well with respect to classification results when compared with the specific feature extraction technique. Keyword: Cluster Analysis, K-means, KNN

uncover previously undetected relationships in a complex data set. Many applications for cluster analysis exist. For example, in a business application, cluster analysis can be used to discover and characterize customer groups for marketing purposes. II. CLUSTER ANALYSIS The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on kmeans, k-medoids and several other methods have also been built into many statistical analysis software packages. In machine learning, clustering is an example of unsupervised learning [3]. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining,
ISSN 2229 5208

I. INTRODUCTION Cluster analysis is the process of grouping objects into subsets that have meaning in the context of a particular problem. The objects are thereby organized into an efficient representation that characterizes the population being sampled. Unlike classification, clustering does not rely on predefined classes. Clustering is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects [1, 2]. It can
October Issue

Page 122 of 128

International Journal of Computer Information Systems, Vol. 3, No.4, 2011 efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. Two types of clustering algorithms are nonhierarchical and hierarchical. In nonhierarchical clustering, such as the k-means algorithm [4], the relationship between clusters is undetermined. Hierarchical clustering repeatedly links pairs of clusters until every data object is included in the hierarchy. With both of these approaches, an important issue is how to determine the similarity between two objects, so that clusters can be formed from objects with a high similarity to each other. Commonly, distance functions, such as the Manhattan and Euclidian distance functions, are used to determine similarity. A distance function yields a higher value for pairs of objects that are less similar to one another. Sometimes a similarity function is used instead, which yields higher values for pairs that are more similar. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. The computational task of classifying the data set into k clusters is often referred to as k-clustering. Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes or features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data. III. PROBLEM FORMULATION The aim of this process is to develop a generic system, which takes its input as raw data. The most important aspect in developing a generic system is feature extraction. The input to the system is in the form of instances {xi , ti } , where xi is the set of attributes and ti is the label of the instance. The aim is to select a set of attributes yi, which are the subset of xi, such that the set yi represents the instance completely and useful for classification. The parameter involved in the selection of yi, is the accuracy of classification, i.e., using yi the classifier must be able to match with ti. The steps involved in the system are given below. Input: Raw Data Output: Extracting feature 1. PreprocessData ( ); 2. ClusterData ( ); {Generates cluster centres for each cluster based on the number of classes.} 3. ExtractFeatures ( ); {Uses the cluster centres to choose the attributes, which contribute most towards differentiating the classes.} 4. ClassifyData ( ); {Classifies data based on the features selected.} IV. SYSTEM ARCHITECTURE The system consists of four modules, pre-processing, clustering, feature extraction and classification. The overall system architecture is shown in Figure 1. The raw data is passed through the system. This can be in the form of any numerical data or in the form of waves. Appropriate techniques are applied to get the preprocessed data. Next, the data is passed through the clustering phase, which returns the cluster centres. Feature extraction is then performed to obtain the attributes that can completely represent a given instance. The features selected along with the pre-processed data are then passed through the classifier for testing. The quality of features extracted improves with the increase in accuracy of the classification. The system can be tested with various datasets so as to check the generic nature of the feature extraction process.

October Issue

Page 123 of 128

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No.4, 2011

Raw Data

Pre Processed Data Pre Processing (Normalisation) Clustering Using K-mean Testing Data Features Cluster Center

Result

Classification

Feature Extraction

Figure1: Proposed System

A. Preprocess Data The pre-processing stage is performed to convert all attributes of the data into a numeric form that can be used by the clustering process. This is extremely useful for reduction in dimension of the dataset. Another form of preprocessing is normalization. It may so happen that the values of some attributes may vary in different ranges and to reduce the effect of such attributes, all values of the attributes are normalized to lie in some common range, like [1, 5]. B. Cluster Data The clustering is an important step, as it is an essential precursor to the feature extraction. The input for feature extraction is the preprocessed data, wherein the labels are stripped off. Clustering is a form of unsupervised learning that helps to find the inherent structure in the data. Using clustering, it is possible to find the similar points without actually knowing the labels and hence those attributes may be found that contribute to the points being similar to others as well as those which make it dissimilar from others. Many clustering algorithms have been developed and studied; for example, kmeans and the fuzzy c-means (FCM) clustering algorithms. The K-means technique is proved to be more general and useful in case of overlapping clusters, a common scenario that exist in some real datasets. A brief description of the K-means algorithm is given. Algorithm-K-Mean()
October Issue

{ Initialize the matrix M by taking the objects in the Row and attributes in the Column Initialize matrix C which contains initial centroids of cluster C1,C2,..Ck Initialize the Group matrix G(0) While (G(i) !=G(i+1)) { Determine the new centroid coordinate Determine the distance of each object to the centroids Re-arrange Group matrix } } Once the cluster centers have been computed, the membership matrix is updated according to the location of the cluster centers. To calculate the new value of a point with respect to a particular cluster, the distance of that point from that cluster centre as well as the distance of the point from all other cluster a center is taken into account. The change in the centroid matrix is computed. If this change is lower than a predefined threshold, then the process is stopped, otherwise, new cluster centers are calculated and centroid matrix are updated with respect to the new cluster centers. The iteration continues till the change in the membership matrix is minimized. C. Extract Features The important step in building a generic data mining system is the process of feature extraction that can be applied to different datasets. A generic feature extraction process
ISSN 2229 5208

Page 124 of 128

International Journal of Computer Information Systems, Vol. 3, No.4, 2011 should be built that does not involve the nature of the attributes, but just the attribute values. It is evident from FCM that, even though the cluster centres are obtained in k dimensional space, where k is the number of attributes, the natures of the attributes do not contribute to cluster centres. Hence, these cluster centres can be used to choose the attributes that can be used to distinguish between dissimilar points. Consider an n-class problem with k attributes. Let (c1 , c2 ,....., cn ) be the n cluster centres and each cluster centre is expressed in k dimensions as ci [ci1 , ci 2 ,....., cin ] . Intuitively, attributes with the cluster centres that are far apart are suited for classification since the classes are better defined in those dimensions. Now, let us consider the case for n = 2 and there exist just two centres. The distance between the cluster centres can be calculated as disti |C1i C2i | .Let dist be a vector containing the elements disti for all the dimensions. Then, the number of attributes needed to pass through classifier is chosen. Let (i1 , i2 ,......, in ) be the attributes selected where, j is the number of attributes. The representative attributes for each cluster are obtained as follows, i1 = max (dist) k Number of clusters; n Number of attributes dist ; mindist ; Attr for i = 1 to n for j = 1 to k - 1 for l = j to k dist dist U | centerj centeri | endfor endfor mindist mindist U min (dist) endfor for i = 1 to n Attr Attr U max (mindist) endfor End The algorithm finally generates a list of attributes, namely Attr that contains the attributes in the order of their relevance. The most relevant features are at the beginning of the Attr while the least relevant features are at the end of Attr. To perform classification, the most relevant k features are selected. D. Classify Data After the feature extraction phase, the quality of features extracted is quantified to evaluate the accuracy of the classifier. The quality of the extracted features is dependent on the accuracy of the classifier. k-nearest neighbors algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. kNN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. The same method can be used for regression, by simply assigning the property value for the object to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. (A common weighting scheme is to give each neighbor a weight of 1/d, where d is the distance to the

im max(dist disti (m 1)) ,


m=2, 3 j ;

for

However, for a multi-class problem, the feature selection is not trivial. It is difficult to choose the cluster centres in the computation of disti in higher dimensions, i.e., in case of n > 2. One solution is to choose two cluster centres whose distance is minimum. There are n * (n 1) / 2 distances to be considered to obtain the minimum distance between the two centres and is computed as follows,

Distijl | Cij Cil | For all j and l such


that j l

disti min(distijl ) For all j, l Once

disti is calculated, the features can be extracted


in the same way as in any other two-class problem and the algorithm is given below. Algorithm II: Extract Features ( ) Begin
October Issue

Page 125 of 128

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No.4, 2011 Neighbor. This scheme is a generalization of linear interpolation.) The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. The k-nearest neighbor algorithm is sensitive to the local structure of the data. Nearest neighbor rules in effect compute the decision boundary in an implicit manner. It is also possible to compute the decision boundary itself explicitly, and to do so in an efficient manner so that the computational complexity is a function of the boundary complexity. The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance) can be used. Often, the classification accuracy of KNN can be improved significantly if the distance metric is learned with specialized algorithms such as e.g. Large Margin Nearest Neighbor or Neighborhood components analysis. A drawback to the basic "majority voting" classification is that the classes with the V. IMPLEMENTATION AND RESULT Taking Different data set we have implemented the following on matlab and accuracy comes even better than the PCA and LDA method. In the 1st dataset we are using 3 attribute and making 3 clusters. In the figure 2.2, we are using 4 numbers of attributes and making 3 clusters. In the last figure 2.3 three numbers of attributes are used for swiss dataset. Accuracy of classification degrades with reducing the number of attributes from the data set. more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbors. The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance) can be used. Often, the classification accuracy of "k"NN can be improved significantly if the distance metric is learned with specialized algorithms such as e.g. Large Margin Nearest Neighbor or Neighborhood components analysis. A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbors.

October Issue

Page 126 of 128

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No.4, 2011


4
3

-1

-1

-2

-2 -1.5

-3 -2

-1.5

-1

-0.5

0.5

-1

-0.5

0.5

1.5

Figure 2.1: using Random dataset1


35

Figure 2.2: using Random dataset2

30

25

20

15

10

0 -10

-5

10

15

Figure 2.3: cluster using swiss dataset

The following two graph shows the accuracy of the class by reducing the no of attribute from a randomize dataset. Accuracy decreases by choosing smaller no of attribute. In figure 2.4,

the accuracy measured by taking 3 clusters into the account and figure 2.5 is tested on random dataset by taking 6 clusters into the account.

Accuracy

Accuracy

Attributes
Figure: 2.4: Classification accuracy by taking 3 cluster

Attributes
Figure: 2.5: Classification accuracy by taking 6 cluste

October Issue

Page 127 of 128

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No.4, 2011 VI. CONCLUSION K-mean clustering process for feature extraction is a powerful method for reducing a number of observed variables into a smaller number of artificial variables that account for most of the variance in the data set. This extraction method comparatively good as compare with PCA because I t attempt to group genes by user-specified criteria It is particularly useful when you need a data reduction procedure that makes no assumptions concerning an underlying causal structure that is responsible for co variation in the data. Although this method is effective and good running time complexity but as it is a unsupervised method so it do not always show a good view of the class structure. Over all this method is useful for smaller dataset for getting better accuracy in terms of reducing the number of features from the dataset.
[1]

REFERENCES
C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 61-72, 1999. Frisvad. Cluster Analysis. Based on BasedonH.C. Romesburg: Clusteranalysis for researchers, Lifetime Learning Publications, Belmont, CA, 1984 P.H.A. Sneathand R.R. Sokal: NumericxalTaxonomy, Freeman, San Francisco, CA, 1973. Data Mining: Concepts and Techniques Jiawei Han and Micheline Kamber, 2nd Edition, March 2006. ISBN 1-55860-901-6. Kanungo, T.; Mount, D. M.; Netanyahu, N. S.; Piatko, C. D.; Silverman, R.; Wu, A. Y. (2002). "An efficient k-means clustering algorithm: Analysis and implementation". IEEE Trans. Pattern Analysis and Machine Intelligence 24: 881892. D.W. Aha and R.L. Bankert. A comparative evaluation of sequential feature selection algorithms. Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, pages 177, 1995. AUTHORS PROFILE Kharabela Parida has completed his B.Tech in Computer science & Engineering from BPUT University. Now he is persuing his M. Tech in Computer science & Engineering at College of Engineering Bhubaneswar, Odisha. His research areas are Data mining, Ad-hoc Network etc. Sumanta Kumar Mandal has completed his MCA from BPUT University. Now he is persuing his M. Tech in Computer science & Engineering at College of Engineering Bhubaneswar, Odisha. His research areas are Data mining, Mobile Communication, Ad-hoc Network etc. Sudhansu Sekhar Das has completed his MCA from IGNUE University. Now he is persuing his M. Tech in Computer science & Engineering at College of Engineering Bhubaneswar, Odisha. His research areas are Data mining, Ad-hoc Network, Project management etc. Alok Ranjan Tripathy has completed his M.Tech from Utkal University. Presently working in College of Engineering Bhubaneswar as a Asst. Professor in the Department of Computer Science & Engineering. His research areas are Parallel Processing, Mobile Communication, Sensor Networks etc.

[2]

[3]

[4]

[5]

October Issue

Page 128 of 128

ISSN 2229 5208

S-ar putea să vă placă și