Sunteți pe pagina 1din 5

K-means with Four different Distance Metrics

Mentor
Zenun Kastrati
zenun.kastrati@uni-pr.edu

Student
Rinor Dreshaj
rinordreshaj@gmail.com

ABSTRACT
The k-means algorithm will partition your data into "k" distinct
clusters, but it does not tell you if that's the correct number of
clusters. Your data might naturally have 5 different clusters in it,
but if you feed k-means the number 3 you'll get 3 clusters back.
Those clusters will be bigger, looser and more awkwardly shaped
than if you had told it to find 5 clusters. .The power of k-means
algorithm is due to its computational efficiency and the nature of
ease at which it can be used. Different distance metrics are used to
find similar data objects that lead to develop robust algorithms for
the data mining functionalities such as classification and
clustering. In this paper, the results obtained by implementing the
k-means algorithm using four different metrics Euclidean,
Euclidian Squared, Manhattan, Chebychev distance metrics along
with the comparative study of results of k-means algorithm which
is implemented through Euclidian, Manhattan, Euclidian Suqared,
Chebychev distance metric for ninty dimensional data, are
discussed. Results are displayed with the help of histograms.

2.2 Manhattan Distance


Manhattan distance computes the absolute differences
between coordinates of pair of objects

2.3 Chebychev Distance


Chebyshev Distance is also known as maximum value
distance and is computed as the absolute magnitude of the
differences between coordinate of a pair of objects.

2.4 Euclidian Distance


Euclidean Distance Euclidean distance computes difference
between co-ordinates of pair of objects.

General Terms
Algorithms, Measurement, Performance.

Keywords
Centroids, clustering, metrics, normalisation.

1. INTRODUCTION
Clustering is grouping of data or dividing a large data set into
smaller data sets of some similarity. A well known clustering
algorithm in unsupervised machine learning is K-Means clustering.
The K-Means algorithm takes in n observations (data points), and
groups them into k clusters, where each observation belongs to a
cluster based on the nearest mean (cluster centroid).

3. RESULTS
Results that are obtained after the implementation of K-means
using 4 various distance metrics are shown using histograms. All
the experiments are performed on test data. The results obtained
by using Euclidean distance metric i.e. basic k-means are shown
in fig 1.

2. DISTANCE METRICS
2.1 Euclidian Distance
Euclidean Distance Euclidean distance computes the root of
square difference between co-ordinates of pair of objects.

Figure 1
The results obtained by using Manhattan distance metric i.e. kmeans are shown in fig.2.

classification is located in the kMeansAlgorithm() function.

Figure 2
The results obtained by using Euclidian squared distance
metric i.e. k-means are shown in fig 3.

Figure 5 - kMeansAlgorithm
The part of code that reads the data from the file and assigns that
data to the variables is

Figure 6 - Reading data from files

Figure 3

the code in the figure above shows that also a parseData() function
is called, which function does the parse of data and the
implementation of that is shown on the figure below.

The results obtained by using Euclidian squared distance


metric i.e. k-means are shown in fig 4.

Figure 7 - Parsing data

4.1 Normalization
Figure 4
.

4. IMPLEMENTATION
K-means for this project was implemented using `node.js`. The
main file for the project is called `index.js`. The method that
initiates the algorithm is init(); . The code that does the

Also as a part of the K-means Algorithm is the module for


normalization of data. This is located in the file called
normalization.js .

Figure 12 - Euclidian Suqared Distance


Figure 8 - Normalization

4.2 Distances Implementation

5. RSS FUNCTION ANALYSIS


5.1 RSS Implementation

4.2.1 Euclidian Distance

Figure 9 - Euclidian Distance

4.2.2 Manhattan Distance

Figure 13 - RSS Implementation

5.2 Analysis with different number of K


The RSS coefficient went down when the number of K is
decreased.

Figure 10 - Manhattan Distance

4.2.3 Chebychev Distance

Figure 14 - K = 15
Figure 11 - Chebychev Distance

4.2.4 Euclidian Squared Distance

6. OPTIMIZATION
6.1 Optimal number of clusters
The optimal choice of k will strike a balance between maximum
compression of the data using a single cluster, and maximum
accuracy by assigning each data point to its own cluster. There are
several categories of methods for making this decision.

6.2 Empty Clusters


Figure 15 - K = 12

Because the data were spread in different dimensions and there


were given only 15 centroids which implies 15 clusters there were
no empty clusters obtained while classifying the data with all of the
distance metrics. If the number of clusters were to go down only
the size of cluster would increase because. One solution for empty
clusters would be relocating. If we re-locate any empty cluster
centers, the algorithm will probably converge anyway if that
happens a limited number of times. However, if we have to relocate
too often, it might happen that the algorithm doesn't terminate.

7. CONCLUSION
Figure 16 - K = 6

Figure 17 - K = 3

K means is a heuristic algorithm that partitions a data set into


K clusters by minimizing the sum of squared distance in each
cluster. During the implementation of k-means with four different
distance metrics, it is observed that selection of distance metric
plays a very important role in clustering. So, the selection of
distance metric should be made carefully. As a conclusion, the Kmeans, which is implemented using Euclidean distance metric has
the best performance and K-means based on Chebychev distance
metrics performance, is worst.

8. REFERENCES
Figure 18 - K = 2
The dependency between K and RSS function is displayed in the
figure 19.

Figure 19 - K/RSS Plot

[1] Christopher Bishop. Pattern Recognition and Machine


Learning. Springer, 2006.
[2] K-means with Three different Distance Metrics. International
Journal of Computer Applications (0975 8887) Volume
67 No.10, April 2013
[3] Selection of K in K-means clustering. D T Pham , S S
Dimov, and C D Nguyen. Manufacturing Engineering
Centre, Cardiff University, Cardiff, UK

S-ar putea să vă placă și