Documente Academic
Documente Profesional
Documente Cultură
Mentor
Zenun Kastrati
zenun.kastrati@uni-pr.edu
Student
Rinor Dreshaj
rinordreshaj@gmail.com
ABSTRACT
The k-means algorithm will partition your data into "k" distinct
clusters, but it does not tell you if that's the correct number of
clusters. Your data might naturally have 5 different clusters in it,
but if you feed k-means the number 3 you'll get 3 clusters back.
Those clusters will be bigger, looser and more awkwardly shaped
than if you had told it to find 5 clusters. .The power of k-means
algorithm is due to its computational efficiency and the nature of
ease at which it can be used. Different distance metrics are used to
find similar data objects that lead to develop robust algorithms for
the data mining functionalities such as classification and
clustering. In this paper, the results obtained by implementing the
k-means algorithm using four different metrics Euclidean,
Euclidian Squared, Manhattan, Chebychev distance metrics along
with the comparative study of results of k-means algorithm which
is implemented through Euclidian, Manhattan, Euclidian Suqared,
Chebychev distance metric for ninty dimensional data, are
discussed. Results are displayed with the help of histograms.
General Terms
Algorithms, Measurement, Performance.
Keywords
Centroids, clustering, metrics, normalisation.
1. INTRODUCTION
Clustering is grouping of data or dividing a large data set into
smaller data sets of some similarity. A well known clustering
algorithm in unsupervised machine learning is K-Means clustering.
The K-Means algorithm takes in n observations (data points), and
groups them into k clusters, where each observation belongs to a
cluster based on the nearest mean (cluster centroid).
3. RESULTS
Results that are obtained after the implementation of K-means
using 4 various distance metrics are shown using histograms. All
the experiments are performed on test data. The results obtained
by using Euclidean distance metric i.e. basic k-means are shown
in fig 1.
2. DISTANCE METRICS
2.1 Euclidian Distance
Euclidean Distance Euclidean distance computes the root of
square difference between co-ordinates of pair of objects.
Figure 1
The results obtained by using Manhattan distance metric i.e. kmeans are shown in fig.2.
Figure 2
The results obtained by using Euclidian squared distance
metric i.e. k-means are shown in fig 3.
Figure 5 - kMeansAlgorithm
The part of code that reads the data from the file and assigns that
data to the variables is
Figure 3
the code in the figure above shows that also a parseData() function
is called, which function does the parse of data and the
implementation of that is shown on the figure below.
4.1 Normalization
Figure 4
.
4. IMPLEMENTATION
K-means for this project was implemented using `node.js`. The
main file for the project is called `index.js`. The method that
initiates the algorithm is init(); . The code that does the
Figure 14 - K = 15
Figure 11 - Chebychev Distance
6. OPTIMIZATION
6.1 Optimal number of clusters
The optimal choice of k will strike a balance between maximum
compression of the data using a single cluster, and maximum
accuracy by assigning each data point to its own cluster. There are
several categories of methods for making this decision.
7. CONCLUSION
Figure 16 - K = 6
Figure 17 - K = 3
8. REFERENCES
Figure 18 - K = 2
The dependency between K and RSS function is displayed in the
figure 19.