K Mean Notes

Cluster analysis could be divided into
1)hierarchical clustering
Examples of hierarchical techniques : single linkage, complete linkage,average linkage, median,

and Ward
2)non-hierarchical clustering techniques.
Non-hierarchical:k-means, adaptive k-means, k-medoids, fuzzy clustering.
To determine which algorithm is good is a function of the type of data available and the
particularly purpose of analysis. A good clustering algorithm ideally should
produce groups with distinct non-overlapping boundaries.
In supervised classification, we are provided with a collection of labeled (preclassified) patterns

[1]; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given
labeled (training) patterns are used to learn the descriptions of classes which in turn are used to
label a new pattern.
In the case of clustering, the problem is to group a given collection of unlabeled patterns into
meaningful clusters. In a sense, labels are associated with clusters also, but these category labels
are data driven; that is, they are obtained solely from the data.
PRO'S
1. It is relatively faster clustering technique [1].
2. It works fast with the Large data set since the time complexity is O(nkl) where n is the number
of
patterns, k is the number of clusters and l is the number of the iterations.
3. It relies on Euclidian distance which makes it works well with numeric values with interesting
geometrical and statistic meaning [4].
CONs:
1. The initial assumption of value for K is very important but there isn’t any proper description
on assuming value for K and hence for different values for K will generate the different numbers
of clusters .
2. The initial cluster centroids are very important but if the centroid is far from the cluster center
of the data itself then it results into infinite iterations which sometimes lead to incorrect
clustering
3. The K-means clustering is not good enough with clustering data set with noise .
Irrespective of algorithm : we calculate distance between two data points in three ways.
--------------------------------------------------------------------------------------------------------------------------------
1) Euclidean,
2) Manhattan
3) Minkowskidistance.
square root of the sum of the

1)Euclidean distance is defined as(L2-norm):
squares of the differences between x and y in each dimension.
2): sum of the differences in each dimension. Manhattan distance =

distance if you had to travel along coordinates only
d( i, j)= |xi1 –xj1| + |xi2 – xj2| +……+ |xin-xjn|
3) Minkowski distance is a generalization of both Euclidean

distance and Manhattan distance. It is defined as
d(i, j) = (|xi1-xj1|**p+|xi2 – xj2|**p+……+|xin-xjn|**p)1/p
Where
p is a positive integer. Such a distance is called Lp norm. It
represents the Manhattan distance when p=1 and Euclidean
distance when p=2.
K-Means:
--------------------------------------------------------------------------------------------
Kmean is a unsupervised, non-deterministic, numerical, iterative
method of clustering.
In k-mean each cluster is represented by
the mean value of objects in the cluster. Here we partition a set
of n object into k cluster so that intercluster similarity is low
and intracluster similarity is high. Similarity is measured in
term of mean value of objects in a cluster
K-means algorithm:
1. Select ‘c’ cluster centers randomly.
2. Calculate the distance between each data point and cluster
centers using the Euclidean distance metric as follows
np.sqrt(np.sum((x-y)**2)))
3. Data point is assigned to the cluster center whose distance
from the cluster center is minimum of all the cluster centers
4. New cluster center is calculated using:[mean of the cluster]
np.mean(x) for i=1 to ci
where, ‘ci’ denotes the number of data points in ith cluster.
5. The distance between each data point and new obtained
cluster centers is recalculated.
6. If no data point was reassigned then stop, otherwise repeat
steps from 3 to 5.
Advantages:
1. Fast, robust and easier to understand.
2. Relatively efficient: O(tknd), where n is number of objects,
k is number of clusters, d is number of dimension of each
object, and t is number of iterations. Normally, k, t d < n.
3. Gives best result when data set are distinct or well
separated from each other.
Disadvantage:
-overlapping - non-linear datset -outliers - dosent wrok well with
categorical data - []
1. The learning algorithm requires a priori specification of the
number of cluster centers
2. The use of Exclusive Assignment - If there are two highly
overlapping data then k-means will not be able to resolve that
there are two clusters.
3. The learning algorithm is not invariant to non-linear
transformations i.e. with another representation of data
different results are obtained (data represented in form of
Cartesian co-ordinates and polar co-ordinates will give
different results).
4. Euclidean distance measures can unequally weight
underlying factors.
5. The learning algorithm provides the local optima of the
squared error function.
6. Randomly choosing of the cluster center cannot lead us to
the fruitful result.
7. This algorithm does not work well for categorical data i.e. it
is applicable only when mean is defined.
8. Unable to handle noisy data and outliers.
9. Algorithm fails for non-linear data set.

K Mean Notes

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

K Mean Notes

Încărcat de

Drepturi de autor:

Formate disponibile

Cluster analysis could be divided into

Examples of hierarchical techniques : single linkage, complete linkage,average linkage, median,

2)non-hierarchical clustering techniques.

Non-hierarchical:k-means, adaptive k-means, k-medoids, fuzzy clustering.

produce groups with distinct non-overlapping boundaries.

In supervised classification, we are provided with a collection of labeled (preclassified) patterns

1. It is relatively faster clustering technique [1].

patterns, k is the number of clusters and l is the number of the iterations.

geometrical and statistic meaning [4].

square root of the sum of the

2): sum of the differences in each dimension. Manhattan distance =

3) Minkowski distance is a generalization of both Euclidean

S-ar putea să vă placă și