Sunteți pe pagina 1din 10

Kmeans:

In recent years agricultural and environmental data have been


increased in exponential rates by the widely use of automated
data collection tools and systems. The yield data from precision
agriculture applications have become one of the recent
contributors in this increase. A huge amount of data collected by
weather forecasting, remote sensing and geographic information
systems have already been in use for a long time. In addition the
progressive and intensive use of sensor networks and computers
in the cultivated areas, barns and poultry houses have played a
significant role in the increase of agricultural data. Cluster
Analysis as one of the most popular among many Data mining
techniques could be used in agricultural data analysis. For
instance, it is believed that Data mining and Cluster Analysis
should be a part of agriculture because they can improve the
accuracy of decision systems (Tiwari & Misra 2011).

Cluster Analysis is defined as the collection of unsupervised


classification techniques for grouping objects or segmenting
datasets into subsets of data called as clusters. By using an
appropriate clustering algorithm, a cluster is formed with objects
which are more similar to each other when compared to others in
different clusters. In other words, cluster analysis assigns similar
objects into the same cluster which share common
characteristics based on their features. Although there are some
different ways to categorize them, the clustering algorithms can
be generally grouped in 3 categories as hierarchical, non-
hierarchical (flat) and mixture techniques. Although hundreds of
algorithms do exist, in practice the use of many of these
algorithms has been limited due to their complexity, efficiency
and availability in presently used statistical software. The choice
of a good algorithm to run on a certain dataset depends on many
criteria such as data size, data structure, and the goals of Cluster
Analysis (Velmurugan 2012; Bora & Gupta 2014). As reported in
many studies (e.g. Dong et al. 2011; Kuar & Kuar 2013), the non-
hierarchical partitioning algorithms, Therefore, since introduced
by MacQueen (1967) K Means and its successor derivatives have
been the most popular algorithms in exploratory data analysis
and Data Mining applications over a half of century.
K-means is a well-known clustering algorithm that partitions a
given dataset into clusters. It needs a parameter k representing
the number of clusters which should be known or determined as
a fixed apriori value before going to cluster analysis. K-Means is
reported fast, robust and simple to implement. As reported in
many studies it gives comparatively good results if clusters in
datasets are distinct or well separated. The main idea is to define
k centroids, one for each cluster. These centroids should be
placed in a cunning way because different location causes
different result. So, the better choice is to place them far away
from each other as much as possible. The next step is to take
each point belonging to a given data set and associate it to the
nearest centroid. When no point is pending, the first step is
completed and an early group age is done. At this point it is
necessary to re-calculate k new centroids as bar centres of the
clusters resulting from the previous step. After obtaining these k
new centroids, a new binding has to be done between the same
data points and the nearest new centroid. A loop has been
generated. As a result of this loop, one may notice that the k
centroids change their location step by step until no more
changes are done. In other words centroids do not move any
more. Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function.
Algorithm:
Step1:Specify number of clusters K.
Step2:Initialize centroids by first shuffling the dataset and then
randomly selecting K  data points for the centroids without
replacement.
Step3:Keep iterating until there is no change to the centroids. i.e
assignment of data points to clusters isn’t changing.
Step4:Compute the sum of the squared distance between data
points and all centroids.
Step5:Assign each data point to the closest cluster (centroid).
Step6:Compute the centroids for the clusters by taking the
average of the all data points that belong to each cluster.
Step7:The approach kmeans follows to solve the problem is
called Expectation-Maximization. The E-step is assigning the
data points to the closest cluster. The M-step is computing the
centroid of each cluster. Below is a break down of how we can
solve it mathematically.

The objective function is:


m K
J=∑ ∑ wi k̇ ∨¿ xi −μ k ¿∨¿ 2 ¿ (1)
i=1 k=1

where wik=1 for data point xi if it belongs to cluster k; otherwise,


wik=0. Also, μk is the centroid of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t.
wik and treat μk fixed. Then we minimize J w.r.t. μk and treat
wik fixed. Technically speaking, we differentiate J w.r.t. wik first
and update cluster assignments (E-step). Then we differentiate J
w.r.t. μk and recompute the centroids after the cluster
assignments from previous step (M-step). Therefore, E-step is:
m K
∂J
=∑ ∑ ¿∨xi −μk ¿∨¿ 2 ¿
∂ wij i=1 k=1

i 2
1
 w ij = 0 if k=argmi n j∨¿ x −μk ¿∨¿ otherwise ¿
{ ¿
(2)
In other words, assign the data point xi to the closest cluster
judged by its sum of squared distance from cluster’s centroid.
And M-step is:
m
∂J
=2 ∑ wik ( xi −μk ) =0
∂ μk i =1

∑ wik x i
i=1
 μk = m (3)
∑ w ik
i=1

Which translates to recomputing the centroid of each cluster to


reflect the new assignments.
Few things to note here:
 Since clustering algorithms including kmeans use distance-
based measurements to determine the similarity between data
points, it’s recommended to standardize the data to have a
mean of zero and a standard deviation of one since almost
always the features in any dataset would have different units
of measurements such as age vs income.
 Given kmeans iterative nature and the random initialization
of centroids at the start of the algorithm, different
initializations may lead to different clusters since kmeans
algorithm may stuck in a local optimum and may not converge
to global optimum. Therefore, it’s recommended to run the
algorithm using different initializations of centroids and pick
the results of the run that that yielded the lower sum of
squared distance.
 Assignment of examples isn’t changing is the same thing as
no change in within-cluster variation:
mk
1
∑ ¿∨xi −μ c ¿∨¿2 ¿
k (4)
m k i=1

EXAMPLE:
As a simple illustration of a k-means algorithm, consider the
following data set consisting of the scores of two variables on
each of seven individuals:

Subjec
A B
t
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

This data set is to be grouped into two clusters.  As a first step in


finding a sensible initial partition, let the A & B values of the two
individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:
Mean
Individ Vector
 
ual (centr
oid)
Gro
(1.0,
up 1
1.0)
1
Gro 4 (5.0,
up 7.0)
2

The remaining individuals are now examined in sequence and


allocated to the cluster to which they are closest, in terms of
Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the
following series of steps:
  Cluster 1 Cluster 2
Mean Mean
Ste Individ Vector Individ Vector
p ual (centroi ual (centroi
d) d)
(1.0, (5.0,
1 1 4
1.0) 7.0)
(1.2, (5.0,
2 1, 2 4
1.5) 7.0)
(1.8, (5.0,
3 1, 2, 3 4
2.3) 7.0)
(1.8, (4.2,
4 1, 2, 3 4, 5
2.3) 6.0)
(1.8, (4.3,
5 1, 2, 3 4, 5, 6
2.3) 5.7)
(1.8, 4, 5, 6, (4.1,
6 1, 2, 3
2.3) 7 5.4)

Now the initial partition has changed, and the two clusters at this
stage having the following characteristics:
Mean
Individua Vector
 
l (centroid
)
Cluste
1, 2, 3 (1.8, 2.3)
r1
Cluste
4, 5, 6, 7 (4.1, 5.4)
r2

But we cannot yet be sure that each individual has been assigned
to the right cluster.  So, we compare each individual’s distance to
its own cluster mean and to that of the opposite cluster. And we
find:
Distanc Distanc
e to e to
mean mean
Individu
(centroi (centroi
al
d) of d) of
Cluster Cluster
1 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Only individual 3 is nearer to the mean of the opposite cluster


(Cluster 2) than its own (Cluster 1).  In other words, each
individual's distance to its own cluster mean should be smaller
that the distance to the other cluster's mean (which is not the
case with individual 3).  Thus, individual 3 is relocated to Cluster
2 resulting in the new partition:
Mean
Individua Vector
 
l (centroid
)
Cluste
1, 2 (1.3, 1.5)
r1
Cluste 3, 4, 5,
(3.9, 5.1)
r2 6, 7
 
The iterative relocation would now continue from this new
partition until no more relocations occur.  However, in this
example each individual is now nearer its own cluster mean than
that of the other cluster and the iteration stops, choosing the
latest partitioning as the final cluster solution.
Also, it is possible that the k-means algorithm won't find a final
solution.  In this case it would be a good idea to consider
stopping the algorithm after a pre-chosen maximum of iterations.
WHY ONLY K-MEANS ALGORITHM:
K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms.Typically,
unsupervised algorithms make inferences from datasets using
only input vectors without referring to known, or labelled,
outcomes. the objective of K-means is simple: group similar data
points together and discover underlying patterns. To achieve this
objective, K-means looks for a fixed number (k) of clusters in a
dataset.”A cluster refers to a collection of data points aggregated
together because of certain similarities.You’ll define a target
number k, which refers to the number of centroids you need in
the dataset. A centroid is the imaginary or real location
representing the center of the cluster.Every data point is allocated
to each of the clusters through reducing the in-cluster sum of
squares.In other words, the K-means algorithm
identifies k number of centroids, and then allocates every data
point to the nearest cluster, while keeping the centroids as small
as possible.The ‘means’ in the K-means refers to averaging of the
data; that is, finding the centroid.
However, there are also drawbacks of K-Means which are:

 Strong sensitivity to outliers and noise


 Doesn't work well with non-circular cluster shape -- number
of cluster and initial seed value need to be specified
beforehand
 Low capability to pass the local optimum.

FUZZY SET THEORY:


It enables one to work in uncertain and ambiguous situations
and solve ill-posed problems or problems with incomplete
information.
Fuzziness is a language concept; its main strength is its
vagueness using symbols and defining them. Consider a set of
tables in a lobby, in classical set theory, we would ask: Is it a
table? And we would have only two answers, yes  or no. If we
code yes with a 1 and no with a 0, then we would have the pair
of answers as (0, 1). At the end we would collect all the elements
with 1 and have the set of tables in the lobby. We may then ask
what objects in the lobby can function as a table? We could
answer that tables, boxes, desks, among others can function as a
table. The set is not uniquely defined, and it all depends on what
we mean by the word function. Words like this have many shades
of meaning and depend on the circumstances of the situation.
Thus, we may say that the set of objects in the lobby that can’t
function as a table is a fuzzy  set, because we have not crisply
defined the criteria to define the membership  of an element to
the set. Objects such as tables, desks, boxes may function as a
table with a certain degree, although the fuzziness is a feature of
their representation in symbols and is normally a property of
models, or languages.
MEMBERSHIP FUNCTIONS:
A membership function for a fuzzy set A on the universe of
discourse X is defined as µA:X → [0,1], where each element of X is
mapped to a value between 0 and 1. This value, called
membership value or degree of membership, quantifies the grade
of membership of the element in X to the fuzzy set A.

Membership functions allow us to graphically represent a fuzzy


set. The x axis represents the universe of discourse, whereas
the y axis represents the degrees of membership in the [0,1]
interval.

Simple functions are used to build membership functions.


Because we are defining fuzzy concepts, using more complex
functions does not add more precision.

TRIANGULAR MEMBERSHIP FUNCTIONS:


The triangular curve is a function of a vector, x, and depends on
three scalar parameters a, b, and c.
x−a

{
a< x ≤b
b−a
μ Á ( x : a , b , c ) = x−b
b< x ≤ c
c−b
0 otherwise

The three main basic features involved in characterizing


membership function are the following.
Core: The core of a membership function for some fuzzy set 𝐴 is
defined as that region of workspace that is characterized by
complete membership function in the set . The core has
elements x of the workspace such that µA(x). = 1
Support: The support of a membership function for some fuzzy
set 𝐴 is defined as that region of universe that is characterized by
non-membership function in the set . The support comprises
elements x of the universe such that µA(x). > 0
Boundary: The support of a membership function for some fuzzy
set 𝐴 is defined as that region of universe containing that have a
non zero but not complete membership function in the set . The
boundary comprises those elements x of the universe such that
0 < µA (x). < 1

ALGORITHM:
Step 1: Build initial membership functions by the attribute
values of the training examples
Step 2: Build an initial decision table
Step 3: Simplify the decision table
Step 4: Simplify the membership functions
Step 5: Generate fuzzy rules from the decision table

Example:
We have to considered sample clinical datasets,
PID AGE GENDER WEIGHT SUGAR FEVER BP
111 18 M 54 108 101 90
112 32 M 65 123 104 80
113 43 M 70 116 97 60
114 54 F 84 94 90 100
115 65 F 66 82 107 110
116 36 M 66 237 107 120

Table shows the sample clinical trial dataset. The boundary


values have to be setup for all attributes except PatientID,
Gender and Blood group attributes. The membership value [0,1]
of each attribute values associate with boundary value. For
example, if patient causes diabetes disease, the blood sugar
metric value is important for diagnosis. The range value of blood
sugar shown in below
Table 1.Blood sugar level chart

LOW NORMAL HIGH


<30 30<=100 >100

The boundary value of sugar is based on the chart. a, c are


boundary value and b is core value. Therefore a=30, b=100 and
c=320 (maximum value). In table 1, the sugar level of PatientID is
230. The fuzzification executes as per equation which derived in
section 2. The implementation is given below
Triangle (230:30,100,320) = (230-100) / (100-30)
= 130/70
= 1.8571
= (320-230) / (320-100)
= 90/220
=0.4090
f(x)= max (min(1.8571,0.4090),0) = 0.4090
Figure 1 shows the result in the triangular membership shape.

Fig1. Indication of fuzzy value in triangular membership shape


The fuzzy value of sugar metris 230 is 0.4090. It is observed that
the calculation, the PatientID 116 causes diabetes in severe
condition. The result has suggested, the patient ID 116 has to be
taken treatment immediately. This formulation applies into
values of all attributes except PatientID, Gender and Blood group
attributes.
WHY FUZZY SET THEORY:
Fuzzy set theory has been shown to be a useful tool to describe
situations in which the data are imprecise or vague. Fuzzy sets
handle such situations by attributing a degree to which a certain
object belongs to a set. In real life, however, a person may
assume that an object x belongs to a set A to a certain degree,
but it is possible that he is not so sure about it. In other words,
there may be a hesitation or uncertainty about the membership
degree of x in A. In Fuzzy set theory there is no means to
incorporate that hesitation in the membership degrees. 

S-ar putea să vă placă și