Cluster Analysis..

Cluster Analysis
Amandeep Singh
30 June 2014
Cluster Analysis
Cluster analysis is a class of techniques used to classify

objects or cases into relatively homogeneous groups called
clusters.
Objects in each cluster tend to be similar to each other and
dissimilar to objects in the other clusters.
Cluster analysis is also called classification analysis, or
numerical taxonomy.
Both cluster analysis and discriminant analysis are concerned
with classification. However, discriminant analysis requires
prior knowledge of the cluster or group membership for
each object or case included, to develop the classification
rule.
In contrast, in cluster analysis there is no a priori information
about the group or cluster membership for any of the
objects. Groups or clusters are suggested by the data, not
defined a priori.
30 June 2014
30 June 2014
30 June 2014
30 June 2014
30 June 2014
30 June 2014
30 June 2014
30 June 2014
30 June 2014
10
30 June 2014
11
30 June 2014
12
30 June 2014
13
30 June 2014
14
30 June 2014
15
30 June 2014
16
30 June 2014
17
The k-means algorithm
The k-means algorithm is perhaps the

most often used clustering method.
Having been studied for several decades, it
serves as the foundation for many more
sophisticated clustering techniques.
The goal is to minimize the differences
within each cluster and maximize the
differences between clusters.
This involves assigning each of the n
examples to one of the k clusters, where k is
a number that has been defined ahead of
time
30 June 2014
18
30 June 2014
19

The k-means algorithm begins by
choosing k points in the feature space to
serve as the cluster centers.
These centers are the catalyst that spurs
the remaining examples to fall into place.
The points are chosen by selecting k
random examples from the training
dataset.
30 June 2014
20
Because we hope to identify three

clusters, k = 3 point sare selected.
30 June 2014
21
Traditionally, k-means uses Euclidean

distance, but Manhattan distance or
Minkowski distance are also sometimes
used.
Recall that if n indicates the number of
features, the formula for Euclidean distance
between example x and example y is as
follows:
30 June 2014
22

The three cluster centers partition the
examples into three segments labeled
Cluster A, B, and C.
The dashed lines indicate the boundaries
for the Voronoi diagram created by the
cluster centers.
30 June 2014
23
A Voronoi diagram indicates the areas that are closer to one

cluster center than any other; the vertex where all three
boundaries meet is the maximal distance from all three
cluster centers.
30 June 2014
24

The initial assignment phase has been
completed, the k-means algorithm proceeds
to the update phase.
The first step of updating the clusters
involves shifting the initial centers to a new
location, known as the centroid, which is
calculated as the mean value of the points
currently assigned to that cluster.
30 June 2014
25
30 June 2014
26
Because the cluster boundaries have been adjusted

according to the repositioned centers, Cluster A is
able to claim an additional example from Cluster B
(indicated by an arrow).
30 June 2014
27
Two more points have been reassigned from Cluster

B to Cluster A during this phase, as they are now
closer to the centroid for A than B. This leads to
another update as shown:
30 June 2014
28
Choosing the appropriate

number of clusters
In the introduction to k-means, we
learned that the algorithm can be
sensitive to randomly chosen cluster
centers.
Indeed, if we had selected a different
combination of three starting points in
the previous example, we may have found
clusters that split the data differently from
what we had expected.
30 June 2014
29
Ideally, you will have some a priori knowledge

(that is, a prior belief) about the true
groupings, and you can begin applying kmeans using this information. For example
For instance, if you were clustering movies, you
might begin by setting k equal to the number of
genres considered for the Academy Awards.
In the data science conference seating problem
that we worked through previously, k might
reflect the number of academic fields of study
that were invited.
30 June 2014
30
Sometimes the number of clusters is

dictated by business requirements or the
motivation for the analysis. For example,
the number of tables in the meeting hall could
dictate how many groups of people should be
created from the data science attendee list.
the marketing department only has resources to
create three distinct advertising campaigns, it
might make sense to set k = 3 to assign all the
potential customers to one of the three appeals
30 June 2014
31
A technique known as the elbow method

attempts to gauge how the homogeneityor
heterogeneity within the clusters changes for
various values of k.
The homogeneity within clusters is expected to
increase as additional clusters are added;
Similarly, heterogeneity will also continue to
decrease with more clusters. Because you could
continue to see improvements until each example
is in its own cluster,
The goal is not to maximize homogeneity or
minimize heterogeneity, but rather to find k such
that there are diminishing returns beyond a point.
30 June 2014
32
This value of k is known as the elbow

point, because it looks like an elbow.
30 June 2014
33
Cluster Analysis
30 June 2014
34

Cluster Analysis..

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Cluster Analysis..

Încărcat de

Drepturi de autor:

Formate disponibile

Cluster Analysis

Cluster analysis is a class of techniques used to classify

The k-means algorithm

The k-means algorithm is perhaps the

The k-means algorithm

The k-means algorithm

The k-means algorithm

Because we hope to identify three

The k-means algorithm

Traditionally, k-means uses Euclidean

The k-means algorithm

The k-means algorithm

A Voronoi diagram indicates the areas that are closer to one

The k-means algorithm

The k-means algorithm

The k-means algorithm

Because the cluster boundaries have been adjusted

The k-means algorithm

Two more points have been reassigned from Cluster

Choosing the appropriate

Ideally, you will have some a priori knowledge

Sometimes the number of clusters is

A technique known as the elbow method

This value of k is known as the elbow

S-ar putea să vă placă și