Sunteți pe pagina 1din 34

Cluster Analysis

Amandeep Singh
30 June 2014

Cluster Analysis

Cluster analysis is a class of techniques used to classify


objects or cases into relatively homogeneous groups called
clusters.
Objects in each cluster tend to be similar to each other and
dissimilar to objects in the other clusters.
Cluster analysis is also called classification analysis, or
numerical taxonomy.
Both cluster analysis and discriminant analysis are concerned
with classification. However, discriminant analysis requires
prior knowledge of the cluster or group membership for
each object or case included, to develop the classification
rule.
In contrast, in cluster analysis there is no a priori information
about the group or cluster membership for any of the
objects. Groups or clusters are suggested by the data, not
defined a priori.
30 June 2014

30 June 2014

30 June 2014

30 June 2014

30 June 2014

30 June 2014

30 June 2014

30 June 2014

30 June 2014

10

30 June 2014

11

30 June 2014

12

30 June 2014

13

30 June 2014

14

30 June 2014

15

30 June 2014

16

30 June 2014

17

The k-means algorithm

The k-means algorithm is perhaps the


most often used clustering method.
Having been studied for several decades, it
serves as the foundation for many more
sophisticated clustering techniques.
The goal is to minimize the differences
within each cluster and maximize the
differences between clusters.
This involves assigning each of the n
examples to one of the k clusters, where k is
a number that has been defined ahead of
time
30 June 2014

18

The k-means algorithm

30 June 2014

19

The k-means algorithm


The k-means algorithm begins by
choosing k points in the feature space to
serve as the cluster centers.
These centers are the catalyst that spurs
the remaining examples to fall into place.
The points are chosen by selecting k
random examples from the training
dataset.

30 June 2014

20

The k-means algorithm

Because we hope to identify three


clusters, k = 3 point sare selected.

30 June 2014

21

The k-means algorithm

Traditionally, k-means uses Euclidean


distance, but Manhattan distance or
Minkowski distance are also sometimes
used.
Recall that if n indicates the number of
features, the formula for Euclidean distance
between example x and example y is as
follows:

30 June 2014

22

The k-means algorithm


The three cluster centers partition the
examples into three segments labeled
Cluster A, B, and C.
The dashed lines indicate the boundaries
for the Voronoi diagram created by the
cluster centers.

30 June 2014

23

The k-means algorithm

A Voronoi diagram indicates the areas that are closer to one


cluster center than any other; the vertex where all three
boundaries meet is the maximal distance from all three
cluster centers.
30 June 2014

24

The k-means algorithm


The initial assignment phase has been
completed, the k-means algorithm proceeds
to the update phase.
The first step of updating the clusters
involves shifting the initial centers to a new
location, known as the centroid, which is
calculated as the mean value of the points
currently assigned to that cluster.

30 June 2014

25

The k-means algorithm

30 June 2014

26

The k-means algorithm

Because the cluster boundaries have been adjusted


according to the repositioned centers, Cluster A is
able to claim an additional example from Cluster B
(indicated by an arrow).
30 June 2014

27

The k-means algorithm

Two more points have been reassigned from Cluster


B to Cluster A during this phase, as they are now
closer to the centroid for A than B. This leads to
another update as shown:
30 June 2014

28

Choosing the appropriate


number of clusters
In the introduction to k-means, we
learned that the algorithm can be
sensitive to randomly chosen cluster
centers.
Indeed, if we had selected a different
combination of three starting points in
the previous example, we may have found
clusters that split the data differently from
what we had expected.

30 June 2014

29

Ideally, you will have some a priori knowledge


(that is, a prior belief) about the true
groupings, and you can begin applying kmeans using this information. For example
For instance, if you were clustering movies, you
might begin by setting k equal to the number of
genres considered for the Academy Awards.
In the data science conference seating problem
that we worked through previously, k might
reflect the number of academic fields of study
that were invited.
30 June 2014

30

Sometimes the number of clusters is


dictated by business requirements or the
motivation for the analysis. For example,
the number of tables in the meeting hall could
dictate how many groups of people should be
created from the data science attendee list.
the marketing department only has resources to
create three distinct advertising campaigns, it
might make sense to set k = 3 to assign all the
potential customers to one of the three appeals

30 June 2014

31

A technique known as the elbow method


attempts to gauge how the homogeneityor
heterogeneity within the clusters changes for
various values of k.
The homogeneity within clusters is expected to
increase as additional clusters are added;
Similarly, heterogeneity will also continue to
decrease with more clusters. Because you could
continue to see improvements until each example
is in its own cluster,
The goal is not to maximize homogeneity or
minimize heterogeneity, but rather to find k such
that there are diminishing returns beyond a point.

30 June 2014

32

This value of k is known as the elbow


point, because it looks like an elbow.

30 June 2014

33

Cluster Analysis

30 June 2014

34

S-ar putea să vă placă și