Sunteți pe pagina 1din 7

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis

Overview Idea & purpose Similarity measures Dissimilarity measures Hierarchical clustering Non-hierarchical clustering

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Overview


What type of relationship is being examined? Dependenc e How many variables are being predicted?
Several dependent variables in single relationship

Independence Is the structure of relationships among:

Multiple relationships of dependent and independent variables

One dependent variable in a single relationship

Variables

Case/respondent

Object

Structural equation modelling

What is the measurement scale of the dependent variable?


Metric Nonmetric

What is the measurement sc ale of the dependent variable?


Metric Nonmetric

Factor analysis Principal components

Cluster analysis

How are the attributes measured?

Metric Nonmetric

Nonmetric

What is the measurement scale of the predictor variable?


Metric Nonmetric

Canonical correlation with dummy variables

Multiple regression Conjoint analysis

Multiple discriminant analysis Linear probability models

Multidimensional scaling

Correspondence analysis

Canonical correlation

MANOVA

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Idea & purpose


The objective is to discover natural groupings of the items (respondents or variables) More primitive than classification since no assumptions are made regarding the number of groups Grouping is done on the basis of similarity or distances (dissimilarities) between the items

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Similarity measures


Euclidian distance
d (x, y) =

( x y ) ( x y )

Statistical distance
d (x, y) =

( x y ) A ( x y )

ordinarily A = S-1

Minkowski metric
p d ( x , y ) = xi y i i =1
m

1m

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Dissimilarity measures

Canberra metric
d (x, y) =

(x
i =1

xi y i
i

yi )

Czekanowski coefficient
d (x, y) = 1 2 min ( xi , y i )
i =1 p p

(x
i =1

yi )
5

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Hierarchical clustering


Many different methods (SPSS has 37 similarity or distance measures and 7 clustering methods) The 3 most common are:
Single linkage Complete linkage Average linkage

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Hierarchical clustering


Single linkage
Groups are formed from the individual entries by merging the nearest neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity. Cluster membership is determined by the minimum of d(UV )W = min {dUW ,dVW } The quantities dUW and dVW are the distances between the nearest neighbors of clusters U and W and clusters V and W respectively.
7

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Hierarchical clustering


Complete linkage
Like single linkage with one exception: the distance between clusters is determined by the distance between the furthest neighbors. Complete linkage ensures that all items in a cluster are within some maximum distance (minimum similarity). Cluster membership is determined by the minimum of d(UV )W = max {dUW ,dVW } The quantities dUW and dVW are the distances between the furthest neighbors of clusters U and W and clusters V and W respectively.
8

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Hierarchical clustering


Average linkage
Average linkage treats the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to each clusters. Cluster membership is determined by the minimum of d(UV )W =

d
i k

ik

N(UV )NW

The quantity dik is the distance between object i in cluster (UV ) and object k in cluster W , and N(UV ) and NW are the number of items in the two clusters. Average linkage is default in SPSS

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Hierarchical clustering

10

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Hierarchical clustering


The results of a cluster analysis can be graphically displayed in the dendogram

11

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Non-hierarchical clustering


K - means method
Composed of three steps: (1) Partition the items into K initial clusters (2) Proceed through the list of items, assigning an item to the cluster whose cetroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. (3) Repeat step 2 until no more reassignments take place.
12

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Example


Are there different segments among 1. year bachelor students at Aarhus School of Business? K-means with 2, 3, 4 and 5 clusters
Sum of Squares of Distance to Cluster Centers
1 0.9 0.8 Proportion of Sum of Squares 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 Number of clusters 4 5

13

QEM

Cluster Analysis

Jacob Eskildsen

Cluster Analysis Example


The solution with 3 clusters is chosen
ANOVA Cluster Mean Square Seminar in Descriptive Statistics Economics Business Computing Business Statistics Mathematics Managerial Economics 35.828 207.503 147.590 724.605 455.028 327.398 Error Mean Square 2 2 2 2 2 2 1.401 2.371 2.027 1.900 2.548 1.502 df df 246 246 246 246 246 246 F 25.567 87.533 72.802 381.357 178.552 218.012 Sig. .000 .000 .000 .000 .000 .000

Final Cluster Centers Cluster 2 8.4 7.6 7.7 7.9 8.2 8.8 7.3 5.7 6.3 4.2 5.8 6.6

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

1 Seminar in Descriptive Statistics Economics Business Computing Business Statistics Mathematics Managerial Economics

3 7.2 3.8 4.3 .8 2.1 3.7


Number of Cases in each Cluster Cluster 1 2 3 135.000 89.000 25.000 249.000 14.000

Valid Missing

14

S-ar putea să vă placă și