Cluster Analysis

QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis
Overview Idea & purpose Similarity measures Dissimilarity measures Hierarchical clustering Non-hierarchical clustering
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Overview

What type of relationship is being examined? Dependenc e How many variables are being predicted?
Several dependent variables in single relationship
Independence Is the structure of relationships among:
Multiple relationships of dependent and independent variables
One dependent variable in a single relationship
Variables
Case/respondent
Object
Structural equation modelling
What is the measurement scale of the dependent variable?

Metric Nonmetric
What is the measurement sc ale of the dependent variable?

Metric Nonmetric
Factor analysis Principal components
Cluster analysis
How are the attributes measured?
Metric Nonmetric
Nonmetric
What is the measurement scale of the predictor variable?

Metric Nonmetric
Canonical correlation with dummy variables
Multiple regression Conjoint analysis
Multiple discriminant analysis Linear probability models
Multidimensional scaling
Correspondence analysis
Canonical correlation
MANOVA
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Idea & purpose

The objective is to discover natural groupings of the items (respondents or variables) More primitive than classification since no assumptions are made regarding the number of groups Grouping is done on the basis of similarity or distances (dissimilarities) between the items
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Similarity measures

Euclidian distance
d (x, y) =
( x y ) ( x y )
Statistical distance
d (x, y) =
( x y ) A ( x y )
ordinarily A = S-1
Minkowski metric
p d ( x , y ) = xi y i i =1
m
1m
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Dissimilarity measures
Canberra metric
d (x, y) =
(x
i =1
xi y i
i
yi )
Czekanowski coefficient
d (x, y) = 1 2 min ( xi , y i )
i =1 p p
(x
i =1
yi )
5
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Hierarchical clustering

Many different methods (SPSS has 37 similarity or distance measures and 7 clustering methods) The 3 most common are:
Single linkage Complete linkage Average linkage
QEM
Cluster Analysis
Jacob Eskildsen

Single linkage
Groups are formed from the individual entries by merging the nearest neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity. Cluster membership is determined by the minimum of d(UV )W = min {dUW ,dVW } The quantities dUW and dVW are the distances between the nearest neighbors of clusters U and W and clusters V and W respectively.
7
QEM
Cluster Analysis
Jacob Eskildsen

Complete linkage
Like single linkage with one exception: the distance between clusters is determined by the distance between the furthest neighbors. Complete linkage ensures that all items in a cluster are within some maximum distance (minimum similarity). Cluster membership is determined by the minimum of d(UV )W = max {dUW ,dVW } The quantities dUW and dVW are the distances between the furthest neighbors of clusters U and W and clusters V and W respectively.
8
QEM
Cluster Analysis
Jacob Eskildsen

Average linkage
Average linkage treats the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to each clusters. Cluster membership is determined by the minimum of d(UV )W =
d
i k
ik
N(UV )NW
The quantity dik is the distance between object i in cluster (UV ) and object k in cluster W , and N(UV ) and NW are the number of items in the two clusters. Average linkage is default in SPSS
QEM
Cluster Analysis
Jacob Eskildsen
10
QEM
Cluster Analysis
Jacob Eskildsen

The results of a cluster analysis can be graphically displayed in the dendogram
11
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Non-hierarchical clustering

K - means method
Composed of three steps: (1) Partition the items into K initial clusters (2) Proceed through the list of items, assigning an item to the cluster whose cetroid (mean) is nearest. Recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. (3) Repeat step 2 until no more reassignments take place.
12
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Example

Are there different segments among 1. year bachelor students at Aarhus School of Business? K-means with 2, 3, 4 and 5 clusters
Sum of Squares of Distance to Cluster Centers
1 0.9 0.8 Proportion of Sum of Squares 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 Number of clusters 4 5
13
QEM
Cluster Analysis
Jacob Eskildsen
Cluster Analysis Example

The solution with 3 clusters is chosen
ANOVA Cluster Mean Square Seminar in Descriptive Statistics Economics Business Computing Business Statistics Mathematics Managerial Economics 35.828 207.503 147.590 724.605 455.028 327.398 Error Mean Square 2 2 2 2 2 2 1.401 2.371 2.027 1.900 2.548 1.502 df df 246 246 246 246 246 246 F 25.567 87.533 72.802 381.357 178.552 218.012 Sig. .000 .000 .000 .000 .000 .000
Final Cluster Centers Cluster 2 8.4 7.6 7.7 7.9 8.2 8.8 7.3 5.7 6.3 4.2 5.8 6.6
The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.
1 Seminar in Descriptive Statistics Economics Business Computing Business Statistics Mathematics Managerial Economics
3 7.2 3.8 4.3 .8 2.1 3.7

Number of Cases in each Cluster Cluster 1 2 3 135.000 89.000 25.000 249.000 14.000
Valid Missing
14

Cluster Analysis

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Cluster Analysis

Încărcat de

Drepturi de autor:

Formate disponibile

QEM

Cluster Analysis Overview

Independence Is the structure of relationships among:

Multiple relationships of dependent and independent variables

One dependent variable in a single relationship

Structural equation modelling

What is the measurement scale of the dependent variable?

What is the measurement sc ale of the dependent variable?

Factor analysis Principal components

How are the attributes measured?

What is the measurement scale of the predictor variable?

Canonical correlation with dummy variables

Multiple regression Conjoint analysis

Multiple discriminant analysis Linear probability models

Cluster Analysis Idea & purpose

Cluster Analysis Similarity measures

Cluster Analysis Dissimilarity measures

Cluster Analysis Hierarchical clustering

Cluster Analysis Hierarchical clustering

Cluster Analysis Hierarchical clustering

Cluster Analysis Hierarchical clustering

Cluster Analysis Hierarchical clustering

Cluster Analysis Hierarchical clustering

Cluster Analysis Non-hierarchical clustering

Cluster Analysis Example

Cluster Analysis Example

3 7.2 3.8 4.3 .8 2.1 3.7

S-ar putea să vă placă și