Sunteți pe pagina 1din 57

1

Multivariate Data Analysis Using Statgraphics Centurion: Part 2


Dr. Neil W. Polhemus Statpoint Technologies, Inc.

Multivariate Statistical Methods


The simultaneous observation and analysis of more than one response variable. *Primary Uses
1. Data reduction or structural simplification 2. Sorting and grouping 3. Investigation of the dependence among variables 4. Prediction 5. Hypothesis construction and testing

*Johnson and Wichern, Applied Multivariate Statistical Analysis

Data set: U.S. Demographics

Variables
Education: percent completing college Income per Capita: in 2011 dollars Urban Percentage: percent living in urban areas Population Density: people per square mile Non-English Language: percent that dont speak English Median Age: in years Unemployment Rate: for 2012 Percent Female: of general population Crime Rate: violent crimes per 100,000 Manufacturing: % of GDP Infant Mortality: 2006 Poverty Rate: percentage below poverty level

Election 2012: results of 2012 presidential election (Obama or Romney)

Urban Percentage

Percent Female

Poverty Rate

Star Glyphs

Data Input

10

11

Cluster Analysis
Method for assigning objects to groups so that objects in the same group are more similar to each other than they are to objects in other groups.
Some questions: 1. How many natural groups are there? 2. Which objects belong to each group?

12

Data Input

13

Analysis Options

14

Dendrogram Nearest Neighbor

15

Dendrogram Furthest Neighbor

16

Agglomeration Distance Plot

17

Dendrogram - 4 Clusters

18

Save Cluster Numbers

19

Map of Cluster Numbers

20

Method of k-Means
1. Starts with a set of typical objects (seeds) for each cluster. 2. Each object is assigned to group with nearest seed. 3. Centroids of groups are computed. 4. Objects are assigned to different groups if nearer to a different centroid. 5. Steps 3 and 4 are repeated until no change.

21

Method of k-Means

22

Seeds
Row 5 = California Row 32 = New York

Row 34 = North Dakota


Row 43 = Texas

23

Map for k-Means

24

Classification
The classification problem deals with the assignment of objects to known groups. Goals include:
1. Understanding what differentiates the groups. 2. Predicting group membership for unclassified objects.

25

Methods
Discriminant Analysis (on the Centurion Relate menu) based on Fishers linear discriminant functions.
Neural Network Classifier (on the Centurion Relate menu) based on a Bayesian classifier with priors and costs. Uniwin (companion program to Statgraphics) includes methods for dealing with qualitative factors.

26

(Linear) Discriminant Analysis


Constructs linear functions of the standardized classification variables:

D j d j1 Z1 d j 2 Z 2 ... d jp Z p
These functions are derived so as to maximize the separation of the groups. If there are g groups, there are g-1 discriminant functions.

27

Example

28

Data Input

29

Analysis Options

30

All Variables

31

Discriminant Function Plot

32

Backward selection

33

Classification Functions
These functions create a score for each group:
C j c j1 X 1 c j 2 X 2 ... c jp X p c j 0

Objects are assigned to whatever group has the largest value of (Cj * priorj) where priorj is the prior probability of belonging to group j.

34

Classification Functions

35

Classification Summary

36

Classification Table

37

Validation Set
May leave some of the objects out of the training set and use them to validate the results.

Correctly classified 8 out of 10 omitted states.

38

Neural Network Classifier


Implements a nonparametric method for classifying objects based on the product of 3 quantities:
1. 2. 3. The estimated density function in the neighborhood of the object (given a specified value of s). The prior probabilities of belonging to each group. The costs of misclassifying cases that belong to a given group.

39

Bivariate Density (s = 0.5)


Biv ariate Density Obama

Biv ariate Density Romney

(X 0.001) 4
density

3 2 1 0 35 55 12 75 95 115 8 16 20

24

Pov erty rate

(X 0.001) 5 4 3 2 1 0
density

24 20 16 55 12 75 95 115 8 Pov erty rate

35

Urban Percentage

Urban Percentage

40

Schematic Diagram

41

Data Input

42

Analysis Options

43

Classification Summary

44

Reduced Set of Variables

45

Some Improvement

46

Classification Plot

47

Classification Plot

48

UNIWIN Plus from Sigma Plus


Software package written by our longtime colleague Christian Charles at Sigma Plus in France.
Contains additional features for multivariate analysis. Reads Statgraphics data files.

49

Uniwin Cluster and Classification


Handle qualitative variables.

50

New Variables

51

Uniwin Cluster Analysis

52

Dendrogram

53

Uniwin Clusters

54

Uniwin Classification

55

Classification Results

Wrongly classified only Nevada (only cattle state to vote for Obama).

56

More Information
Statgraphics Centurion: www.statgraphics.com
Uniwin Plus: www.statgraphics.fr or www.sigmaplus.fr Or send e-mail to info@statgraphics.com

Join the Statgraphics Community on:

Follow us on

S-ar putea să vă placă și