Documente Academic
Documente Profesional
Documente Cultură
Methods in
Research
13 – 14 January 2011
6/9/2016 2
Contact Information
Objective
Method
Machinery
Agenda
Norm
Development of SPSS
6/9/2016 3
Contact Information
O
M
M Contact
A
N
Information
D
• Mr. Chang Yun Fah
• E-mail: changyf@utar.edu.my
• Tel: 03-41079802
• Department of Mathematical and Actuarial
Sciences, Faculty of Engineering and
Science
6/9/2016 4
C
Objective Objective
M
M
A • Use the analytical functions of SPSS.
N • Process data and generate statistics for
D ANOVA and MANOVA tests.
• Process data and generate statistics for linear
regression analysis and logit analysis.
• Process data and generate statistics for
principal component analysis and factor
analysis.
• Process data and generate statistics for
testing for clustering analysis.
• Process data and generate statistics for
discriminant analysis.
6/9/2016 5
C
O
Method
M
Presentation
A
N
D
Method of
Case study Exercise
Learning
Discussion
6/9/2016 6
C
O
M
Machine Machine &
A Software
N • SPSS 14.0 for Windows
D • Student package
• Release 14.0.0 on
• needed a downloadable hotfix to be
installed in order to be compatible with
Windows Vista.
6/9/2016 7
C
O
M
Agenda
M Day 1 Day 2
Agenda
COMMAND Principal
N Component Analysis
D One-Way, Two-Way and Factor Analysis
and Three-Way
ANOVA Clustering Analysis
6/9/2016 8
C
O
M
Writing report
M
A
Navigate Interpretation
D
Statistical analysis
Tools selection
Data collection
Problem analysis
6/9/2016 9
C
O
M History
M
A Release history
N SPSS 15.0.1 - November 2006
Development SPSS 16.0.2 - April 2008
SPSS Statistics 17.0.1 - December
2008
PASW Statistics 17.0.3 - September
2009
PASW Statistics 18.0 - August 2009
PASW Statistics 18.0.1 - December
2009
PASW Statistics 18.0.2 - April 2010
6/9/2016 10
One-Way,
Two-Way
and Three-
Way ANOVA
6/9/2016 11
One-Way ANOVA
Analysis of variance (ANOVA) is an extension of
the independent-sample t-test (one independent
variable/factor with 2 levels/groups)
Dealt with an experiment involves one
dependent variable and one factor/independent
variable.
Comparing means of 2 or more levels/
treatments of the factor.
6/9/2016 12
Completely Randomized Design
In general, there will be a levels of the factor, or a
treatments, and n replicates of the experiment, run in
random order
Objective is to test hypotheses about the equality of the
a treatment means
N=axn total runs
H0: τ1 = τ2 = …= τa = 0
H1: τi ≠ 0 for at least one i.
6/9/2016 13
ANOVA table:
Source of
variation Sum of Squares DF MS F0
a
SST = n∑ ( yi. − y.. )
2
MST
i =1
a −1 MST =
SST F0 =
Between 1 a 2 y2 a −1 MSE
SST = ∑ yi −
treatments n i =1 N
Error (within SSE
SSE = SSTO − SST N −a M SE =
treatments) N −a
a n
SSTO = ∑∑ ( yij − y.. )
2
i =1 j =1
y2
N −1
a n
SSTO = ∑∑ yij2 −
i =1 j =1 N
Total
6/9/2016 14
Exercise:
6/9/2016 15
Observations
Power (W) 1 2 3 4 5 Total Average
160 575 542 530 539 570 2756 551.2
180 565 593 590 579 610 2937 587.4
200 600 651 610 637 629 3127 625.4
220 725 700 715 685 710 3535 707.0
6/9/2016 16
Observations
Power (W) 1 2 3 4 5 Total Average
160 575 542 530 539 570 2756 551.2
180 565 593 590 579 610 2937 587.4
200 600 651 610 637 629 3127 625.4
220 725 700 715 685 710 3535 707.0
6/9/2016 17
Open SPSS file ‘Data 1’: Summaries for
personal savings and personal income based
on ethnic group. What is your ‘conclusion’?
Descriptives
Descriptives
Are they statistically different?
Personal income
95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
Malay 11 $1,011.8182 $438.01411 $132.066 $717.5563 $1,306.0801 $650.00 $2100.00
Chinese 15 $1,127.3333 $340.95384 $88.03390 $938.5194 $1,316.1473 $600.00 $1600.00
Indian 7 $1,138.5714 $367.48826 $138.898 $798.7015 $1,478.4414 $600.00 $1670.00
Foreigner 6 $1,238.3333 $406.91113 $166.121 $811.3063 $1,665.3604 $590.00 $1700.00
Total 39 $1,113.8462 $376.92394 $60.35614 $991.6615 $1,236.0308 $590.00 $2100.00
6/9/2016 18
Ethnic Psavings
3 32 Rearrange the personal savings and
4 45 ethnic group from Questionnaire 1
2 34
2 56 (Data1) into the completely
3 32
2 26
randomized design format:
2 23
2 27 Observations
3 38
2 21 Treatment 1 2 3 4 5 6 7 8
4 48 1 11 43 18 19 11
4 49
4 12 2 34 56 26 23 27 21 27 24
2 27 3 32 32 38
1 11
4 45 48 49 12
1 43
1 18
1 19
1 11
2 24
6/9/2016 19
One-Way
Compare Means Analyze menu
ANOVA
move
move dependent select the
independent Click
variable to
variable to ‘Options’ Statistics and Continue
‘Dependent List’ Means plot OK
‘Factor’
1
9
4
2
5
3
6
8
7
6/9/2016 20
Open SPSS file ‘Data1’: Conduct an One-Way ANOVA by using
Personal savings as dependent variable and Ethnic as factor.
Use LCD test for multiple comparison.
6/9/2016 22
Exercise
Get the One-Way ANOVA for Personal income as
dependent variable and Ethnic as factor
6/9/2016 23
Multiple Comparison Test (Post Hoc Test)
6/9/2016 25
Dunnett’s test: Comparing treatment means with a control
group:
H0: μi-μa= 0 vs H1: μi-μa ≠ 0
Example 1: the Malays is a benchmark for personal savings.
Example 2: the ASEAN countries’ GDP for the last 4
consecutive years were compared. The GDP obtained in the
2006 is used as the base year.
9
4
6/9/2016
8 27
Multiple Comparisons
Mean
Malays
Difference 95% Confidence Interval
(I) Ethnic group
(J) Ethnic group (I-J) Std. Error Sig. Lower BoundUpper Bound
Malay Chinese -5.188 4.642 .271 -14.61 4.24
Indian -14.597* 5.653 .014 -26.07 -3.12 Chinese
Foreigner -18.288* 5.934 .004 -30.34 -6.24
Chinese Malay 5.188 4.642 .271 -4.24 14.61
Indian -9.410 5.352 .087 -20.28 1.46
Foreigner
Indians
-13.100* 5.648 .026 -24.57 -1.63
Indian Malay 14.597* 5.653 .014 3.12 26.07
Chinese 9.410 5.352 .087 -1.46 20.28
Foreigner -3.690 6.505 .574 -16.90 9.52
Foreigner Malay 18.288* 5.934 .004 6.24 30.34 Foreigner
Chinese 13.100* 5.648 .026 1.63 24.57
Indian 3.690 6.505 .574 -9.52 16.90
*. The mean difference is significant at the .05 level.
6/9/2016 28
Writing it in report 2!
Analysis of the data showed that the amount of savings differed among
the ethnic groups with the foreign workers having the highest savings at
RM40830 followed by the Indians, Chinese and lastly Malays. Further
analysis using one-way ANOVA technique revealed that the difference
was significant at least at 0.05 level. LCD test was conducted to detect
which ethnic group differed from the other ethnic groups. The result
showed the mean difference of savings between Malays and the Chinese
was RM5188 and this was not significant at 0.05 (p=0.271). But the mean
differences in savings between the Malays and the Indians and the
foreigners were RM14597 and RM18288 respectively and they were both
significant at 0.05 with the probabilities of error being p=0.014 and
p=0.004 respectively. At the 0.05 level, the Chinese workers had
significant savings differences from the foreigners but no significant
differences with the Malays and the Indians. The Indians has a significant
savings difference from the Malays but not the other ethnic groups. Lastly,
the foreigners had significant savings differences from the Malays and the
Chinese but not the Indians.
6/9/2016 29
Perform the multiple comparison by assuming Malay is a
control group.
Multiple Comparisons
Mean
Difference 95% Confidence Interval
(I) Ethnic group (J) Ethnic group (I-J) Std. Error Sig. Lower Bound Upper Bound
Chinese Malay 5.188 4.642 .563 -6.27 16.64
Indian Malay 14.597* 5.653 .038 .65 28.55
Foreigner Malay 18.288* 5.934 .011 3.64 32.93
*. The mean difference is significant at the .05 level.
a. Dunnett t-tests treat one group as a control, and compare all other groups against it.
6/9/2016 30
Kruskal-Wallis H Test (nonparametric multiple comparison test)
K Independent
Nonparametric Tests Analyze menu
Samples
5
7 open SPSS file ‘Data11’
6
9
2
8
3
6/9/2016 31
Test Statistics a,b
(RM)
Chi-Square 9.632
df 2
Asymp. Sig. .008
Writing it in report 3! a. Kruskal Wallis Test
b. Grouping Variable: Country
6/9/2016 34
Interaction Effects
3. Whether the means of perception were
significantly influenced by both the types of
companies and citizenships of workers
(interaction effects)
6/9/2016 35
Open SPSS file ‘Data15’:
the data suggested that foreign workers responded favorably to the
question whether they received good treatment from their employers, but
the situation was reversed for companies owned by local investors.
6/9/2016 36
Interaction Plot
Univariate General Linear Model Analyze menu
9
7
7
8
6/9/2016 37
move factors (can be >2
Multiple click ‘Post factors) to ‘Post Hoc Tests Continue
Comparison Hoc’ for’ and select the tests
10
13 click
‘Options’
click
‘Descriptiv
e statistics’
16
Continue
12 OK
11
11
14
6/9/2016 38
Descriptive Statistics
Mean
Difference 95% Confidence Interval
(I) OWNER (J) OWNER (I-J) Std. Error Sig. Lower Bound Upper Bound
foreign joint-venture -.50 .434 .486 -1.55 .55
local 1.30* .434 .011 .25 2.35
joint-venture foreign .50 .434 .486 -.55 1.55
local 1.80* .434 .000 .75 2.85
local foreign -1.30* .434 .011 -2.35 -.25
joint-venture -1.80* .434 .000 -2.85 -.75
Based on observed means.
*. The mean difference is significant at the .05 level.
6/9/2016 41
Second, there was no significant difference in the levels of satisfaction
between local workers and the Vietnamese workers with F=0.319 (p=0.575).
This means that citizenship did not have significant influence on satisfaction.
Third, there was an interaction effect in which both ownership of company
and workers’ citizenship together exerted significant influence on workers’
satisfaction with F=10.949 and significant at the 0.05 level (p=0.000).
A closer look at the interaction effects (figure) revealed that there was a big
gap between the level of satisfaction of local and Vietnamese workers in
foreign-owned companies, with the latter showing higher satisfaction than
the former. In joint-venture companies, Vietnamese workers continued to
show higher satisfaction than local workers but the gap was narrowed,
whereas in local companies, the situation was reversed where local workers
indicated a higher level of satisfaction than Vietnamese workers.
6/9/2016 42
Case of No Interaction Effects
Tests of Between-Subjects Effects
Parallel lines
6/9/2016 43
Tests of Between-Subjects Effects
It is hard to
determine the
interaction effects
based on a plot with
3 or more lines
6/9/2016 44
Three-Way ANOVA
It is a 3-factor factorial design.
One dependent variable in numeric scale
and 3 explanatory variables (factors) either
all in numeric scale or a combination of
numeric and categorical scales.
i = 1,2,..., a
j = 1,2,...b
yijkl = μ + τi + β j + γk + (τβ)ij + (τγ)ik + ( βγ) jk + (τβγ)ijk + εijkl
k = 1,2,..., c
l = 1,2,..., n
6/9/2016 45
Univariate General Linear Model Analyze menu
9
4 7
6 7
5
7
10
6/9/2016 46
There is a significant difference in There is no significant difference in
perception towards efficiency of the perception towards the efficiency of
police among the respondents of the police among respondents of
different ethnicities at 0.05 level different educational backgrounds at
(p=0.005) 0.05 (p=0.096)
Tests of Between-Subjects Effects
There is a
Dependent Variable: PERCEPTION significant
Type III Sum
Source of Squares df Mean Square F Sig. difference in
Corrected Model 172.717a 11 15.702 8.416 .000 perception towards
Intercept 544.027 1 544.027 291.588 .000
Ethnicity
the efficiency of the
27.432 2 13.716 7.352 .005
Education 5.767 1 5.767 3.091 .096 police among
City 10.607 1 10.607 5.685 .028 respondents of
Ethnicity * Education 65.617 2 32.809 17.585 .000 different locations at
Ethnicity * City 3.832 2 1.916 1.027 .378
Education * City 1.179 1 1.179 .632 .437
0.05 (p=0.028)
Ethnicity * Education * City 23.096 2 11.548 6.189 .009
Error 33.583 18 1.866 No interaction effect
Total 869.000 30
Corrected Total 206.300 29
between ethnic and
a. R Squared = .837 (Adjusted R Squared = .738) location factors
6/9/2016 49
Controlling for the location factor, this study revealed that there was a
significant interaction effect in perception towards efficiency in the police in
terms of ethnicity among respondents living in small cities. In small cities,
lowly educated Malay respondents demonstrated a relatively high regard for
the police compared with the highly educated Malays, a similar pattern
observable among the Chinese respondents but with a wider gap between
the lowly and highly educated. However, this phenomenon was not
noticeable among the Indian respondents; in fact this ethnic group showed a
reverse situation in which the highly educated Indians tended to have a
relatively more favorable perception towards the police compared with the
lowly educated Indians. Moreover, the gap between the two groups of
educational attainment among the Indians was much wider compared with
the other two ethnic groups, namely the Malays and the Chinese.
6/9/2016 50
A similar interaction effect can be seen among respondents living in big
cities. Generally the Malays demonstrated the highest regard for the police,
followed by the Chinese and the Indians. The lowly educated Malays
demonstrated a relatively high regard for the police; a similar pattern which
was observable among the Chinese respondents but at relatively low levels
and a narrower gap between the two groups. However, the situation is
reversed among the Indians where perception towards efficiency of the
police among the highly educated of this ethnic group was relatively higher
than that of the lowly educated. In general, the Indians were found to have
very low regard for the police compared with the other two ethnic groups,
and there was not much difference between the lowly and highly educated
respondents among members of this ethnic group in terms of the level of
respect for the police.
6/9/2016 51
Exercise using ‘Data15’: construct the 2 way ANOVA
using three factors: Ownership, Worker, and Activity
6/9/2016 52
Difficult to
interpret 3 lines
6/9/2016 53
6/9/2016 54
Linear Regression Analysis and
Logit Analysis
6/9/2016 55
Simple linear regression
Fitted line
Multiple linear regression and model
checking
Significance test
Model selection
Nonlinear regression and curve estimation
Dummy variable in regression
Logit analysis
6/9/2016 56
Simple Linear Regression
yi = β 0 + β1 xi + ε i , i = 1,2, K , n
where:
• The intercept β0 and the slope β1 are unknown
constants (parameters)
• The regressor (independent or predictor variable)
xi is a known constant (fix)
• εi is the random error.
• yi is the value of the response (dependent)
variable in the i-th trial/observation
6/9/2016 57
Assumptions:
1) The error terms εi are Normally (needed in inferences and
MLE) and Independently Distributed with mean E(εi)=0 and
constant variance Var(εi)=δ2 ;
Cov ( ε i , ε j ) = 0; ∀i ≠ j
2) The errors (thus, the yi also) are uncorrelated with each other in
successive observations
ε i ~ NID (0 , σ 2
)
3) The relationship between the variables y and x should be linear.
6/9/2016 58
move one variable to
‘Dependent’, one variable Linear Regression Analyze menu
to ‘Independent’
Select types of statistics:
Statistics Continue OK
Estimates, Model fit, R2
1
8
4
2 3
7
6
6/9/2016 59
Coefficientsa
n n
∑y ∑x i i Standardized intercept
∑ y ( x − x)
n n
∧ ∑yx − i i
i =1
n
i =1
i i
S xy
always zero,
β1 = i =1
= i =1
= = −48.650 standardized slope (=-
∑ ( x − x)
2 n
n 2 S xx 0.324) is now
n
∑ xi
i =1
i
comparable to other
∑
i =1
xi −
2 i =1
n
slopes if any
∧ ∧
β 0 = y − β1 x = 296.813
∧ ∧ ∧
y = β 0 + β1 x Estimated Mobile phone bill = 296.813 – 48.65*CGPA
6/9/2016 60
ANOVAb
Sum of
Model Squares df Mean Square F Sig. p-value<0.05 means the
1 Regression 41792.297 1 41792.297 5.615 .022a regression model is
Residual 357292.9 48 7443.603
Total 399085.2 49
significant
a. Predictors: (Constant), CGPA (max 4.0)
b. Dependent Variable: Handphone bills
Model Summary
Change Statistics
Adjusted Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
1 .324a .105 .086 86.276 .105 5.615 1 48 .022
a. Predictors: (Constant), CGPA (max 4.0)
Coefficient of
correlation (R):
Std error of the
measure the
Coefficient of estimate: standard
strength of When Adjusted R2 – R2
determination errors of the
relationship between is large means the
(R2): the variation predicted values.
the dependent and linear model is not
independent of dependent appropriate or some
variables. variable ‘important’ independent
explained by the variables are missing
model (%).
6/9/2016 61
Writing it in report 6!
One of the hypothesis in this study is that several personal characteristics of
the youths interviewed can predict the monthly amount they spent on the
mobile phone. Simple linear regression analysis using CGPA as the
explanatory variable produced a significant result with F=5.615 and
significant at the 0.05 level (p=0.022). There was a weak and inverse
relationship between CGPA and expenditure on mobile phone with a
correlation coefficient of 0.324. The derived R2 was rather small at 0.105,
indicating only 10.5% variations in expenditure on mobile phone were
explained by academic performance measured in CGPA. The resultant
model from the analysis is Y= 296.813 – 48.65*X where Y is expenditure on
mobile phone and X is CGPA. It showed that the higher the CGPA of the
respondent, the lower the amount spent on mobile phone bills.
Note: Although the ANOVA test shows that the model is significant, but the small R
and R2 values indicated that the CGPA is not a good predictor for expenditure on
mobile phone.
click
this
‘Add fit
line at
total’
This is a
possible
outlier
point
6/9/2016 63
Multiple Linear Regression
It involves one dependent variable and a set of
several independent variables.
Assumptions in simple linear regression are
hold e.g. linearity, normality, uncorrelated and
equal variance.
IMPORTANT: At the end of MLR, we try to
select a model that has a minimum number of
independent variables with acceptable model
performance (e.g. large coefficient of
determination)
6/9/2016 64
Matrix form of multiple linear regression model
y = Xβ + ε
where y = [ y1 y2 L yn ] ' 1 x11 x12 L x1k
1 x21 x22 L x2 k
β = [β0 β1 Lβk ] '
X=
M M M M
ε = [ε 1 ε 2 Lε k ]
'
1 xn1 xn 2 L xnk
6/9/2016 66
Open SPSS file ‘Data17’
move one variable to ‘Dependent’, Analyze
Linear Regression
regressors to ‘Independent’ menu
Select types of statistics:
Statistics Estimates, Model fit, Continue Plots
Descriptive
1
4 12
2 3
5 8
7
6
6/9/2016 67
click ‘Histogram’ and Move Dependent to ‘Y’ and
Continue OK
‘Normal Probability Plot’ ZRESID (or others) to ‘X’
Dependent variable
Standardized
predicted values
Standardized 11
residuals
Deleted residuals
10
Residual plot
Adjusted to check
predicted values assumptions
9
Studentized
residuals
Studentized
deleted residuals
6/9/2016 68
Do the data meet the conditions and assumptions?
1) Do the explanatory variables have a relatively strong linear
relationship with the dependent variable?
Correlations
YES, except variable AGE
Working
Job Experience
Satisfcation (Years) INCOME (RM) AGE (Yeas) SEX
Pearson Correlation Job Satisfcation 1.000 .660 .719 -.045 -.407
Working
.660 1.000 .482 .135 -.232
Experience (Years)
INCOME (RM) .719 .482 1.000 -.093 -.183
AGE (Yeas) -.045 .135 -.093 1.000 .166
SEX -.407 -.232 -.183 .166 1.000
Sig. (1-tailed) Job Satisfcation . .000 .000 .406 .013
Working
.000 . .003 .239 .109
Experience (Years)
INCOME (RM) .000 .003 . .313 .166
AGE (Yeas) .406 .239 .313 . .191
SEX .013 .109 .166 .191 .
N Job Satisfcation 30 30 30 30 30
Working
30 30 30 30 30
Experience (Years)
INCOME (RM) 30 30 30 30 30
AGE (Yeas) 30 30 30 30 30
SEX 30 30 30 30 30
The realized or
observed values of
the errors
ei Zero mean and
Standardized Residuals: di =
MS E approximately
unit variance
Average standard deviation
ei
ri =
Studentized Residuals:
(
xi − x )
2
Useful in
MS E 1 − +
1
n S xx regression
Exact standard error diagnosis
6/9/2016 70
1) Does the normality assumption hold?
YES, since the histogram is
approximately bell shape or the points
lie approximately along a straight line
in Normal P-P plot
6/9/2016 71
Heavy-tailed/long tailed distribution:
•The points show a sharp upward and
downward curve at both extremes.
Negative/left
skewed
Positive/right
skewed
6/9/2016 72
1) Does the errors variance constant?
Model
Sum of
Squares df Mean Square F Sig.
The overall regression
1 Regression
Residual
126.678
55.622
4
25
31.670
2.225
14.234 .000a model was significant at
Total 182.300 29 0.05 level (p=0.000)
a. Predictors: (Constant), SEX, AGE (Yeas), INCOME (RM), Working Experience
(Years)
b. Dependent Variable: Job Satisfcation
6/9/2016 74
JobStat = 1.406 + 0.413Exp + 0.001Income – 0.003Age – 1.134Sex
6/9/2016 75
Writing it in report 7!
This study examines the possibility of several personal characteristics of
workers in explaining job satisfaction in the firm. Job satisfaction was in a
9-point Likert scale in which 1 denotes extreme dissatisfaction and 9
denotes extreme satisfaction, working experience in years, monthly
income in RM, age in years and sex a categorical variable in which 0 is
for female and 1 for male. Initial analysis using the Pearson correlation
method found that the dependent variable had relatively strong
correlations with the independent variables and with the normality
assumption hold but the error variance is not constant. (Assuming that a
transformation method was applied to deal with the error variance) In
general the analysis yielded a significant regression model with F value
of 14.234 and significant at the 0.05 level. The derived model is
JobStat = 1.406 + 0.413Exp + 0.001Income – 0.003Age – 1.134Sex
6/9/2016 76
Job satisfaction was found to be positively correlated with working
experience and income but had an inversed relationship with age and
sex. High job satisfaction was associated with high income, longer
working experience and young workers. This study also found that
female workers were more likely to be more satisfied than their male
counterparts. Taking the regression model as a whole it was found that
the four independent variables were able to explain 69.5% of the
variance in levels of job satisfaction among the studied workers. At the
individual level, income was the most significant variable in explaining
variations in levels of job satisfaction, followed by working experience
and sex, while age did not have significant impact.
6/9/2016 77
• Enter (Regression): all variables in a block are entered
in a single step.
Model Selection • Remove: all variables in a block are removed in a
single step.
• Stepwise: At each step, the independent variable not in
the model that has the smallest probability of F is
entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their
probability of F becomes sufficiently large.
• Backward Elimination: all variables are entered into the
equation and then sequentially removed. The variable
with the smallest partial correlation with the dependent
variable and meets the elimination criterion is removed
first. Repeat the procedure until there are no variables
Choose ‘Stepwise’ in the model that satisfy the removal criteria.
method • Forward Selection: variables are sequentially entered
into the model. The variable with the largest
Repeat by using positive/negative correlation with the dependent
‘Remove’, ‘Backward’, variable and satisfies the entry criterion enter first.
‘Forward’ methods Repeat the procedure until there are no variables that
meet the entry criterion.
6/9/2016 78
Model Summary
Variables Entered/Removeda Adjusted Std. Error of
Model R R Square R Square the Estimate
Variables Variables
1 .719a .517 .500 1.773
Model Entered Removed Method
1 Stepwise 2 .803b .645 .619 1.548
(Criteria: 3 .834c .695 .660 1.463
Probabilit a. Predictors: (Constant), INCOME (RM)
y-of-
b. Predictors: (Constant), INCOME (RM), Working
F-to-enter
INCOME Experience (Years)
. <= .050,
(RM)
Probabilit c. Predictors: (Constant), INCOME (RM), Working
y-of- Experience (Years), SEX
F-to-remo
ve >= .
100).
2 Stepwise Models 1, 2 and 3 are all significant, but Model 3 is
(Criteria:
Probabilit
preferred because there is a large increase in R2
Working
y-of-
F-to-enter
values from Model 2 to Model 3.
Experience . <= .050,
(Years) Probabilit ANOVAd
y-of-
F-to-remo Sum of
ve >= . Model Squares df Mean Square F Sig.
100). 1 Regression 94.261 1 94.261 29.979 .000a
3 Stepwise Residual 88.039 28 3.144
(Criteria: Total 182.300 29
Probabilit 2 Regression 117.581 2 58.790 24.527 .000b
y-of- Residual 64.719 27 2.397
F-to-enter
Total 182.300 29
SEX . <= .050,
Probabilit
3 Regression 126.657 3 42.219 19.727 .000c
y-of- Residual 55.643 26 2.140
F-to-remo Total 182.300 29
ve >= . a. Predictors: (Constant), INCOME (RM)
100).
b. Predictors: (Constant), INCOME (RM), Working Experience (Years)
a. Dependent Variable: Job Satisfcation
c. Predictors: (Constant), INCOME (RM), Working Experience (Years), SEX
d. Dependent Variable: Job Satisfcation
6/9/2016 79
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 1.875 .609 3.078 .005
INCOME (RM) .001 .000 .719 5.475 .000
2 (Constant) .344 .724 .475 .638
INCOME (RM) .001 .000 .522 3.990 .000
Working
.458 .147 .408 3.119 .004
Experience (Years)
3 (Constant) 1.320 .832 1.586 .125
INCOME (RM) .001 .000 .501 4.034 .000
Working
.410 .141 .365 2.913 .007
Experience (Years)
SEX -1.145 .556 -.230 -2.059 .050
a. Dependent Variable: Job Satisfcation
6/9/2016 80
Nonlinear Regression and Curve Estimation
Open SPSS file ‘Data18’ and create the scatter plot with
fitted line
The linear line is not fitted
well to the data. There
are two ways to solve
this problem:
1. Transform the data
using an appropriate
transformation (say
log)
2. Fit the data using curve
estimation
6/9/2016 81
Transform the y and x values using logarithm: y => ln y
and x => ln x
1
6
4
2
3
6/9/2016 83
Model Summary and Parameter Estimates
6/9/2016 84
Writing it in report 8!
This study examines the relationship between property value and
distance with the hypothesis that property values are the highest in
areas near the city centre but decreases with the distance from the city
centre. Initial analysis using simple linear regression yielded a
significant but weak relationship with only about 60% variations in
property values explained by distance. Double-log or power
transformation (growth and exponential as well) using natural logarithm
for both variables resulted in a better R2 of 0.895 indicating
approximately 90% accuracy. This study concludes that there is a non-
linear relationship between property values and distance from the city
centre with the following equation:
VALUE = 1990114 / DISTANCE4.253
where VALUE is average property values in RM per 1000m3 and
DISTANCE is distance in kilometer from the city centre.
6/9/2016 85
Dummy Variable
6/9/2016 86
Open SPSS file ‘Data10b’: the scatter plot indicates that
perception of the public is influenced by age has strong
negative relationship. Careful investigation reveals that the
dummy variable, family background, could possibly influence
public opinion.
6/9/2016 87
Method 1: split the variables into two groups manually,
based on non-drug addict family and drug-addict family
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 9.043 .597 15.158 .000
Age -.104 .014 -.713 -7.686 .000
Family -2.647 .453 -.542 -5.842 .000
a. Dependent Variable: Perception
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 71.889 1 71.889 43.224 .000a
Residual 26.611 16 1.663
Total 98.500 17
a. Predictors: (Constant), AGE0 ANOVAb
b. Dependent Variable: PERCEP0 Sum of
Model Squares df Mean Square F Sig.
1 Regression 40.355 1 40.355 21.945 .000a
Residual 29.423 16 1.839
Total 69.778 17
a. Predictors: (Constant), AGE1
b. Dependent Variable: PERCEP1
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 9.903 .782 12.665 .000
AGE0 -.127 .019 -.854 -6.575 .000
a. Dependent Variable: PERCEP0
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 5.769 .693 8.326 .000
AGE1 -.085 .018 -.760 -4.685 .000
a. Dependent Variable: PERCEP1
6/9/2016 89
Method 2: Use split file command of the SPSS (see module 1)
Model Summary
Change Statistics
Adjusted Std. Error of R Square
Family Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
non-drug addict family 1 .854a .730 .713 1.290 .730 43.224 1 16 .000
drug-addict family 1 .760a .578 .552 1.356 .578 21.945 1 16 .000
a. Predictors: (Constant), Age
ANOVAb
Sum of
Family Model Squares df Mean Square F Sig.
non-drug addict family 1 Regression 71.889 1 71.889 43.224 .000a
Residual 26.611 16 1.663
Total 98.500 17
drug-addict family 1 Regression 40.355 1 40.355 21.945 .000a
Residual 29.423 16 1.839
Total 69.778 17
a. Predictors: (Constant), Age
b. Dependent Variable: Perception
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Family Model B Std. Error Beta t Sig.
non-drug addict family 1 (Constant) 9.903 .782 12.665 .000
Age -.127 .019 -.854 -6.575 .000
drug-addict family 1 (Constant) 5.769 .693 8.326 .000
Age -.085 .018 -.760 -4.685 .000
a. Dependent Variable: Perception
6/9/2016 91
Model Summary
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 172.013 4 43.003 31.031 .000a
Residual 42.960 31 1.386
Total 214.972 35
a. Predictors: (Constant), Ethnic2, Age, Ethnic1, Family
b. Dependent Variable: Perception
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 8.139 .631 12.907 .000
Age -.084 .013 -.579 -6.445 .000
Family -1.705 .483 -.349 -3.528 .001
Ethnic1 .663 .445 .136 1.492 .146
Ethnic2 -1.378 .543 -.278 -2.539 .016
a. Dependent Variable: Perception
6/9/2016 92
Writing it in report 9!
One of the objectives of this study is to examine whether several personal
background characteristics of the surveyed public influence the shaping of
their opinion towards the proposal that drug addicts be sent to an
uninhabited island as a measure to rehabilitate them. There are three
independent variables, namely age, family background and ethnicity,
where the last two are dummy variables. Analysis using multiple linear
regression taking perception as the dependent variable yielded a
significant model (F=31.03, p=0.000) with R2=0.80 implying 80% of
variations in perception is explained by the independent variables. This
implies that the independent variables exerted strong and significant
influences in shaping public opinion. The derived model is as follows:
PERCEPTION=8.139 – 0.084*AGE – 1.705*FAMILY + 0.663*ETHNIC1 –
1.377*ETHNIC2
6/9/2016 93
Regardless of family background and ethnicity, age had an inverse but
strong (WHY?) and significant influence on perception in that older
respondents tended to be less favorable towards the proposal. In terms
of ethnic influence, the regression model indicated that the Malays
generally showed more favorable response to the proposal followed by
the Indians and the Chinese in that order. In terms of family experience
in drug addiction, respondents from non-drug addict families were more
favorable to the proposal of sending drug addicts to the island compared
with those from drug addict families.
6/9/2016 94
Logit Analysis/ Logistic Regression
It is used when the dependent variable is in
categorical scale.
It is a linear probability method of predicting the
category of outcome for individual cases or
observations.
Advantage: not requiring assumptions of
multivariate normality and equal variance-
covariance
6/9/2016 95
Open ‘Data19[dengue fine]’
Fine=whether the house was fined by the local authority (1=fined, 0=not fined)
Size=size of the house indicated by the floor area
Age=average age of family members
Pot=number of outdoor flower pots in the house compound
Family=total number of persons in the family
Helper=whether the house employs a housemaid (1=yes, 0=no)
House1=house location (1=squatter, 0=other types of houses)
House2=house location (1=flat, 0=other types of houses)
6/9/2016 96
move one variable to ‘Dependent’, Analyze
Binary Logistic Regression
regressors to ‘Covariates’ menu
1 8
4
2
3
6
6/9/2016 97
The prediction equation for the probability of a household being fined by the
local authority is:
exp( 6.141− 0.001Size − 0.269Age − 0.049Pot + 0.324Family − 0.741Help + 4.773H1+ 3.879H2)
Prob( y =1) =
1+ exp( 6.141− 0.001Size − 0.269Age − 0.049Pot + 0.324Family − 0.741Help + 4.773H1+ 3.879H2)
2 Log likelihood measures how poorly the model predicts the decisions. The
smaller the statistic the better the model
Cox & Snell R square and Nagelkerke R square can be interpreted like R
square in multiple linear regression. Cox & Snell R2 cannot reach the
maximum of 1 and Nagelkerke R2 can reach 1.
6/9/2016 100
Classification Tablea
Predicted
Out of 11 households that
Actual FINE Percentage were actually not fined, 10
Observed not fined fined Correct were correctly predicted
Step 1 FINE not fined 10 1 90.9
fined 1 9 90.0 not fined (90.0%
Overall Percentage 90.5 accuracy)
a. The cut value is .500
Casewise List
6/9/2016 102
The parameter estimates showed that houses of bigger sizes, with more
members, occupied by family members dominated by the young, which
did not employ housemaids, and were located in squatter and flat areas
had greater probability of being negligent and thus fined by the local
authority. The goodness-of-fit statistic was found to be good with the
observed chi-square estimate of 23.908 and significance at the 0.05 level
(p=0.02). It is indicated that out of the total 11 households that were
actually not fined by the local authority, 10 households were predicted not
fined by the logit model resulting in a 90.9% success, and out of the 10
households that were actually fined, 9 households were predicted fined
resulting in a 90% success, giving the overall model predictability of
90.5%. The study concludes that the household characteristics selected
were able to predict the probability of the households being fined by the
local health authority for failing to prevent the spread of dengue fever in
the community under study.
6/9/2016 103
Principal Component Analysis and
6/9/2016
Factor Analysis 104
Comparing PCA and Factor Analysis
PCA FA
to identify a relatively small number of same
factors that can be used to represent
relationships among sets of many
interrelated variables. (reduce
dimension)
identify the underlying, not directly same
observable, constructs (hidden
variables)
all the variations in a given population only part of the variation in a given
are contained within the variables population is contained within the
used to define that population variables used to define that
population
6/9/2016 105
PCA FA
used in deterministic approach used in studies that use a more
studies flexible experimental approach
Objective: to select a number of Objective: the factors (components)
components that explain as much of are selected mainly to explain the
the total variance as possible interrelationship among the original
variables.
Its value for a given individual is Emphasis on obtaining easily
relatively simple to compute and understandable factors that convey
interpret the essential information contained
in the original set of variables.
6/9/2016 106
move all selected Data Analyze
Descriptive Factor
variables to ‘Variables’ Reduction menu
select ‘initial solution’, ‘coefficients’
and ‘univariate descriptive’ Continue
Open SPSS file ‘Data20’
1
24
4
8
5 16
12 20
2 3
7
6/9/2016 107
Continue select ‘Correlation matrix’, ‘Unrotated factor choose ‘Principal Extractio
solution’, ‘Sree plot’ and ‘Eigenvalues over: 1’ Components’ method n
select ‘Rotated solution’, choose ‘Varimax’
Continue ‘Loading plots’ Rotation
method
9 11 19
For Factor 17
Analysis,
10
choose
18
‘principal axis
factoring’,
‘Alpha
factoring’ or
‘Image 15 23
13 22
factoring’
21
14
6/9/2016 108
Economic Sector Variable Definition
Employment Sector Variable
Agriculture, hunting and forestry AGRIC
Fishing FISH
Mining and quarrying MINE
Manufacturing MANU
Electricity, gas and water supply ELECT
Construction CONST
Wholesale, retail, repair of motor vehicle, personnel WSALE
Hotels and restaurants HOTEL
Transport, storage and communication TRANS
Financial intermediation FINANCE
Real estate, renting and business activities ESTATE
Public admin, defence, social security ADMIN
Education EDU
Health and social work HEALTH
Other community, social and personal services OTHERS
Housemaid MAID
6/9/2016 109
Step #17: save the factor scores as variables yielded the table below.
There are only 4 optimum factors (FAC1_1 to FAC4_1). These factor
scores are listed after the last variable in Data View window.
6/9/2016 110
Determine the number of components using Scree Plot
6/9/2016 111
Step #10 Extract:
Determine the number of • If you put the ‘eigenvalues over’
components using % of variance = 1, the optimum # of
components is obtained
The first 4 components contributed
• If the eigenvalues over = 0, the
82.685% cumulative % of variance
where contribution of the 5th component
# of components = the number of
is small 5.231% of variance, which is variables
not significant. • You can also set the number of
components desired
Total Variance Explained
Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings
Component Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %
1 6.738 42.112 42.112 6.738 42.112 42.112 5.790 36.185 36.185
2 3.023 18.892 61.004 3.023 18.892 61.004 3.147 19.672 55.857
3 2.432 15.199 76.203 2.432 15.199 76.203 2.994 18.712 74.568
4 1.037 6.482 82.685 1.037 6.482 82.685 1.299 8.117 82.685
5 .837 5.231 87.916
6 .767 4.796 92.711
7 .414 2.585 95.297 Contributions of the first 4
8 .297 1.854 97.151
9 .208 1.301 98.452 components after rotation.
10 .131 .821 99.273
11 .064 .399 99.672
12 .042 .260 99.933
13 .011 .067 100.000
14 1.10E-016 6.85E-016 100.000
15 -6.6E-017 -4.11E-016 100.000
16 -1.9E-016 -1.17E-015 100.000
6/9/2016 112
Extraction Method: Principal Component Analysis.
Component Matrixa
Component
Component matrix before rotation
1 2 3 4
FINANCE
ESTATE
.932
.929
.286
.242
.011
.056
-.148
.084
Component matrix after rotation
MAID .910 .180 -.327 -.027
TRANS .839 .053 -.241 .198
AGRIC -.797 .164 -.435 -.277
OTHERs
Rotated Component Matrixa
.761 .136 -.385 -.121
WSALE .687 .425 .269 -.329
Component
EDU -.636 .362 .429 -.144
MANU .330 -.815 .203 .364 1 2 3 4
CONST -.268 .778 .349 .100 FINANCE .939 -.038 .271 -.121
ELECT .599 .677 -.032 .188 MAID .939 -.274 -.020 -.105
ADMIN -.528 .565 .283 -.072 ESTATE .901 -.139 .304 .091
HOTEL .167 -.293 .770 .063
ELECT .826 .255 -.015 .325
HEALTH .410 -.188 .656 -.382
FISH -.563 .204 -.583 .040 OTHERs .805 -.245 -.105 -.205
MINE -.233 .529 .251 .643 TRANS .795 -.401 .049 .089
Extraction Method: Principal Component Analysis. WSALE .743 .310 .392 -.179
a. 4 components extracted.
MANU -.135 -.823 .488 .112
CONST .044 .797 .002 .413
Component 1 consists of variables ADMIN -.259 .756 -.078 .197
‘Finance’, ‘Maid’, ‘Estate’, ‘Elect’, ‘Others’, EDU -.472 .707 .075 .106
‘Trans’ and ‘Wsale’ because they have HOTEL -.149 -.049 .821 .111
HEALTH .146 .070 .810 -.313
largest coefficient values as compared to
FISH -.277 .143 -.776 .025
components 2, 3 and 4. AGRIC -.519 .355 -.691 -.235
MINE -.034 .360 -.047 .823
Component 2 consists of variables Manu,
Extraction Method: Principal Component Analysis.
Const, Admin and Edu. Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 7 iterations.
Component 3 consists of variables Hotel,
Health, Fish and Agric. Component 4 consists of variable Mine
only.
6/9/2016 113
Rotated Component Matrixa
Hidden Variables: 1 2
Component
3 4
FINANCE .939 -.038 .271 -.121
• Component 1: Dominated by Tertiary MAID .939 -.274 -.020 -.105
Employment (Y1) ESTATE .901 -.139 .304 .091
ELECT .826 .255 -.015 .325
• Component 2: Lack in Manufacturing OTHERs .805 -.245 -.105 -.205
TRANS .795 -.401 .049 .089
(Y2) WSALE .743 .310 .392 -.179
MANU -.135 -.823 .488 .112
• Component 3: Dominant in Tourist- CONST .044 .797 .002 .413
Related Activities (Y3) ADMIN -.259 .756 -.078 .197
EDU -.472 .707 .075 .106
• Component 4: Dependent on Mining (Y4) HOTEL -.149 -.049 .821 .111
HEALTH .146 .070 .810 -.313
FISH -.277 .143 -.776 .025
Instead of using 16 variables, the AGRIC -.519 .355 -.691 -.235
researcher may use these 4 components to MINE -.034 .360 -.047 .823
Extraction Method: Principal Component Analysis.
study the problem concerned. Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 7 iterations.
6/9/2016 115
Scatter plot: the relative positions of states in terms of dominance of
employment in the tertiary sector (component 1) and problem of lack in
manufacturing (component 2). You may plot the scatter diagrams for any
combination of the components.
Plot using the factor scores obtained from Step #17 116
6/9/2016
Writing it in report 11!
In this study, principal component analysis was used to portray the
basic structure of economic development of states in Malaysia
based on the percentage of employment in sixteen economic sub-
sectors, namely agriculture (including hunting & forestry), fishing,
mining (including quarrying), manufacturing, utility (including
electricity, gas & water supply), construction, wholesale (including
retail, repair of motor vehicle & personal trade), hotels and
restaurants, transport (including storage & communication),
financial intermediation, real estate (including renting and business
activities), administration, housemaid services, education, health
and other community, social and personal services. Initial analysis
showed variations in the values of the variables and correlations
among several variables, both indicating the possibility of using this
technique. Initial solution resulted in four components (factors) that
had eigenvalues of more tan one with a total of 82.7% variance
explained.
6/9/2016 117
The first component accounted for 42.1% of the variance explained,
the second component contributed 18.9%, the third component 15.2%
and the fourth component 6.5%. Rotation using the varimax method
did not result in an increase in total variance explained, but it
managed to raise slightly the percentage of variance explained for the
first, second and third components so that the contributions of these
components to the underlying economic structure of the country
studied were made clear and more significant.
Seven employment sub-sector variables, namely in finance,
housemaid services, real estate, utility, transport and communication,
wholesale and others loaded high on Component 1. Employment in
manufacturing, construction, administration and educational services
loaded high on Component 2. Employment in hotel, health services,
fishing and agriculture loaded high on Component 3. The only
variable contributing to the formation of Component 4 was
employment in mining. In this study, the first component was labeled
‘Dominant by tertiary employment’, the second component ‘Lack in
manufacturing’, the third ‘Dominant in tourist-related activities’ and the
fourth ‘Dependent on mining’.
6/9/2016 118
Figure 1 maps out the position of each state in the employment
structure based on factor scores. Kuala Lumpur was positioned far
ahead of other states in the first component of employment
dimension, followed by Selangor in the middle, while other states
flocked together with generally low scores. On the other extreme,
Perlis and Kelantan, and to a lesser extent, Terengganu, showed not
only low scores for tertiary activities but also lack of employment in
manufacturing. Seen from the second figure which was defined as
the dominance of employment in tertiary activities and tourism,
Kuala Lumpur did quite well, positioned positively far from other
states. States such as Melaka, Penang and Kelantan did quite well
in tourism but greatly lacked in tertiary activities.
6/9/2016 119
From the third figure where employment in tertiary activities was
seen together with mining, the position of Terengganu clearly
eclipsed that of other states. But because several other sub-sectors
such as utility, construction and agriculture too loaded (either
positively or negatively) quite heavily on this component, Melaka
was also positioned near Terengganu. When scores for component 2
depicting lack in manufacturing and component 3 for significant
contribution to tourism in employment were put together, the position
of Kelantan, Perlis, Terengganu and Kuala Lumpur as a group
separated from other states was very clear. Relative to other states,
Penang did well both in manufacturing and tourism as shown by the
negative and low scores on components 2 and 4 were plotted
together, Terengganu was found to be positioned far from other
states indicating a problem of lack in manufacturing and dependence
on mining, while Kelantan and, to a lesser extent, Perlis were behind
other wates in both sub-sectors. Finally, when components 3 and 4
were plotted in a two-dimension map, the two states were positioned
in two extremes where Melaka lacked in mining, fishing and
agriculture but did well in tourism, and Sabah did well in agriculture
but not quite well in mining.
6/9/2016 120
Better tertiary employment
6/9/2016 123
This procedure attempts to identify relatively
homogeneous groups of cases (or variables)
based on selected characteristics, using an
algorithm that starts with each case (or variable)
in a separate cluster and combines clusters until
only one is left
It can be used to reduce dimension or reduce
cases. Since the variables or cases in the same
group are homogeneous, we can choose only
one of them to represent the whole group
6/9/2016 124
Measure
Allows you to specify the distance or similarity measure to be used
in clustering. Select the type of data and the appropriate
distance or similarity measure:
1. Interval: Available alternatives are Euclidean distance,
squared Euclidean distance, cosine, Pearson correlation,
Chebychev, block, Minkowski, and customized
2. Counts: Available alternatives are chi-square measure and phi-
square measure
3. Binary: Available alternatives are Euclidean distance, squared
Euclidean distance, size difference, pattern difference,
variance, dispersion, shape, simple matching, phi 4-point
correlation, lambda, Anderberg's D, dice, Hamann, Jaccard,
Kulczynski 1, Kulczynski 2, Lance and Williams, Ochiai, Rogers
and Tanimoto, Russel and Rao, Sokal and Sneath 1, Sokal and
Sneath 2, Sokal and Sneath 3, Sokal and Sneath 4, Sokal and
Sneath 5, Yule's Y, and Yule's Q
6/9/2016 125
Euclidean Distance
Let the data of m variables and n observations were
recorded as follows: Var 1 Var 2 Var m
6/9/2016 126
Distance Matrix
d11 = 0 d12 L d1m
d 0 L d 2 m
The diagonal entries = 0, means
D= 21
M M O M there is no distance between a
variable/case and itself.
d n1 dn2 L d nm = 0
6/9/2016 128
Between (Average) Groups Linkage
6/9/2016 129
move categorical variable move all selected Hierarchical Classif Analyze
to ‘Label Cases by’ variables to ‘Variables’ Cluster y menu
select ‘cases’ Agglomeration
Statistics Continue
under Cluster schedule Open SPSS file ‘Data20’
1 17
4
5 Cluster
6 cases
10
2 7
13
3
8 9
6/9/2016 130
Plot Dendogram Continue Method Select cluster method
‘Furthest neighbor’
select measure
OK Continue ‘Pearson Correlation’
11 12 14 16
15
Number of clusters
Case 1 2 3 4 5 6 7 8 9 10 11 12 13
11:Sarawak X X X X X X X X X X X X X
X X X X X X X X X X X X X
10:Sabah X X X X X X X X X X X X X
X X X X X
6:Pahang X X X X X X X X X X X X X
X X X X X X X
8:Perlis X X X X X X X X X X X X X
X X X X X X X X X X X
3:Kelantan X X X X X X X X X X X X X
X X X
13:Terengganu X X X X X X X X X X X X X
X X X X
7:Perak X X X X X X X X X X X X X
X X X X X X X X X X
5:N.Sembilan X X X X X X X X X X X X X
X X X X X X X X X X X X
2:Kedah X X X X X X X X X X X X X
X
14:Kuala Lumpur X X X X X X X X X X X X X
X X
12:Selangor X X X X X X X X X X X X X
X X X X X X
9:Penang X X X X X X X X X X X X X
X X X X X X X X
4:Melaka X X X X X X X X X X X X X
X X X X X X X X X
1:Johor X X X X X X X X X X X X X
1 cluster 3 clusters
2 clusters 4 clusters
6/9/2016 133
Dendogram
Cluster 1: Sabah,
Sarawak, Kelantan,
Perlis, Pahang
Cluster 2: Kedah, N9,
Perak, Terengganu
Cluster 3: Johor, Melaka,
Penang, Selangor
Cluster 4: KL
6/9/2016 134
The number of clusters can be adjusted according to the
desired degrees of similarity
6/9/2016 135
6/9/2016 136
6/9/2016 137
Clustering variable
6/9/2016 138
K-Means Clustering
This procedure attempts to identify relatively
homogeneous groups of cases based on selected
characteristics, using an algorithm that can handle large
numbers of cases. However, the algorithm requires you
to specify the number of clusters
You can specify initial cluster centers if you know this
information
You can select one of two methods for classifying cases,
either updating cluster centers (centers will change)
iteratively or classifying only
The k-means cluster analysis command is efficient
primarily because it does not compute the distances
between all pairs of cases, as do many clustering
algorithms
6/9/2016 139
It is a tool designed to assign cases to a fixed
number of groups (clusters) whose
characteristics are not yet known but are based
on a set of specified variables. It is most useful
when you want to classify a large number
(thousands) of cases
A good cluster analysis is:
Efficient. Uses as few clusters as possible
Effective. Captures all statistically and commercially
important clusters. For example, a cluster with five
customers may be statistically different but not very
profitable
6/9/2016 140
Select the number of move all selected K-Means Analyze
Classify
clusters needed variables to ‘Variables’ Cluster menu
Create a new
file to store Iterate and Define the maximum
Continue Save
results classify number of iterations
click the boxes to
OK Continue select ‘Statistics’ Options
save and continue
1 15
4
5
7
2
3
6
14 8 10 12
13
9 11
8
6/9/2016 141
Initial Cluster Centers
Cluster
1 2 3 The final cluster centres
AGRIC 10.0 1.3 23.2 Number of Cases in each Cluster
FISH .9 .2 2.2
are computed as the Cluster 1 6.000
MINE .3 .3 .5 mean for each variable 2 2.000
MANU 30.2 20.1 13.8
ELECT .6 .9 .6 within each final cluster. 3 6.000
Valid 14.000
CONST
WSALE
7.8 10.0 11.0 The final cluster centres Missing .000
15.4 19.6 15.3
HOTEL 7.2 6.3 6.1 reflect the
TRANS
FINANCE
4.7
1.9
7.1
6.0
3.9
1.2
characteristics of the
ESTATE 3.0 8.0 2.1 typical case for each
ADMIN 7.1 7.0 8.5
EDU 5.5 4.7 6.7
cluster.
HEALTH 2.0 2.2 1.7
OTHERs 1.7 2.9 1.7
MAID 1.9 3.4 1.6
6/9/2016 143
ANOVA
Cluster
Mean Square df
Error
Mean Square df F Sig.
The ANOVA table
AGRIC 462.014 2 33.735 11 13.696 .001 indicates which
FISH 4.296 2 1.139 11 3.771 .057
MINE .064 2 .070 11 .916 .429
variables contribute
MANU 407.604 2 27.601 11 14.768 .001 the most to your
ELECT .073 2 .023 11 3.184 .081
CONST 14.986 2 5.299 11 2.828 .102
cluster solution.
WSALE 15.184 2 3.097 11 4.903 .030
HOTEL 1.764 2 1.229 11 1.435 .279
TRANS 7.688 2 .426 11 18.037 .000 Variables with
FINANCE
ESTATE
18.092 2 .579 11 31.242 .000 large F values (or
26.271 2 .764 11 34.375 .000
ADMIN 3.908 2 2.818 11 1.387 .290 small p-values)
EDU
HEALTH
3.922 2 1.390 11 2.821 .103 provide the
.167 2 .282 11 .590 .571
OTHERs 1.202 2 .214 11 5.618 .021 greatest
MAID 2.535 2 .198 11 12.774 .001 separation
The F tests should be used only for descriptive purposes because the clusters have been
chosen to maximize the differences among cases in different clusters. The observed between clusters.
significance levels are not corrected for this and thus cannot be interpreted as tests of the
hypothesis that the cluster means are equal.
6/9/2016 144
A new SPSS file was created to store the results
6/9/2016 145
Plot of Distances from Cluster Center by Cluster Membership
This is a diagnostic plot that helps you to find outliers within clusters.
► Click the Graphs
menu and select Chart
Builder
► Click the Gallery
tab, select Boxplot
from the list of chart
types, and drag and
drop the Simple
Boxplot icon onto the
canvas.
► Drag and drop
Distance of Case from
its Classification
Cluster Center onto the
y axis.
► Drag and drop
Cluster Number of
Case onto the x axis.
► Click OK to create
the boxplot.
6/9/2016 146
6/9/2016 147
Discriminant Analysis
6/9/2016 148
Discriminant analysis is used to model the value
of a dependent categorical variable based on its
relationship to one or more predictors
Given a set of independent variables,
discriminant analysis attempts to find linear
combinations of those variables that best
separate the groups of cases. These
combinations are called discriminant functions
and have the form displayed in the equation
6/9/2016 149
d ik = b0 k + b1k xi1 + K + b pk xip
6/9/2016 150
The discriminant model has the following assumptions:
1. The predictors are not highly correlated with each other
2. The mean and variance of a given predictor are not
correlated
3. The correlation between two predictors is constant across
groups
4. The values of each predictor have a normal distribution
Exercise:
If you are a political analyst, you want to be able to identify characteristics
that are indicative of voters who are likely to vote presidential candidate of
USA, and you want to use those characteristics to identify supporters and
opponents.
6/9/2016 152
10
Define range: select ‘candidate’ as
Discriminant Classify
Analyze
Continue minimum=1, maximum=2 the grouping variable menu
1
4 22
5
2
8 9
11 16 19
3
6 10
5 2 9
6/9/2016 153
Select ‘Within-groups select ‘Fisher’s’ and select Means, Univariate
Continue correlation’ ‘Unstandardized’ ANOVA, Box’s M
18
14
12
13
15 17
21
20
6/9/2016 154
Classification Statistics
The classification functions are used to assign cases to
groups.
Prior Probabilities for Groups
6/9/2016 155
There is a separate function for each group. For each case, a
classification score is computed for each function. The discriminant model
assigns the case to the group whose classification function obtained the
highest score.
Classification function for Bush:
Y1 = −46.455 + 0.589 Age − 4.613 AgeCategory + 5.761School − 9.436 Degree + 6.604 Sex
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
1 .020a 100.0 100.0 .141
a. First 1 canonical discriminant functions were used in the
analysis.
Wilks' lambda is a measure of how well each function separates cases into
groups. It is equal to the proportion of the total variance in the discriminant
scores not explained by differences among the groups. Smaller values of Wilks'
lambda indicate greater discriminatory ability of the function.
Wilks' Lambda
Wilks'
Test of Function(s) Lambda Chi-square df Sig.
1 .980 22.038 5 .001
The associated chi-square statistic tests the hypothesis that the means
of the functions listed are equal across groups. The small significance
value indicates that the discriminant function does better than chance at
separating the groups.
6/9/2016 157
Checking Collinearity of Predictors: The within-groups correlation matrix
shows the correlations between the predictors.
Pooled Within-Groups Matrices
HIGHEST
AGE OF YEAR OF
RESPON age SCHOOL RS HIGHEST RESPOND
DENT categories COMPLETED DEGREE ENTS SEX
Correlation AGE OF RESPONDENT 1.000 .943 -.306 -.246 .037
age categories .943 1.000 -.247 -.189 .021
HIGHEST YEAR OF
-.306 -.247 1.000 .870 -.068
SCHOOL COMPLETED
RS HIGHEST DEGREE -.246 -.189 .870 1.000 -.066
RESPONDENTS SEX .037 .021 -.068 -.066 1.000
Wilks' lambda is Tests of Equality of Group Means Each test displays the
another measure of Wilks' results of a one-way
a variable's
Lambda F df1 df2 Sig. ANOVA for the independent
AGE OF RESPONDENT 1.000 .069 1 1100 .794
potential. Smaller age categories
variable using the grouping
1.000 .194 1 1100 .660
values indicate the HIGHEST YEAR OF variable as the factor. If the
1.000 .147 1 1100 .701
variable is better at
SCHOOL COMPLETED significance value is greater
RS HIGHEST DEGREE 1.000 .042 1 1100 .837
discriminating than 0.10, the variable
RESPONDENTS SEX .985 16.822 1 1100 .000
probably does not
between groups.
contribute to the model.
Function
The standardized coefficients allow you to
1 compare variables measured on different
AGE OF RESPONDENT -1.497 scales. Coefficients with large absolute
age categories 1.453
HIGHEST YEAR OF
values correspond to variables with greater
SCHOOL COMPLETED
-.210 discriminating ability. It downgrades the
RS HIGHEST DEGREE .104 importance of Sex.
RESPONDENTS SEX .886
Since the structure matrix is unaffected by collinearity, it's safe to say that this
collinearity has inflated the importance of Age, Age Categories, Highest year of
school completed and Highest degree in the standardized coefficients table.
Thus, voter’s sex best discriminates between supporters and opponents.
6/9/2016 159
Checking for Correlation of Group Means and Variances: The group
statistics table reveals a potentially more serious problem. For all five
predictors, larger group means tend to associate with larger group
standard deviations. Group Statistics
6/9/2016 161
Model Validation
The classification table shows the practical results of using the discriminant model.
b,c,d
Classification Results
Predicted Group
Membership
240 of the 466 voted Bush are
VOTE FOR
CLINTON, BUSH Bush Clinton Total classified correctly.
Cases Selected Original Count Bush 240 226 466
Clinton 250 386 636 Of the cases used to create
% Bush 51.5 48.5 100.0
Clinton 39.3 60.7 100.0 the model, 386 of the 636
a Count Bush
Cross-validated 238 228 466 voters who previously voted
Clinton 255 381 636
% Bush 51.1 48.9 100.0
Clinton are classified correctly.
Clinton 40.1 59.9 100.0
Cases Not Selected Original Count Bush 105 90 195 The cross-validated section of the
Clinton 120 151 271
table attempts to correct this by
% Bush 53.8 46.2 100.0
Clinton 44.3 55.7 100.0 classifying each case while
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is leaving it out from the model
classified by the functions derived from all cases other than that case.
b. 56.8% of selected original grouped cases correctly classified.
calculations
c. 54.9% of unselected original grouped cases correctly classified.
d. 56.2% of selected cross-validated grouped cases correctly classified. Overall, 56.8% of the cases are
classified correctly.
6/9/2016 163
Combined-groups. Creates an all-groups
scatterplot of the first two discriminant function
values. If there is only one function, a histogram
is displayed instead.
Territorial map. A plot of the boundaries used to
classify cases into groups based on function
values. The numbers correspond to groups into
which cases are classified. The mean for each
group is indicated by an asterisk within its
boundaries. The map is not displayed if there is
only one discriminant function.
6/9/2016 164
SPSS also produces an ASCII territorial map plot which shows the relative
location of the boundaries of the different categories.
Territory for
Group 2
6/9/2016 165
The Discriminant Analysis procedure is useful
for modeling the relationship between a
categorical dependent variable and one or more
scale independent variables.
If your dependent variable is scale, use the
Linear Regression procedure.
Alternatively, if your dependent variable is scale,
try the GLM Univariate procedure.
If your predictors are multicollinear and you want
to reduce their number, use the Factor Analysis
procedure.
6/9/2016 166
Differences between DA and CA
In clustering, the category of the object is unknown.
However, we know the rule to classify (usually based on
distance) and we also know the features (independent
variables) that can describe the classification of the
object. There is no training example to examine whether
the classification is correct or not. Thus, the objects are
assigned into groups merely based on the given rule.
In discriminant analysis, object groups and several
training examples of objects that have been grouped are
known. The model of classification is also given (e.g.
linear or quadratic) and we want to know the best fit
parameters of the model that can best separate the
objects based on the training samples.
6/9/2016 167
Neural
Network
Method
1. Neurones (nodes)
2. Synapses (weights)
w2
p2 a
w3 f Output
p3
1
Bias
a = f (p1w1 + p 2 w2 + p3 w3 + b ) = f (∑ pi wi + b )
Information is distributed
Input
Desired
Output
Recurrency
Nodes connect back to other nodes or themselves
Information flow is multidirectional
Sense of time and memory of previous state(s)
2 3
6
-The MLP procedure can find more
complex relationships, while the
RBF procedure is faster.
7
6/9/2016 179
Use 70% training:
30% testing Partitions
8 10
9
11
12
6/9/2016 180
Specify network performance methods Specify network structure Output
14
15
16
17
6/9/2016 181
Main Reference
6/9/2016 182