Documente Academic
Documente Profesional
Documente Cultură
2-1
2-2
2-3
Examination Phases
Graphical examination.
Identify and evaluate missing values.
Identify and deal with outliers.
Check whether statistical assumptions are met.
Develop a preliminary understanding of your data.
2-4
Graphical
Examination
Shape:
Histogram
Bar Chart
Box & Whisker plot
Stem and Leaf plot
Relationships:
Scatterplot
Outliers
2-5
X19 - Satisfaction
30
20
Frequency
10
Std. Dev = 1.19
Mean = 6.92
N = 100.00
0
4.50
5.50
5.00
6.50
6.00
7.50
7.00
8.50
8.00
9.50
9.00
10.00
X19 - Satisfaction
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-6
5.
5.
6.
6.
7.
7.
8.
8.
9.
9.
10 .
012
5567777899
0112344444
5567777999
01144
55666777899
000122234
55556667777778
001111222333333444
56699999
00
2-7
Valid
5.0
5.1
5.2
5.5
5.6
5.7
5.8
5.9
6.0
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.9
7.0
7.1
7.4
Frequency
Percent
Valid Percent
1
1.0
1.0
1
1.0
1.0
1
1.0
1.0
2
2.0
2.0
1
1.0
1.0
4
4.0
4.0
1
1.0
1.0
2
2.0
2.0
1
1.0
1.0
2
2.0
2.0
1
1.0
1.0
1
1.0
1.0
5
5.0
5.0
2
2.0
2.0
1
1.0
1.0
4
4.0
4.0
3
3.0
3.0
1
1.0
1.0
2
2.0
2.0
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2
2.0
2.0
Cumulative
Percent
1.0
2.0
3.0
5.0
6.0
10.0
11.0
13.0
14.0
16.0
17.0
18.0
23.0
25.0
26.0
30.0
33.0
34.0
36.0
2-8
38.0
11
10
13
X 6 - P ro d u ct Q u a lity
Median
5
4
N=
32
35
33
1 to 5 years
Over 5 years
X1 - Customer Type
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-9
X 1 9 - S a tis fa c tio n
4
4
10
11
X6 - Product Quality
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-10
Missing Data
Impact . . .
Reduces sample size available for analysis.
Can distort results.
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-11
2-12
Missing Data
Strategies for handling missing data . . .
use observations with complete data
only;
delete case(s) and/or variable(s);
estimate missing values.
2-13
Rules of Thumb 21
How Much Missing Data Is Too Much?
2-14
Rules of Thumb 23
Imputation of Missing Data
2-15
Outlier
Outlier = an observation/response with a
unique combination of characteristics
identifiable as distinctly different from the
other observations/responses.
Issue: Is the observation/response
representative of the population?
2-16
Procedural Error.
Extraordinary Event.
Extraordinary Observations.
Observations unique in their
combination of values.
2-17
Identify outliers.
Describe outliers.
Delete or Retain?
2-18
Identifying Outliers
2-19
Rules of Thumb 24
Outlier Detection
Univariate methods examine all metric variables to identify unique or
extreme observations.
For small samples (80 or fewer observations), outliers typically are defined
as cases with standard scores of 2.5 or greater.
For larger sample sizes, increase the threshold value of standard scores up
to 4.
If standard scores are not used, identify cases falling outside the ranges of
2.5 versus 4 standard deviations, depending on the sample size.
Bivariate methods focus their use on specific variable relationships, such
as the independent versus dependent variables:
o use scatterplots with confidence intervals at a specified Alpha level.
Multivariate methods best suited for examining a complete variate, such as
the independent variables in regression or the variables in factor analysis:
o threshold levels for the D2/df measure should be very conservative (.005
or .001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger
samples.
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-20
Multivariate Assumptions
Normality
Linearity
Homoscedasticity
Non-correlated Errors
Data Transformations?
2-21
Testing Assumptions
Normality assumptions
Visual check of histogram.
Kurtosis.
Normal probability plot.
Homoscedasticity
Equal variances across independent
variables.
Levene test (univariate).
Boxs M (multivariate).
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-22
Rules of Thumb 25
2-23
Data Transformations ?
Data transformations . . . provide a means of
modifying variables for one of two reasons:
1. To correct violations of the statistical
assumptions underlying the multivariate
techniques, or
2. To improve the relationship (correlation)
between the variables.
2-24
Rules of Thumb 26
Transforming Data
To judge the potential impact of a transformation, calculate the ratio of
the variables mean to its standard deviation:
o Noticeable effects should occur when the ratio is less than 4.
o When the transformation can be performed on either of two variables,
select the variable with the smallest ratio .
Transformations should be applied to the independent variables except
in the case of heteroscedasticity.
Heteroscedasticity can be remedied only by the transformation of the
dependent variable in a dependence relationship. If a heteroscedastic
relationship is also nonlinear, the dependent variable, and perhaps the
independent variables, must be transformed.
Transformations may change the interpretation of the variables. For
example, transforming variables by taking their logarithm translates the
relationship into a measure of proportional change (elasticity). Always
be sure to explore thoroughly the possible interpretations of the
transformed variables.
Use variables in their original (untransformed) format when profiling or
interpreting results.
Copyright 2010 Pearson Education, Inc., publishing as Prentice-Hall.
2-25
Dummy Variable
Dummy variable . . . a nonmetric independent
variable that has two (or more) distinct levels
that are coded 0 and 1. These variables act as
replacement variables to enable nonmetric
variables to be used as metric variables.
2-26
X1
X2
Physician
Attorney
Professor
2-27
2-28
Examining Data
Learning Checkpoint
1. Why examine your data?
2. What are the principal aspects of
data that need to be examined?
3. What approaches would you use?
2-29