Documente Academic
Documente Profesional
Documente Cultură
Analysis
Exploratory Data Analysis (EDA)
Descriptive Statistics
Graphical
Data driven
Part I:
The Basics
or
“I got mean and deviant and now I’m considered normal”
Part II:
Exploratory Data Analysis
or
“I ask Skew how to recover from kurtosis and only hear ‘Get
out, liar!’”
What is data?
Categorical (Qualitative)
• Nominal scales – number is just a symbol that identifies a quality
• 0=male, 1=female
• 1=green, 2=blue, 3=red, 4=white
• Ordinal – rank order
X=T+e
Organizing your data in a
spreadsheet
Stacked data: Subjec
t
conditi
on
score
3 2 5
2 3 8 4
3 3 7 1
Variable Summaries
1 3 2 7 0 0 -3 9 2 4
2 3 8 4 0 0 3 9 -1 1
3 3 7 3 0 0 2 4 -2 4
4 3 2 6
0 0 -3 9 1 1
5 3 8 4
0 0 3 9 -1 1
6 3 1 6
0 0 -4 16 1 1
7 3 9 3 0 0 4 16 -2 4
8 3 3 6
0 0 -2 4 1 1
9 3 9 4
0 0 4 16 -1 1
10 3 1 7
0 0 -4 16 2 4
Sum = 30 50 50 0 0 0 108 0 22
/n 10 10 10 10* 10 10
Mean = 3 5 5 VAR = 0 10.8 2.2
*actually you divide by n-1 because it is a sample and not a population, but you
get the idea…
Variance continued
6.00 6.00 6.00
4.00 4.00 4.00
mean
2.00 2.00 2.00
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
leptokurtic
platokurtic
Standard deviation
High SD Low SD
Interpreting standard
deviation (SD)
Second, you can then interpret any individual score in terms of the SD.
For example: mean = 50, SD = 10
versus mean = 50, SD = 1
A score of 55 is:
0.5 Standard deviation units from the mean (not much) OR
5 standard deviation units from mean (a lot!)
Standardized scores (Z)
Third, you can use SDs to create standardized scores – that is, force the
scores onto a normal distribution by putting each score into units of
SD.
Subtract the mean from each score and divide by SD
Z = (X – mean)/SD
This is truly an amazing thing
Standardized normal distribution
that?) 4
5
3.3
3.3
2.3
8.6
6.7
4.5
6 3.3 1.5 6.6
7 2.8 9.1 3.4
• Precision has increased 8 3 3.3 6.5
9 3.1 9.5 4.1
10 3 1 7.3
Visual display of univariate data
• Histograms
Subjec
t before during after
1 3.1 2.3 7
• Stem and Leaf plots 2 3.2 8.8 4.2
• Boxplots 3
4
2.8
3.3
7.1
2.3
3.2
6.7
4 3.0
3 2.5
2.0
2
1.5
Frequency
1
Std. Dev = .19
Mean = 3.09 1.0
Frequency
0 N = 10.00
Std. Dev = 4.03
2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 3.35 3.45 .5
Histogram Mean = 6.4
before 5 N = 10.00
0.0
.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5
4 1.5 3.5 5.5 7.5 9.5 11.5 13.5 15.5 17.5 19.5
3
after
2
Frequency
during
Stem and Leaf plots
Before:
N = 10 Median = 3.1 Quartiles = 3, 3.3
During:
2 : 88 N = 10 Median = 5.2 Quartiles = 2.3, 8.8
3 : 00112333
-1 : 0
-0 :
After: 0:
N = 10 Median = 5.5 Quartiles = 4.1, 6.7
1:5
2 : 33
3 : 24 3:3
4 : 125 4:
5: 5:
6 : 567 6:
7:3 7:1
8 : 68
High: 17 9 : 15
Boxplots
Upper and lower bounds of
boxes are the 25th and 75th
percentile (interquartile range)
20
length)
-10
N= 10 10 10 10
12 30
10
8 20
4 10
NORMAL EXP
10
8 20
4
Q-Q Plots
10
NORMAL EXP
0.4
distributions$NORMAL,N=100
1
0.1 distributions$EXP,N=100
0.3
0
0.2
-2-1
-2 -1 0 1 2
-2 -1 0 1 2
So…what do you do?
So, you have examined each variable for mistakes, outliers and
distribution and made any necessary alterations. Now what?
Look at the relationship between 2 (or more) variables at a time
Visual Displays of Bivariate
Data
0
EXP
-1
-3 -2 -1 0 1 2 3
14
NORMAL
12
10
NORMAL
Intro to Scatter plots
before
during
after
10
3.3
10 12 14 16
8
2.9 3.0 3.1 3.2
8
BEFORE,N=10
FOLLOWUP
6
DURING
AFTER
6
4
8
2
4
6
0
4
2.8
2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.8 2.9 3.0 3.1 3.2 3.3 2.8 2.9 3.0 3.1 3.2 3.3 2.8 2.9 3.0 3.1 3.2 3.3
Standard Normal Quantiles BEFORE BEFORE BEFORE
r=-0.18, B=-0.01, t=-0.53, p=0.61, N=10 M= 5.15,Sd= 3.67,Sk=-0.19,K=-1.51 r=-0.57, B=-0.6, t=-1.97, p=0.08, N=10 r=-0.33, B=-0.22, t=-0.99, p=0.35, N=10
10
3.3
10 12 14 16
8
3.2
8
DURING,N=10
FOLLOWUP
6
3.0 3.1
BEFORE
AFTER
6
4
8
2
2.9
4
6
0
4
2.8
2
0 2 4 6 8 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8 0 2 4 6 8
DURING Standard Normal Quantiles DURING DURING
r=0.18, B=0.01, t=0.52, p=0.62, N=10 r=-0.57, B=-0.55, t=-1.97, p=0.08, N=10 M=6.35,Sd=3.82,Sk=2.01*,K=3.12* r=0.34, B=0.22, t=1.04, p=0.33, N=10
10
3.3
8 10 12 14 16
8
3.2
8
AFTER,N=10
FOLLOWUP
6
3.0 3.1
BEFORE
DURING
6
4 2
2.9
4
6
0
4
2.8
2
4 6 8 10 12 14 16 4 6 8 10 12 14 16 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 4 6 8
10 12 14 16
AFTER AFTER Standard Normal Quantiles AFTER
r=0.19, B=0.01, t=0.53, p=0.61, N=10 r=-0.33, B=-0.5, t=-0.99, p=0.35, N=10 r=0.34, B=0.54, t=1.04, p=0.33, N=10 M= 5.89,Sd= 2.43,Sk= 0.09,K=-1.29
10
3.3
10 12 14 16
8
3.2
FOLLOWUP,N=10
8
6
3.0 3.1
BEFORE
DURING
AFTER
6
4
8
2
2.9
4
6
0
4
2.8
2
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
FOLLOWUP FOLLOWUP FOLLOWUP Standard Normal Quantiles
With Outlier and Out of Range
Value M= 5.15,Sd= 3.67,Sk=-0.19,K=-1.51 r=-0.57, B=-0.6, t=-1.97, p=0.08, N=10
16
8
14
DURING,N=10
6
10 12
AFTER
4
8
2
6
0
4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8
Standard Normal Quantiles DURING
r=-0.57, B=-0.55, t=-1.97, p=0.08, N=10 M=6.35,Sd=3.82,Sk=2.01*,K=3.12*
16
8
14
6
AFTER,N=10
10 12
DURING
4
8
2
6
0
7
8
6
DURING,N=10
6
AFTnew
4
5
2
4
0
7
8
6
6
AFTnew,N=9
DURING
4
5
2
4
0
8
6
AFTnew,N=9
DURnew
6
5
4
4
2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 4 5 6 7
Standard Normal Quantiles AFTnew
r=-0.92, B=-0.41, t=-6.4, p=0, N=9 M= 5.35,Sd= 3.37,Sk= 0.00,K=-1.81
7
8
DURnew,N=10
6
AFTnew
6
5
4
4
• It is very important to pay attention to the scale that you are using
when you are plotting.
• Compare the following graphs created from identical data
18
30
20
10
-10
-2 -20
3
M
e
a
n 2
before during after followup
Summary
• Examine all your variables thoroughly and carefully before you begin
analysis
• Use visual displays whenever possible
• Transform each variable as necessary to deal with mistakes, outliers,
and distributions
Resources on line
http://www.statsoftinc.com/textbook/stathome.html
http://www.cs.uni.edu/~campbell/stat/lectures.html
http://www.psychstat.smsu.edu/sbk00.htm
http://davidmlane.com/hyperstat/
http://bcs.whfreeman.com/ips4e/pages/bcs-main.asp?v=category&s=00010&n=99000&i=99010.01&o=
http://trochim.human.cornell.edu/selstat/ssstart.htm
http://www.math.yorku.ca/SCS/StatResource.html#DataVis
Recommended Reading
• http://www.execpc.com/~helberg/pitfalls