Sunteți pe pagina 1din 6

Stat 146

Handout 1

John Wilder Tukey


1915-2000

1.
1.1

Introduction
What is EDA?

Definition
Exploratory data analysis is a subfield of applied statistics that is concerned with the investigation
of the collected or transformed data to reveal patterns, peculiarities and relationships using visual
displays, resistant statistics and a thorough examination of the residuals.
Approach
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
1.
2.
3.
4.
5.
6.
7.

maximize insight into a data set;


uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions;
develop parsimonious models; and
determine optimal factor settings.

Focus
The EDA approach is precisely that--an approach--not a set of techniques, but an
attitude/philosophy about how a data analysis should be carried out.
Philosophy
EDA is not identical to statistical graphics. Statistical graphics is a collection of techniques--all
graphically based and all focusing on one data characterization aspect. EDA encompasses a larger
venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind
of model the data follow with the more direct approach of allowing the data itself to reveal its
underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy
as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that
EDA heavily uses the collection of techniques that we call "statistical graphics", but it is not
identical to statistical graphics per se.

Techniques
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the
heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly
explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its
structural secrets, and being always ready to gain some new, often unsuspected, insight into the
data. In combination with the natural pattern-recognition capabilities that we all possess, graphics
provides, of course, unparalleled power to carry this out.
1.2

How Does Exploratory Data Analysis differ from Classical Data Analysis?

EDA is a data analysis approach. What other data analysis approaches exist and how does EDA
differ from these other approaches? Three popular data analysis approaches are:
1. Classical
2. Exploratory (EDA)
3. Bayesian
These three approaches are similar in that they all start with a general science problem and all
yield science conclusions. The difference is the sequence and focus of the intermediate steps.
For classical analysis, the sequence is
Problem => Data => Model => Analysis => Conclusions
For EDA, the sequence is
Problem => Data => Analysis => Model => Conclusions
For Bayesian, the sequence is
Problem => Data => Model => Prior Distribution => Analysis =>

Conclusions

Focusing on EDA versus classical, these two approaches differ as follows:


1.
2.
3.
4.
5.
6.

Models
Focus
Techniques
Rigor
Data Treatment
Assumptions

Stat 146

EDA
-2-

1.3

Main Themes in EDA


Resistance
- is a matter of insensitivity to misbehavior in the data.
- An analysis or summary is resistant if an arbitrary change in any small part of the data
produces only a small change in the analysis or summary.
- Pays more attention to main body and little to outliers.
- In theoretical discussions, one seeks to limit the effect of any small change in the
sample - minor perturbations in all the data, drastic shifts in a small fraction of the
data.
- When concerned with only some of the possible small changes, we might speak of
resistance to wild values or resistance to rounding and grouping.
Resistance vs. Robustness: Robustness generally implies insensitivity to departures from
assumptions surrounding and underlying probabilistic model.
Residuals
- are what remain after a summary or fitted model has been subtracted out, according to
the schematic equation.
- EDA asserts that an analysis of a set of data is not complete without examination of
the residuals
- This should take advantage of the tendency of resistant methods to provide a clear
separation between dominant behavior and unusual behavior
- Is in more traditional cases, residuals can warn of important systematic aspects of data
behavior that may need attention, such as curvature, nonadditivity, nonconstancy of
variability.
Re-expression
-

Involves the question of what scale would help to simplify the analysis of data.
EDA emphasizes the benefits of considering, at an early stage, whether the original
scale of measurement for the data is satisfactory. If not, a re-expression into another
scale may help promote symmetry, constancy of variability, straightness of
relationship, or additivity of effect, depending on the structure.

Revelation
-

utilizes displays; answers the need to see behavior of data, of fits, of diagnostic
measures, and of residuals so as to grasp unexpected features as well as familiar
regularities
emphasizes the visual display including new graphical techniques.

Stat 146

EDA
-3-

Hypothetical Data Matrix Containing 4 Variables and 20 Observations


obs
X1
X2
X3
X4
1
32.3
33.2
24.7
29.7
2
28.0
34.2
29.4
30.2
3
31.4
27.0
28.5
28.7
4
29.5
33.0
25.6
27.3
5
40.0
35.8
27.6
31.3
6
20.0
34.6
32.0
29.5
7
26.0
24.2
28.2
26.3
8
28.6
34.9
40.9
29.9
9
27.7
25.1
37.5
29.8
10
27.0
37.3
26.3
30.1
11
17.5
22.7
33.9
37.9
12
31.0
25.4
36.7
27.6
13
32.0
25.8
25.2
30.3
14
30.5
38.2
23.8
22.1
15
34.0
26.5
26.1
28.1
16
42.5
38.4
28.2
26.5
17
35.0
26.8
31.8
30.5
18
29.0
21.6
39.7
27.4
19
25.0
33.5
19.1
51.0
20
33.0
21.8
34.8
25.8
Mean
Std. Deviation

30.0
5.8

30.0
5.8

30.0
5.8

30.0
5.8

Univariate Scatterplots Comparing Data Distributions


of Four Hypothetical Variables
5.0

Variable

4.0
3.0
2.0
1.0
0.0
0.0

10.0

20.0

30.0

40.0

50.0

60.0

Data Values

Stat 146

EDA
-4-

1.4

An EDA/Graphics Example

A simple, classic (Anscombe) example of the central role that graphics play in terms of providing
insight into a data set starts with the following data set:
X
10
8
13
9
11
14
6
4
12
7
5

Y
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68

If the goal of the analysis is to compute summary statistics plus determine the best linear fit for Y
as a function of X, the results might be given as:
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
The above quantitative analysis, although valuable, gives us only limited insight into the data.

Stat 146

EDA
-5-

In contrast, the following simple scatter plot of the data

suggests the following:


1. The data set "behaves like" a linear curve with some scatter;
2. there is no justification for a more complicated model (e.g., quadratic);
3. there are no outliers;
the vertical spread of the data appears to be of equal height irrespective of the X-value; this
indicates that the data are equally-precise throughout and so a "regular" (that is, equi-weighted) fit
is appropriate.

Stat 146

EDA
-6-

S-ar putea să vă placă și