Sunteți pe pagina 1din 30

Exploratory Data Analysis (EDA) in the data analysis process

Module B2 Session 13
SADC Course in Statistics

Learning Objectives
students should be able to Construct a dot plot for a numeric variable
split by a categorical variable

Apply EDA concepts to a large dataset Explain the use of Excels pivot tables
and filters, in the EDA process

Explain the importance of EDA


for data checking and at the start of the analysis

Relate EDA
to the principles of official statistics .

To put your footer here go to View > Header and Footer

EDA with small and large data sets


Session 12:
Stressed the importance of EDA Introduced 2 new tools (dot and stem) Practiced with small data sets

In this session we scale up


Look at large data sets The tools do not scale up easily But the concepts do scale up EDA becomes even more crucial

Most data sets are large!


at least compared with teaching examples
To put your footer here go to View > Header and Footer 3

The essence of a stem and leaf plot


The leaf shows the next digit. This can be useful in the exploration phase

data 5.3 5.4 6.0 .. 11.1 11.9 Stem and leaf plot Stacked dot plot

To put your footer here go to View > Header and Footer

What are the key points?


We look at individual data points
not summaries at this stage this is general for EDA

The stem and leaf plot in particular


keeps the actual numbers as far as possible This can be important

An example uses the Tanzania survey

To put your footer here go to View > Header and Footer

Tanzania agriculture survey

To put your footer here go to View > Header and Footer

This is the variable we wish to explore. It is a value between 0 and 100

The data in Excel

To put your footer here go to View > Header and Footer

The variable to explore before analysis7

How to explore this value


Can we do a stem and leaf plot?
By hand in Excel but there are 16628 values!

Even if automated, that is too many! The essence of a stem and leaf plot
is to look at all the possible values

Try a pivot table


a powerful feature in Excel used previously on categorical data

To put your footer here go to View > Header and Footer

The pivot table

To put your footer here go to View > Header and Footer

Some results

To put your footer here go to View > Header and Footer

10

To put your footer here go to View > Header and Footer

11

What do you deduce?


There are oddities in rounding
Perhaps enumerator differences Can this question be answered to 1%?

So what should be done before analysis? First look further at the data Excel can help it can drill down to examine individual records The concept:
Use the table to look for oddities Then examine them in more detail

To put your footer here go to View > Header and Footer

12

Drilling down an example


Make the 6 corresponding to 2% the active cell Then double click to give the detail

4 of these values are from the same village so same enumerator To put your footer here go to View > Header and Footer

13

To put your footer here go to View > Header and Footer

14

What do you conclude technique/results


Technique
Stem and leaf plots when looking at small datasets Pivot tables when datasets are large

But the principle is general


Numbers must be looked at carefully! The principle can be adapted for the data and explored effectively in Excel

Results
Did enumerators have different interpretations
of the precision required in the percentages This needs further exploration and the analysis needs to take account of this
To put your footer here go to View > Header and Footer 15

Another new element in this session


Exploratory analysis includes
looking for oddities in the data

Unexplained oddities cause variation


that can make it difficult to detect the pattern because they add unnecessary noise to the data

How do you tame the variation


One way is to examine related variables This is important in the analysis
the next slide is a repeat from Session 3

It is also a key weapon in data exploration


and is covered in the practical
To put your footer here go to View > Header and Footer 16

Slide from Module B2 Session 3


To do good statistics you must
fight the curse of variation

Two main strategies to overcome variation 1. Take enough observations


In the Tanzania survey there were 3223 households just from this one region

2. Measure characteristics that explain variation


Variation itself is not necessarily the problem Variation you do not understand is the problem

Here we start understanding variation


at the exploration stage
To put your footer here go to View > Header and Footer 17

Practical three parts


Tanzania data
practice what has been done in these slides

Dot plots split by a factor


demonstration and practice

Swaziland data
apply the concepts checking factors as well as numeric columns

Then the key points are reviewed

To put your footer here go to View > Header and Footer

18

Points for review after the practical


Looking for individual problems
And surprising patterns

Exploratory graphics
need to help the analyst and data checker see dot plots on next slide

Tables are also useful


especially with the facility to drill down

Look at individual variables


and at records as a whole

Trust your common sense


It is useful to estimate results And question the computer if they are very different
To put your footer here go to View > Header and Footer 19

Dot plots - yield by variety

Outliers (typing errors) are clear, but only because of the 2nd variable They are not outliers overall
To put your footer here go to View > Header and Footer 20

EDA is a continuous process


EDA effectively is a continuation of the data checking process The example on the previous slide shows
how some oddities only become clear once the analysis is undertaken

This continues into the formal analysis


where it involves looking at the residuals

They are the unexplained variation


As discussed in Session 3!

So analysis is not just a set of rules


It is a thoughtful process Where you become the data detective!
To put your footer here go to View > Header and Footer 21

Swaziland data was for checking

To put your footer here go to View > Header and Footer

22

Investigating the column called Presence

What does 0 mean?

Why are there blanks?


Next steps:

1. Look at the questionnaire


2. Select these records
To put your footer here go to View > Header and Footer

You are becoming detectives!

23

Codes for the column

Seems clear enough. Zeros and blanks still a puzzle


To put your footer here go to View > Header and Footer 24

Selecting the blank records

Missing also

Too young and all the same


Crop code not recognised

Areas too large


25

i.e. serious problems with the whole record


To put your footer here go to View > Header and Footer

Dot plot of area by Presence

Odd crop areas were ALL associated with odd codes for the column PRESENCE

It was found to be a data transfer problem with one byte missing in these records
To put your footer here go to View > Header and Footer 26

Checking data quality and EDA


Where Before data entry During data entry Before analysis During analysis Why How By Whom supervisor To ensure Manual complete data check set received To highlight anomalies Double check Filter, dot plots etc As above

Supervisor and helpers Analyst/ statistician Analyst/ statistician


27

Remain critical Residuals

To put your footer here go to View > Header and Footer

Importance principles of official statistics


Principle 2: Professional standards
It is unprofessional to analyse the data and report results without exploring critically at all stages

Principle 4: Prevention of misuse


We risk misusing the data unless we explore the data critically

Principle 5: Sources of statistics


Includes a requirement to avoid undue burden on respondents We must process the data fully and effectively. This needs EDA Otherwise the burden imposed on respondents is to some extent wasted
To put your footer here go to View > Header and Footer 28

Can you now:


Apply EDA concepts to a large dataset Explain the importance of EDA for data checking and at the start of the analysis Relate EDA to the principles of official statistics

To put your footer here go to View > Header and Footer

29

Now you can organise the data for analysis And then do an exploratory analysis

We show next how the analysis is easy IF your objectives are clear

To put your footer here go to View > Header and Footer

30