Sunteți pe pagina 1din 3

Exploratory data analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to


summarize their main characteristics in easy-to-understand form, often with visual graphs,
without using a statistical model or having formulated a hypothesis. Exploratory data analysis
was promoted by John Tukey to encourage statisticians to examine visually their data sets, to
formulate hypotheses that could be tested on new data-sets. These visualization capabilities also
allowed statisticians to identify outliers, trends and patterns in data that merit further study.
In descriptive statistics, a boxplot (also known as a box-and-whisker diagram ) is a convenient
way of graphically depicting groups of numerical data through their five-number summaries
namely: the smallest observation (sample minimum), first quartile (Q1), the second quartile (also
called median) (Q2), third quartile (Q3), and the largest observation (sample maximum). A
boxplot may also indicate which observations, if any, might be considered outliers. Tukey
promoted the use of the five number summary of numerical data because the quantiles are
defined for all distributions and moreover they are more robust to skewed or heavy tailed
distributions than the moment measures.
How to draw a box plot: A box plot can be horizontal or vertical. To construct a horizontal box
plot, we draw a horizontal axis that is scaled to the data. Above the axis a rectangular box is
drawn with left and right sides drawn at Q1 and Q3 respectively and a vertical line segment drawn
at Q2. A left whisker is drawn as a horizontal line segment from the midpoint of the left side of
the box to the minimum and a right whisker is drawn as a line segment from the midpoint of the
right side of the box to the maximum. It is to be noted that the length of the box is equal to Q 3-Q1
i.e the interquartile range. The left and right whiskers represent respectively the first and fourth
quarters of the data while the two mid quarters of the data are represented by the two sections of
the box one to the left of the median line and other to the right of the median line.
Detecting outliers: Sometimes we are interested in picking out observations that seem to be
much larger or much smaller than most of the observations. Such atypical observations are called
outliers. To detect outliers in a box plot we proceed as follows: In a box plot we construct inner
fences to the right and left of the box at a distance of 1.5times the IQR. Similarly outer fences are

constructed on either side at a distance of 3 times the IQR, where IQR denotes the interquartile
range. Observations that fall between the inner and outer fences are called suspected outliers,
while observations that lie beyond the outer fences are called outliers. These observations are
denoted with asterix (*) and the whiskers are drawn only to the extreme values within or on the
inner fences. When we are analysing a set of data, suspected outliers deserve a closer look and
outliers should be looked at very carefully.
Use: From the box plot we can have a quick view of the centre (given by Q 2). The spacings
between the different parts of the box plot help to indicate the degree of dispersion and skewness
in the data. It also helps to identify outliers.
Advantage: Organizing data in a box plot by using five key concept is an efficient way of
dealing with large data that is too unmanageable for other graphs, such as stem and leaf plots.
Because of the small size of a boxplot, it is easy to display and compare several box plots in a
small space. A box plot is a good alternative or complement to a histogram and is usually better
for showing several simultaneous comparisons.
Disadvantage: The issue with handling such large amounts of data in a box plot is that the exact
values and details of the distribution of results are not retained.
Stem and Leaf display
A stem and leaf display is a graphical method of displaying data. It is particularly useful when
the data are not too numerous. Stem and Leaf display is a "shorthand" notation for representing
numbers. We break each number into 2 parts. The last digit is called the leaf, and the rest of the
number is called the stem. For example the number 75 has a stem of 7 and a leaf of 5. The
number 129 has a stem of 12 and a leaf of 9. We then collect all numbers with the same stem and
place them in a row. Let us illustrate the stem and leaf display with an example. Let us consider
the ages of 48 students in a statistics course:
22
19
30
21
41
24
19
23
27
32
30
20
22
36
24
26
39
20
21
19
19
19
22
30
31
17
18
21
26
21
25
21
22
22
20
40
23
19
21
17
20
33
22
31
19
24
37
22
Here we observe that the lowest age is 17, and the highest age is 41. The stem for 17 is 1, and the
stem for 41 is 4. So we start off by making a table with all the possible stems. Finally the stem

and leaf display will look as follows. (See class notes) Next we arrange the leaves in an
increasing order and get an ordered stem and leaf plot. This is displayed below. ( See class
notes). Occasionally if one or more of the rows has too much information, we can divide each
row into equal halves like follows and get a stem and leaf display as follows ( refer to class
notes)
Advantages: Here we retain the original values of the variable which are otherwise lost in box
plot or histogram. Moreover if the leaves are carefully aligned vertically, then the table has the
same effect of that of the histogram, so that we can identify the shape of the distribution as well.
Sample percentiles can be found easily from stem and leaf display. A back to back stem and leaf
diagram is used to compare two groups of data.
Disadvantage: This presentation of data is not suitable when the data is numerous since we
retain every observation. Moreover it is not possible to compare three or more groups of data by
using stem and leaf display.