Sunteți pe pagina 1din 16

BOX-PLOT WITH FENCES

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Learning goals
Why go beyond a basic box-plot? What are fences? How is box-plot with fences constructed? How does one interpret such a plot? What are the gains and limitations?

Applied Statistics and Computing Lab

Box-plot with fences


Can we modify the basic box-plot so that it helps in detecting unusual observations? Box-plot with fences can be useful What are fences? Let us take a look at a figure!

Applied Statistics and Computing Lab

Box-plot with fences (contd.)

Applied Statistics and Computing Lab

4 Source: http://en.wikipedia.org/wiki/Boxplot

Basis for fences


From the previous figure, we see that for a normally distributed data, 99.3% of the data lies in the interval
(Q1 1.5(Q3 Q1 ), Q3 + 1.5(Q3 Q1 ))

Also, only 3 out of a million or 0.003% observations are expected to be present outside the interval
(Q1 3(Q3 Q1 ), Q3 + 3(Q3 Q1 ))
5

Applied Statistics and Computing Lab

Box-plot with fences


Suspected outlier

Outlier

Applied Statistics and Computing Lab

6 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Box-plot with fences


Box-plot with fences are useful in identifying unusual observations What are unusual observations? Box-plot serves only as a diagnostic. It is not a test of significance. Caution: Even for a random sample from a normal distribution, about 7 out of thousand sample points can lie outside the inner fence and 3 out of a million can lie outside the outer fence. Thus when dealing with large data sets, one has to be careful about declaration of outliers on the basis of a Box-plot. Sometimes, simulation-based methods are used for this purpose. For more information one may see Robert Dawson (2011) Sometimes only the inner fence is used (as is the default in R) The default for Box-plot command in R produces Box-plot with inner fence

Applied Statistics and Computing Lab

Comparison of data

Applied Statistics and Computing Lab

8 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Box-plot of final exam scores

Applied Statistics and Computing Lab

Box-plots of all the scores

Applied Statistics and Computing Lab

10

Box-plots of three minors

Applied Statistics and Computing Lab

11

Box-plots indicating means

Applied Statistics and Computing Lab

12

Interpretation of the Box-plot


In the Box-plot corresponding to the scores in the second semester exam, we have 3 unusual observations among 50. Under normal situation, we expect to have about 7 in a thousand observations. Thus one needs to probe into these unusual observations. The distribution of scores of second semester exam appears to be symmetric, but may have slightly longer tails in view of the unusual observations, situated symmetrically below and above the fences. From the box-plots corresponding to the three minors, it appears that
The distribution of scores in First minor is skewed to the right, The distributions of scores in Second and Third minors are symmetric and are somewhat similar, and The median scores of the three minors seem to be close (we shall examine this further when we deal with the notched box-plots)

There is an unusual observation in the Box-plot of scores of First semester exam, with a value of about 18. We know that the GPA is out of 10. Thus this is an outlier!
13

Applied Statistics and Computing Lab

Gain from a Box-plot with fence


As we saw,
We can identify unusual observations We can examine the tail behaviour We can compare two or more variables or datasets more easily However we cannot get modal information from these plots!

Applied Statistics and Computing Lab

14

R-codes
Plot Boxplot (of single variable) Boxplot (of all the variables in a dataset) Boxplot (of k distinct variables from a dataset) Boxplot with means (can be drawn for one or many variables at the same time) R-code boxplot(variable name) boxplot(name of data as input in R) boxplot(dataname$variable 1 name, dataname$variable 2 name,, dataname$variable k name) boxplot(variable specification) points(y=colMeans(variables specification),x=1:(total number of variables in a box-plot))

Applied Statistics and Computing Lab

15

Thank you

Applied Statistics and Computing Lab