Documente Academic
Documente Profesional
Documente Cultură
D ESCRIPTIVE S TATISICS
Paul Razafimandimby
Montanuniversitat Leoben
Contents
Introduction
Organizing and visualization of the data
Visualization of correlation
Measures of Central Tendency/Location
Measures of variation or dispersion
Appendix: Calculation of some parameters for grouped data.
Introduction
Roughly speaking, Statistics is the science of gaining knowledge from numer-
ical and categorical data. It deals with the collection, analysis, interpretation
and drawing conclusion from collected data. A population is basically the
collection or set of all individuals under consideration in a statistical study. A
sample is a part of the part or subset of the population from which information
is collected.
1
Descriptive and inferential statistics are interrelated in that before inferring
conclusion from the statistical investigation it is necessary to organize and
summarize the information collected from a sample. Moreover, the knowl-
edge from the descriptive statistics usually suggests the appropriate method
or approach to be used for the inferential statistics.
1. Describe the research problem. For instance, we want to know the average
age of MUL students.
2. Define the population and the sample on which we will conduct the study. In
a very simple terms, a population is basically the collection or set of all
individuals under consideration in a statistical study. In our example, the
population is the set of all MUL students (from 1st year to phd students).
A sample is a part or subset of the population from which information
is collected. Sample could be set of 100 students randomly interviewed
by 10 volunteers at 5 building entrances of the university from tomorrow
7:00-9:00 am.
3. Collect the data We send 10 volunteers to interview 100 students at 5 build-
ing entrances of the university during the period of tomorrow 7:00-9:00
am.
4. Conduct a descriptive data analysis After collecting the data we need to
organize it. For instance,
we could form a table containing the (relative) frequency and cumulative
(relative) frequency of each class of the sample.
We could plot the data to visualize some of its properties.
Study the tendency of the population/sample by calculating its measure
of location such as mean, median, mode, ....
We could also study the dispersion of the population/sample through the
calculation of range, variance, standard deviation, coefficient of skew-
ness, kurtosis, interquartile,... All of these terms will be or have been
defined appropriately.
2
Organizing and visualization of the data
As defined above, this branch of statistics deals with the organization and the
summary of information form the collected data. But, before we organize our
data we need to specify our variate or (random) variable.
Ni
RF (Ci ) = .
i=1 .Ni
The relative frequency of all classes sum to 1 or 100% The cumulative (relative)
frequency of a class Ci is the sum of all frequencies of all classes up to to the
class Ci
i
CF (Ci ) = RF(Cj ).
j =1
Note that cumulative frequency makes sense only for quantitative and ordinal
variable.
3
Age=load(Age.txt);
and create a frequency table from it
tabulate(Age);
4
39 5 5.00%
40 4 4.00%
But it gives us the age range 0-16 which we do not want. To get the right table
we have to remove these values. For this purpose, let us store the table in a
40 3 matrix called T
T=tabulate(Age);
and remove the block T (i, j), for i = 1, 16 and j = 2, 3.
T(1:16,:)=[];
Now we recreate the frequency table
Freq_Table=table(T(:,1),T(:,2),T(:,3),VariableNames,{Age,Count,Percent})
Freq_Table =
17 5 5
18 4 4
19 3 3
20 7 7
21 2 2
22 5 5
23 9 9
24 5 5
25 7 7
26 3 3
27 3 3
28 2 2
29 1 1
30 1 1
31 3 3
32 4 4
33 6 6
34 4 4
35 4 4
36 5 5
37 2 2
38 6 6
39 5 5
40 4 4
5
We can also export our table into a txt, xls, ... file.
writetable(Freq_Table,Freq_Table_Age.txt, Delimiter, );
To visualize our data we can plot the frequencies versus the classes. For ordinal
or quantitative variable we usually use a pie chart or a bar graph. Note that in
a bar graph, the bars do not touch each other. Bar graph is also used to visualize
discrete quantitative data, i.e., the each class is described by a single number.
For visualization of continuous quantitative data, i.e., each class is an interval,
we usually draw an histogram. The bars of an histogram do touch each other.
The usual method to form the frequency table of a continuous quantitative data
is as follows.
bar(T(:,2))
But in this case, the x axis contain unwanted values and does not contain the
whole range of our variable classes. To remediate this we can specify the value
of the bar location along the x-axis as follows
6
bar(17:1:40,T(:,2))
Which is equivalent to
bar(T(:,1),T(:,2));
We can also draw a histogram for our data. For instance, we will cover the min
and max values of our observation by disjoint intervals of same length, say,
[17,22], [23,28],...., Here is how we do it in matlab
7
histage=histogram(Age,[17:5:45]);
pie(Age);
Well! This looks awful. Let us just do the pie chart of the first 5 students and
label them
8
pie(Age(1:5), {Stud1, Stud2, Stud3, Stud4, Stud5});
Visualization of correlation
Graphs are also very useful to give an intuition of teh correlation between vari-
ables. For example, we want to know whether smoking is one of cancer factors
and which cancer type is mostly caused by smoking. For this let us download
a data from
http://lib.stat.cmu.edu/DASL/Stories/cigcancer.html
. I named the data as smoke cancer.txt and load it to Matlab by using the
dataset command.
smokeds=dataset(File, smoke_cancer.txt);
We can now visualize the correlation between smoking and let say bladder
cancer and lung cancer
subplot(2,1,1)
scatter(smokeds.CIG,smokeds.BLAD),
title(CIG vs BLAD)
subplot(2,1,2)
scatter(smokeds.CIG,smokeds.LUNG)
title(CIG vs LUNG);
9
It seems that CIG and LUNG has a positive linear correlation. Let see how if
we can draw something from the histogram
bar(smokeds.LUNG, c)
hold on
bar(smokeds.BLAD, r)
hold off
10
Measures of Central Tendency/Location
A measure of location is a typical or a central value which describe well the
location of the data. We mainly have three measures of location
N
1
X =
N Xi .
i =1
Note when the data is grouped in classes Ci , i + 1, .., n, then the mean is defined
by
n
1
X =
N f i Xi .
i =1
Median or Middle is the middle value which divides the observation into tow
equal parts. If the data is ungrouped, then the median is defined by
Med = X n+1 ,
2
if n is odd, and
X n + X n +1 /2,
2 2
is n is even.
age7=[23,24,16,19,30,28,33];
age7s=sort(age7);
Medage7=age7s((length(age7)+1)/2);
11
Example again! Now let us look at an ungrouped data with even number of
observation. For this take 8 MUL students
age8=[23,24,16,19,30,28,33,40];
age8s=sort(age8);
Medage8=(age8s((length(age8))/2)+age8s(length(age8)/2+1))/2;
Warning The above formula/procedure for the median does not work well
grouped data (especially when the observed values are grouped into intervals)
For grouped data, the formula/procedure for finding the median is more com-
plicated and it gives only an estimate for the median; we will the method on
how to find it in appendix. Nevertheless, it is relatively simple to find a Me-
dian class which is basically the interval containing the first cumulative fre-
quency bigger than N/2. However, we can apply the above procedure in our
example of 100 MUL
Class mode is the most frequently occurring class, i.e., it is the class which has
the highest count. In our example, the mode or modal class is the number with
the highest frequency ( which is 9), i.e., 23. For a grouped data we only have
a complicated formula/procedure which will be given in the appendix. Fortu-
nately, with Matlab we do not need to worry about these formula, the software
will do it for us (but, you should read books and understand the procedure).
Mean_age2=mean(Age);
Mean age2 is equivalent to the second definition of mean, i.e.,
1 24
Mean age2 = fi i.
100 i= 17
Med_age=median(T(:,1));
which returns 28.500.
Mode=mode(Age);
which gives us 23. This is also the class modal as we grouped our data in a
discrete way.
12
Measures of variation or dispersion
The measures of dispersion given in the first lecture note are valid for un-
grouped data, but their meaning are the same as for grouped data. For grouped
data we give them below. The variance and the standard deviation of sample
of size n are respectively defined by:
n
1
S2 = f ( X X )2 ,
n 1 i =1 i i
S= S.
!2
n n
1 fi X2 1
S2 =
n 1 i =1 i
n f i Xi ,
i =1
S= S.
As in ungouped data we can also defined the r-th moment and r-th central
moment . They are respectively defined by
n
1
Mr0 =
n fi Xir ,
i =1
n
1
Mr =
n fi (Xi Mean)r .
i =1
Now these parameters can be used to defined the coefficient of skewness and
kurtosis whose definitions are exactly the same as in an ungrouped data.
13
Kurtf=mean((Age-Mean_age2).^4)/(mean((Age-Mean_age2).^2))^2;
Skewf=mean((Age-Mean_age2).^3)/(mean((Age-Mean_age2).^2))^(3/2);
We compare them with values returned by the Matlab functions kurtosis and
skewness
Kurtf-kurtosis(Age);
Skewf-skewness(Age);
Interquartile The k-th percentile is the value of the observed variable which
has a cumulative frequency equal to k/100.
The first quartile, the second quartile and the third quartile correspond to the
values with cumulative frequencies 25%, 50% and 75%, respectively.
The interquartile is the difference between the first quartile and third quartile.
It is a range within which the middle half of the data lie.
The gap between classes is the difference between the upper limit of one class
and the lower limit of the next class. For example, assume that our classes are
the interval ( ai , bi ), i = 1, . . . , n . The gap is
gap = bi ai+1 .
a i = ai gap/2
b i = bi + gap/2.
14
Now, we are ready to estimate the median, quartiles and interquartile (range)
of a grouped data. Follow the steps below to calculate the media:
1. Form the cumulative frequency table and insert in it the ranges of class
boundaries. Call N the total frequency which is also the total number
of observation or individuals in the sample. Locate the Median class,i.e.,
find the class which contains the N/2-th individual. Call it Cm = ( am , bm )
and C m = ( a m , b m ) its lower and upper class boundaries. Apply the fol-
lowing formula to find the median
N/2 Fb
Median = a m + ( bm a m ) ,
fm
where
A similar argument can be used to compute the first quartile ( N/4) and the
third quartile ( 3N/4). Let {1, 3}
N/4 Fb
Q = a Q + ( bQ a Q ).
f Q
f mo f a
Mode = a mo + (b mo a mo ),
2 f ( fa + fb )
where
f b and f a are respectively the frequency of the class before and after the class
mode.
Exercise: Find the median, interquartile and the mode of the following
grouped data.
15
Time to travel to work Frequency
1-10 8
11-20 14
21-30 12
31-40 9
41-50 7
16