Documente Academic
Documente Profesional
Documente Cultură
DESCRIPTIVE
STATISTICS
2.1 INTRODUCTION
Raw data
Example 1
Here is a list of question asked in a large statistics class and the raw data given by one
of the students:
1.
2.
3.
4.
5.
Example 2
A frequency distribution for qualitative data lists all categories and the
number of elements that belong to each of the categories.
FORMUL
A
Example 3
A sample of UUM staff-owned vehicles produced by Proton was identified and the
make of each noted. The resulting sample follows (W = Wira, Is = Iswara, Wj =
Waja, St = Satria, P = Perdana, Sv = Savvy):
Construct a frequency distribution table for these data with their relative frequency
and percentage.
W
Is
Wj
Wj
St
W
W
Is
Sv
W
P
W
Wj
W
W
Is
Wj
Sv
Is
W
Is
Is
W
P
W
P
W
W
Sv
St
Is
W
W
Wj
St
W
Is
Wj
Wj
P
St
W
St
W
Wj
Wj
Wj
W
W
Sv
Solution:
Category
Frequency
Wira
Iswara
Perdana
Waja
Satria
Savvy
Relative
Frequency
Percentage (%)
19
8
4
10
5
4
Total
It has
o
o
o
o
Types of Vehicle
Satria
Perdana
Wira
0
10
15
20
Frequency
Example 4
Suppose we want to illustrate the information below, representing the number of
people participating in the activities offered by an outdoor pursuits centre during
Jun of three consecutive years.
2004
21
10
75
36
Climbing
Caving
Walking
Sailing
Total
2005
34
12
85
36
142
2006
36
21
100
40
167
191
Solution:
Number of participants
Sailing
Walking
100
Caving
Climbing
50
0
2004
2005
2006
Year
The bar graphs for relative frequency and percentage distributions can be
Number of participants
120
100
Climbing
80
Caving
60
Walking
40
Sailing
20
0
2004
2005
2006
Year
b) Pie Chart
The whole pie represents the total sample or population. The pie is
divided into different portions that represent the different categories.
Frequency
Relative Frequency
Angle Size
54
36
28
28
22
16
16
0.27
0.18
0.14
0.14
0.11
0.08
0.08
360*0.27=97.2o
360*0.18=64.8o
360*0.14=50.4o
360*0.14=50.4o
360*0.11=39.6o
360*0.08=28.8o
360*0.08=28.8o
200
1.00
360o
A graph represents data that occur over a specific period time of time.
Line graphs are more popular than all other graphs combined because
their visual characteristics reveal data trends clearly and these graphs
are easy to create.
When analyzing the graph, look for a trend or pattern that occurs over
the time period.
Another thing to look for is the slope, or steepness, of the line. A line
that is steep over a specific time period indicates a rapid increase or
decrease over that period.
Data collected on the same element for the same variable at different
points in time or for different periods of time are called time series data.
Line graphs compare two variables: one is plotted along the x-axis
(horizontal) and the other along the y-axis (vertical).
The y-axis in a line graph usually indicates quantity (e.g., RM, numbers
of sales litres) or percentage, while the horizontal x-axis often measures
units of time. As a result, the line graph is often viewed as a time series
graph
Example 6
A transit manager wishes to use the following data for a presentation showing
how Port Authority Transit ridership has changed over the years. Draw a time
series graph for the data and summarize the findings.
Ridership
(in millions)
88.0
85.0
75.7
76.6
75.4
Year
1990
1991
1992
1993
1994
Solution:
89
87
85
83
81
79
77
75
1990
1991
1992
1993
1994
Year
The graph shows a decline in ridership through 1992 and then leveling off for the years
1993 and 1994.
EXERCISE 1
Chapter 2: Descriptive Statistics
a.
b.
c.
CK
CC
CK
D
C
CC
CC
C
D
CK
O
CK
C
CC
2. The frequency distribution table represents the sale of certain product in ZeeZee
Company. Each of the products was given the frequency of the sales in certain
period. Find the relative frequency and the percentage of each product. Then,
construct a pie chart using the obtained information.
Type of
Product
A
B
C
D
E
Frequency
Relative Frequency
Percentage
Angle Size
13
12
5
9
11
3. Draw a time series graph to represent the data for the number of worldwide airline
fatalities for the given years.
Year
No. of
fatalities
1990
1991
1992
1993
1994
1995
1996
440
510
990
801
732
557
1132
4. A questionnaire about how people get news resulted in the following information
from 25 respondents (N = newspaper, T = television, R = radio, M = magazine).
N
R
M
T
T
N
N
M
R
R
R
T
N
M
R
T
M
R
N
N
T
R
N
M
N
6.
Export
28
30
32
24
Import
20
28
17
14
10
Quantity (mm)
435
512
163
721
664
Perlis
Kedah
Pulau Pinang
Perak
Selangor
Wilayah Persekutuan
Kuala Lumpur
Negeri Sembilan
Melaka
Johor
Pahang
Terengganu
Kelantan
Sarawak
Sabah
1003
390
223
876
1050
1255
986
878
456
In stem and leaf display of quantitative data, each value is divided into two
portions a stem and a leaf. Then the leaves for each stem are shown
separately in a display.
Example 7
25
36
14
12
13
41
9
11
38
10
12
44
5
31
13
12
28
22
23
37
18
7
6
19
Solution:
11
A frequency distribution for quantitative data lists all the classes and the
number of values that belong to each class.
The class boundary is given by the midpoint of the upper limit of one
class and the lower limit of the next class. Also called real class limit.
To find the midpoint of the upper limit of the first class and the lower limit
of the second class, we divide the sum of these two limits by 2.
e.g.:
400 401
400.5
2
class
boundary
FORMUL
A
e.g. :
Width of the first class = 600.5 400.5 = 200
FORMUL
A
12
e.g:
401 600
500.5
2
where
c = 1 + 3.3 log n
c is the no. of classes
n is the no. of observations in the data set.
2. Class width,
FORMUL
A
Example 8
13
i)
Number of classes, c
ii)
Class width,
= 1 + 3.3 log 30
= 1 + 3.3(1.48)
= 5.89 6 class
242 135
6
17.8
18
Tally
|||| ||||
||
||||
|||| |
|||
||||
f
10
2
5
6
3
4
f 30
14
Example 9
(Refer example 8)
Table 2.11: Relative Frequency and Percentage Distributions
Class Boundaries
134.5 less than 152.5
152.5 less than 170.5
170.5 less than 188.5
188.5 less than 206.5
206.5 less than 224.5
224.5 less than 242.5
Total
Relative
Frequency
0.3333
0.0667
0.1667
0.2000
0.1000
0.1333
1.0
%
33.33
6.67
16.67
20.00
10.00
13.33
100%
Example
(Refer example 8)
10
b) Polygon
15
Example
11
134.5
For a very large data set, as the number of classes is increased (and the width of
classes is decreased), the frequency polygon eventually becomes a smooth
curve called a frequency distribution curve or simply a frequency curve.
c) Shape of Histogram
Same as polygon.
For a very large data set, as the number of classes is increased (and the width
of classes is decreased), the frequency polygon eventually becomes a smooth
curve called a frequency distribution curve or simply a frequency curve.
16
Symmetric histograms
Describing data using graphs helps us insight into the main characteristics of the
data.
When interpreting a graph, we should be very cautious. We should observe
carefully whether the frequency axis has been truncated or whether any axis has
been unnecessarily shortened or stretched.
17
Cumulative
Frequency
Class Boundaries
10
2
5
6
3
4
Ogive
An ogive is a curve drawn for the cumulative frequency distribution by joining
with straight lines the dots marked above the upper boundaries of classes at
heights equal to the cumulative frequencies of respective classes.
Two type of ogive:
(i)
(ii)
Example
13
Cumulative Frequency
30 39
40 49
50 59
60 - 69
70 79
80 - 89
5
6
6
3
3
7
Total
Earnings (RM)
Cumulative
Frequency (F)
0
5
11
17
20
23
30
30
35
49.5
59.5
69.5
79.5
89.5
Earnings
18
Example
14
(Ogive
More Than)
Earnings
(RM)
Number of
students (f)
30 39
40 49
50 59
60 - 69
70 79
80 - 89
5
6
6
3
3
7
Total
Earnings (RM)
Cumulative
Frequency (F)
30
25
19
13
10
7
0
30
Cumulative Frequency
0
29.5
39.5
49.5
59.5
69.5
79.5
89.5
Earnings
2.3.6 Box-Plot
19
Smallest
value
K1
Median
K3
Largest
value
Smallest
value
K1
Median
K3
Largest
value
Median
Largest
value
K3
Mean
FORMUL
A
where:
x
N
x
n
x =
Example
15 following data give the prices (rounded to thousand RM) of five homes sold
The
recently in Sekayang.
158
189
265
127
191
20
Solution:
Thus, these five homes were sold for an average price of RM186 thousand @
RM186 000.
The mean has the advantage that its calculation includes each value of
the data set.
Weighted Mean
Weight mean :
FORMUL
A
xw
wx
w
where w is a weight.
Example 16
Consider the data of electricity components purchasing from a factory in the table
below:
Type
Cost/unit (x)
21
1200
500
2500
1000
800
Total
6000
RM3.00
RM3.40
RM2.80
RM2.90
RM3.25
Solution:
xw
wx
w
Median
Median is the value of the middle term in a data set that has been
ranked in increasing order.
Depth of Median = n 2 1
Step 3: Determine the value of the Median.
Example
17
Solution:
(1)
22
(2)
Example
18 the median for the following data:
Find
10
5
19
8
3
15
Solution:
(1) Rank the data in increasing order
n 1
2
6 1
=
2
= 3.5
Depth of Median =
Median
8 10
9
2
The median gives the center of a histogram, with half of the data values
to the left of (or, less than) the median and half to the right of (or, more
than) the median.
Mode
23
Mode is the value that occurs with the highest frequency in a data set.
Example
19
1. What is the mode for given data?
77
69
74
81 71
68
74
73
A major shortcoming of the mode is that a data set may have none or
may have more than one mode.
One advantage of the mode is that it can be calculated for both kinds of
data, quantitative and qualitative.
Mean
FORMUL
A
fx
N
x=
Where
fx
n
Example 20
The following table gives the frequency distribution of the number of orders received
each day during the past 50 days at the office of a mail-order company. Calculate the
mean.
Number of order
10 12
13 15
16 18
19 21
Chapter 2: Descriptive Statistics
f
4
12
20
14
n = 50
24
Solution:
Because the data set includes only 50 days, it represents a sample. The value of
fx is calculated in the following table:
Number of order
10 12
13 15
16 18
19 21
f
4
12
20
14
n = 50
fx
Thus, this mail-order company received an average of 16.64 orders per day during
these 50 days.
Median
Step 1: Construct the cumulative frequency distribution.
Step 2: Decide the class that contain the median.
Class Median is the first class with the value of cumulative frequency is
at least n/2.
Example 21
n
2 - F
Median = Lm +
i
f
m
Where:
n = the total frequency
F = the total frequency before class
median
i = the class width
= the lower boundary of the class
median
= the frequency of the class median
Frequency
8
14
12
9
7
25
Solution:
1st Step: Construct the cumulative frequency distribution
Time to travel to work
Frequency
1 10
11 20
21 30
31 40
41 50
8
14
12
9
7
Cumulative Frequency
Thus, 25 persons take less than 23 minutes to travel to work and another 25
persons take more than 23 minutes to travel to work.
Mode
Mode is the value that has the highest frequency in a data set.
For grouped data, class mode (or, modal class) is the class with the
highest frequency.
Mode = L
mo
1
i
1 + 2
26
Where:
Lmo
1
2
i
Example
22
Frequency
8
14
12
9
7
Solution:
Based on the table,
27
For a symmetrical histogram and frequency curve with one peak, the
value of the mean, median and mode are identical and they lie at the
center of the distribution.
Mean, median, and mode for a symmetric histogram and frequency distribution curve
(2)
For a histogram and a frequency curve skewed to the right, the value of
the mean is the largest that of the mode is the smallest and the value
of the median lies between these two.
Mean, median, and mode for a histogram and frequency distribution curve
skewed to the right
Chapter 2: Descriptive Statistics
28
(3)
For a histogram and a frequency curve skewed to the left, the value of
the mean is the smallest and that of the mode is the largest and the
value of the median lies between these two.
Mean, median, and mode for a histogram and frequency distribution curve
skewed to the left
The measures of central tendency such as mean, median and mode do not
reveal the whole picture of the distribution of a data set.
Two data sets with the same mean may have a completely different spreads.
The variation among the values of observations for one data set may be
much larger or smaller than for the other data set.
29
Example 23
Solution:
Range = Largest value Smallest value
= 267 277 49 651
= 217 626
Disadvantages:
o
A Standard Deviation value tells how closely the values of a data set
clustered around the mean.
Lower value of standard deviation indicates that the data set value are
spread over relatively smaller range around the mean.
Larger value of data set indicates that the data set value are spread
over relatively larger around the mean (far from mean).
30
s2
FORMUL
A
n 1
s2
Example 24
Let x denote the total production (in unit) of company
Company
A
B
C
D
E
Production
62
93
126
75
34
Solution:
Company
Production (x)
A
B
C
D
E
62
93
126
75
34
x2
390
31
The value of the variance and the standard deviation are never
negative. Also, larger values of variance or standard deviation indicate
greater amounts of variation.
Range
FORMUL
A
Class
41 50
51 60
61 70
71 80
81 90
91 - 100
Total
Frequency
1
3
7
13
10
6
40
32
fx
fx
s2
FORMUL
A
fx
fx
n 1
Standard Deviation:
Population: 2
Sample:
s2
Example 25
Find the variance and standard deviation for the following data:
No. of order
10 12
13 15
16 18
19 21
Total
f
4
12
20
14
n = 50
Solution:
No. of order
10 12
13 15
16 18
19 21
Total
f
4
12
20
14
n = 50
fx
fx2
33
Variance,
Standard Deviation,
Thus, the standard deviation of the number of orders received at the office of this mailorder company during the past 50 days is 2.75.
To compare two or more distribution that has different unit based on their
dispersion OR
To compare two or more distribution that has same unit but big different in
their value of mean.
s
100% ( sample)
x
CV
100% ( population )
x
CV
Example
26
Chapter 2: Descriptive Statistics
34
Solution:
20
100% 2.86%
700
20
CV2
100% 1.87%
1070
CV1
The monthly salary for group 1 worker is more dispersed compared to group 2.
Quartiles
Quartiles are three summary measures that divide ranked data set into
four equal parts.
Depth of Q1 =
n 1
4
Depth of Q3 =
3( n 1)
4
Example
27
35
Table below lists the total revenue for the 11 top tourism company in Malaysia
109.7
79.9
21.2
76.4
80.2
82.1
79.4
89.3
98.0
103.5
103.5
109.7
86.8
Solution:
Step 1: Arrange the data in increasing order
76.4
79.4
79.9
80.2
82.1
86.8
89.3
98.0
121.2
Step 2: Determine the depth for Q1 and Q3
Depth of Q1 =
n 1 11 1
=
=3
4
4
Depth of Q3 =
3 11 1
3( n 1)
=
=9
4
4
79.4
79.9
80.2
82.1
86.8
89.3
98.0 103.5
109.7
121.2
Q1 = 79.9 ; Q3 = 103.5
Example
Table below list the total revenue for the 12 top tourism company in Malaysia
28
109.7
79.9
74.1
98.0
103.5
86.8
121.2
76.4
80.2
82.1
79.4
89.3
Solution:
Step 1: Arrange the data in increasing order
74.1 76.4
79.4
79.9
80.2
82.1
86.8
89.3
98.0 103.5
109.7
121.2
Step 2: Determine the depth for Q1 and Q3
Depth of Q1 =
n 1
12 1
=
= 3.25
4
4
Depth of Q3 =
3 12 1
3( n 1)
=
= 9.75
4
4
36
79.4
79.9
80.2
82.1
86.8
89.3
98.0 103.5
109.7
121.2
Q1 = 79.4 + 0.25 (79.9 79.4) = 79.525
Q3 = 98.0 + 0.75 (103.5 98.0) = 102.125
Interquartile Range
The difference between the third quartile and the first quartile for a data
set.
FORMUL
A
IQR = Q3 Q1
Example 29
By referring to example 28, calculate the IQR.
Solution:
IQR = Q3 Q1 = 102.125 79.525 = 22.6
n
4 - F
Q1 LQ1 +
i
f
Q
1
3n
- F
Q3 LQ3 + 4
i
f
Q3
Example 30
Refer to example 22, find Q1 and Q3
Solution:
Chapter 2: Descriptive Statistics
37
Frequency
Cumulative Frequency
1 10
11 20
21 30
31 40
41 50
8
14
12
9
7
8
22
34
43
50
Class Q1
n 50
12.5
4
4
Therefore,
n
4 - F
Q1 LQ1
i
fQ1
12.5 - 8
10.5
10
14
13.7143
Class Q 3
3n 3 50
37.5
4
4
n
- F
Q3 LQ3 4 i
fQ3
37.5 - 34
30.5
10
9
34.3889
Interquartile Range
Chapter 2: Descriptive Statistics
38
FORMUL
A
IQR = Q3 Q1
Example 31
Refer to example 30, calculate the IQR.
Solution:
IQR = Q3 Q1 = 34.3889 13.7143 = 20.6746
sk
mean mode
3(mean median)
or sk
s
s
If Sk = 0 symmetry
Example
32
The duration of cancer patient warded in Hospital Seberang Jaya recorded in a
frequency distribution. From the record, the mean is 28 days, median is 25 days
and mode is 23 days. Given the standard deviation is 4.2 days.
Chapter 2: Descriptive Statistics
39
Solution:
This distribution is right skewed because the mean is the largest value
Sk
Sk
Mean - Mode 28 23
11905
.
s
4.2
OR
3 Mean - Median
s
3 28 25
4.2
21429
.
ADDITIONAL INFORMATION
Use of Standard Deviation
1. Chebyshevs Theorem
According to Chebyshevs Theorem, for any number k greater than 1, at least (1
1/k2) of the data values lie within k standard deviations of the mean.
1
k2
1
1
2 2
0.75 @ 75%
1
40
2. Empirical Rule
For a bell-shaped distribution, approximately
1.68%of the observations lie within one standard deviation of the mean.
2.95% of the observations lie within two standard deviations of mean.
3.99.7% of the observations lie within three standard deviations of the mean.
Measure of Position
1.
2.
41
42
EXERCISE 2
1. A survey research company asks 100 people how many times they have been to
the dentist in the last five years. Their grouped responses appear below.
Number of Visits
04
59
10 14
15 19
Number of Responses
16
25
48
11
2. A researcher asked 25 consumers: How much would you pay for a television
adapter that provides Internet access? Their grouped responses are as follows:
Amount ($)
Number of Responses
0 99
100 199
200 249
250 299
300 349
350 399
400 499
500 999
2
2
3
3
6
3
4
2
3.
The following data give the pairs of shoes sold per day by a particular
shoe store in the last 20 days.
85
89
90
86
89
71
70
76
79
77
80
89
83
70
83
65
75
90
76
86
Calculate the
a. mean and interpret the value.
b.median and interpret the value.
c. mode and interpret the value.
d.standard deviation.
4.
The followings data shows the information of serving time (in minutes) for 40
customers in a post office:
2.0
4.5
2.5
2.9
4.2
2.9
3.5
3.2
2.9
4.0
3.0
3.8
2.5
2.3
2.1
3.1
3.6
4.3
4.7
2.6
4.1
4.6
2.8
5.1
2.7
2.6
4.4
3.5
2.7
3.9
2.9
2.9
2.5
3.7
3.3
a.Construct a frequency distribution table with 0.5 of class width.
2.8
3.5
3.1
3.0
2.4
43
5.
In a survey for a class of final semester student, a group of data was obtained for
the number of text books owned.
Number of students
12
9
11
15
10
8
Find the average number of text book for the class. Use the weighted mean.
6.The following data represent the ages of 15 people buying lift tickets at a ski area.
15
30
25
53
26
28
17
40
38
20
16
35
60
31
21
Frequency
5
14
25
7
6
3
44