Documente Academic
Documente Profesional
Documente Cultură
DESCRIPTIVE STATISTICS
2.1 INTRODUCTION
Example 1
Here is a list of question asked in a large statistics class and the “raw data” given by one
of the students:
Example 2
FORMUL
A
Relative Frequency of a category
= Frequency of that category
Sum of all frequencies
Example 3
A sample of UUM staff-owned vehicles produced by Proton was identified and the
make of each noted. The resulting sample follows (W = Wira, Is = Iswara, Wj =
Waja, St = Satria, P = Perdana, Sv = Savvy):
Construct a frequency distribution table for these data with their relative frequency
and percentage.
W W P Is Is P Is W St Wj
Is W W Wj Is W W Is W Wj
Wj Is Wj Sv W W W Wj St W
Wj Sv W Is P Sv Wj Wj W W
St W W W W St St P Wj Sv
Solution:
Relative
Category Frequency Percentage (%)
Frequency
Wira 19
Iswara 8
Perdana 4
Waja 10
Satria 5
Savvy 4
Total
To construct a horizontal bar chart, mark the various categories on the vertical
axis and mark the frequencies on the horizontal axis.
Satria
Perdana
Wira
0 5 10 15 20
Frequency
To construct a component bar chart, all categories is in one bar and every
bar is divided into components.
The height of components should be tally with representative frequencies.
Example 4
Solution:
200
Number of participants
150 Sailing
Walking
100
Caving
50 Climbing
0
2004 2005 2006
Year
120
Number of participants
100
Climbing
80
Caving
60
Walking
40
Sailing
20
0
2004 2005 2006
Year
b) Pie Chart
Example 5
Movie
Frequency Relative Frequency Angle Size
Genres
Comedy 54 0.27 360*0.27=97.2o
Action 36 0.18 360*0.18=64.8o
Romance 28 0.14 360*0.14=50.4o
Drama 28 0.14 360*0.14=50.4o
Horror 22 0.11 360*0.11=39.6o
Foreign 16 0.08 360*0.08=28.8o
Science 16 0.08 360*0.08=28.8o
Fiction
Total 200 1.00 360o
A graph represents data that occur over a specific period time of time.
Line graphs are more popular than all other graphs combined because
their visual characteristics reveal data trends clearly and these graphs
are easy to create.
When analyzing the graph, look for a trend or pattern that occurs over
the time period.
Example is the line ascending (indicating an increase over time) or
descending (indicating a decrease over time).
Another thing to look for is the slope, or steepness, of the line. A line
that is steep over a specific time period indicates a rapid increase or
decrease over that period.
Two data sets can be compared on the same graph (called a
compound time series graph) if two lines are used.
Data collected on the same element for the same variable at different
points in time or for different periods of time are called time series data.
A line graph is a visual comparison of how two variables—shown on the
x- and y-axes—are related or vary with each other. It shows related
information by drawing a continuous line between all the points on a
grid.
Line graphs compare two variables: one is plotted along the x-axis
(horizontal) and the other along the y-axis (vertical).
The y-axis in a line graph usually indicates quantity (e.g., RM, numbers
of sales litres) or percentage, while the horizontal x-axis often measures
units of time. As a result, the line graph is often viewed as a time series
graph
Example 6
A transit manager wishes to use the following data for a presentation showing
how Port Authority Transit ridership has changed over the years. Draw a time
series graph for the data and summarize the findings.
Ridership
Year
(in millions)
1990 88.0
1991 85.0
1992 75.7
1993 76.6
1994 75.4
Solution:
89
Ridership (in millions)
87
85
83
81
79
77
75
1990 1991 1992 1993 1994
Year
The graph shows a decline in ridership through 1992 and then leveling off for the years
1993 and 1994.
EXERCISE 1
C CK CK C CC D O C
CK CC D CC C CK CK CC
2. The frequency distribution table represents the sale of certain product in ZeeZee
Company. Each of the products was given the frequency of the sales in certain
period. Find the relative frequency and the percentage of each product. Then,
construct a pie chart using the obtained information.
4. A questionnaire about how people get news resulted in the following information
from 25 respondents (N = newspaper, T = television, R = radio, M = magazine).
N N R T T
R N T M R
M M N R N
T R M N M
T R R N N
5. The given information shows the export and import trade in million RM for four
months of sales in certain year. Using the provided information, present this data
in component bar graph.
In stem and leaf display of quantitative data, each value is divided into two
portions – a stem and a leaf. Then the leaves for each stem are shown
separately in a display.
Gives the information of data pattern.
Can detect which value frequently repeated.
Example 7
25 12 9 10 5 12 23 7
36 13 11 12 31 28 37 6
14 41 38 44 13 22 18 19
Solution:
To find the midpoint of the upper limit of the first class and the lower limit
of the second class, we divide the sum of these two limits by 2.
e.g.:
class
400 + 401
= 400.5 boundary
2
FORMUL
A
Class width = Upper boundary – Lower boundary
e.g. :
Width of the first class = 600.5 – 400.5 = 200
e.g:
401 + 600
Midpoint of the 1st class = = 500.5
2
2. Class width,
FORMUL
A
i>
Largest value - Smallest value
Number of classes
Range
i>
c
Example 8
The following data give the total home runs hit by all players of each of the 30 Major
League Baseball teams during 2004 season.
FORMUL
A
Relative frequency of a class = Frequency of that class
Sum of all frequencies
f
=
�f
Percentage = (Relative frequency) �100
Example 9
(Refer example 8)
Example
(Refer example 8)
10
Frequency histogram for Table 2.9
12
10
0
134.5 152.5 170.5 188.5 206.5 224.5 242.5
1
b) Polygon Total home runs
Example
Frequency polygon for Table 2.11
11
For a very large data set, as the number of classes is increased (and the width of
classes is decreased), the frequency polygon eventually becomes a smooth
curve called a frequency distribution curve or simply a frequency curve.
c) Shape of Histogram
Same as polygon.
For a very large data set, as the number of classes is increased (and the width
of classes is decreased), the frequency polygon eventually becomes a smooth
curve called a frequency distribution curve or simply a frequency curve.
Symmetric histograms
Describing data using graphs helps us insight into the main characteristics of the
data.
When interpreting a graph, we should be very cautious. We should observe
carefully whether the frequency axis has been truncated or whether any axis has
been unnecessarily shortened or stretched.
Example 12
Using the frequency distribution of table 2.11,
Ogive
An ogive is a curve drawn for the cumulative frequency distribution by joining
with straight lines the dots marked above the upper boundaries of classes at
heights equal to the cumulative frequencies of respective classes.
Two type of ogive:
(i) ogive less than
(ii) ogive greater than
Example
13
(Ogive Less Than)
Earnings Number of Cumulative
(RM) students (f) Earnings (RM) Frequency (F)
35
30
25
20
15
Graph
10 Ogive Less Than
5
0
Chapter 2: Descriptive
29.5Statistics
39.5 49.5 59.5 69.5 79.5 89.5 18
Earnings
SQQS1013 Elementary Statistics
Example
14
(Ogive More Than)
35
30
25
20
15
10
5
Cumulative Frequency
0
29.5 39.5 49.5 59.5 69.5 79.5 89.5
Earnings
2.3.6 Box-Plot
Smallest
value K1 Median K3 Largest
value
Example
15
The following data give the prices (rounded to thousand RM) of five homes sold
recently in Sekayang.
Solution:
Thus, these five homes were sold for an average price of RM186 thousand @
RM186 000.
The mean has the advantage that its calculation includes each value of
the data set.
Weighted Mean
Example 16
Consider the data of electricity components purchasing from a factory in the table
below:
Solution:
xw =
�wx
�w
1200(3) + 500(3.4) + 2500(2.8) + 1000(2.9) + 800(3.25)
=
1200 + 500 + 2500 + 1000 + 800
17800
=
6000
= 2.967
Median
Median is the value of the middle term in a data set that has been
ranked in increasing order.
Procedure for finding the Median
Step 1: Rank the data set in increasing order.
Example
17
Solution:
Example
18 the median for the following data:
Find
10 5 19 8 3 15
Solution:
Therefore the median is located in the middle of 3rd position and 4th
position of the data set.
8 + 10
Median = =9
2
The median gives the center of a histogram, with half of the data values
to the left of (or, less than) the median and half to the right of (or, more
than) the median.
The advantage of using the median is that it is not influenced by outliers.
Mode
Mode is the value that occurs with the highest frequency in a data set.
Example
19
1. What is the mode for given data?
77 69 74 81 71 68 74 73
Solution:
1. Mode =
2. Mode =
A major shortcoming of the mode is that a data set may have none or
may have more than one mode.
One advantage of the mode is that it can be calculated for both kinds of
data, quantitative and qualitative.
Mean
FORMUL
A
Mean for population data:
μ=
�fx
N
Mean for sample data:
x=
�fx
n
Where x the midpoint and f is the frequency of a class.
Example 20
The following table gives the frequency distribution of the number of orders received
each day during the past 50 days at the office of a mail-order company. Calculate the
mean.
Number of order f
10 – 12 4
13 – 15 12
16 – 18 20
19 – 21 14
n = 50
Solution:
Because the data set includes only 50 days, it represents a sample. The value of
� fx is calculated in the following table:
Number of order f x fx
10 – 12 4
13 – 15 12
16 – 18 20
19 – 21 14
n = 50
Thus, this mail-order company received an average of 16.64 orders per day during
these 50 days.
Median
Step 1: Construct the cumulative frequency distribution.
Step 2: Decide the class that contain the median.
Class Median is the first class with the value of cumulative frequency is
at least n/2.
Step 3: Find the median by using the following formula:
FORMUL Where:
A n = the total frequency
�n � F = the total frequency before class
� - F �
median
Median = Lm + �2
i = the class width
i
�
f
�m �
= the lower boundary of the class
median
� � = the frequency of the class median
Example 21
Based on the grouped data below, find the median:
Solution:
Thus, 25 persons take less than 23 minutes to travel to work and another 25
persons take more than 23 minutes to travel to work.
Mode
Mode is the value that has the highest frequency in a data set.
For grouped data, class mode (or, modal class) is the class with the
highest frequency.
Formula of mode for grouped data:
FORMUL
A
Mode = L mo
� Δ
+� 1
�
i
�
�Δ1 + Δ2 �
Where:
Lmo is the lower boundary of class mode
Example
22
Based on the grouped data below, find the mode
Solution:
(1) For a symmetrical histogram and frequency curve with one peak, the
value of the mean, median and mode are identical and they lie at the
center of the distribution.
Mean, median, and mode for a symmetric histogram and frequency distribution curve
(2) For a histogram and a frequency curve skewed to the right, the value of
the mean is the largest that of the mode is the smallest and the value
of the median lies between these two.
Mean, median, and mode for a histogram and frequency distribution curve
skewed to the right
(3) For a histogram and a frequency curve skewed to the left, the value of
the mean is the smallest and that of the mode is the largest and the
value of the median lies between these two.
Mean, median, and mode for a histogram and frequency distribution curve
skewed to the left
The measures of central tendency such as mean, median and mode do not
reveal the whole picture of the distribution of a data set.
Two data sets with the same mean may have a completely different spreads.
The variation among the values of observations for one data set may be
much larger or smaller than for the other data set.
Range
FORMUL
A
RANGE = Largest value – Smallest value
Example 23
Solution:
Disadvantages:
o being influenced by outliers.
o based on two values only. All other values in a data set are ignored.
x 2
x 2
-
N
= 2
N
Variance for sample:
x
-
2
x 2
n
s =
2
n -1
FORMUL
A
Standard Deviation for population:
= 2
Example 24
Let x denote the total production (in unit) of company
Company Production
A 62
B 93
C 126
D 75
E 34
Find the variance and standard deviation,
Solution:
Range
FORMUL
A
class Range = Upper bound of last class – Lower bound of first
Class Frequency
41 – 50 1
51 – 60 3
61 – 70 7
71 – 80 13
81 – 90 10
91 - 100 6
Total 40
�fx 2
-
N
2 =
N
Variance for sample:
�fx
2
�fx 2
-
n
s = 2
n -1
FORMUL
A
Standard Deviation:
Population: = 2
Sample: s= s2
Example 25
Find the variance and standard deviation for the following data:
No. of order f
10 – 12 4
13 – 15 12
16 – 18 20
19 – 21 14
Total n = 50
Solution:
Variance,
Standard Deviation,
Thus, the standard deviation of the number of orders received at the office of this mail-
order company during the past 50 days is 2.75.
To compare two or more distribution that has different unit based on their
dispersion OR
To compare two or more distribution that has same unit but big different in
their value of mean.
Also called modified coefficient or coefficient of variation, CV.
FORMUL
A
s
CV = 100% - ( sample)
x
CV = 100% - ( population)
x
Example
26
Given mean and standard deviation of monthly salary for two groups of worker who
are working in ABC company- Group 1: 700 & 20 and Group 2 :1070 & 20. Find the
CV for every group and determine which group is more dispersed.
Solution:
20
CV1 = �100% = 2.86%
700
20
CV2 = �100% = 1.87%
1070
The monthly salary for group 1 worker is more dispersed compared to group 2.
Quartiles
Quartiles are three summary measures that divide ranked data set into
four equal parts.
109.7 79.9 21.2 76.4 80.2 82.1 79.4 89.3 98.0 103.5
86.8
Solution:
76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
121.2
Step 2: Determine the depth for Q1 and Q3
n + 1 11 + 1
Depth of Q1 = = =3
4 4
3( n + 1) 3 11 + 1
Depth of Q3 = = =9
4 4
76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
121.2
Q1 = 79.9 ; Q3 = 103.5
Example
Table below list the total revenue for the 12 top tourism company in Malaysia
28
Solution:
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
121.2
n +1 12 + 1
Depth of Q1 = = = 3.25
4 4
3( n + 1) 3 12 + 1
Depth of Q3 = = = 9.75
4 4
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
121.2
Interquartile Range
The difference between the third quartile and the first quartile for a data
set.
FORMUL
A
IQR = Q3 – Q1
Example 29
By referring to example 28, calculate the IQR.
Solution:
�3 n �
�4 - F �
Q3 = LQ3 + � i
�
� f Q3 �
� �
Example 30
Refer to example 22, find Q1 and Q3
Solution:
n 50
Class Q1 = = = 12.5
4 4
Therefore,
�n �
�4 - F �
Q1 = LQ1 + � �i
f
� 1 �
Q
� �
12.5 - 8 �
�
= 10.5 + � 10
�
� 14 �
= 13.7143
3n 3 50
Class Q3 = = = 37.5
4 4
Therefore,
�n �
�4 - F �
Q3 = LQ3 + � �i
� fQ3 �
� �
�37.5 - 34 �
= 30.5 + � 10
�
� 9 �
= 34.3889
Interquartile Range
FORMUL
A
IQR = Q3 – Q1
Example 31
Refer to example 30, calculate the IQR.
Solution:
mean
sk =
If Sk -ve left skewed
If Sk = 0 symmetry
If Sk takes a value in between (-0.9999, -0.0001) or (0.0001,
0.9999) approximately symmetry.
Example
32
The duration of cancer patient warded in Hospital Seberang Jaya recorded in a
frequency distribution. From the record, the mean is 28 days, median is 25 days
and mode is 23 days. Given the standard deviation is 4.2 days.
a. What is the type of distribution?
b. Find the skewness coefficient
Solution:
This distribution is right skewed because the mean is the largest value
Mean - Mode 28 - 23
Sk = = = 1.1905
s 4.2
OR
3 Mean - Median 3 28 - 25
Sk = = = 21429
.
s 4.2
ADDITIONAL INFORMATION
1. Chebyshev’s Theorem
1
= 1-
k2
1
= 1-
2 2
= 0.75 @ 75%
2. Empirical Rule
1.68%of the observations lie within one standard deviation of the mean.
2.95% of the observations lie within two standard deviations of mean.
3.99.7% of the observations lie within three standard deviations of the mean.
Measure of Position
EXERCISE 2
1. A survey research company asks 100 people how many times they have been to
the dentist in the last five years. Their grouped responses appear below.
2. A researcher asked 25 consumers: “How much would you pay for a television
adapter that provides Internet access?” Their grouped responses are as follows:
Amount ($) Number of Responses
0 – 99 2
100 – 199 2
200 – 249 3
250 – 299 3
300 – 349 6
350 – 399 3
400 – 499 4
500 – 999 2
Calculate the mean, variance, and standard deviation.
3. The following data give the pairs of shoes sold per day by a particular
shoe store in the last 20 days.
85 90 89 70 79 80 83 83 75 76
89 86 71 76 77 89 70 65 90 86
Calculate the
a. mean and interpret the value.
b. median and interpret the value.
c.mode and interpret the value.
d. standard deviation.
4. The followings data shows the information of serving time (in minutes) for 40
customers in a post office:
5. In a survey for a class of final semester student, a group of data was obtained for
the number of text books owned.
Find the average number of text book for the class. Use the weighted mean.
6.The following data represent the ages of 15 people buying lift tickets at a ski area.
15 25 26 17 38 16 60 21
30 53 28 40 20 35 31
7.A student scores 60 on a mathematics test that has a mean of 54 and a standard
deviation of 3, and she scores 80 on a history test with a mean of 75 and a
standard deviation of 2. On which test did she perform better?
8.The following table gives the distribution of the share’s price for ABC Company which
was listed in BSKL in 2005.