Sunteți pe pagina 1din 15

9

Chapter 2

Organizing Data
At the end of this chapter, the students should be able to:
 Organize qualitative and quantitative data by using tables and graphs.

2.1 Definition of Raw Data

Raw data is information obtained by observing values of a variables. Moreover, data


obtained by observing values of a qualitative variable (non-numerical values) are referred
to as qualitative data. Data obtained by observing values of a quantitative variable
(numerical values) are referred to as quantitative data.

Example 2.1:
Below give an example of qualitative data on model of national cars used by 25 UTHM
students.

Wira Gen-2 Gen-2 Perdana Iswara


Gen-2 Wira Kembara Wira Wira
Perdana Wira Gen-2 Gen-2 Perdana
Iswara Wira Iswara Wira Iswara
Kembara Kembara Gen-2 Iswara Wira

Example 2.2:
Below gives an example of quantitative data on monthly petrol expenses
(in Ringgit Malaysia) of 25 UTHM students.

256 219 125 169 143


185 135 202 157 161
190 155 245 210 210
350 107 176 124 200
150 172 209 165 150

2.2 Organizing and Displaying Qualitative Data

The data can be organized by using frequency distribution tables. Furthermore, they are
also displayed in graphs such as bar chart or pie chart.

2.2.1 Frequency Distribution For Qualitative Data

A frequency distribution for qualitative data lists all categories and the number of
elements that belong to each of the categories.
10

Example 2.3:
By using Example 2.2, the frequency distribution table can be presented.

Solution:
The model of national car is the variable in this example. This (qualitative) variable is
classified into five categories: Wira, Gen-2, Perdana, Iswara, and Kemara.
Step 1: we record these categories into first column of Table 2.1.

Step 2: we note each student’s response from the given raw data and mark a tally,
denoted by symbol “│”, in the second column beside the category that it falls in.

Step 3: we record the total tallies (frequency) for each category in third column. The
sum of this column gives the total frequency, which is the sample size.

Table 2.1: Frequency distribution of model of national cars.


Model Tally Number of students (f)
Wira ││││ │││ 8
Gen-2 ││││ │ 6
Perdana │││ 3
Iswara ││││ 5
Kembara │││ 3
Total 25

2.2.2 Relative Frequency and Percentage Distribution

The relative frequency of a category is obtained by dividing the frequency of that


category by the total frequency. In addition, this frequency can be written as:
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑎𝑡 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, 𝑅𝑓 = 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑓
𝑅𝑓 = ∑ 𝑓

A percentage distribution can be presented as:

𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 = (𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦) × 100

= (𝑅𝑓) × 100.

Example 2.4:
Construct the relative frequency and percentage distribution for the data in Table 2.1
(in Example 2.2).

Solution:
The relative frequency and percentage are calculated using the formulae above.
11

Table 2.2: Relative frequency and percentage distribution of the model of national cars.
Model Frequency (f) Relative frequency (Rf) Percentage
Wira 8 8/25 = 0.32 0.32 (100) = 32%
Gen-2 6 6/25 = 0.24 0.24 (100) = 24%
Perdana 3 3/25 = 0.12 0.12 (100) = 12%
Iswara 5 5/25 = 0.20 0.20 (100) = 20%
Kembara 3 3/25 = 0.12 0.12 (100) = 12%
Total ∑f =25 ∑Rf =1.00 ∑% = 100%

2.2.3 Graphing of Qualitative Data

There are two types of graphs namely, bar graph and pie chart. Moreover, a bar graph
is a graph composed of bars whose heights are frequencies of the different categories.
A bar graph display graphically the same information concerning qualitative data that a
frequency distribution shows in tabular form.

Example 2.5:
Draw a bar chart to represent the data in Table 2.1.

Solution:
Figure 2.1 shows the bar chart (2D and 3D) for the data in Table 2.1.

Frequency distribution for model of national cars used by 25 UTHM


students
10

4 Frequency

0
Wira Gen-2 Perdana Iswara Kembara

A pie chart is more commonly used to show percentages, although frequencies or


relative frequencies can also be used. The pie (or circle) is divided into different
portions that represent the percentages that belong to different categories.

Angle = Relative frequency × 360 = Rf × 360


12

Frequency distribution for model of national cars used by 25 UTHM


students

2
Frequency
0

Example 2.6:
Construct a pie chart for the data in Table 2.1.

Solution:
First, the angle size for each category should be calculated.

Model Relative frequency (Rf) Angle size, ᵒ


Wira 8/25 = 0.32 0.32 (360) = 115.2
Gen-2 6/25 = 0.24 0.24 (360) = 86.4
Perdana 3/25 = 0.12 0.12 (360) = 43.2
Iswara 5/25 = 0.20 0.20 (360) = 72.0
Kembara 3/25 = 0.12 0.12 (360) = 43.2
Total ∑Rf =1.00 ∑Angle = 360.0

Figure 2.2 shows the pie chart (2D and 3D) for data in the Table 2.2.
Pie chart for frequency distribution for model of national cars used
by 25 UTHM students

9%
Asean
11%
37% India
Japan
16%
China
U.K.
27%
13

Pie chart for frequency distribution for model of national cars used
by 25 UTHM students

9%
Asean
11%
37%
India
Japan
16%
China
U.K.
27%

Exercise 2.1:
1. The following list gives the academic ranks for the 25 male faculty members at a
mechanical faculty:

instructor Assistant prof Assistant prof instructor Associate prof


Assistant prof Associate prof Assistant prof Full professor Associate prof
instructor Assistant prof Full professor Associate prof Assistant prof
instructor Assistant prof Assistant prof Associate prof Assistant prof
Full professor Assistant prof Assistant prof Assistant prof Associate prof

a. Give a frequency distribution for these data.


b. Give the relative frequencies and percentages for these data.
c. Draw a bar chart and a pie chart for the relative frequency distribution.

2. The pie chart below shows the percentage of the foreign professional workers in
Malaysia from January to June 2013. Given the number of foreign professional
workers in Malaysia during that period is 12 705. Construct a frequency
distribution and draw a bar chart for the frequency distribution.

Foreign professional workers in Malaysia


Others Asean
23% 27%
U.S.
3%

India
U.K.
20%
7%
China
8% Japan
12%
14

2.3 Organizing and Displaying Quantitative Data

In section 2.2, the organizing of the qualitative data has been introduced. In this section, we
will learn the organizing and displaying qualitative data.

2.3.1 Frequency Distribution

A frequency distribution for qualitative data/variable lists all intervals and the number of
observations that belong to each interval. It shows how the frequencies are distributed
over these intervals.
Table 2.3 gives the frequency distribution for the cholesterol values of 45 patients in a
cardiac rehabilitation study. Give the lower and upper class limits and boundaries as well as
the class marks for each class.

Table 2.3: Frequency distribution of cholesterol value of 45 patients.

Cholesterol value Frequency (f)


170 to 189 3
190 to 209 10
210 to 229 17
230 to 249 13
250 to 269 2

Observe the table above. There are 5 classes (intervals), each with a class width of 20. All
these are also class limits in which the ending point (upper limit) of a class is not the same
as the starting point (lower limit) of the next class. In other words, there is a gap between
two consecutive classes. Moreover, all the classes in the first column are called class
boundaries in which the ending point (upper boundary) of a class is the same as starting
point (lower boundary) of the next class. In other word, there is continuity between classes.

Tips class boundaries:


If your data in integer number then boundaries can be made ± 0.5.
If your data in real number with one decimal point then boundaries can be made ±0.05, etc.

Step to construct a frequency distribution table

Step 1: determine the number of classes (approximation only).


Generally varies from 5 – 20 classes, depending mainly on the number of observations in
the data set. To calculate the number of class, we can apply Sturges’s rule and two power
of k as follows:

Sturges’s Rule:
k = 1 + 3.3 log (n)

Power of k rule:
2k < n

where k is a number of class (interval) and n is a number of data (observations).

Step 2: determine the class width/size

Class width (C) = Range / Number of Classes = (Data max – Data Min) / k.

Step 3: determine the starting point or the lower limit of first class.
15

- Value of the starting point must be ≤ the smallest value (data minimum) in the data set.
- Normally the smallest value in the data set will be chosen as the starting point.

Step 4: determine class boundaries


First, we determine the upper boundary of each class. And, the lower boundary of each
class has the same value as the upper boundary of the previous class.
𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑛𝑒 𝑐𝑙𝑎𝑠𝑠+𝑙𝑜𝑤𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑛𝑒𝑥𝑡 𝑐𝑙𝑎𝑠𝑠
𝑐𝑙𝑎𝑠𝑠 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = .
2

Note:
a. We can check for errors by finding the class width using both formulas below. Of
course, the answers must be the same if there is no error.
Class width = upper boundary – lower boundary
= (upper class limit – lower class limit) + 1
b. We may conclude class midpoint (class mark) into the table.
𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡+𝑙𝑜𝑤𝑒𝑟 𝑐𝑙𝑎𝑠𝑠
𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 = 2
𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦+𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦
𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 = 2

Step 5: determine the number of observations (frequency) that fall in each class.
By using these steps, Table 2.3 can be presented completely.
class Lower limit Upper limit Lower Upper Class mark/ class
boundary boundary midpoint
170 to 189 170 189 169.5 189.5 179.5
190 to 209 190 209 189.5 209.5 199.5
210 to 229 210 229 209.5 229.5 219.5
230 to 249 230 249 229.5 249.5 239.5
250 to 269 250 269 249.9 269.5 259.5

Example 2.7
Suppose you are considering investing in a mutual fund. You collected the data in Table 2.4,
which present the three year rate return (in percent) for a simple random sample of 40
small capitalization growth mutual funds. Construct a frequency distribution table for this
data.

Table 2.4: Three-year return rate of 40 mutual funds.


27 13 23 32 18 24 18 15
17 29 30 48 32 15 21 37
11 22 12 11 26 13 27 19
24 18 46 18 24 31 20 19
36 17 17 23 38 22 16 29

Solution:
Step 1: determine the number of class.
Sturges’s rule:
k = 1 + 3.3 log (n)
= 1 + 3.3 log (40)
= 6.28
16

We may choose either 6 or 7 classes. Let us choose 7 classes for the data.
Power of k rule:
2k < n
2k < 40
25 < 40
By using this rule, we may choose 5 intervals or classes.

Step 2: determine the class width.


Class width = Range/number of classes
= (48 – 11)/7
= 5.28 ≈ 6 (rounded up)
Step 3: determine class boundary and midpoint class.
Let us choose 11 ( minimum data) as the starting point.
Step 4 and 5 are done in the following table too.
Class limit Class boundary tally frequency
[11 – 16] [10.5 – 16.5] 8
[17 – 22] [16.5 – 22.5] 13
[23 – 28] [22.5 – 28.5] 8
[29 – 34] [28.5 – 34.5] 6
[35 – 40] [34.5 – 40.5] 3
[41 – 46] [40.5 – 46.5] 1
[47 – 52] [46.5 – 52.5] 1

2.3.2 Relative Frequency and Percentage Distribution


The relative frequency of a category is obtained by dividing the frequency of that category
by the total frequency. In addition, this frequency can be written as:
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑎𝑡 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, 𝑅𝑓 =
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑓
𝑅𝑓 = ∑
𝑓

A percentage distribution can be presented as:

𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 = (𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦) × 100

= (𝑅𝑓) × 100.

Example 2.8:
Construct a relative frequency and percentage distribution of the data in Table 2.3.
17

Table 2.4: Relative frequency and percentage distribution for cholesterol value.
Cholesterol value frequency Relative frequency Percentage
170 to 189 3 3/45 3/45 (100) =
190 to 209 10 10/45 10/45 (100) =
210 to 229 17 17/45 17/45 (100) =
230 to 249 13 13/45 13/45 (100) =
250 to 269 2 2/45 2/45 (100) =
Total 45 1.00 100%

2.3.3 Cumulative frequency distribution


By using Table 2.4, the cumulative frequency can be given in Table 2.5.
Table 2.5: Cumulative frequency and percentage distribution for cholesterol value.
Cholesterol value frequency Cumulative frequency
170 to 189 3 3
190 to 209 10 13
210 to 229 17 30
230 to 249 13 43
250 to 269 2 45
Total 45

2.3.4 Graphing of (grouped) Qualitative Data


To present the grouped quantitative data can be displayed by using histograms, polygons,
and ogives.

Histograms
A histogram is similar to a bar graph that can be drawn for a frequency distribution or
relative frequency distribution or percentage distribution. To draw a histogram, the
horizontal axis (x-axis) is marked with classes ( or class boundary) and the frequencies ( or
relative frequencies or percentages) are marked on the vertical axis (y-axis). The heights of
the bars represent the frequency of each class.

Example 2.9:
Construct a histogram for data in Table 2.3.

Solution:
Histogram of Cholesterol Value
18

16

14

12
Frequency

10

0
180 200 220 240
Intervals

Figure 2.3: Histogram of cholesterol value.


18

Polygon
A polygon is a line graph that is formed by joining the points plotted by the midpoint and
frequency (or relative frequency or percentage) of each class with straight lines.

Example 2.10:
Construct a polygon for data in Table 2.3.

Solution:

Polygon of Cholesterol Value


20

15

10

0
21.5 24.5 27.5 30.5 33.5
Figure 2.4: Polygon of cholesterol value.

Ogive
The curve that is constructed from the cumulative frequency distribution is called an ogive.
The class boundaries are marked on the horizontal axis and the cumulative frequencies are
marked on the vertical axis. Each of the dots in the graph is plotted by the upper class
boundary as the x-coordinate and the cumulative frequency as the y-coordinate for each of
the classes. Next, a curve or straight line is drawn through each of these points.

Ogive of Cholesterol Value


12

10

0
21.5 24.5 27.5 30.5 33.5

Figure 2.5: Ogive of cholesterol value.


19

Exercise 2.2:
1. Refer to Table in Example 2.7.
a. Construct the relative frequency and cumulative frequency distribution.
b. Draw a relative frequency histogram and a relative frequency polygon for the data
on the same graph.
c. Illustrate the cumulative distribution with a suitable graph.

2. The following data gives the amounts of electrical bills (rounded to nearest RM) for the
past one month for 30 families.

75 34 47 26 56 29 48 42 33 67
38 41 63 55 61 73 61 76 46 51
55 42 35 39 45 71 24 47 67 52

a. Construct a frequency distribution table by taking 21 as the lower limit of the first
class and 10 as the width of each class.
b. Calculate relative frequency, percentage and class boundaries for each of the class.
c. Construct a cumulative frequency distribution table and represent it graphically.
d. What is percentage of the families have a monthly electrical bill of RM 61 or more
and RM 40 or less?

3. Refer to the frequency distribution given in this table to find following.


a. The boundaries for class c to d
b. The class mark for the class e to f
c. The width for the class g to i
d. The lower class limit for the class g to i
e. The total number of observations

class frequency
a to b f1
c to d f2
e to f f3
g to i f4
j to k f5

2.4 Implementation in Computer Sciences and Software Packages

Example 2.11:
Given a set of values for some variable, we want to organize and describe these values in a
meaningful way than just listing the raw data. For example, Mary converted a Java library
for matrix manipulation into JavaScript, and she was interested in the time behavior of
some of the functions. In one test, she generated 100 random matrices of size 70 x 60 and
used the JavaScript Date object to calculate run-time of the pseudo-inverse of each matrix
in milliseconds. 
Then, how should Mary describe the following data set that she has
recorded?

318, 314, 315, 315, 313, 314, 315, 314, 314, 315, 313, 313, 315, 313, 314, 315, 314, 314,
315, 316, 315, 315, 314, 314, 314, 314, 314, 315, 314, 314, 316, 315, 314, 314, 315, 315,
316, 315, 313, 314, 313, 314, 314, 313, 313, 313, 315, 313, 312, 312, 313, 316, 313, 315,
315, 315, 313, 313, 312, 314, 314, 313, 313, 315, 314, 314, 315, 314, 314, 315, 313, 313,
314, 312, 312, 316, 314, 315, 315, 315, 315, 315, 314, 314, 313, 314, 314, 315, 313, 315,
316, 314, 315, 314, 323, 314, 314, 315, 314, 310
20

More useful way for Mary is to organize the raw data and to look at the distribution. This
shows the relative number of occurrence or the frequency of each value occurred in her
data set whereby she can observe the time behavior of the inverse function. Therefore, she
can construct a frequency distribution table and generate the graph/chart based on the
table.

To approximate the number of classes,

𝑘 = 1 + 3.3 log(100) = 7.6 = 8 Classes,

Frequency Distribution Table


Number of Relative Percentage (%)
Run-Time (ms)
occurrence frequency
310 1 1/100 = 0.01 0.01 × 100 = 1
312 5 5/100 = 0.05 0.05 × 100 = 5
313 20 20/100 = 0.20 0.20 × 100 = 20
314 36 36/100 = 0.36 0.36 × 100 = 36
315 30 30/100 = 0.30 0.30 × 100 = 30
316 6 6/100 = 0.06 0.06 × 100 = 6
318 1 1/100 = 0.01 0.01 × 100 = 1
323 1 1/100 = 0.01 0.01 × 100 = 1
Total ∑𝒇 = 𝟏𝟎𝟎 ∑𝒓𝒇 = 𝟏 ∑% = 100

Frequency distribution of run-time for pseudo-inverse function of random matrix of size 70


x 60 used the JavaScript Date

40

35

30

25

20
Frequency
15

10

0
310 312 314 316 318 320 322

Based on Frequency Distribution Table, Mary observes the number of occurrence of each
time taken for pseudo-inverse function of random matrix of size 70 x 60 used the JavaScript
Date. The time values are clustered into 310-318 ms range, and there is an odd outlier at
323 ms. The highest frequency of the time taken for pseudo-inverse function is 314 ms and
the lowest is 310, 318 and 323 ms. Moreover, the bar chart shows that the ‘worst case’
performance will be occurred frequently in times between 313 to 315 ms.
21

Practice statistical tools to organize and describe raw data


Based on previous example, we can practice statistical functions, analysis and graphs using
Microsoft Excel, SPSS or Matlab.

1. Using Excel

Consider the previous set of data as in Table _, enter the data into Excel worksheet. The
data consists of time range (ms) and number of occurrence or frequency. Then drag the
data A2 toA15 and B2 to B15 simultaneously, click on ‘charts’ and choose chart type. To edit
the data of x-axis and y-axis, right click on the graph and select ‘select data’.

Answer:

2. Using SPSS Package

IBM SPSS Statistics can also be used for statistical analysis. The difference between Excel
and SPSS is that SPSS cannot interpret Excel formulas and so any cell in Excel that is derived
from a formula will not be read in. SPSS data files contain only data and meta-data that
describes the data in terms of format, labels, value labels, missing values, etc).

To insert the data set, we need to click on the ‘file’ tab, then ‘new’ and select ‘data’ to open
data editor or we can also import the data from Excel, csv data, text data, and etc by
clicking at the ‘file’ tab, then ‘import data’ and select data file. To edit data description, we
can click on ‘variable view’ tab. Then, click on ‘analyze’ tab to analyse data set or ‘graphs’
tab to generate graph as in following figure. After generating the graph, double click on the
graph figure to open chart editor whereby we can edit the data and label on x-axis, y-axis,
the title of the graph and etc.
22

3. Using Matlab

Unlike Excel’s statistical analysis, we need to write our own codes to run statistical
functions and analysis, and to call graphing tools in Matlab. Firstly, we need to open a new
script where we write the codes. Then, we create our own ‘filename.mat’ file to insert the
data (time range (time) and number of occurrence (occ)). Based on the following figure, 1)
we can directly insert the data into an empty 4x8 data matrix (after generating line 4, click
at ‘data’ in workspace) and then save the data matrix (insert line 21 after line 4) or 2) we
can write as in line 6-9. Then, after we insert the data, we calculate the relative frequency
and percentage distribution as in line 13-18. To plot the bar chart, we call the graphing tool
as in line 20 and save the 4x8 data matrix in ‘filename.mat’(freq_dist.mat) file as in line 21.
After we finish writing the code, we click ‘run’ at the editor tab to run each line of the
code. To look at the 4x8 data matrix, click ‘data’ in workspace section.
23

S-ar putea să vă placă și