Sunteți pe pagina 1din 23

Chapter 2: Collection, Presentation, and Organization of Data

2.1. Preliminaries

Measurement is the process of determining the values or label of the variable based on what has
been observed.

Elementary unit Is the individual or object on which a variable is measured.

2.1.1. Quantitative vs. Qualitative Variables

Qualitative or categorical variable a variable that yields categorical responses. Examples of


qualitative variables are Sex, Civil Status, Student Number, and Telephone Number.

Quantitative variable a variable that takes on numerical values representing an amount or quantity.
Examples of quantitative variables are height, age, and weight. There are two kinds of quantitative
variables, discrete and continuous variables.

Discrete variable is a variable which can assume finite, or at least countably infinite number
of values. Examples are number of trees in the Sunken Garden, and number of car accidents
in University Avenue in one year.

Continuous variable is a variable which can assume infinitely many values that can be stated
using an interval. Examples are height and weight.

Remarks

• Some variables, although numerical in nature, are still categorical variables.


• One technique to know whether a specific variable has a fixed unit is trying to add or
subtract the values of the variable. If the adding or subtracting them makes sense, then the
variable has a fixed unit.
• Most variables with units (meters, hours, Celsius) are quantitative.

Exercises 2.1.1. Determine whether the variables given below are qualitative or quantitative
variables. If quantitative, determine whether the variable is a discrete or continuous variable.

1. Number of deaths in Game of Thrones


2. Temperature in Celsius.
3. Student name
4. Student’s STS bracket
5. Rankings in a contest – (Champion, 1st Runner Up, etc.)
6. The number of coin tosses until a head comes up
7. Number of UPCAT applicants
8. The time it takes to get from your house to UP

2.1.2. Levels of Measurement

1. Nominal level of measurement is the level of measurement which only satisfies the property
that the numbers in the system are used to classify a person/object into distinct,
nonoverlapping, and exhaustive categories.

1 |Collection, Presentation, and Organization of Data


Remarks

• There is no ordering of the possible values of the variable measure in the nominal scale.
• A variable measured in the nominal level is categorical/qualitative in nature, which means
that doing arithmetic operations for their values are still meaningless.
• These variables are mainly used for identification.
• Examples or variables measured in the nominal level: Sex, Plate number, Student number

2. Ordinal level of measurement satisfies the following properties:


i. The numbers in the system are used to classify a person/object into distinct, nonoverlapping,
and exhaustive categories.
ii. The system arranges the categories according to magnitude.

Remarks

• Variables measured in the ordinal level of measurement are still categorical in nature.
• Mathematical operations are still meaningless for values of variable measured in the
ordinal level.
• Examples: Likert scale, Rankings, IMDB ratings, Yelp Ratings

3. Interval level of measurement only satisfies the following properties:


i. The numbers in the system are used to classify a person/object into distinct, nonoverlapping,
and exhaustive categories.
ii. The system arranges the categories according to magnitude.
iii. The system has a fixed unit of measurement representing a set size throughout the scale

Remarks

• The third property basically states that the distances between each interval on the scale are
equal right along the scale from the low end to the high end.
• In the interval level, there is no absolute zero. Absolute zero mans that a score of zero really
means that “none exists.”
• Values of variable measured in the interval level can now be added or subtracted. However,
multiplication and division are still meaningless mathematical operations for interval level.
• Examples: Celsius

4. Ratio Level of Measurement satisfies the following properties:


i. The numbers in the system are used to classify a person/object into distinct, nonoverlapping,
and exhaustive categories.
ii. The system arranges the categories according to magnitude.
iii. The system has a fixed unit of measurement representing a set size throughout the scale
iv. It has an absolute zero

Remarks

• Absolute zero mans that a score of zero really means that “none exists.”
• Examples: Scores in exam, Kelvin scale, Income

2 |Collection, Presentation, and Organization of Data


Exercises 2.1.2. Determine what level of measurement is used to measure the following variables.

1. Time (in minutes)


2. Allowance (in pesos)
3. Religious belief (1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other)
4. Rank in a competition
5. Cost of a fast food meal
6. Years of work experience

2.2. Data Collection Methods

1. Registration Method – recording certain identifying or descriptive statistics.


Examples: Car conference, email account registration.

2. Observation Method – Using senses in recording phenomenon such as behavioral patterns,


gestures, etc. during occurrence.
Examples: recording number of car accidents in a specific street (by watching CCTV
footages), watching the behavior of children in a day-care facility.

3. Survey Method – uses questionnaires to ask questions about the topic at hand. When
conducting a survey, respondents are the people who answer the questions, and enumerators
are the people who ask the questions.
Main Types of surveys:
a. Self-administered questionnaires
b. personal interviews
c. online surveys

Self-Administered Questionnaire Personal Interviews


Missing information and vague responses are
Obtained information is limited to subjects’
minimized with proper probing of the
written answers to pre-arranged questions
interviewer
Lower response rate Higher response rate through call-backs
It can be administered to a large number of It is administered to a person or group one at a
people simultaneously time
Respondents may feel more cautious
Respondents may feel freer to express views
particularly in answering sensitive questions for
and are less pressured to answer immediately
fear of disapproval
It is more appropriate for obtaining about
It is more appropriate for obtaining objective
complex emotionally-laden topics or probing
information
sentiments underlying an expressed opinion

4. Focus Group Discussion – a method of collection of data wherein a moderator follows a focus
group discussion guide to direct a freewheeling discussion among a small group of people.

5. Experimental Method - A method of collecting data where there is direct human intervention
on the conditions that may affect the values if the variable of interest.

3 |Collection, Presentation, and Organization of Data


Some Terminologies in Experiments

• Explanatory variables are the factors under study


• Response variable is the variable which states the “effect” of the explanatory variable
• Factor levels/Treatments are the categories of the explanatory variable
• Extraneous variables are other variables that that we believe to have an effect on the
response variable, but we do not want to study. Oftentimes, we control them, or held them
constant.

Example 2.2.1. You want to study the effect of different daily water intake (5ml, 10 ml, 15 ml, 20
ml) to its height in centimeters.
Response Variable: Height of the mongo seedling
Explanatory Variable: Amount of water
Factor levels: 5 ml, 10 ml, 15 ml, and 20 ml
Extraneous variables: Fertility of soil, amount of sunlight, parent of the mongo seedlings
6. Use of Documented Data – data that came from documents and published media such as
journal, books, and newspapers.
• Primary Data: data documented by the primary source
• Secondary Data: data documented by a secondary source. An individual/agency, other
than the data collectors, documented this data.

Remarks

• The experimental method is the most appropriate method of collection of data if we want to
study a cause-and-effect relationship. This is because, in experiments, there are more control to
other variables that has an effect on the value of the response variable.
• One disadvantage of experimental method is the fact that the results of the study might be
more unnatural if there are too many extraneous variables being controlled.
• Observation method is superior over surveys when we are dealing with non-verbal behaviors.
• One disadvantage of observation method is when the environment under study is difficult to
enter.
• Surveys are more appropriate for cases when we want to extract verbal information (opinions
and ideas)
• A focus group discussion is oftentimes conducted before the actual survey to construct the
actual questionnaire.

2.3. Sampling

2.3.1. Basic Concepts in Sampling

Census/Complete Enumeration is the process of gathering information from every unit in the
population

Sampling process of obtaining information from the units in the selected sample

Why we resort sampling: 1. Cost-efficiency of sampling. 2. Feasibility

4 |Collection, Presentation, and Organization of Data


Census Sampling
▪ Not always possible to get timely, ▪ Reduced cost
accurate, and economic data ▪ Greater speed
▪ Costly, if the number of units in the ▪ Greater scope
population is too large ▪ Greater accuracy

Terminologies in Sampling

Target Population is the population we want to study.

Sampled Population is the population from where we actually select the sample.

Elementary unit/element is a member of the population whose measurement on the variable


of interest is what we wish to examine.

Sampling unit is a unit of the population that we select in our sample.

Sampling frame or frame is a list or map showing all the sampling units in the population

Remarks

• It is ideal that the target population and sampled population are equivalent. However, they
are oftentimes not identical.
• If the sampled population is not equivalent to the target population, it is more appropriate
to apply the conclusion to the sampled population.
• There are cases wherein the sampling unit is not equivalent to the elementary unit.

Example 2.3.1. A news program wants to know whether all Filipinos are in favor of a
proposed tax increase, so they posted on their Twitter account a poll about it.

In this scenario, the target population is the set of all Filipinos. However, the sampled
population are only those Filipinos who have twitter accounts and saw the poll. The
sampling unit and elementary unit are the Filipino Twitter users who answered the online
poll.

Example 2.3.2. A researcher wants to conduct a survey about social media usage of public high
school students in Manila City. He randomly selected 10 public high schools in Manila City, and all
the students in the selected high schools will be included in the study.

In this example, the target population and sampled population are the same: all public high school
students in Manila City (assuming that the sampling frame is updated. The elementary units are the
public high school students in Manila City, while the sampling units are the public high schools in
Manila City.

Errors while conducting sampling

Sampling error is the error attributed to the variation present among the computed values of the
statistic from different possible samples consisting of n elements.

5 |Collection, Presentation, and Organization of Data


Nonsampling error is the error from other sources apart from sampling fluctuations. Some cause of
this are errors in the frame, error in the questionnaire, error in the sampling selection etc. There are
two main types of nonsampling error:

▪ Error in the implementation of the design is the error when the sampling design is not
implemented properly.

▪ Measurement error is the error when the actual value of the variable is not captured.

Example 2.3.3. Illustration of Sampling Error

{𝑌1 , 𝑌2 , … , 𝑌𝑁 } : Population

{𝑦1 , 𝑦2 , … , 𝑦𝑛 } : Sample

𝑁: Population Size (number of elements in the population)

𝑛: Sample Size (number of elementary units in the sample)

You want to get a sample of 5 students from a class of 15 students. You want to estimate the mean
grade of the whole class in mathematics. The grades are 87, 88, 96, 75, 76, 89, 87, 89, 78, 75, 90, 92,
85, 84, and 78.

87 is Y1, 88 is Y2, ..., 78 is Y15

True Mean = 84.6

Sample 1: {y1= Y3, y2=Y12, y3=Y7, y4=Y9, y5=Y1} = {96, 92, 87, 78, 87}.

Estimated Mean = 88

Sample 2: {y1= Y2, y2=Y7, y3=Y8, y4=Y1, y5=Y15} = {88, 87, 89, 87, 78}.

Estimated Mean = 85.8

Notice that the value of the mean differs based on the sample selected. This error is the
sampling error.

Types of Sampling

Probability Sampling are methods or sampling plans that gives every element a known and
nonzero probability of being included in the sample (known as the inclusion probability).

Nonprobability Sampling are methods or sampling plans wherein at least one of the of the two
conditions of probability sampling is not satisfied.

2.3.2. Types of Probability Sampling

I. Simple Random Sampling is a probability sampling method wherein all possible subsets
consisting of n elements selected from the N elements of the population have the same chances
of selection.
a. Simple Random Sampling Without Replacement (SRSWOR) – all the n elements in
the sample must be distinct from each other.

6 |Collection, Presentation, and Organization of Data


b. Simple Random Sampling With Replacement (SRSWR) – all the n elements in the
sample need not be distinct from each other.

Steps in picking a Simple Random Sample


i. Make a list of the sampling units and number them from 1 to N.
ii. Select n numbers from 1 to N using some random process, for example, the table
of random numbers.
iii. The sample consists of the units corresponding to the selected random numbers.

Advantages and Disadvantages of Using SRS

Advantages Disadvantages
▪ The theory involved is much easier ▪ The sample chosen may be widely
to understand than the theory spread, thus entailing higher
behind other sampling methods. transportation cost.
▪ Inferential methods are simple and ▪ A population frame, or list, is
easy. needed.
▪ Less precise estimates result if the
population is heterogeneous with
respect to the characteristic under
study.

ASIDE: Using the Table of Random Numbers


1. Determine the population size N, sample size n, and the number of digits in N which we
will denote with m.
Example: N=52, n=5, and 52 has 2 digits, therefore m=2.
2. Select a starting point. The starting point can be any row or and any m consecutive
columns.
Example: Let us say the starting point is row 21, and columns 24 and 25 (since m = 2).
The starting point is 33.
3. Read the numbers in the selected column/s, starting with the selected row then move on
by going down the rows. The numbers selected are the numbers from 1 to N. Skip the
number 00 and the numbers greater than N. If n numbers must be distinct, then skip all
numbers that you have already selected. If you reach the last row and have not yet
generated n numbers, proceed to the next m columns starting with row 00. Stop once
you obtain the n numbers.
Example: 33, 23, 43, 24, 46

Exercises 2.3.1.

1. From a population of 900 employees in a factory, we need to find a sample of size 5 using
simple random sampling without replacement. Use the table of random numbers to get the

7 |Collection, Presentation, and Organization of Data


labels of the five employees that are to be included in the sample. (Start in row 19, columns
12 to 12+m-1).
Answer: 362, 365, 163, 545, 500
2. From a population of 30 students in a Stat 101 class, we are tasked to get a sample of 6
students using SRS With Replacement. Use the table of random numbers to get the six
students that are to be included in the sample. (Start in row 10, columns 08 to 08+m-1).
3. SRSWR, N=500, n=4, start at row 4, column 9 to 9+m-1
4. SRSWOR, N=90, n=3, start at row 15, columns 28 to 28+m-1

II. Systematic Sampling is a probability sampling method wherein the selection of the first
element is at random and the selection of the other elements in the sample is systematic by
subsequently taking every kth element from the random start, where k is the sampling interval.
Steps in picking a Systematic Sample
i. Number the units of the population consecutively from 1 to N.
ii. Let k be equal to ⟦𝑁⁄𝑛⟧
iii. Select the random start 𝑟, where 1 ≤ 𝑟 ≤ 𝑘. The unit corresponding to 𝑟 is the
first unit of the sample.
iv. The other units of the sample correspond to 𝑟 + 𝑘, 𝑟 + 2𝑘, … 𝑟 + (𝑛 − 1)𝑘.
v. Variation (Circular list): The random start will be selected from 1 to N. Starting
from the r count every kth element for the other elements of the sample.
Consider a circular list.

Advantages and Disadvantages of Using Systematic Sampling

Advantages Disadvantages
▪ It is easier to draw the sample and ▪ If periodic regularities are found in
often easier to execute without the list, a systematic sample may
mistakes than SRS. consist of only similar types.
▪ It is possible to select a sample in ▪ Knowledge of the structure of the
the field without a sampling frame. population is necessary for its most
▪ The systematic sample is spread effective use.
more evenly over the population.

III. Stratified Random Sampling is a probability sampling method where we divide the
population into nonoverlapping subpopulations or strata, and then select one sample from
each stratum. The sample consists of all the samples in the different strata.
Steps in picking a Stratified Random Sample
i. Divide population into strata. Ideally, each stratum must consist of homogeneous
units.
ii. After the population has been stratified, a random sample is selected from each
stratum.

8 |Collection, Presentation, and Organization of Data


Advantages and Disadvantages of Using Stratified Random Sampling

Advantages Disadvantages
▪ Stratification may produce a gain in ▪ The stratification of the population
precision in the estimates of may require additional prior
characteristics of the population. information about the population
▪ it allows for more comprehensive and its strata.
data analysis since information is ▪ A listing of the population for each
provided for each stratum. stratum is needed.

Sample Allocations in Stratified Sampling

1. Equal Allocation – you will allocate your sample size equally on each stratum.
2. Proportional Allocation – you will allocate your sample size proportional to the size of
each stratum.
Example 2.3.2. Suppose you have a population of 100 students that consists of 40 males
and 60 females. You want to select a sample of size 10 using stratified sampling.

If your allocation method is equal allocation, you will get 5 students from the 40 male
students and another 5 from the female students.

If you choose to do proportional allocation, you have to select 4 students from the 40 male
students, and 6 students from the 60 female students.

IV. Cluster Sampling is a probability sampling method wherein we divide the population into
nonoverlapping groups or clusters consisting of one or more elements, and then select a
sample of clusters, The sample will consist of all the elements in the selected clusters.
Steps in picking a Cluster Sample
i. Decide on how to divide the population into non-overlapping clusters. Ideally,
each cluster must consist of heterogenous units. Number the clusters from 1 to
N.
ii. Select n numbers from 1 to N using some random process. The clusters
corresponding to the selected numbers form the sample of clusters.
iii. Observe all elements in the sample of clusters.

Advantages and Disadvantages of Using Cluster Sampling

Advantages Disadvantages
▪ A population list of elements is not ▪ The costs and problems of statistical
needed; only a population list of analysis are greater.
clusters is required. Listing cost is ▪ Estimation procedures are more
reduced. difficult.
▪ Transportation cost is reduced if
clusters are geographic units.

9 |Collection, Presentation, and Organization of Data


V. Multistage Sampling: a probability sampling method where there is a hierarchical
configuration of sampling units and we select a sample of these units in stages.
Steps in picking a Multistage Sample
i. Stage 1: Divide the population into primary stage units (PSU) then draw a sample
of PSUs.
ii. Stage 2: Each selected PSU is subdivided into second-stage units (SSU) then a
sample of SSUs is drawn.

Advantages and Disadvantages of Using Multistage Sampling

Advantages Disadvantages
Transportation and listing cost are ▪ Estimation procedure is difficult,
reduced especially when the primary stage
units are not of the same size.
▪ Estimation procedure gets more
complicated as the number of
sampling stages increase.
▪ The sampling procedure entails
much planning before selection is
done.

ASIDE: More examples on Probability Sampling

Suppose we wish to conduct a sample survey. The population consists of N=30 members of the
Dumbledore’s Army and we wish to select a sample of size n=10 members.

01 Bell, Katie Gryffindor 1 16 Weasley, Ginny Gryffindor 16


02 Brown, Lavender Gryffindor 2 17 Weasley, Ron Gryffindor 17
03 Creevey, Colin Gryffindor 3 18 Abbott, Hannah Hufflepuff 1
04 Creevey, Dennis Gryffindor 4 19 Bones, Susan Hufflepuff 2
05 Finnigan, Seamus Gryffindor 5 20 Finch-Fletchley, Justin Hufflepuff 3
06 Granger, Hermione Gryffindor 6 21 Macmillan, Ernie Hufflepuff 4
07 Johnson, Angelina Gryffindor 7 22 Smith, Zacharias Hufflepuff 5
08 Jordan, Lee Gryffindor 8 23 Boot, Terry Ravenclaw 1
09 Longbottom,
Neville Gryffindor 9 24 Chang, Cho Ravenclaw 2
10 Patil, Parvati Gryffindor 10 25 Corner, Michael Ravenclaw 3
11 Potter, Harry Gryffindor 11 26 Edgecombe, Marietta Ravenclaw 4
12 Spinnet, Alicia Gryffindor 12 27 Goldstein, Anthony Ravenclaw 5
13 Thomas, Dean Gryffindor 13 28 Lovegood, Luna Ravenclaw 6
14 Weasley, Fred Gryffindor 14 29 Patil, Padma Ravenclaw 7
15 Weasley, George Gryffindor 15 30 Reynolds, Maisy Ravenclaw 8

10 |Collection, Presentation, and Organization of Data


1. Use simple random sampling with replacement to select n=10. Use row 25 cols 09 to
09+m-1

04 Creevey, Dennis
06 Granger, Hermione
16 Weasley, Ginny
08 Jordan, Lee
16 Weasley, Ginny
11 Potter, Harry
15 Weasley, George
18 Abbott, Hannah
27 Goldstein, Anthony
02 Brown, Lavender

2. Use simple random sampling without replacement to select n=10. Use row 23 cols 16
to 16+m-1

25 Corner, Michael
18 Abbott, Hannah
29 Padma, Patil
14 Weasley, Fred
01 Bell, Katie
12 Spinnet, Alicia
11 Potter, Harry
13 Thomas, Dean
17 Weasley, Ron
07 Johnson, Angelina

3. Use stratified random sampling with equal allocation (SRSWOR per stratum). Use row
4 cols 4 to 4+m-1 for Gryffindor, row 7 cols 8 to 8+m-1 for Hufflepuff, row 10 cols 5 to
5+m-1 for Ravenclaw

Since we have equal allocation the number of elements per stratum is 10/3 ≈ 3.3333 ≈ 3.
For the extra single element, I will allocate it to Gryffindor, since this house has the greatest
number of Dumbledore’s Army members.

Gryffindor
13 Thomas, Dean
02 Brown, Lavender
12 Spinnet, Alicia
11 Potter, Harry

Hufflepuff
2 Bones, Susan

11 |Collection, Presentation, and Organization of Data


3 Finch-Fletchley, Justin
1 Abbott, Hannah

Ravenclaw
6 Lovegood, Luna
2 Chang, Cho
5 Goldstein, Anthony

4. Use stratified random sampling with proportional allocation (SRSWOR per stratum).
Use row 01 cols 16 to 16+m-1 for Gryffindor, row 20 cols 17 to 17+m-1 for Hufflepuff,
and row 16 cols 18 to 18-m+1 for Ravenclaw

Gryffindor: 17/30 ≈ 0.5667, so n1 = 10X0.5667 ≈ 5.667 ≈ 6


Hufflepuff: 5/30 ≈ 0.1667, so n2 = 10X0.1667 ≈ 1.667 ≈ 2
Ravenclaw: 8/30 ≈ 0.2667, so n3 = 10X0.2667 ≈ 2.667 ≈ 3

To balance the sample size to the number of elements per stratum, I will deduct an element
from Gryffindor.

Gryffindor
01 Bell, Katie
12 Spinnet, Alicia
14 Weasley, Fred
11 Potter, Harry
13 Thomas, Dean

Hufflepuff
02 Bones, Susan
05 Smith, Zacharias

Ravenclaw
06 Lovegood, Luna
04 Edgecomb, Marietta
08 Reynolds, Maisy

5. Use Systematic Sampling with linear listing. To get the random start, use row 3 cols 16
to 16+m-1

k=30/10 = 3
Since we are considering a linear listing, r can be 1, 2, or 3.

Using the table of random numbers, r = 3

The sample selected is composed of the following elements:

12 |Collection, Presentation, and Organization of Data


03 Creevey, Colin
06 Granger, Hermione
09 Longbottom, Neville
12 Spinnet, Alicia
15 Weasley, George
18 Abbott, Hannah
21 Macmillan, Ernie
24 Chang, Cho
27 Goldstein, Anthony
30 Reynolds, Maisy

6. Use systematic Sampling with circular listing. To get the random start, use row 2 cols
14 to 14+m-1

The sample selected is composed of the following elements:


22 Smith, Zacharias
25 Corner, Michael
28 Lovegood, Luna
01 Bell, Katie
04 Creevey, Dennis
07 Johnson, Angelina
10 Patil, Parvati
13 Thomas, Dean
16 Weasley, Ginny
19 Bones, Susan

7. Use Cluster Sampling where the houses are the clusters (1 = Gryffindor, 2 = Hufflepuff,
3 = Ravenclaw) where only one cluster will be selected. Use row 14 cols 14 to 14+m-1

Based on the table of random numbers, the cluster selected is 1 = Gryffindor. So all the
students from Gryffindor that are members of the Dumbledore Army are included in the
sample.

2.3.2. Types of Nonprobability Sampling

I. Convenience Sampling – the sample consists of elements that are most accessible or
easiest to contact.

Example 2.3.2.1. A group of social scientists is interested in studying the socio-economic


profile of persons with Acquired Immune Deficiency Syndrome. In most cases, the
subjects with the disease will not admit that he/she is a carrier in an interview. There is no
complete list of people with AIDS, and because of the Data Privacy act, they cannot get a
list of patients from the hospitals.

13 |Collection, Presentation, and Organization of Data


II. Judgment/Purposive Sampling - Instead of just selecting any sampling unit that is easy
to reach, there is now an attempt to come up with a representative sample.

Example 2.3.2.2. A researcher may use a particular district, province, or city to be the
sample cluster in representing their population of interest. For instance, the researcher
can identify a specific district of Quezon City whose households have the same profile in
terms of the socioeconomic characteristics as the households in the whole Quezon City.

III. Quota Sampling - The nonprobability sampling version of stratified sampling, wherein
you divide the population first into different groups and you select a sample from each
group.

Example 2.3.2.3. A researcher wishes to study the people’s views on birth control and
religion. Census results showed that 70% of the population are Christians, 20% are
Muslims, and 10% are non-believers. The researcher then selects a sample reflecting the
proportions to represent the three religious group.

2.4. Presentation of Data

Textual Presentation – presentation of data incorporated to a paragraph of text.

Example 2.4.1. (Example of Presentation of Data Using Textual Presentation)

She added that from 2012 to 2015, the Philippine National Police (PNP) recorded a total of
27,823 cases of child in conflict with the law.

Furthermore, the PNP data showed that from 2002 to 2015, the percentage of offenses
committed by children in the total number of crimes recorded is very much negligible, Ancheta-
Templa said.

Specifically, the PNP data revealed that the percent distribution of crime committed by adults
is higher at 98 percent, while those involving children is only 2 percent.

Source: Vera-Ruiz E. (October 3, 2018), “Lowering criminal age never resulted in lower crime rates
says DSWD usec,” Manila Bulletin.

Advantages and Disadvantages of Textual Presentation

Advantages Disadvantages
▪ Simplest and most appropriate ▪ When a large mass of quantitative
approach when there are only a few data is included in a text or
numbers to be presented paragraph, the presentation
▪ Gives emphasis to significant figures becomes almost incomprehensible.
and comparisons ▪ Written paragraphs can be tiresome
to read especially if the same words
are repeated so many times.

14 |Collection, Presentation, and Organization of Data


Tabular Presentation - the systematic organization of data in rows and columns.

Example 2.4.2. Curse Words used by P. Duterte in his speeches from June 2016 – June 2017

Curse Word Spelled as is Used asterisks

PI 248 741

Ga*o 72 143

Ya*a 95 32

Sh*t 41 66

Source: Bueza M. (June 27, 2017). “What were Duterte’s favorite words in his speeches?” Rappler.

Advantages of Tabular Presentation

▪ More concise than textual presentation


▪ Easy to understand
▪ Presents data in a greater detail than a graph
▪ Facilitates comparisons & analysis of relationship among different categories

Graphical Presentation - A graph or chart is a device for showing numerical values or relationships
in pictorial form.

Characteristics of a good graph

1. Accuracy - It should not be deceptive, distorted, or in any way susceptible to wrong


interpretations due to careless construction.

2. Clarity - An effective chart can be easily read and understood. There should be an
unambiguous representation of the facts.

3. Appearance - It is one that is designed and constructed to attract and hold attention by
holding a neat, dignified, professional appearance.

4. Simplicity - The basic design of a statistical chart should be simple, straight-forward, and not
loaded with trivial symbols or ornamentation.

15 |Collection, Presentation, and Organization of Data


Types of Chart

1. Line Chart
When to use: Line charts are most commonly used in presenting historical data. Use it when
you want to focus on the movement of the series over time. You can also use if you want to
compare the trend of two or more time series data.
Rules: The ratio of height to width should be 2:3 or 3:4.
Types:
a. Single Line Chart
b. Multiple Line Chart

Example 2.4.3. CO2 Emissions in the World from 1960 to 2014 (metric tons per capita)

Source: World Bank

2. Column Chart/Vertical Bar Chart


When to use: Use column charts when presenting time series data that focuses on the
magnitude of the series instead of the movement. You should also use it if you want to show
data changes over a period of time among several items.
Rules: The space between bars is around one-fourth the width of the bars. For single time
series, only use one shade of color for the bars.
Types:
a. Simple Column Chart
b. Group Column Chart
c. Subdivided Column Chart
d. 100% Subdivided Column Chart
e. Net deviation Column Chart

16 |Collection, Presentation, and Organization of Data


Example 2.4.4. Bar Graph of the number of P. Duterte’s speeches from June 2016 – June 2017

Source: Rappler

3. Horizontal Bar Chart


When to use: Use when we are concerned in presenting categorical data. Its main purpose is
to present magnitude per category.
Rules: Space between bars should have width that is between one-fifth to one-half the width
of the bars. The bars should be arranged according to the length of the bars (either
ascending of descending). The “Others” section should be the first or last category in the
graph.
Types:
a. Simple Bar Chart
b. Grouped Bar Chart
c. Subdivided Bar Chart
d. Subdivided 100% Bar Chart
e. Pictograph (use when you want to get attention of the reader). In pictographs, a single
symbol accounts for a particular unit of measurement.

17 |Collection, Presentation, and Organization of Data


Example 2.4.5. Number of Attacks and Casualties in 2016 in the Philippines due to Terrorism by
Attack type

Source: Global Terrorism Database

4. Pie Chart
When to use: Use to present categorical data but the focus is to show the components parts
with respect to the total in terms of the percentage distribution.
Rules: Use pie charts when the categories are less than 6. The categories should be arranged
according to magnitude. Plot the biggest slice at 12 o’clock. The “Others” section should be
plotted last.

Example 2.4.6. Philippine Population by Region

Source: AsiaSociety.Org

18 |Collection, Presentation, and Organization of Data


2.5. Organization of Data

Raw Data is the data in their original form. {𝑋1 , 𝑋2 , 𝑋3 , … , 𝑋𝑛 }

Array or sorted data is an ordered arrangement of data according to magnitude.


{𝑋(1) , 𝑋(2) , 𝑋(3) , … , 𝑋(𝑛) }

The frequency distribution table is a way of summarizing data by showing the number of
observations that belong in the different categories or classes. We also refer to this as grouped data.

1. Single value grouping is a frequency distribution where the classes are distinct values of the
variable.
2. Grouping by class intervals is a frequency distribution where the classes are the intervals.

Terminologies

• Class interval is the range of values that belong in the class or category
• Class frequency is the number of observations that belong in a class interval.
• Class limits are the end numbers used to define the class interval (Upper Class Limit, Lower
Class Limit)
• Open class interval is a class interval with no lower class limit or no upper class limit.
• Close class interval is a class interval with both lower and upper class limits
• Class boundaries are the true class limits. These are the midpoints between two class limits
(Upper class boundary, Lower class boundary)
• Class size is the size of the class interval (the difference between two upper class
boundaries/limits or two lower class boundaries/limits).
• Class mark is the midpoint of a class interval.
• Relative frequency is the class frequency divided by the total number of observations.
• Relative frequency percentage is the relative frequency multiplied by 100.
• Less than cumulative frequency distribution (< CFD) shows the number of observations with
values smaller than or equal to the upper class boundary. To compute for this, starting from the
first class interval, add the class frequencies in a succeeding manner up to the last class interval.
• Greater than cumulative frequency distribution (> CFD) shows the number of observations
with values larger than or equal to the lower class boundary. To compute for this, starting from
the last class interval, add the class frequencies in a succeeding manner up to the first class
interval.

How to make a frequency distribution table

1. Determine the number of classes


Sturges’s Rule: K = 1+3.322 log(n)
Where K is the number of class intervals and n is the number of observations. Note that this
formula tends to yield too few classes when n is greater than 200.
2. Determine the range
Range = highest observed value – lowest observed value
3. Determine the class size
Class Size = Range/K
4. List the class intervals

19 |Collection, Presentation, and Organization of Data


5. Tally the observed values in each class interval.
6. Compute for the class boundaries, class marks, relative frequency, and relative frequency
percentage if deemed necessary.

Example 4.5.1. Given below are final grades of Stat 101 students arranged in an array. Create a FDT,
histogram, and ogives

50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 92
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96

𝐾 = 1 + 3.322 log(𝑛)
𝐾 = 1 + 3.322 log(110) = 7.7815
𝑅𝑎𝑛𝑔𝑒 = ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑜𝑏𝑠 − 𝑙𝑜𝑤𝑒𝑠𝑡 𝑜𝑏𝑠 = 96 − 50 = 46
𝑅𝑎𝑛𝑔𝑒 46
𝐶′ = = = 5.9115
𝐾 7.7815
Round off C’ to the same number of decimal places as the original dataset, say C, and use C as the
class size. 𝐶 = 6

In this example, we will choose the lowest class limit to be the lowest observation. Therefore, the
lowest class limit is 50.

Relative
Class Class Class Class Relative
Frequency <CFD >CFD
Limit Frequency Boundary Mark Frequency
Percentage

50-55 10 49.5-55.5 52.5 0.0909 9.09% 10 110

56-61 6 55.5-61.5 58.5 0.0545 5.45% 16 100

62-67 8 61.5-67.5 64.5 0.0727 7.27% 24 94

68-73 24 67.5-73.5 70.5 0.2182 21.82% 48 86

74-79 22 73.5-79.5 76.5 0.2000 20.00% 70 62

80-85 24 79.5-85.5 82.5 0.2182 21.82% 94 40

86-91 12 85.5-91.5 88.5 0.1091 10.91% 106 16

92-97 4 91.5-97.5 94.5 0.0364 3.64% 110 4

20 |Collection, Presentation, and Organization of Data


Graphical Presentations of Frequency Distribution

1. Frequency Histogram
It shows the distribution of the observed values. It plots the class boundaries in the x-axis,
and class frequencies on the vertical axis. We represent each class frequency by a vertical bar
whose height is equal to the frequency of the class interval. The area under the histogram is
the total number of observations. If the histogram can be divided vertically into two such
that the half is a mirror image of the other half, then the distribution of the dataset is
symmetric, otherwise, it is skewed.

2. Relative Frequency/ Relative Frequency Percentage Histogram


It is like the frequency histogram, but instead of using the frequency in the vertical axis, you
plot the relative frequency.

3. Ogives
It is the plot of the cumulative frequency distribution. The less than ogive is the plot of <CFD
against the upper class boundaries. The greater than ogive is the plot of the >CFD with the
lower class boundaries. Superimposing them, the intersection of the two ogives will give the
median of the data. If the less-than ogive looks like an S-shape, and the greater than ogive
looks like an inverted S-shape, then the distribution of the dataset looks like a symmetric
bell-shape curve.

Example 4.5.2. Histogram and ogives of the Stat 101 grades dataset

21 |Collection, Presentation, and Organization of Data


The stem-and-leaf display (SALD) is a histogram-like summary of the data where the digits of the
data values replace the bars in representing the frequencies.

The difference between an ordinary histogram and the stem-and leaf display is that we will not lose
any information on the actual data values for we are using the data values themselves to create a
pictorial form of the distribution of the data.

In constructing the SALD, we need to split each observation into two parts and refer to them as stem
and leaf.Example: We have 356 as one of the observations. We can separate the digits this way:

Stem Leaf

3 | 56

Steps in creating a SALD

1. Choose the common division point of each observations where we will split each data into its
stem and leaf components.
2. In a vertical column, list the smallest stem value up to the largest stem value.
3. Draw a vertical line to the right of the stem value.
4. Record the leaf portion of the first observation in the row corresponding to its stem value.
Do the same for all of the observations
5. Sort the leaves within each stem row from lowest to highest. Maintain spacing in between
the leaves.
6. Indicate the unit of the leaves to allow the recreation of the actual data values from the
display. For example we have the 35 | 6
Unit = 0.1 represents 356 x 0.1 = 35.6
Unit = 1 represents 356 x 1 = 356
Unit = 10 represents 356 x 10 = 3560

Example 4.5.3. Listed are the typing speed (net words per minute) for 20 secretarial
applicants:
68, 52, 65, 58, 46, 72, 75, 35, 61, 55, 91, 63, 84, 69, 66, 47, 55, 45, 32, 71. Create a Stem and
Leaf Display.
In this example, we will choose the common division point to be between the tens and ones
place. Example: 3 | 2
3 25
4 567
5 2558
6 135689
7 125
8 4
9 1
Unit=1

22 |Collection, Presentation, and Organization of Data


Exercises 4.5.1. Create a FDT, Frequency Histogram, less than and greater than ogives, and for the
following dataset

The following data shows the price of 30 different types of notebooks (Elem. Stat pg. 183 #4)
38.12 64.76 34.29 44.37 56.29
63.10 45.45 66.25 58.96 38.95
39.42 55.50 46.35 38.78 66.91
53.50 43.15 67.29 45.26 44.32
34.07 45.05 44.13 67.69 45.65
66.55 62.13 36.87 45.77 43.89

Stat 101 Class Activity #1

I. Given the frequency distribution of weights in pounds of 50 male college students, fill in the
blanks. (Elementary Statistics page 184 Exercise #6)

Class Class Class Relative


Frequency RFP <CFD >CFD
limits boundaries Marks Frequency
112-120 a.____-120.5 b. ____ 2 0.04 4% 2 50
121-129 120.5–c.___ 125 7 d.______ e.___ f.___ g.__
130-138 129.5-138.5 h._____ 10 0.20 20% i.___ 41
139-147 138.5-j.____ 143 k.____ 0.24 24% 31 l.___
148-156 m.___-156.5 152 11 0.22 22% 42 19
157-165 156.5-165.5 n.____ 5 o.______ p.___ q.__ r.___
s.___-174 165.5-174.5 170 3 0.06 6% 50 t.___
Total u.____ v.______
II. The lengths of power failures in Hawkins, Indiana in 1983 are recorded in the following table.
13 22 35 40 45 48 52 60 66
15 22 37 42 45 49 55 60 66
18 24 37 42 47 49 56 61 68
21 28 37 43 47 50 58 61 69
21 28 39 44 48 51 59 62 78

1. Construct a frequency distribution table with the following parts: Class limits, Class
frequency, Class boundaries, Class marks, Relative frequency, <CFD, and >CFD. Take note of
the following instructions:
a. Strictly follow Sturges’ guide in making a frequency distribution table. Show all solutions
for the number of classes, the range and the class size.
b. Use the minimum value of your raw data as the lower class limit of the first class interval.
c. Follow the natural rule in rounding off numbers.
d. The relative frequency should be rounded off up to 4 decimal places.
2. Construct a frequency histogram. Interpret.
3. Construct a less-than ogive and a greater-than ogive plotted on the same graph. Interpret.
4. Construct an ordinary stem and leaf display. Interpret.

23 |Collection, Presentation, and Organization of Data

S-ar putea să vă placă și