Sunteți pe pagina 1din 118

CHAPTER 2

DESCRIPTIVE DATA
SQQS1013 ELEMENTARY STATISTICS
ORGANIZING AND
VISUALIZING DATA
Objectives
In this chapter you learn:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Organizing and visualizing a mix of variables.
• The challenge in organizing and visualizing variables.
2.1 INTRODUCTION
Example:
Here is a list of question asked in a large statistics class and the
data value given by one of the students:
i. What is your sex (m=male, f=female)?
ii. How many hours did you sleep last night?
iii. Randomly pick a letter – S or Q.
iv. What is your height in inches?
v. What’s the fastest you’ve ever driven a car (mph)?
Raw data - Data recorded in the sequence in which
they were originally collected,
before being processed or ranked.

Array data - Raw data that are arranged in


ascending or descending order.
PRESENTATION OF DATA
• Organizing Data Creates Both Tabular And Visual Summaries
• Summaries both guide further exploration and sometimes
facilitate decision making.

• Visual summaries enable rapid review of larger amounts of


data & show possible significant patterns.

• Often, the Organize and Visualize step in DCOVA occur


concurrently.
2.2 PRESENTATION OF
QUALITATIVE DATA
2.2.1 Organizing Categorical Data
Categorical
Data

Tallying Data

One Two
Categorical Categorical
Variable Variables

Summary Contingency
Table Table
• A summary table tallies the frequencies or percentages of
items in a set of categories so that you can see differences
between categories.

Table 2.3 Main Reason Young Adults Shop Online

Reason For Shopping Frequenc Percen


Online? y t
Better Prices 555 37%
Avoiding holiday crowds or hassles 435 29%
Convenience 270 18%
Percentage =
Better selection 195 13% Frequency
�100%
Ships directly 45 3% Total frequency
Total 1500 100%
Source: Data extracted and adapted from “Main Reason Young Adults Shop Online?”
USA Today, December 5, 2012, p. 1A.
• A Contingency Table
– Helps Organize Two or More Categorical Variables

– Used to study patterns that may exist between the responses of two or
more categorical variables.

– Cross tabulates or tallies jointly the responses of the categorical


variables.

– For two variables the tallies for one variable are located in the rows and
the tallies for the second variable are located in the columns
Example 2.1-Contingency Table
Table 2.4 Contingency Table Showing
• A random sample of 400 Frequency of Invoices Categorized
invoices is drawn. By Size and The Presence Of Errors
No
• Each invoice is categorized as Errors Errors Total
a small, medium, or large
Small 170 20 190
amount. Amount
• Each invoice is also examined Medium 100 40 140
to identify if there are any Amount
errors. Large 65 5 70
Amount
• This data are then organized in Total 335 65 400
the contingency table to the
right.
Contingency Table Based On
Percentage Of Overall Total DCOVA
No
Errors Errors Total 42.50% = 170 / 400
Small 170 20 190 25.00% = 100 / 400
Amount 16.25% = 65 / 400
Medium 100 40 140
Amount No
Large 65 5 70 Errors Errors Total
Amount Small 42.50% 5.00% 47.50%
Total 335 65 400 Amount
Medium 25.00% 10.00% 35.00%
Amount
83.75% of sampled invoices
Large 16.25% 1.25% 17.50%
have no errors and 47.50% Amount
of sampled invoices are for Total 83.75% 16.25% 100.0%
small amounts.
Contingency Table Based On
Percentage of Row Totals DCOVA
No
Errors Errors Total 89.47% = 170 / 190
Small 170 20 190 71.43% = 100 / 140
Amount 92.86% = 65 / 70
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 89.47% 10.53% 100.0%
Total 335 65 400 Amount
Medium 71.43% 28.57% 100.0%
Amount
Medium invoices have a larger
Large 92.86% 7.14% 100.0%
chance (28.57%) of having Amount
errors than small (10.53%) or Total 83.75% 16.25% 100.0%
large (7.14%) invoices.
Contingency Table Based On
Percentage Of Column Totals DCOVA
No
Errors Errors Total 50.75% = 170 / 335
Small 170 20 190 30.77% = 20 / 65
Amount
Medium 100 40 140
Amount No
Large 65 5 70 Errors Errors Total
Amount
Small 50.75% 30.77% 47.50%
Total 335 65 400 Amount
Medium 29.85% 61.54% 35.00%
Amount
There is a 61.54% chance
Large 19.40% 7.69% 17.50%
that invoices with errors are Amount
of medium size. Total 100.0% 100.0% 100.0%
2.2.2 Visualizing Categorical Data
DCOVA
Categorical
Data

Visualizing Data

Summary Contingency
Table For One Table For Two
Variable Variables

Bar Pareto Component / Doughnut


Chart Chart Multiple Bar Chart
Chart
Pie or
Doughnut Chart
The Bar Chart DCOVA
 The bar chart visualizes a categorical variable as a series of bars.
The length of each bar represents either the frequency or percentage
of values for each category. Each bar is separated by a space called a
gap.

Reason For Percent


Shopping Online?
Better Prices 37%
Avoiding holiday 29%
crowds or hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Pie Chart DCOVA
 The pie chart is a circle broken up into slices that represent
categories. The size of each slice of the pie varies according to
the percentage in each category.

Reason For Shopping Percent


Online?
Better Prices 37%
Avoiding holiday crowds or 29%
hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Doughnut Chart DCOVA
 The doughnut chart is the outer part of a circle broken up into
pieces that represent categories. The size of each piece of the
doughnut varies according to the percentage in each category.
Doughnut Chart of Reasons to Shop Online

Reason For Shopping Percent


Online?
Better Prices 37%
Avoiding holiday crowds or 29%
hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Pareto Chart
DCOVA
• Used to portray categorical data (nominal scale).

• A vertical bar chart, where categories are shown in


descending order of frequency.

• A cumulative polygon is shown in the same graph.

• Used to separate the “vital few” from the “trivial many.”


The Pareto Chart (con’t)
DCOVA
Table 2.5 Ordered Summary Table For Causes
Of Incomplete ATM Transactions
Cumulative
Cause Frequency Percent Percent
Warped card jammed 365 50.41% 50.41%
Card unreadable 234 32.32% 82.73%
ATM malfunctions 32 4.42% 87.15%
ATM out of cash 28 3.87% 91.02%
Invalid amount requested 23 3.18% 94.20%
Wrong keystroke 23 3.18% 97.38%
Lack of funds in account 19 2.62% 100.00%
Total 724 100.00%

Source: Data extracted from A. Bhalla, “Don’t Misuse the Pareto Principle,”
Six Sigma Forum
Magazine, May 2009, pp. 15–18.
The Pareto Chart (con’t) DCOVA

The “Vital
Few”
Multiple (Side By Side) Bar Charts
 The side by side bar chart represents the data from a contingency DCOVA
table.
No
Errors Errors Total Invoice Size Split Out By Errors & No Errors

Small 50.75% 30.77% 47.50%


Amount Errors

Medium 29.85% 61.54% 35.00%


Amount
No Errors
Large 19.40% 7.69% 17.50%
Amount
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0%
Total 100.0% 100.0% 100.0% Small Medium Large

Invoices with errors are much more likely to be of


medium size (61.5% vs 30.8% & 7.7%).
Component Bar Charts
 The component bar chart represents the data from a contingency table. DCOVA
No
Errors Errors Total Invoice Size Split Out By Errors & No Errors
Small 50.75% 30.77% 47.50% 120.00%

Amount 100.00%

Medium 29.85% 61.54% 35.00% 80.00%


Amount
60.00%

Large 19.40% 7.69% 17.50% 40.00%


Amount
20.00%
Total 100.0% 100.0% 100.0%
0.00%

Amount Amount Amount

Invoices with errors are much more likely to be of


medium size (61.5% vs 30.8% & 7.7%).
Doughnut Charts DCOVA
 A Doughnut Chart can be used to represent the data from a contingency table.

No Invoice Size & Errors


Errors Errors Total Inner Ring With Errors, Outer Ring No Errors

Small 50.75% 30.77% 47.50% 19.38%


Amount 7.70%
30.80%
Medium 29.85% 61.54% 35.00%
Amount 50.75%

Large 19.40% 7.69% 17.50% 29.87%


61.50%

Amount
Total 100.0% 100.0% 100.0%
Small Medium Large

Invoices with errors are much more likely to be of


medium size (61.5% vs 30.8% & 7.7%).
EXERCISE 2.1
A recent consumer survey on i. Construct a bar chart for the types of
stores customers plan to shop at.
holiday shopping reveals the
following information on the types ii. construct a pie chart for the types of
stores customers plan to shop at.
of stores at which consumers plan
to shop. iii. What is the type of stores that the most
customers plan to shop at?
Types of Stores % of
Customers iv. What is the percentage of the top 2
Stand-alone “big box” stores 54 categories of stores that customers plan to
Traditional mall 61
Local independent stores not in 35 shop at make up out of the 6 categories of
a mall shopping preferences.
Strip mall or mini mall 25
Town hall mall 14 v. What is the % of the customers surveyed
I do not plan to shop at any of 9 mentioned that they did not plan to shop
these at any of these stores.
2.3 PRESENTATION OF
QUANTITATIVE DATA
2.3.1 Organizing Quantitative Data

Numerical Data

Frequency Cumulative
Ordered Array
Distributions Distributions
Ordered Array DCOVA
 An ordered array is a sequence of data, in rank order, from the smallest value to the
largest value.

 Shows range (minimum value to maximum value).

 May help identify outliers (unusual observations).

Age of Surveyed Day Students


College Students
16 17 17 18 18 18
19 19 20 20 21 22
22 25 27 32 38 42
Night Students
18 18 19 19 20 21
23 28 32 33 41 45
DCOVA
Frequency Distribution
 The frequency distribution is a summary table in which the data are
arranged into numerically ordered classes  group data
 You must give attention to
i. selecting the appropriate number of class groupings (Sturge’s Rule) for the
table,
c = 1 + 3.3 log n
ii. determining a suitable width of a class grouping, and establishing the
boundaries of each class grouping to avoid overlapping.
i must always i>
Largest value - Smallest value
c shall be
be rounded-up Number of classes
Range rounded-up or
i>
c rounded down
iii. Starting point of the 1st class
use the smallest value in the data set.
Example 2.2
Frequency Distribution Example DCOVA

A manufacturer of insulation randomly selects 20 winter days and records the


daily high temperature.
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Construct a frequency distribution from the data given.
Why Use a Frequency Distribution?
• It condenses the raw data into a more useful form.
• It allows for a quick visual interpretation of the data.
• It enables the determination of the major characteristics of the data set including where
the data are concentrated / clustered.

Frequency Distribution: Tips


• As the size of the data set increases, the impact of alterations in the selection of class
boundaries is greatly reduced.

• When comparing two or more groups with different sample sizes, you must use either a
relative frequency or a percentage distribution.
2.3.2 Visualizing Numerical Data
DCOVA
Numerical Data

Frequency Distributions
Ordered Array and
Cumulative Distributions

Stem-and-Leaf Histogram Polygon Ogive


Display
Stem-and-Leaf Display DCOVA
• A simple way to see how the data are distributed and where
concentrations of data exist.

METHOD: Separate the sorted data series


into leading digits (the stems) and
the trailing digits (the leaves).
Stem and Leaf Display DCOVA
 A stem-and-leaf display organizes data into groups (called stems) so that the values
within each group (the leaves) branch out to the right on each row.

Age of College Students


Age of Day Students
Surveyed 16 17 17 18 18 18 Day Students Night Students
College
Students 19 19 20 20 21 22 Stem Leaf Stem Leaf
22 25 27 32 38 42 1 67788899 1 8899
Night Students
2 0012257
18 18 19 19 20 21 2 0138
23 28 32 33 41 45 3 28
3 23
4 2
4 15
The Histogram DCOVA
 A vertical bar chart of the data in a frequency distribution is called a histogram.

 In a histogram there are no gaps between adjacent bars.

 The class boundaries (or class midpoints) are shown on the horizontal axis.

 The vertical axis is either frequency, relative frequency, or percentage.

 The height of the bars represent the frequency, relative frequency, or percentage.
The Histogram DCOVA
Relative
Class Frequency Frequency Percentage

12 - 21 3 .15 15
22 - 31 6 .30 30
32 - 41 5 .25 25
42 - 51 4 .20 20
52 - 61 2 .10 10
Total 20 1.00 100 Histogram: Temperature

(In a percentage histogram


the vertical axis would be
defined to show the
percentage of observations
per class).
The Polygon DCOVA
 A percentage polygon is formed by having the midpoint of each class
represent the data in that class and then connecting the sequence of
midpoints at their respective class percentages.

 The cumulative percentage polygon, or ogive, displays the variable


of interest along the X axis, and the cumulative percentages along the
Y axis.

 Useful when there are two or more groups to compare.


The Frequency Polygon DCOVA
Useful When Comparing Two or More Groups
The Percentage Polygon
DCOVA
Ogive
• An ogive is a curve drawn for the cumulative frequency distribution.
• Two types of ogive:

(1) ogive less than


(2) ogive greater than
• Steps:
– Build a table of cumulative frequency.
– Draw x and y axes. Label x = class boundaries, y= cumulative frequencies.
– Plot graph using the appropriate class boundary.
– Join the 1st appropriate class boundary to the consecutive points.
Ogive

SQQS1013 W2 L4 41
Ogive

SQQS1013 W2 L4 42
2.3.3 Visualizing Two Numerical Variables
DCOVA

Two Numerical
Variables

Scatter Time-Series
Plot Plot
The Scatter Plot
DCOVA
 Scatter plots are used for numerical data consisting of paired observations
taken from two numerical variables.

 One variable is measured on the vertical axis and the other variable is
measured on the horizontal axis.

 Scatter plots are used to examine possible relationships between two


numerical variables.
Scatter Plot Example DCOVA

Volume Cost per


per day day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
The Time Series Plot
DCOVA
• A Time-Series Plot is used to study patterns in the
values of a numeric variable over time.

• The Time-Series Plot:


– Numeric variable is measured on the vertical axis and the
time period is measured on the horizontal axis.
DCOVA
Time Series Plot Example

Number of franchises, 2007 to 2015


Number of
Year Franchises 120

2007 43 100

2008 54

number of franchises
80
2009 60
2010 73 60

2011 82 40

2012 95
20
2013 107
2014 99 0
2007 2008 2009 2010 2011 2012 2013 2014 2015

2015 95 year
EXERCISE 2.2
The histogram below represents i. How many percent of the job applicants
scored between 10 and 20?
scores achieved by 200 job
applicants on a personality profile. ii. How many percent of the job applicants
scored below 50?

0.30
Rel.Freq. iii. What is the number of job applicants
who scored between 30 and below 60.
0.20
0.20 0.20 0.20
iv. What is the number of job applicants
who scored 50 or above.
0.10
0.10 0.10 0.10 0.10
v. 90% of the job applicants scored above
or equal to ________.
0.00
0 10 20 30 40 50 60 70
vi. Half of the job applicants scored below
________.
NUMERICAL
DESCRIPTIVE MEASURE
Objectives
In this topic, you learn to:
• Describe the properties of central tendency, variation, and
shape in numerical variables.
• Construct and interpret a boxplot.
Summary DCOVA
 The central tendency is the extent to which the values of a numerical
variable group around a typical or central value.

 The variation is the amount of dispersion or scattering away from a


central value that the values of a numerical variable show.

 The shape is the pattern of the distribution of values from the lowest
value to the highest value.
2.4 MEASURE OF
CENTRAL TENDENCY
2.4.1 MEAN
2.4.1.1 UNGROUP DATA DCOVA

• The arithmetic mean (often just called the “mean”) is the most common
measure of central tendency.
• The most common measure of central tendency.
• Mean = sum of values divided by the number of values.
• Affected by extreme values (outliers).
•Population mean,  if data comes from population.
•Sample mean, if data comes from sample.
For a sample of size n:

The ith value


Pronounced x-bar
n

X i
X1  X 2    X n
X i 1

n n
Sample size Observed values
EXAMPLE 2.3 DCOVA

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Mean = 13 Mean = 14
11  12  13  14  15 65 11  12  13  14  20 70
  13   14
5 5 5 5
2.4.1 MEAN
2.4.1.2 GROUP DATA DCOVA

The ith value


Pronounced x-bar n

fX i i
f1 X 1  f 2 X 2    f n X n
X i 1

n
f1  f 2  ...  f n
f
i 1
i
Mid-point of a
Total of Frequency of a class
frequency/S class
ample size
EXAMPLE 2.3
a. During a semester, a student took five exams. The population of
exam scores is 78, 83, 92, 68, and 85. Find the mean. (406, 81.2)
b. The following table shows the speeds (in km/h) of 30 cars measured
at certain checkpoint. (1504, 50.13)
41 53 58 67 33 61 43 45 42 67
39 48 36 47 34 59 57 54 65 69
63 42 60 48 66 30 30 46 52 49
c) The following table presents the daily high temperature in a
manufacturer of insulation for randomly selected 20 winter
days(Refer Example 2.2). Approximate the mean of daily high
temperature. (34.5)

Class Frequency

12 - 21 3
22 - 31 6
32 - 41 5

42 - 51 4
52 - 61 2

Total 20
2.4.2 MEDIAN
2.4.2.1 UNGROUP DATA DCOVA

• In an ordered array, the median is the “middle” number (50% above, 50%
below).

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Median = 13 Median = 13

• Less sensitive than the mean to extreme values.


Procedure of computing the DCOVA
Median
1. Arrange data in ascending order
2. The location of the median when the values are in numerical order
(smallest to largest):
n 1
Median position  position in the ordered data
2
– If the number of values is odd, the median is the middle number.
– If the number of values is even, the median is the average of the two middle
numbers.

3. Find the median.


2.4.2 MEDIAN
2.4.2.2 GROUP DATA
•Procedure :
1. Construct cumulative frequency distribution
Class width
2. Determine median class

3. Compute the median


Cumulative
n  freq before a
Total freq
 F  class median
Median  Lm   2 i
 fm 
 
Lower boundary  
of class median Freq of a class median
EXAMPLE 2.4
a. During a semester, a student took five exams. The population of
exam scores is 78, 83, 92, 68, and 85. Find the median. (83).
b. One of the goals of medical research is to develop treatments that
reduce the time spent in recovery. Eight patients undergo a new
surgical procedure, and the number of days spent in recovery for
each is as follows. Find the median. (17).
c. The following table presents the daily high temperature in a
manufacturer of insulation for randomly selected 20 winter
days(Refer Example 2.2). Approximate the median of daily high
temperature. (33.5)

Class Frequency

12 - 21 3
22 - 31 6
32 - 41 5

42 - 51 4
52 - 61 2

Total 20
2.4.3 MODE
2.4.3.1 UNGROUP DATA
DCOVA
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 9 No Mode
2.4.3 MODE
2.4.3.2 GROUP DATA
• Determine class mode (or, modal class) - the class with the highest
frequency.
• Use the following formula Class width

the difference between the


 1  frequency of class mode and
MODE  Lmo   i the frequency of the class
 1   2  before the class mode

Lower boundary the difference between the


of class median frequency of class mode and
the frequency of the class
before the class mode
Approximating mode using
histogram
14

12

10
Frequency

0
-0.5 49.5 99.5 149.5 199.5 249.5 299.5 No. of text messages

MODE = 140 66
EXAMPLE 2.5
a. Ten students were asked how many siblings they had. The results,
arranged in order, were
0111122336
Find the mode of this data set.(1).
b. The following table presents the daily high temperature in a
manufacturer of insulation for randomly selected 20 winter
days(Refer Example 2.2). Approximate the mode of daily high
temperature. (29.0)

Class Frequency

12 - 21 3
22 - 31 6
32 - 41 5

42 - 51 4
52 - 61 2

Total 20
Which Measure to Choose?
DCOVA
 The mean is generally used, unless extreme
values (outliers) exist.
 The median is often used, since the median
is not sensitive to extreme values. For
example, median home prices may be
reported for a region; it is less sensitive to
outliers.
 In many situations it makes sense to report
both the mean and the median.
Describing the Shape of a Data Set
• The mean and median measure the center of a data set in different
ways. When a data set is symmetric, the mean, median and mode are
equal.
• When a data set is skewed to the right, there are large values in the
right tail. Because the median is resistant while the mean is not, the
mean is generally more affected by these large values. Therefore for a
data set that is skewed to the right, the mean is often greater than the
median greater than the mode.
• Similarly, when a data set is skewed to the left, the mean is often less
than the median less than the mode.

70
i. Approximately Symmetric
Shape: Approximately Symmetric

Relationship Between
the Mean, Median and Mode: Mean, median and mode are approximately the same

71
ii. Skewed to the Right
Shape: Skewed to the Right

Relationship Between
the Mean, Median and Mode : Mean is noticeably greater than the median greater
than the mode.

72
iii. Skewed to the Left
Shape: Skewed to the Left

Relationship Between
the Mean, Median and Mode: Mean is noticeably less than the median less than the
mode.

73
Summary of Measure of Central
Tendency
Data
Measure
Ungrouped Grouped

Mean

Mode = value with


Mode highest frequency (could Mode
be > 1)

Median Med
74
2.5 MEASURE OF
POSITION

75
DCOVA
Position

Percentiles Quartiles

Measures of position are techniques that divide a set of data into equal groups.
To determine the measurement of position, the data must be sorted from lowest to highest. The
different measures of position are percentiles and quartiles
2.5.1 PERCENTILES
• The mean and median of a data set describe the center of a
distribution (quantitative).
• For some data it is often useful to compute measures of positions
other than the center, to get a more detailed description of the
distribution.
• Percentiles provide a way to do this. Percentiles divide a data set into
hundredths.
• Definition: For a number p between 1 and 99, the pth percentile
separates the lowest p% of the data from the highest (100 – p)%.

77
2.5.1 PERCENTILES
UNGROUPED DATA
• First, the data need to be arranged in increasing order.
• To compute the data value corresponding to a given percentile:

– If L is a whole number, then the pth percentile is the average of the number in position L and the number in position (L+1).
– If L is not a whole number, round it up to the next higher whole number. The pth percentile is the number in the position
corresponding to the rounded-up value.

• To compute the percentile corresponding to a given data value, X:

– Round the result to the nearest whole number.

78
EXAMPLE 2.6
A teacher gives a 20-points test to 10 students. The scores are shown
here.
18 15 12 6 8 2 3 5 20 10
1. Find the value corresponding to the 25th and 60th percentile (5, 11).

2. Find the percentile rank of a score of 6 and 12 (35, 65).

79
2.5.2 QUARTILES
• There are 3 percentiles that are used more often than the others - the 25th,
the 50th, and the 75th .
• These percentiles divide the data into 4 parts, each of which contains
approximately one quarter of the data.
• Thus, these 3 percentiles are called quartiles.
• Can visualize the distribution of the values for a numerical variable by
computing:
– The quartiles.
– The five-number summary.
– Constructing a boxplot.

80
DCOVA
2.5.2 QUARTILE MEASURES
2.5.2.1 UNGROUPED DATA
• Quartiles split the ranked data into 4 segments with an equal number
of values per segment.
25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger.
 Q2 is the same as the median (50% of the values are
smaller and 50% are larger).
 Only 25% of the values are greater than the third quartile -
separates the lowest 75% of the data from the highest 25%.
• Determining quartiles
i. Arrange data in ascending order
ii. Find 25th and 75th percentiles or find the depth of Q1 and Q3,

iii. Determine the values based on the positions.


2.5.1 QUARTILE MEASURES
2.5.1.2 GROUPED DATA
• Recall the procedure for approximating the median using grouped data
• Determining quartiles
– Cumulative frequency
– Quartile class:
– Q1 class 
– Q3class 
– Find the values
EXAMPLE 2.7
• Following are final exam scores, arranged in increasing order for 28
students.
58 59 62 64 67 68 69 71 73 74 74 75 76 76
76 77 78 78 78 82 82 84 86 87 87 88 91 97

a. Find the 1st quartile of the scores (70).


b. Find the 3rd quartile of the scores (83).

84
EXAMPLE 2.8
The following table presents the daily high temperature in a manufacturer
of insulation for randomly selected 20 winter days(Refer Example 2.2).
Calculate the Q1 and Q3.

Class Frequency Cumulative


Frequency
12 - 21 3 3
22 - 31 9 6
32 - 41 14 5
18
42 - 51 4
20
52 - 61 2

Total 20
Conclusions: Measures of Positions
Data
Measurement
Ungrouped Grouped

Percentiles 
Percentiles 

1st Quartile

1st Quartile
3rd Quartile

3rd Quartile
86
2.6 MEASURE OF
DISPERSION
DCOVA
Variation

Range Variance Standard Coefficient


Deviation of Variation

Interquartile
Range

 Measures of variation give


information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
2.6.1 THE RANGE DCOVA
2.6.1.1 UNGROUP DATA
 Simplest measure of variation.
 Difference between the largest and the smallest values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12
2.6.1 THE RANGE
2.6.1.2 GROUP DATA

Class Frequency
41 – 50 1
51 – 60 3 Upper bound of last class = 100.5
61 – 70 7 Lower bound of first class = 40.5
71 – 80 13 Range = 100.5 – 40.5 = 60
81 – 90 10
91 - 100 6
Total 40
Why The Range Can Be Misleading
DCOVA
 Does not account for how the data are distributed.

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
to outliers
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
EXAMPLE 2.9
The following table presents the average monthly temperature, in degrees Fahrenheit,
for the cities of San Francisco and St. Louis. Compute the range for each city.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San Francisco 51 54 55 56 58 60 60 61 63 62 58 52
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35
Source: National Weather Service

92
2.6.2 INTERQUARTILE RANGE (IQR)
• Quartiles can be used as a rough measurement of variability.
• The interquartile range is the range of the middle 50% of the data.
• The IQR is a measure of variability that is not influenced by outliers or
extreme values.
• Measures like Q1, Q3, and IQR that are not influenced by outliers are
called resistant measures.
• It is defined as the difference between the first quartile and the third
quartile.

IQR = Q3 – Q1
93
EXAMPLE 2.10
Table below list the total revenue for the 12 top tourism company in Malaysia
109.7 79.9 74.1 121.2 76.4 80.2 82.1 79.4 89.3 98.0 103.5
86.8
Determine the interquartile of the data (79.5, 102.1, 22.6)

74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2

Answer : 79.65, 100.75, 21.1

94
2.6.3 VARIANCE
• Although the range is easy to compute, it is not often used in practice. The
reason is that the range involves only two values from the data set; the largest
and smallest.
• The measures of spread that are most often used are the variance and the
standard deviation, which use every value in the data set.
• When a data set has a small amount of spread, like the San Francisco
temperatures, most of the values will be close to the mean. When a data set has
a larger amount of spread, more of the data values will be far from the mean.
• The variance is a measure of how far the values in a data set are from the
mean, on the average.
• The variance is computed slightly differently for populations and samples.

95
Population Sample
• In the formula, the mean μ is replaced
by the sample mean and the
denominator is n – 1 instead of N. The
sample variance is denoted by s2.

where N – number of where n – number of


observation in the population observation in the sample

96
Sample Variance
Ungrouped Grouped
• 1st: • 1st: Compute the midpoint (x) for each
class
• 2nd:
• 2nd: Multiply the midpoint by the class
• 3rd: Calculate the sample variance frequency (). Find the sum ()

• 3rd: Squared the midpoint (x2) and


multiply the frequency (), then sum the
values ().

• 4th: Calculate the sample variance

97
EXAMPLE 2.11
A company that manufactures batteries is testing a new type of battery designed for
laptop computers. They measure the lifetimes, in hours, of six batteries, and the results
are presented in the following table. Find the variance of the lifetimes. (2)

Battery Lifetime 3 4 6 5 4 2

98
EXAMPLE 2.12
No. of text No. of student Class Midpoint, fx
message sent (frequency, f) x
0 – 49 10 24.5 245.0
50 – 99 5 74.5 372.5
100 – 149 13
124.5 1618.5
150 – 199 11
174.5 1919.5
200 – 249 7
224.5 1571.5
250 – 299 4
274.5 1098.0

6825

99
2.6.4 STANDARD DEVIATION
• Because the variance is computed using squared deviations, the units of the
variance are the squared units of the data.
• For example, in Battery Lifetime example, the units of the data are hours, and
the units of variance are squared hours.
• In most situations, it is better to use a measure of spread that has the same
units as the data.
• We do this simply by taking the square root of the variance. This quantity is
called the standard deviation.
• The standard deviation of a sample is denoted s, and the standard deviation
of a population is denoted by σ.

100
Important properties of standard
deviation
• The standard deviation is a measure of variation of all values from the
mean.
• The value of the standard deviation is usually positive (it is never
negative).
• The value of the standard deviation can increase dramatically with the
inclusion of one or more outliers (data values far away from all others).
• The units of the standard deviation are the same as the units of the
original data values.

101
Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Summary Characteristics
 The more the data are spread out, the greater the range, variance, and standard
deviation.

 The more the data are concentrated, the smaller the range, variance, and
standard deviation.

 If the values are all the same (no variation), all these measures will be zero.

 None of these measures are ever negative.


2.6.5 THE COEFFICIENT OF
VARIATION
• Measures relative variation.

• Always in percentage (%).

• Shows variation relative to mean.

• Can be used to compare the variability of two or more


sets of data measured in different units.

 S
CV     100%

X 
EXAMPLE 2.13 Comparing Coefficients of
Variation
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.

• Stock B:
– Mean price last year = $100.
– Standard deviation = $5.
Comparing Coefficients of Variation (con’t)
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.

• Stock C:
– Mean price last year = $8.
– Standard deviation = $2.
Conclusions: Measures of
Dispersion
Data
Measuremen
t
Ungrouped Grouped

Range
Interquartile
Interquartile IQR
IQR =
= Q3
Q3 –
– Q1
Q1
range
range

Variance
Variance

Standard
Standard
deviation
deviation
107
2.7 MEASURE OF
SKEWNESS/SHAPE
• Describes how data are distributed.
• Two useful shape related statistics are:
– Skewness:
– Measures the extent to which data values are not symmetrical.
– Kurtosis:
– Kurtosis measures the peakedness of the curve of the
distribution—that is, how sharply the curve rises approaching the
center of the distribution.
2.7.1 COEFFICIENT OF SKEWNESS
• To determine the skewness of the data

– If the value = +ve  right skewed


– If the value = -ve  left skewed
– If the value = 0  symmetry
• Measures the extent to which data is not symmetrical.

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean

Skewness
<0 0 >0
Statistic
2.7.2 KURTOSIS
Measures how sharply the curve rises approaching the center of the distribution

Sharper Peak
Than Bell-Shaped
(Kurtosis > 0)

Bell-Shaped
(Kurtosis = 0)
Flatter Than
Bell-Shaped
(Kurtosis < 0)
The Five Number Summary
The five numbers that help describe the center, spread and shape of data are:
 Xlargest.
 Third Quartile (Q3).
 Median (Q2).
 First Quartile (Q1).
 Xsmallest.

• These summaries are more informative when it is displayed on a diagram drawn to


scale.
• A graphic display that accomplishes this is known as box-and-whiskers display
(boxplot)
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data based on the five-number
summary:

Xsmallest -- Q1 -- Median -- Q3 -- Xlargest


Example:

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest


Calculating The Interquartile Range

Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27
Five Number Summary:
Shape of Boxplots
DCOVA
• If data are symmetric around the median then the box and central
line are centered between the endpoints.

Xsmallest Q1 Median Q3 Xlargest

• A Boxplot can be shown in either a vertical or horizontal orientation.


Distribution Shape and
The Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Chapter Summary
In this chapter we covered:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Describing the properties of central tendency, variation, and shape in
numerical variables.
• Constructing and interpreting a boxplot.

S-ar putea să vă placă și