Sunteți pe pagina 1din 33

Week 3:

Descriptive statistics for


continuous/grouped data

MCD2080: Business Statistics
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to
you by or on behalf of Monash University pursuant to Part
VB of the Copyright Act 1968 (the Act).
The material in this communication may be subject to
copyright under the Act. Any further reproduction or
communication of this material by you may be the subject of
copyright protection under the Act.
Do not remove this notice.
Descriptive statistics for continuous/grouped data
Tables
Frequency,
Relative frequency,
Percentage relative frequency,
Cumulative frequency,
Relative cumulative frequency

Graphs
Histograms
Frequency polygons (for equal and unequal
class intervals)
Ogive (how to read percentile from an ogive)

Descriptive statistics for continuous/grouped data
Numerical measures
measures of location from grouped data
(mean, median, mode)
measures of dispersion from grouped data
(range, variance, standard deviation, CV)
measures of relative standing
(percentiles, IQR)
Shape of distributions
symmetric, positively skewed, negatively skewed
Describing distributions
How to go about compiling a report from a set of
data
5
Tabular Methods for continuous data
Example 2: Length of phone calls
A Melbourne firm is interested in the length of interstate
telephone calls made by its employees.
The firm instituted a program to encourage employees
to reduce the number of calls longer than 14 minutes.
A sample of 30 interstate phone calls is taken (is this
enough?)
It is desired to extract information from this data on
employee behaviour with regard to making interstate
calls of duration more than 14 minutes.
Tabular Methods for continuous data
11.8 3.6 16.6 13.5 4.8 8.3
8.9 9.1 8.1 2.3 12.1 6.1
10.2 8 11.4 6.8 9.6 19.5
15.3 12.3 8.5 15.9 18.7 11.7
6.2 11.2 10.4 7.2 5.5 14.5
This is the raw data set
Example 2: Length of phone calls
7
Tabular Methods for continuous data: grouped frequency
distribution: sort into several classes /bins
Usually between 5 and 20 bins.
Depends partly on amount of data
Aim for enough smoothing that the shape of the
data is clear, without losing all detail.
Any thresholds of particular interest should form a
class boundary.
Duration of calls example:
interested in calls longer than 14 minutes.
No call is shorter than 2 minutes, or longer than 20
Have only 30 observations, so cant afford too many
classes.
Taking all this into account:

We will have 6 class intervals in phone calls example
Refer to page 47- 48 of text
book
8
Tabular Methods for continuous data:
Grouped frequency distribution
Class
(bin)

Frequency
(number of
calls)
Relative
frequency
Cumulative
frequency
Relative
cumulative
frequency
2 5 3 3/30 = 0.1 3 0.1
5 8 6 6/30 = 0.2 3+6 = 9 0.1+0.2 = 0.3
8 11 8 0.267 3+6+8 = 17 0.1+0.2+0.267=0.567
11 14 7 0.233 24 0.8
14 17 4 0.133 28 0.933
17 20 2 0.067 30 1.0
Total 30 1.0
9
Tabular Methods for continuous data: grouped
frequency distribution
No overlap between classes
(Use Excel convention on class boundaries: a
class includes its upper boundary, not lower)
Classes cover all the data.
Relative frequency: the proportion of
observations in each interval (class).
Cumulative frequency: gives us the
proportion of observations less than each class
boundary.
what percentage of calls are less than 14 minutes in
length?

What percentage of calls are between 8 & 11
minutes long?
Tabular Methods for continuous data: grouped frequency
distribution
Based on our sample of 30 calls,
11
Graphical Methods for continuous data: Histograms
A Histogram is a bar chart displaying frequency
against class intervals, where the height of the bars is
determined by making the area of each bar equal to
the (relative) frequency.
Widths of classes should be taken into account.
For equal class widths this will be proportional to a
simple bar chart of the frequencies.
Called frequency histogram
Similar idea for relative or percentage frequency
histogram
12
Graphical Methods for continuous data:
Percentage frequency histogram
0%
5%
10%
15%
20%
25%
30%
<= 2 2 -<= 5 5 - < =8 8 - < =11 11 -<= 14 14 - < =17 17 - <= 20 > 20
P
e
r
c
e
n
t
a
g
e
s

Length of calls
Proportion of interstate calls in each 3 minute band.
13
Variable widths of class intervals
Suppose instead the final class was 17 29
If height corresponded to frequency:
0
2
4
6
8
10
3.5 6.5 9.5 12.5 15.5 18.5 21.5 24.5 27.5
F
r
e
q
u
e
n
c
y

Length (minutes)
Length of telephone calls
Wrong!
14
Variable widths of class intervals
Instead make area
correspond to frequency,
Use frequency density as
height for unequal class
intervals
0
0.5
1
1.5
2
2.5
3
3.5 6.5 9.5 12.5 15.5 18.5 21.5 24.5 27.5
F
r
e
q
u
e
n
c
y

d
e
n
s
i
t
y

Length (minutes)
Call lengths: Frequency density histogram
Class
Class
Limits
midpoint Frequency
Frequency
density *
1 2 - 5
3.5
3
3/ 3 = 1
2 5 - 8
6.5
6
6/3 = 2
3 8 - 11
9.5
8
8/3 = 2.67
4 11 - 14
12.5
7
7/ 3 = 2.33
5 14 - 17
15.5
4
4/3 = 1.33
6 17 - 29
23
2
2 / 12 =0.17
Frequency density
frequency
class width
| |
=
|
\ .
Class width =
upper limit - lower limit
Correct!
15
Graphical Methods for continuous data:
Frequency Polygons
A Frequency Polygon displays frequency or frequency
density above class midpoints.
A Relative Frequency Polygon does the same for relative
frequency.
The polygon is usually closed by considering one
additional class with zero frequency at each end of the
distribution and extending a straight line to the mid point.
It is useful for getting a general idea of the shape of the
data distribution (similar to the Frequency Histogram).
More than one distribution can be graphed on the same
axes
If intervals have different widths use frequency density
polygon
16
Graphical Methods for continuous data:
Frequency Polygon
0
1
2
3
4
5
6
7
8
9
0.5 3.5 6.5 9.5 12.5 15.5 18.5 21.5
call duration in minutes
f
r
e
q
u
e
n
c
y
Call duration: frequency polygon
Add two extra classes
with 0 frequency
18
Duration of interstate calls - Ogive
Vertical axis of an ogive: cumulative percentage frequency.
Plot values corresponding to endpoints of intervals
Join points by straight lines
Read off deciles, quartiles etc
Duration of interstate calls - Ogive
0%
25%
50%
75%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
length of call (minutes)
Q1 = 7.2
Median =10.2
Q3 =13.3
90
th
percentile=16.2
19
Measures of Location and dispersion from raw data
Duration of interstate calls example
30 observations on lengths of interstate calls
Sum = 308.1
Mean = 308.1/30 = 10.27 minutes.
Quartiles from position formula
(Use listing on next slide)

Median = , Q1 = , and Q3 = minutes
Range = Maximum Minimum
IQR = Q3 Q1
Sample variance = 18.36 minutes-squared
Sample standard deviation = 4.28 minutes
Using

Coefficient of variation = 100 * standard deviation / mean
100
) 1 (
p
n +

=
30
1
2 2
) (
1 30
1
i
i
x x s
20
Position Value Position Value
1 2.3 16 10.2
2 3.6 17 10.4
3 4.8 18 11.2
4 5.5 19 11.4
5 6.1 20 11.7
6 6.2 21 11.8
7 6.8 22 12.1
8 7.2 23 12.3
9 8 24 13.5
10 8.1 25 14.5
11 8.3 26 15.3
12 8.5 27 15.9
13 8.9 28 16.6
14 9.1 29 18.7
15 9.6 30 19.5
Calculation of Quartiles from raw data
First quartile position:
75 . 7
100
25
31
100
) 1 ( = = +
p
n
Median position:


Halfway between 15
th
and 16
th

position: average of 9.6 and 10.2,
or
9.6 + 0.5*(10.2-9.6) = 9.9
5 . 15
100
50
31
100
) 1 ( = = +
p
n
Third quartile position?
21
Measures of Location and Dispersion from grouped data:
Duration of interstate calls example
Quartiles from Ogive: (location formula is not applicable on
grouped data)
Median = 10.2 , Q1 = 7.2, and Q3 = 13.3 minutes
IQR =Q3 Q1 = 13.3 7.2 = 6.1
For the following, use midpoint estimation
(see following slides):
Mean
Sample variance
Sample standard deviation
Modal class
22
Numerical Measures for Grouped continuous data
Notation for grouped frequency distribution
k = number of classes
m
j
= midpoint of j
th
class
f
j
= frequency of j
th
class
1
Population size:
k
j
j
N f
=
=

1
Sample size:
k
j
j
n f
=
=

1
Sample mean:
1
k
j j
j
x f m
n
=
~

1
Population mean:
1
k
j j
j
f m
N
u
=
~

23
Numerical Measures for grouped data
2
1
2
( )
Population Variance:
k
j j
j
m f
N
u
o
=

~
2
Population Standard Deviation: o= o
2
Sample Standard Deviation: s s =
2
1
2
( )
Sample Variance:
1
k
j j
j
m x f
s
n
=

24
Calculating approximate mean and median from
grouped Frequency Distribution for interstate calls
Class Freq. Lower
bound
Upper
bound
Class
midpoint
Freq x
midpoint
Freq x (midpoint
mean)
2
2 - 5 3 2 5 3.5 10.5 142.83
5 - 8 6 5 8 6.5 39 91.26
8 - 11 8 8 11 9.5 76 6.48
11 - 14 7 11 14 12.5 87.5 30.87
14 - 17 4 14 17 15.5 62 104.04
17 - 20 2 17 20 18.5 37 131.22
Total 30 312 506.7
j
f
j j
m f
2
) ( x m f
j j

j
m
Class midpoint = (lower limit + upper limit)/2
25
Summary measures based on grouped frequency
distribution
Approximate measures based on the grouped frequency table for
STD calls:
Mean ~ 312/30 = 10.4 minutes
Variance ~ 506.7/(30-1) = 17.47 minutes-squared
Std.dev ~ \17.47 = 4.18 minutes

The exact measures based on the raw data of STD calls:
Mean = 10.27 minutes, Variance = 18.36 minutes-squared,
and standard deviation = 4.28 minutes.
One advantage of grouped data is that we can identify the
modal class
Modal class is 8 11 minutes

28
Shapes of distributions
Distributions are often
symmetric bell-shaped, with a
single mode (unimodal).
Data distributions may have
more than one mode, and may
not be symmetric.
More than one mode bimodal
or multimodal
Asymmetry is referred to
skewness.
Skewness is often associated
with Outliers.
30
Detecting Outliers
Number of Television sets per household
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10
Number of televisions
F
r
e
q
u
e
n
c
y
Example from previous Lecture
Outlier
31
Shapes of distributions
Best way to get info on your
datas distribution is to plot a
histogram!
1. Modality: is the distribution
unimodal, bimodal or
multimodal?
2. Symmetry: is the distribution
symmetric or skewed?
Measures of location can tell
us quite a bit about the shape
of the distribution, esp. its
skewness.
Symmetric unimodal
Symmetric bimodal
32
Shapes of distributions
Positively skewed
(skewed right)
Negatively skewed
(skewed left)


33
Measures of Location & Skewness
Symmetric distribution:
mean = median (= mode if
unimodal)


Skewed-Right:
mean > median (> mode if
unimodal)
mode
median
mean
Mean=median=mode
34
Measures of Location & Skewness

Skewed-Left:
mean < median
(< mode if
unimodal)

So the difference between the mean and the median (or the
mean and the mode) can tell us whether the distribution is
skewed; and if so, in which direction.
mode
median
mean
Pl. note this rule is true in general. There could be exceptions to this rule.
35
Case Studies in Descriptive Statistics
Presenting case studies: tell a story
Identify the Variable of Interest?
Choose appropriate tabular and graphical methods
Calculate numerical summary statistics
Summarize your findings in non-technical language: what do
your tables, graphs, and statistics tell you about the data and the
variable of interest?
Point out any interesting or surprising results.
Explain differences between stats that are supposedly
measuring the same thing eg: measures of central location.
Comparing distributions location, dispersion, shape etc
Comment on outliers if any
36
Tutorial: week 3
Tutorial:
Exercise Q13 Q15
Computing Lab:
Excel exercise 18-19.
When completed, work on assignment: 1


37
Next week
Basic Probability

Reading:
Unit Guide, Section 2.1.
Selvanathan, Chapter 4, Sections 4.1 4.5

S-ar putea să vă placă și