Chapter2 Descriptive Statistics New

STAT 2606 Course Notes
Chapter 2 Descriptive Statistics

1

Once a sample has been selected, we compute descriptive statistics to describe features of the sample.
Descriptive statistics may be either numerical (mean, median, variance) or graphical (histogram, boxplot,
barplot, etc .). There are two main types of numerical descriptive statistics measures of central tendency and
measures of variation.

2.1. Measures of Central Tendency
In practice, we obtain data as a result of a random experiment. We next learn techniques to summarize and
describe a dataset.
Notation: We denote the n observations of the random variable X by
, ,
.
Central Tendency: We consider the sample mean, the sample median and the mode.
Definition: The sample mean is the average of the n values, and is denoted .
=

+ +

Definition: The sample median divides the data into equal halves, and is denoted M
d
.
To find M
d
,
First order the observations from smallest to largest.
If n is odd, the median is the middle value
If n is even, the median is the average of the two middle values.

Definition: The mode of a population or sample is the measurement that occurs most frequently, and is
denoted M
O
.
NOTE: Sometimes the highest frequency occurs at two or more different measurements, hence the data is
multimodal. If exactly two modes exist, the data is bimodal.


2

Example: A study measured the amount of time (in minutes) taken for students to move from one class to
another. The n = 10 measurements are listed below:
9, 8, 10, 10, 12, 6, 11, 10, 12, 8.
Thus,
= 9,
= 8, ...,
= 8. Find the sample mean, sample median and mode.

Solution:

Definition: The 100p
th
percentile is a data value such that approximately 100p% of the observations are at or
below this value and approximately 100(1-p)% of them are above this value.
Special cases:
The 25
th
percentile is also called the first quartile (
)
The 75
th
percentile is also called the third quartile (
)
The 50
th
percentile is the median and is also called the second quartile (
)
Note that the quartiles divide the data into four equal parts.


3

Example (contd): In our previous example of time taken to arrive at class, compute the first and third quartiles.
Solution:

2.2. Measures of Dispersion (Variability)
The variability or scatter in the data may be measured by the sample variance, the sample standard deviation,
the range and the interquartile range.
Definition: The sample variance (denoted s
2
) is calculated as:
=
(
1
=

/
1

Definition: The sample standard deviation (denoted s) is calculated as the square root of the sample variance:
=

Definition: The range is the difference of the maximum and minimum values:
Range = max (
) min (
)
Definition: The interquartile range (denoted IQR) is the difference between the third and first quartiles:
=


4

Example (contd): For the previous data on amount of time taken to arrive at class, compute the sample
variance, sample standard deviation, range and interquartile range.
Solution:


5

Sample vs. Population
Note that the data are a sample of observations that have been selected from a larger population of
observations.
We may consider the mean of all measurements in the population, which is denoted the population mean ().
If there are N equally likely observations in a finite population, then
=

Similarly, the variability in the population is defined by the population variance (
). If there are N observations

in the population, then
=
(

The square root of
is the population standard deviation (), i.e. =
.
NOTE: The sample mean estimates the population mean , and the sample variance
estimates the
population variance
. Sometimes the sample mean and sample variance are denoted point estimates of the
population mean and population variance, respectively.

Furthermore, we have the following definitions:
Statistic a numerical measure which is computed from a sample (e.g. ,
)
Parameter a numerical measure which is computed from a population (e.g. ,
).

Empirical Rule (p. 49)
To interpret the population standard deviation, we consider the Empirical Rule (see Figure 2.22 on page 49).
For a normally distributed population with mean and standard deviation , then
1. 68.26% of the population measurements are within (plus or minus) one standard deviation of the
mean. In other words, 68.26% of the measurements lie in the interval [ , + ] = [ ].
2. 95.44% of the population measurements are within (plus or minus) two standard deviations of the
mean. In other words, 95.44% of the measurements lie in the interval [ 2, + 2] = [ 2].
3. 99.73% of the population measurements are within (plus or minus) three standard deviations of the
mean. In other words, 99.73% of the measurements lie in the interval [ 3, + 3] = [ 3].

NOTE: The intervals [ ], [ 2] and [ 3] are sometimes called tolerance intervals around
containing 68.26%, 95.44% and 99.73% of the measurements, respectively.

6

Chebyshevs Theorem (p. 52)
If we believe the Empirical Rule does not hold for a particular population, we may use Chebyshevs Theorem to
find an interval containing a specified percentage of the individual measurements of the population.
Chebyshevs Theorem: Consider any population that has mean and standard deviation . The for any value
of k greater than 1, at least 100(1 1/k
2
) percent of the population measurements lie in the interval [ ].

Definition: To determine the relevant location of any value in a population or sample, we may calculate the z-
score, as follows:
=
mean
standard deviation

Example (contd): For the data on class arrival times, consider the following:
a) Assuming the distribution of arrival times is approximately normally distributed, calculate estimates of
the tolerance intervals containing 68.26%, 95.44% and 99.73% of the population of arrival times.
b) If a students arrival time is 13 minutes, should this be considered unusually high?
c) Calculate the z-score for the arrival time in (b).
Solution:


7

2.3. Graphs Quantitative Variables
2.3.1. Stem-and-leaf-plot (p. 26)
A stem-and-leaf plot is a plot of quantitative data for which the stem is a leading part of a data value, and the
leaf is the remaining part of the data value. The stem-and-leaf plot is used to display the shape of the
distribution of values.
Construction of a stem-and-leaf plot:
1. Decide which units will be used for the stems and the leaves. As a general rule, consider between 5
and 20 stems.
2. Order the observations in increasing order.
3. Place the stems in a column with the smallest step at the top of the column and the largest stem at the
bottom.
4. Enter the leaf for each measurement into the row corresponding to the proper stem. Arrange the
leaves so they are in increasing order from left to right.

Example: Consider the n = 40 observations on the breaking strengths of trash bags selected during a pilot
construction, as outlined in Table 2.9 (p. 37).
Solution:
Since the observations have a decimal point, select the stem unit as the first two digits (using a unit of 1) and
the leaf unit as the last digit (using a unit of 0.1).
Order the observations from smallest to largest: 21.3, 21.6, 21.9, 22.0, , 24.5
Then we have the stem and leaf plot from MINITAB as follows:

Stem-and-Leaf Display: Trash_Bag

Stem-and-leaf of Trash_Bag N = 40
Leaf Unit = 0.10

1 21 3
3 21 69
9 22 002344
18 22 555677889
(12) 23 000111223344
10 23 566899
4 24 123
1 24 5

8

2.3.2. Histogram / Frequency Histogram (p. 30)
A histogram or frequency histogram is a visual summary of the data that may also be used to describe the
shape of the distribution of a quantitative variable.
Construction of a histogram (p. 30):
1. Divide the horizontal axis into sub-intervals, preferably of equal length. Each sub-interval represents a
range of values for the random variable. Typically, we use between 5 to 20 intervals or classes. Often
we choose the number of classes = or the smallest number b such that 2
b
<n.
2. Different statistical packages use different techniques to determine the number of classes. The default
method typically works well.
3. Order the observations from smallest to largest and compute the class length:

=
Upper unit - Lower unit
No. of classes

4. A class is often called a bin.
5. For each bin, construct a rectangle whose height is equal to either the frequency or the relative
frequency.
6. Note the relative frequency =
bin frequency
total no. of observations
.

Example (contd): Consider the n = 40 observations on the breaking strengths of trash bags selected during a
pilot construction, as outlined in Table 2.9 (p. 37).
We first sort the data in ascending order:
21.3, 21.6, 21.9, , 24.3, 24.5
We have 2
= 32 and 2
= 64 so we consider k = 5 classes.
The data ranges from a minimum of 21.3 to a maximum of 24.5, so consider a lower end unit of 21 and an
upper end unit of 25.
Then, the class length is (25-21)/5 = 0.8.
We organize the data into classes as follows:
Class: 21 < 21.8 21.8 < 22.6 24.2 25
Frequency: 2 10 3
Relative Frequency:
= 0.05
= 0.25
= 0.075

9

The computer output of the histogram of the data is provided below:

To describe the shape of the distribution, consider the following:

Histogram of Trash Bag Strengths
trash
F
r
e
q
u
e
n
c
y
21 22 23 24 25
0
5
1
0
1
5
Right-Skewed Left-Skewed Symmetric
Mean = Median = Mode

Mode < Median < Mean

Mean < Median < Mode

10

Example: Consider the following histograms. Describe the shape of the distribution.

(a) (b)

(c)

x1
F
r
e
q
u
e
n
c
y
4 6 8 10 12 14
0
5
1
0
1
5
2
0

x3
F
r
e
q
u
e
n
c
y
-5 0 5 10 15 20 25
0
5
1
0
1
5
2
0
2
5
3
0
3
5

x2
F
r
e
q
u
e
n
c
y
0 2 4 6 8 10
0
5
1
0
1
5
2
0
2
5
3
0
3
5

11

2.3.3. Boxplots (p. 60)
A boxplot or box-and-whisker plot describes simultaneously the central tendency and dispersion of a dataset. It
will also detect departures from symmetry, and atypical observations or outliers.
Construction of a boxplot:
1. Construct a box from the first quartile
to the third quartile
with the median drawn through the

centre of the box.
2. Note the height of the box is the interquartile range (IQR).
3. At the bottom of the box, draw a whisker down to the smallest value that is within 1.5 times the
interquartile range of
.
4. At the top of the box, draw a whisker up to the largest value that is within 1.5 times the interquartile
range of
.
5. Values that are farther than a distance of 1.5 x IQR (inner fences) above or below the box are
designated with a point. We consider these values to be mild outliers or atypical points. Outlying points
should be investigated, since they may be an extreme value, or the result of an input error.
6. Values farther than a distance of 3 x IQR (outer fences) above or below the box are called extreme
outliers.
NOTE: Boxplots may be represented vertically or horizontally.


12

Example: We consider a boxplot of the following dataset on the next page:
1 2 3 5 5 6 7 8 19 20 22 23 27 29 50

Describe the shape of the distribution:

0
1
0
2
0
3
0
4
0
5
0
Atypical Value
Largest value within 1.5 IQR of Q3
Q3
Q1
Median
Smallest value within 1.5 IQR of Q1
IQR

13

Example (contd): Consider the boxplot of the Trash bag strengths (Table 2.9 p. 37), presented previously.
Describe the shape of the distribution. Are there any potential outliers?

2
1
.
5
2
2
.
0
2
2
.
5
2
3
.
0
2
3
.
5
2
4
.
0
2
4
.
5
Boxplot of Trash Bag Strengths

14

2.4 Graphs Qualitative Variables
Qualitative variables may also be graphed in order to compare categories. We consider bar charts, pie charts or
a Pareto chart.
Since we are considering categories, we estimate the population proportion:

Definition: We define the population proportion (p) to be the proportion of all population elements that are
contained in the category of interest.

Definition: To estimate p, we consider the sample proportion:
= the proportion of sample elements which are contained in the category of interest.

Example (Exercise 2-49) Construct a pie chart, bar graph and Pareto chart of the following data on an equity
fund.

Category % of Fund
Cash 0.1
Fixed Income 0.2
Canadian Equity 89.6
U.S. Equity 5.1
International Equity 0.7
Other 4.3

Solution: Before graphing, first order the categories from largest to smallest:

Category % of Fund Cumulative %
Canadian Equity 89.6 89.6
U.S. Equity 5.1 94.7
Other 4.3 99.0
International Equity 0.7 99.7
Fixed Income 0.2 99.9
Cash 0.1 100.0


15

The pie chart as calculated by MINITAB:

The bar chart has frequencies (or relative frequencies) of each categories are represented by vertical bars.
Relative frequencies are denoted by Percent in MINITAB:


16

The Pareto chart includes frequencies (or relative frequencies) displayed as vertical bars, arranged from highest
to lowest, as well as the cumulative distribution. Note that Minitab groups the three smaller categories into
Others to graph the Pareto chart.

Suggested Exercises (Chapter 2): 2.5, 2.7, 2.13, 2.15, 2.31, 2.33, 2.41, 2.45.

Chapter2 Descriptive Statistics New

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Chapter2 Descriptive Statistics New

Încărcat de

Drepturi de autor:

Formate disponibile

STAT 2606 Course Notes

Chapter 2 Descriptive Statistics

= 8. Find the sample mean, sample median and mode.

). If there are N observations

is the population standard deviation (), i.e. =

to the third quartile

with the median drawn through the

S-ar putea să vă placă și