Documente Academic
Documente Profesional
Documente Cultură
, ,
.
Central Tendency: We consider the sample mean, the sample median and the mode.
Definition: The sample mean is the average of the n values, and is denoted .
=
+ +
Definition: The sample median divides the data into equal halves, and is denoted M
d
.
To find M
d
,
First order the observations from smallest to largest.
If n is odd, the median is the middle value
If n is even, the median is the average of the two middle values.
Definition: The mode of a population or sample is the measurement that occurs most frequently, and is
denoted M
O
.
NOTE: Sometimes the highest frequency occurs at two or more different measurements, hence the data is
multimodal. If exactly two modes exist, the data is bimodal.
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
2
Example: A study measured the amount of time (in minutes) taken for students to move from one class to
another. The n = 10 measurements are listed below:
9, 8, 10, 10, 12, 6, 11, 10, 12, 8.
Thus,
= 9,
= 8, ...,
)
The 75
th
percentile is also called the third quartile (
)
The 50
th
percentile is the median and is also called the second quartile (
)
Note that the quartiles divide the data into four equal parts.
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
3
Example (contd): In our previous example of time taken to arrive at class, compute the first and third quartiles.
Solution:
2.2. Measures of Dispersion (Variability)
The variability or scatter in the data may be measured by the sample variance, the sample standard deviation,
the range and the interquartile range.
Definition: The sample variance (denoted s
2
) is calculated as:
=
(
1
=
/
1
Definition: The sample standard deviation (denoted s) is calculated as the square root of the sample variance:
=
Definition: The range is the difference of the maximum and minimum values:
Range = max (
) min (
)
Definition: The interquartile range (denoted IQR) is the difference between the third and first quartiles:
=
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
4
Example (contd): For the previous data on amount of time taken to arrive at class, compute the sample
variance, sample standard deviation, range and interquartile range.
Solution:
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
5
Sample vs. Population
Note that the data are a sample of observations that have been selected from a larger population of
observations.
We may consider the mean of all measurements in the population, which is denoted the population mean ().
If there are N equally likely observations in a finite population, then
=
Similarly, the variability in the population is defined by the population variance (
=
(
The square root of
.
NOTE: The sample mean estimates the population mean , and the sample variance
estimates the
population variance
. Sometimes the sample mean and sample variance are denoted point estimates of the
population mean and population variance, respectively.
Furthermore, we have the following definitions:
Statistic a numerical measure which is computed from a sample (e.g. ,
)
Parameter a numerical measure which is computed from a population (e.g. ,
).
Empirical Rule (p. 49)
To interpret the population standard deviation, we consider the Empirical Rule (see Figure 2.22 on page 49).
For a normally distributed population with mean and standard deviation , then
1. 68.26% of the population measurements are within (plus or minus) one standard deviation of the
mean. In other words, 68.26% of the measurements lie in the interval [ , + ] = [ ].
2. 95.44% of the population measurements are within (plus or minus) two standard deviations of the
mean. In other words, 95.44% of the measurements lie in the interval [ 2, + 2] = [ 2].
3. 99.73% of the population measurements are within (plus or minus) three standard deviations of the
mean. In other words, 99.73% of the measurements lie in the interval [ 3, + 3] = [ 3].
NOTE: The intervals [ ], [ 2] and [ 3] are sometimes called tolerance intervals around
containing 68.26%, 95.44% and 99.73% of the measurements, respectively.
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
6
Chebyshevs Theorem (p. 52)
If we believe the Empirical Rule does not hold for a particular population, we may use Chebyshevs Theorem to
find an interval containing a specified percentage of the individual measurements of the population.
Chebyshevs Theorem: Consider any population that has mean and standard deviation . The for any value
of k greater than 1, at least 100(1 1/k
2
) percent of the population measurements lie in the interval [ ].
Definition: To determine the relevant location of any value in a population or sample, we may calculate the z-
score, as follows:
=
mean
standard deviation
Example (contd): For the data on class arrival times, consider the following:
a) Assuming the distribution of arrival times is approximately normally distributed, calculate estimates of
the tolerance intervals containing 68.26%, 95.44% and 99.73% of the population of arrival times.
b) If a students arrival time is 13 minutes, should this be considered unusually high?
c) Calculate the z-score for the arrival time in (b).
Solution:
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
7
2.3. Graphs Quantitative Variables
2.3.1. Stem-and-leaf-plot (p. 26)
A stem-and-leaf plot is a plot of quantitative data for which the stem is a leading part of a data value, and the
leaf is the remaining part of the data value. The stem-and-leaf plot is used to display the shape of the
distribution of values.
Construction of a stem-and-leaf plot:
1. Decide which units will be used for the stems and the leaves. As a general rule, consider between 5
and 20 stems.
2. Order the observations in increasing order.
3. Place the stems in a column with the smallest step at the top of the column and the largest stem at the
bottom.
4. Enter the leaf for each measurement into the row corresponding to the proper stem. Arrange the
leaves so they are in increasing order from left to right.
Example: Consider the n = 40 observations on the breaking strengths of trash bags selected during a pilot
construction, as outlined in Table 2.9 (p. 37).
Solution:
Since the observations have a decimal point, select the stem unit as the first two digits (using a unit of 1) and
the leaf unit as the last digit (using a unit of 0.1).
Order the observations from smallest to largest: 21.3, 21.6, 21.9, 22.0, , 24.5
Then we have the stem and leaf plot from MINITAB as follows:
Stem-and-Leaf Display: Trash_Bag
Stem-and-leaf of Trash_Bag N = 40
Leaf Unit = 0.10
1 21 3
3 21 69
9 22 002344
18 22 555677889
(12) 23 000111223344
10 23 566899
4 24 123
1 24 5
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
8
2.3.2. Histogram / Frequency Histogram (p. 30)
A histogram or frequency histogram is a visual summary of the data that may also be used to describe the
shape of the distribution of a quantitative variable.
Construction of a histogram (p. 30):
1. Divide the horizontal axis into sub-intervals, preferably of equal length. Each sub-interval represents a
range of values for the random variable. Typically, we use between 5 to 20 intervals or classes. Often
we choose the number of classes = or the smallest number b such that 2
b
<n.
2. Different statistical packages use different techniques to determine the number of classes. The default
method typically works well.
3. Order the observations from smallest to largest and compute the class length:
=
Upper unit - Lower unit
No. of classes
4. A class is often called a bin.
5. For each bin, construct a rectangle whose height is equal to either the frequency or the relative
frequency.
6. Note the relative frequency =
bin frequency
total no. of observations
.
Example (contd): Consider the n = 40 observations on the breaking strengths of trash bags selected during a
pilot construction, as outlined in Table 2.9 (p. 37).
We first sort the data in ascending order:
21.3, 21.6, 21.9, , 24.3, 24.5
We have 2
= 32 and 2
= 64 so we consider k = 5 classes.
The data ranges from a minimum of 21.3 to a maximum of 24.5, so consider a lower end unit of 21 and an
upper end unit of 25.
Then, the class length is (25-21)/5 = 0.8.
We organize the data into classes as follows:
Class: 21 < 21.8 21.8 < 22.6 24.2 25
Frequency: 2 10 3
Relative Frequency:
= 0.05
= 0.25
= 0.075
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
9
The computer output of the histogram of the data is provided below:
To describe the shape of the distribution, consider the following:
Histogram of Trash Bag Strengths
trash
F
r
e
q
u
e
n
c
y
21 22 23 24 25
0
5
1
0
1
5
Right-Skewed Left-Skewed Symmetric
Mean = Median = Mode
Mode < Median < Mean
Mean < Median < Mode
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
10
Example: Consider the following histograms. Describe the shape of the distribution.
(a) (b)
(c)
x1
F
r
e
q
u
e
n
c
y
4 6 8 10 12 14
0
5
1
0
1
5
2
0
x3
F
r
e
q
u
e
n
c
y
-5 0 5 10 15 20 25
0
5
1
0
1
5
2
0
2
5
3
0
3
5
x2
F
r
e
q
u
e
n
c
y
0 2 4 6 8 10
0
5
1
0
1
5
2
0
2
5
3
0
3
5
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
11
2.3.3. Boxplots (p. 60)
A boxplot or box-and-whisker plot describes simultaneously the central tendency and dispersion of a dataset. It
will also detect departures from symmetry, and atypical observations or outliers.
Construction of a boxplot:
1. Construct a box from the first quartile
.
4. At the top of the box, draw a whisker up to the largest value that is within 1.5 times the interquartile
range of
.
5. Values that are farther than a distance of 1.5 x IQR (inner fences) above or below the box are
designated with a point. We consider these values to be mild outliers or atypical points. Outlying points
should be investigated, since they may be an extreme value, or the result of an input error.
6. Values farther than a distance of 3 x IQR (outer fences) above or below the box are called extreme
outliers.
NOTE: Boxplots may be represented vertically or horizontally.
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
12
Example: We consider a boxplot of the following dataset on the next page:
1 2 3 5 5 6 7 8 19 20 22 23 27 29 50
Describe the shape of the distribution:
0
1
0
2
0
3
0
4
0
5
0
Atypical Value
Largest value within 1.5 IQR of Q3
Q3
Q1
Median
Smallest value within 1.5 IQR of Q1
IQR
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
13
Example (contd): Consider the boxplot of the Trash bag strengths (Table 2.9 p. 37), presented previously.
Describe the shape of the distribution. Are there any potential outliers?
2
1
.
5
2
2
.
0
2
2
.
5
2
3
.
0
2
3
.
5
2
4
.
0
2
4
.
5
Boxplot of Trash Bag Strengths
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
14
2.4 Graphs Qualitative Variables
Qualitative variables may also be graphed in order to compare categories. We consider bar charts, pie charts or
a Pareto chart.
Since we are considering categories, we estimate the population proportion:
Definition: We define the population proportion (p) to be the proportion of all population elements that are
contained in the category of interest.
Definition: To estimate p, we consider the sample proportion:
= the proportion of sample elements which are contained in the category of interest.
Example (Exercise 2-49) Construct a pie chart, bar graph and Pareto chart of the following data on an equity
fund.
Category % of Fund
Cash 0.1
Fixed Income 0.2
Canadian Equity 89.6
U.S. Equity 5.1
International Equity 0.7
Other 4.3
Solution: Before graphing, first order the categories from largest to smallest:
Category % of Fund Cumulative %
Canadian Equity 89.6 89.6
U.S. Equity 5.1 94.7
Other 4.3 99.0
International Equity 0.7 99.7
Fixed Income 0.2 99.9
Cash 0.1 100.0
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
15
The pie chart as calculated by MINITAB:
The bar chart has frequencies (or relative frequencies) of each categories are represented by vertical bars.
Relative frequencies are denoted by Percent in MINITAB:
STAT 2606 Course Notes
Chapter 2 Descriptive Statistics
16
The Pareto chart includes frequencies (or relative frequencies) displayed as vertical bars, arranged from highest
to lowest, as well as the cumulative distribution. Note that Minitab groups the three smaller categories into
Others to graph the Pareto chart.
Suggested Exercises (Chapter 2): 2.5, 2.7, 2.13, 2.15, 2.31, 2.33, 2.41, 2.45.