Basic Tools For Data Collection, Organization and Description

PHE1012 Statistics
Trimester 1, Year 2016-17

BEng (Hons)
Pharmaceutical Engineering
Instructor
Fung Ho-Ki
Phone: 6592-2197
Email: Hoki.Fung@SingaporeTech.edu.sg
Slide 2
Tentative Schedule
Slide 3
Times & Venues

Lecture
Tue 1300 1500 @ LT2A
Thu 1300 1400 @ LT2A
Tutorials
T1 : Tue 1500 1600 @ SR2A
T2 : Thu 1500 1600 @ SR2A
T3 : Thu 1400 1500 @ SR2A
Quiz
Tentatively on Sep 29 (Thu) 1300 1400, room TBA
Midterm
Tentatively on Oct 27 (Thu) after lecture, room TBA
Slide 4
Textbooks
Recommended Main Textbook
Introduction to Probability and Statistics,
International Edition (14th Edition),
William Mendenhall, Robert J. Beaver &
Barbara M. Beaver,
ISBN-13: 9781133111504
Library has few copies
Slide 5
Assessment
CA may include a small project on minitab (details TBA)
Slide 6
Basic Tools for

Data Collection,
Organization and
Description
What is Statistics?
Statistics - the science of collecting and analyzing data
(a set of measurements) in large quantities
When first presented with a set of measurements, we
need to find a way to organize and summarize it
The branch of statistics that presents techniques for
describing sets of measurements is called descriptive
statistics (e.g., bar charts, pie charts, line charts,
numerical tables, etc.)
Sometimes it may be too expensive or time consuming
to enumerate the entire population. We may only have
a sample from the population. The branch of statistics
that deals with making inferences about population
characteristics from information contained in a sample
is called inferential statistics.
Slide 8
Variables & Data

A variable is a characteristic that changes or varies
over time and/or for different individuals or objects
under consideration, e.g., hair color, white blood
cell count, time to failure of a computer component,
etc.
An experimental unit is the individual or object on
which a variable is measured
A measurement results when a variable is actually
measured on an experimental unit
A population is the set of all measurements of
interest to the investigator
A sample is a subset of measurements selected
from the population of interest
Slide 9
Example 1
Variable
Hair color
Experimental unit
Person
Typical Measurements
Brown, black, blonde, etc.
Slide 10
Example 2
Variable
Time until a light bulb burns out
Experimental unit
Light bulb
Typical Measurements
1500 hours, 1535.5 hours, etc.
Slide 11
Variables & Data

Univariate data: One variable is measured on a
single experimental unit
Bivariate data: Two variables are measured on
a single experimental unit
Multivariate data: More than two variables are
measured on a single experimental unit
Slide 12
Types of Variables
Variables
Qualitative
Quantitative
Discrete
Continuous
Slide 13
Qualitative Variables
Qualitative variables measure a quality or
characteristic on each experimental unit.
They produce data that can be categorized
according to similarities or differences in kind;
hence they are often called categorical data.
Examples:
Hair color (black, brown, blonde, )
Gender (male, female)
DNA-bases (adenine(A), guanine(G),
thymine(T), cytosine(C))
Amino acid type (alanine, glutamine,
methionine, )
Slide 14
Quantitative Variables
Quantitative variables measure a numerical
quantity or amount on each experimental unit.
Two types of quantitative variables:
Discrete variable can assume only a finite
or countable number of values.
Continuous variable can assume the
infinitely many values corresponding to the
points on a line interval.
Slide 15
Examples
Total number of workers in a
pharmaceutical manufacturing plant:
Quantitative discrete
Operating temperature / pressure in the
distillation column in the plant
Quantitative continuous
My blood type
Qualitative
Slide 16
Graphs for Categorical Data

After the data have been collected, they can be
consolidated and summarized to show:
o what values of the variable have been measured
o how often each value has occurred
We can construct a statistical table to display the data
graphically as a data distribution
For qualitative or categorical variable, how often can
be measured in 3 different ways:
Frequency, or number of measurements in each
category
Relative frequency, or proportion of measurements
in each category
The percentage of measurements in each category
Slide 17
Graphs for Categorical Data

Let n be the total number of measurements in the set
Frequency
Relative frequency
n
Percent 100 Relative frequency
Sum of the frequencies is always n
Sum of the relative frequencies is 1
Sum of the percentages is 100%
Slide 18
Example
A bag of M&Ms contains 25 candies:
Raw Data:
Statistical Table:
Color
Tally
Frequency Relative
Frequency
Percent
Red
3/25 = .12
12%
Blue
6/25 = .24
24%
Green
4/25 = .16
16%
Orange
5/25 = .20
20%
Brown
3/25 = .12
12%
Yellow
4/25 = .16
16%
Slide 19
Example
6
Frequency
Bar Chart
4
3
2
1
0
Brown
Yellow
Red
Blue
Orange
Green
Color
Brown
12.0%
Green
16.0%
Yellow
16.0%
Pie Chart
Orange
20.0%
Red
12.0%
Blue
24.0%
Slide 20
Graphs for Quantitative Data

A single quantitative variable measured for different
population segments or for different categories of
classification can be graphed using a pie or bar
chart.
A single quantitative variable measured over time is
called a time series. Time series data are most
effectively presented on a line chart with time as the
horizontal axis.
Sept
Oct
Nov
Dec
Jan
Feb
Mar
178.10
177.60
177.50
177.30
177.60
178.00
178.60
BUREAU OF LABOR
STATISTICS
CPI: All Urban Consumers-Seasonally Adjusted

Slide 21
Graphs for Quantitative Data

DOTPLOTS
The simplest graph for quantitative data
Plots the measurements as points on a
horizontal axis, stacking the points that
duplicate existing points.
Example: The set 4, 5, 5, 7, 6
STEM & LEAF PLOTS

Slide 22
Stem & Leaf Plots

A simple graph for quantitative data
Uses the actual numerical values of each data
point
Divide each measurement into two parts: the stem and
the leaf.
List the stems in a column, with a vertical line to their
right.
For each measurement, record the leaf portion in the
same row as its matching stem.
Order the leaves from lowest to highest in each stem.
Provide a key to your coding.
Slide 23
Stem & Leaf Plots

The prices ($) of 18 brands of a particular product:
90
70
70
70
75
70
65
68
74
70
95
75
70
68
65
40
4
4
Reorder
60
65
580855
055588
000504050
000000455
8
9
05
05
Slide 24
Interpreting Graphs (1)

Location and Spread
Where is the data centered on the

horizontal axis, and how does it spread
out from the center?
Slide 25

Shapes
Mound shaped and symmetric
(mirror images)
Skewed right: a few unusually
large measurements
Skewed left: a few unusually

small measurements
Bimodal: two local peaks
Slide 26

Outliers
No Outliers
Outlier
Are there any strange or unusual

measurements that stand out in the
data set?
Slide 27
Relative Frequency Histograms

A relative frequency histogram for a quantitative
data set is a bar graph in which the height of the
bar shows how often (measured as a proportion
or relative frequency) measurements fall in a
particular class or subinterval. The classes or
subintervals are plotted along the horizontal axis.
Create intervals
Stack and draw bars
Slide 28

Divide the range of the data into 5-12 subintervals of
equal length; the more data available, the more
subintervals you need
Calculate the approximate width of the subinterval as
Range/number of subintervals.
Round the approximate width up to a convenient
value.
Use the method of left inclusion,including the left
endpoint, but not the right in your tally.
Create a statistical table including the subintervals,
their frequencies and relative frequencies.
Slide 29

Draw the relative frequency histogram, plotting
the subintervals on the horizontal axis and the
relative frequencies on the vertical axis.
The height of the bar represents:
The proportion of measurements falling in
that class or subinterval.
The probability that a single measurement,
drawn at random from the set, will belong to
that class or subinterval.
Slide 30
Example
The ages of 50 professors at a university:
34
42
34
43
48
31
59
50
70
36
34
30
63
48
66
43
52
43
40
32
52
26
59
44
35
58
36
58
50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53
We choose to use 6 intervals.

Minimum class width = (70 26)/6 = 7.33
Convenient class width = 8
Use 6 classes of length 8, starting at 25.
Slide 31
Example
Age
Tally
Frequency Relative
Frequency
Percent
25 to < 33
1111
5/50 = .10
10%
33 to < 41
1111 1111 1111
14
14/50 = .28
28%
41 to < 49
1111 1111 111
13
13/50 = .26
26%
49 to < 57
1111 1111
9/50 = .18
18%
57 to < 65
1111 11
7/50 = .14
14%
65 to < 73
11
2/50 = .04
4%
14/50
Relative frequency
12/50
10/50
8/50
6/50
4/50
2/50
0
25
33
41
49
57
65
73
Ages
Slide 32
Describing the Distribution

14/50
Relative frequency
12/50
10/50
8/50
6/50
4/50
2/50
Shape?
25
33
41
49
57
65
73
Ages
Any outliers?
What proportion of the tenured faculty are
younger than 41?
What is the probability that a randomly selected

faculty member is 49 or older?
Slide 33
Describing Data with Numerical

Measures
Graphical methods may not always be
sufficient for describing data.
Numerical measures can be created for
both populations and samples.
A parameter is a numerical descriptive
measure calculated for a population.
A statistic is a numerical descriptive
measure calculated for a sample.
Slide 34
Measures of the Center

A measure along the horizontal axis of the
data distribution that locates the center of
the distribution.
Slide 35
Arithmetic Mean or Average

The mean of a set of measurements is the
sum of the measurements divided by the
total number of measurements.
xi
x
n
where n = number of
measurements
xi sum of all the measurements
Slide 36
Example 1
The set: 2, 9, 1, 5, 6
What is their arithmetic mean?
If we were able to enumerate the whole

population, the population mean would
be called m (the Greek letter mu).
Slide 37
Median
The median of a set of measurements
is the middle measurement when the
measurements are ranked from
smallest to largest.
The position of the median is
.5(n + 1)
once the measurements have been
ordered.
Slide 38
Example 2
The set: 2, 4, 9, 8, 6, 5, 3 n = 7
Sort: 2, 3, 4, 5, 6, 8, 9
Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement
The set: 2, 4, 9, 8, 6, 5
n=6
Sort: 2, 4, 5, 6, 8, 9
Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 average of the 3rd and 4th measurements
Slide 39
Mode
The mode is the measurement which occurs
most frequently.
The set: 2, 4, 9, 8, 8, 5, 3
The mode is 8, which occurs twice
The set: 2, 2, 9, 8, 8, 5, 3
There are two modes8 and 2 (bimodal)
The set: 2, 4, 9, 8, 5, 3
There is no mode (each value is unique).
Slide 40
Extreme Values (1)

The mean is more easily affected by
extremely large or small values than the
median.
The median is often used as a

measure of center when the
distribution is skewed.
Slide 41
Extreme Values (2)

Symmetric: Mean = Median
Skewed right: Mean > Median
Skewed left: Mean < Median
Slide 42
Measures of Variability
A measure along the horizontal axis of
the data distribution that describes the
spread of the distribution from the
center.
Slide 43
The Range
The range, R, of a set of n measurements is
the difference between the largest and
smallest measurements.
Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
The range is R = 14 5 = 9.
Quick and easy, but only uses
2 of the 5 measurements.
Slide 44
The Variance
The variance is measure of variability
that uses all the measurements. It
measures the average deviation of the
measurements about their mean.
Data : 5, 12, 6, 8, 14
45
x
9
5
4
10
12
14
Slide 45
The Variance
The variance of a population of N
measurements is the average of the
squared deviations of the measurements
about their mean m. 2 ( xi m ) 2

The variance of a sample of n

measurements is the sum of the squared
deviations of the measurements about
their mean, divided by (n 1).
( xi x )
s
n 1
Slide 46
The Standard Deviation

In calculating the variance, we squared all
of the deviations, and in doing so changed
the scale of the measurements.
To return this measure of variability to the
original units of measure, we calculate the
standard deviation, the positive square
root of the variance.
Population standard deviation : 2
Sample standard deviation : s s 2
Slide 47
Example 3
2 Ways to calculate Sample Variance:
1. Use the Definition Formula:
xi xi x ( xi x ) 2
5
12
-4
3
16
9
6
8
-3
-1
9
1
14
Sum 45
5
0
25
60
Slide 48
Example 3
2. Use the Computing Formula for s2:
Sum
xi
xi2
5
12
25
144
6
8
14
36
64
196
45
465
Slide 49
Some Notes
The value of s is ALWAYS positive.
The larger the value of s2 or s, the larger
the variability of the data set.
Why divide by n 1?
The sample standard deviation s is often used
to estimate the population standard deviation
s. Dividing by n 1 gives us a better estimate
of s.
Slide 50
Tchebysheffs Theorem
Theorem: Given a number k greater than or equal
to 1 and a set of n measurements, at least 1-(1/k2)
of the measurements will lie within k standard
deviations of the mean.
Applies to any set of measurements. Can be used
for either samples ( x and s) or for a population (m
and ).
Slide 51
Tchebysheffs Theorem
The mean and variance of a sample of n = 25

2
measurements are 75 and 100 (i.e., x 75 and s 100 )
3
of the 25 measurements lie in the interval x 2 s 75 2(10 ) (i.e., 55 to 95)
4
8
At least of the 25 measurements lie in the interval x 3 s 75 3(10 ) (i.e., 45 to 105)
9
At least
Slide 52
The Empirical Rule

Given a distribution of measurements
that is approximately mound-shaped:
The interval m contains approximately

68% of the measurements.
The interval m 2 contains
approximately 95% of the measurements.
The interval m 3 contains
approximately 99.7% of the measurements.
Note: it does not work for all data sets (only

work well for mound shape distributions)
Slide 53
Example 1
Raw Data
The ages of 50 professors at a university:
34
42
34
43
48
31
59
50
70
36
34
30
63
48
66
43
52
43
40
32
52
26
59
44
35
58
36
58
50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53
x 44.9
s 10.73
14/50
12/50
Relative frequency
10/50
8/50
6/50
4/50
2/50
Shape? Skewed right
25
33
41
49
57
65
73
Ages
Slide 54
Example 1
k
x ks
Interval
Proportion
in Interval
Tchebysheff
Empirical
Rule
44.9 10.73
34.17 to 55.63
31/50 (.62)
At least 0
.68
44.9 21.46
23.44 to 66.36
49/50 (.98)
At least .75
.95
44.9 32.19
12.71 to 77.09
50/50 (1.00)
At least .89
.997
Do the actual proportions in the three intervals

agree with those given by Tchebysheffs
Theorem?
Do they agree with the Empirical Rule?
Why or why not?
Slide 55
Example 2
A distribution is relatively mound - shaped with mean 50 and
standard deviation 10.
a) What proportion of measurements will fall between 40 and 60?
b) What proportion of measurements will fall between 30 and 70?
c) What proportion of measurements will fall between 30 and 60?
d) If a measurement is chosen at random from this distribution, what is the
probability that it will be 60?
Slide 56
Approximating S
From Tchebysheffs Theorem and the
Empirical Rule, we know that
R 4-6 s
To approximate the standard deviation of a

set of measurements, we can use:
s R/4
or s R / 6 for a largedata set.
Slide 57
Measures of Relative Standing

Where does one particular measurement stand
in relation to the other measurements in the
data set?
How many standard deviations away from the
mean does the measurement lie? This is
measured by the z-score.
xx
z - score
s
Suppose s = 2.
4
s
x 5
x9
x = 9 lies z =2 std dev from the mean.

Slide 58
The Z-score
From Tchebysheffs Theorem and the Empirical Rule
At least 3/4 and more likely 95% of measurements lie within
2 standard deviations of the mean.
At least 8/9 and more likely 99.7% of measurements lie
within 3 standard deviations of the mean.
z-scores between 2 and 2 are not unusual. z-scores should not
be more than 3 in absolute value. z-scores larger than 3 in
absolute value would indicate a possible outlier.
Not unusual
Outlier
Outlier
z
-3
-2
-1
Somewhat unusual
Slide 59
Measures of Relative Standing

How many measurements lie below the
measurement of interest? This is measured
by the pth percentile.
p%
(100-p) %
x
p-th percentile
Slide 60
Quartiles and the IQR

The lower quartile (Q1) is the value of x
which is larger than 25% and less than
75% of the ordered measurements.
The upper quartile (Q3) is the value of x
which is larger than 75% and less than
25% of the ordered measurements.
The range of the middle 50% of the
measurements is the interquartile range,
IQR = Q3 Q1
Slide 61
Calculating Sample Quartiles

The lower and upper quartiles (Q1 and
Q3), can be calculated as follows:
The position of Q1 is .25(n + 1)
The position of Q3 is
.75(n + 1)
once the measurements have been

ordered. If the positions are not
integers, find the quartiles by
interpolation.
Slide 62
Example 3
Raw data:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25
Q1is 3/4 of the way between the 4th and 5th ordered
measurements, or
Q1 = 65 + .75(65 - 65) = 65.
Slide 63
Example 3
Raw data:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25
Q3 is 1/4 of the way between the 14th and 15th
ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 Q1 = 74.25 - 65 = 9.25
Slide 64
Bivariate Data
When two variables are measured on a single experimental

unit, the resulting data are called bivariate data.
You can describe each variable individually, and you can
also explore the relationship between the two variables.
When both of the variables are quantitative, call one
variable x and the other y. A single measurement is a pair of
numbers (x, y) that can be plotted using a two-dimensional
graph called a scatterplot.
y
(2, 5)
y=5
x
x=2
Slide 65
Describing the Scatterplot
What pattern or form do you see?

Straight line upward or downward
Curve or no pattern at all (just random
scattering of points)
How strong is the pattern?
Strong: all of the points follow the pattern
exactly
Weak: relationship barely visible
Are there any unusual observations?
Clusters or outliers
Slide 66
Examples of Scatterplots
Positive linear - strong
Negative linear -weak
Curvilinear
No relationship
Slide 67
The Correlation Coefficient

The strength and direction of the relationship
between x and y are measured using the
correlation coefficient, r.
s xy
sx s y
where
( xi )
2
x
(
x
x
)
i
n
s x2
n 1
n 1
2
i
sx = standard deviation of the xs
sy = standard deviation of the ys

Slide 68
Covariance
The new quantity sxy is called the covariance
between x and y, and is defined as:
( xi x )( yi y )
s xy
n 1
It is a measure of how much the two variables

change together. There is also a computing
formula for covariance:
( xi )( yi )
xi yi
n
s xy
n 1
Slide 69
Example 4
x
14
15
17
19
16
178
230
240
275
200
280
260
240
220
200
180
14
15
16
17
x
18
19
The scatterplot
indicates a
positive linear
relationship.
Slide 70
Example 4
x
xy
14
178
2492
15
230
3450
17
240
4080
x 16.2
19
275
5225
y 224.6 s y 37.360
16
200
3200
81
1123 18447
Calculate
( xi )( yi )
xi yi
n
s xy
n 1
(81)(1123)
18447
5
63.6
4
s x 1.924
s xy
sx s y
63.6
.885
1.924(37.36)
Interpreting r
-1 r 1
r 0
Sign of r indicates direction

of the linear relationship.
Weak relationship; random
scatter of points
r 1 or 1 Strong relationship; either

positive or negative
r = 1 or 1
All points fall exactly on a

straight line.
Slide 72
The Regression Line

Sometimes x and y are related in a particular way
the value of y depends on the value of x.
y = dependent variable
x = independent variable
The form of the linear relationship between x and y
can be described by fitting a line as best we can
through the points. This is the regression line (or
least-squares line*),
y = a + bx.
a = y-intercept of the line
b = slope of the line
* The best fit in the least-squares sense minimizes the sum of squared residuals, a residual
being the difference between a data point and the fitted value provided by the regression line.
Slide 73
The Regression Line

To find the slope and y-intercept of
the best fitting line, use:
280
sx
a y bx
260
240
y
br
sy
220
200
180
14
15
The least squares

regression line is y = a + bx
16
17
18
19
Slide 74
Example 5
x
xy
14
178
2492
Recall
15
230
3450
x 16.2
17
240
4080
19
275
5225
y 224.6 s y 37.3604
16
200
3200
81
1123 18447
br
sy
sx
sx 1.9235
r .885
37.3604
(.885)
17.189
1.9235
a y bx 224.6 17.189(16.2) 53.86

RegressionLine : y 53.86 17.189 x

Basic Tools For Data Collection, Organization and Description

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Basic Tools For Data Collection, Organization and Description

Încărcat de

Drepturi de autor:

Formate disponibile

PHE1012 Statistics

Trimester 1, Year 2016-17

Times & Venues

CA may include a small project on minitab (details TBA)

Basic Tools for

Variables & Data

Variables & Data

Graphs for Categorical Data

Graphs for Categorical Data

Graphs for Quantitative Data

CPI: All Urban Consumers-Seasonally Adjusted

Graphs for Quantitative Data

STEM & LEAF PLOTS

Stem & Leaf Plots

Stem & Leaf Plots

Interpreting Graphs (1)

Where is the data centered on the

Interpreting Graphs (2)

Skewed left: a few unusually

Interpreting Graphs (3)

Are there any strange or unusual

Relative Frequency Histograms

Stack and draw bars

Relative Frequency Histograms

Relative Frequency Histograms

We choose to use 6 intervals.

1111 1111 1111

1111 1111 111

Describing the Distribution

What is the probability that a randomly selected

Describing Data with Numerical

Measures of the Center

Arithmetic Mean or Average

xi sum of all the measurements

If we were able to enumerate the whole

Extreme Values (1)

The median is often used as a

Extreme Values (2)

Skewed right: Mean > Median

Skewed left: Mean < Median

The variance of a sample of n

The Standard Deviation

The mean and variance of a sample of n = 25

The Empirical Rule

The interval m contains approximately

Note: it does not work for all data sets (only

Shape? Skewed right

Do the actual proportions in the three intervals

To approximate the standard deviation of a

Measures of Relative Standing

x = 9 lies z =2 std dev from the mean.

Measures of Relative Standing

Quartiles and the IQR

Calculating Sample Quartiles

once the measurements have been

When two variables are measured on a single experimental

Describing the Scatterplot

What pattern or form do you see?

Positive linear - strong

Negative linear -weak

The Correlation Coefficient

sx = standard deviation of the xs

sy = standard deviation of the ys