Documente Academic
Documente Profesional
Documente Cultură
Instructor
Fung Ho-Ki
Phone: 6592-2197
Email: Hoki.Fung@SingaporeTech.edu.sg
Slide 2
Tentative Schedule
Slide 3
Quiz
Tentatively on Sep 29 (Thu) 1300 1400, room TBA
Midterm
Tentatively on Oct 27 (Thu) after lecture, room TBA
Slide 4
Textbooks
Recommended Main Textbook
Introduction to Probability and Statistics,
International Edition (14th Edition),
William Mendenhall, Robert J. Beaver &
Barbara M. Beaver,
ISBN-13: 9781133111504
Library has few copies
Slide 5
Assessment
Slide 6
What is Statistics?
Statistics - the science of collecting and analyzing data
(a set of measurements) in large quantities
When first presented with a set of measurements, we
need to find a way to organize and summarize it
The branch of statistics that presents techniques for
describing sets of measurements is called descriptive
statistics (e.g., bar charts, pie charts, line charts,
numerical tables, etc.)
Sometimes it may be too expensive or time consuming
to enumerate the entire population. We may only have
a sample from the population. The branch of statistics
that deals with making inferences about population
characteristics from information contained in a sample
is called inferential statistics.
Slide 8
Example 1
Variable
Hair color
Experimental unit
Person
Typical Measurements
Brown, black, blonde, etc.
Slide 10
Example 2
Variable
Time until a light bulb burns out
Experimental unit
Light bulb
Typical Measurements
1500 hours, 1535.5 hours, etc.
Slide 11
Slide 12
Types of Variables
Variables
Qualitative
Quantitative
Discrete
Continuous
Slide 13
Qualitative Variables
Qualitative variables measure a quality or
characteristic on each experimental unit.
They produce data that can be categorized
according to similarities or differences in kind;
hence they are often called categorical data.
Examples:
Hair color (black, brown, blonde, )
Gender (male, female)
DNA-bases (adenine(A), guanine(G),
thymine(T), cytosine(C))
Amino acid type (alanine, glutamine,
methionine, )
Slide 14
Quantitative Variables
Quantitative variables measure a numerical
quantity or amount on each experimental unit.
Two types of quantitative variables:
Discrete variable can assume only a finite
or countable number of values.
Continuous variable can assume the
infinitely many values corresponding to the
points on a line interval.
Slide 15
Examples
Total number of workers in a
pharmaceutical manufacturing plant:
Quantitative discrete
Operating temperature / pressure in the
distillation column in the plant
Quantitative continuous
My blood type
Qualitative
Slide 16
Slide 18
Example
A bag of M&Ms contains 25 candies:
Raw Data:
Statistical Table:
Color
Tally
Frequency Relative
Frequency
Percent
Red
3/25 = .12
12%
Blue
6/25 = .24
24%
Green
4/25 = .16
16%
Orange
5/25 = .20
20%
Brown
3/25 = .12
12%
Yellow
4/25 = .16
16%
Slide 19
Example
6
Frequency
Bar Chart
4
3
2
1
0
Brown
Yellow
Red
Blue
Orange
Green
Color
Brown
12.0%
Green
16.0%
Yellow
16.0%
Pie Chart
Orange
20.0%
Red
12.0%
Blue
24.0%
Slide 20
Oct
Nov
Dec
Jan
Feb
Mar
178.10
177.60
177.50
177.30
177.60
178.00
178.60
BUREAU OF LABOR
STATISTICS
Slide 23
4
Reorder
60
65
580855
055588
000504050
000000455
8
9
05
05
Slide 24
No Outliers
Outlier
Slide 27
Create intervals
Slide 28
Slide 30
Example
The ages of 50 professors at a university:
34
42
34
43
48
31
59
50
70
36
34
30
63
48
66
43
52
43
40
32
52
26
59
44
35
58
36
58
50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53
Example
Age
Tally
Frequency Relative
Frequency
Percent
25 to < 33
1111
5/50 = .10
10%
33 to < 41
14
14/50 = .28
28%
41 to < 49
13
13/50 = .26
26%
49 to < 57
1111 1111
9/50 = .18
18%
57 to < 65
1111 11
7/50 = .14
14%
65 to < 73
11
2/50 = .04
4%
14/50
Relative frequency
12/50
10/50
8/50
6/50
4/50
2/50
0
25
33
41
49
57
65
73
Ages
Slide 32
Relative frequency
12/50
10/50
8/50
6/50
4/50
2/50
Shape?
25
33
41
49
57
65
73
Ages
Any outliers?
What proportion of the tenured faculty are
younger than 41?
Slide 33
Slide 34
Slide 35
xi
x
n
where n = number of
measurements
Slide 36
Example 1
The set: 2, 9, 1, 5, 6
What is their arithmetic mean?
Median
The median of a set of measurements
is the middle measurement when the
measurements are ranked from
smallest to largest.
The position of the median is
.5(n + 1)
once the measurements have been
ordered.
Slide 38
Example 2
The set: 2, 4, 9, 8, 6, 5, 3 n = 7
Sort: 2, 3, 4, 5, 6, 8, 9
Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement
The set: 2, 4, 9, 8, 6, 5
n=6
Sort: 2, 4, 5, 6, 8, 9
Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 average of the 3rd and 4th measurements
Slide 39
Mode
The mode is the measurement which occurs
most frequently.
The set: 2, 4, 9, 8, 8, 5, 3
The mode is 8, which occurs twice
The set: 2, 2, 9, 8, 8, 5, 3
There are two modes8 and 2 (bimodal)
The set: 2, 4, 9, 8, 5, 3
There is no mode (each value is unique).
Slide 40
Slide 42
Measures of Variability
A measure along the horizontal axis of
the data distribution that describes the
spread of the distribution from the
center.
Slide 43
The Range
The range, R, of a set of n measurements is
the difference between the largest and
smallest measurements.
Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
The range is R = 14 5 = 9.
Quick and easy, but only uses
2 of the 5 measurements.
Slide 44
The Variance
The variance is measure of variability
that uses all the measurements. It
measures the average deviation of the
measurements about their mean.
Data : 5, 12, 6, 8, 14
45
x
9
5
4
10
12
14
Slide 45
The Variance
The variance of a population of N
measurements is the average of the
squared deviations of the measurements
about their mean m. 2 ( xi m ) 2
Slide 46
Example 3
2 Ways to calculate Sample Variance:
1. Use the Definition Formula:
xi xi x ( xi x ) 2
5
12
-4
3
16
9
6
8
-3
-1
9
1
14
Sum 45
5
0
25
60
Slide 48
Example 3
2. Use the Computing Formula for s2:
Sum
xi
xi2
5
12
25
144
6
8
14
36
64
196
45
465
Slide 49
Some Notes
The value of s is ALWAYS positive.
The larger the value of s2 or s, the larger
the variability of the data set.
Why divide by n 1?
The sample standard deviation s is often used
to estimate the population standard deviation
s. Dividing by n 1 gives us a better estimate
of s.
Slide 50
Tchebysheffs Theorem
Theorem: Given a number k greater than or equal
to 1 and a set of n measurements, at least 1-(1/k2)
of the measurements will lie within k standard
deviations of the mean.
Applies to any set of measurements. Can be used
for either samples ( x and s) or for a population (m
and ).
Slide 51
Tchebysheffs Theorem
Slide 52
Example 1
Raw Data
The ages of 50 professors at a university:
34
42
34
43
48
31
59
50
70
36
34
30
63
48
66
43
52
43
40
32
52
26
59
44
35
58
36
58
50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53
x 44.9
s 10.73
14/50
12/50
Relative frequency
10/50
8/50
6/50
4/50
2/50
25
33
41
49
57
65
73
Ages
Slide 54
Example 1
k
x ks
Interval
Proportion
in Interval
Tchebysheff
Empirical
Rule
44.9 10.73
34.17 to 55.63
31/50 (.62)
At least 0
.68
44.9 21.46
23.44 to 66.36
49/50 (.98)
At least .75
.95
44.9 32.19
12.71 to 77.09
50/50 (1.00)
At least .89
.997
Example 2
A distribution is relatively mound - shaped with mean 50 and
standard deviation 10.
a) What proportion of measurements will fall between 40 and 60?
b) What proportion of measurements will fall between 30 and 70?
c) What proportion of measurements will fall between 30 and 60?
d) If a measurement is chosen at random from this distribution, what is the
probability that it will be 60?
Slide 56
Approximating S
From Tchebysheffs Theorem and the
Empirical Rule, we know that
R 4-6 s
s R/4
or s R / 6 for a largedata set.
Slide 57
xx
z - score
s
Suppose s = 2.
4
s
x 5
x9
The Z-score
From Tchebysheffs Theorem and the Empirical Rule
At least 3/4 and more likely 95% of measurements lie within
2 standard deviations of the mean.
At least 8/9 and more likely 99.7% of measurements lie
within 3 standard deviations of the mean.
z-scores between 2 and 2 are not unusual. z-scores should not
be more than 3 in absolute value. z-scores larger than 3 in
absolute value would indicate a possible outlier.
Not unusual
Outlier
Outlier
z
-3
-2
-1
Somewhat unusual
Slide 59
p%
(100-p) %
x
p-th percentile
Slide 60
.75(n + 1)
Example 3
Raw data:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25
Q1is 3/4 of the way between the 4th and 5th ordered
measurements, or
Q1 = 65 + .75(65 - 65) = 65.
Slide 63
Example 3
Raw data:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25
Q3 is 1/4 of the way between the 14th and 15th
ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 Q1 = 74.25 - 65 = 9.25
Slide 64
Bivariate Data
x
x=2
Slide 65
Examples of Scatterplots
Curvilinear
No relationship
Slide 67
s xy
sx s y
where
( xi )
2
x
(
x
x
)
i
n
s x2
n 1
n 1
2
i
Covariance
The new quantity sxy is called the covariance
between x and y, and is defined as:
( xi x )( yi y )
s xy
n 1
( xi )( yi )
xi yi
n
s xy
n 1
Slide 69
Example 4
x
14
15
17
19
16
178
230
240
275
200
280
260
240
220
200
180
14
15
16
17
x
18
19
The scatterplot
indicates a
positive linear
relationship.
Slide 70
Example 4
x
xy
14
178
2492
15
230
3450
17
240
4080
x 16.2
19
275
5225
y 224.6 s y 37.360
16
200
3200
81
1123 18447
Calculate
( xi )( yi )
xi yi
n
s xy
n 1
(81)(1123)
18447
5
63.6
4
s x 1.924
s xy
sx s y
63.6
.885
1.924(37.36)
Interpreting r
-1 r 1
r 0
r = 1 or 1
sx
a y bx
260
240
y
br
sy
220
200
180
14
15
16
17
18
19
Slide 74
Example 5
x
xy
14
178
2492
Recall
15
230
3450
x 16.2
17
240
4080
19
275
5225
y 224.6 s y 37.3604
16
200
3200
81
1123 18447
br
sy
sx
sx 1.9235
r .885
37.3604
(.885)
17.189
1.9235