Sunteți pe pagina 1din 75

PHE1012 Statistics

Trimester 1, Year 2016-17


BEng (Hons)
Pharmaceutical Engineering

Instructor
Fung Ho-Ki
Phone: 6592-2197
Email: Hoki.Fung@SingaporeTech.edu.sg

Slide 2

Tentative Schedule

Slide 3

Times & Venues


Lecture
Tue 1300 1500 @ LT2A
Thu 1300 1400 @ LT2A
Tutorials
T1 : Tue 1500 1600 @ SR2A
T2 : Thu 1500 1600 @ SR2A
T3 : Thu 1400 1500 @ SR2A

Quiz
Tentatively on Sep 29 (Thu) 1300 1400, room TBA
Midterm
Tentatively on Oct 27 (Thu) after lecture, room TBA

Slide 4

Textbooks
Recommended Main Textbook
Introduction to Probability and Statistics,
International Edition (14th Edition),
William Mendenhall, Robert J. Beaver &
Barbara M. Beaver,
ISBN-13: 9781133111504
Library has few copies

Slide 5

Assessment

CA may include a small project on minitab (details TBA)

Slide 6

Basic Tools for


Data Collection,
Organization and
Description

What is Statistics?
Statistics - the science of collecting and analyzing data
(a set of measurements) in large quantities
When first presented with a set of measurements, we
need to find a way to organize and summarize it
The branch of statistics that presents techniques for
describing sets of measurements is called descriptive
statistics (e.g., bar charts, pie charts, line charts,
numerical tables, etc.)
Sometimes it may be too expensive or time consuming
to enumerate the entire population. We may only have
a sample from the population. The branch of statistics
that deals with making inferences about population
characteristics from information contained in a sample
is called inferential statistics.
Slide 8

Variables & Data


A variable is a characteristic that changes or varies
over time and/or for different individuals or objects
under consideration, e.g., hair color, white blood
cell count, time to failure of a computer component,
etc.
An experimental unit is the individual or object on
which a variable is measured
A measurement results when a variable is actually
measured on an experimental unit
A population is the set of all measurements of
interest to the investigator
A sample is a subset of measurements selected
from the population of interest
Slide 9

Example 1
Variable
Hair color
Experimental unit
Person
Typical Measurements
Brown, black, blonde, etc.

Slide 10

Example 2
Variable
Time until a light bulb burns out
Experimental unit
Light bulb
Typical Measurements
1500 hours, 1535.5 hours, etc.
Slide 11

Variables & Data


Univariate data: One variable is measured on a
single experimental unit
Bivariate data: Two variables are measured on
a single experimental unit
Multivariate data: More than two variables are
measured on a single experimental unit

Slide 12

Types of Variables
Variables

Qualitative

Quantitative

Discrete

Continuous
Slide 13

Qualitative Variables
Qualitative variables measure a quality or
characteristic on each experimental unit.
They produce data that can be categorized
according to similarities or differences in kind;
hence they are often called categorical data.
Examples:
Hair color (black, brown, blonde, )
Gender (male, female)
DNA-bases (adenine(A), guanine(G),
thymine(T), cytosine(C))
Amino acid type (alanine, glutamine,
methionine, )
Slide 14

Quantitative Variables
Quantitative variables measure a numerical
quantity or amount on each experimental unit.
Two types of quantitative variables:
Discrete variable can assume only a finite
or countable number of values.
Continuous variable can assume the
infinitely many values corresponding to the
points on a line interval.

Slide 15

Examples
Total number of workers in a
pharmaceutical manufacturing plant:
Quantitative discrete
Operating temperature / pressure in the
distillation column in the plant
Quantitative continuous
My blood type
Qualitative

Slide 16

Graphs for Categorical Data


After the data have been collected, they can be
consolidated and summarized to show:
o what values of the variable have been measured
o how often each value has occurred
We can construct a statistical table to display the data
graphically as a data distribution
For qualitative or categorical variable, how often can
be measured in 3 different ways:
Frequency, or number of measurements in each
category
Relative frequency, or proportion of measurements
in each category
The percentage of measurements in each category
Slide 17

Graphs for Categorical Data


Let n be the total number of measurements in the set
Frequency
Relative frequency
n
Percent 100 Relative frequency
Sum of the frequencies is always n
Sum of the relative frequencies is 1
Sum of the percentages is 100%

Slide 18

Example
A bag of M&Ms contains 25 candies:
Raw Data:

Statistical Table:
Color

Tally

Frequency Relative
Frequency

Percent

Red

3/25 = .12

12%

Blue

6/25 = .24

24%

Green

4/25 = .16

16%

Orange

5/25 = .20

20%

Brown

3/25 = .12

12%

Yellow

4/25 = .16

16%
Slide 19

Example
6

Frequency

Bar Chart

4
3
2
1
0

Brown

Yellow

Red

Blue

Orange

Green

Color

Brown
12.0%

Green
16.0%

Yellow
16.0%

Pie Chart

Orange
20.0%

Red
12.0%
Blue
24.0%

Slide 20

Graphs for Quantitative Data


A single quantitative variable measured for different
population segments or for different categories of
classification can be graphed using a pie or bar
chart.
A single quantitative variable measured over time is
called a time series. Time series data are most
effectively presented on a line chart with time as the
horizontal axis.
Sept

Oct

Nov

Dec

Jan

Feb

Mar

178.10

177.60

177.50

177.30

177.60

178.00

178.60

BUREAU OF LABOR
STATISTICS

CPI: All Urban Consumers-Seasonally Adjusted


Slide 21

Graphs for Quantitative Data


DOTPLOTS
The simplest graph for quantitative data
Plots the measurements as points on a
horizontal axis, stacking the points that
duplicate existing points.
Example: The set 4, 5, 5, 7, 6

STEM & LEAF PLOTS


Slide 22

Stem & Leaf Plots


A simple graph for quantitative data
Uses the actual numerical values of each data
point
Divide each measurement into two parts: the stem and
the leaf.
List the stems in a column, with a vertical line to their
right.
For each measurement, record the leaf portion in the
same row as its matching stem.
Order the leaves from lowest to highest in each stem.
Provide a key to your coding.

Slide 23

Stem & Leaf Plots


The prices ($) of 18 brands of a particular product:
90
70
70
70
75
70
65
68
74
70
95
75
70
68
65
40
4

4
Reorder

60
65

580855

055588

000504050

000000455

8
9

05

05
Slide 24

Interpreting Graphs (1)


Location and Spread

Where is the data centered on the


horizontal axis, and how does it spread
out from the center?
Slide 25

Interpreting Graphs (2)


Shapes
Mound shaped and symmetric
(mirror images)
Skewed right: a few unusually
large measurements

Skewed left: a few unusually


small measurements
Bimodal: two local peaks
Slide 26

Interpreting Graphs (3)


Outliers

No Outliers

Outlier

Are there any strange or unusual


measurements that stand out in the
data set?

Slide 27

Relative Frequency Histograms


A relative frequency histogram for a quantitative
data set is a bar graph in which the height of the
bar shows how often (measured as a proportion
or relative frequency) measurements fall in a
particular class or subinterval. The classes or
subintervals are plotted along the horizontal axis.

Create intervals

Stack and draw bars

Slide 28

Relative Frequency Histograms


Divide the range of the data into 5-12 subintervals of
equal length; the more data available, the more
subintervals you need
Calculate the approximate width of the subinterval as
Range/number of subintervals.
Round the approximate width up to a convenient
value.
Use the method of left inclusion,including the left
endpoint, but not the right in your tally.
Create a statistical table including the subintervals,
their frequencies and relative frequencies.
Slide 29

Relative Frequency Histograms


Draw the relative frequency histogram, plotting
the subintervals on the horizontal axis and the
relative frequencies on the vertical axis.
The height of the bar represents:
The proportion of measurements falling in
that class or subinterval.
The probability that a single measurement,
drawn at random from the set, will belong to
that class or subinterval.

Slide 30

Example
The ages of 50 professors at a university:

34
42
34
43

48
31
59
50

70
36
34
30

63
48
66
43

52
43
40
32

52
26
59
44

35
58
36
58

50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53

We choose to use 6 intervals.


Minimum class width = (70 26)/6 = 7.33
Convenient class width = 8
Use 6 classes of length 8, starting at 25.
Slide 31

Example
Age

Tally

Frequency Relative
Frequency

Percent

25 to < 33

1111

5/50 = .10

10%

33 to < 41

1111 1111 1111

14

14/50 = .28

28%

41 to < 49

1111 1111 111

13

13/50 = .26

26%

49 to < 57

1111 1111

9/50 = .18

18%

57 to < 65

1111 11

7/50 = .14

14%

65 to < 73

11

2/50 = .04

4%

14/50

Relative frequency

12/50
10/50
8/50
6/50
4/50
2/50
0

25

33

41

49

57

65

73

Ages

Slide 32

Describing the Distribution


14/50

Relative frequency

12/50
10/50
8/50
6/50
4/50
2/50

Shape?

25

33

41

49

57

65

73

Ages

Any outliers?
What proportion of the tenured faculty are
younger than 41?

What is the probability that a randomly selected


faculty member is 49 or older?

Slide 33

Describing Data with Numerical


Measures
Graphical methods may not always be
sufficient for describing data.
Numerical measures can be created for
both populations and samples.
A parameter is a numerical descriptive
measure calculated for a population.
A statistic is a numerical descriptive
measure calculated for a sample.

Slide 34

Measures of the Center


A measure along the horizontal axis of the
data distribution that locates the center of
the distribution.

Slide 35

Arithmetic Mean or Average


The mean of a set of measurements is the
sum of the measurements divided by the
total number of measurements.

xi
x
n
where n = number of
measurements

xi sum of all the measurements

Slide 36

Example 1
The set: 2, 9, 1, 5, 6
What is their arithmetic mean?

If we were able to enumerate the whole


population, the population mean would
be called m (the Greek letter mu).
Slide 37

Median
The median of a set of measurements
is the middle measurement when the
measurements are ranked from
smallest to largest.
The position of the median is

.5(n + 1)
once the measurements have been
ordered.
Slide 38

Example 2
The set: 2, 4, 9, 8, 6, 5, 3 n = 7
Sort: 2, 3, 4, 5, 6, 8, 9
Position: .5(n + 1) = .5(7 + 1) = 4th
Median = 4th largest measurement

The set: 2, 4, 9, 8, 6, 5
n=6
Sort: 2, 4, 5, 6, 8, 9
Position: .5(n + 1) = .5(6 + 1) = 3.5th
Median = (5 + 6)/2 = 5.5 average of the 3rd and 4th measurements

Slide 39

Mode
The mode is the measurement which occurs
most frequently.
The set: 2, 4, 9, 8, 8, 5, 3
The mode is 8, which occurs twice
The set: 2, 2, 9, 8, 8, 5, 3
There are two modes8 and 2 (bimodal)
The set: 2, 4, 9, 8, 5, 3
There is no mode (each value is unique).
Slide 40

Extreme Values (1)


The mean is more easily affected by
extremely large or small values than the
median.

The median is often used as a


measure of center when the
distribution is skewed.
Slide 41

Extreme Values (2)


Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median

Slide 42

Measures of Variability
A measure along the horizontal axis of
the data distribution that describes the
spread of the distribution from the
center.

Slide 43

The Range
The range, R, of a set of n measurements is
the difference between the largest and
smallest measurements.
Example: A botanist records the number of
petals on 5 flowers:
5, 12, 6, 8, 14
The range is R = 14 5 = 9.
Quick and easy, but only uses
2 of the 5 measurements.
Slide 44

The Variance
The variance is measure of variability
that uses all the measurements. It
measures the average deviation of the
measurements about their mean.
Data : 5, 12, 6, 8, 14
45
x
9
5
4

10

12

14
Slide 45

The Variance
The variance of a population of N
measurements is the average of the
squared deviations of the measurements
about their mean m. 2 ( xi m ) 2

The variance of a sample of n


measurements is the sum of the squared
deviations of the measurements about
their mean, divided by (n 1).
( xi x )
s
n 1

Slide 46

The Standard Deviation


In calculating the variance, we squared all
of the deviations, and in doing so changed
the scale of the measurements.
To return this measure of variability to the
original units of measure, we calculate the
standard deviation, the positive square
root of the variance.
Population standard deviation : 2
Sample standard deviation : s s 2
Slide 47

Example 3
2 Ways to calculate Sample Variance:
1. Use the Definition Formula:
xi xi x ( xi x ) 2
5
12

-4
3

16
9

6
8

-3
-1

9
1

14
Sum 45

5
0

25
60
Slide 48

Example 3
2. Use the Computing Formula for s2:

Sum

xi

xi2

5
12

25
144

6
8
14

36
64
196

45

465
Slide 49

Some Notes
The value of s is ALWAYS positive.
The larger the value of s2 or s, the larger
the variability of the data set.
Why divide by n 1?
The sample standard deviation s is often used
to estimate the population standard deviation
s. Dividing by n 1 gives us a better estimate
of s.

Slide 50

Tchebysheffs Theorem
Theorem: Given a number k greater than or equal
to 1 and a set of n measurements, at least 1-(1/k2)
of the measurements will lie within k standard
deviations of the mean.
Applies to any set of measurements. Can be used
for either samples ( x and s) or for a population (m
and ).

Slide 51

Tchebysheffs Theorem

The mean and variance of a sample of n = 25


2
measurements are 75 and 100 (i.e., x 75 and s 100 )
3
of the 25 measurements lie in the interval x 2 s 75 2(10 ) (i.e., 55 to 95)
4
8
At least of the 25 measurements lie in the interval x 3 s 75 3(10 ) (i.e., 45 to 105)
9
At least

Slide 52

The Empirical Rule


Given a distribution of measurements
that is approximately mound-shaped:

The interval m contains approximately


68% of the measurements.
The interval m 2 contains
approximately 95% of the measurements.
The interval m 3 contains
approximately 99.7% of the measurements.

Note: it does not work for all data sets (only


work well for mound shape distributions)
Slide 53

Example 1
Raw Data
The ages of 50 professors at a university:
34
42
34
43

48
31
59
50

70
36
34
30

63
48
66
43

52
43
40
32

52
26
59
44

35
58
36
58

50 37 43 53 43 52 44
62 49 34 48 53 39 45
41 35 36 62 34 38 28
53

x 44.9
s 10.73

14/50
12/50

Relative frequency

10/50
8/50
6/50
4/50
2/50

Shape? Skewed right

25

33

41

49

57

65

73

Ages

Slide 54

Example 1
k

x ks

Interval

Proportion
in Interval

Tchebysheff

Empirical
Rule

44.9 10.73

34.17 to 55.63

31/50 (.62)

At least 0

.68

44.9 21.46

23.44 to 66.36

49/50 (.98)

At least .75

.95

44.9 32.19

12.71 to 77.09

50/50 (1.00)

At least .89

.997

Do the actual proportions in the three intervals


agree with those given by Tchebysheffs
Theorem?
Do they agree with the Empirical Rule?
Why or why not?
Slide 55

Example 2
A distribution is relatively mound - shaped with mean 50 and
standard deviation 10.
a) What proportion of measurements will fall between 40 and 60?
b) What proportion of measurements will fall between 30 and 70?
c) What proportion of measurements will fall between 30 and 60?
d) If a measurement is chosen at random from this distribution, what is the
probability that it will be 60?

Slide 56

Approximating S
From Tchebysheffs Theorem and the
Empirical Rule, we know that
R 4-6 s

To approximate the standard deviation of a


set of measurements, we can use:

s R/4
or s R / 6 for a largedata set.
Slide 57

Measures of Relative Standing


Where does one particular measurement stand
in relation to the other measurements in the
data set?
How many standard deviations away from the
mean does the measurement lie? This is
measured by the z-score.

xx
z - score
s

Suppose s = 2.

4
s

x 5

x9

x = 9 lies z =2 std dev from the mean.


Slide 58

The Z-score
From Tchebysheffs Theorem and the Empirical Rule
At least 3/4 and more likely 95% of measurements lie within
2 standard deviations of the mean.
At least 8/9 and more likely 99.7% of measurements lie
within 3 standard deviations of the mean.
z-scores between 2 and 2 are not unusual. z-scores should not
be more than 3 in absolute value. z-scores larger than 3 in
absolute value would indicate a possible outlier.

Not unusual

Outlier

Outlier
z

-3

-2

-1

Somewhat unusual
Slide 59

Measures of Relative Standing


How many measurements lie below the
measurement of interest? This is measured
by the pth percentile.

p%

(100-p) %
x

p-th percentile

Slide 60

Quartiles and the IQR


The lower quartile (Q1) is the value of x
which is larger than 25% and less than
75% of the ordered measurements.
The upper quartile (Q3) is the value of x
which is larger than 75% and less than
25% of the ordered measurements.
The range of the middle 50% of the
measurements is the interquartile range,
IQR = Q3 Q1
Slide 61

Calculating Sample Quartiles


The lower and upper quartiles (Q1 and
Q3), can be calculated as follows:
The position of Q1 is .25(n + 1)
The position of Q3 is

.75(n + 1)

once the measurements have been


ordered. If the positions are not
integers, find the quartiles by
interpolation.
Slide 62

Example 3
Raw data:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25

Q1is 3/4 of the way between the 4th and 5th ordered
measurements, or
Q1 = 65 + .75(65 - 65) = 65.

Slide 63

Example 3
Raw data:
40 60 65 65 65 68 68 70 70
70 70 70 70 74 75 75 90 95
Position of Q1 = .25(18 + 1) = 4.75
Position of Q3 = .75(18 + 1) = 14.25
Q3 is 1/4 of the way between the 14th and 15th
ordered measurements, or
Q3 = 74 + .25(75 - 74) = 74.25
and
IQR = Q3 Q1 = 74.25 - 65 = 9.25
Slide 64

Bivariate Data

When two variables are measured on a single experimental


unit, the resulting data are called bivariate data.
You can describe each variable individually, and you can
also explore the relationship between the two variables.
When both of the variables are quantitative, call one
variable x and the other y. A single measurement is a pair of
numbers (x, y) that can be plotted using a two-dimensional
graph called a scatterplot.
y
(2, 5)
y=5

x
x=2

Slide 65

Describing the Scatterplot

What pattern or form do you see?


Straight line upward or downward
Curve or no pattern at all (just random
scattering of points)
How strong is the pattern?
Strong: all of the points follow the pattern
exactly
Weak: relationship barely visible
Are there any unusual observations?
Clusters or outliers
Slide 66

Examples of Scatterplots

Positive linear - strong

Negative linear -weak

Curvilinear

No relationship
Slide 67

The Correlation Coefficient


The strength and direction of the relationship
between x and y are measured using the
correlation coefficient, r.

s xy
sx s y

where

( xi )
2
x
(
x

x
)
i
n
s x2

n 1
n 1

2
i

sx = standard deviation of the xs

sy = standard deviation of the ys


Slide 68

Covariance
The new quantity sxy is called the covariance
between x and y, and is defined as:

( xi x )( yi y )
s xy
n 1

It is a measure of how much the two variables


change together. There is also a computing
formula for covariance:

( xi )( yi )
xi yi
n
s xy
n 1
Slide 69

Example 4
x

14

15

17

19

16

178

230

240

275

200

280

260

240

220

200

180
14

15

16

17
x

18

19

The scatterplot
indicates a
positive linear
relationship.
Slide 70

Example 4
x

xy

14

178

2492

15

230

3450

17

240

4080

x 16.2

19

275

5225

y 224.6 s y 37.360

16

200

3200

81

1123 18447

Calculate

( xi )( yi )
xi yi
n
s xy
n 1

(81)(1123)
18447
5

63.6
4

s x 1.924

s xy
sx s y

63.6

.885
1.924(37.36)

Interpreting r
-1 r 1
r 0

Sign of r indicates direction


of the linear relationship.
Weak relationship; random
scatter of points

r 1 or 1 Strong relationship; either


positive or negative

r = 1 or 1

All points fall exactly on a


straight line.
Slide 72

The Regression Line


Sometimes x and y are related in a particular way
the value of y depends on the value of x.
y = dependent variable
x = independent variable
The form of the linear relationship between x and y
can be described by fitting a line as best we can
through the points. This is the regression line (or
least-squares line*),
y = a + bx.
a = y-intercept of the line
b = slope of the line
* The best fit in the least-squares sense minimizes the sum of squared residuals, a residual
being the difference between a data point and the fitted value provided by the regression line.
Slide 73

The Regression Line


To find the slope and y-intercept of
the best fitting line, use:
280

sx

a y bx

260

240
y

br

sy

220

200

180
14

15

The least squares


regression line is y = a + bx

16

17

18

19

Slide 74

Example 5
x

xy

14

178

2492

Recall

15

230

3450

x 16.2

17

240

4080

19

275

5225

y 224.6 s y 37.3604

16

200

3200

81

1123 18447

br

sy
sx

sx 1.9235

r .885

37.3604
(.885)
17.189
1.9235

a y bx 224.6 17.189(16.2) 53.86


RegressionLine : y 53.86 17.189 x

S-ar putea să vă placă și