Sunteți pe pagina 1din 779

Descriptive statistics

1.1
Lecture 1: An Introduction to
MATH1041

Course aims

This course provides an introduction to statistics: the study of collect-


ing, analysing, and interpreting data.

Statistics plays a fundamental role in quantitative research (research


involving data). Some examples of fields in which quantitative research
plays a major role are psychology, biology, physics, economics, . . .

1.2
What is statistics?

statistic summary of data (which are measures of events)

field of statistics the collecting, analysing and understanding of


data measured with uncertainty

Who needs to know about statistical methods? Anyone who col-


lects, analyses or wants to understand data. (And a lot of people
do!)

1.3
An example

Does smoking when youre pregnant affect your childs develop-


ment?

A study was conducted (Johns et al. 1993) using guinea pigs to address
this question. Ten pregnant guinea pigs were injected with nicotine
tartrate, and ten were not. Offspring were then given an intelligence
test, a maze through which they had to pass to find food.

Results (number of errors the offspring made in the maze):

Group Sample size Mean Standard deviation


Control 10 23.4 12.3
Treatment 10 44.3 21.5

Is there evidence that guinea pigs in the treatment group (those whose
mums were smokers) were slower learners, on average?
1.4
Skills to be developed
In this course, you will learn how to approach designing studies and
analysing data to answer research questions like the above. In partic-
ular, at the end of this course, you will be able to:
1. Recognise which analysis procedure is appropriate for a given re-
search problem involving one or two variables.
2. Understand principles of study design.
3. Apply probability theory to practical problems.
4. Apply statistical procedures on a computer using R/RStudio.
5. Interpret computer output for a statistical procedure.
6. Calculate confidence intervals and conduct hypothesis tests by hand
for small datasets.
7. Understand the usefulness of Statistics in your professional area.

1.5
1.6
Lecture 2: Graphs
During this lecture, we will meet common graphs used for visualising
data.

Graphing data is a key step in analyses it is important to use the


appropriate graph(s) for your situation!

Introduction the role of graphs

Quantitative or categorical?

Recommended graphical tools

Class survey data

1.7
Introduction the role of graphs

Data Information

Data are just a bunch of numbers.

A major goal of Statistics is to make them informative.

1.8
Tools for Making Data Informative

Graphical tools (todays class).

Summary measures (next class).

Which type of graph to use depends on:


Whether you are summarising one variable or looking at the rela-
tionship between two variables.

Whether the variables are quantitative or categorical (qualita-


tive).

1.9
Quantitative or categorical?
A categorical variable places an individual into one of several cate-
gories.

A quantitative variable takes numerical values, measured on a scale.

Which of the following variables are quantitative, and which are cate-
gorical?
gender
satisfaction with UNSW (from 0 to 10)
time travelling to UNSW
method of travelling to UNSW

1.10
Recommended graphical tools
If you want to summarise one variable:
and it is quantitative: a histogram or boxplot.
and it is categorical: a bar graph (or bar chart).

If you want to explore the relationship between two variables:


and both are quantitative: a scatterplot.
and both are categorical: a clustered bar chart or a jittered
scatterplot.
and one is categorical, the other quantitative: comparative
boxplots or comparative histograms.

1.11
What sort of graph would you use to summarise:
gender of MATH1041 students
satisfaction with UNSW (from 0 to 10)
Time travelling to UNSW
method of travelling to UNSW

1.12
What to look for in a graph

When commenting on a graph of a quantitative variable, consider:

the location (where most of the data are) and spread (or variabil-
ity) of the data;

the shape of the data (symmetric, left-skewed or right-skewed?);


and

if there are any unusual observations.

1.13
Change in location
Frequency

1 2 3 4 5 6 7 8
Variable

1.14
Spread

Larger Spread
Smaller Spread
Frequency

0 2 4 6 8 10
Variable

1.15
Typical shapes:Symmetric
Frequency

1 2 3 4 5 6 7 8 9
Variable

1.16
Typical shapes:Skewed to the left
Frequency

1 2 3 4 5 6 7 8 9 10
Variable

1.17
Typical shapes:Skewed to the right
Frequency

1 2 3 4 5 6 7 8 9 10
Variable

1.18
The following histogram depicts the scoring average of players from
the National Basketball Association (NBA) up to the 2008 season.

Histogram of points per game

120

100
Frequency

80

60

40

20

0
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Points per game
1.19
Comment on the location, spread and shape of the histogram.

1.20
Identify the variable(s) involved in the following questions, whether
they are quantitative or categorical, and what sort of graph you would
use to answer the questions:

Do males and female MATH1041 students have different levels of


satisfaction with UNSW?

How much do students pay for a haircut?

Does the amount you pay for a haircut depend on your gender?

1.21
Another way to think about it:
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

1.22
Identify the variable(s) involved in the following question, whether they
are quantitative or categorical, and what sort of graph you would use
to answer the question:

Is there a relationship between how much the class spends on their


hair and how much they would charge for their labour?

1.23
Some Other Graphical Tools

Stem-and-leaf plots (e.g. page 10 of Moore et al.)

Pie charts (controversial can be bettered by a bar chart.)

Time plots (e.g. page 2022 of Moore et al.). Suitable for time
ordered data. Common in financial pages of newspapers.

Dot plots Poor persons histogram.

1.24
Statistics Packages

Graphs (and indeed most statistical procedures) are most easily imple-
mented using a computer, and a statistics package specially devel-
oped for data analysis. Common programs used for statistics:
SAS
SPSS (PASW)
Excel
R/RStudio (Well use both R and RStudio used for most graphs
in the lecture notes).
Minitab
S+ (S-PLUS)

1.25
Fancy graphs

We have covered some fundamental graphical tools.

But new tools are constantly being developed and modified.

Depending on the problem at hand, there is nothing to stop you de-


vising your own graphical display!

A good example of an improvised graphical display is the moving bubble


plot used by Prof Hans Rosling in:
http://www.youtube.com/watch?v=jbkSRLYSojo

1.26
Class Survey Data
During the remainder of the class, will look at other results in the
getting to know the class exercise.

These graphs illustrate what is regarded by statisticians as best prac-


tice.

(These graphs will also be available on UNSW Moodle soon)

1.27
1.28
Lectures 34: Numerical summaries
This lecture, we will meet common types of numerical summaries of
data ways of summarising the key properties of data using a few
numbers.
Introduction

Summaries of categorical variables

Summaries of quantitative variables

Five-number summaries

Outlier detection

Linear transformations

1.29
Introduction
From last lecture:

Data Information

We learnt that data are just a bunch of numbers.

A major goal of Statistics is to make them informative.

1.30
Tools for Making Data Informative

Graphical tools (last class).


Numerical summaries (this weeks classes).

1.31
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

useful
numbers:

This lecture

1.32
Relationship to Textbook

Graphical tools:
Section 1.1 Displaying Distributions with Graphs

Numerical summaries:
Section 1.2 Describing Distributions with Numbers

1.33
Types of numerical summary

Examples of numerical summaries are:

proportions or percentages

mean or average

median

interquartile range (IQR)

standard deviation

1.34
Recommended numerical summaries

If you want to summarise one categorical variable:

table of frequencies or percentages

If you want to summarise one quantitative variable:

Measures of: location spread

Commonly used: mean (


x) standard deviation (s)

Robust to outliers: median (M ) interquartile range (IQR)

1.35
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of
numbers: frequencies

1.36
Summaries of categorical variables
Consider the data from the class survey last lecture:

Is gender a quantitative or categorical variable?

What type of numerical summary would you use for gender of


MATH1041 students?

1.37
Numerical Summary of Gender

Gender Frequency %
Female 240 60

Male 160 40

1.38
Summaries of quantitative variables

Measures of location

Given measurements of a quantitative variable, an obvious question is


How large (or small) are the values?

Measures of location tell us how large (or small) the typical value is.

1.39
Satisfaction with UNSW

The mean satisfaction rating with UNSW is

7.84

1.40
The mean is just another name for what is commonly called the av-
erage of a set of numbers.

A common notation for the mean (used in textbook) is:

The mean also has a physical interpretation as the centre of gravity


of the data.

1.41
Travel Times to UNSW

The mean travel time to UNSW is:


51.99 minutes.

1.42
Labour cost

The mean cost of labour is:


$ 842, 655.86

1.43
The Mean Can be Heavily Influenced by Outliers

The problem is that a couple of entrepreneurs gave unrealistic an-


swers: someone this semester said $100 million!

If the entrepreneurs are removed (any value over $1,000), then the
mean changes from

$842, 655.86
to
$80.52

1.44
The Median an Alternative to the Mean

Even after removing big outliers, the mean is still not describing the
typical labour cost very well.

A more satisfactory numerical summary in this case is:

The median labour cost is: $50

1.45
Definition of the median

The median is the middle value.

For n values sorted as x1, x2, . . . , xn, the median is:


x(n+1)/2 if n is odd; and

the average of xn/2 and xn/2+1 if n is even

Textbook notation: the textbook refers to the median as M .

1.46
Computing the Median

A sample of 26 UNSW satisfaction ratings led to the following stem-


and-leaf plot.

Data: 5, 5, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9

55
6
7777777777
888888888
9999

What is the median of these data?

Answer:

1.47
Often Mean and Median are Close


x M
UNSW satisf. 7.84 8

Travel time 51.99 min 50 min

1.48
Medians and Boxplots

Recall the UNSW satisfaction dataset from a few slides ago:

9
8
UNSW satisfaction rating
7
6

Medians correspond to the horizontal bar in the boxplot!


1.49
How about the edges of the box?

These correspond to the medians of the lower and upper half of the
data; and are called:

Q1 = first quartile

Q3 = third quartile

(note: the median can be thought of as the second quartile, M = Q2)


1.50
Class Exercise Computing Q1 and Q3.

Again, recall the UNSW satisfaction dataset from a few slides ago:

55
6
7777777777
888888888
9999

What are Q1 and Q3 for these data?

Answer:

1.51
Measures of spread

Asthma example

A few years ago a colleague was involved in a study that explored


possible genetic differences between asthmatics and non-asthmatics.

The histograms on the next slide show

FENO = Fraction of Expired Nitric Oxygen


(biomarker for asthma)

for two groups A and B with genetic differences.


1.52
0.10
0.0

0 10 20 30 40 50

FENO for Group A


0.05
0.02
0.0

0 10 20 30 40 50

FENO for Group B


1.53
Spread of a Set of Data

The two groups do not differ considerably in their central location.

But they do differ substantially in their spread.

50


40
30
FENO

20



10

A B

1.54
Simple Measure of Spread

A simple measure of spread is the height of the box part of the boxplot.
From earlier slides this is:

Q3 Q1 = interquartile range = IQR

1.55
IQR for Small Example

For the following dataset (26 UNSW satisfaction scores):

55
6
7777777777
888888888
9999

recall that Q1 = 7 and Q3 = 8.

This means that the inter-quartile range is:

IQR = Q3 Q1 = 8 7 = 1.

1.56
Standard Deviation another measure of spread

Another measure of spread is the standard deviation denoted by s


and calculated as

v
u
)2 + (x2 x
u (x x
t 1 )2 + . . . + (xn x
)2
s=
n1

where x1, x2, . . . , xn denote the data.

Note: It is easy to calculate s using statistics mode on your calculator.


There is a standard deviation button: usually n1 or sx .

1.57
Example
Recall our sample data on satisfaction with UNSW:

55
6 n = 26
7777777777
888888888
9999 x1 = 5, x2 = 5, x3 = 6, x4 = 7, . . . , x26 = 9.

You can use your calculator to show that for this dataset,

' 7.46
x
s
(5 7.46)2 + (5 7.46)2 + . . . + (9 7.46)2
s =
25
' 1.067

The standard deviation for this data set is 1.067 (to 3 decimal places).
1.58
Often IQR and s are Similar

IQR s
UNSW satisf. 2 1.3

Hair cost $34.5 $5436.79

1.59
The Standard Deviation Can be Heavily Influenced by
Outliers

For the labour cost data we get:

s = $ 7, 632, 285.18

But if the outliers (values greater than $1, 000) are removed from the
sample then we get

s = $ 141.89

This still seems a little high though. . . probably because there are still
some people who replied with pretty large values.
1.60
IQR is Hardly Affected by Outliers

For the full labour costs data we get

IQR = $ 67.5

With the outliers (values greater than $1000) omitted we get

IQR = $ 46.25

1.61
Five-Number Summaries
Textbook advocates the five-number summary:

Min. Q1 M Q3 Max.

where Min. and Max. are the smallest and largest values.

1.62
Class Exercise: Five-Number Summary

A poll of age in years of 20 randomly chosen students led to the data:

18 18
19 19 19 19 19 19
20 20
21 21 21 21
22
23
24 24
25

29

Determine the five-number summary (Min., Q1, M, Q3, Max.) for


these data.
1.63
Answer:

1.64
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of
or 5-number
numbers: frequencies
summary

1.65
Boxplot Terminology

Textbook (Moore et al.) uses the terms boxplot and modified box-
plot.

Boxplot Stems go from the box to the minimum


and maximum. (visual representation of five-
number summary)

Modified boxplot Stems use 1.5 IQR criterion for outliers


(all boxplots given in my slides are of this
variety)

In this course, as is more common in practice, we will call the latter


a boxplot (without the modified): Moore et al. are the exception
rather than the rule.

1.66
Outlier Identification
How do we decide if an observation is an outlier?

There is no clear-cut answer, but Moore et al. recommend the 1.5


IQR criterion.

The 1.5 IQR Criterion for Outliers:

Observation is a suspected outlier

More than 1.5 IQR lower than Q1; or


More than 1.5 IQR higher than Q3.

1.67
Class Exercise: 1.5 IQR Criterion

Apply the suspected outlier rule to the following age data.


18 18
19 19 19 19 19 19
20 20
21 21 21 21
22
23
24 24
25

29

Min. Q1 M Q3 Max.
18 19 20.5 22.5 29

1.68
Answer:

1.69
Linear Transformations

Linear transformation is the transformation of a variable from x to

xnew, as follows:

xnew = a + bx
x is some quantitative measurement (e.g. travel time, height, tem-
perature).

1.70
Linear Transformations Change of Units

Suppose we measure

x = travel time in minutes


but then convert this to

xnew = travel time in hours.

e.g. 45 minutes 0.75 hours.

Then
xnew = a + bx where a = 0, 1.
b = 60

1.71
Other examples

change of units a b
mins to hours 0 1
60
hours to mins 0 60
mm to metres 0 0.001
o C to o F 32 1.8
o F to o C 160
9
5
9

1.72
Linear Transformations Dont Change the Shape of
the Data

The next slide shows histograms for


maximum daily temperatures

in Melbourne in both
Celsius and Fahrenheit.

1.73
Histogram of max. Melbourne temperatures
Relative Frequency 0.3

0.2

0.1

0
0 20 40 60 80 100 120
C

Histogram of max. Melbourne temperatures


Relative Frequency

0.3

0.2

0.1

0
0 20 40 60 80 100 120
F

1.74
How Linear Transformations Affect Central Location

Recall that main measures of central location are

= mean
x and M = median.


new = a + b x
x

xnew = a + bx
Mnew = a + b M

1.75
How Linear Transformations Affect Spread

Recall that main measures of spread are

s = standard deviation and IQR = interquartile range.


snew = b s

xnew = a + bx
IQRnew = b IQR

1.76
Class Exercise Transformation of Numerical Summaries

Convert the following numerical summaries from Celsius to Fahrenheit.

Celsius Fahrenheit

x 30
M 25
s 10
IQR 15

Note: F = 9
5 C + 32.

1.77
Additional Exercise 1-1 (based on Exercise 1.45 of 4th Edition)

How much do users pay for Internet service? Here are the monthly
fees (in dollars) paid by a random sample of 50 users of commercial
Internet service providers in August 2000:

8 9 10 10 12 13 14 15 15 15
15 18 18 19 19 20 20 20 20 20
20 20 20 20 20 20 20 20 21 21
21 21 21 22 22 22 22 22 22 22
22 22 23 25 29 30 35 40 40 50

(a) Obtain a five-number summary of these data.


(b) Which observations are suspected outliers by the 1.5 IQR crite-
rion?
(c) Construct a boxplot of these data. What pattern do you see?
(d) Suppose the unit of measurement is cents rather than dollars.
What is the five-number summary for this unit of measurement?
1.78
1.79
1.80
Transformations and

Relationships Between Variables

2.1
Lectures 12: Transformations and
Relationships Between Variables
This lecture, we will learn about the role of transformation in statistics,
and introduce some ideas that are useful for studying the relationship
between two quantitative variables.

Transformations:
Introduction
Common transformations
Effect of transformation on location and scale
Relationships Between Variables:
Relationships between two variables when (at least) one is cate-
gorical
Relationships between two quantitative variables
Outliers in Scatterplots
Correlation (r)
Cautions about the use of r 2.2
Transformations
Consider xnew, formed as some function of a variable x. Examples:
9
xnew = x2 xnew = 32 + x xnew = log(x)
5

If xnew is a function of x, we say xnew is a transformation of x. The


act of calculating xnew is referred to as transforming x.

A general formula for a transformation is:

xnew = f (x)
where the function f () tells us how x has been transformed.

2.3
Types of transformation

There are two key types of transformations: linear and non-linear.

A linear transformation is so-called because a plot of xnew against x is


a straight line.

Are the following linear or non-linear transformations?

xnew = x2

xnew = 32 + 9
5x

xnew = 10 log(x)
2.4
Why transform data?

Linear transformation changes the scale

and leaves shape unchanged. Linear transformation is used when a


different scale is desired (e.g. degrees Celsius not degrees Fahrenheit).

Non-linear transformation changes the shape

Non-linear transformations are a useful way to handle strongly-skewed


data, and data with outliers. (Recall that many numerical summaries
dont work well for such data.)

It is often possible to transform strongly-skewed data so that it is


approximately symmetric, if you chose your transformation carefully!

(This is good because approximately symmetric data, with no outliers,


is easier to analyse.)
2.5
How transformation changes the shape of data.
If observations are positive, have a long right tail (right skew).

Concave (concave down) transformations (e.g. log(x) or x)
pull the large values down farther than they pull the central or
small values.
21

2 0 2 0 1
8

3
7
log seeded

4
6

8
5

5
4

3
3

2
2

1
1
0 500 1,000 1,500 2,000 2,500 3,000

Seeded

Convex (concave up) transformations work the other way.

2.6
Common transformations
When your data takes positives values (x > 0) and is right-skewed,
these transformations might make it more symmetric:

xnew = x

xnew = x1/4

xnew = log x
These transformations are all concave down, hence they reduce the
length of the right tail. They are in increasing order of strength that
is, for strongly skewed data, is more likely xnew = log x to work than

xnew = x.

They are also monotonically increasing that is, as x gets larger, xnew
gets larger.

(A transformation which didnt have this property would be hard to


interpret!)
2.7
Examples:

2.8
Log transformation

The logarithmic or log-transformation is particularly important:

xnew = loga x
where a is the base, commonly log10 x, loge x, log2 x.

(The base doesnt matter it only affects the scale, not the shape.)

Logs have the following key property:

log(ab) = log a + log b


and more generally,

log(x1 x2 xn) = log x1 + log x2 + + log xn


In words, the logarithm transforms multiplicative to additive.
2.9
Many variables can often be understood as the outcome of a series of
multiplicative processes.

Examples:
Wealth

Size

Profit

Population

By transforming such processes, they change from multiplicative to


additive.

2.10
Example:
Population (1st world countries)

5 10 15 20
Frequency
0

0 50 100 150 200 250 300 350


Population
Population log transformed
2 4 6 8
Frequency
0

0 1 2 3 4 5 6
log(Population)

2.11
Situations where we need a different trans-
formation

Left-skewed data: This is less common. But if data are left-skewed


and negative, then x is right-skewed and positive, in which case the
transformations previously discussed can be applied to x.

e.g. xnew = log(x) takes negative, left-skewed values of x and tries


to make them more symmetric.

2.12
Proportions: If data only take values between 0 and 1, try the logit
transformation:
 
x
xnew = log
1x
(This stretches data over the whole real line, from to .)

Right-skewed with zeros: This often happens when data are counts.
The problem is that you cant take logs because log 0 is undefined.
Try:
xnew = log(x + 1)

2.13
Exercise

Suggest a possible transformation (if you think any is required) in order


to make the following variables more symmetric:
Gross domestic product of countries

IQ

Body mass of different animals

2.14
Survey exercise

Consider the survey data. What transformation (if any) would you
suggest for the following variables, to make them more symmetric?
Hair cut cost

Day of the month you were born

Labour cost

2.15
Effect of transformation on location
and shape
Recall that when we do a linear transformation:

effect on measures of location:

xnew = a + bx x
new = a + b x

(and similarly for other measures of location).

effect on measures of spread:

xnew = a + bx snew = bs
(and similarly for other measures of spread).

2.16
Effect of transformation on location
and shape
Well when we do a non-linear transformation this all falls apart:

effect on measures of location:

xnew = f (x) x
new =???
(and similarly for other measures of location).

effect on measures of spread:

xnew = f (x) snew =???


(and similarly for other measures of spread).

2.17
There is no rule for how measures of location and shape change under
non-linear transformation they can change in unpredictable ways.

This makes measures of location and spread, calculated on transformed


data, difficult to understand in terms in the original measurement units.

For this reason:


Avoid transforming data if you can! It complicates interpretation.

If you have to transform, stick with logs if possible, as the log-scale


is still fairly easy to interpret.

2.18
Exercise

Consider the following hair cost data:

10 20 50 35 15

Calculate average hair cut cost in dollars, x


.

log10(x)-transform the data and recalculate the sample mean, x


new.

Does 10xnew = x
? Comment.

2.19
Relationships Between Variables
So far, weve talked about how to study one variable.

Now well think about how to study the relationship between two vari-
ables, in two situations:
When at least one of them is categorical.

When both are quantitative.

2.20
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of
or 5-number
numbers: frequencies
summ.

This lecture

2.21
Relationships between two variables
when (at least) one is categorical
To study the relationship between one categorical variable and another
variable, break the data up by categories, summarise the data in each
category using the appropriate one-variable method, and compare!

Quantitative variable Comparing a quant. var. across categories


Histogram Comparative histograms
Boxplot Comparative boxplots
Five-number summary Separate five-number summaries
Categorical variable Comparing a categ. var. across categories
Bar chart Clustered bar chart
Table of frequencies Two-way table of frequencies

2.22
Example gender and hair cut cost

To study how gender and hair cut cost are related:


Summarise hair cut cost for females.

Summarise hair cut cost for males.

Compare!

2.23
A numerical summary of the relationship between gender and hair cost:
five-number summaries by group

Min. Q1 M Q3 Max.
Females 0 25 40 80 90909
Males 0 15 20 25 260

2.24
Example study area and gender

To examine how study area and gender are related:


Summarise gender for Aviation students.

Summarise gender for Life Science students.

...

Compare!
(Or you could do it the other way around, summarising study area for
each gender, with similar results)

2.25
A graphical summary of the relationship between study area and gen-
der: clustered bar chart

female
80
60
40
20
0 male

Other Life Sci Aviation Science

2.26
A numerical summary of the relationship between study area and gen-
der: two-way table (of frequencies)

Females Males
Aviation 16 52
Life science 91 52
Other science 54 56
Other 99 74

2.27
Relationships between two
quantitative variables
Relationships between two quantitative variables are best explored
through a scatterplot.

From the scatterplot, we can get a sense for the nature of the rela-
tionship between the two variables.

2.28
Nature of Relationship

existent versus non-existent

strong versus weak

increasing versus decreasing

linear versus non-linear

2.29
Nature of Relationships for Examples

relationship nature
elec. use vs. temp. decreasing; non-linear; reasonably strong
log(elec. use) vs. temp. decreasing; linear; reasonably strong
log(income) vs. age non-linear; mild strength

2.30
Temperature vs Electricity usage
Electricity usage (kWh per month)

100

80

60

40

20

5 0 5 10 15 20 25
Temperatures ( C)

2.31
Temperature vs log(Electricity usage)

4.5

4
log(usage)

3.5

2.5

5 0 5 10 15 20 25
Temperatures ( C)

2.32
Age vs log(income)

15

14
log(income)

13

12

11
20 30 40 50 60
Age (years)

2.33
Outliers in Scatterplots
Outliers could represent some unexplainable anomalies in the data.

However, outliers could also reveal possible systematic structure worthy


of further investigation.

The following plot is votes for Reform Party candidates in the 2000
presidential election versus 1996 for each county in Florida, USA.

2.34
Comparison of votes

3,500
Palm Beach
3,000
Buchanan 2000 votes

2,500

2,000

1,500

1,000

500

0 10,000 20,000 30,000 40,000

Perot 1996 votes See:


http://www.asktog.com/columns/042ButterflyBallot.html

2.35
Correlation (r)

What is correlation?

In the electricity usage example, a relationship between the logarithm


of electricity use and temperature variables was observed graphically.

Can we summarise this numerically?

2.36
Solution

Correlation is a number that measures the strength of the linear


relationship between two variables.

2.37
Temperature vs log(Electricity)

4.5
log(Electricity usage)

3.5

2.5

5 0 5 10 15 20 25
Temperatures ( C)

2.38
Formula for the correlation coefficient r

!
1 X  xi x


yi y
r=
n1 sx sy

xi = denotes the x-axis values


yi = denotes the y-axis values
= mean of the xs
x
sx = standard deviation of the xs
y = mean of the ys
sy = standard deviation of the ys
X
= summation

2.39
Properties of r

For any data set:


1 < r < 1

r close to 1 strong positive linear relationship


r close to 1 strong negative linear relationship
r close to 0 weak or non-existent linear relationship

2.40
r = 0.96
r = 0.78

r = 0.95

r = 0.74

r = 0.01
r = 0.01

2.41
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table 5-num for correlation
or 5-number
numbers: frequencies of freq. each group
summ.

This lecture

2.42
Cautions about the use of r
In this section we discuss cautionary notes about r:
r is only useful for describing linear relationships; and

r is sensitive to outliers.

2.43
Correlation Does Not Describe Non-linear Re-
lationships

The next slide shows some data from the National Basketball Associ-
ation (NBA) 2007-08 season on:
mean points per game and age.

2.44
Age vs Average points per game (NBA 07/08)

11

Average points per game 10

6 r = 0.046

20 22 24 26 28 30 32 34
Age (years)

2.45
Comments on Previous Slide:

There is clearly a very strong relationship between mean points per


game and age.

Getting r near 0 is not a contradiction, since correlation only measures


the strength of linear relationships; not non-linear relationships like
we have here.

2.46
Correlation is Sensitive to Outliers

Scatterplot of Endotoxin vs pIL-13

4,000

3,000
pIL-13

2,000 r = 0.253

1,000

0 100 200 300 400 500

Endotoxin
2.47
Scatterplot of Endotoxin vs pIL-13
4,000
r = 0.0088
3,000
pIL-13

2,000

1,000

0 100 200 300

Endotoxin

2.48
endotoxin=a measure of poisonous content of dust
pIL-13= measure of immunological activity

The apparent correlation between endotoxin and pIL-13 was mostly


driven by a single data point.

Conclusion:

Beware of outliers when using correlation!

Plotting the data is always recommended

2.49
2.50
Least-Squares Regression

2.51
Lectures 34: Least-Squares
Regression
Introduction

Least squares regression

Using the Least-Squares Line for Prediction

Connection Between Regression and Correlation

Measuring the Strength of a Regression: r2

2.52
Introduction
Explanatory and response variables

Regression is used to study causal relationships


(and more generally, for predicting one variable from another).

Causal relationship a relationship between two variables in which


one variable is believed to cause changes in the other variable.

When we are looking at a causal relationship between two variables,


we say that the
explanatory variable

explains or causes changes in the


response variable.

Note: it is conventional in plots to put the explanatory variable on the


x-axis and the response variable on the y-axis.
2.54
Age vs log(income)

15

14
log(income)

13

12

11
20 30 40 50 60
Age (years)

2.55
What is Regression?

In Statistics, regression means:

A method of explaining how a response variable is related to explana-


tory variables.

e.g. What factors are important in determining birthweight? We can


use regression to find how birthweight (the response variable) can be
explained by other variables (explanatory variables).

We will talk about fitting a regression line between two variables a


straight line that describes how the response variable (y) is related to
the explanatory variable (x).

2.56
Explaining Electricity Usage

In the electricity usage example we can think of the


explanatory variable: temperature

partially explaining the


response variable: log(electricity use)

So we could fit a regression line to explain how log(electricity use) is


related to temperature.

2.57
Temperature vs log(Electricity usage)

4.5

4
log(usage)

3.5

2.5

5 0 5 10 15 20 25
Temperatures ( C)

2.58
Least-squares regression
Least-squares is a mathematical method for determining a line of
best fit through the scatterplot points.

2.59
Temperature vs log(Electricity usage)

4.5

4
log(usage)

3.5 4.32 0.06 Temp

2.5

5 0 5 10 15 20 25
Temperatures ( C)

2.60
How is a least squares regression fitted?

Line chosen to minimise


the sum of the squared
lengths of the arrows

2.61
Mathematical Representation of Regression
Line

Straight lines have the algebraic form

y = b0 + b1 x
where

b0 = intercept on y-axis (x=0)


b1 = slope of the line

The value of b1 is often more interesting than b0 since it represents


the magnitude of the effect of x on y.

2.62
Textbook Notation for Least-Squares Regres-
sion

Since y is used for the original data values of the y-axis variable, we
use
y

to denote the y-values on the least-squares line; i.e.


y = b0 + b1 x.

The values of y are called fitted values.

Note: the textbook writes the regression line as


y = a + b x
we use b0 as the y-intercept instead of a and b1 as the slope instead
of b, a notation which will come in handy for you later.
2.63
Using the Least-Squares Line for
Prediction
We have:

log (electricity usage) = 4.32 0.057 temperature.

Next months average temperature will be about 15o.

How much electricity do we expect to use?

This is a problem called prediction.

Answer:

2.64
Exercise Prediction

Consider the results from the Menss Large Hill Ski Jump competition
during the 2014 Sochi Olympic Games.

2.65
Mens Large Hill Ski Jump scores, Sochi 2014

280

270
Final

260

250
21 + 1.811 Round1

125 130 135 140 145


Round 1
The equation of the regression line on the previous page is:

y = 21 + 1.81x
where y is the predicted score in the final round and x is the score
in the semifinal round.

Suppose an athlete scored a 137 in Round 1 but had to drop out of the
Final because of an injury. Predict what the final round score would
have been had they competed.

Answer:

2.66
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation
or 5-number
numbers: frequencies frequencies each group or regression
summ.

This lecture

2.67
Connection Between Regression and
Correlation
Least-squares fits a line to the x and y points with slope
intimately connected

to the
correlation

between the points.

2.68
Slope and Correlation are Kind of the Same

Imagine that we standardise our data (to Z-scores):


(x x)
zx =
sx
(y y)
zy =
sy
then fit a regression line.

Lets see what the regression line looks like.

2.69
Standardised: Temperature vs log(Electricity usage)

2
Standardised:log(usage)

0
zy = 0 0.90 zx

2 r = 0.9, slope and correlation the same!

2 1.5 1 0.5 0 0.5 1 1.5


Standardised: Temperatures ( C)

2.70
Least-Squares Line for General Scatterplots

The least-squares line is

y = b0 + b1 x.

In the previous slides we saw that

b1 = r = correlation coefficient

provided that we work with standardised data.

For unstandardised data we have the following general formulae for


b0 and b1:
b1 = r(sy /sx) and b0 = y b1x

2.71
How Regression Differs from Correlation

In correlation the x and y variables are on equal footing.

In regression

x = explanatory variable

y = response variable

Use regression when you want to try and explain or predict one variable
(y) using another (x).

2.72
Measuring the Strength of a
Regression: r2
An important quantity throughout regression is the r2 value. It mea-
sures the strength of the regression.

For simple linear least-squares regression

r2 = square of correlation r.

Note that:
0 r2 1.

For other regression models (not covered in MATH1041) the formula


for r2 is more complicated.

2.73
Why r2?

We already know (last lecture) that r measures the strength of linear


relationship between y and x. Why calculate r2 as well?

r2 has a special interpretation:


2 variance of y values
r =
variance of y values

(Note that variance is the square of the standard deviation.)

So r2 is the % of variation in y that is explained by the linear regression.

2.74
r2 as Percentage Explained

An example of how to interpret r2 is as follows:

Temperature explains 81% of the variation in log(electricity usage).

2.75
The next few slides show how r2 measures the strength of the re-
gression.

It uses the result:

r2 = square of correlation between y-values and y-values

2.76
Temperature vs log(Electricity usage)

data
4.5 fitted

4
log(usage)

3.5

2.5

5 0 5 10 15 20 25
Temperatures ( C)

2.77
log(Electricity usage) log(Electricity usage) vs fitted values

4.5 r = 0.9, r2 = 0.81

3.5

2.5

2.5 3 3.5 4 4.5


fitted

2.78
Comment on Previous slides

Regression is stronger if the correlation between

y and y

is higher.

2.79
Aside: Using r2 to Quantify Non-Linear Re-
lationships

Age vs average points per game (2008, NBA season)


Average points per game

11
10
9
8
7
6 r2 = 0.002
5
20 22 24 26 28 30 32 34
Age

2.80
A small r2 doesnt necessarily mean there is no relationship.

The r2 value discussed so far in this course only assesses linear rela-
tionships.

2.81
Bonus regression material: Cautions
about regression
Regression is not always appropriate

Residual plots

Outliers and influential observations

Lurking variables

Extrapolation

2.82
Regression is not always appropriate
Fitting the least-squares regression line to a set of data is not always
appropriate.

We should always check the appropriateness of the basic regression


assumptions before proceeding.

2.83
The Anscombe Examples

Data A Data B Data C Data D


x y x y x y x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 8 5.56
12 10.84 12 9.13 12 8.15 8 7.91
7 4.82 7 7.26 7 6.42 8 6.89
5 5.68 5 4.74 5 5.73 19 12.50

2.84
For all 4 data sets, the least-squares regression line is

y = 3 + 5 x

But in some cases, linear regression is appropriate, in other cases, it is


not!

plot the data to check if regression is a good idea

2.85
A B

12 12
10 10
8 8
6 6
4 4

C D

12 12
10 10
8 8
6 6
4 4

5 10 15 20 5 10 15 20

2.86
A B

12 12
10 10
8 8
6 6
4 4

C D

12 12
10 10
8 8
6 6
4 4

5 10 15 20 5 10 15 20

2.87
Regression Assumptions

When fitting a least squares line to data on y and x we are assuming:

y = b0 + b1x + error.
where error corresponds to random scatter about the line.

So how do we check these assumptions?

2.88
Residual plots
The residuals from a least-squares regression are obtained by subtract-
ing the fitted values (also called predicted values) from the response
values:
residual = y y

The residuals allow easier checking of the regression assumptions through


residual plots.

2.89
The residuals are the
length (and direction) of
the arrows

2.90
A residual plot is a scatterplot of the

residuals

against the
explanatory variable x.

2.91
Residual plot

0.4

0.2

0
Residuals

0.2

0.4

0.6

5 0 5 10 15 20 25
Temperature ( C)

2.92
Interpreting Residual Plots

If the regression line catches the overall pattern of the data, there
should be no pattern in the residuals.

Interpreting residual plots is a delicate art. We will give a few illus-


tratory examples.

2.93
Janka hardness example

Janka hardness is a structural property of timber. It is difficult to


measure, so the density of the timber is often used for prediction.

The following plot shows log(Janka hardness) versus density.

2.94
A linear regression fit:
Janka hardness data
8.2
8
7.8
7.6
log(hardness)

7.4
7.2
7
6.8
6.6
6.4
6.2
6

25 30 35 40 45 50 55 60 65 70
density

2.95
The residual plot below has a rough arch-shaped pattern indicating
the linearity assumption is not appropriate.

Residual plot for linear regression

0.2

0.1
residuals

0.1

0.2

0.3

6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2


density

2.96
A quadratic regression fit is better:
Janka hardness data
8.2
8
7.8
7.6
log(hardness)

7.4
7.2
7
6.8
6.6
6.4
6.2
6

25 30 35 40 45 50 55 60 65 70
density

2.97
The arch-shaped pattern in the residual plot is gone! Much better fit.

Residual plot for quadratic regression

0.2
0.15
0.1
0.05
residuals

0
0.05
0.1
0.15
0.2
0.25
25 30 35 40 45 50 55 60 65 70
density

2.98
Outliers and influential observations

Car Purchase Example

Barbara runs a catering business and wants to buy a small fleet of used
Mitsubishi Lancer cars.

She collects 39 ads from the newspapers and arrives at the data set
plotted on the following slide.

Her goal is to predict price based on age so that she has a better idea
of what a fleet should cost.

2.99
Car price/age data
20,000
18,000
16,000
14,000
Price ($AUD)

12,000
10,000
8,000
6,000
4,000
2,000
0
6 7 8 9 10 11 12 13 14 15
Age (Years)

2.100
Attaching a Special Cause to an Outlier

Barbara looks through the newspaper ads and finds that the ad for the
highest price Mitsubishi Lancer was as follows:

MITSUBISHI, LANCER EVO 2006, excellent condition, air cond.,


leather interior, power mirrors, 18 mag wheels, satellite Navigation,
television! $19,000. Phone Steve 555 1234 after 6 weekdays.

None of the other ads promised anywhere near as much as this one.

Should Barbara include this car in analyses, or is she justified in re-


moving it?

Answer:

2.101
Car price/age data
8,000

7,000

6,000
Price ($AUD)

5,000

4,000

3,000

2,000

1,000

0
6 7 8 9 10 11 12 13 14 15
Age (Years)

2.102
Residual plot of car price/age data

2,000
1,500
1,000
500
Residual

0
500
1,000
1,500
2,000
2,500
6 7 8 9 10 11 12 13 14 15
Age (Years)

2.103
When is it Alright to Remove Outliers?

There is no clear-cut answer and it depends on the situation.

It is worthwhile to find out if removing the outlier changes your con-


clusions: if it doesnt, then the outlier doesnt present a problem!

If an outlier does influence your results, it should only be removed if


you have a good reason (e.g. silly answer to a question, the hotted-up
car example).

If you dont have a good reason for removing it: try presenting both
sets of results (with and without outlier) or look into robust regression
methods.

2.104
Lurking Variables
The next slide shows data on the relationship between steals per
game and rebounds per game for the 200708 NBA season.

2.106
Rebounds vs Steals per game

2.5
Steals per game

1.5

0.5

0 2 4 6 8 10
Rebounds per game

2.107
The results are surprising:
The relationship between steals per game and rebounds per
game is either weak or non-existent. The r2 is only 0.056.

The least-squares regression line indicates that steals per game only
increases slightly as your number of rebounds per game increases.
A stronger relationship was expected players who get more rebounds
should also get more steals.

Explanation:

There are other variables lurking around e.g. the position you play.

2.108
Rebounds vs Steals per game

Centre
Forward
2.5 Guard
Steals per game

1.5

0.5

0 2 4 6 8 10
Rebounds per game

2.109
Recall that only 5.6% of variation in steals was explained by rebounds.

Within positions, r2 values are much higher:

Position r2
Center 0.27
Forward 0.34
Guard 0.40

2.110
What to Do with Lurking Variables?

To get a better picture of the effect of rebounds per game on steals


per game we need to also account for lurking variables such as posi-
tion in the regression analysis.

To do this, we could use a method of analysis known as analysis of


covariance (related to multiple regression, Chapter 11 of Moore et
al).

2.111
Extrapolation

A linear regression equation should only be used to make predictions


for values of the explanatory variable within the range of the actual
data! Otherwise, you are extrapolating!

2.112
Example: US Farm Population

The following slides show the population of farmers in the United


States from 1935 to 1980.

2.113
U.S. Farm Population
34
32
30
28
Population (millions)

26
24
22
20
18
16
14
12
10
8
6
4
2
1940 1950 1960 1970 1980
Year

2.114
The regression equation for predicting farm population (
y ) from year
(x) is
y = 1166.93 0.587x

If we used this line to predict the U.S. farm population in 2020, we


would get a value of 18.81 million!

2.115
Take-home messages

Remember:

Fitting the least-squares regression line to a set of data is not


always appropriate.

We should always check the appropriateness of the basic regres-


sion assumptions before proceeding.

Beware of lurking variables! Just because X is related to Y does


not mean that X causes Y : there might be something else going
on involving some lurking variable. . .

Only use the regression line to make predictions for values of X


that are included in your data!

2.116
2.117
Design of experiments
Based on Moore et al.: Introduction and Section 3.1

3.1
Lectures 12: Design of experiments
Ways to obtain data

Observational studies vs experiments

Principles of experimental design

Types of experiments

Cautions about experiments

3.2
Ways to obtain data
From Moore et al.:

Statistics is the science of collecting, organising, and interpreting


numerical facts, which we call data.

In the next two lectures we will focus on the important question:


How do we obtain data?

This question is of fundamental importance in any research study.

If there is a problem with the way we have collected data, this leads
to problems with the analysis and interpretation of the data.

If your data collection is flawed, the whole study is flawed!

3.3
A strategy for using data in research:
Identify the key research question: the question you want to
discover the answer to.

Decide on the population to be studied the set of people/things


to which your research question refers.

Decide which variables to measure.


These should be informed by your research question.

Obtain data from the population of interest to answer your research


question.

3.4
There are a few ways to obtain data:
Anecdotal data Haphazardly collected data (such as data from your
own experience).

Anecdotal data is not usually a sound basis for drawing conclusions.

e.g. A few of my friends are left-handed and are good at maths.


Does this mean that left-handers are good at maths?

Available data Data previously produced (possibly for some other


purpose).

e.g. The Australian Bureau of Statistics has lots of interesting data,


that can be used to answer many interesting research questions.

Collect your own data! We will focus on issues that come up when
collecting your own data.

3.5
Census vs sample?

When collecting your own data, you can take a census or take a
sample.

If we survey the whole population this is called a census.

Usually, we cannot survey all the population (or we just dont have the
time!), we take a sample.

Samples are often very informative, so taking a census is often a waste


of time. For example, you could predict the winner of an election with
surprising accuracy from just 1000 Australians!

3.6
Example

Consider the following research question:

How long does it take UNSW students to get to uni, on aver-


age?

1. What is the population of interest?

2. What variable(s) do we wish to measure?

3. How could we obtain data to answer this question?

3.7
Observational Study or Experiment
Observational Study: Individuals are observed and variables of inter-
est are measured, but there is no attempt to influence responses.

e.g. An on-line survey of MATH1041 students is carried out to see


if travel time to UNSW varies according to how you travel to UNSW.

Experiment Some treatment is deliberately imposed on individuals,


and we observe their responses.

e.g. Is it faster to drive to UNSW or take public transport? 10


randomly selected students are asked to drive, 10 randomly selected
students are asked to take public transport. Time to UNSW is
compared.

3.8
Observational studies and association

We can use an observational study, to find an association between


two variables.

But often in science we are concerned with causation, not association:


The moons gravitational pull causes the tides.

Low temperature applied to water causes ice to form.

Mating causes gestation.

Problem with observational studies:


association does not mean causation

3.9
There are many possible explanations for an association:
Common response. e.g. Ice-cream sales and heat stroke cases.

Causation. e.g. moon position and tides.

Confounding. e.g. Parent BMI and Child BMI, confounded by


diet/lifestyle.

The following slides are diagrams of these example types of association.

3.10
Common response

Ice cream Heat stroke


sales cases

Temperature Association

Cause

3.11
Causation

Position of
Tides
the moon

Association

Cause

3.12
Confounding

Parent BMI Child BMI

Child diet;
Habits Association
Other. . .
Confounding causes

3.13
Why do an experiment?

An experiment allows us to demonstrate causation.

We can make an intervention (the cause) and see whether or not there
is an effect!

In a carefully designed experiment, the intervention is the only possible


explanation for any effect we observe, so we have demonstrated a
cause-and-effect link.

3.14
Example walking is faster than catching a
bus!

The MATH1041 survey reveals that students who walk to UNSW have
shorter travel times, on average, than people who catch a bus.

Does this mean that walking gets you to UNSW faster than catching
a bus?

Identify a confounding variable that could explain our observed pattern.


Describe an experiment you could do to test if walking to UNSW is
faster than catching a bus.

Answer:

3.15
Principles of experimental design
We will use the following important definitions to describe experimental
designs:
Subjects Individuals on which the experiment is done.

Treatment A specific experimental condition applied to subjects.

Factor An explanatory variable in the experiment a variable that is


manipulated in different treatments.

Levels Each treatment is formed by combining a specific level of each


of the factors in the experiment.

Response variable The variable of primary interest that is measured


on subjects, after treatment.

Note: The textbook defines subjects slightly differently, and instead


sometimes refers to them as experimental units.
3.16
Example: effects of advertising

What are the effects of repeated exposure to advertising?

Sixty students view a 40-minute TV program, but the program includes


ads for a new smart-phone.

Different students see different ads some students see a 30-second


ad, some see a 90-second ad. Each ad is shown either once, three
times, or five times.

Students are then asked if they intend to purchase the new smart-
phone.
1. What are the subjects?
2. What factors are considered?
3. What are the levels of each factor?
4. How many treatments are there in total? What are these treat-
ments?
5. What is the response variable? 3.17
Answer:

3.18
We often use a diagram to summarise experimental designs, as below:

3.19
All experiments should employ the following principles:

Compare two or more treatments. One treatment should be a suitably


chosen control group, to which the treatment of interest can be
compared.

Randomise assignment of subjects to treatments (to remove selection


bias).

Repeat the treatment on many subjects, to reduce chance variation.

3.20
Example is yawning contagious?

Watch the Mythbusters video:

http://dsc.discovery.com/tv-shows/mythbusters/videos/is-yawning-contagious-minimyth.htm

and answer the following questions:


1. What are the subjects?

2. What treatments are applied?

3. Experiments should always Compare, Randomise, and Repeat.

Did the Mythbusters do all of these steps?


3.21
How do you randomise?

To randomly assign subjects to treatment groups, the simplest way is


to use a computer:
Label all experimental subjects.

Assign each a random number.

Sort the random numbers.

Treatment groups are formed by selecting subjects in order from


the sorted list.

3.22
Choice of control group

The control group should differ from other treatments only in the
application of the treatment of interest.

A common type of control group is a placebo a sham treatment.


This is sometimes called an experimental control. It controls for
unforseen effects that experimental manipulation may have on subjects.

e.g. the placebo effect patients often feel better when a doctor
gives them a treatment, no matter what the treatment is!

3.23
Example effects of smoking during preg-
nancy

Johns et al explored the effect of smoking while pregnant on foetal


development, by injecting nicotine into pregnant guinea pigs. They
injected 15 pregnant guinea pigs, twice daily, with nicotine hydrogen
tartrate (0.5, 1.5 and 2.5 mg/kg) in saline solution, and compared the
offspring to a control group of 5 pregnant guinea pigs injected with
saline solution.

Why was the control group injected with saline?

3.24
Types of experiment
We will meet a few common types of experiment.

Randomised Comparative experiment


Subjects are randomly allocated to one of several treatments, and
responses are compared across treatment groups.

The guinea pig study above is an example of a randomised comparative


experiment.

3.25
Matched pairs design
We break subjects into pairs (that have similar properties) and apply
each of two treatments to one subject from each pair.

Matched pairs designs can produce much more precise results than
complete randomisation, because we are controlling for variation in
response across pairs.

Common examples:
Identical twin studies allow us to control for genetics!
Before-after experiments we take two measurements on each
subject, and control for variation across subjects.

3.26
Example Driving skills and mobile phone use

Does use of a mobile phone affect driving skills?

To answer this question, we use a driving simulator to test students


driving skills (by measuring reaction time). We want to know if talking
on a mobile phone affects driving skills by slowing reaction time.

Consider two experiments:


1. 20 students (out of 40) are randomly assigned to a mobile phone
treatment. They use the simulator while talking on their phone.
The other 20 students use the simulator while not on the phone.

2. 20 students are asked to use the driving simulator twice once


with a mobile phone, once without, with the order of these two
treatments is randomised.
3.27
What is the name for each of these two types of experiment?

Which experiment do you think provides better information for answer-


ing the research question?

3.28
Randomised block designs

A block is a group of subjects known before the experiment to be


similar in some way that might affect their response to treatment.

In a randomised block design, the random assignment of subjects to


treatments is carried out separately within each block.

A matched pairs design is a special case of a randomised block design,


where there are two treatments and all blocks have size two.

3.29
Example response to cancer treatment

Suppose there are three treatments for cancer being studied. White
blood cell count is used to monitor response to treatment. It is be-
lieved that men and women may respond differently to the different
treatments.

How should we assign subjects to the three treatments?

3.30
Cautions about experiments
Below are some common problems when designing experiments:

Choose an appropriate control. The only thing that should vary


across treatments is the factor(s) of interest. (Use a placebo?)

Beware of bias. If the administrator of the treatment knows what is


being applied, this may (even subconsciously) bias the way they work
with the subject in a double blind experiment, neither the subject
nor the administrator of the treatment knows what is being applied.

Repeat the entire treatment for different subjects. When repeat-


ing a treatment for different subjects, it is important that all treatment
steps are repeated.

Try to make your experiment realistic. For experiments to say


anything about the real world, they need to duplicate real-world con-
ditions!
e.g. Nicotine-injections to pregnant guinea pigs stunt development.
But what does that mean for people who smoke while pregnant???
3.31
Example oven temperatures

A Food Technology Honours student is interested in the effects of oven


temperature on water content of bread.

She prepares 20 loaves of bread, using dough prepared in exactly the


same way. 10 loaves of bread are baked simultaneously in an oven that
has been heated to 200 degrees Celsius, and the other 10 loaves of
bread are baked at the same time, in a different oven that was heated
to 220 degrees Celsius.

She would like to compare the water content of loaves across treat-
ments, to draw conclusions about the effects of oven temperature on
water content of bread.

Can you see any problems with this experiment?


3.32
Example mites on cotton plants

Alex is a biology PhD student interested in the effects of mite infes-


tation on growth of cotton plants.

He randomly assigns 15 (out of 30) cotton seedlings to a mite


treatment, which he applies by crumbling up mite-infested leaves and
placing near the base of each seedling. The remaining 15 no-mite
seedlings do not receive any such treatment.

Alex then compares the total biomass of cotton plants in the mite and
no-mite treatments, to explore whether there is an effect of mites on
cotton plant growth.

Can you see any problems with Alexs experiment?

3.33
3.34
Sampling designs
and toward statistical inference
Based on Moore et al.: Section 3.2

3.35
Lectures 34: Sampling designs and
towards inference
Introduction

The simple random sample (SRS)

Other sampling designs

Cautions about sample surveys

Towards Statistical Inference:


Sampling distributions

Bias and variability

3.36
Introduction
Consider the following question:

How long do students take to get to UNSW?

It is not feasible to ask all UNSW students what their travel time is
we sample UNSW students.

How we choose to sample is our sample design.

3.37
Some definitions

Population (usually large, sometimes theoretical): The entire group


of individuals about whom we want information.

e.g. All UNSW students

Sample (often moderate or small in size): The part of the population


we actually examine, to gather information.

e.g. UNSW students who answer a travel time questionnaire.

Design How we choose the sample.

e.g. Ask MATH1041 students to complete an on-line questionnaire.

3.38
The simple random sample (SRS)
The most common type of sample is a simple random sample.

We will sometimes refer to a simple random sample as an SRS.

From Moore et al.:

A simple random sample of size n consists of n individuals


from the population, chosen in such a way that every possible
combination of n individuals has equal chance to be the sample
actually selected.

3.39
To create a SRS of size n there are four steps:
1. Create a dataset with all elements of the population in the first
column.

2. Assign a random number to each element of the population, put


these in the second column.

3. Sort the dataset by the random number column.

4. Your SRS is the first n elements in the sorted list.

3.40
Example: a SRS of MATH1041 students

How long do MATH1041 students take to get to UNSW?

We will take a SRS of 5 MATH1041 students, to answer this question,


using a roll of all MATH1041 students enrolled by the start of March.

3.41
Properties of simple random samples

By using random sampling, our data become subject to the laws


of probability. This enables statistical inference (as considered in
Weeks 7-12).

A particularly important result (discussed later) is that measure-


ments from a SRS are independently and identically distributed.

An SRS is representative of the population from which sampling is


done it eliminates favouritism (sampling bias).

An SRS can be difficult or impossible to obtain you need a list


of all elements of the population.
e.g. Could you take an SRS of illegal immigrants?

3.42
Other sampling designs
We have met simple random samples.

Some other ways to sample:


Voluntary sample: People choose themselves for the study by re-
sponding to a general appeal.
e.g. Respondents to the MATH1041 on-line survey.

Stratified random sample: The population is divided into groups of


similar individuals called strata. Then we choose a SRS from each
stratum.
e.g. To estimate average hair-cut cost of MATH1041 students, we
take an SRS from male MATH1041 students and another from
female MATH1041 students.

3.43
Multi-stage sample: We sample successively smaller groups from the
population in stages, resulting in a sample that consists of clusters
of individuals.

This is more convenient for large or widely dispersed populations.


e.g. Sampling 200 Sydney houses: Take an SRS of 10 suburbs,
then take an SRS of 20 houses within each suburb.

3.44
Example

In a study of the literacy of high school students, we want to choose


a sample of 24 students to represent a high school.

How would you do this?

Answer:

3.45
Cautions about sample surveys
Here are some common problems to watch out for when sampling:
Undercoverage: Some groups in the population are not included in
the sampling process.
e.g. MATH1041 students who enrolled since Monday (they arent
on the roll I just sampled from!).
Non-response: Some individuals cant be contacted/dont respond.
e.g. MATH1041 students who didnt attend this weeks lectures!
Response bias: Interviewer technique may bias replies.
Questionnaire design: Wording is important!
e.g. Compare the following two questions:
Would you drink tertiary-treated sewage?
Would you drink recycled water?

3.46
Towards statistical inference
How long do MATH1041 students take to get to UNSW?

For a SRS of five MATH1041 students, average travel time to UNSW


is minutes.

But is this the true average travel time of all MATH1041 students?

Its not, but its an estimate of average travel time of all MATH1041
students.
The average travel time from our survey was 52 minutes

3.47
Some definitions:
Parameter: A number which describes some aspect of a population.

In practice a parameter is usually unknown, but we are interested


in knowing it.

e.g. The true average travel time of all MATH1041 students.

Statistic: A number which describes some aspect of a sample.

Often we use a statistic to estimate an unknown parameter.

e.g. The average travel time of an SRS of 5 MATH1041 students.

3.48
Sampling distributions
If we took a different SRS of 5 MATH1041 students, would we get
the same estimated average travel time?

No our statistic will vary from sample to sample. This is called


sampling variability.

How large is the sampling variability? We could estimate sampling


variability by taking multiple random samples of 5 MATH1041 students,
and estimating the average travel time for each SRS.

In practice this would be tedious but often we can use a computer


to imitate random behaviour via simulation.

3.49
The sampling distribution of a statistic is the distribution of values
taken by the statistic in all possible samples of the same size from the
same population.

By taking 1000 samples, calculating x


, and drawing a histogram, we
get a very good estimate of the sampling distribution of x
.

But:
Most of the time we wont have a population we can sample from
in simulations. Without the survey data, we couldnt have done the
simulation!

In many cases, such as those we will consider in weeks 7-9, we can use
probability theory to work out the sampling distribution.

3.50
Bias and variability
Two key features of the sampling distribution of a statistic are its bias
and variability:
Bias: concerns the centre of the sampling distribution.

A statistic is unbiased if the mean of its sampling distribution equals


the true value of the parameter being estimated.
Variability: of a statistic is described by the spread of the sampling
distribution.

3.51
3.52
How do we reduce bias?

Use random sampling!


An SRS from the entire population of interest has no selection bias.

How do we reduce variability?

Increase the sample size n!


The larger n is, the smaller the variability of the statistic.

Also, sometimes it is possible to reduce variability by improving sam-


pling protocols (so that there is less measurement error).

3.53
3.54
3.55
3.56
Probability
Based on Moore et al.: Sections 4.14.2

4.1
Lectures 12: Probability
Randomness and probability

Probability models

Probability rules

Assigning probabilities to outcomes

Independence

4.2
Randomness and probability
A phenomenon is random if individual outcomes are uncertain but
there is nonetheless a regular distribution of outcomes in a large num-
ber of repetitions.

e.g. Tossing a fair coin the outcome is a head (H) or a tail (T ). We


dont know if the coin will land heads or tails, but we know that
after many coin tosses we will have about as many heads as tails.

The probability of a random phenomenon is the proportion of times


the outcome will occur, in a large number of repetitions. The long-
run frequency.

e.g. When we toss a fair coin, the probability of heads and the proba-
bility of tails equal 0.5. That is, P (H) = 0.5 and P (T ) = 0.5.

4.3
Why study probability?

Recall from last week that when doing an experiment, it is important


to randomly assign subjects to treatment groups.

Recall from last lecture that when doing a survey, it is important to


randomly select subjects for inclusion in our sample (to ensure our
sample is representative).

The data we collect from an experiment or an observational study can


then be treated as random phenomena. So we need to learn basic
rules of probability and probability modelling, so we can learn how to
interpret data!

4.4
Probability Models
To describe a random phenomenon using a probability model, we re-
quire two components:
1. A description of all possible outcomes that can be observed. This
is known as the sample space.

2. The probability of each outcome or set of outcomes.

4.5
Important definitions

When constructing a probability model, we make use of the following


definitions:

Sample space: The sample space of a random phenomenon, S, is the


set of all possible outcomes.

Events: The event A is an outcome or a set of outcomes of a random


phenomenon.

P (A): We will write the probability of an event A as P (A) for short.

4.6
Example
Consider the following random phenomena:
1. A coin is tossed.

2. Ten randomly selected patients are given a treatment. We are


interested in the number of patients who are cured.

3. The mass of a randomly selected airplane (to the nearest tonne).

In each case, write down the sample space of all possible events.

4.7
Probability rules
The following rules follow from the definition of probability:
Rule 1 Any probability is a number between 0 and 1 (inclusive).
That is, for all events A,
0 P (A) 1.

Rule 2 If S is the sample space of a probability model,


P (S) = 1.

Rule 3 Two events A and B are disjoint or mutually exclusive if


they can never occur together. If A and B are disjoint, then
P (A or B) = P (A) + P (B).
This is the addition rule for disjoint events.

Rule 4 The complement of any event A (written as Ac) is the event


that A does not occur. The complement rule states that
P (Ac) = 1 P (A).
4.8
Assigning probabilities to outcomes
How do we assign probabilities to outcomes?

There are a few methods:


Long run empirical observation.
e.g. toss a coin 10 000 times (!) to estimate the proportion of
heads.

If outcomes are equally likely, count the number of equally likely


outcomes in each event.

Use theoretical ideas to derive probabilities, under a set of assump-


tions.

Subjective or expert opinion.

4.9
Equally likely outcomes

If a random phenomenon has k possible outcomes, each equally likely,


then each outcome has probability 1/k.

This means that


count of outcomes in A
P (A) =
count of outcomes in S

count of outcomes in A
=
k

4.10
Example reporting to the Board

Adam, Becca, Christina and Damian are on a committee, and they


need to choose two people to report their recommendations to the
Board of Directors. They choose these two people at random (as a
simple random sample).

1. What is the sample space of possible outcomes, when choosing


two committee members to report to the Board?

2. What is the probability of each outcome?

3. What is the probability that Damian doesnt have to report to the


Board of Directors?

4.11
Binomial coefficients counting combinations

Often, when counting outcomes, we wish to count the total number


of possible ways of choosing k subjects out of n.
 
This is known as a binomial coefficient, and it is written as n
k .
n n!
=
k k!(n k)!
n (n 1) 1
=
[(n k) (n k 1) 1] [k (k 1) 1]
n (n 1) (n k + 1)
=
k (k 1) 1

 
Looks messy! The best way to compute n
k is using you calculator.
(Sometimes called nCr ).
4.12
Example

The total number of ways of choosing two committee members to


report to the Board in the previous example is:
4 4! 4321 43
= = = =6
2 2!2! (2 1) (2 1) 21
There were six ways to select two committee members, out of four
(these make up the sample space we listed previously).

Warning: Binomial coefficients get very big (and messy) very quickly,
 
40
e.g. 15 = 40, 225, 345, 056

 
Your calculator starts to struggle calculating n
k when n and k are in
the hundreds.
4.13
Multiple choice quiz

A test consists of five True/False questions. A student decides to


answer the questions randomly True or False, without even reading
the question.

The student ends up getting three questions correct.

How many ways could they have got three answers correct?

Answer:

4.14
OzLotto
In a game of OzLotto, you choose 7 numbers, out of the integers
from 1 to 45. Seven numbers are drawn at random, and if all are your
numbers, you win the Jackpot.

How many ways are there of choosing 7 numbers out of 45?

What is your chance of winning the jackpot?

Answer:

4.15
Independence
An important idea in research is that of independence.

Often the research question of interest involves exploring whether two


variables are independent of each other.

Is birthweight independent of whether or not a mother smokes while


pregnant?
Is on-time performance related to which airline you fly with?

4.16
The idea of independence of events is defined formally in terms of
probability.
Rule 5 Two events A and B are independent if knowing whether one
occurs or not does not change the probability that the other occurs.

If events A and B are independent then

P (A and B) = P (A) P (B).

This is the multiplication rule for independent events.

4.17
Example reporting to the Board (again!)

Two people out of Adam, Becca, Christina and Damian need to report
to the Board of Directors. They choose these two people at random
(as a simple random sample).

1. What is chance that Damo has to report to the Board of Directors?

2. What is the chance that Becca has to report to the Board?

3. Are these two events independent?

4.18
Example gender bias in promotion

Below is a table of the proportion of people promoted at a workplace,


by gender.

Promoted?
Yes No Total
Male 0.45 0.15 0.60
Female 0.30 0.10 0.40
0.75 0.25

Is there gender bias in promotion? Or is promotion independent of


gender?

4.19
Additional Exercise

Assume that forty percent of Australians approve of the Prime Minis-


ters performance. We take a random sample of five Australians and
ask them if they approve of the Prime Ministers performance.

Find the probability that:

1. None of the five people approve of him.

2. Exactly two people approve of him.

3. No more than two people approve of him.

4.20
Answer:

4.21
4.22
General Probability Rules
Based on Moore et al.: Section 4.5

4.23
Lectures 34: General Probability
Rules
We will meet some useful probability rules, and the important idea of
conditional probability.
More addition rules

Conditional probability defined

Conditional probability and independence

Conditional probability rules

4.24
More addition rules
Addition rule for many disjoint events:
If A, B and C are disjoint events, then

P (one or more of A, B, C) = P (A) + P (B) + P (C)


This rule extends to any number of disjoint events.

Addition rule for any two events


For any two events A and B,

P (A or B) = P (A) + P (B) P (A and B)

Note that if A and B are disjoint, then P (A and B) = 0 and so


P (A or B) = P (A) + P (B) as in Rule 3.

4.25
Example UNSW student population

The UNSW website reported that at the start of 2004, amongst com-
mencing first year UNSW students, 87% had enrolled full-time, 15%
were international students, and 14% were both full-time and interna-
tional students.
1. What is the proportion of UNSW students that were full-time or
international?

2. What is the proportion of UNSW students that were part-time?

3. What is the proportion of UNSW students that were part-time and


international?

4. What is the proportion of UNSW students that were part-time or


international?

4.26
Conditional probability defined
The probability of an event can change if we know some other event
has occurred.

For example, consider the following table of first year UNSW enrolment
data:

Full-time Part-time
Local 0.73 0.12
International 0.14 0.01

Overall, 73+14=87% of students are enrolled full-time, so if we ran-


domly select a student, the probability that they are full-time is 0.87.

But if we know that the student is international, what is the chance


that they are full-time?

4.27
For two events A and B, (where P (A) 6= 0) the conditional proba-
bility of B given A is
P (A and B)
P (B | A) =
P (A)

In this formula A is the information that we are given, and B is the


event whose probability we are computing.

4.28
Example student enrolments

Full-time Part-time
Local 0.73 0.12
International 0.14 0.01

If we know that John is an international student, what is the chance


that he is studying full-time?
P (Full-time and International)
P (Full-time | International) =
P (International)
0.14 14
= =
0.14 + 0.01 15

If we know that Anna is studying full-time, what is the chance that


she is an international student?

4.29
You can think about P (B | A) as a branch on a tree diagram it is
the branch that leaves from the event A and goes to the event B:

4.30
A is the new S

Consider again the definition of conditional probability: for two events


A and B, (where P (A) 6= 0) the conditional probability of B given A is
P (A and B)
P (B | A) =
P (A)

This looks a bit like how you calculate probabilities from equally likely
outcomes (from last lecture):
count of outcomes in B
P (B) =
count of outcomes in S
The essential difference is that we have replaced the sample space S
with A.

This is because we know that A has occurred (it is given), so A


becomes the new sample space.
4.31
A good diagram of this idea:

P( A and B)
P( B | A) =
P( A)
Refocus to those
parts of B which
are in the A
conditioning
event A.
B
Rescale by A'
dividing by P(A)

4.32
Conditional probability and
independence
Recall that A and B are independent if
P (A and B) = P (A) P (B)

Conditional probability offers an equivalent but perhaps more natural


way of defining independence:

A and B are independent if


P (B | A) = P (B)

In words: The probability of the event B is the same no matter


whether or not we know that A happened.

This is one of the most important uses of conditional probability it


enables the study of independence of variables.
4.33
Example student test results
Overall, 87% of students are full-time. 15% of students are interna-
tional, and 14 out of every 15 international students is full-time.

Is whether or not you are full-time independent of whether or not you


are an international student?

4.34
Conditional probability rules
There are three key rules to know about conditional probabilities, but
we only cover two in this course:
Multiplication rule

P (A and B) = P (A)P (B | A)

Law of Total Probability For any two events A and B,

P (B) = P (A)P (B | A) + P (Ac)P (B | Ac)

We will discuss both of these rules.

4.35
The multiplication rule

For any events A and B,

P (A and B) = P (A) P (B | A)

This rule follows from the definition of conditional probability (by re-
arranging it).

4.36
Example student enrolments

In 2004, 15% of UNSW students were international, and 14 out of


every 15 international students were full-time.

What is the proportion of UNSW students who were both full-time and
international?

4.37
On a tree diagram, the multiplication rule tells us that to find proba-
bilities of two events happening, just multiply the probabilities across
branches.

4.38
The multiplication rule for more than two
events

This rule can also be generalised to more than two events, e.g. for any
events A, B and C,
P (A and B and C) = P (A) P (B|A) P (C|A and B).

Wanna be a Wallaby?

About 40% of high school rugby players go on to play for a club.


About 5% of club players go on to play first grade rugby. About 1%
of first grade rugby players get to play as a Wallaby (Australian Rugby
Team).

What is the chance that a high school rugby player will get to play for
the Wallabies?
4.39
The Law of Total Probability

For any two events A and B,

P (B) = P (A)P (B | A) + P (Ac)P (B|Ac).

Example student enrolments

In 2004, 15% of UNSW students were international, and 14 out of


every 15 international students was full-time. Of the students who
were not international, 73 out of 85 were full-time.

What is the proportion of UNSW students who were full-time?

4.40
On a tree diagram, the law of total probability tells us that to find
the probability of the event B, we find all the outcomes involving B,
multiply across branches to find the probability of these outcomes, and
sum.

4.41
The law of total probability for multiple out-
comes

The law of total probability extends to when there are more than two
possible disjoint events to sum.

e.g. If one and only one of A1, A2 and A3 occurs, then

P (B) = P (A1)P (B | A1) + P (A2)P (B | A2) + P (A3)P (B | A3)

See textbook example 4.47 to see this rule in action.


(see example 4.46 in the 7th edition)

4.42
4.43
4.44
Random variables
Based on Moore et al.: Section 4.3

5.1
Lectures 12: Random variables
Random variables defined

Discrete random variables

The binomial distribution

Continuous random variables

5.2
Random variables defined
A Random Variable is a variable whose value is a numerical outcome
of a random phenomenon.

Examples:
Number of children in a family.

Height of a person.

Time travelled to UNSW.

We usually use upper-case letters from near the end of the alphabet
to denote random variables, e.g. X or Y . We reserve Z for a special
type of random variable that we will meet next week.
5.3
Discrete random variables
A random variable X is discrete if X has a countable number of
possible values, say x1, x2, x3, x4, . . . .

The probability distribution of X lists the values and their probabili-


ties:

Value of X x1 x2 x3 xk
Probability p1 p2 p3 pk

These probabilities must satisfy two requirements:


1. Every pi is a number between 0 and 1.

2. p1 + p2 + + pk = 1.

5.4
Examples of discrete random variables:
Number of children in a family.

Number of winning lotto numbers that you selected.

The outcome from rolling a six-sided die.

5.5
Probability histograms

We can illustrate the probability distribution of a discrete random vari-


able using a probability histogram a histogram constructed using
the true probabilities for the variable (rather than relative frequencies
from sample data).

5.6
Some examples

Example Malcolms sub-committee


Malcolm assigns three public servants to a sub-committee. Each se-
lected public servant has a 50:50 chance of being female.

Consider X, the number of females on the sub-committee.


1. Is X a discrete random variable?

2. What are the possible values of X?

3. Find the probability distribution of X.

4. Construct a probability histogram for X.

5.7
Example Rock-paper-scissors
Bill and Ben play rock-paper-scissors. Bill chooses one of rock, paper
and scissors at random. If Bill and Ben pick the same object, they play
again, until someone wins.

Let X be the number of times Bill and Ben play until there is a winner.

1. What is the chance that Bill and Ben pick different objects first
time around? That is, whats P (X = 1)?

2. What is the chance neither Bill nor Ben wins until the 4th time
they play?

3. Is X a discrete random variable?

4. Find the probability distribution of X.

5.8
Example Texas hold em
In the variation of poker known as Texas hold em, each player receives
two cards. Here is the probability distribution of the number of aces
you get in a two-card hand:

Number of aces 0 1 2
Probability 0.85 0.14 0.01
1. Verify this is a proper discrete random variable.

2. Find P (X > 0).

3. Find P (X = 1 | X > 0).

4. Draw the probability histogram. Comment on its shape.

5.9
Example left-handers
About 12% of Australians are left-handed. Three people are sitting
down at lunch (assume these are a random sample of Australians with
respect to handedness).

Let X be the number of left-handers in the sample.


1. What are the possible values of X?

2. Find the probability distribution of X.

3. Find P (X = 2).

4. Find P (X 2).
This is an important example known as the binomial distribution.

5.10
The Binomial Distribution
A special type of discrete random variable which is very useful in prac-
tice is the binomial distribution. It is used as a model for counts
in a table of frequencies for a categorical variable that takes only two
values.

Suppose an experiment is performed independently n times. At each


trial, there are two possible outcomes success and failure. Each
trial has probability p of success, independently of all other trials.

Let X = the number of successful trials. X can be 0, 1, 2, . . ., n.

The probability distribution is given by:


n
P (X = k) = pk (1 p)nk .
k

We say X has a Binomial Distribution, which we write for short as:


X B(n, p).
5.11
The binomial distribution is important because it is useful in modelling
any variable that has only two possible outcomes:
Number of females on a sub-committee (out of 3 sub-committee
positions).

Number of patients whose treatment is successful (out of 23 on


whom the treatment was tried).

Number of male salmon that find a mate (out of a population of


457 salmon).

Number of QANTAS flights (out of 52 flights).

The key assumptions are that the trials we are counting need to be
independent and to have the same probability of success.
(As it turns out, this can be guaranteed by random sampling!)

5.12
Exercise: recognising binomial situations

Which of the following can be modelled by a binomial distribution?


1. A random sample of 10 packets of cornflakes is selected from a
production line and their contents are weighed. X is the number
of packets weighing more than 750g.

2. There is a large shipment of electrical components (say 100,000).


A random sample of 20 is selected. Let X be the number of faulty
components out of the 20.

3. A couple continues to have children until the first girl is born. Let
X be the total number of children.

5.13
Answer:

5.14
Exercise: Calculating binomial probabilities

In each of the following,


1. What is the distribution of X?

2. List the values X can take and their probabilities.

3. What is the probability of passing the test? (Getting at least half


the questions correct.)

5.15
A test consists of five True/False questions. A student decides to
answer the questions randomly True or False, without even reading
the question. Let X be the number of correct answers the students
gets out of 5.

A test consists of six multiple choice questions each of which has


5 possible answers listed (only one of which is correct). A student
answers the questions randomly, without even reading the question.
Let X be the number of correct answers the students gets out of
6.

5.16
Continuous random variables
A random variable X is continuous if it takes all values in some
interval of real numbers.

This means that a continuous random variable can take infinitely many
values any value at all in the interval on which it is defined.

Discrete or continuous?
Consider the following random variables:
Number of children in a family.

Height of a person.

A randomly chosen number that takes any value between 0 and 1.


Which of these are continuous random variables?
5.17
Density curves

The probability distribution of a continuous random variable cant be


written down in a table (because there are infinitely many possible
values). Instead, it can be described using a mathematical function,
known as a density curve.

A density curve is a curve that describes the overall pattern of a


distribution. The area under the curve in any range of values is the
proportion of all observations that fall within that range the proba-
bility of observing a value in that range.

A density curve can be thought of as a smoothed-out histogram.

5.18
5.19
The density curve as a smoothed-out histogram.

5.20
Density curves are mathematical functions which can be used to de-
scribe the probabilistic behaviour of measurements of interest.

A density curve has two key properties:


It never takes a negative value (no such thing as a negative prob-
ability!)

The area under the whole curve is 1.

5.21
Some example density curves:
birthweight survival time after heart transplant

waiting time for next bus duration of geyser eruptions

5.22
Why all this talk about density curves?

Any continuous, quantitative variable can be described using a density


curve.

Density curves are important, because if you know the density curve
of a variable, you can calculate any probability you want about this
variable!

e.g. If you know the density curve of the life expectancy of patients
using beta blockers (heart medication), then you could calculate the
proportion of patients using beta blockers who will live for another 40
years.

5.23
How do you calculate probabilities using a
density curve?

The area under the density curve in the range of values you are inter-
ested in tells you the probability of observing a value in that range.

To calculate a probability from a density curve:


Sometimes you can use geometry (area of a triangle etc).

Sometimes we can use R (we will do this later in the course).


Rb
Sometimes you need to use integration, P (a < X < b) = a f (x)dx
for density curve the f (x) (not considered in the course).

5.24
Example uniform random numbers

A number X is chosen at random between 0 and 1.

1. Sketch the density curve of X.

2. Label the y-axis of your plot as appropriate (recalling that the area
under the whole density curve must be 1).

3. Find P (X = 1
4 ).

4. Find P (0 X 1
2 ).

5. Find P ( 1
4 X 1 ).
2

6. Find a general formula that gives us P (a X b) for any a and b


such that 0 a b 1.

5.25
5.26
Means and variances
of random variables
Based on Moore et al.: Section 4.4

5.27
Lectures 34: Means and variances
of random variables
The mean of a random variable

Rules for means

The variance of a random variable

Rules for variances of random variables

Other rules for means and variances

5.28
Example betting on a coin toss

Imagine your friend comes to you with the following proposition:

Lets play a game. I toss a (fair) coin, and:


If it lands heads, I give you $2.

If it lands tails, you give me $2.

Would you play?

5.29
The mean of a random variable
One way to answer this question is:

How much money will you win per game in the long run?

What we would like to calculate here is the mean of the random


variable.

We can speak of the mean of a random variable in the same way as


we speak of the mean of a set of observations (the sample mean, the
ordinary average).

The mean of a random variable is the long-run average of the


values it can take an average of the values weighted according to
the probabilities or density function.

5.30
Notation for Means

We write the mean of the random variable X as X , whereas we write


the mean of a sample as x
.

That is: when talking about a set of data we use

mean = x
.

When talking about a random variable and its probability distribu-


tion (or density curve) we use

mean =
(pronounced mi-oo) is the greek letter M M for Mean!

5.31
Example betting on a coin toss

Let X be the amount won (in dollars). X has probability distribution:

Value of X -2 2
Probability 0.5 0.5

If you play 100 games, how many wins would you expect? How many
losses? Whats the average?

5.32
The mean of a discrete random variable

Suppose X is a discrete random variable whose probability distribution


is:

Value of X x1 x2 x3 ... xk
Probability p1 p2 p3 ... pk

Then the mean of X, X , is found as

X = x1p1 + x2p2 + + xk pk
k
X
= xi p i
i=1

5.33
Example Texas hold em
In the variation of poker known as Texas hold em, each player receives
two cards. Here is the probability distribution of the number of aces
you get in a two-card hand:

Number of aces 0 1 2
Probability 0.85 0.14 0.01

Whats the long-run average number of aces per game?


That is, if you played a huge number of games of Texas Hold em, and
found the average number of Aces you were dealt each game, what
would you expect the answer to be?

I would expect about 85% of values to be 0, about 14% to be 1, and


about 1% to be 2. So Id expect the average to be about:

0 0.85 + 1 0.14 + 2 0.01 = 0.16

5.34
Example rolling a fair die

Let X be the number that Xena gets when she rolls a 6-sided die
(which has equally likely outcomes 1, 2, 3, 4, 5 and 6).

What is the probability distribution of X?


What is mean of X?
Answer:

5.35
The law of large numbers

The law of large numbers is that if you take a large number of inde-
pendent observations of a random variable, then the sample mean x
will be close to the true mean X . The larger the sample size, the
closer we expect x to be to the true mean X .

This is commonly thought of as the law of averages.

5.36
We can see the law of large numbers in action by looking at the hair
cut cost data.

Average hair cut cost of first two students: $37.5

Average hair cut cost of first twenty students: $66.3

Average hair cut cost of first hundred students: $47.71

Average hair cut cost for the population of all MATH1041 students:
$50.7

5.37
A graph of average hair cut cost against sample size:

5.38
Rules for means
If X is a random variable with mean X , and Y is a variable with mean
Y , and a and b are fixed numbers, then:
1. The variable a + bX has mean:

(a+bX) = a + bX

2. The variable X + Y has mean:

(X+Y ) = X + Y

These rules can be combined, by saying that the variable aX + bY has


mean:
(aX+bY ) = aX + bY

5.39
Example rolling a fair die twice

Xena rolls her 6-sided die twice. Let Y = X1 + X2 be the sum of the
values from her two rolls.

What is mean of Y ?
Answer:

5.40
Example Texas hold em variation

Here is the probability distribution of the number of aces you get in a


two-card hand:

Number of aces 0 1 2
Probability 0.85 0.14 0.01

Steven is bored of poker so he offers his playing partners a bet if you


give him $2 every round, he will give you $10 for every ace you get.

Should you take him up on his offer, or will Steven win out in the long
run?
Answer:

5.41
Example betting big on a coin toss

Imagine your friend comes to you with the following proposition:

Lets play another game. I toss a (fair) coin, and:


If it lands heads, I give you $2,000.

If it lands tails, you give me $2,000.


Would you play?

Whats the mean amount won in the long run?

Is your answer different to when the bet was for $2?

5.42
Many people give different answers for the $2 and $2,000 games, even
though the mean is the same. Clearly the mean isnt everything
when it comes to investment decisions. There is something else going
on. . .
risk!

One way to measure risk is to think about how variable your gain is,
by studying the variance.

5.43
Variance of a random variable
The mean is a measure of the centre of the distribution of a random
variable. The usual measure of spread of a random variable is its
variance.

To distinguish this quantity from the variance of a sample, which is


written as s2, the variance of a random variable is written as 2.

The variance of a random variable, 2, is defined as the average of the


squared deviation (X X )2 of the variable X from its mean X .

5.44
Suppose X is a discrete random variable whose probability distribution
is:

Value of X x1 x2 x3 xk
Probability p1 p2 p3 pk

The variance of X is
2 = (x )2 p + (x )2 p + + (x )2 p
X 1 X 1 2 X 2 k X k

k
X
= (xi X )2pi
i=1
where X is the mean of the random variable X.

The standard deviation of X is


q
X = 2
X

5.45
Examples

1. Find the variance and standard deviation of X, the number Xena


gets when she rolls a fair die.

2. Find the variance of X, the number of aces in a two-card hand of


Texas hold em.
Answer:

5.46
Rules for variances
Rule 1. If X is a random variable and a and b are fixed numbers, then:
2
(a+bX) = b2X
2.

Rule 2. If X and Y are independent random variables, then X + Y


and X Y have variances:
2 2 2
(X+Y ) = X + Y
2
(XY = 2 + 2
) X Y

Rule 3. If X and Y are random variables with correlation , then X +Y


and X Y have variances:
2
(X+Y = 2 + 2 + 2
) X Y X Y
2 2 2
(XY ) = X + Y 2X Y

The correlation is defined for random variables in a similar way to


the sample correlation r.
5.47
Examples

1. Find the variance and standard deviation of Y = X1 + X2, Xenas


total score when she rolls a fair die twice.

2. Find the standard deviation of the total amount you expect to lose
playing Stevens game as described previously.

5.48
Rules for variances
Rule 4. If X and Y are independent random variables and a and b
are fixed numbers, then:
2
(aX+bY = a2 2 + b2 2 .
) X Y

5.49
Example seed weights

A laboratory scale reports that its error rate is 10mg. That is, if
you put the same object on the scales multiple times, the standard
deviation of measurements will be 10.

Dan requires accurate measurements of seed size, so he decides to


weigh each seed twice rather than once, and to average these weights.
Let X1 be the reported weight of the seed the first time it is weighed,
and let X2 be the weight the second time.
1. Find the variance of X1.

2. Find the variance and standard deviation of the average weight,


X1 +X2
2 .

3. Is Dan onto a good thing, weighing his seeds more than once?
5.50
5.51
Other rules for means and variances

Non-linear transformations

We know how X and X 2 change under linear transformation, but what

about under non-linear transformations?

If Y = f (X) for some non-linear transformation f (), then:

Y =??? Y2 =???

It is possible to work out Y and Y2 (discussed in MATH2801 but not


this course), but the point is that they are not some known function
of X and X 2.

e.g. if Y = log X then Y 6= log X , and the actual answer for how Y
relates to X depends on the distribution of your data.
5.52
Mean and variance of continuous random vari-
ables

The definitions of means and variances in this lecture were for discrete
random variables only. There are corresponding definitions for contin-
uous random variables. Although these are given here, you will not be
expected to know them.

If X is a continuous random variable with density curve f (x), then:

The mean of X X is:


Z
xf (x)dx
all possible x

The variance of X is:


Z
2 =
X (x )2f (x)dx
all possible x
5.53
The rules for means and variances given previously apply to continuous
random variables also.
As is the case for the sample mean of a dataset, the mean of a density
curve X can be thought of as a
centre of gravity.

5.54
mean as the centre of gravity

5.55
The Normal Distribution

6.1
Lectures 12: The normal
distribution
This lecture, we will meet the normal distribution a very important
distribution in statistics. We will be using this distribution throughout
this course both to model data and to make inferences from data.
Normal curves

The normality assumption

The 68-95-99.7 Rule for Normal Curves

Calculating Probabilities Using the Standard Normal Distribution

Reverse Use of Standard Normal Probabilities

Calculating Probabilities for any Normal Variable

6.2
Density curves: revision

Recall that a density curve is a mathematical function that describes


the overall pattern of a distribution like a smoothed-out histogram.

For any particular range of values, the area under the curve is the
probability of an observation falling in that range.

6.3
Normal Curves
Normal curves are a special type of density curve.

Normal curves are mathematical functions such as


1 2
y = ex .
2

The normal curve is very important in statistics because it is a good


probability model for many quantities of interest to us in data analysis.

The following slide shows 5 different normal curves.

6.4
-10 -5 0 5 10

6.5
Why So Many Normal Curves?

Answer: We need this big family of normal curves to handle different


combinations of central location and spread.

The general form of the equation for a normal curve is:


2 2
y = 1 e(x) /(2 ).
2

It has two parameters (values we need to know to calculate a curve):


, the mean, measures location; and

, the standard deviation, measures spread.

6.6
Normal distribution shorthand

An example of this is:

X has a normal distribution with mean 100 and standard deviation 15.

m
X N(100,15)

6.7
Warning: The Moore et al. textbook normal distribution shorthand is
not universal!

Many other books/research papers use

X N (100, 225) (note 225 = 152)


for the example from the previous slide.

Moore et al. Many Others


N (, ) N (, 2)

6.8
The Normality Assumption
The normal distribution is a good model for a data set if its histogram
looks like a normal curve.

This can be re-phrased as:

The normality assumption is reasonable.

6.9
Measurements Often Approximately Normal:
birthweights
heights of adults
intelligence quotient (IQ)
final assessment scores

Measurements Usually Non-normal:


time until some event (bus arrival, earthquake)
chemical concentrations (e.g. hydrogen in solutions)
house prices

6.10
Warning: always check the normality assump-
tion!

Just because our IQ measurements were normal last year, doesnt


necessarily imply that they will be this year.

Its best to check the normality of every new dataset.

6.11
Is the normality assumption reasonable?
Melbourne max. temps.

0.06
0.05
0.04
Density

0.03
0.02
0.01
0.00

10 20 30 40

maximum temp. (Celsius)

6.12
Graphical Assessment of Normality Assump-
tion

Boxplots Almost useless

Histograms Somewhat helpful


(but can be misleading)

Normal quantile plots Best


(tailor-made for the job)

6.13
What a Normal Distribution Should Look Like

Histogram of Normal SRS

1,000

800
Frequency

600

400

200

4 3 2 1 0 1 2 3 4
Data
6.14
What a Normal Quantile Plot Should Look Like

Normal Quantile plot of Normal SRS

4
3
Sample Quantiles

2
1
0
1
2
3
4

4 3 2 1 0 1 2 3 4
Normal Quantiles
6.15
Right Skewed Data

Histogram of Right Skewed Data

1,000
Frequency

500

0 10 20 30 40 50
Data
6.16
Right Skewed Data

Normal Quantile plot of Right Skew Data

55
50
45
Sample Quantiles

40
35
30
25
20
15
10
5
0
5
4 3 2 1 0 1 2 3 4
Normal Quantiles
6.17
Left Skewed Data

Histogram of Left Skewed Data

1,000
Frequency

500

0 10 20 30 40 50
Data
6.18
Left Skewed Data

Normal Quantile plot of Left Skew Data

55
50
45
Sample Quantiles

40
35
30
25
20
15
10
5
0
5
4 3 2 1 0 1 2 3 4
Normal Quantiles
6.19
Normal Quantile Plot for Melbourne Temperatures

Normal Quantile plot of Melbourne Temperature data


45

40
Sample Quantiles

35

30

25

20

15

10

3 2 1 0 1 2 3
Normal Quantiles
6.20
How do Normal Quantile Plots work?

A normal quantile plot compares the observed data (temperature, etc.)


to a set of values calculated from a normal distribution (normal quan-
tiles) which represent what we would expect to see for a normally
distributed variable.

We check to see how closely the data follow a straight line.

Data follow a straight line



Data are close to what we expect of a normal variable

The normal approximation is good

6.21
The normal quantiles (z-scores) are on the horizontal axis. These are
calculated as the n values that break the normal curve into a set of
shapes with equal area.

z-scores chosen
so that it splits
areas equally

3 2 1 0 1 2 3

6.22
Normality Should Not Be Treated as Yes/No

We shouldnt say that data are normal or not. Instead it is better to


say describe the extent to which the normality assumption holds for
the data set at hand; using a scale such as:

excellent
good
fair
poor
hopeless

6.23
Extent of Normality for Class Survey Data

data normality
Melb. temp.
UNSW satisf.
travel time
hair cost
labour

6.24
The 68-95-99.7 Rule for Normal
Curves
For normal measurements:
About 68% of data fall within of the mean
(that is, within one standard deviation of their mean).

About 95% of data fall within 2 of the mean .

About 99.7% of data fall within 3 of the mean .

6.25
The 68-95-99.7 Rule for Normal
Curves

68% of data

95% of data

99.7% of data

3 2 1 0 1 2 3

6.26
Example: Intelligence Quotient (IQ)

IQ historically follows a normal distribution with

= mean = 100 and = standard deviation = 15.

Application of the 68-95-99.7 Rule leads to


About 68% of people have IQ between 85 and 115.

About 95% of people have IQ between 70 and 130.

About 99.7% of people have IQ between 55 and 145.

6.27
Calculating Probabilities for the
Standard Normal Distribution
The standard normal distribution is the one for which

=0 and = 1.

The most common symbol for standard normal variables is Z.


Shorthand: Z is N(0,1).

6.28
Probability that Z < 1.4

Probabilities concerning Z correspond to the area under the curve over


the corresponding region.

P (Z < 1.4)

1.4
3 2 1 0 1 2 3
6.29
It can be found by using statistical such as R by using the command
pnorm():
pnorm(1.4) = P (Z < 1.4) = 0.9192

Alternatively, we say that 91.92% of observations of Z are less than


1.4.

6.30
P (1.39 < Z < 0.43)

P (1.39 < Z < 0.43)

1.4 0.43

3 2 1 0 1 2 3

6.31
P (1.39 < Z < 0.43) = P (Z < 0.43) P (Z < 1.39)
= 0.6664 0.0823
= 0.5841

About 58% of Z values are between 1.39 and 0.43.

6.32
Probability/Proportion Calculations Concern-
ing Z

Textbook likes to say:


What proportion of observations on Z take values less than 1.4?

Other books might say:


What is the probability that Z is less than 1.4?

Or, for short,


What is P (Z < 1.4)?

Regardless of how the question is asked, we need a computer to answer


the question.

6.33
Class Exercise: Probabilities Concerning Z

1. What is the probability that Z < 0.78?

2. What is the probability that 1 < Z < 2?

3. What is the probability that Z > 1.4? (Hint: total area is 1).

Hint for all questions: Draw diagrams!

If available you can use, R e.g. For Q1


R function: pnorm(-0.78)

6.34
Answer:

6.35
Reverse Use of Standard Normal
Probabilities
Sometimes we need to use standard normal probabilities in reverse.

Example: Find the number c for which the proportion of Z values


exceeding c is 10%.

More concisely:
P (Z > c) = 0.1. What is c?

Area= 0.9
Area= 0.1

c
3 2 1 0 1 2 3
6.36
Solution

From diagram we have


c = 1.28

i.e. 1.28 is the number that 10% of Z values exceed.

It easier to use R by using the qnorm() function:

R function: qnorm(0.9)

6.37
Class Exercise: Reverse Use of Standard Normal Probability Ta-
bles

33% of Z values are less than which number?

Answer:

qnorm(0.33) = - 0.44

6.38
Calculating Probabilities for any
Normal Variable
Now we know how to compute probabilities concerning Z, a N (0, 1)
variable.

But how do we compute probabilities concerning X, a N (, ) variable?

6.39
Example

Let
X = Intelligence Quotient (IQ) of humans.

Assuming
X is N (100, 15)

What proportion of humans have IQ more than 110?

6.40
Standardising Transformation

RESULT

If
X is N (, )

then
X
Z= is N (0, 1).

This is a useful result because we have tables for N(0,1) variables!

6.41
Link to Linear Transformations

The result on the previous slide is just a linear transformation

Z = a + bX
where
a = / and b = 1/.

6.42
Temperature Analogy

The metric system applies a linear transformation on Fahrenheit mea-


surements to obtain Celsius measurements since the latter is more
natural (e.g. 0oC is freezing, 100oC is boiling).

Statistics applies a linear transformation on general normal measure-


ments to obtain standard normal measurements, because we can
use R to find probabilities for standard normal measurements.

From a mathematical viewpoint, these are exactly the same type of


operation.

6.43
Back to the IQ Example

X is human IQ and is a N (100, 15) variable. What proportion of


humans have IQ more than 110?

P (X > 110) = P (X 100 > 110 100)


= P (X 100 > 10)
 
X 100 10
= P >
15 15
= P (Z > 10/15)
= P (Z > 0.67)

The proportion of humans with IQ exceeding 110 is the same as prob-


ability that Z >0.67.

6.44
P (Z > 0.67)

0.67
3 2 1 0 1 2 3

6.45
Answer:

Using R (i.e. pnorm(0.67)) we have P (Z < 0.67) = 0.7486 and

1-0.7486=0.251.

So about 25% of people have an IQ exceeding 110.

Note that you could also use 1-pnorm(0.67).

6.46
Birthweight Example

On the island nation of Tonga it has been observed that birthweights


in grams have a normal distribution with

mean = = 3500 and standard deviation = = 600.

What proportion of Tongan babies have birthweights between 2000g


and 3000g?

6.47
Answer:

6.48
Additional Exercise 06-1 (based on Exercise 1.82 of 4th Edition)

The length of human pregnancies from conception to birth varies ac-


cording to a distribution that is approximately normal with mean 266
days and standard deviation 16 days. Use the 68-95-99.7 rule to answer
the following questions.
(a) Between what values do the lengths of the middle 95% of all preg-
nancies fall?
(b) How short are the shortest 2.5% of all pregnancies? How long do
the longest 2.5% last?
(c) Almost all (99.7%) of human pregnancies fall in what range of
lengths?

6.49
6.50
Additional Exercise 06-2 (based on Exercise 1.90 of 4th Edition)

Find the value z of a standard normal variable Z that satisfies each


of the following conditions. (If using tables, report the value of z
that comes closest to satisfying the condition.) In each case, sketch a
standard normal curve with your value of z marked on the axis.
(a) The point z with 25% of the observations falling below it.
(b) The point z with 40% of the observations falling above it.

6.51
6.52
Sampling distribution
of counts and proportions
Based on Moore et al.: Introduction and Section 5.1

6.53
Lectures 34: Sampling distribution
of counts and proportions
We will study the sampling distribution of binomial counts and sam-
ple proportions, important in understanding gender imbalance, polit-
ical polls and more . . .
The sampling distribution of a statistic

Mean and variance of a binomial variable

Sample proportions

Mean and variance of sample proportions

The normal approximation to binomial counts and proportions

How large does n need to be?

6.54
The sampling distribution of a
statistic
A 2015 poll by the Lowy Institute of 1,200 randomly selected Aus-
tralians found that 63% of Australians think our government should
commit to significant emission reductions so that other countries will
be encouraged to do the same.

https://www.lowyinstitute.org/lowyinstitutepollinteractive/climate-change-and-energy/

But how accurate is this estimate?

6.55
When we have collected some data, we usually want to calculate a
statistic that summarises some key quantity of interest.
(Such as the 63% in the poll.)

Statistics obtained from random samples or randomised experiments


are random variables because their values vary from sample to sample.
(If you sampled a different 1,200 Australians, would you expect to get
exactly 63% again?)

6.56
The sampling distribution of a statistic is the probability distribution
of values taken by this random variable.

If we know the sampling distribution of a statistic, we can understand


how this statistic will behave, from one sample to the next. This is
important for understanding how reliable a statistic is.

6.57
This week we will study the sampling distribution of binomial counts
and sample proportions, such as the poll example.

This will help us answer questions such as:

How accurate is the 63% in the poll?

Does this mean that most Australians want the government to lead
the way internationally on climate change policy?

6.58
Mean and variance of a binomial
variable

Binomial Revision:
Recall the binomial distribution from last week the most common
type of discrete random variable encountered in practice.

The binomial distribution is used to model the counts in a table of


frequencies for a categorical variable that takes only two values.

If we have a sample of size n, and each observation has probability p


of being a success, then the number of successes X can take the
values k = 0, 1, 2, . . . , n with probability distribution given by:
n
P (X = k) = pk (1 p)nk .
k

6.59
Consider the following variables:
# Australians (in a random sample of 1,200) wanting significant
emission reductions to help combat climate change.

# Females in MATH1041 (out of 400 enrolments).

# People who watched the Aust. Open Womens Final


(out of 400 enrolments).

# Patients who survived surgery (out of 50 patients).

Which of the above can be modelled as a binomial random variable?

6.60
Revision exercise:
Assume 20% of students watched the Aust. Open Womens Final.

You have a coffee with four students (assume they are a random sam-
ple).

Let X be the number, out of the four, who watched the program.
1. What is the probability that none of the four students watched the
program?

2. What is the probability that exactly one the four watched it?

3. Write down, in a table, the probability distribution of X.

6.61
Mean and variance of a binomial variable

There is a general rule that can be used for calculating means and
variances of binomial variables.

If X B(n, p) then
X = np
2 = np(1 p)
X
q
X = np(1 p)

6.62
Assume 20% of students watched the Aust. Open Womens Final.

You have a coffee with four students (assume they are a random sam-
ple).

Let X be the number, out of the four, who watched the Aust. Open
Womens Final.
1. What is the mean (expected value) of X?
2. What is the variance of X?
3. What is the standard deviation of X?
Answer:

6.63
Sample proportions
Recall that the binomial distribution has two parameters n (the num-
ber of trials) and p (the probability of success).

We always know n, but usually we dont know p.

How could we estimate p, if we dont know its true value?

6.64
We can estimate p using our observations, by calculating the sample
proportion p:
X
p =
n

For example, if we have 5 trials and 2 successes, our sample proportion


is 2
5 = 0.4.

6.65
p is a statistic and we want to understand its sampling distribution
(so that we understand how reliably it can estimate p in practice).

We can derive the sampling distribution of p using the binomial distri-


bution and the fact that p = X
n.

6.66
Aust. Open Womens Final example (continued)

Recall the Aust. Open Womens Final example, where we sampled 4


people and wanted to know if they watched it.

Let p be the proportion of the four who watched the show. So p = X


4.

What are the possible values and probabilities for p?

6.67
Exercises

Recall the following examples, for which you have already derived the
distribution of X in Week 4.

A test consists of five True/False questions. A student decides to


answer the questions randomly True or False, without even reading
the question. Let X be the number of correct answers the students
gets out of 5.

A test consists of six multiple choice questions each of which has


5 possible answers listed (only one of which is correct). A student
answers the questions randomly, without even reading the question.
Let X be the number of correct answers the students gets out of
6.

6.68
For each case,
1. list the values p = X
n can take and their probabilities.

2. Draw the probability histogram for p.

6.69
Mean & variance of sample proportions
We know that if X B(n, p), then:
q
X = np 2 = np(1 p)
X X = np(1 p).

Therefore, since the sample proportion is p = X


n,

r
p(1p) p(1p)
p = p p2
= n p = n .

6.70
Example Poll

A 2015 poll by the Lowy Institute of 1,200 randomly selected Aus-


tralians found that 63% of Australians think our government should
commit to significant emission reductions so that other countries will
be encouraged to do the same.

But how accurate is this estimate?

Assume that actually only 50% of Australians think Australia should


make a significant commitment.
(i.e. there is no overall majority of Australians that agree).

Now let X be the number of people, out of a sample of 1,200, who


would pay more on their electricity bill.

6.71
1. What is the probability distribution of X?

2. Determine the mean, variance and standard deviation of X.

3. Determine the mean, variance and standard deviation of p = X


n.

6.72
The Normal approximation for
binomial counts
Lets say, for the poll example, we want to find out how likely it is that
as many as 63% of Australians in a sample of 1,200 say they want
a significant emission reduction commitment, if the true proportion p
were actually equal to 0.5. How could we calculate this probability?

One way to do this is to use a normal approximation to the bino-


mial. This approximation works very well for polls and many other
applications.

6.73
Key result
If n is large enough, the distribution of

X B(n, p)
can be well approximated by the normal distribution with the same
mean and variance.
 q 
Y N np, np(1 p) .

Similarly p can be well approximated by


v
u
p(1 p)
u
u
p N p,

t .

n

6.74
Example: Poll

6.75
We can use this result to calculate probabilities for binomial problems
where n is large.

In the poll example, lets assume that the probability that a respondent
says they would pay more is actually 0.5
(i.e. there is no overall majority of Australians who would pay more).

Whats the chance 63% or more want a significant emission reduction


commitment, in our sample of 1,200?

6.76
How large does n need to be?
We can use a normal approximation to the binomial if n is large
enough . . . but when is n large enough?

One rule of thumb is that:

if np 10 and n(1 p) 10

then n is large enough to use the normal approximation.

6.77
Example

Consider the following distributions:

1. X B(5, 0.25)

2. X B(15, 0.25)

3. X B(40, 0.25)

In which cases do you think the normality assumption would be rea-


sonable?

6.78
Y N (3.75, 1.677)

0.2 P (Y 4)
0.1
0
0 1 2 3 4 5 6 7 8 9

0.2 P (Y 3.5)
0.1
0
0 1 2 3 4 5 6 7 8 9

0.2 P (X 4)
0.1
0
0 1 2 3 4 5 6 7 8 9
X B(15, 0.25)
6.79
Example leaky tanks

Environmentalists claim that one in four underground petrol storage


tanks in service stations is leaky. To investigate the claim, researchers
propose to take a random sample of 50 tanks, empty them and pressure
test them for leaks.
Let X be the number of leaky tanks out of the 50.

Assuming that the environmentalists are correct, find:


1. The probability distribution of X.
2. The mean, variance and standard deviation of X.
3. The probability that fewer than ten of the 50 tanks are leaky.

6.80
6.81
6.82
6.83
6.84
The Central Limit Theorem
and its applications
Based on Moore et al.: Sections 5.2, 6.1

7.1
In this weeks lectures we will learn more about the sampling distribu-
the average. It is important to understand averages because
tion of X,
they are used a lot. . .

Average has over one billion hits on Google!

Examples:
In TV ratings: The Masterchef season average tumbled to 1.127
million from 1.873 million viewers

In science: The average male release [of pleasure hormones from


illegal drugs] was 50% to 200% higher than the average female
release.

Another example is average yield of a new crop. . .

7.2
Example corn yields

In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5

We want to estimate the average yield () for this new variety, to see
if average yield is higher for this corn variety.

We estimate average yield as x


= 124.14 bushels per acre.

How accurate is this estimate of average yield?

7.3
Lectures 12: The Central Limit
Theorem
lead-
In this lecture we study special aspects of the sample mean X,
ing to the Central Limit Theorem the most important theorem in
statistics!
Revision: Mean and variance of linear combinations


The mean and standard error of X

for a normal random sample


The sampling distribution of X

The Central Limit Theorem

Some examples

7.4
Revision: Mean and variance of
linear combinations
If X and Y are independent random variables, and a and b are con-
stants, what can we say about the linear combination aX + bY ?

In particular:
the mean;

the variance;

the standard deviation; and

the distribution?

7.5
Example

Suppose X1, X2 and X3 are independent random variables, each of


which has mean 1 and standard deviation 2.

What is the mean and standard deviation of:


1X
1. X1 + 3 2

2. the total: T = X1 + X2 + X3

= 1 (X1 + X2 + X3)
3. the sample mean: X 3

7.6

The mean and standard error of X
A special and very important case of linear combinations of random
for a random sample.
variables is the sample mean X

Suppose we have a SRS of size n from a population X with a mean


and a variance 2.

What can we say about the sample average X?

7.7
Consider the sample x1, x2, . . . , xn to be values of independent random
variables: X1, X2, . . . , Xn which have the same distribution as X.
X1 + X2 + + Xn 1 1 1
=
X = X1 + X2 + + Xn.
n n n n

take, on average?
What value will X

compare to those
How will the variance and standard deviation for X
for X?

7.8
In general, for a random sample of size n:
2
2 =
= ,
X X
, =
X
n n

We will refer to X
as the standard error of X.

(Note that this means that as n gets larger and larger, X


gets closer
and closer to zero. Does this make sense?)

7.9
Example corn yields

In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5

We estimate average yield as x


= 124.14 bushels per acre.

We can assume that the standard deviation of corn yield is = 10.


What is the standard error of X?

7.10
for a
The sampling distribution of X
normal random sample
Let X and Y be two independent normally distributed random vari-
ables. Then the linear combination aX + bY has a normal distribution,
with mean

aX+bY = aX + bY 2
aX+bY = a2X
2 + b2 2 .
Y

This can be extended to a linear combination of n independent normal


random variables.
That is, a linear combination of n independent normal random variable
is normally distributed.

7.11
Example

Suppose X1, X2 and X3 are independent random variables, each of


which is distributed N (1, 2).

What is the distribution of


1X
(i) X1 + 3 2

(ii) the total: T = X1 + X2 + X3

= 1 (X1 + X2 + X3)
(iii) the mean: X 3

7.12
Recall the sample mean X is a combination of random variables. Hence
the following important result:

If we have an SRS of size n from a normally distributed population,


with mean and a variance 2, then the sampling distribution of the
mean is:
!


X N ,
n

7.13
Example test scores

The scores on an exam have a normal distribution with mean = 20.8


and = 4.8.
1. A student is chosen at random. What is the probability the student
scores 23 or more?

2. Take an SRS of 25 students who took the test. What is the mean
and variance of the sample mean?

3. What is the probability that the mean score is 23 or more?

Answer:

7.14
The Central Limit Theorem
In words:

When we take a SRS from a population and measure the sample


average of a variable, then the distribution of the average is
approximately normal.

Irrespective of the distribution of the variable we mea-


sured!

7.15
In notation:

Consider an SRS in which we measure the variable X that has


population mean and population standard deviation .

Then for sufficiently large n, it is approximately true that


N (, )
X
n

Irrespective of the sampling distribution of X!

Try out the applet:


http://onlinestatbook.com/stat_sim/sampling_dist/index.html
This will give you more of a sense for how the Central Limit Theorem
works, and how well it works in different situations.
7.16
Some important things to note about the Central Limit Theorem:
As the sample size n gets bigger, the approximation gets more
accurate.

The sample size n which is required for this to be sufficiently accu-


rate depends on the distribution of X, the variable we measured.

If the population is very not normal, for example very skewed,


then n needs to be larger.

The result does apply even when the population is discrete. (e.g.
normal approximation to binomial).

7.17
Why is the Central Limit Theorem (CLT) so
important?

If we know the sampling distribution of a statistic, we can understand


its behaviour (for example, how accurate it is for a small sample).

The Central Limit Theorem tells us the form of the approximate


If we know the population
sampling distribution of an average X.
mean and standard deviation , we know the approximate sampling

distribution of X.

7.18
Some examples
The following are some examples to help see where the Central Limit
Theorem is useful. These examples are just a starting point were
going to use the CLT plenty more times in the rest of this course!

7.19
Example corn yields

In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots is found.

Previous studies for a similar variety suggest that the mean corn yield
should be 110 bushels/acre, and the standard deviation should be =
10 bushels/acre.

the average yield of 15 plots?


What is the sampling distribution of X,

The 15 plots had an average yield of x


= 124.14 bushels/acre. What
is the chance of getting an average yield this high or higher, if the
mean is actually 110?

7.20
Answer:

7.21
Example pH

We would like to estimate the pH of a dam to within 0.2 of its true


value (). But repeated measurements using our instrument have a
standard deviation of 1! We can improve on this standard deviation by

averaging repeated measurements of pH (X).

How many independent pH measurements (n) should we take so that


the probability that the sample average is within 0.2 of the true pH is
at least 0.95?

7.22
Answer:

7.23
7.24
Lectures 34: The Central Limit
Theorem and Confidence intervals
Based on Moore et al.: Sections 5.26.1

In this lecture we will use simulations and examples from the Week 1
survey to demonstrate the Central Limit Theorem the single most
important theorem in statistics.

Then you will meet your first method of inference: the confidence
interval for .

7.25
Lectures 34: The Central Limit
Theorem and Confidence intervals
Revision The Central Limit Theorem

Examples real and simulated

How large is a large enough n?

Confidence intervals

A confidence interval for , when is known

Interpreting confidence intervals

Example speeding?

7.26
Revision The Central Limit
Theorem

Consider a SRS in which we measure the variable X that has


population mean and population standard deviation . Then
it is approximately true that:
!

N ,
X
n
Irrespective of the sampling distribution of X!

(The average of a random sample from any variable is approximately


normal, if the sample size is large enough.)

7.27
Examples real and simulated

Averaging Uniform Random Variables

Now we consider the random variable X which is the uniform random


variable on (0, 1) i.e. a random number between 0 and 1. (Do
you remember the density function?)

7.28
Example 1 distribution of X

If we take a random sample of 100 values from X what sort of distri-


bution do we expect to see in the sample?

We will simulate some samples to see what sort of samples appear.

7.29
The Central Limit Theorem is about the sampling distribution of
the mean.

for n = 2
Example 2 distribution of X
If we take 100 samples, each of size n = 2 and take the mean of each
sample, what sort of distribution do we expect to see?

We will simulate some samples, and find their means and look at the
distribution of the sample values. (What do we expect?)

7.30
Example 3 distribution of X for n = 4 If we take a 100 samples,
each of size n = 4 and take the means of each sample, what sort of
distribution do we expect to see?

We will simulate some samples, and find their means and look at the
distribution of the sample values. (What do we expect?)

7.31
Another example your birthday!
Recall that in the Week 1 survey you were asked what day of the month
you were born on.

How is this variable distributed?

Answer:

7.32
You were also asked what day of the month your mother and your
father were born on. Consider the average of your parents birthday.

How is average birthday distributed (n = 2)?

Answer:

7.33
You were also asked what day of the month your youngest brother
(/sister/cousin) was born on. Consider the average of all 4 birthdays
for your family.

How is average birthday distributed (n = 4)?

Answer:

7.34
Another example: travel time

Consider the travel time data from the survey.

How good is the normal approximation for travel time?

What is the distribution of average travel time of five students? We


can answer this question by repeated sampling groups of 5 students,
and studying the distribution of the average of each sample of 5.

How good is the normal approximation for average travel time (n = 5)?

What about the distribution of average travel time of 25 students?

How good is the normal approximation for average travel time (n =


25)?

7.35
Summary:

7.36
How large is a large enough n?
The Central Limit Theorem says that the average of a random sample
from any variable is approximately normal, if the sample size is large
enough.)

But how approximate? And how large is large enough?

7.37
From simulations using different distributions you may notice that:
As the sample size n gets bigger, the approximation gets more
accurate.

The sample size n which is required for this to be sufficiently accu-


rate depends on the distribution of X, the variable we measured.

If the population is very non normal, in particular when it is very


skewed, then n needs to be larger.

7.38
The Central Limit Theorem says that X is approximately normal, if
sample size is large enough. Here are some rough rules of thumb
about whether sample size is large enough for X to be well approxi-
mated by a normal distribution.

How large a sample size you need depends on how close your data (X)
are to being normal:
If the normal approximation for your data is good, you dont need
many observations at all!

If the normal approximation is fair (i.e. not noticeably skewed) then


15 observations should be enough.

If the normal approximation is poor, you need 40 observations (or


sometimes more).

7.39
Confidence intervals
Now we will talk about an important application of the Central Limit
Theorem constructing a confidence interval for when we dont
know anything about a variable except its standard deviation and
the sample we have collected.

The confidence interval is a very powerful tool it allows us to make


general statements (inferences) about the average of a whole popu-
lation () based on just a sample!

7.40
Example corn yield

In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5

We want to estimate the average yield () for this new variety, to see
if average yield is higher for this corn variety.

Assume (from previous studies etc.) that the population standard


deviation is known to be = 10.

Can we find a range of values that we are pretty sure contains ?

7.41
Note that for this sample x
= 124.14.

Let X be the yield (in bushels/acre) of a random plot of corn.

We know that X has mean and standard deviation 10.

We need to further assume that:

The normal assumption is fair for X (because n = 15).


(How could we check this?) Look at a normal quantile plot!

The sample is random.

7.42
Then, because of the central limit theorem, the sampling distribution
of the mean is normal with:

=
X

10
=
X
15
!
10
N ,
X
15

7.43
We can use what we know about normal distributions to find a range
of values, a confidence interval that we are pretty sure contains .

This is our reasoning:


For any normal distribution, 95% of values fall within about 2 stan-
dard deviations of the mean (more exactly, within 1.96 standard
deviations).

is 10 and the mean of X


The standard deviation of X is .
15

is within 1.96 10 of .
95% of the time, X
15

We can be 95% confident that the true mean is somewhere in


the range:
10
124.14 1.96 =( 119.1, 129.2 ).
15

7.44
A confidence interval for , when
is known
Assuming that:
we have a set of n independent observations from any variable
with a known standard deviation ; and

is normally distributed
X
then a 95% confidence interval for from a SRS of size n is:
!

1.96 , x
x + 1.96 .
n n

7.45
More generally:

If:
we have a set of independent observations from any variable with
a known standard deviation ; and

is normally distributed,
we can assume X
then a level C confidence interval for from a SRS of size n is obtained
from
!

z ,x
x +z ,
n n
where z is the number such that if Z N (0, 1), then P (z < Z <
z ) = C.

So we can say something (and often something quite specific) about


the average of the whole population (), even though we have just a
sample from the population!
7.46
How to pick z

Where did the 1.96 come from for a 95% confidence interval?

The goal is to find the value such that the middle 95% of the area
under a Standard Normal curve will be between z and z .

That leaves 100 - 95 = 5% of the area in the tails (2.5% in each


due to the symmetry of the Normal curve).

7.47
95% confidence
0.4

0.3

0.2

0.1 Area = 0.025 Area = 0.95 Area = 0.025

3 1.96 1 0 1 1.96 3

Can you find the value of z for a 97% confidence interval?


7.48
Interpreting confidence intervals
When we constructed a 95% confidence interval for average corn yield,
the answer was (119.1 , 129.2).

How do we interpret this confidence interval?

7.49
It is important to remember that average corn yield () is a fixed
parameter that doesnt vary. is either in the 95% confidence interval
or it isnt and if we repeated the estimation process lots of times,
95% of our confidence intervals would contain , while 5% wouldnt
contain .

A good way to summarise the confidence interval is to say: We


are 95% confident that average yield is between 119.1 and 129.2
bushels/acre.

Strictly speaking, it is not true that there is 95% probability that is


between 119.1 and 129.2. Hence we refer to 95% confidence rather
than 95% probability.

7.50
Example speeding?
The speed limit in residential areas is 50 km/hr. A concerned resident
on Barker St measures speeds of traffic during a 10-minute period and
obtained the following results:

46 54 54 45 48 46 61 50 46 54
51 50 43 59 46 41 38 54 58 50

From other studies, the standard deviation of residential speed is known


to be about = 6 km/hr.

How fast is the average car going? Is there evidence that on average,
cars exceed the speed limit?

7.51
Answer:

We have x
= 49.7 and n = 20.

So putting it all together we get:



49.7 1.96 6/ 20 = 49.7 2.63
= (47.1, 52.3).

7.52
Confidence intervals
and significance testing
Based on Moore et al.: Section 6.16.2

8.1
Lectures 12: Understanding
confidence intervals
The margin of error

How to decrease the margin of error

What sample size should I use?

Checking assumptions

Extras

8.2
The margin of error
Recall that if we have a random sample from any variable with a
known standard deviation , then a level C confidence interval for
from a SRS of size n is obtained from:
!

z , x
x + z
n n
where z is the number such that if Z N (0, 1), then we have the
following probability

P (z < Z < z ) = C.

8.3
For example, because P (1.96 < Z < 1.96) = 0.95, a
95% confidence interval for is:
!

1.96
x + 1.96
, x .
n n

Similarly, because P (1.645 < Z < 1.645) = 0.90, a


90% confidence interval for is:
!

1.645
x + 1.645
, x .
n n

One more: because P (2.576 < Z < 2.576) = 0.99, a


99% confidence interval for is:
!

2.576
x + 2.576
, x .
n n

8.4
We can rewrite this confidence interval as:

m, x
(x + m)
where m is the margin of error and

m = z .
n

In the case of a 95% confidence interval,



m = 1.96 .
n

In the case of a 90% confidence interval,



m = 1.645 .
n

8.5
The margin of error controls the width (or length) of the confidence
interval.

The smaller the margin of error, the more precisely we can estimate .

Ideally, wed like the margin of error to be as small as possible!

8.6
How to decrease the margin of
error?
Example body temperature

A university measured the body temperatures of 106 healthy adults to


determine if the long-held belief that body temperatures average 37
degrees Celsius is accurate.

We want to estimate the average body temperature () for healthy


adults. Assume the population standard deviation is known, = 0.4.
We have that x
= 36.78 degrees.

Find 90%, 95% and 99% confidence intervals for .

8.7
Answer:

Notice that the higher the desired confidence level, the larger the mar-
gin of error.
8.8
Example body temperature

Now consider what would happen if they had sampled n = 200 healthy
adults instead of n = 106, and the average body temperature of these
200 individuals was 36.78 degrees.

As previously, assume the standard deviation is known to be = 0.4,


and we are interested in the true average body temperature of healthy
adults .

Find a 95% confidence interval for the mean body temperature and
compare it to the confidence interval when n = 106.

8.9
Answer:

Notice that the larger the sample size, the smaller the margin of error.
8.10
Example body temperature

Consider the original body temperature again, so n = 106 and x =


36.78, but now lets think about what would happen if the population
standard deviation were known to be = 0.2 rather than = 0.4.

Find a 95% confidence interval for the mean body temperature for
= 0.2 and compare it to the confidence interval when = 0.4.

8.11
Answer:

Notice that the smaller the standard deviation , the smaller the margin
of error.
8.12
There are three ways to reduce the margin of error

m = z .
n

Reduce your confidence level (reduce the value of z ). The


cost of doing this is that you are less confident that your interval
captures the true mean .

Increase your sample size (increase n hence reduce 1n ). This is


usually your best option.

Measure your variables more precisely (decrease ). Sometimes


(but not always) it is possible through careful sampling protocols
to reduce measurement error hence reduce .

8.13
What sample size should I use?
The margin of error is often used in practice to determine the sample
size to use in a particular study.

By solving for n in our expression for the margin of error, we can decide
on an appropriate sample size, if we are given:
The margin of error that is desired.

How much variability to expect.

8.14
Example body temperature

A hospital would like to estimate average body temperature of healthy


adults to within 0.1 degrees of its true value (with 95% confidence).

As previously, assume = 0.4 degrees.

How many healthy adults would need to be used in a study, for the
margin of error on our estimate of to be no more than 0.1 degrees?

8.15
Answer:


We have 4 = 1.96 10/ n.


Solving for n, n = 4.9, n = 9.8, sample at least 10 plots.

8.16
Checking assumptions

!

z , x
x + z
n n

There are two key assumptions when constructing a confidence interval


for :
we have a set of n independent observations from any variable
with a known standard deviation (where possible, we take a SRS
to guarantee independence is satisfied); and

X is normally distributed (the Central Limit Theorem helps us with


this see slide 7.39 for details).

8.17
Example body temperature

A university measured the body temperatures of 106 healthy adults.


We want to estimate the average body temperature () for healthy
adults, to see if it differs from the expected 37 degrees.

Assume the population mean is unknown and that the population


standard deviation is known, = 0.4.

What assumptions are required to construct a confidence interval for


? If possible, check these assumptions.

8.18
We assume that:
The body temperatures of the adults are: independent
(Which is OK if the sample is random.)

X is: approximately normal.


(A normal quantile plot suggests the data have a fair-good approx-
imation to normality. By the Central Limit Theorem, the sample
mean of n = 106 such measurements will have a very good ap-
proximation to normality.)

8.19
Normal Quantile Plot of Body Temperatures

37.5







37.0












Temperature












36.5










36.0

2 1 0 1 2

zscore

8.20
Confidence intervals: TRUE or FALSE?

If you have a 95% confidence interval (100.5,110.2), there is prob-


ability 95% that the true population value is between 100.5 and
110.2.

It is better to have a narrow confidence interval than a wide one,


as it gives us more certain information (if the same confidence level
is used each time).

If your study involves twenty 95% confidence intervals, then you


expect about one of them to be wrong.

8.21
8.22
Lectures 34: Tests of significance
Based on Moore et al.: Section 6.2

In this lecture you will meet the most widely used method of infer-
ence about a population from a sample significance testing, or
hypothesis testing.
Some examples

The reasoning of hypothesis tests

Procedure for hypothesis testing

The Z-test for the population mean

P -values and significance

One-sided vs two-sided testing

8.23
Often, you will come across P -values in your studies:
Paired t-tests indicated that there was no change in attitude (P >
0.05).

The results showed that expression of EphA2 was obviously in-


creased in gastric cancer tissues (P < 0.01).

There was no significant difference (p > 0.05) between the first


and second responses.
A P -value is a key statistic for making a decision in a hypothesis
test. Today we will learn how to conduct hypothesis tests and how to
interpret their P -values.

8.24
Some examples
1. Body temperature Again

A university measured the body temperatures of 106 healthy adults


and observed that the average body temperature was 36.78 degrees.

Assume that the population standard deviation is known, = 0.4.

It is widely believed that the average body temperature for healthy


adults is 37 degrees. Is there evidence that average body temperature
is actually less than 37 degrees?

8.25
2. Corn again

In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5

Assume that the population standard deviation is known, = 10.

Average yield for the old corn variety was 110 bushels per acre. Is
there evidence that this new variety has higher yield?

8.26
3. Calcium, pregnancy and Central Americans

The level of calcium (in mg/dl) in healthy young adults varies with
mean 9.5 and standard deviation = 0.4.

A clinic in rural Guatemala measures blood calcium for 180 healthy


pregnant women at their first visit for prenatal care. The mean is
= 9.58.
x

Is this an indication that the mean calcium level is different from 9.5
in these women?

8.27
The reasoning of hypothesis tests
In hypothesis testing, we have a particular claim about a population
parameter (a null hypothesis) that we want to test, and we have a
sample of data we can use to test the claim.

Null hypothesis in each of the previous examples:


1. Mean body temperature is 37 degrees (or = 37).

2. Mean yield for old corn variety was 110 bushels per acre (or =
110).

3. Mean calcium level is 9.5mg/dl (or = 9.5).

8.28
We use the sample data to test the null hypothesis by asking the
question: How much evidence do these data provide against the null
hypothesis?

To measure how much evidence the data provide against the null hy-
pothesis, we use probability we calculate a P -value, the probability
of observing a test statistic as or more extreme than the observed
outcome, if the null hypothesis were true.

We can calculate a P -value whenever we have a sample estimator of


the parameter of interest and we know its density function under the
null hypothesis.

8.29
Example 1 Body temperature

If our null hypothesis is that = 37, we want to know: Could be


37? Or do we have convincing evidence that < 37?

The sample mean of the 106 healthy adults is 36.78 degrees.

36.78) measures how unusual this sample mean is (if the true
P (X
mean were = 37).


We can calculate this probability if we know the distribution of X,
assuming that = 37.

8.30
36.78)
P (X

If this probability is relatively large, we could expect to get a value like


= 36.78 quite often, and therefore the data does not provide much
x
evidence against the claim that the true mean is 37.

However, if this probability is small, a value of x = 36.78 is quite


unusual, so this data provides strong evidence against the original claim
that the true mean is 37.

In other words, the smaller this P -value is, the more evidence we have
that the original claim ( = 37) is not true!

8.31
Procedure for hypothesis testing
There are several steps to complete when doing any hypothesis test:
State the null and alternative hypotheses.

Calculate the test statistic and its null distribution.

Calculate the P -value.

Conclusion.

8.32
The null and alternative hypotheses

The statement being tested is the null hypothesis H0.

In the body temperature example:


H0 : = 37

To assess the strength of evidence against the null hypothesis we need


to specify the
alternative hypothesis Ha.

In the body temperature example:

the alternative hypothesis is that the mean body temperature is less


than 37 degrees.
Ha : < 37
8.33
Calculate the test statistic and its null distri-
bution

A test statistic is a standardised measure of the difference between


the sample estimate and the value of the population parameter in the
null hypothesis.

In the body temperature example:


0
x 36.78 37
z= = = 5.66
/ n 0.4/ 106

So the test statistic is equal to equal to -5.66.

z 0
But why did we use Z = ?
/ n

8.34
In order to make inferences using this test statistic, we need to know
its null distribution the sampling distribution of the test statistic
when H0 is true.

In the body temperature example:

If we assume that X1, X2, . . . , Xn N (, ), then Z N (0, 1),


0
x
Z= = N (0, 1)
/ n

So we can now use the standard normal distribution to find probabili-


ties.

8.35
Calculate the P -value

The P -value is our measure of how much evidence there is against


H0 in the direction of the alternative hypothesis.

It is defined as the probability that the test statistic would take a value
as extreme or more extreme than the value we actually observed, if H0
is true.

The smaller the P -value, the more extreme the test statistic is and so
the stronger the evidence against H0.

8.36
In the body temperature example:

37
x
If H0 : = 37 is actually true the we expect z = to be a negative
/ n
value very often.

!
37
x 36.78 37
P -value = P <
/ n 0.4/ 106
 
0.22
= P z<
0.03885

= P (z < 5.66) (for this bit we can use RStudio)

= 7.5 109.

8.37
Conclusion

Now we reach a conclusion about how much evidence there is against


H0 .

If the P -value is small, the test statistic is extreme and so there is


evidence against H0.

If our P -value is large, the test statistic is not extreme and so there is
no evidence against H0.

8.38
In the body temperature example:

The P -value is 7.5 109.

This is extremely small suggesting that getting an observed value of


36.78 (or smaller), if = 37 is true is highly unlikely.

That is, there is very strong evidence against the null hypothesis being
true (H0 : = 37) and we reject it in favour of Ha : < 37 the
alternative hypothesis.

8.39
The Z-test for a population mean
Often we want to test a hypothesis about the population mean of
some variable X.

Assuming that:
we have a set of n independent observations from any variable
with a known standard deviation ; and

is approximately normal.
X
then we can use as a test statistic:
0
x
z= .
/ n

If H0 is true, z comes from a standard normal distribution, Z N (0, 1).

8.40
2. Corn again

In crop research into a new variety of corn, a sample of 15 plots gave


the mean yield of 124.14 bushels per acre.

Assume that the population standard deviation is known, = 10.

Average yield for the old corn variety was 110 bushels per acre. Is
there evidence that this new variety has higher yield?

8.41
Answer:

Recall from last week that a 95% confidence interval for was
(119.1, 129.2). Do the results of the hypothesis test coincide with
this interval?
8.42
P -values and significance
Recall that if the P -value is small then we conclude we have evidence
against H0 in favour of Ha.

If the P -value is not small then we conclude that our test statistic is
not extreme so we have no evidence against H0.

But how small is small?

8.43
It is common to choose a significance level and to use this as a
guide for drawing conclusions from P -values.

If the P -value , then the data are statistically significant at level .

The most common value of used in practice is 0.05 hence results


are often considered statistically significant if the P -value is smaller
than 0.05.

But in principle there is no reason why you couldnt use a different


significance level. . .

8.44
Below is an alternative way of interpreting P -values. Note that P -
values measure strength of evidence the smaller P is, the more evi-
dence there is against H0.

8.45
One-sided vs two-sided testing
There are various possible types of alternative hypotheses, when H0 :
= 0 .

One-sided alternatives:
Ha : > 0

Ha : < 0

Two-sided alternative:
Ha : 6= 0

8.46
The alternative hypothesis Ha is important for determining the P -value,
because it tells us what sorts of departures from H0 we are interested
in measuring.

If Ha : > o then the P -value = P (Z z).

If Ha : < o then the P -value = P (Z z).

If Ha : 6= o then the P -value = 2P (Z |z|).

8.47
Calcium, pregnancy and Central Americans

The level of calcium (in mg/dl) in health young adults varies with mean
9.5 and standard deviation = 0.4.

A clinic in rural Guatemala measures blood calcium for 180 healthy


pregnant women at their first visit for prenatal care. The mean is
= 9.58.
x

Is this an indication that the mean calcium level is different from 9.5
in these women?

8.48
Answer:

8.49
Fast food

A fast food outlet claims that the average caloric content of its meals
is 800, and the standard deviation is 25.
A consumer protection group tested 12 meals and found the average
number of calories was 873.

Is there enough evidence to reject the claim? Use = 0.02.

8.50
Answer:

8.51
8.52
Inference about a
population proportion
Based on Moore et al.: Section 8.1

9.1
Lectures 12: Inference about a
population proportion
Today we will meet a set of practical tools for making inference about
binary (yes/no) variables.
Introduction

One-sample test for p

Confidence interval for the true proportion p

Some emerging patterns

* Emphasis is on section 8.1. Plus four CIs are excluded

9.2
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-sample Z-test
useful test:
for
(if known)

useful for
CI for
inference:
Last week

9.3
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-sample Z-test
useful test:
for
(if known)

useful for
CI for
inference:

This lecture
9.4
Introduction

1. Birth months and spring

Many mammals have a higher likelihood of giving birth to offspring in


Spring than at other times of the year. Is this true of humans too?

Out of 400 MATH1041 students, 108 of them were born in the spring.

Does this provide evidence that people are more likely to be born in
Spring?

9.5
2. TV ratings and MATH1041
According to OzTAM, 6.88% of people in Sydney watched the Aus-
tralian Open Womens Tennis Final. In contrast, out of 400 MATH1041
respondents, 49 students watched this event on television.

Is there evidence that MATH1041 students have different viewing pat-


terns to the remainder of Sydney-siders?

9.6
3. Poll

In the poll results on 2 May 2016, 2951 randomly selected Australians


were asked If a Federal election for the House of Representatives were
being held today which party would receive your first preference?

The results revealed that 51% of Australians prefer Labor, on a two-


party preferred basis.

In the federal election in September 2013, 46.5% of people voted for


Labor. Is there evidence that intended voting patterns have changed?

9.7
In all of these examples, we have observed a binary variable (a cate-
gorical variable with only two possible responses):
1. Birth month: Spring vs not Spring.

2. TV ratings: watched Aust. Open Womens Final vs didnt watch


it.

3. Poll: Labor vs Coalition

9.8
Equivalently, we can think of our data as n trials in which we have
counted the number of successes:
1. Birth month: n =400 trials, X =108 successes.

2. TV ratings: n =400 trials, X =49 successes.

3. Poll: n = 2951 trials, X =51% proportion.


We would like to make inferences about the probability of success p.

9.9
Week 6 revision

If we have an SRS of size n from a population with unknown proportion


p of successes, the sample proportion is:
X
p =
n
p has standard error
s
p(1 p)
p =
n
and p is approximately normal.

The normal approximation is good if both np 10 and n(1 p) 10.

9.10
So if the true proportion of successes is p, then
p p
r
p(1p)
n
has a standard normal distribution (approximately by the CLT), that
is:
p p
r N (0, 1).
p(1p)
n

9.11
One-sample test for p
When studying a binary variable, the proportion of successes p is of
key interest. Usually p is unknown but we may have ideas (hypotheses)
about what it should be.

Assuming that:
We have a sample of n independent measurements of any binary
variable.

p is approximately normal,
then we can test:
H0 : p = p 0
using the test statistic
p p0
z=r
p0 (1p0 )
n

If H0 is true, z comes from the standard normal distribution.


9.12
So to calculate the P -value, we compare z to Z N (0, 1):

If Ha : p > p0 then the P -value = P (Z z).

If Ha : p < p0 then the P -value = P (Z z).

If Ha : p 6= p0 then the P -value = 2P (Z |z|).

9.13
Birth months and spring

Many mammals have a higher likelihood of giving birth to offspring in


Spring than at other times of the year. Is this true of humans too?

Out of 400 MATH1041 students, 108 of them were born in the spring.

Does this provide evidence that people are more likely to be born in
Spring?

9.14
Answer:

9.15
Political poll

In a recent political poll, 2951 randomly selected Australians were asked


who they would vote for if the federal election were held today. 51%
would vote for Labor, on a two-party preferred basis.

In the federal election in September 2013, 46.5% of people voted for


Labor. Is there evidence that intended voting patterns have changed?

What assumptions did we make? Are these assumptions reasonable?

9.16
Answer:

9.17
Confidence interval for the true
proportion p
Assuming that:
We have a sample of n independent measurements of any binary
variable; and

p is approximately normal,
then an approximate level C confidence interval for the probability of
success p is
s s
p(1 p) p(1 p)
z
p , p + z
n n
where z is the number such that if Z N (0, 1), then

P (z < Z < z ) = C.

This confidence interval works well if n


p 10 and n(1 p) 10.
Note: textbook says n p 15, but well use 10.
9.18
Why is the standard error different?

Note that the standard error involves p, whereas in the hypothesis test,
the standard error involved p0. Why?

In a hypothesis test for p, we proceed under the assumption that p has


a true value equal to p0.

However, when you create a confidence interval, there is no specific


value p0 which we assume to be the true value, so we rely on the
sample estimate p.

9.19
Political Poll

In the recent poll, out of n =2951 randomly selected Australian voters,


51% said they would vote Labor ahead of the Coalition.
Specify a range of values which youre pretty sure (95% sure, say)
contains the true proportion of Labor voters.

Answer:

9.20
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample Z-test


useful test:
(for binary var) for
(if known)

useful for CI for p


CI for
inference: (for binary var)

This lecture
9.21
The plus four confidence interval

The textbook emphasises a modification of the confidence interval


formula for p, which it calls the Plus four confidence interval for a
sample proportion. In small samples, this gives a slightly more accu-
rate confidence interval, and Moore et al. claim that this alternative
formula works well whenever n 10.

The plus four formula is not widely used and we will not use it in
MATH1041 just stick with the slide 9.17 formula. If n is small, you
are better off using the binomial distribution directly for inference.

9.22
Some emerging patterns
In our three weeks studying methods of inference, you might have
noticed some patterns across the different methods. Here are two
patterns:
The test statistics have a common form; and

Confidence intervals and hypothesis tests are related.

9.23
The test statistics have a common form

You may have noticed that all the test statistics we have used so far
have had the form
estimate hypothesised value
standard error of estimate

(In fact all Z statistics have this form. So do t statistics, which we will
meet next lecture!)

This type of test statistic is known as a Wald statistic.

In Week 11 we will meet another form of test statistic.

9.24
Confidence intervals and hypothesis tests are
related

So far we have considered hypothesis tests about a parameter (p, ),


and we have also considered how to construct a confidence interval for
this parameter.

How are these two types of inference procedure:


confidence intervals and hypothesis testing
related?

9.25
TV ratings and MATH 1041

According to OzTAM, 6.88% of people in Sydney watched the Aus-


tralian Open Womens Tennis Final. In contrast, out of 400 MATH1041
respondents, 49 students watched this event on television.

Is there evidence that MATH1041 students have different viewing pat-


terns to the remainder of Sydney-siders?
Answer this question both using a hypothesis test and a 95% confi-
dence interval, and compare results.

9.26
Answer:

9.27
Declaring statistical significance at the 0.05 level for a two-sided test
is equivalent to a 95% confidence interval of the parameter of interest
not containing the null value.

Mathematically, the two procedures are equivalent (in the sense that
every two-sided hypothesis test has a matching confidence interval)!

Summary for 2-sided test examples from the last few weeks:

Example H0 : 2-sided P 95% CI

Fast food = 800 P < 0.0002 (858.9, 887.1)


(Week 8)

TV ratings p = 6.88%

9.28
When to use which procedure?

If confidence intervals and hypothesis tests are mathematically equiva-


lent, why not just use one all the time instead of worrying about having
both?

Confidence intervals and hypothesis tests put the focus on different


aspects of the analysis:

CIs focus on estimation of the unknown parameter we specify a


range of values we are confident contains the parameter.

Hypothesis tests focus on a specific hypothesis about the parameter


and how much evidence we have against this hypothesis.

Which method is more appropriate depends on your research question.


9.29
9.30
Lectures 34: Inference about
using the t-distribution
Based on Moore et al.: Section 7.1

Today we will meet a set of practical tools for making inference about
the mean of a quantitative variable.
What to do when is not known?

The t Distribution

Confidence Intervals for the Mean (with not known)

The One-Sample t test

Assumptions

9.31
What to do when is not known?
For a quantitative random variable X with mean and variance 2, we
have established that you can estimate from a random sample.

A level C confidence interval for is


!

z , x
x + z
n n
where z satisfies P (z < Z < z ) = C where Z N (0, 1).

Problem: the confidence interval is a function of , the population


standard deviation.
But usually, we dont know .

9.32
To construct a confidence interval for , we used:

X
N (0, 1)
/ n

If we dont know (and in practice, we dont) then we can estimate


it using the sample standard deviation s. But

x

s/ n
does not come from a normal distribution, because s varies from one
sample to the next. It is approximately normal for large n, but not
small n.

This means that we cant use:


!
s s
z , x
x + z
n n
as a confidence interval for if n is small.
9.33
x
The following slide is an example it is a histogram of s/ for a
n
sample of 6 observations from a normal distribution.

x
The histogram approximates the density function of s/ it was made
n
using thousands of samples of size 6.

Its distribution is not standard normal (the red line) there is higher
density in the tails (because s is not exactly all the time).

9.34
Average of 6 values from the normal distribution

X
Z N (0, 1) S
n
Frequency

5 4 3 2 1 0 1 2 3 4 5
9.35
Using the standard normal distribution to construct a confidence inter-
val for when is not known, wouldnt work well, for small samples.

It can actually be shown that in this case, sample means are farther
than 1.96 standard errors from the mean 10.6% of the time, not 5%
of the time!

This means that using:


!
s s
1.96 , x
x + 1.96
n n
would give us an 89.4% confidence interval, not a 95% confidence
interval!

9.36
Even for larger samples than n = 6,

X
6 N (0, 1)
S/ n
although the approximation gets better as n increases.

For n 30 the standard normal is a fairly good approximation.

9.37
X
Z N (0, 1)
n=6

S
n

5 4 3 2 1 0 1 2 3 4 5
X
n = 11

Z N (0, 1) S
n

5 4 3 2 1 0 1 2 3 4 5
X
n = 25

Z N (0, 1) S
n

5 4 3 2 1 0 1 2 3 4 5

9.38
The t distribution
Instead of using the normal distribution to make inferences about ,
we use the t distribution when is unknown.

If we have a SRS from a normal random variable with

X N (, )
then the statistic

x
t=
s/ n
has the t-distribution with n 1 degrees of freedom, which we
write in shorthand as t(n 1) or tn1.

9.39
The t-distribution has a known density function. However, this density
function is a bit complicated and its integral does not have a closed
form solution, so as for the normal distribution.

We find probabilities and critical values using computers, that is, we


use RStudio.

What does the density curve of a t-distribution look like?

9.40
tdistribution with 1, 5 and 10 degrees of freedom vs N (0, 1)
t1 distribution t5 distribution
0.4 0.4

0.3 95% between 12.706 0.3 95% between 2.571

0.2 0.2

0.1 0.1

4 2 0 2 4 4 2 0 2 4
t10 distribution Z N (0, 1) distribution
0.4 0.4

0.3 95% between 2.228 0.3 95% between 1.96

0.2 0.2

0.1 0.1

4 2 0 2 4 4 2 0 2 4
9.41
Properties of the t-distribution:
It is actually a family of distributions, t1, t2, . . . , t .

The density depends on its degrees of freedom, often called .


We need to know to find probabilities (using RStudio).

The density is bell shaped and symmetric around zero.

As increases, the tdistribution becomes more short-tailed, and


progressively closer to the standard normal.
In RStudio we use the command pt() to find probabilities and the
command qt() quantiles of the t distribution.

9.42
Confidence Intervals for the Mean
(with not known)
We can construct confidence intervals for without having any
knowledge of .

Assuming that:
we have a set of n independent observations of a variable; and

the distribution of the variable X is approximately normal.


then a level C confidence interval for is
s
t ,
x
n
where t is the value from the t(n 1) distribution for which the area
between t and t is C.

In practice, this formula works well even when our data are not normal,
as long as X is approximately normal.
9.43
Example corn again

In crop research into a new variety of corn, the yields in bushels per
acre for 15 plots are:
138.0, 139.1, 113.0, 132.5, 140.7, 109.7, 118.0, 134.8
109.6, 127.3, 115.6, 130.4, 130.2, 117.7, 105.5

We want to estimate the average yield () for this new variety. We do


not know the population standard deviation .

= 124.14 and s ' 11.96.


x
Find a 95% confidence interval for the true mean yield .

9.44
Answer:

9.45
Example lead in soil

A new method of extracting lead from soil has been developed, and
we would like to know what the average amount of lead extracted is
(in parts per million). When tried on 27 specimens of soil, it yielded a
mean of 83 ppm lead and a standard deviation of 10 ppm.

Construct a 99% confidence interval for the average amount of lead


released from the soil.

Answer:

9.46
The One-Sample t Test
Often we want to test a hypothesis about the population mean of
some variable X, without any knowledge of .

Assuming that:
we have a set of n independent observations of a variable; and

the distribution of the variable X is approximately normal.


then the test statistic
0
x
t=
s/ n
has a t(n 1) distribution under the hypothesis H0 : = 0.

In practice, this formula works well even when our data are not normal,
as long as X is approximately normal.

9.47
As always, the alternative hypothesis Ha is important for determining
the P -value. Ha tells us what sorts of departures from H0 we are
interested in measuring.

If Ha : > o then the P -value = P (T t).

If Ha : < o then the P -value = P (T t).

If Ha : 6= o then the P -value = 2P (T |t|).

9.48
Example lead in soil
The amount of lead in a certain type of soil, when released by a
standard extraction method, has a mean of 86 parts per million (ppm).
A new and cheaper extraction method is tried on 27 specimens of the
soil, yielding a mean of 83 ppm lead and a standard deviation of 10
ppm.

Is there significant evidence that the new method frees less lead from
the soil?

Answer:

9.49
9.50
Assumptions

X
Note that for T = S/ to have a t(n 1) distribution, we made two
n
assumptions:
we have a set of n independent observations of a variable; and

the distribution of the variable X is approximately normal.


We can guarantee that the first assumption is satisfied by. . .

(taking a simple random sample).

But how important is the normality assumption?


(How robust is the t-distribution result to non-normal data?)

What do you do if your data arent normal?

9.51
The t-distribution results works whenever the Central Limit Theorem
works. Hence Moore et al. say (page 432):
Small samples: Do not use t procedures if there are outliers present
or if the data are more than slightly skewed (data must have a good
approximation to normality).

Moderate samples (e.g. 15 n 40): Do not use t procedures if


there are outliers or if data are strongly skewed (only need a fair
approximation to normality).

Large samples (e.g. n 40): t procedures work well even for


strongly skewed data (data has a poor approximation to normality).
Although it can still be adversely affected by gross outliers.
(These are basically the same rules introduced for the Central Limit
Theorem in slide 7.37).

9.52
What do you do if youre worried about non-normality?
Use another family of distributions to model the data. This is an
advanced topic outside the scope of MATH1041.

If the data are right-skewed then use a data transformation such



as xnew = log(x) or xnew = x that aims to turn the data into
approximately normal data. See Week 2 lecture for some other
transformations.

Use a distribution-free procedure such as the sign test (Moore et


al. page 439). Such procedures do not assume that the population
distribution has any specific form (but they usually dont work as
well as t-tests).

9.53
Example hair cut costs

A survey of 160 male MATH1041 students found that the average


amount they spent at the hairdressers was $23.46, with a standard
deviation of $23.84.

The university hairdresser charges $35 for a male haircut.

Is there evidence that the hairdressers are overcharging? (In the sense
that they are charging more than the average amount a male student
would pay?)

What assumptions did you make to answer this question? Are these
assumptions reasonable?

9.54
Answer:

9.55
Additional exercise

The mean cholesterol level of 20-39 aged Australians is 230 mg/dL.

In a nutrition study, a group of 24 people in this age range who eat


primarily a macrobiotic diet are measured and the mean cholesterol
level was found to be 175 mg/dL with a standard deviation of 35
mg/dL.

Test the hypothesis that the group of people on a macrobiotic diet


have cholesterol levels different from those of the general population.

What assumptions must be satisfied for your testing approach to be


valid?

9.56
9.57
9.58
9.59
9.60
Use and abuse of hypothesis
testing
Based on Moore et al.: Section 6.3

10.1
Lectures 1-2: Use and abuse of
hypothesis tests/Comparing Means
Using statistical tests wisely is not necessarily easy to do. We will work
through some important cautions when using a hypothesis test.

Use and abuse of hypothesis testing:


Revision one-sample test statistics and assumptions

Check your assumptions

Practical versus statistical significance

Common misuses of hypothesis tests

Comparing Means:
Introduction

The two sample t-test


10.2
Revision one-sample test statistics
and assumptions
In Weeks 89 we have met the following three types of one-sample
tests:

Test type Z-test for t-test for 1-samp test for p


H0 : = 0 = 0 p = p0
( known) ( unknown)

Test statistic x
z = / 0
n
t = x
0
s/ n
z = pp0
p0 (1p0 )/n
Null distribution N (0, 1) t(n 1) N (0, 1)
Assumptions: Observations are independent
or p comes from a normal distribution
x

10.3
Check your assumptions
In all of the above one-sample tests, there are two important assump-
tions. Your conclusions may not be valid if the assumptions are not
met.
1. Observations are independent. You can guarantee that this as-
sumption is satisfied by taking a SRS.

2. The sample mean X or sample proportion p is approximately nor-


mal. If n is large enough, the Central Limit Theorem ensures this
is true. But for small samples, this depends on the distribution of
your data.
Always check assumption with your data.

10.4
Fast food again

Recall the fast food example from Week 8:

A consumer protection group tested 12 meals to see if the average


number of calories was 800. In Week 8 we used a Z-test to test this
claim.

What assumptions were made in doing the Z-test?


What could be done in collecting data to ensure assumptions are sat-
isfied?
Use the following graph to check assumptions.

10.5
10.6
Calcium, pregnancy and Central Americans again

A clinic in rural Guatemala measures blood calcium for 180 healthy


pregnant women at their first visit for prenatal care. We conducted a
Z-test to see if the mean calcium level is significantly different from
9.5.

Do you think the assumptions of the Z-test would be satisfied in this


case?

10.7
Practical vs statistical significance
Statistical significance is attained when the P -value is below some
chosen significance level.

Practical significance is attained when the departure from the null


hypothesis is big enough to have practical importance.

Just because an effect has statistical significance doesnt mean it is of


practical significance. From a dataset with a really large sample size, a
test statistic can be significant even though there is only a really small
(and unimportant) departure from the null hypothesis.

10.8
Example factory working hours

Hours worked in a day is recorded for 996 factory workers. The mean
hours worked is 7.02 hours, standard deviation 0.3 hours.

Does this sample provide evidence that factory workers work longer
than 7 hour days?

Use the appropriate method of inference to answer the above question.

10.9
Answer:

Hypotheses: H0 : = 7 Ha : > 7

7
x
Test statistic: t = s/ 2.10
n
comes from t995 if H0 is true
 
P-value: P (T > 2.1) = 0.018 P (T > 2.1) < 0.02 using tables alone

Conclusion: There is some evidence that the factory workers work


longer than 7 hour days.

10.10
Textbook exercise

A study with 5000 subjects reported a result which was statistically


significant at the 5% level.

Explain why this result might not be particularly large or important.

Answer:

10.11
Common misuses of hypothesis tests
Unfortunately, hypothesis testing is often misused in practice. Some
particularly common abuses of the method:
To conclude that the null hypothesis is true.

To search for significance to test lots of hypotheses in the


same dataset.

To test a hypothesis on the same data the hypothesis was gener-


ated from (the mean appears to be close to 100. Is it significantly
different from 100?)

To test claims you know arent true anyway.

10.12
Concluding that the null hypothesis is true

You can never prove a theory in the same way, you can never prove
a null hypothesis!

Recall in the lead example from Week 9: we tested H0 : = 86 using


27 soil specimens, with x
= 83 and s = 10. We calculated a P -value
of about 0.07 and concluded that there was little evidence against H0.

A 95% confidence interval for is (79.0, 87.0).

So clearly the true mean could be 86 (so H0 could be true), but the
true mean could also be 85, 84.5, or 80.2. . .

10.13
Have the results of a hypothesis test been misinterpreted in any of the
following statements?
Paired t-tests indicated that there was no change in attitude (P >
0.05)

The results showed that expression of EphA2 was obviously in-


creased in gastric cancer tissues (P < 0.01)

There was no significant difference (p > 0.05) between the first


and second responses

10.14
When interpreting a large P -value, it is often useful to construct a con-
fidence interval for the key parameter of interest this helps understand
whether or not a practically significant effect is possible despite it not
being statistically significant.

We usually try to plan studies to avoid such a situation through ap-


propriate choice of sample size.

Using margin-of-error or power calculations, we can choose n so that


we are likely to detect any practically significant departures from H0.
Hence if H0 is retained we can conclude that there is no practically
significant departure from H0.

10.15
Searching for significance

Hypothesis testing is designed for the situation where you have a theory
or hypothesis you want to test you then design an experiment or study
to test this hypothesis.

While it is easy using a computer to take a given dataset and conduct


lots of hypothesis tests, searching for significant P -values, this is an
abuse of the method. Search for patterns by all means but dont use
P -values in this setting.

10.16
Exercise searching for significance

A physiologist is testing whether there is an effect of a new fungicide


on plant size. He measures plant size in 15 different ways, and for
each variable he tests for a fungicide effect using a hypothesis test
with significance level = 0.05.

Assume that this fungicide has no effect on plant size.


What is the chance of finding a significant result from a single hypoth-
esis test, in this case?
What is the chance of finding a significant result from at least one of
the 15 tests carried out?

10.17
Comparing two means
Based on Moore et al.: Section 7.2

10.18
Data analysis for one or two variables
one variable two variables
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative

useful boxplot or clustered comparative


bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample t-test


useful test:
(for binary var) for
(or Z-test if known)

useful for CI for p


CI for
inference: (for binary var)

This week
10.19
Do ravens fly to gunshots?

(Ecology 2005, 86:10571060)

Do ravens intentionally fly towards gunshot sounds (to scavenge on


the carcass they expect to find)?

Crow White addressed this question by counting raven numbers at a


location, firing a gun, then counting raven numbers 10 minutes later.
He did this in 12 locations. Results:

location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3

Is there evidence that ravens fly towards the location of gunshots?

10.20
Pregnancy and smoking

(Journal of General Psychology 1993, 120(1): 4963)


What is the effect of a mothers smoking during pregnancy on the
resulting offspring?

A randomised controlled experiment investigated this by injecting preg-


nant guinea pigs with 0.5 mg/kg nicotine hydrogen tartrate in saline
solution. The control group was injected with a saline solution without
nicotine.

The learning capabilities of the offspring of these guinea pigs was


then measured using the number of errors that were made trying to
find food in a maze.

10.21
Results:

Group Sample size Mean Standard deviation


Control 10 23.4 12.3
Treatment 10 44.3 21.5

Is there a detrimental effect of nicotine on the development of the


guinea pig offspring?

10.22
Introduction
Hypothesis tests for data such as the above are often called two sam-
ple tests because they involve the comparison of two samples from
two different populations.

Population Variable Mean Standard deviation


1 X1 1 1
2 X2 2 2

We want to compare the means of the two populations, 1 and 2.


The hypotheses we are interested in are


1 < 2
H 0 : 1 = 2 Ha : one of 1 6= 2

>
1 2

10.23
Another way to think about it: we have measured two variables on
each subject a quantitative variable and a binary variable.

e.g. when comparing

In a two-sample test, we want to test if these two variables are inde-


pendent.

10.24
The two-sample t-test
We have two samples: one from X1, one from X2. We want to test
H0 : 1 = 2.

If we assume that:
the two samples are independent of each other;

each sample consists of independent observations of a variable;

the distribution of each variable is approximately normal*; and

both X1 and X2 have the same standard deviation ,


then we can test H0 by calculating
v
u
1 x
x 2 u (n 1)s2 + (n 1)s2
t= r where sp = t 1 1 2 2
sp n1 + n1 n1 + n2 2
1 2

and measuring how far t is from 0. If this statistic is large and positive,
it suggests that 1 > 2, if it is large and negative, it suggests that
1 < 2 .
10.25
We know that if H0 is true, and if the above assumptions are satisfied,
then T t(n1 + n2 2). So to calculate the P -value:

If Ha : 1 > 2 then the P -value = P (T t).

If Ha : 1 < 2 then the P -value = P (T t).

If Ha : 1 6= 2 then the P -value = 2P (T |t|).

* In practice, this t-test works well even when our data are not normal,
as long as x1 x
2 comes from an approximately normal distribution.
10.26
Pregnancy and smoking

The number of errors made by guinea pigs in the maze, classified by


treatment group, is summarised as:

Group Sample size Mean Standard deviation


Control 10 23.4 12.3
Treatment 10 44.3 21.5

Is there evidence that nicotine treatment detrimentally affects the de-


velopment of guinea pig offspring?

Answer:

10.27
10.28
Lectures 34: Comparing two means
Understanding the two sample t-test

Confidence interval for the difference between means

Assumptions of two-sample t procedures

The paired t-test

Paired t vs two-sample t

10.29
Understanding the formula for the
two-sample t-test
The two-sample t-test is based on the following result:

If we take two samples one from X1, which has mean 1, and one
from X2, which has mean 2, then
1 x
x 2 (1 2)
r
sp n1 + n1
1 2

has a t-distribution with n1 + n2 2 degrees of freedom if the


following assumptions are satisfied:
the two samples are independent of each other;

each sample consists of independent observations of a variable;

the distribution of each variable is approximately normal; and

both X1 and X2 have the same standard deviation .


10.30
Why do we use a formula that looks like this?

The statistic has a similar form to the one-sample t-statistic, with a


few key differences:
The null hypothesis is H0 : 1 2 = 0, so 1 2 is our parameter
of interest. We estimate it using x1 x
2.

1 N (1, ) and X
If X 2 N (2, ), and if X
1 and X2 are
n1 n2
independent, the distribution of the sample mean difference is:
s !
1 1
1 X
X 2 N 1 2, + .
n1 n2

The formula for sp is a two-sample generalisation of the formula


for sample standard deviation:
v v
uP u Pn1
)2
u n (x x u (x
x ) 2 + Pn2 (x x
2)2
t i=1 i t i=1 1i 1 i=1 2i
s= sp =
n1 (n1 1) + (n2 1)

The degrees of freedom is the sum of n 1 for each sample:


(n1 1) + (n2 1) = n1 + n2 2. 10.31
Confidence interval for the mean difference

We can calculate confidence intervals for the difference between the


population means, 1 2, using the same result used for two-sample
t-tests:

We have two samples of a variable, and we assume that:


the two samples are independent of each other;

each sample consists of independent observations of a variable;

the distribution of each variable is approximately normal; and

both have the same standard deviation

10.32
Let
v
u
x
x 2 (1 2) u (n 1)s2 + (n 1)s2
t= 1 r where sp = t 1 1 2 2 .
sp n1 + n1 n1 + n2 2
1 2

Then t comes from a t(n1 + n2 2) distribution.

Using the same arguments as previously (e.g. slide 7.48) we can con-
struct a level C confidence interval for 1 2 using
s
1 1
x 2 tsp
1 x +
n1 n2
where t is the value from the t(n1 + n2 2) distribution for which the
area between t and t is C.

10.33
Pregnancy and smoking

The number of errors made by guinea pigs in the maze, classified by


treatment group, is summarised as:

Group Sample size Mean Standard deviation


Control 10 23.4 12.3
Treatment 10 44.3 21.5

How big is the average difference in number of errors, between control


and treatment groups?
Use a 95% confidence interval to answer this question.

10.34
Answer:

10.35
Assumptions of two-sample t
procedures
In both the two-sample t-test and the confidence interval for 1 2,
we needed to make four assumptions, in order to be able to say that
our statistic has a t(n1 + n2 2) distribution:
1. the two samples are independent of each other
(ensured by sampling from different populations);

2. each sample consists of independent observations of a variable


(ensured by taking an SRS from each population);

3. the distribution of each variable is approximately normal;


(in practice, we only actually need X1 X
2 to be normal. Use slide
7.43 rules for n1 + n2 to check if n1 and n2 are large enough).

4. both have the same standard deviation ;


(you cant guarantee this, but you can check if s1 and s2 are simi-
lar).
10.36
What if my standard deviations arent equal

Regarding assumption 4: what if X1 and X2 do not have equal standard


deviation?

Simulation can be used to show that this assumption doesnt actually


matter if n1 = n2.

The only time you run into problems is when both n1/n2 and 1/2
are very different from 1 (they differ from 1 by at least a factor of two,
say).

In this situation we have a problem.

10.37
Two possible solutions to this problem:
Transform data so that the standard deviations are similar. This
simple approach often solves the problem (and it often removes
skew at the same time).

Use a test statistic that does not assume equal standard deviations.
We will not use this technique in MATH1041, but see page 447
454 of Moore et al. for details.

10.38
Pregnancy and smoking

Recall the study in which 10 pregnant guinea pigs were used as a


control and 10 were given a nicotine treatment. The number of errors
made by their offspring in a maze were:

Group Sample size Mean Standard deviation


Control 10 23.4 12.3
Treatment 10 44.3 21.5

What assumptions did we make in doing a two-sample t-test? Are


these assumptions valid here?

Answer:

10.39
Normal Quantile plot of Guinea pig control group
50
45
40
Sample Quantiles

35
30
25
20
15
10

1.5 1 0.5 0 0.5 1 1.5


Normal Quantiles

10.40
Normal Quantile plot of Guinea pig treatment group

90

80
Sample Quantiles

70

60

50

40

30

20
1.5 1 0.5 0 0.5 1 1.5
Normal Quantiles

10.41
The paired t-test
Do ravens fly to gunshots?
(Ecology 2005, 86:10571060)

Do ravens intentionally fly towards gunshot sounds? Raven numbers


at 12 locations before and after firing a gun:

location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3

Is there evidence that ravens fly towards the location of gunshots?

10.42
For this example, it is not appropriate to conduct a two-sample t-test,
because our two sets of samples are not independent.

If our two samples are not independent, we cannot say that


s
1 1
X 2 =
1 X +
n1 n2
(if X1 = X2 = ) and so we can not use our 2-sample t-test.

In this example, observations are paired.


(Textbook: matched pairs)

10.43
The Paired t Test

We have a paired sample: we have pairs of measurements on X1 and


X2, and we want to test if the means are equal, that is, if the mean
difference is zero.

We do this by analysing the differences using a one-sample t-test of


H0 : = 0, where is the true mean difference.

10.44
Assuming that:
we have a set of n independent pairs of observations of a variable;
and

the distribution of the paired differences is approximately normal,


then the test statistic we calculate on the differences:

x
t=
s/ n
has a t(n 1) distribution under the hypothesis H0 : = 0.

In practice, this formula works well even when our differences are not
normal, as long as X is approximately normal.

10.45
As always, the alternative hypothesis Ha is important for determining
the P -value. Ha tells us what sorts of departures from H0 we are
interested in measuring.

If Ha : > 0 then the P -value = P (T t).

If Ha : < 0 then the P -value = P (T t).

If Ha : 6= 0 then the P -value = 2P (T |t|).

10.46
Do ravens fly to gunshots?

location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3
difference

Is there evidence that ravens fly towards the sound of gunshots?

Answer:

First, calculate the variable

difference = after before.

The hypotheses of interest (about , the true mean of the differences)


are:
H0 : = 0 Ha : > 0.
10.47
The P -value is:
! !
d 1.08
P tn1 > P t11 >
s/ n 1.88/ 12
P (t11 > 1.995)

Conclusion: and using RStudio we have: 1-pt(1.995,11) = 0.035, and


so we conclude that there is some evidence that ravens fly towards the
sound of gunshots.
Confidence Intervals for the Mean
Difference from a Paired Sample
We can construct a confidence interval for the mean of the differences
, using the same method as in Week 9, except analysing the paired
differences instead of the original data.

Assuming that:
we have a set of n independent pairs of observations of a variable;

the distribution of the paired differences is approximately normal,


then a level C confidence interval for can be calculated from the
paired differences as:
s
t ,
x
n
where t is the value from the t(n 1) distribution for which the area
between t and t is C.

In practice, this formula works well even when our paired differences
are not normal, as long as X is approximately normal.
10.48
Do ravens fly to gunshots?

location 1 2 3 4 5 6 7 8 9 10 11 12
before 1 0 1 0 0 0 0 5 1 1 2 0
after 2 3 2 2 1 1 2 2 4 2 0 3
difference 1 3 1 2 1 1 2 -3 3 1 -2 3

How big is the increase in average number of ravens, following a gun-


shot? Use an appropriate method of inference to answer this question.

Answer:

10.49
Speed cameras
Below is the number of people exceeding the speed limit (in a month)
before and after speed cameras were installed at four locations (data
from Sydney Morning Herald, 22/9/03).

location Concord West Edgecliff Wentworthville Green Valley


before 5719 7535 6254 2200
after 1786 2228 528 260

Is there evidence that introducing speed cameras reduced the number


of people speeding? How big was the reduction in number of people
speeding per month per camera, on average?

10.50
Answer:

10.51
Paired t-test vs two sample t-test
Why collect paired data?

Using a matched pairs design can control for other factors that may
be important.
e.g. raven data: the response variable was number of ravens, which
can vary greatly with location. Which of the following approaches do
you think would be more efficient?
Taking before-after measurements at 12 locations (as in the ex-
ample given previously); and

Measuring # ravens at 12 places with no gunshot, and comparing


this to # ravens at 12 different places after a gunshot.

10.52
OK, well why not analyse all two sample datasets us-
ing a paired t-test?

If we have two balanced samples (n1 = n2), then why not always use
a paired t-test, to avoid confusion about which test to use?

If we have two independent samples, we should not use a paired t-test,


because this discards useful information.

Degrees of freedom are a measure of how much information we have


about the sample standard deviation, and the degrees of freedom are
half as big for a paired t-test as for a two sample t-test using the same
data.

10.53
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful boxplot or clustered comparative
bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample t-test 2-sample


useful test:
(for binary var) for t-test (for
(or Z-test if known) binary+quant)

useful for CI for p


CI for CI for 1- 2
inference: (for binary var)

This lecture
10.54
10.55
Relationships between
categorical variables
Based on Moore et al.: Chapter 9

11.1
Lectures 12: Data analysis for
two-way tables
Summarising the association between two categorical variables
Two-way tables

Visualising two-way tables

Simpsons paradox
Inference for two-way tables
Introduction

2 tests for categorical data

11.2
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful boxplot or clustered comparative
bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample t-test 2-sample


useful test:
(for binary var) for t-test (for
(or Z-test if known) binary+quant)

useful for CI for p


CI for CI for 1- 2
inference: (for binary var)

This week
11.3
Introduction
1. Gender and study area

In a survey of MATH1041 students, the gender of students and their


study area was recorded.

How should we summarise the results, graphically and numerically?

Is there evidence that the gender ratio is different in different subject


areas?

11.4
2. Lateness of QANTAS planes

The Sydney Morning Herald (Saturday, 16th August 2003) conducted


a study to compare how late QANTAS flights depart, compared with
Virgin Blue flights. The following results were obtained for after-
noon/evening flights:

Virgin Blue QANTAS


0-3 min late 22 11
3-15 min late 6 21
> 15 min late 0 2

Is there evidence that one airline runs late more often than the other?

11.5
3. Cancers and where they occur
(From Roberts et al, Pathology (1981) 13:76370.)

Malignant melanomas from 366 NSW patients were classified by Can-


cer Type and Cancer Site:

Cancer site
Cancer Type Head/neck Trunk Extremities
Superficial spreading 16 54 115
Nodular 19 33 73
Indeterminate 11 17 28

Is there evidence that where a cancer occurs is related to type of


cancer?

11.6
In all of the above examples, there were two categorical variables:
1. Gender and Study area
2. Airline and Lateness

3. Cancer Type and Cancer Site

11.7
In particular, we are interested in assessing whether or not there is an
association between the two categorical variables.
1. Is there an association between gender and study area?

2. Is there an association between Airline and lateness?

3. Is there an association between Cancer Site and Cancer Type

11.8
Two-way tables
To numerically summarise the association between two categorical vari-
ables, we use a two-way table.

To construct a two-way table (of frequencies):


Choose one variable as the row variable. List the different cate-
gories of this variable in different rows.

Make the other variable the column variable (by listing categories
in different columns).

Count the number of subjects in each of the possible combinations


of categories. Record the count in the corresponding row/column.

11.9
Example: area of study and gender

Female Male
Life Sci 38 31
Medicine 11 13
Science 51 40
Other 9 14

So for example there are 38 female students studying Life Sciences.

11.10
Example: On-time planes and airlines

The following is a subset of the Sydney Morning Herald on-time per-


formance data for QANTAS and Virgin Blue.

Airline: Min. late: Airline: Min. late:


QANTAS 0-3 QANTAS 0-3
QANTAS 3-15 Virgin 3-15
Virgin 0-3 Virgin 0-3
Virgin 0-3 Virgin 0-3
QANTAS 3-15 QANTAS 3-15
Virgin 3-15 QANTAS 0-3
QANTAS 3-15 QANTAS 3-15

Construct a two-way table to summarise these data.

11.11
Answer:

11.12
A two-way table can be thought of as a summary of the joint dis-
tribution of the two categorical variables (because we look at the
frequencies in the two variables jointly).

We could also construct one-way tables to look at the marginal dis-


tribution of each variable.

11.13
(One-way) table of frequencies for study area:

Life Sci 69
Medicine 24
Science 91
Other 23

(One-way) table of frequencies for gender:

Female Male
109 98

11.14
These are sometimes called the marginal distributions of gender and
study area because they can be written into the margins of the two-way
table:

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Note that each marginal total is just the sum of the corresponding
row/column of the two-way table.

11.15
Example: On-time planes and airlines

The following is a subset of the Sydney Morning Herald on-time per-


formance data for QANTAS and Virgin Blue.

Plane: Min. late: Plane: Min. late:


QANTAS 0-3 QANTAS 0-3
QANTAS 3-15 Virgin 3-15
Virgin 0-3 Virgin 0-3
Virgin 0-3 Virgin 0-3
QANTAS 3-15 QANTAS 3-15
Virgin 3-15 QANTAS 0-3
QANTAS 3-15 QANTAS 3-15

Estimate the marginal distributions of the Min. late and airline vari-
ables, and add them to your two-way table.

11.16
Conditional distributions

A final aspect of two-way tables that is sometimes useful is to explore


the conditional distribution of one variable conditional on another.
This is calculated by conditioning on one of the variables and then
looking at the distribution of the other.

This corresponds directly to the idea of conditional probability intro-


duced in Week 6.

11.17
Example: Gender and study area

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Estimate the conditional distribution of Gender given Study Area.

11.18
Conditional distribution of gender for Life Science students:

Female Male
38 (55.1%) 31 (44.9%)
69 69

Conditional distribution of gender for Science students:

Female Male
56.0% 44.0%

11.19
Visualising Two-Way Tables
It is not easy (but possible) to assess association by staring at the
numbers in a two-way table.

A better option is often to visualise the data using one of the following:
Multiple bar charts

Clustered bar charts

Bar charts of the conditional distribution

11.20
Example: Gender and study area

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Construct a graphical summary of the association between gender and


study area.

Below are a few possible ways of doing this:

11.21
Multiple bar charts of study area for each gender:
female
80
60
40
20
0
Aviation Life Sci Other Science

male
50
40
30
20
10
0
Aviation Life Sci Other Science

11.22
Multiple bar charts of gender for each study area:
Life Sci Science
80 40

60 30

40 20

20 10

0 0
female male female male

Aviation Other
40
80
30
60
20
40
10 20

0 0
female male female male

11.23
A clustered bar chart of gender clustered by study area:

female

80
male
60
40
20
0

Other Life Sci Aviation Science

11.24
A bar chart of conditional probabilities of being female:

Proportion of females

1.0
0.8
0.6
0.4
0.2
0.0

Aviation Life Sci Other Science

11.25
Simpsons Paradox
We discussed lurking variables in Week 3 for quantitative variables,
and how relationships are distorted when lurking variables are ignored.

The same phenomenon occurs with categorical variables, and is known


as Simpsons Paradox.

11.26
On-time airlines

The Department of Transportation in the USA ranks airlines in terms


of their on-time performance.

Based on 1991 data, America West received a better ranking than


Alaska Airlines, because of the following data:

On-time Late
Alaska Airlines 2062 317
America West 5041 476

Caulkins et al. Operations Research (1993) 41, 710720.

11.27
The conditional probabilities of each airline being on-time are

Alaska Airlines America West


86.7% on-time 91.4% on-time

and so America West planes were on-time more often.

11.28
But what happens when we consider the effects of the lurking variable:
Where the plane departed from

For this we need a three-way table of frequencies!


(For the categorical variables Airline, On-time and Where the plane
departed from)

11.29
Consider on-time performance of planes from each airline, but calcu-
lated separately according to whether they departed from Seattle or
Phoenix:

Seattle Phoenix
On-time Late On-time Late
Alaska Airlines 1841 305 221 12
America West 201 61 4840 415

11.30
On-time performance, conditional on airline and where the plane de-
parted from, can be summarised as follows:

Seattle Phoenix
Alaska Airlines 85.8% on-time 94.8% on-time
America West 76.7% on-time 92.1% on-time

At both airports, Alaskan Airlines planes were more likely to run on-
time than American West!

11.31
In summary:

Alaskan Airlines has better on-time performance for planes leaving


Seattle.

Alaskan Airlines has better on-time performance for planes leaving


Phoenix (and indeed for all airports in the study).

America West planes had better on-time performance overall (?!).

This is an example of Simpsons paradox.

11.32
An association or comparison that holds for all of several groups can
reverse direction when the data are combined to form a single group.

This reversal is called


Simpsons paradox.

11.33
Why did this happen?

The reason why this happened in the airline example was that where a
plane departs from is an important lurking variable that is associated
with both airline and on-time performance:
Alaskan airlines sends most of its planes to airports like Seattle,
whereas America West sends most planes through Phoenix.

Planes are more likely to be late when they depart an airport like
Seattle (which doesnt have great weather) than when departing
an airport like Phoenix.

11.34
Useful diagram:

11.35
What does this mean for me?

Any time you find an association in an observational study, you need to


think about whether there are any lurking variables that might explain
the association.

If there are, you should include these in your analyses, to avoid Simp-
sons paradox.

Or: design an experiment to remove these confounding factors.

11.36
Inference for two-way tables:
Introduction
Gender and study area
In a survey of MATH1041 students, the gender of students and their
study area was recorded:

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Is there evidence that the gender ratio is different in different subject


areas?

11.37
For this problem we need to test the hypothesis:

H0 : No association between gender and study area


against

Ha : There is an association between gender and study area

Note that another way to state H0 is

H0 : Study area is independent of gender


But what will we use as a test statistic? And what is the null distri-
bution of this test statistic?

11.38
2 tests for categorical data
To test for an association between two categorical variables:
Display observed counts in a two-way table.

Compute expected counts under H0,


row total column total
expected counts = .
n

Compare observed to expected counts using:


X (observed count expected count)2
X2 = .
expected count

Under H0, X 2 has a chi-square distribution with


(r 1)(c 1) degrees of freedom (for a r c table).

11.39
This chi-square test is appropriate when:
The n observations are independent.

The sample size is large enough that all expected counts > 10.

11.40
Expected counts under H0

Under the hypothesis that two categorical variables are independent,


the expected counts can be calculated using the marginal counts in
their respective rows and columns of the two-way table:
row total column total
expected counts = .
n

11.41
Example Gender and study area

In a survey of MATH1041 students in Week 1, the gender of students


and their study area was recorded.

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Find the expected counts, under the hypothesis that study area is
independent of gender.

11.42
Answer:

11.43
Where does this formula come from?

109
P (Female) =
207
and if H0 is true then gender and study area are independent i.e.

P (Female | Life Sci) = P (Female).

So under H0, we expect that the number of students who are female
and in life sci is about
109 69 109
# Life Sci P (Female) = 69 =
207 207
which is just the formula
row total column total
expected counts = .
n

11.44
Computing the chi-square test statistic

We use as a test statistic


X (observed count expected count)2
X2 =
expected count
summed across all r c cells in the table.

This statistic will be large if the observed counts are far from expected
counts (that is, the larger X 2 is, the more evidence there is against
H0).

11.45
Example Gender and study area
In a survey of MATH1041 students in week 1, the gender of students
and their study area was recorded.

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Find the value of the chi-square test statistic.

11.46
Answer:

11.47
Finding the P -value for a chi-square test

Under H0, if we have a SRS for which all expected counts are larger
than 10, then X 2 has a chi-square distribution with degrees of freedom
(r 1)(c 1), where there are r rows and c columns in the two-way
table.

So we can calculate the P -value as:

P (2 > X 2)
where 2 has a chi-square distribution with (r 1)(c 1) degrees of
freedom.
We write this as 2(df ) or 2
df where df = (r 1)(c 1).

11.48
The chi-what distribution?

We have not previously met the chi-square distribution (chi is pro-


nounced as kai, rhymes with by).
Here are its important properties:
Like t, it is indexed by degrees of freedom (df ).

It only takes positive values: X 2 > 0.

We will use tables to calculate probabilities.

Unlike t and Z, it is not symmetrical but right skewed.

2(df ) has mean df .


Pn
(Also, 2 is related to Z N (0, 1): 2(1) = Z 2, and 2(n) = Z
i=1 i
2

for independent Zi N (0, 1).)

11.49
Some example chi-square density curves:
22

0 5 10 15 20
24

0 5 10 15 20
28

0 5 10 15 20

11.50
Example Gender and study area
In a survey of MATH1041 students in Week 1, the gender of students
and their study area was recorded.

Female Male Total


Life Sci 38 31 69
Medicine 11 13 24
Science 51 40 91
Other 9 14 23
Total 109 98 207

Is there evidence of an association between gender and study area?


Find the P -value of a chi-square test. Hence state your conclusion in
plain language.

11.51
11.52
Inference for two-way tables
(Continued)
Based on Moore et al.: Introduction and Section 9.2

11.53
Lectures 34: Inference for two-way
tables
Today we will continue to discuss the key methods for making infer-
ences about the association between two categorical variables.
2 tests for categorical data more exercises

Comparing two proportions

11.54
Recall the strategy for chi-squares tests for categorical data:

From an SRS of size n, we can test for an association between two


categorical variables using a chi-square test:
Display observed counts in a two-way table.

Compute expected counts under H0,


row total column total
expected counts = .
n

Compare observed to expected counts using:


X (observed count expected count)2
X2 = .
expected count

P -value: compare X 2 to the chi-square distribution.

11.55
Cancers and where they occur

(From Roberts et al. Pathology (1981) 13:76370.)

Malignant melanomas from 366 NSW patients were classified by Can-


cer Type and Cancer Site:

Cancer site
Cancer Type Head/neck Trunk Extremities
Superficial spreading 16 54 115
Nodular 19 33 73
Indeterminate 11 17 28

Is there evidence that where a cancer occurs is related to type of


cancer?

Answer:

11.56
Lateness of QANTAS planes

The Sydney Morning Herald (Saturday, 16th August 2003) conducted


a study to compare how late QANTAS flights depart, compared with
Virgin Blue flights. The following results were obtained for after-
noon/evening flights:

Virgin Blue QANTAS


0-3 min late 22 11
3-15 min late 6 21
> 15 min late 0 2

Is there evidence that one airline runs late more often than the other?

11.57
Answer:
First note that the expected counts in the > 15 min late will be too
small (they must be at least 10), so we will combine these with the
3-15 category:

Virgin Blue QANTAS


0-3 min late 22 11
> 3 min late 6 23

11.58
Comparing two proportions
When comparing two sample proportions, the data could be written as
a two-way table with two rows and two columns.

Unemployment rate

In April 2009, a Roy Morgan poll of 4,315 Australians found that 7.1%
were unemployed. A similar poll of 4,914 Australians more recently
found that 10.4% were unemployed.

Is there evidence that the unemployment rate changed?

11.59
First, consider the unemployment poll results expressed as the propor-
tion of unemployed people:

Poll date n X p = X
n
April 2009 4,315 304 0.071
Recent poll 4,914 511 0.104

Re-express these results as a two-way table.

Answer:

11.60
When comparing proportions from two populations, there are two ways
we could do it:

As we have seen, we could use a chi-square test to test:

H0 : Success/failure is independent of population.

Or we could follow an alternate procedure demonstrated in Section 8.2


of Moore et al. utilising the proportions directly.

These two methods are mathematically equivalent, so we get the same


result using either method.

11.61
Is there evidence that the unemployment rate changed?
Use a chi-square test to answer this question.
Answer:

11.62
11.63
11.64
Inference for regression
Based on Moore et al.: Chapter 10

12.1
Inference for Regression
Today we will discuss inference for linear regression a key statistical
tool for making inferences about whether two quantitative variables
are related.
Introduction

Linear regression revision

Why inference for regression?

Inferences about the slope 1

A bit more about regression inference

12.2
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample t-test 2 test for 2-sample


useful test:
(for binary var) for independence t-test (for
es if known)
(or Z-test
(o o ) binary+quant)

useful for CI for p


CI for CI for 1- 2
inference: (for binary var)
(analyse Paired data?
Last week
differences)
12.3
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample t-test 2 test for 2-sample t (for


useful test:
(for binary var) for independence binary+quant)
(o es if known)
(or Z-test o )

useful for CI for p


CI for CI for 1- 2
inference: (for binary var)
(analyse Paired data?
differences) This lecture
12.4
Introduction
By the end of this section, we will be able to analyse data to address
questions such as the one below.

Is water quality in creeks related to the size of the catchment area?


Water catchment area and an index measuring water quality (IBI) are
calculated for a sample of 20 creeks:
Catchment area (km2) 29 49 28 8 57 26
Water quality (IBI) 61 85 46 53 55 85
Is there evidence of relationship between water quality in a stream is
a function of water catchment area?

12.5
Drinking and blood alcohol levels

A study was conducted at a lunch function at the University of Western


Sydney, to determine how many drinks you can have in an hour and still
have a blood alcohol content (BAC) under the legal limit, 0.05. The
following table gives blood alcohol content and number of standard
drinks of white wine consumed in the hour for some of the subjects:
# drinks 4 4 6 6 3 3 1 3
BAC 0.025 0.04 0.07 0.065 0.015 0.02 0 0.015 . . .

How do you estimate the relationship between number of drinks and


BAC?

12.6
Linear regression revision

When can we use regression?

We can use regression when we have measurements on two quantitative


variables an explanatory variable (X) and a response variable (Y ),
and we want to know the answer to a question like one of the following:

Are these two variables related? How is Y related to X?


Can we predict Y using X?
What is the predicted value of Y , for an observation on X?

12.7
To use linear regression, we need the data to be linearly related, i.e.
something like one of the following:

12.8
r2 revision
r2 measures the strength of the linear regression, which takes on
the values:
0 r2 1
and is computed by
2 variance of y values
r = .
variance of y values

So r2 is the % of variation in y that is explained by the linear regression.

12.9
In Week 2 we fitted a linear regression to our data:

y = b0 + b1x + error
where b0 is the y-intercept, b1 is the slope of the line, and we assume
the error is random scatter around the line.

We can use linear regression to calculate predicted values for y, using


x to make the predictions. These fitted values for y are calculated
using y = b0 + b1x.

To check if we have random scatter around the line, we construct a


residual plot, that is, a plot of y y versus x.

12.10
Least squares regression
A good way to estimate the line that best fits the data (for predicting
Y from X) is to use least squares regression.

Line chosen to minimise


the sum of the squared
lengths of the arrows

12.11
To fit a least squares regression line, calculate the intercept and slope
of the line using:
sy
b0 = y b1x
and b1 = r .
sx

The line constructed in this way will minimise the sum of the squared
errors as on the previous slide.

12.12
Why inference for regression?
Recall that often we only have a sample, and we want to make infer-
ences about the population.

For linear regression, we want to make inferences about the true re-
gression line:
y = 0 + 1x
based on an estimate of this line from a sample:

y = b0 + b1x.

12.13
We need to take into account the sampling error in estimating the
regression line how much error is there in our estimates of b0 and
b1, due to us only having a sample of data, rather than the whole
population?

12.14
The fit for a sample of 20 streams:

95
90
85
80
Water Quality (IBI)

75
70
65
60
55
50
45
40 yb = 49.79 + 0.46 x
35
30
25
0 5 10 15 20 25 30 35 40 45 50 55 60
Water catchment area (square km)

12.15
The fit for a different sample of 20 streams:

95
90
85
80
Water Quality (IBI)

75
70
65
60
55
50
45
40 yb = 55.35 + 0.45 x
35
30
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Water catchment area (square km)

12.16
Consider the following questions:

Is there a relationship between water catchment area and water qual-


ity?
How does average water quality index change as water catchment area
increases by 1 square km?
What is the average water quality index equal to when the water catch-
ment area is 40km2?

12.17
In each case we want to say something general using the relationship
between water catchment area and water quality across all streams,
based on just a sample.

So we want to make inferences about the true regression line:

y = 0 + 1x
based on just our sample of 20 streams:

y = b0 + b1x.

12.18
Inferences about the slope 1
In linear regression, the slope 1 is usually of primary interest 1 tells
us how Y and X are related.

If X changes by 1 unit, Y changes (on average) by 1.

1 = 0 no relationship between Y and X.

1 > 0 increasing relationship between Y and X.

1 < 0 decreasing relationship between Y and X.

12.19
Useful graphs:

12.20
Now well meet the key result for making inferences about 1.

Assume that:
yi = 0 + 1xi + i
where each i is independently sampled from a normal distribution with
mean 0 and standard deviation .

This is sometimes known as the linear regression model.

12.21
If the linear regression model is appropriate,
b1 1
t(n 2)
SE(b1)
where SE(b1) is the estimated standard error (SE) for b1.

We will not discuss how to calculate SE(b1) by hand, but you will need
to know where to find it in computer output. . .

12.22
12.23
Testing for a relationship between Y and X

We can use this result to test for evidence of a relationship between Y


and X. If there is no relationship between Y and X, then 1 = 0. Do
you understand why?

We can test H0 : 1 = 0 using the test statistic


b1
t=
SE(b1)
which comes from a t(n 2) distribution if H0 is true.

12.24
As always, the alternative hypothesis Ha is important for determining
the P -value. Ha tells us what sorts of departures from H0 we are
interested in measuring.

If Ha : 1 > 0 then the P -value = P (T t)


where T t(n 2).

If Ha : 1 < 0 then the P -value = P (T t).

If Ha : 1 6= 0 then the P -value = 2P (T |t|).

12.25
Note that the R linear regression output automatically calculate the
t-statistic and two-sided P -value for a test of this hypothesis.

Can you find the test statistic in the water quality output? (slides
12.37 and 12.38)

What is the P -value for when Ha : 1 > 0?

12.26
Answer:

12.27
Exercise: Supermarket display space

An experiment was conducted in a supermarket to explore the relation


between the amount of display space allotted to a brand of coffee and
its weekly sales. This yielded the following data:
space (feet2) 6 3 6 9 3 9 6 3 9
sales ($/week) 526 421 581 630 412 560 434 443 590
A linear regression was fitted, and the estimated slope was 28.0 with
a standard error of 6.1.

Is there evidence that coffee sales increase with display space?

12.28
Answer:

12.29
Constructing a confidence interval for 1

Often we are interested not only in testing for a relationship between


Y and X, but also in finding out how Y changes as X changes. The
true slope 1 tells us this information.

A level C confidence interval for 1 is:


 
b1 t SE(b1) , b1 + t SE(b1)
where t is the value from that t(n 2) distribution for which the area
between t and t is C.

12.30
Exercise: Supermarket display space

Consider again the experiment that was conducted to explore the rela-
tion between the amount of display space allotted to a brand of coffee
and its weekly sales.

A linear regression was fitted to 9 datapoints, and the estimated slope


was 28.0 with a standard error of 6.1.

Construct a 95% CI for the amount by which sales is expected to


increase if display space is increased by 1 square foot.

Answer:

12.31
Assumptions
Recall the linear regression model, which we assume is true when mak-
ing inferences about 1:

For each data point (xi, yi) we assume that

yi = 0 + 1xi + i
where each i is independently sampled from a normal distribution with
mean 0 and standard deviation .

12.32
This model can be broken down into four key assumptions:
The mean of Y (y ) has a linear relationship with X.

The errors from the line (i) are normally distributed.

The Yi observations are independent.

The errors from the line (i) have the same variance at each x
value.
How important are these assumptions?
What do we need to check for each assumption?

12.33
The importance of assumptions

Straight line relationship between y and x is a crucial assump-


tion there is no point fitting a straight line to Y and X if they
are not linearly related!

Errors are normally distributed: it turns out that the CLT guar-
antees that b1 is approximately normal even if errors are not normal,
so this is not an important assumption for large n. Use the checks
outlined on slide 7.37 to see if any departure from normality is
potentially important.

12.34
The Yi are independent: The subjects must be unrelated. How
could you guarantee that this is satisfied?

Errors from the line have the same spread for all x: The stan-
dard errors of parameter estimates can be biased if this assumption
is not satisfied.
We need to check the first, second and last assumptions for our data
(although the second is not important for large sample sizes).

12.35
Residual vs fits plot

You met residual vs fits plots in Week 2 they are useful for detecting
whether linear regression is reasonable for your data.

A residual vs fits plot (y y vs y) can be used to check if the data


are linearly related and if the variance is constant for different values
of x.

There should be no pattern on a residual plot if there is, we can


not make inferences about the true regression line using the methods
described in this lecture.

12.36
e.g. The water quality data:
25
20
15
10
5
Residuals

0
5
10
15
20
25
30

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78
Fitted values

12.37
If there is a distinct pattern, we cant use standard methods to make
inferences about the regression line.
e.g. if there is a U -shaped pattern, the relationship is non-linear (maybe
quadratic):

Residuals vs Fitted Values

1,000
200

Residual
750
500 0
Y

250
0 200

0 10 20 30 200 0 200 400 600 800


X Fitted Value
and so we should not be fitting a straight line to data.

12.38
e.g. if there is a fan shape, then the spread increases with x:

Residuals vs Fitted Values

200 100

Residual
100
Y

0
0
100
0 10 20 30 0 20 40 60 80 100
X Fitted Value

and so we should not be assuming that the errors from the line have
the same spread for all x, and we need to use a different method of
inference (not covered in MATH1041).

12.39
Normal quantile plot of residuals
Because we are assuming that errors from the regression line are nor-
mally distributed, we can check this using a normal quantile plot.

As it turns out, CLT guarantees that b1 is approximately normal even


if errors are not normal, as long as n is large enough. Use the rules of
thumb outlined on slide 7.37 to see if any departure from normality is
potentially important.

12.40
Does this plot suggest that any linear regression model assumptions
are reasonable for the water quality data?
Normal Quantile plot for water quality data
40
30
Regression residuals

20
10
0
10
20
30
40
2 1.5 1 0.5 0 0.5 1 1.5 2
Theoretical Quantiles

12.41
Sales/display space example
Is it reasonable to fit a linear regression? List the assumptions and
check them using the following plots.

80 80
60

Regression residuals
60
40 40
20
Residuals

20
0 0
20 20
40 40
60 60
80 80
450 500 550 600 1.510.5 0 0.5 1 1.5
Fitted values Theoretical Quantiles

12.42
Answer:

12.43
A bit more about regression inference

Standard error:
A formula for SE(b1) is:
s
SE(b1) =
sx n 1
where s and sx are the standard deviations of the errors () and X-
values, respectively.

Notice that SE(b1) decreases as:


the sample size (n) increases;

the variability in the X values increases (sx); and

the variability around the fitted line decreases (s).


We can control the first two of these quantities in our study design,
so these are the ones to change if you want to get a more precise
estimate of b1.
12.44
Why does CLT ensure b1 is approximately normal?

As it turns out, the Central Limit Theorem doesnt just work for av-
erages and sums, but for weighted sums too:
Pn
(xi x
)(yi y)
b1 = i=1
Pn
)2
i=1 (xi x
which is a weighted sum of the residuals (yi y), the weights equalling
.
xi x

So CLT implies that b1 is approximately normal, if n is large enough,


irrespective of the distribution of residuals.

12.45
Exercise Drinking and blood alcohol levels
A study was conducted on 22 staff/students at a lunch function at the
University of Western Sydney, to determine how many drinks you can
have in an hour and still have a blood alcohol content (BAC) under
the legal limit, 0.05.

Blood alcohol level (mg/mL)


0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1 2 3 4 5 6 7 8 9
Wine (number of drinks)
12.46
Use the following results to answer these questions:
Is the relationship between number of drinks and BAC statistically
significant at the 0.05 level?

How strong is the relationship?

Construct a 95% confidence interval for the average increase in


BAC when you have another drink.

Predict your blood alcohol level after 4 standard drinks.

What assumptions did you make in the above?


Do you think these assumptions are reasonable?

12.47
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004375 0.014402 -0.304 0.76441
Wine 0.010889 0.002988 3.644 0.00161

Residual standard error: 0.02728 on 20 degrees of freedom


Multiple R-Squared: 0.399, Adjusted R-squared: 0.369

12.48
Normal QQ plot for BAC data
0.08 0.08
0.06

Regression residuals
0.06
0.04 0.04
Residuals

0.02 0.02

0 0

0.02 0.02

0.04 0.04
0.06
0.06
0.02 0.04 0.06 0.08 0.1 2 1 0 1 2
Fitted values Theoretical Quantiles

12.49
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.

1-samp for p 1-sample t-test 2 test for 2-sample t (for test


useful test:
(for binary var) for independence binary+quant) regression
es if known)
(or Z-test
(o o ) slope ()

useful for CI for p


CI for CI for 1- 2 CI for
inference: (for binary var)
(analyse Paired data?
differences) This lecture
12.50
Revision

12.51
Which method do you use when?

Weve discussed lots of different ways to analyse data.

Which method do you use in which situation?

The way to work this out is to look at:


what the key research question is that you are trying to answer.

(It might also help to look at the data to be analysed.)

12.52
See what the research question tells you about the data and the
analysis goal then you will know how to analyse the data.

How you analyse data depends on the data:


How many variables are involved in the research question?

Are they quantitative or categorical?

(If comparing two samples, are they paired?)


How you analyse data depends on the analysis goal:
Is this a descriptive study (graphs/numerical summaries), or do we
want to use the sample to make general statements (inferences)?

If making inferences, is there a specific claim we want to test?

12.53
Example
Recall the following exercise:

A survey of 224 male MATH1041 students found that the av-


erage amount they spent at the hairdressers was $23.04, with
a standard deviation of $33.02.

The university hairdresser charges $35 for a male haircut.

Is there evidence that the hairdressers are overcharging? (In the


sense that they are charging more than the average amount a
male student would pay?)

What does the research question tell us about the data and the analysis
goal?

12.54
Answer:

The data to be analysed:


# variables in the research question:

Quantitative or categorical:
The analysis goal:
Descriptive study or making inferences?

Specific claim to test?

12.55
The following slides summarise the methods of data analysis discussed
in this course, and where to find the lecture notes on each method.

Note that:
The columns of the graphic are about the data (how many vari-
ables, quantitative or categorical).

The rows are about the analysis goal (numerical or graphical sum-
mary, hypothesis test or confidence interval).
See if you can use these slides to work out which method of analysis
should be used for the hair cut example above.

Try using these slides to work out which analysis method to use for
each data analysis question in last years exam!

12.56
Data analysis for one or two variables
one variable (analyse two variables
differences)
variable both one categorical
categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
useful
f l boxplot
b l t or clustered
l t d comparative
ti
bar chart scatterplot
graphs: histogram bar chart boxplots

mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
test
1-samp for p 1-sample t-test 2 test for 2-sample t (for regression
useful test:
(for binary var) for independence binary+quant)
(or Z-test if known) slope ()

useful for CI for p


CI for CI for 1- 2 CI for
inference: (for binary var)
(analyse Paired data?
differences)
12.57
Where to find these methods in your notes
one variable (analyse two variables
differences)
variable both one categorical, both
categorical quantitative
type: categorical one quantitative quantitative
Paired data?
Week 10 (lect 4)
useful boxplot or clustered comparative
bar chart scatterplot
graphs: histogram bar chart boxplots
Week 1
mean and sd
useful table of 2-way table of 5-num for correlation or
or 5-number
numbers: frequencies frequencies each group regression
summ.
test
1-samp for p 1-sample t-test c2 test for 2-sample t (for regression
useful test:
(for binary var) for m independence binary+quant)
Week 9 (or Z-test if s known) slope (b1)
(lect 1-2) Week 9 Week 11 Week 10 Week 12
useful for CI for p (lect 3-4) (lect 2-4)
CI for m CI for m1- m2 CI for b1
inference: (for binary var)
(analyse Paired data?
differences)
12.58
End of a Long Journey
All MATH1041 concepts have now been presented.

So if you have been keeping up then you are now


statistically literate.

CONGRATULATIONS!

12.59

S-ar putea să vă placă și