Stat Lecs

Sampling
Probability sample
Non probability sample
Statistical inference
Sampling error
inference.ppt - Aki Taanila
Probability sample
Goal: A representative sample = miniature of
the population
You can use simple random sampling,
systematic sampling, stratified sampling,
clustered sampling or combination of these
methods to get a probability sample
Probability sample You can draw
conclusions about the whole population
Simple Random
Sample
Population
Systematic
Select picking interval e.g. every fifth

Choose randomly one among the first five (or whatever
the picking interval is)
Pick out every fifth (or whatever the picking interval is)
beginning from the chosen one
Stratified
Population
Guarantee that all

the groups are
represented like in
the population
18-29
Sample
Proportional
allocation
30-49
65+
50-64
Even allocation
Compare groups
Sample
Cluster
Divide population into the clusters
(schools, districts,)
Choose randomly some of the
clusters
Draw sample from the chosen

clusters using appropriate
sampling method (or investigate
chosen groups in whole)
Sample
Non probability Sample

When a sample is not drawn randomly it is
called a non probability sample
For example, when you use elements most
available, like in self-selecting surveys or street
interviews
In the case of a non probability sample you
should not draw conclusions about the whole
population
Statistical inference
Statistical inference: Drawing conclusions
about the whole population on the basis of a
sample
Precondition for statistical inference: A
sample is randomly selected from the
population (=probability sample)
Sampling Error
Sample 1
mean 40,5
Population
Sample 2
mean 40,3
mean 40,8
Different samples from the

same population give
different results
Due to chance
10
Sample 3
mean 41,4
Sampling distributions
Mean
Normal distribution
T-distribution
Proportion
Normal distribution
11
In real life calculating parameters of populations is

prohibitive because populations are very large.
Rather than investigating the whole population, we
take a sample, calculate a statistic related to the
parameter of interest, and make an inference.
The sampling distribution of the statistic is the tool
that tells us how close is the statistic to the
parameter.
9.12
Distribution of a statistic
Statistics follow distributions too
But the distribution of a statistic is a theoretical construct.
Statisticians ask a thought experiment: how much would the
value of the statistic fluctuate if one could repeat a particular
study over and over again with different samples of the same
size?
By answering this question, statisticians are able to pinpoint
exactly how much uncertainty is associated with a given
statistic.
Sampling distribution
Most of the statistical inference methods are
based on sampling distributions
You can apply statistical inference without
knowing sampling distributions
Still, it is useful to know, at least the basic idea
of sampling distribution
14
Notation for Samples and Populations

Statistics
= sample mean
Parameters
m = population mean
s2 = sample variance
s2 = population variance
s = sample standard
deviation
s = population standard
deviation
9.15
Properties of the Standard Deviation, s

1.
s measures the variability is a sample of measurements. It is a measure of how much

the sample values deviate from the sample mean.
2.
s is a nonnegative number. If all the numbers in a sample are equal, the value of the
standard deviation will be zero. This is the smallest possible value for the standard
deviation.
3.
When comparing 2 samples of data, the sample that is more variable will have a
larger standard deviation.
9.16
9.2 The Sampling Distribution of the Sample Mean

When you take a sample, compute a statistic, repeat the
process a large number of times, and then make a
histogram of the statistics you observed, you are
examining the sampling distribution of the statistic.
Under special conditions, some of these distributions
(histograms) will begin to resemble a normal distribution.
9.17
Sampling Distribution of the Mean
A fair die is thrown infinitely many times,

with the random variable X = # of spots on any
throw.
The probability distribution of X is:
x
P(x)
1/6
1/6
1/6
1/6
1/6
1/6
and the mean and variance are calculated as

well:
9.18
Throwing a die twice sample mean

Sample
1
2
3
4
5
6
7
8
9
10
11
12
1,1
1,2
1,3
1,4
1,5
1,6
2,1
2,2
2,3
2,4
2,5
2,6
Mean Sample
Mean
1
13
3,1
2
1.5
14
3,2
2.5
2
15
3,3
3
2.5
16
3,4
3.5
3
17
3,5
4
3.5
18
3,6
4.5
1.5
19
4,1
2.5
2
20
4,2
3
2.5
21
4,3
3.5
3
22
4,4
4
3.5
23
4,5
4.5
4
24
4,6
5
Sample
25
26
27
28
29
30
31
32
33
34
35
36
Mean
5,1
5,2
5,3
5,4
5,5
5,6
6,1
6,2
6,3
6,4
6,5
6,6
3
3.5
4
4.5
5
5.5
3.5
4
4.5
5
5.5
6
9.19
Sampling Distribution of Two Dice

A sampling distribution is created by looking at
all samples of size n=2 (i.e. two dice) and their means
While there are 36 possible samples of size 2, there are

only 11 values for , and some (e.g. =3.5) occur more
frequently than others (e.g. =1).
9.20
Sampling Distribution of Two Dice

The sampling distribution of
6/36
P( )
5/36
1/36
2/36
3/36
4/36
5/36
6/36
5/36
4/36
3/36
2/36
1/36
4/36
P(
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
is shown below:
3/36
2/36
1/36
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
9.21
Compare
Compare the distribution of X
1.0
1.5
with the sampling distribution of
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
As well, note that:

9.22
Generalize
We can generalize the mean and variance of
the sampling of two dice:
to n-dice:
The standard deviation of the

sampling distribution is called the
standard error:
9.23
Central Limit Theorem

The sampling distribution of the mean of a
random sample drawn from any population is
approximately normal for a sufficiently large
sample size.
The larger the sample size, the more closely the
sampling distribution of X will resemble a
normal distribution.
9.24
Sampling Distribution of the Mean

n5
n 10
n 25
m x 3.5
m x 3.5
m x 3.5
s2x
s .5833 ( )
5 6
s2x
2
s x .2917 ( )
10
s2x
s .1167 ( )
25
2
x
2
x
9.25
Central Limit Theorem

If the population is normal, then X is normally
distributed for all values of n.
If the population is non-normal, then X is
approximately normal only for larger values of n.
In many practical situations, a sample size of 30
may be sufficiently large to allow us to use the
normal distribution as an approximation for the
sampling distribution of X.
9.26
Sampling Distribution of the Sample

Mean
1.
2.
3. If X is normal, X is normal. If X is nonnormal, X is

approximately normal for sufficiently large sample
sizes.
Note: the definition of sufficiently large depends
on the extent of nonnormality of x (e.g. heavily
skewed; multimodal)
9.27
Interpreting Standard Deviation

The standard deviation can be used to
compare the variability of several distributions
make a statement about the general shape of a distribution.
The empirical rule: If a sample of observations has a

mound-shaped distribution, the interval
( x s, x s) contains approximately 68% of the measuremen ts

(x 2s, x 2s) contains approximately 95% of the measuremen ts
( x 3s, x 3s) contains approximately 99.7% of the measuremen ts
9.29
Parameter Estimation
Parameter and its estimate
Error margin
30
Parameter estimation
Objective is to estimate the unknown
population parameter using the value
calculated from the sample
The parameter may be for example mean
or proportion
31
Statistic
Parameter
Mean:
estimates
_m___
Standard
deviation:
estimates
_s___
Proportion:
estimates
____
from sample
from entire
population
Population
Mean, m, is
unknown
Sample
Point estimate Interval estimate

Mean
X = 50
I am 95%
confident that m
is between 40 &
60
Parameter
= Statistic Its Error
Error margin
A value calculated from the sample is the best
guess when estimating corresponding
population value
Estimate is still uncertain due to sampling
error
Error margin is a measure of uncertainty
Using error margin you can state confidence
interval: estimate + error margin
35
Error margin for mean - s known

If population standard deviation s is known then error margin
for population mean is
s
1,96
n
We can be 95% sure that population mean is (95% confidence

interval):
s
s
x 1,96
m x 1,96
n
n
36
Error margin for mean - s unknown

If population standard deviation is unknown
then error margin for population mean is
t critical
s
n
We can be 95% sure that population mean is

(95% confidence interval):
x t critical
s
n
m x t critical
37
s
n
Confidence level
Confidence level can be selected to be different from 95%
If population standard deviation s is known then critical value
can be calculated from normal distribution
Ex. In Excel =-NORMSINV(0,005) gives the critical value for
99% confidence level (0,005 is half of 0,01)
If population standard deviation s is unknown then critical
value can be calculated from t-distribution
Ex. In Excel =TINV(0,01;79) gives critical value when sample
size is 80 and confidence level is 99%
38
Inference
Two ways to make inference
Estimation of parameters
* Point Estimation (X or p)
* Intervals Estimation
Hypothesis Testing
Hypothesis testing
Null hypothesis
Alternative hypothesis
2-tailed or 1 tailed
P-value
40
Hypothesis 1
Hypothesis is a belief
concerning a parameter
Parameter may be
population mean,
proportion, correlation
coefficient,...
I believe that mean weight of

cereal packages is 300 grams!
41
Hypothesis 2
Null hypothesis is prevalent opinion, previous
knowledge, basic assumption, prevailing
theory,...
Alternative hypothesis is rival opinion
Null hypothesis is assumed to be true as long
as we find evidence against it
If a sample gives strong enough evidence
against null hypothesis then alternative
hypothesis comes into force.
42
Hypothesis examples
H0: Mean height of males equals 174.
H1: Mean height is bigger than 174.
H0: Half of the population is in favour of nuclear power plant.
H1: More than half of the population is in favour of nuclear power plant.
H0: The amount of overtime work is equal for males and females.
H1: The amount of overtime work is not equal for males and females.
H0: There is no correlation between interest rate and gold price.

H1: There is correlation between interest rate and gold price.
43
2-tailed Test
Use 2-tailed if there is no
reason for 1-tailed.
In 2-tailed test deviations
(from the null hypothesis)
to the both directions are
interesting.
takes the form different
than.
44
1-tailed Test
In 1-tailed test we know
beforehand that only
deviations to one
direction are possible
or interesting.
takes the form less
than or greater than.
45
Logic behind hypothesis testing

Population
J
J
J J
J
J
Prevalent opinion is that

mean age in that group is
50 (null hypothesis)
Random sample
Mean
age = 45
46
Reject null
hypothesis! Sample
mean is only 45!
Risk of being wrong

Not Guilty until proved otherwise!
Null hypothesis remains valid until
proved otherwise!
Sometimes it happens that
innocent person is proved guilty.
Same may happen in hypothesis
testing: We may reject null
hypothesis although it is true.
(there is always a risk of being
wrong when we reject null
hypothesis; risk is due to sampling
error).
47
Significance Level
When we reject the null hypothesis there is a
risk of drawing a wrong conclusion
Risk of drawing a wrong conclusion (called pvalue or observed significance level) can be
calculated
Researcher decides the maximum risk (called
significance level) he is ready to take
Usual significance level is 5%
48
P-value
We start from the basic assumption: The null
hypothesis is true
P-value is the probability of getting a value
equal to or more extreme than the sample
result, given that the null hypothesis is true
Decision rule: If p-value is less than 5% then
reject the null hypothesis; if p-value is 5% or
more then the null hypothesis remains valid
In any case, you must give the p-value as a
justification for your decision.
49
Steps in hypothesis testing!

1. Set the null hypothesis and the alternative
hypothesis.
2. Calculate the p-value.
3. Decision rule: If the p-value is less than 5%
then reject the null hypothesis otherwise the
null hypothesis remains valid. In any case,
you must give the p-value as a justification
for your decision.
50
Testing mean
Null hypothesis: Mean equals x0
Alternative hypothesis (2-tailed): Mean is
different from x0
Alternative hypothesis (1-tailed): Mean is less
than x0
Alternative hypothesis (1-tailed): Mean is
bigger than x0
51
Testing mean - known

p-value
Calculate standardized sample mean

z
xm
s
n
Calculate the p-value that indicates, how likely

it is to get this kind of value if we assume that
null hypothesis is true
In Excel you can calculate the p-value:
=NORMSDIST(-ABS(z))
52
Testing mean - unknown

p-value
Calculate standardized sample mean
t
xm
s
n
Calculate the p-value that indicates, how likely it is to get this

kind of value if we assume that null hypothesis is true
In Excel you can calculate the p-value: =TDIST(ABS(t),degrees
of freedom,tails); in this case degrees of freedom equals n-1
and tails defines whether you use one-tailed (1) or two-tailed
(2) test
53
The Central Limit Theorem:

If all possible random samples, each of size n, are taken
from any population with a mean m and a standard
deviation s, the sampling distribution of the sample
means (averages) will:
1. have mean:
mx m
2. have standard deviation:
s
sx
n
3. be approximately normally distributed regardless of the shape

of the parent population (normality improves with larger n)
Central Limit Theorem caveats for small

samples:
For small samples:
The sample standard deviation is an imprecise estimate of the
true standard deviation (); this imprecision changes the
distribution to a T-distribution.
A t-distribution approaches a normal distribution for large n (100), but
has fatter tails for small n (<100)
If the underlying distribution is non-normal, the distribution of

the means may be non-normal.
More on T-distributions next week!!
Summary: Single population mean

(large n)
Hypothesis test:
observed mean null mean
Z
s
n
Confidence Interval
s
confidence interval observed mean Z/2 * ( )
n
Single population mean (small n,

normally distributed trait)
Hypothesis test:
observed mean null mean
Tn 1
s
n
Confidence Interval
s
confidence interval observed mean Tn 1,/2 * ( )
n
Examples of Sample Statistics:

Single population mean
Single population proportion
Difference in means (ttest)
Difference in proportions (Z-test)
Odds ratio/risk ratio
Correlation coefficient
Regression coefficient
Distribution of a correlation coefficient

Computer simulation
1. Specify the true correlation coefficient
Correlation coefficient = 0.15
2. Select a random sample of 100 virtual men from

the population.
3. Calculate the correlation coefficient for the
sample.
4. Repeat steps (2) and (3) 15,000 times
5. Explore the distribution of the 15,000 correlation
coefficients.
Distribution of a correlation
coefficient
Normally distributed!
Mean = 0.15 (true correlation)
Standard error = 0.10
Distribution of a correlation coefficient

in general
1. Shape of the distribution
Normally distributed for large samples

T-distribution for small samples (n<100)
2. Mean = true correlation coefficient (r)

3. Standard error
1 r
n
Many statistics follow normal (or tdistributions)

Means/difference in means
T-distribution for small samples
Proportions/difference in proportions
Regression coefficients
T-distribution for small samples
Natural log of the odds ratio
Estimation (confidence intervals)

What is a good estimate for the true mean
vitamin D in the population (the population
parameter)?
63 nmol/L +/- margin of error
95% confidence interval

Goal: capture the true effect (e.g., the true
mean) most of the time.
A 95% confidence interval should include the
true effect about 95% of the time.
A 99% confidence interval should include the
true effect about 99% of the time.
Recall: 68-95-99.7 rule for normal distributions! These is a 95% chance that the
sample mean will fall within two standard errors of the true mean= 62 +/- 2*3.3 =
55.4 nmol/L to 68.6 nmol/L
Mean - 2 Std error=55.4
Mean
Mean + 2 Std error =68.6
To be precise, 95% of
observations fall
between Z=-1.96 and Z=
+1.96 (so the 2 is a
rounded number)

There is a 95% chance that the sample mean is
between 55.4 nmol/L and 68.6 nmol/L
For every sample mean in this range, sample mean
+/- 2 standard errors will include the true mean:
For example, if the sample mean is 68.6 nmol/L:
95% CI = 68.6 +/- 6.6 = 62.0 to 75.2
This interval just hits the true mean, 62.0.

Thus, for normally distributed statistics, the formula
for the 95% confidence interval is:
sample statistic 2 x (standard error)
Examples:
95% CI for mean vitamin D:
63 nmol/L 2 x (3.3) = 56.4 69.6 nmol/L
95% CI for the correlation coefficient:

0.15 2 x (0.1) = -.05 .35
Simulation of 20 studies of 100 men

Vertical line indicates the true mean (62)
95% confidence intervals for

the mean vitamin D for each of
the simulated studies.
Only 1 confidence
interval missed the true
mean.
Confidence Intervals give:
*A plausible range of values for a population

parameter.
*The precision of an estimate.(When sampling
variability is high, the confidence interval will be
wide to reflect the uncertainty of the observation.)
*Statistical significance (if the 95% CI does not cross
the null value, it is significant at .05)
Confidence Intervals
The value of the statistic in my sample (eg.,
mean, odds ratio, etc.)
point estimate (measure of how confident we want to

be) (standard error)
From a Z table or a T table, depending on the
sampling distribution of the statistic.
Standard error of the statistic.
Common Z levels of confidence

Commonly used confidence levels are 90%,
95%, and 99%
Confidence
Level
80%
90%
95%
98%
99%
99.8%
99.9%
Z value
1.28
1.645
1.96
2.33
2.58
3.08
3.27
99% confidence intervals

99% CI for mean vitamin D:
63 nmol/L 2.6 x (3.3) = 54.4 71.6 nmol/L
99% CI for the correlation coefficient:

0.15 2.6 x (0.1) = -.11 .41

Stat Lecs

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Stat Lecs

Încărcat de

Drepturi de autor:

Formate disponibile

Sampling

inference.ppt - Aki Taanila

inference.ppt - Aki Taanila

Select picking interval e.g. every fifth

inference.ppt - Aki Taanila

Guarantee that all

inference.ppt - Aki Taanila

Draw sample from the chosen

inference.ppt - Aki Taanila

Non probability Sample

inference.ppt - Aki Taanila

Different samples from the

inference.ppt - Aki Taanila

In real life calculating parameters of populations is

inference.ppt - Aki Taanila

Notation for Samples and Populations

Properties of the Standard Deviation, s

s measures the variability is a sample of measurements. It is a measure of how much

9.2 The Sampling Distribution of the Sample Mean

Sampling Distribution of the Mean

A fair die is thrown infinitely many times,

and the mean and variance are calculated as

Throwing a die twice sample mean

Sampling Distribution of Two Dice

While there are 36 possible samples of size 2, there are

Sampling Distribution of Two Dice

with the sampling distribution of

As well, note that:

The standard deviation of the

Central Limit Theorem

Sampling Distribution of the Mean

Central Limit Theorem

Sampling Distribution of the Sample

3. If X is normal, X is normal. If X is nonnormal, X is

Interpreting Standard Deviation

The empirical rule: If a sample of observations has a

( x s, x s) contains approximately 68% of the measuremen ts

inference.ppt - Aki Taanila

inference.ppt - Aki Taanila

Point estimate Interval estimate

= Statistic Its Error

Error margin for mean - s known

We can be 95% sure that population mean is (95% confidence

inference.ppt - Aki Taanila

Error margin for mean - s unknown

We can be 95% sure that population mean is

inference.ppt - Aki Taanila

inference.ppt - Aki Taanila

inference.ppt - Aki Taanila

I believe that mean weight of

inference.ppt - Aki Taanila

H0: There is no correlation between interest rate and gold price.

inference.ppt - Aki Taanila

Logic behind hypothesis testing

Prevalent opinion is that

inference.ppt - Aki Taanila

Risk of being wrong

Steps in hypothesis testing!

inference.ppt - Aki Taanila

Testing mean - known

Calculate standardized sample mean

Calculate the p-value that indicates, how likely

Testing mean - unknown

Calculate the p-value that indicates, how likely it is to get this