Sunteți pe pagina 1din 72

Sampling

Probability sample
Non probability sample
Statistical inference
Sampling error

inference.ppt - Aki Taanila

Probability sample
Goal: A representative sample = miniature of
the population
You can use simple random sampling,
systematic sampling, stratified sampling,
clustered sampling or combination of these
methods to get a probability sample
Probability sample You can draw
conclusions about the whole population
inference.ppt - Aki Taanila

Simple Random

Sample

Population

inference.ppt - Aki Taanila

Systematic

Select picking interval e.g. every fifth


Choose randomly one among the first five (or whatever
the picking interval is)
Pick out every fifth (or whatever the picking interval is)
beginning from the chosen one

inference.ppt - Aki Taanila

Stratified

Population

Guarantee that all


the groups are
represented like in
the population

18-29

Sample
Proportional
allocation

30-49
65+

50-64

Even allocation
Compare groups

inference.ppt - Aki Taanila

Sample

Cluster
Divide population into the clusters
(schools, districts,)
Choose randomly some of the
clusters

Draw sample from the chosen


clusters using appropriate
sampling method (or investigate
chosen groups in whole)
Sample

inference.ppt - Aki Taanila

Non probability Sample


When a sample is not drawn randomly it is
called a non probability sample
For example, when you use elements most
available, like in self-selecting surveys or street
interviews
In the case of a non probability sample you
should not draw conclusions about the whole
population
inference.ppt - Aki Taanila

Statistical inference
Statistical inference: Drawing conclusions
about the whole population on the basis of a
sample
Precondition for statistical inference: A
sample is randomly selected from the
population (=probability sample)

inference.ppt - Aki Taanila

Sampling Error
Sample 1
mean 40,5
Population

Sample 2
mean 40,3

mean 40,8

Different samples from the


same population give
different results
Due to chance
inference.ppt - Aki Taanila

10

Sample 3
mean 41,4

Sampling distributions
Mean
Normal distribution
T-distribution
Proportion
Normal distribution

inference.ppt - Aki Taanila

11

In real life calculating parameters of populations is


prohibitive because populations are very large.
Rather than investigating the whole population, we
take a sample, calculate a statistic related to the
parameter of interest, and make an inference.
The sampling distribution of the statistic is the tool
that tells us how close is the statistic to the
parameter.

9.12

Distribution of a statistic
Statistics follow distributions too
But the distribution of a statistic is a theoretical construct.
Statisticians ask a thought experiment: how much would the
value of the statistic fluctuate if one could repeat a particular
study over and over again with different samples of the same
size?
By answering this question, statisticians are able to pinpoint
exactly how much uncertainty is associated with a given
statistic.

Sampling distribution
Most of the statistical inference methods are
based on sampling distributions
You can apply statistical inference without
knowing sampling distributions
Still, it is useful to know, at least the basic idea
of sampling distribution

inference.ppt - Aki Taanila

14

Notation for Samples and Populations


Statistics

= sample mean

Parameters

m = population mean

s2 = sample variance

s2 = population variance

s = sample standard
deviation

s = population standard
deviation

9.15

Properties of the Standard Deviation, s


1.

s measures the variability is a sample of measurements. It is a measure of how much


the sample values deviate from the sample mean.

2.

s is a nonnegative number. If all the numbers in a sample are equal, the value of the
standard deviation will be zero. This is the smallest possible value for the standard
deviation.

3.

When comparing 2 samples of data, the sample that is more variable will have a
larger standard deviation.

9.16

9.2 The Sampling Distribution of the Sample Mean


When you take a sample, compute a statistic, repeat the
process a large number of times, and then make a
histogram of the statistics you observed, you are
examining the sampling distribution of the statistic.
Under special conditions, some of these distributions
(histograms) will begin to resemble a normal distribution.

9.17

Sampling Distribution of the Mean

A fair die is thrown infinitely many times,


with the random variable X = # of spots on any
throw.
The probability distribution of X is:
x
P(x)

1/6

1/6

1/6

1/6

1/6

1/6

and the mean and variance are calculated as


well:

9.18

Throwing a die twice sample mean


Sample
1
2
3
4
5
6
7
8
9
10
11
12

1,1
1,2
1,3
1,4
1,5
1,6
2,1
2,2
2,3
2,4
2,5
2,6

Mean Sample
Mean
1
13
3,1
2
1.5
14
3,2
2.5
2
15
3,3
3
2.5
16
3,4
3.5
3
17
3,5
4
3.5
18
3,6
4.5
1.5
19
4,1
2.5
2
20
4,2
3
2.5
21
4,3
3.5
3
22
4,4
4
3.5
23
4,5
4.5
4
24
4,6
5

Sample
25
26
27
28
29
30
31
32
33
34
35
36

Mean
5,1
5,2
5,3
5,4
5,5
5,6
6,1
6,2
6,3
6,4
6,5
6,6

3
3.5
4
4.5
5
5.5
3.5
4
4.5
5
5.5
6

9.19

Sampling Distribution of Two Dice


A sampling distribution is created by looking at
all samples of size n=2 (i.e. two dice) and their means

While there are 36 possible samples of size 2, there are


only 11 values for , and some (e.g. =3.5) occur more
frequently than others (e.g. =1).
9.20

Sampling Distribution of Two Dice


The sampling distribution of
6/36

P( )

5/36

1/36
2/36
3/36
4/36
5/36
6/36
5/36
4/36
3/36
2/36
1/36

4/36

P(

1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0

is shown below:

3/36
2/36
1/36
1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

9.21

Compare
Compare the distribution of X

1.0

1.5

with the sampling distribution of

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

As well, note that:


9.22

Generalize
We can generalize the mean and variance of
the sampling of two dice:

to n-dice:

The standard deviation of the


sampling distribution is called the
standard error:

9.23

Central Limit Theorem


The sampling distribution of the mean of a
random sample drawn from any population is
approximately normal for a sufficiently large
sample size.
The larger the sample size, the more closely the
sampling distribution of X will resemble a
normal distribution.
9.24

Sampling Distribution of the Mean


n5

n 10

n 25

m x 3.5

m x 3.5

m x 3.5

s2x
s .5833 ( )
5 6

s2x
2
s x .2917 ( )
10

s2x
s .1167 ( )
25

2
x

2
x

9.25

Central Limit Theorem


If the population is normal, then X is normally
distributed for all values of n.
If the population is non-normal, then X is
approximately normal only for larger values of n.
In many practical situations, a sample size of 30
may be sufficiently large to allow us to use the
normal distribution as an approximation for the
sampling distribution of X.
9.26

Sampling Distribution of the Sample


Mean
1.
2.

3. If X is normal, X is normal. If X is nonnormal, X is


approximately normal for sufficiently large sample
sizes.
Note: the definition of sufficiently large depends
on the extent of nonnormality of x (e.g. heavily
skewed; multimodal)
9.27

Interpreting Standard Deviation


The standard deviation can be used to
compare the variability of several distributions
make a statement about the general shape of a distribution.

The empirical rule: If a sample of observations has a


mound-shaped distribution, the interval

( x s, x s) contains approximately 68% of the measuremen ts


(x 2s, x 2s) contains approximately 95% of the measuremen ts
( x 3s, x 3s) contains approximately 99.7% of the measuremen ts
9.29

Parameter Estimation
Parameter and its estimate
Error margin

inference.ppt - Aki Taanila

30

Parameter estimation
Objective is to estimate the unknown
population parameter using the value
calculated from the sample
The parameter may be for example mean
or proportion

inference.ppt - Aki Taanila

31

Statistic

Parameter

Mean:

estimates

_m___

Standard
deviation:

estimates

_s___

Proportion:

estimates

____

from sample

from entire
population

Population

Mean, m, is
unknown
Sample

Point estimate Interval estimate


Mean
X = 50

I am 95%
confident that m
is between 40 &
60

Parameter

= Statistic Its Error

Error margin
A value calculated from the sample is the best
guess when estimating corresponding
population value
Estimate is still uncertain due to sampling
error
Error margin is a measure of uncertainty
Using error margin you can state confidence
interval: estimate + error margin
inference.ppt - Aki Taanila

35

Error margin for mean - s known


If population standard deviation s is known then error margin
for population mean is
s
1,96
n

We can be 95% sure that population mean is (95% confidence


interval):
s
s
x 1,96
m x 1,96
n
n

inference.ppt - Aki Taanila

36

Error margin for mean - s unknown


If population standard deviation is unknown
then error margin for population mean is
t critical

s
n

We can be 95% sure that population mean is


(95% confidence interval):
x t critical

inference.ppt - Aki Taanila

s
n

m x t critical

37

s
n

Confidence level
Confidence level can be selected to be different from 95%
If population standard deviation s is known then critical value
can be calculated from normal distribution
Ex. In Excel =-NORMSINV(0,005) gives the critical value for
99% confidence level (0,005 is half of 0,01)
If population standard deviation s is unknown then critical
value can be calculated from t-distribution
Ex. In Excel =TINV(0,01;79) gives critical value when sample
size is 80 and confidence level is 99%

inference.ppt - Aki Taanila

38

Inference
Two ways to make inference
Estimation of parameters
* Point Estimation (X or p)
* Intervals Estimation
Hypothesis Testing

Hypothesis testing
Null hypothesis
Alternative hypothesis
2-tailed or 1 tailed
P-value

inference.ppt - Aki Taanila

40

Hypothesis 1
Hypothesis is a belief
concerning a parameter
Parameter may be
population mean,
proportion, correlation
coefficient,...

I believe that mean weight of


cereal packages is 300 grams!

inference.ppt - Aki Taanila

41

Hypothesis 2
Null hypothesis is prevalent opinion, previous
knowledge, basic assumption, prevailing
theory,...
Alternative hypothesis is rival opinion
Null hypothesis is assumed to be true as long
as we find evidence against it
If a sample gives strong enough evidence
against null hypothesis then alternative
hypothesis comes into force.
inference.ppt - Aki Taanila

42

Hypothesis examples
H0: Mean height of males equals 174.
H1: Mean height is bigger than 174.
H0: Half of the population is in favour of nuclear power plant.
H1: More than half of the population is in favour of nuclear power plant.
H0: The amount of overtime work is equal for males and females.
H1: The amount of overtime work is not equal for males and females.

H0: There is no correlation between interest rate and gold price.


H1: There is correlation between interest rate and gold price.

inference.ppt - Aki Taanila

43

2-tailed Test
Use 2-tailed if there is no
reason for 1-tailed.
In 2-tailed test deviations
(from the null hypothesis)
to the both directions are
interesting.
Alternative hypothesis
takes the form different
than.
inference.ppt - Aki Taanila

44

1-tailed Test
In 1-tailed test we know
beforehand that only
deviations to one
direction are possible
or interesting.
Alternative hypothesis
takes the form less
than or greater than.
inference.ppt - Aki Taanila

45

Logic behind hypothesis testing


Population

J
J
J J
J
J

Prevalent opinion is that


mean age in that group is
50 (null hypothesis)

Random sample

Mean
age = 45

inference.ppt - Aki Taanila

46

Reject null
hypothesis! Sample
mean is only 45!

Risk of being wrong


Not Guilty until proved otherwise!
Null hypothesis remains valid until
proved otherwise!
Sometimes it happens that
innocent person is proved guilty.
Same may happen in hypothesis
testing: We may reject null
hypothesis although it is true.
(there is always a risk of being
wrong when we reject null
hypothesis; risk is due to sampling
error).
inference.ppt - Aki Taanila

47

Significance Level
When we reject the null hypothesis there is a
risk of drawing a wrong conclusion
Risk of drawing a wrong conclusion (called pvalue or observed significance level) can be
calculated
Researcher decides the maximum risk (called
significance level) he is ready to take
Usual significance level is 5%
inference.ppt - Aki Taanila

48

P-value
We start from the basic assumption: The null
hypothesis is true
P-value is the probability of getting a value
equal to or more extreme than the sample
result, given that the null hypothesis is true
Decision rule: If p-value is less than 5% then
reject the null hypothesis; if p-value is 5% or
more then the null hypothesis remains valid
In any case, you must give the p-value as a
justification for your decision.
inference.ppt - Aki Taanila

49

Steps in hypothesis testing!


1. Set the null hypothesis and the alternative
hypothesis.
2. Calculate the p-value.
3. Decision rule: If the p-value is less than 5%
then reject the null hypothesis otherwise the
null hypothesis remains valid. In any case,
you must give the p-value as a justification
for your decision.
inference.ppt - Aki Taanila

50

Testing mean
Null hypothesis: Mean equals x0
Alternative hypothesis (2-tailed): Mean is
different from x0
Alternative hypothesis (1-tailed): Mean is less
than x0
Alternative hypothesis (1-tailed): Mean is
bigger than x0

inference.ppt - Aki Taanila

51

Testing mean - known


p-value

Calculate standardized sample mean


z

xm
s
n

Calculate the p-value that indicates, how likely


it is to get this kind of value if we assume that
null hypothesis is true
In Excel you can calculate the p-value:
=NORMSDIST(-ABS(z))
inference.ppt - Aki Taanila

52

Testing mean - unknown


p-value
Calculate standardized sample mean
t

xm
s
n

Calculate the p-value that indicates, how likely it is to get this


kind of value if we assume that null hypothesis is true
In Excel you can calculate the p-value: =TDIST(ABS(t),degrees
of freedom,tails); in this case degrees of freedom equals n-1
and tails defines whether you use one-tailed (1) or two-tailed
(2) test

inference.ppt - Aki Taanila

53

The Central Limit Theorem:


If all possible random samples, each of size n, are taken
from any population with a mean m and a standard
deviation s, the sampling distribution of the sample
means (averages) will:
1. have mean:

mx m

2. have standard deviation:

s
sx
n

3. be approximately normally distributed regardless of the shape


of the parent population (normality improves with larger n)

Central Limit Theorem caveats for small


samples:
For small samples:
The sample standard deviation is an imprecise estimate of the
true standard deviation (); this imprecision changes the
distribution to a T-distribution.
A t-distribution approaches a normal distribution for large n (100), but
has fatter tails for small n (<100)

If the underlying distribution is non-normal, the distribution of


the means may be non-normal.
More on T-distributions next week!!

Summary: Single population mean


(large n)
Hypothesis test:
observed mean null mean
Z
s
n

Confidence Interval
s
confidence interval observed mean Z/2 * ( )
n

Single population mean (small n,


normally distributed trait)
Hypothesis test:
observed mean null mean
Tn 1
s
n

Confidence Interval
s
confidence interval observed mean Tn 1,/2 * ( )
n

Examples of Sample Statistics:


Single population mean
Single population proportion
Difference in means (ttest)
Difference in proportions (Z-test)
Odds ratio/risk ratio
Correlation coefficient
Regression coefficient

Distribution of a correlation coefficient


Computer simulation

1. Specify the true correlation coefficient

Correlation coefficient = 0.15

2. Select a random sample of 100 virtual men from


the population.
3. Calculate the correlation coefficient for the
sample.
4. Repeat steps (2) and (3) 15,000 times
5. Explore the distribution of the 15,000 correlation
coefficients.

Distribution of a correlation
coefficient
Normally distributed!
Mean = 0.15 (true correlation)
Standard error = 0.10

Distribution of a correlation coefficient


in general

1. Shape of the distribution

Normally distributed for large samples


T-distribution for small samples (n<100)

2. Mean = true correlation coefficient (r)


3. Standard error

1 r
n

Many statistics follow normal (or tdistributions)


Means/difference in means
T-distribution for small samples

Proportions/difference in proportions
Regression coefficients
T-distribution for small samples

Natural log of the odds ratio

Estimation (confidence intervals)


What is a good estimate for the true mean
vitamin D in the population (the population
parameter)?
63 nmol/L +/- margin of error

95% confidence interval


Goal: capture the true effect (e.g., the true
mean) most of the time.
A 95% confidence interval should include the
true effect about 95% of the time.
A 99% confidence interval should include the
true effect about 99% of the time.

Recall: 68-95-99.7 rule for normal distributions! These is a 95% chance that the
sample mean will fall within two standard errors of the true mean= 62 +/- 2*3.3 =
55.4 nmol/L to 68.6 nmol/L
Mean - 2 Std error=55.4

Mean

Mean + 2 Std error =68.6

To be precise, 95% of
observations fall
between Z=-1.96 and Z=
+1.96 (so the 2 is a
rounded number)

95% confidence interval


There is a 95% chance that the sample mean is
between 55.4 nmol/L and 68.6 nmol/L
For every sample mean in this range, sample mean
+/- 2 standard errors will include the true mean:
For example, if the sample mean is 68.6 nmol/L:
95% CI = 68.6 +/- 6.6 = 62.0 to 75.2
This interval just hits the true mean, 62.0.

95% confidence interval


Thus, for normally distributed statistics, the formula
for the 95% confidence interval is:
sample statistic 2 x (standard error)
Examples:
95% CI for mean vitamin D:
63 nmol/L 2 x (3.3) = 56.4 69.6 nmol/L

95% CI for the correlation coefficient:


0.15 2 x (0.1) = -.05 .35

Simulation of 20 studies of 100 men


Vertical line indicates the true mean (62)

95% confidence intervals for


the mean vitamin D for each of
the simulated studies.

Only 1 confidence
interval missed the true
mean.

Confidence Intervals give:

*A plausible range of values for a population


parameter.
*The precision of an estimate.(When sampling
variability is high, the confidence interval will be
wide to reflect the uncertainty of the observation.)
*Statistical significance (if the 95% CI does not cross
the null value, it is significant at .05)

Confidence Intervals
The value of the statistic in my sample (eg.,
mean, odds ratio, etc.)

point estimate (measure of how confident we want to


be) (standard error)
From a Z table or a T table, depending on the
sampling distribution of the statistic.

Standard error of the statistic.

Common Z levels of confidence


Commonly used confidence levels are 90%,
95%, and 99%
Confidence
Level
80%
90%
95%
98%
99%
99.8%
99.9%

Z value
1.28
1.645
1.96
2.33
2.58
3.08
3.27

99% confidence intervals


99% CI for mean vitamin D:
63 nmol/L 2.6 x (3.3) = 54.4 71.6 nmol/L

99% CI for the correlation coefficient:


0.15 2.6 x (0.1) = -.11 .41

S-ar putea să vă placă și