Sunteți pe pagina 1din 38

Chapter 7, the Bootstrap

US President George W. Bush awards Bradley Efron the 2005 Medal


of Science in the East Room of the White House July 27, 2007 in
Washington, DC. President Bush awarded 27 National Medals of
Science and Technology for 2005 and 2006 at the event.

Brad Efron at Stanford University.

The Bootstrap
Introduced by Brad Efron in 1979

B. Efron. Bootstrap methods: another look at the
jackknife. Annals of Sta+s+cs, 7:1-26, 1979.

B. Efron. Nonparametric esCmates of standard error:
the jackknife, the bootstrap, and other methods.
Biometrika, 68: 589-599, 1981.
Bootstrap methods are a class of nonparametric
Monte Carlo methods that esCmate the distribuCon
of a populaCon by resampling.
2

The Bootstrap
Suppose the only thing you have are a sample of n observaCons:

x1 , x2 ,..., xn .
You do not want to make an assumpCon about a parametric
distribuCon for the data, for example, you do not want to
assume they are Normally distributed or Student t-distributed.

In other words you want to be nonparametric.

The MC methods in Ch. 6 of the book can be called the
parametric bootstrap because simulaCon was performed using a
parametric model, e.g. N(,2).
We want to repeat the same types of things in this chapter
without using a parametric model.
3

Resampling on a discrete uniform

Assume x = ( x1 , x2 ,..., xn ) is an observed random sample from a


distribution with unknown cdf F ( x).
Choose X * at random from x.
1
, for i = 1,..., n.
n
In other words X * is distributed as discrete uniform on x.
Then P( X * = xi ) =

Resampling creates an i.i.d. sample X 1* ,X 2* ,..., X n* from this discrete


uniform on x distribution.

Empirical cumulative distribution


function (ecdf)
Recall, pg. 36 of Ch 2, repeated in the lecture of Ch 3:
The ecdf Fn ( x) is an unbiased estimate of F ( x) = P( X x)
and defined for an observed ordered sample x(1) x( 2 ) ... x( n ) by :
0
x < x(1) ,

Fn ( x) = i / n x(i ) x < x(i +1) , i = 1,..., n 1,


1
x( n ) x.

The standard error of Fn ( x) = F ( x)[1 F ( x)] / n

0.5
.
n

Resampling on a discrete uniform


So under our resampling scheme:
Assume x = ( x1 , x2 ,..., xn ) is an observed random sample from a
distribution with unknown cdf F ( x).
Choose X * at random from x.
1
Then P( X * = xi ) = , for i = 1,..., n.
n
In other words X * is distributed as discrete uniform on x.
Resampling creates an i.i.d. sample X 1* ,X 2* ,..., X n* , from this discrete
uniform on x distribution.
The ecdf Fn is therefore the cdf of X * .
6

Bootstrap sampling scheme


FX
x = ( x1 , x2 ,..., xn )
Fn (x)

x*(1) = ( x1*(1) , x2*(1) ,..., xn*(1) ) x*(2 ) = ( x1*(2 ) , x2*( 2 ) ,..., xn*(2 ) )
*(1)
n

(x )

Fn*( 2 ) ( x* )

x*( B ) = ( x1*( B ) , x2*( B ) ,..., xn*( B ) )


Fn*( B ) ( x* )

Bootstrap sampling scheme


FX
x = ( x1 , x2 ,..., xn )
Fn (x)

x*(1) = ( x1*(1) , x2*(1) ,..., xn*(1) ) x*(2 ) = ( x1*(2 ) , x2*( 2 ) ,..., xn*(2 ) )
*(1)
n

(x )

x*( B ) = ( x1*( B ) , x2*( B ) ,..., xn*( B ) )

Fn*( 2 ) ( x* )

*( b )
n

Chain of convergence : F

Fn*( B ) ( x* )
n

( x ) Fn ( x) and Fn ( x) F ( x).

Fn*(b ) ( x* ) Fn ( x) F ( x)

FX

Increasing n helps here.

x = ( x1 , x2 ,..., xn )
Fn (x)

x*(1) = ( x1*(1) , x2*(1) ,..., xn*(1) ) x*(2 ) = ( x1*(2 ) , x2*( 2 ) ,..., xn*(2 ) )
*(1)
n

(x )

Fn*( 2 ) ( x* )

x*( B ) = ( x1*( B ) , x2*( B ) ,..., xn*( B ) )


Fn*( B ) ( x* )

Fn*(b ) ( x* ) Fn ( x) F ( x)

FX

Increasing B only helps here.

x = ( x1 , x2 ,..., xn )
Fn (x)

x*(1) = ( x1*(1) , x2*(1) ,..., xn*(1) ) x*(2 ) = ( x1*(2 ) , x2*( 2 ) ,..., xn*(2 ) )
*(1)
n

10

(x )

Fn*( 2 ) ( x* )

x*( B ) = ( x1*( B ) , x2*( B ) ,..., xn*( B ) )


Fn*( B ) ( x* )

Example 7.1
n = 10, very small

11

For X ~ Poi (2) , P( X = 0) = e - 2 = .135, not 0.

Estimating a summary (parameter)


of the nonparametric distribution
Assume x = ( x1 , x2 ,..., xn ) is an observed random sample from a
distribution with unknown cdf F ( x).
We are interested in some " parameter" of F ( x), such as the mean of the
distribution F ( x) or the .975 - quantile of F ( x).
Parameter sounds strange since it belongs to a nonparametric distribution.
It is a naming convention only, instead of the word " parameter" think of
it as any " summary" of F ( x).
We can also use bootstrapping to get the empirical distribution of , and
this is a very powerful technique because we can do it for any type of
and for any distribution F ( x).
12

Bootstrap for a summary of interest


FX

= ( FX )

x = ( x1 , x2 ,..., xn )

= ( Fn ( x))

Fn (x)

x*(1) = ( x1*(1) , x2*(1) ,..., xn*(1) ) x*(2 ) = ( x1*(2 ) , x2*( 2 ) ,..., xn*(2 ) )

(1) = ( Fn*(1) )
13

( 2)

x*( B ) = ( x1*( B ) , x2*( B ) ,..., xn*( B ) )

( B)

The bootstrap estimate of F () is the empirical cdf of ( (1) , ( 2) ,..., ( B ) ).

Bootstrap estimate of standard


error

Note, and not * , typically used to estimate .


14

Example 7.2

Ques%on 1: EsCmate the correlaCon between LSAT and GPA


scores based on the random sample above.

Ques%on 2: Write a general bootstrap algorithm for
esCmaCng the standard error of the esCmate in QuesCon 1.

15

Example 7.2
Ques%on 1: EsCmate the correlaCon between LSAT and GPA
scores based on the random sample above.

Solu%on:

What is ?
How do we esCmate it in R?

16

Example 7.2

17

640
620
lsat

600
580
560

The book did not specify but Pearsons


correlaCon is the standard correlaCon
measure but only measures a linear
relaConship. The plot looks linear so
we will use it. If there is a non-linear
relaConship it is be^er to use
Spearmans rank correlaCon.

660

Ques%on 1: EsCmate the correlaCon between LSAT and GPA scores


based on the random sample above.

280

290

300

310
gpa

320

330

340

Example 7.2
Ques%on 1: EsCmate the correlaCon between LSAT and GPA
scores based on the random sample above.

Solu%on:

What is ? The populaCon correlaCon coecient between
LSAT and GPA scores.
ow do we esCmate it in R? With the cor() funcCon, with
H
the default method=pearson used.

18

> cor(gpa,lsat)
[1] 0.7763745 ANSWER

> cor(gpa,lsat,method="spearman") #for comparison:
[1] 0.7964286

Note we did not


use the bootstrap!

Question 2: Inadequate solution


Ques%on 2: Write a general bootstrap algorithm for
esCmaCng the standard error of the esCmate in QuesCon 1.


The soluCon from the book is not specic enough!

What is x1 , x1* ?

Important to note the data are paired.


19

Question 2: Better solution


The data comprise 15 pairs : x = {xi = (GPAi ,LSATi ); i = 1,...,15}.
For each bootstrap replicate, indexed b = 1,...,B :
*
Generate a random sample of 15 pairs x*(b ) = ( x1* ,x*2 ,..., x15
) where

xi* is the ith random pair by sampling with replacement from x.


Compute (b) as the sample correlation of the pairs of x*(b ) .
The bootstrap estimate of the standard error of the correlation is the
sample standard deviation of (1) , (2) ,..., ( B ) .

20

Reporting standard errors


Ques%on 2: Write a general bootstrap algorithm for
esCmaCng the standard error of the esCmate in QuesCon 1.

21

The bootstrap esCmate of standard error


(se) is 0.136. We report the correlaCon
ANSWER
between the LSAT and GPA as 0.776 0.136.

Seeing what you are doing


In pracCce it is always a good idea to inspect the distribuCon
of the B bootstrap esCmates of .
T his distribuCon does not
look normal, and will not look
normal no ma^er how large
you make B.
T hat is because n = 15 is a
small sample size and a
correlaCon coecient from
this sample size is not normal.
22

Recall = 0.776.

The difference between bootstrap


and asymptotic Normal intervals

symptoCc normal theory, assuming


A
a large n, is olen used to esCmate
standard errors for correlaCon
coecients, even for small n.

The normal esCmate of the standard
error for this example is 0.115.

Recall = 0.776, bootstrap estimate of SE = 0.136.


23

Bootstrapping
packages in R
Pg. 187, 188
oot(boot)
b
bootstrap(bootstrap)
Not covered for this
class because we learn
this ourselves use
these later in your job.

24

Bias
The bias of an estimator for is
bias() = E[ ] = E[] E[ ] = E[] .
An estimator is unbiased if bias() = 0.
Example 1 : If X 1,...,X n are i.i.d. from a distribution with population
1 n
mean , then the sample mean X = X i is an unbiased estimator for .
n i =1
1 n
1
1 n
1 n
1 n
Proof : E [X ] = E X i = E X i = E ( X i ) = = n = .
n i =1
n
n i =1 n i =1 n i =1

25

E(cX) = cE(X) E(X1 + X2) = E(X1) + E(X2) Xi are i.i.d. with


mean .

Bias
The bias of an estimator for is
bias() = E[ ] = E[] E[ ] = E[] .
An estimator is unbiased if bias() = 0.
Not every estimate is unbiased. For example the maximum likelihood
estimator of the variance of a population, 2 :
1 n
= ( X i X ) 2
n i =1
2

26

2
has bias
. That is why the unbiased estimator :
n
1 n
2
s =
( X i X )2

n 1 i =1
is traditionally used, also implemented in R. Proof next.

Proof
Assume X 1,X 2 ,...,X n i.i.d. with mean and variance 2 . Show 2 biased for 2 .
Proof : Just showed E ( X ) = . Next note that
1
2
1 n
1
n
1 n
2
Var ( X ) = Var X i = 2 Var X i = 2 Var ( X i ) = 2 n =
.
n
n
n i =1 n
i =1 n i =1
1 n
1 n 2
1 n 2
2
2
2
= ( X i X ) = X i nX = X i X 2 .
n i =1
n i =1
n i =1
1 n
1
E ( ) = E ( X i2 ) E ( X 2 ) = nE ( X 12 ) E ( X 2 ) = E ( X 12 ) E ( X 2 ).
n i =1
n
2

Recall, Var ( X ) = E ( X 2 ) E ( X ) 2 E ( X 2 ) = Var ( X ) + E ( X ) 2 .


E ( 2 ) = Var ( X 1 ) + E ( X 1 ) 2 [Var ( X ) + E ( X ) 2 ] = Var ( X 1 ) + 2 [Var ( X ) + 2 ]

(n 1) 2
= Var ( X 1 ) Var ( X ) =
=
2.
n
n
2

27

Bootstrap estimate of bias


The bootstrap esCmate of bias is:

In the Law example



> cor(gpa,lsat)
[1] 0.7763745

28

Bootstrap estimate of bias

and mean(R) gives

A new simulaCon with B = 2000 in Example 7.4, pg. 189,


gives a bootstrap esCmate of bias as -0.0058.
29

Bootstrap confidence intervals


FX

= ( FX )

x = ( x1 , x2 ,..., xn )

= ( Fn ( x))

Fn(x)

x*(1) = ( x1*(1) , x2*(1) ,..., xn*(1) ) x*(2 ) = ( x1*(2 ) , x2*( 2 ) ,..., xn*(2 ) )

(1) = ( Fn*(1) )

x*( B ) = ( x1*( B ) , x2*( B ) ,..., xn*( B ) )

( 2)

( B)

The bootstrap estimate of F () is the empirical cdf of ( (1) , ( 2 ) ,..., ( B ) ).


30

Now need a bootstrap confidence interval (CI) for .

Bootstrap confidence intervals


The typical 95% confidence interval (CI) for an estimate of that you have learned is :

1.96se(),
where is the estimate of based on a sample of n observations, se() is the standard
error of .
The 95% CI is a random interval that covers the true but unknown with probability 0.95.
The number 1.96 is the .975 - quantile of the standard normal distribution (q.975 ). For a
general 100(1 - )% CI, q(1- / 2 ) is used.

31

Standard normal = N(0,1)

Bootstrap confidence intervals


Note that the CI, 1.96 se(), is
symmetric,
based on an asymptotic Normal distribution for ,
only applies where is a sample average,
and where the sample size n is large.

For example, you could not apply this method directly when is the .95 quantile of
some distribution.

32

Bootstrap confidence intervals


There are many bootstrap condence intervals. We will study
the four most commonly used.

Standard normal bootstrap
Basic bootstrap
Bootstrap percenCle
Bootstrap t
And then there are even methods to accelerate these
intervals; see Sec 7.5 of book (not covered in this course).

33

Standard normal bootstrap CI


The typical 100(1 - /2)% confidence interval (CI) for an estimate of is q1 / 2 se(),
where q1 / 2 is the 1 - /2 quantile of the standard normal distribution.
Use the typical CI above but just replace se() by the bootstrap estimate of standard
error that we previously learned.

34

Basic bootstrap CI
The standard normal bootstrap CI is still symmetric and based on the assumption
of a normal distribution.
The basic bootstrap CI is more flexible .Instead of using quantiles from the normal
distribution, it estimates them from the bootstrap replications. It therefore is
non - symmetric.
The 100(1 - )% basic bootstrap CI is given by
(2 1 / 2 ,2 / 2 ),
where q is the sample q - quantile from the ecdf of the bootstrap replicates * .
The specific form of this interval is complicated to derive so omitted from the course.
35

Percentile bootstrap CI
More intuitive with theoretical advantages and better average properties than the
previous intervals.
The 100(1 - )% percentile bootstrap CI is given by
( / 2 , 1 / 2 ),
where q is the sample q - quantile from the ecdf of the bootstrap replicates * .

The boot package in R


Examples 7.10 and 7.11 compare the three CIs previously shown
using the boot.ci funcCon in the R boot package. For this course
you should program these intervals yourself and not use the boot
package.
36

Bootstrap t CI

37

Note: requires
BB
bootstraps.
Complicated,
but has a
theoretical
advantage
over all prior
intervals in
that it has a
statistical
Based on the idea that standard CIs for the sample
mean
property,
when the variance is unknown are based on quantiles

End of Chapter 7

38

S-ar putea să vă placă și