Sunteți pe pagina 1din 10

Lecture Notes 13

The Bootstrap

1 Introduction

The bootstrap is a method for estimating the variance of an estimator and for finding ap-
proximate confidence intervals for parameters. Although the method is nonparametric, it
can also be used for inference about parameters in parametric and nonparametric models.

2 Empirical Distribution

Let X1 , . . . , Xn ∼ P . Recall that the empirical distribution Pn is defined by


n
1X
Pn (A) = I(Xi ∈ A).
n i=1

In other words, Pn puts mass 1/n at each Xi . Recall also that a parameter of the form
θ = T (P ) is called a statistical functionl and that the plug-in estimator is θbn = T (Pn ).

An iid sample of size n drawn from Pn is called a bootstrap sample, denoted by

X1∗ , . . . , Xn∗ ∼ Pn .

Bootstrap samples play an important role in what follows. Note that drawing an iid sample
X1∗ , . . . , Xn∗ from Pn is equivalent to drawing n observations, with replacement, from the
original data {X1 , . . . , Xn }. Thus, bootstrap sampling is often described as “resampling the
data.”

3 The Bootstrap

Now we give the bootstrap algorithms for estimating the variance of θbn and for constructing
confidence intervals. The explanation of why (and when) the bootstrap works is mainly
deferred until Section 5. Let θbn = g(X1 , . . . , Xn ) denote some estimator.

We would like to find the variance of θbn . Let

VarP (θbn ) = VarP (g(X1 , . . . , Xn )) ≡ Sn (P ).

1
Note that VarP (θbn ) is some function pf P (and n) so I have written VarP (θbn ) = Sn (P ). If
we knew P , we could approximate Sn (P ) by simulation as follows:

draw X1 , . . . , Xn ∼ P
compute θb(1) = g(X1 , . . . , Xn )
n
draw X1 , . . . , Xn ∼ P
compute θb(2) = g(X1 , . . . , Xn )
n
....
..
draw X1 , . . . , Xn ∼ P
compute θb(B) = g(X1 , . . . , Xn ).
n

(1) (B)
Let s2 be the sample variance of θbn , . . . , θbn . So
B B
!2
2 1 X b(j) 2 1 X b(j)
s = (θ ) − θ .
B j=1 n B j=1 n

By the law of large numbers,


P
s2 → E[θbn2 ] − (E[θbn ])2 = VarP (θbn ) = Sn (P ).

Since we can take B as large as we want, we have that s2 ≈ VarP (θbn ). In other words, we
can approximate Sn (P ) by repeatedly simulating n observations from P .

But we don’t know P . So we estimate Sn (P ) with Sn (Pn ) where Pn is the empirical dis-
tribution. Since Pn is a consistent estimator, we expect that Sn (Pn ) ≈ Sn (P ). In other
words:

Bootstrap approximation of the variance : estimate Sn (P ) with Sn (Pn )

or in other words
\
VarP (θbn ) = VarPn (θbn ).
But how do we compute Sn (Pn )? We use the simulation method above, except that we
simulate from Pn instead of P . This leads to the following algorithm:

2
Bootstrap Variance Estimator

1. Draw a bootstrap sample X1∗ , . . . , Xn∗ ∼ Pn . Compute θbn∗ = g(X1∗ , . . . , Xn∗ ).


∗ ∗
2. Repeat the previous step, B times, yielding estimators θbn,1 , . . . , θbn,B .

3. Compute: v
u B
u1 X
sb = t (θb∗ − θ)2
B j=1 n,j

1
PB b∗
where θ = B j=1 θn,j .

4. Output sb.

You can think about it like this:


B
1 X b∗
(θ − θ)2 ≈ Sn (Pn ) ≈ Sn (P )
B j=1 n,j |{z} |{z}
simulation error estimation error

The are two sources of error in this apprixmation. The first is due to the fact that n is finite
and the second is due to the fact that B is finite. However, we can make B as large as we
like. (In practice, it usually suffices to take B = 10, 000.) So we ignore the error due to finite
B.

s2 P
Theorem 1 Under appropriate regularity conditions, Var(θbn )
→ 1 as n → ∞.

Now we describe the confidence interval algorithm. This will look less intuitive than the
variance estimator; I’ll eplain it in Section 5.

3
Bootstrap Confidence Interval

1. Draw a bootstrap sample X1∗ , . . . , Xn∗ ∼ Pn . Compute θbn∗ = g(X1∗ , . . . , Xn∗ ).


∗ ∗
2. Repeat the previous step, B times, yielding estimators θbn,1 , . . . , θbn,B .

3. Let
B
1 X √ b∗ 
F (t) =
b I n(θn,j − θn ≤ t).
b
B j=1

4. Let  
t1−α/2 tα/2
Cn = θbn − √ , θbn − √
n n
where tα/2 = Fb−1 (α/2) and t1−α/2 = Fb−1 (1 − α/2).

5. Output Cn .

Theorem 2 Under appropriate regularity conditions,


 
1
P(θ ∈ Cn ) = 1 − α − O √ .
n
as n → ∞.

See the appendix for a discussion of the regularity conditions.

4 Examples

Example 3 Consider the polynomial regression model Y = g(X) +  where X, Y ∈ R and


g(x) = β0 + β1 x + β2 x2 . Given data (X1 , Y1 ), . . . , (Xn , Yn ) we can estimate β = (β0 , β1 , β2 )
with the least squares estimator β. b Suppose that g(x) is concave and we are interested in
the location at which g(x) is maximized. It is easy to see that the maximum occurs at x = θ
where θ = −(1/2)β1 /β2 . A point estimate of θ is θb = −(1/2)βb1 /βb2 . Now we use the
bootstrap to get a confidence interval for θ. Figure 1 shows 50 points drawn from the above
model with β0 = −1, β1 = 2, β2 = −1. The Xi ’s were sample uniformly on [0, 2] and we
took i ∼ N (0, .22 ). In this case, θ = 1. The true and estimated curves are shown in the
figure. At the bottom of the plot we show the 95 percent boostrap confidence interval based
on B = 1, 000.

4
0.5



● ●

0.0
● ● ● ●
● ●
● ● ● ●

● ●

● ● ●
● ●


● ● ●

−0.5





● ●

● ●



−1.0



●●

0.0 0.5 1.0 1.5 2.0


Figure 1: 50 points drawn from the model Yi = −1 + 2Xi − Xi2 + i where Xi ∼ Uniform(0, 2)
and i ∼ N (0, .22 ). In this case, the maximum of the polynomail occurs at θ = 1. The
true and estimated curves are shown in the figure. At the bottom of the plot we show the 95
percent boostrap confidence interval based on B = 1, 000.

Example 4 Let (X1 , Y1 , Z1 ), . . . , (Xn , Yn , Zn ) ∼ P where Xi ∈ R, Yi ∈ R, Zi ∈ Rd . The


partial correlation of X and Y given Z is
Ω12
θ = −√
Ω11 Ω22

where Ω = Σ−1 and Σ is the covariance matrix of W = (X, Y, Z)T . The partial correla-
tion measures the linear dependence between X and Y after removing the effect of Z. For
illustration, suppose we generate the data as follows: we take Z ∼ N (0, 1), X = 10Z + 
and Y = 10Z + δ where , δ ∼ N (0, 1). The correlation between X and Y is very large.
But the partial correlation is 0. We generated n = 100 data points from this model. The
sample correlation was 0.99. However, the estimate partial correaltion was -0.16 which is
much closer to 0. The 95 percent bootstrap confidence interval is [-.33,.02] which includes
the true value, namely, 0.

5 Why Does the Bootstrap Work?

To explain why the bootstrap works, let us begin with a heuristic. Let

Fn (t) = P( n(θb − θ) ≤ t).

5
If we knew Fn we could easily construct a confidence interval. Let
 
t1−α/2 b tα/2
Cn = θ − √ , θ − √
b
n n
where tα = Fn−1 (α). Then
 
t1−α/2 tα/2
P(θ ∈ Cn ) = P θ − √ ≤ θ ≤ θ − √
b b
n n

= P(tα/2 ≤ n(θb − θ) ≤ t1−α/2 ) = Fn (t1−α/2 ) − Fn (tα/2 )
α α
= Fn (Fn−1 (1 − α/2)) − Fn (Fn−1 (α/2)) = 1 − − = 1 − α.
2 2

The problem is that we do not know Fn . The bootstrap estimates Fn with


√ 
Fbn (t) = P n(θb∗ − θbn ) ≤ t X1 , . . . , Xn .

If Fbn ≈ Fn then the bootstrap will work.

Usually, Fn will be close to some limiting distribution L. Similarly, Fbn will be close to some
limiting distribution L.
b Moreover, L and L b will be close which implies that Fn and Fbn are
close. In practice, we usually approximate Fbn by its Monte Carlo version
B
1 X √ b∗ b
F (t) = I( n(θj − θj ) ≤ t).
B j=1

But F is close to Fbn as long as we take B large. See Figure 2.

Now we will give more detail in a simple, special case. Suppose that X1 , . . . , Xn ∼ P where
Xi has mean µ and variance σ 2 . Suppose we want to construct a confidence interval for µ.

bn = n1 ni=1 Xi and define


P
Let µ

Fn (t) = P( n(bµn − µ) ≤ t). (1)

We want to show that


√ 

Fbn (t) = P µn − µ
n(b bn ) ≤ t X1 , . . . , Xn

is close to Fn .

Theorem 5 (Bootstrap Theorem) Suppose that µ3 = E|Xi |3 < ∞. Then,


 
1
sup |Fn (t) − Fn (t)| = OP √
b .
t n

6

O(1/ n)
Fn L


OP (1/ n)

Fbn √ L
b
OP (1/ n)


O(1/ B)


Figure 2: The distribution Fn (t) = P( n(θbn − θ) ≤ t) is close to some limit distribution

L. Similarly, the bootstrap distribution Fbn (t) = P( n(θbn∗ − θbn ) ≤ t|X1 , . . . , Xn ) is close to
some limit distribution L.
b Since Lb and L are close, it follows that Fn and Fbn are close. In
practice, we approximate Fbn with its Monte Carlo version F which we can make as close to
Fbn as we like by taking B large.

7
To prove this result, let us recall that Berry-Esseen Theorem.

Theorem 6 (Berry-Esseen Theorem) Let P X1 , . . . , Xn be i.i.d. with mean µ and variance


n
σ 2 . Let µ3 = E[|Xi − µ|3 ] < ∞. Let X n = n−1
√ i=1 Xi be the sample mean and let Φ be the
n(X n −µ)
cdf of a N (0, 1) random variable. Let Zn = σ
. Then
33 µ
3
sup P(Zn ≤ z) − Φ(z) ≤ √ . (2)

3
4 σ n
z

Proof of the Bootstrap Theorem.


Pn Let Φσ (t) denote the cdf √
of a Normal with mean 0
2
and variance σ . Let σ 2 1
b = n i=1 (Xi − µ 2
bn ) . Thus, σ 2
b = Var( n(b µ∗n − µ
bn )|X1 , . . . , Xn ).
Now, by the triangle inequality,

sup |Fbn (t) − Fn (t)| ≤ sup |Fn (t) − Φσ (t)| + sup |Φσ (t) − Φσb (t)| + sup |Fbn (t) − Φσb (t)|
t t t t
= I + II + III.

Let Z ∼ N (0, 1). Then, σZ ∼ N (0, σ 2 ) and from the Berry-Esseen theorem,
√ 
I = sup |Fn (t) − Φσ (t)| = sup P n(bµn − µ) ≤ t − P (σZ ≤ t)
t t
√   
n(b
µ n − µ) t t 33 µ3
= sup P ≤ −P Z ≤ ≤ √ .
t σ σ σ 4 σ3 n

Using the same argument on the third term, we have that


33 µb3
III = sup |Fbn (t) − Φσb (t)| ≤ √
t b3 n
4 σ

b3 = n1 i=1 |Xi − µ bn |3 is the empirical third moment. By the strong law of large
P
where µ
numbers, µ b3 converges almost surely to µ3 and σ b converges almost surely to σ. So, almost

b3 ≤ 2µ3 and σ
surely, for all large n, µ b ≥ (1/2)σ and III ≤ 33 4
√ 3 . From the fact that
n
p p
b − σ = OP ( 1/n) it may be shown that II = supt |Φσ (t) − Φσb (t)| = OP ( 1/n). (This may
σ
be seen by Taylor expanding Φσb (t) around σ.) This completes the proof. 
 
We have shown that supt |Fbn (t) − Fn (t)| = OP √1n . From this, it may be shown that, for
 
each 0 < β < 1, tβ − zβ = OP √1n . From this, one can prove Theorem 2.

So far we have focused on the mean. Similar theorems may be proved for more general
parameters. The details are complex so we will not discuss them here. More information is
in the appendix. See also Chapter 23 of van der Vaart (1998).

8
6 The Parametric Bootstrap

The bootstrap can also be used for parametric inference. Suppose that X1 , . . . , Xn ∼ p(x; θ).
Let θb be the mle. Let ψ = g(θ) and ψb = g(θ).
b To estimate the standard error of ψb we could
find the Fishher information followed by the delta method.

Alternatively, we simply compute the standard deviation of the bootstrap replications ψb1∗ , . . . , ψbB .
The only difference is that now we draw from the bootstrap sampples from p(x; θ).b In other
words:
X ∗ , . . . , X ∗ ∼ p(x; θ).
1 n
b

7 A Few Remarks About the Bootstrap

Here are some random remarks about the bootstrap:

1. The bootstrap is nonparametric but it does require some assumptions. You can’t
assume it is always valid. (See the appendix.)

2. The bootstrap is an asymptotic method. √ Thus the coverage of the confidence interval
is 1 − α + rn where, typically, rn = C/ n.

3. There is a related method called the jackknife where the standard error is estimated by
leaving out one observation at a time. However, the bootstrap is valid under weaker
conditions than the jackknife. See Shao and Tu (1995).

4. Another way to construct a bootstrap confidence interval is to set C = [a, b] where



a is the α/2 quantile of θb1∗ , . . . , θbB and b is the 1 − α/2 quantile. This is called the
percentile interval. This interval seems very intuitive but does not have the theoretical
support for the interval Bn . However, in practice, the percentile interval and Bn are
often quite similar.

5. There are many cases where the bootstrap is not formally justified. This is especially
true with discrete structures like trees and graphs. Nonethless, the bootstrap can be
used in an informal way to get some intuition of the variability of the procedure. But
keep in mind that the formal guarantees may not apply in these cases. For example,
see Holmes (2003) for a discussion of the bootstrap applied to phylogenetic tres.

6. There is a method related to the bootstrap called subsampling. In this case, we draw
samples of size m < n without replacement. Subsampling produces valid confidence
intervals under weaker conditions than the bootstrap. See Politis, Romano and Wolf
(1999).

9
7. There are many modifications of the bootstrap that lead to more accurate confidence
intervals; see Efron (1996).

8. There is a version of the bootstrap that works in high dimensions. We discuss this in
10/36-702.

8 Summary

The bootstrap provides nonparametric standard errors and confidence intervals. To draw
a bootstrap sample we draw n observations X1∗ , . . . , Xn∗ from the empirical distribution
Pn . This is equivalent to drawing n observations with replacement from the original daa
X1 , . . . , Xn . We then compute the estimator θb∗ = g(X1∗ , . . . , Xn∗ ). If we repeat this whole
process B times we get θb1∗ , . . . , θB

. The standard deviation of these values approximates the
stanard error of θn = g(X1 , . . . , Xn ).
b

9 References

Efron, Bradley and Tibshirani, Robert. (1994). An introduction to the bootstrap. CRC
press.

van der Vaart, A. (1998). Asymptotic Statistics. Chapter 23.

Appendix

Hadamard Differentiability. The key condition needed for the bootstrap is Hadamard
differentiability. Let D and E be normed spaces and let T : D → E. We say that T is
Hadamard differentiable at P ∈ D if there exists a continuous linear map Tp0 : D → E such
that
T (P + tQt ) − T (P ) 0

− TP (Q) → 0
t R
whenever t ↓ 0 and Qt → Q.

10

S-ar putea să vă placă și