Sunteți pe pagina 1din 25

'

AdaBoost
Freund and Schaipre: Experiments with a new Boosting Algorithm Schapire and Singer: Improved Boosting Algorithms using Condence-rated Predictions

Ian Fasel October 23, 2001 & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Resampling for Classier Design


Arcing adaptive reweighting and combining Two most popular:

reusing or selecting data in order to improve classication

Bagging (Breiman, 1994) AdaBoost (Freund and Schapire, 1996)

Combine the results of multiple weak classiers into a single strong classier. & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

The horse-track problem

Gambler wants to write a computer program to accurately predict winner of horse races. Gathers Historical horse-race data. input vector: Odds dry or muddy jockey output label: win or lose

Discovers:

easy to nd rules of thumb that are often correct. %

hard to nd a single rule that is very highly accurate. &

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

The gamblers plan

The gambler has decided intuitively that he wants to use arcing, and he wants his nal solution to be some simple combination of simply derived rules. Repeat T times: 1. Examine small sample of races. 2. Derive rough rule-of-thumb: e.g., bet on horse with most favorable odds 3. Select new sample, derive 2nd rule-of-thumb: otherwise, choose randomly end & If track is muddy, then bet on the lightest horse %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Questions
How to choose samples?

Select multiple random samples? Concentrate only on the errors?

How to combine rules-of-thumb into a single accurate rule?

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

More Formally:

Given: training data (x1 , y1 ), . . . , (xm , ym ), where xi X , yi Y = {1, +1} For t = 1, . . . , T : 1. Train Weak Learner on the training set. Let ht : X {1, +1} represent the classier obtained after training. 2. Modify the training set somehow The nal hypothesis H(x) is some combination of all the weak hypotheses: H(x) = f (h(x)) The question is how to modify the training set, and how to combine the weak classiers. & (1) %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Bagging
The simplest algorithm is called Bagging, used by Breiman 1994 Algorithm: Given m training examples, repeat for t = 1 . . . T : Select, at random with replacement, m training examples. Train learning algorithm on selected examples to generate hypothesis ht Final hypothesis is simple vote: H(x) = M AJ(h1 (x), . . . , hT (x)). & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Bias Variance Dilemma

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Bagging: Pros and Cons


Bagging reduces variance

Helps improve unstable classiers: i.e., small changes in training data lead to signicantly dierent classiers and large changes in accuracy.

no proof for this, however does not reduce bias

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Boosting
Two modications 1. instead of a random sample of the training data, use a weighted sample to focus learning on most dicult examples. 2. instead of combining classiers with equal vote, use a weighted vote. Several previous methods (Schapire, 1990; Freund, 1995) were eective, but had limitations. We skip ahead to Freund and Schapire 1996. & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

10

AdaBoost (Freund and Schapire, 1996)

Initialize distribution over the training set D1 (i) = 1/m For t = 1, . . . , T : 1. Train Weak Learner using distribution Dt . 3. Update the distribution over the training set: Dt (i)et yi ht (xi ) Dt+1 (i) = Zt 2. Choose a weight (or condence value) t R. (2)

Where Zt is a normalization factor chosen so that Dt+1 will be a distribution Final vote H(x) is a weighted sum:
T

H(x) = sign(f (x)) = sign &


t=1

t ht (x)

(3) %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

11

If the underlying classiers are linear networks, then AdaBoost builds multilayer perceptrons one node at a time.
1 2 n

Importance

Importance

Example 2

Example 3

Example 4

Example 5

Example 6

Example 2

Example 3

Example 4

Example 5

Example 6

Importance

Example 2

Example 3

Example 4

Example 5

Example 1

Example 1

&

However, underlying classier can be anything. Decision trees, neural networks, hidden Markov models . . .

Example 1

Example 6

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

12

How do we pick ?
To decide how to pick the s, we have to understand what the relationship is between the distribution, the t , and the training error. (Later, we can show that reducing training error this way should reduce test error as well).

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

13

Theorem 1: Error is minimized by minimizing Zt Proof: 1 eyi 1 h1 (xi ) eyi T hT (xi ) DT +1 (i) = ... m Z1 ZT e t yi t ht (xi ) eyi t t ht (xi ) = = m t Zt m t Zt eyi f (xi ) = m t Zt

(4)

But, if H(xi ) = yi then yi f (xi )) 0, implying that eyi f (xi ) 1. Thus, [[H(xi ) = yi ]] eyi f (xi ) e
i yi f (xi )

1 m &

1 [[H(xi ) = yi ]] m

(5) %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

14

Combining these results, 1 m [[H(xi ) = yi ]] =


i t

1 m

eyi f (xi )
i

Zt Zt
t

DT +1 (i)

(6)

(since DT +1 sums to 1).

Thus, we can see that minimizing Zt will minimize this error bound. Consequences: 1. We should choose t to minimize Zt 2. We should modify the weak learner to minimize Zt (instead of, say, squared error). & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

15

Finding t analytically

Dene: ui = yi h(xi ). (Then ui is positive if h(xi ) is correct, and negative if it is incorrect. Magnitude is condence.) r = i Di ui . (r [1, +1], this is a measure of the overall error rate.) If we restrict h(xi ) [1, 1], we can approximate Z: Z=
i

D(i)eui D(i)
i

1 + ui ui 1 ui ui + e e 2 2

(7)

Note: if h(xi ) {1, 1}, this approximation is exact &

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

16

We can nd to minimize Z analytically by nding which gives us = 1 ln 2 1+r 1r 1 r2

dZ() d

= 0,a

Plugging this into equation 7, this gives the upper bound Z

for a particular t, or, plugging into equation 6, we have an upper bound for the training error of H 1 [[H(xi ) = yi ]] m
a For

t t

Zt

t=1

2 1 rt

&

the careful, it can be shown that chosen this way is guaranteed to minimize Z, and that it is unique.

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

17

Finding numerically

When using real valued hypotheses h(xi ) [1, 1], the best choice of to minimize Z must be found numerically. If we do this, and nd such that Z () = 0, then Z () = d dZ() = d d = = Z So, either Z = 0 (the trivial case), or
i

D(i)eui
i

D(i)ui eui Dt+1 (i)ui = 0


i i

Dt+1 (i)ui = 0.

&

In words, this means that with respect to the new distribution DT +1 , the current classier ht will perform at chance.

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

18

Relation to Naive Bayes rule


If we assume independence, then
T

p(y|h1 , h2 , . . . ht )

i=1

p(hi |y)

If we have h1 . . . ht independent classiers, then Bayes optimal prediction y = sign( (h(xi )). Since AdaBoost attempts to make the hypotheses independent, intuition is that this is the optimal combination. & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

19

Modifying the Weak Learner for AdaBoost


Just as we choose t to minimize Zt , it is also sometimes possible to modify the weak learner ht so that it minimizes Z explicitly. This leads to several modications of common weak learners A modied rule for branching in C4.5. For neural networks trained with backprop, simply multiply the weight change due to xi by D(i)

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

20

Practical advantages of AdaBoost


Simple and easy to program. No parameters to tune (except T ). Provably eective, provided can consistently nd rough rules of thumb Goal is to nd hypotheses barely better than guessing. Can combine with any (or many) classiers to nd weak hypotheses: neural networks, decision trees, simple rules of thumb, nearest-neighbor classiers . . . & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

21

Proof that it is easy to implement

[X, Y] = preprocessTrainingData(params); % Initialize weight distribution vector dist = ones(nImages,1) * (1/nImages); for k = 1:nBoosts % do the ridge regression z = 1000; % wild guess if this is good w(k) = pinv(X*X + z*eye(nImages)) * X*Y; % find alpha for this round of AdaBoost u = sign( I * w(k)) .* Y; & %

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

22

% Find alpha numerically: % alpha_num = fminbnd(Z,-100,100,[],u,dist) % Or find alpha analytically r = sum(dist.*u); alpha(k) = .5*log((1+r)/(1-r));

% make new distribution udist = dist .* exp(-1 * alpha(k) .* u); dist = udist/sum(udist); end [X, Y] = preprocessTestData(params); H = sign(sum(alpha * sign( I * w ))); testErr = sum( H .* Y <= 0);

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

23

caveats
AdaBoost with decision trees has been referred to as the best o-the-shelf classier. However, If weak learners are actually quite strong (i.e., error gets small very quickly), boosting might not help If hypotheses too complex, test error might be much larger than training error

&

'

Vision and Learning

Freund, Schapire, Singer: AdaBoost

24

go to ppt slides

&

S-ar putea să vă placă și