Ada Boost

'
AdaBoost
Freund and Schaipre: Experiments with a new Boosting Algorithm Schapire and Singer: Improved Boosting Algorithms using Condence-rated Predictions
Ian Fasel October 23, 2001 & %
'
Vision and Learning
Freund, Schapire, Singer: AdaBoost
Resampling for Classier Design

Arcing adaptive reweighting and combining Two most popular:
reusing or selecting data in order to improve classication
Bagging (Breiman, 1994) AdaBoost (Freund and Schapire, 1996)
Combine the results of multiple weak classiers into a single strong classier. & %
'
Vision and Learning
The horse-track problem
Gambler wants to write a computer program to accurately predict winner of horse races. Gathers Historical horse-race data. input vector: Odds dry or muddy jockey output label: win or lose
Discovers:
easy to nd rules of thumb that are often correct. %
hard to nd a single rule that is very highly accurate. &
'
Vision and Learning
The gamblers plan
The gambler has decided intuitively that he wants to use arcing, and he wants his nal solution to be some simple combination of simply derived rules. Repeat T times: 1. Examine small sample of races. 2. Derive rough rule-of-thumb: e.g., bet on horse with most favorable odds 3. Select new sample, derive 2nd rule-of-thumb: otherwise, choose randomly end & If track is muddy, then bet on the lightest horse %
'
Vision and Learning
Questions
How to choose samples?
Select multiple random samples? Concentrate only on the errors?
How to combine rules-of-thumb into a single accurate rule?
&
'
Vision and Learning
More Formally:
Given: training data (x1 , y1 ), . . . , (xm , ym ), where xi X , yi Y = {1, +1} For t = 1, . . . , T : 1. Train Weak Learner on the training set. Let ht : X {1, +1} represent the classier obtained after training. 2. Modify the training set somehow The nal hypothesis H(x) is some combination of all the weak hypotheses: H(x) = f (h(x)) The question is how to modify the training set, and how to combine the weak classiers. & (1) %
'
Vision and Learning
Bagging
The simplest algorithm is called Bagging, used by Breiman 1994 Algorithm: Given m training examples, repeat for t = 1 . . . T : Select, at random with replacement, m training examples. Train learning algorithm on selected examples to generate hypothesis ht Final hypothesis is simple vote: H(x) = M AJ(h1 (x), . . . , hT (x)). & %
'
Vision and Learning
Bias Variance Dilemma
&
'
Vision and Learning
Bagging: Pros and Cons

Bagging reduces variance
Helps improve unstable classiers: i.e., small changes in training data lead to signicantly dierent classiers and large changes in accuracy.
no proof for this, however does not reduce bias
&
'
Vision and Learning
Boosting
Two modications 1. instead of a random sample of the training data, use a weighted sample to focus learning on most dicult examples. 2. instead of combining classiers with equal vote, use a weighted vote. Several previous methods (Schapire, 1990; Freund, 1995) were eective, but had limitations. We skip ahead to Freund and Schapire 1996. & %
'
Vision and Learning
10
AdaBoost (Freund and Schapire, 1996)
Initialize distribution over the training set D1 (i) = 1/m For t = 1, . . . , T : 1. Train Weak Learner using distribution Dt . 3. Update the distribution over the training set: Dt (i)et yi ht (xi ) Dt+1 (i) = Zt 2. Choose a weight (or condence value) t R. (2)
Where Zt is a normalization factor chosen so that Dt+1 will be a distribution Final vote H(x) is a weighted sum:
T
H(x) = sign(f (x)) = sign &

t=1
t ht (x)
(3) %
'
Vision and Learning
11
If the underlying classiers are linear networks, then AdaBoost builds multilayer perceptrons one node at a time.
1 2 n
Importance
Importance
Example 2
Example 3
Example 4
Example 5
Example 6
Example 2
Example 3
Example 4
Example 5
Example 6
Importance
Example 2
Example 3
Example 4
Example 5
Example 1
Example 1
&
However, underlying classier can be anything. Decision trees, neural networks, hidden Markov models . . .
Example 1
Example 6
'
Vision and Learning
12
How do we pick ?
To decide how to pick the s, we have to understand what the relationship is between the distribution, the t , and the training error. (Later, we can show that reducing training error this way should reduce test error as well).
&
'
Vision and Learning
13
Theorem 1: Error is minimized by minimizing Zt Proof: 1 eyi 1 h1 (xi ) eyi T hT (xi ) DT +1 (i) = ... m Z1 ZT e t yi t ht (xi ) eyi t t ht (xi ) = = m t Zt m t Zt eyi f (xi ) = m t Zt

(4)
But, if H(xi ) = yi then yi f (xi )) 0, implying that eyi f (xi ) 1. Thus, [[H(xi ) = yi ]] eyi f (xi ) e
i yi f (xi )
1 m &
1 [[H(xi ) = yi ]] m
(5) %
'
Vision and Learning
14
Combining these results, 1 m [[H(xi ) = yi ]] =

i t
1 m
eyi f (xi )
i
Zt Zt
t
DT +1 (i)
(6)
(since DT +1 sums to 1).
Thus, we can see that minimizing Zt will minimize this error bound. Consequences: 1. We should choose t to minimize Zt 2. We should modify the weak learner to minimize Zt (instead of, say, squared error). & %
'
Vision and Learning
15
Finding t analytically
Dene: ui = yi h(xi ). (Then ui is positive if h(xi ) is correct, and negative if it is incorrect. Magnitude is condence.) r = i Di ui . (r [1, +1], this is a measure of the overall error rate.) If we restrict h(xi ) [1, 1], we can approximate Z: Z=
i
D(i)eui D(i)
i
1 + ui ui 1 ui ui + e e 2 2
(7)
Note: if h(xi ) {1, 1}, this approximation is exact &
'
Vision and Learning
16
We can nd to minimize Z analytically by nding which gives us = 1 ln 2 1+r 1r 1 r2
dZ() d
= 0,a
Plugging this into equation 7, this gives the upper bound Z
for a particular t, or, plugging into equation 6, we have an upper bound for the training error of H 1 [[H(xi ) = yi ]] m
a For
t t
Zt
t=1
2 1 rt
&
the careful, it can be shown that chosen this way is guaranteed to minimize Z, and that it is unique.
'
Vision and Learning
17
Finding numerically
When using real valued hypotheses h(xi ) [1, 1], the best choice of to minimize Z must be found numerically. If we do this, and nd such that Z () = 0, then Z () = d dZ() = d d = = Z So, either Z = 0 (the trivial case), or
i
D(i)eui
i
D(i)ui eui Dt+1 (i)ui = 0

i i
Dt+1 (i)ui = 0.
&
In words, this means that with respect to the new distribution DT +1 , the current classier ht will perform at chance.
'
Vision and Learning
18
Relation to Naive Bayes rule

If we assume independence, then
T
p(y|h1 , h2 , . . . ht )
i=1
p(hi |y)
If we have h1 . . . ht independent classiers, then Bayes optimal prediction y = sign( (h(xi )). Since AdaBoost attempts to make the hypotheses independent, intuition is that this is the optimal combination. & %
'
Vision and Learning
19
Modifying the Weak Learner for AdaBoost

Just as we choose t to minimize Zt , it is also sometimes possible to modify the weak learner ht so that it minimizes Z explicitly. This leads to several modications of common weak learners A modied rule for branching in C4.5. For neural networks trained with backprop, simply multiply the weight change due to xi by D(i)
&
'
Vision and Learning
20
Practical advantages of AdaBoost

Simple and easy to program. No parameters to tune (except T ). Provably eective, provided can consistently nd rough rules of thumb Goal is to nd hypotheses barely better than guessing. Can combine with any (or many) classiers to nd weak hypotheses: neural networks, decision trees, simple rules of thumb, nearest-neighbor classiers . . . & %
'
Vision and Learning
21
Proof that it is easy to implement
[X, Y] = preprocessTrainingData(params); % Initialize weight distribution vector dist = ones(nImages,1) * (1/nImages); for k = 1:nBoosts % do the ridge regression z = 1000; % wild guess if this is good w(k) = pinv(X*X + z*eye(nImages)) * X*Y; % find alpha for this round of AdaBoost u = sign( I * w(k)) .* Y; & %
'
Vision and Learning
22
% Find alpha numerically: % alpha_num = fminbnd(Z,-100,100,[],u,dist) % Or find alpha analytically r = sum(dist.*u); alpha(k) = .5*log((1+r)/(1-r));
% make new distribution udist = dist .* exp(-1 * alpha(k) .* u); dist = udist/sum(udist); end [X, Y] = preprocessTestData(params); H = sign(sum(alpha * sign( I * w ))); testErr = sum( H .* Y <= 0);
&
'
Vision and Learning
23
caveats
AdaBoost with decision trees has been referred to as the best o-the-shelf classier. However, If weak learners are actually quite strong (i.e., error gets small very quickly), boosting might not help If hypotheses too complex, test error might be much larger than training error
&
'
Vision and Learning
24
go to ppt slides
&

Ada Boost

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ada Boost

Încărcat de

Drepturi de autor:

Formate disponibile

'

Ian Fasel October 23, 2001 & %

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Resampling for Classier Design

reusing or selecting data in order to improve classication

Bagging (Breiman, 1994) AdaBoost (Freund and Schapire, 1996)

Vision and Learning

Freund, Schapire, Singer: AdaBoost

The horse-track problem

easy to nd rules of thumb that are often correct. %

hard to nd a single rule that is very highly accurate. &

Vision and Learning

Freund, Schapire, Singer: AdaBoost

The gamblers plan

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Select multiple random samples? Concentrate only on the errors?

How to combine rules-of-thumb into a single accurate rule?

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Bias Variance Dilemma

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Bagging: Pros and Cons

no proof for this, however does not reduce bias

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost

AdaBoost (Freund and Schapire, 1996)

H(x) = sign(f (x)) = sign &

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Combining these results, 1 m [[H(xi ) = yi ]] =

(since DT +1 sums to 1).

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Note: if h(xi ) {1, 1}, this approximation is exact &

Vision and Learning

Freund, Schapire, Singer: AdaBoost

We can nd to minimize Z analytically by nding which gives us = 1 ln 2 1+r 1r 1 r2

Plugging this into equation 7, this gives the upper bound Z

Vision and Learning

Freund, Schapire, Singer: AdaBoost

D(i)ui eui Dt+1 (i)ui = 0

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Relation to Naive Bayes rule

Vision and Learning

Freund, Schapire, Singer: AdaBoost

Modifying the Weak Learner for AdaBoost

Vision and Learning

Freund, Schapire, Singer: AdaBoost