Documente Academic
Documente Profesional
Documente Cultură
AdaBoost
Freund and Schaipre: Experiments with a new Boosting Algorithm Schapire and Singer: Improved Boosting Algorithms using Condence-rated Predictions
'
Combine the results of multiple weak classiers into a single strong classier. & %
'
Gambler wants to write a computer program to accurately predict winner of horse races. Gathers Historical horse-race data. input vector: Odds dry or muddy jockey output label: win or lose
Discovers:
'
The gambler has decided intuitively that he wants to use arcing, and he wants his nal solution to be some simple combination of simply derived rules. Repeat T times: 1. Examine small sample of races. 2. Derive rough rule-of-thumb: e.g., bet on horse with most favorable odds 3. Select new sample, derive 2nd rule-of-thumb: otherwise, choose randomly end & If track is muddy, then bet on the lightest horse %
'
Questions
How to choose samples?
&
'
More Formally:
Given: training data (x1 , y1 ), . . . , (xm , ym ), where xi X , yi Y = {1, +1} For t = 1, . . . , T : 1. Train Weak Learner on the training set. Let ht : X {1, +1} represent the classier obtained after training. 2. Modify the training set somehow The nal hypothesis H(x) is some combination of all the weak hypotheses: H(x) = f (h(x)) The question is how to modify the training set, and how to combine the weak classiers. & (1) %
'
Bagging
The simplest algorithm is called Bagging, used by Breiman 1994 Algorithm: Given m training examples, repeat for t = 1 . . . T : Select, at random with replacement, m training examples. Train learning algorithm on selected examples to generate hypothesis ht Final hypothesis is simple vote: H(x) = M AJ(h1 (x), . . . , hT (x)). & %
'
&
'
Helps improve unstable classiers: i.e., small changes in training data lead to signicantly dierent classiers and large changes in accuracy.
&
'
Boosting
Two modications 1. instead of a random sample of the training data, use a weighted sample to focus learning on most dicult examples. 2. instead of combining classiers with equal vote, use a weighted vote. Several previous methods (Schapire, 1990; Freund, 1995) were eective, but had limitations. We skip ahead to Freund and Schapire 1996. & %
'
10
Initialize distribution over the training set D1 (i) = 1/m For t = 1, . . . , T : 1. Train Weak Learner using distribution Dt . 3. Update the distribution over the training set: Dt (i)et yi ht (xi ) Dt+1 (i) = Zt 2. Choose a weight (or condence value) t R. (2)
Where Zt is a normalization factor chosen so that Dt+1 will be a distribution Final vote H(x) is a weighted sum:
T
t ht (x)
(3) %
'
11
If the underlying classiers are linear networks, then AdaBoost builds multilayer perceptrons one node at a time.
1 2 n
Importance
Importance
Example 2
Example 3
Example 4
Example 5
Example 6
Example 2
Example 3
Example 4
Example 5
Example 6
Importance
Example 2
Example 3
Example 4
Example 5
Example 1
Example 1
&
However, underlying classier can be anything. Decision trees, neural networks, hidden Markov models . . .
Example 1
Example 6
'
12
How do we pick ?
To decide how to pick the s, we have to understand what the relationship is between the distribution, the t , and the training error. (Later, we can show that reducing training error this way should reduce test error as well).
&
'
13
Theorem 1: Error is minimized by minimizing Zt Proof: 1 eyi 1 h1 (xi ) eyi T hT (xi ) DT +1 (i) = ... m Z1 ZT e t yi t ht (xi ) eyi t t ht (xi ) = = m t Zt m t Zt eyi f (xi ) = m t Zt
(4)
But, if H(xi ) = yi then yi f (xi )) 0, implying that eyi f (xi ) 1. Thus, [[H(xi ) = yi ]] eyi f (xi ) e
i yi f (xi )
1 m &
1 [[H(xi ) = yi ]] m
(5) %
'
14
1 m
eyi f (xi )
i
Zt Zt
t
DT +1 (i)
(6)
Thus, we can see that minimizing Zt will minimize this error bound. Consequences: 1. We should choose t to minimize Zt 2. We should modify the weak learner to minimize Zt (instead of, say, squared error). & %
'
15
Finding t analytically
Dene: ui = yi h(xi ). (Then ui is positive if h(xi ) is correct, and negative if it is incorrect. Magnitude is condence.) r = i Di ui . (r [1, +1], this is a measure of the overall error rate.) If we restrict h(xi ) [1, 1], we can approximate Z: Z=
i
D(i)eui D(i)
i
1 + ui ui 1 ui ui + e e 2 2
(7)
'
16
dZ() d
= 0,a
for a particular t, or, plugging into equation 6, we have an upper bound for the training error of H 1 [[H(xi ) = yi ]] m
a For
t t
Zt
t=1
2 1 rt
&
the careful, it can be shown that chosen this way is guaranteed to minimize Z, and that it is unique.
'
17
Finding numerically
When using real valued hypotheses h(xi ) [1, 1], the best choice of to minimize Z must be found numerically. If we do this, and nd such that Z () = 0, then Z () = d dZ() = d d = = Z So, either Z = 0 (the trivial case), or
i
D(i)eui
i
Dt+1 (i)ui = 0.
&
In words, this means that with respect to the new distribution DT +1 , the current classier ht will perform at chance.
'
18
p(y|h1 , h2 , . . . ht )
i=1
p(hi |y)
If we have h1 . . . ht independent classiers, then Bayes optimal prediction y = sign( (h(xi )). Since AdaBoost attempts to make the hypotheses independent, intuition is that this is the optimal combination. & %
'
19
&
'
20
'
21
[X, Y] = preprocessTrainingData(params); % Initialize weight distribution vector dist = ones(nImages,1) * (1/nImages); for k = 1:nBoosts % do the ridge regression z = 1000; % wild guess if this is good w(k) = pinv(X*X + z*eye(nImages)) * X*Y; % find alpha for this round of AdaBoost u = sign( I * w(k)) .* Y; & %
'
22
% Find alpha numerically: % alpha_num = fminbnd(Z,-100,100,[],u,dist) % Or find alpha analytically r = sum(dist.*u); alpha(k) = .5*log((1+r)/(1-r));
% make new distribution udist = dist .* exp(-1 * alpha(k) .* u); dist = udist/sum(udist); end [X, Y] = preprocessTestData(params); H = sign(sum(alpha * sign( I * w ))); testErr = sum( H .* Y <= 0);
&
'
23
caveats
AdaBoost with decision trees has been referred to as the best o-the-shelf classier. However, If weak learners are actually quite strong (i.e., error gets small very quickly), boosting might not help If hypotheses too complex, test error might be much larger than training error
&
'
24
go to ppt slides
&