Sunteți pe pagina 1din 8

Bagging and Boosting in Data Mining

Carolina Ruiz
ruiz@cs.wpi.edu http://www.cs.wpi.edu/~ruiz

Motivation and Background

Problem Definition:

Given: a dataset of instances and a target concept Find: a model (e.g. set of association rules, decision tree, neural network) that helps in predicting the classification of unseen instances. The model should be stable (i.e. shouldnt depend too much on input data used to construct it) The model should be a good predictor (difficult to achieve when input dataset is small)

Difficulties:

Two Approaches

Bagging (Bootstrap Aggregating)

Leo Breiman, UC Berkeley

Boosting

Rob Schapire, ATT Research Jerry Friedman, Stanford U.

Bagging

Model Creation:

Create bootstrap replicates of the dataset and fit a model to each one Average/vote predictions of each model

Prediction:

Advantages

Stabilizes unstable methods Easy to implement, parallelizable.

Bagging Algorithm

1. Create k bootstrap replicates of the dataset 2. Fit a model to each of the replicates 3. Average/vote the predictions of the k models

Boosting

Creating the model:

Construct a sequence of datasets and models in such a way that a dataset in the sequence weights an instance heavily when the previous model has misclassified it.

Prediction:

Merge the models in the sequence


Improves classification accuracy

Advantages:

Generic Boosting Algorithm


1. Equally weight all instance in dataset 2. For I = 1 to T


2.1. Fit a model to current dataset 2.2. Upweight poorly predicted instances 2.3 Downweight well-predicted instances

3. Merge the models in the sequence to obtain the final model

Conclusions and References

Boosted nave Bayes tied for first place in KDD-cup 1997 Reference:

Combining Estimators to Improve Performance KDD-99 tutorial notes


John F. Elder Greg Ridgeway

S-ar putea să vă placă și