Sunteți pe pagina 1din 6

See

discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/260543181

Implementation of Bootstrap Aggregating


Algorithm for Scikit-learn Library
DATASET NOVEMBER 2013

READS

69

1 AUTHOR:
Maheshakya Wijewardena
University of Moratuwa
5 PUBLICATIONS 0 CITATIONS
SEE PROFILE

All in-text references underlined in blue are linked to publications on ResearchGate,


letting you access and read them immediately.

Available from: Maheshakya Wijewardena


Retrieved on: 09 December 2015

Implementation of Bootstrap Aggregating Algorithm for


Scikit-learn Library
Maheshakya Wijewardena
Department of Computer Science and Engineering
Faculty of Engineering
University of Moratuwa
Sri Lanka
maheshakya.10@cse.mrt.ac.lk
Abstract Bootstrap aggregating has been designed to
improve stability and accuracy of machine learning algorithms
used for classification and regression. This method is specially
used to reduce the variance among different datasets which
helps avoiding overfitting.
In this paper, how multiple predictors are generated and how
they are used to get an aggregate predictor are discussed.
Results with bagging and without bagging are compared for
several learning algorithms.
Index Terms Machine learning, Data mining, Statistics,
Python language and Scikit-learn library [1].

1. INT RODUCT ION


In general, learning algorithms can be broadly categorized
into two sections: linear models and non-linear models. Nonlinear models are considered as more powerful than linear
models as the error rate of non-linear models for a
particular data set is significantly lesser than linear
models. But here, an issue arises as non-linear models are
highly specific (overfit) for that particular data set and
hence it may degrade its performance when predicting on
another data sets. This results in high variance in that
predicting task. On the other hand, linear models result in high
bias as they are more erroneous themselves.
This
project
concerns
about
improving
the
performance of non-linear models because, what Bagging
[2] algorithms does theoretically is removing the variance
while keeping the bias same. (But practically, this algorithm
does not completely eliminate variance; it only reduces it
while
increasing
bias
slightly) This reduces the
dependency the model has on a particular data set while
retaining the prediction accuracy at a higher rate.
(Benefiting from low bias).
So far, a general bagging module which can be
applied to any learning algorithm has not been
implemented in Scikit-learn library (Though bagging
algorithm has been used in certain classifiers such as
Random forests, Randomized Logistic regression, etc.).
There are many other non-linear learning algorithms such as
Decision trees, K-Nearest Neighbors, Support Vector

Machine, Gradient Boosting on which bagging methods


can be applied. By implementing a general module,
performance of any learning model (ostensibly non-linear)
can be improved. Thus, the objective of this project is to
implement bootstrap aggregating for Scikit-learn library using
its parallel processing interfaces.

2. LIT ERAT URE REVIEW


2.1 Bootstrap Aggregating
In bootstrap aggregating, the base learning algorithm is
run repeatedly in a series of rounds. In particular, on each
round, the base learner is trained on what is often called a
"bootstrap replicate" of the original training set. Suppose the
training set consists of m examples. Then a bootstrap replicate
is a new training set that may consist of any portion of the
training sample, and which is formed by repeatedly selecting
uniformly at random and with replacement m examples from the
original training set. This means that the same example may
appear multiple times in the bootstrap replicate, or it may
appear not at all.
Bagging methods come in many flavors but mostly differ
from each other by the way they draw random subsets of the
training set:
1. When random subsets of the dataset are drawn as
random subsets of the samples, then this algorithm is
known as Pasting[3]
2. When samples are drawn with replacement, then the
method is known as Bagging[4]
3. When random subsets of the dataset are drawn as
random subsets of the features, then the method is
known as Random Subspaces [5]
4. Finally, when base estimators are built on subsets of
both samples and features, then the method is known
as Random Patches[6]
Note: Last two methods (which include bootstrapped features)
are not implemented within the scope of this project.

Bagging procedure can be presented as follows:

Lemma: E[ (Z Z`) 2] = E[Z 2] Z`2


Corollary: E[Z2 ] = E[ (Z Z`) 2] + Z`2
Using above results, we can derive the following[7]:
E[ (h(x*) y*)2 ] =

E[ (h(x*)-h`(x*) )2]
+ (h(x*))2 - f(x*))2
+ E[(y* - f(x*))2 ]

(Variance)
(Bias)
(Noise)

= Var(h(x*)) + Bias(h(x*))2 + E[2 ]


= Var(h(x*)) + Bias(h(x*))2 + 2
Expected prediction error = Variance + Bias 2 + Noise2
Variance: E[ (h(x*) h`(x*))2 ] describes how much h(x*) varies
from one training set S to another
Bias: [h(x*) f(x*)] describes the average error of h(x*).
Noise: E[ (y* f(x*))2 ] = E[2] = 2 describes how much y*
varies from f(x*)
Fig. 1 Bagging algorithm

Regression differs from above as it calculates an average value


from the estimators generated from the training phase for the
input x.
2.2 Bias - Variance Trade-off [7]
Bias: measures the accuracy or quality of the algorithm.
High bias means poor match.
Variance: measures the precision or specificity of the match.
High variance means weak match.
Minimizing of these two artifacts cannot
independently. Hence there is a trade-off.

be

done

True function is y = f(x) + where is normally distributed


with zero mean and standard deviation .
Given a set of training examples, {(xi, y i)}, we fit an hypothesis
h(x) = w x + b to the data to minimize the squared error.
i [y i h(xi)]2
Now, given a new data point x* (with observed value
y* = f(x*) + , we would like to understand the expected
prediction error.
E[ (y* h(x*))2 ]

In practice we will have only one training sample therefore we


simulate multiple training sets by bootstrapping. Suppose we
construct bootstrap replicates of S as S1..., SB Then we apply
the learning algorithm to each replicate Sb to obtain the
hypothesis h b.
Let Tb = S \ Sb be the data points that do not appear in Sb (out
of bag points). We compute h b(x) for each x in Tb.
For each data point x, we will now have the observed
corresponding value y and several predictions y 1, y K. Then
we calculate average of the prediction h`, then estimate the
bias as (h-y) and variance as k (y k h)2/(K 1). Here we
assume that noise is zero. In addition, following assumptions
are also made.
If we have multiple data points with the same x value,
then we can estimate the noise
We can also estimate noise by pooling y values from
nearby x values
Bagging is essentially an ensemble method; voting, averaging
and weighted averaging can be involved. When classification
is considered, we do the following.
For b = 1, , B do
Sb = bootstrap replicate of S
Apply learning algorithm to Sb to learn h b
Classify new points by un-weighted vote:
[b h b(x)] / B > 0

Assume that our particular training sample S is drawn from


some population of possible samples according to P(S). We
compute EP [ (y* h(x*))2 ] and decompose it to bias, variance
and noise. We will use the following lemma[7] to obtain this
decomposition.
Let Z be a random variable with probability distribution P(Z).
Let Z` = Ep [ Z ] be the average value of Z.

Bagging makes predictions according to y = b h b(x) / B


Hence, baggings predictions are h(x). If we estimate bias and
variance using the same B bootstrap samples, we will have:
Bias
= (h y)
[same as before]
Variance = k (y h)2/(K 1) = 0

Hence, according to this approximate way of estimating


variance; bagging removes the variance while leaving bias
unchanged. In reality, bagging only reduces variance and
tends to slightly increase bias.
Models that fit the data poorly have high bias: inflexible
models such as linear regression, regression stumps. Models
that can fit the data very well have low bias but high variance:
flexible models such as nearest neighbor regression,
regression trees. This suggests that bagging of a flexible model
can reduce the variance while benefiting from the low bias.

An overview of implementation can be depicted as


follows.
Training the model:

2.3 Bootstrap aggregating in Scikit-learn library[1]


This method has not been implemented as a separate
module which can be used as a general ensemble method on
other learning algorithms. But the technique of bootstrapping
is used in Random forests[8] to build forests with decision
trees[9].
3. DESIGN AND IMPLEMENT AT ION
Overview of the overall design is as follows. It uses
Python[10] shell to get user inputs(data and parameters).

Fig. 3 T raining the model using Bagging

Bagging module will take the learning algorithm, its parameters


and the training data as its parameters at the beginning to
build multiple estimators from bootstrapped samples. Training
data and system resources are divided for each estimator in
parallel using the Joblib interface[13]. After the training phase,
those trained estimators will be held in the memory separately.

Predicting:

Fig. 2 Overall design

Bootstrap aggregating algorithms has been implement a


module in Scikit-lernn library[1]. It is designed to use system
resources optimally to avoid processor, memory and time
inefficiencies. It is achieved by using the provided routines in
Scikit-learn code base[11] and their coding guidelines [12].
This design uses the general ensemble classes
implemented in the Scikit-learn library. First, a base class for
bagging in created and the sub classes for Classifier and
Regressor are derive from that. Base class is extended from
Base ensemble and Classifier, Regressor from ClassifierMixin
and RegressorMixin respectively. Those are parent classes
implemented in the Scikit-learn code base for ensemble
methods.

Fig. 4 Predicting using trained models

In predicting, all data to be predicted will be passed to all


trained models. Then the results will be aggregated using the
appropriated criteria. (Averaging for regression and majority
voting for classification) Prediction process also happens in
parallel.

In order to get class labels in classification we need find


probabilities of for each class in which to data points belongs
to. These probabilities will also be calculated in parallel.
In this design, only bootstrap samples will be considered. In
order to get more randomized data sets bootstrap features can
be implemented.

Fig. 5 Results of classification

For regression:

4. RESULT S
Main objective of the bootstrap aggregating module is to
remove the variance of prediction and improve the accuracy.
To measure the accuracy of the results without and with
bagging, cross validation[14] has been used.
Scikit-learn library has its own cross validation tools
implemented. K-fold cross validation[15] is used in this
scenario. Sample datasets are taken from Scikit-learn
datasets[16]. Iris dataset has been used for classification and
Diabetes dataset has been used for regression.
Shape of Iris data : (150 data points, 4 features)
Shape of Diabetes data : (442 data points, 10 features)
A combination of linear and non-linear learning algorithms
has been chosen to illustrate the performance on each type.
First the 3-fold cross validation is done without bagging then
with bagging. This cross validation method uses R2 score [17]
(Coefficient of determination[18]) to measure performance.
Following figures present the results of the tests. Initially
the learning algorithms used with its parameters followed by
scores without bagging and with bagging.

Fig. 6 Results of regression

5. RELAT ED WORK
Implemented code for Bootstrap aggregating module and
the tests can be found in the Github repository[19].

For classification:
6. CONCLUSION
It is obvious that the consumption of system resources for
the ensemble methods is greater than ordinary learning
algorithms as there are parallel processes and amount of data
being processed is considerably higher. Hence, same theory
applies for bootstrap aggregating as well.
Main consideration is improvement of accuracy in
predictions. In the results section above this improvement can
be seen for non-linear models in regression. But in
classification, performance has slightly deteriorated. This has
happened because the number of data points in the Iris dataset
is too small for bagging to be applied. When sampled, majority
of the important information will be left out due to small set of
data. This issue can be overcome when sufficiently large
datasets are used for training.
Moreover, another improvement to this design is to use
feature bootstrapping. It will generate more randomized
datasets which further reduces the variance of datasets.

REFERENCES
[1] Scikit-learn . Available[online]: http://scikit-learn.org/stable/

[2] Wikipedia,
Bootstrap
Aggregating.
Available[online]:
http://en.wikipedia.org/wiki/Bootstrap_aggregating
[3] L. Breiman, Pasting Small Votes for Classicationin Large
Databases and On-Line, Machine Learning 36, 1999, pp.85103.
[4] L. Breiman, Bagging Predictors, T echnical Report No. 421,
1994
[5] T . Ho, "T he random subspace method for constructing decision
forests", Pattern Analysis and Machine Intelligence, 20(8), 1998,
pp. 832-844.
[6] G. Louppe and P. Geurts, "Ensembles on Random Patches",
Machine Learning and Knowledge Discovery in Databases, 2012,
pp. 346-361.
[7] T . Dietterich and R. Maclin, Bias-Variance T radeoff and
Ensemble Methods, CMSC726 Spring 2006, University of
Maryland.
[8] Wikipedia,
Random
Forest .
Available[online]:
http://en.wikipedia.org/wiki/Random_forest
[9] Wikipedia,
Decision
tree,
Available[online]:
http://en.wikipedia.org/wiki/Decision_tree
[10] Python. Available[online]: http://www.python.org/getit/
[11] Scikit-learn, APIs of scikit -learn objects. Available[online]:
http://scikit-learn.org/stable/developers/index.html#apis-of-scikitlearn-objects
[12] Scikit-learn, Coding guidelines. Available[online]: http://scikitlearn.org/stable/developers/index.html#coding-guidelines
[13] Joblib, Joblib: running Python function as pipeline jobs.
Available[online]: http://pythonhosted.org/joblib/
[14] Wikipedia, Cross-Validation (Statistics). Available[online]:
http://en.wikipedia.org/wiki/Cross-validation_(statistics)
[15] Scikit-learn, sklearn.cross_validation.KFold. Available[online]:
http://scikitlearn.org/stable/modules/generated/sklearn.cross_validation.KFold.h
tml
[16] Scikit-learn, Dataset loading utilities. Available[online]:
http://scikit-learn.org/stable/datasets/
[17] Scikit-learn,
sklearn.metrics.r2_score.
Available[online]:
http://scikitlearn.org/stable/modules/generated/sklearn.metrics.r2_score.html
[18] Wikipedia, Coefficient of determination. Available[online]:
http://en.wikipedia.org/wiki/Coefficient_of_determination
[19] Github, maheshakya / scikit -learn, sklearn / ensemble /.
Available[online]:
https://github.com/maheshakya/scikit learn/tree/master/sklearn/ensemble
.

S-ar putea să vă placă și