Documente Academic
Documente Profesional
Documente Cultură
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/260543181
READS
69
1 AUTHOR:
Maheshakya Wijewardena
University of Moratuwa
5 PUBLICATIONS 0 CITATIONS
SEE PROFILE
E[ (h(x*)-h`(x*) )2]
+ (h(x*))2 - f(x*))2
+ E[(y* - f(x*))2 ]
(Variance)
(Bias)
(Noise)
be
done
Predicting:
For regression:
4. RESULT S
Main objective of the bootstrap aggregating module is to
remove the variance of prediction and improve the accuracy.
To measure the accuracy of the results without and with
bagging, cross validation[14] has been used.
Scikit-learn library has its own cross validation tools
implemented. K-fold cross validation[15] is used in this
scenario. Sample datasets are taken from Scikit-learn
datasets[16]. Iris dataset has been used for classification and
Diabetes dataset has been used for regression.
Shape of Iris data : (150 data points, 4 features)
Shape of Diabetes data : (442 data points, 10 features)
A combination of linear and non-linear learning algorithms
has been chosen to illustrate the performance on each type.
First the 3-fold cross validation is done without bagging then
with bagging. This cross validation method uses R2 score [17]
(Coefficient of determination[18]) to measure performance.
Following figures present the results of the tests. Initially
the learning algorithms used with its parameters followed by
scores without bagging and with bagging.
5. RELAT ED WORK
Implemented code for Bootstrap aggregating module and
the tests can be found in the Github repository[19].
For classification:
6. CONCLUSION
It is obvious that the consumption of system resources for
the ensemble methods is greater than ordinary learning
algorithms as there are parallel processes and amount of data
being processed is considerably higher. Hence, same theory
applies for bootstrap aggregating as well.
Main consideration is improvement of accuracy in
predictions. In the results section above this improvement can
be seen for non-linear models in regression. But in
classification, performance has slightly deteriorated. This has
happened because the number of data points in the Iris dataset
is too small for bagging to be applied. When sampled, majority
of the important information will be left out due to small set of
data. This issue can be overcome when sufficiently large
datasets are used for training.
Moreover, another improvement to this design is to use
feature bootstrapping. It will generate more randomized
datasets which further reduces the variance of datasets.
REFERENCES
[1] Scikit-learn . Available[online]: http://scikit-learn.org/stable/
[2] Wikipedia,
Bootstrap
Aggregating.
Available[online]:
http://en.wikipedia.org/wiki/Bootstrap_aggregating
[3] L. Breiman, Pasting Small Votes for Classicationin Large
Databases and On-Line, Machine Learning 36, 1999, pp.85103.
[4] L. Breiman, Bagging Predictors, T echnical Report No. 421,
1994
[5] T . Ho, "T he random subspace method for constructing decision
forests", Pattern Analysis and Machine Intelligence, 20(8), 1998,
pp. 832-844.
[6] G. Louppe and P. Geurts, "Ensembles on Random Patches",
Machine Learning and Knowledge Discovery in Databases, 2012,
pp. 346-361.
[7] T . Dietterich and R. Maclin, Bias-Variance T radeoff and
Ensemble Methods, CMSC726 Spring 2006, University of
Maryland.
[8] Wikipedia,
Random
Forest .
Available[online]:
http://en.wikipedia.org/wiki/Random_forest
[9] Wikipedia,
Decision
tree,
Available[online]:
http://en.wikipedia.org/wiki/Decision_tree
[10] Python. Available[online]: http://www.python.org/getit/
[11] Scikit-learn, APIs of scikit -learn objects. Available[online]:
http://scikit-learn.org/stable/developers/index.html#apis-of-scikitlearn-objects
[12] Scikit-learn, Coding guidelines. Available[online]: http://scikitlearn.org/stable/developers/index.html#coding-guidelines
[13] Joblib, Joblib: running Python function as pipeline jobs.
Available[online]: http://pythonhosted.org/joblib/
[14] Wikipedia, Cross-Validation (Statistics). Available[online]:
http://en.wikipedia.org/wiki/Cross-validation_(statistics)
[15] Scikit-learn, sklearn.cross_validation.KFold. Available[online]:
http://scikitlearn.org/stable/modules/generated/sklearn.cross_validation.KFold.h
tml
[16] Scikit-learn, Dataset loading utilities. Available[online]:
http://scikit-learn.org/stable/datasets/
[17] Scikit-learn,
sklearn.metrics.r2_score.
Available[online]:
http://scikitlearn.org/stable/modules/generated/sklearn.metrics.r2_score.html
[18] Wikipedia, Coefficient of determination. Available[online]:
http://en.wikipedia.org/wiki/Coefficient_of_determination
[19] Github, maheshakya / scikit -learn, sklearn / ensemble /.
Available[online]:
https://github.com/maheshakya/scikit learn/tree/master/sklearn/ensemble
.