Sunteți pe pagina 1din 4

Much of our focus so far has been on building a single model that is most accurate.

In practice, data
scientists often construct multiple models and then combine them into a single prediction model. This is
referred to as a model ensemble. Two common techniques for assembling such models
are boosting and bagging. Do some research and define what model ensembles are, why they are
important, and how boosting and bagging function in the construction of assemble models. Be detailed
and provide references to your research.

What are ‘Model Ensembles’

In machine learning, a model is developed to obtain predictive results from a dataset. Instead of
developing the single most accurate prediction model possible for a givenS task, the technique of
generating a set of models and then make predictions by aggregating the outputs of these models,
produces much improved overall results. A prediction model that is composed of a set of models that
combine the predictions made by multiple models is called a ‘model ensemble’ [1].

Combining predictions from different models yield better results, however, steps should be taken to guard
against using the models that don’t predict independently and often produced skewed results. There are
different ways to combine and make use of the predictions from individual models. Training different
models and then make predictions using the average of the predictions made by each model, is one such
naïve example. During cross-validation we might generate twenty different decision trees and have them
vote on the best classification for a new example, this could be another example of combining the results.

Importance of ‘Model Ensembles’

The motivation for ensemble learning is simple and can be intuitive. Consider an ensemble of 5 different
models and suppose that we combine their predictions using simple majority voting. For the ensemble to
misclassify a new example, at least three of the five models have to misclassify it. The hope is that this is
much less likely than a misclassification by a single model. Now, the assumption of independence
between the models could be unreasonable, because models are likely to be misled in the same way by
any misleading aspects of the training data. But if the hypotheses are at least a little bit different, thereby
reducing the correlation between their errors, then ensemble learning can be very useful.

Now, with the understanding of how ‘model ensemble’ could be very useful tool for better predictions,
we can list the two main characteristics of ‘ensemble models’ –
1. They build multiple different models from the same dataset by inducing each model using a
modified version of the dataset.

2. They make a prediction by aggregating the predictions of the different models in the ensemble.

For categorical target features, this can be done using different types of voting mechanisms, and for
continuous target features, this can be done using a measure of the central tendency of the different model
predictions, such as the mean or the median. There are two standard approaches to creating ensembles –
boosting and bagging.

Boosting and Bagging in the construction of ‘Ensemble Models’

Although ‘model ensemble’ combines different models to produce better prediction results, however,
combining different models also have to deal two key limitations that are essentially present in the
training set. Practically ‘model ensemble’ has a fixed training set to operate upon. Using the same training
set by different models have to deal with bias and variance. Boosting and Bagging are the methods of
achieving effective ‘model ensemble’ for better results but both methods have to deal with these two
limitations. Let us see how these two methods deal with the problem at hand.

Boosting: The most widely used ensemble method is called boosting. Boosting is a general method which
attempts to boost" the accuracy of any given learning algorithm. To understand how it works, we need
first to understand the idea of a hypothesis and the weighted training set. A machine learning hypothesis
is a candidate model that approximates a target function for mapping inputs to outputs. In a weighted

training set, each example has an associated weight wj ≥ 0. The higher the weight of an example, the

higher is the importance attached to it during the learning of a hypothesis. Boosting starts with wj = 1 for
all the examples (i.e., a normal training set). From this set, it generates the first hypothesis, h1. This
hypothesis will classify some of the training examples correctly and some incorrectly. We would like the
next hypothesis to do better on the misclassified examples, so we increase their weights while decreasing
the weights of the correctly classified examples. From this new weighted training set, we generate
hypothesis h2. The process continues in this way until we have generated ‘K’ hypotheses, where K is an
input to the boosting algorithm. The final ensemble hypothesis is a weighted-majority combination of all
the K hypotheses, each weighted according to how well it performed on the training set. There are many
variants of the basic boosting idea, with different ways of adjusting the weights and combining the
hypotheses, but the key idea remains same. One such important boosting algorithm is the AdaBoost
algorithm, introduced in 1995 by Freund and Schapire [3]. The algorithm takes as input a training set as
follows: (x1; y1), (x2; y2), ..., (xm; ym) where each xi belongs to some domain or instance space X, and
each label yi is in some label set Y. AdaBoost calls a given weak or base learning algorithm repeatedly in
a series of rounds t = 1, ..., K. In the first iteration, i.e., t = 1 or with weight wj = 1, it gets the weak
hypothesis h1 (t = 1) with error ∈1. Error ∈1 value is calculated by summing the weights of the training
instances for which the predictions made by the model are incorrect. Subsequent iteration increases the
weights for the instances misclassified by the model and decreases the weights for the instances correctly
classified by the model. Once the set of models has been created after K iterations, the ensemble makes
predictions using a weighted aggregate of the predictions made by the individual models. Important
observation to notice here is the fact that even if the input learning algorithm L is a weak learning

algorithm — which means that L always returns a hypothesis with accuracy on the training set that is

slightly better than random guessing (i.e., 50% + ∈) — then AdaBoost will return a hypothesis that

classifies the training data perfectly for large enough K. Thus, the algorithm boosts the accuracy of the
original learning algorithm on the training data. AdaBoost, is adaptive in that it adapts to the error rates of
the individual weak hypotheses. This is the basis of its name - “Ada" is short for adapt.

Bagging: The boosting method discussed above deals with the limitation of bias pretty well. Now, we
will see how another mechanism of ‘model ensemble’ called bagging deals with variance in the training
set. As it is clear that in practice, we only have a single training data set, and so we have to find a way to
introduce variability between the different models in a ‘model ensemble’. One approach is to use bagging
(or bootstrap aggregating), where each model in the ensemble is trained
on a random sample of the dataset where, importantly, each random sample is the same size as the data
set and sampling with replacement is used. These random samples are known as bootstrap samples,
and one model is induced from each bootstrap sample. The reason that we sample with replacement is that
this will result in duplicates within each of the bootstrap samples, and consequently, every bootstrap
sample will be missing some of the instances from the dataset. As a result, each bootstrap sample will be
different, and this means that models trained on different bootstrap samples will also be different.
Decision tree induction algorithms are particularly well suited to use with bagging. This is because
decision trees are very sensitive to changes in the dataset: a small change in the dataset can result in a
different feature being selected to split the dataset at the root, or high up in the tree, and this can have a
ripple effect throughout the subtrees under this node. Frequently, when bagging is used with decision
trees, the sampling process is extended so that each bootstrap sample only uses a randomly selected
subset of the descriptive features in the dataset. This sampling of the feature set is known as subspace
sampling. Subspace sampling further encourages the diversity of the trees within the ensemble and has
the advantage of reducing the training time for each tree. The combination of bagging, subspace
sampling, and decision trees is known as a random forest model. Once the individual models have been
induced, the ensemble makes predictions by returning the majority vote or the median depending on
the type of prediction required. Consider M ‘Decision Tree’ models, the final hypothesis averages out or
take the mode of the hypothesis sample space, thereby reducing the average error of a model by a factor
of M simply by averaging M versions of the model. A very important point to note here is that it depends
on the key assumption that the errors due to the individual models are uncorrelated. In practice, the errors
are typically correlated, and the reduction in overall error is generally small and is not exactly reduced by
a factor M.

References:

[1] Fundamentals of Machine Learning for Predictive Data Analytics by Kelleher, MacNamee, and D'Arcy

[2] Artificial Intelligence, Modern Approach by Stuart Russel and Peter Norvig

[3] A Brief Introduction to Boosting by Robert E. Schapire

[4] Pattern Recognition and Machine Learning by Christopher M Bishop

S-ar putea să vă placă și