Sunteți pe pagina 1din 14

Decision Trees, Boosting, and Random Forest

Notes on statistical learning theory

Kevin Song

Department of Biomedical Informatics


Stanford University School of Medicine

July 14, 2017

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Decision Trees
Overview

I Some of the most popular tools in the data mining field.


I Have been in use since 1984 (Breiman, Friedman, Stone, &
Olshen).
I Can be extremely powerful, when used as ensembles of trees
(e.g., boosted trees and random forest).

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Decision Trees
An illustration

Figure: Decision trees are stepwise-defined functions that can be used to


model data. They can be used for either regression or classification
purposes.
Adapted from Profs. Rob Tibshirani and Trevor Hastie
Decision Trees
Usage in supervised regression
I For regression (when y is real-valued and continuous): create
a stepwise-defined function that best approximates the data,
using a greedy algorithm that minimizes squared error loss at
each split.
I Three-dimensional case: imagine that you are building a table
that has various heights, corresponding to the average y
value for the subset of data belonging to that specific table
region. The ideal splits are made, one at a time, considering
each variable, to determine which data gets assigned to one
table height or the other.

Figure: A plot of a decision tree algorithms prediction surface.


Adapted from Profs. Rob Tibshirani and Trevor Hastie
Decision Trees
Usage in supervised classification

I For classification (when y is discrete and categorical): search


and split data (into left and right branches) on the best
variable that separates the data. Repeat this procedure
recursively, until an optimal stopping criterion is reached.
I The Gini index is usually minimized to maximize node purity
at each split point:
PJ
G (p) = pi (1 pi ),
i=1
where J is the number of classes and pi is the proportion of
training datapoints that would be assigned to the ith class in
the region of interest.
I Aside: decision trees were also known as recursive
partitioning because of thisthe R package rpart, used for
building decision trees, preserves this naming convention.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Decision Trees
Pros and cons of single decision trees

Pros:
I High interpretability, and not a black box (e.g., Exactly why
was I declined my loan?).
I To arrive at the models conclusion, just trace down the
branches of the tree.
I Fast to train, not computationally intensive for large datasets
(unlike neural networks, for instance).
I Can accept either categorical or quantitative variable inputs
(unlike neural networks, which only accept numerical data).
I Can be used with sparse data with missing values.
I Feature selection is built into the model.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Decision Trees
Pros and cons of single decision trees
Cons:
I Low accuracy for approximating smooth or linear boundaries
(due to being a stepwise-defined function).
I Very high variance: single trees are highly variable, even when
the dataset is perturbed slightly.

Figure: Top row: true linear boundary; Bottom row: true non-linear
boundary. Left column: linear model; Right column: tree-based model
Adapted from Profs. Rob Tibshirani and Trevor Hastie
How to improve decision tree algorithms?
Using ensembles of trees

How can we boost the accuracy (bias) and/or stability


(variance) of decision trees, while still maintaining their
relatively easy-to-train property?
I We will use ensembles of trees (boosted decision trees and
random forest).
I However, in doing so, we completely lose interpretability.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Boosting
Overview

I Best off-the-shelf classifier today (Friedman, Elements of


Statistical Learning).
I XGBoost, a regularized gradient-descent version of boosted
trees, has been winning many of todays Kaggle competitions.
It has even outperformed deep neural networks in many cases.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Boosting
How does it work?

I Boosting works by constructing many successive iterations of


trees (i.e., weak learners) on the same dataset.
I The final model is constructed from a weighted, linear
combination of weak learners.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Boosting
AdaBoost algorithm
AdaBoost, or adaptive boosting, was the first boosting algorithm
developed (by Freund and Schapire). Here is its implementation
for a classification task:
1. Create an empty vector of weights that correspond to each
to-be-created decision tree. Initialize a vector of weights for
each data point, set to 1/n.
2. At first, construct a normal decision tree (weak learner) on
entire dataset, such that its accuracy is >50% (i.e., better
than chance). Assign a weight to this weak learner, based on
its misclassification error.
3. Identify all misclassified points from the previous decision tree.
Make misclassified data points weights higher, and correctly
classified data points weights lower. Construct another
normal decision tree, but accounting for the new weights on
the data points.
Continue algorithm for k number of iterations.
Adapted from Profs. Rob Tibshirani and Trevor Hastie
Boosting
AdaBoost algorithm

I The final output of the boosted model is the output from a


weighted, linear combination of k weak learners. The overall
model consists of a strong learner made up of several weak
ones.
I k can be best chosen by cross-validation, or by using a
validation dataset.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Random Forest
Algorithm

The random forest algorithm (Breiman, 2001) is much simpler


than that of boosting, and is based on a method known as
bagging, or bootstrap aggregation:
1. Create B boostrap samples of the dataset (i.e., create B
samples of size n, with replacement)
2. Create a decision tree for each bootstrap sample.
I For regression, the final output is the average output of B
decision trees.
I For classification, the final output is the majority vote of B
decision trees.

Adapted from Profs. Rob Tibshirani and Trevor Hastie


Boosting versus Random Forest
A comparison

I Are distant cousins of each other, each by itself a vast


improvement over traditional single trees.
I Boosting tends to be more accurate than random forest, but
can suffer from overfitting.
I Random forest tends to have lower chance of overfitting than
boosting, and is more stable than a boosted model.
I Hence, boosting tends to have low bias, but high variance,
and random forest tends to have low variance, but high bias.
I Ideally, when performing one type of analysis, consider the
other as a second opinion.

Adapted from Profs. Rob Tibshirani and Trevor Hastie

S-ar putea să vă placă și