Sunteți pe pagina 1din 6

Decision Tree: Introduction 323

histopathologic cell type (squamous cell carcinoma


Decision Tree: Introduction or adenocarcinoma); histopathologic grade (a
crude indicator of tumor biology); and depth of
A decision tree is a powerful method for classifica- tumor invasion (PT classification). It is believed
tion and prediction and for facilitating decision that number of nodes to be removed should
making in sequential decision problems. This increase with more deeply invasive tumors when
entry considers three types of decision trees in histopathologic grade is poorly differentiated and
some detail. The first is an algorithm for a recom- that number of nodes differs by cell type.
mended course of action based on a sequence of The decision tree in this case is composed pre-
information nodes; the second is classification and dominantly of chance outcomes, these being the
regression trees; and the third is survival trees. results from pathology (cell type, grade, and tumor
depth). The surgeon’s only decision is whether to
perform the esophagectomy. If the decision is made
Decision Trees
to operate, then the surgeon follows this decision
Often the medical decision maker will be faced with line on the graph, moving from left to right, using
a sequential decision problem involving decisions pathology data to eventually determine the termi-
that lead to different outcomes depending on nal node. The terminal node, or final outcome, is
chance. If the decision process involves many number of lymph nodes to be removed.
sequential decisions, then the decision problem Decision trees can in some instances be used to
becomes difficult to visualize and to implement. make optimal decisions. To do so, the terminal
Decision trees are indispensable graphical tools in nodes in the decision tree must be assigned termi-
such settings. They allow for intuitive understand- nal values (sometimes called payoff values or end-
ing of the problem and can aid in decision making. point values). For example, one approach is to
A decision tree is a graphical model describing assign values to each decision branch and chance
decisions and their possible outcomes. Decision branch and define a terminal value as the sum of
trees consist of three types of nodes (see Figure 1): branch values leading to it. Once terminal values
are assigned, tree values are calculated by follow-
1. Decision node: Often represented by squares ing terminal values from right to left. To calculate
showing decisions that can be made. Lines the value of chance outcomes, multiply by their
emanating from a square show all distinct probability. The total for a chance node is the total
options available at a node. of these values. To determine the value of a deci-
sion node, the cost of each option along each deci-
2. Chance node: Often represented by circles
sion line is subtracted from the cost already
showing chance outcomes. Chance outcomes
calculated. This value represents the benefit of the
are events that can occur but are outside the
decision.
ability of the decision maker to control.
3. Terminal node: Often represented by triangles
or by lines having no further decision nodes or
Classification Trees
chance nodes. Terminal nodes depict the final In many medical settings, the medical decision
outcomes of the decision making process. maker may not know what the decision rule is.
Rather, he or she would like to discover the deci-
For example, a hospital performing esophagec- sion rule by using data. In such settings, decision
tomies (surgical removal of all or part of the trees are often referred to as classification trees.
esophagus) for patients with esophageal cancer Classification trees apply to data where the y-value
wishes to define a protocol for what constitutes an (outcome) is a classification label, such as the dis-
adequate lymphadenectomy in terms of total num- ease status of a patient, and the medical decision
ber of regional lymph nodes removed at surgery. maker would like to construct a decision rule that
The hospital believes that such a protocol should predicts the outcome using x-variables (dependent
be guided by pathology (available to the surgeon variables) available in the data. Because the data
prior to surgery). This information should include set available is just one sample of the underlying
324 Decision Tree: Introduction

Outcome 1

A Outcome 2

C
Outcome 3

1 Outcome 4

Outcome 5
2

Outcome 6
B

Outcome 7

2 Decision 2 Uncertainty (external event)

Figure 1   Decision trees are graphical models for describing sequential decision problems.

population, it is desirable to construct a decision An Example


rule that is accurate not only for the data at hand
Classification trees are decision trees derived
but over external data as well (i.e., the decision
using recursive partitioning data algorithms that
rule should have good prediction performance). At
classify each incoming x-data point (case) into
the same time, it is helpful to have a decision rule
one of the class labels for the outcome. A classifi-
that is understandable. That is, it should not be so
cation tree consists of three types of nodes (see
complex that the decision maker is left with a
Figure 2):
black box. Decision trees offer a reasonable way to
resolve these two conflicting needs.
1. Root node: The top node of the tree comprising
all the data.
Background
2. Splitting node: A node that assigns data to a
The use of tree methods for classification has a subgroup.
history that dates back at least 40 years. Much of
the early work emanated from the area of social 3. Terminal node: Final decision (outcome).
sciences, starting in the late 1960s, and computa-
tional algorithms for automatic construction of Figure 2 is a CART tree constructed using the
classification trees began as early as the 1970s. breast cancer databases obtained from the
Algorithms such as the THAID program devel- University of Wisconsin Hospitals, Madison (avail-
oped at the Institute for Social Research, University able from http://archive.ics.uci.edu/ml). In total,
of Michigan, laid the groundwork for recursive the data comprise 699 patients classified as having
partitioning algorithms, the predominate algo- either benign or malignant breast cancer. The goal
rithm used by modern-day tree classifiers, such as here is to predict true disease status based on nine
Classification and Regression Tree (CART). different variables collected from biopsy.
Decision Tree: Introduction 325

Unsize < 2.5 Unsize > = 2.5

Benign
458/241
Nuclei = 1,2,3,4,5,? Nuclei = 10,6,7,8 Unshape < 2.5 Unshape > = 2.5

Benign Malignant
417/12 41/229
Clump < 5.5 Clump > = 5.5 Nuclei = 1,2,? Nuclei = 10,3,4,5,6,7,8,9

Benign Malignant Benign Malignant


416/5 1/7 18/5 23/224
Unshape < 3.5 Unshape > = 3.5

Benign Malignant Malignant Malignant


18/1 0/4 13/23 10/201

Benign Malignant
7/0 6/23

Figure 2   Classification tree for Wisconsin breast cancer data


Note: Light-shaded and dark-shaded barplots show frequency of data at each node for the two classes: benign (light shaded);
malignant (dark shaded). Terminal nodes are classified by majority voting (i.e., assignment is made to the class label having the
largest frequency). Labels in black given above a splitting node show how data are split depending on a given variable. In some
cases, there are missing data, which are indicated by a question mark.

The first split of the tree (at the root node) is on construction is completed, terminal nodes are
the variable “unsize,” measuring uniformity of cell assigned class labels by majority voting (the class
size. All patients having values less than 2.5 for label with the largest frequency). Each patient in a
this variable are assigned to the left node (the left given terminal node is assigned the predicted class
daughter node); otherwise they are assigned to the label for that terminal node. For example, the left-
right node (right daughter node). The left and right most terminal node in Figure 2 is assigned the class
daughter nodes are then split (in this case, on the label “benign” because 416 of the 421 cases in the
variable “unshape” for the right daughter node node have that label. Looking at Figure 2, one can
and on the variable “nuclei” for the left daughter see that voting heavily favors one class over the
node), and patients are assigned to subgroups other for all terminal nodes, showing that the deci-
defined by these splits. These nodes are then split, sion tree is accurately classifying the data. However,
and the process is repeated recursively in a proce- it is important to assess accuracy using external
dure called recursive partitioning. When the tree data sets or by using cross-validation as well.
326 Decision Tree: Introduction

Recursive Partitioning increases, and dissimilar cases become separated


into daughter nodes, each node in the tree becomes
In general, recursive partitioning works as fol-
homogeneous and is populated by cases with simi-
lows. The classification tree is grown starting at the
lar outcomes (recall Figure 2).
root node, which is the top node of the tree, com-
There are several impurity functions used. These
prising all the data. The root node is split into two
include the twoing criterion, the entropy criterion,
daughter nodes: a left and a right daughter node.
and the gini index. The gini index is arguably the
In turn, each daughter node is split, with each split
most popular. When the outcome has two class
giving rise to left and right daughters. The process
labels (the so-called two-class problem), the gini
is repeated in a recursive fashion until the tree can-
index corresponds to the variance of the outcome
not be partitioned further due to lack of data or
if the class labels are recoded as being 0 and 1.
some stopping criterion is reached, resulting in a
collection of terminal nodes. The terminal nodes
represent a partition of the predictor space into a Stopping Rules
collection of rectangular regions that do not over- The size of the tree is crucial to the accuracy of
lap. It should be noted, though, that this partition the classifier. If the tree is too shallow, terminal
may be quite different than what might be found nodes will not be pure (outcomes will be heteroge-
by exhaustively searching over all partitions cor- neous), and the accuracy of the classifier will suf-
responding to the same number of terminal nodes. fer. If the tree is too deep (too many splits), then
However, for many problems, exhaustive searches the number of cases within a terminal node will be
for globally optimal partitions (in the sense of pro- small, and the predicted class label will have high
ducing the most homogeneous leaves) are not com- variance—again undermining the accuracy of the
putationally feasible, and recursive partitioning classifier.
represents an effective way of undertaking this task To strike a proper balance, pruning is employed
by using a one-step procedure instead. in methodologies such as CART. To determine the
A classification tree as described above is referred optimal size of a tree, the tree is grown to full size
to as a binary recursive partitioned tree. Another (i.e., until all data are spent) and then pruned back.
type of recursively partitioned tree is multiway The optimal size is determined using a complexity
recursive partitioned tree. Rather than splitting the measure that balances the accuracy of the tree as
parent node into two daughter nodes, such trees measured by cost complexity and by the size of
use multiway splits that define multiple daughter the tree.
nodes. However, there is little evidence that multi-
way splits produce better classifiers, and for this
reason, as well as for their simplicity, binary recur- Regression Trees
sive partitioned trees are often favored.
Decision trees can also be used to analyze data
when the y-outcome is a continuous measurement
Splitting Rules
(such as age, blood pressure, ejection fraction for
The success of CART as a classifier can be the heart, etc.). Such trees are called regression
largely attributed to the manner in which splits are trees. Regression trees can be constructed using
formed in the tree construction. To define a good recursive partitioning similar to classification trees.
split, CART uses an impurity function to measure Impurity is measured using mean-square error. The
the decrease in tree impurity for a split. The purity terminal node values in a regression tree are
of a tree is a measure of how similar observations defined as the mean value (average) of outcomes
in the leaves are to one another. The best split for for patients within the terminal node. This is the
a node is found by searching over all possible vari- predicted value for the outcome.
ables and all possible split values and choosing
that variable and split that reduces impurity the
Survival Trees
most. Reduction of tree impurity is a good princi-
ple because it encourages the tree to push dissimi- Time-to-event data are often encountered in the
lar cases apart. Eventually, as the number of nodes medical sciences. For such data, the analysis
Decision Tree: Introduction 327

focuses on understanding how time-to-event var- without occurrence of a symptom and then
ies in terms of different variables that might be released from a hospital. Such a patient is said to
collected for a patient. Time-to-event can be time be right censored because the time-to-event must
to death from a certain disease, time until recur- exceed 2 weeks, but the exact event time is
rence (for cancer), time until first occurrence of a unknown. Another example of right censoring
symptom, or simple all-cause mortality. occurs when patients enter a study at different
The analysis of time-to-event data is often com- times and the study is predetermined to end by a
plicated by the presence of censoring. Generally certain time. Then, all patients who do not experi-
speaking, this means that the event times for some ence an event within the study period are right
individuals in a study are not observed exactly censored.
and are only known to fall within certain time Decision trees can be used to analyze right-cen-
intervals. Right censoring is one of the most com- sored survival data. Such trees are referred to as
mon types of censoring encountered. This occurs survival trees. Survival trees can be constructed
when the event of interest is observed only if it using recursive partitioning. The measure of impu-
occurs prior to some prespecified time. For exam- rity plays a key role, as in CART, and this can be
ple, a patient might be monitored for 2 weeks defined in many ways. One popular approach is to

NKI70
p = .007

>0

≤0 3

TSP
p = .298

≤1 >1

Node 2 (n = 44) Node 4 (n = 22) Node 5 (n = 44)


1 1 1

.8 .8 .8

.6 .6 .6

.4 .4 .4

.2 .2 .2

0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 3   Binary survival tree for breast cancer patients


Note: Dependent variables NKI70 and TSP are gene signatures. For example, extreme right terminal node (Node 5) corresponds
to presence of both the NKI70 and TSP gene signatures. Underneath each terminal node are Kaplan-Meier survival curves for
patients within that node.
328 Decision Trees, Advanced Techniques in Constructing

define impurity using the log-rank test. As in Decision Boundary


CART, growing a tree by reducing impurity
Decision trees, by their very nature, are simple and
ensures that terminal nodes are populated by indi-
intuitive to understand. For example, a binary
viduals with similar behavior. In the case of a sur-
classification tree assigns data by dropping a data
vival tree, terminal nodes are composed of patients
point (case) down the tree and moving either left
with similar survival. The terminal node value in a
or right through nodes depending on the value of
survival tree is the survival function and is esti-
a given variable. The nature of a binary tree
mated using those patients within the terminal
ensures that each case is assigned to a unique ter-
node. This differs from classification and regres-
minal node. The value for the terminal node (the
sion trees, where terminal node values are a single
predicted outcome) defines how the case is classi-
value (the estimated class label or predicted value
fied. By following the path as a case moves down
for the response, respectively). Figure 3 shows an
the tree to its terminal node, the decision rule for
example of a survival tree.
that case can be read directly off the tree. Such a
Hemant Ishwaran and J. Sunil Rao rule is simple to understand, as it is nothing more
than a sequence of simple rules strung together.
The decision boundary, on the other hand, is a
See also Decision Trees, Advanced Techniques in
more abstract concept. Decision boundaries are
Constructing; Recursive Partitioning
estimated by a collection of decision rules for cases
taken together—or, in the case of decision trees,
the boundary produced in the predictor space
Further Readings between classes by the decision tree. Unlike deci-
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, sion rules, decision boundaries are difficult to
C. J. (1984). Classification and regression trees. visualize and interpret for data involving more
Belmont, CA: Wadsworth. than one or two variables. However, when the data
LeBlanc, M., & Crowley, J. (1993). Survival trees by involve only a few variables, the decision bound-
goodness of split. Journal of the American Statistical ary is a powerful way to visualize a classifier and
Association, 88, 457–467. to study its performance.
Segal, M. R. (1988). Regression trees for censored data. Consider Figure 1. On the left-hand side is the
Biometrics, 44, 35–47. classification tree for a prostate data set. Here, the
Stone, M. (1974). Cross-validatory choice and assessment outcome is presence or absence of prostate cancer
of statistical predictions. Journal of the Royal and the independent variables are prostate-specific
Statistical Society, Series B, 36, 111–147. antigen (PSA) and tumor volume, both having been
transformed on the log scale. Each case in the data
is classified uniquely depending on the value of
these two variables. For example, the leftmost ter-
Decision Trees, Advanced minal node in Figure 1 is composed of those
patients with tumor volumes less than 7.851 and
Techniques in Constructing PSA levels less than 2.549 (on the log scale).
Terminal node values are assigned by majority vot-
Decision trees such as classification, regression, ing (i.e., the predicted outcome is the class label
and survival trees offer the medical decision maker with the largest frequency). For this node, there are
a comprehensive way to calculate predictors and 54 nondiseased patients and 16 diseased patients,
decision rules in a variety of commonly encoun- and thus, the predicted class label is nondiseased.
tered data settings. However, performance of deci- The right-hand side of Figure 1 displays the
sion trees on external data sets can sometimes be decision boundary for the tree. The dark-shaded
poor. Aggregating decision trees is a simple way to region is the space of all values for PSA and tumor
improve performance—and in some instances, volume that would be classified as nondiseased,
aggregated tree predictors can exhibit state-of-the- whereas the light-shaded regions are those values
art performance. classified as diseased. Superimposed on the figure,