Documente Academic
Documente Profesional
Documente Cultură
Statistical inference for tree-based models is in its infancy and there is a definite lack of
formal procedures for inference. However, the method is rapidly gaining widespread
popularity as a means of devising prediction rules for rapid and repeated evaluation, as a
screening method for variables, as diagnostic technique to assess the adequacy of other
types of models, and simply for summarizing large multivariate data sets. Some reasons
for its recent popularity are that:
1. In certain applications, especially where the set of predictors contains a mix of numeric
variables and factors, tree-based models are sometimes easier to interpret and discuss
than linear models.
119
3. Tree-based models are more adept at capturing nonadditive behavior; the standard
linear model does not allow interactions between variables unless they are pre-specified
and of a particular multiplicative form. MARS can capture this automatically by
specifying a degree = 2 fit.
The form of the fitted surface or smooth obtained from a regression tree is
M
f ( x )= cm I (x R m )
m=1
where the c m are constants and the Rm are regions defined a series of binary splits.
If all the predictors are numeric these regions form a set of disjoint hyper-rectangles with
sides parallel to the axes such that
120
m=1 M Rm =R
Regardless of how the neighborhoods are defined if we use the least squares criterion for
each region
n
( y if ( x i ) )2
i=1
y i in the region
c^ m=ave( y i xi R m) .
R1 ( j , s ) ={ x| x j s } R 2 ( j, s )={xx j > s }
c1
2
( y ic 2 )
( y i c 1) 2+ min
]
c
x i R1 ( j , s)
x i R2( j ,s)
121
The goal at any given stage is to find the pair ( j , s) such that RSS( j, s ) is minimal
or the overall RSS is maximally reduced. This may seem overwhelming, however this
only requires examining at most (n1) splits for each variable because the points in a
neighborhood only change when the split point s crosses an observed value. If we
wish to split into three neighborhoods, i.e. split R1 ( j , s ) or R2 ( j , s ) after the first
split, we have ( n1 ) p possibilities for the first split and ( n2 ) p possibilities for the
second split, given the first split. In total we have ( n1 )( n2 ) p2 operations to find the
best splits for M =3 neighborhoods. In general for M neighborhoods we have,
( n1 )( n2 ) ( nM +1 ) p M 1
possibilities if all predictors are numeric! This gets too big for an exhaustive search,
therefore we use the technique for M =2 recursively. This is the basic idea of recursive
partitioning. One starts with the first split and obtains R1R2 as explained above. This
split stays fixed and the same splitting procedure is applied recursively to the two regions
R1R2 . This procedure is then repeated until we reach some stopping criterion such
as the nodes become homogenous or contain very few observations. The rpart function
uses two such stopping criteria. A node will not be split if it contains fewer minsplit
observations (default =20). Additionally we can specify the minimum number of
observations in terminal node by specifying a value for minbucket (default =
minsplit /3 ).
The figures below from pg. 306 of Elements of Statistical Learning show a hypothetical
tree fit based on two numeric predictors X 1 X 2 .
122
Let's examine these ideas using the ozone pollution data for the Los Angeles Basin
discussed earlier in the course. For simplicity we consider the case where p=2 . Here
we will develop a regression tree using rpart for predicting upper ozone concentration
using the temperature at Sandburg Air Force base and Daggett pressure.
> library(rpart)
> attach(Ozdata)
> oz.rpart <- rpart(upoz ~ inbh + safb)
> summary(oz.rpart)
> plot(oz.rpart)
> text(oz.rpart)
> post(oz.rpart,"Regression Tree for Upper Ozone Concentration")
123
> plot(oz.rpart,uniform=T,branch=1,compress=T,margin=0.05,cex=.5)
> text(oz.rpart,all=T,use.n=T,fancy=T,cex=.7)
> title(main="Regression Tree for Upper Ozone Concentration")
124
Example 6.1: Infant Mortality Rates for 77 Largest U.S. Cities in 2000
In this example we will examine how to build regression trees using functions in the
packages rpart and tree. We will also examine use of the maptree package to plot
the results.
> infmort.rpart = rpart(infmort~.,data=City,control=rpart.control(minsplit=10))
> summary(infmort.rpart)
Call:
rpart(formula = infmort ~ ., data = City, minsplit = 10)
n= 77
CP nsplit rel error
xerror
xstd
1 0.53569704
0 1.00000000 1.0108944 0.18722505
2 0.10310955
1 0.46430296 0.5912858 0.08956209
3 0.08865804
2 0.36119341 0.6386809 0.09834998
4 0.03838630
3 0.27253537 0.5959633 0.09376897
5 0.03645758
4 0.23414907 0.6205958 0.11162033
6 0.02532618
5 0.19769149 0.6432091 0.11543351
7 0.02242248
6 0.17236531 0.6792245 0.11551694
8 0.01968056
7 0.14994283 0.7060502 0.11773100
9 0.01322338
8 0.13026228 0.6949660 0.11671223
10 0.01040108
9 0.11703890 0.6661967 0.11526389
11 0.01019740
10 0.10663782 0.6749224 0.11583334
12 0.01000000
11 0.09644043 0.6749224 0.11583334
Node number 1: 77 observations,
mean=12.03896, MSE=12.31978
left son=2 (52 obs) right son=3
Primary splits:
pct.black < 29.55
to the
growth
< -5.55
to the
pct1par
< 31.25
to the
precip
< 36.45
to the
laborchg < 2.85
to the
complexity param=0.535697
(25 obs)
left,
right,
left,
left,
right,
improve=0.5356970,
improve=0.4818361,
improve=0.4493385,
improve=0.3765841,
improve=0.3481261,
(0
(0
(0
(0
(0
missing)
missing)
missing)
missing)
missing)
125
Surrogate splits:
growth
< -2.6
pct1par < 31.25
laborchg < 2.85
poverty < 21.5
income
< 24711
to
to
to
to
to
the
the
the
the
the
right,
left,
right,
left,
right,
agree=0.896,
agree=0.896,
agree=0.857,
agree=0.844,
agree=0.818,
adj=0.68,
adj=0.68,
adj=0.56,
adj=0.52,
adj=0.44,
(0
(0
(0
(0
(0
split)
split)
split)
split)
split)
complexity param=0.08865804
(18 obs)
left,
left,
right,
left,
right,
improve=0.3647980,
improve=0.3395304,
improve=0.3325635,
improve=0.3058060,
improve=0.2745090,
left,
right,
right,
right,
left,
agree=0.865,
agree=0.865,
agree=0.808,
agree=0.769,
agree=0.769,
(0
(0
(0
(0
(0
missing)
missing)
missing)
missing)
missing)
adj=0.611,
adj=0.611,
adj=0.444,
adj=0.333,
adj=0.333,
(0
(0
(0
(0
(0
split)
split)
split)
split)
split)
complexity param=0.1031095
(5 obs)
left,
right,
left,
left,
left,
improve=0.4659903,
improve=0.4215004,
improve=0.4061168,
improve=0.3398599,
improve=0.3398599,
left,
right,
left,
left,
left,
agree=0.92,
agree=0.88,
agree=0.88,
agree=0.88,
agree=0.84,
(0
(0
(0
(0
(0
adj=0.6,
adj=0.4,
adj=0.4,
adj=0.4,
adj=0.2,
missing)
missing)
missing)
missing)
missing)
(0
(0
(0
(0
(0
split)
split)
split)
split)
split)
missing)
missing)
missing)
missing)
missing)
(0
(0
(0
(0
(0
split)
split)
split)
split)
split)
126
127
missing)
missing)
missing)
missing)
missing)
(0
(0
(0
(0
(0
split)
split)
split)
split)
split)
128
Surrogate splits:
hisp.pop < 39157.5
pctold
< 11.15
pct1par < 23
pctrent < 41.85
precip
< 15.1
to
to
to
to
to
the
the
the
the
the
right,
left,
right,
right,
left,
agree=0.885,
agree=0.769,
agree=0.769,
agree=0.769,
agree=0.769,
adj=0.75,
adj=0.50,
adj=0.50,
adj=0.50,
adj=0.50,
(0
(0
(0
(0
(0
split)
split)
split)
split)
split)
> plot(infmort.rpart)
> text(infmort.rpart)
129
The package maptree has a function draw.tree() that plots trees slightly differently.
> draw.tree(infmort.rpart)
130
xerror
1.01089
0.59129
0.63868
0.59596
0.62060
0.64321
0.67922
0.70605
0.69497
0.66620
0.67492
0.67492
xstd
0.187225
0.089562
0.098350
0.093769
0.111620
0.115434
0.115517
0.117731
0.116712
0.115264
0.115833
0.115833
> plotcp(infmort.rpart)
131
> plot(City$infmort,predict(infmort.rpart))
> row.names(City)
[1] "New.York.NY"
[6] "San.Diego.CA"
[11] "San.Jose.CA"
[16] "Columbus.OH"
[21] "El.Paso.TX"
[26] "New.Orleans.LA"
[31] "Long.Beach.CA"
[36] "Albuquerque.NM"
[41] "Tulsa.OK"
[46] "Cincinnati.OH"
[51] "Wichita.KS"
[56] "Tampa.FL"
[61] "Newark.NJ"
[66] "Aurora.CO"
"Lexington.Fayette.KY"
[71] "Jersey.City.NJ"
[76] "Richmond.VA"
"Los.Angeles.CA"
"Dallas.TX"
"Indianapolis.IN"
"Milwaukee.WI"
"Seattle.WA"
"Denver.CO"
"Kansas.City.MO"
"Atlanta.GA"
"Oakland.CA"
"Minneapolis.MN"
"Mesa.AZ"
"Arlington.TX"
"Corpus.Christi.TX"
"Riverside.CA"
"Chicago.IL"
"Phoenix.AZ"
"San.Francisco.CA"
"Memphis.TN"
"Cleveland.OH"
"Fort.Worth.TX"
"Virginia.Beach.VA"
"St.Louis.MO"
"Honolulu.CDP.HI"
"Omaha.NE"
"Colorado.Springs.CO"
"Anaheim.CA"
"Birmingham.AL"
"St.Petersburg.FL"
"Houston.TX"
"Detroit.MI"
"Baltimore.MD"
"Washington.DC"
"Nashville.Davidson.TN"
"Oklahoma.City.OK"
"Charlotte.NC"
"Sacramento.CA"
"Miami.FL"
"Toledo.OH"
"Las.Vegas.NV"
"Louisville.KY"
"Norfolk.VA"
"Rochester.NY"
"Philadelphia.PA"
"San.Antonio.TX"
"Jacksonville.FL"
"Boston.MA"
"Austin.TX"
"Portland.OR"
"Tucson.AZ"
"Fresno.CA"
"Pittsburgh.PA"
"Buffalo.NY"
"Santa.Ana.CA"
"St.Paul.MN"
"Anchorage.AK"
"Baton.Rouge.LA"
"Mobile.AL"
"Akron.OH"
"Raleigh.NC"
"Stockton.CA"
> identify(City$infmort,predict(infmort.rpart),labels=row.names(City))
[1] 14 19 37 44 61 63 identify some interesting points
^ =Y to the plot
> abline(0,1) adds line Y
> post(infmort.rpart) creates a postscript version of tree. You will need to download a
postscript viewer add-on for Adobe Reader to open them. Google Postscript Viewer and grab the one off
of cnet - (http://download.cnet.com/Postscript-Viewer/3000-2094_4-10845650.html)
132
Using the draw.tree function from the maptree package we can produce the
following display of the full infant mortality regression tree.
> draw.tree(infmort.rpart)
Another function in the maptree library is the group.tree command that will
label the observations in according to the terminal nodes they are in. This can be
particularly interesting when the observations have meaningful labels or are spatially
distributed.
> infmort.groups = group.tree(infmort.rpart)
> infmort.groups
Here is a little function to display groups of observations in a data set given the group
identifier.
> groups = function(g,dframe) {
ng <- length(unique(g))
for(i in 1:ng) {
cat(paste("GROUP ", i))
cat("\n")
cat("=========================================================\n")
cat(row.names(dframe)[g == i])
cat("\n\n")
}
cat("
\n\n")
}
133
> groups(infmort.groups,City)
GROUP 1
====================================================================
San.Jose.CA San.Francisco.CA Honolulu.CDP.HI Santa.Ana.CA Anaheim.CA
GROUP 2
==================================
Mesa.AZ Las.Vegas.NV Arlington.TX
GROUP 3
===============================================================================
Los.Angeles.CA San.Diego.CA Dallas.TX San.Antonio.TX El.Paso.TX Austin.TX
Denver.CO Long.Beach.CA Tucson.AZ Albuquerque.NM Fresno.CA Corpus.Christi.TX
Riverside.CA Stockton.CA
GROUP 4
===============================================================================
Phoenix.AZ Fort.Worth.TX Oklahoma.City.OK Sacramento.CA Minneapolis.MN Omaha.NE
Toledo.OH Wichita.KS Colorado.Springs.CO St.Paul.MN Anchorage.AK Aurora.CO
GROUP 5
===============================================================================
Houston.TX Jacksonville.FL Nashville.Davidson.TN Tulsa.OK Miami.FL Tampa.FL
St.Petersburg.FL
GROUP 6
===============================================================================
Indianapolis.IN Seattle.WA Portland.OR Lexington.Fayette.KY Akron.OH Raleigh.NC
GROUP 7
=================================================================
New.York.NY Columbus.OH Boston.MA Virginia.Beach.VA Pittsburgh.PA
GROUP 8
=================================================================
Milwaukee.WI Kansas.City.MO Oakland.CA Cincinnati.OH Rochester.NY
GROUP 9
===============================================================
Charlotte.NC Norfolk.VA Jersey.City.NJ Baton.Rouge.LA Mobile.AL
GROUP 10
============================================================
Chicago.IL New.Orleans.LA St.Louis.MO Buffalo.NY Richmond.VA
GROUP 11
===================================================================
Philadelphia.PA Memphis.TN Cleveland.OH Louisville.KY Birmingham.AL
GROUP 12
==========================================================
Detroit.MI Baltimore.MD Washington.DC Atlanta.GA Newark.NJ
134
> head(cpus)
By default rpart() uses a complexity penalty of cp = .01 which will prune off more
terminal nodes than we might want to consider initially. I will generally use a smaller
value of cp (e.g. .001) to lead to a tree that is larger but will likely over fit the data. Also
135
if you really want a large tree you can use the arguments below when calling rpart:
control=rpart.control(minsplit=##,minbucket=##).
> printcp(cpus.tree)
Regression tree:
rpart(formula = log(Performance) ~ ., data = cpus[, 2:7], cp = 0.001)
Variables actually used in tree construction:
[1] cach chmax chmin mmax syct
Root node error: 228.59/209 = 1.0938
n= 209
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
xerror
1.02344
0.48514
0.43673
0.33004
0.34662
0.32769
0.31008
0.29809
0.27080
0.24297
0.24232
0.23530
0.23783
0.23683
0.23703
0.23455
xstd
0.098997
0.049317
0.043209
0.033541
0.034437
0.034732
0.031878
0.030863
0.028558
0.026055
0.026039
0.025449
0.025427
0.025407
0.025453
0.025286
> plotcp(cpus.tree)
size of tree
2
Inf
0.22
0.088
0.054
0.03
0.022
0.018
0.016
10
11
12
13
14
15
16
17
0.8
0.6
0.4
0.2
1.0
1.2
0.012
cp
136
> plot(cpus.tree,uniform=T)
> text(cpus.tree,cex=.7)
Prune the tree back to a 10 split, 11 terminal node tree using cp = .0055.
> cpus.tree = rpart(log(Performance)~.,data=cpus[,2:7],cp=.0055)
> post(cpus.tree)
137
> plot(log(Performance),predict(cpus.tree),ylab="Fitted
Values",main="Fitted Values vs. log(Performance)")
> abline(0,1)
> plot(predict(cpus.tree),resid(cpus.tree),xlab="Residuals",
ylab="Fitted Values",main="Residuals vs. Fitted (cpus.tree)")
> abline(h=0)
138
] [
E [ ( Y x ) ] +Var ( ( x ) )
E [ ( Y x )
E [ ( Y x ) ] + E ( ( x ) )
2
For each bootstrap sample, b, we will obtain an estimated model b (x) and average
( x ) =
1
( x)
B i=1 b
This estimator for Y x = yx should in theory be better than the one obtained from the
training data. This process of averaging the predicted value from a given x is called
bagging. Bagging works best when the fitted models vary substantially from one
bootstrap sample to the next. Modeling schemes are complicated and involve the
effective estimation of a large number parameters will benefit from bagging most.
Projection Pursuit, CART and MARS are examples of algorithms where this is likely to
be the case.
139
> names(Ozdata)
[1] "day"
"upoz"
140
141
0.2745
0.2685
0.2614
142
Variable Importance
To measure variable the importance do the following. For each bootstrap sample we first
compute the Out-of-Bag (OOB) error rate, P E b (OOB) . Next we randomly permute
the OOB values on the j th variable X j while leaving the data on all other variables
unchanged. If X j is important, permuting its values will reduce our ability to predict
the response successfully for all of the OOB observations. Then we make the
predictions using the permuted X j values and all the other predictors unchanged to
obtain P E b (OO B j ) , which should be larger than the error rate of the unaltered data.
The raw score for X j can be computed by the difference between these two OOB
error rates,
ra w b ( j )=P Eb ( OO B j ) P E b ( OOB ) , b=1, , B .
Finally, average the raw scores over all the B trees in the forest,
B
1
imp ( j )= ra wb ( j)
B b=1
to obtain an overall measure of the importance of X j . This measure is called the raw
permutation accuracy importance score for the j th variable. Assuming the B raw
scores are independent from tree to tree, we can compute a straightforward estimate of
the standard error by computing the standard deviation of the ra w b ( j) values.
Dividing the average raw importance scores from each bootstrap by the standard error
gives what is called the mean decrease in accuracy for the j th variable.
Example 1: L.A. Ozone Levels
> oz.rf = randomForest(tupoz~.,data=Ozdata2,importance=T)
> oz.rf
Call:
randomForest(formula = tupoz ~ ., data = Ozdata2, importance = T)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.05886756
% Var explained: 78.11
144
Short function to display mean decrease in accuracy from a random forest fit.
rfimp = function(rffit) { barplot(sort(rffit$importance[,1]),horiz=T,
xlab="Mean Decreasing in Accuracy",main="Variable Importance")
}
The temperature variables, inversion base temperature and Sandburg AFB temperature,
are clearly the most important predictors in the random forest.
145
The random forest command with options is show below. It should be noted that x and y
can be replaced by the usual formula building nomenclature, namely
randomForest(y ~ . , data=Mydata, etc)
Some important options have been highlighted in the generic function call above and they
are summarized below:
Estimate RMSEP for random forest models can be done using the errorest function in
the ipred package (i.e. the bagging one) as show below. It is rather slow so doing more
than 10 is not advisable. Each replicate does a 10-fold cross-validation B = 25 times per
RMSEP estimate by default. It can be used with a variety of modeling methods, see the
errorest help file for examples.
> error.RF = numeric(10)
> for (i in 1:10) error.RF[i] =
errorest(tupoz~.,data=Ozdata2,model=randomForest)$error this is one line.
> summary(error.RF)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.2395 0.2424 0.2444 0.2448 0.2466 0.2506
146
Example 6.4: Predicting San Francisco List Prices of Single Family Homes
Using www.redfin.com it is easy to create interesting data sets regarding the list
price of homes and different characteristics of the home. The map below shows
all the single-family homes listed in San Francisco just south of downtown and
the Golden Gate Bridge. Once you drill down to this level of detail you can
download the prices and home characteristics in an Excel file. From there it is
easy to load the data in JMP, edit it, and then read it into R.
> names(SFhomes)
[1] "ListPrice" "BEDS"
"BATHS"
"SQFT"
"YrBuilt"
"ParkSpots" "Garage"
"LATITUDE"
"LONGITUDE"
> str(SFhomes)
'data.frame':
263 obs. of 9 variables:
$ ListPrice: int 749000 499900 579000 1295000 688000 224500 378000
140000 530000 399000 ...
$ BEDS
: int 2 3 4 4 3 2 5 1 2 3 ...
$ BATHS
: num 1 1 1 3.25 2 1 2 1 1 2 ...
$ SQFT
: int 1150 1341 1429 2628 1889 995 1400 772 1240 1702 ...
$ YrBuilt : int 1931 1927 1937 1937 1939 1944 1923 1915 1925 1908 ...
$ ParkSpots: int 1 1 1 3 1 1 1 1 1 1 ...
$ Garage
: Factor w/ 2 levels "Garage","No": 1 1 1 1 1 1 1 2 1 2 ...
$ LATITUDE : num 37.8 37.7 37.7 37.8 37.8 ...
$ LONGITUDE: num -122 -122 -122 -122 -122 ...
- attr(*, "na.action")=Class 'omit' Named int [1:66] 7 11 23 30 32 34 36 43 55 62 ...
.. ..- attr(*, "names")= chr [1:66] "7" "11" "23" "30" ...
147
We will now develop a random forest model for the list price of the home as a function of
the number of bedrooms, number of bathrooms, square footage, year built, number of
parking spots, garage (Garage or No), latitude, and longitude.
> sf.rf = randomForest(ListPrice~.,data=SFhomes,importance=T)
> rfimp(sf.rf,horiz=F)
> plot(SFhomes$ListPrice,predict(sf.rf),xlab="Y",
ylab="Y-hat",main="Predict vs. Actual List Price")
> abline(0,1)
148
"BEDS"
"BATHS"
"SQFT"
"YrBuilt"
"ParkSpots" "Garage"
"LATITUDE"
"LONGITUDE"
We now resume the model building process using the log of the list price as the response.
To develop models we consider different choices for (m) the number predictors chosen in
each random subset.
149
> attributes(sf.rf)
$names
[1] "call"
[7] "importance"
[13] "forest"
"type"
"importanceSD"
"coefs"
"predicted"
"mse"
"localImportance" "proximity"
"y"
"test"
"rsq"
"ntree"
"inbag"
"oob.times"
"mtry"
"terms"
$class
[1] "randomForest.formula" "randomForest"
> sf.rf$mtry
[1] 2
Max.
0.2664
m=4
m=5*
> myforest = function(formula,data) {randomForest(formula,data,mtry=5)}
> for (i in 1:10) error.RF[i] =
+
errorest(logList~.,data=SFhomes.log,model=myforest)$error
> summary(error.RF)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.2374 0.2391 0.2403 0.2418 0.2433 0.2500
m=6
> myforest = function(formula,data) {randomForest(formula,data,mtry=6)}
> for (i in 1:10) error.RF[i] =
+
errorest(logList~.,data=SFhomes.log,model=myforest)$error
> summary(error.RF)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.2362 0.2403 0.2430 0.2439 0.2462 0.2548
150
151
As with earlier methods, it is not hard to write our own MCCV function.
> rf.cv = function(fit,data,p=.667,B=100,mtry=fit$mtry,ntree=fit$ntree) {
n <- dim(data)[1]
cv <- rep(0,B)
y = fit$y
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- randomForest(formula(fit),data=data[sam,],mtry=mtry,ntree=ntree)
ynew <- predict(fit2,newdata=data[-sam,])
cv[i] <- mean((y[-sam]-ynew)^2)
}
cv
}
This function also allows you to experiment with different values for (mtry) & (ntree).
> results = rf.cv(sf.final,data=SFhomes.log,mtry=2)
> mean(results)
[1] 0.3059704
> results = rf.cv(sf.final,data=SFhomes.log,mtry=3)
> mean(results)
[1] 0.2960605
> results = rf.cv(sf.final,data=SFhomes.log,mtry=4)
> mean(results)
[1] 0.2945276
> results = rf.cv(sf.final,data=SFhomes.log,mtry=5) m = 5 is optimal.
> mean(results)
[1] 0.2889211
> results = rf.cv(sf.final,data=SFhomes.log,mtry=6)
> mean(results)
[1] 0.289072
152
>
>
>
>
>
>
>
"BEDS"
"BATHS"
"SQFT"
"YrBuilt"
"ParkSpots" "Garage"
"LATITUDE"
"LONGITUDE"
partialPlot(sf.final,SFhomes.log,SQFT)
partialPlot(sf.final,SFhomes.log,LONGITUDE)
partialPlot(sf.final,SFhomes.log,LATITUDE)
partialPlot(sf.final,SFhomes.log,BEDS)
partialPlot(sf.final,SFhomes.log,BATHS)
partialPlot(sf.final,SFhomes.log,YrBuilt)
partialPlot(sf.final,SFhomes.log,ParkSpots)
> par(mfrow=c(2,3))
returns to default
153
(R package gbm)
Boosting, like bagging, is way to combine or average the results of multiple trees in
order to improve their predictive ability. Boosting however does not simply average trees
constructed from bootstrap samples of the original data, rather it creates a sequence of
trees where the next tree in sequence essentially uses the residuals from the previous trees
as the response. This type of approach is referred to as gradient boosting. Using the
squared error as the measure of fit, the Gradient Tree Boosting Algorithm is given below.
Gradient Tree Boosting Algorithm (Squared Error)
1. Initialize f o ( x )= y .
2. For m=1, , M :
a) For i=1, , n compute
r = y i f m 1(x i ) which are simply the residuals from the previous tree.
b) Fit a regression tree using r as the response, giving terminal node regions
R jm , j=1, , J m .
c) For j=1,2, , J m compute the mean of the residuals in each of the
terminal nodes, call these jm .
d) Update the model as follows:
Jm
f m ( x )=f m1 ( x ) + jm I (x R jm )
j=1
154
The algorithm as presented above looks a bit daunting at first, however the graphic below
simplifies the boosting concept considerably.
Models
f^ 1 ( X)
f^ 2( X )
f^ 3 ( X )
f^ M ( X )
bos.gbm = gbm(logCMEDV~.,data=Boston3,
distribution="gaussian",
n.trees=1000,
shrinkage=.1,
interaction.depth=4, small Jm
bag.fraction=0.5,
train.fraction=0.8,
n.minobsinnode=10,
cv.folds=5,
keep.data=T,
verbose=T)
n.trees=2000,shrinkage=.1,interaction.depth=4,bag.fraction=.5,
train.fraction=.8,n.minobsinnode=10,cv.folds=5,keep.data=T,verbose=T)
CV: 1
Iter
1
2
3
4
5
6
7
8
9
10
100
TrainDeviance
0.1206
0.1045
0.0901
0.0791
0.0698
0.0611
0.0550
0.0496
0.0453
0.0412
0.0078
ValidDeviance
0.1379
0.1189
0.1029
0.0923
0.0833
0.0738
0.0664
0.0600
0.0567
0.0530
0.0242
StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
Improve
0.0183
0.0143
0.0138
0.0109
0.0096
0.0071
0.0056
0.0055
0.0040
0.0032
-0.0001
155
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
CV: 2
Iter
1
2
3
4
5
6
7
8
9
10
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
CV: 3
Iter
1
2
3
4
5
6
7
8
9
10
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
0.0044
0.0029
0.0019
0.0014
0.0010
0.0007
0.0006
0.0004
0.0003
0.0003
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0001
0.0000
0.0000
0.0230
0.0228
0.0237
0.0243
0.0242
0.0248
0.0248
0.0249
0.0248
0.0250
0.0252
0.0251
0.0252
0.0254
0.0255
0.0256
0.0255
0.0256
0.0255
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
-0.0001
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
TrainDeviance
0.1198
0.1031
0.0880
0.0768
0.0675
0.0601
0.0533
0.0481
0.0432
0.0395
0.0070
0.0038
0.0023
0.0015
0.0010
0.0007
0.0005
0.0004
0.0003
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
ValidDeviance
0.1367
0.1225
0.1048
0.0914
0.0829
0.0751
0.0693
0.0646
0.0590
0.0553
0.0279
0.0272
0.0271
0.0269
0.0269
0.0268
0.0267
0.0272
0.0269
0.0268
0.0272
0.0272
0.0269
0.0270
0.0272
0.0271
0.0270
0.0271
0.0270
0.0271
StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
Improve
0.0193
0.0172
0.0144
0.0087
0.0080
0.0057
0.0052
0.0052
0.0044
0.0035
-0.0001
-0.0001
-0.0001
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
TrainDeviance
0.1266
0.1106
0.0951
0.0821
0.0721
0.0642
0.0574
0.0517
0.0467
0.0414
0.0080
0.0042
0.0025
0.0016
0.0011
0.0008
0.0006
0.0004
0.0003
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0001
ValidDeviance
0.1089
0.0984
0.0854
0.0757
0.0683
0.0615
0.0569
0.0536
0.0504
0.0470
0.0237
0.0227
0.0233
0.0241
0.0240
0.0243
0.0243
0.0244
0.0243
0.0245
0.0246
0.0245
0.0245
0.0246
0.0246
0.0248
StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
Improve
0.0201
0.0170
0.0147
0.0118
0.0098
0.0074
0.0070
0.0047
0.0051
0.0041
-0.0001
-0.0001
-0.0001
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
156
1700
1800
1900
2000
CV: 4
Iter
1
2
3
4
5
6
7
8
9
10
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
CV: 5
Iter
1
2
3
4
5
6
7
8
9
10
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Iter
1
2
3
4
5
6
7
8
9
0.0000
0.0000
0.0000
0.0000
0.0247
0.0247
0.0248
0.0248
0.1000
0.1000
0.1000
0.1000
-0.0000
-0.0000
-0.0000
-0.0000
TrainDeviance
0.1250
0.1064
0.0928
0.0825
0.0730
0.0632
0.0565
0.0513
0.0467
0.0421
0.0076
0.0043
0.0027
0.0017
0.0011
0.0008
0.0006
0.0004
0.0003
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
ValidDeviance
0.1217
0.1044
0.0926
0.0834
0.0726
0.0640
0.0571
0.0513
0.0479
0.0431
0.0233
0.0228
0.0230
0.0232
0.0230
0.0227
0.0227
0.0234
0.0232
0.0231
0.0230
0.0231
0.0232
0.0233
0.0234
0.0233
0.0233
0.0233
0.0234
0.0233
StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
Improve
0.0216
0.0198
0.0139
0.0101
0.0087
0.0081
0.0052
0.0058
0.0031
0.0044
-0.0001
-0.0001
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
TrainDeviance
0.1261
0.1110
0.0958
0.0835
0.0728
0.0641
0.0576
0.0519
0.0471
0.0427
0.0084
0.0048
0.0031
0.0021
0.0014
0.0010
0.0008
0.0006
0.0004
0.0003
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0001
0.0000
0.0000
0.0000
ValidDeviance
0.1130
0.0972
0.0830
0.0716
0.0621
0.0534
0.0482
0.0426
0.0387
0.0347
0.0122
0.0128
0.0134
0.0133
0.0141
0.0137
0.0139
0.0139
0.0137
0.0138
0.0139
0.0139
0.0140
0.0140
0.0140
0.0141
0.0141
0.0141
0.0142
0.0142
StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
Improve
0.0200
0.0136
0.0149
0.0111
0.0096
0.0080
0.0064
0.0045
0.0044
0.0037
-0.0002
-0.0001
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
TrainDeviance
0.1246
0.1079
0.0941
0.0823
0.0715
0.0641
0.0562
0.0506
0.0457
ValidDeviance
0.2433
0.2169
0.1899
0.1705
0.1492
0.1387
0.1235
0.1126
0.1052
StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
Improve
0.0189
0.0164
0.0137
0.0126
0.0101
0.0081
0.0081
0.0056
0.0049
157
10
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
0.0414
0.0087
0.0050
0.0033
0.0023
0.0017
0.0012
0.0009
0.0007
0.0006
0.0004
0.0003
0.0003
0.0002
0.0002
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0998
0.0524
0.0502
0.0488
0.0483
0.0462
0.0460
0.0453
0.0447
0.0441
0.0450
0.0444
0.0445
0.0443
0.0445
0.0445
0.0443
0.0444
0.0443
0.0443
0.0442
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.0030
-0.0001
-0.0001
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
-0.0000
Verbose obviously provides a lot of detail about the fitted model that does not look
particularly useful. We can plot the CV performance results using the gbm.perf().
> gbm.perf(bos.gbm,method="OOB")
[1] 45
> title(main="Out of Bag and Training Prediction Errors")
OOB error
Training error
> gbm.perf(bos.gbm,method="test")
[1] 869
158
Test error
Training error
> gbm.perf(bos.gbm,method="cv")
[1] 159
Test error
5-fold CV error
Training error
159
> plot(logCMEDV,yhat)
> cor(logCMEDV,yhat)
[1] 0.9719412
> cor(logCMEDV,yhat)^2 R2 = 94.47%
[1] 0.9446697
Using the number of trees suggest by 5-fold cross-validation instead, i.e. smaller M.
> plot(logCMEDV,predict(bos.gbm,n.trees=159))
> cor(logCMEDV,predict(bos.gbm,n.trees=159))
[1] 0.9679083
> cor(logCMEDV,predict(bos.gbm,n.trees=159))^2
[1] 0.9368466
We can consider varying the shrinkage parameter as well. For example we consider
changing the shrinkage to for the Boston Housing data. This requires are larger
number of iterations (M) as evidenced below.
160
> gbm.perf(bos.gbm,method="OOB")
[1] 111
> gbm.perf(bos.gbm,method="cv")
[1] 273
> gbm.perf(bos.gbm,method="test")
[1] 1557
The fit is not much different from the model above, in fact it is actually a bit worse.
> cor(logCMEDV,yhat)
[1] 0.9718639
> cor(logCMEDV,yhat)^2
[1] 0.9445194
n.trees = 1557 overfit
We might consider trying larger values for shrinkage, say .15 or .20, but we will
stick with the shrinkage choice for subsequent gradient boosting models. In the
examples above we used trees with Jm = 4 terminal nodes. We might now consider
increasing the model size (Jm) as well. Before exploring tree sizes however, we will look
at variable importances for our current model.
> best.iter = gbm.perf(bos.gbm,method="test")
> summary(bos.gbm,n.trees=best.iter)
var
rel.inf
1
LSTAT 37.4903576
2
RM 23.8754098
3
CRIM 9.4464051
4
DIS 8.6265412
5
TAX 4.5582532
6
B 4.4923617
7
AGE 3.3002739
8
NOX 3.1613004
9 PTRATIO 2.3503516
10
INDUS 1.2405061
11
RAD 0.7834407
12
CHAS 0.5500685
13
ZN 0.1247303
161
We can consider increasing the size of the trees being averaged through the use of the
interaction depth, which determines the number of terminal nodes in each tree (Jm above).
Jm = 6
> bos.gbm2 = gbm(logCMEDV~.,data=Boston3,interaction.depth=6,
n.minobsinnode=10,n.trees=5000,bag.fraction=.5,train.fraction=.8,
cv.folds=5,shrinkage=.1,distribution="gaussian")
> gbm.perf(bos.gbm2,method="OOB")
[1] 103
> gbm.perf(bos.gbm2,method="cv")
[1] 130
> gbm.perf(bos.gbm2,method="test")
[1] 205
> yhat = predict(bos.gbm2,n.trees=best.iter)
> plot(logCMEDV,yhat)
> cor(logCMEDV,yhat)
[1] 0.9656995
> cor(logCMEDV,yhat)^2
[1] 0.9325755
Jm = 8
162
Untransformed response
Surprisingly the non-transformed response model fits even better, with Jm = 4 only!
The R-square is 96.84%.
> bos.gbm3 = gbm(CMEDV~.,data=Boston.working,interaction.depth=4,
n.minobsinnode=10,n.trees=5000, bag.fraction=.5,train.fraction=.8,
cv.folds=5,shrinkage=.1,distribution="gaussian")
> plot(Boston.working$CMEDV,yhat)
> cor(Boston.working$CMEDV,yhat)
[1] 0.9840717
> cor(Boston.working$CMEDV,yhat)^2
[1] 0.9683971
163
> library(plotmo)
> plotmo(bos.gbm3)
grid:
>
>
>
>
>
>
>
>
>
>
par(mfrow=c(3,3))
plot.gbm(bos.gbm3,1,n.trees=1344)
plot.gbm(bos.gbm3,2,n.trees=1344)
plot.gbm(bos.gbm3,3,n.trees=1344)
plot.gbm(bos.gbm3,4,n.trees=1344)
plot.gbm(bos.gbm3,5,n.trees=1344)
plot.gbm(bos.gbm3,6,n.trees=1344)
plot.gbm(bos.gbm3,7,n.trees=1344)
plot.gbm(bos.gbm3,8,n.trees=1344)
plot.gbm(bos.gbm3,9,n.trees=1344)
164
165
> plot.gbm(bos.gbm3,c(12,13,5),n.trees=245)
166
> cor(tupoz,yhat)
[1] 0.9618522
> cor(tupoz,yhat)^2
[1] 0.9251596
> yhat = predict(oz.gbm,n.trees=46) CV
> plot(tupoz,yhat)
> cor(tupoz,yhat)
[1] 0.9618522
> cor(tupoz,yhat)^2
[1] 0.9251596
167
Jm = 6
> oz.gbm3 = gbm(tupoz~.,data=Ozdata2,distribution="gaussian",
n.trees=2000,shrinkage=.1,interaction.depth=6,bag.fraction=.5,
train.fraction=.8,n.minobsinnode=10,cv.folds=5,keep.data=T,verbose=T)
> gbm.perf(oz.gbm2,method="OOB")
[1] 44
> gbm.perf(oz.gbm2,method="cv")
[1] 45
> gbm.perf(oz.gbm2,method="test")
[1] 1077
> yhat = predict(oz.gbm2,n.trees=1077)
> plot(tupoz,yhat)
> cor(tupoz,yhat)
[1] 0.9733923
> cor(tupoz,yhat)^2
[1] 0.9474925
> summary(oz.gbm2,n.trees=1077)
var
rel.inf
1 safb 30.792830
2 inbt 23.339880
3 day 11.620270
4 inbh 8.588146
5 dagg 7.611727
6 hum 7.099951
7 v500 4.513893
8 vis 4.116435
9 wind 2.316868
168
>
>
>
>
>
>
>
>
>
>
par(mfrow=c(3,3))
plot.gbm(oz.gbm3,1,n.trees=45)
plot.gbm(oz.gbm3,2,n.trees=45)
plot.gbm(oz.gbm3,3,n.trees=45)
plot.gbm(oz.gbm3,4,n.trees=45)
plot.gbm(oz.gbm3,5,n.trees=45)
plot.gbm(oz.gbm3,6,n.trees=45)
plot.gbm(oz.gbm3,7,n.trees=45)
plot.gbm(oz.gbm3,8,n.trees=45)
plot.gbm(oz.gbm3,9,n.trees=45)
169
> plot.gbm(oz.gbm2,c(1,5),n.trees=45)
> plot.gbm(oz.gbm2,c(5,8,1),n.trees=45)
Day is X3 in the conditioning plot plot. Warning this takes a very long time to run on
even a small dataset!
170
Based upon my experimentation with gradient boosted regression trees it seems that
shrinkage of is a good starting point. Experimenting with larger trees (Jm) and
different shrinkage values (larger or smaller) seems to be worthwhile. For example with
the L.A. ozone data a shrinkage of with Jm = 6 produced a better tree than either
of the ones above.
A MCCV cross-validation function for gradient boosted regression trees should take
information about the terminal node size (Jm), which method will be used to choose M
(i.e.. OOB, CV, or Train/Test), and what shrinkage () value to use throughout as
arguments. The value for M could therefore vary from one bootstrap sample to the next.
We definitely do not need to see verbose output from each loop. We also want to
consider relatively small values for B because computation time could be an issue.
> Ozdata2 = data.frame(tupoz=Ozdata$upoz^.333,Ozdata[,-10])
> oz.gbm = gbm(tupoz~.,data=Ozdata2,distribution="gaussian",n.trees=2000,shrinkage=.1,
interaction.depth=4, bag.fraction=0.5,train.fraction=.8,n.minobsinnode=10,cv.folds=5,
keep.data=T,verbose=T)
> set.seed(1)
> results = gbm.cv(oz.gbm,Ozdata2,B=25,method="OOB")
> summary(results)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.04471 0.05670 0.06660 0.06614 0.07378 0.08713
> set.seed(1)
> results = gbm.cv(oz.gbm,data=Ozdata2,B=25,method=cv)
> summary(results)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.04596 0.05724 0.06429 0.06491 0.07215 0.08326
> set.seed(1)
> results = gbm.cv(oz.gbm3,data=Ozdata2,B=25,method=test)
> summary(results)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.04669 0.05481 0.06597 0.06544 0.07415 0.08376
gbm.cv = function(fit,data,p=.667,B=10,method="cv",interaction.depth=4,
shrinkage=.1,cv.folds=5) {
n <- dim(data)[1]
y = data[,1] #assumes response is in first column!!
cv <- rep(0,B)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- gbm(formula(data[sam,]),data=data[sam,],
shrinkage=shrinkage,distribution="gaussian",
n.trees=fit$n.trees,bag.fraction=fit$bag.fraction,
train.fraction=fit$train.fraction,
interaction.depth=interaction.depth,n.minobsinnode=10,
cv.folds=cv.folds,keep.data=T,verbose=F)
m = gbm.perf(fit2,method=method,plot.it=F)
ynew = predict(fit2,n.tree=m,newdata=data[-sam,])
cv[i] <- mean((y[-sam]-ynew)^2)
}
cv
}
171
This MCCV function is useful for choosing values for the interaction depth (
().
Also the function gbm.cv above assesses the boosted tree models based upon the method
employed to choose the number of trees to be averaged to obtain the final fit, with method
cv recommended.
Another approach would be to anchor down values for the shrinkage parameter , the number of
terminal nodes in the trees used in the boosted fit (J m ) , and the number of trees (M) to
average chosen by using the cross-validation features within in the gbm function. We could then
write a cross-validation function that takes these tuning parameters as arguments and use
bootstrapped training/test sets to assess predictive performance.
A better approach is to set aside a validation set that is NOT used in the model development
process at all and then fit models using the wide variety of approaches we have examined. We
can then predict the response value for the validation set to get an idea of which method has the
best predictive performance.
medv is the 8th variable in Boston3
"B"
"zn"
"chas"
"crim"
"dis"
"indus"
"lstat"
"medv"
"nox"
"ptratio" "rad"
"rm"
"B"
"chas"
"crim"
"dis"
"indus"
"lstat"
"nox"
"ptratio" "rad"
"rm"
> dim(Boston.yt)
[1] 506 14
> set.seed(1)
> sam = sample(1:506,400,replace=F)
> Boston.train = Boston.yt[sam,]
> Boston.test = Boston.yt[-sam,]
> dim(Boston.train)
[1] 400 14
> dim(Boston.test)
[1] 106 14
RPART
Bagging (number of average trees (nbagg=40) chosen via cross-validation using the training data only)
> bos.bag = bagging(logmedv~.,data=Boston.train,nbagg=40,coob=T)
> ypred = predict(bos.bag,newdata=Boston.test)
172
> mean((ypred-Boston.test[,1])^2)
[1] 0.02501989
Gradient Boosted Trees (Jm = 5 and , these choices were determined using the gbm.cv function above.)
> bos.gbm = gbm(logmedv~.,data=Boston.train,
distribution=gaussian,
n.trees=2000,
interaction.depth=5,
shrinkage=.05,
bag.fraction=.5,
train.fraction=.8,
cv.folds=5)
> gbm.perf(bos.gbm,method="cv")
[1] 328
> ypred = predict(bos.gbm,newdata=Boston.test,n.trees=328)
> mean((Boston.test$logmedv-ypred)^2)
[1] 0.01806277
173
X1 <10
E ( Y |X R 1) = o + 1 X 1 + + p X p
X6 > .25
E ( Y |X R 2) = o + 1 X 1+ + p X p
E ( Y |X R 4 )= o + 1 X 1 ++ p X p
E ( Y |X R 3 )= o + 1 X 1+ + p X p
There are two packages in R that perform treed regression, Cubist and party. We will
use the implementation in Cubist for regression problems where the response (Y) is
numeric. The basic function for fitting a treed OLS regression model is cubist().
cubist(x, y,
committees = 1, neighbors=0,
control = cubistControl(), ...)
Arguments
x
a matrix or data frame of predictor variables. Missing data are allowed but (at this time) only
numeric, character and factor values are allowed.
y
an integer: how many committee models (e.g.. boosting iterations) should be used?
174
control
neighbors
The neighbors option specifies whether or not to use nearest neighbors in making predictions.
The idea behind nearest-neighbors is outlined on the following page and is taken from the website
www.rulequest.com.
For some applications, the predictive accuracy of a rule-based model can be improved by combining it with
an instance-based or nearest-neighbor model. The latter predicts the target value of a new case by finding
the n most similar cases in the training data, and averaging their target values.
Cubist employs an unusual method for combining rule-based and instance-based models. Cubist finds
the n training cases that are "nearest" (most similar) to the case in question. Then, rather than averaging
their target values directly, Cubist first adjusts these values using the rule-based model. Here's how it
works:
The neighbors option instructs Cubist to use composite models of this type. Now for the value of n, the
number of nearest neighbors to be used the allowable range is from 0 to 9. We can use cross-validation to
choose optimal values for the number of committees and the number of nearest-neighbors to use.
cubist.cv = function(x,y,p=.667,B=10,committees=1,neighbors=0) {
n <- length(y)
cv <- rep(0,B)
for (i in 1:B) {
ss <- floor(n*p)
sam <- sample(1:n,ss,replace=F)
fit2 <- cubist(x[sam,],y[sam],committees=committees,neighbors=neighbors)
ynew <- predict(fit2,newdata=x[-sam,],neighbors=neighbors)
cv[i] <- mean((y[-sam]-ynew)^2)
}
cv
}
"ZN"
"LSTAT"
"INDUS"
"NOX"
"CHAS"
"RM"
"AGE"
"DIS"
"RAD"
"TAX"
175
> bos.cub
Call:
cubist.default(x = bos.x, y = bos.y, committees = 1) no boosting
Number of samples: 506
Number of predictors: 13
Number of committees: 1
Number of rules: 8
> summary(bos.cub)
Call:
cubist.default(x = bos.x, y = bos.y, committees = 1)
Cubist [Release 2.07 GPL Edition] Tue Feb 28 09:14:54 2012
-----------------------------------------------------------Target attribute `outcome'
Read 506 cases (14 attributes) from undefined.data
Model:
Rule 1: [44 cases, mean 9.196384, range 8.517193 to 9.786954, est err
0.166292]
if
LSTAT > 0.19
NOX > 6.71
then
outcome = 8.09701 + 0.579 DIS + 0.29 NOX - 2.9 LSTAT - 0.201 RM
- 0.0065 CRIM
Rule 2: [20 cases, mean 9.518296, range 8.853665 to 10.22194, est err
0.169161]
if
TAX > 469
LSTAT > 0.19
NOX <= 6.71
then
outcome = 23.867485 - 0.01515 TAX - 0.589 NOX - 0.315 DIS - 0.13 LSTAT
+ 0.0005 RAD - 0.0005 CRIM
Rule 3: [25 cases, mean 9.687253, range 9.169518 to 10.12663, est err 0.104636]
if
RM > 6.24
LSTAT <= 0.19
NOX > 6.47
then
outcome = 10.179097 - 4.83 LSTAT + 0.109 DIS - 0.0034 CRIM + 0.25 B
+ 0.021 RM + 0.0016 RAD - 7e-005 TAX - 0.005 PTRATIO
- 0.007 NOX
Rule 4: [19 cases, mean 9.695155, range 9.449357 to 10.07323, est err
0.159993]
176
if
TAX <= 469
LSTAT > 0.19
NOX <= 6.71
then
outcome = 10.350003 - 0.1233 CRIM - 0.0027 AGE - 0.042 NOX - 0.51 LSTAT
+ 0.0022 RAD - 0.00011 TAX - 0.008 DIS + 0.023 RM
- 0.007 PTRATIO + 0.1 B
Rule 5: [25 cases, mean 9.881541, range 9.296518 to 10.81978, est err 0.151990]
if
RM <= 6.24
DIS <= 1.88
LSTAT <= 0.19
then
outcome = 12.574974 - 1.199 DIS - 4.64 LSTAT - 0.0123 CRIM
Rule 6: [180 cases, mean 9.892811, range 9.230143 to 10.49681, est err
0.089884]
if
RM <= 6.24
DIS > 1.88
LSTAT <= 0.19
then
outcome = 9.811655 - 1.6 LSTAT + 0.141 RM + 0.83 B - 0.03 DIS
- 0.028 PTRATIO - 0.0016 AGE + 0.0033 RAD - 0.0027 CRIM
- 0.00013 TAX - 0.016 NOX
Rule 7: [10 cases, mean 10.114698, range 9.581903 to 10.81978, est err
0.192493]
if
CRIM > 2.92
RM > 6.24
LSTAT <= 0.19
NOX <= 6.47
then
outcome = 8.469767 - 6.14 LSTAT + 0.361 NOX + 0.032 RM - 0.0019 CRIM
+ 0.14 B - 6e-005 TAX + 0.001 RAD - 0.003 PTRATIO
+ 0.0007 INDUS - 0.002 DIS
Rule 8: [185 cases, mean 10.271324, range 9.615806 to 10.81978, est err
0.073224]
if
CRIM <= 2.92
RM > 6.24
then
outcome = 9.023735 + 0.296 RM - 2.73 LSTAT - 0.00043 TAX - 0.024 PTRATIO
- 0.02 DIS + 0.0042 RAD - 0.0042 CRIM - 0.001 AGE + 0.27 B
+ 0.0027 INDUS - 0.01 NOX
177
0.091012
0.30
0.95
Attribute usage:
Conds Model
84%
64%
40%
38%
23%
8%
91%
100%
100%
100%
95%
86%
86%
82%
82%
76%
38%
RM
LSTAT
DIS
CRIM
NOX
TAX
RAD
PTRATIO
B
AGE
INDUS
We can also estimate variable importance by using the varImp( ) command from the
caret library.
> library(caret)
> varImp(bos.cub)
Overall
RM
87.5
LSTAT
82.0
DIS
70.0
CRIM
69.0
NOX
59.0
TAX
47.0
RAD
43.0
PTRATIO
41.0
B
41.0
AGE
38.0
INDUS
19.0
ZN
0.0
CHAS
0.0
178
179
> plot(bos.y,predict(bos.cub,newdata=bos.x),xlab="logCMEDV",
ylab="Fitted Values from Treed Regression")
> abline(0,1,lty=2,col="blue")
180
181
182
183
P (class i | x N A )
~
such that
P(class i | x N
i 1
)
= 1.
x
Here, N A is a neighborhood defined by a set covariates/predictors, ~ . The
neighborhoods are found by a series of binary splits chosen to minimize the overall loss
of the resulting tree. For classification problems measuring overall loss can be a bit
complicated. One obvious method is to construct classification trees so that the overall
misclassification rate is minimized. In fact, this is precisely what the RPART algorithm
does by default. However in classification problems it is often times the case we wish to
incorporate prior knowledge about likely class membership. This knowledge is
represented by prior probabilities of an observation being from class i, which will denote
by i . Naturally the priors must be chosen in such a way that they sum to 1. Other
information we might want to incorporate into a modeling process is the cost or loss
incurred by classifying an object from class i as being from class j and vice versa. With
this information provided we would expect the resulting tree to avoid making the most
costly misclassifications on our training data set.
Some notation that is used by Therneau & Atkinson (1999) for determining the loss for a
given node A:
niA number of observations in node A from class i
ni number of observations in training set from class i
n total number of observations in the training set
n A number of observations in node A
ni
i prior probability of being from class i (by default i n )
L(i, j ) = loss incurred by classifying a class i object as being from class j
(i.e. c(j|i) from our discussion of discriminant analysis).
(A) predicted class for node A
In general, the loss is specified as a matrix
184
0
L(1,2) L(1,3)
...
L(1, C )
L(2,1)
0
L(2,3)
...
L(2, C )
...
...
0
L(C 1, C )
...
L(C ,1) L(C ,2)
...
L(C , C 1)
0
By default this is a symmetric matrix with L(i, j ) L( j , i ) 1 for all i j .
Using the notation and concepts presented above the risk or loss at a node A is given by
C
n n
n A
R( A) i L(i, ( A)) iA
i 1
ni n A
Example: Kyphosis Data
> library(rpart)
> data(kyphosis)
> attach(kyphosis)
> names(kyphosis)
[1] "Kyphosis" "Age"
"Number"
"Start"
R(root) = .2098765*1*(17/17)*(81/81)*81 = 17
R(A) = .7901235*1*(8/64)*(81/19)*19 = 8
185
Suppose now we have prior beliefs that 65% of patients will not have Kyphosis (absent)
and 35% of patients will have Kyphosis (present).
> k.priors <-rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(.65,.35)))
> k.priors
n= 81
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 81 28.350000 absent (0.65000000 0.35000000)
2) Start>=12.5 46 3.335294 absent (0.91563089 0.08436911) *
3) Start< 12.5 35 16.453130 present (0.39676840 0.60323160)
6) Age< 34.5 10 1.667647 absent (0.81616742 0.18383258) *
7) Age>=34.5 25 9.049219 present (0.27932897 0.72067103) *
186
This says that it is 4 times more serious to misclassify a child that actually has kyphosis
(present) as not having it (absent). Again we will use the priors from the previous model
(65% - absent, 35% - present).
> lmat <- matrix(c(0,4,1,0),nrow=2,ncol=2,byrow=T)
> k.priorloss <- rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(.65,.35),loss=lmat))
> k.priorloss
n= 81
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 81 52.650000 present (0.6500000 0.3500000)
2) Start>=14.5 29 0.000000 absent (1.0000000 0.0000000) *
3) Start< 14.5 52 28.792970 present (0.5038760 0.4961240)
6) Age< 39 15 6.670588 absent (0.8735178 0.1264822) *
7) Age>=39 37 17.275780 present (0.3930053 0.6069947) *
188
> misclass.rpart(owl.tree)
Table of Misclassification
(row = predicted, col = actual)
1 2 3 4 5 6 7
annandalfi
24 2 0 0 0 0 0
argentiventer 1 27 0 2 0 1 0
exulans
0 0 24 0 0 0 1
rajah
0 4 0 11 1 2 0
surifer
0 1 0 0 6 0 0
tiomanicus
0 3 0 3 3 36 0
whiteheadi
0 0 0 0 0 0 27
Misclassification Rate =
0.134
> misclass.rpart(owl.tree2)
Table of Misclassification
(row = predicted, col = actual)
1 2 3 4 5 6 7
annandalfi
24 2 0 0 0 0 0
argentiventer 1 27 0 2 0 1 0
exulans
0 0 24 0 0 0 1
rajah
0 4 0 11 1 2 0
surifer
0 1 0 1 9 5 0
tiomanicus
0 3 0 2 0 31 0
whiteheadi
0 0 0 0 0 0 27
Misclassification Rate = 0.145
> plot(owl.tree)
> text(owl.tree,cex=.6)
189
We can see that teeth row and palantine foramen figure prominently in the classification
rule. Given the analyses we have done previously this is not surprising.
Cross-validation is done in the usual fashion: we leave out a certain percentage of the
observations, develop a model from the remaining data, predict back the class of the
observations we set aside, calculate the misclassification rate, and then repeat this process
are number of times. The function tree.cv3 leaves out (1/k)x100% of the data at a time
to perform the cross-validation.
> tree.cv3(owl.tree,k=10,data=OwlDiet,y=species,reps=100)
[1]
[9]
[17]
[25]
[33]
[41]
[49]
[57]
[65]
[73]
[81]
[89]
[97]
0.17647059
0.11764706
0.11764706
0.29411765
0.23529412
0.29411765
0.29411765
0.05882353
0.23529412
0.11764706
0.23529412
0.11764706
0.11764706
0.29411765
0.11764706
0.29411765
0.17647059
0.17647059
0.05882353
0.35294118
0.17647059
0.35294118
0.00000000
0.11764706
0.17647059
0.23529412
0.29411765
0.11764706
0.11764706
0.11764706
0.17647059
0.05882353
0.17647059
0.23529412
0.11764706
0.00000000
0.29411765
0.17647059
0.11764706
0.05882353
0.11764706
0.11764706
0.29411765
0.35294118
0.17647059
0.17647059
0.29411765
0.29411765
0.17647059
0.35294118
0.00000000
0.17647059
0.17647059
0.23529412
0.11764706
0.35294118
0.05882353
0.23529412
0.23529412
0.11764706
0.05882353
0.17647059
0.35294118
0.29411765
0.00000000
0.11764706
0.29411765
0.41176471
0.17647059
0.11764706
0.35294118
0.17647059
0.23529412
0.23529412
0.23529412
0.05882353
0.11764706
0.17647059
0.29411765
0.17647059
0.23529412
0.17647059
0.17647059
0.23529412
0.17647059
0.05882353
0.11764706
0.41176471
0.17647059
0.05882353
0.23529412
0.11764706
0.23529412
0.17647059
0.11764706
0.17647059
0.17647059
0.23529412
0.11764706
0.35294118
190