Sunteți pe pagina 1din 28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Introduction
Tree based learning algorithms are considered to be one of the best and mostly used supervised
learningmethods.Treebasedmethodsempowerpredictivemodelswithhighaccuracy,stabilityand
ease of interpretation. Unlike linear models, they map nonlinear relationships quite well. They
areadaptableatsolvinganykindofproblemathand(classificationorregression).
Methodslikedecisiontrees,randomforest,gradientboostingarebeingpopularlyusedinallkindsof
data science problems. Hence, for every analyst (fresher also), its important to learn these
algorithmsandusethemformodeling.
Thistutorialismeanttohelpbeginnerslearntreebasedmodelingfromscratch.Afterthesuccessful
completion of this tutorial, one is expected to become proficient at using tree based algorithms
andbuildpredictivemodels.
Note:Thistutorialrequiresnopriorknowledgeofmachinelearning.However,elementaryknowledge
ofRorPythonwillbehelpful.TogetstartedyoucanfollowfulltutorialinRandfulltutorialinPython.

TableofContents
1.WhatisaDecisionTree?Howdoesitwork?
2.RegressionTreesvsClassificationTrees
3.Howdoesatreedecidewheretosplit?
4.Whatarethekeyparametersofmodelbuildingandhowcanweavoidoverfittingindecision
trees?
5.Aretreebasedmodelsbetterthanlinearmodels?
6.WorkingwithDecisionTreesinRandPython
7.Whataretheensemblemethodsoftreesbasedmodel?

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

1/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

8.WhatisBagging?Howdoesitwork?
9.WhatisRandomForest?Howdoesitwork?
10.WhatisBoosting?Howdoesitwork?
11. Whichismorepowerful:GBMorXgboost?
12.WorkingwithGBMinRandPython
13.WorkingwithXgboostinRandPython
14.WheretoPractice?

1.WhatisaDecisionTree?Howdoesitwork?
Decisiontreeisatypeofsupervisedlearningalgorithm(havingapredefinedtargetvariable)thatis
mostlyusedinclassificationproblems.Itworksforbothcategoricalandcontinuousinputandoutput
variables.Inthistechnique,wesplitthepopulationorsampleintotwoormorehomogeneoussets(or
subpopulations)basedonmostsignificantsplitter/differentiatorininputvariables.

Example:
Letssaywehaveasampleof30studentswiththreevariablesGender(Boy/Girl),Class(IX/X)and
Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a model
topredictwhowillplaycricketduringleisureperiod?Inthisproblem,weneedtosegregatestudents

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

2/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

whoplaycricketintheirleisuretimebasedonhighlysignificantinputvariableamongallthree.
Thisiswheredecisiontreehelps,itwillsegregatethestudentsbasedonallvaluesofthreevariable
and identify the variable, which creates the best homogeneous sets of students (which are
heterogeneous to each other). In the snapshot below, you can see that variable Gender is able to
identifybesthomogeneoussetscomparedtotheothertwovariables.

As mentioned above, decision tree identifies the most significant variable and its value that gives
best homogeneous sets of population. Now the question which arises is, how does it identify the
variableandthesplit?Todothis,decisiontreeusesvariousalgorithms,whichwewillshalldiscussin
thefollowingsection.

TypesofDecisionTrees
Typesofdecisiontreeisbasedonthetypeoftargetvariablewehave.Itcanbeoftwotypes:
1.CategoricalVariableDecisionTree:DecisionTreewhichhascategoricaltargetvariablethenitcalled
ascategoricalvariabledecisiontree.Example:Inabovescenarioofstudentproblem,wherethetarget
variablewasStudentwillplaycricketornoti.e.YESorNO.
2.ContinuousVariableDecisionTree:DecisionTreehascontinuoustargetvariablethenitiscalledas
ContinuousVariableDecisionTree.

Example:Letssaywehaveaproblemtopredictwhetheracustomerwillpayhisrenewalpremium
withaninsurancecompany(yes/no).Hereweknowthatincomeofcustomerisasignificantvariable
butinsurancecompanydoesnothaveincomedetailsforallcustomers.Now,asweknowthisisan
important variable, then we can build a decision tree to predict customer income based on
occupation,productandvariousothervariables.Inthiscase,wearepredictingvaluesforcontinuous

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

3/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

variable.

ImportantTerminologyrelatedtoDecisionTrees
LetslookatthebasicterminologyusedwithDecisiontrees:
1.Root Node: It represents entire population or sample and this further gets divided into two or more
homogeneoussets.
2.Splitting:Itisaprocessofdividinganodeintotwoormoresubnodes.
3.DecisionNode:Whenasubnodesplitsintofurthersubnodes,thenitiscalleddecisionnode.
4.Leaf/TerminalNode:NodesdonotsplitiscalledLeaforTerminalnode.

5.Pruning:Whenweremovesubnodesofadecisionnode,thisprocessiscalledpruning.Youcansay
oppositeprocessofsplitting.
6.Branch/SubTree:Asubsectionofentiretreeiscalledbranchorsubtree.
7.ParentandChildNode:Anode,whichisdividedintosubnodesiscalledparentnodeofsubnodes
whereassubnodesarethechildofparentnode.

These are the terms commonly used for decision trees. As we know that every algorithm has
advantagesanddisadvantages,belowaretheimportantfactorswhichoneshouldknow.

Advantages
1.Easy to Understand: Decision tree output is very easy to understand even for people from non
analytical background. It does not require any statistical knowledge to read and interpret them. Its

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

4/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

graphicalrepresentationisveryintuitiveanduserscaneasilyrelatetheirhypothesis.
2.UsefulinDataexploration:Decisiontreeisoneofthefastestwaytoidentifymostsignificantvariables
andrelationbetweentwoormorevariables.Withthehelpofdecisiontrees,wecancreatenewvariables
/featuresthathasbetterpowertopredicttargetvariable.Youcanreferarticle(Tricktoenhancepower
ofregressionmodel)foronesuchtrick.Itcanalsobeusedindataexplorationstage.Forexample,we
areworkingonaproblemwherewehaveinformationavailableinhundredsofvariables,theredecision
treewillhelptoidentifymostsignificantvariable.
3.Less data cleaning required: It requires less data cleaning compared to some other modeling
techniques.Itisnotinfluencedbyoutliersandmissingvaluestoafairdegree.
4.Datatypeisnotaconstraint:Itcanhandlebothnumericalandcategoricalvariables.
5.NonParametricMethod:Decisiontreeisconsideredtobeanonparametricmethod.Thismeansthat
decisiontreeshavenoassumptionsaboutthespacedistributionandtheclassifierstructure.

Disadvantages
1.Overfitting:Overfittingisoneofthemostpracticaldifficultyfordecisiontreemodels.Thisproblem
getssolvedbysettingconstraintsonmodelparametersandpruning(discussedindetailedbelow).
2.Notfitforcontinuousvariables:Whileworkingwithcontinuousnumericalvariables,decisiontree
loosesinformationwhenitcategorizesvariablesindifferentcategories.

2.RegressionTreesvsClassificationTrees
Weallknowthattheterminalnodes(orleaves)liesatthebottomofthedecisiontree.Thismeans
thatdecisiontreesaretypicallydrawnupsidedownsuchthatleavesarethethebottom&rootsare
thetops(shownbelow).

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

5/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Both the trees work almost similar to each other, lets look at the primary differences &
similaritybetweenclassificationandregressiontrees:
1.Regressiontreesareusedwhendependentvariableiscontinuous.Classificationtreesareusedwhen
dependentvariableiscategorical.
2.In case of regression tree, the value obtained by terminal nodes in the training data is the mean
response of observation falling in that region.Thus, if an unseen data observation falls in that region,
wellmakeitspredictionwithmeanvalue.
3.Incaseofclassificationtree,thevalue(class)obtainedbyterminalnodeinthetrainingdataisthemode
ofobservationsfallinginthatregion.Thus,ifanunseendataobservationfallsinthatregion,wellmake
itspredictionwithmodevalue.
4.Both the trees divide the predictor space (independent variables) into distinct and nonoverlapping
regions.Forthesakeofsimplicity,youcanthinkoftheseregionsashighdimensionalboxesorboxes.
5.Boththetreesfollowatopdowngreedyapproachknownasrecursivebinarysplitting.Wecallitastop
downbecauseitbeginsfromthetopoftreewhenalltheobservationsareavailableinasingleregion
and successively splits the predictor space into two new branches down the tree. It is known as
greedybecause,thealgorithmcares(looksforbestvariableavailable)aboutonlythecurrentsplit,and
notaboutfuturesplitswhichwillleadtoabettertree.
6.Thissplittingprocessiscontinueduntilauserdefinedstoppingcriteriaisreached.Forexample:wecan

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

6/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

tellthethealgorithmtostoponcethenumberofobservationspernodebecomeslessthan50.
7.In both the cases, the splitting process results in fully grown trees until the stopping criteria is
reached.But,thefullygrowntreeislikelytooverfitdata,leadingtopooraccuracyonunseendata.This
bring pruning. Pruning is one of the technique used tackle overfitting. Well learn more about it in
followingsection.

3.Howdoesatreedecidewheretosplit?
The decision of making strategic splits heavily affects a trees accuracy. The decision criteria is
differentforclassificationandregressiontrees.
Decision trees use multiple algorithms to decide to split a node in two or more subnodes. The
creationofsubnodesincreasesthehomogeneityofresultantsubnodes.Inotherwords,wecansay
thatpurityofthenodeincreaseswithrespecttothetargetvariable.Decisiontreesplitsthenodeson
allavailablevariablesandthenselectsthesplitwhichresultsinmosthomogeneoussubnodes.
The algorithm selection is also based on type of target variables. Lets look at the four most
commonlyusedalgorithmsindecisiontree:

GiniIndex
Giniindexsays,ifweselecttwoitemsfromapopulationatrandomthentheymustbeofsameclass
andprobabilityforthisis1ifpopulationispure.
1.ItworkswithcategoricaltargetvariableSuccessorFailure.
2.ItperformsonlyBinarysplits
3.HigherthevalueofGinihigherthehomogeneity.
4.CART(ClassificationandRegressionTree)usesGinimethodtocreatebinarysplits.

StepstoCalculateGiniforasplit

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

7/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

1.Calculate Gini for subnodes, using formula sum of square of probability for success and failure
(p^2+q^2).
2.CalculateGiniforsplitusingweightedGiniscoreofeachnodeofthatsplit

Example:Referringtoexampleusedabove,wherewewanttosegregatethestudentsbasedon
targetvariable(playingcricketornot).Inthesnapshotbelow,wesplitthepopulationusingtwoinput
variablesGenderandClass.Now,Iwanttoidentifywhichsplitisproducingmorehomogeneoussub
nodesusingGiniindex.

SplitonGender:
1.Calculate,GiniforsubnodeFemale=(0.2)*(0.2)+(0.8)*(0.8)=0.68
2.GiniforsubnodeMale=(0.65)*(0.65)+(0.35)*(0.35)=0.55
3.CalculateweightedGiniforSplitGender=(10/30)*0.68+(20/30)*0.55=0.59

SimilarforSplitonClass:
1.GiniforsubnodeClassIX=(0.43)*(0.43)+(0.57)*(0.57)=0.51
2.GiniforsubnodeClassX=(0.56)*(0.56)+(0.44)*(0.44)=0.51
3.CalculateweightedGiniforSplitClass=(14/30)*0.51+(16/30)*0.51=0.51

Above,youcanseethatGiniscoreforSplitonGenderishigherthanSplitonClass,hence,thenode
splitwilltakeplaceonGender.

ChiSquare

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

8/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Itisanalgorithmtofindoutthestatisticalsignificancebetweenthedifferencesbetweensubnodes
andparentnode.Wemeasureitbysumofsquaresofstandardizeddifferencesbetweenobserved
andexpectedfrequenciesoftargetvariable.
1.ItworkswithcategoricaltargetvariableSuccessorFailure.
2.Itcanperformtwoormoresplits.
3.HigherthevalueofChiSquarehigherthestatisticalsignificanceofdifferencesbetweensubnodeand
Parentnode.
4.ChiSquareofeachnodeiscalculatedusingformula,
5.Chisquare=((ActualExpected)^2/Expected)^1/2
6.ItgeneratestreecalledCHAID(ChisquareAutomaticInteractionDetector)

StepstoCalculateChisquareforasplit:
1.CalculateChisquareforindividualnodebycalculatingthedeviationforSuccessandFailureboth
2.CalculatedChisquareofSplitusingSumofallChisquareofsuccessandFailureofeachnodeofthe
split

Example:LetsworkwithaboveexamplethatwehaveusedtocalculateGini.
SplitonGender:
1.FirstwearepopulatingfornodeFemale,PopulatetheactualvalueforPlayCricket and Not Play
Cricket,heretheseare2and8respectively.
2.Calculate expected value for Play Cricket and Not Play Cricket, here it would be 5 for both
becauseparentnodehasprobabilityof50%andwehaveappliedsameprobabilityonFemalecount(10).
3.Calculatedeviationsbyusingformula,ActualExpected.ItisforPlayCricket(25=3)andfor
Notplaycricket(85=3).
4.Calculate Chisquare of node for Play Cricket and Not Play Cricket using formula with
formula,=((ActualExpected)^2/Expected)^1/2.Youcanreferbelowtableforcalculation.
5.FollowsimilarstepsforcalculatingChisquarevalueforMalenode.
6.NowaddallChisquarevaluestocalculateChisquareforsplitGender.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

9/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

SplitonClass:
PerformsimilarstepsofcalculationforsplitonClassandyouwillcomeupwithbelowtable.

Above,youcanseethatChisquarealsoidentifytheGendersplitismoresignificantcompareto
Class.

InformationGain:
Lookattheimagebelowandthinkwhichnodecanbedescribedeasily.Iamsure,youranswerisC
because it requires less information as all values are similar. On the other hand, B requires more
informationtodescribeitandArequiresthemaximuminformation.Inotherwords,wecansaythatC
isaPurenode,BislessImpureandAismoreimpure.

Now,wecanbuildaconclusionthatlessimpurenoderequireslessinformationtodescribeit.And,
moreimpurenoderequiresmoreinformation.Informationtheoryisameasuretodefinethisdegree
ofdisorganizationinasystemknownasEntropy.Ifthesampleiscompletelyhomogeneous,thenthe
entropyiszeroandifthesampleisanequallydivided(50%50%),ithasentropyofone.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

10/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Entropycanbecalculatedusingformula:

Herepandqisprobabilityofsuccessandfailurerespectivelyinthatnode.Entropyisalsousedwith
categorical target variable. It chooses the split which has lowest entropy compared to parent node
andothersplits.Thelessertheentropy,thebetteritis.
Stepstocalculateentropyforasplit:
1.Calculateentropyofparentnode
2.Calculate entropy of each individual node of split and calculate weighted average of all subnodes
availableinsplit.

Example:Letsusethismethodtoidentifybestsplitforstudentexample.
1.Entropyforparentnode=(15/30)log2(15/30)(15/30)log2(15/30)=1.Here1showsthatitisa
impurenode.
2.EntropyforFemalenode=(2/10)log2(2/10)(8/10)log2(8/10)=0.72andformalenode,(13/20)
log2(13/20)(7/20)log2(7/20)=0.93
3.EntropyforsplitGender=Weightedentropyofsubnodes=(10/30)*0.72+(20/30)*0.93=0.86
4.EntropyforClassIXnode,(6/14)log2(6/14)(8/14)log2(8/14)=0.99andforClassXnode,(9/16)
log2(9/16)(7/16)log2(7/16)=0.99.
5.EntropyforsplitClass=(14/30)*0.99+(16/30)*0.99=0.99

Above,youcanseethatentropyforSplitonGenderisthelowestamongall,sothetreewillsplit
onGender.Wecanderiveinformationgainfromentropyas1Entropy.

ReductioninVariance
Tillnow,wehavediscussedthealgorithmsforcategoricaltargetvariable.Reductioninvarianceisan
algorithm used for continuous target variables (regression problems). This algorithm uses the
standard formula of variance to choose the best split. The split with lower variance is selected as
thecriteriatosplitthepopulation:

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

11/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

AboveXbarismeanofthevalues,Xisactualandnisnumberofvalues.
StepstocalculateVariance:
1.Calculatevarianceforeachnode.
2.Calculatevarianceforeachsplitasweightedaverageofeachnodevariance.

Example:Lets assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow
thestepstoidentifytherightsplit:
1.VarianceforRootnode,heremeanvalueis(15*1+15*0)/30=0.5andwehave15oneand15zero.
Nowvariancewouldbe((10.5)^2+(10.5)^2+.15times+(00.5)^2+(00.5)^2+15times)/30,thiscan
bewrittenas(15*(10.5)^2+15*(00.5)^2)/30=0.25
2.MeanofFemalenode=(2*1+8*0)/10=0.2andVariance=(2*(10.2)^2+8*(00.2)^2)/10=0.16
3.MeanofMaleNode=(13*1+7*0)/20=0.65andVariance=(13*(10.65)^2+7*(00.65)^2)/20=0.23
4.VarianceforSplitGender=WeightedVarianceofSubnodes=(10/30)*0.16+(20/30)*0.23=0.21
5.MeanofClassIXnode=(6*1+8*0)/14=0.43andVariance=(6*(10.43)^2+8*(00.43)^2)/14=0.24
6.MeanofClassXnode=(9*1+7*0)/16=0.56andVariance=(9*(10.56)^2+7*(00.56)^2)/16=0.25
7.VarianceforSplitGender=(14/30)*0.24+(16/30)*0.25=0.25

Above,youcanseethatGendersplithaslowervariancecomparetoparentnode,sothesplitwould
takeplaceonGendervariable.
Untilhere,welearntaboutthebasicsofdecisiontreesandthedecisionmakingprocessinvolvedto
choose the best splits in building a tree model. As I said, decision tree can be applied both on
regressionandclassificationproblems.Letsunderstandtheseaspectsindetail.

4.Whatarethekeyparametersoftreemodelingand
howcanweavoidoverfittingindecisiontrees?

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

12/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Overfittingisoneofthekeychallengesfacedwhilemodelingdecisiontrees.Ifthereisnolimitsetof
adecisiontree,itwillgiveyou100%accuracyontrainingsetbecauseintheworsecaseitwillend
up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a
decisiontreeanditcanbedonein2ways:
1.Settingconstraintsontreesize
2.Treepruning

Letsdiscussbothofthesebriefly.

SettingConstraintsonTreeSize
Thiscanbedonebyusingvariousparameterswhichareusedtodefineatree.First,letslookatthe
generalstructureofadecisiontree:

The parameters used for defining a tree are further explained below. The parameters described
below are irrespective of tool. It is important to understand the role of parameters used in tree
modeling.TheseparametersareavailableinR&Python.
1.Minimumsamplesforanodesplit
Definestheminimumnumberofsamples(orobservations)whicharerequiredinanodetobe
consideredforsplitting.
Usedtocontroloverfitting.Highervaluespreventamodelfromlearningrelationswhichmightbe

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

13/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

highlyspecifictotheparticularsampleselectedforatree.
Toohighvaluescanleadtounderfittinghence,itshouldbetunedusingCV.
2.Minimumsamplesforaterminalnode(leaf)
Definestheminimumsamples(orobservations)requiredinaterminalnodeorleaf.
Usedtocontroloverfittingsimilartomin_samples_split.
Generallylowervaluesshouldbechosenforimbalancedclassproblemsbecausetheregionsin
whichtheminorityclasswillbeinmajoritywillbeverysmall.
3.Maximumdepthoftree(verticaldepth)
Themaximumdepthofatree.
Used to control overfitting as higher depth will allow model to learn relations very specific to a
particularsample.
ShouldbetunedusingCV.
4.Maximumnumberofterminalnodes
Themaximumnumberofterminalnodesorleavesinatree.
Can be defined in place of max_depth. Since binary trees are created, a depth of n would
produceamaximumof2^nleaves.
5.Maximumfeaturestoconsiderforsplit
Thenumberoffeaturestoconsiderwhilesearchingforabestsplit.Thesewillberandomly
selected.
Asathumbrule,squarerootofthetotalnumberoffeaturesworksgreatbutweshouldcheckupto
3040%ofthetotalnumberoffeatures.
Highervaluescanleadtooverfittingbutdependsoncasetocase.

TreePruning
Asdiscussedearlier,thetechniqueofsettingconstraintisagreedyapproach.Inotherwords,itwill
check for the best split instantaneously and move forward until one of the specified stopping
conditionisreached.Letsconsiderthefollowingcasewhenyouredriving:

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

14/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Thereare2lanes:
1.Alanewithcarsmovingat80km/h
2.Alanewithtrucksmovingat30km/h

Atthisinstant,youaretheyellowcarandyouhave2choices:
1.Takealeftandovertaketheother2carsquickly
2.Keepmovinginthepresentlane

Lets analyze these choice. In the former choice, youll immediately overtake the car ahead and
reachbehindthetruckandstartmovingat30km/h,lookingforanopportunitytomovebackright.All
carsoriginallybehindyoumoveaheadinthemeanwhile.Thiswouldbetheoptimumchoiceifyour
objectiveistomaximizethedistancecoveredinnextsay10seconds.Inthelaterchoice,yousale
through at same speed, cross trucks and then overtake maybe depending on situation ahead.
Greedyyou!
This is exactly the difference between normal decision tree & pruning. A decision tree with
constraints wont see the truck ahead and adopt a greedy approach by taking a left. On the other
handifweusepruning,weineffectlookatafewstepsaheadandmakeachoice.
Soweknowpruningisbetter.Buthowtoimplementitindecisiontree?Theideaissimple.
1.Wefirstmakethedecisiontreetoalargedepth.
2.Then we start at the bottom and start removing leaves which are giving us negative returns when
comparedfromthetop.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

15/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

3.Supposeasplitisgivingusagainofsay10(lossof10)andthenthe
nextsplitonthatgivesusagainof20.Asimpledecisiontreewillstopat
step 1 but in pruning, we will see that the overall gain is +10 and keep
bothleaves.

Note

that

sklearns

decision

tree

classifier

does

not

currently support pruning. Advanced packages like xgboost have


adopted tree pruning in their implementation. But the libraryrpart in R,
providesafunctiontoprune.GoodforRusers!

5.Aretreebasedmodelsbetterthan
linearmodels?
IfIcanuselogisticregressionforclassificationproblemsandlinearregressionforregression
problems,whyisthereaneedtousetrees?Manyofushavethisquestion.And,thisisavalidone
too.
Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Lets
lookatsomekeyfactorswhichwillhelpyoutodecidewhichalgorithmtouse:
1.If the relationship between dependent & independent variable is well approximated by a linear model,
linearregressionwilloutperformtreebasedmodel.
2.Ifthereisahighnonlinearity&complexrelationshipbetweendependent&independentvariables,atree
modelwilloutperformaclassicalregressionmethod.
3.If you need to build a model which is easy to explain to people, a decision tree model will always do
betterthanalinearmodel.Decisiontreemodelsareevensimplertointerpretthanlinearregression!

6.WorkingwithDecisionTreesinRandPython
ForRusersandPythonusers,decisiontreeisquiteeasytoimplement.Letsquicklylookattheset

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

16/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

ofcodeswhichcangetyoustartedwiththisalgorithm.Foreaseofuse,Ivesharedstandardcodes
whereyoullneedtoreplaceyourdatasetnameandvariablestogetstarted.
ForRusers,therearemultiplepackagesavailabletoimplementdecisiontreesuchasctree,rpart,
treeetc.

>library(rpart)
>x<cbind(x_train,y_train)
#growtree
>fit<rpart(y_train~.,data=x,method="class")
>summary(fit)
#PredictOutput
>predicted=predict(fit,x_test)

Inthecodeabove:
y_trainrepresentsdependentvariable.
x_trainrepresentsindependentvariable
xrepresentstrainingdata.

ForPythonusers,belowisthecode:

#ImportLibrary
#Importothernecessarylibrarieslikepandas,numpy...
fromsklearnimporttree
#Assumedyouhave,X(predictor)andY(target)fortrainingdatasetandx_test(predictor)ofte
st_dataset
#Createtreeobject
model=tree.DecisionTreeClassifier(criterion='gini')#forclassification,hereyoucanchanget
healgorithmasginiorentropy(informationgain)bydefaultitisgini
#model=tree.DecisionTreeRegressor()forregression

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

17/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

#Trainthemodelusingthetrainingsetsandcheckscore
model.fit(X,y)
model.score(X,y)
#PredictOutput
predicted=model.predict(x_test)

7.Whatareensemblemethodsintreebased
modeling?
The literary meaning of word ensemble is group. Ensemble methods involve group of predictive
models to achieve a better accuracy and model stability. Ensemble methods are known to impart
supremeboosttotreebasedmodels.
Likeeveryothermodel,atreebasedmodelalsosuffersfromtheplagueofbiasandvariance.Bias
means,howmuchonanaveragearethepredictedvaluesdifferentfromtheactualvalue.Variance
means, how different will the predictions of the model be at the same point if different samples
aretakenfromthesamepopulation.
Youbuildasmalltreeandyouwillgetamodelwithlowvarianceandhighbias.Howdoyoumanage
tobalancethetradeoffbetweenbiasandvariance?
Normally,asyouincreasethecomplexityofyourmodel,youwillseeareductioninpredictionerror
duetolowerbiasinthemodel.Asyoucontinuetomakeyourmodelmorecomplex,youendupover
fittingyourmodelandyourmodelwillstartsufferingfromhighvariance.
Achampionmodelshouldmaintainabalancebetweenthesetwotypesoferrors.Thisisknownas
the tradeoff management of biasvariance errors. Ensemble learning is one way to execute this
tradeoffanalysis.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

18/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Some of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In this
tutorial,wellfocusonBaggingandBoostingindetail.

8.WhatisBagging?Howdoesitwork?
Bagging is a technique used to reduce the variance of our predictions by combining the result of
multipleclassifiersmodeledondifferentsubsamplesofthesamedataset.Thefollowingfigurewill
makeitclearer:

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

19/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Thestepsfollowedinbaggingare:
1.CreateMultipleDataSets:
Samplingisdonewithreplacementontheoriginaldataandnewdatasetsareformed.
Thenewdatasetscanhaveafractionofthecolumnsaswellasrows,whicharegenerallyhyper
parametersinabaggingmodel
Taking row and column fractions less than 1 helps in making robust models, less prone to
overfitting
2.BuildMultipleClassifiers:
Classifiersarebuiltoneachdataset.
Generallythesameclassifierismodeledoneachdatasetandpredictionsaremade.
3.CombineClassifiers:
Thepredictionsofalltheclassifiersarecombinedusingamean,medianormodevaluedepending
ontheproblemathand.
Thecombinedvaluesaregenerallymorerobustthanasinglemodel.

Notethat,herethenumberofmodelsbuiltisnotahyperparameters.Highernumberofmodelsare
alwaysbetterormaygivesimilarperformancethanlowernumbers.Itcanbetheoreticallyshownthat
thevarianceofthecombinedpredictionsarereducedto1/n(n:numberofclassifiers)oftheoriginal
variance,undersomeassumptions.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

20/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Therearevariousimplementationsofbaggingmodels.Randomforestisoneofthemandwell
discussitnext.

9.WhatisRandomForest?Howdoesitwork?
RandomForestisconsideredtobeapanaceaofalldatascienceproblems.Onafunnynote,when
youcantthinkofanyalgorithm(irrespectiveofsituation),userandomforest!
Random Forest is a versatile machine learning method capable of performing both regression and
classificationtasks.Italsoundertakesdimensionalreductionmethods,treatsmissingvalues,outlier
values and other essential steps of data exploration, and does a fairly good job. It is a type of
ensemblelearningmethod,whereagroupofweakmodelscombinetoformapowerfulmodel.

Howdoesitwork?
In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see
comparison between CART and Random Forest here,part1 and part2). To classify a new object
basedonattributes,eachtreegivesaclassificationandwesaythetreevotesforthatclass.The
forestchoosestheclassificationhavingthemostvotes(overallthetreesintheforest)andincaseof
regression,ittakestheaverageofoutputsbydifferenttrees.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

21/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Itworksinthefollowingmanner.Eachtreeisplanted&grownasfollows:
1.AssumenumberofcasesinthetrainingsetisN.Then,sampleoftheseNcasesistakenatrandom
butwithreplacement.Thissamplewillbethetrainingsetforgrowingthetree.
2.If there are M input variables, a number m<M is specified such that at each node, m variables are
selectedatrandomoutoftheM.Thebestsplitonthesemisusedtosplitthenode.Thevalueofmis
heldconstantwhilewegrowtheforest.
3.Eachtreeisgrowntothelargestextentpossibleandthereisnopruning.
4.Predictnewdatabyaggregatingthepredictionsofthentreetrees(i.e.,majorityvotesforclassification,
averageforregression).

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

22/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

To understand more in detail about this algorithm using a case study, please read this article
IntroductiontoRandomforestSimplified.

AdvantagesofRandomForest
This algorithm can solve both type of problems i.e. classification and regression and does a decent
estimationatbothfronts.
One of benefits of Random forest which excites me most is, the power of handle large data set with
higherdimensionality.Itcanhandlethousandsofinputvariablesandidentifymostsignificantvariables
so it is considered as one of the dimensionality reduction methods. Further, the model
outputsImportanceofvariable,whichcanbeaveryhandyfeature(onsomerandomdataset).

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

23/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Ithasaneffectivemethodforestimatingmissingdataandmaintainsaccuracywhenalargeproportionof
thedataaremissing.
Ithasmethodsforbalancingerrorsindatasetswhereclassesareimbalanced.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering,
dataviewsandoutlierdetection.
Random Forest involves sampling of the input data with replacement called as bootstrap sampling.
Hereonethirdofthedataisnotusedfortrainingandcanbeusedtotesting.Thesearecalledtheoutof
bagsamples.Errorestimatedontheseoutofbagsamplesisknownasoutofbagerror.Studyoferror
estimatesbyOutofbag,givesevidencetoshowthattheoutofbagestimateisasaccurateasusinga
testsetofthesamesizeasthetrainingset.Therefore,usingtheoutofbagerrorestimateremovesthe
needforasetasidetestset.

DisadvantagesofRandomForest
Itsurelydoesagoodjobatclassificationbutnotasgoodasforregressionproblemasitdoesnotgive
precisecontinuousnaturepredictions.Incaseofregression,itdoesntpredictbeyondtherangeinthe
trainingdata,andthattheymayoverfitdatasetsthatareparticularlynoisy.
RandomForestcanfeellikeablackboxapproachforstatisticalmodelersyouhaveverylittlecontrol
onwhatthemodeldoes.Youcanatbesttrydifferentparametersandrandomseeds!

Python&Rimplementation
http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

24/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Python&Rimplementation
Random forests have commonly known implementations in R packages and Python scikitlearn.
LetslookatthecodeofloadingrandomforestmodelinRandPythonbelow:
Python

#ImportLibrary
fromsklearn.ensembleimportRandomForestClassifier#useRandomForestRegressorforregressionpro
blem
#Assumedyouhave,X(predictor)andY(target)fortrainingdatasetandx_test(predictor)ofte
st_dataset
#CreateRandomForestobject
model=RandomForestClassifier(n_estimators=1000)
#Trainthemodelusingthetrainingsetsandcheckscore
model.fit(X,y)
#PredictOutput
predicted=model.predict(x_test)

RCode

>library(randomForest)
>x<cbind(x_train,y_train)
#Fittingmodel
>fit<randomForest(Species~.,x,ntree=500)
>summary(fit)
#PredictOutput
>predicted=predict(fit,x_test)

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

25/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

10.WhatisBoosting?Howdoesitwork?
Definition:ThetermBoostingreferstoafamilyofalgorithmswhichconvertsweaklearnertostrong
learners.
Letsunderstandthisdefinitionindetailbysolvingaproblemofspamemailidentification:
How would you classify an email as SPAM or not? Like everyone else, our initial approach would
betoidentifyspamandnotspamemailsusingfollowingcriteria.If:
1.Emailhasonlyoneimagefile(promotionalimage),ItsaSPAM
2.Emailhasonlylink(s),ItsaSPAM
3.EmailbodyconsistofsentencelikeYouwonaprizemoneyof$xxxxxx,ItsaSPAM
4.EmailfromourofficialdomainAnalyticsvidhya.com,NotaSPAM
5.Emailfromknownsource,NotaSPAM

Above,wevedefinedmultiplerulestoclassifyanemailintospamornotspam.But,doyouthink
theserulesindividuallyarestrongenoughtosuccessfullyclassifyanemail?No.
Individually, these rules are not powerful enough to classify an email into spam or not
spam.Therefore,theserulesarecalledasweaklearner.
Toconvertweaklearnertostronglearner,wellcombinethepredictionofeachweaklearnerusing
methodslike:
Usingaverage/weightedaverage
Consideringpredictionhashighervote

Forexample:Above,wehavedefined5weaklearners.Outofthese5,3arevotedasSPAMand2
are voted as Not a SPAM. In this case, by default, well consider an email as SPAM because
wehavehigher(3)voteforSPAM.

Howdoesitwork?

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

26/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule.An
immediatequestionwhichshouldpopinyourmindis,Howboostingidentifyweakrules?
To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time
base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative
process.Aftermanyiterations,theboostingalgorithmcombinestheseweakrulesintoasinglestrong
predictionrule.
Heres another question which might haunt you, How do we choose different distribution for each
round?
Forchoosingtherightdistribution,herearethefollowingsteps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.
Step2:If there is any prediction error caused by first base learning algorithm, then we pay higher
attentiontoobservationshavingpredictionerror.Then,weapplythenextbaselearningalgorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improvesthepredictionpowerofthemodel.Boostingpayshigherfocusonexampleswhicharemis
classiedorhavehighererrorsbyprecedingweakrules.
There are many boosting algorithms which impart additional boost to models accuracy. In this
tutorial,welllearnaboutthetwomostcommonlyusedalgorithmsi.e.GradientBoosting(GBM)and
XGboost.

11.Whichismorepowerful:GBMorXgboost?

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

27/28

2/6/2016

ACompleteTutorialonTreeBasedModelingfromScratch(inR&Python)

Ive always admired the boosting capabilities that xgboost algorithm. At times, Ive found that it
providesbetterresultcomparedtoGBMimplementation,butattimesyoumightfindthatthegains
arejustmarginal.WhenIexploredmoreaboutitsperformanceandsciencebehinditshighaccuracy,
IdiscoveredmanyadvantagesofXgboostoverGBM:
1.Regularization:
StandardGBMimplementationhasnoregularizationlikeXGBoost,thereforeitalsohelpstoreduce
overfitting.
Infact,XGBoostisalsoknownasregularizedboostingtechnique.
2.ParallelProcessing:
XGBoostimplementsparallelprocessingandisblazinglyfasterascomparedtoGBM.
Buthangon,weknowthatboostingissequentialprocesssohowcanitbeparallelized?Weknow
thateachtreecanbebuiltonlyafterthepreviousone,sowhatstopsusfrommakingatreeusing
allcores?IhopeyougetwhereImcomingfrom.Checkthislink outtoexplorefurther.
XGBoostalsosupportsimplementationonHadoop.
3.HighFlexibility
XGBoostallowuserstodefinecustomoptimizationobjectivesandevaluationcriteria.
Thisaddsawholenewdimensiontothemodelandthereisnolimittowhatwecando.
4.HandlingMissingValues
XGBoosthasaninbuiltroutinetohandlemissingvalues.
Userisrequiredtosupplyadifferentvaluethanotherobservationsandpassthatasaparameter.
XGBoosttriesdifferentthingsasitencountersamissingvalueoneachnodeandlearnswhich
pathtotakeformissingvaluesinfuture.
5.TreePruning:
AGBMwouldstopsplittinganodewhenitencountersanegativelossinthesplit.Thusitismore
ofagreedyalgorithm.
XGBoost on the other hand make splits upto the max_depthspecified and then
startpruningthetreebackwardsandremovesplitsbeyondwhichthereisnopositivegain.
Anotheradvantageisthatsometimesasplitofnegativelosssay2maybefollowedbyasplitof
positiveloss+10.GBMwouldstopasitencounters2.ButXGBoostwillgodeeperanditwillsee
acombinedeffectof+8ofthesplitandkeepboth.
6.BuiltinCrossValidation
XGBoost allows user to run a crossvalidation at each iteration of the boosting process and
thusitiseasytogettheexactoptimumnumberofboostingiterationsinasinglerun.
ThisisunlikeGBMwherewehavetorunagridsearchandonlyalimitedvaluescanbetested.
7.ContinueonExistingModel
User can start training an XGBoost model from its last iteration of previous run.This can be of
significantadvantageincertainspecificapplications.

http://www.analyticsvidhya.com/blog/2016/04/completetutorialtreebasedmodelingscratchinpython/

28/28

S-ar putea să vă placă și