Sunteți pe pagina 1din 20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

CompleteGuidetoParameterTuningin
XGBoost(withcodesinPython)
MACHINELEARNING

PYTHON

AARSHAYJ AIN,MARCH1,2016 /

91

SHARE

Introduction
Ifthingsdontgoyourwayinpredictivemodeling,useXGboost.XGBoostalgorithmhasbecomethe
ultimateweaponofmanydatascientist.Itsahighlysophisticatedalgorithm,powerfulenoughtodeal
withallsortsofirregularitiesofdata.

BuildingamodelusingXGBoostiseasy.But,improvingthemodelusingXGBoostisdifficult(atleast
Istruggledalot).Thisalgorithmusesmultipleparameters.Toimprovethemodel,parametertuningis
must. It is very difficult to get answers to practical questions like Which set of parameters you
shouldtune?Whatistheidealvalueoftheseparameterstoobtainoptimaloutput?
This article is best suited to people who are new to XGBoost. In this article, well learn the art of
parameter tuning along with some useful information about XGBoost. Also, well practice this
algorithmusingadatasetinPython.

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

1/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

Whatshouldyouknow?
XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting
algorithm. Since I covered Gradient Boosting Machine in detail in my previous article Complete
Guide to Parameter Tuning in Gradient Boosting (GBM) in Python, I highly recommend going
throughthatbeforereadingfurther.Itwillhelpyoubolsteryourunderstandingofboostingingeneral
andparametertuningforGBM.
SpecialThanks:Personally,IwouldliketoacknowledgethetimelesssupportprovidedbyMr.Sudalai
Rajkumar(akaSRK),currently AVRank2.Thisarticlewouldntbepossiblewithouthishelp.Heis
helpingusguidethousandsofdatascientists.AbigthankstoSRK!

TableofContents
1.TheXGBoostAdvantage
2.UnderstandingXGBoostParameters
3.TuningParameters(withExample)

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

2/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

1.TheXGBoostAdvantage
Ivealwaysadmiredtheboostingcapabilitiesthatthisalgorithminfusesinapredictivemodel.WhenI
explored more about its performance and science behind its high accuracy, I discovered many
advantages:
1.Regularization:
StandardGBMimplementationhasnoregularizationlikeXGBoost,thereforeitalsohelpstoreduce
overfitting.
Infact,XGBoostisalsoknownasregularizedboostingtechnique.
2.ParallelProcessing:
XGBoostimplementsparallelprocessingandisblazinglyfasterascomparedtoGBM.
Buthangon,weknowthatboostingissequentialprocesssohowcanitbeparallelized?Weknow
thateachtreecanbebuiltonlyafterthepreviousone,sowhatstopsusfrommakingatreeusing
allcores?IhopeyougetwhereImcomingfrom.Checkthislink outtoexplorefurther.
XGBoostalsosupportsimplementationonHadoop.
3.HighFlexibility
XGBoostallowuserstodefinecustomoptimizationobjectivesandevaluationcriteria.
Thisaddsawholenewdimensiontothemodelandthereisnolimittowhatwecando.
4.HandlingMissingValues
XGBoosthasaninbuiltroutinetohandlemissingvalues.
Userisrequiredtosupplyadifferentvaluethanotherobservationsandpassthatasaparameter.
XGBoosttriesdifferentthingsasitencountersamissingvalueoneachnodeandlearnswhich
pathtotakeformissingvaluesinfuture.
5.TreePruning:
AGBMwouldstopsplittinganodewhenitencountersanegativelossinthesplit.Thusitismore
ofagreedyalgorithm.
XGBoost on the other hand make splits upto the max_depthspecified and then
startpruningthetreebackwardsandremovesplitsbeyondwhichthereisnopositivegain.
Anotheradvantageisthatsometimesasplitofnegativelosssay2maybefollowedbyasplitof
positiveloss+10.GBMwouldstopasitencounters2.ButXGBoostwillgodeeperanditwillsee
acombinedeffectof+8ofthesplitandkeepboth.
6.BuiltinCrossValidation
XGBoost allows user to run a crossvalidation at each iteration of the boosting process and
thusitiseasytogettheexactoptimumnumberofboostingiterationsinasinglerun.
ThisisunlikeGBMwherewehavetorunagridsearchandonlyalimitedvaluescanbetested.
7.ContinueonExistingModel

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

3/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

User can start training an XGBoost model from its last iteration of previous run.This can be of
significantadvantageincertainspecificapplications.
GBMimplementationofsklearnalsohasthisfeaturesotheyareevenonthispoint.

I hope now you understand the sheer power XGBoost algorithm. Note that these are the points
whichIcouldmuster.Youknowafewmore?FeelfreetodropacommentbelowandIwillupdatethe
list.
DidIwhetyourappetite?Good.Youcanrefertofollowingwebpagesforadeeperunderstanding:
XGBoostGuideIntroductiontoBoostedTrees
WordsfromtheAuthorofXGBoost[Video]

2.XGBoostParameters
Theoverallparametershavebeendividedinto3categoriesbyXGBoostauthors:
1.GeneralParameters:Guidetheoverallfunctioning
2.BoosterParameters:Guidetheindividualbooster(tree/regression)ateachstep
3.LearningTaskParameters:Guidetheoptimizationperformed

IwillgiveanalogiestoGBMhereandhighlyrecommendtoreadthisarticletolearnfromthevery
basics.

GeneralParameters
ThesedefinetheoverallfunctionalityofXGBoost.
1.booster[default=gbtree]
Selectthetypeofmodeltorunateachiteration.Ithas2options:
gbtree:treebasedmodels
gblinear:linearmodels
2.silent[default=0]:
Silentmodeisactivatedissetto1,i.e.norunningmessageswillbeprinted.
Itsgenerallygoodtokeepit0asthemessagesmighthelpinunderstandingthemodel.
3.nthread[defaulttomaximumnumberofthreadsavailableifnotset]
Thisisusedforparallelprocessingandnumberofcoresinthesystemshouldbeentered

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

4/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

Ifyouwishtorunonallcores,valueshouldnotbeenteredandalgorithmwilldetectautomatically

Thereare2moreparameterswhicharesetautomaticallybyXGBoostandyouneednotworryabout
them.LetsmoveontoBoosterparameters.

BoosterParameters
Though there are 2 types of boosters, Ill consider only tree booster here because it always
outperformsthelinearboosterandthusthelaterisrarelyused.
1.eta[default=0.3]
AnalogoustolearningrateinGBM
Makesthemodelmorerobustbyshrinkingtheweightsoneachstep
Typicalfinalvaluestobeused:0.010.2
2.min_child_weight[default=1]
Definestheminimumsumofweightsofallobservationsrequiredinachild.
Thisissimilartomin_child_leafinGBMbutnotexactly.Thisreferstominsumofweightsof
observationswhileGBMhasminnumberofobservations.
Usedtocontroloverfitting.Highervaluespreventamodelfromlearningrelationswhichmightbe
highlyspecifictotheparticularsampleselectedforatree.
Toohighvaluescanleadtounderfittinghence,itshouldbetunedusingCV.
3.max_depth[default=6]
Themaximumdepthofatree,sameasGBM.
Usedtocontroloverfittingashigherdepthwillallowmodeltolearnrelationsveryspecifictoa
particularsample.
ShouldbetunedusingCV.
Typicalvalues:310
4.max_leaf_nodes
Themaximumnumberofterminalnodesorleavesinatree.
Canbedefinedinplaceofmax_depth.Sincebinarytreesarecreated,adepthofnwould
produceamaximumof2^nleaves.
Ifthisisdefined,GBMwillignoremax_depth.
5.gamma[default=0]
Anodeissplitonlywhentheresultingsplitgivesapositivereductioninthelossfunction.Gamma
specifiestheminimumlossreductionrequiredtomakeasplit.
Makesthealgorithmconservative.Thevaluescanvarydependingonthelossfunctionand

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

5/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

shouldbetuned.
6.max_delta_step[default=0]
Inmaximumdeltastepwealloweachtreesweightestimationtobe.Ifthevalueissetto0,it
meansthereisnoconstraint.Ifitissettoapositivevalue,itcanhelpmakingtheupdatestep
moreconservative.
Usuallythisparameterisnotneeded,butitmighthelpinlogisticregressionwhenclassis
extremelyimbalanced.
Thisisgenerallynotusedbutyoucanexplorefurtherifyouwish.
7.subsample[default=1]
SameasthesubsampleofGBM.Denotesthefractionofobservationstoberandomlysamplesfor
eachtree.
Lowervaluesmakethealgorithmmoreconservativeandpreventsoverfittingbuttoosmallvalues
mightleadtounderfitting.
Typicalvalues:0.51
8.colsample_bytree[default=1]
Similartomax_featuresinGBM.Denotesthefractionofcolumnstoberandomlysamplesfor
eachtree.
Typicalvalues:0.51
9.colsample_bylevel[default=1]
Denotesthesubsampleratioofcolumnsforeachsplit,ineachlevel.
Idontusethisoftenbecausesubsampleandcolsample_bytreewilldothejobforyou.butyou
canexplorefurtherifyoufeelso.
10.lambda[default=1]
L2regularizationtermonweights(analogoustoRidgeregression)
ThisusedtohandletheregularizationpartofXGBoost.Thoughmanydatascientistsdontuseit
often,itshouldbeexploredtoreduceoverfitting.
11.alpha[default=0]
L1regularizationtermonweight(analogoustoLassoregression)
Canbeusedincaseofveryhighdimensionalitysothatthealgorithmrunsfasterwhen
implemented
12.scale_pos_weight[default=1]
Avaluegreaterthan0shouldbeusedincaseofhighclassimbalanceasithelpsinfaster
convergence.

LearningTaskParameters
Theseparametersareusedtodefinetheoptimizationobjectivethemetrictobecalculatedateach

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

6/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

step.
1.objective[default=reg:linear]
Thisdefinesthelossfunctiontobeminimized.Mostlyusedvaluesare:
binary:logisticlogisticregressionforbinaryclassification,returnspredictedprobability
(notclass)
multi:softmaxmulticlassclassificationusingthesoftmaxobjective,returnspredicted
class(notprobabilities)
youalsoneedtosetanadditionalnum_class(numberofclasses)parameterdefining
thenumberofuniqueclasses
multi:softprobsameassoftmax,butreturnspredictedprobabilityofeachdatapoint
belongingtoeachclass.
2.eval_metric[defaultaccordingtoobjective]
Themetrictobeusedforvalidationdata.
Thedefaultvaluesarermseforregressionanderrorforclassification.
Typicalvaluesare:
rmserootmeansquareerror
maemeanabsoluteerror
loglossnegativeloglikelihood
errorBinaryclassificationerrorrate(0.5threshold)
merrorMulticlassclassificationerrorrate
mloglossMulticlasslogloss
auc:Areaunderthecurve
3.seed[default=0]
Therandomnumberseed.
Canbeusedforgeneratingreproducibleresultsandalsoforparametertuning.

If youve been using ScikitLearn till now, these parameter names might not look familiar.A good
newsisthatxgboostmoduleinpythonhasansklearnwrappercalledXGBClassifier.Itusessklearn
stylenamingconvention.Theparametersnameswhichwillchangeare:
1.eta>learning_rate
2.lambda>reg_lambda
3.alpha>reg_alpha

Youmustbewonderingthatwehavedefinedeverythingexceptsomethingsimilartothe
n_estimatorsparameterinGBM.WellthisexistsasaparameterinXGBClassifier.However,ithas
tobepassedasnum_boosting_roundswhilecallingthefitfunctioninthestandardxgboost

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

7/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

implementation.
Irecommendyoutogothroughthefollowingpartsofxgboostguidetobetterunderstandthe
parametersandcodes:
1.XGBoostParameters(officialguide)
2.XGBoostDemoCodes(xgboostGitHubrepository)
3.PythonAPIReference(officialguide)

3.ParameterTuningwithExample
WewilltakethedatasetfromDataHackathon3.xAVhackathon,sameasthattakeninthe GBM
article.Thedetailsoftheproblemcanbefoundonthecompetitionpage.Youcandownloadthedata
setfromhere.Ihaveperformedthefollowingsteps:
1.Cityvariabledroppedbecauseoftoomanycategories
2.DOBconvertedtoAge|DOBdropped
3.EMI_Loan_Submitted_Missingcreatedwhichis1ifEMI_Loan_Submittedwasmissingelse0|Original
variableEMI_Loan_Submitteddropped
4.EmployerNamedroppedbecauseoftoomanycategories
5.Existing_EMIimputedwith0(median)sinceonly111valuesweremissing
6.Interest_Rate_Missingcreatedwhichis1ifInterest_Ratewasmissingelse0|Originalvariable
Interest_Ratedropped
7.Lead_Creation_Datedroppedbecausemadelittleintuitiveimpactonoutcome
8.Loan_Amount_Applied,Loan_Tenure_Appliedimputedwithmedianvalues
9.Loan_Amount_Submitted_Missingcreatedwhichis1ifLoan_Amount_Submittedwasmissingelse0|
OriginalvariableLoan_Amount_Submitteddropped
10.Loan_Tenure_Submitted_Missingcreatedwhichis1ifLoan_Tenure_Submittedwasmissingelse0|
OriginalvariableLoan_Tenure_Submitteddropped
11.LoggedIn,Salary_Accountdropped
12.Processing_Fee_Missingcreatedwhichis1ifProcessing_Feewasmissingelse0|Originalvariable
Processing_Feedropped
13.Sourcetop2keptasisandallotherscombinedintodifferentcategory
14.NumericalandOneHotCodingperformed

For those who have the original data from competition, you can check out these steps from the

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

8/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

data_preparationiPythonnotebookintherepository.
Letsstartbyimportingtherequiredlibrariesandloadingthedata:

#Importlibraries:
importpandasaspd
importnumpyasnp
importxgboostasxgb
fromxgboost.sklearnimportXGBClassifier
fromsklearnimportcross_validation,metrics#Additionalscklearnfunctions
fromsklearn.grid_searchimportGridSearchCV#Perforinggridsearch

importmatplotlib.pylabasplt
%matplotlibinline
frommatplotlib.pylabimportrcParams
rcParams['figure.figsize']=12,4

train=pd.read_csv('train_modified.csv')
target='Disbursed'
IDcol='ID'

NotethatIhaveimported2formsofXGBoost:
1.xgbthisisthedirectxgboostlibrary.Iwilluseaspecificfunctioncvfromthislibrary
2.XGBClassifierthisisansklearnwrapperforXGBoost.ThisallowsustousesklearnsGridSearch
withparallelprocessinginthesamewaywedidforGBM

Beforeproceedingfurther,letsdefineafunctionwhichwillhelpuscreateXGBoostmodelsand
performcrossvalidation.Thebestpartisthatyoucantakethisfunctionasitisanduseitlaterfor
yourownmodels.

defmodelfit(alg,dtrain,predictors,useTrainCV=True,cv_folds=5,early_stopping_rounds=50):

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

9/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

ifuseTrainCV:
xgb_param=alg.get_xgb_params()
xgtrain=xgb.DMatrix(dtrain[predictors].values,label=dtrain[target].values)
cvresult=xgb.cv(xgb_param,xgtrain,num_boost_round=alg.get_params()['n_estimators'],n
fold=cv_folds,
metrics='auc',early_stopping_rounds=early_stopping_rounds,show_progress=False)
alg.set_params(n_estimators=cvresult.shape[0])

#Fitthealgorithmonthedata
alg.fit(dtrain[predictors],dtrain['Disbursed'],eval_metric='auc')

#Predicttrainingset:
dtrain_predictions=alg.predict(dtrain[predictors])
dtrain_predprob=alg.predict_proba(dtrain[predictors])[:,1]

#Printmodelreport:
print"\nModelReport"
print"Accuracy:%.4g"%metrics.accuracy_score(dtrain['Disbursed'].values,dtrain_predictio
ns)
print"AUCScore(Train):%f"%metrics.roc_auc_score(dtrain['Disbursed'],dtrain_predprob)

feat_imp=pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar',title='FeatureImportances')
plt.ylabel('FeatureImportanceScore')

This code is slightly different from what I used for GBM. The focus of this article is to cover the
conceptsandnotcoding.Pleasefeelfreetodropanoteinthecommentsifyoufindanychallenges
in understanding any part of it. Note that xgboosts sklearn wrapper doesnt have a
feature_importancesmetricbutaget_fscore()functionwhichdoesthesamejob.

GeneralApproachforParameterTuning

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

10/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

WewilluseanapproachsimilartothatofGBMhere.Thevariousstepstobeperformedare:
1.Choosearelativelyhighlearningrate.Generallyalearningrateof0.1worksbutsomewherebetween
0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this
learning rate. XGBoost has a very useful function called as cv which performs crossvalidation at
eachboostingiterationandthusreturnstheoptimumnumberoftreesrequired.
2.Tunetreespecificparameters(max_depth,min_child_weight,gamma,subsample,colsample_bytree)
fordecidedlearningrateandnumberoftrees.Notethatwecanchoosedifferentparameterstodefinea
treeandIlltakeupanexamplehere.
3.Tuneregularizationparameters(lambda,alpha)forxgboostwhichcanhelpreducemodelcomplexity
andenhanceperformance.
4.Lowerthelearningrateanddecidetheoptimalparameters.

Letuslookatamoredetailedstepbystepapproach.

Step1:Fixlearningrateandnumberofestimatorsfortuningtree
basedparameters
Inordertodecideonboostingparameters,weneedtosetsomeinitialvaluesofotherparameters.
Letstakethefollowingvalues:
1.max_depth = 5 : This should be between 310. Ive started with 5 but you can choose a different
numberaswell.46canbegoodstartingpoints.
2.min_child_weight=1:Asmallervalueischosenbecauseitisahighlyimbalancedclassproblemand
leafnodescanhavesmallersizegroups.
3.gamma=0:Asmallervaluelike0.10.2canalsobechosenforstarting.Thiswillanywaysbetuned
later.
4.subsample,colsample_bytree=0.8:Thisisacommonlyusedusedstartvalue.Typicalvaluesrange
between0.50.9.
5.scale_pos_weight=1:Becauseofhighclassimbalance.

Pleasenotethatalltheabovearejustinitialestimatesandwillbetunedlater.Letstakethedefault
learningrateof0.1hereandchecktheoptimumnumberoftreesusingcvfunctionofxgboost.The
functiondefinedabovewilldoitforus.

#Chooseallpredictorsexcepttarget&IDcols

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

11/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

predictors=[xforxintrain.columnsifxnotin[target,IDcol]]
xgb1=XGBClassifier(
learning_rate=0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb1,train,predictors)

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

12/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

Asyoucanseethatherewegot140astheoptimalestimatorsfor0.1learningrate.Notethatthis
value might be too high for you depending on the power of your system. In that case you can
increasethelearningrateandrerunthecommandtogetthereducednumberofestimators.
Note:YouwillseethetestAUCasAUCScore(Test)intheoutputshere.Butthiswouldnot
appear if you try to run the command on your system as the data is not made public. Its
providedherejustforreference.Thepartofthecodewhichgeneratesthisoutputhasbeen
removedhere.

Step2:Tunemax_depthandmin_child_weight
http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

13/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

Step2:Tunemax_depthandmin_child_weight
We tune these first as they will have the highest impact on model outcome.To start with, lets set
widerrangesandthenwewillperformanotheriterationforsmallerranges.
ImportantNote:Ill be doing some heavyduty grid searched in this section which can take 1530
minsorevenmoretimetorundependingonyoursystem.Youcanvarythenumberofvaluesyou
aretestingbasedonwhatyoursystemcanhandle.

param_test1={
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
gsearch1=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=140,max_dept
h=5,
min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,
objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
param_grid=param_test1,scoring='roc_auc',n_jobs=4,iid=False,cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_,gsearch1.best_params_,gsearch1.best_score_

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

14/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

Here,wehaverun12combinationswithwiderintervalsbetweenvalues.Theidealvaluesare5for
max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values.
Wellsearchforvalues1aboveandbelowtheoptimumvaluesbecausewetookanintervaloftwo.

param_test2={
'max_depth':[4,5,6],
'min_child_weight':[4,5,6]
}
gsearch2=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=140,max_depth
=5,
min_child_weight=2,gamma=0,subsample=0.8,colsample_bytree=0.8,
objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
param_grid=param_test2,scoring='roc_auc',n_jobs=4,iid=False,cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_,gsearch2.best_params_,gsearch2.best_score_

Here,wegettheoptimumvaluesas4formax_depthand6formin_child_weight.Also, we can
see the CV score increasing slightly. Note that as the model performance increases, it becomes
exponentially difficult to achieve even marginal gains in performance.You would have noticed that
herewegot6asoptimumvalueformin_child_weightbutwehaventtriedvaluesmorethan6.We
candothatasfollow:.

param_test2b={

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

15/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

'min_child_weight':[6,8,10,12]
}
gsearch2b=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=140,max_dept
h=4,
min_child_weight=2,gamma=0,subsample=0.8,colsample_bytree=0.8,
objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
param_grid=param_test2b,scoring='roc_auc',n_jobs=4,iid=False,cv=5)
gsearch2b.fit(train[predictors],train[target])

modelfit(gsearch3.best_estimator_,train,predictors)
gsearch2b.grid_scores_,gsearch2b.best_params_,gsearch2b.best_score_

Wesee6astheoptimalvalue.

Step3:Tunegamma
Now lets tune gamma value using the parameters already tuned above. Gamma can take various
valuesbutIllcheckfor5valueshere.Youcangointomoreprecisevaluesas.

param_test3={
'gamma':[i/10.0foriinrange(0,5)]
}
gsearch3=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=140,max_dept
h=4,

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

16/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

min_child_weight=6,gamma=0,subsample=0.8,colsample_bytree=0.8,
objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
param_grid=param_test3,scoring='roc_auc',n_jobs=4,iid=False,cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_,gsearch3.best_params_,gsearch3.best_score_

Thisshowsthatouroriginalvalueofgamma,i.e.0istheoptimumone.Beforeproceeding,agood
ideawouldbetorecalibratethenumberofboostingroundsfortheupdatedparameters.

xgb2=XGBClassifier(
learning_rate=0.1,
n_estimators=1000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb2,train,predictors)

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

17/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

Here,wecanseetheimprovementinscore.Sothefinalparametersare:
max_depth:4
min_child_weight:6
gamma:0

Step4:Tunesubsampleandcolsample_bytree
Thenextstepwouldbetrydifferentsubsampleandcolsample_bytreevalues.Letsdothisin2

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

18/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

stagesaswellandtakevalues0.6,0.7,0.8,0.9forbothtostartwith.

param_test4={
'subsample':[i/10.0foriinrange(6,10)],
'colsample_bytree':[i/10.0foriinrange(6,10)]
}
gsearch4=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=177,max_dept
h=4,
min_child_weight=6,gamma=0,subsample=0.8,colsample_bytree=0.8,
objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
param_grid=param_test4,scoring='roc_auc',n_jobs=4,iid=False,cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_,gsearch4.best_params_,gsearch4.best_score_

Here, we found 0.8 as the optimum value for both subsample and colsample_bytree. Now we
shouldtryvaluesin0.05intervalaroundthese.

param_test5={
'subsample':[i/100.0foriinrange(75,90,5)],

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

19/20

29/5/2016

CompleteGuidetoParameterTuninginXGBoost(withcodesinPython)

'colsample_bytree':[i/100.0foriinrange(75,90,5)]
}
gsearch5=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=177,max_dept
h=4,
min_child_weight=6,gamma=0,subsample=0.8,colsample_bytree=0.8,
objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27),
param_grid=param_test5,scoring='roc_auc',n_jobs=4,iid=False,cv=5)
gsearch5.fit(train[predictors],train[target])

Againwegotthesamevaluesasbefore.Thustheoptimumvaluesare:
subsample:0.8
colsample_bytree:0.8

Step5:TuningRegularizationParameters
Nextstepistoapplyregularizationtoreduceoverfitting.Thoughmanypeopledontusethis
parametersmuchasgammaprovidesasubstantialwayofcontrollingcomplexity.Butweshould
alwaystryit.Illtunereg_alphavaluehereandleaveituptoyoutotrydifferentvaluesof
reg_lambda.

param_test6={
'reg_alpha':[1e5,1e2,0.1,1,100]

http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/

20/20

S-ar putea să vă placă și