Sunteți pe pagina 1din 14

Didacticieltudesdecas R.R.

1 Theme
DataMiningwithRTherattlepackage.

R(http://www.rproject.org/)isoneofthemostexcitingfreedataminingsoftwareprojectsofthese
lastyears.Itspopularityiscompletelyjustified(seeKdnuggetsPollsDataMining/AnalyticTools
Used 2011). Among the reasons which explain this success, we distinguish two very interesting
characteristics:(1)wecanextendalmostindefinitelythefeaturesofthetoolwiththepackages;(2)
wehaveaprogramminglanguagewhichallowstoperformeasilysequencesofcomplexoperations.

But this second property can be also a drawback. Indeed, some users do not want to learn a new
programming languagebefore being ableto realizeprojects.Forthis reason,toolswhich allowto
define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still
remainavaluablealternativewiththedataminers.

In this tutorial, wepresentthe "Rattle" package which allowsto thedataminersto useR without
needing to know the associated programming language. All the operations are performed with
simple clicks, such as for any software driven by menus. But, in addition, all the commands are
stored. We can save them in a file. Then, in a new working session, we can easily repeat all the
operations.Thus,wefindoneoftheimportantpropertieswhichmisstothetoolsdrivenbymenus.

Todescribetheuseoftherattlepackage,weperformananalysissimilartotheonesuggestedby
therattle'sauthorinitspresentationpaper(G.J.Williams,Rattle:ADataMiningGUIforR,inThe
R Journal, volume 1 / 2, pages 4555, December 2009, http://journal.rproject.org/archive/2009
2/RJournal_20092_Williams.pdf).Weperformthefollowingsteps:loadingthedatafile;partitioning
theinstancesintolearningandtestsamples;specifyingthetypesofthevariables(targetorinput);
computing some descriptive statistics; learning the predictive models from the learning sample;
assessingthemodelsonthetestsample(confusionmatrix,errorrate,somecurves).

2 Dataset
We use the heart 1 data file. We want to explain the occurrence of the DISEASE from the
characteristicsofpatients.Weshowherethefirstinstancesofthedataset.

1
http://eric.univlyon2.fr/~ricco/tanagra/fichiers/heart_for_rattle.txt;adescriptionofthisdatafileisavailable
onthefollowingwebsite:http://archive.ics.uci.edu/ml/datasets/Heart+Disease

26aot2011 Page1sur14
Didacticieltudesdecas R.R.

3 Data Mining with Rattle


3.1 Loadingtherattlepackage
First,weloadtherattlepackage[library()].Then,westarttheGUIwiththecommandrattle().
> #loading the package
> library(rattle)
> #lauching the GUI
> rattle()

IntotheRconsole,wehave


From now, we perform all the operations by clicking on the appropriate menu or button. All
theseoperationsarerecordedasRcommandsbyrattle.TherattleGUIisdisplayed.

26aot2011 Page2sur14
Didacticieltudesdecas R.R.

The use of rattle is always the same: we define the command by working in the appropriate tab
(Data:loadthedataset;Explore:somedescriptivestatistics;Test:somestatisticaltests,etc.);then,
welaunchthecalculationsbyclickingontheEXECUTERbuttonintothetoolbar.
3.2 Importingthedatafile
IntotheDatatab,weclickontheFILENAMEbutton.Weselecttheheart_for_rattle.txtdatafile.


Wespecifythecolumnseparator:SEPARATOR=\t.ThenweclickonEXECUTER.

26aot2011 Page3sur14
Didacticieltudesdecas R.R.

Thedatasetisloaded.Thevariabletypeisautomaticallydetectedfromthedistinctvaluesintoeach
column(discreteorcontinuous).WecandefinetheTARGETattributeandtheINPUTones.Last,we
specifythesizeofthetraining(70%ofinstances,drawnrandomly)andtest(30%)samples.

3.3 Datasetdescription


Into the Explore tab, we obtain some descriptive statistics indicators about the variables
(SUMMARY / SUMMARY option). For the discrete variables, rattle lists the values (levels). For the
continuousones,wehavethemin,max,mean,quartiles.Alltheindicatorsarecomputedonthe
learningsample.

26aot2011 Page4sur14
Didacticieltudesdecas R.R.

WiththeSUMMARY/DESCRIBEoption,weobtainamoredetaileddescription.Amongothers,for
thecontinuousvariables,theindicationsareusefultodetectunusualvalues(outliers).

Into the Explore tab still, with the DISTRIBUTIONS option, we obtain some graphical
representations of the distributions. We have for instance the conditional box plots of AGE and
CHOLaccordingtothevaluesofDISEASE.

26aot2011 Page5sur14
Didacticieltudesdecas R.R.


Wecanobtainalsotheconditionaldistributionfunctions.

26aot2011 Page6sur14
Didacticieltudesdecas R.R.

About the discrete variables, we can obtain the Mosaic of the variables, according still to the
valuesofthetargetattribute.

For instance, about SEX, the men (MALE) are more numerous than women (FEMALE) into the
sample;andtheproportionofdiseaseishigherforthemen.

We can also obtain the correlations about the


continuous input attributes. The correlations are
describedinahierarchicalstructure.Itisusefulfor
instance for the detection of the redundant
variables.

26aot2011 Page7sur14
Didacticieltudesdecas R.R.

3.4 Datatransformation

The"Transform"tabisdedicatedtothevariabletransformation.Someusualoperatorsareavailable
(e.g.logarithm,rank,etc.).


3.5 Supervisedlearning

This step is at the heart of our analysis. We select the "Model" tab. We want to evaluate three
methods:decisiontreeinduction,randomforest,logisticregression.

26aot2011 Page8sur14
Didacticieltudesdecas R.R.

Aboutthedecisiontree,rattleusestherpartcommandfromtherpartpackage.Wenotethedefault
parametersused.WeclickontheEXECUTERbutton.Weobtaintherulesassociatedtothetreeby
clickingontheRULESbutton.


WecanobtainalsoagraphicalrepresentationofthetreewiththeDRAWoption.

26aot2011 Page9sur14
Didacticieltudesdecas R.R.

Abouttherandomforestapproach,rattleusestherandomForestcommandfromtherandomForest
package.Weobtainthefollowingresultswiththedefaultsettings.

TheOOB(outofbag)errorestimationis16.5%.Wewillcomparethisvaluetotheoneobtainedon
thetestsetbelow.

Aboutthelogisticregression,weusetheglm()command.Itautomaticallytransformsthediscrete
predictorsusingdummyvariables.Weobtainthefollowingresults.

26aot2011 Page10sur14
Didacticieltudesdecas R.R.


3.6 Measuringthegeneralizationperformance

Laststepofouranalysis,wewanttoevaluatetheperformancesoftheclassifiersonthetestsample
(30%ofthewholedataset).

We activate the Evaluate tab. First, we want to obtain the confusion matrix and the associated
error rate. We select the Error Matrix option. For the Data item, we must select the Testing
option.OnlythemodelslearnedintotheModeltabareavailablehere.

WeclickontheEXECUTERmenu.Weobservethatthelogisticregressionisthebetterherewitha
testerrorrateequalto18.18%.

WenotealsothattheOOBerror rate(16.5%) seemsunderestimatetheerrorrateforthe random


forest(20.45%onthetestset).But,becausethetestsetsizeissmall,andthetesterrorratebeing
alsoanestimationofthetrueerrorrate,weconsiderwithmanycautionsthisresult.

26aot2011 Page11sur14
Didacticieltudesdecas R.R.

Actually, the error rate is not a good criterion here. We note that the differences between the
methodsarebasedonlyononemisclassifiedinstance.Inourcontext,itisperhapsmoreinteresting
to use the ROC curve which highlights the ability of the methods to assign higher score to the
positive instances compared with the negative ones (see http://datamining
tutorials.blogspot.com/2008/11/roccurveforclassifiercomparison.html or http://datamining
tutorials.blogspot.com/2008/10/computingroccurve.html).

WeselecttheROCoptionunderrattle.

26aot2011 Page12sur14
Didacticieltudesdecas R.R.

AccordingtheAUCcriterion,thedecisiontreeisdefinitelytheworstcomparedwiththetwoother
classifiers,whicharesimilarintermsofperformance.Itisnotsurprising.Weknowthatthedecision
treeisnotwelladaptedtothescoringprocess.

3.7 Rcommandsassociatedtothetreatments

26aot2011 Page13sur14
Didacticieltudesdecas R.R.

Oneofthemaincriticismswhichwemakeforthesoftwaredrivenbymenuisthatoncetheprocess
isfinalized,whenweclosethesoftware,wehavenorecollectionofthesequenceofoperationswe
performed. In the next working session, it is complicated to reproduce them as before. It is
necessarytohaveanexcellentmemory,ortohavetakencareofnotingallthatwemade.

Rattleallowstoovertakethisdrawbackbytranslatingalltheoperations(correspondingtoaclickon
theEXECUTERmenu)performedbytheuserinasequenceofRcommands.Wecanvisualizethem
inthe"Log"tab.Wecanstorethesecommands(andthecomments)intoafile.Inthenextworking
session,itisveryeasytoperformthesamedataprocessingbyloadingthesecommands.

4 Rattle under Linux (Ubuntu)


The installation of the Rattle package under Linux is not easy. We must follow carefully the
descriptionavailableonthewebsite.Incaseofproblem,atroubleshootingprocedureisproposed.
ThisistheonethatIused(seehttp://datamining.togaware.com/survivor/Install_GNU_Linux.html).

Whentheinstallationisfinalized,RattleworksproperlyunderLinux(Ubuntu)asweseebelow.

5 Conclusion
Inthistutorial,weshowedthatitwaspossibletouseRwithoutknowledgeaboutitsprogramming
language with the help of the rattle package. This package is rather specialized about the data
miningmethods.Forthestatisticians,thereareotherpackagessuchas"RCommander".

26aot2011 Page14sur14

S-ar putea să vă placă și