Documente Academic
Documente Profesional
Documente Cultură
LABORATORY MANUAL
on
DATA MINING
Data Mining Lab DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SRI KOTTAM TULASI REDDY MEMORIAL COLLEGE OF ENGINEERING (Affiliated to JNTU, Hyderabad, Approved by AICTE, Accredited by NBA) KONDAIR, MAHABOOBNAGAR (Dist), AP - 509 !5
1) INTRODUCTION ON WEKA ? WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, ew !ealand" ? WEKA is an open source application that is freely availa#le under the $ U general pu#lic license agreement" %riginally written in &, the WEKA application has #een completely rewritten in Java and is compati#le with almost every computing platform" 't is user friendly with a graphical interface that allows for (uick set up and operation" WEKA operates on the predication that the user data is availa#le as a flat file or relation" )his means that each data o#*ect is descri#ed #y a fi+ed num#er of attri#utes that usually are of a specific type, normal alpha,numeric or numeric values" )he WEKA application allows novice users a tool to identify hidden information from data#ase and file systems with simple to use options and visual interfaces"
? )he WEKA work#ench contains a collection of visuali-ation tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality" ? )his original version was primarily designed as a tool for analy-ing data from agricultural domains, #ut the more recent fully Java,#ased version (WEKA .), for which development started in /001, is now used in many different application areas, in particular for educational purposes and research" 2) ADVANTAGES OF WEKA ? )he o#vious advantage of a package like WEKA is that a whole range of data preparation, feature selection and data mining algorithms are integrated" )his means that only one data format is needed, and trying out and comparing different approaches #ecomes really easy" )he package also comes with a $U', which should make it easier to use" ? 2orta#ility, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform" ? ? A comprehensive collection of data preprocessing and modeling techni(ues" Ease of use due to its graphical user interfaces"
? WEKA supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visuali-ation, and
S.K.T.R.M College off Engineering
Data Mining Lab feature selection" ? All of WEKA3s techni(ues are predicated on the assumption that the data is availa#le as a single flat file or relation, where each data point is descri#ed #y a fi+ed num#er
of attri#utes (normally, numeric or nominal attri#utes, #ut some other attri#ute types are also supported)" ? WEKA provides access to 456 data#ases using Java 7ata#ase &onnectivity and can process the result returned #y a data#ase (uery" ? 't is not capa#le of multi,relational data mining, #ut there is separate software for converting a collection of linked data#ase ta#les into a single ta#le that is suita#le for processing using WEKA" Another important area is se(uence modeling" ? Attri#ute 8elationship 9ile 9ormat (A899) is the te+t format file used #y WEKA to store data in a data#ase" ? )he A899 file contains two sections: the header and the data section" )he first line of the header tells us the relation name" ? )hen there is the list of the attri#utes (;attri#ute""")" Each attri#ute is associated with a uni(ue name and a type" ? )he latter descri#es the kind of data contained in the varia#le and what values it can have" )he varia#les types are: numeric, nominal, string and date" ? )he class attri#ute is #y default the last one of the list" 'n the header section there can also #e some comment lines, identified with a 3<3 at the #eginning, which can descri#e the data#ase content or give the reader information a#out the author" After that there is the data itself (;data), each line stores the attri#ute of a single entry separated #y a comma" ? WEKA3s main user interface is the E+plorer, #ut essentially the same functionality can #e accessed through the component,#ased Knowledge 9low interface and from the command line" )here is also the E+perimenter, which allows the systematic comparison of the predictive performance of WEKA3s machine learning algorithms on a collection of datasets" 6aunching WEKA )he WEKA $U' &hooser window is used to launch WEKA=s graphical environments" At the #ottom of the window are four #uttons: /" Simple CLI. 2rovides a simple command,line interface that allows direct e+ecution of WEKA commands for operating systems that do not provide their own command line 'nterface"
S.K.T.R.M College off Engineering
Data Mining Lab >" Explorer. An environment for e+ploring data with WEKA" ." Experimen er. An environment for performing e+periments and conducting" ?" Kno!le"#e Flo!. )his environment supports essentially the same functions as the E+plorer #ut with a drag,and,drop interface" %ne advantage is that it supports incremental learning"
'f you launch WEKA from a terminal window, some te+t #egins scrolling in the terminal" 'gnore this te+t unless something goes wrong, in which case it can help in tracking down the cause" )his User @anual focuses on using the E+plorer #ut does not e+plain the individual data preprocessing tools and learning algorithms in WEKA" 9or more information on the various filters and learning methods in WEKA, see the #ook 7ata @ining (Witten and 9rank, >AAB)" )he WEKA E+plorer Se$ ion T%&' At the very top of the window, *ust #elow the title #ar, is a row of ta#s" When the E+plorer is first started only the first ta# is activeC the others are greyed out" )his is #ecause it is necessary to open (and potentially pre,process) a data set #efore starting to e+plore the data" )he ta#s follows: are as
/" (repro$e''. &hoose and modify the data #eing acted on" >" Cl%''i)*. )rain and test learning schemes that classify or perform regression" ." Cl+' er. 6earn clusters for the data" ?" A''o$i% e. 6earn association rules for the data" B" Sele$ % ri&+ e'" 4elect the most relevant attri#utes in the data" D" Vi'+%li,e" Eiew an interactive >7 plot of the data" %nce the ta#s are active, clicking on them flicks #etween different screens, on which the respective actions can #e performed" )he #ottom area of the window (including the status #o+, the log #utton, and the WEKA #ird) stays visi#le regardless of which section you are in"
Cl%''i)i$% ion Sele$ in# % Cl%''i)ier At the top of the classify section is the Cl%''i)ier #o+" )his #o+ has a te+t field that gives the name of the currently selected classifier, and its options" &licking on the te+t #o+ #rings up a $eneric%#*ectEditor dialog #o+, *ust the same as for filters that you can use to configure the options of the current classifier" )he C-oo'e #utton allows you to choose one of the classifiers that are availa#le in WEKA" Te' Op ion' )he result of applying the chosen classifier will #e tested according to the options that are set #y clicking in the Te' op ion' #o+" )here are four test modes: /. U'e r%inin# 'e . )he classifier is evaluated on how well it predicts the class of the instances it was trained on" >" S+pplie" e' 'e " )he classifier is evaluated on how well it predicts the class of a set of instances loaded from a file" &licking the 4et""" #utton #rings up a dialog allowing you to choose the file to test on" ." Cro''./%li"% ion. )he classifier is evaluated #y cross,validation, using the num#er of folds that are entered in the Fol"' te+t field" ?" (er$en %#e 'pli . )he classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing" )he amount of data held out depends on the value entered in the < field" No e0 o matter which evaluation method is used, the model that is output is always the one #uild from all the training data" 9urther testing options can #e set #y clicking on the 1ore op ion'... #utton:
/" O+ p+ mo"el. )he classification model on the full training set is output so that it can #e viewed, visuali-ed, etc" )his option is selected #y default" >. O+ p+ per.$l%'' ' % '. )he precisionFrecall and trueFfalse statistics for each class are output" )his option is also selected #y default" ." O+ p+ en rop* e/%l+% ion me%'+re'. Entropy evaluation measures are included in the output" )his option is not selected #y default" ?" O+ p+ $on)+'ion m% rix. )he confusion matri+ of the classifier=s predictions is included in the output" )his option is selected #y default" B" S ore pre"i$ ion' )or /i'+%li,% ion. )he classifier=s predictions are remem#ered so that they can #e visuali-ed" )his option is selected #y default"
S.K.T.R.M College off Engineering
Data Mining Lab D. O+ p+ pre"i$ ion'. )he predictions on the evaluation data are output" ote that in the case of a cross,validation the instance num#ers do not correspond to the location in the dataG 1. Co' .'en'i i/e e/%l+% ion. )he errors is evaluated with respect to a cost matri+" )he Se ... #utton allows you to specify the cost matri+ used"
H. R%n"om 'ee" )or x/%l 2 3 Spli . )his specifies the random seed used when randomi-ing the data #efore it is divided up for evaluation purposes" T-e Cl%'' A ri&+ e )he classifiers in WEKA are designed to #e trained to predict a single Iclass= attri#ute, which is the target for prediction" 4ome classifiers can only learn nominal classesC others can only learn numeric classes (regression pro#lems)C still others can learn #oth" Jy default, the class is taken to #e the last attri#ute in the data" 'f you want to train a classifier to predict a different attri#ute, click on the #o+ #elow the )est options #o+ to #ring up a drop,down list of attri#utes to choose from" Tr%inin# % Cl%''i)ier %nce the classifier, test options and class have all #een set, the learning process is started #y clicking on the 4tart #utton" While the classifier is #usy #eing trained, the little #ird moves around" Kou can stop the training process at any time #y clicking on the 4top #utton" When training is complete, several things happen" )he &lassifier output area to the right of the display is filled with te+t descri#ing the results of training and testing" A new entry appears in the 8esult list #o+" We look at the result list #elowC #ut first we investigate the te+t that has #een output" T-e Cl%''i)ier O+ p+ Tex )he te+t in the Cl%''i)ier o+ p+ area has scroll #ars allowing you to #rowse the results" %f course, you can also resi-e the E+plorer window to get a larger display area" )he output is split into several sections: /" R+n in)orm% ion. A list of information giving the learning scheme options, relation name, instances, attri#utes and test mode that were involved in the process" >" Cl%''i)ier mo"el 4)+ll r%inin# 'e ). A te+tual representation of the classification model that was produced on the full training data" ." )he results of the chosen test mode are #roken down thus: ?" S+mm%r*. A list of statistics summari-ing how accurately the classifier was a#le to predict the true class of the instances under the chosen test mode" B" De %ile" A$$+r%$* 5* Cl%''. A more detailed per,class #reak down of the
S.K.T.R.M College off Engineering
10
Data Mining Lab classifier=s prediction accuracy" D" Con)+'ion 1% rix. 4hows how many instances have #een assigned to each class" Elements show the num#er of test e+amples whose actual class is the row and whose predicted class is the column" T-e Re'+l Li' After training several classifiers, the result list will contain several entries" 6eft,
11
clicking the entries flicks #ack and forth #etween the various results that have #een generated" 8ight,clicking an entry invokes a menu containing these items: /" Vie! in m%in !in"o! " 4hows the output in the main window (*ust like left, clicking the entry)" >" Vie! in 'ep%r% e !in"o!. %pens a new independent window for viewing the results" ." S%/e re'+l &+))er. Jrings up a dialog allowing you to save a te+t file containing the te+tual output" ?. Lo%" mo"el. 6oads a pre,trained model o#*ect from a #inary file" B" S%/e mo"el. 4aves a model o#*ect to a #inary file" %#*ects are saved in Java Iseriali-ed o#*ect= form" D. Re.e/%l+% e mo"el on $+rren e' 'e . )akes the model that has #een #uilt and tests its performance on the data set that has #een specified with the 4et"" #utton under the S+pplie" e' set option" 1" Vi'+%li,e $l%''i)ier error'. Jrings up a visuali-ation window that plots the results of classification" &orrectly classified instances are represented #y crosses, whereas incorrectly classified ones show up as s(uares" 1" Vi'+%li,e ree or Vi'+%li,e #r%p-. Jrings up a graphical representation of the structure of the classifier model, if possi#le (i"e" for decision trees or Jayesian networks)" )he graph visuali-ation option only appears if a Jayesian network classifier has #een #uilt" 'n the tree visuali-er, you can #ring up a menu #y right, clicking a #lank area, pan around #y dragging the mouse, and see the training instances at each node #y clicking on it" &)86,clicking -ooms the view out, while S6IFT.dragging a #o+ -ooms the view in" )he graph visuali-er should #e self, e+planatory" H" Vi'+%li,e m%r#in $+r/e. $enerates a plot illustrating the prediction margin" )he margin is defined as the difference #etween the pro#a#ility predicted for the actual class and the highest pro#a#ility predicted for the other classes" 9or e+ample, #oosting algorithms may achieve #etter performance on test data #y increasing the margins on the training data" 0" Vi'+%li,e -re'-ol" $+r/e. $enerates a plot illustrating the tradeoffs in prediction that are o#tained #y varying the threshold value #etween classes" 9or e+ample, with the default threshold value of A"B, the predicted pro#a#ility of Ipositive= must #e greater than A"B for the instance to #e predicted as Ipositive=" )he plot can #e used to visuali-e the precisionFrecall tradeoff, for
S.K.T.R.M College off Engineering
12
Data Mining Lab 8%& curve analysis (true positive rate vs false positive rate), and for other types of curves" /A" Vi'+%li,e $o' $+r/e. $enerates a plot that gives an e+plicit representation of the e+pected cost, as descri#ed #y 7rummond and Lolte (>AAA)" %ptions are greyed out if they do not apply to the specific set of results"
13
CREDIT RISK ASSESS1ENT ? )he #usiness of #anks is making loans" Assessing the credit worthiness of an applicant=s of crucial importance" We have to develop a system to help a loan officer decide whether the credit of a customer is good or #ad" A #ank=s #usiness rules regarding loans must consider two opposing factors" %n the one hand, a #ank wants to make as many loans as possi#le" 'nterest on these loans is the #anks profit source" %n the other hand, a #ank cannot afford to make too many #ad loans" )o many #ad could leads to the collapse of the #ank" The banks loan policy must involve a compromise not too strict, and not too lenient. ? &redit risk is an investor3s risk of loss arising from a #orrower who does not make payments as promised" 4uch an event is called a default" %ther terms for credit risk are default risk and counterparty risk" ? &redit risk is most simply defined as the potential that a #ank #orrower or counterparty will fail to meet its o#ligations in accordance with agreed terms" ? )he goal of credit risk management is to ma+imise a #ank3s risk,ad*usted rate of return #y maintaining credit risk e+posure within accepta#le parameters" ? Janks need to manage the credit risk inherent in the entire portfolio as well as the risk in individual credits or transactions" ? Janks should also consider the relationships #etween credit risk and other risks" ? )he effective management of credit risk is a critical component of a comprehensive approach to risk management and essential to the long,term success of any #anking organisation" ? A good credit assessment means you should #e a#le to (ualify, within the limits of your income, for most loans"
14
L%& Experimen ' 1. List all the categorical (or nominal) attributes and the real-valued attributes separately. 9rom the $erman &redit Assessment &ase 4tudy given to us, the following attri#utes are found to #e applica#le for &redit,8isk Assessment:
To %l V%li" A ri&+ e'
C% e#ori$%l or Nomin%l % ri&+ e' 4which takes 1. chec i!"#$tat%$ )rueFfalse, etc values)
/" checkingMstatus >" duration ." credit history ?" purpose B" credit amount D" savingsMstatus 1" employment duration H" installment rate 0" personal status /A" de#itors //" residenceMsince />" property /?" installment plans /B" housing /D" e+isting credits /1" *o# /H" numMdependents /0" telephone >A" foreign worker
&. credit hi$tory '. p%rpo$e (. $avi!"$#$tat%$ ). e*ploy*e!t +. per$o!al $tat%$ ,. debtor$ -. property .. i!$tall*e!t pla!$ 1/. ho%$i!" 11. 0ob 1&. telepho!e 1'. forei"! 1or er
/" duration >" credit amount ." credit amount ?" residence B" age D" e+isting credits 1" numMdependents
15
2. What attributes do you think might be crucial in making the credit assessment? Come up ith some simple rules in plain !nglish using your selected attributes.
A$$or"in# o me -e )ollo!in# % ri&+ e' m%* &e $r+$i%l in m%7in# -e $re"i ri'7 %''e''men . /" &reditMhistory >" Employment ." 2ropertyMmagnitude ?" *o# B" duration D" crditMamount 1" installment H" e+isting credit 5%'e" on -e %&o/e % ri&+ e'8 !e $%n m%7e % "e$i'ion !-e -er o #i/e $re"i or no . checking_status = no checking AND other_pay ent_p!ans = none AND cre"it_history = critica!#other e$isting cre"it% goo" checking_status = no checking AND e$isting_cre"its &= 1 AND other_pay ent_p!ans = none AND purpose = ra"io#t'% goo" checking_status = no checking AND (oreign_)orker = yes AND e p!oy ent = 4&=*&7% goo" (oreign_)orker = no AND persona!_status = a!e sing!e% goo"
checking_status = no checking AND purpose = use" car AND other_pay ent_p!ans = none% goo" "uration &= 15 AND other_parties = guarantor% goo" "uration &= 11 AND cre"it_history = critica!#other e$isting cre"it% goo" checking_status = +=200 AND nu _"epen"ents &= 1 AND property_ agnitu"e = car% goo" checking_status = no checking AND property_ agnitu"e = rea! estate AND other_pay ent_p!ans = none AND age + 23% goo" sa'ings_status = +=1000 AND property_ agnitu"e = rea! estate% goo" sa'ings_status = 500&=*&1000 AND e p!oy ent = +=7% goo" cre"it_history = no cre"its#a!! pai" AND housing = rent% ,a" sa'ings_status = no kno)n sa'ings AND checking_status = 0&=*&200 AND e$isting_cre"its + 1% goo" checking_status = +=200 AND nu _"epen"ents &= 1 AND S.K.T.R.M College off Engineering
16
17
insta!! ent_co it ent &= 2 AND other_parties = co app!icant AND e$isting_cre"its + 1% ,a" insta!! ent_co it ent &= 2 AND cre"it_history = "e!aye" pre'ious!y AND e$isting_cre"its + 1 AND resi"ence_since + 1% goo" insta!! ent_co it ent &= 2 AND cre"it_history = "e!aye" pre'ious!y AND e$isting_cre"its &= 1% goo" "uration + 30 AND sa'ings_status = 100&=*&500% ,a" cre"it_history = a!! pai" AND other_parties = none AND other_pay ent_p!ans = ,ank% ,a"
"uration + 30 AND sa'ings_status = no kno)n sa'ings AND nu _"epen"ents + 1% goo" "uration + 30 AND cre"it_history = "e!aye" pre'ious!y% ,a" "uration + 42 AND sa'ings_status = &100 AND resi"ence_since + 1% ,a"
18
". #ne type o$ model that you can create is a %ecision &ree - train a %ecision &ree using the complete dataset as the training data. 'eport the model obtained a$ter training. A decision tree is a flow chart like tree structure where each internal node(non, leaf) denotes a test on the attri#ute, each #ranch represents an outcome of the test ,and each leaf node(terminal node)holds a class la#el" 7ecision trees can #e easily converted into classification rules" e"g" '7.,&?"B and &A8)" 9:; ree pr+ne"
/" Using WEKA )ool, we can generate a decision tree #y selecting the N$l%''i)* %&O" >" 'n classify ta# select $-oo'e option where a list of different decision trees are availa#le" 9rom that list select 9:;" ." ow under test op ion ,select r%inin# "% % e' option"
B" )o generate the decision tree, right click on the re'+l li' and select
S.K.T.R.M College off Engineering
19
Data Mining Lab /i'+%li,e ree option #y which the decision tree will #e generated"
20
D" )he o#tained decision tree for credit risk assessment is very large to fit on the screen"
T-e "e$i'ion ree %&o/e i' +n$le%r "+e o % l%r#e n+m&er o) % ri&+ e'.
S.K.T.R.M College off Engineering
21
(. )uppose you use your above model trained on the complete dataset* and classi$y credit good+bad $or each o$ the e,amples in the dataset. What - o$ e,amples can you classi$y correctly? (&his is also called testing on the training set) Why do you think you cannot get 1.. - training accuracy? In -e %&o/e mo"el !e r%ine" $omple e "% %'e %n" !e $l%''i)ie" $re"i #oo"2&%" )or e%$- o) -e ex%mple' in -e "% %'e . 9or e+ample: '9 )LE purposePvacation
creditP#ad C E64E purposeP#usiness )LE creditPgood C 'n this way we classified each of the e+amples in the dataset" We classified HB"B< of e+amples correctly and the remaining /?"B< of e+amples are incorrectly classified" We can=t get /AA< training accuracy #ecause out of the >A attri#utes, we have some unnecessary attri#utes which are also #een analy-ed and trained" 7ue to this the accuracy is affected and hence we can=t get /AA< training accuracy"
22
23
/. 0s testing on the training set as you did above a good idea? Why Why not?
-a" i"ea. i( take a!! the "ata into training set/ &hen ho correctly or not ?
According to the rules, for the ma+imum accuracy, we have to take >F. of the dataset as training set and the remaining /F. as test set" Jut here in the a#ove model we have taken complete dataset as training set which results only HB"B< accuracy" )his is done for the analy-ing and training of the unnecessary attri#utes which does not make a crucial role in credit risk assessment" And #y this comple+ity is increasing and finally it leads to the minimum accuracy" 'f some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation will #e less" T-i' i' !-*8 !e pre)er no r%inin# 'e . o %7e $omple e "% %'e %'
U'eTr%inin# Se Re'+l )or -e %&le Germ%nCre"i D% %0 &orrectly &lassified 'nstances HBB /?B A"D>B/ @ean a#solute error 8oot mean s(uared error 8elative a#solute A">./> A".? error HB"B /?"B < 'ncorrectly &lassified 'nstances < Kappa statistic
BB"A.11 < 8oot relative s(uared error 1?">A/B < )otal /AAA
S.K.T.R.M College off Engineering
um#er of 'nstances
24
1. #ne approach $or solving the problem encountered in the previous 2uestion is using cross-validation? %escribe hat cross-validation is brie$ly. &rain a %ecision &ree again using cross-validation and report your results. %oes your accuracy increase+decrease? Why? Cro'' /%li"% ion0. 'n k,fold cross,validation, the initial data are randomly portioned into Ik= mutually e+clusive su#sets or folds 7/, 7>, 7., " " " " " ", 7k" Each of appro+imately e(ual si-e" )raining and testing is performed Ik= times" 'n iteration ', partition 7i is reserved as the test set and the remaining partitions are collectively used to train the model" )hat is in the first iteration su#sets 7>, 7., " " " " " ", 7k collectively serve as the training set in order to o#tain as first model" Which is tested on 7i" )he second trained on the su#sets 7/, 7., " " " " " ", 7k and test on the 7> and so onQ"
/" 4elect $l%''i)* ta# and 9:; decision tree and in the e' op ion select
S.K.T.R.M College off Engineering
25
Data Mining Lab $ro'' /%li"% ion radio #utton and the num#er of folds as 1<. >" um#er of folds indicates num#er of partition with the set of attri#utes"
26
." Kappa statistics nearing / indicates that there is /AA< accuracy and hence all the errors will #e -eroed out, #ut in reality there is no such training set that gives /AA< accuracy" Cro'' V%li"% ion Re'+l % )ol"'0 1< )or -e %&le Germ%nCre"i D% %0
&orrectly &lassified 'nstances 1AB 1A" < 'ncorrectly &lassified 'nstances >0B >0" < B B Kappa statistic A">?D1 @ean a#solute error A".?D1 8oot mean s(uared error A"?10D 8elative a#solute error H>"B>.. < 8oot relative s(uared error /A?"DBDB < )otal um#er of 'nstances /AAA Lere there are /AAA instances with /AA instances per partition"
27
28
2ercentage split does not allow /AA<, it allows only till 00"0<
29
,+.')&' 3
/AD"?.1. < BAA
(er$en %#e Spli Re'+l % >>.>30 'ncorrectly &lassified Kappa statistic 'nstances @ean a#solute error 8oot mean s(uared error 8elative a#solute error 8oot relative s(uared error )otal um#er of 'nstances &orrectly &lassified 'nstances / A A"DDD1 A"DDD1 >>/"1AB? < >>/"1AB? < / A /A A <
<
30
orkers3 (attribute 2.)* or 3personal-status3(attribute 4). #ne ay to do this (5erhaps rather simple minded) is to remove these attributes $rom the dataset and see i$ the decision tree created in those cases is signi$icantly di$$erent $rom the $ull dataset case hich you have already done. &o remove an attribute you can use the reprocess tab in W!678s 9:0 !,plorer. %id removing these attributes have any signi$icant e$$ect? %iscuss.
)his increases in accuracy #ecause the two attri#utes Nforeign workersO and Npersonal status Nare not much important in training and analy-ing" Jy removing this, the time has #een reduced to some e+tent and then it results in increase in the accuracy" )he decision tree which is created is very large compared to the decision tree which we have trained now" )his is the main difference #etween these two decision trees"
31
'f we remove 0th attri#ute, the accuracy is further increased to HD"D< which shows that
32
33
34
;. 7nother 2uestion might be* do you really need to input so many attributes to get good results? <aybe only a $e ould do. =or e,ample* you could try >ust having attributes 2* "* /* ?* 1.* 1? (and 21* the class attribute (naturally)). &ry out some combinations. (@ou had removed t o attributes in problem ? 'emember to reload the 7'== data $ile to get all the attributes initially be$ore you start selecting the ones you ant.) 4elect attri#ute 28?8=8@81<81@821 and click on in/er to remove the remaining attri#utes"
is
35
36
37
After we remove /? attri#utes, the accuracy has #een decreased to 1D"?< hence we can further try random com#ination of attri#utes to increase the accuracy"
38
39
2ercentage split
40
4. )ometimes* the cost o$ re>ecting an applicant ho actually has a good credit Case 1. might be higher than accepting an applicant ho has bad credit Case 2. 0nstead o$ counting the misclassi$ications e2ually in both cases* give a higher cost to the $irst case (say cost /) and lo er cost to the second case. @ou can do this by using a cost matri, in W!67. &rain your %ecision &ree again and report the %ecision &ree and cross-validation results. 7re they signi$icantly di$$erent $rom results obtained in problem 1 (using e2ual cost)? 'n the 2ro#lem D, we used e(ual cost and we trained the decision tree" Jut here, we consider two cases with different cost" 6et us take cost B in case / and cost > in case >" When we give such costs in #oth cases and after training the decision tree, we can o#serve that almost e(ual to that of the decision tree o#tained in pro#lem D" &ase/ (cost B) &ase> (cost B) )otal &ost /1AB Average cost /"1AB .H>A ."H>
We don=t find this cost factor in pro#lem D" As there we use e(ual cost" )his is the ma*or difference #etween the results of pro#lem D and pro#lem 0" )he cost matrices we used here: &ase /: B / /B &ase >: > / />
41
/"4elect $l%''i)* ta#" >" 4elect 1ore Op ion from )est %ption"
42
Data Mining Lab .")ick on $o' 'en'i i/e E/%l+% ion and go to 'e "
43
1")hen confusion matri+ will #e generated and you can find out the difference #etween good and #ad attri#ute" H"&heck accuracy whether it=s changing or not"
44
45
1.. %o you think it is a good idea to pre$er simple decision trees instead o$ having long comple, decision trees? Ao does the comple,ity o$ a %ecision &ree relate to the bias o$ the model? When we consider long comple+ decision trees, we will have many unnecessary attri#utes in the tree which results in increase of the #ias of the model" Jecause of this, the accuracy of the model can also effect" )his pro#lem can #e reduced #y considering simple decision tree" )he attri#utes will #e less and it decreases the #ias of the model" 7ue to this the result will #e more accurate" 4o it is a good idea to prefer simple decision trees instead of long comple+ trees" /" %pen any e+isting A899 file e"g la#our"arff" >" 'n prepro$e'' ta#, select ALL to select all the attri#utes" '. $o to $l%''i)* ta# and then use traning set with 9:; algorithm"
46
(. )o generate the decision tree, right click on the re'+l li' and select /i'+%li,e ree option, #y which the decision tree will #e generated .
47
Data Mining Lab B" 8ight click on 9:; algorithm to get $eneric %#*ect Editor window D" 'n this,make the +npr+ne" option as r+e "
48
1" )hen press OK and then ' %r " we find the tree will #ecome more comple+ if not pruned"
Vi'+%li,e ree
49
11. @ou can make your %ecision &rees simpler by pruning the node s. #ne approach is to use 'educed !rror 5runing - !,plain this idea brie$ly. &ry reduced error pruning $or training your %ecision &rees using cross-validation (you can do this in W!67) and report the %ecision &ree you obtain? 7lso* report your accuracy using the pruned model. %oes your accuracy increase? Re"+$e".error pr+nin#0. )he idea of using a separate pruning set for pruningRwhich is applica#le to decision trees as well as rule setsRis called reduced,error pruning" )he variant descri#ed previously prunes a rule immediately after it has #een grown and is called incremental reduced,error pruning" Another possi#ility is to #uild a full, unpruned rule set first, pruning it afterwards #y discarding individual tests" Lowever, this method is much slower" %f course, there are many different ways to assess the worth of a rule #ased on the pruning set" A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only rule in the theory, operating under the closed world assumption" 'f it gets p instances right out of the t instances that it covers, and there are 2 instances of this class out of a total ) of instances altogether, then it gets positive instances right" )he instances that it does not cover include , n negative ones, where n P t S p is the num#er of negative instances that the rule covers and P ) , 2 is the total num#er of negative instances" )hus the rule has an overall success ratio of Tp U( , n)V ) , and this (uantity, evaluated on the test set, has #een used to evaluate the success of a rule when using reduced,error pruning" /" 8ight click on J?H algorithm to get $eneric %#*ect Editor window >" 'n this,make re"+$e" error pr+nin# option as true and also the +npr+ne" op ion as true " ." )hen press OK and then ' %r .
51
?" We find that the accuracy has #een increased #y selecting the reduced error
S.K.T.R.M College off Engineering
52
53
12. (!,tra Credit)B Ao can you convert a %ecision &rees into 3i$-thenelse rules3. <ake up your o n small %ecision &ree consisting o$ 2-" levels and convert it into a set o$ rules. &here also e,ist di$$erent classi$iers that output the model in the $orm o$ rules - one such classi$ier in W!67 is rules. 57'&* train this model and report the set o$ rules obtained. )ometimes >ust one attribute can be good enough in making the decision* yes* >ust oneC Can you predict hat attribute that might be in this dataset? #ne' classi$ier uses a single attribute to make decisions (it chooses the attribute based on minimum error). 'eport the rule obtained by training a one ' classi$ier. 'ank the per$ormance o$ >(;* 57'& and one'. 'n WEKA, rules"2A8) is one of the classifier which converts the decision trees into N'9, )LE ,E64EO rules" Con/er in# De$i'ion ree' in o AIF.T6EN.ELSEB r+le' +'in# r+le'.(ART $l%''i)ier0. 2A8) decision list outlook P overcast: yes (?"A) windy P )8UE: no (?"AF/"A) outlook P sunny: no (."AF/"A) : yes (."A) um#er of 8ules : ? Kes, sometimes *ust one attri#ute can #e good enough in making the decision" 'n this dataset (Weather), 4ingle attri#ute for making the decision is Ao+ loo7B outlook: sunny ,W no overcast ,W yes rainy ,W yes
(/AF/? instances correct)
With respect to the ime, the one8 classifier has higher ranking and J?H is in >nd place and 2A8) gets .rd place" J?H 2A8) one8 )'@E (sec) A"/> A"/? A"A? 8A K '' ''' ' Jut if you consider the %$$+r%$*8 )he J?H classifier has higher ranking, 2A8) gets second place and one8 gets lst place
S.K.T.R.M College off Engineering
54
Data Mining Lab J?H 2A8) one8 A&&U8A&K (<) 1A"B 1A">< DD"H< /"%pen e+isting weather"nomial"arff >"4elect All" ."$o $l%''i)*" ?"4tart" to file as
55
56
T-e ree i' 'ome -in# li7e Ai). -en.el'eB r+le 'f o+ loo7Co/er$%' then
pl%*C*e'
57
To $li$7 o+ -e r+le'
/" $o to choose then click on R+le then select (ART. >" &lick on 4ave and start" ." 4imilarly for one8 algorithm"
If out!ook = o'ercast then play=yes If out!ook = sunny and hu i"ity= high then play=no If out!ook = sunny and hu i"ity= !o) then play=yes