Documente Academic
Documente Profesional
Documente Cultură
In this lab, you will use ID3 classifier which is the WEKA implementation of ID3 algorithm. The ID3
algorithm constructs a unpruned decision tree and only deal with nominal attributes without missing
values. The algorithm make use of entropy which gives us the information about “degree of doubt”,
then the algorithm select the attribute for classifying by comparing the information gains. The
following is the quick summarize of the algorithm:
1. For each attribute, compute its entropy with respect to the class attribute.
2. Compute and select the attribute (say A) with highest information gain.
3. Divide the data into separate sets according to the values of A.
4. Build a tree with each branch represents an attribute (A) value.
5. For each subtree, repeat this process from step 1.
6. At each iteration, one attribute gets removed from consideration. The process stops when there
are no attributes left to consider, or when all the data being considered in a subtree have the
same value for the class attribute.
Datasets
We demonstrate the utility of the Decision Tree Classifier using following datasets. You can find them
in the WEKA data folder (in Windows, it should be C:\Program Files\Weka-3-6\data).
1. weather.nominal.arff dataset. The nominal version of the weather example in the textbook.
2. credit-g.arff dataset. This dataset classifies people described by a set of attributes
as good or bad credit risks.
Your Task
1. The first task here is a warm-up where you will apply ID3 classifier on
the weather.nominal.arff dataset.
2. Load the weather.nominal.arff into WEKA Explorer.
3. In the Classify tab, select the classifier by clicking the Choose button: classifier trees Id3.
4. In the Test options frame, select Percentage split at 66%.
5. Click Start button to train the model. In the Classifier output, you may find the decision tree
in the Classifer model section.
The second task is an application which predicts whether a loan approval is good or bad credit risk
based on 20 attributes, by using the credit dataset..
Questions
From above, you may know how to create decision tree classifiers by using ID3 implementation in
WEKA. Now, please answer the following questions in your lab report (you may need to do more
experiments in WEKA to answer the questions) :
1. In the first task, you will train a very simple decision tree. Please attempt to draw the tree
according to the Classifier output and answer following questions.
If only one of these two applicants can be approved, which one would you choose? ________
3. How many numeric attributes are there in the credit dataset? ____________________
4. Can we train the decision tree just by removing these numeric attributes instead of discretizing
them?_____________________________ Explain your conclusion by comparing the
performance (error rates) of these two methods in WEKA.
II. Understand Confusion Matrix and Interpreting Output of Decision Tree using WEKA
Recall is the TP rate (also referred to as sensitivity): what fraction of those that are actually positive
were predicted positive? : TP / actual positives
Precision is TP / predicted Positive: what fraction of those predicted positive are actually positive?
Generating Decision Trees with J48
1. Open credit-g.arff from the Explorer interface. You may also want to open this file in a text editor.
2. Run a decision tree classifier over the data by selecting classifiers trees J48 under the Classify
tab.
3. Set a confidenceFactor of 0.2 in the options dialog.
4. Use a test percentage split of 90%.
Observe the output of the classifier. The full decision tree is output for your perusal; you may need to
scroll up for this. The tree may also be viewed in graphical form by right-clicking the run in the Result
list at the bottom-left and selecting Visualize tree, although it may be very cluttered for large trees.
a) How would you assess the performance of the classifier? Hint: check the number of good and
bad cases in the test sample (e.g. using the confusion matrix)
b) What is the effect of the confidenceFactor option? Try increasing or decreasing the value of
this option and observe the results.