Sunteți pe pagina 1din 4

CAP3770 - Lab #4 (Date 04/05/17)

Last Name_________________________ First Name: ______________________

I. Decision Tree Classification: ID3

In this lab, you will use ID3 classifier which is the WEKA implementation of ID3 algorithm. The ID3
algorithm constructs a unpruned decision tree and only deal with nominal attributes without missing
values. The algorithm make use of entropy which gives us the information about “degree of doubt”,
then the algorithm select the attribute for classifying by comparing the information gains. The
following is the quick summarize of the algorithm:

1. For each attribute, compute its entropy with respect to the class attribute.
2. Compute and select the attribute (say A) with highest information gain.
3. Divide the data into separate sets according to the values of A.
4. Build a tree with each branch represents an attribute (A) value.
5. For each subtree, repeat this process from step 1.
6. At each iteration, one attribute gets removed from consideration. The process stops when there
are no attributes left to consider, or when all the data being considered in a subtree have the
same value for the class attribute.

Datasets

We demonstrate the utility of the Decision Tree Classifier using following datasets. You can find them
in the WEKA data folder (in Windows, it should be C:\Program Files\Weka-3-6\data).

1. weather.nominal.arff dataset. The nominal version of the weather example in the textbook.
2. credit-g.arff dataset. This dataset classifies people described by a set of attributes
as good or bad credit risks.

Your Task

1. The first task here is a warm-up where you will apply ID3 classifier on
the weather.nominal.arff dataset.
2. Load the weather.nominal.arff into WEKA Explorer.
3. In the Classify tab, select the classifier by clicking the Choose button: classifier  trees  Id3.
4. In the Test options frame, select Percentage split at 66%.
5. Click Start button to train the model. In the Classifier output, you may find the decision tree
in the Classifer model section.

The second task is an application which predicts whether a loan approval is good or bad credit risk
based on 20 attributes, by using the credit dataset..

1. Firstly, load the credit-g.arff file into WEKA.


2. When presented with a dataset, it is usually a good idea to visualise it first. In the Visualize tab,
there are scatter plots of every two attributes. Click the scatter plot
of checking_status and foreign_worker, and think about this: assume there are two applicants,
one is foreign with no checking and the other one is domestic with checking_status = 300, who
will be approved?
3. Since the ID3 algorithm cannot deal with numeric attributes, we need to convert numeric
attributes to nominal ones. In this case, we use Discretize filter. In the Preprocess tab, click
the Choose button, then: filters  unsupervised  attribute  Discretize. Click textbox in the
filter frame, and set the parameter bin to 2. Apply the filter and you will find all numeric
attributes are discretized into binary bins.
4. In the Classify tab, select the classifier by clicking the Choose button: classifier  trees  Id3.
Use the same test option, Percentage split at 66%.
5. After training the classifier, you can review the result model in the Classifier output frame.
Unfortunately, the result decision tree is very large (with large depth). Either in WEKA the ID3
implement is not drawable, so you cannot visualize tree in the righ-click menu.

Questions

From above, you may know how to create decision tree classifiers by using ID3 implementation in
WEKA. Now, please answer the following questions in your lab report (you may need to do more
experiments in WEKA to answer the questions) :

1. In the first task, you will train a very simple decision tree. Please attempt to draw the tree
according to the Classifier output and answer following questions.

a) What is the depth of the tree? ____________________________


b) How many leaf nodes are there in the tree?____________________________
c) How many tree nodes?___________________________________
2. In the second task, according to the scatter plot of of checking_status and foreign_worker in
the Visualize tab, considering following two applicants:

a) Foreign worker whose has no checking.


b) Domestic worker whose has checking_status of 300.

If only one of these two applicants can be approved, which one would you choose? ________

3. How many numeric attributes are there in the credit dataset? ____________________
4. Can we train the decision tree just by removing these numeric attributes instead of discretizing
them?_____________________________ Explain your conclusion by comparing the
performance (error rates) of these two methods in WEKA.

II. Understand Confusion Matrix and Interpreting Output of Decision Tree using WEKA

Weka results output:


TP = true positives: number of examples predicted positive that are actually positive
FP = false positives: number of examples predicted positive that are actually negative
TN = true negatives: number of examples predicted negative that are actually negative
FN = false negatives: number of examples predicted negative that are actually positive
Weka Confusion Matrix if a is taken to be the negative class (ex: no disease):
a b - classified as
actual a=0 TP FN
actual b=1 FP TN

Weka Confusion Matrix if a is taken to be the negative class (ex: no disease):


a b  classified as
actual a=0 TP FN
actual b=1 FP TN

Recall is the TP rate (also referred to as sensitivity): what fraction of those that are actually positive
were predicted positive? : TP / actual positives

Precision is TP / predicted Positive: what fraction of those predicted positive are actually positive?
Generating Decision Trees with J48

1. Open credit-g.arff from the Explorer interface. You may also want to open this file in a text editor.
2. Run a decision tree classifier over the data by selecting classifiers trees  J48 under the Classify
tab.
3. Set a confidenceFactor of 0.2 in the options dialog.
4. Use a test percentage split of 90%.

Observe the output of the classifier. The full decision tree is output for your perusal; you may need to
scroll up for this. The tree may also be viewed in graphical form by right-clicking the run in the Result
list at the bottom-left and selecting Visualize tree, although it may be very cluttered for large trees.

a) How would you assess the performance of the classifier? Hint: check the number of good and
bad cases in the test sample (e.g. using the confusion matrix)
b) What is the effect of the confidenceFactor option? Try increasing or decreasing the value of
this option and observe the results.

S-ar putea să vă placă și