Sunteți pe pagina 1din 11

DECISION TREE

As a marketing manager, you want a set of customers who are most likely to purchase your
product. This is how you can save your marketing budget by finding your audience. As a loan
manager, you need to identify risky loan applications to achieve a lower loan default rate. This
process of classifying customers into a group of potential and non-potential customers or safe or
risky loan applications is known as a classification problem. Classification is a two-step process,
learning step and prediction step. In the learning step, the model is developed based on given
training data. In the prediction step, the model is used to predict the response for given data.
Decision Tree is one of the easiest and popular classification algorithms to understand and
interpret.

Decision Tree Algorithm

A decision tree is a flowchart-like tree structure where an internal node represents feature(or
attribute), the branch represents a decision rule, and each leaf node represents the outcome.
The topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This
flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram
which easily mimics the human level thinking. That is why decision trees are easy to understand
and interpret.

How does the Decision Tree algorithm work?

The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures(ASM) to split the records.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of the
condition will match:

a. All the tuples belong to the same attribute value.

b. There are no more remaining attributes.

c. There are no more instances.

Attribute selection measure is a method for selecting the splitting criterion that partition data into
the best possible manner. It is also known as splitting rules because it helps us to determine
breakpoints for tuples on a given node. ASM provides a rank to each feature(or attribute) by
explaining the given dataset. Best score attribute will be selected as a splitting attribute.

Most popular selection measures are:

1. Information Gain,
2. Gain Ratio, and
3. Gini Index.

GINI INDEX

Decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to
create split points.

The Gini Index is calculated by subtracting the sum of the squared probabilities of each class
from one. It favors larger partitions.

Gini = [1 - Sum(Pi)**2]

The Gini index is used in the classic CART algorithm and is very easy to calculate.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Decision Tree Classifier Building in Scikit-learn

STEP 1: Importing Required Libraries

In [18]: #importing packages


import pandas as pd
import numpy as py
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import os

STEP 2: Loading Data

In [19]: #Load data


data = pd.read_csv('binary11.csv', header=0)
data.shape
data.head()

Out[19]:
admit gre gpa rank

0 0 380 3.61 3

1 1 660 3.67 3

2 1 800 4.00 1

3 1 640 3.19 4

4 0 520 2.93 4

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
STEP 3: Converting Categorical Variables

In [20]: #Convert categorical variables with numeric values to str type


data['rank'] = data['rank'].astype(str)

#Declare the dependent variable


dep='admit'

#Get all categorical variables and create dummies


obj = data.dtypes == py.object
obj[dep]=False
dummydf = pd.DataFrame()

for i in data.columns[obj]:
dummy = pd.get_dummies(data[i], drop_first=True)
dummydf = pd.concat([dummydf, dummy], axis=1)

In [32]: #Merge the dummy and dataset


data1 = data
data1 = pd.concat([data1,dummydf], axis=1)
obj1 = data1.dtypes == py.object
#Create your independent and dependent datasets
X = data1.drop(data1.columns[obj1], axis=1)
X = X.drop([dep], axis=1)

print(X)
print(Y)
#Appending "V" to columns in order to avoid numeric column names genera
ted while creating dummies
X.columns = 'V_'+X.columns
print("X_col\n",X.columns)
Y = data1[dep]

gre gpa 2 3 4
0 380 3.61 0 1 0
1 660 3.67 0 1 0
2 800 4.00 0 0 0
3 640 3.19 0 0 1
4 520 2.93 0 0 1

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
5 760 3.00 1 0 0
6 560 2.98 0 0 0
7 400 3.08 1 0 0
8 540 3.39 0 1 0
9 700 3.92 1 0 0
10 800 4.00 0 0 1
11 440 3.22 0 0 0
12 760 4.00 0 0 0
13 700 3.08 1 0 0
14 700 4.00 0 0 0
15 480 3.44 0 1 0
16 780 3.87 0 0 1
17 360 2.56 0 1 0
18 800 3.75 1 0 0
19 540 3.81 0 0 0
20 500 3.17 0 1 0
21 660 3.63 1 0 0
22 600 2.82 0 0 1
23 680 3.19 0 0 1
24 760 3.35 1 0 0
25 800 3.66 0 0 0
26 620 3.61 0 0 0
27 520 3.74 0 0 1
28 780 3.22 1 0 0
29 520 3.29 0 0 0
.. ... ... .. .. ..
370 540 3.77 1 0 0
371 680 3.76 0 1 0
372 680 2.42 0 0 0
373 620 3.37 0 0 0
374 560 3.78 1 0 0
375 560 3.49 0 0 1
376 620 3.63 1 0 0
377 800 4.00 1 0 0
378 640 3.12 0 1 0
379 540 2.70 1 0 0
380 700 3.65 1 0 0
381 540 3.49 1 0 0
382 540 3.51 1 0 0

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
383 660 4.00 0 0 0
384 480 2.62 1 0 0
385 420 3.02 0 0 0
386 740 3.86 1 0 0
387 580 3.36 1 0 0
388 640 3.17 1 0 0
389 640 3.51 1 0 0
390 800 3.05 1 0 0
391 660 3.88 1 0 0
392 600 3.38 0 1 0
393 620 3.75 1 0 0
394 460 3.99 0 1 0
395 620 4.00 1 0 0
396 560 3.04 0 1 0
397 460 2.63 1 0 0
398 700 3.65 1 0 0
399 600 3.89 0 1 0

[400 rows x 5 columns]


0 0
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 1
9 0
10 0
11 0
12 1
13 0
14 1
15 0
16 0
17 0
18 0
19 1

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
20 0
21 1
22 0
23 0
24 1
25 1
26 1
27 1
28 1
29 0
..
370 1
371 1
372 1
373 1
374 0
375 0
376 0
377 1
378 0
379 0
380 0
381 1
382 0
383 0
384 1
385 0
386 1
387 0
388 0
389 0
390 1
391 1
392 1
393 1
394 1
395 0
396 0
397 0

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
398 0
399 0
Name: admit, Length: 400, dtype: int64
X_col
Index(['V_gre', 'V_gpa', 'V_2', 'V_3', 'V_4'], dtype='object')

STEP 4: Splitting Data


To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.

Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.

In [22]: #Split into train and test


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =
0.2, random_state=5)

print('Train Data Size - ', X_train.shape[0], '\n')


print('Test Data Size - ', X_test.shape[0], '\n')

Train Data Size - 320

Test Data Size - 80

STEP 5: Building Decision Tree Model

Let's create a Decision Tree Model using Scikit-learn.

In [23]: #Run CART algorithm


#Since we do not have pruning available in python, we would do a grid s
earch on the parameters to
#find the best depth
#where the auc would be maximum

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
#In general pruning is done on the tree growth, so we can control it by
checking for different
#max_depth
modCART = DecisionTreeClassifier()
param_grid = {'max_depth': py.arange(3, 10)}
gridS = GridSearchCV(modCART, param_grid)
gridS.fit(X_train, Y_train)
tree_preds = gridS.predict_proba(X_test)[:, 1]
tree_performance = roc_auc_score(Y_test, tree_preds)

print('DecisionTree: Area under the ROC curve = {}'.format(tree_perform


ance))

DecisionTree: Area under the ROC curve = 0.5445156695156695

In order to make our model robust, we need to determine the no. of nodes that the tree should be
split into. For this we create a paramater param_grid with values in the range of 3-10. The same
variable is then passed through GridSearchCV which creates a tree for all parameteres using the
best score.

the best score features tells us the maximum depth of the tree which can be used then to fit and
finalize the model as below.

In [24]: #The best depth which gives maximum accuracy


gridS.best_params_

Out[24]: {'max_depth': 6}

In [25]: #View all the scores for different depth combinations


gridS.grid_scores_

C:\Users\Shivani\Anaconda3\lib\site-packages\sklearn\model_selection\_s
earch.py:761: DeprecationWarning: The grid_scores_ attribute was deprec
ated in version 0.18 in favor of the more elaborate cv_results_ attribu
te. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
Out[25]: [mean: 0.67500, std: 0.01497, params: {'max_depth': 3},

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
mean: 0.67500, std: 0.01555, params: {'max_depth': 4},
mean: 0.66250, std: 0.00778, params: {'max_depth': 5},
mean: 0.68125, std: 0.01893, params: {'max_depth': 6},
mean: 0.64687, std: 0.01849, params: {'max_depth': 7},
mean: 0.63750, std: 0.02564, params: {'max_depth': 8},
mean: 0.63125, std: 0.03226, params: {'max_depth': 9}]

In [26]: #We finalise the model on max_depth of 6


modCART = DecisionTreeClassifier(max_depth=6)
modCART.fit(X_train, Y_train)

Out[26]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=


6,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=N
one,
splitter='best')

STEP 6: Evaluating Model

Let's estimate, how accurately the classifier or model can predict the type of cultivars.

Accuracy can be computed by comparing actual test set values and predicted values.

In [29]: tree_preds = modCART.predict_proba(X_test)[:, 1]


tree_performance = roc_auc_score(Y_test, tree_preds)
print(tree_performance)

0.5445156695156695

Well, we got a classification rate of 54.45%, considered as good accuracy. We can improve this
accuracy by tuning the parameters in the Decision Tree Algorithm.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
STEP 7: Generating Tree

In [30]: from sklearn.externals.six import StringIO


from IPython.display import Image
from sklearn.tree import export_graphviz
%matplotlib inline
import pydotplus
import pydot
from sklearn.tree import export_graphviz
dot_data = StringIO() #to get output file
export_graphviz(modCART, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
#Image(graph.create_png())

#graph.create_pdf("tree.jpg")

In [31]: with open("TREE1.txt", "w") as f:


f = tree.export_graphviz(modCART, out_file=f)

http://www.webgraphviz.com/

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

S-ar putea să vă placă și