15th LESSON (ANKUR - PROSCHOOL) - DECISION TREE CASE STUDY.h

DECISION TREE
As a marketing manager, you want a set of customers who are most likely to purchase your
product. This is how you can save your marketing budget by finding your audience. As a loan
manager, you need to identify risky loan applications to achieve a lower loan default rate. This
process of classifying customers into a group of potential and non-potential customers or safe or
risky loan applications is known as a classification problem. Classification is a two-step process,
learning step and prediction step. In the learning step, the model is developed based on given
training data. In the prediction step, the model is used to predict the response for given data.
Decision Tree is one of the easiest and popular classification algorithms to understand and
interpret.
Decision Tree Algorithm
A decision tree is a flowchart-like tree structure where an internal node represents feature(or
attribute), the branch represents a decision rule, and each leaf node represents the outcome.
The topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This
flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram
which easily mimics the human level thinking. That is why decision trees are easy to understand
and interpret.
How does the Decision Tree algorithm work?
The basic idea behind any decision tree algorithm is as follows:
1. Select the best attribute using Attribute Selection Measures(ASM) to split the records.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of the
condition will match:
a. All the tuples belong to the same attribute value.
b. There are no more remaining attributes.
c. There are no more instances.
Attribute selection measure is a method for selecting the splitting criterion that partition data into
the best possible manner. It is also known as splitting rules because it helps us to determine
breakpoints for tuples on a given node. ASM provides a rank to each feature(or attribute) by
explaining the given dataset. Best score attribute will be selected as a splitting attribute.
Most popular selection measures are:
1. Information Gain,
2. Gain Ratio, and
3. Gini Index.
GINI INDEX
Decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to
create split points.
The Gini Index is calculated by subtracting the sum of the squared probabilities of each class
from one. It favors larger partitions.
Gini = [1 - Sum(Pi)**2]
The Gini index is used in the classic CART algorithm and is very easy to calculate.
Decision Tree Classifier Building in Scikit-learn
STEP 1: Importing Required Libraries
In [18]: #importing packages

import pandas as pd
import numpy as py
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import os
STEP 2: Loading Data
In [19]: #Load data

data = pd.read_csv('binary11.csv', header=0)
data.shape
data.head()
Out[19]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
STEP 3: Converting Categorical Variables
In [20]: #Convert categorical variables with numeric values to str type

data['rank'] = data['rank'].astype(str)
#Declare the dependent variable

dep='admit'
#Get all categorical variables and create dummies

obj = data.dtypes == py.object
obj[dep]=False
dummydf = pd.DataFrame()
for i in data.columns[obj]:
dummy = pd.get_dummies(data[i], drop_first=True)
dummydf = pd.concat([dummydf, dummy], axis=1)
In [32]: #Merge the dummy and dataset

data1 = data
data1 = pd.concat([data1,dummydf], axis=1)
obj1 = data1.dtypes == py.object
#Create your independent and dependent datasets
X = data1.drop(data1.columns[obj1], axis=1)
X = X.drop([dep], axis=1)
print(X)
print(Y)
#Appending "V" to columns in order to avoid numeric column names genera
ted while creating dummies
X.columns = 'V_'+X.columns
print("X_col\n",X.columns)
Y = data1[dep]
gre gpa 2 3 4
0 380 3.61 0 1 0
1 660 3.67 0 1 0
2 800 4.00 0 0 0
3 640 3.19 0 0 1
4 520 2.93 0 0 1
5 760 3.00 1 0 0
6 560 2.98 0 0 0
7 400 3.08 1 0 0
8 540 3.39 0 1 0
9 700 3.92 1 0 0
10 800 4.00 0 0 1
11 440 3.22 0 0 0
12 760 4.00 0 0 0
13 700 3.08 1 0 0
14 700 4.00 0 0 0
15 480 3.44 0 1 0
16 780 3.87 0 0 1
17 360 2.56 0 1 0
18 800 3.75 1 0 0
19 540 3.81 0 0 0
20 500 3.17 0 1 0
21 660 3.63 1 0 0
22 600 2.82 0 0 1
23 680 3.19 0 0 1
24 760 3.35 1 0 0
25 800 3.66 0 0 0
26 620 3.61 0 0 0
27 520 3.74 0 0 1
28 780 3.22 1 0 0
29 520 3.29 0 0 0
.. ... ... .. .. ..
370 540 3.77 1 0 0
371 680 3.76 0 1 0
372 680 2.42 0 0 0
373 620 3.37 0 0 0
374 560 3.78 1 0 0
375 560 3.49 0 0 1
376 620 3.63 1 0 0
377 800 4.00 1 0 0
378 640 3.12 0 1 0
379 540 2.70 1 0 0
380 700 3.65 1 0 0
381 540 3.49 1 0 0
382 540 3.51 1 0 0
383 660 4.00 0 0 0
384 480 2.62 1 0 0
385 420 3.02 0 0 0
386 740 3.86 1 0 0
387 580 3.36 1 0 0
388 640 3.17 1 0 0
389 640 3.51 1 0 0
390 800 3.05 1 0 0
391 660 3.88 1 0 0
392 600 3.38 0 1 0
393 620 3.75 1 0 0
394 460 3.99 0 1 0
395 620 4.00 1 0 0
396 560 3.04 0 1 0
397 460 2.63 1 0 0
398 700 3.65 1 0 0
399 600 3.89 0 1 0
[400 rows x 5 columns]

0 0
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 1
9 0
10 0
11 0
12 1
13 0
14 1
15 0
16 0
17 0
18 0
19 1
20 0
21 1
22 0
23 0
24 1
25 1
26 1
27 1
28 1
29 0
..
370 1
371 1
372 1
373 1
374 0
375 0
376 0
377 1
378 0
379 0
380 0
381 1
382 0
383 0
384 1
385 0
386 1
387 0
388 0
389 0
390 1
391 1
392 1
393 1
394 1
395 0
396 0
397 0
398 0
399 0
Name: admit, Length: 400, dtype: int64
X_col
Index(['V_gre', 'V_gpa', 'V_2', 'V_3', 'V_4'], dtype='object')
STEP 4: Splitting Data

To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.
Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.
In [22]: #Split into train and test

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =
0.2, random_state=5)
print('Train Data Size - ', X_train.shape[0], '\n')

print('Test Data Size - ', X_test.shape[0], '\n')
Train Data Size - 320
Test Data Size - 80
STEP 5: Building Decision Tree Model
Let's create a Decision Tree Model using Scikit-learn.
In [23]: #Run CART algorithm

#Since we do not have pruning available in python, we would do a grid s
earch on the parameters to
#find the best depth
#where the auc would be maximum
#In general pruning is done on the tree growth, so we can control it by
checking for different
#max_depth
modCART = DecisionTreeClassifier()
param_grid = {'max_depth': py.arange(3, 10)}
gridS = GridSearchCV(modCART, param_grid)
gridS.fit(X_train, Y_train)
tree_preds = gridS.predict_proba(X_test)[:, 1]
tree_performance = roc_auc_score(Y_test, tree_preds)
print('DecisionTree: Area under the ROC curve = {}'.format(tree_perform

ance))
DecisionTree: Area under the ROC curve = 0.5445156695156695
In order to make our model robust, we need to determine the no. of nodes that the tree should be
split into. For this we create a paramater param_grid with values in the range of 3-10. The same
variable is then passed through GridSearchCV which creates a tree for all parameteres using the
best score.
the best score features tells us the maximum depth of the tree which can be used then to fit and
finalize the model as below.
In [24]: #The best depth which gives maximum accuracy

gridS.best_params_
Out[24]: {'max_depth': 6}
In [25]: #View all the scores for different depth combinations

gridS.grid_scores_
C:\Users\Shivani\Anaconda3\lib\site-packages\sklearn\model_selection\_s
earch.py:761: DeprecationWarning: The grid_scores_ attribute was deprec
ated in version 0.18 in favor of the more elaborate cv_results_ attribu
te. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
Out[25]: [mean: 0.67500, std: 0.01497, params: {'max_depth': 3},
mean: 0.67500, std: 0.01555, params: {'max_depth': 4},
mean: 0.63125, std: 0.03226, params: {'max_depth': 9}]
In [26]: #We finalise the model on max_depth of 6

modCART = DecisionTreeClassifier(max_depth=6)
modCART.fit(X_train, Y_train)
Out[26]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=

6,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=N
one,
splitter='best')
STEP 6: Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars.
Accuracy can be computed by comparing actual test set values and predicted values.
In [29]: tree_preds = modCART.predict_proba(X_test)[:, 1]

tree_performance = roc_auc_score(Y_test, tree_preds)
print(tree_performance)
0.5445156695156695
Well, we got a classification rate of 54.45%, considered as good accuracy. We can improve this
accuracy by tuning the parameters in the Decision Tree Algorithm.
STEP 7: Generating Tree
In [30]: from sklearn.externals.six import StringIO

from IPython.display import Image
from sklearn.tree import export_graphviz
%matplotlib inline
import pydotplus
import pydot
from sklearn.tree import export_graphviz
dot_data = StringIO() #to get output file
export_graphviz(modCART, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
#Image(graph.create_png())
#graph.create_pdf("tree.jpg")
In [31]: with open("TREE1.txt", "w") as f:

f = tree.export_graphviz(modCART, out_file=f)
http://www.webgraphviz.com/

15th LESSON (ANKUR - PROSCHOOL) - DECISION TREE CASE STUDY.h

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

15th LESSON (ANKUR - PROSCHOOL) - DECISION TREE CASE STUDY.h

Încărcat de

Drepturi de autor:

Formate disponibile

DECISION TREE

Decision Tree Algorithm

How does the Decision Tree algorithm work?

The basic idea behind any decision tree algorithm is as follows:

a. All the tuples belong to the same attribute value.

b. There are no more remaining attributes.

c. There are no more instances.

Most popular selection measures are:

STEP 1: Importing Required Libraries

In [18]: #importing packages

STEP 2: Loading Data

In [19]: #Load data

In [20]: #Convert categorical variables with numeric values to str type

#Declare the dependent variable

#Get all categorical variables and create dummies

In [32]: #Merge the dummy and dataset

[400 rows x 5 columns]

STEP 4: Splitting Data

In [22]: #Split into train and test

print('Train Data Size - ', X_train.shape[0], '\n')

Train Data Size - 320

Test Data Size - 80

STEP 5: Building Decision Tree Model

Let's create a Decision Tree Model using Scikit-learn.

In [23]: #Run CART algorithm

print('DecisionTree: Area under the ROC curve = {}'.format(tree_perform

DecisionTree: Area under the ROC curve = 0.5445156695156695

In [24]: #The best depth which gives maximum accuracy

In [25]: #View all the scores for different depth combinations

In [26]: #We finalise the model on max_depth of 6

Out[26]: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=

STEP 6: Evaluating Model

In [29]: tree_preds = modCART.predict_proba(X_test)[:, 1]

In [30]: from sklearn.externals.six import StringIO

In [31]: with open("TREE1.txt", "w") as f:

S-ar putea să vă placă și