Documente Academic
Documente Profesional
Documente Cultură
As a marketing manager, you want a set of customers who are most likely to purchase your
product. This is how you can save your marketing budget by finding your audience. As a loan
manager, you need to identify risky loan applications to achieve a lower loan default rate. This
process of classifying customers into a group of potential and non-potential customers or safe or
risky loan applications is known as a classification problem. Classification is a two-step process,
learning step and prediction step. In the learning step, the model is developed based on given
training data. In the prediction step, the model is used to predict the response for given data.
Decision Tree is one of the easiest and popular classification algorithms to understand and
interpret.
A decision tree is a flowchart-like tree structure where an internal node represents feature(or
attribute), the branch represents a decision rule, and each leaf node represents the outcome.
The topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This
flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram
which easily mimics the human level thinking. That is why decision trees are easy to understand
and interpret.
1. Select the best attribute using Attribute Selection Measures(ASM) to split the records.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of the
condition will match:
Attribute selection measure is a method for selecting the splitting criterion that partition data into
the best possible manner. It is also known as splitting rules because it helps us to determine
breakpoints for tuples on a given node. ASM provides a rank to each feature(or attribute) by
explaining the given dataset. Best score attribute will be selected as a splitting attribute.
1. Information Gain,
2. Gain Ratio, and
3. Gini Index.
GINI INDEX
Decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to
create split points.
The Gini Index is calculated by subtracting the sum of the squared probabilities of each class
from one. It favors larger partitions.
Gini = [1 - Sum(Pi)**2]
The Gini index is used in the classic CART algorithm and is very easy to calculate.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Decision Tree Classifier Building in Scikit-learn
Out[19]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
STEP 3: Converting Categorical Variables
for i in data.columns[obj]:
dummy = pd.get_dummies(data[i], drop_first=True)
dummydf = pd.concat([dummydf, dummy], axis=1)
print(X)
print(Y)
#Appending "V" to columns in order to avoid numeric column names genera
ted while creating dummies
X.columns = 'V_'+X.columns
print("X_col\n",X.columns)
Y = data1[dep]
gre gpa 2 3 4
0 380 3.61 0 1 0
1 660 3.67 0 1 0
2 800 4.00 0 0 0
3 640 3.19 0 0 1
4 520 2.93 0 0 1
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
5 760 3.00 1 0 0
6 560 2.98 0 0 0
7 400 3.08 1 0 0
8 540 3.39 0 1 0
9 700 3.92 1 0 0
10 800 4.00 0 0 1
11 440 3.22 0 0 0
12 760 4.00 0 0 0
13 700 3.08 1 0 0
14 700 4.00 0 0 0
15 480 3.44 0 1 0
16 780 3.87 0 0 1
17 360 2.56 0 1 0
18 800 3.75 1 0 0
19 540 3.81 0 0 0
20 500 3.17 0 1 0
21 660 3.63 1 0 0
22 600 2.82 0 0 1
23 680 3.19 0 0 1
24 760 3.35 1 0 0
25 800 3.66 0 0 0
26 620 3.61 0 0 0
27 520 3.74 0 0 1
28 780 3.22 1 0 0
29 520 3.29 0 0 0
.. ... ... .. .. ..
370 540 3.77 1 0 0
371 680 3.76 0 1 0
372 680 2.42 0 0 0
373 620 3.37 0 0 0
374 560 3.78 1 0 0
375 560 3.49 0 0 1
376 620 3.63 1 0 0
377 800 4.00 1 0 0
378 640 3.12 0 1 0
379 540 2.70 1 0 0
380 700 3.65 1 0 0
381 540 3.49 1 0 0
382 540 3.51 1 0 0
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
383 660 4.00 0 0 0
384 480 2.62 1 0 0
385 420 3.02 0 0 0
386 740 3.86 1 0 0
387 580 3.36 1 0 0
388 640 3.17 1 0 0
389 640 3.51 1 0 0
390 800 3.05 1 0 0
391 660 3.88 1 0 0
392 600 3.38 0 1 0
393 620 3.75 1 0 0
394 460 3.99 0 1 0
395 620 4.00 1 0 0
396 560 3.04 0 1 0
397 460 2.63 1 0 0
398 700 3.65 1 0 0
399 600 3.89 0 1 0
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
20 0
21 1
22 0
23 0
24 1
25 1
26 1
27 1
28 1
29 0
..
370 1
371 1
372 1
373 1
374 0
375 0
376 0
377 1
378 0
379 0
380 0
381 1
382 0
383 0
384 1
385 0
386 1
387 0
388 0
389 0
390 1
391 1
392 1
393 1
394 1
395 0
396 0
397 0
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
398 0
399 0
Name: admit, Length: 400, dtype: int64
X_col
Index(['V_gre', 'V_gpa', 'V_2', 'V_3', 'V_4'], dtype='object')
Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
#In general pruning is done on the tree growth, so we can control it by
checking for different
#max_depth
modCART = DecisionTreeClassifier()
param_grid = {'max_depth': py.arange(3, 10)}
gridS = GridSearchCV(modCART, param_grid)
gridS.fit(X_train, Y_train)
tree_preds = gridS.predict_proba(X_test)[:, 1]
tree_performance = roc_auc_score(Y_test, tree_preds)
In order to make our model robust, we need to determine the no. of nodes that the tree should be
split into. For this we create a paramater param_grid with values in the range of 3-10. The same
variable is then passed through GridSearchCV which creates a tree for all parameteres using the
best score.
the best score features tells us the maximum depth of the tree which can be used then to fit and
finalize the model as below.
Out[24]: {'max_depth': 6}
C:\Users\Shivani\Anaconda3\lib\site-packages\sklearn\model_selection\_s
earch.py:761: DeprecationWarning: The grid_scores_ attribute was deprec
ated in version 0.18 in favor of the more elaborate cv_results_ attribu
te. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
Out[25]: [mean: 0.67500, std: 0.01497, params: {'max_depth': 3},
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
mean: 0.67500, std: 0.01555, params: {'max_depth': 4},
mean: 0.66250, std: 0.00778, params: {'max_depth': 5},
mean: 0.68125, std: 0.01893, params: {'max_depth': 6},
mean: 0.64687, std: 0.01849, params: {'max_depth': 7},
mean: 0.63750, std: 0.02564, params: {'max_depth': 8},
mean: 0.63125, std: 0.03226, params: {'max_depth': 9}]
Let's estimate, how accurately the classifier or model can predict the type of cultivars.
Accuracy can be computed by comparing actual test set values and predicted values.
0.5445156695156695
Well, we got a classification rate of 54.45%, considered as good accuracy. We can improve this
accuracy by tuning the parameters in the Decision Tree Algorithm.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
STEP 7: Generating Tree
#graph.create_pdf("tree.jpg")
http://www.webgraphviz.com/
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD