Documente Academic
Documente Profesional
Documente Cultură
=
=
=
N
i
j k
a
i
c p
j k
a
i
c p
k
M
j
j k
a p
k
A C E
1
, 2
log
,
1
,
( 1 )
where
( ) =
k
A C E entropy of the classification property of attribute
k
A
( ) =
j k
a p
,
probability of attribute k being at value j
( )
j k
a
i
c p
,
=probability that the class value is
i
c when attribute k is at its jth value
k
M = total number of values for attribute
k
a ; j = 1,2,...,
k
M
N = total number of different classes; i = 1,2,..., N
K = total number of attributes; k = 1,2,..., K
The term in the brackets is called the information. Thus as Equation 1
implies, entropy is the expected information that is the sum of the information in the
several possible outcomes multiplied by their probability. Logarithms are generally
taken to base 2, so that the information is measured in bits.
If a set S of records is partitioned into classes C
1
, C
2
, C
3
, . . . , C
i
on the
basis of the categorical attribute, then the information needed to identify the class of
an element of S is denoted by:
( ) ( ) ( ) ( ) ( )
i
p
i
p p p p p S I
2
log ...
2 2
log
2 1 2
log
1
+ + + = ( 2 )
8
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174
where p
i
is the probability distribution of the partition C
i
[8]. The term in the
brackets in equation 1 is similar to equation 2. Thus Entropy in equation 1 can be
written like this [8]:
( ) ( )
i
S
n
i
I
S
i
S
A E
=
=
1
( 3 )
Thus the information gain in performing a branching with attribute A can be
calculated with this equation [5]:
( ) ( ) ( ) A E S I A Gain = ( 4 )
The information gain computed for each attribute is used to choose the test
attribute in each node of the decision tree. The attribute with highest information
gain is chosen as the test attribute for the current node. This attribute minimizes the
information necessary for the classification of the data and the problems which may
occur during branching.
After the preprocessing of data, it comes computing the information gain
necessary to construct the decision tree. For each field in the training data, the
information gain is computed from the equations 1,2,3 and 4 by descending as
follows:
1. Gain (min_Account) = 0.342
2. Gain (Loan_Amount) = 0.192
3. Gain (min_Amount) = 0.162
4. Gain (max_Amount) = 0.069
5. Gain (avg_Amount) = 0.051
6. Gain (Card_Type) = 0.044
7. Gain (avg_Account) = 0.039
8. Gain (Age) = 0.032
9. Gain (District) = 0.03
10. Gain (Loan_Duration) = 0.025
11. Gain (max_Account) = 0.029
12. Gain (Sex) = 0
After these information gains are computed, the decision tree constructed by
the software is like this:
9
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174
10
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174
After the decision tree in Figure 1 is constructed, the following
classification rules are constituted by the software.
If min_Account= '<=0' Then CLASS='B'
If min_Account= '>40K' Then CLASS='A'
If min_Account= '0<...<=10K' Then
If Loan_Amount= '>325K' Then CLASS='B'
If Loan_Amount= '100K<...<=125K' Then CLASS='A'
If Loan_Amount= '125K<...<=150K' Then
If min_Miktar= '10<...<=20' Then CLASS='A'
If min_Miktar= '20<...<=30' Then CLASS='B'
End If
If Loan_Amount= '150K<...<=175K' Then
If min_Amount= '10<...<=20' Then
If max_Amount= '10K<...<=20K' Then CLASS='A'
If max_Amount= '40K<...<=50K' Then CLASS='A'
If max_Amount= '50K<...<=60K' Then CLASS='B'
End If
If min_Amount= '20<...<=30' Then CLASS='A'
End If
If Loan_Amount= '175K<...<=200K' Then CLASS='B'
If Loan_Amount= '200K<...<=250K' Then CLASS='B'
If Loan_Amount= '20K<...<=30K' Then CLASS='A'
If Loan_Amount= '250K<...<=275K' Then CLASS='B'
If Loan_Amount= '275K<...<=300K' Then CLASS='B'
If Loan_Amount= '300K<...<=325K' Then CLASS='A'
If Loan_Amount= '30K<...<=40K' Then
If min_Amount= '10<...<=20' Then
If max_Amount= '0<...<=10K' Then CLASS= 'A'
If max_Amount= '20K<...<=30K' Then CLASS='B'
If max_Amount= '50K<...<=60K' Then CLASS='A'
If max_Amount= '60K<...<=70K' Then CLASS='A'
End If
End If
If Loan_Amount= '40K<...<=50K' Then CLASS='A'
If Loan_Amount= '50K<...<=60K' Then CLASS='A'
11
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174
If Loan_Amount= '60K<...<=70K' Then
If min_Amount= '10<...<=20' Then CLASS='B'
If min_Amount= '20<...<=30' Then CLASS='A'
If min_Amount= '30<...<=40' Then CLASS='B'
If min_Amount= '60<...<=70' Then CLASS='B'
End If
If Loan_Amount= '70K<...<=80K' Then CLASS='A'
If Loan_Amount= '80K<...<=90K' Then CLASS='B'
If Loan_Amount= '90K<...<=100K' Then
If min_Amount= '<=1' Then CLASS='A'
If min_Amount= '>1K' Then CLASS='B'
If min_Amount= '10<...<=20' Then
If max_Amount= '10K<...<=20K' Then CLASS='A'
If max_Amount= '30K<...<=40K' Then CLASS='A'
If max_Amount= '40K<...<=50K' Then CLASS='A'
If max_Amount= '50K<...<=60K' Then CLASS='A'
If max_Amount= '70K<...<=80K' Then CLASS='B'
End If
End If
End If
If min_Account= '10K<...<=20K' Then CLASS= 'A'
If min_Account= '20K<...<=30K' Then CLASS= 'A'
If min_Account= '30K<...<=40K' Then CLASS= 'A'
The verification of this classification rules as seen in Figure 2 is done by
using test data. The software executes the classification rules in the fields of
min_Account, loan_amount, min_account and max_amount for each account and
writes the results in the results column. Loan_state data from the test data is
compared with values in the result column and an error rate is obtained. This error
rate shown in the message box is 12.76%. Since this error rate is admissible, the
classification rules may be used to predict the loan states of C and D at the end of
the contract.
12
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174
Figure 2. The specification of the classification rules error rate.
The prediction on states C and D is shown in Figure 3. In the loan state
column current states of accounts are shown. And in the result column, predicted
loan states at the end of the contract are shown. If the status of clients with loan state
C becomes the state B at the end of the contract and if the status of clients with loan
state D becomes the state A at the end of the contract, these two cases are called
unexpected cases.
13
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174
Figure 3. The prediction of the C and D loan states at the end of the contract.
As the results of the predictions on C and D states whose number is 448,
generally loan states C will be A and loan states with D will be B at the end of the
contract. This shows that generally accounts with no problems are expected to repay
the loan, and accounts with problems so far are expected not to repay the loan.
But in some accounts it is expected that although the current loan state is C, it will
be B, and although the current loan state is D, it will be A at the end of the contract.
The number of these states is 45.
Therefore for the clients having problems in repaying the loan, it seems
possible that they solve their problems and repay the loans. Similarly some clients
although still repaying their loans without any problem, will have problems and can
not repay their loans.
14
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174
The used decision tree and classification rules, may not sufficient to predict
some new data. These states which can not be predicted are described as unknown
states. These states are shown in Figure 4.
Figure 4. The C and D states that can not be predicted.
RESULTS AND CONCLUSIONS
When a data mining application will be realized, first of all the data in hand
and the business problem to be solved must be analyzed and understood very well.
These two key points are the basic facts that effect the success of the data mining
application. After that the choosing the right data mining technique is another
important fact to overcome the business problem.
15
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174
The data you wish to analyze by data mining techniques may be incomplete,
noisy and inconsistent. Thus when starting the application, first the data must be
preprocessed. This preprocessing includes data cleaning, data integration, data
transformation and data reduction. The data used in this application is also
preprocessed and arranged for the decision tree technique. One of these
arrangements is the determination of the input variables used in the construction of
the decision tree. Determination of the input variables according to the goal is one
of the key points of the decision tree technique.
The evaluation and interpretation of the patterns which are obtained by
applying the right technique on the preprocessed data is another important point of
the data mining applications. By interpretation of the patterns by the experts, the
gold valued information is obtained. The last step of a data mining application is the
representation of the information for the users.
In this study as a results of the predictions on C and D states whose number
is 448, 10 cases are labeled as unknown states. It is seen that the decision tree and
the classification rules are insufficient in these states and cant make any
predictions. This is because the data set used in application can not provide an
adequate training data set and thus classification rules with adequate capacity to
predict C and D states can not be obtained.
The software designed for this application, uses the Entropy measure as a
branching criterion. The contributions in the solving of the above problem of the
Gini and Twoing [14] criterions, which are the other alternatives that can be used as
a branching criterion, must be studied.
REFERENCES
[1] Zhong, N.; Zhou, L.: Methodologies for Knowledge Discovery and Data Mining,
The Third Pacific-Asia Conference, Pakdd-99, Beijing, China, April 26-28, 1999 ;
Proceedings, Springer Verlag, (1999).
[2] Fayyad, U.: Mining Databases: Towards Algorithms for Knowledge Discovery,
IEEE Bulletin of the Technical Committee on Data Engineering, 21 (1) (1998) 41-48.
[3] Akpnar, H.: Veri Tabanlarnda Bilgi Kefi ve Veri Madencilii, stanbul niv.
letme Fakltesi Dergisi, 29 (2000) 1.
16
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174
[4] Berson, A.; Smith, S.; Thearling, K.: Building Data Mining Applications for CRM,
McGraw-Hill Professional Publishing, New York, USA, (2000).
[5] Chaudhuri, S.: Data Mining and Database Systems : Where is the Intersection?, IEEE
Bulletin of the Technical Committee on Data Engineering, 21 (1) (1998) 4 - 8.
[6] 3rd European Conference on Principles and Practice of Knowledge Discovery in
Databases, http://lisp.vse.cz/pkdd99/DATA/data_berka.zip, Access Date: 18 Dec.2001
[7] Seidman, C.: Data Mining with Microsoft SQL Server 2000, Microsoft Press, 1st Ed.,
Washington, USA, (2001).
[8] Gven, E.: Student Performance Assessment In Higher Education Using Data
Mining, MSc Thesis, Boazii Univ., Inst. for Graduate Studies in Science and
Engineering, Istanbul, Turkey, (2001).
[9] Ula, M.A.: Market Basket Analysis For Data Mining, MSc Thesis, Boazii Univ.,
Inst. for Graduate Studies in Science and Engineering, Istanbul, Turkey, (2001).
[10] Berry, M.J.A.; Linoff, G.S.: Mastering Data Mining: The Art and Science of
Customer Relationship Management, 1st Ed.; John Wiley & Sons, (1999).
[11] Han, J.; Kamber, M.; Data Mining Concepts and Techniques, 1st Ed.; Morgan
Kaufmann Publishers, San Francisco, USA, (2000).
[12] Agrawal, R.; Imielinski, T.; Swami, A.: Database Mining:A Performance
Perspective, IEEE Transactions on Knowledge and Data Engineering, (1993) 914 -
925.
[13] Chen, M.; Han, J.; Yu, P.S.: Data Mining: An Overview from Database Perspective,
IEEE Transactions on Knowledge and Data Engineering, 8 (6) (1996).
[14] CART for Windows Users Guide, http://www.salford-systems.com, Access Date: 17
March 2002.
Received December 2002