Sunteți pe pagina 1din 5

Available online at www.sciencedirect.

com

Knowledge-Based Systems 21 (2008) 15


www.elsevier.com/locate/knosys

Short communication

Data mining method for listed companies nancial distress prediction


Jie Sun, Hui Li

School of Business Administration, Zhejiang Normal University, Jinhua 321004, Zhejiang Province, PR China
Received 21 August 2006; received in revised form 1 November 2006; accepted 16 November 2006
Available online 8 December 2006

Abstract
Data mining technique is capable of mining valuable knowledge from large and changeable database. This paper puts forward a data
mining method combining attribute-oriented induction, information gain, and decision tree, which is suitable for preprocessing nancial
data and constructing decision tree model for nancial distress prediction. On the base of nancial ratios attributes and one class attribute, adopting entropy-based discretization method, a data mining model for listed companies nancial distress prediction is designed.
The empirical experiment with 35 nancial ratios and 135 pairs of listed companies as initial samples got satisfying result, which testies
the feasibility and validity of the proposed data mining method for listed companies nancial distress prediction.
2006 Elsevier B.V. All rights reserved.
Keywords: Financial distress prediction; Data mining; Decision tree; Attribute-oriented induction

1. Introduction
Listed companies nancial distress prediction is important to both listed companies and investors. However, due
to the uncertainty of business environment and strong
competition, even companies with perfect operation mechanism have the possibility of business failure and nancial
bankruptcy. So whether listed companies nancial distress
can be predicted eectively and timely is related to companies development, numerous investors interest, and the
order of capital market.
Early studies of nancial distress prediction used statistical techniques such as univariate analysis, multiple discriminant analysis, Logit and so on [1]. Though these
methods use history samples to create diagnostic model,
they cannot inductively learn from new data dynamically,
which greatly aects the prediction accuracy. More recently, many studies have demonstrated that articial intelligence such as neural networks can be an alternative
method for nancial distress prediction [2]. But neural net-

work is a black-box whose structure weight values are the


hidden knowledge for classication, which is dicult for
ordinary investors and nance majors to understand.
In recent years, with the development of information
technology, machine learning, and articial intelligence, a
new eld of intelligent data analysis, data mining, began
to appear and grow rapidly in the embarrassing background of abundant data and poor knowledge. It also
brings a new livingness for the deep research of the method
for nancial distress prediction. On the basis of large database or data warehouse which stores a great number of listed companies nancial data, data mining technique can
dynamically mine out valuable hidden knowledge, which
can be applied to predict listed companies nancial
distress.

2. Data mining method for listed companies nancial


distress prediction
2.1. Choice of algorithm

Corresponding author. Tel.: +86 130 9144 2884; fax: +86 24 8350
0616.
E-mail addresses: sjhit@sina.com (J. Sun), lihuihit@sohu.com (H. Li).
0950-7051/$ - see front matter 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2006.11.003

Data mining is the process of mining hidden and valuable knowledge from database, data warehouse and other
information storage media. It has several functions such

J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 15

as association analysis, classication and prediction, clustering analysis, outlier analysis and so on. Each of them
may have several alternative data mining algorithms [3].
Data mining aiming at listed companies nancial distress
prediction belongs to the problem of classication and prediction, whose typical data mining methods consist of decision tree classier, Bayesian classier, and neural networks
classier. Bayesian classier is based on the hypothesis of
class independency that is hard to meet in reality, and neural networks have the deciency mentioned above. So we
choose decision tree method to form classier for listed
companies nancial distress prediction, not only because
it is not subject to class independency hypothesis and is fast
and accurate, but also because the knowledge produced by
it is easy to understand and use.
Besides, attribute-oriented induction (AOI) and attribute relativity analysis based on information gain (IG)
are combined to enhance the attributes conceptual level
and lter the weak-related attributes out of attributes set.
This kind of data preprocessing not only improves the mining eciency of decision tree algorithm, but also makes the
classication knowledge obtained by data mining more
meaningful and valuable.
2.1.1. Data preprocessing algorithm combining AOI and IG
AOI can be used to generalize data. It rstly collects
data relevant to the mining task by database query operation, and then generalizes data by counting the number of
each attributes dierent values. Generally, this process is
realized through two operations, attribute reduction and
attribute generalization, and the degree of attribute concept level enhancement is controlled by attribute generalization threshold [4]. The method of IG is based on the
entropy theory. It is used to eliminate attributes which
are irrelated or weak-related to mining task by calculating
each attributes IG and comparing it to attribute relativity
threshold which is designed beforehand [5]. The detailed
preprocessing algorithm is as follows.

(4) //Attribute generalization


for each ai in a_list where tot_valu(ai) > gen_thresh(ai)
while (tot_valu(ai) > gen_thresh(ai))
generalize(ai, Gen(ai), tot_valu(ai), Work_relation);
(5) //Attribute relativity analysis
for each ai in a_list
IG(ai) //Get the IG of each attribute
if IG(ai) < rela_thresh
remove_attribute(ai, a_list).
In the above algorithm, IG(ai), which is used to get the
IG of attributes, is calculated as the following
approach.
Suppose S is a data set containing s samples. Class
attribute have m dierent values that correspond to m different classes, denoted as Ci, i 2{1, 2, 3, . . . , m}, and si is the
sample number of class Ci. Then the total information
entropy needed to classify the given data set is
I (s1, s2, . . . , sm)
Is1 ; s2 ;    ; sm 

pi log2 pi

i1

In which, pi is the probability of each random sample


belonging to class Ci, namely pi = si/s.
If attribute A has v dierent values {a1, a2, . . . , av}, then
data set S can be divided into v subsets {S1, S1, . . . , Sv}, and
subset Sj is composed of data samples whose value of attribute A equals aj. Suppose sij is the number of samples who
belong to both subset Sj and class Ci, then the information
entropy needed to classify the given data set according to
attribute A is E(A)
EA

v
X
s1j s2j    smj
Is1j ; s2j ; . . . ; smj ;
s
j1

Is1j ; s2j ; . . . ; smj 

m
X

pij log2 pij ;

2
3

i1

pij
Input. (1) relation database DB; (2) data mining command DMQuery; (3) attributes set a_list; (4) concept
level tree or generalization operation of attribute ai
Gen(ai); (5) generalization threshold of attribute aigen_thresh(ai); (6) attribute relativity threshold
rela_thresh.
Output. relation after generalization and attribute relativity analysis Gen_Rela_relation.
Algorithm.
(1) //Obtain data related to mining task
get _relavant_data(DMQuery,DB, Work_relation);
(2) //Get the number of each attributes dierent values
scan Work_relation to count tot_valu(ai);
(3) //Attribute reduction
for each ai in a_list where tot_valu(ai) > gen_thresh(ai)
if (Gen(ai) not exist) or (higher concept level of ai is
denoted as other attribute)
remove_attribute(ai, a_list);

m
X

sij
:
s1j s2j    smj

In this way, the information entropy gained by attribute A


is Gain(A),
GainA Is1 ; s2 ; . . . ; sm  EA:

2.1.2. Decision tree algorithm


Decision tree is a kind of tree-shaped decision structure
learnt inductively from sample data whose class is already
known. Each non-leaf node of the decision tree means a
testing of an attribute value, and each leaf node represents
a class [6]. The basis algorithm to generate a decision tree is
stated as follows.
Input. Training sample data (all attributes should be discretized), candidate attributes set attribute_list.
Output. Decision tree.
Algorithm: Gen_decision_tree(N, attribute_list)

J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 15

(1) Create a node denoted as N;


(2) If all samples of node N belong to the same class C
then return N as a leaf node and denote it as class C;
(3) If attribute_list is empty then return N as a leaf node
and denote it as the class, which has the most samples in
node N;
(4) Choose the attribute which has the biggest IG in
attribute_list, and denote it as test_attribute;
(5) Sign the node N as test_attribute;
(6) According to condition test_attribute = ai, produce a
branch from node N, and Si is samples set who meet the
branch condition;
(7) If Si is empty then denote the corresponding leaf
node as the class which has the most samples in node
N, else denote the corresponding leaf node as the class
which is iteratively returned by algorithm Gen_decision
_tree(Si, attribute_list test_attribute).

2.2. Discretization of continuous-values attributes


Most nancial measures attributes have continuous-values, but decision tree algorithm requires that all attributes
should be discretized. So before data mining, we must convert continuous-values attributes into discretized attributes
by dividing the continuous-values domain into several
intervals and replacing the real data with interval symbols.
In fact, this process is also the process of constructing concept level trees for continuous-values attributes, which is
the preparation for the preprocessing algorithm in Section
2.1.1. At present, discretization methods include equal
breadth intervals, equal frequency intervals, clustering discretization, and entropy-based discretization. Compared
with other methods, the entropy-based discretization takes
class information into consideration, so that intervals
divided by this method will improve the accuracy of classication [7]. Because data mining for listed companies
nancial distress prediction belongs to classication problem, and decision tree algorithm chooses testing attribute
by the rule of biggest IG, entropy-based discretization is
the best choice for discretizing nancial measures
attributes.
Given a data set S, each value of attribute A can be
considered as a possible interval boundary T. For example, attribute As one possible value v may divide the
data set into two subsets: S1 who meets the condition
of A < v and S2 who meets the condition of A P v. So
when each possible value is supposed to be the interval
boundary, we can, respectively, calculate the IG of attribute A according to formula (1)(5). Choose the value
v*, who makes attribute A get the biggest IG (namely
smallest information entropy), to be the interval boundary, then the value domain of attribute A can be divided
into two intervals: A < v* and A P v*. The same method
can be used to further subdivide these two intervals until
the IG of attribute A is bigger than the predened
threshold.

Creating data set

Data preprocessing

Construction of decision tree model

Accuracy evaluation

Classification and prediction

Fig. 1. Data mining steps of nancial distress prediction.

2.3. Data mining steps


Data mining for listed companies nancial distress prediction needs ve steps: creating data set, data preprocessing, constructing decision tree by inductive learning,
accuracy evaluation, and classication and prediction, as
shown in Fig. 1.
(1) Creating data set: means drawing relevant data from
data source such as listed companies publicly
revealed information. Attributes of the data set may
include nancial measures attributes, class attribute,
and other essential information attributes.
(2) Data preprocessing: consists of discretization of continuous-values attributes, data generalization, and
attribute relativity analysis, elimination of outliers,
and so on.
(3) Construction of decision tree model: is to inductively
learn from preprocessed data by the decision tree
algorithm stated in Section 2.1.2 and construct a decision tree which represents the classication knowledge
for listed companies nancial distress prediction.
(4) Accuracy evaluation: is to evaluate the decision tree
models prediction accuracy, respectively, through
training data set and validation data set.
(5) Classication and prediction: if the decision trees
accuracy is acceptable, then it will be used to predict
listed companies nancial distress.
3. Empirical experiment
3.1. Data collection and preprocessing
The data used in this study was obtained from China
Stock Market and Accounting Research Database
(CSMAR). Following principles such as summarization,
measurability, and sensitivity [8], initial nancial ratio set
is composed of 35 nancial ratios, which cover protability
ratios, activity ratios, short-term debt ratios, long-term
debt ratios, growth ratios and structural ratios. Companies

J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 15


0.5

x30<-1.525755

Cross-validation

Cost (misclassification error)

Resubstitution
Best choice

0.4

x14<1.50866

S
0.3
x1<0.865457

x32<0.05295

0.2
x15<611.156

x4<0.497159

ST

0.1

x17<0.05929

0
0

ST

10

Number of terminal nodes


Fig. 2. Misclassication error when decision tree is pruned at dierent
degrees.

that are specially treated (ST)1 by China Securities Supervision and Management Committee (CSSMC) are considered as companies in nancial distress and those never
specially treated are regarded as healthy ones. According
to the data between 2000 and 2005, 135 pairs of companies
listed in Shenzhen Stock Exchange and Shanghai Stock
Exchange are selected as initial sample companies. In order
to eliminate outliers, companies with nancial ratios deviating from the mean value as much as three times of standard deviation are excluded, getting the nal 198 sample
companies, among which 92 are ST companies and 106
are normal (NM) ones. Then 70 ST companies and 80
NM companies (totally 150 companies) are randomly chosen as training samples. Another 22 ST companies and 26
NM companies (totally 48 companies) are used as validation samples.

3.2. Construction of decision tree model


Modeling process is realized through MATLAB 6.5
toolbox and its programming language. After applying
the data mining method proposed in Section 2, the decision
tree model for listed companies nancial distress prediction was formed, and then it was pruned until the cross-validation error reached the minimum value, as Fig. 2. So
when the number of terminal nodes equals eight, the decision tree has the lowest cross-validation misclassication
error. At this time the decision tree model is as Fig. 3.
It is easy to transform decision tree knowledge into rule
knowledge. For example, if x30 < 1.52757 then nancially
1
The most common reason that China listed companies are specially
treated by CSSMC is that they have had negative net prot in continuous
two years. Of course they will also be specially treated if they purposely
publish nancial statements with serious false and misstatement, but the
ST samples chosen in this study are all companies that have been specially
treated because of negative net prot in continuous two years.

Fig. 3. The decision tree model for listed companies nancial distress
prediction.

Table 1
Meaning of non-nodes
Non-leaf nodes

Meaning

x30
x14
x1
x32
x15
x4
x17

Net prot growth rate


Ratio of liabilities to tangible net assets
Account receivable turnover
Ratio of liabilities to cash ow
Ratio of liabilities to equity market value
Total asset turnover
Gross prot rate of sales

Table 2
Result of accuracy evaluation
Method

Sample size

Error (%)

Accuracy (%)

Resubstitution
Cross-validation
Independent validation

150
150
48

4.67
15.33
18.75

95.33
84.67
81.25

distressed. The meaning of non-leaf nodes in Fig. 3 is listed


in Table 1.

3.3. Accuracy evaluation of the decision tree model


The method of resubstitution, 10-fold cross-validation
and validation with independent samples are, respectively,
carried out to evaluate the decision tree model. As Table 2
shows, the classication accuracy obtained by resubstitution, 10-fold cross-validation and independent validation
is, respectively, 95.33%, 84.67% and 81.25%, indicating
that the decision tree model constructed by the data mining
method in Section 2 has relatively satisfying prediction
accuracy not only for training samples but also for validation samples. So this data mining method is suitable to

J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 15

construct decision tree model for listed companies nancial distress prediction.
4. Conclusion
Existing nancial distress prediction methods have
problems such as lacking dynamic learning ability and difculty to understand. Data mining method combining
AOI, IG and decision tree can overcome those problems
and eectively predict listed companies nancial distress.
Adopting entropy-based discretization method to discretize
continuous-values attributes, data mining model for listed
companies nancial distress prediction can be designed
to dynamically and inductively learn from periodically
changeable database, which produces easily understandable decision tree classication model. The empirical experiment involving 35 nancial ratios and 135 pairs of listed
companies got a satisfactory result, which means application of the proposed data mining method to listed companies nancial distress prediction is not only theoretically
feasible but also practically eective.
Acknowledgements
This research is partially supported by Zhejiang Provincial Natural Science Foundation of China (Grant No.

Y607011), National Natural Science Foundation of China


(Nos. 70573030 and 70571019), and National Center of
Technology, Policy and Management at Harbin Institute
of Technology. The authors gratefully thank anonymous
referees for their useful comments and editors for their work.
References
[1] E. Altman, G. Marco, Corporate distress diagnosis: comparisons using
liner discriminant analysis and neural networks, Journal of Banking
and Finance 18 (1994) 505529.
[2] C.P. Parag, A threshold varying articial neural network approach for
classication and its application to bankruptcy prediction problem,
Computers and Operations Research 32 (10) (2005) 25612582.
[3] P. Adriaans, D. Zantinge, Data Mining, Addison Wesley, England,
1996.
[4] Y.-L. Chen, C.-C. Shen, Mining generalized knowledge from ordered
data through attribute oriented induction techniques, European
Journal of Operational Research 166 (2005) 221245.
[5] J.-W. Han, M. Kamber, Data Mining Concepts and Techniques,
Morgan Kaufman Publishers Inc., San Mateo, 2001.
[6] S.-C. Chou, C.-L. Hsu, MMDT: a multi-valued and multi-labeled
decision tree classier for data mining, Expert Systems with Applications 28 (4) (2005) 799812.
[7] D. Janssens, T. Brijs, K. Vanhoof, et al., Evaluating the performance
of cost based discretization versus entropy- and error-based discretization, Computers and Operations Research 33 (11) (2005) 117.
[8] X.-F. Li, J.-P. Xu, The establishment of rough-ANN model for prewarning of enterprise nancial crisis and its application, Systems
Engineering Theory and Practice 10 (2004) 813 (in Chinese).

S-ar putea să vă placă și