Documente Academic
Documente Profesional
Documente Cultură
com
Short communication
School of Business Administration, Zhejiang Normal University, Jinhua 321004, Zhejiang Province, PR China
Received 21 August 2006; received in revised form 1 November 2006; accepted 16 November 2006
Available online 8 December 2006
Abstract
Data mining technique is capable of mining valuable knowledge from large and changeable database. This paper puts forward a data
mining method combining attribute-oriented induction, information gain, and decision tree, which is suitable for preprocessing nancial
data and constructing decision tree model for nancial distress prediction. On the base of nancial ratios attributes and one class attribute, adopting entropy-based discretization method, a data mining model for listed companies nancial distress prediction is designed.
The empirical experiment with 35 nancial ratios and 135 pairs of listed companies as initial samples got satisfying result, which testies
the feasibility and validity of the proposed data mining method for listed companies nancial distress prediction.
2006 Elsevier B.V. All rights reserved.
Keywords: Financial distress prediction; Data mining; Decision tree; Attribute-oriented induction
1. Introduction
Listed companies nancial distress prediction is important to both listed companies and investors. However, due
to the uncertainty of business environment and strong
competition, even companies with perfect operation mechanism have the possibility of business failure and nancial
bankruptcy. So whether listed companies nancial distress
can be predicted eectively and timely is related to companies development, numerous investors interest, and the
order of capital market.
Early studies of nancial distress prediction used statistical techniques such as univariate analysis, multiple discriminant analysis, Logit and so on [1]. Though these
methods use history samples to create diagnostic model,
they cannot inductively learn from new data dynamically,
which greatly aects the prediction accuracy. More recently, many studies have demonstrated that articial intelligence such as neural networks can be an alternative
method for nancial distress prediction [2]. But neural net-
Corresponding author. Tel.: +86 130 9144 2884; fax: +86 24 8350
0616.
E-mail addresses: sjhit@sina.com (J. Sun), lihuihit@sohu.com (H. Li).
0950-7051/$ - see front matter 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2006.11.003
Data mining is the process of mining hidden and valuable knowledge from database, data warehouse and other
information storage media. It has several functions such
as association analysis, classication and prediction, clustering analysis, outlier analysis and so on. Each of them
may have several alternative data mining algorithms [3].
Data mining aiming at listed companies nancial distress
prediction belongs to the problem of classication and prediction, whose typical data mining methods consist of decision tree classier, Bayesian classier, and neural networks
classier. Bayesian classier is based on the hypothesis of
class independency that is hard to meet in reality, and neural networks have the deciency mentioned above. So we
choose decision tree method to form classier for listed
companies nancial distress prediction, not only because
it is not subject to class independency hypothesis and is fast
and accurate, but also because the knowledge produced by
it is easy to understand and use.
Besides, attribute-oriented induction (AOI) and attribute relativity analysis based on information gain (IG)
are combined to enhance the attributes conceptual level
and lter the weak-related attributes out of attributes set.
This kind of data preprocessing not only improves the mining eciency of decision tree algorithm, but also makes the
classication knowledge obtained by data mining more
meaningful and valuable.
2.1.1. Data preprocessing algorithm combining AOI and IG
AOI can be used to generalize data. It rstly collects
data relevant to the mining task by database query operation, and then generalizes data by counting the number of
each attributes dierent values. Generally, this process is
realized through two operations, attribute reduction and
attribute generalization, and the degree of attribute concept level enhancement is controlled by attribute generalization threshold [4]. The method of IG is based on the
entropy theory. It is used to eliminate attributes which
are irrelated or weak-related to mining task by calculating
each attributes IG and comparing it to attribute relativity
threshold which is designed beforehand [5]. The detailed
preprocessing algorithm is as follows.
pi log2 pi
i1
v
X
s1j s2j smj
Is1j ; s2j ; . . . ; smj ;
s
j1
m
X
2
3
i1
pij
Input. (1) relation database DB; (2) data mining command DMQuery; (3) attributes set a_list; (4) concept
level tree or generalization operation of attribute ai
Gen(ai); (5) generalization threshold of attribute aigen_thresh(ai); (6) attribute relativity threshold
rela_thresh.
Output. relation after generalization and attribute relativity analysis Gen_Rela_relation.
Algorithm.
(1) //Obtain data related to mining task
get _relavant_data(DMQuery,DB, Work_relation);
(2) //Get the number of each attributes dierent values
scan Work_relation to count tot_valu(ai);
(3) //Attribute reduction
for each ai in a_list where tot_valu(ai) > gen_thresh(ai)
if (Gen(ai) not exist) or (higher concept level of ai is
denoted as other attribute)
remove_attribute(ai, a_list);
m
X
sij
:
s1j s2j smj
Data preprocessing
Accuracy evaluation
x30<-1.525755
Cross-validation
Resubstitution
Best choice
0.4
x14<1.50866
S
0.3
x1<0.865457
x32<0.05295
0.2
x15<611.156
x4<0.497159
ST
0.1
x17<0.05929
0
0
ST
10
that are specially treated (ST)1 by China Securities Supervision and Management Committee (CSSMC) are considered as companies in nancial distress and those never
specially treated are regarded as healthy ones. According
to the data between 2000 and 2005, 135 pairs of companies
listed in Shenzhen Stock Exchange and Shanghai Stock
Exchange are selected as initial sample companies. In order
to eliminate outliers, companies with nancial ratios deviating from the mean value as much as three times of standard deviation are excluded, getting the nal 198 sample
companies, among which 92 are ST companies and 106
are normal (NM) ones. Then 70 ST companies and 80
NM companies (totally 150 companies) are randomly chosen as training samples. Another 22 ST companies and 26
NM companies (totally 48 companies) are used as validation samples.
Fig. 3. The decision tree model for listed companies nancial distress
prediction.
Table 1
Meaning of non-nodes
Non-leaf nodes
Meaning
x30
x14
x1
x32
x15
x4
x17
Table 2
Result of accuracy evaluation
Method
Sample size
Error (%)
Accuracy (%)
Resubstitution
Cross-validation
Independent validation
150
150
48
4.67
15.33
18.75
95.33
84.67
81.25
construct decision tree model for listed companies nancial distress prediction.
4. Conclusion
Existing nancial distress prediction methods have
problems such as lacking dynamic learning ability and difculty to understand. Data mining method combining
AOI, IG and decision tree can overcome those problems
and eectively predict listed companies nancial distress.
Adopting entropy-based discretization method to discretize
continuous-values attributes, data mining model for listed
companies nancial distress prediction can be designed
to dynamically and inductively learn from periodically
changeable database, which produces easily understandable decision tree classication model. The empirical experiment involving 35 nancial ratios and 135 pairs of listed
companies got a satisfactory result, which means application of the proposed data mining method to listed companies nancial distress prediction is not only theoretically
feasible but also practically eective.
Acknowledgements
This research is partially supported by Zhejiang Provincial Natural Science Foundation of China (Grant No.