Documente Academic
Documente Profesional
Documente Cultură
a r t i c l e
i n f o
Keywords:
Financial distress
Articial intelligent
Decision tree classication
Logistic regression
a b s t r a c t
Lately, stock and derivative securities markets continuously and rapidly evolve in the world. As quick
market developments, enterprise operating status will be disclosed periodically on nancial statement.
Unfortunately, if executives of rms intentionally dress nancial statements up, it will not be observed
any nancial distress possibility in the short or long run. Recently, there were occurred many nancial
crises in the international marketing, such as Enron, Kmart, Global Crossing, WorldCom and Lehman
Brothers events. How these nancial events affect worlds business, especially for the nancial service
industry or investors has been publics concern. To improve the accuracy of the nancial distress prediction model, this paper referred to the operating rules of the Taiwan Stock Exchange Corporation (TSEC)
and collected 100 listed companies as the initial samples. Moreover, the empirical experiment with a
total of 37 ratios which composed of nancial and other non-nancial ratios and used principle component analysis (PCA) to extract suitable variables. The decision tree (DT) classication methods (C5.0,
CART, and CHAID) and logistic regression (LR) techniques were used to implement the nancial distress
prediction model. Finally, the experiments acquired a satisfying result, which testies for the possibility
and validity of our proposed methods for the nancial distress prediction of listed companies.
This paper makes four critical contributions: (1) the more PCA we used, the less accuracy we obtained
by the DT classication approach. However, the LR approach has no signicant impact with PCA; (2) the
closer we get to the actual occurrence of nancial distress, the higher the accuracy we obtain in DT classication approach, with an 97.01% correct percentage for 2 seasons prior to the occurrence of nancial
distress; (3) our empirical results show that PCA increases the error of classifying companies that are in a
nancial crisis as normal companies; and (4) the DT classication approach obtains better prediction
accuracy than the LR approach in short run (less one year). On the contrary, the LR approach gets better
prediction accuracy in long run (above one and half year). Therefore, this paper proposes that the articial
intelligent (AI) approach could be a more suitable methodology than traditional statistics for predicting
the potential nancial distress of a company in short run.
2011 Elsevier Ltd. All rights reserved.
1. Introduction
Recently, one of the most attractive business news is a series of
nancial crisis events related to the public companies. Some of
these companies are famous and also at high stock prices, originally (e.g. Enron Corp., Kmart Corp., WorldCom Corp., Lehman
Brothers Bank, etc.). In consequence of the nancial crisis, it is
always too late for many creditors to withdraw their loans, as well
as for investors to sell their own stocks, futures, or options. Therefore, corporate bankruptcy is a very important economic phenomenon and also affects the economy of every country. In Taiwan,
domestic and foreign capital markets have developed rapidly in
recent years, gradually giving people the idea of making a nancial
investment. Nevertheless, Procomp Corp. and Cdbank Corp.
E-mail address: mychen@ntit.edu.tw
0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.02.173
11262
2. Classication techniques
2.1. Decision trees algorithm
Data mining (DM), also known as knowledge discovery in databases (KDD), is the process of discovering meaningful patterns in
huge databases (Han & Kamber, 2001). In addition, it is also an
application that can provide signicant competitive advantages
for making the right decision (Huang, Chen, & Lee, 2007). The more
common model functions in the current data mining process include the classication, regression, clustering, association rules,
summarization, dependency modeling and sequence analysis
(Mitra, Pal, & Mitra, 2002). Decision tree is one of common DM
methodologies that provide both classication and predictive functions simultaneously. Focusing on the data provided, it produces a
model of tree-shaped structure using inductive reasoning (Chang &
Chen, 2009). Many scholars also use articial neural network
(ANN) techniques to solve classication and prediction problems.
However, there are three weaknesses of neural networks mostly.
Firstly, neural networks are not guaranteed to converge to a global
optimal solution. Secondly, neural networks have the well-known
over-training problem. Last, neural networks have the black-box
phenomenon that lacks the ability to explain their behavior (Roiger
& Geatz, 2003). The rst two problems have been solved by setting
the number of hidden nodes and learning parameters. However, it
is difcult to explain how neural networks act, and how they make
the decisions through their layers. Decision trees are a well-known
technique and have had many successful applications to real-world
problems (Kumar & Ravi, 2007). In addition, decision trees have the
ability to build models with datasets including numerical and categorical data (Witten & Frank, 2005).
Major algorithms of decision tree analysis model include ID3
(Interactive Dichotomiser 3), C5.0, classication and regression
trees (CRAT), and chi-squared automatic interactive detector
(CHAID) models. In the late 1970s, Ross (1993) proposed an algorithm named ID3 to generate decision trees. Based on the theory
of information gain, ID3 algorithm chooses the optimal information gain to as a rst attribute for branching of decision trees and
thus constructs a simple trees structure. However, ID3 algorithm
still has its shortcoming when using information gain as a rule to
select attributes for segmentation will result in bias over attributes
of higher values. Therefore, in the condition when only one data remains in the sub-tree after data set segmentation, its information
gain is the highest, indicating a less meaningful segmentation
(Ross, 1993).
The C4.5 algorithm improves ID3 with regard to the splitting
rule and the calculation method (Quinlan, 1993). It uses gain-ratio
index instead as a measurement method to segment attributes and
thus can reduce the inuence of ID3 drawback that segmentation
nodes prefer too many sub-trees. C5.0 algorithm is a commercial
version of C4.5, such as Clementine and RuleQuest (Quinlan,
1997), and it improves the rule generation of C4.5. It can be skilled
in processing enormous datasets particularly. Besides, C5.0 algorithm is also faster in speed and more memory efcient than
C4.5 due to Boosting method adopting.
In CART (Breiman, Friedman, Olshen, & Stone, 1984), the building of the tree classier is also accomplished by recursively splitting the instance space in smaller subparts. CART algorithm
generates a binary decision tree, unlike ID3 which only creates
two children. Both of CART and CHAID provide a set of rules that
can be used to an unclassied dataset to predict which records will
have a given result. CART segments a dataset by creating two-way
splits, but CHAID segments a dataset by chi-square test to create
multi-way splits. Generally, CART needs less data preparation than
CHAID.
CHAID algorithm is based on the chi-square test and constructed by repeatedly splitting subsets of the space into two or
more child nodes with the entire datasets (Michael & Gordon,
1997). To decide the best split at any node, any allowable pair of
categories of the predictor variables is merged until there is no statistically signicant difference within the pair with respect to the
target variable. This CHIAD algorithm surely handles interactions
between the independent variables that are directly available from
an examination of the tree. Although there is no optimal method to
obtain the best segment size, in fact, CHAID can assist researchers
to compromise variances against segment size to discovery the
most adequate one. Certainly, CHAID algorithm clearly proves
which segmentation variable must come rst for the large datasets.
2.2. Statistical algorithms
Most of the broadly used traditional statistical algorithm applied for prediction and diagnosis in many disciplines are discriminant analysis, Logistic regression, Bayesian approach, and multiple
regression. These models have been proven to be very effective,
however, for solving relatively less complex problems. LR is a
regression method for predicting a dichotomous dependent variable. In producing the LR equation, the maximum-likelihood ratio
was used to determine the statistical signicance of the variables
(Hosmer & Lemeshow, 2000). In logistic regression models, dependent variable is always in categorical form and has two or more
levels. Independent variables may be in numerical or categorical
form (Camdeviren, Yazici, Akkus, Bugdayci, & Sungur, 2007).We
consider the situation where we observe a binary outcome variable
y and a vector x = (1, x1, x2, . . . , xk) of covariates for each of N
Logitp ln
p
f x; b bT x
1p
where b = (b0, b1, b2 , ... , bk) is the vector of the coefcients of the
model and bT the transpose vector. We refer to p/(1 p) as odds-ratio and to the expression (1) as the log-odds or logit transformation.
Let D = {(xl, yl): l = 1 , 2, . . . , nT} be the training data set, where
the number of samples is nT. Here, we assume that the training
sample is a realization of a set of independent and identically distributed random variables. The unknown regression coefcients bi,
which have to be estimated from the data, are directly interpretable as log-odds ratios or, in term of exp(bi), as odds ratios. That
log-likelihood for nT observations is
lb
nT
X
T
fylbT xl log1 eb xl g
l1
11263
crisis automatically (Chen & Du, 2009). Li and Sun (2009) presented a multiple case-based reasoning system by majority voting
(Multi-CBRMV) for nancial distress prediction. With data collected from Shanghai and Shenzhen Stock Exchanges, experiment
was shown that a Multi-CBRMV system was better than original
CBR model (Li & Sun, 2009). Xidonas et al. (2009) presented an expert system methodology for supporting decisions that concern the
selection of equities, on the basis of nancial analysis. Finally, the
validity of the proposed methodology is tested through a large
scale application on the Athens Stock Exchange (Xidonas et al.,
2009). Atsalakis and Valavanis (2009) proposed a neuro-fuzzy
adaptive control system to forecast next days stock price trends.
The proposed system has performed very well in trading simulations, returning results superior to the B&H (buy and hold) strategy
(Atsalakis & Valavanis, 2009). Yip (2004) also used Case-based reasoning (CBR) with K-NN to predict Australian rm business failure.
She used the statistical evaluations for assigning the relevancy of
attributes in the retrieval phase of algorithm. The results concluded that CBR with weighted K-NN was better than discriminant
approaches (Yip, 2004). However, few of these studies focused on
the DT and LR comparison, and even fewer empirical investigations
were made of nancial distress prediction related topics. Therefore,
we will use DT and LR to compare the accuracy of predicting bankruptcy in a capital market.
3. Research methodology
In this research, we compare DT and LR techniques for nancial
distress prediction (FDP) performance. The research methodology
is as shown in Fig. 1. In the FDP Choosing phase, we handle the original huge datasets from the TSEC which will be processed by data
pre-processing. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection. The product
of data pre-processing is the nal training and testing set. The goal
in this phase is to choose the appropriate variables, including
nancial and non-nancial ratios, by means of PCA. Moreover,
the next phase will use these variables and extract prediction rule
sets that are ready to be used in DT and LR techniques.
11264
Non-nancial factors: including Dividend Payout ratio, Pricebook ratio, the proportion of collateralized shares by the board
of directors, and the Insider Holding ratio.
4.3. Factor analysis
This paper collected the samples of 50 pairs of nancial distress
and non-bankruptcy rms listed in the TSEC, between 2000 and
2007. The main variables are 37 ratios for the predictive nancial
distress model factors. This research used the SPSS statistical software to guide factor analysis and PCA with varimax for rotation
(VARIMAX), with the purpose of the factor structures easier and
simpler to explicate. The principle for the selection of factors is
based on Kaisers criteria, meaning that the eigenvalue greater than
1 is a common factor, and the communality is greater than 0.8 in
order to obtain suitable factors.
Totally, we assembled 33 nancial ratios and 4 non-nancial ratios. Due to reduce dimensionality, we ran a factor analysis to test
whether the differences between these 37 variables were signicant for each variable. The variable was considered to be non-informative if the difference was not signicant (low communality
values). In Table 1, it shows the factor loadings, communality,
the eigenvalues and the explained variance information for each
variable. In addition, the total explained variance was 75.475%.
Consequently, 16 variables presented high communality values
and were chosen to be retained in the input vector. In the other
way, the remaining 21 variables were discarded because of low
communality values. Immediately, we used the factor analysis to
process the experiment a second time. In Table 2, it shows that four
variables were discarded, and that the total explained variance was
84.048%. It is obvious that total explained variance value is higher
than rst time. Therefore, we cannot suppose factor analysis is the
optimal solution and we used the factor analysis to process the
experiment a third time. Table 3 shows that no variables need to
be discarded, and that the total explained variance was 94.009%.
Therefore, we can sure that the optimal factor analysis was the
one we nished the third time, where the performance was the
highest at 94.009%.
After the three times factor analysis, 12 variables presented
higher communality value. These variables were chosen to be used
in the input vector, while the remaining 25 variables were discarded. The selected variables were Return on Equity (ROE), Earnings per Share (EPS), Return on Asset (ROA), Cash Flow Ratio, Cash
Flow to Total Debt Ratio, Current Ratio, Acid-Test Ratio, Inventory
to Total Assets Ratio, Current Assets to Total Assets, Gearing Ratio,
Debt to Equity Ratio, and Debt/Equity (DE).
5. The FDP modeling phase
5.1. DT experiments and results
This process uses the nance and non-nance ratios, and constructs a nancial distress prediction model after carrying out a
second time factor analysis. The variables are then loaded as DT
and LR input nodes. In addition, we also apply these experiment
parameters to investigate the past 2 seasons, the past 4 seasons,
the past 6 seasons, and the past 8 seasons before the nancial distress occurred, for the sake of prediction accuracy. In this experiment, we will use the C5.0, CART, CHAID as the DT algorithm. In
addition, the training sample and the testing sample will adopt
the 70:30 ratio.
In terms of bankruptcy prediction, whether or not the prediction is accurate is routinely measured by three quantities: Type I
Error Rate, Type II Error Rate, and Total Error Rate. Type I Error
Rate means that the error rate for the risk can not categorize
11265
Variables
Factor loadings
Communality
Eigenvalues
Explained variance
0.394
0.722
0.426
0.721
0.805
0.436
0.403
0.867
0.851
0.825
0.811
0.746
0.746
0.539
8.721
23.570
0.770
0.535
0.304
0.654
0.509
0.484
0.860
0.814
0.810
0.792
0.776
0.635
3.752
10.142
0.444
0.116
0.451
0.821
0.781
0.713
2.845
7.690
Gearing Ratio
Debt to Equity Ratio
Debt Equity Ratio
Earnings Per Share Ratio
Return on Equity Ratio
Return on Asset Ratio
Margin Before Interest and Tax Ratio
0.401
0.397
0.402
0.412
0.358
0.434
0.248
0.970
0.967
0.960
0.923
0.894
0.869
0.815
2.573
6.954
0.426
0.165
0.350
0.785
0.634
0.534
2.113
5.712
0.390
0.329
0.768
0.730
1.814
4.903
0.330
0.281
0.295
0.758
0.714
0.451
1.458
3.941
0.245
0.361
0.890
0.797
1.293
3.493
0.274
0.382
0.644
0.566
1.237
3.343
10
0.270
0.503
1.068
2.888
11
0.301
0.367
1.050
2.839
75.475
Table 2
2nd Factor analysis results.
Factors
Variables
Factor loadings
Communality
Eigenvalues
Explained variance
0.766
0.768
0.748
0.720
0.925
0.893
0.889
0.776
6.326
35.146
0.551
0.515
0.464
0.416
0.923
0.885
0.840
0.809
2.597
14.430
0.637
0.807
0.887
0.884
2.193
12.185
Gearing Ratio
Debt to Equity Ratio
Debt Equity Ratio
0.521
0.516
0.502
0.974
0.973
0.965
1.739
9.661
0.304
0.303
0.696
0.495
1.269
7.051
Debt Ratio
0.228
0.776
1.004
5.575
84.048
11266
Table 3
3rd Factor analysis results.
Factors
Variables
Factor loadings
Communality
Eigenvalues
Explained variance
0.755
0.748
0.746
0.932
0.931
0.930
4.803
40.028
0.802
0.785
0.952
0.950
2.107
17.558
Current Ratio
Acid-Test Ratio
0.675
0.661
0.975
0.944
1.870
15.585
0.509
0.656
0.880
0.826
1.495
12.458
Gearing Ratio
Debt to Equity Ratio
Debt Equity Ratio
0.475
0.481
0.446
0.994
0.992
0.975
1.006
8.380
94.009
Table 4
The relationship with Type I, II, and Total Error Rates.
Prediction
Actually
Normal
Bankruptcy
Sum
Sum
Normal
Bankruptcy
Y1
Y4
Y7
Y2
Y5
Y8
Y3
Y6
Y9
Y2
Y3
Y4
Y6
Y 2 Y 4
Y9
Therefore, we can observe the closer the nancial crisis the higher
the accuracy will be by C5.0 analysis.
This experiment obtains a result after using 16 original ratio
variables of this research that have undergone 1st factor analysis.
As shown in Table 6, the testing data has an estimate accuracy rate
as high as 92.53%, with an error rate of 7.47% for the past 2 seasons.
However, the accuracy rate reduces to 82.07%, and the error rate
clear rises to 17.93% when measured over the past 8 seasons. Similar to the above experiment, the closer the nancial crisis the
higher the accuracy will be.
This experiment obtains a result after using 12 original ratio
variables of this research that have undergone 2nd factor analysis.
As shown in Table 7, the testing data has an estimate accuracy rate
as high as 92.54%, with an error rate of 7.46% for the past 2 seasons.
However, the accuracy rate reduces to 81.58%, and the error rate
distinct manifest rises to 18.42% when measured over the past 8
seasons. Similar to the above experiment, the closer the nancial
crisis the higher the accuracy will be.
5.1.2. The experiment with CART analysis
As is evident by the above-mentioned results, the CART analysis
also uses the same training and testing dataset. As shown in Table
8, the testing data has an estimate accuracy rate as high as 95.83%,
with an error rate of 4.17% for the past 2 seasons. However, the
accuracy rate reduces to 85.89%, and the error rate obvious rises
to 14.11% when measured over the past 8 seasons. Therefore, we
can observe the closer the nancial crisis the higher the accuracy
will be by using CART analysis.
Immediately, we use remaining 16 ratio variables to undergo
1st factor analysis. As shown in Table 9, the testing data has an
estimate accuracy rate as high as 94.03%, with an error rate of
5.97% for the past 2 seasons. However, the accuracy rate reduces
to 84.65%, and the error rate visible rises to 15.35% when measured
Table 5
The accuracy for the C5.0 model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
99.25%
98.55%
94.44%
97.01%
100%
Accuracy rate
Average
98.54%
98.52%
98.49%
90.47%
91.50%
92.53%
Accuracy rate
Average
99.06%
98.09%
97.08%
91.95%
88.95%
86.17%
Accuracy rate
Average
97.92%
98.21%
98.51%
85.58%
88.80%
91.53%
11267
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
99.30%
99.11%
98.88%
95.45%
92.53%
90.00%
Accuracy rate
Average
97.81%
98.52%
99.24%
90.47%
92.53%
91.04%
Accuracy rate
Average
95.77%
94.03%
92.23%
88.50%
84.53%
80.85%
Accuracy rate
Average
99.30%
99.11%
98.88%
81.08%
82.07%
83.07%
Table 7
The accuracy for the C5.0 model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.50%
98.55%
88.88%
92.54%
96.77%
Accuracy rate
Average
96.35%
97.41%
98.49%
88.65%
90.59%
92.53%
Accuracy rate
Average
97.65%
96.43%
95.63%
89.65%
84.53%
79.78%
Accuracy rate
Average
96.88%
96.78%
96.66%
81.08%
81.58%
82.08%
Table 8
The accuracy for the CART model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
93.75%
96.24%
98.55%
91.66%
95.83%
100%
Accuracy rate
Average
90.51%
94.07%
97.74%
90.47%
92.24%
94.02%
Accuracy rate
Average
94.36%
93.32%
92.23%
79.31%
78.45%
77.65%
Accuracy rate
Average
94.11%
94.45%
94.81%
81.08%
85.89%
90.00%
Table 9
The accuracy for the CART model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.5%
98.55%
96.77%
94.03%
91.66%
Accuracy rate
Average
91.24%
94.07%
96.99%
88.88%
91.04%
91.04%
Accuracy rate
Average
88.73%
89.98%
91.26%
78.16%
75.14%
72.34%
Accuracy rate
Average
91.34%
91.77%
92.22%
84.68%
84.65%
84.61%
over the past 8 seasons. Similar to the above experiment, the closer
the nancial crisis the higher the accuracy will be.
Finally, we apply remaining 12 ratio variables to undergo 2nd
factor analysis. As shown in Table 10, the testing data has an estimate accuracy rate as high as 94.03%, with an error rate of 5.97%
for the past 2 seasons. However, the accuracy rate reduces to
85.89%, and the error rate rises to 14.11% when measured over
11268
Table 10
The accuracy for the CART model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.5%
98.55%
91.66%
94.03%
96.77%
Accuracy rate
Average
94.16%
96.67%
99.24%
92.06%
90.77%
89.55%
Accuracy rate
Average
89.20%
89.26%
89.32%
78.16%
74.59%
71.27%
Accuracy rate
Average
91.00%
90.70%
90.37%
89.18%
85.89%
83.07%
Table 11
The accuracy for the CHAID model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
97.74%
97.10%
92.06%
92.29%
92.53%
Accuracy rate
Average
97.81%
98.52%
99.24%
87.30%
89.91%
92.53%
Accuracy rate
Average
94.83%
97.14%
99.51%
73.56%
80.11%
86.17%
Accuracy rate
Average
93.77%
93.92%
94.07%
89.18%
88.38%
87.69%
shown in Table 11, the testing data has an estimate accuracy rate
as high as 92.29%, with an error rate of 7.71% for the past 2 seasons.
However, the accuracy rate reduces to 88.38%, and the error rate
rises to 11.62% when measured over the past 8 seasons. Therefore,
we can observe the closer the nancial crisis the higher the accuracy will be by CHAID analysis.
Then, we obtain remaining 16 original ratio variables to carry
out 1st factor analysis. As shown in Table 12, the testing data has
an estimate accuracy rate as high as 92.54%, with an error rate of
7.46% for the past 2 seasons. However, the accuracy rate reduces
to 84.65%, and the error rate rises to 15.35% when measured over
the past 8 seasons.
Eventually, we obtain the nal 12 original ratio variables to perform 2nd factor analysis. As shown in Table 13, the testing data has
an estimate accuracy rate as high as 92.09%, with an error rate of
7.91% for the past 2 seasons. However, the accuracy rate reduces
to 84.65%, and the error rate rises to 15.35% when measured over
the past 8 seasons. Similar to the above non-factory analysis and
1st factory analysis, the closer the nancial crisis the higher the
accuracy will be by using CHAID algorithm.
Table 12
The accuracy for the CHAID model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
98.50%
97.10%
91.66%
92.54%
93.54%
Accuracy rate
Average
97.81%
98.15%
98.49%
87.30%
88.42%
89.55%
Accuracy rate
Average
91.07%
93.56%
96.11%
73.56%
76.24%
78.72%
Accuracy rate
Average
91.00%
93.56%
96.29%
82.88%
84.65%
86.15%
11269
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.50%
98.55%
91.66%
92.09%
92.53%
Accuracy rate
Average
96.35%
97.78%
99.24%
87.30%
89.23%
91.04%
Accuracy rate
Average
78.91%
86.87%
94.17%
85.05%
85.64%
86.17%
Accuracy rate
Average
89.61%
93.56%
97.77%
81.08%
84.65%
87.69%
Table 14
The accuracy for the logistic regression model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
100%
100%
86.11%
85.07%
83.87%
Accuracy rate
Average
84.67%
87.59%
90.51%
88.88%
89.21%
89.55%
Accuracy rate
Average
93.89%
94.03
94.17%
85.05%
90.06%
94.68%
Accuracy rate
Average
94.11%
93.38%
92.59%
91.89%
91.70%
91.53%
Table 15
The accuracy for the logistic regression model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
100%
100%
86.11%
86.60%
87.09%
Accuracy rate
Average
90.51%
91.85%
93.23%
88.88%
89.21%
89.55%
Accuracy rate
Average
89.67%
87.59%
85.43%
88.50%
85.08%
81.91%
Accuracy rate
Average
91.00%
89.98%
88.88%
86.48%
85.06%
83.84%
Table 16
The accuracy for the logistic regression model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
95.31%
93.98%
92.75%
75.00%
80.60%
87.09%
Accuracy rate
Average
87.59%
86.30%
84.96%
82.53%
82.31%
82.08%
Accuracy rate
Average
88.26%
84.49%
80.58%
87.35%
82.87%
78.72%
Accuracy rate
Average
86.15%
83.18%
80.00%
85.58%
81.74%
77.69%
11270
80.00%
60.00%
past 4 seasons
40.00%
past 6 seasons
20.00%
past 8 seasons
0.00%
past 2 seasons
92.29%
92.54%
92.09%
past 4 seasons
89.91%
88.42%
89.23%
past 6 seasons
80.11%
76.24%
85.64%
past 8 seasons
88.38%
84.65%
84.65%
100.00%
past 2 seasons
95.00%
past 4 seasons
90.00%
80.00%
past 6 seasons
85.00%
75.00%
past 8 seasons
80.00%
95.00%
past 2 seasons
90.00%
past 4 seasons
85.00%
70.00%
past 6 seasons
past 8 seasons
None
1st
2nd
past 2 seasons
97.01%
92.53%
92.54%
past 4 seasons
91.50%
90.93%
90.59%
past 2 seasons
85.07%
86.60%
80.60%
past 6 seasons
88.95%
84.53%
84.53%
past 4 seasons
89.21%
89.21%
82.31%
past 8 seasons
88.80%
82.07%
81.58%
past 6 seasons
90.06%
85.08%
82.87%
past 8 seasons
91.70%
85.06%
81.74%
75.00%
120.00%
past 2 seasons
100.00%
25.00%
past 2 seasons
20.00%
80.00%
past 4 seasons
15.00%
past 4 seasons
past 6 seasons
10.00%
past 6 seasons
past 8 seasons
5.00%
60.00%
40.00%
20.00%
0.00%
0.00%
past 8 seasons
None
1st
2nd
past 2 seasons
95.83%
94.03%
94.03%
past 2 seasons
past 4 seasons
92.24%
91.04%
90.77%
past 4 seasons
7.47%
8.96%
7.47%
past 6 seasons
78.45%
75.14%
74.59%
past 6 seasons
13.83%
19.15%
20.22%
past 8 seasons
85.89%
84.65%
85.89%
past 8 seasons
8.47%
16.93%
17.92%
None
1st
2nd
0.00%
10.00%
3.23%
11271
25.00%
past 2 seasons
30.00%
25.00%
20.00%
15.00%
past 6 seasons
10.00%
past 8 seasons
5.00%
0.00%
past 2 seasons
20.00%
past 2 seasons
15.00%
past 4 seasons
10.00%
past 6 seasons
5.00%
past 8 seasons
past 4 seasons
None
1st
2nd
0.00%
8.34%
3.23%
0.00%
past 4 seasons
5.98%
8.96%
10.45%
past 6 seasons
22.35%
24.86%
28.73%
past 2 seasons
past 4 seasons
past 6 seasons
past 8 seasons
10.00%
15.39%
16.93%
past 8 seasons
15.00%
past 4 seasons
past 6 seasons
Accuracy Rate
20.00%
10.45%
10.45%
17.92%
5.32%
18.09%
21.28%
8.30%
16.16%
22.31%
80.00%
C5.0
60.00%
CART
40.00%
CHAID
LR
20.00%
past 8 seasons
0.00%
0.00%
2nd
12.91%
100.00%
25.00%
5.00%
1st
12.91%
10.00%
None
16.13%
2 Seasons
4 Seasons
6 Seasons
8 Seasons
1st
6.46%
2nd
7.47%
past 2 seasons
None
7.47%
past 4 seasons
past 6 seasons
past 8 seasons
7.47%
13.83%
12.31%
10.45%
21.28%
13.85%
8.96%
13.83%
12.31%
Fig. 10. The accuracy rate with non-factor analysis for the C5.0, CART, CHAID and
logistic regression.
80.00%
C5.0
60.00%
CART
40.00%
CHAID
LR
20.00%
0.00%
2 Seasons
4 Seasons
6 Seasons
8 Seasons
100.00%
Accuracy Rate
fore, C5.0 and CHAID models have lower Type II error rate, but
CART model has higher Type II error rate in nancial bankruptcy
prediction.
As shown in Fig. 9, the LR model presents the Type II error rate
by non-factor analysis, after rst-time factor analysis, and after the
second time factor analysis. It indicates that the Type II error rate
has approximately the same increasing trend as the DT model,
while the accuracy rate decreases similar to the DT model. The only
exception is the Type II error rate which is better in the 2nd factor
analysis than in the non-factor analysis over the past 2 seasons.
Nevertheless, in summary we get that the closer the crisis point,
the lower the Type II error rate in the LR model. The only exception
is the Type II error rate which is better over the past 8 seasons than
over the past 2 seasons in the non-factor analysis.
Accuracy Rate
100.00%
80.00%
C5.0
60.00%
CART
CHAID
LR
40.00%
20.00%
0.00%
2 Seasons
4 Seasons
6 Seasons
8 Seasons
11272
7. Conclusions
This research aimed at the nancial and the non-nancial ratios
in the nancial statement, and used the DT and the LR models to
compare the performance of the nancial distress predictions, in
order to nd a better early-warning method. This research took
50 companies that were facing a nancial crisis, and matched them
with 50 normal companies of the similar industry. In addition, we
adopted the necessary dataset from the TSEC database and sampled them into the past 2, 4, 6, 8 seasons prior to the nancial crisis
occurrence. This data was then used to carry out a statistical factor
analysis, with each ratio variable being generated going into DT
and LR methods in order to make a comparison.
After the experiments, we summarized four critical contributions. First, the more time we used PCA, the less accurate the results for the DT and LR approaches. In our experiments, we found
that when we applied all of the 37 variables with non-factor analysis into the DT and LR models, we could obtain a better prediction
performance except only for the past 6 seasons in the CHAID
model.
Second, the closer we get to the time of the actual nancial distress, the more accurate the prediction will be in DT models. For
example, the accuracy rate with the non-factor analysis for 2 seasons before the nancial distress occurs is 97.01% in C5.0, while it
is only 88.80% over 8 seasons. However, the results are not similar
for the LR model, where the accuracy rate with non-factor analysis
for 2 and 8 seasons before the occurrence of nancial distress are
85.07% and 91.70%, respectively.
Third, most investors are concerned with the Type II error rate
and avoid investing in these companies. Our empirical results
show that factor analysis increases the error forecasts of classifying
companies with a potential nancial crisis as a normal company.
Moreover, we also found that the average rate of the Type II error
in the LR model is higher than in the DT model. Therefore, the prediction performance for the LR approach is more aggressively inuenced than the DT model.
Finally, the DT approach obtains a better prediction accuracy
than the LR approach in developing a nancial distress prediction
model, with the exception that the accuracy rate (non-factor and
1st factor analysis) for the past 6 and 8 season model is lower with
the LR model. Therefore, the DT approach is suitable for nancial
distress prediction in short run. Otherwise, the LR approach is
appropriately for long run prediction for nancial distress.
In future research, additional macroeconomic index and technical indicator could be considered as input variables and expands
the explanation capability for the proposed models. Moreover,
additional articial intelligence techniques, such as neural network
models, genetic algorithms, and others, could also be applied. And
certainly, researchers could expand the system so as to deal with
more nancial datasets.
Acknowledgements
We thank the support of National Scientic Council (NSC) of the
Republic of China (ROC) to this work under Grant No. NSC 962416-H-018-011. We also gratefully acknowledge the Editor and
anonymous reviewers for their valuable comments and constructive suggestions.
References
Altman, E. L. (1968). Financial ratios, discriminant analysis and the prediction of
corporate bankruptcy. The Journal of Finance, 23(3), 589609.
Altman, E. L., Edward, I., Haldeman, R., & Narayanan, P. (1977). A new model to
identify bankruptcy risk of corporations. Journal of Banking and Finance, 1,
2954.
Atsalakis, G. S., & Valavanis, K. P. (2009). Forecasting stock market short-term trends
using a neuro-fuzzy based methodology. Expert Systems with Applications, 36,
1069610707.
Beaver, W. (1966). Financial ratios as predictors of failure, empirical research in
accounting: Selected studied. Journal of Accounting Research, 71, 111.
Blum, M. (1974). Failing company discriminant analysis. Journal of Accounting
Research, 1, 25.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and
regression trees. California: Wadsworth.
Camdeviren, H. A., Yazici, A. C., Akkus, Z., Bugdayci, R., & Sungur, M. A. (2007).
Comparison of logistic regression model and classication tree: An application
to postpartum depression data. Expert Systems with Applications, 32, 987994.
Chang, C. L., & Chen, C. H. (2009). Applying decision tree and neural network to
increase quality of dermatologic diagnosis. Expert Systems with Applications, 36,
40354041.
Chen, M. Y., & Du, Y. K. (2009). Using neural networks and data mining techniques
for the nancial distress prediction model. Expert Systems with Applications,
36(2), 40754086.
Dimitras, A. I., Zanakis, S. H., & Zopounidis, C. (1996). A survey of business failure
with an emphasis on prediction methods and industrial applications. European
Journal of Operational Research, 90(3), 487513.
Fanning, K., & Cogger, K. (1998). Neural network detection of management fraud
using published nancial data. International Journal of Intelligent Systems in
Accounting, Finance and Management, 7(1), 2124.
Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco,
CA, USA: Morgan Kaufmann.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression. New York: Wiley.
Hua, Z., Wang, Y., Xu, X., Zhang, B., & Liang, L. (2007). Predicting corporate nancial
distress based on integration of support vector machine and logistic regression.
Expert Systems with Applications, 33(2), 434440.
Huang, M. J., Chen, M. Y., & Lee, S. C. (2007). Integrating data mining with casebased reasoning for chronic diseases prognosis and diagnosis. Expert Systems
with Applications, 32(3), 856867.
Kinney, W., & McDaniel, L. (1989). Characteristics of rms correcting previously
reported quarterly earnings. Journal of Accounting and Economics, 11(1), 7193.
Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniques for the
detection of fraudulent nancial statements. Expert Systems with Applications,
32(4), 9951003.
Kumar, P. R., & Ravi, V. (2007). Bankruptcy prediction in banks and rms via
statistical and intelligent techniques A review. European Journal of Operational
Research, 180, 128.
Laitinen, E. K., & Laitinen, T. (2000). Bankruptcy prediction application of the
Taylors expansion in logistic regression. International Review of Financial
Analysis, 9, 327349.
Lensberg, T., Eilifsen, A., & McKee, T. E. (2006). Bankruptcy theory development and
classication via genetic programming. European Journal of Operational
Research, 169, 677697.
Li, H., & Sun, J. (2009). Majority voting combination of multiple case-based
reasoning for nancial distress prediction. Expert Systems with Applications, 36,
43634373.
Michael, J. A., & Gordon, S. L. (1997). Data mining technique for marketing, sales and
customer support. New York: Wiley.
Mitra, S., Pal, S. K., & Mitra, P. (2002). Data mining in soft computing framework: A
survey. IEEE Transactions Neural Networks, 13(1), 314.
McKee, T. E., & Lensberg, T. (2002). Genetic programming and rough sets: A hybrid
approach to bankruptcy classication. European Journal of Operational Research,
138, 436451.
Meyer, P. A., & Pifer, H. (1970). Prediction of bank failures. The Journal of Finance, 25,
853868.
Persons, O. (1995). Using nancial statement data to identify factors associated with
fraudulent nancial reporting. Journal of Applied Business Research, 11(3), 3846.
Quinlan, J. R. (1993). Programs for machine learning. San Francisco: Morgan Kaufmann.
Quinlan, J. R. (1997). C5. 0 and see 5: Illustrative examples. RuleQuest Research.
<http://www.rulequest.com>.
Roiger, R. J., & Geatz, M. W. (2003). Data mining: A tutorial-based primer. Boston, MA:
Addison Wesley.
Ross, Q. J. (1993). C45: Programs for machine learning. Morgan Kaufmann Publishers.
Spathis, C., Doumpos, M., & Zopounidis, C. (2002). Detecting falsied nancial
statements: A comparative study using multicriteria analysis and multivariate
statistical techniques. The European Accounting Review, 11(3), 509535.
Stice, J. (1991). Using nancial and market information to identify pre-engagement
market factors associated with lawsuits against auditors. The Accounting Review,
66(3), 516533.
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and
techniques. California: Morgan Kaufmann.
Xidonas, P., Ergazakis, E., Ergazakis, K., Metaxiotis, K., Askounis, D., Mavrotas, G.,
et al. (2009). On the selection of equity securities: An expert systems
methodology and an application on the Athens Stock Exchange. Expert
Systems with Applications, 36(2009), 1196611980.
Yip, A. Y. N. (2004). Predicting business failure with a case-based reasoning
approach, lecture notes in computer science. In M. G. Negoita, R. J. Howlett, & L.
C. Jain (Eds.), Knowledge-based intelligent information and engineering systems:
8th international conference, KES 2004, Wellington, New Zealand, September 3215/
2004, proceedings, Part III (pp. 2025).
Zopounidis, C., & Dimitras, A. (1998). Multicriteria decision aid methods for the
prediction of business failure. Dordrecht: Kluwer Academic Publishers.