Chen 2011

Expert Systems with Applications 38 (2011) 1126111272
Contents lists available at ScienceDirect
Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa
Predicting corporate nancial distress based on integration of decision tree

classication and logistic regression
Mu-Yen Chen
Department of Information Management, National Taichung Institute of Technology, Taichung 404, Taiwan, ROC
a r t i c l e
i n f o
Keywords:
Financial distress
Articial intelligent
Decision tree classication
Logistic regression
a b s t r a c t
Lately, stock and derivative securities markets continuously and rapidly evolve in the world. As quick
market developments, enterprise operating status will be disclosed periodically on nancial statement.
Unfortunately, if executives of rms intentionally dress nancial statements up, it will not be observed
any nancial distress possibility in the short or long run. Recently, there were occurred many nancial
crises in the international marketing, such as Enron, Kmart, Global Crossing, WorldCom and Lehman
Brothers events. How these nancial events affect worlds business, especially for the nancial service
industry or investors has been publics concern. To improve the accuracy of the nancial distress prediction model, this paper referred to the operating rules of the Taiwan Stock Exchange Corporation (TSEC)
and collected 100 listed companies as the initial samples. Moreover, the empirical experiment with a
total of 37 ratios which composed of nancial and other non-nancial ratios and used principle component analysis (PCA) to extract suitable variables. The decision tree (DT) classication methods (C5.0,
CART, and CHAID) and logistic regression (LR) techniques were used to implement the nancial distress
prediction model. Finally, the experiments acquired a satisfying result, which testies for the possibility
and validity of our proposed methods for the nancial distress prediction of listed companies.
This paper makes four critical contributions: (1) the more PCA we used, the less accuracy we obtained
by the DT classication approach. However, the LR approach has no signicant impact with PCA; (2) the
closer we get to the actual occurrence of nancial distress, the higher the accuracy we obtain in DT classication approach, with an 97.01% correct percentage for 2 seasons prior to the occurrence of nancial
distress; (3) our empirical results show that PCA increases the error of classifying companies that are in a
nancial crisis as normal companies; and (4) the DT classication approach obtains better prediction
accuracy than the LR approach in short run (less one year). On the contrary, the LR approach gets better
prediction accuracy in long run (above one and half year). Therefore, this paper proposes that the articial
intelligent (AI) approach could be a more suitable methodology than traditional statistics for predicting
the potential nancial distress of a company in short run.
2011 Elsevier Ltd. All rights reserved.
1. Introduction
Recently, one of the most attractive business news is a series of
nancial crisis events related to the public companies. Some of
these companies are famous and also at high stock prices, originally (e.g. Enron Corp., Kmart Corp., WorldCom Corp., Lehman
Brothers Bank, etc.). In consequence of the nancial crisis, it is
always too late for many creditors to withdraw their loans, as well
as for investors to sell their own stocks, futures, or options. Therefore, corporate bankruptcy is a very important economic phenomenon and also affects the economy of every country. In Taiwan,
domestic and foreign capital markets have developed rapidly in
recent years, gradually giving people the idea of making a nancial
investment. Nevertheless, Procomp Corp. and Cdbank Corp.
E-mail address: mychen@ntit.edu.tw
0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.02.173
bankruptcy events have also caused tremendous disorder in the

nancial market and related industries are also affected by these
economic shocks in Taiwan. The number of bankruptcy rms is
important for the economy of a country and it can be viewed as
an indictor of the development and robustness of the economy
(Zopounidis & Dimitras, 1998). The high individual, economic,
and social costs encountered in corporate failures or bankruptcies
have spurred searches for better understanding and prediction
capability (McKee & Lensberg, 2002). Therefore, forecasting
corporate nancial distress plays an increasingly important role
in todays society since it has a signicant impact on lending decisions and the protability of nancial institutions.
A common methodology to bankruptcy prediction is to summarize the literature to search a large set of potential predictive nancial and/or non-nancial variables and then reduce a set of not
signicant variables, through traditional mathematical analysis
11262
M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272
that will predict bankruptcy (Lensberg, Eilifsen, & McKee, 2006).

Many traditional classication techniques have been presented to
predict nancial distress using ratios, e.g., univariate approaches
(Beaver, 1966), multivariate approaches, linear multiple discriminant approaches (MDA) (Altman, 1968; Altman, Edward, Haldeman, & Narayanan, 1977), multiple regression (Meyer & Pifer,
1970), logistic regression (Dimitras, Zanakis, & Zopounidis, 1996),
factor analysis (Blum, 1974), and stepwise (Laitinen & Laitinen,
2000). However strict assumptions of traditional statistics such
as linearity, normality, independence among predictor variables
and pre-existing functional form relating to the criterion variable
and the predictor variable limit their application in the real world
(Hua, Wang, Xu, Zhang, & Liang, 2007).
Therefore, this paper proposes a model of nancial distress prediction comparing decision tree (DT) classication and logistic
regression (LR) techniques. The main objectives of this paper are
to (1) adopt DT and LR techniques to construct a nancial distress
prediction model, (2) use nancial and non-nancial ratios to enhance the accuracy of the nancial distress prediction model, (3)
employ a traditional statistical method (principle component analysis, PCA) to compare the degree of accuracy with that of the articial intelligent (AI) approach, and (4) to expand this model so that
it will work within a nancial distress prediction system to provide
information to investors as well as investment monitoring organizations. The data for our experiment were collected from the
Taiwan Stock Exchange Corporation (TSEC) database.
The rest of this paper is organized as follows. A literature review
of related techniques is provided in Section 2. We describe our proposed approach and its capabilities of each step in Section 3. Section 4 presents the process for choosing appropriate variables by
PCA. In Section 5, we analyzed the prediction performance of our
approach and fullled several experiments. Moreover, we compared our results with the DT, and LR approaches in Section 6. Finally, we inference our conclusions and discuss future research in
Section 7.
2. Classication techniques
2.1. Decision trees algorithm
Data mining (DM), also known as knowledge discovery in databases (KDD), is the process of discovering meaningful patterns in
huge databases (Han & Kamber, 2001). In addition, it is also an
application that can provide signicant competitive advantages
for making the right decision (Huang, Chen, & Lee, 2007). The more
common model functions in the current data mining process include the classication, regression, clustering, association rules,
summarization, dependency modeling and sequence analysis
(Mitra, Pal, & Mitra, 2002). Decision tree is one of common DM
methodologies that provide both classication and predictive functions simultaneously. Focusing on the data provided, it produces a
model of tree-shaped structure using inductive reasoning (Chang &
Chen, 2009). Many scholars also use articial neural network
(ANN) techniques to solve classication and prediction problems.
However, there are three weaknesses of neural networks mostly.
Firstly, neural networks are not guaranteed to converge to a global
optimal solution. Secondly, neural networks have the well-known
over-training problem. Last, neural networks have the black-box
phenomenon that lacks the ability to explain their behavior (Roiger
& Geatz, 2003). The rst two problems have been solved by setting
the number of hidden nodes and learning parameters. However, it
is difcult to explain how neural networks act, and how they make
the decisions through their layers. Decision trees are a well-known
technique and have had many successful applications to real-world
problems (Kumar & Ravi, 2007). In addition, decision trees have the
ability to build models with datasets including numerical and categorical data (Witten & Frank, 2005).
Major algorithms of decision tree analysis model include ID3
(Interactive Dichotomiser 3), C5.0, classication and regression
trees (CRAT), and chi-squared automatic interactive detector
(CHAID) models. In the late 1970s, Ross (1993) proposed an algorithm named ID3 to generate decision trees. Based on the theory
of information gain, ID3 algorithm chooses the optimal information gain to as a rst attribute for branching of decision trees and
thus constructs a simple trees structure. However, ID3 algorithm
still has its shortcoming when using information gain as a rule to
select attributes for segmentation will result in bias over attributes
of higher values. Therefore, in the condition when only one data remains in the sub-tree after data set segmentation, its information
gain is the highest, indicating a less meaningful segmentation
(Ross, 1993).
The C4.5 algorithm improves ID3 with regard to the splitting
rule and the calculation method (Quinlan, 1993). It uses gain-ratio
index instead as a measurement method to segment attributes and
thus can reduce the inuence of ID3 drawback that segmentation
nodes prefer too many sub-trees. C5.0 algorithm is a commercial
version of C4.5, such as Clementine and RuleQuest (Quinlan,
1997), and it improves the rule generation of C4.5. It can be skilled
in processing enormous datasets particularly. Besides, C5.0 algorithm is also faster in speed and more memory efcient than
C4.5 due to Boosting method adopting.
In CART (Breiman, Friedman, Olshen, & Stone, 1984), the building of the tree classier is also accomplished by recursively splitting the instance space in smaller subparts. CART algorithm
generates a binary decision tree, unlike ID3 which only creates
two children. Both of CART and CHAID provide a set of rules that
can be used to an unclassied dataset to predict which records will
have a given result. CART segments a dataset by creating two-way
splits, but CHAID segments a dataset by chi-square test to create
multi-way splits. Generally, CART needs less data preparation than
CHAID.
CHAID algorithm is based on the chi-square test and constructed by repeatedly splitting subsets of the space into two or
more child nodes with the entire datasets (Michael & Gordon,
1997). To decide the best split at any node, any allowable pair of
categories of the predictor variables is merged until there is no statistically signicant difference within the pair with respect to the
target variable. This CHIAD algorithm surely handles interactions
between the independent variables that are directly available from
an examination of the tree. Although there is no optimal method to
obtain the best segment size, in fact, CHAID can assist researchers
to compromise variances against segment size to discovery the
most adequate one. Certainly, CHAID algorithm clearly proves
which segmentation variable must come rst for the large datasets.
2.2. Statistical algorithms
Most of the broadly used traditional statistical algorithm applied for prediction and diagnosis in many disciplines are discriminant analysis, Logistic regression, Bayesian approach, and multiple
regression. These models have been proven to be very effective,
however, for solving relatively less complex problems. LR is a
regression method for predicting a dichotomous dependent variable. In producing the LR equation, the maximum-likelihood ratio
was used to determine the statistical signicance of the variables
(Hosmer & Lemeshow, 2000). In logistic regression models, dependent variable is always in categorical form and has two or more
levels. Independent variables may be in numerical or categorical
form (Camdeviren, Yazici, Akkus, Bugdayci, & Sungur, 2007).We
consider the situation where we observe a binary outcome variable
y and a vector x = (1, x1, x2, . . . , xk) of covariates for each of N
individuals. We code the two-class via a 0/1 response yi, where

yi = 1 for the rst class and yi = 0 for the second one. Let p be the
conditional probability associated with the rst class. LR is a
widely used statistical modeling technique in which the probability p of the dichotomous outcome event is related to a set of
explanatory variables x in the form (Hosmer & Lemeshow, 2000):
Logitp ln

p
f x; b bT x
1p
where b = (b0, b1, b2 , ... , bk) is the vector of the coefcients of the
model and bT the transpose vector. We refer to p/(1 p) as odds-ratio and to the expression (1) as the log-odds or logit transformation.
Let D = {(xl, yl): l = 1 , 2, . . . , nT} be the training data set, where
the number of samples is nT. Here, we assume that the training
sample is a realization of a set of independent and identically distributed random variables. The unknown regression coefcients bi,
which have to be estimated from the data, are directly interpretable as log-odds ratios or, in term of exp(bi), as odds ratios. That
log-likelihood for nT observations is
lb
nT
X
T
fylbT xl log1 eb xl g
l1
The log-likelihood function is used for estimating regression

coefcients (bi) in model. Coefcients are obtained by iterative
T
methods. Exponential value of regression coefcients (eb ) gives
odds ratio and this value reects the effect of risk factor in the disease and the interpreted values are odds ratios. In addition, after
model obtained a classication table is obtained as in other classication methods.
2.3. Related literatures
DT classication and LR play an important role in many data
analysis, providing prediction and classication rules. Both of them
are very simple and comprehensible. In the recent past many research contributions have applied AI techniques to nancial applications. Recent examples are as follows. Chen and Du (2009)
adopted the data mining and ANN approaches to discover nancial
11263
crisis automatically (Chen & Du, 2009). Li and Sun (2009) presented a multiple case-based reasoning system by majority voting
(Multi-CBRMV) for nancial distress prediction. With data collected from Shanghai and Shenzhen Stock Exchanges, experiment
was shown that a Multi-CBRMV system was better than original
CBR model (Li & Sun, 2009). Xidonas et al. (2009) presented an expert system methodology for supporting decisions that concern the
selection of equities, on the basis of nancial analysis. Finally, the
validity of the proposed methodology is tested through a large
scale application on the Athens Stock Exchange (Xidonas et al.,
2009). Atsalakis and Valavanis (2009) proposed a neuro-fuzzy
adaptive control system to forecast next days stock price trends.
The proposed system has performed very well in trading simulations, returning results superior to the B&H (buy and hold) strategy
(Atsalakis & Valavanis, 2009). Yip (2004) also used Case-based reasoning (CBR) with K-NN to predict Australian rm business failure.
She used the statistical evaluations for assigning the relevancy of
attributes in the retrieval phase of algorithm. The results concluded that CBR with weighted K-NN was better than discriminant
approaches (Yip, 2004). However, few of these studies focused on
the DT and LR comparison, and even fewer empirical investigations
were made of nancial distress prediction related topics. Therefore,
we will use DT and LR to compare the accuracy of predicting bankruptcy in a capital market.
3. Research methodology
In this research, we compare DT and LR techniques for nancial
distress prediction (FDP) performance. The research methodology
is as shown in Fig. 1. In the FDP Choosing phase, we handle the original huge datasets from the TSEC which will be processed by data
pre-processing. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection. The product
of data pre-processing is the nal training and testing set. The goal
in this phase is to choose the appropriate variables, including
nancial and non-nancial ratios, by means of PCA. Moreover,
the next phase will use these variables and extract prediction rule
sets that are ready to be used in DT and LR techniques.
Fig. 1. Research methodology.
11264
In the FDP Modeling phase, we gather the nancial statement

datasets for DT and LR processing. In the DT approach, we will
use the C5.0, CART, and CHAID algorithms to nd the rule patterns
and forecast the FDP. In the LR approach, we will use the binary
regression technique to classify and forecast the FDP. As a rule,
the DT and LR algorithms are applied to separately discovery the
nancial distress prediction patterns or rules.
In the FDP Comparison phase, we compare the accuracy rate
and Type II error rate for DT and LR by means of several times factor analysis (non-factor analysis, 1st factor analysis, and 2nd factor
analysis). Then, the intelligent bankruptcy prediction model will be
constructed and validated the new datasets of the nancial statement from the TSEC.
4. The FDP choosing phase
4.1. Data
Our samples contained raw data from 100 Taiwan rms listed in
the TSEC. The period of sampling was from 2000 January to May,
2007, amounting to 7 years and 5 months. The 50 rms in nancial
distress were matched with 50 non-bankruptcy rms. These rms
were distinguished as non-bankruptcy based on the absence of any
indication or proof concerning the issuing of nancial distress in
the auditors reports. All the variables used in the sample were extracted from formal nancial statements, such as balance sheets, cash
ow statement, and income statements in the nancial databases of
TSEC. This implies that the usefulness of this research is not restricted
by the fact that only data from Taiwanese companies was used.
4.2. Variables
The selection of variables to be used as candidates for participation in the input vector was based upon prior researches associated
with the topic of nancial distress prediction. The related researches fullled by Kirkos, Spathis, and Manolopoulos (2007),
Spathis, Doumpos, and Zopounidis (2002), Fanning and Cogger
(1998), Persons (1995), Stice (1991) and Kinney and McDaniel
(1989) comprised of the suggested indicators of nancial distress
prediction. Therefore, this paper adopted the related variables
based on prior researches, the Taiwanese economic Journal (TEJ),
and the Taiwanese economic database. Moreover, this paper selected 37 variables and categorized them as six major types: earning ability, nancial structure ability, management efciency
ability, management performance, debt-repaying ability, and
non-nancial factors. The details of these indicators belong to each
type and are listed as follows:
Earning ability: including Pretax Margin, Return on Total Assets,
Return on Equity, Earnings per Share, and Gross Margin ratios.
Financial structure ability: including Debt to Assets, Times Interest Earned, Book Value Per Share, Financial Leverage ratio, Debt
to Equity, Short Term & Long Term Debt to Book Value ratio,
Fixed Assets to Total Assets ratio, Gross Margin to Total assets
ratio, Inventory to Total assets ratio, Inventory to Sales ratio,
Investment ratio, and Current Assets to Total Assets ratios.
Management efciency ability: including Turnover rate of Inventory, Turnover rate of Account Receivable, Turnover rate of
Fixed Assets, Turnover rate of Total Assets, Turnover rate of
Equity, and Turnover rate of Working Capital ratios.
Management performance, including Pretax Margin Growth
ratio, Gross Margin Growth ratio, and Sales Growth ratio ratios.
Debt-repaying ability: including Current ratio, Acid test ratio,
Cash ratio, Cash Flow ratio, Cash Flow to Long Term Debt, Cash
Flow to Total Debt, and Cash Flow to Short Term and Long Term
Debt ratio ratios.
Non-nancial factors: including Dividend Payout ratio, Pricebook ratio, the proportion of collateralized shares by the board
of directors, and the Insider Holding ratio.
4.3. Factor analysis
This paper collected the samples of 50 pairs of nancial distress
and non-bankruptcy rms listed in the TSEC, between 2000 and
2007. The main variables are 37 ratios for the predictive nancial
distress model factors. This research used the SPSS statistical software to guide factor analysis and PCA with varimax for rotation
(VARIMAX), with the purpose of the factor structures easier and
simpler to explicate. The principle for the selection of factors is
based on Kaisers criteria, meaning that the eigenvalue greater than
1 is a common factor, and the communality is greater than 0.8 in
order to obtain suitable factors.
Totally, we assembled 33 nancial ratios and 4 non-nancial ratios. Due to reduce dimensionality, we ran a factor analysis to test
whether the differences between these 37 variables were signicant for each variable. The variable was considered to be non-informative if the difference was not signicant (low communality
values). In Table 1, it shows the factor loadings, communality,
the eigenvalues and the explained variance information for each
variable. In addition, the total explained variance was 75.475%.
Consequently, 16 variables presented high communality values
and were chosen to be retained in the input vector. In the other
way, the remaining 21 variables were discarded because of low
communality values. Immediately, we used the factor analysis to
process the experiment a second time. In Table 2, it shows that four
variables were discarded, and that the total explained variance was
84.048%. It is obvious that total explained variance value is higher
than rst time. Therefore, we cannot suppose factor analysis is the
optimal solution and we used the factor analysis to process the
experiment a third time. Table 3 shows that no variables need to
be discarded, and that the total explained variance was 94.009%.
Therefore, we can sure that the optimal factor analysis was the
one we nished the third time, where the performance was the
highest at 94.009%.
After the three times factor analysis, 12 variables presented
higher communality value. These variables were chosen to be used
in the input vector, while the remaining 25 variables were discarded. The selected variables were Return on Equity (ROE), Earnings per Share (EPS), Return on Asset (ROA), Cash Flow Ratio, Cash
Flow to Total Debt Ratio, Current Ratio, Acid-Test Ratio, Inventory
to Total Assets Ratio, Current Assets to Total Assets, Gearing Ratio,
Debt to Equity Ratio, and Debt/Equity (DE).
5. The FDP modeling phase
5.1. DT experiments and results
This process uses the nance and non-nance ratios, and constructs a nancial distress prediction model after carrying out a
second time factor analysis. The variables are then loaded as DT
and LR input nodes. In addition, we also apply these experiment
parameters to investigate the past 2 seasons, the past 4 seasons,
the past 6 seasons, and the past 8 seasons before the nancial distress occurred, for the sake of prediction accuracy. In this experiment, we will use the C5.0, CART, CHAID as the DT algorithm. In
addition, the training sample and the testing sample will adopt
the 70:30 ratio.
In terms of bankruptcy prediction, whether or not the prediction is accurate is routinely measured by three quantities: Type I
Error Rate, Type II Error Rate, and Total Error Rate. Type I Error
Rate means that the error rate for the risk can not categorize
11265

Table 1
1st Factor analysis results.
Factors
Variables
Factor loadings
Communality
Eigenvalues
Explained variance
Cash Flow Ratio

Current Ratio
Turnover Rate of Working Capital Ratio
Acid-Test Ratio
Investment Ratio
Equity Per Share Ratio
Cash Ratio
0.394
0.722
0.426
0.721
0.805
0.436
0.403
0.867
0.851
0.825
0.811
0.746
0.746
0.539
8.721
23.570
Current Assets to Total Assets Ratio

Turnover Rate of Total Assets Ratio
Debt Ratio
Turnover Rate of Fixed Assets Ratio
Turnover Rate of Equity Ratio
Sales Revenue Growth Ratio
0.770
0.535
0.304
0.654
0.509
0.484
0.860
0.814
0.810
0.792
0.776
0.635
3.752
10.142
Cash Flow to Total Debt Ratio

Cash Flow to Short Term & Long Term Debt Ratio
Cash Flow to Long Term Debt Ratio
0.444
0.116
0.451
0.821
0.781
0.713
2.845
7.690
Gearing Ratio
Debt to Equity Ratio
Debt Equity Ratio
Earnings Per Share Ratio
Return on Equity Ratio
Return on Asset Ratio
Margin Before Interest and Tax Ratio
0.401
0.397
0.402
0.412
0.358
0.434
0.248
0.970
0.967
0.960
0.923
0.894
0.869
0.815
2.573
6.954
Turnover Rate of Inventory Ratio

Gross Operating Soread Ratio
Gross Operating Soread Growth Ratio
0.426
0.165
0.350
0.785
0.634
0.534
2.113
5.712
Fixed Assets to Total Assets Ratio

Gross Margin to Total Assets Ratio
0.390
0.329
0.768
0.730
1.814
4.903
Times Interest Earned Ratio

Turnover Rate of Account Receivable Ratio
Price Book Ratio
0.330
0.281
0.295
0.758
0.714
0.451
1.458
3.941
Inventory to Total Assets Ratio

Inventory to Sales Ratio
0.245
0.361
0.890
0.797
1.293
3.493
Insider Holding Ratio

Dividend Payout Ratio
0.274
0.382
0.644
0.566
1.237
3.343
10
Pretax Margin Growth Ratio
0.270
0.503
1.068
2.888
11
Proportion of Collateralized shares Ratio
0.301
0.367
1.050
2.839
Total explained variance
75.475
Table 2
2nd Factor analysis results.
Factors
Variables
Factor loadings
Communality
Eigenvalues
Explained variance

Margin Before Interest and Tax Ratio
0.766
0.768
0.748
0.720
0.925
0.893
0.889
0.776
6.326
35.146
Cash Flow Ratio

Current Ratio
Acid-Test Ratio
0.551
0.515
0.464
0.416
0.923
0.885
0.840
0.809
2.597
14.430

0.637
0.807
0.887
0.884
2.193
12.185
Gearing Ratio
Debt Equity Ratio
0.521
0.516
0.502
0.974
0.973
0.965
1.739
9.661
Turnover Rate of Total Assets Ratio

Turnover Rate of Working Capital Ratio
0.304
0.303
0.696
0.495
1.269
7.051
Debt Ratio
0.228
0.776
1.004
5.575
84.048
11266
Table 3
3rd Factor analysis results.
Factors
Variables
Factor loadings
Communality
Eigenvalues
Explained variance

0.755
0.748
0.746
0.932
0.931
0.930
4.803
40.028
Cash Flow Ratio

0.802
0.785
0.952
0.950
2.107
17.558
Current Ratio
Acid-Test Ratio
0.675
0.661
0.975
0.944
1.870
15.585

0.509
0.656
0.880
0.826
1.495
12.458
Gearing Ratio
Debt Equity Ratio
0.475
0.481
0.446
0.994
0.992
0.975
1.006
8.380
94.009
Table 4
The relationship with Type I, II, and Total Error Rates.
Prediction
Actually
Normal
Bankruptcy
Sum
Sum
Normal
Bankruptcy
Y1
Y4
Y7
Y2
Y5
Y8
Y3
Y6
Y9
the normal company as a normal company, Type II Error Rate

means that the error rate for the risk cannot categorize the bankruptcy company, and Total Error Rate means the combined Type
I Error Rate and Type II Error Rate. Table 4 shows the relationship among these three error rate types. The formula for each error
rate is listed as follows:
Type I Error Rate
Y2
Y3
Type II Error Rate
Total Error Rate
Y4
Y6
Y 2 Y 4
Y9
5.1.1. The experiment with C5.0 analysis

This experiment obtains a result after using 37 original ratio
variables that havent yet obtained a result by factor analysis. As
shown in Table 5, the testing data has an estimate accuracy rate
as high as 97.01%, with an error rate of 2.99% for the past 2 seasons.
However, the accuracy rate reduces to 88.80%, and the error rate
distinct rises to 11.20% when measured over the past 8 seasons.
Therefore, we can observe the closer the nancial crisis the higher
the accuracy will be by C5.0 analysis.
variables of this research that have undergone 1st factor analysis.
As shown in Table 6, the testing data has an estimate accuracy rate
clear rises to 17.93% when measured over the past 8 seasons. Similar to the above experiment, the closer the nancial crisis the
higher the accuracy will be.
variables of this research that have undergone 2nd factor analysis.
As shown in Table 7, the testing data has an estimate accuracy rate
distinct manifest rises to 18.42% when measured over the past 8
seasons. Similar to the above experiment, the closer the nancial
crisis the higher the accuracy will be.
5.1.2. The experiment with CART analysis
As is evident by the above-mentioned results, the CART analysis
also uses the same training and testing dataset. As shown in Table
8, the testing data has an estimate accuracy rate as high as 95.83%,
with an error rate of 4.17% for the past 2 seasons. However, the
accuracy rate reduces to 85.89%, and the error rate obvious rises
to 14.11% when measured over the past 8 seasons. Therefore, we
can observe the closer the nancial crisis the higher the accuracy
will be by using CART analysis.
Immediately, we use remaining 16 ratio variables to undergo
1st factor analysis. As shown in Table 9, the testing data has an
estimate accuracy rate as high as 94.03%, with an error rate of
5.97% for the past 2 seasons. However, the accuracy rate reduces
to 84.65%, and the error rate visible rises to 15.35% when measured
Table 5
The accuracy for the C5.0 model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
99.25%
98.55%
94.44%
97.01%
100%
Accuracy rate
Average
98.54%
98.52%
98.49%
90.47%
91.50%
92.53%
Accuracy rate
Average
99.06%
98.09%
97.08%
91.95%
88.95%
86.17%
Accuracy rate
Average
97.92%
98.21%
98.51%
85.58%
88.80%
91.53%
11267

Table 6
The accuracy for the C5.0 model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
99.30%
99.11%
98.88%
95.45%
92.53%
90.00%
Accuracy rate
Average
97.81%
98.52%
99.24%
90.47%
92.53%
91.04%
Accuracy rate
Average
95.77%
94.03%
92.23%
88.50%
84.53%
80.85%
Accuracy rate
Average
99.30%
99.11%
98.88%
81.08%
82.07%
83.07%
Table 7
The accuracy for the C5.0 model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.50%
98.55%
88.88%
92.54%
96.77%
Accuracy rate
Average
96.35%
97.41%
98.49%
88.65%
90.59%
92.53%
Accuracy rate
Average
97.65%
96.43%
95.63%
89.65%
84.53%
79.78%
Accuracy rate
Average
96.88%
96.78%
96.66%
81.08%
81.58%
82.08%
Table 8
The accuracy for the CART model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
93.75%
96.24%
98.55%
91.66%
95.83%
100%
Accuracy rate
Average
90.51%
94.07%
97.74%
90.47%
92.24%
94.02%
Accuracy rate
Average
94.36%
93.32%
92.23%
79.31%
78.45%
77.65%
Accuracy rate
Average
94.11%
94.45%
94.81%
81.08%
85.89%
90.00%
Table 9
The accuracy for the CART model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.5%
98.55%
96.77%
94.03%
91.66%
Accuracy rate
Average
91.24%
94.07%
96.99%
88.88%
91.04%
91.04%
Accuracy rate
Average
88.73%
89.98%
91.26%
78.16%
75.14%
72.34%
Accuracy rate
Average
91.34%
91.77%
92.22%
84.68%
84.65%
84.61%
over the past 8 seasons. Similar to the above experiment, the closer
the nancial crisis the higher the accuracy will be.
Finally, we apply remaining 12 ratio variables to undergo 2nd
factor analysis. As shown in Table 10, the testing data has an estimate accuracy rate as high as 94.03%, with an error rate of 5.97%
for the past 2 seasons. However, the accuracy rate reduces to
85.89%, and the error rate rises to 14.11% when measured over
the past 8 seasons. Similar to the above non-factory analysis and

1st factory analysis, the closer the nancial crisis the higher the
accuracy will be.
5.1.3. The experiment with CHAID analysis
variables that havent yet obtained a result by factor analysis. As
11268
Table 10
The accuracy for the CART model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.5%
98.55%
91.66%
94.03%
96.77%
Accuracy rate
Average
94.16%
96.67%
99.24%
92.06%
90.77%
89.55%
Accuracy rate
Average
89.20%
89.26%
89.32%
78.16%
74.59%
71.27%
Accuracy rate
Average
91.00%
90.70%
90.37%
89.18%
85.89%
83.07%
Table 11
The accuracy for the CHAID model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
97.74%
97.10%
92.06%
92.29%
92.53%
Accuracy rate
Average
97.81%
98.52%
99.24%
87.30%
89.91%
92.53%
Accuracy rate
Average
94.83%
97.14%
99.51%
73.56%
80.11%
86.17%
Accuracy rate
Average
93.77%
93.92%
94.07%
89.18%
88.38%
87.69%
shown in Table 11, the testing data has an estimate accuracy rate
rises to 11.62% when measured over the past 8 seasons. Therefore,
we can observe the closer the nancial crisis the higher the accuracy will be by CHAID analysis.
Then, we obtain remaining 16 original ratio variables to carry
out 1st factor analysis. As shown in Table 12, the testing data has
an estimate accuracy rate as high as 92.54%, with an error rate of
to 84.65%, and the error rate rises to 15.35% when measured over
the past 8 seasons.
Eventually, we obtain the nal 12 original ratio variables to perform 2nd factor analysis. As shown in Table 13, the testing data has
an estimate accuracy rate as high as 92.09%, with an error rate of
to 84.65%, and the error rate rises to 15.35% when measured over
the past 8 seasons. Similar to the above non-factory analysis and
1st factory analysis, the closer the nancial crisis the higher the
accuracy will be by using CHAID algorithm.
5.2. LR experiments and results

The same as with the DT experiment, this process also uses nance and non-nance ratios, and constructs the nancial distress
prediction model after a second time factor analysis. We investigate the past 2 seasons, the past 4 seasons, the past 6 seasons,
and the past 8 seasons before the occurrence of nancial distress
to ensure prediction accuracy. The results has described as following sections.
5.2.1. The experiment with non-factor analysis

variables of this research that have not yet undergone a factor
analysis. As shown in Table 14, the data has an estimate accuracy
rate as high as 85.07%, with an error rate of 14.93% for the past 2
seasons. However, the accurate rate obvious rises to 91.70%, and
the error rate reduces to 8.3% when measured over the past 8 seasons. The result shows the different conclusion with DT approach
that further the nancial crisis the higher the accuracy will be.
Table 12
The accuracy for the CHAID model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
98.50%
97.10%
91.66%
92.54%
93.54%
Accuracy rate
Average
97.81%
98.15%
98.49%
87.30%
88.42%
89.55%
Accuracy rate
Average
91.07%
93.56%
96.11%
73.56%
76.24%
78.72%
Accuracy rate
Average
91.00%
93.56%
96.29%
82.88%
84.65%
86.15%
11269

Table 13
The accuracy for the CHAID model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
98.43%
98.50%
98.55%
91.66%
92.09%
92.53%
Accuracy rate
Average
96.35%
97.78%
99.24%
87.30%
89.23%
91.04%
Accuracy rate
Average
78.91%
86.87%
94.17%
85.05%
85.64%
86.17%
Accuracy rate
Average
89.61%
93.56%
97.77%
81.08%
84.65%
87.69%
Table 14
The accuracy for the logistic regression model with non-factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
100%
100%
86.11%
85.07%
83.87%
Accuracy rate
Average
84.67%
87.59%
90.51%
88.88%
89.21%
89.55%
Accuracy rate
Average
93.89%
94.03
94.17%
85.05%
90.06%
94.68%
Accuracy rate
Average
94.11%
93.38%
92.59%
91.89%
91.70%
91.53%
Table 15
The accuracy for the logistic regression model with 1st factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
100%
100%
100%
86.11%
86.60%
87.09%
Accuracy rate
Average
90.51%
91.85%
93.23%
88.88%
89.21%
89.55%
Accuracy rate
Average
89.67%
87.59%
85.43%
88.50%
85.08%
81.91%
Accuracy rate
Average
91.00%
89.98%
88.88%
86.48%
85.06%
83.84%
5.2.2. The experiment with 1st factor analysis

This experiment uses 16 original ratio variables have undergone
a 1st factor analysis. As shown in Table 15, the data has an accuracy
rate as high as 86.60%, with an error rate of 13.4% for the past 2
seasons. Besides, the accurate rate slight reduces to 85.06%, and
the error rate rises to 14.94% when measured over the past 8 seasons. The result shows that accuracy rate has not signicant affected by factor analysis.
5.2.3. The experiment with 2nd factor analysis

This experiment uses last 12 original ratio variables of this
research that have undergone 2nd factor analysis. As shown
in Table 16, the testing data has an estimate accuracy rate as
high as 80.06%, with an error rate of 19.94% for the past 2 seasons. However, the accurate rate slight increase to 81.74% and
the error rate also reduces to 18.26% when measured over
the past 8 seasons. Similar to the above experiment, the result
Table 16
The accuracy for the logistic regression model with 2nd factor analysis.
Training data
Testing data
Normal
Bankruptcy
Normal
Bankruptcy
Accuracy rate
Average
95.31%
93.98%
92.75%
75.00%
80.60%
87.09%
Accuracy rate
Average
87.59%
86.30%
84.96%
82.53%
82.31%
82.08%
Accuracy rate
Average
88.26%
84.49%
80.58%
87.35%
82.87%
78.72%
Accuracy rate
Average
86.15%
83.18%
80.00%
85.58%
81.74%
77.69%
11270
shows that accuracy rate has not signicant affected by factor

analysis.
6. The FDP comparing phase
After the implementation for the FDP modeling phase, we will
compare the DT and LR approaches with the accuracy rate, Type
II error rate, and factor analysis. The detail descriptions will be discussed as following sections.
6.1. The accuracy rate for DT and LR
As is evident by the above-mentioned results in Fig. 2, the C5.0
model presents the prediction performance by non-factor analysis,
after the rst-time factor analysis, and after the second time factor
analysis. The result shows that the accuracy rate has the worst
trend from the past 2 seasons to the past 8 seasons prior to the
occurrence of the nancial crisis. Moreover, the CART and CHAID
models also present the prediction performance has the worst
trend from the past 2 seasons to the past 8 seasons in Figs. 3 and
4. Generally, these three DT models show that the closer the crisis
the higher the accuracy rate becomes.
The LR model shows the prediction performance by non-factor
analysis, after rst-time factor analysis, and after the second time
factor analysis in Fig. 5. As a result, the accuracy rate is also shown
the worse and worse trend as DT models. However, the LR model
keeps an average accurate rate over the past 2 seasons to the past
8 seasons. In addition, we can nd the 91.70% highest accuracy rate
by non-factor analysis with past 8 seasons. It shows LR model has
better prediction performance in long run.
6.2. The Type II Error Rate for BPN and clustering

As shown in Fig. 6, the C5.0 model presents the Type II error rate
by non-factor analysis, after rst-time factor analysis, and after the
second time factor analysis. It shows that the Type II error rate almost increases for each factor analysis, while the accuracy rate decreases from the past 2 seasons to the past 8 seasons prior to the
nancial crisis. In addition, the C5.0 model becomes more accurate
the closer the crisis and the Type II error rate becomes lower.
Moreover, the CART model also presents the Type II error rate almost increases for each factor analysis, while the accuracy rate decreases from the past 2 seasons to the past 8 seasons prior to the
nancial crisis in Fig. 7. However, the CHAID model shows that
there is no signicant Type II error rate increased in Fig. 8. There-
The Accuracy Rate for the CHAID

100.00%
past 2 seasons
80.00%
60.00%
past 4 seasons
40.00%
past 6 seasons
20.00%
past 8 seasons
0.00%
past 2 seasons
92.29%
92.54%
92.09%
past 4 seasons
89.91%
88.42%
89.23%
past 6 seasons
80.11%
76.24%
85.64%
past 8 seasons
88.38%
84.65%
84.65%
Fig. 4. The accuracy rate for CHAID.
The Accuracy Rate for the C5.0

The Accuracy Rate for the Logistic Regression
100.00%
past 2 seasons
95.00%
past 4 seasons
90.00%
80.00%
past 6 seasons
85.00%
75.00%
past 8 seasons
80.00%
95.00%
past 2 seasons
90.00%
past 4 seasons
85.00%
70.00%
past 6 seasons
past 8 seasons
None
1st
2nd
past 2 seasons
97.01%
92.53%
92.54%
past 4 seasons
91.50%
90.93%
90.59%
past 2 seasons
85.07%
86.60%
80.60%
past 6 seasons
88.95%
84.53%
84.53%
past 4 seasons
89.21%
89.21%
82.31%
past 8 seasons
88.80%
82.07%
81.58%
past 6 seasons
90.06%
85.08%
82.87%
past 8 seasons
91.70%
85.06%
81.74%
75.00%
Fig. 2. The accuracy rate for the C5.0.

Fig. 5. The accuracy rate for logistic regression.
The Accuracy Rate for CART
The Type 2 Error Rate for the C5.0
120.00%
past 2 seasons
100.00%
25.00%
past 2 seasons
20.00%
80.00%
past 4 seasons
15.00%
past 4 seasons
past 6 seasons
10.00%
past 6 seasons
past 8 seasons
5.00%
60.00%
40.00%
20.00%
0.00%
0.00%
past 8 seasons
None
1st
2nd
past 2 seasons
95.83%
94.03%
94.03%
past 2 seasons
past 4 seasons
92.24%
91.04%
90.77%
past 4 seasons
7.47%
8.96%
7.47%
past 6 seasons
78.45%
75.14%
74.59%
past 6 seasons
13.83%
19.15%
20.22%
past 8 seasons
85.89%
84.65%
85.89%
past 8 seasons
8.47%
16.93%
17.92%
Fig. 3. The accuracy rate for CART.
None
1st
2nd
0.00%
10.00%
3.23%
Fig. 6. The Type 2 Error Rate for the C5.0.
11271
The Type 2 Error Rate for the Logistic Regression
The Type 2 Error Rate for CART

35.00%
25.00%
past 2 seasons
30.00%
25.00%
20.00%
15.00%
past 6 seasons
10.00%
past 8 seasons
5.00%
0.00%
past 2 seasons
20.00%
past 2 seasons
15.00%
past 4 seasons
10.00%
past 6 seasons
5.00%
past 8 seasons
past 4 seasons
None
1st
2nd
0.00%
8.34%
3.23%
0.00%
past 4 seasons
5.98%
8.96%
10.45%
past 6 seasons
22.35%
24.86%
28.73%
past 2 seasons
past 4 seasons
past 6 seasons
past 8 seasons
10.00%
15.39%
16.93%
past 8 seasons
The Type 2 Error Rate for the CHAID

past 2 seasons
15.00%
past 4 seasons
past 6 seasons
Accuracy Rate
20.00%
10.45%
10.45%
17.92%
5.32%
18.09%
21.28%
8.30%
16.16%
22.31%
80.00%
C5.0
60.00%
CART
40.00%
CHAID
LR
20.00%
past 8 seasons
0.00%
0.00%
2nd
12.91%
100.00%
25.00%
5.00%
1st
12.91%
Fig. 9. The Type 2 Error Rate for logistic regression.
Fig. 7. The Type 2 Error Rate for CART.
10.00%
None
16.13%
2 Seasons
4 Seasons
6 Seasons
8 Seasons
1st
6.46%
2nd
7.47%
Before the Occurrence of Financial Distress
past 2 seasons
None
7.47%
past 4 seasons
past 6 seasons
past 8 seasons
7.47%
13.83%
12.31%
10.45%
21.28%
13.85%
8.96%
13.83%
12.31%
Fig. 10. The accuracy rate with non-factor analysis for the C5.0, CART, CHAID and
logistic regression.
Fig. 8. The Type 2 Error Rate for CHAID.
6.3. The factor analysis for DT and LR

In this comparison, we average the accuracy rate of DT and the
LR model for each factor analysis and over 2, 4, 6, and 8 seasons. In
Fig. 10, we can see that the accuracy rate (non-factor analysis) with
the DT model is better than with the LR model, with the past 2 and
4 seasons. However, LR model has better accuracy rate than DT
models with past 6 and 8 seasons. In Fig. 11, we can see that the
accuracy rates (1st factor analysis) with the DT model are all better
than with the LR model in short run. However, LR model still has
better performance than DT models in long run as the same in
Fig. 10. In Fig. 12, we can see that the accuracy rate (2nd factor
analysis) with the DT models is better than with the LR model, with
the exception over the past 6 seasons by CART algorithm. Therefore, we can nd the DT models have excellent prediction perfor-
80.00%
C5.0
60.00%
CART
40.00%
CHAID
LR
20.00%
0.00%
2 Seasons
4 Seasons
6 Seasons
8 Seasons

Fig. 11. The accuracy rate with 1st factor analysis for the C5.0, CART, CHAID and
100.00%
Accuracy Rate
fore, C5.0 and CHAID models have lower Type II error rate, but
CART model has higher Type II error rate in nancial bankruptcy
prediction.
As shown in Fig. 9, the LR model presents the Type II error rate
by non-factor analysis, after rst-time factor analysis, and after the
second time factor analysis. It indicates that the Type II error rate
has approximately the same increasing trend as the DT model,
while the accuracy rate decreases similar to the DT model. The only
exception is the Type II error rate which is better in the 2nd factor
analysis than in the non-factor analysis over the past 2 seasons.
Nevertheless, in summary we get that the closer the crisis point,
the lower the Type II error rate in the LR model. The only exception
is the Type II error rate which is better over the past 8 seasons than
over the past 2 seasons in the non-factor analysis.
Accuracy Rate
100.00%
80.00%
C5.0
60.00%
CART
CHAID
LR
40.00%
20.00%
0.00%
2 Seasons
4 Seasons
6 Seasons
8 Seasons

Fig. 12. The accuracy rate with 2nd factor analysis for the C5.0, CART, CHAID and
mance with past 2 and 4 seasons. Moreover, the accuracy rate

with C5.0 algorithm is better than CART and CHAID. However,
the LR model has prediction advantage with past 6 and 8 seasons,
with the exception by using 2nd factor analysis.
11272
7. Conclusions
This research aimed at the nancial and the non-nancial ratios
in the nancial statement, and used the DT and the LR models to
compare the performance of the nancial distress predictions, in
order to nd a better early-warning method. This research took
50 companies that were facing a nancial crisis, and matched them
with 50 normal companies of the similar industry. In addition, we
adopted the necessary dataset from the TSEC database and sampled them into the past 2, 4, 6, 8 seasons prior to the nancial crisis
occurrence. This data was then used to carry out a statistical factor
analysis, with each ratio variable being generated going into DT
and LR methods in order to make a comparison.
After the experiments, we summarized four critical contributions. First, the more time we used PCA, the less accurate the results for the DT and LR approaches. In our experiments, we found
that when we applied all of the 37 variables with non-factor analysis into the DT and LR models, we could obtain a better prediction
performance except only for the past 6 seasons in the CHAID
model.
Second, the closer we get to the time of the actual nancial distress, the more accurate the prediction will be in DT models. For
example, the accuracy rate with the non-factor analysis for 2 seasons before the nancial distress occurs is 97.01% in C5.0, while it
is only 88.80% over 8 seasons. However, the results are not similar
for the LR model, where the accuracy rate with non-factor analysis
for 2 and 8 seasons before the occurrence of nancial distress are
85.07% and 91.70%, respectively.
Third, most investors are concerned with the Type II error rate
and avoid investing in these companies. Our empirical results
show that factor analysis increases the error forecasts of classifying
companies with a potential nancial crisis as a normal company.
Moreover, we also found that the average rate of the Type II error
in the LR model is higher than in the DT model. Therefore, the prediction performance for the LR approach is more aggressively inuenced than the DT model.
Finally, the DT approach obtains a better prediction accuracy
than the LR approach in developing a nancial distress prediction
model, with the exception that the accuracy rate (non-factor and
1st factor analysis) for the past 6 and 8 season model is lower with
the LR model. Therefore, the DT approach is suitable for nancial
distress prediction in short run. Otherwise, the LR approach is
appropriately for long run prediction for nancial distress.
In future research, additional macroeconomic index and technical indicator could be considered as input variables and expands
the explanation capability for the proposed models. Moreover,
additional articial intelligence techniques, such as neural network
models, genetic algorithms, and others, could also be applied. And
certainly, researchers could expand the system so as to deal with
more nancial datasets.
Acknowledgements
We thank the support of National Scientic Council (NSC) of the
Republic of China (ROC) to this work under Grant No. NSC 962416-H-018-011. We also gratefully acknowledge the Editor and
anonymous reviewers for their valuable comments and constructive suggestions.
References
Altman, E. L. (1968). Financial ratios, discriminant analysis and the prediction of
corporate bankruptcy. The Journal of Finance, 23(3), 589609.
Altman, E. L., Edward, I., Haldeman, R., & Narayanan, P. (1977). A new model to
identify bankruptcy risk of corporations. Journal of Banking and Finance, 1,
2954.
Atsalakis, G. S., & Valavanis, K. P. (2009). Forecasting stock market short-term trends
using a neuro-fuzzy based methodology. Expert Systems with Applications, 36,
1069610707.
Beaver, W. (1966). Financial ratios as predictors of failure, empirical research in
accounting: Selected studied. Journal of Accounting Research, 71, 111.
Blum, M. (1974). Failing company discriminant analysis. Journal of Accounting
Research, 1, 25.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and
regression trees. California: Wadsworth.
Camdeviren, H. A., Yazici, A. C., Akkus, Z., Bugdayci, R., & Sungur, M. A. (2007).
Comparison of logistic regression model and classication tree: An application
to postpartum depression data. Expert Systems with Applications, 32, 987994.
Chang, C. L., & Chen, C. H. (2009). Applying decision tree and neural network to
increase quality of dermatologic diagnosis. Expert Systems with Applications, 36,
40354041.
Chen, M. Y., & Du, Y. K. (2009). Using neural networks and data mining techniques
for the nancial distress prediction model. Expert Systems with Applications,
36(2), 40754086.
Dimitras, A. I., Zanakis, S. H., & Zopounidis, C. (1996). A survey of business failure
with an emphasis on prediction methods and industrial applications. European
Journal of Operational Research, 90(3), 487513.
Fanning, K., & Cogger, K. (1998). Neural network detection of management fraud
using published nancial data. International Journal of Intelligent Systems in
Accounting, Finance and Management, 7(1), 2124.
Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco,
CA, USA: Morgan Kaufmann.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression. New York: Wiley.
Hua, Z., Wang, Y., Xu, X., Zhang, B., & Liang, L. (2007). Predicting corporate nancial
distress based on integration of support vector machine and logistic regression.
Expert Systems with Applications, 33(2), 434440.
Huang, M. J., Chen, M. Y., & Lee, S. C. (2007). Integrating data mining with casebased reasoning for chronic diseases prognosis and diagnosis. Expert Systems
with Applications, 32(3), 856867.
Kinney, W., & McDaniel, L. (1989). Characteristics of rms correcting previously
reported quarterly earnings. Journal of Accounting and Economics, 11(1), 7193.
Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniques for the
detection of fraudulent nancial statements. Expert Systems with Applications,
32(4), 9951003.
Kumar, P. R., & Ravi, V. (2007). Bankruptcy prediction in banks and rms via
statistical and intelligent techniques A review. European Journal of Operational
Research, 180, 128.
Laitinen, E. K., & Laitinen, T. (2000). Bankruptcy prediction application of the
Taylors expansion in logistic regression. International Review of Financial
Analysis, 9, 327349.
Lensberg, T., Eilifsen, A., & McKee, T. E. (2006). Bankruptcy theory development and
classication via genetic programming. European Journal of Operational
Research, 169, 677697.
Li, H., & Sun, J. (2009). Majority voting combination of multiple case-based
reasoning for nancial distress prediction. Expert Systems with Applications, 36,
43634373.
Michael, J. A., & Gordon, S. L. (1997). Data mining technique for marketing, sales and
customer support. New York: Wiley.
Mitra, S., Pal, S. K., & Mitra, P. (2002). Data mining in soft computing framework: A
survey. IEEE Transactions Neural Networks, 13(1), 314.
McKee, T. E., & Lensberg, T. (2002). Genetic programming and rough sets: A hybrid
approach to bankruptcy classication. European Journal of Operational Research,
138, 436451.
Meyer, P. A., & Pifer, H. (1970). Prediction of bank failures. The Journal of Finance, 25,
853868.
Persons, O. (1995). Using nancial statement data to identify factors associated with
fraudulent nancial reporting. Journal of Applied Business Research, 11(3), 3846.
Quinlan, J. R. (1993). Programs for machine learning. San Francisco: Morgan Kaufmann.
Quinlan, J. R. (1997). C5. 0 and see 5: Illustrative examples. RuleQuest Research.
<http://www.rulequest.com>.
Roiger, R. J., & Geatz, M. W. (2003). Data mining: A tutorial-based primer. Boston, MA:
Addison Wesley.
Ross, Q. J. (1993). C45: Programs for machine learning. Morgan Kaufmann Publishers.
Spathis, C., Doumpos, M., & Zopounidis, C. (2002). Detecting falsied nancial
statements: A comparative study using multicriteria analysis and multivariate
statistical techniques. The European Accounting Review, 11(3), 509535.
Stice, J. (1991). Using nancial and market information to identify pre-engagement
market factors associated with lawsuits against auditors. The Accounting Review,
66(3), 516533.
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and
techniques. California: Morgan Kaufmann.
Xidonas, P., Ergazakis, E., Ergazakis, K., Metaxiotis, K., Askounis, D., Mavrotas, G.,
et al. (2009). On the selection of equity securities: An expert systems
methodology and an application on the Athens Stock Exchange. Expert
Systems with Applications, 36(2009), 1196611980.
Yip, A. Y. N. (2004). Predicting business failure with a case-based reasoning
approach, lecture notes in computer science. In M. G. Negoita, R. J. Howlett, & L.
C. Jain (Eds.), Knowledge-based intelligent information and engineering systems:
8th international conference, KES 2004, Wellington, New Zealand, September 3215/
2004, proceedings, Part III (pp. 2025).
Zopounidis, C., & Dimitras, A. (1998). Multicriteria decision aid methods for the
prediction of business failure. Dordrecht: Kluwer Academic Publishers.

Chen 2011

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Chen 2011

Încărcat de

Drepturi de autor:

Formate disponibile

Expert Systems with Applications 38 (2011) 1126111272

Contents lists available at ScienceDirect

Expert Systems with Applications

Predicting corporate nancial distress based on integration of decision tree

bankruptcy events have also caused tremendous disorder in the

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

that will predict bankruptcy (Lensberg, Eilifsen, & McKee, 2006).

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

individuals. We code the two-class via a 0/1 response yi, where

The log-likelihood function is used for estimating regression

Fig. 1. Research methodology.

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

In the FDP Modeling phase, we gather the nancial statement

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

Cash Flow Ratio

Current Assets to Total Assets Ratio

Cash Flow to Total Debt Ratio

Turnover Rate of Inventory Ratio

Fixed Assets to Total Assets Ratio

Times Interest Earned Ratio

Inventory to Total Assets Ratio

Insider Holding Ratio

Pretax Margin Growth Ratio

Proportion of Collateralized shares Ratio

Total explained variance

Return on Equity Ratio

Cash Flow Ratio

Inventory to Total Assets Ratio

Turnover Rate of Total Assets Ratio

Total explained variance

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

Earnings Per Share Ratio

Cash Flow Ratio

Inventory to Total Assets Ratio

Total explained variance

the normal company as a normal company, Type II Error Rate

Type I Error Rate

Type II Error Rate

Total Error Rate

5.1.1. The experiment with C5.0 analysis

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

the past 8 seasons. Similar to the above non-factory analysis and

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

5.2. LR experiments and results

5.2.1. The experiment with non-factor analysis

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

5.2.2. The experiment with 1st factor analysis

5.2.3. The experiment with 2nd factor analysis

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

shows that accuracy rate has not signicant affected by factor

6.2. The Type II Error Rate for BPN and clustering

The Accuracy Rate for the CHAID

Fig. 4. The accuracy rate for CHAID.

The Accuracy Rate for the C5.0

Fig. 2. The accuracy rate for the C5.0.

The Accuracy Rate for CART

The Type 2 Error Rate for the C5.0

Fig. 3. The accuracy rate for CART.

Fig. 6. The Type 2 Error Rate for the C5.0.

M.-Y. Chen / Expert Systems with Applications 38 (2011) 1126111272

The Type 2 Error Rate for the Logistic Regression

The Type 2 Error Rate for CART

The Type 2 Error Rate for the CHAID

Fig. 9. The Type 2 Error Rate for logistic regression.