Documente Academic
Documente Profesional
Documente Cultură
ABSTRACT
A potential and valuable customer is identified only through the 360 degree complete analysis. The identification
process uses various business models in CRM. A number of researchers had made efforts to use such process models to
direct them to implement in mining large amount of data. This paper mainly focuses on the comparative analysis of most
popular data mining process models viz., Knowledge Discovery Databases (KDD) process model, CRISP-DM and
SEMMA as well as enhancement of CRISP-DM in its modeling technique. This comparative study shows that the KDD
and SEMMA are almost similar and CRISP-DM is best suited to the business analysis which is related to the
identification of potential customer in CRM, the major objective of this paper. Also the investigation revealed that the
inclusion of scoring model in the modeling phase of CRISP-DM provides optimum result in identifying the potential
customer through the process models.
Received: Sep 13, 2016; Accepted: Oct 07, 2016; Published: Oct 13, 2016; Paper Id.: IJCSEITROCT20169
1. INTRODUCTION
Data mining is an innovative process that needs various skill and knowledge. With available standard
Original Article
models, different data projects are carried out. It is interpreted that the success of the project depends on the process
model used. These models are used to translate the business challenges into various data mining tasks, recommend
appropriate data transformation and data mining technique, and give method for assessing the efficiency of the
result and prepare the document of the learning. Acceptance of common process model in the market provides more
benefits where the model serves as a general reference point for discussing and thus increases the understanding of
vital data mining challenges for pointing out the potential customer. The familiar models are KDD, SEMMA,
CRISP-DM.
Data mining is one of the phase of KDD process (Fayyad et al., 1996) and in (Brachman & Anand, 1996).
The Phrase knowledge discovery in database or KDD was termed in 1989 which refers to the extended process of
identifying information from data, and to highlight the high end application of specific datamining method
(Fayyad et al, 1996). SEMMA was developed by the SAS Institute. The acronym SEMMA stands for Sample,
Explore, Modify, Model, Assess, and refers to the process of conducting a data mining project. SEMMA is
simple to understand, allows a structured and sufficient development and maintaining of data mining project. Thus
it conferred an organization for conception, creation and evolution, and helps to present solution to business
problem as well as to identify the CRM goals. (Santos & Azevedo, 2005).
www.tjprc.org
editor@tjprc.org
76
The process of CRISP-DM was generated by the effort of an association composed of Daimler Chryrler, SPSS
and NCR. CRISP-DM stands for CRoss-Industry Standard Process for DataMining (Chapmen et al, 2000). At the time of
analysing the documentation in these process models, the similarities and dissimilarities of them are understood.
This paper also deals with the enhancement of CRISP-DM with including scoring model in the modelling phase.
Scoring model is a predictive system that is used for assessing the credit worthiness, optimization of direct marketing and
models used in CRM that allows the predicting of future behaviour of customers. The scoring model is delivered as score
table containing the scores of customers with respect to various parameters.
The remaining part of this paper is organized as section 2- comparative study of existing process models, section
3- Enhancement of CRISP-DM with scoring model, section 4- Results and discussion section, 5- Conclusion and Future
work.
77
5. Assess: This stage evaluates the data by assessing the worth and consistency of the finding from the process of
data mining process and its performance.
Even though SEMMA process is not dependent on the selected tool, it is associated with the SAS Enterprise
Miner software and acts as if guides the users on the implementation of DM application. SEMMA offer a simple to
understand process that allows unstructured and sufficient development and maintaining of data mining project.
2.3 The CRISP-DM Process
The CRISP-DM process was designed by the group that included DaimlerChryrler, SPSS and NCR. CRISP-DM
stands for CRoss-Industry Standard Process for DataMining. It consists on a cycle that comprises six stages:
1. Business Understanding: This first stage focusing in the understanding of objectives of the project and needs
from the business view. Later converting the knowledge in to data mining problems and initial plan developed for the
achievement of the objective.
2. Data Understanding: This phase commences with the initial data set and further proceeds with the actions that
make the data familiar and identifies the data quality issues, discovers the first view of data or find required subset to form
hypothesis on hidden information.
3. Data Preparation: This includes entire actions to build the final data set from the initial rough set.
4. Modeling: In this stage, different modelling technique is chosen and implemented with their calibrated
parameters to best values.
5. Evaluation: This stage evaluates the model as well as the steps included in constructing the model. It achieves
the exact objective of business.
6. Deployment: Creating a model is not the end. Its purpose is to increase the knowledge gain, and to present it in
the user friendly manner. (Chapman et al, 2000)
2.4 Comparison
With the comparison of KDD and SEMMA stagesit confirms the equivalency between them: Sample is similar to
Selection;
Explore is similar to Preprocessing;
Modify is similar to Transformation;
Model is similar to DM;
Assess is similar to Interpretation/Evaluation.
By thorough investigation, it is observed that the entire five stages of SEMMA process are similar to the practical
implementation of all the five phases of KDD process. At the same time, when compared to KDD stages the CRISP-DM
stages are not as straightforward as in the SEMMA environment. But it is observed that the CRISP-DM methodology
includes the steps given above; either precedes or succeeds the KDD process. The Business Understanding phase is deals
with the development of an understanding of the application domain related to the previous knowledge and goal of the final
user. The Deployment phase incorporates this knowledge to the working system. While considering the other stages, it is
www.tjprc.org
editor@tjprc.org
78
said that: The Data Understanding phase is the blend of Selection and Pre processing; The Data Preparation phase is related
to Transformation; The Modeling phase is compared with DM and finally the Evaluation phase with
Interpretation/Evaluation.
Table 1, presents a summary of the correspondence:
Table 1: Summary of the Correspondences between KDD, SEMMA and CRISP-DM
With previous researches it is observed that the data mining experts follow the KDD process model due to its
completeness and accurateness. In contra, CRISP-DM and SEMMA are highly company oriented. In specific, SEMMA is
used by SAS enterprise miner and integrate with their software. However, studies prove that CRISP-DM is more complete
when compared to SEMMA. These process models help the users and experts to understand the application of data mining
in the practical environment. The CRISP-DM process was developed as a process which is industry oriented and
tool-neutral. From the embryonic knowledge discovery process implemented in the early data mining projects which
responded directly to user requirement, this model can be applied to various industry sector. This model works on larger
data, with fastness, cheaper, consistent and more manageable. Not only larger data, even the small level data mining
exploration benefits of using CRISP-DM.
79
editor@tjprc.org
80
Whether a customer buys a product requested earlier (e.g. will decide to have higher credit limit)? (up-selling)
Using a product less (attrition scoring)
Stopping using a product jointly with starting using another product it is a problem often occuring in telecoms
(churn)
It is vital in situations where only small data set is available. For instance, this happens while constructing a model
to assess the credit worthiness to verify the customers who apply of the mortgage loan where the sample is smaller when
compared to cash or retail loan. With the less data more significant methods are to be selected for building the model. In
case, where data are extensively gig, then optimal choice of method and knowledge in analyzing the data plays a key role
and is a major factor to success. Best suited method allows evaluating the uncertainty that causes reduction in risk.
Implementation of best model directly increases the profit and competitiveness. This is especially important during
economical recession.
Pseudo code for redit scoring the calculation done to find the potential customers
81
Data Set Used: The German Credit data set contains observations on 30 variables for 1000 past applicants for
credit. Each applicant was rated as good credit (700 cases) or bad credit (300 cases). New applicants for credit can also
be evaluated on these 30 "predictor" variables.
www.tjprc.org
editor@tjprc.org
82
Table 1.2, below, shows the values of these variables for the first several records in the case.
Table 1.2: The Data (First Several Rows)
The consequences of misclassification have been assessed as follows: the costs of a false positive
(incorrectly saying an applicant is a good credit risk) outweigh the cost of a false negative (incorrectly saying an applicant
is a bad credit risk) by a factor of five. This can be summarized in the following table.
Table 1.3: Opportunity Cost Table (In deutch Marks)
The Opportunity Cost table was derived from the average net profit per loan as shown below:
Table 1.4: Average Net Profit
83
Useful graphs include the lift chart, Kolmogorov Smirnov chart, and other ways to assess the performance of the
scoring model. For example, the following graph shows the Kolmogorov Smirnov (KS) graph for a credit scoring model.
Figure 2
In this graph, the X axis shows the credit score values (sums), and the Y axis denotes the cumulative proportions
of observations in each outcome class (Good Credit vs. Bad Credit) in the hold-out sample. The further apart are the two
lines, the greater is the degree of differentiation between the Good Credit and Bad Credit cases in the hold-out sample, and
thus, the better (more accurate) is the model.
www.tjprc.org
editor@tjprc.org
84
REFERENCES
1.
Fayyad, U. M. et al. 1996. From data mining to knowledge discovery: an overview. In Fayyad, U. M.et al (Eds.),Advances in
knowledge discovery and data mining. AAAI Press / The MIT Press.
2.
Benot, G., 2002. Data Mining. Annual Review of Information Science and Technology, Vol. 36, No. 1, pp 265-310.
3.
Brachman, R. J. & Anand, T., 1996. The process of knowledge discovery in databases. In Fayyad, U. M. et al. (Eds.),
Advances in knowledge discovery and data mining. AAAI Press / The MIT Press.
4.
Chen, M. et al, 1996. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge andData
Engineering, Vol. 8, No. 6, pp 866-883.
5.
Simoudis, E., 1996. Reality check for data mining. IEEE Expert, Vol. 11, No. 5, pp 26-33.
6.
Fayyad, U. M., 1996. Data mining and knowledge discovery: making sense out of data. IEEE Expert, Vol. 11 No. 5, pp20-25.
7.
Dzeroski, S., 2006. Towards a General Framework for Data Mining.. In Dzeroski, S and Struyf, J (Eds.), Knowledge
Discovery in Inductive Databases. LNCS 47474. Springer-Verlag.
8.
Meo, R. e tal, 1998. An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery Vol. 2,pp
195-224. Kluwer Academic Publishers.
9.
Imielinski, T.; Virmani, A., 1999. MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery
Vol. 3, pp 373-408. Kluwer Academic Publishers.
10. Sarawagi, S. et al, 2000. Integrating Association Rule Mining with Relational Database Systems: Alternatives
andImplications. Data Mining and Knowledge Discovery, Vol. 4, pp 89125.
11. Botta, Marco, et al, 2004. Query Languages Supporting Descriptive Rule Mining: A Comparative Study. Database Support for
Data Mining Applications. LNAI 2682, pp 24-51.
12. SAS Enterprise Miner SEMMA. SAS Institute.
13. Accessed from http://www.sas.com/technologies/analytics/datamining/miner/semma.html, on May 2008
14. Santos, M &Azevedo, C (2005). Data Mining Descoberta de Conhecimentoem Bases de Dados. FCA Publisher.
15. Chapman, P. et al, 2000. CRISP-DM 1.0 - Step-by-step data mining guide.
16. Accessed from http://www.crisp-dm.org/CRISPWP-0800.pdf on May 2008