Documente Academic
Documente Profesional
Documente Cultură
Candace L. Gunnarsson
is owner-operator of S2 Statistical Solutions, Inc. founded in 1992, and performs statistics consulting and training seminars.
Dr Gunnarsson has taught statistics courses for more than ten years in the fields of education, psychology, and business
at Xavier University, The Union Institute, and University of Cincinnati.
Mary M. Walker
is Professor of Marketing at Xavier University in Cincinnati, Ohio, and in this role was the recipient of a grant sponsored by the
General Electric Company to develop a course on data mining. She has served as Xaviers Acting Vice President for Information
Resources, overseeing all information technology during 20042005.
Vern Walatka
is a chemical engineer and chemist with experience as a SAS programmer and data analyst. He currently consults on topics of
survey analysis, GIS, and mapping software.
Kenneth Swann
is Senior Data Analytical Manager with a background in software management and market research. He has organised three
successful annual SAS one-day conferences for statisticians and researchers.
Abstract Many organisations across a variety of industries are engaging in the process
of data mining as part of an overall strategy for business intelligence, customer
relationship management (CRM), including churn prevention. This paper provides an
overview of the data mining process and illustrates a case study in which data mining is
utilised as a churn prevention tool for a major Midwest USA newspaper. For this case
study, a decision tree, a common modelling technique, was the analytical tool of choice.
Lessons learned throughout the data mining process are provided to offer insight and to
promote the sharing of information. Strategies for getting started in the data mining
process are presented to encourage organisations to embrace a data-driven strategy for
business intelligence, CRM and churn prevention.
Journal of Database Marketing & Customer Strategy Management (2007) 14, 271280.
doi:10.1057/palgrave.dbm.3250058
2007 Palgrave Macmillan Ltd 1741-2439 $30.00 Vol. 14, 4, 271280 Database Marketing & Customer Strategy Management 271
www.palgrave-journals.com/dbm
Gunnarsson et al.
272 Database Marketing & Customer Strategy Management Vol. 14, 4, 271280 2007 Palgrave Macmillan Ltd 1741-2439 $30.00
A case study using data mining in the newspaper industry
coordinates the inputs that establish business various data sources, as well as how the data
objectives. It is critical that all domain should be organised for modelling, can be
experts be part of the data mining activity arduous.
so that the data mining effort put forth is The data preparation step is generally a
better able to deliver the defined objectives. large and challenging undertaking. This is
The business objectives best suited to the because the data utilised in the data mining
data mining process are those involving process is typically massive due to it being
prediction or seeking an explanation of transactional in nature. It is typically not
behaviour. For example, CRM objectives collected with analysis in mind. Instead, the
include determining best customer profiles data are collected to monitor process
and predicting customers who are most at control as the result of transactions that
risk to churn. Once initial objectives have took place within the organisation or
been defined and a decision to move between an organisation and a constituent.
forward with the data mining process is
made, the knowledge acquired from the Modelling
process will lead to more business objectives. Once the data are prepared, three modelling
Companies should not embark on data techniques are typically used: decision trees,
mining projects. Data mining is an iterative regression analysis and neural networks.
process that is employed as part of an These techniques are neither new modelling
overall strategy for business intelligence and procedures nor are they exclusive
problem solving by employing statistical techniques to the data mining process.
modelling to large amounts of transformed, A distinct difference exists between data
transactional and historical data. mining versus statistical modelling. Data
Once the business objectives have been mining focusses on computer-generated
delineated, the next step is to determine models; whereas traditional statistical
what data are needed to address the modelling stresses theory-driven hypothesis
objectives and how the data will be testing4: Hypotheses need to be generated a
obtained. priori and assumptions such as linearity need
to be tested. By comparison, data mining
The data warehouse and data preparation does not impose the assumptions and
Steps two and three in the data mining limitations of traditional theoretical
process typically go hand in hand, and are modelling.
crucial elements of the process. While newer Typically the data mining process uses
innovations create opportunities to make statistical modelling and methods, and
this process more efficient, data warehousing therefore requires target data. In other
and data preparation have historically words, the modelling process attempts to
represented 85 per cent of the data mining explain or predict a target variable. A target
effort,3 and continue to be a mainstay of variable is a value that is either known or
successful data mining initiatives. created from other values in some currently
All data warehouse sources, both internal available data, but will be unknown in some
and external, need to be considered for data future or fresh data. It is the dependent
mining. Most companies have many variable when applying the three statistical
different and incompatible internal data modelling techniques. A target variable can
sources available, making warehousing be dichotomous, nominal or interval. In
inefficient. Businesses collect enormous order to predict behaviour one must be able
amounts of transactional and historical data. to assign either a probability of response or
The task of deciding the appropriate data a range of response based on a list of
fields that need to be extracted from these attributes associated with the target variable.
2007 Palgrave Macmillan Ltd 1741-2439 $30.00 Vol. 14, 4, 271280 Database Marketing & Customer Strategy Management 273
Gunnarsson et al.
274 Database Marketing & Customer Strategy Management Vol. 14, 4, 271280 2007 Palgrave Macmillan Ltd 1741-2439 $30.00
A case study using data mining in the newspaper industry
d. For both former subscribers and those organisations must commit to putting the
who have never subscribed: Are there scaffolding in place that will support the
risk/reward timings between possible necessary data infrastructure. Constructing
sub-segments of potential households, the data infrastructure is typically time
based on demographics or former consuming and costly, but if done correctly,
subscription history? is well worth the investment.
Lesson 2: Identifying and aligning business
All of the above questions were tied to the objectives with the data available to answer
companys business objectives and were these objectives is a crucial step in the data
well-suited questions for the data mining mining process. Domain experts must be
process. After careful examination of the willing to commit the necessary time and
data warehouse, however, the data available resources to this task.
did not support all objectives and research
Based on this experience, we recognised the
questions at hand.
scope of this task and developed a three-
Lesson 1: Executives have a broad array of business step process (see Figure 2: Domain Expert
objectives and research questions; but without the Questionnaire) to be utilised with all
appropriate data, the data mining process cannot domain experts to aid in defining goals and
support knowledge discovery, and a substantial business objectives. Domain experts are
disconnect will exist between upper management individuals who will be participating in the
and the database marketers they employ.
data mining initiative and who have at least
Once these disconnects are revealed, some degree of decision-making authority.
management must be willing to invest in
the data gathering and warehousing Step Two: The data warehouse
resources necessary to meet its business At the newspaper, both internal and
objectives. Companies need to view their external data were stored in different
data as a corporate asset. Just as data mining functional databases specific to particular
is an iterative process, so too is data business processes. To maintain customer
warehousing. If the data capture is not privacy, we were given a random sample of
sufficient for the data mining process, 10 per cent of the households in the master
1. Which person(s) at your company are responsible for initiating this data mining effort?
2. List the names, positions and domain expertise of all the people in your organization who are
responsible for initiating this effort.
3. List the names, positions and domain expertise of all the people in your organization who will be
working on and ultimately responsible for this effort. Note: Keep a tally of the people who
overlap, those that are both responsible for initiation, data preparation and analysis, and
accountable for the outcomes.
Step 2: Give the following questionnaire to all domain experts involved in both the initiation and organization
of the data mining process.
1. Do you have a particular question you are looking to answer? If so, write it down.
2. Do you have a particular problem you are trying to solve? If so, write it down.
3. Do you have a business goal? If so, write it down.
4. Is this effort part of a bigger organizational policy or practice?
5. Are you trying to predict something?
6. Are you trying to understand or explain some behavior or phenomenon?
Step 3: Analyze the results from the domain experts. Are expectations the same across the board? Write the
business objectives that encompass the statements of the domain experts you surveyed. If the goals and
objectives are different, work these out ahead of time. All experts involved in the process should agree on the
business objectives.
2007 Palgrave Macmillan Ltd 1741-2439 $30.00 Vol. 14, 4, 271280 Database Marketing & Customer Strategy Management 275
Gunnarsson et al.
276 Database Marketing & Customer Strategy Management Vol. 14, 4, 271280 2007 Palgrave Macmillan Ltd 1741-2439 $30.00
A case study using data mining in the newspaper industry
combining it into a single record. During the churn model was one of the only
this process, important data were retained objectives we were able to examine.
and transformed in the form of newly Lesson 6: It is critical that you identify or create
defined variables. the appropriate target variable to meet your
The three files now contained only one business objectives. If this is not done you may
record per household and were merged into have statistically significant results, but you will
one large file by matching records with the fall short of having results that are practical and
same Address_Sequence_Number. For the aligned with your business objectives.
final flat data file, only those records that
For this case study, a decision tree was
appeared in all three files were retained,
developed using SAS Enterprise Miner
guaranteeing that all households in the final
5.2. The decision tree was the modelling
data file contained valid transactions and
technique of choice over a more classical
received at least one promotion during the
regression approach because the decision
30-month period of the data.
tree did not require a priori distribution
Lesson 5: During data preparation, many assumptions and we did not need to imput
decisions will be made concerning which missing values. Missing values for critical
data to include. It is important to remember variables of interest were extensive, making
that data mining is a process. Therefore, imputation not a recommended approach. A
some of these initial decisions regarding data decision tree was favourable over the neural
aggregation, inclusion and exclusion have the
network approach because the results were
potential to change based on the learning from
the actual models.
primarily going to be used for explanation
purposes making the decision tree easy to
understand and interpret for the client.
Step Four: Modelling
Neural networks can sometimes be viewed
After data preparation, the next step was
as a black box making it difficult to
to recode and derive new variables for
explain the method of modelling.
modelling. New variables were computed,
Before running the decision tree, a Data
some of which included tenure and recency
Partition node was used in Enterprise
of last transaction. For this study, our target
Miner to randomly split the input SAS
or dependent variable was a dichotomous
dataset into three datasets Training
variable, which was named Flag_active. This
(40 per cent), Validation (30 per cent) and
target variable, Flag_active, had a value of
Test (30 per cent). The training data are
either 1 or 0, indicating, respectively,
used initially to fit the model and the
whether a given household was an active
validation data are then used to fine-tune
subscriber to the newspaper at the end
the model. The test data are used to
of the 30-month period, from which this
estimate the error of the final model. The
data was captured, or a non-subscriber
chi-square statistic was used to evaluate
(a household that has churned).
candidate splits with a maximum acceptable
This variable was created because one of
p value for each split of 0.2. Missing values
the business objectives was to identify those
were used as a value during the split search.
customers most likely to churn. By creating
a dependent dichotomous variable Lesson 7: Choose the correct statistical tool
comparing active customers to their inactive that fits your business objective and your
communication needs.
counterparts, we were able to look for
explanatory variables that would be the Lesson seven is important because if the
most significant indicators and predictors of goal is an algorithm for prediction purposes
churn. Due to the disconnect between the to be utilised in a mailing database, a
business objectives and the available data, logistic regression model would be a good
2007 Palgrave Macmillan Ltd 1741-2439 $30.00 Vol. 14, 4, 271280 Database Marketing & Customer Strategy Management 277
Gunnarsson et al.
(which was the goal in this case study), a Target variable: Active
decision tree is a better choice due to its Yes 10,141 58
No 7,411 42
ease of use and understanding over the
logistic regression coefficients. Children in the household
Yes 8,521 49
No 9,031 51
278 Database Marketing & Customer Strategy Management Vol. 14, 4, 271280 2007 Palgrave Macmillan Ltd 1741-2439 $30.00
A case study using data mining in the newspaper industry
R>=1.6 Recency < 1.6 Tenure < 75 Tenure >= 75 Tenure < 115 Tenure >= 115
1=66% 1=88% 1=29% 1=61% 1=6% 1=27%
0=34% 0=12% 0=71% 0=39% 0=94% 0=73%
n=1,666 n=1,376 n=873 n=938 n=929 n=511
Tenure AutoPay Home Value Auto Pay Delivery
<35 >=35 1 0 < 108,900 >= 108,900 1 0 1 0
1=41% 1=75% 1=100% 1=24% 1=47% 1=69% 1=100% 1=5% 1=41% 1=5%
0=59% 0=25% 0= 0% 0=76% 0=53% 0=31% 0= 0% 0=95% 0=59% 0=95%
n=423 n=1,243 n=53 n=820 n=323 n=615 n=9 n=920 n=316 n=195
active customers at low recency (frequent at 35 months shows 75 per cent active
contact) is, however, likely due to contacts customers at the longer tenure versus only 41
initiated by the newspaper to increase per cent active at the shorter tenure. This is an
customer satisfaction and profitability. A example of how increased knowledge could
more focussed definition of Recency should lead to the development of an action plan to
provide guidance to the type and frequency support the business objective of churn
of customer contact by the paper. prevention. This split suggests that a
The variable, Tenure, refers to the number of promotional offer for customers reaching
months that a household has subscribed to the tenure of around 30 months could be tried to
paper and has breakpoints at 35, 75 and 115 increase retention. Once implemented, the
months. The percentage of active households impact of this promotion could be measured.
increases with tenure, confirming the loyalty of After evaluating the effectiveness of this action,
long-term customers. The spilt based on Tenure the iterative process would continue.
2007 Palgrave Macmillan Ltd 1741-2439 $30.00 Vol. 14, 4, 271280 Database Marketing & Customer Strategy Management 279
Gunnarsson et al.
CONCLUSIONS Appendix A
Data mining is an iterative process of Variable List
knowledge discovery that promotes data-driven
business intelligence and decision making. Data 1 Estimated home value
storage is growing at unprecedented rates, 2 Age
3 Length of residence
which drives higher demand for tools that can 4 Prizm code
reduce data into information.7 Setting 5 Presence of children in the household
6 Credit card user
appropriate business objectives with all domain 7 Credit account
experts involved in the data mining process is 8 Delivery
9 Delivery value
a critical first step. Data capture and retention 10 Type of dwelling
systems have to be designed to provide the 11 Auto pay customer
12 Gender
data that assists in the evaluation of these 13 Home owner
objectives. Data that may come from variant 14 Income
sources must be stored in a way that allows for 15 Mail preference
16 Mail responder
analyses to be performed without regard to 17 Marital status
the source of the original domain area and in 18 Recency
19 State
accordance with industry-wide SOPs. 20 Tenure
Determining the relevant variables to analyse 21 Tresp/dedupe
22 Working professional
for relationships and improved business
280 Database Marketing & Customer Strategy Management Vol. 14, 4, 271280 2007 Palgrave Macmillan Ltd 1741-2439 $30.00