Data Mining Overview

DATA MINING-AN OVERVIEW
Vishnuprasad Nagadevara
Indian Institute of Management Bangalore
Data mining is probably the most talked about topic today. Whether it is a corporate
executive or a senior bureaucrat or a financial analyst or a software vendor, data mining
seems to offer the ultimate solution to their problems. Interestingly, data mining means
different things to different people. For marketing executives, data mining seems to be
synonymous with customer relationship management (CRM), even though in reality, data
mining is just a tool employed in CRM. For a software vendor or a software
professional, data mining means a variety of data processing tools. On the other hand,
some people believe that data mining is the little more than applying statistical and
mathematical tools to discover knowledge buried in large databases. It is true that data
mining requires large amounts of data, usually stored in large data warehouses. In a
sense, many organizations are trying to use data mining as a tool to leverage the huge
investments made in data capturing and storage.
Most of the techniques that are used in data mining have been around for a long time.
Techniques such as clustering, chi-square, and factor analysis are few decades old and so
are the concepts of neural networks, which are borrowed by data mining experts from the
research on artificial intelligence. So is the case with using Bayesian probabilities to find
out associations between different products. The main factors that lead to the popularity
of data mining today are the ease with which data is captured with every transaction
through on-line transaction processing (OLTP) systems, capacity to store large (measured
in terabytes) amounts of data and finally, the continuously falling cost of computing
power. Many of the data mining techniques deal with large amounts of data and it has
always been a problem to collect and store data. With the advent of OLTP systems, data
collection or data capture, to be exact, has become a part of the transaction itself. Just
imagine the amount of data collected by the chains of grocery stores or the telephone
companies which record the information on each and every call or the information
collected by the ATM machines, telephone banking or credit card transactions. The cost
Data Mining-An Overview
of storage systems had been decreasing exponentially and the huge volumes of data that
has been collected through OLTP systems can now be stored at a fraction of the cost. In
addition, the networking of computers has made it easier to create geographically
dispersed data warehouses. Finally, the ever-decreasing cost of CPUs has made available
significant amounts of computing power in the hands of the users at a very low cost.
These factors, coupled with the need to be highly competitive in todays market place
have made data mining popular. It is becoming more and more popular with the need of
the organizations to understand the customer better leading to cross selling or up selling
to the customer at each and every touch-point with the organization.
One of the main attractions of data mining is that it can be used across different industries
or sectors to meet a variety of objectives. The Internal Revenue Service as well as the
State of Washington has used it effectively to improve tax revenues. Many companies
have used the technique for customer acquisition as well as cross selling and up selling.
Many cellular telephone companies have used it to reduce customer churn. Insurance
companies employed these techniques to detect claims that are likely to be fraudulent,
there by increasing the level of concentration on these claims.
There are many definitions of data mining. For many, it is another form of data analysis
through which the knowledge buried in databases can be unearthed. It is viewed as a set
of tools, which will help in identifying the patterns or relationships between variables. At
another level, data mining is referred to situations where proper predictions can be made
on the possible behavior of the customer. Some examples are the predicting a customer
who is most likely churn or likely to buy a new product or a taxpayer who is likely to
under report his income. One of the most appropriate definitions of data mining is by the
Gartner Group, which defines data mining as
Process of discovering meaningful new correlations, patterns and trends
through large amounts of data using pattern recognition technologies as
well as statistical and mathematical techniques
The important part of the definition is the new relationships or knowledge. It is of no
importance if data mining unearths information that is already common knowledge. For
Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 2

example, it the exercise results in the relationship that people buying bread also buy
butter or jam, it is of no importance because the information is already known and
probably is being acted upon. On the other hand, if the analysis brings out a relationship
that a large proportion of young males who buy diapers on a weekend also buy beer, it
unearths a relationship, which is not common knowledge. It may or may not be possible
to rationalize the relationship, but if such discovery leads to a strategy or can be acted
upon, then it is useful or meaningful. These relationships are discovered by the
behavior of large amounts of data.
Data mining uses statistical or mathematical techniques (referred to as pattern recognition
technologies) to bring out the relationships hidden in the data. Considering the vast
amounts of data that needs to be processed, it obviously is completely dependent of
computing power. These techniques generally lead to predictive or classification models.
These techniques are capable of identifying the patterns or relationships, which are
dominant in nature while ignoring the trivial ones. The techniques are used to build
mathematical models, which are used to understand complex relationships as well as to
develop predictions. A typical data mining process is depicted in Figure 1.
Figure 1. Typical data mining process
Classification
Input 1
Data Mining
Input 2
Model
Analysis
Input 3
There are many data mining tools and techniques that are employed under different
scenarios. The appropriateness of the technique or the tool used depends on the business
problem and the also the ultimate objective. There are many situations where the

different techniques can be applied to a given business problem, each resulting in

different levels of accuracy in prediction or different groups of classification. The
challenge lies in identifying the right technique or the tool, which will provide the best
possible classification or prediction.
There are many techniques that are used for classification and prediction. Data mining
techniques can be either directed or undirected. In general, the directed data mining
technique results in a model whose output is specified before building the model. Thus,
it involves defining a target (such as response, credit rating, tax compliance etc.) variable
and then using other variable to explain or predict the target variable. The model is built
using a portion of the data that is available. Building this model is usually referred to as
training the model. The data used for training the model is called the training set. The
training model is then tested on another set of data (which is very similar to the one used
for training) to fine-tune the model. Finally, the effectiveness of this model is evaluated
using an evaluation set of data. The logic for using different sets of data at different
stages is that the information contained in the data used for training has already been
absorbed by the model and using the same data for evaluation is likely to over-estimate
the results in terms of its effectiveness. Similarly, the test data set is used fine-tune the
model so that it does not become specific to certain aspects that are present in the training
set and the model can be generalized.
There is no target variable identified in the case of undirected data mining. The
undirected data mining techniques identify patterns that are significant. Usually,
undirected data mining leads to further investigation using directed data mining
techniques.
Some of the data mining techniques that are used commonly are:
Clustering
Associations
Classification trees

Neural networks
Discriminant analysis
Genetic algorithms etc.
Clustering:
Clustering or cluster detection is a typical undirected data mining technique. The
technique involves putting similar items (or records) together. The similarities or
dissimilarities are based on distances between items as measured by the similarities or
dissimilarities between various characteristics of the items. Imagine a small dataset with
7 customers who have purchased groceries and toiletries as shown in Table 1
Table 1. The purchases of groceries and toiletries
Customer Groceries (Rs) Toiletries (Rs)
1 1200 300
2 1300 380
3 500 1800
4 450 1900
5 1350 1560
6 1400 1620
7 1550 1450
Each of these customers could be treated as a cluster i.e., there are 7 clusters. In the next
step, the two customers who are nearest to each other could be combined into one cluster.
The distance between the customers 5 and 6 is the smallest and hence these two are
combined into one cluster. At this stage, there are 6 clusters namely, one cluster with
customers 5 and 6 and the remaining 5 treated as five individual clusters. In the next
stage, customers 3 and 4 are brought into one cluster followed by customers 1 and 2
forming the next cluster. Then customer 7 is added to the already existing cluster
containing customers 5 and 6. Thus, there are three clusters at this stage, the first one
containing customers 5,6 and 7; the second one containing customers 3 and 4 and the last
one containing customers 1 and 2. The next step involves combining clusters 1 and 3.
Finally all the clusters are combined together to form one single cluster. At each stage,
the process of combining individuals into clusters or clusters with individuals or other

clusters is based on the distance between them. The three cluster stage is represented in
Figure 2.
Figure 2. Clusters formed by combining the customers
Cluster Example
2000
4
1800 Cluster 2 6
3
1600
5 7
1400
1200
Toiletries
1000 Cluster 1
800
600
2
400
1
Cluster 3
200
0
0 200 400 600 800 1000 1200 1400 1600 1800
Groceries
In this process, we have started with 7 individuals who are gradually formed into clusters
by combining individuals or clusters at each step and these clusters in turn combined into
a single cluster. The two extremes are seven individuals (clusters) and one single cluster.
The important issue in cluster analysis is to decide how many clusters to form. This is
usually based on the unique characteristics of each of the clusters formed. In this
example, we may decide to have 3 clusters, cluster 1 has high spending on both groceries
and toiletries; cluster 2 is high on toiletries but low on groceries and cluster 3 is high on
groceries. Cluster analysis is an effective technique in market segmentation.
Associations:
Market Basket analysis or Associations is often used to understand the product
associations and customer purchase patterns. The results of the Market Basket analysis
are used for cross selling, store layout, catalogue design and product pricing and
promotions. The input for market basket analysis is the transactional data at the point of
sale and the output is the set of rules that provides information on the product

associations and the customer behavior. It also helps in understanding customer

preferences from their purchase patterns. Although market basket analysis is primarily
used in analyzing point of sale transactions, it can also be used effectively for other
situations where the customer purchases multiple products or does multiple things in
close proximity. Example of such situations are items purchased on credit card, banking
services used by different customers, optional services purchased by telcom customers
etc.
The Market Basket analysis using associations is generally carried out under the
support-confidence framework. Support is generally defined as the percentage of
transactions containing the items that are purchased together. A low support indicates the
infrequent nature of the transactions and low level of dependability or significance for the
associations. Usually, a minimum support level is set and only those transactions, which
meet the minimum level, are considered for further analysis. Confidence is a measure of
association between items in the basket, say, items A and B. It is defined as the
transactions that contain item B calculated as a percentage of all transactions that
contained A. A higher confidence level indicates a higher association between the items.
The support-confidence framework, which is generally used for items that are
purchased together, has its owns advantages as well as weaknesses.
The ultimate objective of the market basket analysis is to develop rules, which can be
acted upon. The usefulness of the rules is measured by the support and confidence. The
measure support is the percentage of the transactions that include, say two items A and B.
If there are 10,000 transactions containing both A and B out of a total of 100,000
transactions, then the support is 10% (10,000 100,000). In other words, support is same
as the joint probability of both A and B occurring in the database i.e., P(AB). In the
same example, if there are 20,000 transactions containing A, then the confidence for A
B is calculated as 50% (10,000 20,000). This is same as the conditional probability of
B occurring, given A i.e., (P(B|A). For the rule to be useful, the support as well as the
confidence levels should be high enough. If the level of support is high, it indicates that
there are enough transactions for the association rule to be applicable. It implies that

these are not rare occurrences. Similarly, if the level of confidence is high, it implies that
the association between the items is stronger.
In general, the support-confidence framework provides a very reliable basis for

formulating actionable rules. This type of analysis is mostly used in situations where
items are purchased together, and hence it is referred to as market basket analysis.
Classification trees
Classification trees are one of the commonly used techniques in data mining. Like any
other tree, these classification trees start at the root and reach the leaf nodes through a
series of branches. While the root contains all the items or records in the data set, each
leaf node consists of homogenous group of records. The leaf nodes do not necessarily
contain completely homogenous records, but each node could be classified as belonging
to a particular category of records. Each record traverses through the branches of the tree
and finally reaches a particular leaf node, which determines the specific classification for
the record. The path traversed by the record is determined by a series of questions at
each node.
There are many algorithms available for constructing a classification tree. Notable
among these are CART, ID3, C4.5, Chaid, TREEDISC etc. These algorithms undergo
similar type of processes, but employ different criteria to determine the best variable to
the groups at each node. The general process of constructing a classification tree is as
follows:
At the first step, a target variable is chosen. The target variable can be a binary variable
or a categorical variable, depending on the number of categories defined. The target
variable is also referred to as dependent variable. A typical example would be a customer
who purchases a particular product or do not purchase. Other example would be a patient
having high or low or normal blood pressure. The root node will contain all the records
in the dataset, irrespective of the category of the target variable they belong to.

In the second step, each variable in the dataset affecting the target variable is examined to
determine the homogeneity of the resulting groups of the target variable. The specific
examination would ask questions like, if the records are grouped by gender, what would
be the homogeneity of the records in terms of the categories of the target variable, say the
purchase decision of the customer. How would the homogeneity differ if age is
considered instead of gender? In other words, the objective is to divide the records into
as homogenous groups as possible, based on the values of other variables. At each node,
only one of the variables is selected for performing the split. Once a particular variable
is used at a particular node, the same variable is not considered down the path. Different
algorithms use different criteria for measuring the resultant homogeneity at a particular
node so as to select a specific variable, which will maximize the homogeneity. For
example, Chaid algorithm uses Chi-square and CART uses GINI rule.
At step 3, select the variable, which results in the maximum homogeneity of the target
variable and split the node based on the selected variable. If the resultant groups contain
records of the same category of the target variable, the resultant node is treated as a leaf
and no further split is necessary. On the other hand, if the resultant group has certain
amount of heterogeneity, continue with step two to create further splits. The process is
continued till all the nodes become leaf nodes or all the variables are exhausted.
Let us consider the credit ratings of customers of a bank to demonstrate the classification
tree. There are a total of 3230 customers of whom, 1680 had bad credit rating and the
remaining 1550 had good credit rating. At this stage, there are almost equal chances of a
customer having a good or bad rating. This is son in the top node or the root node in
Figure 3 below.

Figure 3. Classification Tree

The target variable in this example is the credit rating. The two characteristics available
for splitting the group of customers are age and pay period. If these customers are
divided into two groups based on the pay period (either paid monthly or other than
monthly), the homogeneity of each of the two groups increases. The first group (with
monthly pay period) consists of a total of 1580 customer of whom only 250 have bad
rating. On the other hand, the other group has 1650 customers of whom only 220 had
good rating. These two groups are shown as branches emanating from the root node in
Figure 3. These groups are further divided based on age and the resultant nodes are more
homogenous as shown in Figure 3. Tracing the path on the tree, it can be interpreted that
if a customers pay period is monthly and the age is at least 25 years, there is a very high
chance of the customer worthy of a good rating. On the other hand, if the customers pay
period is other than monthly and younger than 35 years of age, the chances of his being a
bad risk is very high. Thus, the classification tree attempts to classify customers with
respect to the target variable, using other variables for branching at each stage.
Neural Networks
Neural networks are one of the most commonly used techniques in data mining. This
technique is borrowed from the artificial intelligence or machine-learning experts. These
neural networks are used to fit a simple or complex model to the historical data and make
predictions or classifications based on the fitted model. The network consists of an input

layer, one or more hidden layers and an output layer. Figure 4 shows a neural network
with one hidden layer.
Figure 4. Neural Network
Figure 4 is an example of classifying a farmer into an adopter, or one who is going to

wait and watch or one who has no interest. These three classes constitute the out[put
layer. This classification is based on a set of characteristics such as farmers age,
education, past experience, holding size and income. These various characteristics,
which become the inputs to the neural network, constitute the input layer. The hidden
layer consists of a number of mathematical equations. The data from the input layer
(each of the inputs) is weighted using a specific function at each node of the hidden layer.
The output of each of the nodes of the hidden layer is weighed again to form an
activation function of the output layer. Based on the value of this function, the class or
the prediction with respect to the output is made.

Most of the neural networks follow back propagation method to train the network. In
this method, the initial weights for each of the functions (or equations) are assigned.
Then, the input data of an instance (say a farmer) is fed into the network and the network
calculates the output. Based on the calculated output and actual (real) output, the error in
prediction is calculated. The error is used to adjust the values of the weights of the
functions (or equations). The entire process is repeated until the error (the difference in
the actual value of the output and the calculated value) is within tolerable limits.
Discriminant analysis
When there are two or more groups or classes, the Discriminant analysis attempts to
maximally separate these groups using a linear classification function. This function
involves a set of variables and the Discriminant function calculates the coefficients or
weights for each of these variables in such a way that the linear classification function is
capable of separating the different groups.
For the purpose of illustration of the technique, let us consider 16 different companies of
which 8 have negative profits and the remaining have positive profits, in the current
financial year, say 2002-03. Interestingly, all these 16 companies had positive profits
three years earlier, i.e., 1999-2000. Is it possible to predict which of these companies will
become loss making, using the data of the year 1999-2000? Discriminant analysis
attempts to do exactly this. The data with respect to the percentage of employee cost and
interest & finance cost as a percentage of the total cost is presented in Table 2. This data
is for the year 1999-2000. The table separates the data based on their performance in the
year 2002-2003.
A quick look at the data may not reveal any significant pattern. On the other hand,
Figure 5 plots the same data with the diamonds representing those companies with
positive profits in 2002-2003 (good companies) and the squares representing those
companies with losses in the year 2002-2003 (bad companies).

Table 2. Data for the year 1999-2000 for the two sets of companies.
Positive Profits Negative Profits

Employee Int & Fin Employee Int & Fin
S. No. Cost % Cost % S.No. Cost % Cost %
1 8.61 5.3 11 14.3 4.33
2 3.01 0.9 12 6.18 19.85
3 10.51 1.1 13 33.2 3.64
4 5.65 2.25 14 9.67 4.46
5 8.92 4.15 15 3.17 5.07
6 4.63 5.56 16 3.64 9.57
7 7.92 3.23 17 7.52 12.02
8 4.91 2.73 18 13.79 9.48
25
Int & Finance Cost %
20
15
Good
Bad
10
A
0
B
0 10 20 30 40
Employee cost %
Figure 5. Chart showing the scatter plot and the Discriminant line
This type of plot is usually referred to as a scatter plot. The line AB on the graph
separates the good and bad companies to the maximum possible extent. While all the
points above the line represent bad companies, all but two below the line represent good
companies. Using this Discriminant line, a prediction can be made in terms of which of
the companies could become bad three years from now.

Genetic algorithms
Genetic algorithms are generally more suitable for solving optimization problems. These
algorithms are based on the natures survival of the fittest concept as proposed by
Charles Darwin. A genetic algorithm works with a set of individuals forming a
population. Each individual is represented by a string of numbers, usually as et of binary
digits. These sets of numbers are called the genetic material or genome. A fitness
function is defined and the fitness of each of the individuals is calculated. The fitness
function indicates the extent of optimization. Using the fitness function, the individuals
who are likely to survive and reproduce are identified and these individuals generate
offspring, thereby transmitting their genetic material. The genetic material of the
offspring is determined through a process of selection and crossover. The fitness of each
of each of the new individuals in the population is examined and the cycle is repeated. In
addition to the selection and crossover, another operation called mutation is also
applied to selected individuals of the population. Generally, the mutation operation is
carried out by interchanging a randomly selected bit of the genome string. Most often,
the result of a mutation is harmful and the frequency is kept very low in the algorithm.
Nevertheless, the mutation could result in a very significant increase in the fitness of the
individual. There is a randomness associated in each of the operations namely, selection,
crossover and mutation. In general, the overall fitness of the population increases after a
few cycles. Still, the genetic algorithms do not always produce a global optimal solution,
but they often produce a near optimal solution very quickly.
There are many instances where data mining can be applied very effectively. It can be
used very effectively to identify the customers who are most likely to purchase a
particular product. It can be used to identify who can be targeted for cross selling or up
selling. Market segmentation is an effective application of data mining. It will help in
minimizing churn. It has been used in quality control and for minimizing wastage in
production.

References:
Pujari, Arun K Data Mining Techniques, Universities Press, 2001
Berry, Michael J. A. and Gordon Linoff, Data Mining Techniques for Marketing,
Sales and Customer Suppport, John Wiley & Sons, 1997
Berry, Michael J. A. and Gordon Linoff, Mastering Data Mining The Art and Science
of Customer Relationship Management, John Wiley & Sons, 2000
Han, Jiawei and Micheline Kamber, Data Mining Concepts and Techniques,
Harcourt India Pvt. Limited, 2001
Dunham, Margaret H, Data Mining Introductory and Advance Topic, Pearson

Education, 2003

Data Mining Overview

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Mining Overview

Încărcat de

Drepturi de autor:

Formate disponibile

DATA MINING-AN OVERVIEW

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 2

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 3

different techniques can be applied to a given business problem, each resulting in

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 4

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 5

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 6

associations and the customer behavior. It also helps in understanding customer

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 7

In general, the support-confidence framework provides a very reliable basis for

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 8

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 9

Figure 3. Classification Tree

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 10

Figure 4. Neural Network

Figure 4 is an example of classifying a farmer into an adopter, or one who is going to

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 11

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 12

Positive Profits Negative Profits

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 13

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 14

Pujari, Arun K Data Mining Techniques, Universities Press, 2001

Dunham, Margaret H, Data Mining Introductory and Advance Topic, Pearson

Vishnuprasad Nagadevara, Indian Institute of Management Bangalore 15

S-ar putea să vă placă și