Sunteți pe pagina 1din 78

Introduction

Introduction to Data Mining with Case Studies


Author: G. K. Gupta
Prentice Hall India, 2006.

27 November 2008 GKGupta 2
What is data mining?
Why data mining?
What applications?
What techniques?
What process?
What software?
Objectives
27 November 2008 GKGupta 3
Definition
Data mining may be defined as follows:
data mining is a collection of techniques for efficient
automated discovery of previously unknown, valid,
novel, useful and understandable patterns in large
databases. The patterns must be actionable so they
may be used in an enterprises decision making.
27 November 2008 GKGupta 4
What is Data Mining?
Efficient automated discovery of previously unknown
patterns in large volumes of data.
Patterns must be valid, novel, useful and
understandable.
Businesses are mostly interested in discovering past
patterns to predict future behaviour.
A data warehouse, to be discussed later, can be an
enterprises memory. Data mining can provide
intelligence using that memory.
27 November 2008 GKGupta 5
Examples
amazon.com uses associations. Recommendations to
customers are based on past purchases and what other
customers are purchasing.
A store in USA Just for Feet has about 200 stores,
each carrying up to 6000 shoe styles, each style in
several sizes. Data mining is used to find the right shoes
to stock in the right store.
More examples in case studies to be discussed later.

27 November 2008 GKGupta 6
Data Mining
We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes.
Although data mining is possible with smaller amount
of data, bigger the data, higher the confidence in any
unknown pattern that is discovered.
There is considerable hype about data mining at the
present time and Gartner Group has listed data mining
as one of the top ten technologies to watch.


Question: How many books could one store in one Terabyte of memory?
27 November 2008 GKGupta 7
Growth in generation and storage of corporate data
information explosion
Need for sophisticated decision making current
database systems are Online Transaction Processing
(OLTP) systems. The OLTP data is difficult to use for
such applications. Why?
Evolution of technology much cheaper storage,
easier data collection, better database management,
to data analysis and understanding.
Why Data Mining Now?
27 November 2008 GKGupta 8
Database systems are being used since the 1960s in
the Western countries (perhaps since 1980s in India).
These systems have generated mountains of data.
Point of sale terminals and bar codes on many
products, railway bookings, educational institutions,
huge number of mobile phones, electronic commerce,
all generate data.
Government is now collecting a lot of information.
Information explosion
27 November 2008 GKGupta 9
Internet banking via networked computers and
ATMs.
Credit and debit cards.
Medical data, doctors, hospitals.
Transportation, Indian railways, automatic toll
collection on toll roads, growing air travel.
Passports, NRI visas, Other visas, NRI money
transfers.

Question: Can you think of other examples of data collection?

Information explosion
27 November 2008 GKGupta 10
Many adults in India generate:
Mobile phone transactions. More than 300 million
phones in India, reportedly growing at the rate of
10,000 new ones every hour! Mobile companies must
save information about calls.
Growing middle class with growing number of credit
and debit card transactions. About 25m credit cards
and 70m debit cards in 2007. Annual growth rate about
30% and 40% respectively. Could be 55m credit cards
and 200m debit cards in 2010 resulting in perhaps
500m transactions annually.

Information explosion
27 November 2008 GKGupta 11
India has some huge enterprises, for example Indian
railways, perhaps the busiest network in the world
with 2.5m employees, 10,000 locomotives, 10,000
passenger trains daily, 10,000 freight trains daily and
20m passengers daily.
Growing airline traffic with more than ten airlines.
Perhaps 30m passengers annually.
Growing number of motor vehicles registration,
insurance, driver license
Internet surfing records

Information explosion
27 November 2008 GKGupta 12
OLTP
As noted earlier, most enterprise database systems were
designed in the 1970s or 1980s and were mainly
designed to automate some of the office procedures e.g.
order entry, student enrolment, patient registration,
airline reservations. These are well structured repetitive
operations easily automated.
27 November 2008 GKGupta 13
Need for business memory and intelligence.
Need to serve customers better by learning from past
interactions.
OLTP data is not a good basis for maintaining an
enterprise memory.
The intelligence hidden in data could be the secret
weapon in a competitive business world but given the
information explosion not even a small fraction could
be looked at by human eye.

Question: Why OLTP is not good for maintaining an enterprise memory?
Decision Making
27 November 2008 GKGupta 14
OLTP vs Decision Making
Clerical view of data focuses on details required for
day-to-day running of an enterprise.

Management view of data focuses on summary data to
identify trends, challenges and opportunities.

The detailed data view is the operational view while
the management view is decision-support view.
Comparison of the two views:
27 November 2008 GKGupta 15
Operational vs Management View
Operational Decision-Support
Users Admin staff Users Management
Daytoday work Decision support
Application oriented Subject oriented
Current data Historical data
Detailed Overall view summaries
Simple queries Complex queries
Predetermined queries Ad hoc queries
Update/Select Only Select
Realtime Not realtime
27 November 2008 GKGupta 16
Evolution of Technology
Corporate data growth accompanied by decline in the
cost of storage and processing.
PC motherboard performance, measured in MHz/$, is
currently doubling every 27 2 months.
Next slide using logarithmic scale shows that disk is
now about 10GB per US dollar and the following slide
shows that sales of disk storage is growing
exponentially.
Look at computing trends at
http://www.zoology.ubc.ca/~rikblok/ComputingTrends/


Question: How much is the cost of 100GB disk? What is the cost of a PC and what is
its CPU performance?
27 November 2008 GKGupta 17
Decline in Hard Drive cost
27 November 2008 GKGupta 18
Growth in Worldwide Disk
Capacity
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1996 1997 1998 1999 2000 2001 2002 2003
Year
S
t
o
r
a
g
e

i
n

P
e
t
a
b
y
t
e
s
27 November 2008 GKGupta 19
Evolution of Technology




Question: What do the graphs in the last two slides tell us? What scales are used in
them? What was the pink line is the first graph?
27 November 2008 GKGupta 20
Evolution of Technology
Database technology has improved over the years.
Data collection is often much better and cheaper now
The need for analyzing and synthesizing information
is growing in a fiercely competitive business
environment of today.




27 November 2008 GKGupta 21
New applications
Sophisticated applications of modern enterprises include:
- sales forecasting and analysis
- marketing and promotion planning
- business modeling

OLTP is not designed for such applications. Also, large
enterprises operate a number of database systems and
then it is necessary to integrate information for decision
making applications.


Question: Why OLTP cannot be used for sales forecasting and analysis?
27 November 2008 GKGupta 22
Why Data Mining Now?
As noted earlier, the reasons may be summarized as:
Accumulation of large amounts of data
Increased affordable computing power enabling data
mining processing
Statistical and learning algorithms
Availability of software
Strong business competition
27 November 2008 GKGupta 23
Large amount of data
Already discussed that many enterprises have large
amounts of data accumulated over 30+ years.

Noted earlier that some enterprises collect information
for analysis, for example, supermarkets in USA offer
loyalty cards in exchange for shopper information.
Loyalty cards in Australia also collect information
using a reward system.
27 November 2008 GKGupta 24
Growth of cards
A recent survey in USA found that the percentages of
US adults using the following types of cards were:

Credit cards - 88%;
ATM cards - 60%
Membership cards - 58%
Debit cards - 35%
Prepaid cards - 35%
Loyalty cards - 29%

Question: What kind of data do these cards generate?
27 November 2008 GKGupta 25
Affordable computing power
Data mining is usually computationally intensive.
Dramatic reduction in the price of computer systems,
as noted earlier, is making it possible to carry out
data mining without investing huge amounts of
resources in hardware and software.

In spite of affordable computing power, using data
mining can be resources intensive.
27 November 2008 GKGupta 26
Algorithms
A variety of statistical and learning algorithms have
been available in fields like statistics and artificial
intelligence that have been adapted for data mining.

With new focus on data mining, new algorithms are
being developed.
27 November 2008 GKGupta 27
Availability of Software
Large variety of DM software is now available. Some
more widely used software is:
IBM - Intelligent Miner and more
SAS - Enterprise Miner
Silicon Graphics - MineSet
Oracle - Thinking Machines - Darwin
Angoss - knowledgeSEEKER
27 November 2008 GKGupta 28
Strong Business Competition
Growth in service economies. Almost every business
is a service business. Service economies are
information rich and very competitive.

Consider the telecommunications environment in
Australia. About 20 years ago, Telstra was a
monopoly. The field is now very competitive. Mobile
phone market in India is also very competitive.
27 November 2008 GKGupta 29
Applications
In finance, telecom, insurance and retail:
Loan/credit card approval
market segmentation
fraud detection
better marketing
trend analysis
market basket analysis
customer churn
Web site design and promotion
27 November 2008 GKGupta 30
Loan/Credit card approvals
In a modern society, a bank does not know its
customers. Only knowledge a bank has is their
information stored in the computer.

Credit agencies and banks collect a lot of customers
behavioural data from many sources. This
information is used to predict the chances of a
customer paying back a loan.
27 November 2008 GKGupta 31
Market Segmentation
Large amounts of data about customers contains
valuable information
The market may be segmented into many subgroups
according to variables that are good discriminators
Not always easy to find variables that will help in
market segmentation
27 November 2008 GKGupta 32
Fraud Detection
Very challenging since it is difficult to define
characteristics of fraud. Often based on detecting
changes from the norm.
In statistics, it is common to throw out the outliers
but in data mining it may be useful to identify them
since they could either be due to errors or perhaps
fraud.

27 November 2008 GKGupta 33
Better Marketing
When customers buy new products, other products may
be suggested to them when they are ready.
As noted earlier, in mail order marketing for example,
one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer return purchase?
- will the customer pay for the purchase?
27 November 2008 GKGupta 34
Better Marketing
It has been reported that more than 1000 variable
values on each customer are held by some mail order
marketing companies.

The aim is to lift the response rate.
27 November 2008 GKGupta 35
Trend analysis
In a large company, not all trends are always visible to
the management. It is then useful to use data mining
software that will identify trends.

Trends may be long term trends, cyclic trends or
seasonal trends.

27 November 2008 GKGupta 36
Market Basket Analysis
Aims to find what the customers buy and what they
buy together
This may be useful in designing store layouts or in
deciding which items to put on sale
Basket analysis can also be used for applications
other than just analysing what items customers buy
together
27 November 2008 GKGupta 37
Customer Churn
In businesses like telecommunications, companies are
trying very hard to keep their good customers and to
perhaps persuade good customers of their competitors
to switch to them.
In such an environment, businesses want to find
which customers are good, why customers switch and
what makes customers loyal.
Cheaper to develop a retention plan and retain an old
customer than to bring in a new customer.

27 November 2008 GKGupta 38
Customer Churn
The aim is to get to know the customers better so you
will be able to keep them longer.
Given the competitive nature of businesses,
customers will move if not looked after.
Also, some businesses may wish to get rid of
customers that cost more than they are worth e.g.
credit card holders that dont use the card, bank
customers with very small amount of money in their
accounts.

27 November 2008 GKGupta 39
Web site design
A Web site is effective only if the visitors easily find
what they are looking for.
Data mining can help discover affinity of visitors to
pages and the site layout may be modified based on
this information.

27 November 2008 GKGupta 40
Data Mining Process
Successful data mining involves careful determining
the aims and selecting appropriate data.
The following steps should normally be followed:
1. Requirements analysis
2. Data selection and collection
3. Cleaning and preparing data
4. Data mining exploration and validation
5. Implementing, evaluating and monitoring
6. Results visualisation
27 November 2008 GKGupta 41
Requirements Analysis
The enterprise decision makers need to formulate
goals that the data mining process is expected to
achieve. The business problem must be clearly
defined. One cannot use data mining without a good
idea of what kind of outcomes the enterprise is looking
for.
If objectives have been clearly defined, it is easier to
evaluate the results of the project.
27 November 2008 GKGupta 42
Data Selection and Collection
Find the best source databases for the data that is
required. If the enterprise has implemented a data
warehouse, then most of the data could be available
there. Otherwise source OLTP systems need to be
identified and required information extracted and
stored in some temporary system.
In some cases, only a sample of the data available
may be required.
27 November 2008 GKGupta 43
Cleaning and Preparing Data
This may not be an onerous task if a data warehouse
containing the required data exists, since most of this
must have already been done when data was loaded in
the warehouse.
Otherwise this task can be very resource intensive,
perhaps more than 50% of effort in a data mining
project is spent on this step. Essentially a data store
that integrates data from a number of databases may
need to be created. When integrating data, one often
encounters problems like identifying data, dealing
with missing data, data conflicts and ambiguity. An
ETL (extraction, transformation and loading) tool may
be used to overcome these problems.
27 November 2008 GKGupta 44
Exploration and Validation
Assuming that the user has access to one or more
data mining tools, a data mining model may be
constructed based on the enterprises needs. It may
be possible to take a sample of data and apply a
number of relevant techniques. For each technique
the results should be evaluated and their significance
interpreted.
This is likely to be an iterative process which should
lead to selection of one or more techniques that are
suitable for further exploration, testing and
validation.
27 November 2008 GKGupta 45
Implementing, Evaluating and
Monitoring
Once a model has been selected and validated, the
model can be implemented for use by the decision
makers. This may involve software development for
generating reports or for results visualisation and
explanation for managers.
If more than one technique is available for the given
data mining task, it is necessary to evaluate the
results and choose the best. This may involve checking
the accuracy and effectiveness of each technique.

27 November 2008 GKGupta 46
Implementing, Evaluating and
Monitoring
Regular monitoring of the performance of the
techniques that have been implemented is required.
Every enterprise evolves with time and so must the
data mining system. Monitoring may from time to
time to lead to the refinement of tools and techniques
that have been implemented.
27 November 2008 GKGupta 47
Results Visualisation
Explaining the results of data mining to the decision
makers is an important step. Most DM software
includes data visualisation modules which should be
used in communicating data mining results to the
managers.
Clever data visualisation tools are being developed to
display results that deal with more than two
dimensions. The visualisation tools available should
be tried and used if found effective for the given
problem.
27 November 2008 GKGupta 48
Data Mining Process Another
Approach
The last few slides presented one approach.
Another approach that also includes six steps has
been proposed by CRISPDM (CrossIndustry
Standard Process for Data Mining) developed by an
industry consortium.
The six steps are:
27 November 2008 GKGupta 49
CRISPDM Steps
The six CRISPDM steps are:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment
27 November 2008 GKGupta 50
CRISPDM Steps
The six steps proposed in CRISPDM are similar to
the six steps proposed earlier.
. The CRISDM steps are shown in the following
figure.





Question: Compare the two sets of steps, one given in previous few slides and the
CRISP-DM approach. Which approach is better?
27 November 2008 GKGupta 51
CRISP Data Mining Model
27 November 2008 GKGupta 52
Data Mining Techniques
Although data mining is a new field, it uses many
techniques developed years ago in other fields
Machine learning, statistics, artificial intelligence, etc
These techniques are in some cases modified to deal
with large amounts of data
27 November 2008 GKGupta 53
Data Mining Techniques
Data mining includes a large number of techniques
including concept/class description, association analysis,
classification and prediction, cluster analysis, outlier
analysis etc.

Expression and visualization of data mining results is a
challenging task.

Privacy issues also need to be considered.
27 November 2008 GKGupta 54
Data Mining Tasks
Association analysis
Classification and prediction
Cluster analysis
Web data mining
Search Engines
Data warehouse and OLAP
Others, for example, Sequential patterns and Time-
series analysis, not covered in this book
27 November 2008 GKGupta 55
Association Analysis
Association analysis involves discovery of
relationships or correlations among a set of items.
Discovering that personal loans are repaid with 80%
confidence when the person owns his home.
The classical example is the one where a store
discovered that people buying nappies tend also to
buy beer.
27 November 2008 GKGupta 56
Associations
The association rules are often written as X Y
meaning that whenever X appears Y also tends to
appear. X and Y may be collection of attributes.

A supermarket like Woolworths may have several
thousand items and many millions of transactions a
week (i.e. Gigabytes of data each week). Note that the
quantities of items bought is ignored.
27 November 2008 GKGupta 57
Classification and Prediction
A set of training objects each with a number of
attribute values are given to the classifier. The
classifier formulates rules for each class in the training
set so that the rules may be used to classify new
objects. Some techniques do not require training data.

Classification may be used for predicting the class label
of data objects. Number of techniques including
decision tree and neural network.
27 November 2008 GKGupta 58
Cluster Analysis
Similar to classification in that the aim is to build
clusters such that each of them is similar within itself
but is dissimilar to others. Clustering does not rely on
class-labeled data objects.

Based on the principle of maximizing the intracluster
similarity and minimizing the intercluster similarity.
27 November 2008 GKGupta 59
Web data mining
The Web revolution has had a profound impact on the
way we search and find information at home and at
work. From its beginning in the early 1990s, the web
has grown to more than ten billion pages in 2008
(estimates vary), perhaps even more by the time you
are looking at this slide. Web usage, Web content and
Web structure are discussed in Chapter 5.
27 November 2008 GKGupta 60
Search engines
Normally the search engine databases of Web pages
are built and updated automatically by Web crawlers.
When one searches the Web using one of the search
engines, one is not searching the entire Web. Instead
one is only searching the database that has been
compiled by the search engine. There are a number of
challenging problems related to search engines that
are discussed in Chapter 6 including how to assign a
ranking to each Web page that is retrieved in response
to a user query.
27 November 2008 GKGupta 61
Data Warehousing and OLAP
Data warehousing is a process by which an enterprise
collects data from the whole enterprise to build a
single version of the truth. This information is useful
for decision makers and may also be used for data
mining. A data warehouse can be of real help in data
mining since data cleaning and other problems of
collecting data would have already been overcome.

OLAP (Online Analytical Processing) tools are
decision support tools that are often built on top of a
data warehouse or another database. OLAP goes
further than traditional query and report tools in that
a decision maker already has a hypothesis which
he/she is trying to test.


27 November 2008 GKGupta 62
Data Warehousing and OLAP
Data mining is somewhat different than OLAP since
in data mining a hypothesis is not being tested.
Instead data mining is used to uncover novel patterns
in the data.
27 November 2008 GKGupta 63
Before Data Mining
To define a data mining task, one needs to answer the
following:
What data set do I want to mine?
What kind of knowledge do I want to mine?
What background knowledge could be useful?
How do I measure if the results are interesting?
How do I display what I have discovered?
27 November 2008 GKGupta 64
Task-relevant Data
The whole database may not be required since it may
be that we only want to study something specific e.g.
trends in postgraduate students
- countries they come from
- degree program they are doing
- their age?
- time they take to finish the degree
- scholarship they have they been awarded
May need to build a database subset before data
mining can be done.
27 November 2008 GKGupta 65
Task-relevant Data
Data collection is non-trivial.

OLTP data is not useful since it is changing all the
time. In some cases, data from more than one database
may be needed.
27 November 2008 GKGupta 66
Preprocessing
A data mining process would normally involve
preprocessing
Often data mining applications use data warehousing
One approach is to pre-mine the data, warehouse it,
then carry out data mining
The process is usually iterative and can take years of
effort for a large project
27 November 2008 GKGupta 67
Data Preprocessing
Preprocessing is very important although often
considered too mundane to be taken seriously
Preprocessing may also be needed after the data
warehouse phase
Data reduction may be needed to transform very
high dimensional data to a lower dimensional data
27 November 2008 GKGupta 68
Data Preprocessing
Feature Selection
Use sampling?
Normalization
Smoothing
Dealing with duplicates, missing data
Dealing with time-dependent data
27 November 2008 GKGupta 69
Background knowledge
Background information may be useful in the discovery
process.

For example, concept hierarchies or relationships
between data may be useful in data mining. For
postgraduate degrees, we may wish to look at all
Masters degrees and all doctorate degrees separately.
27 November 2008 GKGupta 70
Measuring interest
Data mining process may generate many patterns. We
cannot look at all of them and so need some way to
separate uninteresting results from the interesting
ones.

This may be based on simplicity of pattern, rule length,
or level of confidence.
27 November 2008 GKGupta 71
Visualization
We must be able to display results so that they are easy
to understand.

Display may be a graph, pie chart, tables etc. Some
displays are better than others for a given kind of
knowledge.
27 November 2008 GKGupta 72
Guidelines for Successful Data
Mining
The data must be available
The data must be relevant, adequate and clean
There must be a well-defined problem
The problem should not be solvable by means of
ordinary query or OLAP tools
The results must be actionable
27 November 2008 GKGupta 73
Guidelines for Successful Data
Mining
1. Use a small team with a strong internal integration
and a loose management style.
2. Carry out a small pilot project before a major data
mining project.
3. Identify a clear problem owner responsible for the
project. Could be someone in a sales or marketing.
This will benefit the external integration.



Question: Why each of the above guidelines is important for success?

27 November 2008 GKGupta 74
Guidelines for Successful Data
Mining
4. Try to realise a positive return on investment within
6 to 12 months.
5. The whole data mining project should have the
support of the top management of the company.





Question: Why each of the above guidelines is important for success?

27 November 2008 GKGupta 75
Data Mining Software
As noted earlier, a large variety of DM software is now
available. Some more widely used software is:
IBM - Intelligent Miner and more
SAS - Enterprise Miner
Silicon Graphics - MineSet
Oracle - Thinking Machines - Darwin
Angoss - knowledgeSEEKER
27 November 2008 GKGupta 76
Choosing Data Mining Software
Many factors need to be considered if purchasing significant software:
Product and vendor information
Total cost of ownership
Performance
Functionality and modularity
Training and support
Reporting facilities and visualization
Usability




Question: Which one of the above is the most important? Why?

27 November 2008 GKGupta 77
References
D. Hand, H. Mannila and P. Smyth, Principles of Data Mining, MIT
Press, 2001.

J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001. The Web site for this book is
http://www.cs.sfu.ca/~han/DM_Book.

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations, Morgan Kaufmann,
2000. The Web site for this book is www.mkp.com/datamining.

Dhar, V. and Stein, R., 1997, Seven methods for transforming corporate
data into business intelligence, Prentice Hall.

27 November 2008 GKGupta 78
References

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT
Press, 1996

M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, 8(6), pp 866-883, 1996.

Berry, M. and Linoff, G., 1997, Data mining techniques for marketing,
sales and support, John Wiley & Sons.

Berry, M. and Linoff, G., 1999, Mastering data mining, John Wiley &
Sons.

S-ar putea să vă placă și