Intro Data Mining

Data Mining
Introduction
Why Mine Data? Commercial
Viewpoint
 Lots of data is being collected
and warehoused
◦ Web data, e-commerce
◦ purchases at department/
grocery stores
◦ Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
◦ Provide better, customized services for an edge
(e.g. in Customer Relationship Management)

Mining Large Data Sets - Motivation
 There is often information “hidden” in the data

that is
not readily evident
 Human analysts may take weeks to discover
useful information
 Much of the data is never analyzed at all
4,000,000
3,500,000
3,000,000 The Data Gap

2,500,000
2,000,000 Total new disk (TB) since 1995

1,500,000
1,000,000
Number of
analysts
500,000
0
1995 1996 1997 1998 1999
What is Data Mining?
 Many Definitions
◦ Non-trivial extraction of implicit,
previously unknown and potentially
useful information from data
◦ Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Data Mining is Not ...
DM - what it can’t do Automatically find relationships without

user intervention when no relationships exist
Data warehousing
SQL Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization

Origins of Data Mining
 Draws ideas from machine learning/AI,
pattern recognition, statistics, and
database systems
 Traditional Techniques
may be unsuitable due Statistics/
to Machine Learning/
◦ Enormity of data AI Pattern
Recognition
◦ High dimensionality
of data Data
◦ Heterogeneous, Mining
distributed nature
of data
Database
systems
A Problem...
 You are a marketing manager for a
brokerage company
◦ Problem: Churn is too high
 Turnover (after six month introductory period
ends) is 40%
◦ Customers receive incentives (average cost:
$160)
when account is opened
◦ Giving new incentives to everyone who might
leave is very expensive (as well as wasteful)
◦ Bringing back a customer after they leave is
both difficult and costly

… A Solution
 One month before the end of the
introductory period is over, predict which
customers will leave
◦ If you want to keep a customer that is predicted
to churn, offer them something based on their
predicted value
The ones that are not predicted to churn - need
no attention
◦ If you don’t want to keep the customer, do
nothing
 How can you predict future behavior?

◦ solution : Data Mining

Data Mining : Why now ?
 Changes in the Business Environment
◦ Customers becoming more demanding
◦ Markets are saturated
◦ Replace statistician  Better models, less grunge work
◦ Many different data mining algorithms / tools available
◦ Statistical expertise required to compare different
techniques
◦ Build intelligence into the software
◦
 Drivers
◦ Focus on the customer, competition, and data assets
 Enablers
◦ Increased data understanding
◦ Cheaper and faster hardware


Growing Base of data
•Data doubling every

20 months in the
world
•Businesses feel there
is value in historical
data
•Automated knowledge
discovery is only way to
explore this data
1970 1980 1990 2000
Convergence of Three Key
Technologies
Increased
Computing
Power
DM
Statistical Improved
and Learning Data Collection
Algorithms and Mgmt
Applications
 Fraud Detection
 Loan and Credit Approval
 Market Basket Analysis
 Customer Segmentation
 Financial Applications
 E-Commerce & Decision Support
 Web and text mining
Market Analysis and
Management (1)
 Where are the data sources for analysis?
◦ Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
 Target marketing
◦ Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
 Determine customer purchasing patterns over
time
◦ Conversion of single to a joint bank account: marriage,
etc.
 Cross-market analysis
Market Analysis and
Management (2)
 Customer profiling
◦ data mining can tell you what types of customers buy
what products (using techniques : clustering or
classification)
 Identifying customer requirements
◦ identifying the best products for different customers
◦ use prediction to find what factors will attract new
customers
 Provides summary information
◦ various multidimensional summary reports
◦ statistical summary information
Fraud Detection and
Management (1)
 Applications
◦ widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach
◦ use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
 Examples
◦ auto insurance: detect a group of people who stage
accidents to collect on insurance
◦ money laundering: detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
◦ medical insurance: detect professional patients and
ring of doctors and ring of references
Fraud Detection and
Management (2)
 Detecting telephone fraud
◦ Telephone call model: destination of the call,
duration, time of day or week. Analyze
patterns that deviate from an expected norm.
◦ British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud. (source:
Gartner,2006)
 Retail
◦ Analysts estimate that 38% of retail shrink is
due to dishonest employees. (Business
today,2006)
Other Applications
 Sports
◦ IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat
 Astronomy
◦ JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
 Internet Web Surf-Aid
◦ IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover
customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Data Mining: On What Kind
of Data?
 Relational databases
 Data warehouses
 Transactional databases
 Object-oriented databases
 Time-series data
 Text databases and multimedia databases
 Heterogeneous
Data Mining Techniques
 Association Rules & sequential Patterns
 Classification
 Clustering
 Similar Images
 Text/Web Mining
 Outliers analysis (using stats)
–
Data mining - Analysis
12/04/09 20
Examples of Data Mining
Conditional Logic
n
If profession = athlete then

n
age < 30 in 90 % cases
n
When Paint is sold, Paint

n
Associations brushes are also sold 85%

times
n
Golf balls sales are seasonal

n
with Summer peak and
Winter low
Trends & Variations
12/04/09 21
Examples of Datamining contd....
✠How many people can be

Outcome Prediction expected to respond to a mailer
campaign?
✠
✠What will be the total sales of this
Forecasting
product range in next quarter
taking into account seasonal and
long term trends?
✠
✠Is this insurance claim likely to be
Deviation Detection
a fraud?
✠
✠When a person is fired, he is
Link Analysis
likely to default on credit card
payments.
12/04/09 22
Query example

 suppose that you want to see all the data in a table called SALES.
 There are three columns (fields) in the table: NAME, DEPT, and
SALES. The user wants to see all the data, so he enters this
statement:
 SELECT * from SALES (the * indicates all columns)

 The user wants to know the total sales by month for all the
individuals in Department 50. The statement now becomes a bit
more complex.

 SELECT DEPT,SUM(SALES)
 FROM SALES
 GROUP BY DEPT (you have aggregated using SUM and now
require a GROUP)
 ORDER BY 2 DESC (sorting the data based on the second column
in descending order)
12/04/09 23
Consider this complex tricky
query
 A sales executive wishes to see all the sales for the past three
years where profitability has been greater than xx percent.
He wishes to see it by month. And where the percentages
have been greater than yy percent, he wants to see whether
the sales team has been in place during this period or
whether there has been personnel turnover. They are looking
for territorial versus personnel factors in sales success. He
also wishes to see trends in profitability so, where all sales
by year have steadily increased for zz percent at least two
years in a row, he wishes to see the top five products ranked
by profitability.
12/04/09 24
This query requires
 Sums
 Percentages
 Grouping
 Trends
 Time-based analysis
 Comparisons

12/04/09 25
Data mining - Users
 Executives - need top-level insights and
spend far less time with computers than
the other groups.

 Analysts may be financial analysts,
statisticians, consultants, or database
designers.

 End users are sales people, scientists,
market researchers, engineers, physicians,
etc.
12/04/09 26
Mining market
 Around 20 to 30 mining tool vendors
 Major tool players:
◦ Clementine,
◦ IBM’s Intelligent Miner,
◦ SGI’s MineSet,
◦ SAS’s Enterprise Miner.
 All pretty much the same set of tools
 Many embedded products:
◦ fraud detection:
◦ electronic commerce applications,
◦ health care,
◦ customer relationship management: Epiphany
Vertical integration:
Mining on the web
 Web log analysis for site design:
◦ what are popular pages,
◦ what links are hard to find.
 Electronic stores sales enhancements:
◦ recommendations, advertisement:
◦ Collaborative filtering: Net perception,
Wisewire
◦ Inventory control: what was a shopper looking
for and could not find..
Data Mining
 Through a variety of techniques, data mining identifies nuggets of
information in bodies of data.

 Data mining extracts information in such a way that it can be used in
areas such as decision support, prediction, forecasts, and
estimation. Data is often voluminous but of low value and with little
direct usefulness in its raw form. It is the hidden information in the
data that has value.

 In data mining, success comes from combining your (or your expert’s)
knowledge of the data with advanced, active analysis techniques in
which the computer identifies the underlying relationships and
features in the data.

 The process of data mining generates models from historical data that
are later used for predictions, pattern detection, and more. The
technique for building these models is called machine learning, or
modeling.

Data Mining Tasks
 Prediction Methods
◦ Use some variables to predict unknown or future
values of other variables.

 Description Methods
◦ Find human-interpretable patterns that describe
the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]
Modeling Techniques

 Predictive modeling methods include decision trees, neural
networks, and statistical models.

 Clustering models focus on identifying groups of similar records
and labeling the records according to the group to which they
belong. Clustering methods include Kohonen, k-means, and
TwoStep.

 Association rules associate a particular conclusion (such as the
purchase of a particular product) with a set of conditions (the
purchase of several other products).

 Screening models can be used to screen data to locate fields
and records that are most likely to be of interest in modeling
and identify outliers that may not fit known patterns.
Available methods include feature selection and anomaly
detection.

Typical Applications
Typical applications of data mining techniques include the


following:

 Direct mail. Determine which demographic groups have
the highest response rate. Use this information to
maximize the response to future mailings.

 Credit scoring. Use an individual’s credit history to make
credit decisions.

 Human resources. Understand past hiring practices and
create decision rules to streamline the hiring process.

 Medical research. Create decision rules that suggest
appropriate procedures based on medical evidence.


Typical Applications
 Market analysis. Determine which variables, such as

geography, price, and customer characteristics, are
associated with sales.

 Quality control. Analyze data from product manufacturing
and identify variables determining product defects.

 Policy studies. Use survey data to formulate policy by
applying decision rules to select the most important
variables.

 Health care. User surveys and clinical data can be
combined to discover variables that contribute to
health.
A Strategy for Data Mining
 As with most business endeavors, data

mining is much more effective if done in a
planned, systematic way. Even with
cutting-edge data mining tools, such as
Clementine, SAA, the majority of the work
in data mining requires a knowledgeable
business analyst to keep the process on
track.

A Strategy for Data Mining
 To guide your planning, answer the following
questions:

 What substantive problem do you want to solve?

 What data sources are available, and what parts of
the data are relevant to the current problem?
 What kind of preprocessing and data cleaning do
you need to do before you start mining
 the data?
 What data mining technique(s) will you use?
 How will you evaluate the results of the data mining
analysis?
 How will you get the most out of the information you
obtained from data mining?

The CRISP-DM Process Model
 Cross Industry Standard Process Model for

Data Mining

 The general CRISP-DM process model

includes six phases that address the main
issues in data mining. The six phases fit
together in a cyclical process designed to
incorporate data mining into your larger
business practices.

The six phases include:
 Business understanding. This is perhaps the most
important phase of data mining.
 Business understanding includes determining
business objectives, assessing the situation, determining
data mining goals, and producing a project plan.

 Data understanding. Data provides the “raw
materials” of data mining.
 This phase addresses the need to understand what
your data resources are and the characteristics of those
resources. It includes collecting initial data, describing
data, exploring data, and verifying data quality.


 Data preparation. After cataloging your data
resources, you will need to prepare your data
for
 mining.
 Preparations include selecting, cleaning,
constructing, integrating, and formatting
 data.

 Modeling.
 sophisticated analysis methods are used to
extract information from the data. This phase
involves selecting modeling techniques,
generating test designs, and building and
assessing models.

 Evaluation.
 Once you have chosen your models, you are
ready to evaluate how the data mining results can
help you to achieve your business objectives.
 Elements of this phase include evaluating results,
reviewing the data mining process, and determining
the next steps.
 Deployment.
 This phase focuses on integrating your new
knowledge into your everyday business processes to
 solve your original business problem.
 This phase includes plan deployment, monitoring
and
 maintenance, producing a final report, and
reviewing the project.

Data Mining
Association Analysis: Basic
Concepts
and Algorithms
Association Rule Mining
 Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market -Basket t ransact ions

Exam ple of Associat ion
Rules
TID Item
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,

not causality!
Definition: Frequent Itemset
 Itemset
◦ A collection of one or more
items
 Example: {Milk, Bread,
Diaper}
TID
◦ k-itemset
 An itemset that contains k
items
 Support count ()
◦ Frequency of occurrence of an
itemset
◦ E.g. ({Milk, Bread,Diaper}) =
2
1
 Support
◦ Fraction of transactions that
contain an itemset
◦ E.g. s({Milk, Bread, Diaper}) =
2/5
 Frequent Itemset
◦ An itemset whose support is
greater than or equal to a
minsup threshold
Definition: Association Rule
●Association Rule
–An implication expression of the
TID
form X  Y, where X and Y are
itemsets
–Example:
{Milk, Diaper}  {Beer}
●Rule Evaluation Metrics

–Support (s)
1
uFraction of transactions that Example:
contain both X and Y
{Milk, Diaper } ⇒ Beer
–Confidence (c)
uMeasures how often items in
σ (Milk , Diaper, Beer ) 2
Y s = = = 0.4
appear in transactions that |T| 5
contain X
σ (Milk, Diaper, Beer) 2
2
c= = = 0.67
σ (Milk, Diaper) 3
Task
 Given a set of transactions T, the goal of
association rule mining is to find all rules
having
◦ support ≥ minsup threshold
◦ confidence ≥ minconf threshold
◦
 Brute-force approach:
◦ List all possible association rules
◦ Compute the support and confidence for each
rule
◦ Prune rules that fail the minsup and minconf
thresholds
  Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
Observations: {Milk}  {Diaper,Beer} (s=0.4, c=0.5)
1 Bread
•All the above rules are binary
partitions of the same itemset:
{Milk, Diaper, Beer}
•Rules originating from the same
itemset have identical support but
can have different confidence
•Thus, we may decouple the
support and confidence
requirements
Mining Association Rules
 Two-step approach:
1.Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2.Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset

 Frequent itemset generation is still
computationally expensive

Frequent Itemset
Generation
Brute-force approach:
◦ Each itemset in the lattice is a candidate frequent
itemset
◦ Count the support of each candidate by scanning the
database
◦ Transactions List of
◦ Candidate
TID Items
◦ 1 Bread, Milk
◦ 2 Bread, Diaper, Beer, Eggs
◦ N 3 Milk, Diaper, Beer, Coke
◦ 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
◦ Match each transaction against every candidate
◦ Complexity ~ O(NMw) w => Expensive since M = 2d !!!
A Naïve Algorithm
Transaction ID Items
100 Bread, Cheese
200 Bread, Cheese, Juice
300 Bread, Milk
400 Cheese, Juice, Milk
Find out all possible com binat ions

Itemsets Frequency
Bread 3
Cheese 3
 Juice 2
milk 2

(Bread, Cheese) 2
 (Bread, Juice) 1
 (Bread, Milk) 1
(Cheese, Juice) 2
 (Cheese, Milk) 1
 (Juice, Milk) 1
(Bread, Cheese, Juice) 1

(Bread, Cheese, Milk) 0
 (Bread, Juice, Milk) 0
 (Cheese, Juice, Milk) 1
(Bread, Cheese, Juice, 0
 Milk)


 minimum support – 50%
 Minimum confidence – 75%

Itemsets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
Bread, cheese 2
Cheese, Juice 2
 Bread Cheese with confidence of 2/3
=67%
 Cheese Bread with confidence of 2/3
=67%
 Cheese Juice with confidence of 2/3
=67%
 Juice Cheese with confidence of
=100%
 Rules that have more than the user-specified
minimum confidence are called confident

Improved Naïve Algorithm
 Find outItems
Transaction all possible combinations
Combinations
ID
100 Bread, Cheese {Bread, Cheese}
200 Bread, Cheese, {Bread, Cheese}, {Bread,
300 Juice
Bread, Milk Juice}
{Bread, Milk}
400 Cheese, Juice, Milk {Cheese,
{Cheese, Juice},
Juice}, {Bread,
{Cheese,
Cheese, Juice} Milk}{Cheese,
Milk}, {Juice,
Juice, Milk}
Find out all possible com binat ions wit h non-zero

frequency
Itemsets Frequency
Bread 3
Cheese 3
Juice 2
milk 2
(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2
(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Cheese, Juice, Milk) 1
Apriori Algorithm
 Method:

◦ Let k=1
◦ Generate frequent itemsets of length 1
◦ Repeat until no new frequent itemsets are
identified
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets of
length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
leaving only those that are frequent
Reducing Number of
Comparisons
 Candidate counting:
◦ Scan the database of transactions to determine
the support of each candidate itemset
◦ To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets
Transactions Hash Structure
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
APRIORI
100 Bread, Cheese, Eggs, Juice
300 Bread, Milk, Yogurt
50% support
Frequent items L1
Item Frequency
Bread 4
Cheese 3
Juice 4
Milk 3
Candidate item pairs C2
Itemsets Frequency
(Bread, Cheese) 2
(Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) 3
(Cheese, Milk) 1
(Juice, Milk) 2
Frequent items L2
Item Frequency
Bread, Juice 3
Cheese, Juice 3
The APRIORI Algorithm
 Bread Juice with confidence of 3/4
=75%
 Juice Bread with confidence of 3/4
=75%
 Cheese Juice with confidence of 3/3
=100%
 Juice Cheese with confidence of 3/4
=75%
A larger APRIORI Example
Item Number Item Name
1 Biscuits
2 Bread
3 Cereal
4 Cheese
5 Chocolate
6 Coffee
7 Donuts
8 Eggs
9 Juice
10 Milk
11 Newspaper
12 Pastry
13 Rolls
14 Sugar
15 Tea
16 Yogurt
TID Items
1 Biscuits, Bread, Cheese, Coffee, Yogurt
2 Bread, Cereal, Cheese, Coffee
3 Cheese, Chocolate, Donuts, Juice, Milk
4 Bread, Cheese, Coffee, Cereal, Juice
5 Bread, Cereal, Chocolate, Donuts, Juice
6 Milk, Tea
7 Biscuits, Bread, Cheese, Coffee, Milk
8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee
12 Bread, Cheese, Coffee, Donuts, Juice
13 Biscuits, Bread, Cereal
14 Cereal, Cheese, Chocolate, Donuts, Juice
15 Chocolate, Coffee
16 Donuts
17 Donuts, Eggs, Juice
18 Biscuits, Bread, Cheese, Coffee
20 Cheese, Chocolate, Donuts, Juice
21 Milk, Tea, Yogurt
22 Bread, Cereal, Cheese, Coffee
23 Chocolate, Donuts, Juice, Milk, Newspaper
24 Newspaper, Pastry, Rolls
25 Rolls, Sugar, Tea
25% support
Frequency count for all items
Item NumberItem Name Frequency

1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 Donuts 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 Yogurt 2
The frequent 1-itemset or
L1
Item Frequency
Bread 13
Cereal 10
Cheese 11
Chocolate 9
Coffee 9
Donuts 10
Juice 11
The 21 candidate 2-itemsets or C2
{Bread, Cereal}
{Bread, Cheese}
{Bread, Chocolate}
{Bread, Coffee}
{Bread, Donuts}
{Bread, Juice}
{Cereal, Cheese}
{Cereal, Chocolate}
{Cereal, Coffee}
{Cereal, Donuts}
{Cereal, Juice}
{Cheese, Chocolate}
{Cheese, Coffee}
{Cheese, Donuts}
{Cheese, Juice}
{Chocolate, Coffee}
{Chocolate, Donuts}
{Chocolate, Juice}
{Coffee, Donuts}
{Coffee, Juice}
{Donuts, Juice}
Frequency count of candidate
2-itemsets
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Chocolate} 4
{Bread, Coffee} 8
{Bread, Donuts} 4
{Bread, Juice} 6
{Cereal, Cheese} 5
{Cereal, Chocolate} 4
{Cereal, Coffee} 5
{Cereal, Donuts} 4
{Cereal, Juice} 6
{Cheese, Chocolate} 4
candidate 2-itemsets or C2
{Cheese, Coffee} 9
{Cheese, Donuts} 3
{Cheese, Juice} 4
{Chocolate, 1
Coffee}
{Chocolate, 7
Donuts}
{Chocolate, Juice} 7
{Coffee, Donuts} 1
{Coffee, Juice} 2
{Donuts, Juice} 9
The frequent 2-itemset or
L2
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
{Chocolate, Donuts} 7
{Chocolate, Juice} 7
{Donuts, Juice} 9
Candidate 3-itemsets or C3
{Bread, Cereal, 4
{Bread,
Cheese} Cereal, 4
{Bread,
Coffee} Cheese, 8
{Chocolate,
Coffee} Donuts, 7
Juice}
Frequent 3-itemsets or L3
{Bread, Cheese, 8
{Chocolate,
Coffee} Donuts, 7
Juice}
Confidence of association rules from
{Chocolate, Donuts, Juice}
Rule Support of BCD Frequency of Confidence
N MP 7 9LHS 0.78
M NP 7 10 0.70
P NM7 11 0.64
MP N 7 9 0.78
NP M 7 7 1.0
NM P 7 7 1.0
Confidence of association rules
from {Bread, Cheese, Coffee}
Rule Support of BCD Frequency of Confidence
B 8 13
LHS 0.61
CD
C 8 11 0.72
BD
D 8 9 0.89
BC
CD 8 9 0.89
B
BD 8 8 1.0
C
BC 8 8 1.0
D
All association rules
Cheese Bread
Cheese Coffee
Coffee Bread
Coffee Cheese
Cheese, Coffee Bread
Bread, Coffee Cheese
Bread, Cheese Coffee
Chocolate Donuts
Chocolate Juice
Donuts Chocolate
Donuts Juice
Donuts, Juice Chocolate
Chocolate, Juice Donuts
Chocolate, Donuts Juice
Bread Cereal
Cereal Bread
Closed and Maximal
Itemsets
 Closed itemset – a frequent itemset X such
that there exists no superset of X with the
same support count as X.
 Maximal itemset – a frequent itemset Y is
maximal if it is not a proper subset of any
other frequent itemset.
 A maximal itemset is a closed itemset, but a
closed itemset is not necessarily a
maximal itemset.
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
A transaction database to
illustrate closed and maximal
itemsets
200 Bread, Cheese, Juice, Milk
300 Cheese, Juice, Egg
400 Bread, Juice, Milk, Egg
500 Milk, Egg
Frequent itemsets for the database
Itemset Support Closed? Maximal? Both?
{Bread} 3 No No No
{Cheese} 3 No No No
{Juice} 4 Yes No No
{Milk} 3 Yes No No
{Egg} 3 Yes No No
{Bread, Cheese} 2 No No No
{Bread, Juice} 3 Yes No No
{Bread, Milk} 2 No No No
{Cheese, Juice} 3 Yes No No
{Juice, Milk} 2 No No No
{Juice, Egg} 2 Yes Yes Yes
{Milk, Egg} 2 Yes Yes Yes
{Bread, Cheese, 2 Yes Yes Yes
Juice}
{Bread, Juice, 2 Yes Yes Yes
Milk}
Association models
 associate a particular conclusion
(such as a decision to buy something)
with a set of conditions.

 The Generalized Rule Induction (GRI)
node discovers association rules in
the data. For example, customers
who purchase razors and aftershave
lotion are also likely to purchase
shaving cream. GRI extracts rules
with the highest information content
based on an index that takes both
the generality (support) and
accuracy (confidence) of rules into
account. GRI can handle numeric
and categorical inputs, but the target
must be categorical.


Association models
 The Apriori node extracts a set of
rules from the data, pulling out the
rules with the highest information
content.
 Apriori offers five different methods of
selecting rules and uses a
sophisticated indexing scheme to
process large datasets efficiently.
 For large problems, Apriori is
generally faster to train than GRI; it
has no arbitrary limit on the
number of rules that can be
retained, and it can handle rules
with up to 32 preconditions.
 Apriori requires that input and output
fields all be categorical but delivers
better performance because it is
optimized for this type of data.

 At the end of the processing, a table of the best rules is
presented. Unlike a decision tree, this set of association
rules cannot be used directly to make predictions in the
way that a standard model (such as a decision tree or a
neural network) can. This is due to the many different
 possible conclusions for the rules. Another level of
transformation is required to transform the
 association rules into a classification ruleset. Hence, the
association rules produced by association
 algorithms are known as unrefined models. Although
the user can browse these unrefined
 models, they cannot be used explicitly as classification
models unless the user tells the system
 to generate a classification model from the unrefined
model. This is done from the browser
 through a Generate menu option.

 X & Y appear in only 10% o the transactions
but whenever X appears there is an 80 %
chance that Y also appears.
 The 10 % presence is called – Support ( or
prevalance)
 80 % chance is called – confidence (or
predictability)
 High level of support – the rule is frequent
enough for the business to be interested in
it.
 High level of confidence – the rule is true
often enough to justify a decision based on
it.
 Total number of transactions = N
 Support(X) = (Number of times X appears) / N
 = P(X)
 Support(XY) = (Number of times X and Y
appears together) / N
 = P(X∩Y)
 Confidence of (X Y)
=Support(XY)/Support(X)
 = P(X∩Y) / P(Y)
 = P(Y∣X)
 P(Y∣X) is the probability of Y once X has taken
place, also called the conditional probability of
Y

 Lift – used to measure the power of
association between items that are
purchased together.
 How much more likely an item Y is to be
purchased if the customer has brought the
item X that has been identified as having
an association with the first item Y,
compared to the likelihood of Y being
purchased without the the other item
being purchased.
 P(Y∣X) / P(Y)

Intro Data Mining

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Intro Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

Data Mining

 There is often information “hidden” in the data

3,000,000 The Data Gap

2,000,000 Total new disk (TB) since 1995

DM - what it can’t do Automatically find relationships without

 How can you predict future behavior?

•Data doubling every

If profession = athlete then

When Paint is sold, Paint

Associations brushes are also sold 85%

Golf balls sales are seasonal

✠How many people can be

Typical applications of data mining techniques include the

 Market analysis. Determine which variables, such as

 As with most business endeavors, data

 What substantive problem do you want to solve?

 Cross Industry Standard Process Model for

 The general CRISP-DM process model

Market -Basket t ransact ions

Implication means co-occurrence,

●Rule Evaluation Metrics

Find out all possible com binat ions

Find out all possible com binat ions wit h non-zero

Transactions Hash Structure

Item NumberItem Name Frequency

S-ar putea să vă placă și