Documente Academic
Documente Profesional
Documente Cultură
— Chapter 6 —
Basic Concepts
Evaluation Methods
Summary
4
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
5
Why Is Freq. Pattern Mining Important?
Broad applications
6
Basic Concepts: Frequent Patterns
7
Basic Concepts: Association Rules
Tid Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional
probability that a transaction
Customer Customer
buys both
having X also contains Y
buys
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
8
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
9
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
10
Scalable Frequent Itemset Mining Methods
Approach
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
diaper}
i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
12
Apriori: A Candidate Generation & Test Approach
13
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
15
Implementation of Apriori
20
Further Improvement of the Apriori Method
• Partition size is set so that each partion can fit into main memory and
read only once in each phase.
Read the book!!!
Sampling for Frequent Patterns
27
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
28
December 30, 2017 Data Mining: Concepts and Techniques 29
December 30, 2017 Data Mining: Concepts and Techniques 31
December 30, 2017 Data Mining: Concepts and Techniques 32
December 30, 2017 Data Mining: Concepts and Techniques 33
December 30, 2017 Data Mining: Concepts and Techniques 34
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
40
Further Improvements of Mining Methods
47
Extension of Pattern Growth Mining Methodology
Pattern-growth-based Clustering
MaPle (Pei, et al., ICDM’03)
Pattern-Growth-Based Classification
Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
48
Computational Complexity of Frequent Itemset
Mining
How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the
minsup threshold
When minsup is low, there exist potentially an exponential number of
frequent itemsets
The worst case: MN where M: # distinct items, and N: max length of
transactions
The worst case complexty vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4
The chance to pick up a particular set of 10 products: ~10-40
What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
59
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
60
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
Measure of dependent/correlated events: lift
62
Which Null-Invariant Measure Is Better?
IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
Basic Concepts
Evaluation Methods
Summary 67
Pattern Mining Problems
• Again, there are many kinds of pattern mining problems,
according to the classes of patterns and databases
Questions:
Isomorphism check is easy?
Canonical form exists?
Frequency
- inclusion check is easy?
Frequent Graph Mining
• Labeled graph is a graph is labels on either vertices or
edges
- chemical compounds
- networks of maps
- graphs of organization, relationship
- XML
December 30, 2017 Fischer, I., Meinl, T. (2004), “Graph Based Molecular Data Mining – An Overview” 72
Lattice Structure
of the lattice, i.e. structures that have the same number of edges,
are searched before the next level is started.
The main disadvantage of breadth first search is the larger memory
consumption because in the middle of the lattice: a large amount of
subgraphs has to he stored. With depth first search only structures
whose amount is proportional to the size of the biggest graph in the
database have to be recorded during the search.
Second method: At each level of the lattice structure, sub graphs are
constructed from the graphs at one level high by adding an edge.
The lattice structure extends too much!!! (Unnecessary candidate
Basic Concepts
Evaluation Methods
Summary
80
Summary
81
December 30, 2017 Data Mining: Concepts and Techniques 82
Ref: Basic Concepts of Frequent Pattern Mining
83
Ref: Apriori and Its Improvements
86
Ref: Mining Correlations and Interesting Rules
88