Documente Academic
Documente Profesional
Documente Cultură
— Chapter 5 —
Mining Frequent Patterns
Basic concepts
Efficient and scalable frequent itemset mining
methods
Mining various kinds of association rules
From association mining to correlation
analysis
Constraint-based association mining
Summary
@SIGMOD’00)
Algorithms using vertical format
frequent
If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
hash function
3,6,9 Transaction: 2 3 5 6 7
1,4,7
2,5,8
2 5 3
234 6
5
567
145 345 356 367
136 368
357
689
124
457 125 159
458
Bottlenecks
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Reduce number of transactions
Shrink number of candidates
Facilitate support counting of candidates
February 7, 2008 Data Mining: Concepts and Techniques 18
Partitioning: Reduce Number of Scans
DHP (Direct hash and pruning): hash k-itemsets into buckets and a k-
itemset whose bucket count is below the threshold cannot be frequent
Especially useful for 2-itemsets
Generate a hash table of 2-itemsets during the scan for 1-itemset
If minimum support count is 3, the itemsets in bucket 0,1,3,4
should not be included in candidate 2-itemsets
J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association
rules. In SIGMOD’95
Data Mining: Concepts and Techniques 21
Sampling for Frequent Patterns
@SIGMOD’00)
Algorithms using vertical format
3. Scan DB again, p 3
m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
February 7, 2008 Data Mining: Concepts and Techniques 26
Prefix Tree (Trie)
Prefix tree
Keys are usually strings
All descendants of one node have a common
prefix
Advantages
Fast looking up (O?)
Less space with a large number of short
strings
Help with longest-prefix matching
Applications
Storing dictionary
Approximate matching algorithms, including
spell checking
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
database partition
Method
For each frequent item, construct its conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
…
Pattern f
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
f-conditional FP-tree
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Divide-and-conquer:
Decompose both mining task and DB and leads to
@SIGMOD’00)
Algorithms using vertical format (ECLAT)
@SIGMOD’00)
Algorithms using vertical data format (ECLAT)
FIMI Workshop
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
Min_sup=2 30 A,C,D,F
Max-Patterns Illustration
An itemset is maximal frequent if none of its immediate
supersets is frequent
Maximal
Itemsets
Infrequent
Itemsets Border
Closed Patterns
An itemset is closed if none of its immediate supersets has
the same support as the itemset
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Maximal vs Closed Itemsets
Exercise: Closed Patterns and Max-Patterns
@SIGMOD’00)
Algorithms using vertical data format (ECLAT)
FIMI Workshop
Φ (ABCD)
ABCD ()
Algorithm MaxMiner
Initially, generate one node N= Φ (ABCD), where
h(N)=Φ and t(N)={A,B,C,D}.
Recursively expanding N
Local pruning
Φ (ABCD)
ABCD ()
Global Pruning Technique (across sub-trees)
Φ (ABCD)
ABCD ()
Example
Tid Items
Φ (ABCDEF)
10 A,B,C,D,E
AC (D) AD () Min_sup=2
Node A
Items Frequency
Max patterns:
ABCDE 1
AB 1
AC 2
AD 2
AE 1
Example
Tid Items
Φ (ABCDEF)
10 A,B,C,D,E
AC (D) AD () Min_sup=2
Node B
Items Frequency
Max patterns:
BCDE 2
BC
BCDE
BD
BE
Example
Tid Items
Φ (ABCDEF)
10 A,B,C,D,E
AC (D) AD () Min_sup=2
Node AC
Items Frequency
Max patterns:
ACD 2
BCDE
ACD
Mining Frequent Closed Patterns: CLOSET
Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
Multi-dimensional rules: ≥ 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X, “coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
Frequent itemset -> frequent predicate set
Treating quantitative attributes: discretization
(age,income,buys)
February 7, 2008 Data Mining: Concepts and Techniques 63
ARCS: Association Rule Clustering System
Example
age(X,”34-35”) ∧ income(X,”30-50K”)
⇒ buys(X,”high resolution TV”)
Chi-square
Pearson correlation
P( A∪ B)
lift =
P( A) P( B)
all _ conf =
sup(X ) P( A∪ B)
=
max_item _ sup(X ) max(P( A), P( B))
sup( X ) P( A∪ B)
coh = =
| universe ( X ) | P( A) + P( B) − P( A∪ B)
70
February 7, 2008 Data Mining: Concepts and Techniques
Are Lift and Chi-Square Good Measures?
Tan, Kumar, Sritastava @KDD’02, Omiecinski@TKDE’03
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m Σ
71
February 7, 2008 Data Mining: Concepts and Techniques
Correlation Measure: Pearson’s Coefficient
rA , B =
∑ ( A − A )( B − B ) = ∑ ( AB ) − n A B
( n − 1)σ A σ B ( n − 1)σ A σ B
where n is the number of tuples, A and B are the respective means of A
and B, σA and σB are the respective standard deviation of A and B, and
Σ(AB) is the sum of the AB cross-product.
rA,B > 0: positively correlated
rA,B = 0: independent;
rA,B < 0: negatively correlated
Commonly used in recommender systems
r=0.63
73
Chapter 5: Mining Frequent Patterns,
Association and Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining
methods
Mining various kinds of association rules
From association mining to correlation analysis
Constraint-based association mining
Summary
Constraint pushing
Shares a similar philosophy as pushing selections deeply in query
processing
What kind of constraints can be pushed?
Constraints
Anti-monotonic
Monotonic
Succinct
Convertible
algorithm 40 f, g, h, c, e
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
range(S) ≥ v no yes no
support(S) ≤ ξ no yes no
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible