Documente Academic
Documente Profesional
Documente Cultură
Hazem Hajj
Source: 2006 Jiawei Han and Micheline Kamber, All rights reserved
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data
What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
Applications
Basket data analysis, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
Data Mining: Concepts and Techniques 4
Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data
Itemset X = {x1,
, x k}
support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y
Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: 3/5 A D (60%, 100%) 3/3 D A (60%, 75%)
3/4
A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, , a100} contains (1001) + (1002) + + (110000) = 2100 1 = 1.27*1030 sub-patterns!
Idea: Closed Patterns and Max-Patterns can be sufficient to mine for all frequent patterns and get the important rules.
An itemset X is closed if X is frequent and there exists no superpattern Y X, with the same support as X (proposed by Pasquier, et al. @ ICDT 99) (Intuitively a closed pattern is like the biggest pattern before the
support count starts going down.)
An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (proposed by Bayardo @ SIGMOD 98)
(Intuitively a max-pattern is the biggest pattern that remains frequent.)
Exercise. DB = {<a1,
, a50>}
Min_sup = 1. <a1, < a1, <a1, !! From the maximal frequent item set, we can only assert that sub-sets are frequent, but we can not assert their actual support counts. (e.g. Consider {a3, a9}) From the set of closed frequent item sets, we have complete information about the count of the sub-sets.
Data Mining: Concepts and Techniques 8
Note:
Roadmap
Frequent Pattern Mining can be classified in various ways, which leads to different methods/research based on the: Completeness of patterns to be mined (e.g. all frequent patterns, closed, max, top-k frequent) Levels of abstractions (e.g. yield rule based on litho tool versus Litho area). Single level versus multi-level association rules. Number of data dimensions in the rule (e.g. buying computer based on age versus based on age, income, student) Types of values in the rule. Boolean versus quantitative (e.g. Buying yes/no versus age range). Kinds of rules. Association versus Correlation. Kinds of patterns to be mined. Itemset versus Sequences versus Structures (graphs, trees,..) The last one will be considered in the Advanced data mining class.
January 18, 2012 Data Mining: Concepts and Techniques 9
Efficient and scalable frequent itemset mining methods: Simple case (Single level, single dimension, Boolean) rules from transactional DBs
Mining various kinds of association rules From association mining to correlation analysis Constraint-based association mining Summary
January 18, 2012 Data Mining: Concepts and Techniques 10
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} (similar to the iceberg condition of descendents: If
{beer, diaper not frequent then {beer, diaper, nuts} not frequent.)
i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB 94) (5.2.1)
Freq. pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) (5.2.4) Vertical data format approach (Charm Zaki & Hsiao January 18, 2012 Data @SDM 02) (5.2.5) Mining: Concepts and Techniques
11
Apriori
12
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB 94, Mannila, et al. @ KDD 94) Method:
Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets (This step includes the a-priori pruning) Test the candidates against DB Terminate when no frequent or candidate set can be generated
Data Mining: Concepts and Techniques 13
sup 2 3 3 1 3 sup 1 2 1 2 3 2
Itemset
sup 2 3 3 3
L1
C2 L2
Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2
Prune
C2 2nd scan
Join Itemset
{A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
C3
Itemset {B, C, E}
3rd scan
L3
Itemset {B, C, E}
sup 2
14
Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; //Keep going until no more candidates to generate for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; Join & Prune for each transaction t in database do
DB Scans
increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support Iceberg end return k Lk;
Data Mining: Concepts and Techniques
15
abcd from abc and abd acde from acd and ace acde is removed because ade is not in L3
Pruning:
C4={abcd}
Data Mining: Concepts and Techniques 16
How to Generate Candidates? Join & Prune (Algorithm illustrated on previous slide)
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1
insert into Ck select p.item1, p.item2, from Lk-1 p, Lk-1 q where p.item1=q.item1, q.itemk-1 , p.itemk-2=q.itemk-2, p.itemk-1 < , p.itemk-1, q.itemk-1
Step 2: pruning
forall itemsets c in Ck do forall (k-1)-subsets s of c do
Assume we got the following frequent sets: {I1, I2, I5}: 2, Generate its subsets: {I1, I2}: 4, {I1, I5}: 2, {I2,I5}:2, {I1}:6 . Decide on strong association rules: What is the confidence level in: I1 and I2 I5? I1 I2 and I5 ? I2 and I5 I1? .
Data Mining: Concepts and Techniques 18
Challenges
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates
Methods: Sampling, Hashing w pruning, or use rule if transaction w/ no frequent items then remove
Inverted index
Data Mining: Concepts and Techniques 19
Paritioning: Into local frequent then global of the local candidates Hash-Based Technique: As 1-itemset s are being counted, generate two itemset counts by hashing into bucket hash and count increment bucket with every 2-itemset insertion. Then bucket without min-support is removed. Sampling : Tradeoff between accuracy and efficiency Dynamic Itemset Counting: Divide DB into blocks. At start of every block, add candidates instead before full DB scan. Transaction reduction: If it does not contain k-frequent itemset, then it can not have k+1-frequent itemset
20
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB 95
21
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold
J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD 95
22
Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked
, etc.
Scan database again to find missed frequent patterns H. Toivonen. Sampling large databases for association rules. In VLDB 96
23
Skip
Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins
{} Itemset lattice
S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset DIC counting and implication rules for market basket data. In SIGMOD 97
January 18, 2012
24
Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates
26
Grow long patterns from short ones using local frequent items
abc is a frequent pattern Get all transactions having abc : DB|abc d is a local frequent item in DB|abc frequent pattern abcd is a
27
The idea
1. 2.
For each item (starting with the least frequent), call it z we identify all frequent (we prune out those that are not frequent) itemsets that could combine with it.
4.
We combine these itemsets with the item z to form the combined frequent itemsets having that item ( z ) in it.
5.
We move to the next least frequent item, call it y , and repeat the same. But exclude all those that had z in them since already captured.
6. 7.
Approach
Build a tree to represent DB Mine the tree (instead of the DB) more efficiently To Facilitate scan of tree, use (index) header table
29
min_support = 3
3
f:4 c:3 a:3 m:2 p:2 b:1 m:1
2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
30
Step 4: Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p s conditional pattern base
Conditional pattern bases Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:3 a:3 m:2 p:2
January 18, 2012
item c a b m p
cond. pattern base f:3 fc:3 fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1
We then re-organize the potential combinations to reflect commonality through the creation of an FP tree for each cond pattern base. Will prune out nodes that does not satisfy min_sup 31
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base
For example, consider the case for item m below m-conditional pattern base: {} fca:2, fcab:1 Header Table All frequent Item frequency head f:4 c:1 patterns relate to m f 4 {} m, c 4 c:3 b:1 b:1 a 3 f:3 fm, cm, am, a:3 p:1 b 3 fcm, fam, cam, m 3 c:3 fcam m:2 b:1 p 3 All have 3 counts
p:2
January 18, 2012
m:1
a:3
m-conditional FP-tree
32
f:3 c:3
am-conditional FP-tree
{} f:3
cm-conditional FP-tree
{}
f:3
cam-conditional FP-tree
January 18, 2012 Data Mining: Concepts and Techniques 33
Recursion details
34
Now
Associations
Now we can compute from the FP-tree strong associations Similar to candidate generation.
35
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
Data Mining: Concepts and Techniques 36
2.5
37
39
Multi-level Association
40
II.
III.
Apply concpet hierachies to find counts of patterns atr different levels. Goal is to Generate Associations at different levels. Use Apriori (or other) optimization for mining at each level (just like before) and generate associations at each level. Remove redundant Associations
41
Data Items at low primitive levels may be sparse (e.g. buying a particular type & model of laptop) versus data items at higher levels of abstraction (e.g. buying computers) As a result, may not be possible to extract strong associations for low primitives due to sparsity Use concept hierarchies to identify interesting (frequent patterns. Note: Categorical attributes may have a natural concept hierarchy (e.g. city state country) Numerical attributes can be discretized
42
Top-down approach for mining multi-level associations using 1) Uniform support for all levels 2) Flexible support settings - Items at the lower level are expected to
have lower support
Depending on the approach, different rules may be pruned Exploration of shared multi-level mining (Agrawal & Srikant@VLB 95, Han & Fu@VLDB 95) 1) uniform support
At Level 1 min_sup = 5%
At Level 2 min_sup = 5%
At Level 2 min_sup = 3%
Some rules may be redundant due to ancestor relationships between items. Example
We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the expected value, based on the rule s ancestor. Remove the less general rule. In the above example, remove the rule 2% milk wheat bread
Assumption: count of 2% milk is ~ milk transactions. In that case, expected support is of milk, and confidence is the same (since 2% milk is also milk)
Data Mining: Concepts and Techniques
44
Multi-Dimensional Association
45
Single-dimensional rules:
buys(X, milk ) buys(X, bread )
Categorical Attributes: finite number of possible values, no ordering among values data cube approach Quantitative Attributes: numeric, implicit ordering among values discretization, and clustering approaches
Data Mining: Concepts and Techniques 46
Find frequent patterns with the multi-dimensional values similar to frequent itemsets. Then Generate associations.
47
2. Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
48
Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. Data cube is well suited for mining. It may already contain the () dimensions of interest. The cells of an n-dimensional cuboid correspond to the predicate sets. Mining from data cubes can be much faster.
(age) (incom e) (buys)
A set of values that occur frequently in an n-dimensional cell can be mined for associations.
The counts of the sub-items can be found in the cube. Use rule generation algorithm to derive the associations
Data Mining: Concepts and Techniques 49
Proposed by Lent, Swami and Widom ICDE 97 Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized 2-D quantitative association rules: Aquan1 Aquan2 Acat (Dynamically) Cluster adjacent association rules to form general rules using a 2-D grid Example
age(X,34) income(X,30-40K) buys(X,high resolution TV) age(X,35) income(X,30-40K) buys(X,high resolution TV) four rules can cbe clustered to get:
age(X,34-35) income(X,30-50K) buys(X,high resolution TV) Concepts and Techniques January 18, 2012 Data Mining:
50
51
Misleading Example
Consider 10,000 sales transactions: 6000 included computer games, 7500 included videos, 4000 included both. We can conclude: Buys (X, computer games ) buys (X, videos ) [40%, 66.7%] , but is misleading Since 75% buy videos In this case, we actually have a negative correlation. So, we may want to add correlation measure.
Data Mining: Concepts and Techniques 52
Possible correlation measures: Lift Chi-square All confidence Cosine Coherence Idea is to use correlation measure in conjunction with support/confidence to decide whether to keep or drop the association.
Data Mining: Concepts and Techniques 53
P( A B) lift ! P ( A) P ( B)
If Lift > 1 then positively correlated. If Lift < 1 then negatively correlated If Lift = 1 independent.
54
play basketball not eat cereal [20%, 33.3%] is more accurate (because they are negatively correlated), although with lower support and confidence Measure of dependent/correlated events: lift
Basketball Not basketball 1750 250 2000 Sum (row) 3750 1250 5000
P( A B) lift ! P ( A) P ( B)
lift ( B, C ) !
all _ conf !
cos ine( A, B) !
coh !
sup( X ) | universe( X ) |
56
Null-Invariant measures
A null transaction is a transaction that does not contain any of the itemsets being examined. A measure is null-invariant if its value is free from the influence of null-transactions. Note that Lift and Chi-square can be influenced by null transactions All-confidence, Coherence, and Cosine are nullinavriant
57
Support and confidence are not good to represent correlations So many interestingness measures? (Tan, Kumar, Sritastava @KDD 02)
P ( A B ) P ( A) P ( B )
Coffee
lift !
Milk m, c m, ~c m ~m~c 10,000 100,000 100,000 1000 No Coffee Sum(col.) m, c 1000 100 1000 1000 ~m, c 100 1000 100 1000 m~c 100 1000 10000 1000
Sum (row) c ~c 7
all _ conf !
coh !
sup( X ) | universe( X ) |
In A1 m and c are positively correlated In A2 and A3 they are negatively correlated In A4, they are independent Note that Chi-square are not good measures of negative correlations All provide good indicators of independence
DB A1 A2 A3 A4
January 18, 2012
Proposed Approach
Use All confidence and Coherence to start with, then zoom in with Lift or Chi-square when numbers are close to independence (ie ambiguous)
60
lift and G2 are not good measures for correlations in large transactional DBs all-conf or coherence could be good measures
(Omiecinski@TKDE 03)
Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM 03sub)
January 18, 2012 Data Mining: Concepts and Techniques 61
62
The patterns could be too many but not focused! User directs what to be mined using a data mining query language (or a graphical user interface) User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining constraint-based mining
Data Mining: Concepts and Techniques 63
Constraint-based mining
Knowledge type constraint: classification, association, etc. Data constraint using SQL-like queries find product pairs sold together in stores in Chicago in Dec. 02 Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum > $200) Interestingness constraint strong rules: min_support u 3%, min_confidence u 60%
Data Mining: Concepts and Techniques 64
Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS Database query processing requires to find all Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing
Data Mining: Concepts and Techniques 65
Anti-monotonicity
TID 10 20 30 40
When an intemset S violates the constraint, so does any of its superset sum(S.Price) e v is anti-monotone sum(S.Price) u v is not anti-monotone
Monotonicity
TID 10 20 30 40
When an intemset S satisfies the constraint, so does any of its superset sum(S.Price) u v is monotone min(S.Price) e v is monotone Itemset ab satisfies C So does every superset of ab
Data Mining: Concepts and Techniques
Example. C: range(S.profit) u 15
Succinctness
Succinctness:
Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price) e v is succinct sum(S.Price) u v is not succinct
Example
{1} {2} {3} {5} 2 3 3 3
L1 itemset sup.
C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}
C3 itemset {2 3 5}
January 18, 2012
Scan D
L3 itemset sup {2 3 5} 2
69
L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3
C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}
C3 itemset {2 3 5}
January 18, 2012
Scan D
L3 itemset sup {2 3 5} 2
L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3
C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}
C3 itemset {2 3 5}
January 18, 2012
Scan D
L3 itemset sup {2 3 5} 2
L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3
C2 itemset {1 2} Scan D
{1 {1 {2 {2 {3 3} 5} 3} 5} 5}
not immediately to be used
C3 itemset {2 3 5}
January 18, 2012
Scan D
L3 itemset sup {2 3 5} 2
Convert tough constraints into antimonotone or monotone by properly ordering items Examine C: avg(S.profit) u 25
TID 10 20 30 40 Item a b c d e f g h
avg(X) u 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X) u 25 is convertible monotone w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X) u 25 is strongly convertible
Data Mining: Concepts and Techniques
Item a b c d e f g h
74
A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm Within the level wise framework, no direct pruning based on the constraint can be made Itemset df violates constraint C: avg(X)>=25 Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework!
Item a b c d e f g h
75
C: avg(X) >= 25, min_sup=2 List items in every transaction in value descending order R: <a, f, g, d, b, h, c, e>
a f g d b h c e
Item h is dropped
Itemsets a and f are good, Imposing an appropriate order on item projection Many tough constraints can be converted into (anti)-monotone
Data Mining: Concepts and Techniques
Projection-based mining
TDB (min_sup=2)
TID 10 20 30 40 Transaction a, f, d, b, c f, g, d, b, c a, f, d, c, e f, g, h, c, e
76
Different constraints may require different or even conflicting item-ordering If there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items
Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database
Data Mining: Concepts and Techniques 77
Constraint avg(S) e , u v median(S) e , u v sum(S) e v (items could be of any value, v u 0) sum(S) e v (items could be of any value, v e 0) sum(S) u v (items could be of any value, v u 0) sum(S) u v (items could be of any value, v e 0)
Convertible antimonotone
Convertible monotone
Strongly convertible
Yes Yes No No No No
78
79
A Classification of Constraints
Antimonotone
Strongly convertible
Monotone
Succinct
Convertible monotone
80
81
Frequent pattern mining an important task in data mining Scalable frequent pattern mining methods
Apriori (Candidate generation & test) Projection-based (FPgrowth, CLOSET+, ...) Vertical format approach (CHARM, ...)
Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications
Data Mining: Concepts and Techniques 82
Patterns allows limited faults (insertion, deletion, mutation) Surprising, novel, concise, E.g., DNA sequence analysis and bio-pattern classification Invisible data mining
Data Mining: Concepts and Techniques 83
Application exploration
(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93. (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98. (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99. (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95
84
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94. A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95. J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95. H. Toivonen. Sampling large databases for association rules. VLDB'96. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98.
Data Mining: Concepts and Techniques 85
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing:02. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00. J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02. J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03. G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns. KDD'03.
Data Mining: Concepts and Techniques 86
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97. Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02. C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A DualPruning Algorithm for Itemsets with Constraints. KDD 02. F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03.
87
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96. K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97. R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97. Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules KDD'99.
Data Mining: Concepts and Techniques 88
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98. P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02. E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03. Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining of Correlated Patterns. ICDM 03.
Data Mining: Concepts and Techniques 89
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96. B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97. A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98. F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98. K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT 02.
Data Mining: Concepts and Techniques 90
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97. R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD 98. M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 99. G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00. J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. ICDE'01. J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in Large Databases, CIKM'02.
91
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97. M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning:01. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01. M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01. X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03. X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. KDD'03.
Data Mining: Concepts and Techniques 92
K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases, SSD 95. O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. ADL'98. O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00. D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns. SSTD'01.
93
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98. J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association Analysis: Mining Multi-Dimensional Inter-Transaction Association Rules. TOIS:00. B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00. W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical Attributes. ICDE 01. J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data. TKDE 03.
Data Mining: Concepts and Techniques 94
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB'96. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidi-mensional aggregates. SIGMOD'97. J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. DAMI: 97. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98. K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99.
Data Mining: Concepts and Techniques 95
J. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data Cubes with Complex Measures. SIGMOD 01. W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE'02. G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining MultiDimensional Constrained Gradients in Data Cubes. VLDB'01. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. DAMI:02. L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB'02. D. Xin, J. Han, X. Li, B. W. Wah. Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration. VLDB'03.
Data Mining: Concepts and Techniques 96
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99. B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. KDD 98. W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. ICDM'01. H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. SIGMOD 02. J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering. ICDE 03. B. Fung, K. Wang, and M. Ester. Large Hierarchical Document Clustering Using Frequent Itemset. SDM 03. X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. SDM'03.
Data Mining: Concepts and Techniques 97
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association Rules. KDD 02. J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD 02. G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB 02. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. MultiDimensional Regression Analysis of Time-Series Data Streams. VLDB'02. C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining:03. A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. PODS 03.
Data Mining: Concepts and Techniques 98
Y. Huhtala, J. Krkkinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE 98. H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02.
99
100
101
1st
scan: find frequent items A, B, C, D, E scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE,
Skip
Tid 10 20 30 Items A,B,C,D,E B,C,D,E, A,C,D,F
2nd
Potential max-patterns
Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD 98
Data Mining: Concepts and Techniques 102
Flist: d-a-f-e-c Patterns having d Patterns having d but no a, etc. Every transaction having d also has cfa frequent closed pattern
Min_sup=2
TID 10 20 30 40 50 Items a, c, d, e, f a, b, e c, e, f a, c, d, f c, e, f
cfad is a
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.
Data Mining: Concepts and Techniques 103
Skip
Sub-itemset pruning: if Y X, and sup(X) = sup(Y), X and all of X s descendants in the set enumeration tree can be pruned Hybrid tree projection
Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels Efficient subset checking
Data Mining: Concepts and Techniques 104
tid-list: list of trans.-ids containing an itemset t(X) = t(Y): X and Y always happen together t(X) t(Y): transaction having X always has Y Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2}
Eclat/MaxEclat (Zaki et al. @KDD 97), VIPER(P. Shenoy et al.@SIGMOD 00), CHARM (Zaki & Hsiao@SDM 02)
Data Mining: Concepts and Techniques 105
AFOPT (Liu, et al. @ KDD 03) A push-right method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et al. @ KDD 03) Mine data sets with small rows but numerous columns Construct a row-enumeration tree for efficient mining
106
107
108
109
Skip
Some items (e.g., diamond) may occur rarely but are valuable Customized supmin specification and application Hard to specify supmin, but top-k with lengthmin is more desirable Dynamically raise supmin in FP-tree construction and mining, and select most promising path to mine
110