Documente Academic
Documente Profesional
Documente Cultură
Dr.M. Venkatesan
Assistant Professor
Department of Computer Science and
Engineering
NITK
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
• Challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
– E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper} {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain Example:
both X and Y {Milk, Diaper } Beer
– Confidence (c)
Measures how often items in Y (Milk , Diaper, Beer) 2
s 0.4
appear in transactions that |T| 5
contain X
(Milk, Diaper, Beer) 2
c 0.67
(Milk, Diaper) 3
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
frequent
infrequent
AB AC AD AE BC BD BE CD CE DE
infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
24
ABCDE
Illustrating Apriori Principle
Found to be
Infrequent
Pruned
supersets
Closed Itemset: support of all parents are not equal to the support of the
itemset.
Maximal Itemset: all parents of that itemset must be infrequent.
Itemset {c} is closed as support of parents (supersets) {A C}:2, {B C}:2, {C D}:1, {C E}:2
not equal support of {c}:3. And the same for {A C}, {B E} & {B C E}.
Itemset {A C} is maximal as all parents (supersets) {A B C}, {A C D}, {A C E} are
infrequent. And the same for {B C E}.
Mining Frequent Patterns Without Candidate
Generation
• Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more
frequently occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)
– For Connect-4 DB, compression ratio could be over 100
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
• Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
• Multi-dimensional rules: 2 dimensions or predicates
– Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
– hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
• Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
• Quantitative Attributes: numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
September 13, 2019 Data Mining: Concepts and Techniques 36
Mining Other Interesting Patterns
value is greater than one, and the observed value of the slot
(game, video) = 4,000, which is less than the expected value
4,500, buying game and buying video are
negatively correlated.
September 13, 2019 Data Mining: Concepts and Techniques 45
All_conf and Cosine measures
P ( A B )
lift
P ( A) P ( B ) Milk No Milk Sum (row)
Coffee m, c ~m, c c
sup(X ) No Coffee m, ~c ~m, ~c ~c
all _ conf
max_ item _ sup(X ) Sum(col.) m ~m