Documente Academic
Documente Profesional
Documente Cultură
Applications
Ruleform-
Prediction (Boolean variables) =>Prediction
(Boolean variables) [support, confidence]
Computer => antivirus _software
[support =2%, confidence = 60%]
buys(x, “computer”) buys (x,
“antivirus_software”) [S=0.5%, C=60%]
CSE@HCST 9/5/2019
Association Rule: Basic Concepts
5
CSE@HCST 9/5/2019
Association rule performance measures
6
Confidence
Support
Minimum support threshold
Minimum confidence threshold
CSE@HCST 9/5/2019
Rule Measures: Support and
7
Confidence
Customer
buys both
Customer Find all the rules X & Y Z with minimum
buys napkin confidence and support
support, s, probability that a transaction
contains {X Y Z}
confidence, c, conditional probability that a
transaction having {X Y} also contains Z
Customer
buys beer
Shopping baskets
Each item has a Boolean variable representing the
presence or absence of that item.
Each basket can be represented by a Boolean vector
of values assigned to these variables.
Identify patterns from Boolean vector.
Patterns can be represented by association rules.
CSE@HCST 9/5/2019
Association Rule Mining: A Road Map
9
CSE@HCST 9/5/2019
10
CSE@HCST 9/5/2019
Apriori Algorithm
11
CSE@HCST 9/5/2019
Mining Association Rules—An Example
12
CSE@HCST 9/5/2019
Mining Frequent Itemsets: the Key Step
13
CSE@HCST 9/5/2019
The Apriori Algorithm
14
Join Step
Ck is generated by joining Lk-1with itself
Prune Step
Any (k-1)-itemset that is not frequent cannot be a subset
of a frequent k-itemset
CSE@HCST 9/5/2019
The Apriori Algorithm
15
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
CSE@HCST 9/5/2019
The Apriori Algorithm — Example
Database D itemset sup.
16 L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
CSE@HCST 9/5/2019
How to Generate Candidates?
17
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
CSE@HCST 9/5/2019
How to Count Supports of Candidates?
18
Method
Candidate itemsets are stored in a hash-tree
CSE@HCST 9/5/2019
Example of Generating Candidates
19
Pruning:
acde is removed because ade is not in L3
C4={abcd}
CSE@HCST 9/5/2019
Methods to Improve Apriori’s Efficiency
20
CSE@HCST 9/5/2019
Methods to Improve Apriori’s Efficiency
21
Transaction reduction
A transaction that does not contain any frequent k-itemset is useless in
subsequent scans
Partitioning
Any itemset that is potentially frequent in DB must be frequent in at least one
of the partitions of DB
CSE@HCST 9/5/2019
Methods to Improve Apriori’s Efficiency
22
Sampling
mining on a subset of given data, lower support threshold +
a method to determine the completeness.
CSE@HCST 9/5/2019
Mining Frequent Patterns Without Candidate
Generation[Not in the syllabus]
23
CSE@HCST 9/5/2019
24
CSE@HCST 9/5/2019
Mining various kinds of association rules
25
CSE@HCST 9/5/2019
26 CSE@HCST 9/5/2019
Multiple-Level Association Rules
27
Food
Items often form hierarchy.
Items at the lower level are milk bread
expected to have lower support.
Rules regarding itemsets at skim 2% wheat white
appropriate levels could be quite
useful. Fraser Sunset
Transaction database can be
encoded based on dimensions
and levels.
We can explore shared multi-
level mining.
CSE@HCST 9/5/2019
28 CSE@HCST 9/5/2019
Multi-level Association
29
CSE@HCST 9/5/2019
Multi-level Association
30
CSE@HCST 9/5/2019
31 CSE@HCST 9/5/2019
Uniform Support
32
Level 1 Milk
min_sup = 5%
[support = 10%]
CSE@HCST 9/5/2019
Back
Reduced Support
33
Level 1 Milk
min_sup = 5%
[support = 10%]
CSE@HCST 9/5/2019
34 CSE@HCST 9/5/2019
Multi-level Association: Redundancy Filtering
35
CSE@HCST 9/5/2019
36
CSE@HCST 9/5/2019
Multi-Dimensional Association
37
Single-dimensional rules
Intra-dimension association rules
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules
Inter-dimension association rules -no repeated predicates
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
hybrid-dimension association rules -repeated predicates
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
CSE@HCST 9/5/2019
Multi-Dimensional Association
38
Categorical Attributes
finite
number of possible values, no ordering among
values
Quantitative Attributes
numeric, implicit ordering among values
CSE@HCST 9/5/2019
Techniques for Mining MD Associations
39
CSE@HCST 9/5/2019
Static Discretization of Quantitative Attributes
40
CSE@HCST 9/5/2019
41 CSE@HCST 9/5/2019
Quantitative Association Rules
42
CSE@HCST 9/5/2019
Example:
43 CSE@HCST 9/5/2019
44 CSE@HCST 9/5/2019
45
CSE@HCST 9/5/2019
Interestingness Measurements
46
Objective measures-
Two popular measurements
support
confidence
Subjective measures-
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
CSE@HCST 9/5/2019
Criticism to Support and Confidence
Example
47
48
Example
X and Y: positively correlated,
X and Z, negatively related
X 1 1 1 1 0 0 0 0
support and confidence of Y 1 1 0 0 0 0 0 0
X=>Z dominates
We need a measure of dependent or
Z 0 1 1 1 1 1 1 1
correlated events
P( A B)
corrA, B =
P( A) P( B)
Rule
Support Confidence
P(B|A)/P(B) is also called the lift of rule A
=> B X=>Y
25% 50%
X=>Z 37.50% 75%
CSE@HCST 9/5/2019
Other Interestingness Measures: Interest
49
50 CSE@HCST 9/5/2019