Apriori (Compatibility Mode) - 2

FREQUENT PATTERNS & ASSOCIATION RULES
Basic Concepts
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Motivation: Finding inherent regularities in data
What products were often purchased together? What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
Basic Concepts of Association Rule Mining

Association Rule Mining Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Examples
Rule form: Body Head [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%] major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]
Association Model: Problem Statement

I ={i1, i2, ...., in} a set of items J = P(I ) set of all subsets of the set of items, elements of J are called itemsets Transaction T: T is subset of I Data Base: set of transactions An association rule is an implication of the form : X-> Y, where X, Y are disjoint subsets of I (elements of J ) Problem: Find rules that have support and confidence greater that user-specified minimum support and minimun confidence
4
Rule Measures: Support & Confidence
Simple Formulas:
Confidence (A B) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A) Support (A B) = #tuples containing both A & B/ total number of tuples = P(A U B)
What do they actually mean ?

Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z
Support & Confidence : An Example

TransactionID ItemsBought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, then we have, A C (50%, 66.6%) C A (50%, 100%)
6
Two Parameters
Confidence (how true)
The rule X&Y Z has 90% confidence: means 90% of customers who bought X and Y also bought Z.
Support (how useful is the rule)

Useful rules should have some minimum transaction support.
The Apriori Algorithm: Basics

The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Key Concepts : Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset). Apriori Property: Any subset of frequent itemset must be frequent. Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.
The Apriori Algorithm

Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.

9
The Apriori Algorithm : Pseudo code

Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
10
The Apriori Algorithm: Example 1

TID T100 T100 T100 T100 T100 T100 T100 T100 T100 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2 ,I3, I5 I1, I2, I3
Consider a database, D , consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence.
11
Step 1: Generating 1-itemset Frequent Pattern

Itemset Scan D for count of each candidate Sup.Count Compare candidate support count with minimum support count Itemset Sup.Count
{I1} {I2} {I3} {I4} {I5}
6 7 6 2 2
{I1} {I2} {I3} {I4} {I5}
6 7 6 2 2
C1 of candidate.
L1
In the first iteration of the algorithm, each item is a member of the set
The set of frequent 1-itemsets, L1 , consists of the candidate 1itemsets satisfying minimum support.
12

Generate C2 candidates from L1
Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}
Scan D for count of each candidate
Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}
Sup. Count 4 4 1 2 4 2 2 0 1 0
Compare candidate support count with minimum support count
Itemset {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5}
Sup Count 4 4 2 4 2 2
L2
C2
C2
13
Step 2: Generating 2-itemset Frequent Pattern [Cont.]

To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to generate a candidate set of 2itemsets, C2. Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table). The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. Note: We havent used Apriori Property yet.
14

Itemset {I1, I2, I3} {I1, I2, I5}
Itemset {I1, I2, I3} {I1, I2, I5}
Sup. Count 2 2
Compare candidate support count with min support count
Itemset {I1, I2, I3} {I1, I2, I5}
Sup Count 2 2
C3
C3
L3
The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property. In order to find C3, we compute L2 Join L2. C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.
15
Step 3: Generating 3-itemset Frequent Pattern [Cont.]

Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ? For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning. Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.
16

The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus, C4 = , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. Whats Next ? These frequent itemsets will be used to generate strong association rules ( where strong association rules satisfy both minimum support & minimum confidence).
17
Step 5: Generating Association Rules from Frequent Itemsets

Procedure:
For each frequent itemset l, generate all nonempty subsets of l. For every nonempty subset s of l, output the rule s (l-s) if support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence threshold.
Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Lets take l = {I1,I2,I5}. Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
18
Step 5: Generating Association Rules from Frequent Itemsets [Cont.]
Let minimum confidence threshold is , say 70%. The resulting association rules are shown below, each listed with its confidence.
R1: I1 ^ I2
I5 I2 I1
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected.
R2: I1 ^ I5
R3: I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is Selected. Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.
19
Step 5: Generating Association Rules from Frequent Itemsets [Cont.]

R4: I1 I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% R4 is Rejected. R5: I2 I1 ^ I5 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is Rejected. R6: I5 I1 ^ I2 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% R6 is Selected. In this way, We have found three strong association rules.
20
The Apriori AlgorithmExample 2

min. support =2 txs (50%) Database TDB
Tid 100 200 300 400 Items A, C, D B, C, E A, B, C, E B, E
C1 1st scan
Itemset {A} {B} {C} {D} {E}
sup 2 3 3 1 3 sup 1 2 1 2 3 2
L1
Itemset {A} {B} {C} {E}
sup 2 3 3 3
L2
Itemset {A, C} {B, C} {B, E} {C, E}
sup 2 2 3 2
C2
Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
C2 2nd scan
Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
C3
Itemset {B, C, E}
3rd scan
L3
Itemset {B, C, E}
sup 2
21
Generating rules: an example

Suppose {2,3,4} is frequent, with sup=50%
Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively Minconf=60% These generate these association rules:
2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules are strong rules.
22
Generating rules: summary

To recap, in order to obtain A B, we need to have support(A B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation.
23

Apriori (Compatibility Mode) - 2

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Apriori (Compatibility Mode) - 2

Încărcat de

Drepturi de autor:

Formate disponibile

FREQUENT PATTERNS & ASSOCIATION RULES

Basic Concepts of Association Rule Mining

Association Model: Problem Statement

Rule Measures: Support & Confidence

What do they actually mean ?

Support & Confidence : An Example

Support (how useful is the rule)

The Apriori Algorithm: Basics

The Apriori Algorithm

Use the frequent itemsets to generate association rules.

The Apriori Algorithm : Pseudo code

The Apriori Algorithm: Example 1

Step 1: Generating 1-itemset Frequent Pattern

{I1} {I2} {I3} {I4} {I5}

{I1} {I2} {I3} {I4} {I5}

Step 2: Generating 2-itemset Frequent Pattern

Compare candidate support count with minimum support count

Step 2: Generating 2-itemset Frequent Pattern [Cont.]

Step 3: Generating 3-itemset Frequent Pattern

Itemset {I1, I2, I3} {I1, I2, I5}

Scan D for count of each candidate

Itemset {I1, I2, I3} {I1, I2, I5}

Compare candidate support count with min support count

Itemset {I1, I2, I3} {I1, I2, I5}

Step 3: Generating 3-itemset Frequent Pattern [Cont.]

Step 4: Generating 4-itemset Frequent Pattern

Step 5: Generating Association Rules from Frequent Itemsets

Step 5: Generating Association Rules from Frequent Itemsets [Cont.]

Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected.

Step 5: Generating Association Rules from Frequent Itemsets [Cont.]

The Apriori AlgorithmExample 2

Itemset {A} {B} {C} {D} {E}

Itemset {A} {B} {C} {E}

Itemset {A, C} {B, C} {B, E} {C, E}

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

Generating rules: an example

Generating rules: summary

S-ar putea să vă placă și