Documente Academic
Documente Profesional
Documente Cultură
All homework will be team homework but all team members must know the material.
Submit all homework to both instructor and graduate assistant Md Ali (ma03901n@pace.edu).
---------------------------**************--------------------------
Complete Association Rules exercise 4 (end of chapter, page 159 in textbook) manually.
A local retailer has a database that stores 10,000 transactions of last summer. After
analyzing the data, a data science team has identified the following statistics:
{battery} appears in 6,000 transactions.
{sunscreen} appears in 5,000 transactions.
{sandals} appears in 4,000 transactions.
{bowls} appears in 2,000 transactions.
{battery,sunscreen} appears in 1,500 transactions.
{battery,sandals} appears in 1,000 transactions.
{battery,bowls} appears in 250 transactions.
{battery,sunscreen,sandals} appears in 600 transactions.
Answer the following questions:
1. What are the support values of the preceding itemsets?
2. Assuming the minimum support is 0.05, which item sets are considered frequent?
The support of a frequent itemset should be greater than or equal to the minimum support. As the
minimum support is 0.05, Itemsets {battery}, {sunscreen}, {sandals}, {bowls}, {battery,sunscreen},
{battery,sandals}, and {battery,sunscreen,sandals} are considered frequent itemsets at the
minimum support 0.05. Only {battery,bowls} is not frequent itemsets.
4. List all the candidate rules that can be formed from the statistics. Which rules are considered
interesting at the minimum confidence 0.25? Out of these interesting rules, which rule is
considered the most useful (that is, least coincidental)?
Support:
Confidence:
Confidence(x→y)=support(x,y)/support(x)
Confidence (battery → sunscreen) = support (battery, sunscreen)/ support (battery) = 0.15/0.6= 0.25
Confidence (sunscreen → battery) = support (battery, sunscreen)/ support (sunscreen) = 0.15/0.5=
0.3
Confidence (battery → sandals) = support (battery, sandals) / support (battery) = 0.1/0.6= 0.17
Confidence (sandals → battery) = support (battery, sandals) / support (sandals) = 0.1/0.4=0.25
Confidence (battery → bowls) = support (battery, bowls)/ support (battery) = 0.025/0.6= 0.042
Confidence (bowls → battery) = support (battery, bowls)/ support (bowls) = 0.025/0.2= 0.125
Confidence ({battery}→{sunscreen, sandals }) = support (battery, sunscreen, sandals )/ support
(battery) = 0.06/0.6=0.1
Confidence ({sunscreen}→{battery, sandals}) = support (battery, sunscreen, sandals )/ support
(sunscreen) = 0.06/0.5=0.12
Confidence ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals )/ support
(battery, sandals) = 0.06/0.1=0.6
Confidence ({sandals}→{battery, sunscreen }) = support (battery, sunscreen, sandals )/ support
(sandals) = 0.06/ 0.4=0.15
Confidence ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/ support
(battery, sunscreen) = 0.06/0.15= 0.4
So considering the minimum confidence value =0.25, the interesting rules are
Confidence (battery → sunscreen) = support (battery, sunscreen)/ support (battery) =
0.15/0.6= 0.25 [ means that there is 25% chance that customer will buy sunscreen if the
customer buy battery only]
Confidence (sunscreen → battery) = support (battery, sunscreen)/ support (sunscreen) =
0.15/0.5= 0.3 [There is 30% chance that a customer will buy battery, if the customer
buy sunscreen only.]
Confidence (sandals → battery) = support (battery, sandals) / support (sandals) =
0.1/0.4=0.25 [There is 25% chance that a customer will buy battery, if the customer
buy sandals only.]
Confidence ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals )/
support (battery, sandals) = 0.06/0.1=0.6 [There is 60% chance that a customer will buy
sunscreen, if the customer buy battery and sandals together.]
Confidence ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/
support (battery, sunscreen) = 0.06/0.15= 0.4 [there is 40% chance that a customer will buy
sandals, if the customer buy battery and sunscreen together.]
Lift
Lift(x → y)=support(x, y)/{support(x)*support(y)
Lift (battery → sunscreen) = support (battery, sunscreen)/ support (battery)* support ( sunscreen)
= 0.15/(0.6*0.5)= 0.5
Lift (sandals → battery) = support (battery, sandals) / support (battery)* support (sandals) =
0.1/(0.6*0.4)= 0.42
Lift ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/ support (battery,
sunscreen)* support (sandals) = 0.06/(0.15*0.4)= 1
Lift ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals )/ support (battery,
sandals)* support (sunscreen) = 0.06/(0.1*0.5)= 1.2
Therefore it can be concluded that ({battery, sandals}→{sunscreen}) have a stronger
association than others.
Leverage
Leverage(x → y)=support(x, y)-{support(x)*support(y)}
Leverage (battery → sunscreen) = support (battery, sunscreen) – {support (battery)* support (
sunscreen)} = 0.15 - (0.6*0.5)= -0.15
Leverage (sandals → battery) = support (battery, sandals) – {support (battery)* support (sandals)} =
0.1 - (0.6*0.4)= - 0.14
Leverage (battery → bowls) = support (battery, bowls) – {support (battery)* support (bowls)} = 0.025
- (0.6*0.2)= - 0.1
Leverage ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals ) – {support
(battery, sunscreen)* support (sandals) }= 0.06 - (0.15*0.4)= 0
Leverage ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals ) – {support
(battery, sandals)* support (sunscreen) }= 0.06 - (0.1*0.5)= 0.01
It again confirms that ({battery, sandals}→{sunscreen}) have a stronger association than
others.
So by doing Lift and Leverage candidate rules we can conclude that ({battery,
sandals}→{sunscreen}) rule is most useful.
Important Notes: Confidence is able to identify trustworthy rules, but it cannot tell whether a rule
is coincidental. A high-confidence rule can sometimes be misleading because confidence does not
consider support of the itemset in the rule consequent. Measures such as lift and leverage not only
ensure interesting rules are identified but also filter out the coincidental rules.
-----------------------*************------------------------------
Given the following 10 grocery store transactions, use appropriate association rule thresholds to
find a few interesting rules both by hand and by using R.
1. beer, diapers
2. soda, potato chips, hamburger meat, milk, eggs
3. coffee, eggs
4. beer, bread, cheese, ham
5. diapers, beer, potato chips
6. cheese, ham, beer
7. ham, cheese, bread, coffee, milk
8. soda, cheese, bread, ham
9. coffee, hamburger meat
10. eggs, diapers, beer
R Code:
library('arules')
library('arulesViz')
purchases <- c("beer,diapers",
"soda,potato,chips,hamburger,meat,milk,eggs",
"coffee,eggs",
"beer,bread,cheese,ham",
"diapers,beer,potato,chips",
"cheese,ham,beer",
"ham,cheese,bread,coffee,milk",
"soda,cheese,bread,ham",
"coffee,hamburger,meat",
"eggs,diapers,beer")
Console Output
#################
# Extra Exercise
#################
library('arules')
## items
## 1 {beer,diapers}
## 2 {chips,eggs,hamburger,meat,milk,potato,soda}
## 3 {coffee,eggs}
## 4 {beer,bread,cheese,ham}
## 5 {beer,chips,diapers,potato}
## 6 {beer,cheese,ham}
## 7 {bread,cheese,coffee,ham,milk}
## 8 {bread,cheese,ham,soda}
## 9 {coffee,hamburger,meat}
## 10 {beer,diapers,eggs}
summary(trans)
# frequent 2-itemsets
items2 <- apriori(trans, parameter=list(minlen=2, maxlen=2, support=0.3))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.3 2 2
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(items2)
## set of 5 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift
## Min. :0.30 Min. :1 Min. :2.0
## 1st Qu.:0.30 1st Qu.:1 1st Qu.:2.5
## Median :0.30 Median :1 Median :2.5
## Mean :0.34 Mean :1 Mean :2.4
## 3rd Qu.:0.40 3rd Qu.:1 3rd Qu.:2.5
## Max. :0.40 Max. :1 Max. :2.5
##
## mining info:
## data ntransactions support confidence
## trans 10 0.3 0.8
inspect(sort(items2, by ="support"))
# frequent 3-itemsets
items3 <- apriori(trans, parameter=list(minlen=3, maxlen=3, support=0.3))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.3 3 3
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(items3)
## set of 2 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence lift
## Min. :0.3 Min. :1 Min. :2.5
## 1st Qu.:0.3 1st Qu.:1 1st Qu.:2.5
## Median :0.3 Median :1 Median :2.5
## Mean :0.3 Mean :1 Mean :2.5
## 3rd Qu.:0.3 3rd Qu.:1 3rd Qu.:2.5
## Max. :0.3 Max. :1 Max. :2.5
##
## mining info:
## data ntransactions support confidence
## trans 10 0.3 0.8
inspect(sort(items3, by ="support"))
# frequent 4-itemsets
items4 <- apriori(trans, parameter=list(minlen=4, maxlen=4, support=0.3))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.3 4 4
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(items4)
## set of 0 rules
##############################
# Generate and Visualize Rules
##############################
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.3 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 7 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.286 2.500 3.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.3000 Min. :1 Min. :2.000
## 1st Qu.:0.3000 1st Qu.:1 1st Qu.:2.500
## Median :0.3000 Median :1 Median :2.500
## Mean :0.3286 Mean :1 Mean :2.429
## 3rd Qu.:0.3500 3rd Qu.:1 3rd Qu.:2.500
## Max. :0.4000 Max. :1 Max. :2.500
##
## mining info:
## data ntransactions support confidence
## trans 10 0.3 0.8
inspect(rules)
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.3 0.1 1 none FALSE TRUE 0.3 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 11 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 8 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.273 2.500 3.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.3000 Min. :0.6000 Min. :2.000
## 1st Qu.:0.3000 1st Qu.:0.7500 1st Qu.:2.500
## Median :0.3000 Median :1.0000 Median :2.500
## Mean :0.3182 Mean :0.8955 Mean :2.409
## 3rd Qu.:0.3000 3rd Qu.:1.0000 3rd Qu.:2.500
## Max. :0.4000 Max. :1.0000 Max. :2.500
##
## mining info:
## data ntransactions support confidence
## trans 10 0.3 0.3
inspect(rules)
# 11 rules matrix
confidentRules <- rules[quality(rules)$confidence > 0.3]
inspect(confidentRules)
# references
# http://www.rdatamining.com/examples/association-rules
# http://statistical-research.com/data-frames-and-transactions/