BIg Data

Homework Due Lesson 6 – Association Rules
All homework will be team homework but all team members must know the material.
Submit all homework to both instructor and graduate assistant Md Ali (ma03901n@pace.edu).
---------------------------**************--------------------------
Complete Association Rules exercise 4 (end of chapter, page 159 in textbook) manually.
A local retailer has a database that stores 10,000 transactions of last summer. After
analyzing the data, a data science team has identified the following statistics:
{battery} appears in 6,000 transactions.
{sunscreen} appears in 5,000 transactions.
{sandals} appears in 4,000 transactions.
{bowls} appears in 2,000 transactions.
{battery,sunscreen} appears in 1,500 transactions.
{battery,sandals} appears in 1,000 transactions.
{battery,bowls} appears in 250 transactions.
{battery,sunscreen,sandals} appears in 600 transactions.
Answer the following questions:
1. What are the support values of the preceding itemsets?
{battery} appears in 6,000 transactions. So, support (sunscreen)= 6000/10000 = 0.6

{sunscreen} appears in 5,000 transactions. So, support (sunscreen)= 5000/10000 = 0.5
{sandals} appears in 4,000 transactions. So, support (sandals)= 4000/10000 = 0.4
{bowls} appears in 2,000 transactions. So, support (bowls)= 2000/10000 = 0.2
{battery,sunscreen} appears in 1,500 transactions. So, support (battery,sunscreen)= 1500/10000 =
0.15
{battery,sandals} appears in 1,000 transactions. So, support (battery,sandals)= 1000/10000 = 0.1
{battery,bowls} appears in 250 transactions. So, support (battery,bowls)= 250/10000 = 0.025
{battery,sunscreen,sandals} appears in 600 transactions. So, support (battery,sunscreen,sandals)=
600/10000 = 0.06
2. Assuming the minimum support is 0.05, which item sets are considered frequent?
The support of a frequent itemset should be greater than or equal to the minimum support. As the
minimum support is 0.05, Itemsets {battery}, {sunscreen}, {sandals}, {bowls}, {battery,sunscreen},
{battery,sandals}, and {battery,sunscreen,sandals} are considered frequent itemsets at the
minimum support 0.05. Only {battery,bowls} is not frequent itemsets.
3. What are the confidence values of {battery}→{sunscreen} and {battery, sunscreen}→

{sandals}? Which of the two rules is more interesting?
Confidence (battery→sunscreen) = support (battery, sunscreen)/ support (battery) = 0.15/0.6= 0.25.
Which means 25% of the time a customer buys battery, sunscreen is bought as well.
Confidence ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/ support
(battery, sunscreen) = 0.06/0.15= 0.4. Which means 40% of the time a customer buys battery and
sunscreen, sandals is bought as well.
The second rule ({battery, sunscreen}→{sandals}) is more interesting because it shows that 40% of
the time a customer buys battery and sunscreen, sandals is bought as well.
4. List all the candidate rules that can be formed from the statistics. Which rules are considered
interesting at the minimum confidence 0.25? Out of these interesting rules, which rule is
considered the most useful (that is, least coincidental)?
Support:
{battery} appears in 6,000 transactions. So, support (sunscreen)= 6000/10000 = 0.6

{sunscreen} appears in 5,000 transactions. So, support (sunscreen)= 5000/10000 = 0.5
{sandals} appears in 4,000 transactions. So, support (sandals)= 4000/10000 = 0.4
{bowls} appears in 2,000 transactions. So, support (bowls)= 2000/10000 = 0.2
{battery,sunscreen} appears in 1,500 transactions. So, support (battery,sunscreen)= 1500/10000 =
0.15
{battery,sandals} appears in 1,000 transactions. So, support (battery,sandals)= 1000/10000 = 0.1
{battery,bowls} appears in 250 transactions. So, support (battery,bowls)= 250/10000 = 0.025
{battery,sunscreen,sandals} appears in 600 transactions. So, support (battery,sunscreen,sandals)=
600/10000 = 0.06
Confidence:
Confidence(x→y)=support(x,y)/support(x)
Confidence (battery → sunscreen) = support (battery, sunscreen)/ support (battery) = 0.15/0.6= 0.25
Confidence (sunscreen → battery) = support (battery, sunscreen)/ support (sunscreen) = 0.15/0.5=
0.3
Confidence (battery → sandals) = support (battery, sandals) / support (battery) = 0.1/0.6= 0.17
Confidence (sandals → battery) = support (battery, sandals) / support (sandals) = 0.1/0.4=0.25
Confidence (battery → bowls) = support (battery, bowls)/ support (battery) = 0.025/0.6= 0.042
Confidence (bowls → battery) = support (battery, bowls)/ support (bowls) = 0.025/0.2= 0.125
Confidence ({battery}→{sunscreen, sandals }) = support (battery, sunscreen, sandals )/ support
(battery) = 0.06/0.6=0.1
Confidence ({sunscreen}→{battery, sandals}) = support (battery, sunscreen, sandals )/ support
(sunscreen) = 0.06/0.5=0.12
Confidence ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals )/ support
(battery, sandals) = 0.06/0.1=0.6
Confidence ({sandals}→{battery, sunscreen }) = support (battery, sunscreen, sandals )/ support
(sandals) = 0.06/ 0.4=0.15
Confidence ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/ support
(battery, sunscreen) = 0.06/0.15= 0.4
So considering the minimum confidence value =0.25, the interesting rules are
 Confidence (battery → sunscreen) = support (battery, sunscreen)/ support (battery) =
0.15/0.6= 0.25 [ means that there is 25% chance that customer will buy sunscreen if the
customer buy battery only]
 Confidence (sunscreen → battery) = support (battery, sunscreen)/ support (sunscreen) =
0.15/0.5= 0.3 [There is 30% chance that a customer will buy battery, if the customer
buy sunscreen only.]
 Confidence (sandals → battery) = support (battery, sandals) / support (sandals) =
0.1/0.4=0.25 [There is 25% chance that a customer will buy battery, if the customer
buy sandals only.]
 Confidence ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals )/
support (battery, sandals) = 0.06/0.1=0.6 [There is 60% chance that a customer will buy
sunscreen, if the customer buy battery and sandals together.]
 Confidence ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/
support (battery, sunscreen) = 0.06/0.15= 0.4 [there is 40% chance that a customer will buy
sandals, if the customer buy battery and sunscreen together.]
Lift
Lift(x → y)=support(x, y)/{support(x)*support(y)
Lift (battery → sunscreen) = support (battery, sunscreen)/ support (battery)* support ( sunscreen)
= 0.15/(0.6*0.5)= 0.5
Lift (sandals → battery) = support (battery, sandals) / support (battery)* support (sandals) =
0.1/(0.6*0.4)= 0.42
Lift ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals )/ support (battery,
sunscreen)* support (sandals) = 0.06/(0.15*0.4)= 1
Lift ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals )/ support (battery,
sandals)* support (sunscreen) = 0.06/(0.1*0.5)= 1.2
Therefore it can be concluded that ({battery, sandals}→{sunscreen}) have a stronger
association than others.
Leverage
Leverage(x → y)=support(x, y)-{support(x)*support(y)}
Leverage (battery → sunscreen) = support (battery, sunscreen) – {support (battery)* support (
sunscreen)} = 0.15 - (0.6*0.5)= -0.15
Leverage (sandals → battery) = support (battery, sandals) – {support (battery)* support (sandals)} =
0.1 - (0.6*0.4)= - 0.14
Leverage (battery → bowls) = support (battery, bowls) – {support (battery)* support (bowls)} = 0.025
- (0.6*0.2)= - 0.1
Leverage ({battery, sunscreen}→{sandals}) = support (battery, sunscreen, sandals ) – {support
(battery, sunscreen)* support (sandals) }= 0.06 - (0.15*0.4)= 0
Leverage ({battery, sandals}→{sunscreen}) = support (battery, sunscreen, sandals ) – {support
(battery, sandals)* support (sunscreen) }= 0.06 - (0.1*0.5)= 0.01
It again confirms that ({battery, sandals}→{sunscreen}) have a stronger association than
others.
So by doing Lift and Leverage candidate rules we can conclude that ({battery,
sandals}→{sunscreen}) rule is most useful.
Important Notes: Confidence is able to identify trustworthy rules, but it cannot tell whether a rule
is coincidental. A high-confidence rule can sometimes be misleading because confidence does not
consider support of the itemset in the rule consequent. Measures such as lift and leverage not only
ensure interesting rules are identified but also filter out the coincidental rules.
-----------------------*************------------------------------
Given the following 10 grocery store transactions, use appropriate association rule thresholds to
find a few interesting rules both by hand and by using R.
1. beer, diapers
2. soda, potato chips, hamburger meat, milk, eggs
3. coffee, eggs
4. beer, bread, cheese, ham
5. diapers, beer, potato chips
6. cheese, ham, beer
7. ham, cheese, bread, coffee, milk
8. soda, cheese, bread, ham
9. coffee, hamburger meat
10. eggs, diapers, beer
R Code:
library('arules')
library('arulesViz')
purchases <- c("beer,diapers",
"soda,potato,chips,hamburger,meat,milk,eggs",
"coffee,eggs",
"beer,bread,cheese,ham",
"diapers,beer,potato,chips",
"cheese,ham,beer",
"ham,cheese,bread,coffee,milk",
"soda,cheese,bread,ham",
"coffee,hamburger,meat",
"eggs,diapers,beer")
# write to a basket file

data <- paste(purchases, sep="\n")
write(data, file = "purchases")
# read transcations from puchases "basket" file
trans <- read.transactions("purchases", format = "basket", sep=",")
inspect(trans)
summary(trans)
items2 <- apriori(trans, parameter=list(minlen=2, maxlen=2, support=0.3))
summary(items2)
inspect(sort(items2, by ="support"))
summary(items3)
summary(items4)
rules <- apriori(trans, parameter=list(minlen=2, support=0.3))
summary(rules)
inspect(rules)
rules <- apriori(trans, parameter=list(minlen=2, support=0.3, confidence=0.3,
target = "rules"))
summary(rules)
inspect(rules)
plot(rules)
plot(rules@quality)
confidentRules <- rules[quality(rules)$confidence > 0.3]

inspect(confidentRules)
plot(confidentRules, method="matrix", control=list(reorder=TRUE))
inspect(head(sort(rules, by="lift"), 10))
highConfidenceRules <- head(sort(rules, by="confidence"), 5)
plot(highConfidenceRules, method="graph", control=list(type="items"))
highLiftRules <- head(sort(rules, by="lift"), 5)
plot(highLiftRules, method="graph", control=list(type="items"))
# plot parallel coordinates of the candidate rules
plot(rules, method="paracoord", control=list(reorder=TRUE))
Console Output
# HW6: Extra Exercise

# CS816 Big Data Analytics
#################
# Extra Exercise
#################
library('arules')
## Loading required package: Matrix

##
## Attaching package: 'arules'
##
## The following objects are masked from 'package:base':
##
## %in%, abbreviate, write
library('arulesViz')
## Loading required package: grid

##
## Attaching package: 'arulesViz'
##
## The following object is masked from 'package:arules':
##
## abbreviate
##
## The following object is masked from 'package:base':
##
## abbreviate
## create the dataset file using basket format

purchases <- c("beer,diapers",
"soda,potato,chips,hamburger,meat,milk,eggs",
"coffee,eggs",
"beer,bread,cheese,ham",
"diapers,beer,potato,chips",
"cheese,ham,beer",
"ham,cheese,bread,coffee,milk",
"soda,cheese,bread,ham",
"coffee,hamburger,meat",
"eggs,diapers,beer")
# write to a basket file

data <- paste(purchases, sep="\n")
write(data, file = "purchases")
# read transcations from puchases "basket" file

trans <- read.transactions("purchases", format = "basket", sep=",")
inspect(trans)
## items
## 1 {beer,diapers}
## 2 {chips,eggs,hamburger,meat,milk,potato,soda}
## 3 {coffee,eggs}
## 4 {beer,bread,cheese,ham}
## 5 {beer,chips,diapers,potato}
## 6 {beer,cheese,ham}
## 7 {bread,cheese,coffee,ham,milk}
## 8 {bread,cheese,ham,soda}
## 9 {coffee,hamburger,meat}
## 10 {beer,diapers,eggs}
summary(trans)
## transactions as itemMatrix in sparse format with

## 10 rows (elements/itemsets/transactions) and
## 13 columns (items) and a density of 0.2846154
##
## most frequent items:
## beer cheese ham bread coffee (Other)
## 5 4 4 3 3 18
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 7
## 2 3 3 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 3.0 3.5 3.7 4.0 7.0
##
## includes extended item information - examples:
## labels
## 1 beer
## 2 bread
## 3 cheese
# apply apriori on the itemsets in the transactions
# frequent 2-itemsets
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.3 2 2
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 10 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(items2)
## set of 5 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 5
##
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift
## Min. :0.30 Min. :1 Min. :2.0
## 1st Qu.:0.30 1st Qu.:1 1st Qu.:2.5
## Median :0.30 Median :1 Median :2.5
## Mean :0.34 Mean :1 Mean :2.4
## 3rd Qu.:0.40 3rd Qu.:1 3rd Qu.:2.5
## Max. :0.40 Max. :1 Max. :2.5
##
## mining info:
## data ntransactions support confidence
## trans 10 0.3 0.8
## lhs rhs support confidence lift

## 4 {cheese} => {ham} 0.4 1 2.5
## 5 {ham} => {cheese} 0.4 1 2.5
## 1 {diapers} => {beer} 0.3 1 2.0
## 2 {bread} => {cheese} 0.3 1 2.5
## 3 {bread} => {ham} 0.3 1 2.5
##
## 0.8 0.1 1 none FALSE TRUE 0.3 3 3
## target ext
## rules FALSE
##
##
## checking subsets of size 1 2 3 done [0.00s].
summary(items3)
## set of 2 rules
##
## 3
## 2
##
## 3 3 3 3 3 3
##
## Min. :0.3 Min. :1 Min. :2.5
## 1st Qu.:0.3 1st Qu.:1 1st Qu.:2.5
## Mean :0.3 Mean :1 Mean :2.5
## 3rd Qu.:0.3 3rd Qu.:1 3rd Qu.:2.5
## Max. :0.3 Max. :1 Max. :2.5
##
## mining info:
## trans 10 0.3 0.8

## 1 {bread,cheese} => {ham} 0.3 1 2.5
## 2 {bread,ham} => {cheese} 0.3 1 2.5
##
## 0.8 0.1 1 none FALSE TRUE 0.3 4 4
## target ext
## rules FALSE
##
##
summary(items4)
## set of 0 rules
##############################
# Generate and Visualize Rules
##############################
# run Apriori without max (7 rules 100% confidence)

rules <- apriori(trans, parameter=list(minlen=2, support=0.3))
##
## 0.8 0.1 1 none FALSE TRUE 0.3 2 10
## target ext
## rules FALSE
##
##
summary(rules)
## set of 7 rules
##
## 2 3
## 5 2
##
## 2.000 2.000 2.000 2.286 2.500 3.000
##
## Min. :0.3000 Min. :1 Min. :2.000
## 1st Qu.:0.3000 1st Qu.:1 1st Qu.:2.500
## Mean :0.3286 Mean :1 Mean :2.429
## 3rd Qu.:0.3500 3rd Qu.:1 3rd Qu.:2.500
## Max. :0.4000 Max. :1 Max. :2.500
##
## mining info:
## trans 10 0.3 0.8
inspect(rules)

## 1 {diapers} => {beer} 0.3 1 2.0
## 2 {bread} => {cheese} 0.3 1 2.5
## 3 {bread} => {ham} 0.3 1 2.5
## 4 {cheese} => {ham} 0.4 1 2.5
## 5 {ham} => {cheese} 0.4 1 2.5
## 6 {bread,cheese} => {ham} 0.3 1 2.5
## 7 {bread,ham} => {cheese} 0.3 1 2.5
# (11 rules with 30% confidence)

rules <- apriori(trans, parameter=list(minlen=2, support=0.3, confidence=0.3,
target = "rules"))
##
## 0.3 0.1 1 none FALSE TRUE 0.3 2 10
## target ext
## rules FALSE
##
##
summary(rules)
## set of 11 rules
##
## 2 3
## 8 3
##
## 2.000 2.000 2.000 2.273 2.500 3.000
##
## Min. :0.3000 Min. :0.6000 Min. :2.000
## 1st Qu.:0.3000 1st Qu.:0.7500 1st Qu.:2.500
## Median :0.3000 Median :1.0000 Median :2.500
## Mean :0.3182 Mean :0.8955 Mean :2.409
## 3rd Qu.:0.3000 3rd Qu.:1.0000 3rd Qu.:2.500
## Max. :0.4000 Max. :1.0000 Max. :2.500
##
## mining info:
## trans 10 0.3 0.3
inspect(rules)

## 1 {diapers} => {beer} 0.3 1.00 2.0
## 2 {beer} => {diapers} 0.3 0.60 2.0
## 3 {bread} => {cheese} 0.3 1.00 2.5
## 4 {cheese} => {bread} 0.3 0.75 2.5
## 5 {bread} => {ham} 0.3 1.00 2.5
## 6 {ham} => {bread} 0.3 0.75 2.5
## 7 {cheese} => {ham} 0.4 1.00 2.5
## 8 {ham} => {cheese} 0.4 1.00 2.5
## 9 {bread,cheese} => {ham} 0.3 1.00 2.5
## 10 {bread,ham} => {cheese} 0.3 1.00 2.5
## 11 {cheese,ham} => {bread} 0.3 0.75 2.5
# visualization of the selected rules

plot(rules)
plot(rules@quality)
# 11 rules matrix
confidentRules <- rules[quality(rules)$confidence > 0.3]
inspect(confidentRules)

## 1 {diapers} => {beer} 0.3 1.00 2.0
## 2 {beer} => {diapers} 0.3 0.60 2.0
## 3 {bread} => {cheese} 0.3 1.00 2.5
## 4 {cheese} => {bread} 0.3 0.75 2.5
## 5 {bread} => {ham} 0.3 1.00 2.5
## 6 {ham} => {bread} 0.3 0.75 2.5
## 7 {cheese} => {ham} 0.4 1.00 2.5
## 8 {ham} => {cheese} 0.4 1.00 2.5
## 9 {bread,cheese} => {ham} 0.3 1.00 2.5
## 10 {bread,ham} => {cheese} 0.3 1.00 2.5
## 11 {cheese,ham} => {bread} 0.3 0.75 2.5
plot(confidentRules, method="matrix", control=list(reorder=TRUE))
## Itemsets in Antecedent (LHS)

## [1] "{cheese}" "{cheese,ham}" "{ham}" "{bread,ham}"
## [5] "{bread}" "{bread,cheese}" "{diapers}" "{beer}"
## Itemsets in Consequent (RHS)
## [1] "{cheese}" "{beer}" "{diapers}" "{bread}" "{ham}"
# displays rules with top lift scores
inspect(head(sort(rules, by="lift"), 10))

## 3 {bread} => {cheese} 0.3 1.00 2.5
## 4 {cheese} => {bread} 0.3 0.75 2.5
## 5 {bread} => {ham} 0.3 1.00 2.5
## 6 {ham} => {bread} 0.3 0.75 2.5
## 7 {cheese} => {ham} 0.4 1.00 2.5
## 8 {ham} => {cheese} 0.4 1.00 2.5
## 9 {bread,cheese} => {ham} 0.3 1.00 2.5
## 10 {bread,ham} => {cheese} 0.3 1.00 2.5
## 11 {cheese,ham} => {bread} 0.3 0.75 2.5
## 1 {diapers} => {beer} 0.3 1.00 2.0
# graph the 5 rules with the highest CONFIDENCE

highConfidenceRules <- head(sort(rules, by="confidence"), 5)
plot(highConfidenceRules, method="graph", control=list(type="items"))
# graph the 5 rules with the highest LIFT
highLiftRules <- head(sort(rules, by="lift"), 5)
plot(highLiftRules, method="graph", control=list(type="items"))
# plot parallel coordinates of the candidate rules

plot(rules, method="paracoord", control=list(reorder=TRUE))
# references
# http://www.rdatamining.com/examples/association-rules
# http://statistical-research.com/data-frames-and-transactions/

BIg Data

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

BIg Data

Încărcat de

Drepturi de autor:

Formate disponibile

Homework Due Lesson 6 – Association Rules

{battery} appears in 6,000 transactions. So, support (sunscreen)= 6000/10000 = 0.6

3. What are the confidence values of {battery}→{sunscreen} and {battery, sunscreen}→

{battery} appears in 6,000 transactions. So, support (sunscreen)= 6000/10000 = 0.6

# write to a basket file

confidentRules <- rules[quality(rules)$confidence > 0.3]

# HW6: Extra Exercise

## Loading required package: Matrix

## Loading required package: grid

## create the dataset file using basket format

# write to a basket file

# read transcations from puchases "basket" file

## transactions as itemMatrix in sparse format with

# apply apriori on the itemsets in the transactions

## lhs rhs support confidence lift

## lhs rhs support confidence lift

# run Apriori without max (7 rules 100% confidence)

## lhs rhs support confidence lift

# (11 rules with 30% confidence)

## lhs rhs support confidence lift

# visualization of the selected rules

## lhs rhs support confidence lift

plot(confidentRules, method="matrix", control=list(reorder=TRUE))

## Itemsets in Antecedent (LHS)

## lhs rhs support confidence lift

# graph the 5 rules with the highest CONFIDENCE

# plot parallel coordinates of the candidate rules

S-ar putea să vă placă și