Sunteți pe pagina 1din 13

Association Rule Mining or Market Basket Analysis

Market basket analysis (also known as association rule discovery or affinity analysis) is a popular data mining method for exploring associations between items. In the simplest situation, the data consists of two variables: a transaction and an item. or each transaction, there is a list of items. !ypically, a transaction is a single customer purchase, and the items are the things that were bought. It is important to do some initial examination of the data before attempting to do association analysis. "uantifying affinities among related items would be pointless if very few transactions involved multiple items. #n association rule is a statement of the form (item set #) $% (item set &). !he goal of the analysis is to determine the strength of all the association rules among a set of items. !he value of the generated rules is gauged by confidence, support, and lift. # bank's marketing department is interested in examining associations between various retail banking services used by customers. !hey would like to determine both typical and atypical service combinations as well as the order in which the services were first used. !hese re(uirements suggest both a market basket analysis and a se(uence analysis. !he BANK data set contains service information for nearly ),*** customers. !here are three variables in the data set, as shown in the table below. Name ACCOUNT SERVICE VISIT Model Role I+ !arget -e(uence Measurement Level ,ominal ,ominal .rdinal Description #ccount ,umber !ype of -ervice .rder of /roduct /urchase

!he BANK data set has over 01,*** rows. 2ach row of the data set represents a customer3service combination. !herefore, a single customer can have multiple rows in the data set, each row representing one of the products he or she owns. !he median number of products per customer is three. !he 40 products are represented in the data set using the following abbreviations:

#!M #5!. 667+ 6+ 6867+ 68I,: ;M2"<6 I7# MM+# M!: /<.#, -=: !75-!

automated teller machine debit card automobile installment loan credit card certificate of deposit check9debit card checking account home e(uity line of credit individual retirement account money market deposit account mortgage personal9consumer installment loan saving account personal trust account

>our first task is to create a new analysis diagram and data source for the BANK data set. 4. 6reate a new diagram named #ssociations #nalysis to contain this analysis. 1. -elect Create Data Source from the +ata -ources pro?ect property. 0. /roceed to -tep 1 of the +ata -ource @iAard. B. -elect the BANK table in the ##2M library.

Association Rule Mining or Market Basket Analysis

C. /roceed to -tep C of the +ata -ource @iAard. D. In -tep C, assign metadata to the table variables as shown below.

#n association analysis re(uires exactly one target variable and at least one I+ variable. &oth should have a nominal measurement level. # se(uence analysis also re(uires a se(uence variable. It usually has an ordinal measurement scale. E. /roceed to -tep E of the +ata -ource @iAard. or an association analysis, the data source should have a role of !ransaction. ). -elect Role Transaction.

F. -elect Finish to close the +ata -ource @iAard. 4*. +rag a BANK data source into the diagram workspace. 44. -elect the Explore tab and drag an Association tool into the diagram workspace. 41. 6onnect the &#,8 node to the #ssociation node.

Association Rule Mining or Market Basket Analysis

40. -elect the Association node and examine its /roperties panel.

4B. !he 2xport 7ule by I+ property determines whether the 7ule3by3I+ data is exported from the node and if the 7ule +escription table will be available for display in the 7esults window. -et 2xport 7ule by I+ to >es. .ther options in the /roperties panel include the following: Minimum confidence level, which specifies the minimum confidence level to generate a rule. !he default level is 4*G. -upport !ype, which specifies whether the analysis should use the support count or support percentage property. !he default setting is /ercent. -upport 6ount, which specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). !he default count is 1. -upport /ercentage, which specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). !he default fre(uency is CG. Maximum items, which determines the maximum siAe of the item set to be considered. or example, the default of four items indicates that a maximum of four items will be included in a single association rule. If you are interested in associations that involve fairly rare products, you should consider reducing the support count or percentage when you run the #ssociation node. If you obtain too many rules to be practically useful, you should consider raising the minimum support count or percentage as one possible solution.

&ecause you first want to perform a market basket analysis, you do not need the se(uence variable. 4C. 7un the diagram from the #ssociation node and view the results.

!he 7esults 3 #ssociation window opens with the -tatistics /lot, -tatistics <ine /lot, 7ule Matrix, and .utput windows visible.

Association Rule Mining or Market Basket Analysis

4D. MaximiAe the -tatistics <ine /lot window.

!he statistics line plot graphs the lift, expected confidence, confidence, and support for each of the rules by rule index number. 6onsider the rule # &. 7ecall the following: support of # & is the probability that a customer has both # and &. confidence of # & is the probability that a customer has & given that the customer has #. expected confidence of # & is the probability that a customer has &.

lift of # & is a measure of strength of the association. If <ift$1 for the rule #$%&, then a customer having # is twice as likely to have & than a customer chosen at random. !he lift of the rule # $% & is the confidence of the rule divided by the expected confidence, assuming that the item sets are independent. !he expected confidence of # $% & is the probability that a customer has &.

!he lift can be interpreted as a general measure of association between the two item sets. <ift values greater than 4 indicate positive correlationH values e(ual to 4 indicate Aero correlationH and values less than 4 indicate negative correlation. If <ift$1 for the rule # $% &, then a customer having # is twice as likely to have & than a customer chosen at random. <ift is symmetric, so the lift of the rule # $% & is the same as the lift of the rule & $% #. ,otice the rules are ordered in descending order of lift. Interpreting the implication ($%) in association rules can be difficult. ;igh confidence and support does not imply cause and effect. !he rule is not necessarily interesting and the two items might not even be correlated. !he term confidence is not related to the statistical usageH therefore, there is no repeated sampling interpretation. 6onsider the association rule (saving account) $% (checking account). #t a bank, the following was determined about customers having savings and checking acccounts:

!his rule has C*G support (C,***94*,***) and )0G confidence (C,***9D,***). &ased on these two measures, this might be considered a strong rule. .n the contrary, those without a savings account are even more likely to have a checking account ()E.CG). -aving and checking are in fact negatively correlated. If the two accounts were independent, then knowing that one has a savings account does not help in knowing whether one has a checking account. !he expected confidence if the two accounts were independent is )CG (),C**94*,***). !his is higher than the confidence of -=: $% 68.

Association Rule Mining or Market Basket Analysis

4E. !o view the descriptions of the rules, select ie! Rules Rule description.

!he highest lift rule is checking, and credit card implies check card. !his is not surprising given that many check cards include credit card logos. ,otice the symmetry in rules 4 and 1. !his is not accidental because, as noted earlier, lift is symmetric.

10

4). 2xamine the rule matrix.

!he rule matrix plots the rules based on the items on the left side of the rule and the items on the right side of the rule. !he points are colored based on the confidence of the rules. or example, the rules with the highest confidence are in the column indicated by the cursor in the picture above. 5sing the #ctiveI feature of the graph, you discover that these rules all have checking on the right side of the rule.

Association Rule Mining or Market Basket Analysis

11

4F. <astly, explore the associations by viewing the link graph. !o view the link graph, select ie! Rules Lin" #raph.

!he link graph displays association results by using nodes and links. !he siAe and color of a node indicate the transactions counts in the Rules data set. <arger nodes have greater counts than smaller nodes. !he color and thickness of a link indicate the confidence level of a rule. !he thicker the links are, the higher confidence the rules have. -uppose you are particularly interested in those associations that involve automobile loans. .ne way to accomplish that visually in the link graph is to select those nodes whose label contains AUTO and then show only those links involving the selected nodes. 1*. 7ight3click in the <ink :raph window and select #raph $roperties. 5nder the <inks option, deselect -how #ll <inks. 6lick #pply and .8. -electing any node will now display only the links associated with that node.

12

#nother way to explore the rules found in the analysis is by plotting the rules table. 14. -elect ie! Rules Rules Ta%le. !he 7ules !able window opens. 11. -elect the /lot @iAard icon, .

10. 6hoose a three3dimensional scatter plot for the type of chart, and select Next &. 1B. -elect 7ole ', (, and ) for the variables SUPPORT, LIFT, and CONF, respectively.

1C. -elect Finish to generate the plot.

Association Rule Mining or Market Basket Analysis

13

1D. 7e3arrange the windows to view the data and the plot simultaneously.

2xpanding the 7ule column in the data table and selecting points in the three3dimensional plot enable you to (uickly uncover high lift rules from the market basket analysis while ?udging their confidence and support. >ou can use @;272 clauses in the +ata .ptions dialog box to subset cases in which you are interested.

Sequence Analysis
In addition to the products owned by its customers, the bank is interested in examining the order in which the products are purchased. &ecause you are interested in the order, you conduct a se(uence analysis. !he se(uence variable (=isit) in the &ank data set enables you to conduct this analysis. In the last association analysis, you omitted this variable from the analysis. In this association analysis, you use this variable. !he results of the se(uence analysis differ somewhat from the association analysis. !he statistics line plot graphs the confidence and support for each of the rules by rule index number. !he percent support is the transaction count divided by the total number of customers, which would be the maximum transaction count. !he percent confidence is the transaction count divided by the transaction count for the left side of the se(uence. #dd another #ssociation node to run a se(uence analysis. -elect =iew, then 7ules, then 7ule description to view the descriptions of the rules. !he lift for many of the rules changes after the order of service ac(uisition is considered.

S-ar putea să vă placă și