Sunteți pe pagina 1din 5

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

Fast Algorithm for Mining Association Rules

M.H.Margahny and A.A.Mitwaly


Dept. of Computer Science, Faculty of Computers and Information,
Assuit University, Egypt,
Email: marghny@acc.aun.edu.eg.

Abstract (as with functional dependencies), but rather based on co-


One of the important problems in data mining is occurrence of the data items. Association rules, first
discovering association rules from databases of introduced in [21]. The subsequent paper [18] is
transactions where each transaction consists of a set of considered as one of the most important contributions to
items. The most time consuming operation in this the subject. Its main algorithm, Apriori, not only
discovery process is the computation of the frequency of influenced the association rule mining community, but it
the occurrences of interesting subset of items (called affected other data mining fields as well.
candidates) in the database of transactions. Association rule and frequent itemset mining
Can one develop a method that may avoid or reduce became a widely researched area, and hence faster and
candidate generation and test and utilize some novel data faster algorithms have been presented. Numerous of them
structures to reduce the cost in frequent pattern mining? are Apriori based algorithms or Apriori modifications.
This is the motivation of my study. Those who adapted Apriori as a basic search strategy,
A fast algorithm has been proposed for solving this tended to adapt the whole set of procedures and data
problem. Our algorithm uses the "TreeMap" which is a structures as well [13, 25, 11, 1]. Since the scheme of this
structure in Java language. Also we present "Arraylist" important algorithm was not only used in basic
technique that greatly reduces the need to traverse the association rules mining, but also in other data mining
database. Moreover we present experimental results fields (hierarchical association rules [1,27,28],
which show our structure outperforms all existing association rules maintenance [5,6,26,17] , sequential
available algorithms in all common data mining pattern mining [19,24], episode mining [10] and
problems. functional dependency discovery [29,30]).
Frequent pattern mining techniques can also be
Keywords: data mining, association rules, TreeMap, extended to solve many other problems, such as iceberg-
ArrayList. cube computation [14] and classification [3]. Thus the
effective and efficient frequent pattern mining is an
important and interesting research problem.
1. Introduction The reset of this paper is organized as follows. In
The rapid development of computer technology, section (2) we give a formal definition of association
especially increased capacities and decreased costs of rules and the problem definition. Section (3) introduces
storage media, has led businesses to store huge amounts our development. Experimental results are shown in
of external and internal information in large databases at section (4). Section (5) contains conclusions.
low cost. Mining useful information and helpful
knowledge from these large databases has thus evolved 2. Association Rule Problem
into an important research area [20,16,4]. Among them Frequent itemset mining came from efforts to
association rule mining has been one of the most popular discover useful patterns in customers' transaction
data-mining subjects, which can be simply defined as databases. A customers' transaction database is a
finding interesting rules from large collections of data. sequence of transactions (T={t1,t2, ….tn }), where each
Association rule mining has a wide range of transaction is an itemset (ti  T). An itemset with k
applicability such Market basket analysis, Medical elements is called a k-itemset. The support of an itemset
diagnosis/ research, Website navigation analysis, X in T denoted as support(X), is the number of those
Homeland security and so on. transactions that contain X, i.e. support (X)= |{ti : X  tj
Association rules are used to identify relationships }| . An itemset is frequently if its support is greater than a
among a set of items in database. These relationships are support threshold, originally denoted by min_supp. The
not based on inherent properties of the data themselves
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
frequent itemset mining problem is to find all frequent sorted order. Every time an element is added to a tree, it
itemset in a given transaction database. is placed into its proper sorting position. Therefore, the
The algorithms were judged for three main tasks: all iterator always visits the elements in sorted order.
frequent itemsets mining, closed frequent itemset mining, If the tree contain n elements, then an average of
and maximal frequent itemset mining. log2 n comparisons are required to find the correct
A frequent itemset is called closed if there is no position for new element. For example, if the tree already
superset that has the same support (i.e., is contained in contains 1,000 elements, then adding a new element
the same number of transactions). Closed itemsets requires 10 comparisons. If you invoke the remove
capture all information about the frequent itemsets, method of the iterator, you actually remove the key and
because from them the support of any frequent itemset its associated value from the map.
can be determined. A large TreeMap can be constructed by scanning
A frequent itemset is called maximal if there no the database where each different itemsets is mapped to
superset that is frequent. Maximal itemsets define the different locations in the TreeMap, then the entries of the
boundary between frequent and infrequent sets in the TreeMap gives the actual count of each itemset in the
subset lattice. Any frequent itemset is often also called database. In that case, we don not have any extra the
free itemset to distinguish it from closed and maximal occurrences of each itemset.
ones. By using this collection my algorithm runs as
Obviously, the collection of maximal frequent follows, during the first pass of our algorithm, each item
itemset is a subset of the collection of closed frequent in the database is mapped to different location in the
itemset which is a subset of the collection of all frequent TreeMap.The“ Put”me thodoft h eTreeMap adds a new
itemsets, the supports of all their subsets is not available, entry. If an entry of item does not exist in the TreeMap
while this might be necessary for some applications such construct a new ArrayList and add transaction number to
as association rules. On the other hand, the closed it, otherwise add current transaction number.
frequent itemsets from a lossless representation of all After the first pass, the TreeMap contains all
frequent itemsets since the support of those itemsets that e leme ntso ft heda t
a basea st
hek ey sa ndi t’st r
a ns
act
ion
are not closed is uniquely determined by the closed n umb e ra si t’sv alue s
,whi chi st hee xac t number of
frequent itemsets. [2] occurrences of each item in the database. By making one
Through our study to find patterns problem we can pass only over the TreeMap and check its values we can
divide algorithms into two types: algorithms respectively generate the frequent 1-itemsets (L1). Hence TreeMap
with and without candidate generation. Any Aprioi-like can be used instead of the original file of database.
instance belongs to the first type. Eclat [28] may also be In our algorithm we use the property that the
considered as an instance of this type. The FP-growth transaction that does not contain any frequent k-itemset is
algorithm [12] is the best–known instance of the second useless in subsequent scans. In order that in subsequent
type. Comparing the two types, the first type needs passes, the algorithm prunes the database that represent
several database scans. Obviously the second type the TreeMap at the present time by discarding the
performs better than the first. transactions, which have no items from frequent itemsets,
Table 1 summarize and provides a means to briefly and it also trims the items that are not frequent from the
compare the three algorithms. We include in this table transactions.
the maximum number of scans and data structures
proposed. How to generate candidate 2-itemsets and their
frequencies?
Table 1. Comparisons of algorithms After getting L1 from the first pass, the TreeMap
will include all the information about the database such
Algorithm Scan Data structure as the keys that represents the items and the values of the
Apriori M+1 HasTable&Tree transactions numbers. Then the candidate 2-itemsets will
Eclat M+1 HasTable&Tree be generated from comparing the ArrayLists and
FP-growth 2 Prefix-tree
choosing the common elements between both of them.
This will lead us to get a new ArrayList of the candidate
3. Motivation 2-itemsets (C2). In addition, L2 will be generated using
The efficiency of frequent itemset mining the same procedure for getting L1. The process continues
algorithms is determined mainly by three factors: the way until no new Lk is found.
candidates are generated, the data structure that is used
and the implementation details [8] . In this section we By the way, the algorithm has the following advantages:
will show these points in our algorithm.
Data Structure Description: A central data structure of 1-Candidate generation becomes easy and fast.
my algorithm is TreeMap, which is a collection in Java 2-Association rules are produced much faster, since
Languages. TreeMap is a sorted collection. It stores retrieving a support of an itemset is quicker.
key/value pairs. You can find a value if you provide the 3-Just one data structure has to be implemented, hence
key. Keys must be unique. You cannot to store two the code is simpler and easier to maintain.
values with the same key. You insert elements into the 4-Using unique TreeMap to store Lk and Ck reduce the
collection in any order. When you iterate through the usage of memory.
collection, the elements are automatically presented in
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
5-Dealing with TreeMap is faster than dealing with
database file. 1- chess
6-The o riginalf il
ei sn’tinfluence d by t hep running 3500
process where its role ends as soon as TreeMap is
3000
constructed.
7-The volume of the resulting database decreases each 2500

Time(s)
time we produce Lk , because prunning items is not 2000 Apriori
frequent. fpgrowth
1500
8-Constructed TreeMap contains all frequent information 1000
OurDev.
of the database; hence mining the database becomes
500
mining the Treeap.
0
4. Experimental Evaluation 95 94 93 90 85 80 70
In this section, we present a performance Support(as a%)
comparison of our development with FP-growth and the
classical frequent pattern-mining algorithm Apriori. All
Figure 1. Chess database
the experiments are performed on a 1.7 GHZ Pentium PC
machine with 128 M.B main memory, running on
2- Mushroom
WindowsXP operating system. All programs were
800
developed under the java compiler, version 1.3. For
verifying the usability of our algorithm, we used three of 700

the test datasets made of available to the Workshop on 600


Frequent Itemset Mining Implementations(FIMI'04) [9].
500
We report experimental results on one synthetic

Time (s)
Apriori
400
datasets and two real data sets. The synthetic dataset is fpgrowth
O urD ev.
T10I4D100K with 1K items. In this data set, the average 300

maximal potentially frequent itemset size is set to 10 and 200


4, respectively, while the number of transactions in the
100
dataset is set to 100K. It is sparse dataset. The frequent
itemsets are short and not numerous. 0
60 55 50 45 40
We use Chess and Mushroom as real data sets. The
S upport(a a s a% )
test datasets and some of their properties are described in
Table 2.
Figure 2. Mushroom database

Table 2. Test dataset description

3- T10I4D100K
Dataset (T) I ATL
350
T10I4D100K 100 000 1000 10
300
Chess 3196 75 37
250
Mushroom 8124 119 23
Time (s)

200 Apriori
fpgrowth
150 OurDev.
T=Numbers of transactions
I=Numbers of items 100
ATL=Average transactions length 50

0
The performance of Apriori, FP-growth, our 0.1 0.8 0.06 0.04 0.02 0.01
algorithm and these results are shown in figures (1- Support
Chess,2-Mushroom,3-T10I4D100K). The X-axis in these
graphs represents the support threshold values while the Figure 3. T10I4D100k database
Y-axis represents the response times of the algorithms
being evaluated.
5. Conclusion and future work
In these graphs, we see that the response times of all
Determining frequent objects is one of the most
algorithms increase exponentially as the support
important fields in data mining. It is well known that the
threshold is reduced. This is only to be expected since the
way candidates are defined has great effect on running
number of itemsets in the output, the frequent itemsets
time and memory need, and this is the reason for the
increases exponentially with decrease in the support
large number of algorithms. It is also clear that the
threshold.
applied data structure also influences efficiency
parameters. In this paper we presented an implementation
We also see that there is a considerable gap in the
that solved frequent itemset mining problem.
performance of Apriori and FP-growth with respect to
our algorithm.
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
In our approach, TreeMap store not only candidates, Int. Conf. Management of Data(SIG MOD'04) , pp.
but frequent itemsets as well. It provides features I 53-87, 2004.
ha v en’tsee ni nanyo t
heri mple me ntati
onso ftheApriori- [13] J.S.Park, M-S. Chen and P.S.YU. An effective hash
generating algorithm. The most major one is that the based algorithm for mining association rules. In M.J.
frequent itemset generated are printed in alphabetical Carey and D.A. Schneider, editors, Proceedings of
order. This makes easier for the user to find rules on as the 1995 ACM SIG-MOD International Conference
specific product. on Management of Data, pages 175-186, San Jose,
With the success of our algorithm, it is interesting to California, 22-25. 1995.
re-examine and explore many related problems, [14] K.Beyer and R.Ramakrishnan. Bottom-up
extensions and application, such as iceberg cube computation of sparse and ice-berg cubes. In Proc.
computation, classification and clustering. 1999 ACM-SIGMOD Int. Conf. Management of
Our conclusion is that our development is a simples, Da ta( SI GMOD’ 99),pa ges359 -370, Philadelphia,
practical, straight forward and fast algorithm for finding PA, June 1999.
all frequent itemsets. [15] M.J. Zaki. Scalable algorithms for association
mining. IEEE Transactions on Knowledge and Data
Engineering, 12 : 372 –390, 2000.
References [16] M.S. Chen, J.Han and P.S. Yu. Data Mining : An
[1] A.Sarasere,E.Omiecinsky,and S.Navathe. An overview from a database perspective, IEE
efficient algorithm for mining association rules in Transactions on Knowledge and Data Engineering
large databases. In Proc. 21St International 1996.
Conference on Very Large Databases (VLDB) , [17] N.F.Ayan, A.U. Tansel, and M.E.Arkm. An efficient
Zurich, Switzerland, Also Catch Technical Report algorithm to update large itemsets with early
No. GIT-CC-95-04.,1995. prunning. In Knoweldge discovery and Data Mining,
[2] B.Goethals. Efficient Frequent pattern mining. PhD pages 287-291,1999.
thesis, transactional University of Limburg, [18] R. Agrawal and R.Srikant. Fast algorithms for
Belgium, 2002. mining association rules. In Proc. of Intl. Conf. On
[3] B.Liu, W.Hsu, and Y.Ma. Integrating classification Very Large Databases (VLDB), Sept. 1994.
and association rule mining. In Proc. 1998 Int. Conf. [19] R.Agrawal and R.Srikant. Mining sequential
Kn o wl edgeDi scoverya ndDa taMi ning( KDD’ 9 8), patterns. In P.S.Yu and A.L.P. Chen, editors,
pages 80-86, New York, NY, Aug 1998. Proc.11the Int. Conf. Data engineering. ICDE, pages
[4] C.-Y. Wang, T.-P. Hong and S.– S. Tseng. 3-14. IEEE pages, 6-10, 1995.
Maintenance of discovered sequential patterns for [20] R.Agrawal, T. Imielinksi and A. Swami, Database
record deletion. Intell. Data Anal. pp. 399-410, Mining: a performance perspective, IEE
February 2002. Transactions on knowledge and Data Engineering,
[5] D.W-L.Cheung, J.Han, V.Ng, and C.Y.Wong. 1993.
Maintenance of discovered association rules in large [21] R.Agrawal, T. Imielinski, and A.Sawmi. Mining
databases : An incremental updating technique. In association rules between sets of items in large
ICDE, pages 106-114,1996. databases. In proc. of the ACM SIGMOD
[6] D.W-L.Cheung, S.D.Lee, and B.Kao. A general Conference on Management of Data, pages 207-216,
incremental technique for maintaining discovered 1993.
association rules. In Database Systems for advanced [22] R.Agrawal,H.Mannila,R.Srikant,H.Toivone and
Applications, pages 185-194, 1997. A.I.Verkamo. Fast discovery of association rules. In
[7] E.-H.Han, G.Karpis, and V. Kumar. Scalable U.M. Fayyed, G.Piatetsky-Sharpiro, P.Smyth, and
Parallel Data Mining for Association Rules . July R.Uthurusamy, editors, Advances in knowledge
15,1997. Discovery and Data Mining, pages 307-328.
[8] F.Bodon. A Fast Apriori Implementation. Proc.1st AAAI/MIT press, 1996.
IEEE ICDM Workshop on Frequent Itemset Mining [23] R.Srikant and R. Agrawal. Mining generalized
Implementations (FIMI2003,Melbourne,FL).CEUR association rules. In Proc. of Intl. Conf. On Very
Workshop Pioceedings 90, A acheme, Large Databases (VLDB), Sept. 1995.
Germany2003.http://www.ceur-ws.org/vol-90/ [24] R.Srikant and R.Agrawal. Mining sequential
[9] Frequent Itemset Mining Implementations patterns. Generalizations and performance
(FIMI ’03) Wo rkSh op we bsite, improvements. Technical report, IBM Alamden
http://fimi.cs.helsinki.fi, 2004. Research Center, San Jose, California, 1995.
[10] H.Mannila, H.Toivonen, and A.I.Verkamo. [25] S.Brin, R.Motwani, J.D.Vllman, and S.Tsur.
Discovering frequent episodes in sequences. In Dynamic itemset counting and implication rules for
proceedings of the First International Conference on market basket data. SIGMOD Record (ACM Special
knowledge Discovery and Data Mining, pages 210- Interest Group on Management of Data), 26(2): 255,
215. AAAI pages, 1995. 1997.
[11] H.Toivonen. Sampling large databases for [26] S.Thomas, S.Bodadola, K.Alsabti, and S.Ranka. An
association rules. In the VLDB Journal, pages 134- efficient algorithm for incremental updation of
145,1996. association rules in large databases. In Proc.
[12] Ha n,J .,Pe i,J .,andYi n, y..“ Mi ningFr e
q ue n
t KDD'97, Page 263-266, 1997.
Patterns without Candidate Generation : A Frequent- [27] Y.F.Jiawei Han. Discovery of multiple-level
Pa tt
e rnTr eeAp proac h” . In Proc. ACM-SIG MOD association rules from large databases. In Proc. of
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
the 21St International Conference on Very Large
Databases (VLDB), Zurich, Switzerland, 1995.
[28] Y.Fu. Discovery of multiple-level rules from large
databases, 1996.
[29] Y.Huhtala, J.Karkkainen, P.Pokka, and H.Toivonen.
TANE : An efficient algorithm for discovering
functional and approximate dependencies. The
computer Journal, 42(2) : 100-111,1999.
[30] Y.Huhtala, J.Kinen, P.Pokka, and H.Toivonen.
Efficient discovery of functional and approximate
dependencies using partitions. In ICDE, pages 392-
401, 1998.

S-ar putea să vă placă și