Mining Frequent Itemsets With Association Rule

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 3190

Mining Frequent Itemsets with Association Rule
S.uma
1
, Dr. A. Malathi
2
M.Phil., Research Scholor
1
PG and Research Department of Computer Science
1

Government Arts College (Autonomous), Coimbatore - 18
Assistant professor
2

PG and Research Department of Computer Science
2
Government Arts College (Autonomous), Coimbatore - 18

Abstract Given a large database of customer transactions, where each transaction consists of cus-id, trans time, and
the items bought in the transaction. Introduce the problem of mining sequential patterns over such databases. In this paper,
propose three algorithms LCM, LCMfreq, and LCMmax for data mining all maximal frequent sets, frequent sets, frequent
closed itemsets, respectively from databases of transactions. The main theoretical contribution is that to construct tree shaped
transversal routes composed of only frequent closed itemsets, which is induced by a parent-child relationship defined on frequent
closed itemsets. By traversing the route in a depth-first manner, LCM finds all frequent closed itemsets in polynomial time,
without storing the previously obtained closed itemsets in memory. Introduces a several algorithmic techniques using the sparse
and dense structures of input data and algorithms for enumerating all frequent itemsets and maximal frequent itemsets are
obtained from LCM as its variants.

Index TermsItemsets, Association rule mining, LCM, Frequent items.
I. INTRODUCTION
Database mining is motivated by the decision support
problem faced by most large retail organization. Progress in
barcode technology has made and it possible for retail
organizations to collect and store massive amounts of sale data, it
referred to as the basket data. A record in such data typically
consists of the transaction date and the items bought in the
transaction. Very often, data records also contain cus-id,
particularly when the purchase has been made using a credit card
or a frequent buyer card. Catalog companies also collect such
data using the orders they receive.

Introduce the problem of mining sequential patterns over
this data. For example, a pattern is that customers typically rent
\Star Wars", then \Empire Strikes Back", and then \Return of the
Jedi". Note, these rentals need not be a consecutive. Customers
who rent some other videos and also support between sequential
pattern. Elements of a sequential pattern need not be simple item.
\Fitted Sheet and at sheet and pillow cases", followed by
\comforter", followed by \drapes and rules" is an example of a
sequential pattern in which the elements are sets of items.

In this paper, propose an efficient algorithm LCM for
enumerating all frequent closed itemsets. LCM is an abbreviation
of Linear time Closed item set Miner. Existing algorithms for this
task basically enumerate frequent itemsets with cutting off
unnecessary frequent itemsets by pruning. However, the pruning
algorithm is not complete, hence the algorithms unnecessary
frequent itemsets, and do something more. In LCM, define a
parent-child relationship between frequent closed itemsets. The
relationship induces tree-shaped transversal routes composed
only of all the frequent closed itemsets. Our algorithm traverses
the routes, hence takes linear time of the number of frequent
closed itemsets. This algorithm is obtained from the algorithms
for enumerating maximal bipartite cliques, which is designed
based on reverse search technique. In addition to the search tree
technique for closed itemsets, uses several techniques to speed-up
the update of the occurrences of itemsets. One technique is
occurrence deliver, which simultaneously computes the
occurrence sets of all the successors of the current item set during
a single scan on the current occurrence set. The other is diffsets
proposed. Since there is a trade-off between these two methods
that the former is fast for sparse data while the latter is fast for
dense data, developed the hybrid algorithm combining them. In
some iteration, makes a decision based of the estimation of their
computation time, hence our algorithm can use appropriate one
for dense parts and sparse parts of the input.

It also consider the problems of enumerating all
frequent sets, and maximal frequent sets, and derive two
algorithms LCMfreq and LCMmax from LCM. LCMmax is
obtained from LCM by adding the explicit check of maximality.
LCMfreq is not merely a LCM without the check of closeness,
but also achieves substantial speed-up using closed item set
discovery techniques because it enumerates only the
representatives of groups of frequent itemsets, and generates
other frequent itemsets from the representatives.

II. RELATED WORKS

The problem of discovering \what items are bought
together in a transaction" over basket data was introduced. While
related, the problem of finding what items are bought together is
concerned with finding intra-transaction patterns, whereas the
problem of finding sequential patterns is concerned with inter-
transaction patterns. A pattern in the first problem consists of an
unordered set of items whereas a pattern in the latter case is an
ordered list of sets of items. Discovering patterns in sequences of
events has been an area of active research in AI (see, for
example, [6]). However, the focus in this body of work is on
discovering the rule underlying the generation of a given
sequence in order to be able to predict a plausible sequence
continuation (e.g. the rule to predict what number will come next,
given a number in sequence). On the other hand, it interests in
finding all common patterns embedded in a database of
sequences of sets of events (items).

This problem is related to the problem of finding text
subsequences that match a given regular expression (c.f. the
UNIX grep utility). There also has been work on finding text
subsequences that approximately match a given string. These
techniques are oriented toward finding matches for one pattern.
In our problem, the difficulty is in figuring out what patterns to
try and then efficiently finding out which ones are contained in a
customer sequence. Techniques based on multiple alignments
[11] have been proposed to find entire text sequences that are
similar. There also has been work to find locally similar
subsequences. However, as pointed out in [10], these techniques
apply when the discovered patterns consist of consecutive
characters or multiple lists of consecutive characters separated by
a fixed length of noise characters.

a) ENUMERATING FREQUENT CLOSED ITEMSETS

In this section, introduces an parent-child relationship
between frequent closed itemsets in C, and describe our
algorithm LCM for enumeration them. Recent efficient
algorithms for frequent itemsets, use a tree-shaped search
structure for F, called the set enumeration tree defined as follows.

Let X ={x1, . . . , xn} be an item set as an ordered
sequence such that x1 < <xn, where the tail of X is
tail(X) =xn E. Let X, Y be itemsets. For an index i, X(i) =X
{1, . . . , i}. X is a prefix of Y if X =Y (i) holds for i =tail(X).
Then, the parent-child relation P for the set enumeration tree for
F is define as X =P(Y) if Y =X {i} for some i >tail(X), or
equivalently, X =Y \{tail(Y )}. Then, the whole search space for
F forms a prefix tree (or trie) with this edge relation P.

Now it defines the parent-child relation P for closed itemsets in C
as follows. For X C, it defines a parent of X by P(X) =I(T
(X(i(X) 1))), where i(X) be the minimum item i such that T (X)
=T (X(i)) but T (X) _=T (X(i1)). If Y is the parent of X, we
say X is a child of Y. Let =I(T ()) be the smallest item set in
C called the root. For any X C \ {}, its parent P(X) is always
defined and it belongs to C. An illustration is given as an
example of parent of X: The parent of X is obtained by deleting
items larger than i(X).

The existing enumeration algorithm for frequent closed
itemsets are based on backtrack algorithm, which traverse a tree
composed of all frequent itemsets in F, and skip some itemsets by
pruning the tree. Since the pruning is not complete, however,
these algorithms generate unnecessary frequent itemsets. On the
other hand, the algorithm in directly generates only closed
itemsets with the closure operation I(T ()) as ours, but their
method may generate duplicated closed itemsets and needs
expensive duplicate check.

On the other hand, our algorithm traverses a tree
composed only of frequent closed itemsets, and each iteration is
not as heavy as the previous algorithms. Hence, our algorithm
runs fast in practice. If we consider our algorithm as a
modification of usual backtracking algorithm, each iteration of
our algorithm re-orders the items larger than i(X) such that the
items not included in X follow the items included in X. Note that
the parent X is not a prefix of X[i] in a recursive call. The check
of (cond2) can be considered as a pruning of non-closed itemsets.

Algorithm LCM (X : frequent closed item set)
1. Output X
2. For each i> i(X) do
3. If X[i] is frequent and X[i]=I(T(X[i]))then
Call LCM (X[i])
4. End for

Detailed Algorithm. It present below the description of the
algorithm LCM, which recursively computes (X, T (X), i(X)),
simultaneously.

Theorem 1 Algorithm LCM enumerates all frequent closed
itemsets in O(_j>i(X) |T (X[j])| + _j>i(X),X[j]F
_j_T(X)m(X[j], j_)) time, or O(_i>i(X),X[i]F ((|T (X)|
|T(X[i])|) +_jNC(X),j<i |T (X) \ T(X {j})|)) time for each
frequent closed item set X, with memory linear to the input size.

b) ENUMERATING MAXIMAL FREQUENT SETS

In this section, explains an enumeration algorithm of maximal
frequent sets with the use of frequent closed item set
enumeration. The main idea is very simple. Since any maximal
frequent item set and enumerates frequent closed itemsets and
output only those being maximal frequent sets. For a frequent
closed item set X, X is a maximal frequent set if and only if X
{i} is infrequent for any i _ X. By adding this check to


global: J,DJ /* Global sets of lists */
Al gori thm LCM()
1. X :=I (T ()) / * The root J */
2. For i :=1 to |E|
3. If X [i] satisfies (cond2) and (cond3) then
Call LCM _Iter( X[i], T (X [i]), i ) or
Call LCMd_Iter2( X [i], T(X[i] ), i, DJ )
based on the decision criteria
4. End for
LCM_I t e r (X, T( X) , i ( X) ) / * occurrencedel i ver */
1. output X
2. For each T T (X)
3 . For e ac h j T , j >i ( X) , i nsert t t o J [ j ]
4. For each j, J [j] in the decreasing order
5. If [J[ j]] and (cond2) holds then
L CM_ 1ter( T
(J [j], J [j ],j )
6 . De l e t e J [j ]
7. End for
LCM_It er2( X , T (X), i (X), DJ ) * di ffset */
1. output X
2. For each i, X [i] is frequent
3.If X [i] satisfies (cond2) then
4. For each j, X [i] U {j} is frequent,
DJ [ j ] : = DJ[
j
] \ DJ[ i ]
5.LCM_Iter2( T (J[j]), J[j],, j , DJ )
6.End if
7.End for
111
. . .
1

000

0

000
. . .
0

Figure 2: Hypercube decomposition: LCMfreq decomposes a closed item set class
into several sub lattices (gray rectangles).

LCM obtains LCMmax. This modification does not
increase the memory complexity but increase the computation
time. In the case of occurrence deliver, we generate T (X{j})
for all j in the same way as the occurrence deliver, and check the
maximality. This takes O (_j<i(X)|T (X U {j}|) time. In case of
difference in update, it do not discard diffsets unnecessary for
closed item set enumeration. We keep diffsets DJ for all j such
that X {j} is frequent. To update and maintain this, to spend
O(_j,X{j}F |T (X) \ T (X {j})|) time. Note that we are not in
need of check the maximality if X has a child.

III. ENUMERATING FREQUENT SETS

In this section, describes an enumeration algorithm for
frequent itemsets. The key idea of our algorithm is that we
classify the frequent itemsets into groups and enumerate the
representative of each group. Each group is composed of frequent
itemsets included in the class of a closed item set. This idea is
based on the following lemma.

1 Suppose that frequent itemsets X and S X Lemma
satisfy T (X) =T (S). Then, for any item set X_ including X, T
(X_) =T (X_ S). Particularly, T (X_) =T (R) holds for any X_
R X_S, hence all R are included in the same class of a
closed item set. Hence, any frequent item set X_ is generated
from X_ \ (S \ X). We call X_ \ (S \ X) representative.

Let us consider a backtracking algorithm finding
frequent itemsets which adds items one by one in lexicographical
order. Suppose that we currently have a frequent item set X, and
find another frequent item set X {i}. Let S = X[i]. Then,
according to the above lemma, observes that for any frequent
item set X_ including X and not intersecting S \ X, any item set
including X_ and included in X_ S is also frequent. Conversely,
any frequent item set including X is generated from X_ not
intersecting S\X. Hence enumerate only representatives including
X and not intersecting S \ X, and generate other frequent itemsets
by adding each subset of S \ X. This method can be considered
that we decompose classes of closed itemsets into several
sublattices (hyper cubes) each of whose maximal and minimal
elements are S and X_, respectively. This technique is named as
hypercube decomposition. Suppose, currently operating a
representative X_ including X, and going to generate a recursive
call respect to X_ {j}. Then, if (X_[i] \ X_) \ S _=, X_ and S
(X_[i] \ X_) satisfies the condition of Lemma 2. Hence, we add
X_[i] \ X_ to S. LCMfreq describes as follows.

Algorithm LCMfreq ( X : representative,
S : item set, i : item )
1.Output all item sets R,X C R C X U S
2 . F o r each j >i , j X US
3. If X U {j} is frequent then
Call LCMfreq ( X U {j},S U (X[j] )\ (X U {j})), then
4.End for

For some synthetic instances such that frequent closed
itemsets are fewer than frequent itemsets, the average size of S is
up to 5. In these cases, the algorithm finds 2|S| =32 frequent
itemsets at once, hence the computation time is reduced much by
the improvement. To check the frequency of all X {j}, it can
use occurrence deliver and diffsets used for LCM. LCMfreq does
not require the check of (cond2), hence The computation time of
each iteration is O (_j>i(X) |T (X[j])|) time for occurrence
deliver, and O(_j>i(X),X[j]F |T (X) \ T (X[j])|) for diffsets.
Since the computation time change, it uses another estimator for

01 lattice
Closed item set
class
hybrid. In almost all cases, if once _j>i(X),X[j]F |T (X) \ T
(X[j])| becomes smaller than _j>i(X) |T (X[j])|, the condition
holds in any iteration generated by a recursive call. Hence, the
algorithm first start with occurence delivers, and compares them
in each iteration. If _j>i(X),X[j]F |T (X)\ T (X[j])| becomes
smaller, then we change to diffsets. Note that these estimators can
compute in short time by using the result of occurrence deliver.

IV. CONCLUSION

In this paper, it present an efficient algorithm LCM for
mining frequent closed itemsets based on parent-child
relationship defined on frequent closed itemsets. This technique
is taken from the algorithms for enumerating maximal bipartite
cliques [14, 15] based on reverse search [3]. In theory,
demonstrates that LCM exactly enumerates the set of frequent
closed itemsets within polynomial time per closed item set in the
total input size. In practice, we show by experiments that our
algorithms run fast on several real world datasets such as BMS-
WebView-1.

V. FUTURE WORK

Plan to extend this work along the following lines: _
Extension of the algorithms to discover sequential patterns across
item categories. An example of such a category is that a dish
washer is a kitchen appliance is a heavy electric appliance, etc.
Transposition of constraints into the discovery algorithms. There
could be item of constraints (e.g. sequential patterns involving
home appliances) or time constraints (e.g. the elements of the
patterns should come from transactions that are at least d1 and at
most d2 days apart.

REFERENCES

[1] R. Agarwal and R. Srikant, Fast Algorithms for Mining Association
Rules in Large Databases, In Proc. LVDB 94, pp.487-499. 1994.

[2] R. Agarwal, H. Mannilla, R. Srikant, H. Toivonen and A. I. Verkamo,
Fast Discovery of Associa- tion Rules, In advances in Knowledge
Discovery and Data Mining, MIT press,pp.307-328,1996.

[3] D. Avis and K. Fukuda, Reverse Search for Enumeration, Discrete
Applied Mathematics, Vol. 65,pp. 21-46,1996.

[4] R. J . Bayardo J r., Efficiently Mining long Patterns from Databases, In
Proc. SIGMOD98,pp. 85-93, 1998.

[5] E.Boros, V. Gurvich, L.Khachiyan, and K. Makino, On the
Complexity of Generating Maximal Frequent and Minimal Infrequent
Sets, In Proc.STACS 2002,pp. 133-141, 2002.

[6] R. Agarwal and R. Srikant, Fast Algorithms for Mining Association
Rules, Proc. 20
th
Intl Conf. Very Large Data Bases (VLDB), pp.487-
499, 1994.

[7] R. Agarwal and R. Srikant, Mining sequential Patterns, Proc.11
th

Int1 Conf. Data Eng.,pp.3-14, Mar. 1995.

[8] C.F. Ahmed, S.K. Tanbeer, B.-S. J eong, and Y.-K. Lee, Efficient Tree
Structures for High Utility Pattern Mining in Incremental Databases,
IEEE Trans. Knowledge and Data Eng., vol.21, no. 12, pp. 1708-1721,
Dec. 2009.

[9] C.H. Cai, A.W.C. Fu, C.H. Cheng, and W.W. Kwong, Mining
Association Rules with Weighted Items, Proc. Intl Database Eng.
And Applications Symp. (IDEAS 98), pp.68-77, 1998.

[10] R. Chan, Q. Yang, and Y. Shen, Mining High
Itemsets, Proc. IEEE Third Intl Conf. Data Mining, pp. 19-26, Nov.
2003.

[11] Frequent Itemset Implementation
repository,http: //fimi.cs.helsinki.fi/, 2012.

[12] Efficient algorithmfor mining high utility itemsets from
transactional databases, 2013.

[13] LCM: An efficient algorithmfor enumerating frequent closed item
Sets, 2003

[14] Mining of sequential patterns with a progressive database efficiently,
D. Hepsibha pearl, Dr S. Selvan, 5
th
may 2010.

Mining Frequent Itemsets With Association Rule

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Mining Frequent Itemsets With Association Rule

Încărcat de

Drepturi de autor:

Formate disponibile

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 3190

S-ar putea să vă placă și