Sunteți pe pagina 1din 26

ONLINE DATA STREAM MINING OF RECENT FREQUENT ITEMSETS BASED ON SLIDING WINDOW MODEL IEEE 2008 Conference

Presented by: Baha Nawafleh

20093173016

Table of Contents
 Introduction.  Literature


Review. algorithm.

Problem definition. (window initialization phase, window sliding phase, mining frequent itemsets phase. )

 MRFI-SW

 Experiment  Conclusion  Questions??

Introduction
 A data stream is a massive sequence of data elements

continuously generating at a rapid rate. Different from the traditional static datasets, data streams are continuous, unbounded and have a data distribution that changes with time.
 Many applications generate large amount of data streams in real

time, such as sensor data generated from sensors networks, online transaction flows in retail chains, Web record and clickstreams in Web applications, etc.
 Data streams can be classified into offline data streams [1] and

online data streams [2].

Cont..
 [1] The target applications domains of offline data

stream are a bulk addition of new transactions, such as a data warehouse system.
 [2] Online data streams are characterized by real-

time updated data. The streaming data of online data stream come one by one in time, such as a continuously generated transaction as in a network monitoring system.

Literature Review
 Researchers have proposed many algorithms of mining frequent item

sets in data streams.


 The researches of mining frequent itemsets in data streams can be

divided into three categories:


  

landmark window model. the time-fading model. the sliding window model.

 Manku and Motwani developed two single-pass algorithms, Sticky

Sampling and Lossy Counting . This algorithm can mine frequent items over offline data stream under landmark window model.

Cont..
 SWFI-stream is an algorithm for mining frequent item sets in online data

streams under transaction-sensitive sliding window model proposed an incremental mining algorithm to mine frequent item sets in offline data streams with a time-sensitive sliding window.

The purpose of this paper:


 MRFI-SW is Mining Recent Frequent Item sets over online data stream

with Sliding window.

Problem definition

 Let      

={i1,i2,,im} be a set of literals, called items. A transaction T={id, x1x2..xn}. A transaction data stream DS={T1, T2,TN} is a continuous sequence of transactions . A data stream can be also denoted as DS={W1, W2,Wm}, where each basic window is a transaction-sensitive sliding window. w is the size of the transaction-sensitive sliding window. s is a user-defined minimum support threshold in the rang of [0,1]. The support of a transaction X over SW is the number of transactions in SW containing X as a subset. If the support of X is higher than s*w, X is called a frequent item set (FI).

MRFI-SW algorithm
 The proposed MRFI-SW algorithm consists of three

phases :
 window

initialization phase.  window sliding phase. and  mining frequent itemsets phase.

window initialization phase.


 The window initialization phase is activated by the first

transaction arriving. The phase lasts until the transactionsensitive sliding window is full.  When the sliding window is full, the w items are transformed into bit-order representations.  Each entry is the form of (bit, order), denoted as R(x).  If item X is in the i-th transaction in current sliding window, the ith entry of R(X)_bit is set to be 1 and the order of items in a transaction can get from R(X)_order, otherwise the R(X) is set to be 0 (R(X)_bit=R(X)_order=0).

Cont..

 For example, there are three transactions in SW1,

T1, T2, and T3. The bit-order representations of items in SW1 are shown in Table 1.

Cont..


Table 1. Bit-order of items in window initialization phase

window sliding phase


 The window sliding phase is activated when the sliding window

becomes full. In this phase, a new arriving transaction is inserted into the sliding window, and the oldest transaction in current sliding window is removed.  Because the bit-order sequence representation is a structure of sequence, we use left-shift operation on the sequence.  To improve the memory usage, a pruning entry operation is executed after the window sliding.  a pruning entry operation is executed after the window sliding. The operation is pruning the entry of item when its bit-order sequence is 0. If item X dose not appear in any transaction over current sliding window, where sup(X)SW=0, the entry R(X) is pruned.

Cont..
 For instance, in Table 1, when the forth transaction T4 arrives, the first

transaction T1 must be removed from the current SW. The bit-order sequence entries of items in SW1 are executed left-shift.  R(a) is modified from <(1, 1), 0, (1, 1)> to <0, (1, 1), 0>
   

Similarly R(c)=<(1, 2), (1, 3), 0> R(d)=<0, 0, 0> R(b)=<(1, 1), (1, 2), (1, 1)> R(e)=<(1, 3), (1, 4), (1, 2)>

 Noted that item d is dropped, because R(d)=<0, 0, 0>, sup(d)SW2=0.

Algorithm 1: Output: updated bit-order sequence


1Initialize sliding window and bit-order sequence; 2While each new coming transaction Ti in SW do 3 If (SW is full) 4 Transform all of items in SW to bit-order sequence; 5 Else 6 Do left_shift operation on bit-order sequence of all items 7 For each item X arrives in SW 8 Transform X to bit sequence representation 9 End for 10 End if 11For each R(X) in SW 12 If SUM( R(X).bit)=0 13 Drop X from SW 14 End if 15End for


Mining frequent itemsets phase


 The mining frequent itemsets phase is activated when the bit-order sequences    

are updated and the frequent itemsets are requested. We proposed a method to generate k-frequent items (itemsets with k items) from the known k-1-frequent items. The method works basing on Apriori property (If a pattern is frequent, all of its sub-patterns will also be frequent). We use SUM operation on the bit of each entry to compute the support of items, and find the frequent 1-itemsets in current SW . Then the proposed algorithm uses AND operation on the bit of each entry to find 2-itemsets. The support of 2-itemsets is computed, the itemsets whose supports are less than the user defined threshold are pruned. The process is terminated until no new k+1-itemsets are generated.

Cont..
 For instance, consider the DS in Table 1. Let the minimum support

threshold s be 0.6.  Hence, an item set X is frequent if sup(X)0.6*3=1.8.  We discuss the step of mining frequent item sets in SW2. First, MRFISW algorithm finds out frequent 1-itemsets, through computing the support of items where R(a)=<0, (1, 1), 0>, i.e., sup(a)=1 R(c)=<(1, 2), (1, 3), 0>, i.e., sup(c)=2 R(b)=<(1, 1), (1, 2), (1, 1)>, i.e., sup(b)=3 R(e)=<(1, 3), (1, 4), (1, 2)>, i.e., sup(e)=3

 So item a is not frequent because its support is 1.

Cont..

Cont..

Algorithm 1: Output: a set of frequent

itemsets.
1Find frequent 1-itemsets FI1 2For (k=2; FIk-1null; k++) 3 Do AND operation on R(FIk-1).bit to find Candidate FIk 4For each FI do 5 Do bitwise SUM operation on R( Candidate FIk) 6 If SUM(R( Candidate FIk).bit ) s*w 7 If k=2 8 Scan R(Candidate FIk).order 9 Output FIk 10 End if 11 End if 12End for


Experiment
 Our algorithm was written in C and compiled using Microsoft

Visual C++ 6.0. We generate online data streams using IBM synthetic data generator.

Figure 1. Memory usages in window initialization

Figure 2. Memory usages in window

sliding window

Figure 3. Memory usages in mining frequent item sets

Figure 4. The processing time of algorithm

Conclusion
 Mining online data stream is an interesting and challenging research

field.  The characteristics of data stream make many traditional mining algorithms unable to be applied.  In this paper proposed an efficient algorithm of three phases for

mining recent frequent item sets over online data stream with transaction-sensitive sliding window.  Experiment shows that using the proposed algorithm not only attains highly accurate mining result, but also runs significant faster and consume less memory than SWFI-algorithm for mining recent frequent item sets over online data streams.

Questions??