Documente Academic
Documente Profesional
Documente Cultură
(12)
(10) Patent N0.: US 8,250,105 B2 (45) Date of Patent: Aug. 21, 2012
2003/0028509 A1 *
2004/0024790 A1 *
(54)
MINING
2006/0106797 A1 *
5/2006
(75) Inventors: Toni Bollinger, Weil der Stadt (DE); Ansgar Dorneich, HolZgerlingen (DE); Christoph Lingenfelder, Herrenberg (DE)
(73) Assignee: International Business Machines
OTHER PUBLICATIONS
AgraWal, R. and R. Srikant. Fast Algorithms for Mining Association Rules. Proceedings of the 20th VLDB Conference, pp. 487-499.
Santiago, Chile,l994.
Ayres, Jay et al. Sequential Pattern Mining Using a Bitmap Repre
sentation. Proceedings of the 8th ACM SIGKDD Intl. Conference, pp. 429-435. ACM Press, NeWYork, NY, 2002.
Yin-Fu Huang et al: Mining generalized association rules using pruning techniques Data Mining, 2002. Proceedings. 2002 IEEE
International Conference on Maebashi City, Japan Dec. 9-12, 2002, Los Alamitos, CA, USA,IEEE Comput. Soc. US, Dec. 9, 2002, pp.
227-234, XP010805120 ISBN: 0-7695-1754-4.
(Continued)
Primary Examiner * Debbie Le Assistant Examiner * Anh Tai Tran
Feb. 6, 2007
Prior Publication Data
US 2007/0220030 A1
(30)
(2006.01)
parent. Sets of transactions are formed from the several trans actions. The sets of transactions are stored using a computer data structure including: a list of identi?ers of different items
in the set of transactions, information indicating number of identi?ers in the list, and bit ?eld information indicating
presence of the different items in the set of transactions, said
(56)
6,618,725 B1*
7,548,928 B1* 7,630,996 B1 *
bit ?eld information being organized in accordance With the list for facilitating evaluation of patterns With respect to the
1/1
set of transactions. A data structure for compressing data included in a set of transactions is also provided.
501
taxonomy parent
804
determining the numoei of ditleient items and, ii detined, taxonomy parents in the set oi transamions
807
809
storing 019 data SUUCIUYB containing the information specilied in steps 806-506
US 8,250,105 B2
Page 2
OTHER PUBLICATIONS
Dallas, Texas, United States. Srikant R et a1: Mining Generalized Association Rules Proceedings of the International Conference on Very Large Data Bases, Sep. 11, 1995, pp. 407-419, XP000671664.
Mohammed J. Zaki et al.: Fast vertical mining using diffsets Pro ceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2003, pp. 326-335, XP002431508 Washington, DC, US. PCTiInternational Searching Authority: Invitation to pay addi tional fees communication in counterpart PCT application PCT/ EP2007/05l025, May 25, 2007. PCTiInternational Search Report and The Written Opinion of the International Search Authority. PCT application: PCT/EP2007/
051025. Mailed Nov. 14, 2007.
* cited by examiner
US. Patent
Sheet 1 or 22
US 8,250,105 B2
STORE
/-26
input data
16/ MINING
PROGRAM
V
compression
22
compressed
input data
SERVER COMPUTER
FIG. 1
US. Patent
Sheet 2 or 22
US 8,250,105 B2
/
2O1\\ Providing filter conditions
200
i
Determining three disjoint sets of filter conditions
i
203
\\ Determining and evaluating a first set of candidate patterns as {item1} -) {item2} for all possible item pairs.
V
204 \\
candidate pattern and maintaining evaluation information about the parent candidate pattern
205 v
2O6\_
207
"
yes
Further
208
parent candidate
patterns?
US. Patent
Sheet 3 0122
US 8,250,105 B2
300a
Parent pattern
{items_t}, {items_2},
@ Child pattern
{items_n}
{items_t}, {items_2},
{items_n. new_item}
FIG. 3A
300a Parent pattern
{items_t}, {items_2},
31 1
{items_n}
Child pattern
{items_t}, {items_2},
{items_n}
{new_item}
FIG. 38
Lb Parent pattern
{items_t}, {item_2}
@ Child pattern
FIG. 3C
US. Patent
Sheet 4 0f 22
US 8,250,105 B2
401
ast set singleton AND length of sequence = 2 AND st set extensibl
yes >
402
404
4
last set yes create up to N
1 item to last set
1/
extensible > sequences by adding
405
sequence extensible
'2
create up to N \
inc
return no candidates
I I
return generated
sequences (max. N)
FIG. 4
US. Patent
Sheet 5 0122
US 8,250,105 B2
@
read data and create binary form
502
/ 500
\\
503 \\
l
compute initial (simplest) pattern candidates
504\\
GE
l evaluate candidates
new extensible
cand'dates
evaluate |\'|
candidates
remove
from
/
1/506
Stack
510
l
extend selected
candidate N-fold
,/ 507
FIG. 5
US. Patent
Sheet 6 0122
US 8,250,105 B2
7 8 9101112131415
0110001101...
0001101001...
\
0110001101...
%,J\611
N patterns
612
\613
FIG. 6A
250
493
617
.j...
621
246 TAs/TAG s
Y
|
1 42 TAS/TAGS
Y
|
Y
FIG. 6B
US. Patent
Sheet 7 0122
US 8,250,105 B2
6 7 8 9101112131415
I I /63
F
FIG. 6C
631
US. Patent
Sheet 9 0122
800
US 8,250,105 B2
/
if defined, taxonomy parents
V
802 \
taxonomy parent
V
804 \
806
\
determining the number of different items and, if defined, taxonomy parents in the set of transactions
8O7\\ determining identifiers of the different items and possible taxonomy parents
808\\determining presence of the different items and possible taxonomy parents in the
transactions and presenting this information
as bit field information
809 I " . .
|
FIG. 8
US. Patent
Sheet 10 0122
US 8,250,105 B2 900
/
~ determining statistical measures about the items
l
802 \
l
\
determining N and M
803 \
l
\ assigning a unique identifier to each different
taxonomy parent
l
\ discarding transactions having less items than
903\
l
ordering the remaining transactions based
on their similarity
l
804a \Jorming sets of N transactions
step 805
FIG. 9A
US. Patent
Sheet 11 0f 22
US 8,250,105 B2
910
l
802 \
l
911\\
803 \
determining T and M
taxonomy parent
l
\ forming sets of transactions
l
discarding sets of transaction having less
l
step 805
FIG. 9B
US. Patent
Sheet 12 0f 22
US 8,250,105 B2
1001
\\
/1\/1002 _ yes
A/1000
1003
\
\
1r
qffpa 1' 1
no
1 004
l
1006 \ activeTAs := findActiveTAs( g, k items from r)
+
l
/1007
1010
1008 (fix
1009 LQ/I'MESPEH' we .
I/
11
fetch next TA set g
s/
11
|
1 11
no
/
|
etch successfu H yes
finished
FIG. 10A
US. Patent
Sheet 13 0f 22
US 8,250,105 B2
ITEM itemPosi_g
ITEM itemPOSi
I 0;
(136) (137)
if
{ }
(138)
}
else
{
activeTAs := activeTAs & g.bitField[itemPosiig] ;
itemPosi_g := itemPosi_g + l;
itemPosi :: itemPosi + 1; if (itemPosi = numberOfltems)
return activeTAs;
return 0;
FIG. 10B
US. Patent
Sheet 14 0f 22
US 8,250,105 B2
(150)
(152)
(153)
(154)
1]
(156)
(160)
return nbItems;
12 3
23
FIG. 10C
US. Patent
Sheet 15 0122
US 8,250,105 B2
1101
evaluation
1102
1 1 03
yes
all
common items found
in TA set ?
1105
/
detemine TAs containing all common items
11
TA contains
all items ? 1107
current
\
\
1108
ast TA in TA set
1109
11\O
\
yes v
FIG. 11
US. Patent
Sheet 16 0122
US 8,250,105 B2
1200 1201
start with candidate listl, containing N rule candidates
with k items, 11-1 of them identical for all candidates
//
1002a
1003
,\/
l.hasParent.
1004
\\
if
fetch first TA set 9
'11
I1
i
no /. .nbD|ff|ten"|s<k yes
1202
no <Ia\c11veTAsHIE yes em
1005
1 204
I
/_/
203
1205
i := 0; itemPosi := 0
11'
A.
A
12 W 185
[I206
activeTAs_i :: activeTAs & findActiveTAs( g, addeditem?], iterhPosi )
i
absSupport[ i ] :: absSupport[ i] + count1Bits( activeTAs_i ) // 1207
i:=i+1//12O8
no
1009
1008a i
a
>/\
1?
Yes
1010
\\
11
fetch next TA set 9
11
?/w/tibyg, YES
finished 101 1
FIG. 1 2A
US. Patent
Sheet 17 0f 22
US 8,250,105 B2
while {
if
else if
{
return 0;
else
{
activeTAs := activeTAs & g.bitField[itemPosi_g] ; itemPosi_g :2 itemPosi_g + l;
itemPosi :2 itemPosi + 1;
if
(itemPosi I numberOfltems)
return activeTAs;
return 0;
FIG. 12B
US. Patent
Sheet 18 0122
US 8,250,105 B2
ITEMID itemID,
int&
{
while (itemPosi < g.numberOfDifferentItems() )
(g.itemID[itemPosi]
itemPosi
{
if < itemID)
l;
{
:2 itemPosi +
}
if (g.itemID[itemPosi]
itemPosi
> itemID)
:2 g.numberOfDifferentItems() ;
1'
else { return g.bitField[itemPosi] ;
return 0;
FIG. 12C