Sunteți pe pagina 1din 41

US008250105B2

(12)

United States Patent


Bollinger et al.
INPUT DATA STRUCTURE FOR DATA

(10) Patent N0.: US 8,250,105 B2 (45) Date of Patent: Aug. 21, 2012
2003/0028509 A1 *
2004/0024790 A1 *

(54)

2/2003 sail et a1. ........................ .. 707/1


2/2004 Everett ..... .. 707/200

MINING

2006/0106797 A1 *

5/2006

Srinivasa et al. ................ .. 707/6

(75) Inventors: Toni Bollinger, Weil der Stadt (DE); Ansgar Dorneich, HolZgerlingen (DE); Christoph Lingenfelder, Herrenberg (DE)
(73) Assignee: International Business Machines

OTHER PUBLICATIONS
AgraWal, R. and R. Srikant. Fast Algorithms for Mining Association Rules. Proceedings of the 20th VLDB Conference, pp. 487-499.

Santiago, Chile,l994.
Ayres, Jay et al. Sequential Pattern Mining Using a Bitmap Repre
sentation. Proceedings of the 8th ACM SIGKDD Intl. Conference, pp. 429-435. ACM Press, NeWYork, NY, 2002.

Corporation, Armonk, NY (US)


(*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35

Yin-Fu Huang et al: Mining generalized association rules using pruning techniques Data Mining, 2002. Proceedings. 2002 IEEE
International Conference on Maebashi City, Japan Dec. 9-12, 2002, Los Alamitos, CA, USA,IEEE Comput. Soc. US, Dec. 9, 2002, pp.
227-234, XP010805120 ISBN: 0-7695-1754-4.

U.S.C. 154(b) by 835 days.

(21) Appl. No.: 11/671,623


(22) Filed:
(65)

(Continued)
Primary Examiner * Debbie Le Assistant Examiner * Anh Tai Tran

Feb. 6, 2007
Prior Publication Data

(74) Attorney, Agent, or FirmiMollbom Patents, Inc.;


Fredrik Mollbom

US 2007/0220030 A1

Sep. 20, 2007

(30)

Foreign Application Priority Data


(EP) ................................... .. 06111140 (EP) ................................... .. 06121742

(57) ABSTRACT Methods and apparatus, including computer program prod

Mar. 14, 2006 Oct. 4, 2006

ucts, implementing and using techniques for compressing


data included in several transactions. Each transaction has at least one item. A unique identi?er is assigned to each different

(51) Int. Cl. G06F 7/00


(52)
(58)

item and, if taxonomy is de?ned, to each different taxonomy

(2006.01)

US. Cl. ....... .. 707/793; 707/796; 707/803; 707/811


Field of Classi?cation Search ...................... .. None

parent. Sets of transactions are formed from the several trans actions. The sets of transactions are stored using a computer data structure including: a list of identi?ers of different items

See application ?le for complete search history.

in the set of transactions, information indicating number of identi?ers in the list, and bit ?eld information indicating
presence of the different items in the set of transactions, said

(56)
6,618,725 B1*
7,548,928 B1* 7,630,996 B1 *

References Cited U.S. PATENT DOCUMENTS


9/2003
6/2009 12/2009

bit ?eld information being organized in accordance With the list for facilitating evaluation of patterns With respect to the
1/1

Fukuda et a1. .................. .. 707/6


Dean et a1. ........ .. . 1/1 Hershkovich et al. .............. .. l/l

6,804,664 B1* 10/2004 Hartman et a1.

set of transactions. A data structure for compressing data included in a set of transactions is also provided.

32 Claims, 22 Drawing Sheets


I
800

501

determining statistical measures about items and, it oetineo, taxonomy parents

discarding non-frequent items that have no

frequent taxonomy parents

assigning a unique identilief to each different

item andv il taxonomy is definedv to each dillerent

taxonomy parent
804

toiming sets of transactions

determining the numoei of ditleient items and, ii detined, taxonomy parents in the set oi transamions
807

determining identifiers ot the ditterent items

and possible taxonomy oaients


B08

determining presence of the dilferent items and

possible taxonomy parents in the


transactions and presenting this information
as bit field information

809

storing 019 data SUUCIUYB containing the information specilied in steps 806-506

US 8,250,105 B2
Page 2
OTHER PUBLICATIONS

Pradeep Shenoy et al.: Turbo-charging vertical mining of large


databases Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 22-33, XP00243 1507

Dallas, Texas, United States. Srikant R et a1: Mining Generalized Association Rules Proceedings of the International Conference on Very Large Data Bases, Sep. 11, 1995, pp. 407-419, XP000671664.
Mohammed J. Zaki et al.: Fast vertical mining using diffsets Pro ceedings of the Ninth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, 2003, pp. 326-335, XP002431508 Washington, DC, US. PCTiInternational Searching Authority: Invitation to pay addi tional fees communication in counterpart PCT application PCT/ EP2007/05l025, May 25, 2007. PCTiInternational Search Report and The Written Opinion of the International Search Authority. PCT application: PCT/EP2007/
051025. Mailed Nov. 14, 2007.

* cited by examiner

US. Patent

Aug. 21, 2012

Sheet 1 or 22

US 8,250,105 B2

STORE

MINING PROGRAM INTERFACE

/-26

input data

16/ MINING
PROGRAM
V

compression

22

compressed
input data

SERVER COMPUTER

FIG. 1

US. Patent

Aug. 21, 2012

Sheet 2 or 22

US 8,250,105 B2

/
2O1\\ Providing filter conditions

200

i
Determining three disjoint sets of filter conditions

i
203

\\ Determining and evaluating a first set of candidate patterns as {item1} -) {item2} for all possible item pairs.
V

204 \\

Selecting an evaluated candidate pattern as a parent

candidate pattern and maintaining evaluation information about the parent candidate pattern
205 v

\\Generating child candidate patterns by extending the parent


candidate pattern and taking into account the first set of filter conditions.
V

2O6\_

Evaluating child candidate patterns in sets of


similar candidate patterns taking into account the second set of filter conditions. Add positively evaluated child patterns to set of candidates and/or result patterns.

207

"

Maintain evaluation information about


the positively evaluated child candidate patterns

yes

Further

208

parent candidate

patterns?

US. Patent

Aug. 21, 2012

Sheet 3 0122

US 8,250,105 B2

300a

Parent pattern

{items_t}, {items_2},
@ Child pattern

{items_n}

{items_t}, {items_2},

{items_n. new_item}

FIG. 3A
300a Parent pattern

{items_t}, {items_2},
31 1

{items_n}

Child pattern

{items_t}, {items_2},

{items_n}

{new_item}

FIG. 38
Lb Parent pattern

{items_t}, {item_2}
@ Child pattern

{items_t, new_item}, {item_2}

FIG. 3C

US. Patent

Aug. 21, 2012

Sheet 4 0f 22

US 8,250,105 B2

401
ast set singleton AND length of sequence = 2 AND st set extensibl

yes >

/ , create up to N Sequences by adding 1 item to 1st set

402

404

4
last set yes create up to N
1 item to last set

1/
extensible > sequences by adding

405

sequence extensible
'2

create up to N \

sequences by appending a singleton


set to sequence 408

inc
return no candidates

I I

return generated

sequences (max. N)

FIG. 4

US. Patent

Aug. 21, 2012

Sheet 5 0122

US 8,250,105 B2

@
read data and create binary form
502

/ 500

\\

compute initial statistics (e.g. pair support)

503 \\

l
compute initial (simplest) pattern candidates

504\\

GE

l evaluate candidates
new extensible
cand'dates

evaluate |\'|
candidates

remove

from
/

push candidates and their


histories on stack

1/506

Stack

510

l
extend selected
candidate N-fold

,/ 507

FIG. 5

US. Patent

Aug. 21, 2012

Sheet 6 0122

US 8,250,105 B2

NTOT TAs / TAGS


f
012 3 4 5 6

7 8 9101112131415

0110001101...

0001101001...
\

0110001101...

%,J\611
N patterns

612

\613

FIG. 6A

250

493

617

.j...
621
246 TAs/TAG s
Y

|
1 42 TAS/TAGS
Y

|
Y

1 23 TAS/TAGS 622 623

FIG. 6B

US. Patent

Aug. 21, 2012

Sheet 7 0122

US 8,250,105 B2

NTOT TAs / TAGS


f
O
12 3 4 5

6 7 8 9101112131415

I I /63

F
FIG. 6C

631

US. Patent

Aug. 21, 2012

Sheet 9 0122
800

US 8,250,105 B2

/
if defined, taxonomy parents
V
802 \

\determining statistical measures about items and,

\ discarding non-frequent items that have no

frequent taxonomy parents


803 \ \ assigning a unique identifier to each different

item and, if taxonomy is defined, to each different

taxonomy parent
V

804 \

forming sets of transactions

806

\
determining the number of different items and, if defined, taxonomy parents in the set of transactions

8O7\\ determining identifiers of the different items and possible taxonomy parents

808\\determining presence of the different items and possible taxonomy parents in the
transactions and presenting this information
as bit field information
809 I " . .

storing the data structure containing the

information specified in steps 806-808

|
FIG. 8

US. Patent

Aug. 21, 2012

Sheet 10 0122

US 8,250,105 B2 900

/
~ determining statistical measures about the items

and possible taxonomy parents

l
802 \

\ discarding non-frequent items that have no

frequent taxonomy parents

l
\

determining N and M

803 \

l
\ assigning a unique identifier to each different

item and, if taxonomy is defined, to each different

taxonomy parent

l
\ discarding transactions having less items than

a user-specified minimum rule length

903\

l
ordering the remaining transactions based
on their similarity

l
804a \Jorming sets of N transactions

step 805

FIG. 9A

US. Patent

Aug. 21, 2012

Sheet 11 0f 22

US 8,250,105 B2

910

determining statistical measures about the items

and possible taxonomy parents

l
802 \

\ discarding non-frequent items that have no

frequent taxonomy parents

l
911\\
803 \

determining T and M

\ assigning a unique identifier to each different

item and, if taxonomy is defined, to each different

taxonomy parent

l
\ forming sets of transactions

l
discarding sets of transaction having less

remaining items than a predefined number


of items or less transactions than a predefined number of transactions

l
step 805

FIG. 9B

US. Patent

Aug. 21, 2012

Sheet 12 0f 22

US 8,250,105 B2

1001
\\
/1\/1002 _ yes

A/1000

start with candidate rule 1 containing k different items

1003
\
\

1r

qffpa 1' 1

no

1 004

fetch first TA set 9

fetch parents first active [TA set 9

l
1006 \ activeTAs := findActiveTAs( g, k items from r)

+
l

/1007
1010

absSuppcrt[ r] := absSupport[ r] + count1Bits( activeTA/s)

1008 (fix
1009 LQ/I'MESPEH' we .

I/

11
fetch next TA set g

s/

11

fetch parents next active TA set 9

|
1 11
no
/

|
etch successfu H yes

finished

FIG. 10A

US. Patent

Aug. 21, 2012

Sheet 13 0f 22

US 8,250,105 B2

(130) BITFIELDiN findActiVeTAS ( TAGROUP g, ITEMH itemlDs,


ITEM numberOfltems )
BITFIELD_N activeTAs := all N bits set to l;

ITEM itemPosi_g
ITEM itemPOSi
I 0;

while (itemPosiig < g.numberOfDifferentltems () )

(136) (137)

if

(q. itemlD [itemPosi_q] < itemlDs [itemPosi] )


itemPosi_q :2 itemPosi_g + l;

{ }

(138)

else if (g.itemID [itemPosi_g] > itemlDs [itemPosi] )


return 0;

}
else

{
activeTAs := activeTAs & g.bitField[itemPosiig] ;

itemPosi_g := itemPosi_g + l;
itemPosi :: itemPosi + 1; if (itemPosi = numberOfltems)

return activeTAs;

return 0;

FIG. 10B

US. Patent

Aug. 21, 2012

Sheet 14 0f 22

US 8,250,105 B2

(150)
(152)
(153)

inline int countlBits( unsigned long long bitField )


unsigned char* p = (unsigned char*) (&bitField) ; int nbItems nblBits[ 1;
nbIterrs nbItems nbIterns nbIterns nbIterts nbIterrs nbItenis +: += += += += += += nblBits nblBits nblBits nblBits nblBits nblBits nblBits

(154)

1]

(156)
(160)

return nbItems;

const unsigned char nblBits[256]

12 3
23

FIG. 10C

US. Patent

Aug. 21, 2012

Sheet 15 0122

US 8,250,105 B2

1101

evaluation

information of parent pattern tells TA set inactive?

1102

any TA in TAset has sufficent number of items ?

1 1 03

yes

\ determine positions of common items of all N candidate patterns


1104

all
common items found

in TA set ?

1105

/
detemine TAs containing all common items
11

TA contains
all items ? 1107

current

\
\
1108

determine positions and support in


TA for non-common items
"

\~ check whether items defined by positive


item constraints are present in TA
V

ast TA in TA set

1109

11\O
\

yes v

N candiate patterns evaluated with respect to TA set

FIG. 11

US. Patent

Aug. 21, 2012

Sheet 16 0122

US 8,250,105 B2

1200 1201
start with candidate listl, containing N rule candidates
with k items, 11-1 of them identical for all candidates

//

1002a
1003
,\/

l.hasParent.

1004

\\

if
fetch first TA set 9

'11

I1

fetch parents first active TA set 9

i
no /. .nbD|ff|ten"|s<k yes

1202
no <Ia\c11veTAsHIE yes em

1005

activeTAs := findActiveTAsi g, k-1 identical items )

1 204

I
/_/

203
1205

i := 0; itemPosi := 0

11'

A.
A

12 W 185

[I206
activeTAs_i :: activeTAs & findActiveTAs( g, addeditem?], iterhPosi )

i
absSupport[ i ] :: absSupport[ i] + count1Bits( activeTAs_i ) // 1207

i:=i+1//12O8
no
1009

1008a i
a

>/\
1?

Yes
1010

\\

11
fetch next TA set 9

11

fetch parents next active TA set g

?/w/tibyg, YES
finished 101 1

FIG. 1 2A

US. Patent

Aug. 21, 2012

Sheet 17 0f 22

US 8,250,105 B2

BITFIELD_N findACtiVeTAS( TAGROUP g, ITEMH itemlDs, ITEM


numberOfltems )
BITFIELD_N activeTAs :2 all N bits set to l; ITEM itemPosi_g I 0; ITEM itemPosi :2 0;

while {
if

(itemPosi_g < g.numberOfDifferentltems () )

(g. itemlD [itemP0si_g] < itemlDs [itemPosi] )


itemPosi_g :2 itemPosi_g + l;

else if

(g. itemID [itemPosi_g]

> itemlDs [itemPosi] )

{
return 0;
else

{
activeTAs := activeTAs & g.bitField[itemPosi_g] ; itemPosi_g :2 itemPosi_g + l;
itemPosi :2 itemPosi + 1;

if

(itemPosi I numberOfltems)

return activeTAs;

return 0;

FIG. 12B

US. Patent

Aug. 21, 2012

Sheet 18 0122

US 8,250,105 B2

BITFIELD_N findACtiVeTAS( TAGROUP g, itemPosi )

ITEMID itemID,

int&

{
while (itemPosi < g.numberOfDifferentItems() )
(g.itemID[itemPosi]
itemPosi

{
if < itemID)
l;

{
:2 itemPosi +

}
if (g.itemID[itemPosi]
itemPosi

> itemID)

:2 g.numberOfDifferentItems() ;

1'
else { return g.bitField[itemPosi] ;

return 0;

FIG. 12C

S-ar putea să vă placă și