Sunteți pe pagina 1din 9

-ICI

Apriori

Apriori ICI
Apriori
I/O
Apriori

An Efficient Algorithm for Mining Association Rules-ICI


Jen-Peng HuangI-Pei ChienSheng-Hong Wu
Institute of Information Management,
Southern Taiwan University of Technology

Abstract
Due to the improvement of information technologies and popularization of computers,
collecting information becomes easier, rapider and more convenient than before. As the time goes by,
database cumulates huge and hiding information. Therefore, how to correctly uncover and efficiently
mining from those hiding information becomes a very important issue. Hence the technology of data
mining becomes one of the solutions. In the technologies of data mining, association rules mining is
one of the most popular technology to be used. Association rule mining explores the approaches to
extract the frequent itemsets from large database. Further, derives the knowledge behind implicitly. The
Apriori algorithm is one of the most frequently used algorithms. Although the Apriori algorithm can
successful derive the association rules from database, the Apriori algorithm has two major defects: First,
the Apriori algorithm will produce large amounts of candidate itemsets during extracting the frequent
itemsets from large database. Second, frequently scanning whole database lead to inefficient
performance. Many researches try to improve the performance of the Apriori algorithm, but still not

escape from the frame of the Apriori algorithm and lead to a little improvement of the performance. In
this paper we propose ICI (Incremental Combination Itemsets) which escape the frame of Apriori
algorithm, and it only needs to scan whole database once during extracting the frequent itemsets from
large database. Therefore, the ICI algorithm can efficiently reduce the I/O time, and rapidly extract
during extracting the frequent itemsets from large database, and make data mining more efficient than
before.
Keywords Data MiningAssociation RuleAprioriFrequent itemsets


(Database) 2.1
(Item)
Agrawal [1]


(Data I={i1,i2,i3,,im}
Mining) (Itemset)D
T
T TI
ID
(1) TID
(Association Rules)[1,2,8] XY XIYI XY

80%
(2) (Time (Support)S (Confidence)CS
Sequence)[4,10]
(Threshold)
(3) (minsup)C
(Classification Rules)


(4)
(Clustering Rules)[6,9] ()
(Frequent itemsets)
(5) k
(Sequence Patter Analysis)[3] k-(k-itemsets) k-
k-()

BCE 3- BC
Apriori [1,2] EI BCE
DHP[5]AprioriTid[3]AprioriHybrid[3]
QDT[11,12] 2.2 Apriori

(k>1)
(Incremental Combination
ItemsetsICI ) Apriori (1) (k-1)-(Lk-1)
(Ck)
(2) D

(1) k-2
K (k-1)- k-
(Lk)
(3) (1)(2) k-
(k-1)-
k-
2.3 Apriori
(1) (Itemset) (1)
2- 1-
1- k (support)
(k-1)+(k-2)++1 2-

k *(k 1) Apriori
1-
2
1000 45 2- n

(2)


3.1 ICI

Step-1


Start MAP
MAP

Step-2


Step-3



N
MAP



MAP

Step-4

MAP

Step-5
Step-6


Step-7

Step-8

Step-9

End

1 ICI

3.2 T001
1 MAP MAP
2
MAP T001
T001 P001, P003, P004
3 N X=3 MAP
(itemsets) (P001, P003, P004)(P001, P003)(P001,
MAP P004)(P003, P004)(P001)(P003)(P004)
4 MAP
(itemsets)
5 3 (; T001)

X=3 X=2 X=1


P001,P003,P004 1 P001,P003 1 P001 1
P001,P004 1 P003 1
6 P003,P004 1 P004 1

7 T002
8 4
9 4 ( T002 )

X=3 X=2 X=1


3.3 P001,P003,P004 1 P001,P003 1 P001 1
DB MAP P002,P003,P005 1 P001,P004 1 P002 1
1 ID P002,P003 1 P003 2
Itemset P002,P005 1 P004 1
2 MAP P003,P004 1 P005 1
P003,P005 1
1 T003
TID Itemset MAP
T001 P001, P003, P004
T002 P002, P003, P005 MAP
T003 P001, P002, P003, P005 5 MAP
T004 P002, P005

X=1 A A
2 MAP X=2 AB AB, A, B
X=3 ABC ABC, AB, AC, BC, A, B, C
X=1 A A ABCD, ABC, ABD, ACD,
X=2 AB AB, A, B X=4 ABCD BCD, AB, AC, AD, BC, BD,
X=3 ABC ABC, AB, AC, BC, A, B, C CD, A, B, C, D

T003 50%( 50%*4=2)


T003 P001, P002, P003, P005 80%
X=4 MAP
(P001,P002,P003,P005) ( 8)
(P001,P002,P003) (P001,P002,P005) (L3)P002,P003,P005
(P001,P003,P005) (P002,P003,P005) (L2)P001,P003P002,P003
(P001,P002) (P001,P003) (P001,P005) P002,P005P003,P005
(P002,P003) (P002, P005) (P003,P005) (L1)P001P002P003P005
(P001)(P002)(P003)(P005) 8

X=3 X=2 X=1


6 ( T003 ) P002,P003,P005 2 P001,P003 2 P001 2
X=4 X=2 X=1 P002,P003 2 P002 3

P001,P002, 1 P001,P002 1 P001 2 P002,P005 3 P003 3

P003,P005 P001,P003 2 P002 2 P003,P005 2 P005 3

P001,P004 1 P003 3 (L3


X=3
P001, P005 1 P004 1 L2L1)
P001,P002,P003 1
P002,P003 2 P005 2 P002,P003P005(100%>80%)
P001,P002,P005 1
P002,P005 2 P003,P005P002(100%>80%)
P001,P003,P004 1
P003,P004 1 P001P003(100%>80%)
P001,P003,P005 1
P003,P005 2 P002P005(100%>80%)
P002,P003,P005 2
P005P002(100%>80%)

P002,P003P005<>

p( P002, P003, P005) 2


T002 p( P005 | P002, P003) 100% 80%
p( P002, P003) 2
7
P002P003<>
7 ( T004 )
p( P002, P003) 2
X=4 X=2 X=1 p( P003 | P002) 67% 80%
p( P002) 3
P001,P002, 1 P001,P002 1 P001 2

P003,P005 P001,P003 2 P002 3


3.4 ICI -MAP
X=3 P001,P004 1 P003 3
3.4.1
P001,P002,P003 1 P001, P005 1 P004 1
(X =1) A
P001,P002,P005 1 P002,P003 2 P005 3
A 2
P001,P003,P004 1 P002,P005 3

P001,P003,P005 1 P003,P004 1 A A X=1

P002,P003,P005 2 P003,P005 2
2 MAP (A)

(X=2)

AB X=1 AB

3AB

B A 3.4.2
AB B P120,P220,P500(
B X=1 AB ( )
) X=3 MAP P120A
P220BP500C MAP
A A X=1
(P120,P220,P500)(P120,P220)
(P120, P500)(P220,P500)(P120)(P220)
AB A B AB X=2 (P500)( 6~8 )

+ ()

P120,P220,P500ABC
3 MAP (AAB) P120 P220 P500

(X=3)
(MAP)
ABC X=2
ABC 4ABC X=3 MAP
A B C
C
AB ABC C 6 ABC

C X=2 AC
BCABC P120,P220,P500AB,AC,BC

AB A B AB X=2 P120 P220 P500

+

(MAP)

X=3 MAP
A B A C B C

ABC A B AB C AC BC ABC
X=3
+ + +
7 AB,AC,BC

4 MAP (ABABC) P120,P220,P500A,B,C


( 5)
P120 P220 P500

A A X=1

(MAP)

AB A B AB X=2

+ X=3 MAP
A B C

8 A,B,C

ABC A B AB C AC BC ABC
X=3 3.5
+ + +

(1)
5 MAP
CPUPentium1.7GHz
Memory512 Mbytes

OSWindows 2000 Sever 11 ICIQDTApriori

Data BaseAccess Minsup 1% 0.75% 0.5% 0.25% 0.1%

Programming LanguageJava Apriori 305sec 534sec 940sec 1882sec 2630sec

QDT 22sec 22sec 22sec 22sec 22sec

ICI 9sec 9sec 9sec 9sec 9sec

IBM generator Apriori/ICI 33.89 59.33 104.45 209.12 292.23


QDT/ICI 2.45 2.45 2.45 2.45 2.45

9 10 9 11 ICI
QDT
9 Apriori
D ()
L ICI
T QDT 2.5 Apriori
I 140
N ITEM


10
Name [L] [T] [N] [I] [D] ()
L10T7N500I4D10K 10 7 500 4 10K

(2)
L10T7N500I4D10K Apriori
9 11 Apriori


ICI QDT Apriori

3000

2630

2500

2000
1882
ICI ICI
1500

1000 940

500 534
305
0
22
9
22
9
22
9
22
9
22
9
I/O
1% 0.75% 0.50% 0.25% 0.10%

9 ICIQDTApriori

[1] Agrawal, R. Imielinski, T. and Swami, A.
(1993), Mining Association Rules Between
Sets of Items in Large Databases, In proc. of

the ACM SIGMOD Conference on QDT


Management of Data, pp.207-216.
[2] Agrawal, R. and Srikant, R. (1994), Fast
Algorithms for Mining Association Rules, P36
th
Proc. of the 20 VLDB Conference Santiago,
[12]2002
Chile. -QDT
[3] Aarawal, R. and Srikant, R. (1995), Mining P55
Sequential Patterns, Proc. of the Intl
Conference on Data Engineering(ICDE).
[4] Chen, M.S., Han, J. and Yu, R.S. (1996),
Data Mining : An Overview from a Database
Perspective, IEEE Proceeding of the 16th
ICDCS, Vol.8, No,6, pp.866-883.
[5] Jong Soo Park, Ming-Syan chen and Philips
S. Yu, Using a Hashed Method with
Transaction trimming and Database Scan
Reduction for Mining Association Rules,
IEEE Trans. On Knowledge and Enginerring.
[6] Kaufman, L. and rousseeuw, P.J. (1990),
Finding Groups in Data : An Introduction to
Cluster Analysis, John Wiley and Sons.
[7] Lin, D and Kedem, Z.M. (1998),
Pincer-Search : A New Algorithm for
Discovering the Maximum Frequent Set,
Sixth Intl Conf. On Extending Database
Technology.
[8] Ming-Syan, C., Jiawei, H. and Philip, S.Yu
(1996), Data mining : An Overview from a
Database Perspective, IEEE Transactions on
Knowledge and Data Engineering, Vol.8,
No.6.
[9] Ng, R. and Han, J. (1994), Efficient and
Effective Clustering Method for Spatial Data
Mining, Proc. Intl Conf. Very Large Data
Bases, pp.144-155
[10] Quinlan J,R. (1986), Induction of Decision
Trees, Machine Learning, Vol.1, pp.81-106.
[11]2002

S-ar putea să vă placă și