Sunteți pe pagina 1din 6

Applying Correlation Threshold on Apriori Algorithm

Anand H.S. & Vinod Chandra S.S.


Abstract
Ever growing size of information and database
has always demanded the scientific world for
very efficient rule mining algorithm. This paper
provides an extension to the Apriori algorithm, a
classical rule mining algorithm. Apriori finds its
application in areas of data mining, finding
association between attributes and in prediction
systems. Even though Apriori suits in various
applications, it possesses various disadvantages.
To make the algorithm efficient, methods for
incorporating a new correlation factor
(threshold) is being introduced in this paper.
First part of the paper provides a quick summary
of basic Apriori algorithm and the second half
details the implementation of correlation
threshold. Performance of the redesigned
algorithm is evaluated and is compared with the
traditional Apriori algorithm. The evaluation
shows a peak improvement on the mining result.
In an application level, qualitative content
analysis of water was also conducted to affirm
the results.
Keywords: - Apriori algorithm, Data mining,
Correlation threshold, Association rule mining,
Machine learning algorithm.
1. INTRODUCTION
Data growth of the internet is increasing day to
day. It is calculated that the Indexed Web
contains counts to 9.07 billion pages and is still
growing (the measure is just about the indexed
web contents). This was mentioned just to
provide an idea on the amount of data available
in the web. Same is the case with the database
which is used for various applications. So some
efficient methods are needed to mine the data
fromlarge database. This is one of the major
difficulties of researchers working in the field of
data mining. Every database has numerous
attributes. Change in any of the attribute will
affect other objects, which are closely associated
with it. So it has always been an area of interest
to know about such interesting relationship
between the various attributes within a database.
Association rule mining is such a process which
provides numerous ways to find association
between variables.
Consider a large database say, D having N
attributes. Let A and B are any two variables,
which are closely associated. Any variation in
the value of A can cause a positive or negative
effect on the associated variable B. These
associations could be even used to create or
predict some rules or decisions. Hence it is a
serious area of study to know the extent of
association between such attributes. There are
various algorithms which fall under association
rule learning. Major association rule mining
algorithms include Apriori algorithm, Tertius
algorithm, FP growth algorithm, Eclat
algorithm. All these algorithms provide ways to
create rules on associated attributes.
This paper discusses the classical rule mining
algorithm, Apriori [ ]. This algorithm suggests
solutions to market basket analysis for finding
the related products from a store. Such
relationships can increase business and used to
find more innovative methods for advertisement
of related products. For example, consider the
activities of Amazon online shopping store,
when we browse a book under a particular
stream, the related products will also populate in
the side menu. It is merely by the concept of
association mining. In every area, such
association rules and data mining influence the
business either directly or indirectly. Here we
discuses methods for finding sharp associations
with improved accuracy by incorporating
correlation threshold to the existing algorithm.
2. CONCEPTS AND TERMINOLOGY
Two major concepts used while working with
the Apriori algorithm is Support and
Confidence. Lets define what exactly these
terms are:
Support: It defines the transactions where the
item goes in hand by hand. If a, b are two
itemsets, then the support can be defined as the
transaction T which shows a b (a implies b).
Confidence: It is defined as the percentage of
transactions where the itemsets are most
probable to occur. If a, b are two itemsets, then,
the probability that (a union b) is an element of
transaction, T is termed as the confidence.
In addition to the usual concepts we are
introducing a new term Correlation threshold,
which is implemented in the modified Apriori
algorithm. Correlation threshold is a factor
which transfers the probability from single
itemset to n-itemset. This transition probability
is required in-order to confirmthe probability
calculated is prorogating in the calculation of
each itemsets. The general equation of finding
the correlation threshold c is given by,
c
i
=[o(P
mn
) +[(P
mn
P
mux
)] ---- (1) ;
where P
mn
and P
mux
are the minimum and
maximum probability of itemsets, which is
calculated from the probabilistic array. In the
formula we have two constants onJ [; these
are the correlation constants whose value
depends on the median of probabilities. We need
to make sure that, +[ =1 in all situations.
Apriori property: The superset of all frequent
itemset will be frequent. This is the major
property used while calculating the frequent
data.
Above mentioned properties make Apriori
unique and classical from other association rule
learning algorithms. Now lets explain the
algorithmin brief.

3. THE APRIORI ALGORITHM
Basic Apriori algorithmis viewed as a two stage
process. First, the candidate item set generation
then the rule creation. Before starting the
procedure, the threshold E is defined. The
algorithmstarts by scanning the database, say D.
After scanning, all the frequent items fromD are
generated. First scan considers only single
itemsets, later it is repeated by considering 2-
itemset and a new list of frequent items are
created. The process continues till all the
frequent itemsets are mined fromD. Only those
frequent items whose threshold is greater than E
is taken for rule creation. The pseudo code [ ] of
the traditional Apriori is mentioned below.
Apriori (database D, threshold E)
Pseudo-code:
for (k=1; L
k
! =null; k++)
do begin
C
k+1
=candidates generated from
L
k
;
for each transaction T in
database
do
increment the count of all
candidates in C
k+1
that are contained in T
L
k+1
=candidates in C
k+1
with
min_threshold, E
end
return L
k
C
k
: Candidate itemset of size k
L
k
: frequent itemset of size k
J oin Step: C
k
is generated by joining L
k-1
with
itself
Prune Step: Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent k-
itemset (Apriori Property)

3.1. Limitation of Apriori Algorithm
Even though Apriori is a traditional algorithm,
there are various drawbacks in application
solving. For a database D, Apriori needs to scan
n times if the length of the frequent itemsets is
n. This extensive scan makes the system to
consume more time for rule generation. Time
complexity for such a process is defined to be O
(n
2
) [ ]. As mentioned earlier, support threshold
need to be initialized before the algorithm scans
the itemsets. Due to high threshold benchmark
the frequent datasets are discarded in a very
early step. This prunes significant predictions
and rules. In cases where infrequent items are
considered for rule generation, a weak prediction
system is being created. To overcome these
short comings, we need a modified Apriori
algorithm. So we introduce a correlation
threshold.
4. CORRELATION THRESHOLD
Correlation threshold finds its application in
candidate item set generation. In modified
Apriori, we incorporate correlation threshold for
finding strong association rules between the
itemsets. The correlation threshold is a value
between 0 and 1. If the value is 1, then the
attributes are highly related to each other. While
the value close to zero shows a dataset as
independent. This correlation confirms the
presence of all relationships appearing in
traditional Apriori in proposed algorithm. Let
there be n elements in a database D. Also a
probabilistic array of size n, PA[n]. Algorithm
begins by scanning D. After the first scan,
probability of each 1-itemsets appearing in the
transaction is entered into PA[n]. From the
probabilistic array, correlation threshold is found
out using the formula (1). This acts as the
minimum support threshold. Those itemsets
whose threshold is below the correlation value is
eliminated. This step repeats iteratively and from
PA[n], the 2-itemsets are generated by
calculating new correlation threshold. Repeated
database scan is being avoided as the candidate
itemset generation is done directly from the
database and not by continuous scanning of D.
Time complexity of the algorithm was reduced
fromO (n
2
) to O (n
2
). The process is continued
till all attributes in the database is scanned. The
pseudo code for the modified Apriori is given
below,
Modified-Apriori (database D, correlation c)
Input: Database D with n elements
Data structures used: Probabilistic array PA
0
[n]
Output: Frequent itemset, I and rule S.

Pseudo-code:
till (n! =null)
Scan the database D and input
the probability of all n elements into
PA
0
[n]
for(i=0;i<n;i++)
{
Calculate the correlation
threshold for step i, C
i
from
PA
0
[n]
{
for(k=0;k<n;i++)
if( C
i
>PA
i
[k])
then update PA
i+1
[k]
}
}
for each frequent itemset I of
non empty array PA
i
[n]
if (Support(I)/Support(PA
i
[n] >
C
i

generate rule S
(I PA

[n])

5. EXPERIMENTAL RESULTS AND
DISCUSSIONS
Table 1 and 2 denotes the database on which the
modified Apriori was tested. Table 1 is a real
time database for finding the contents of
drinking water. Algorithm was used to predict
the common minerals and other ingredients
found in water.

Table 2 is the 4-transaction database used to
affirmthe comparison results.

By considering the database of water content the
step wise process is explained here by. The
attributes under consideration include, dissolved
solids, carbonates, chlorides, nitrates and
sulphates. Each transaction is scanned one after
the another and the probabilistic array PA
0
[n] is
populated. Now C
o
is calculated by equation
(1). C
o
is the initial correlation threshold and is
initialized to 2/9, the correlation constant is
also initialized [ ] to 5/9. The data pruning is
carried out by checking the PA
0
[0] with C
0
.
Again frequent itemsets are listed and the
probabilistic array is modified by calculating a
new correlation threshold. In second step the 2-
itemset threshold need to be found out. From
equation (1), the new C1 is found out and the
value was nearly, 0.20027. There are itemsets
whose threshold is below C1, thus they get
pruned away. The process repeats until all the
frequent items are visited. The correlation
threshold in next step was obtained to be,
0.18049. Table 3 shows the various item sets
obtained in each iteration. Table clearly draws
distinction between the Apriori and the proposed
algorithm.

Observation of 4-Transaction database

Result affirms that the proposed algorithm
generates more candidate keys than the
traditional Apriori. The number of rules
generated by n-transaction (nT) database is 2n [
]. Thus the number of rules generated in each
step during itemset generation increases by a
factor 2n* a; where a is the difference in number
of frequent item in traditional to proposed
algorithm. The comparison on the number of
rules generate is shown in graph 1. The point
that needs to be noted is that throughout the
problem the confidence rate was fixed to be
70%. If we need to have more accurate results
we can fix the confidence rate to a higher level.
Higher the confidence rate, greater is the
performance of the algorithm.

Time complexity of algorithm

6. CONCLUSIONS
In this paper, correlation threshold was proposed
to modify the traditional Apriori algorithm.
Through pruning the infrequent itemsets and by
retaining the frequent ones strong rules are
created. Database scan which was fully
depended on the length of frequent itemset was
supplanted by the introduction of probabilistic
0 5 10
1
2
3
Frequent itemsets
I
t
e
m
s
e
t
Proposed
Apriori (9T)
Existing
Apriori(9T)
Proposed
Apriori(4T)
Existing Apriori
(4T)
array. This helped to attain a better time
complexity. Results affirm the fact that; with
extended inter-transactional association,
comprehensive and more interesting relations
were able to mine fromthe databae.
REFERENCES
[1] Sanjeev Tao, Prinyanka Gupta,
"Implementing Improved Algorithm over
Apriori Data Mining Association Rule
Algorithm", IJ CST, Volume 3, Issue 1, J an-
March 2012, pp.489-493.
[2] Huan Wu, Zhigang Lu, Lin Pan, Rongsheng
Xu and Wenbao J iang, "An Improved Apriori-
based Algorithm for Association Rules
Mining",Sixth international conference on fuzzy
systems and knowledge discovery,2009.pp.51-
55.
[3] Colin Cooper, and Michele Zito, Realistic
Synthetic Data for Testing Association Rule
Mining Algorithms for Market Basket
Databases, Knowledge Discovery in Databases:
PKDD 2007, Volume 4702/2007, pp.398-405.
[4] Lei J i, Baowen Zhang, J ianhua Li,"A New
Improvement on Apriori Algorithm",IEEE 1-
4244-0605-6/06, 2006, pp.840-844
[5] David L.Olson and Desheng Wu. Decision
making with uncertainity and data mining. In
X. Li, S. Wang and Z.Y. Dong(Eds.), Lecture
notes in Artificial Intelligence (pp. 1-9). Berlin:
Springer(2005).
[6] Aparna S. Varde, Makiko Takahashi, Elke A.
Rundensteiner, Matthew 0.Ward, Mohammed
Maniruzzaman and Richard D. Sisson
J r.,"Apriori Algorithm and game of life for
predictive analysis in materials science",
International J ournal of Knowledge based and
Intelligent Engineering Systems 8, 2004, pp.1-
16
[7] Nandagopal,S., "Mining of Meteorolofical
Data Using Modified Apriori Algorithm",
European J ournal of Scientific Research,
Volume 47 No.2, 2010, pp.295-308.
[8] R. Agrawal, and R. Srikant, "Fast
Algorithms for Mining Association Rules", In
Proc. VLDB 1994, pp. 487-499.

S-ar putea să vă placă și