Sunteți pe pagina 1din 36

Overview of this Project

The association rule mining to mine the frequent patterns is a


fundamentally important task in the process of knowledge discovery in large
databases.

This project report the main focus lies in the generation of frequent
patterns which is the most important task in explanation of the fundamentals of
association rule mining.

This is done by analyzing the implementations of the well known


association rule mining algorithms like Apriori, Dynamic Item set Counting
Algorithm, FP-growth algorithm.

This experimental system is developed using Java under Windows XP


Operating System. Run time behaviors of these algorithms are analyzed and
compared using Mushroom dataset.
Outline

• Introduction
• Association Rule Mining to Frequent Patterns
• Implementation
• Conclusions
• Future Enhancements
• Bibliography
Introduction to Frequent Patterns

Frequent Pattern which is the most important task in explanation of fundamental of


association rule mining techniques
The well known association rule based algorithms to mine the frequent patterns :

 Apriori

 Dynamic Item Counting

 FP Growth
Association Rule Mining

 Association Rule mining is one of the fundamental data mining

 Association is a rule, which implies certain association relationships among set


of objects such as occur together or one implies the other.

 Goal of Association rule mining helps in finding interesting association


relationships among large set of data items.

 Each rule is assigned two factors: Support and Confidence


 Generally association rule mining is performed in two steps:

• Find all frequent item sets


The basic foundation of Association Rule algorithm is
fact that any subset of a frequent itemset must also be a frequent item set.
i.e., if {AB} is a frequent item set, both {A} and {B} should be a frequent
item set. Iteratively find frequent item sets with cardinality from 1 to k (k-
item set)

• Use frequent item sets to generate strong rules having minimum


confidence.
FP Array
• FP Array techniques that greatly reduce the needs to traverse FP Trees.

• FP Array techniques obtaining significance improved performance then


FP Tree based Algorithm.

• FP Array is new Algorithms in finding the Maximal and Closed Frequent


Item sets
FP Array Applications
• It generate the frequent patterns from the existing datasets.

• It Provides the minimum support to the given data inputs.

• Time Complexity for the searching the frequent item sets .

• It displays the no of records row and columns wise from the datasets
Rule to Mine Frequent Items

The frequent itemset mining algorithms are classified considering the following
aspects:

• The type of the discovered frequent itemset


• Using candidates
• The representation of the transactions
• The itemsets representation used in the algorithm
• The number of disk access
• The length of the maximal frequent pattern
Implemented algorithms work differ as follows:
APRIORI DIC FP

With Candidate 

generation
Without Candidate

generation

BFS  
DFS 
FP-Tree 
Stages in Knowledge Discovery in Frequent
Databases
 Selection - selecting and segmenting the data that are relevant to given
criteria.

· .  Preprocessing-data cleaning stage where unnecessary information is removed.

 Transformation-the data is made usable and navigable.

 Data Mining-extraction of patterns from the data

 Interpretation and Evaluation-The patterns in the data mining stage are


converted into knowledge to support decision-making

 Data Visualization-to examine the large volumes of data and detect the
patterns visually
Discoveries in Frequent Databases

·
Apriori Algorithm

The Apriori algorithm is the most popular association rule algorithm. Apriori
uses bottom up search.

 Apriori algorithm works as follows:

• The first step, Apriori algorithm generates Candidate 1 – itemsets.


Then, itemsets count and minimum support value are compared to find
the set L1 (frequent itemsets).

• The second step, algorithm use L1 to construct the set C2 of


Candidate 2 – itemsets. The process is finished when there are no more
candidates.
 In each phase, all the transaction in the data set are scanned.

 Finally, all frequent itemsets are returned.

 Disadvantage:
 Multiple database scan.
DIC Algorithm

DIC (Dynamic Itemset Counting) algorithm which uses fewer database


scan, presents a new approach for finding large itemsets.

 Aim of the DIC algorithm is improving the performance and eliminating


repeated database scan.

 DIC algorithm divides the database into partitions ( intervals M ) and use
a dynamic counting strategy. DIC algorithm determines some stop points for
itemset counting. Any appropriate points, during the database scan, stopping
counting, then starts to count with another itemsets.

 Four symbols to indicate the different states of itemsets:


Solid Box , Solid Circle, Dashed Box, Dashed Circle
 The algorithm is described as follows:

Step1: the empty itemset is marked with a solid box and all the 1-
itemsets into dashed circle.

Step2: After reading one interval of M transactions from database,


do the following steps:
• Check each itemset, in dashed circle. If it exceeds the support
threshold, change it from dashed circle to a dashed box.
• Check each super set of dashed circle. If all the subsets of
dashed circle are in solid box or dashed box, then add it into dashed
circle.
• Check each set in dashed circle and dashed box. If it has been
counted over all the transactions, change it into solid circle if it is in
circle or change it into solid box if it is in box.

Step3: End of transactions is reached then, go back to the


beginning and repeat step 2, until no itemset remains in dashed circle or
dashed box.
FP-Growth

 FP-Growth is an algorithm for generating frequent item sets for


association rules. This algorithm compresses a large database into a
compact, frequent pattern– tree (FP tree) structure.

 FP – tree structure stores all necessary information about frequent itemsets


in a database.

 A frequent pattern tree (or FP-tree in short) is defined as

1. The root labeled with “null” and set of items as the children of the root.

2. Each node contains of three fields: item-name (holds the frequent


item), count (number of transactions that share that node), and node-
link (next node in the FP-tree).

3. Frequent-item header table contains two fields, item-name and head


of node link (points to the first node in the FP-tree holding the item).
Use case Diagram for the proposed system

Apriori

Dynamic Itemset
Counting
Data Set
File
User
FP-Growth
Identifying Classes form the above Use cases
 Architectural design
The division of software into subsystems and components, as well as
the process of deciding how these will be connected and how they will
interact, include determining the interfaces.

GUI for
Selecting the
file ,support
and
algorithm

Dynamic FP- Matrix Based


Apriori Itemset Growth Association
Counting
 User interface design

The design of user interface is to display and obtain needed


information in an accessible, efficient manner. The user interface can employ
one or more windows. Each window should serve a clear, specific purpose.
Step1: Selection of the filename
Step 2: Display the contents of the file onto the text area
Step 3: Enter valid support
Step 5: Select the algorithm
Step 6: Display the frequent patterns for apriori
Step 7: If the selected algorithm is DIC, then enter the step length
Step 8: Display the frequent patterns for DIC
Step 8: Display the frequent patterns for FP-Growth
Step 9: Display the frequent patterns for MBA
RESULTS

The FPMiner tool is implemented using Java language and all the
experiments are performed on 1.7GHz PC machine with 256MB memory. The
Operating System is WindowsXP.

Experiment 1:

Execution times for different support for different algorithms can be


tabulated as follows:
Execution Execution
Execution time of
Support time of time of
FP-Growth
AprioriT DIC
50 187ms 226754ms 94ms
60 110ms 184297ms 74ms
70 78ms 161265ms 46ms
80 47ms 106953ms 32ms
90 32ms 74984ms 31ms
Experiment 2:

The number of frequent itemsets generated using different


algorithms:
Support Frequent itemsets generated

50 153
60 51
Apriori 70 31
80 23
90 9
50 153
60 51
MBA 70 31
80 23
90 9
CONCLUSION

 Frequent Pattern mining is used for finding frequent itemsets


among items in a given data set.

 The results show that

• Apriori cannot be run very effective than FP -Tree.


• Apriori on the other hand runs too slow because each transaction
contains density.
• DIC (Dynamic Itemset Counting) is much slower than every other
algorithm for the real -dataset.
• MBA is better than DIC but not very better than the other two in
the case of MUSHROOM dataset.
FUTURE ENHANCEMENT

 There are still many interesting research issues related to


the extensions of frequent pattern mining, such as mining structured
patterns by further development of these approaches, mining
approximate or fault-tolerant patterns in noisy environments, frequent-
pattern-based clustering and classification, and so on.
FP Array Techniques

 FP Array technique that greatly reduce the needs to traverse FP Trees.

 FP Array technique obtaining significance improved performance then


FP Tree based Algorithms.

 FP Array is new Algorithm in finding all Maximal and Closed Frequent


Item sets.
 Fp – tree use compact data structure based on the following properties,

- Frequent pattern generation mining perform one scan of database to


determine the set of frequent items.

- Method needs to store each item in a compact structure, thus more


than two database scan unnecessary.

- Each frequent item located in the FP – tree and each node hold items
and count of the frequent item.

- Each item have to be sorted in their frequency descending order.

S-ar putea să vă placă și