Sunteți pe pagina 1din 2

CS 536/CS 432 – Data Mining

Assignment 2
Due: March 04 (Monday) at 12 midnight

Instructions: Submit a soft-copy report to the submission folder on LMS. Include


report and code needed to reproduce your results.

1. Apriori and FP-Growth Algorithms (25 points)


Consider the following transactional database:
TID Items
1 BD
2 ABD
3 AC
4 EF
5 CDEF
6 BE
7 AE
8 AEF
9 ADE
10 AE
11 BDF
12 DE
13 DFF
14 CDE
a. Find all frequent itemsets using the Apriori algorithm. Assume minimum
support count is 2.
b. Find all frequent itemsets using the FP-growth algorithm. Assume minimum
support count is 2.
c. Identify all closed and max itemsets.
d. Generate all strong association rules from the longest closed pattern(s) found
in the database. Assume minimum confidence is 70%.

2. Frequent Itemset Mining Using Rapid Miner (45 points)


Experiment with RapidMiner’s implementation of Apriori and FP-growth algorithms.
Apply these algorithms to the Adult data set (available from LMS)

a. For Apriori generate rules and itemsets for (i) default parameter values, (ii)
rules = 50, (iii) confidence = 0.7; rules = 50, (iv) minimum support is 0.1.
Summarize the results and discuss/interpret them w.r.t income of individuals
and their information.
b. For FP-growth, generate itemsets for (i) default parameter values, (ii)
minimum support = 0.1, (iii) find min number of itemsets is unchecked, and
(iv) find min number of itemsets is unchecked; minimum support = 0.1.
Summarize and interpret the interesting results.
c. From results in (a), separate out all strong classification rules, i.e., rules that
contain the class attribute (income) on the right-hand-side.
d. Provide a summary of the results.

Note: You can find dataset description details on the below link.

CS 432/536 (Sp 17-18) – Dr. Mian Muhammad Awais Page 1 of 2


https://hpi.de/naumann/projects/repeatability/datasets/dblp-dataset.html

3. Download the census-income dataset from LMS. (30 points)

a. Divide the dataset into 4 equal bins and find the correlated attributes from
each bin. Compare the results from each bin.
b. Apply dimensionality reduction to reduce computations. Report results
from each part separately after dimensionality reduction. You can use
various techniques of your choice for data preprocessing and
dimensionality reduction. Please report your technique in document, you
will be evaluated based on your findings in report.

Bonus Question: (20)


Note: You can submit it within 4 days after deadline of assignment with a separate
report with results and findings. You can do it in Python or R.
2. Find the DBLP dataset on LMS. Preprocess the data if needed. You can use
Python, R or MATLAB to answer below questions. Submit your code file,
instructions to run code file and report. (40)

a. Fine the top 10 pair of authors mostly work together.


b. Find the top 5 authors with maximum no of publications and citations.
Compare your results with support, confidence, lift, imbalance ratio, and
chi-square separately. Which one gives the best result and why? Select the
attributes of your choice

CS 432/536 (Sp 17-18) – Dr. Mian Muhammad Awais Page 2 of 2

S-ar putea să vă placă și