Sunteți pe pagina 1din 10

The University of the South Pacific

School of Computing, Information and Mathematical Sciences

IS328: Data Mining


Assignment I Semester 2, 2020
Total Marks: 20%

Due Date: as shown on Moodle

This assignment covers both theoretical and practical aspect of this course. The marking rubric is heavily based on Data & Information
Management, which is in liaison with course outline and BSE program map. Rubrics have been taken from ACS-SCIMS rubrics V1.0. This
assessment covers the following course learning outcomes:
CLO 2: Perform pre-processing tasks to refine data sets
CLO 3: Apply various data mining methods for interpreting results

Overview
The goals of this assignment I are
• As a class, to make you familiar with WEKA tool and understand some of the data preprocessing methods, attribute selection
methods, classification and clustering algorithms.
• As a team of 2 members, you can discuss your findings on the chosen questions and consolidate your learnings as a report (Answers
to each question: Algorithm, Working screen shots and results, Comparison and Analysis).
• Make a consolidated report about your findings
• This assignment is an important part of the course and counts for 20% of your final grade. Grades will be based on the completeness
of your findings, analysis and the quality of the report.

IS328: Assignment I Page 1 of 10 Dr.Vani Vasudevan


Grading
• This assignment is worth for 20 marks
• Late delivery without prior notification and permission from the instructor will result in a loss of 10% of the marks per day.
• Plagiarism/Cheating in any form are strictly prohibited. If found, complete Assignment 1 will be nullified.
Plagiarism
For all the Assignment/Project works it’s essential that you avoid plagiarism. Not only do you expose yourself to possibly serious
disciplinary consequences, but you’ll also cheat yourself of a proper understanding of the concepts emphasized in the assignment.
• It’s not plagiarism to discuss the assignment with your friends and consider solutions to the problems together. However, it is
plagiarism if you copy all or part of each other’s solutions.

Question 1: Data Pre-processing [5 Marks]

Use bank-data.arff to perform a series of pre-processing operations using filters in WEKA.

1. Selecting or Filtering Attributes

In bank-data.arff data set, each record is uniquely identified by a customer id (the "id" attribute). Remove this attribute before the data mining
step by using the Attribute filters in WEKA. In the Filter panel, click on the Choose button. This will show a popup window with a list
available filter. Scroll down the list and select the weka.filters.unsupervised.attribute.Remove filter.

Next, click on text field immediately to the right of the "Choose" button. In the resulting dialog box enter the index of the attribute to be
filtered out (this can be a range, or a list separated by commas). In this case, enter 1 which is the index of the "id" attribute (see the left
panel). Make sure that the invertSelection option is set to false (otherwise everything except attribute 1 will be filtered). Then click "OK".
Now, in the filter box you will see Remove -R 1

Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and create a new working relation (whose name
now includes the details of the filter that was applied). The result is depicted. Display the result.

IS328: Assignment I Page 2 of 10 Dr.Vani Vasudevan


It is possible now to apply additional filters to the new working relation. Save the intermediate results as separate data files and treat each
step as a separate WEKA session. To save the new working relation as an ARFF file, click on save button in the top panel. Save the new
relation in the file bank-data-R1.arff.

2. Discretization

Some techniques, such as association rule mining, can only be performed on categorical data. This requires performing discretization on
numeric or continuous attributes. There are 3 such attributes in this data set: "age", "income", and "children". In the case of the "children"
attribute the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of these values in the data. This
means we can simply discretize by removing the keyword "numeric" as the type for the "children" attribute in the ARFF file and replacing
it with the set of discrete values. Do this directly in our text editor and save the resulting relation in a separate file bank-data2.arff.

Rely on WEKA to perform discretization on the "age" and "income" attributes. In this, divide each of these into 3 bins (intervals). The
WEKA discretization filter can divide the ranges blindly, or used various statistical techniques to automatically determine the best way of
partitioning the data. In this case, perform simple binning.

First, load our filtered data set into WEKA by opening the file "bank-data2.arff". Select the "children" attribute in this new data set, that it
is now a categorical attribute with four possible discrete values. Now, once again activate the Filter dialog box, but this time, select
weka.filters.unsupervised.attribute.Discretize.

Next, to change the defaults for this filter, click on the box immediately to the right of the "Choose" button. This will open the Discretize
Filter dialog box. Enter the index for the attributes to be discretized. In this case we enter 1 corresponding to attribute "age". Also, enter 3
as the number of bins (note that it is possible to discretize more than one attribute at the same time (by using a list of attribute indices). Since
its simple binning, all of the other available options are set to "false".

Click "Apply" in the Filter panel. This will result in a new working relation with the selected attribute partitioned into 3 bins. To examine
the results, save the new working relation in the file bank-data3.arff.

IS328: Assignment I Page 3 of 10 Dr.Vani Vasudevan


Now, examine the new data set using text editor (in this case, Text Pad/WordPad). You can observe that WEKA has assigned its own labels
to each of the value ranges for the discretized attribute. For example, the lower range in the "age" attribute is labeled "(-inf-34.333333]"
(enclosed in single quotes and escape characters), while the middle range is labeled "(34.333333-50.666667]", and so on. These labels now
also appear in the data records where the original age value was in the corresponding range.

Next, apply the same process to discretize the "income" attribute into 3 bins. Again, Weka automatically performs the binning and replaces
the values in the "income" column with the appropriate automatically generated labels. Save the new file into bank-data3.arff", replacing
the older version.

Clearly, the WEKA labels, while readable, leave much to be desired as far as naming conventions go. Thus use the global search/replace
functions in TextPad/WordPad to replace these labels with more succinct and readable ones. Replace All the age label "(-inf-34.333333]"
with the label "0_34".

Note that the new label now appears in place of the old one both in the attribute section of the ARFF file as well as in the relevant data
records. Repeat this manual re-labeling process with all of the WEKA-assigned labels for the "age" and the "income" attributes. Also,
change the relation name in the ARFF file to bank-data-final and save the file as bank-data-final.arff.

3. Missing Values

1. Open file bank‐data.arff

2. Check if there is any missing values in any attribute.


3. Edit data to make some missing values.
4. Delete some data in “region” (Nominal) and “children” (Numeric) attributes. Click on “OK” button when finish.
5. Make note of Label that has Max Count in “region” and Mean of “children” attributes.
6. Choose ReplaceMissingValues filter (weka.filters.unsupervised.attribute.ReplaceMissingValues). Then, click on Apply button.
7. Look into the data. How did those missing values get replaced?
8. Edit bank‐data.arff with text editor. Make some data missing by replacing them with ‘?’. (Try with nominal and numeric attributes).
Save to bank‐data‐missing.arff.
9. Load bank‐data‐missing.arff into WEKA, observe the data and attribute information.
10. Replace missing values by the same procedure you had done before.

IS328: Assignment I Page 4 of 10 Dr.Vani Vasudevan


Write a report outlining the main items completed under the following operations: Provide evidence using screen shots and other
descriptions.

1. Selecting or Filtering Attributes


2. Discretization
3. Missing Values
Include your own reflection on the capabilities of WEKA to perform data preprocessing.

Unsatisfactory Satisfactory Good Marks % Marks


CBOK
(0%-49%) (50% - 75%) (76% - 100%) Allocated Attained
Data and I.Do not identify accurately I. Identify accurately some I.Identify accurately
Information any of the data quality of the data quality most of the data
Management problems problems quality problems

II. Do not perform all II. Perform most of the II. Perform all the
required tasks correctly and required tasks correctly and required tasks
consistently consistently correctly and 5
consistently
III. Provide inaccurate and/or III. Provide relatively
incomplete reports accurate and complete III. Provide
reports accurate and
complete
reports
Sub Total &
comments

IS328: Assignment I Page 5 of 10 Dr.Vani Vasudevan


Question 2: Apply J48, PART and SimpleCart Classification Algorithms [25 Marks]

A.
1. You suspect marked differences in promotional purchasing trends between female and male Acme credit card customers. You wish
to confirm or refute our suspicion. Perform a supervised data mining session using the CreditCardPromotion database (ccpromo.arff)
in conjunction with PART. Use sex as the output attribute. Designate all other attributes as input attributes and use all 15 instances
for training. Write a summary confirming or refuting our hypothesis. Base the analysis on rules created for each class.

2. Repeat the exercise using J48 rather than PART but base the analysis on the created decision tree.

B.
1. For this Question, use WEKA’s J48 decision tree algorithm to perform a data mining session with the cardiology patient data. Open
the WEKA explorer and load the cardiology-weka.arff file. This is the mixed form of the dataset containing both categorical and
numeric data.The data contains 303 instances representing patients who have a heart condition (sick) as well as those who do not.

Answer the following Preprocess Mode Questions:


a. How many of the instances are classified as Healthy?
b. What percent of the data is female?
c. What is the most commonly occurring domain value for the attribute slope?
d. What is the mean age within the dataset?
e. How many instances have the value 2 for # of Colored Vessels?

Answer the Classification Questions using J48:


Note: Perform a supervised mining session using 10-fold cross validation with J48 and class as the output attribute.

a. What attribute did J48 choose as the top-level decision tree node?
b. Draw a diagram showing the attributes and values for the first two levels of the J48 created decision tree.
c. What percent of the instances were correctly classified?
d. How many healthy class instances were correctly classified?
e. How many sick class instances were falsely classified as healthy individuals?
f. Determine how True Positive Rate (TP Rate) and False Positive Rate (FP Rate) are computed.

IS328: Assignment I Page 6 of 10 Dr.Vani Vasudevan


Answer the Classification Questions using PART:
a. List one rule for the healthy class that covers at least 50 instances.
b. List one rule for the sick class that covers at least 50 instances.
c. List one rule that is likely to show an inaccuracy rate of at least 0.05.
d. What percent of the instances were correctly classified?
e. How many healthy class instances were correctly classified?
f. How many sick class instances were falsely classified as healthy individuals?
C.

Load the CreditScreening dataset into the WEKA Explorer. Make sure that class is designated as the output attribute.
a. Use J48 together with 10-fold cross validation to mine the data. Record your results including the attributes used to create the root
node and first level of the decision tree.
b. Use Info Gain attribute evaluation to determine the most predictive categorical attribute for each of the two classes. Return to Weka
and preprocess mode. Eliminate all but the two most predictive input attributes from the attribute list. Be sure to save the output
attribute class. Use J48 with 10-fold cross validation to mine the data. Record your results. Compare the results to those seen in part

D.

Load the CreditScreening dataset into the WEKA Explorer. Make sure that class is designated as the output attribute.

a. Use SimpleCart together with 10-fold cross validation to mine the data. How many nodes are seen in the decision tree? What is the
classification accuracy?
b. Compare your results with those seen in question C part b.

E.

Use Wordpad or MS Word to open the soybean dataset located in the folder ─c:\program files\weka-3-6\data or Weka data set. This dataset
represents one of the more famous data mining successes. Classification accuracy of unseen instances is likely to be above 90% with most
classifiers.
a. Scroll through the file to get a better understanding of the dataset. Open WEKA’s Explorer and load this dataset. Classify the data
by applying J48 with a 10-fold cross validation. Report your results.

IS328: Assignment I Page 7 of 10 Dr.Vani Vasudevan


b. Repeat your experiment using SimpleCart rather than J48. Detail the differences between this result and your result in a. Specify
differences between the decision trees and their resultant classification accuracies.
c. Return to preprocess mode. Apply Weka’s supervised attribute selection filter to the dataset. How many attributes are eliminated
from the dataset? Apply J48 to the modified data. Do your results differ from those seen in part a?

Unsatisfactory Satisfactory Good Marks % Marks


CBOK
(0%-49%) (50% - 75%) (76% - 100%) Allocated Attained
Data and IV Do not use correct IV Identify and use correct IV Identify and use
Information algorithm(s) for the problem algorithm for the problem correct algorithm for
Management in hand in hand the problem in hand

II Do not perform all required


tasks correctly and II Perform most of the II Perform all
25
consistently required tasks correctly and required tasks
consistently correctly and
III Provide inaccurate and/or consistently
incomplete reports III Provide relatively
accurate and complete III Provide accurate
reports and complete reports
Sub Total &
comments
Submission Instructions:

1. Completely fill Mark Allocation Sheet and submit with your assignment. Failing to do so may result in deduction of 50% marks.
2. This assignment can to be submitted in groups of 2 members. Assign a group leader and submit the assignment through the group
leader’s moodle account. You have to submit just one zip/rar file of your project. The submission filename should read
A1_Sxxx_Syyy.zip or A1_Sxxx_Syyy.rar where Sxxx, Syyy are student ids of the group members. For example,
A1_S11003232_S01004488.zip or A1_S11003232_S01004488.rar. Incorrect submission will result in high penalty.
3. 25 Marks are allocated for applying appropriate DM techniques and deriving the correct results in Question 2 (A to E: 5 marks each)
for the drafting and consolidation of the report.

IS328: Assignment I Page 8 of 10 Dr.Vani Vasudevan


Mark Allocation Sheet

After having discussed as group, we recommend the following mark allocation to each group member based on contribution or lack of it
throughout the assignment.

Group Name ________________________

Project manager ________________________

Member ID Percentage contribution of allocated task

Certification

ID Member Name Signature

IS328: Assignment I Page 9 of 10 Dr.Vani Vasudevan


Assessments mapping with CBOK

Presentation
Assign 1

Assign 2
IS328

Test1

MST
Core Body of Knowledge

Complex Computing
ICT Professional Ethics M
Knowledge ü ü
Professional expectations M
Teamwork concepts/issues M ü ü
Communication M ü ü
Societal Issues/Legal issues/Privacy M
Understanding the ICT profession
ICT Problem Solving: Abstraction
Design
Technology Resources Hardware and Software
Fundamentals
Data and Information Management M ü ü
Networking
Technology Building Human Factors
Programming
Systems Development / Acquisition
ICT Management IT Governance and organisational
issues
IT Project management
Service management
Security management

IS328: Assignment I Page 10 of 10 Dr.Vani Vasudevan

S-ar putea să vă placă și