Sunteți pe pagina 1din 3

The University of Fiji

School of Science and Technology


Department of Information Technology and Computer Science

ITC355 Business Intelligence Semester I, 2019

ASSIGNMENT 1

Due: 10-May-2019

Weight: 20%

Instructions: Please carefully read the Questions and answer the questions that follow.

Referencing: Please reference any work cited. Use Harvard Referencing Style. Remember that Plagiarism
will lead to disciplinary action under the University of Fiji Regulation “Plagiarism and Dishonest
Practice”. Turnitin will be used to check your assignments for plagiarism and no student should have
more than 30% plagiarism for their assignments. Failure to comply will result in deduction of marks or a
fail grade.

Marking Criteria: Marks will be allocated for correct execution of steps and correct justification of
answers.

Submission: You need to compile all required solutions in a word file and upload on Moodle.Provide
necessary snapshots with your answer justifications. Make sure that the answers are numbered properly
and do no rewrite the questions.

Question 1 (10%)

An online shopping site has the following primary pages or sections: Home, Products, Search, Prod_A,
Prod_B, Prod_C, Cart, Purchase. A user may browse from "Home" to "Products" and then to one of the
individual products. The user may also search for a specific product by using the "Search" function. A
visit to "Cart" implies that the user has placed an item in the shopping cart, and "Purchase" indicates
that the user has completed the purchase of items in the shopping cart. The site has collected some
hypothetical session data for 100 sessions. This data is available in Q_sessions file on Moodle. and
format.
Use WEKA's K-means clustering algorithm to cluster these user sessions into segments. Try different
clustering runs with various numbers of clusters (e.g., between 4 and 8), and select the result set(s) that
seem to best answer the following questions.

 If a new user is observed to access the following pages: Home => Search => Prod_B, according
to your clusters, what other product should be recommended to this user? Explain your answer
based on your clustering results. What if the new user has accessed the following sequence
instead: Products => Prod_C?
 Can clustering help us identify casual browsers ("window shoppers"), focused browsers (those
who seem to know what products they are looking for), and searchers (those using the search
function to find items they want)? If so, Are any of these groups show a higher or lower
propensity to make a purchase?
 Do any of the segments show particular interest in one or more products, and if so, can we
identify any special characteristics about their navigational behavior or their purchase
propensity?
 If we know that, during the time of data collection, independent banner ads had been placed on
some popular sites pointing to products A and B, can we identify segments corresponding to
visitors that respond to the ads? (note that such user's are likely to enter the site by going
directly to product pages rather than navigate from the Home page). If so, can we determine if
either of these promotional campaigns are having any success?

For this problem, you should submit your clustering result summary (including the cluster centroids), the
final data set which shows the final assignment of these sessions to clusters, and your answers to the
above questions along with your justification based on the clustering results.

Other Notes: You may also want to use WEKA's cluster visualization capabilities to identify interesting
distributions of various page visits among and within clusters. Examples of using WEKA for clustering can
be found in K-Means Clustering in WEKA.

Question 2 (10%)

For this problem you will use some preprocessed and aggregated clickstream data from a real e-
commerce site, and use association rule mining to perform market basket analysis on the visitor session
data.

[Note: Please watch the class video Association Rule Mining with WEKA (23 min) demonstrating the
use of Apriori algorithm in WEKA for market basket analysis.]

There are two primary types of products sold through the above site, leg care products, and leg ware
products. Each category includes various subcategories and individual products from multiple vendors.
There is also a separate categorization of products by specialized "Collections" and by "Assortments."
The data collection mechanism, in addition to capturing clickstream page-level data, also captures the
information on categories, subcategories, assortments, and collections of products accessed in a given
session.
For simplicity, the provided data combines and aggregates visited pages from the log files, category and
subcategory names, and product related content pages/categories. The aggregate data contains a total
of 182 attributes corresponding to pages or categories. These attributes are listed in the file Leg-
Pages.txt. The session data is provides in ARFF format in the file legs.arff . All datasets is provided on
Moodle. This data contains a total of 7296 sessions (each row in the data). For the purpose of market
basket analysis in WEKA, the session data is represented in relational format with unary categorical
attributes (a value of "Y" indicates that the corresponding page/category was visited in the session,
while a value of "?" indicates that the page/category is missing from the session). Thus, a typical
association rule might look similar to the following:

/Products/Legwear=Y /Products/Legwear/Berkshire=Y ==> Collection: Better Than Bare - Queen=Y

or

Category: Health Supplements=Y ==> Subcategory: Bones & Joints=Y

Your task in this problem is as follows:

1. Load the data into WEKA and review the distributions in the data (go down the list of attributes
and make a note of which pages or products are most frequent and which are least frequent -
you can list the top 3 and the bottom 3 in your submission.
2. Perform association rule mining using Apriori algorithm with a "lowerBoundMinSupport" of 0.05
and using Lift of 2.5 as the minMetric for filtering the rules. Also, set "outputItemsets" to "True"
so that you can also view the frequent items sets of different sizes in addition to the rules. Write
a short summary of your observations, including any significant or interesting (e.g., unobvious or
unexpected) associations you observe in the data based on the results. Save your result set and
submit along with your assignment submission
3. Next, run the Apriori algorithm with a lower "lowerBoundMinSupport" so that you can identify
associations at a more granular level (e.g., the level of individual products brands rather than
higher level categories). You might want to start with 0.025 and go lower if necessary.
Experiment with this threshold, as well as the Lift or Confidence metrics in different runs and
pick the result set that seems to provide the most useful information (e.g., not too many
obvious or noisy rules and not too few general rules). Again, provide a short summary of your
observations, including some examples of rules or associations that you find interesting or
useful.

S-ar putea să vă placă și