Sunteți pe pagina 1din 7

USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

Family Name _____________________

First Name _____________________

Student Number |__|__|__|__|__|__|__|__|

This exam paper must not be removed from the venue Venue ____________________

Seat Number ________

ICT706 Data Analytics


USC Melb/Syd, Semester 3, 2018
Examination

School of Business

For Examiner Use Only


Examination Duration: 120 minutes
Question Mark
Reading Time: 10 minutes

Exam Conditions:

Reading Time: Read only. Students are not permitted to write on the
examination paper

This is a closed book exam.

No notes are permitted

Materials Permitted In The Exam Venue:

(No electronic aids are permitted e.g. laptops, phones)

English dictionary

English thesaurus

Materials To Be Supplied To Students:

1 x 8 page booklet (green)

Instructions To Students:

Attempt all questions.

Total ________

Page 1 of 7
USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

Question 1: Practical Python [10 marks]


For each of the following tasks, what Python library would you recommend?

a) Doing statistical analysis on a table of sales data? pandas


b) Drawing custom graphs to visualise customer satisfaction rates? matplotlib
c) Using a clustering algorithm to group customers into similar groups? scikit-learn
d) Read an Excel spreadsheet to extract and clean a dataset? pandas
e) Building a machine learning model to predict expected travel costs? scikit-learn

Given the Python list abc=[2, 3, 5, 7, 11, 13, 1], write the answer that will result from
evaluating each of the following expressions:

f) abc[1] 3
g) abc[-1] 1
h) abc[2:4] [5, 7]
i) len(abc) 7
j) sum(abc) 42

_______________________________________________________________________

Question 2: Pandas Proficiency [10 marks]


We are using Pandas to analyse a set of student tests and grades. The CSV file
containing all the data has been loaded into a Pandas DataFrame called 'data'. This has
data for over 100 students, and columns called "Task1", "Task2", "Task3", etc.

For example, evaluating the expression data["Task2"].count() will tell us how many
students submitted Task 2.

Write a Pandas expression that will: 2 marks each

a) Calculate the average mark for Task 1.


data["Task1"].mean() (or data.Task1.mean() )
b) Calculate the lowest and highest marks for Task 1. (Hint: you can return them as
a pair of numbers).
( data["Task1"].min(), data["Task1"].max() )
c) Calculate standard deviation of Task 3.
data["Task3"].std()
d) Calculate the number of students who passed Task 3 (Task 3 is out of 100, so we
want the number of students who got 50 marks or more).
data[ data["Task3"] >= 50].count()
e) Add a new column called "Percent1" to the data table, which is the percentage
(0..100) for Task 1. That is, the Percent1 mark for each student should be their
Task 1 mark divided by 30.0, then times 100.0 (to turn it into a percentage).
data["Percent1"] = data["Task1"] / 30.0 * 100.0

Page 2 of 7
USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

_______________________________________________________________________

Question 3: Pandas Explanations [10 marks]


Write a sentence to explain each of the following Pandas DataFrame functions. They are
shown applied to a dataframe called 'df'.

a) df.columns
Returns a list of all the column names.
b) df.shape
Returns a pair of numbers, (r,c) where r is the number
of rows and c is the number of columns.
c) df.describe()
Displays statistics (such as count, min, max, mean,
25% and 75% quartiles) for all numeric columns in the
table.
d) df.head()
Gets the first few rows of the table (5 by default).
e) df.count()
Gets the counts of ALL columns (that is, the number of
non-NaN values).
_______________________________________________________________________

Question 4: Machine Learning Process [20 marks]


You are the data analyst for a chain of shoe stores, spread across Australia.

a) Explain the typical process you would follow to use machine learning to predict
the level of monthly sales for all the stores. Give a number and a title for each of
the steps that you would take, and briefly explain each step. [10 marks]

Sample answer (from Week 7 lectures):

Typical Steps:
1. import sklearn functions. E.g. DecisionTreeClassifier(_)
2. load the dataset
3. split the dataset 80/20 into training and testing data
4. create a classifier object (choose your parameters, e.g. max_depth)
5. use the classifier to 'fit' the training data (= learn the model)
6. use the classifier to 'predict' outcomes for new data
7. measure the accuracy of the classifier (on the test data)

b) Sketch out some example Python code that you would use to implement the
above steps using a Decision Tree model. Use Python comments to identify each
step. [10 marks]
Sample Answer: (from Week 7 lecture slides).
# 1. import libraries.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Page 3 of 7
USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

from sklearn.metrics import accuracy_score


from sklearn.datasets import load_iris

# 2. load a dataset
iris = load_iris()

# 3 split dataset 80/20.


X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
test_size=0.20,
stratify=iris.target,
random_state=1)

# 4 create classifier
dt = DecisionTreeClassifier(max_depth=3)

# 5 fit the training datda


dt.fit(X_train, y_train)

# 6 predict outcomes
y_predicted = dt.predict(X_test)

# 7 measure accuracy
accuracy_score(y_test, y_predicted)

Hints: You can make assumptions like:

 all your input sales data is available in a file called "sales.csv";


 that each row is one example month for one store;
 that the last column ("TotalSales") is what you are trying to predict;
 that the other columns are numeric input features;
_______________________________________________________________________

Question 5: Machine Learning Algorithms [15 marks]


There are many different types of machine learning algorithms. Explain the following
groups of algorithms, and name one example algorithm in each group.

a) clustering algorithms
Unsupervised algorithm that groups instances into similar
groups. E.g. K-Means algorithm.
b) regression algorithms
Supervised algorithm that predicts numeric answers. E.g.
Linear Regression algorithm.
c) classification algorithms
Supervised algorithm that predicts a discrete 'class' for each
instance (such as Yes/No result). E.g. Decision Tree algorithm.

Page 4 of 7
USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

_______________________________________________________________________

Question 6: Supervised vs Unsupervised [15 marks]


a) Explain the difference between supervised and unsupervised machine
learning. [5 marks]
b) Give a business example where supervised learning would be appropriate. [5
marks]
c) Give a business example where unsupervised learning would be appropriate.
[5 marks]

Answer: Definition of difference [5 marks]

 Supervised is when the expected outcome is known (for the training data)
 Unsupervised is when we have input data only – no known answers.

Business Example of Supervised Learning [5 marks]:

 must be appropriate for supervised learning;


 must clearly state what is the classification/regression outcome that is learnt
 should explain what is the business benefit;

Business Example of Supervised Learning [5 marks]:

 must be appropriate for UNsupervised learning;


 must clearly state what is the output of the learning algorithm;
 should explain what is the business benefit;

_______________________________________________________________________

Question 7: Evaluation of Models [20 marks]


ReadItNow is an online company that loans e-books via subscription ($10/month, or
$120/year). They want to analyse their customer base, and their 'churn rates'
(customers who decide to stop subscribing to their service).

The following confusion matrix shows the results of applying a Decision Tree machine
learning algorithm to 1000 historical examples of customer churn from the previous
year. The columns show whether the customer did really leave ('Churn') or stay ('Not
Churn'). The rows show the prediction output from the learned Decision Tree model.

Churn Not Churn

Model predicted Yes 400 200

Model predicted No 100 300

Page 5 of 7
USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

Calculate the values of the following evaluation metrics for this model (since you do
not have a calculator, you can write them as a fraction) [2 marks each]:

a) number of true positives? = 400 [2 marks for each answer]


b) number of false positives? = 200
c) accuracy? = (400+300)/1000 = 700/1000 = 70%
d) precision? = 400/(400+200) = 400/600 = 66%
e) recall? = 400/1000 = 40%

ReadItNow is considering using this model as the basis a new marketing campaign to
better retain their existing customers. They will send special offers to all the people
that the model predicts Yes (that is, the customers that are in danger of 'churning'
away from ReadItNow). The cost of these discounts will average $10 per customer,
but it is expected that it will halve the churn rate, which will save on average half of
the annual subscription of each customer who is persuaded not to churn. The
following cost-benefit matrix summarises the annual costs and expected benefits of
this campaign for each group of customers.
Churn Not Churn

Model predicted Yes ¾ * $120 - $10 = $80 $120 - $10 = $110

Model predicted No $120 / 2 = $60 $120

f) Calculate the expected income after this marketing campaign? Show your
working. [4 marks]

Answer [4 marks]:

 Expected income = $80 * 400 + $110 * 200 + $60 * 100 + $120 * 300
 = 32000 + 22000 + 6000 + 36000
 = $96,000

The cost-benefit matrix for the next year without the marketing campaign is:

Churn Not Churn

Model predicted Yes $120 / 2 = $60 $120

Model predicted No $120 / 2 = $60 $120

g) Calculate the expected annual income WITHOUT the marketing campaign. [4


marks]

Answer [4 marks]:

Page 6 of 7
USC Melb/Syd, Semester 3, 2018 ICT706 Data Analytics

 Income = $60 * (400+100) + $120 * (200+300) = 30,000 + 60,000 = $90,000

h) Would you recommend that ReadItNow goes ahead with the marketing
campaign? Explain your reason. [2 marks]

Recommendation [2 marks]:

 YES. I would recommend that they go ahead with the marketing campaign
because it will increase income.

END OF EXAMINATION

Page 7 of 7

S-ar putea să vă placă și