Sunteți pe pagina 1din 11

Data Science for Online Customer Analytics

UIC Spring 2019


Assignment
Due: 6pm March 1 Friday
Submit in paper before class
This is an individual assignment

Name: ______________________

UIC Email: ______________________

UNI: ______________________

Multiple Choices / Matching (20 points)

Coding (20 points)


Sample Size Estimation (10 points)
AUC Brain Teaser (10 points)

Short Answer (10 points)


Plumbing Inc (5 points)
Netflix (5 points)

Problem Analysis (35 points)


Mail Marketing (10 points)
GloboBank (10 points)
ASOS Customer Life Value Prediction (15 points)

Multiple Choices / Matching (20 points)

Select one answer for each of the following questions:

1. The points on a model’s ROC curve (2 points)


a. represent the performance of different thresholds
b. represent different rankings of examples
c. represent the cost of different classifications
Answer: _____

1
2. You have built several predictive models to rank credit applicants by their estimated
likelihood of default. Which technique would be least helpful in assessing the quality of a
ranking model mined from data? (2 points)
a. holdout testing
b. calculate area under the ROC curve
c. calculate percent of instances correctly classified
d. cross-validation
e. domain knowledge validation
Answer: _____

3. Which of these organizations would have the most challenge in applying supervised
predictive modeling? (2 points)
a. A grocery store that is trying to identify which of its loyalty-card-carrying
customers will spend more than $100 next month
b. A business school that wants to start a new Master’s degree program in
Business Analytics and would like to estimate the likely number of applicants
c. A city government that is trying to predict which neighborhoods will see the most
new business open up next quarter
d. An online marketing company that wants to estimate the number of clicks that the
ads it serves will receive when shown to a particular population
Answer: _____

4. As a data scientist who values understanding the problem and data over applying
specific algorithms, what would you do first when you start a project? (2 points)
a. Do a lot of data exploration on the dataset provided by your manager.
b. Ask questions like "Why do we want to do this?" and "How is the data
generated/sampled?"
c. Explore various models, and combinations of different models to achieve the
highest score on the evaluation metric.
d. Create a detailed modeling plan before iterating.
Answer: _____

5. Which of the following may be a good reason to use a Multi Armed Bandit (MAB)
algorithm instead of simple A/B testing? (2 points)
a. We need statistically significant results.

2
b. Experiments are very costly.
c. There are only a few options to test.
d. Due to technology constraints, it's impossible to assign the experiment subjects
into treatment groups dynamically in real time.
Answer: _____

6. In class we discussed the Upper Confidence Bound (UCB) algorithm. Now consider an
alternative Lower Confidence Bound (LCB) algorithm, which always select the arm with
the highest lower confidence bound. Would LCB be a good algorithm for the Multi Armed
Bandit (MAB) problem? (2 points)
a. No because the algorithm won’t be exploring.
b. No because the algorithm won’t be exploiting.
c. Yes LCB would work for MAB
Answer: _____

7. You are a famous researcher in the City of Peacetopia. The people of Peacetopia have
a common characteristic: they are afraid of birds. To save them, you have to build an
algorithm that will detect any bird flying over Peacetopia and alert the population. The
city has the following criteria:
- "We need an algorithm that can let us know a bird is flying over Peacetopia as
accurately as possible."
- "We want the trained model to take no more than 10sec to classify a new image.”
- “We want the model to fit in 10MB of memory.”

If you had the three following models, which one would you choose? (2 points)

a. Test Accuracy: 97%, Runtime 1 sec, Memory size 3MB


b. Test Accuracy: 99%, Runtime 13 sec, Memory size 9MB
c. Test Accuracy: 97%, Runtime 3 sec, Memory size 2MB
d. Test Accuracy: 98%, Runtime 9 sec, Memory size 9MB
Answer: _____

Match tasks with the appropriate task type (3 points):

3
Task Type Example Task
___________Classification task a) Are there any interesting natural groupings of my
customers?
___________Scoring/ranking task b) Which 500 customers should I target with a
special offer?
___________Regression task c) Which customers will leave within 90 days of their
current contract expiration?
___________Unsupervised learning d) How I many cell phone minutes will each
customer use next month?

Match metrics with what they measure (3 points):

Task Type What does it measure?


___________F1 score a) classification performance

___________Precision and Accuracy b) ranking performance

___________AUC c) how good are probability predictions

___________Logistic Loss

Coding (20 points)

Sample Size Estimation (10 points)


Complete the following function that estimates the sample size (per variation) needed for an A/B
test, given the base_rate and the minimum detectable effect:
def estimate_sample_size(base_rate, min_detectable_effect):
sz = (2.8*2.8)*___________________________________________
return round(sz)

Now call your function and record the results below:


print(estimate_sample_size(0.01, 0.001))
_______________

4
print(estimate_sample_size(0.1, 0.01))
_______________

Compare your result with that from the online sample size estimators:
● https://www.optimizely.com/sample-size-calculator/
● http://www.evanmiller.org/ab-testing/sample-size.html
Are their results the same as yours? Can you explain what's going on?

AUC Brain Teaser (10 points)


Say you have a large set of examples in a two-class problem. Forget about features here, they
don't matter. Each example is assigned a score: a random number between 0 and 1. With the
same probability it gets assigned to the positive class. In other words, if it's assigned a score of
0.23, it would have a 23% chance of being labeled positive and 77% chance of being labeled
negative.

So you have a set of examples with scores and labels. Our goal is to calculate the expected
AUC of this example set.

Complete the following code that solves this problem:

import numpy as np
np.random.seed(17111)

num_samples = 10000
prob = np.random.rand(num_samples)

# In the next line we will sample the class labels.


# Note that you need to use the np.random.rand() function.
label = __________________________________________________

from sklearn.metrics import auc


from sklearn.metrics import roc_curve
# In the next two lines we will calculate the AUC based on
# "prob" and "label". Search sklearn's online documentation
# to learn how to use the "roc_curve" and "auc" functions.
_____________________ = roc_curve(y_true=_____, y_score=_____)
print(auc(_____________________))

5
What is the output of your program?
_______________

Your probability "estimates" seem to be perfectly right (after all, the labels are generated using
them). But why is the AUC still less than 1?

Short Answer (10 points)

Plumbing Inc (5 points)

Plumbing Inc. has been selling plumbing supplies for the last 20 years. The owner, Joe, decides
that next year it is time to diversify by adding gardening tools to the products. Having had
success using customer data to build predictive models to guide direct mail campaigns for
special plumbing offers, he considers that data mining could help him to identify a subset of
customers who should be good prospects for his new set of products. Is Joe ready to solve this
a supervised learning problem? (Write a few sentences to explain your answer.)

6
Netflix (5 points)

Assume that you work for the data science team of Netflix. Assume that Netflix pays fees in
royalties every time a user requests to watch a movie/series. Even if a user has watched the
first five minutes of a movie, Netflix has to pay the full fee. In the past two years there has been
a worrying increase in the number of customers that play a movie/series on Netflix without
actually watching (e.g. they fall asleep). You want to predict which users indulge in this costly
(for your company) habit. In one sentence, formulate a useful target variable. In another
sentence, describe precisely how you would formulate the feature vector. Finally, briefly
describe how can Netflix use your model.

Problem Analysis (35 points)

Mail Marketing (10 points)

Last month your boss sent mailing to 20,000 of your existing customers with a special offer on a
Hoosfoos Credeen (some cool product). The response was exciting: 1% of them responded,
which brought in $200,000 in revenue. She has now delegated to you the task of continuing the
program, and has given you a budget of $10,000, which will allow you to target another 20,000
customers (out of your customer base of 100,000). You don’t want to just target them randomly,
as your boss did, so you decide to build predictive models for targeting. Describe how to
evaluate them as follows:
1. What is the cost and benefit of sending a mail?

7
2. What does your model predict (i.e., what's the target variable)?

3. How would you use the data for training and validation?

4. Describe the evaluation function you will use to compare your models.

GloboBank (10 points)

You’re working for one of the world's largest financial institutions (GloboBank). They’re building
a system to monitor salespeople’s electronic communications with the company's customers.
The goal is to help reduce bad behavior among the company's salespeople, such as
overpromising, understating risks, and so on. The company is unhappy with their current
surveillance system, because it creates tons of false alarms, which wastes the time of the
analysts, and also they know that it misses a lot of important cases. Below is a proposal they
have received from a vendor, for a better system, that will address their issues. Specifically, it
will monitor each outgoing email from a salesperson and flag those that are suspicious. The
suspicious emails would be examined by an analyst, who would decide which ones ought to be
escalated for further investigation.

Assess the proposal and provide constructive criticism: identify what you assess to be the three
most important potential flaws and suggest ways to fix each of them.

We will use machine learning techniques to build a model to classify emails as


"suspicious" or not. Those classified as suspicious will be "flagged"; our compliance

8
analysis system will rank them and provide the most suspicious ones to the analysts.
The system will maximize the lift at the top of the ranking, and minimize the number of
missed cases (false negatives).

The flagging model will take as input a feature-vector representation of the email, where
each word is a feature and the feature value represents whether the word is present in
the email; more sophisticated representations will be added later. We will leverage the
existing system to provide training labels. Specifically, if the existing system flags the
emails as being suspicious, we will give them a label of yes. Otherwise we will give a
label of no. As we archive all salesperson emails specifically for compliance purposes,
this will allow us almost unlimited training data.

We will evaluate the system based on its generalization accuracy and the area under the
ROC curve (AUC) on holdout data. The system should be able to achieve accuracies
greater than 90% as well as high AUCs. We also will show the flagged emails to
compliance experts for domain knowledge validation.

Flaw 1:

Flaw 2:

Flaw 3:

More flaws (if any):

9
ASOS Customer Life Value Prediction (15 points)
One distinct feature of the data science profession is that many firms publish their best work on
academic journals. Many data scientists read those publications to look for ideas, and to keep
themselves aware of the industry's latest practices. In this question, you will be asked to read a
journal article, and answer a few questions.

The article is: Customer Lifetime Value Prediction Using Embeddings. It was written by several
authors at ASOS, a British online fashion and cosmetic retailer. The article can be found here:
https://arxiv.org/pdf/1703.02596.pdf. For the purpose of this assignment, focus on Section 1, 2,
and 3 only.

Read through the article and write brief answers the following questions. Feel free to search
online and discuss with your friends and classmates while working on these questions. But you
should write up with your own answers.

1. What is Customer Lifetime Value (CLV, sometimes also abbreviated as CLTV)? And
why is relevant to ASOS? (Hint: focus on Section 1 of the paper)

2. What is the prediction target? What is a main challenge associated with using this
prediction target? And how did the authors deal with it? (Hint: focus on Section 3 of the
paper)

3. What features did the authors use? Which are the most useful features? (Hint: focus on
Section 3 of the paper)

10
4. What is model calibration? And why did the authors do it? (Hint: focus on Section 3 of
the paper)

5. What is an alternative approach to building CLV models? How does it compare to the
author's approach/approaches? (Hint: focus on Section 2 of the paper)

6. Assume that you are a data scientist working for Macy's. Furthermore, you are now in
charge of rebuilding their CLV model. Can you briefly summarize what you have learned
from the ASOS paper?

11

S-ar putea să vă placă și