Sunteți pe pagina 1din 3

BUSINESS DATA MINING (IDS 572)

HOMEWORK 3
DUE DATE: WEDNESDAY, SEPTEMBER 23 AT 6:00 PM

Please provide succinct answers to the questions below.


Your entire write up must be at most six pages long including any figures and/or SPSS printouts.
You should submit an electronic pdf or word file in blackboard.
Please include the names of all team-members in your write up and in the name of the file.

Problem 1. Consider the 100 data points in the file hw3.xls. Each data point is either POSITIVE or
NEGATIVE. Based on data not shown, two models have been trained to predict the values of these
data points. For each of the two models, the table gives the probabilities of POSITIVE of each data
point, the last column shows the actual value.
(a) Determine the proportion of records p that are POSITIVE.
(b) Assume the models classify a record as POSITIVE if the probability of POSITIVE is larger
than 0.5 and otherwise they classify it as NEGATIVE. Determine the misclassification rate for
each of the two models.
(c) Again, assume the models classify a record as POSITIVE if the probability of POSITIVE is
larger than 0.5. Give the coincidence matrices for the two models.
(d) Plot the cumulative response charts for the two models (you can plot the results for the two
models in the same chart). Assume a hit is a POSITIVE record.
(e) Plot the gain charts for the two models (you can plot the results for the two models in the same
chart). Again, assume a hit is a POSITIVE record.
Problem 2. Download the file bank-data.csv. The key field in this data set is PEP (Personal Equity
Plan, a savings product our bank offers). Our goal is to predict whether or not a customer will purchase
a PEP. We have data from 600 customers as to their purchasing patterns. The fields are
id
a unique identification number
age
age of customer in years
sex
MALE / FEMALE
region
inner city/rural/suburban/town
income
income of customer
married
is the customer married (YES/NO)
children
number of children
car
does the customer own a car (YES/NO)
save
acct does the customer have a saving account (YES/NO)
current
acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep
did the customer buy a PEP after the last mailing (YES/NO)
Use SPSS Modeler to answer the following questions.
(a) Use 67% of the data set for training and 33% for testing. Create the default C&RT and C5.0
decision tree models. Use the Analysis node to determine the misclassification rates and
coincidence matrices for each of the two models on the testing data.
(b) Use the evaluation node (and relevant charts) to answer the following questions for the decision
tree model:
1

HOMEWORK 3 DUE DATE: WEDNESDAY, SEPTEMBER 23 AT 6:00 PM

i. What fraction of those who would buy the PEP product do we reach if we mail to only
half of our customer base?
ii. If we mail to half of our customer base, what fraction of those would we expect to purchase
PEP (assuming, of course, we are mailing to a previously unmailed-to group).
iii. What lift would we get if we mailed to only the most likely 10% of the population?
(c) Repeat part (b) for the C5.0 model.
To learn how to draw different graphs using evaluation node in SPSS Modeler, please follow
the instructions given in the document Evaluation Node on blackboard. This document can
be found under the SPSS Modeler Documents area.
Problem 3. A data mining routine has been applied to a transaction dataset and has classified 88
records as fraudulent (30 correctly so) and 952 as nonfraudulent (920 correctly so).
(a) Construct the confusion matrix and calculate the error rate, accuracy rate, recall, precision,
specificity, and false alarm rate. Please include the formulas.
(b) Consider the decile lift chart below (decile lift chart is the same as lift chart portrayed as a
decile chart) for the transaction data model applied to a new data.

Interpret the meaning of the first and second bars from the left.
(c) Another analyst comments that you could improve the accuracy of the model by classifying
everything as nonfraudulent. If you do that, what is the error rate?
(d) Comment on the usefulness, in this situation, of these two metrics of the model performance
(error rate and lift).

BUSINESS DATA MINING (IDS 572)

Problem 4. Suppose we have developed a classifier that will be used in an alarm system. Usually we
are especially interested in portion of alarms caused by positive events (that should really fire an alarm)
and portion of alarms caused by negative events. The ratio between positive and negative events can
vary during time, so we want to measure the quality of our alarm system independently of this ratio.
The table below shows the results of a probabilistic classifier on a given test set. Draw the ROC curve
for this classifier (you can use Excel to draw this chart).
Inst#
1
2
3
4
5
6
7
8
9
10

Class
p
p
n
p
p
p
n
n
p
n

Score
0.9
0.8
0.7
0.6
0.55
0.54
0.53
0.52
0.51
0.505

Inst#
11
12
13
14
15
16
17
18
19
20

Class
p
n
p
p
n
n
p
n
p
n

Score
0.4
0.39
0.38
0.37
0.36
0.35
0.34
0.33
0.30
0.1

S-ar putea să vă placă și