Sunteți pe pagina 1din 12

Submitted to:

Prof. Vinay Singh Chawan


Table of Contents

Background......................................................................................................................................2
Data..................................................................................................................................................2
Techniques Used..............................................................................................................................4
Logistic Regression.....................................................................................................................4
Multiple Linear Regression.........................................................................................................4
Data Preparation..............................................................................................................................4
Applying Regression Models..........................................................................................................5
Logistic Regression Model:.........................................................................................................5
Multiple Linear Regression:........................................................................................................7
Result.............................................................................................................................................10

1
Background

Tayko Software is a software catalog firm that sells games and educational software. With its
beginning as a software manufacturer, it has added many third party titles in its offerings.
Recently, the company is planning to roll out a revised collection of items in a new catalogue to
its clients in a mailing. The company values its customers very highly. They are a key asset.
Thus, in an attempt to expand the customer base, the company has recently joined a consortium
of catalog firms that specializes in computer hardware and software products. The consortium
affords members the opportunity to mail catalogs to names drawn from a pooled list of
customers. Members supply their own customer lists to the pool, and can "withdraw" an
equivalent number of names each quarter. Members are allowed to do predictive modeling on the
records in the pool so they can do a better job of selecting names from the pool. It mails its
products to prospective clients. The case requires us to develop a model to predict if a customer
will purchase a product or not. If he/she purchases, then what is spending will be.

Tayko has supplied its customer list of 200,000 names to the pool, which totals over 5,000,000
names, so it is now entitled to draw 200,000 names for a mailing. Tayko would like to select the
names that have the best chance of performing well, so it conducts a test it draws 2000 names
from the pool and does a test mailing of the new catalog to them. This mailing yielded 996
purchasers.

Data
There are two response variables in this case: "purch" indicates whether or not a prospect
responded to the test mailing and purchased something, while "spend" indicates, for those who
made a purchase, how much they spent. The overall procedure in this case will be to develop two
models. One will be used to classify records as "purchase" or "no purchase." The other will be
used for those cases that are classified as "purchase," and will predict the amount they will
spend.

2
Following is the list of predictor and dependent variables provided.

Var. # Variable Name Description Variable Code


Type Description
1. US Is it a US address? binary 1: yes 0: no
2 - 16 Source_* Source catalog for the binary 1: yes 0: no
record (15 possible
sources)
17. Freq. Number of transactions in numeric
last year at source catalog
18. last_update_days_ago How many days ago was numeric
last update to cust. record
19. 1st_update_days_ago How many days ago was numeric
1st update to cust. record
20. Web_order Customer placed at least 1 binary 1: yes 0: no
order via web
21. Gender=mal Customer is male binary 1: yes 0: no
22. Address_is_res Address is a residence binary 1: yes 0: no
23. Purchase Person made purchase in binary 1: yes 0: no
test mailing
24. Spending Amount spent by customer numeric
in test mailing ($)
25. Partition Variable indicating which alpha t: training v:
partition the record will validation
be assigned to

The number of records are 2000 with 1300 records to be used as the training set and the
remaining 700 as the validation set.

3
Techniques Used

Logistic Regression

It is a special type of regression where binary response variable is related to a set of explanatory
variables, which can be discrete or continuous. This basically extends the idea of linear
regression to a situation where the dependent variable, Y, is categorical. This is used to
classifying a new observation where the class is unknown, into one of the classes, based on the
value of its predictor variable.
In this case, we will use the logistic regression to choose perspective buyers out of the mails sent
to the 2000 customers.

Multiple Linear Regression

The multiple linear regression model is used to fit a linear relationship between a quantitative
dependent variable or response variable and a set of predictors or independent variables. One of
the two objectives here is to fir the best model to the data in an attempt to learn about the
underlying relationship in the population. Another major objective in the field of data mining is
to predict the new observation.

In this case, we will use multiple linear regression model to predict spending of each buyer.

Data Preparation
The data given to us consists of details of 2000 customers. Before we can apply the various
regression techniques on the data, we have to clean and partition it into two sets. Viz. the training
and validation set. The number of records are 2000 with 1300 records to be used as the training
set and the remaining 700 as the validation set.

4
Applying Regression Models

Logistic Regression Model:

(Please refer to the worksheets Logistic_Training & Logistic_Validation in the attached Excel
workbook.)

Group5_TaykoCase.
xls

We will use training data to build the regression model and validation data to validate it. In order
to apply logistic regression model to find the prospective buyers, we need to follow the following
steps.

SPSS Results Screenshot

5
Step 1: Combined training & test datasets to into training data.

Step2: Calculated probability of each prospect being a purchaser or non-purchaser using a non-
linear function in the form of:

P=1/ (1+Exp(-b0+b1x1+b2x2+)

Step 3: Calculated the significant coefficients b0. b1. using Binary Logistic in SPSS with
Purchase as the dependent variable.

Step 4: Assumed the cutoff value as 0.5. All the prospects with probability greater than 0.5 were
classified as buyers and vice-versa

Step 5: Created the confusion matrix. Success rate for the model is 82.5%

Step 6: Sorted the probabilities in the descending order and plotted lift curve

6
Lift curve for Training data
700

600

500

400
S S Linear (S)
300

200

100

0
0 200 400 600 800 1000 1200 1400

Step 7: Validated the model on validation dataset. Success rate for the model on validation
dataset is found to be 78.14%. Lift curve for the same has been depicted below:

Lift curve for Validation data


400
350
300
250

S 200 S Linear (S) Lift Curve


150
100
50
0
0 100 200 300 400 500 600 700 800

Multiple Linear Regression:

(Please refer to the worksheets Linear Training & Linear Validation in the attached Excel
workbook.)

Step 1: Kept the training data same as logistic regression along with the predicted probabilities
7
Step 2: Run a Multiple Linear regression in SPSS to find significant coefficients with spending
as the dependent variable. This enabled us to find predicted spending by each individual buyer

Step 3: Calculated the expected spending of each prospect by multiplying the predicted spending
with predicted probability of being a buyer

Step 4: Sorted the expected spending in descending order and plotted a lift curve. Compared the
same to non-model case

SPSS Results Screenshot

8
Expected Spending For Training Data
1400
1200
1000
800
Spending 600
400
200
0
-200 0 200 400 600 800 1000 1200 1400 1600

No. of Cust

Lift (No Model Case)


1600

1400

1200

1000

800

600

400

200

0
0 200 400 600 800 1000 1200 1400

Step 5: This regression model was then fitted to validation dataset to find out the expected
spending by multiplying with respective probabilities

Step 6: The lift curve for the validation data has been depicted below:

9
Expected Spending for Validation Data
1400

1200

1000

800

600

400

200

0
0 100 200 300 400 500 600 700 800
-200

Result
Logistic Regression
Confusion Matrix- Training Data

Confusion Matrix
Predicted
0 1
0 521 156 677
1 67 556 623
Actual 588 713 1300

Correctly
predicted 1077
% 82.85%

Confusion Matrix- Validation Data

Confusion Matrix
Predicted
0 1
0 224 99 323
1 54 323 377
Actual 278 423 700

10
Correctly
predicted 547
% 78.14%

The training data shows an accuracy of 82.85%. When the algorithm is applied to Validation data
the result was 78.14% accurate. Without the model there was an accuracy of only 50%. So,
comparing the original result with the validation result there is an improvement of 28% which
shows the model has been designed correctly.

Linear Regression:
The result of linear regression can be seen from the graph which displays the expected spending.
When we compare the Expected spending graphs of with model and without model, we conclude
that the prediction efficiency is improving by 1.1 percentage points.
Without model: Initial 200 customers out of 1300 are doing the max spending i.e. 15.4%
With model: Initial 100 customers out of 700 are doing the maximum spending i.e. 14.3%

11

S-ar putea să vă placă și