Documente Academic
Documente Profesional
Documente Cultură
Jelena Lazarevic
IN2 d.o.o
Vladimira Popovica 40
11070 Belgrade, Serbia
jelena.nadj@skolkovotech.ru
jelena.lazarevic@in2.rs
ABSTRACT
Keywords
In this report we will describe our approach and achievements for the Data Mining course project - Influence of
coupons on order patters. The project is taken from a competition organized by Prudsys AG.
1.
INTRODUCTION
ing data was high, we decided not to use it. Not all the variables should necessarily be used - high correlation between
variables should be detected, while some of the variables
should be transformed in order to make them suitable as
predictors. Obviously, not all the variables have a logical
connection with all target attributes. This will be discussed
to more details later in this report.
2.
BACKGROUND
3.
PROBLEM STATEMENT
E=
n
X
|coupon1U sedi coupon1Pi |
Pn
1
j=1 coupon1U sedj
n
i=i
4.
(1)
RESOURCES
Both training and test data are provided by the organizers. In order to build the model, we use different functions
implemented in R packages for cross-validation, data preprocessing, training the model, etc. Training data has 32
variables and 6052 observations. Test data has 28 variables
(just without the target variables) and 6690 observations.
Data codebook is submitted as a separate file, where all the
variables are described.
5.
DATA ANALYSIS
Data Preprocessing
In the part of exploratory data analysis we examined different characteristics of variables separately, and also as functions of other variables, to see if it is possible to notice some
pattern. First, we checked how often are users coming back,
and if it is possible to infer something for a single user for
example. After extracting the data for the user who visited the shop the highest number of times, we can call him
2bab1752b217fdd3704199dead8fa372, we have not noticed
a pattern in his behavior. He received several groups of
coupons, several times, but his coupon redemption behavior
varied, as well as his total basket value. In the same group
of coupons, he sometimes used them, sometimes not, and
not always the same coupon, while the basket value varied
from 211.8 to 755.8, with the mean of 506.9. So, we found
8 columns that were completely the same, representing the
purchases of 1 user, but with different values for coupon redemption. As most of the time the user did not use any of
the coupons, or just one of them, we could tell that reason
of his purchase were not the coupons, but still there was no
trivial way to predict whether he will use them. Most of the
other users had only 1 or 2 purchases in the shop, so for them
it was definitely impossible to learn habits from their orders.
We decided not to use userID variable as a predictor, since
in our opinion it could only mislead the algorithm.
In the next step we discarded one observation where total
basket value was a strong outlier. Other observations did not
have values that extreme, so we not to throw out anything
else.
In order to see if there is a correlation between variables, we
plotted pairs of variables. In Figure 1 we can see price of the
first item coupon was given for, for all orders, and whether
that coupon was redeemed represented by different colors.
No connection was spotted here, as coupons were and were
not used in the whole spectrum of prices. Similar plots were
obtained for all three coupons.
4e+05
Coupon1Used
0
1
2e+05
0e+00
0
2000
4000
6000
dat_train.orderID
Figure 2: Coupon redemption as a function of the
time difference between getting the coupon and
making the purchase.
Figure 3 compares the box plots of total basket value, depending on the number of coupons redeemed (0 - 3). We can
see that range between these values varies significantly, but
the mean basketValue, and even 25th and 75th percentile
are really close, so we did not conclude there is a connection
here. We have a similar plot in figure just for the redemption of coupon1, where the similar observations were made,
and no obvious connection was inferred.
Mean values for target attributes are coupon1U sed = 0.24,
coupon2U sed = 0.19, coupon2U sed = 0.17 and basketV alue =
293.07.
75
Coupon1Used
price1
Time Difference[s]
5.1
50
0
1
25
0
0
2000
4000
6000
orderID
Figure 1: Price of the first item the coupon was
given for and the coupon redemption.
For training the models for the data we used different algorithms. First, we tried the linear model, although it did
not make any sense, since when we were exploring the data,
there was no strong linear connection between any of the
variables.
basketValue
10000
1000
100
basketValue
10000
1000
100
Coupon1 redeemed
Figure 4: Boxplots of total basket value depending
on the first coupon redemption.
5.2
Models
5.3
Results
6.
predictors were the same, we do not expect there is a general pattern. It is more logical that a single person has their
own shopping habits and routine, than there is a general
pattern for all people. Except of patterns on the customer
level, we also expected patterns on the level of type of the
coupon - that it would usually be redeemed or not, but this
connection has not been detected.
On the other hand, we do believe there are ways in which
customers can be influenced to use coupons or not, but the
data representing those connections has not been included
here. For example, it depends how the marketing is done,
how are the products promoted together - what is likely
to be bought together etc. Also, the fact that we did not
use three of the given variables could have an effect on the
outcome, but we do not expect the effect is strong since in
some extracted data where the value was present we did not
notice its strong correlation with the values of the target
attributes.
The fact that the data was really bad and hard to work with
made this task more interesting, and educational. We went
through different approaches usually used in the prediction
tasks, especially through a lot of preprocessing methods.
Unfortunately, problems like this are always open-ended,
and it is never known in advance if any pattern really exists.
While showing there is a pattern when it really exists, is
pretty easy, showing there is no pattern is a pretty complicated task, because there is always something that has not
been tried yet.
7.
REFERENCES