Sunteți pe pagina 1din 12

Data Mining Techniques - Assignment 2

Authors: Hidde Hovenkamp (2541936) and Dennis Ramondt (2540351)


Vrije Universiteit, Amsterdam

Introduction

This paper presents the approach, results and learning process of group 6s participation in the Data Mining
Techniques class competition. The challenge is part of a now-closed Kaggle competition around learning to rank
hotels in order to maximise bookings for hotel queries on Expedia.com. The dataset consists of search and hotel ID
pairs, populated with hotel characteristics such as displayed booking price, location attractiveness, star rating, etc.
A search-hotel pair is assigned a relevance score of 1 if it has only been clicked by the user, and 5 if it has also been
booked. This score is used to calculate the Normalized Discounted Cumulative Gain, used for the hotel ranking and
final evaluation of participating teams. The defining characteristic of our approach is that we decided to implement
a ranking algorithm package called RankLib, which allowed us to focus our attention on the feature creation and
selection.
This paper is structured as follows. Section 2 consists of data exploration and preprocessing, during which we
discuss important properties of the dataset and select and create relevant hotel features. Section 3 explains the
feature selection. Section 4 explains the modelling procedure and which ranking algorithms were chosen. Sections
4 and 5 also give special attention to the approach taken by teams in the Kaggle competition. Section 5 presents
our results and section 5 draws conclusions and evaluates the modelling process. Section 7 is the process report,
describing how we worked together, how we divided tasks and what could be improved.

Fig. 1. Share of missing values per feature, shown only for features that contained missing values

2
2.1

Data Preparation
Exploration

The training set consists of 4.958.347 search-hotel pairs with 199.549 unique searches and 51 features; the test set
contains 4.959.183 search-hotel pairs with 199.795 unique searches and 47 features. An initial inspection of the data
reveals some interesting properties. First of all, as visible in Figure 1, several features consist for a large part of
missing values. Furthermore, the dataset consists of only 4.47% positive outcomes (a relevance score of 1 or 5),
which could cause certain ranking algorithms to train mostly for negative outcomes.

Fig. 2. Click and book percentages when ranked randomly or by Expedias own ranking algorithm.

Figure 2 shows the percentage of entries that were clicked or booked as a function of the ranking position, which
was computed either randomly or by Expedias own algorithm. It clearly shows that the likelihood of a hotel being
booked after it has been clicked is a lot higher when Expedia provides the ranking. Finally, histogram plots of three
numerical features in Figure 3 show that they have highly skewered distributions with extreme outliers.

2.2

Feature Creation

The challenge of creating a good feature set is to select and create features that are expected to be highly correlated
with the relevance scores in both training and test sets. Overall, publications, discussions on the Kaggle forum and
team presentations suggest that the price, second location score and destination ID are the strongest predictors [3].
Table 1 shows an overview of the transformed and composite features we created, the rationale behind which is
discussed below. We have used logistic regression on the outcome variable (relevance score) in order to assess its
relevance.
Missing value imputation and outliers As described, many features contain a significant amount of missing
values. In many cases, a missing value is information in itself; when a hotel has no previous reviews, we take this
missing information as something negative and impute with zero value. Missing review scores, second location
scores, search log queries were set to zero, others were imputed with the median. Furthermore, as Figure 3 showed,
the numerical features contain extreme outliers in the 0.999 quantile which have been deleted.
3

Fig. 3. Histogram plots of three relevant numerical features, with 0.999 quantiles indicated, above which datapoints were
deleted as outliers.

Monotonic utility Various numerical features will have preference profiles in the shape of a peaked distribution,
implying some optimal prediction value. However, the team of Jun and Wang (second place in the competition)
rightfully proposed to choose features with monotonically increasing utility with respect to the target variable; i.e.
where a higher feature score implies higher chances of being booked. Such a transformation can be achieved by
taking the absolute value of a feature when its mean has been subtracted, conditional on it being a booking. Figure
4 shows the a histogram of booking frequencies from which it can be seen how certain features are monotonically
increasing and others less so.

Fig. 4. Histogram plots of booking frequencies for several features.

Normalisation Furthermore, Owen (first place in competition) explains that some numerical features need to
be normalised (by subtracting a subgroup average). For example, certain searches may have proportionally better
or worse hotels in them, which puts them at a disadvantage or advantage with respect to hotels in other searches.
4

Composite Features Finally, relevant new features can be created by combining existing ones. For example,
the proportion of times each hotel has been clicked and booked after appearing in search results is a good indicator
of hotel quality (to be calculated only on the training set). Hotel rankings within search or destination IDs based
on feature values are also expected to be relevant. We implemented our own algorithm to create such rankings
for several numerical features. Finally, two features were added which indicated whether a competitor offered a
cheaper booking than Expedia and what the price difference was. Other Kaggle teams indicated that the former
two variables were useful, but that the latter proved not so significant. Table 1 shows the formulas used for creating
our composite features. Some of these features are also shown in figure 4, where the distribution of bookings within
these features is depicted. It is interesting to see that indeed for the differences variables, values close to zero indicate
much more bookings.
Feature
Formula
Indexing
Hotel quality
relevance score - mean(relevance score)i
i = Search ID
abs(visitor hist starrating - prop starrating)
Star diff
Price diff
abs(visitor hist adr usd - price usd)
Price his diff
abs(prop log historical price - log( price usd))
Comp cheap
[1|(comp ratei < price usd AND comp infi = 1)]
Competitor i = 1 : 8
Comp cheap diff
[max(comp ratei ) | comp cheap = 1]
Competitor i = 1 : 8
Star monotonic
abs(prop star - mean(prop star[booking bool])
Review monotonic abs(prop review - mean(prop review[booking bool])
Feature Ranked
hotel rank(feature)i
i = Search ID
Feature Mean
mean(feature)i
i = Search/Destination ID
Feature Normalized feature - mean(feature)i
i = Search/Destination ID
Table 1. Table with the various transformed and composite features, their formulas and logistic regression results. Where
applicable, formulas were applied over specific feature categories through indexing. Ranking and Normalization were implemented over several numerical features.

Feature Selection

To get a first indication of the relative importance of our features we use a logistic regression of the features on
whether a property was booked. We start with a set of indepedent variables that includes all features. From there,
we delete insignificant features one by one and re-run the logistic regression (at the 5% significance level). We keep
doing this untill we are left with a set of features that all have a significant effect on booking. Table 2 shows the
parameter estimates for the final set of features left in the logistic regression. The results provide an indication of
which features seem to have large predictive power for booking. Interestingly, a lot of the features normalised over
search id (ID) and destination id (DEST) are suggested to be important. Additionally, the mean star rating and
review per search id are also good predictors. The numerical features ranked within a search id also seem to be
very good features. While star difference and price difference are also included in the final set of features, comp.
cheap difference and price historical difference seem to be less important as they were not significant. But above all,
our hotel quality feature is the strongest predictor of booking, which is what we had expected. In general we find
the signs of the parameters in the direction we would have expected. A few seem to be in the opposite direction,
however these features are all in the set in multiple ways which probably means they are interacting with each
other.
5

Feature
Parameter Estimate Feature
Parameter Estimate
intercept
-4.9168 rank - star rating
0.0180
meanID - star rating
-0.0684 rank - comp. cheap diff
-0.0158
-0.0199
meanID - review
0.4096 rank - location score 1
normalisedID - starrating
0.2851 rank - location score 2
0.0263
-0.0735
normalisedID - review
0.4200 rank - price
normalisedID - locationscore2
1.9844 star diff.
0.0751
normalisedDEST - starrating
0.0222 price diff.
-0.0020
-0.9035
normalisedDEST - review
-0.2323 location score 2
normalisedDEST - location score2
1.2601 hotel quality
12.7953
location score 1
0.0883
Table 2. Final result from logistic regression on booking. The parameter estimates of the regression give an indication of
the relative importance of features

Modelling Approach

We implemented a step-wise modelling approach to the expedia hotel ranking problem, on which we shall now
elaborate. First, we used a logistic regression to determine which features seem most relevant. Second, we evaluated
various ranking algorithms on 5% of our training set. Third, we looked at some build-in normalisation procedures.
We trained our chosen final model on 10% of the training data set, in order to make our prediction.

4.1

Ranking models

For ranking problems, we found that most competitors in the Kaggle competition used a very efficient package in
Java: RankLib, which includes several algorithms made specifically for learning to rank. RankLib consists of the
following algorithms: RankNet, RankBoost, AdaRank, Coordinate Ascent, Random Forest, ListNet, MART and
LambdaMART. In what follows, we explain how these algorithms work, which we expect to perform best and why.
RankNet is a pair-wise ranking algorithm based on neutral networks. Each pair of correctly ranked documents,
the document is propagated through the net separately [2]. The difference for these documents is then mapped to
a logistics function to obtain a probability and the true label for that pair. Finally, all weights in the network are
updated with an error back propagation and a gradient descent method.
RankBoost also uses a pair-wise boosting technique, where training proceeds in rounds [2]. All documents start
with equal weights and each round the learner selects the weak ranker which the smallest pair-wise loss on the
training data. Pairs that are incorrectly ranked obtain more weight, such that the algorithm focuses on these pairs
in the next round. The final model then consists of a linear combination of these weak rankers.
AdaRank works in essentially the same way as RankBoost, except that is list-wise rather than pair-wise. The
advantage is that it directly maximizes any information retrieval metric, such as NDCG in our case. This could
prove to be an advantage over RankBoost for our purposes.
While coordinate ascent is often used for unconstrained optimization, Metzler and Croft have proposed a different
version of the algorithm used for information retrieval [4]. It cycles through each parameter and optimizes over it
while keeping all other parameters fixed. When implemented in a list-wise linear model this technique can be used
for ranking.
6

A Random Forest is an ensemble of decision trees. Since single decision trees are likely to overfit when made
too big, but underfit when made to small, averaging over a set of decision trees can balance out these effects. This
method is very efficient since there are very few parameters to tune.
ListNet is a learning method for optimizing loss function with neural networks as model and gradient descent as
algorithm, which is very similar to RankNet [1]. Instead of using pair-wise documents as instances it uses documents.
MART (multiple additive regression trees) is a gradient boosted tree model [1]. In a sense, it is more of a class
of models than a single algorithm. The underlying model for our MART is the least squares regression tree and it
uses the gradient descent as optimization algorithm.
Last, the best model found in the literature for ranking is often claimed to be LambdaMART. It is a combinaton
of LambdaRank (improved version of RankNET) and MART and it uses LambdaRank to model the gradients and
MART to work on these gradients [1]. Combined we obtain a MART model that uses Newton-Rhapson method for
approximation. The decisions for splits at a certain node are computed using all the data that falls into that node.
This makes it able to choose splits and leaf values that may decrease local utility but increase overall utility [1].
For our initial specification of which ranking model performed best on the Expedia data set we split our data
into three different sets: the training set, the validation set and our own test set. Since the total training data set
contained almost 5 million rows, we sampled a subset from this to use for our model building. First, we randomly
sample 10,000 search ids from the entire training dataset which amounts to approximately 200,000 rows. From this,
75% is used as the actual training data and 25% is used by the ranking algorithms for validation. Second, we also
sample 10,000 search ids from the training set which we keep entirely separate and can then be used as our own
test set.

4.2

Normalization procedure

Once determining the model, we also test whether several normalization procedures would improve the results. We
try the following methods: normalization by sum (1), normalization by zscore (2) and linear normalization.

x
xsum = P ,
x
4.3

xz =

x
,

xlinear =

x xmin
xmax xmin

Final prediction model

Once we have chosen the best ranking model and normalization procedure, we train our final model on a training
set of 20,000 search ids which amounts to roughly 500,000 rows. We also tweak the parameters to find the optimal
parameters settings for our ranking problem. We use this model to create our final prediction on the test set as
provided.

Results

First we train the model using 9 different algorithms from the Java implementation RankLib. We train the model
on roughly 5% of the data and test the results on an equal amount. We start with the default settings in RankLib
7

for all the models to get a feeling for which class of models performs best for our ranking problem. Table 3 shows
the results for the training set, validation set and test set. From the table we can see that LambaMART is the best
performing model, followed by MART and Random Forest 1 . The two neural network models RankNet and ListNet
perform very poorly. For almost all the models it holds that the results on training and validation sets are higher
than for the test set, which means we are slightly overfitting.
Model
Training Set Validation Set Test Set
MART
0.5547
0.5355
0.4868
RankBoost
0.4877
0.4727
0.4531
0.5094
0.5066
0.4612
AdaRank
RankNet
0.3497
0.3424
0.3495
0.5127
0.5105
0.4641
Coordinate Coordinate
LambdaRank
0.3498
0.3389
0.3502
0.5627
0.5409
0.4920
LambaMART
ListNet
0.3498
0.3389
0.3502
Ranom Forest
0.5595
0.5242
0.4813
Table 3. Results for 9 ranking models on the training data, validation and test data measured in NDCG@38. Training data
consists of 7,500 random search ids, validation set of 2,500 search ids and test set of 10,000 search ids.

In line with reports from previous winners of the Kaggle competition, we also find that LambdaMART performs
best for this ranking problem. Next we want to evaluate whether normalizing the entire feature set, using different
procedures, further improves the model. Table 4 shows the NDCG scores for the normal model, the sum normalization, the zscore normalization and the linear normalization (as described in the previous section). Interestingly, we
find that the model with no normalization performs best, although the linear normalization comes very close and
outperforms the rest on the the training and validation set.
Normalization
LambaMART LambaMART LambaMART LambaMART -

Training Set Validation Set Test Set


normal
0.5627
0.5409
0.4920
sum
0.5659
0.5396
0.4892
zscore
0.5564
0.5399
0.4890
linear
0.5675
0.5478
0.4915

Table 4. Results for different normalization procedure on the training data, validation and test data measured in NDCG@38.
Training data consists of 7,500 random search ids, validation set of 2,500 search ids and test set of 10,000 search ids.

Finally, we investigated whether we could further fine-tune our LambdaMART model with no normalization to
improve on our score. We checked whether increasing the number of trees from 1000 to 2000 or 3000 would improve
the model, but the scores were exactly the same. Using the optimal model specifications found so far, we ran a final
LambdaMART model on a training set of 40,000 search ids, and found 0.5659 on the training set, 0.5505 on the
validation set and 0.4977 on the test set 2 . We also performed a five-fold cross validation on this final model, for
which the results can be found in table 5. Fold 1 seems to perform best, based on its test set. Making a prediction
with this model is expected to lead to a slightly higher NDCG score.
1

Although LambaMART has the highest NDCG scores in the table, our prediction was made with a MART model. This
because there was an error in our LambaMART implementation, which we only found after the deadline for handing in
our prediction had passed
40,000 search ids was the maximum number possible, given our computing power constraints.

Cross-validation fold
LambdaMART - fold
LambdaMART - fold
LambdaMART - fold
LambdaMART - fold
LambdaMART - fold
Average

1
2
3
4
5

Training Set Validation Set Fold Test Set


0.5704
0.5354
0.5460
0.5644
0.5340
0.5457
0.5752
0.5368
0.5383
0.5693
0.5339
0.5451
0.5770
0.5551
0.5331
0.5713
0.5382
0.5416

Table 5. Results for five-fold cross-validation on the training data and validation data measured in NDCG@38. Training
data consists of 7,500 random search ids and validation set of 2,500 search ids. The fold test error is the error on a separate
test held apart in the cross-validation, which is calculated after completing every fold.

6
6.1

Conclusion
Summary of main findings

The main conclusion that can be drawn from our results is that LambdaMART is the best model for learning to
rank hotels such that bookings are maximised. This conclusion is in line with what we find in the literature and
with the top performing teams in the Kaggle competition. Of the original features, the winning team stated that
the second location score, the price and the ranking position were the most important features. Our combination of
logistic regression and incremental model adjustments mostly agreed with this, although the rank position did not
work out as well, and instead pointed to the review score as relevant. Of our composite features, the hotel quality,
difference features, normalised and mean features and value rankings proved significant predictors. Overall, what
the analysis shows is that by far the most value lies in the transformed and composite features. This suggests that
we were right to focus on the feature creation and selection process, and could even have experimented with many
more new features.

6.2

Suggestions for further improvement

Although we obtained a relatively powerful model for prediction the rank of hotels on Expedia there are several
suggestions for further improvement on which we would like to briefly elaborate. First, although we tried to create
a balanced dataset with roughly equal amount of negative and positive outcomes we did not succeed to properly do
this. We think doing this properly could really improve the model, as most of the winners emphasize the importance
of doing this. Second, all our feature engineering was now performed only on the training set. However, it would
have been even better to combine the training and test set and create the ranking features, de-meaned features
and means per property id and destination on the entire dataset. Third, we tried to create an extra feature that
measures the average position of a hotel over all the search queries. We think this could be a very important feature,
however it seemed not to have a significant improvement of the model. We still think this could be a very important
feature so we suspect something might have gone wrong in computing it. Therefore we would suggest to further
investigate the potential of this variable. Last, on a more practical note we had some difficulties with the enormous
dataset for this assignment. Due to computational constraints on our Macbooks (running out of memory) we were
only able to train our models on maximally 10% of the dataset. Our scores would probably have improved some if
we could train on the entire training set, making use of external computing power for example.
9

6.3

Process Evaluation

Looking back at the process there are a couple of things we would do differently next time. First, it would have been
better if we had spent more time in the beginning to explore which model would be best to use for this problem and
what would be the best software to implement it in. This would have prevented us from working on programming
models that we were not able to use in the end. Second, we should have started with a much more simple model,
including just a few features and making the prediction work with this model first. Because we lost some time
figuring out how the Java package worked, we had already created a rich set of features which we then put in our
model. However, this made it very difficult to find small errors and mistakes we made and caused our predictions
to be very bad for a long time. If we had started with a simple mode first to make sure everything worked properly
we probably would have had more time at the end to further improve the model. This also resulted in the fact that
we made our prediction with a MART model, while LambdaMART would have further improved on this, but we
only found the mistake after the deadline for handing in the prediction file.

References
1. Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23581, 2010.
2. V Dang and W Bruce Croft. Feature selection for document ranking using best first search and coordinate ascent. In
Sigir workshop on feature generation and selection for information retrieval, 2010.
3. Xudong Liu, Bing Xu, Yuyu Zhang, Qiang Yan, Liang Pang, Qiang Li, Hanxiao Sun, and Bin Wang. Combination of
diverse ranking models for personalized expedia hotel searches. arXiv preprint arXiv:1311.7679, 2013.
4. Donald Metzler and W Bruce Croft. Linear feature-based models for information retrieval. Information Retrieval,
10(3):257274, 2007.

10

Process Report

Our group consists of Dennis and Hidde who are both studying econometrics, while Dennis also followed the course
on Neural Networks. We both have by far the most programming experience in MATLAB. We have worked on this
assignment for a total of five weeks and we generally worked together on the assignment at the same time. First we
will go through what we did week by week and then we will reflect on the process and what we might do different
next time.
In week 1 we got handed over the assignment and started with some exploration of the assignment. The task of
ranking hotels for Expedia seemed like a very interesting and practical application of data mining to us. Hidde read
through the reports written by the winners of the Kaggle competition to get a feeling for well-working algorithms
and to understand the task at hand better. Hidde also went through the slides of the previous lectures to see which
techniques would be useful for this assignment. At the same time Dennis started examining the data to see how we
could import these into MATLAB. He also looked into the possibilities of using his previous experience in neural
network models for this assignment.
In week 2 we started thinking about the first models we could use for this problem and made a first attempt
at programming these in MATLAB. While Hidde worked on an implementation for Random Forests, Dennis tried
to train a multilayer perceptron (MLP) on a small part of the dataset. We started with the modelling part before
feature preparation and selection to get a feeling for how complicated it was going to be to use these models for
prediction and what the computing time would be. We soon discovered that building these models ourself cost a lot
of time and computing time in MATLAB was very long. Therefore, we decided to change our approach. Instead of
trying to build a model from the knowledge we had, we looked for implementation packages of some of the ranking
models used by the winners of the Kaggle competition such as (Lambda)MART. We found a very good package
written in Java, RankLib, that most of very well performing teams in the Kaggle competition had used. Although we
had no experience in programming in Java we decided to go for this package such that, once we would understand
the package and be able to use it, we could focus on the feature building and selection. This seemed like the most
important driver for getting a good score.
In week 3 we focused first on understanding the RankLib package, so we would be sure that we would be able
to hand in a prediction on May 17. While Dennis familiarised himself with Java and figured out the technical part
of getting RankLib to work, Hidde investigated the various algorithms in the package to understand how they
worked and which might work best. Hidde also continued with the data preparation, which we decided to keep
doing in MATLAB. We would then export the dataset to Java to train our models and import back into MATLAB
to determine the final ranking and create the prediction file. While we worked on being able to train our first
model in Java we now also started working intensively on missing data, removing outliers, creating new features
and transforming existing features.
By the time we reached week 4 we finally had our Java program working and we could train our first models.
We compared the various ranking models in RankLib on a small training data set and quickly found MART was
rather fast and also performed well. Dennis worked on creating ranked numerical features within search querys and
Hidde combined features such as competitive prices together to make more powerful features. Dennis also wrote
11

the code in MATLAB for drawing a random sample of roughly 5% of the training data on which we trained our
models. Hidde trained a variety of models in Java and tested on our own test set, which also consisted of 5% of the
training set. However, we experienced a lot of trouble with training a proper model because we spent a lot of time
de-bugging our Java implementation. We decided to verify whether our model was able to make a proper prediction
by predicting the Kaggle test set and uploading our prediction to check the score. Finally, we managed to get a
score on Kaggle of 0.48457 which would have been the 100th place. By this point however we had very little time
left to fine-tune our model or train it on a large dataset, because we had to hand in our prediction.
Week 5 was the week of the final lecture in which we spent time on preparing the presentation and mostly
worked on the final report. Dennis wrote the parts on data exploration and feature building, while Hidde elaborated
on the modelling approach. Together we wrote the introduction and conclusion and finished the report.
Reflecting on our cooperation as a team, the collaboration between Dennis and Hidde was very good. Since we
know each other very well it was easy to work together and use each other strengths. While Hidde has a somewhat
stronger theoretical background in econometrics at the VU, Dennis was able to use his programming experience
from the neural networks course. Although we both had very busy schedules we managed to work on the assignment
on a regular basis, which was helped by the fact that we live in the same house. We chose deliberately to work in
a team of two rather than three, because we are both so busy and it would have been hard to find proper meeting
times with a third person. However, the downside was that we had to do more of the work ourselves instead of being
able to divide it among three people. Nevertheless, we think the time we won by working in a team that knows each
other very well outweighs this extra work. A possible pitfall of knowing each other well is that you might overlook
some mistakes or opportunities because you have a too similar way of thinking.
Overall, we can look back at an interesting and very practical course on data mining in which we learned a lot
about different methods but also on the data mining process itself. The very practical application of the Kaggle
assignments definitely made data mining come to life for us and contributed strongly to our enthusiasm. Although
frustrated during the process at times, we both finish the course with a lot of new knowledge, experience and
satisfaction.

12

S-ar putea să vă placă și