Documente Academic
Documente Profesional
Documente Cultură
RECOMMENDATION
Bing Zhang, Qingxuan Li
Department of Electrical and Computer Engineering
University of California, Davis
PROJECT STATEMENT
Use different methods to predict the users hotel selection based on their existed data features.
From the dataset of Expedia, there are around 20 features components for each prediction.
First, feature selection modify input feature into smaller size
Backward Elimination
PCA dimension reduction
Implement different learning methods,
Softmax
KNN
K-means
Classification Tree (w/ k-fold cross validation)
Compared and find the best model for Expedia to predict for their users future hotel assignment.
FEATURE SELECTION
Backward elimination
This algorithm is the part of Stepwise regression, which starting with all candidate
variables, and the delete each element. Then through the model to test if it can
improves the model the most by their deleted.
; is the weight of each element is the vector.
To decide , we use Moore-Penrose pseudoinverse (Linear independent columns)
*y
IMPLEMENTATION METHODS
K-nearest neighbor
In the classification phase, k is a user-defined constant, and an unlabeled vector
(a query or test point) is classified by assigning the label which is most frequent
among the k training samples nearest to that query point.
To compute the distance metric, in matlab it is function pdist2 to calculate the
Euclidean distance. Then sort the result, based on the result to assign the nearest
vector into the label of training vector.
K-mean cluster
Using K-mean cluster method in this project is not a ideal method. However, it can
still be used. When we do training, we separate the labels from the dataset, and
using K-mean algorithm package in Matlab to get the each cluster. Then assign the
testing data into the cluster then label them. Using this equation to assign the
vector into the potential right cluster.
Classification Tree
Visually represent decision-making results based on all of the input features.
Divide each feature into many different section
Easy to handle Expedias input feature data (difference in order of magnitude)
data# 20,000
(18,000 train
#fold
2,000 test)
100,000
(90,000 train
10,000 test)
200,000
(180,000 train
20,000 test)
10.55%
16.77%
21.04%
10.85%
18.4%
21.82%
10.75%
18.54%
22.31%
18
10.95%
19.15%
22.56%
RESULTS
Method
KNN K= 100
KNN with confusion
matrix K=1
K-means cluster
Classification Tree
Classification Tree
Dataset
100 Classes;
20000
100 Classes;
20000
100 Classes;
200000
100 Classes;
20000
100 Classes;
200000
Accuracy
4%
24.49%
16.59%
10.55%
21.04%
Method
KNN K= 100
KNN with confusion
matrix K=1
K-means cluster
Classification Tree
Classification Tree with
k-fold cross validation
(k=18)
Table1:
Classification
Tree Without Feature
100 Classes;
with k-fold
cross
22.38%
selection
200000
validation (k=18)
Method
Dataset
KNN K= 100
100 Classes; 20000
KNN with confusion
100 Classes; 20000
matrix K=1
Softmax regression
10 Classes; 20000
K-mean cluster
100 Classes; 200000
Classification Tree
100 Classes; 200000
Classification Tree with
k-fold cross validation
100 Classes; 200000
(k=18)
Dataset
100 Classes; 20000
Accuracy
9.25%
24.69%
21.22%
20.845%
22.25%
Accuracy
13.5%
40.95%
14.1%
31.05%
21.98%
23.97%