Sunteți pe pagina 1din 60

B.E.

PROJECT
ON

Recommendation System for Web Scale Graphs


Submitted by
Nikhil Menon (266/CO/09)
Sapan Garg (295/CO/09)
Sarthak Kukreti (296/CO/09)
In partial fulfillment of B.E. (Computer Engineering) degree
of University of Delhi

Under the Guidance of


Dr. Satish Chand

DEPARTMENT OF COMPUTER ENGINEERING

NETAJI SUBHAS INSTITUTE OF TECHNOL0GY

UNIVERSITY OF DELHI, DELHI

ACKNOWLEDGEMENTS

We take this unique opportunity to express our heartfelt thanks and gratitude to our respected
guide, Professor Dr. Satish Chand, Department of Computer Engineering, NSIT who kindly
consented to be our guide for this project.
We thank him for the precious time he devoted to us, for his expert guidance from the
commencement, for the kind attitude and the resources he arranged for us. It is only because of
him that we have been able to successfully complete this project.
We also owe our thanks to all the faculty members for their constant support and
encouragement.

DECLARATION
This is to certify that the project entitled, Recommendation System for Web Scale
Graphs by Nikhil Menon, Sapan Garg and Sarthak Kukreti is a record of bonafide work
carried out by us, in the Department of Computer Engineering, Netaji Subhas Institute of
Technology, University of Delhi, New Delhi, in partial fulfillment of requirements for the award
of the degree of Bachelor of Engineering in Computer Engineering, University of Delhi in the
academic year 2012-2013.

The results presented in this thesis have not been submitted to any other university in any
form for the award of any other degree.

Nikhil Menon
Roll No. 266/CO/09

Sapan Garg
Roll No. 295/CO/09

Sarthak Kukreti
Roll No. 296/CO/09

Department of Computer Engineering


Netaji Subhas Institute of Technology (NSIT)
Azad Hind Fauj Marg
Sector-3, Dwarka, New Delhi
PIN - 110078
ii

CERTIFICATE
This is to certify that the project entitled, Recommendation System for Web Scale
Graphs by Nikhil Menon, Sapan Garg and Sarthak Kukreti is a record of bonafide work
carried out by them, in the department of Computer Engineering, Netaji Subhas Institute of
Technology, University of Delhi, New Delhi, under our supervision and guidance in partial
fulfillment of requirements for the award of the degree of Bachelor of Engineering in Computer
Engineering, University of Delhi in the academic year 2012-2013.

The results presented in this thesis have not been submitted to any other university in any
form for the award of any other degree.

Dr. Satish Chand


Professor
Department of Computer Engineering
Netaji Subhas Institute of Technology (NSIT)
Azad Hind Fauj Marg
Sector-3, Dwarka, New Delhi
PIN - 110078
iii

ABSTRACT
E-Commerce websites like Amazon, Netflix etc. have a huge amount of data pertaining
to user behavior and item similarity. These websites have been consistently using this data to
improve the user experience. Among the various improvements that these websites have
undergone, Recommendation System has been the game changer. It uses past user experience to
recommend items for the future, based on the users prior purchases and similarity between items
(e.g. movies). Since its first introduction to the websites, considerable research has gone into
improving these systems by making precise and faster recommendations. Another area of
improvement is the scalability of the algorithms.
In this work we use (1) Neighbourhood Models (2) K-Means Clustering and (3) Matrix
Factorization for Rating Prediction. In the first one, similarity amongst the various users or items
is taken. This is measured in terms of correlation using measures like Pearsons R, Jaccard
Coefficient and Cosine Distance. In the second method, a predefined number of clusters is set in
the beginning itself and the users are then put in different clusters. After that the rating is decided
on the basis of the users ratings of that cluster. In the third method, we decompose the rating
matrix into two groups. One matrix consists of each users latent factors and the second one
consists of items latent factors. The number of latent factors is set manually and the value of the
factors is learnt using Gradient descent. We are proposing two new methods Time Based
Neighbourhood Model and Supervised Random Walk to Matrix Factorization. The error measure
used is RMSE and we have obtained RMSE in the range of 0.9541 to 1.3335 for the above
mentioned methods.

iv

LIST OF TABLES

Table

Caption

Page

1.1

An example of a recommendation problem

3.1

Matrix Factorization Example

29

3.2

Latent Factors for example

30

3.3

Predicted Ratings for Example

30

4.1

Results for Neighbourhood Models

45

4.2

Results for Clustering Approach

46

4.3

Results for Latent Factor Models

46

NOTATION

Symbol

Definition

a, p, q

Column vectors

A, P, Q

Matrices

Pu,k

(u,k)th element of matrix P

Number of users

Number of items

u,v
i,j

{1,2,N}

User Indices

{1,2,M}

Item Indices

ru,i

Rating given by user u to item i

r*u,i

Predicted rating given by user u to item i

eu,i = ru,i - r*u,i

Rating error

Ii,i

Indicator Variable

Regularization Parameter
Learning rate for gradient methods

vi

INDEX OF EQUATIONS

Equation

Caption

Page

Equation 2.1

Predicted Rating for neighbourhood models

Equation 2.2

Baseline Matrix Factorization equation

11

Equation 2.3

Root Mean Square Error

12

Equation 3.1

Cosine Similarity

14

Equation 3.2

User-User Correlation

15

Equation 3.3

Pearsons R: for a population

16

Equation 3.4

Pearsons R: for a sample

17

Equation 3.5

Jaccards Index

18

Equation 3.6

Time-Based Neighbourhood Model

22

Equation 3.7

Euclidean Distance

27

Equation 3.8

Gradient Descent Update Rule

32

Equation 3.9

Predicted rating

32

Equation 3.10

Baseline Rating Error

32

Equation 3.11

Baseline Learning Error

33

Equation 3.12

Rating Error with regularization

33

Equation 3.13

Learning Error with regularization

33

Equation 3.14

Gradient of rating error w.r.t latent factors

33

Equation 3.15

Update rules for P and Q

33

vii

Equation 3.16

Predicted rating with global biases

36

Equation 3.17

Update rule for global biases

36

Equation 3.18

SVD++ Model

38

Equation 3.19

Update rules for SVD++

38

Equation 3.20

Transition probabilities for users and items on

42

graph
Equation 3.21

Transition probability for items using random

42

walks with restarts


Equation 3.22

Personalized PageRank Eigenvector Equation

42

Equation 3.23

Gradient of PageRank w.r.t. Latent Factors

42

Equation 3.24

Gradient of transition probabilities

43

Equation 3.25

Learning Error

43

viii

TABLE OF CONTENTS
ACKNOWLEDGEMENTS ................................................................................................. i
DECLARATION ................................................................................................................ ii
CERTIFICATE .................................................................................................................. iii
ABSTRACT....................................................................................................................... iii
LIST OF TABLES ............................................................................................................. iii
NOTATION ....................................................................................................................... vi
INDEX OF EQUATIONS ................................................................................................ vii
TABLE OF CONTENTS ................................................................................................... ix
CHAPTER ONE: INTRODUCTION ..................................................................................1
1.1 Background ................................................................................................................2
1.2 Problem Definition ....................................................................................................3
CHAPTER TWO: LITERATURE REVIEW ......................................................................6
2.1 Content Based Filtering .............................................................................................7
2.2 Collaborative Filtering ...............................................................................................8
2.2.1 Neighbourhood based methods .........................................................................8
2.2.2 Model based methods ......................................................................................10
2.3 Evaluation Measures ................................................................................................11
CHAPTER THREE: PROPOSED WORK .......................................................................13
3.1 Dataset .....................................................................................................................13
3.2 Correlation Measures ...............................................................................................13
3.2.1 Cosine Similarity .............................................................................................14
3.2.2 Pearsons R ......................................................................................................16
3.2.3 Jaccard Index ...................................................................................................18
3.3 Time Based Approach to Neighbourhood Models ..................................................22
3.4 A clustering approach to collaborative filtering ......................................................25
3.5 Latent Factor Models ...............................................................................................29
3.5.1 Non Negative Matrix Factorization .................................................................31
3.5.2 Singular Value Decomposition........................................................................35
3.5.3 SVD++.............................................................................................................38
3.6 A Supervised Random Walk Approach to Matrix Factorization .............................41
CHAPTER FOUR: RESULTS ..........................................................................................45
CHAPTER FIVE: CONCLUSION AND FUTURE WORK ............................................47
REFERENCES ..................................................................................................................49

ix

Chapter One: INTRODUCTION


A person who buys stuff from the internet is facing a problem of plenty nowadays. Ecommerce websites today are offering a wide choice of various products. Online newspapers
publish thousands of articles every day and similarly loads of videos are published or uploaded
by users every second. When a user is deciding what to buy, read or see, it becomes cumbersome
for him to decide what the best choice is as going through each and every product will take a lot
of time. Search engines don't help much, because queries like find something I would like" or
items similar to are too ambiguous and vague. That's where recommender systems come in.
They help the users in dealing with the information overload and retailers in offering appropriate
products to each user resulting in increased sales and customer satisfaction. Since customer
satisfaction is increased it automatically helps in increased customer loyalty as well.
The key feature of recommender systems is personalization. As opposed to search
engines, recommenders take into account the personality and past preferences of each user. It
also takes into account the similarity among various users and thus uses others preferences for
recommending products. A typical recommender won't present the same set of items to two
different users.
Recommender systems are programs operating on large amount of data available from
websites where the links between various users and items is stored. These systems may be
helpful for users that are choosing between a large number of items and aren't willing to browse
information about all available items.

1.1

Background
Recommender systems attempt to screen user preferences over items and build a

relational model between users and items. A recommender systems recommends items that fit
the users tastes, in order to help the user in purchasing/viewing relevant items from an
overwhelming set of choices. These systems have great importance in applications such
as e-commerce, subscription based services, information filtering, news rooms etc.
Recommender systems providing personalized suggestions greatly increase the likelihood of a
customer making a purchase compared to generalized recommendations. An example of the
same would that of the Netflix. Just after a month of the Netflix recommendation competition
there was almost 17% increase in the accuracy of the recommendations thus aiding in the
increase of their revenues because of personalization.
Personalized recommendations are especially important in markets where the number of
choices is enormous, the taste of the customer is of prime importance and last but not least the
price of the items is modest. Typical areas of such services are mostly related to sale of books,
music, fashion, and gaming or recommendation of news articles, humor etc.
With the exponential explosion of web based businesses, an increasing number of web based
merchant or rental services use recommender systems. Some of the major participants of ecommerce web, like Amazon and Netix have successfully applied recommender systems to
deliver automatically generated personalized recommendation to their customers.
There are two basic strategies that can be applied when generating recommendations:Content-based approaches: It characterizes users and items by identifying their features like
demographic data for user characterization, and product descriptions for item characterization.
The features are used by algorithms to connect user interests and item descriptions when
2

generating recommendations. Since it is usually time taking to collect the necessary information
about items, and also often difficult to motivate users to reveal their personal information
required to create the database for the basis of characterization, these methods are seldom used.
Collaborative Filtering (CF): It makes use of only past user activities (for example, transaction
history or user satisfaction expressed in ratings) and is usually more feasible. CF approaches can
be applied to recommender systems regardless of the type of data available. CF algorithm
identifies relationships between users and items, and makes assumptions using this information
to predict user preferences.

1.2

Problem Definition
In a typical recommendation problem, let U be a set of n users and V be a set of m items.

Also, let u be the identifier for users and i be the identifier for items. Each user u has a list of
items Vu , about which the user has expressed an opinion, explicitly(in the form of ratings) or
implicitly(through mining purchase records, web logs etc.). The opinions of the users are stored
on an n x m matrix R, which is known as the ratings matrix. Each cell Ru,i of R, represents the
rating that the uth user has given to the ith item.
Each rating is on a numerical scale and can also be 0 to represent that a user has not rated
an item yet. Therefore, the task of the recommendation algorithm is to model the preferences of a
particular user, using data from the ratings matrix. This includes items which the user has not
rated yet. In practice, we cannot measure the error because the distribution of (U, V, R) is
unknown, but we can estimate the error on a validation set, for example, by randomly
partitioning the above sample into a smaller training set and a validation set. The performance of
the model is calculated in the form of deviation from actual ratings.
3

{0,1} n x m be an indicator variable i.e. Iu,i = 1 iff user u has rated item i.

Formally, let I

Let U, V be the set of Users and Items respectively. Also, let R


Ru,i = x

n x m be the ratings matrix i.e.

. Let Rtrain and Rtest be the training and test dataset where Ru,i : {(u,i,x) | u

U, i

V,

Iu,i = 1, Ru,i = x)}The goal of the recommendation problem is to create a model utilizing
information, explicit as well as implicit, from the training dataset which minimizes h(R*, R)
where h measures the deviation of ratings predicted over actual value on a testing subset of the
actual data set. We use different error functions [Chapter 2] to see the relevance of different error
functions in different situations.

User/Show

Suits

DN

GOT

PB

3
4

3
5

Table 1.1 An Example of a recommendation system problem

For example, let us consider the following case, where U = { A, B, C, D} and V = {


Suits, Death Note, A Game of Thrones, Prison Break}, popular TV shows. All ratings
are in the range 1-5. The missing values represent items that have not been rated yet. The
recommendation system problem can be thought of as a matrix completion algorithm i.e. given
the choice of Suits and A Game of Thrones, which TV show would B prefer to watch. It is
interesting to see that the matrix not only contains explicit information about Bs choices through
his/her ratings of DN and PB, it also contains implicit information in the form of :

Similarity between TV series: People who like DN also tend to like PB.

Similarity between Users: B and C show a correlated pattern in rating TV shows.


The actual dataset may or may not have such straightforward implicit/ explicit

information. It is therefore the job of the recommendation algorithm to utilize this decision to
make the choice for any user.
The layout of all the chapters is given as followsChapter 2 gives literature survey giving an overview of the approaches used by us.
Chapter 3, on the other hand gives a complete overview of the theory. First we describe the
dataset used. Then we describe in detail the approaches used by us - Neighbourhood Models, KMeans Clustering and Matrix Factorization. Finally we describe our proposed approach i.e. A
Supervised Random Walk to Matrix Factorization.
In Chapter 4, we show the results obtained by us using these algorithms.
In Chapter 5, we give the conclusion and future work.

***

Chapter Two: LITERATURE REVIEW


The earliest implementation of Collaborative ltering dates back to the early 1990s. Goldberg et
al.(1992) [1] presented the Tapestry system which allowed a flow of electronic documents to be
recommended. Resnick et al.(1994) [2] built a system which allowed users to rate articles on a
scale of 1 to 5, after having read them and called GroupLens. GroupLens was the first article
introducing the term collaborative filtering(CF). Breese et al. [3] in 1998 extended the CF
approach into two groups:
1) Memory-based approaches which operated on the entire database of ratings to give
a recommendation to a user
2) Model based approaches where a model was first built from the database, which was in turn
used for recommendations.
The last decade has seen many CF algorithms being proposed that approached the problem using
different techniques; including similarity/neighborhood based approaches, Bayesian networks,
various matrix factorization techniques and restricted Boltzmann machines. Side by side, the
number of conferences and workshops focusing only on recommender systems has increased.
Some of the more popular ones are:
Netix conducted two workshops in 2007 and 2008 and the KDD conference held in 2009 were
dedicated to the Netix Prize and recommender systems.
Three International conferences on Recommender Systems have been held by ACM known as
RecSys conferences in 2007, 2008 and 2009.

Nowadays, there are many companies offering recommender systems on their platforms.
Amazon is a typical example of successful recommendation engine's technology. The main
function used is based on collaborative item-to-item contextual recommendations, this was
introduced very early on the site (late 90's). It is based on the logs of purchases and corresponds
to the calculation of a similarity matrix of items. Amazon popularized the famous feature "people
who bought this item also bought these items".

2.1 Content Based Filtering


Content-based filtering methods are one of the oldest and most popular methods for
recommending. The objective of these methods is to recommend objects that are similar to some
objects, the user liked in past. The similarity measure among the objects is determined from the
values of their characteristics.
Items' metadata can be for example the genre of a movie, or the location of a restaurant,
according to the type of items to be recommended. The items most similar to the preferences of
the user will be recommended. The user profile can be constructed implicitly from the ratings or
from other previous actions on items (e.g. research, stop watching, buying,...) or explicitly based
on questionnaires, for example by ratings the general characteristics he likes / dislikes.
These systems are used in a variety of domains ranging from recommending web pages,
television programs, news articles, items and restaurants. Although the implementation of these
systems differ, what they share in common is a means for describing the items that may be
recommended, a means for creating a user profile that describes the types of items the user likes,
and a means of item comparison to the user profile to determine what to recommend. Based upon
the feedback from the user the profile is created and updated automatically.
7

The following figure shows an example of a content based recommender. The known
relationships are marked as full arrows, the calculated or inserted object similarity by the dotted
arrow and the predicted relationship by the dashed arrow.

Fig 2.1 Content Based Filtering Chain of Inference[5]

2.2 Collaborative Filtering


As mentioned at the beginning of this section, Breese et al. [3] divide CF methods into two
groups: Memory-based algorithms operate over the entire user database to make predictions. In
contrast, model-based collaborative ltering uses the user database to estimate or learn a model,
which is then used for predictions. However, the boundary between these two groups is not
clear, as many methods built some model from the data, and then this model was used to assign
importance (weight) to parts of the database.

2.2.1 Neighbourhood based methods


In neighbourhood models, the items are recommended on the basis of either other users
preferences or on the basis of items that they have already rated. When we recommend on the
basis of users, the users whose preferences have to be taken into account is decided on the basis
of the similarity between the given user and other users. This is known as the user-user

neighbourhood model. For example, Flipkart has a feature where whenever you view a product a
list appears at the bottom of the page saying that users who viewed this also viewed this. In this
case, a users similarity with other users is calculated on the basis of the number of similar items
rated by both the users and how close both their ratings are.

Fig 2.2 Flipkart's recommended products[flipkart.com]

In case of items, the items which are similar to the items that the user has already rated
are chosen as recommendations. This is known as the Item-item neighbourhood model Amazon
has a feature where as soon as you buy a product at the side of the page a number of items are
shown saying that users who bought the particular item also bought the given items. In this once
an item is purchased its similarity with other items is calculated on the basis of the number of
users similar to both product and the proximity of their ratings.
The equation [8][9] being used in this project for neighbourhood models is given below.
9

(2.1)

In the equation given above, ui denotes the rating to be calculated. denotes the overall
average of all the ratings given across all users and all movies. bu is the deviation of average of
rating given by user u from the overall mean. bi is the deviation of rating given to the movie i
from the overall mean. rki is the rating given by user k to item i. k is the average rating given by
user k. wku is the similarity between user k and u.
Here, wku is calculated by using correlation. Correlations indicate a predictive relationship
e.g. in stock market the correlation between two stocks is helpful in determining the trend that
the particular stock will follow. If the correlation between two stocks in high then both of them
will follow the same pattern i.e. if ones stock price increases a lot then we can safely assume
that the stock price of the second stock will also increase in almost the same manner.
Correlations can be calculated using various measures. In this project:1. Cosine Similarity
2. Pearsons R
3. Pearsons R with Jaccard Index
are being used to calculate the correlation between the users and the items
.

2.2.2 Model based methods


Model-based methods utilize feature learning to learn the implicit structure of the dataset
in terms of latent factors. Matrix factorization techniques [8] and their variants[7] are one of the
10

most used collaborative filtering techniques. The basic approach of matrix factorization is to try
and approximately factorize the ratings matrix in terms of independent factors of Users and
Items.
Formally, given R

n x m , the goal of matrix factorization is to learn P

k x n and Q

k x m , set of k latent factors for each user and item respectively such that:
R

PT Q

(2.2)

The features learnt in matrix factorization are implicit features in the sense that they do not
represent any observable attribute. These features are representative of the underlying structure
of the ratings matrix.
These methods use optimization techniques like stochastic gradient descent to learn the value of
model parameters P and Q, which can then be used to predict the rating user u will give item i.
In this work we have used:

Non Negative Matrix Factorization

Singular Value Decomposition (SVD)

SVD++

We have also proposed a new supervised random walk based approach which is described in
Chapter 3.

2.3 Evaluation Measures


This section covers the evaluation measures used for quantifying how accurate the model is in
terms of error in the predicted values of the test dataset.

11

Root Mean Square Error


The Root Mean Square Error (RMSE) (also called the root mean square deviation, RMSD) is a
frequently used measure of the difference between the values predicted by a model and the
values actually observed. These individual differences are also called residuals, and the RMSE
serves to aggregate them into a single measure of predictive power.
RMSE is defined as the square root of the mean squared error.
RMSE( ) =

(2.3)

In the next chapter we have discussed the above mentioned algorithms and also our proposed
algorithms.

***

12

Chapter Three: PROPOSED WORK


3.1 Dataset
We have used the Movielens dataset for the prediction of movie ratings. This dataset
contains 1,00,000 ratings on the scale of 1-5 from 943 users on 1682 movies. Moreover, each
user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data
is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time
stamps are unix seconds since 1/1/1970 UTC. The dataset also contains 80%/20% splits of the
whole data into training and test data sets.

3.2 Correlation Measures


While calculating correlation, two methods are possible:
a) Calculate the correlation between two users
b) Calculate the correlation between two items
a) User-User Neighbourhood model
In this model, correlation between two users is calculated on the basis of no of similar
items that both users have rated and the respective ratings given by them.
b) Item-Item Neighbourhood model
In this model, correlation between two items is calculated on the basis of no of similar
users that have rated them and the respective ratings received by them.

Three measures are being used to measure the correlation.

13

1. Cosine Similarity
2. Pearsons R
3. Jaccard Index

3.2.1 Cosine Similarity


Cosine similarity measures the similarity between two vectors in terms of the cosine of the
angle between them. The cosine of 0 is 1, and it is less than 1 for any other angle. It is a
judgement of the orientation and not magnitude: two vectors with the same orientation have a
Cosine similarity of 1, two vectors at 90 have a similarity of 0, and two vectors diametrically
opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly
used in positive space, where the outcome is neatly bounded in [0,1].
The cosine of two vectors can be derived by using the Euclidean dot product formula:

Given two vectors of attributes, A and B, the cosine similarity, cos(), is represented using a dot
product and magnitude as

(3.1)
The formula used in the project is given below.

14

(3.2)

Here, i,j denote either user/item i and j respectively. Xik denotes the rating given/received by user
i/item i to/from item k/user k. Yjk denotes the rating given/received by user j/item j to/from item
k/user k.
The code written for user-user correlation using cosine distance is given below.
cor=[]
#matrix to store the correlation between users
cor.append([])
i=0
while i<=943:
cor.append([])
j=0
cor[i].append(None)
while j<i:
cor[i].append(cor[j][i])
j+=1
cor[i].append(None)
#indicate users correlation with himself
j=i+1
while j<=943:
m=1
sum=0.0
#store the sum of all common movie
rating between users i and j
x=0.0
y=0.0
while m<len(lis[i]):
n=lis[j].index(lis[i][m]) if lis[i][m] in lis[j] else None
if n!=None:
sum+=ratings[i][m]*ratings[j][n]
x+=ratings[i][m]*ratings[i][m]
y+=ratings[j][n]*ratings[j][n]
m+=1
if x and y:
15

cor[i].append(sum/math.sqrt(x*y))
else:
cor[i].append(None)
j+=1
i+=1

3.2.2 Pearsons R
The Pearson product-moment correlation coefficient (sometimes referred to as the PPMCC or
PCC, or Pearson's r) is a measure of the linear correlation (dependence) between two variables X
and Y, giving a value between +1 and 1 inclusive.
Pearson's correlation coefficient between two variables is defined as the covariance of the two
variables divided by the product of their standard deviations. The form of the definition involves
a "product moment", that is, the mean (the first moment about the origin) of the product of the
mean-adjusted random variables; hence the modifier product-moment in the name.
For a population
Pearson's correlation coefficient when applied to a population is commonly represented by the
Greek letter (rho) and may be referred to as the population correlation coefficient or the
population Pearson correlation coefficient. The formula for is:

(3.3)
For a sample
Pearson's correlation coefficient when applied to a sample is commonly represented by the letter
r and may be referred to as the sample correlation coefficient or the sample Pearson correlation

16

coefficient. We can obtain a formula for r by substituting estimates of the covariances and
variances based on a sample into the formula above. That formula for r is:

In this project the formula that has been used for implementing the Pearsons R correlation is
given below.

(3.4)
Here i,j denote either user/item i and j respectively. Xik denotes the rating given/received by user
i/item i to/from item k/user k. Yjk denotes the rating given/received by user j/item j to/from item
k/user k. Yk and Xk denotes the mean of all the ratings given/received by user k/item k.

The python code written for the same is given below for item-item correlation is given below.
cor=[]
#matrix to store the correlation between users
cor.append([])
i=0
while i<=943:
cor.append([])
j=0
cor[i].append(None)
while j<i:
cor[i].append(cor[j][i])
j+=1
cor[i].append(None)
j=i+1
while j<=943:
m=1
sum=0.0
#store the sum of all common movie rating between users
x=0.0
#stores the square of all the ratings given by user i
y=0.0
#stores the square of all the ratings given by user j

17

while m<len(lis[i]):
n=lis[j].index(lis[i][m]) if lis[i][m] in lis[j] else None
if n!=None:
sum+=((ratings[i][m]-ratings[i][0])*(ratings[j][n]ratings[j][0]))
x+=((ratings[i][m]-ratings[i][0])*(ratings[i][m]ratings[i][0]))
y+=((ratings[j][n]-ratings[j][0])*(ratings[j][n]ratings[j][0]))
m+=1
if x and y:
cor[i].append(sum/math.sqrt(x*y))
else:
cor[i].append(None)
j+=1
i+=1

3.2.3 Jaccard Index


The Jaccard index, also known as the Jaccard similarity coefficient is a statistic used for
comparing the similarity and diversity of sample sets. The Jaccard coefficient measures
similarity between sample sets, and is defined as the size of the intersection divided by the size
of the union of the sample sets:
In this project the jaccard index has been used in conjunction with the pearsons r correlation
measure. The following formula is being used to calculate the correlation between two
users/items :

... (3.5)
Here i,j denote either user/item i and j respectively. J is the jaccard coefficient for the two
users/items. Here Xik denotes the rating given/received by user i/item i to/from item k/user k.
Here Yjk denotes the rating given/received by user j/item j to/from item k/user k. Yk and Xk
denote the mean of all the ratings given/received by user k/item k.
18

The python code written for the same is given below for user-user correlation is given below.
cor=[]
#matrix to store the correlations
cor.append([])
i=0
while i<=943:
cor.append([])
j=0
cor[i].append(None)
while j<i:
cor[i].append(cor[j][i])
j+=1
cor[i].append(None) #used to indicate users correlation with himself
j=i+1
while j<=943:
m=1
sum=0.0
#store the sum of all common movie rating between users
x=0.0
#stores the square of all the ratings given by user i
y=0.0
#stores the square of all the ratings given by user j
inter=0.0
#denotes no of common movies amongst two users
while m<len(lis[i]):
n=lis[j].index(lis[i][m]) if lis[i][m] in lis[j] else None
if n!=None:
sum+=((ratings[i][m]-ratings[i][0])*(ratings[j][n]ratings[j][0]))
x+=((ratings[i][m]-ratings[i][0])*(ratings[i][m]ratings[i][0]))
y+=((ratings[j][n]-ratings[j][0])*(ratings[j][n]ratings[j][0]))
inter+=1
m+=1
jcoef=inter/(len(lis[i])+len(lis[j])-inter-2)
if x and y:
cor[i].append(jcoef*sum/math.sqrt(x*y))
else:
cor[i].append(None)
j+=1
i+=1

While calculating the rating of the particular user two approaches have been taken
i) K-Nearest Neighbour Approach
ii) All Neighbour Approach

19

In the first case 10 nearest neighbours have been taken and their correlation with the given user
is used to calculate the final rating. Snippets of the code for the approach are given below.

rmse=0.0
sqsum=0.0
nooflines=0
fil=open("F:\\btp\dataset\movielens\ml-100k\ua.test","r")
for line in fil.readlines():
st=line.split("\t")
maxcor=[0,0,0,0,0,0,0,0,0,0] #matrix to store the 10 highest
correlation
corra=[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
i=int(st[0])
j=int(st[1])
l=1
while l<=943:
if l!=i:
if j in lis[l]:
if cor[i][l]>maxcor[0]:
maxcor[0]=cor[i][l]
n=lis[l].index(j)
corra[0]=ratings[l][n]-ratings[l][0]
k=0
while k<9 and maxcor[k]>maxcor[k+1]:
t=maxcor[k]
maxcor[k]=maxcor[k+1]
maxcor[k+1]=t
t=corra[k]
corra[k]=corra[k+1]
corra[k+1]=t
k+=1
l+=1
sum=0.0
l=0
while l<10:
sum+=maxcor[l]
l+=1
l=0
while l<10 and sum:
#calculating weight of each neighbor
maxcor[l]/=sum
l+=1
l=0
sum=0.0
denom=0.0
while l<10:
sum+=maxcor[l]*corra[l]

20

denom+=maxcor[l]
l+=1
if denom:
prer=rats[i][0]+movrats[j][0]-u+sum/denom

In the all neighbour approach all the users whose correlation is present with the user are taken
into account. The code snippets for the same are given below.
rmse=0.0
sqsum=0.0
nooflines=0
fil=open("F:\\btp\dataset\movielens\ml-100k\ua.test","r")
for line in fil.readlines():
maxcor=[0,0,0,0,0,0,0,0,0,0]
corra=[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
st=line.split("\t")
i=int(st[0])
j=int(st[1])
l=1
while l<=943:
if l!=j:
if i in mov[l]:
maxcor.append(cor[j][l])
n=mov[l].index(i)
corra.append(movrats[l][n]-movrats[l][0])
l+=1
sum=0.0
l=0
while l<len(maxcor):
if maxcor[l]!=None:
sum+=maxcor[l]
l+=1
l=0
while l<len(maxcor):
if maxcor[l]!=None
maxcor[l]/=sum
#calculating weight of each item
l+=1
l=0
sum=0.0
denom=0.0
while l<len(maxcor):
if maxcor[l]!=None:
sum+=maxcor[l]*corra[l]
denom+=maxcor[l]
l+=1
prer=ratings[i][0]+movrats[j][0]-u+sum/denom

21

The RMSEs obtained from this method ranged from 1.2268 to 1.3365.

3.3 Time Based Approach to Neighbourhood Models


The difference between the neighbourhood models discussed above and this model is that the
time at which a rating is given by the user to a particular movie is also being taken into account.
The above approach is first implemented by sorting both the training and test data according to
the timestamp given. The range of the timestamps is found and the length of each time interval is
taken as one hundredth of the range. For each time interval the ratings are added to the matrix
(similar to addition of nodes to a graph). And for each time interval entries are taken from the
test data corresponding to the time interval and the rating is calculated from the existing matrix
i.e. unlike the previous neighbourhood model the correlation calculated between the various
users/items is dependent on the ratings given by them till that time interval.
The

formula

used

for

calculating

the

rating

is

given

below-

(3.6)
In the equation given above, ui denotes the rating to be calculated,

ta

denotes the overall

average of all the ratings given across all users and all movies, buta is the deviation of average of
ratings given by user u from the overall mean, bita is the deviation of rating given to the movie i
from the overall mean, rki is the rating given by user k to item i, k is the average rating given by
user k, wkuta is the similarity between user k and u. Here the superscript ta is used to denote that
the value is calculated at a particular time.

22

Here cosine similarity measure has been used to calculate the correlation between different users
at each time interval. The code for Time based neighbourhood model using 10 nearest
neighbours is given below.
1. Sorting of the training data
fil=open("F:\\btp\dataset\movielens\ml-100k\ua.base","r")
for line in fil.readlines():
st=line.split("\t")
lis[int(st[0])].append(int(st[1]))
ratings[int(st[0])].append(int(st[2]))
tim[int(st[0])].append(int(st[3]))
mov[int(st[1])].append(int(st[0]))
movrats[int(st[1])].append(int(st[2]))
i=1
while i<=943:
j=1
while j<len(lis[i]):
k=1
while k<len(lis[i])-j:
if tim[i][k]>tim[i][k+1]:
t=lis[i][k]
lis[i][k]=lis[i][k+1]
lis[i][k+1]=t
t=rats[i][k]
ratings[i][k]=rats[i][k+1]
ratings[i][k+1]=t
t=tim[i][k]
tim[i][k]=tim[i][k+1]
tim[i][k+1]=t
k+=1
j+=1
i+=1

2. Code for finding the correlation and the corresponding rating at each time interval
t=1
rmse=[]
u=0
d=0
while t<=100 and q<=ma:
time interval
i=1
u=0
d=0
while i<1682:

#denotes time interval

#algo for calculating ratings given at each

23

movrats[i][0]=0
#the 1st position will store the average rating
given to the movie
movrats[i][len(movrats[i])-1]=0
i+=1
i=1
while i<=943:
#algo for finding correlation
j=0
cor[i][j]=None
j+=1
while j<i:
cor[i][j]=cor[j][i]
j+=1
cor[i][j]=None
j+=1
while j<=943:
m=1
sum=0.0
x=0.0
y=0.0
s=0
while m<len(lis[i]) and tim[i][m]<q:
n=lis[j].index(lis[i][m]) if lis[i][m] in lis[j] else None
if n!=None and tim[j][n]<q:
sum+=ratings[i][m]*ratings[j][n]
x+=ratings[i][m]*ratings[i][m]
y+=ratings[j][n]*ratings[j][n]
s+=ratings[i][m]
u+=ratings[i][m]
movrats[lis[i][m]][0]+=ratings[i][m]
movrats[lis[i][m]][len(movrats[lis[i][m]])-1]+=1
m+=1
d+=1
rats[i][0]=s/m
if x and y:
cor[i][j]=sum/math.sqrt(x*y)
else:
cor[i][j]=None
j+=1
i+=1
sqsum=0.0
i=1
c=0
dif=0.0
while i<=943:
#algo for calculating the nearest neighbors
k=1
while k<len(testlis[i]) and testtim[i][k]>=p and testtim[i][k]<q:
maxcor=[0,0,0,0,0,0,0,0,0,0]
corra=[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
j=1
while j<=943:

24

if testlis[i][k] in lis[j]:
n=lis[j].index(testlis[i][k])
if tim[j][n]<q:
if cor[i][j]>maxcor[0]:
maxcor[0]=cor[i][j]
corra[0]=ratings[j][n]-ratings[j][0]
w=0
while w<9 and maxcor[w]>maxcor[w+1]:
t=maxcor[w]
maxcor[w]=maxcor[w+1]
maxcor[w+1]=t
t=corra[w]
corra[w]=corra[w+1]
corra[w+1]=t
w+=1
j+=1
sum=0.0
o=0
while o<10:
sum+=maxcor[o]
o+=1
o=0
while o<10 and sum:
maxcor[o]/=sum
#assigning weights to the 10 nearest neighbors
o+=1
o=0
sum=0.0
denom=0.0
while o<10:
sum+=maxcor[o]*corra[o]
denom+=maxcor[o]
o+=1
if denom and d and movrats [testlis[i][k]] [len(movrats [testlis
[i][k]])-1]:
prer=ratings[i][0]+movrats[testlis[i][k]][0]/movrats[testlis
[i][k]][len(movrats[testlis[i][k]])-1]-u/d+sum/denom
dif+=(prer-testrats[i][k])*(prer-testrats[i][k])
The RMSE obtained from this method is 1.3103.

3.4 A clustering approach to collaborative filtering


The problem of clustering denotes the task of arranging a set of vectors (measurements)
into a number of groups known as clusters. It is an important area of application for a variety of
fields including data mining, statistical data analysis and vector quantization. The main approach
25

is to group together data items that are similar to each other. Several algorithms have been
devised to solve this problem because of its widespread use. Notable among these is the k-means
approach which we have used here.
Formally, the problem of clustering can be defined as: given a dataset of N records, each member
of the dataset having a dimensionality d, we have to partition the data into disjoint subsets such
that some specific criteria is achieved. Each record is assigned to a single cluster and the
optimization measure is the Euclidean distance between a record and the corresponding cluster
center.
The k-means algorithm is based on the optimal placement of a cluster center at the centroid of
the associated cluster. Thus given any set of k cluster centers C, for each center c

C, let V(c)

denote the region of space that is closest to the center c. In every stage, the algorithm replaces the
centers in C with the centroid of the points in V(c) and then updates V(c) by recomputing the
distance from each point to a center c. These steps are repeated until some convergence condition
is met. For points in general positions (i.e. if no point is equidistant from the two centers), the
algorithm will converge to a point where further stages of the algorithm will not change the
position of any center. This would be an ideal convergence condition. The further stages of the
algorithm are stopped when the change in distortion, is less than a given threshold. This saves a
lot of time as in the last stages, the centers move very little in every stage. The results obtained
depend greatly based on the initial set of centers chosen. The algorithm is deterministic after the
initial centers are determined. The main advantage of the k-means approach is its simplicity and
flexibility. In spite of other algorithms being available, k-means continues to be an attractive
method because of its convergence properties.

26

So, we have used the above mentioned k-means model to predict the ratings for the users. The
Movielens dataset can be viewed as a record of 943 users each having a dimensionality of 1682.
Now, we will explain the approach [6] used for the prediction of the ratings.

1. Randomly initialise k cluster centers from the record of users. For this the function
random.sample(population,k) is used which returns a k length list of unique elements chosen
from the population sequence.
The Python code for the same is shown below.
#initialization of cluster centers
length=10
a=random.sample(range(1,943),length)
centre=[0]
i=0
while i<length:
j=1
temp=[0]
while j<=1682:
temp.append(R[a[i]][j])
j=j+1
centre.append(temp)
i=i+1

2. Associate each user to only 1 cluster based on the Euclidean distance from the cluster
center. The formula for Euclidean distance between two points p and q having dimensionality d
is shown below.

(3.7)
The Python code snippet is shown below.
27

i=1
while i<=943:
j=1
sum=0.0
count=0
while j<=1682:
sum = sum + (R[i][j]-centre[1][j]) * (R[i][j]centre[1][j])
j=j+1
sum=math.sqrt(sum)
minimum=sum
position=1
k=2
while k<=length:
j=1
sum=0.0
count=0
while j<=1682:
sum = sum + (R[i][j]-centre[k][j]) * (R[i][j]centre[k][j])
j=j+1
sum=math.sqrt(sum)
if sum <minimum:
minimum=sum
position=k
k=k+1
group[position].append(i)
i=i+1

3.Calculate the new cluster centers of the clusters found in step(2) above. The new cluster
center is the centroid of all the points in that particular cluster.
The Python code snippet is shown below.
i=1
while i<=length:
j=1
while j<=1682:
k=1
sum=0.0
count=0
while k<len(group[i]):
if R[group[i][k]][j]>0:
28

sum=sum+R[group[i][k]][j]
count=count+1
k=k+1
if count!=0:
centre[i][j]=sum/count
else:
centre[i][j]=0
j=j+1
i=i+1

4. Repeat steps(2) and (3) for a fixed number of steps.


5. Next, the missing ratings for a given user are calculated on the basis of the ratings given
by the users who are from the same cluster.
6. Now the accuracy of the predicted ratings is calculated using RMSE.
The RMSE obtained for the above method varied from 1.0437 to 1.0504

3.5 Latent Factor Models


Formally, given R

n x m , the goal of matrix factorization is to learn P

k x n and Q

, set of k latent factors for each user and item respectively such that:

PT Q

User/Show

Suits

DN

GOT

PB

3
4

3
5

Table 3.1 Example for Matrix Factorization

29

kxm

For example, consider the matrix presented in table 2.1. On running basic matrix
factorization for it with k=2, the baseline matrix factorization produces the following latent
vectors.

Item id Factor 1

User id Factor 1 Factor

Factor

1.319

0.8086

1.253

1.527

1.510

2.0165

1.443

0.875

2.293

-1.213

1.998

0.502

1.012

1.816

-0.213

2.885

Table 3.2 Latent Factors for Users and Items

The predicted values of the ratings, when compared with the actual values are:
User/Show

Suits

DN

GOT

PB

3, 2.887

5, 4.9707

1, 1.021

4, 4.0412

2.6115

4, 3.9439

2.2484

3, 3.0499

3, 3.0421

4, 4.0309

4, 3.9717

3, 2.9353

2, 2.0278

5.4358

-3.9515

5, 4.9702

Table 3.3 Predicted ratings for recommendation system example

30

Some inferences can be made from the results:

The factors learnt by matrix factorization are latent factors i.e. they do not represent any
real world explicit characteristic of item/user like age, sex, location( for user) or release
date, number of seasons( for TV shows). Latent factors are representative of the implicit
structure present in the ratings. The algorithm is an example of feature learning where the
algorithm itself transforms the raw ratings matrix into an implicit factor, that can be
exploited in supervised learning.

Some ratings may appear to be nonsensical. For example, Ds rating is -3.95(negative


value) whereas Ds rating for DN is 5.43 (above threshold of 5). This happens because
the objective function of matrix factorization is aimed at minimizing the total error over
the whole dataset and may converge at local minima and give such absurd values.

One of the key factors in this algorithm is deciding the number of latent factors(k). The
higher the number of latent factors, higher the complexity of the model, which may lead
to better results. However, if the number of latent factors is too high, there is a high
chance of overfitting.

We implemented the following latent factor models:

Non Negative Matrix Factorization


SVD
SVD++

3.5.1 Non Negative Matrix Factorization


Non Negative Matrix Factorization was first mentioned by Lee et al. [4] is one of the most basic
matrix factorization models which uses stochastic gradient descent to obtain optimal solutions.
Stochastic Gradient Descent
Stochastic Gradient Descent is an optimization method for minimizing a differentiable function.
31

It uses the loss at any training example to approximate the total loss at that point of time and
minimizes by updating weights in the direction opposite to that of the positive gradient
Given Qi () , the loss at the i-th example, the update rule for is given by
:= (Qi ()/ )

... (3.8)

The pseudo code for stochastic gradient descent is :

Choose an initial vector of parameters and learning rate .

Repeat until an approximate minimum is obtained:

Randomly shuffle examples in the training set.

For i = 1,2,3,4...n , do:

:= - (Qi ()/ )

Algorithm
The base algorithm (2.2) tries to approximate the matrix of ratings R by the product of user and
item latent factors P (K X n) and Q (K X m) respectively, where K is the number of latent
factors.

The predicted rating is :

... (3.9)
The error in rating is given by:

... (3.10)
where ru,i is the actual rating given by user u to item i.
32

The algorithm uses stochastic gradient descent to minimize the learning error E given by:

... (3.11)
Regularization can be used to penalize high values for weights.

... (3.12)

... (3.13)
The gradients of e' with respect to the factor matrices are given by:

... (3.14)
We can use these gradients to update the latent factors:

... (3.15)

The implementation of the algorithm in python is given below:


class recsys1:
U=[]
V=[]
m=0
n=0
k=0
steps=30
alpha=0.02
beta=0.02
iteration_data=[]
33

def __init__(self, n,m, k):


self.n=n
self.m=m
self.k=k
self.U=random.rand(n+1,k)
self.V=random.rand(m+1,k)
def factor(self,R):
self.V=self.V.T;
numrows = len(R)
temp= range(numrows)
epoch = 0
for step in range(self.steps):
random.shuffle(temp)
e=0
for x in temp:
i=R[x][0]
j=R[x][1]
R_ij=R[x][2]
error_ij = R_ij - dot(self.U[i,:],self.V[:,j])
t=self.U[i,:]+self.alpha*(2*error_ij*self.V[:,j] self.beta*self.U[i,:])
self.V[:,j]=self.V[:,j]+self.alpha*(2
*error_ij*self.U[i,:] - self.beta*self.V[:,j])
self.U[i,:]=t
i=R[x][0]
j=R[x][1]
R_ij=R[x][2]
e= e+ pow(R_ij - dot(self.U[i,:],self.V[:,j]), 2)
print epoch, pow(R_ij -dot(self.U[i,:],self.V[:,j]), 2)
epoch+=1
if pow(R_ij - dot(self.U[i,:],self.V[:,j]), 2) <
0.000000001:
self.V=self.V.T
return self.U, self.V
e=e/len(R)
e=math.sqrt(e)
#e = e + (self.beta/2) * dot(self.U[i,:],self.U[i,:].T) +
dot(self.V[:,j],self.V[:,j].T)
if e<0.01:
break
self.iteration_data.append( [e, step])
print e,step
self.V=self.V.T
34

def final_res(self):
return self.U, self.V
def error(self,R):
e=0
for x in R:
i=x[0]
j=x[1]
R_ij=x[2]
R_hat=dot(self.U[i,:],self.V.T[:,j])
e= e+ pow(R_ij- R_hat, 2)
e=e/len(R)
e=math.sqrt(e)
return e
R,n,m=read_ratings("ua.base")
R2,n1,m1=read_ratings("ua.test")
instance= recsys1(n,m,k=5)
instance.factor(R)
Udash, Vdash=instance.final_res()
The RMSE obtained for the above method was 0.9692

3.5.2 Singular Value Decomposition


Sarwar et al. [5] were the first to adapt SVD to a matrix recommendation scenario. SVD
originally dealt with complete matrices and had to be modified to be able to process low rank,
sparse matrices. One of the more recent works in this field has been by Y. Koren et al.[7][8]
The SVD model introduces global biases for users and items:
-Global ratings' mean:
- Item's popularity: bi
- User's behavior: bu
Therefore, the predicted rating value becomes:
35

... (3.16)
The same gradient descent algorithm can be applied as in 3.5.1; however, we have to update the
global biases at each step.

... (3.17)
The implementation in python is given below:
def factor(self,R):
self.V=self.V.T;
numrows = len(R)
temp= range(numrows)
for step in range(self.steps):
random.shuffle(temp)
e=0
for x in temp:
i=R[x][0]
j=R[x][1]
R_ij=R[x][2]
error_ij = R_ij - numpy.dot(self.U[i,:],self.V[:,j]) self.mu - self.b_u[i] -self.b_i[j]
#Update Rules for U,V
t=self.U[i,:]+self.alpha*(error_ij*self.V[:,j] self.beta*self.U[i,:])
self.V[:,j]=self.V[:,j]+self.alpha*(error_ij*self.U[i,:]
- self.beta*self.V[:,j])
self.U[i,:]=t
#print "Here"
#Update Rules for b_i, b_u
self.b_u = self.b_u - self.alpha*(self.beta*self.b_uerror_ij)
self.b_i = self.b_i - self.alpha*(self.beta*self.b_ierror_ij)

36

e= e+ pow(R_ij - numpy.dot(self.U[i,:],self.V[:,j]) self.mu - self.b_u[i] -self.b_i[j], 2)


if e<0.01:
break
print e,step
self.V=self.V.T
def final_res(self):
return self.U, self.V
def error(self,R):
e=0
for x in R:
i=x[0]
j=x[1]
R_ij=x[2]
R_hat=numpy.dot(self.U[i,:],self.V.T[:,j])+self.mu +
self.b_u[i] +self.b_i[j]
e= e+ pow(R_ij- R_hat, 2)
e=e/len(R)
e=math.sqrt(e)
return e
R,n,m=read_ratings("ua.base")
R2,n1,m1=read_ratings("ua.test")
instance= recsys1(n,m,5,R)
instance.factor(R)
Udash, Vdash=instance.final_res()
fd=open("ua.result","w")
for x in R2:
fd.write(str(x[0])+", "+str(x[1])+", "+ str(x[2])+", "+
str(numpy.dot(Udash[x[0],:],Vdash.T[:,x[1]]))+"\n")
fd.write("RMSE ="+str(instance.error(R2)))
The RMSE obtained for the above method was 0.9632

37

3.5.3 SVD++
SVD++ uses implicit feedback to increase prediction accuracy. It adds a second set of item
factors yi

k. These factors account for the implicit feedback present in the recommendation

data ie. The items user u has rated, irrespective of the rating he/she gives. The predicted rating
under this model is given by:

... (3.18)
The update rules of (3.15) and (3.17) change to:

... (3.19)

The implantation in python is given below:


class recsys1:
U=[]
V=[]
b_i= []
b_u= []
Sigma_y=[]
R_u=[]
Z=[]
Y=[]
mu=0
m=0
n=0
k=0
R=[]
I=[]
steps=100
38

alpha=0.02
beta=0.02
def __init__(self, n,m, k,R):
self.n=n
self.m=m
self.k=k
self.U=numpy.random.rand(n+1,k)
self.V=numpy.random.rand(m+1,k)
self.Y=numpy.random.rand(m+1,k)
self.b_u=numpy.zeros(n+1)
self.b_i=numpy.zeros(m+1)
self.R=R
self.mu= sum(self.R[:][2])/len(self.R[:][2])
self.I=numpy.zeros(shape=(n+1,m+1))
self.Sigma_y=numpy.zeros(shape=(n+1,k))
self.R_u=numpy.ones(n+1)
self.Z=[]
for x in R:
self.I[x[0]][x[1]]=1
for x in range(n+1):
#self.Sigma_y[x,:]=numpy.zeros(self.k)
self.Z.append(numpy.nonzero(self.I[x,:])[0])
for x1 in numpy.nonzero(self.I[x,:])[0]:
self.Sigma_y[x,:]+=self.Y[x1,:]
self.R_u[x]+=1
#Update Rules for U,V
self.R_u[x]= 1/math.sqrt(self.R_u[x])

def factor(self,R):
self.V=self.V.T
#self.Y=self.Y.T
numrows = len(R)
temp= range(numrows)
for step in range(self.steps):
random.shuffle(temp)
e=0
for x in temp:
i=R[x][0]
j=R[x][1]
R_ij=R[x][2]
39

error_ij = R_ij - numpy.dot(self.U[i,:] +


self.R_u[i]*self.Sigma_y[i],self.V[:,j]) - self.mu - self.b_u[i]
-self.b_i[j]
# calculating R(u), sigma(Y)
t=self.U[i,:]+self.alpha*(error_ij*self.V[:,j] self.beta*self.U[i,:])
#print self.R_u[i] ,j
self.V[:,j]=self.V[:,j]+self.alpha*(error_ij*(self.U[i,:]+
self.R_u[i]*self.Sigma_y[i,:]) - self.beta*self.V[:,j])
self.U[i,:]=t
#Update Rules for b_i, b_u
self.b_u = self.b_u - self.alpha*(self.beta*self.b_uerror_ij)
self.b_i = self.b_i - self.alpha*(self.beta*self.b_ierror_ij)
self.Y[self.Z[i],:] = self.Y[self.Z[i],:] +
self.alpha*(error_ij*self.R_u[i]*self.V[:,j] self.beta*self.Y[self.Z[i],:])

e= e+ pow(R_ij - numpy.dot(self.U[i,:]+
self.R_u[i]*self.Sigma_y[i],self.V[:,j]) - self.mu - self.b_u[i]
-self.b_i[j], 2)
if e<0.01:
self.V=self.V.T
break
e=e/len(R)
e=math.sqrt(e)
if e<0.01:
break
print e,step
self.V=self.V.T
def final_res(self):
return self.U, self.V
def error(self,R):
40

e=0
for x in R:
i=x[0]
j=x[1]
R_ij=x[2]
R_hat=numpy.dot(self.U[i,:],self.V.T[:,j])+self.mu +
self.b_u[i] +self.b_i[j]
e= e+ pow(R_ij- R_hat, 2)
e=e/len(R)
e=math.sqrt(e)
return e
R,n,m=read_ratings("ua.base")
R2,n1,m1=read_ratings("ua.test")
instance= recsys1(n,m,2,R)
instance.factor(R)
Udash, Vdash=instance.final_res()
The RMSE obtained for the above method was 0.9541

3.6 A Supervised Random Walk Approach to Matrix Factorization


We propose a new graph based model which interprets the recommendation matrix as a bipartite
graph and uses supervised random walk to train its parameters U, V. It is inspired from Lescovec
et al.[10], whose work focused on link prediction using supervised random walks.
Let the given recommendation matrix be equivalent to a bipartite graph with users u {1,2....N}
and items i {1,2...M}. Let each user u have a latent feature vector Uu , and let item i have a
latent feature vector Vi . Therefore, the predicted edge strength of edges in the bipartite
graph are:

Here, u is a user and i is an item.

41

The transition probability from user u to item i is given by:


(3.20a)

The transition probability from item i to user u is given by:


(3.20b)

We use a conditional transition probability which restarts the random walk with probability
(3.21)

Let the stationary distribution (personalized PageRank) of users and items be

and

. The

eigenvector equation for the stationary probabilities are:

(3.22)
Thus, the differential of the personalized PageRank can be calculated by differentiating the
eigenvector equations

(3.23)
This, in turn utilizes the partial differential of the transition probabilities of users and items:

42

)
(

)
(

(3.24)
Therefore, given a recommendation matrix, for user s, we can select the top-x items which he/she
has rated. This we put into a set d (destination nodes), and the rest we put into set l. The learning
error can be formulated as:

(3.25)

Where h(x) is the squared error. If any node in l gets a higher stationary probability than any
node in d, the value of the error function increases.
Thus the update rule for U and V can be given as:
(

43

Therefore, we can use gradient descent to optimize U and V.


The RMSE obtained for the above method was 0.9796
The RMSE values obtained for all the above methods are given in the next chapter.
***

44

Chapter Four: RESULTS


Neighbourhood Models
We implemented neighbourhood models for each of the three similarity measures and for both
user-user and item-item correlation. For each of these scripts we have implemented 2 modelsone in which only the 10 nearest neighbours are present and the second in which all users/items
are taken into account. A Time based model has also been implemented. The RMSE values
obtained for the same are given in the table below.

No of neighbours

Cosine
Similarity

Pearsons R

Pearsons R with
Jaccard
Coefficient

10 nearest neighbours

1.3335

1.3238

1.3238

All users

1.2873

1.3061

1.3061

10 nearest neighbours

1.2471

1.2590

1.2590

All items

1.2268

1.2535

1.2535

Type of Model

User-User

Item-Item

The RMSE obtained for time based neighbourhood model in which cosine similarity was taken
along with 10 nearest neighbours and user-user correlation is 1.3103.

Clustering model
In this model, a predefined number of clusters were set. We set the number of clusters from 5 to
10. The results are shown below.

45

Number of Clusters

RMSE

C=5

1.0443

C=6

1.0441

C=7

1.0492

C=8

1.0437

C=9

1.0504

C=10

1.0443

Latent Factor Models


We implemented SVD, SVD++, Non- negative matrix factorization and the proposed supervised
random walk based approach. Parameter learning was done via 5 fold cross-validation. The
results are shown below.
ALGORITHM

RMSE

SVD

0.9632

SVD++

0.9541

Non-negative Matrix
Factorization

0.9692

Supervised Random Walk


Approach

0.9796

***
46

Chapter Five: CONCLUSION AND FUTURE SCOPE


The best result is achieved from the Latent Factor models. In this, the RMSE for Non-Negative
Matrix Factorization came out to be 0.9692 which is later improved in the SVD model whose
RMSE is 0.9632. The most accurate result is that of the SVD++ model, whose RMSE is 0.9541.
Our proposed model achieves an RMSE of 0.9796.
The K-means Clustering approach has given the RMSE value of 1.0443 which is the second
best result obtained when comparing amongst the models.
The Neighbourhood Models as is seen from the results are the least efficient where the best
rmse of 1.2268 is obtained when item-item correlation is considered using the entire list of
neighbours possible for each user. For the user-user correlation methods where the 10 nearest
neighbours are taken, the time based model has given the best rmse of 1.3103.
Thus it is inferred that Matrix factorization models give the best recommendation results
amongst all the models tested in this project.

Future Scope:
One of the major drawbacks of the supervised random walk based approach is that it converges
very slowly, partly due to slow convergence of (3.23). However, the algorithm, by treating the
training dataset as a whole instead of each sample by itself, introduces various parallelizable
equations like (3.23) and (3.24). Parallelizing these equations can lead to a 1/k reduction in time
where k is the number of latent factors.

47

One of the emerging uses of recommendation systems have been in the context of social graphs.
Integrating recommendation systems with the social graph remains one of the major challenges
right now due to the unavailability of coherent social as well as recommendation data. Social
graph data holds the key to solving the cold start problem as well as reinforcing a users opinion
through.

***

48

REFERENCES
1.

Using collaborative filtering to weave an information tapestry by David Goldberg , David


Nichols , Brian M. Oki , Douglas Terry, Communications of ACM, page 61-70,1992

2.

GroupLens: An Open Architecture for Collaborative Filtering of Netnews by Paul


Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, John Riedl,ACM
Conference on Computer Supported Cooperative Work, page 175-186,1994

3.

Empirical Analysis of Predictive Algorithms for Collaborative Filtering by John S.


Breese, David Heckerman, and Carl Kadie, Microsoft Research, page 43-52, May 1998.

4.

Algorithms for Non-negative Matrix Factorization, Daniel D. Lee and H. Sebastian


Seung, page 556-562, NIPS 2000

5.

Analysis of Recommendation Algorithms for E-Commerce by Badrul Sarwar, George


Karypis, Joseph Konstan, and John Riedl, ACM Conference on Electronic Commerce,
page 158-167, 2000

6.

Scalable Neighborhood Formation Using Clustering by Badrul M. Sarwar, George


Karypis, Joseph Konstan and John Riedl, Recommender Systems for Large-scale ECommerce, page 907-911,2002

7.

Factorization meets the neighborhood: a multifaceted collaborative filtering model by


Yehuda Koren, page 426-434,KDD08

8.

Collaborative Filtering with Temporal Dynamics by Yehuda Koren, page 447-456,


KDD09

49

9.

A comprehensive survey of neighborhood-based recommendation methods by Christian


Desrosiers and George Karypis, Recommender Systems Handbook, page 107-144, 2011

10.

Supervised Random Walks: Predicting and Recommending Links in Social Networks by


Lars Backstrom and Jure Leskovec, page 635-644, WSDM 2011
***

50

S-ar putea să vă placă și