Documente Academic
Documente Profesional
Documente Cultură
Predicting what user wants is the common use of big data. For example, Google show you relevant
ads, Amazon recommend relevant products, and Netflix recommend movies that you might like
Recommendation involves the prediction of what new items a user would like or dislike based
on preferences of or associations to previous items.
For example, a user, Kevin Benedict, likes the following books which are mostly classic books
(items):
A Tale of Two Cities
The Great Gatsby
For Whom the Bell Tolls
Recommendations will predict which new books (items), Kevin Benedict, will like:
Jane Eyre
The Adventures of Tom Sawyer
In this project, we will use Mahout. Mahout is a machine learning application programming
interface (API) built on Hadoop.
Goals
The goal of this project is to show the movie recommendations for each user.
Tools
Data set
The dataset we use is The GroupLens Movie DataSet which provides the rating of movies in this
format. This data set contains 943 users, 1,682 movies and 100,000 ratings.
This archive contains:
Methods
X =
S U R
Therefore, to simplify the steps, we use the dataset whose format supports this matrix
multiplication. The dataset itself from MovieLens supports this format. It contains a set of lines
with the userId, the itemId and a preference value separated by a tab. The userId and itemId
are integers and the preference value is a double.
Then, after hadoop and mahout are successfully installed, we simply run Mahout
Recommenders command:
hadoop jar <MAHOUT DIRECTORY>/mahout-core-0.7-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s
SIMILARITY_COOCCURRENCE --input u.data --output output
The recommendation result is not easily read. So, we use a small python program to show for a
given user, the movies she/he has rated and the movies we recommend him. The python
program uses the file u.data for the list of rated movies, the file u.item to get the movie titles
and output.txt to get the list of recommended movies for the user.