Documente Academic
Documente Profesional
Documente Cultură
net/publication/220349625
CITATIONS READS
0 110
5 authors, including:
Nelson Baloian
University of Chile
182 PUBLICATIONS 880 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
FONDECYT Regular nº1161200, Co-Investigador "Simultaneous development of "21st Century ICT Skills" and curriculum content through mobile devices, using
microblogging and geocollaboration" View project
All content following this page was uploaded by Nelson Baloian on 02 June 2014.
Roberto Konow
Informatics and Telecommunications Engineering School
Universidad Diego Portales, Santiago de Chile
roberto.konow@mail.udp.cl
Javier Pereira
Informatics and Telecommunications Engineering School
Universidad Diego Portales, Santiago de Chile
javier.pereira@udp.cl
Nelson Baloian
Department of Computer Science
Universidad de Chile, Santiago de Chile
nbaloian@dcc.uchile.cl
Abstract: There has been a continuous development of new clustering and prediction
techniques that help customers select products that meet their preferences and/or
needs from an overwhelming amount of available choices. Because of the possible huge
amount of available data, existing Recommender Systems showing good results might
be difficult to implement and may require a lot of computational resources to perform
in this scenario. In this paper, we present a more simple recommender system than the
traditional ones, easy to implement, and requiring a reasonable amount of resources
to perform. This system clusters users according to the frequency an item has been
visited by users belonging to the same cluster, performing a collaborative filtering
scheme. Experiments were conducted to evaluate the accuracy of this method using
the Movielens dataset. Results obtained, as measured by the F-measure value, are
comparable to other approaches found in the literature which are far more complex to
implement. Following this, we explain the application of this system to an e-content
site scenario for advertising. In this context, a filtering tool is shown which has been
developed to filter and contextualize recommended items.
Key Words: Recommender System, Collaborative Filtering, Clustering, TF-IDF, F-
Measure, Advertising, e-content
Category: H.3.1, H.1.m, H.4.m, J.0.m
1 Introduction
2 Research Background
Given a user-item matrix V and a set of items I that have been rated (or
viewed) by a user, identify an ordered set of items X such that | X |≤ N
and X ∩ I = ∅.
– Memory-based:
In memory-based algorithms, recommendations are computed based on pre-
viously rated items. The user-based algorithm class is frequently imple-
mented, which unfolds in three main steps. In the first step, the most simi-
lar users, as compared to the active one, are identified. Regular techniques
may be used to compute similarity between pairs of rating vectors in V
[Choi et al. 2010]: Pearson correlation, Jaccard Pearson or the cosine sim-
ilarity, among others. In the second step, an active user’s neighborhood is
discerned, based on the similarity measure. Classical methods of doing this
are center-based neighborhood, K-Nearest Neighbor and clustering. In the
third step, a list of recommendations, ordered by the predicted value, is pre-
sented. The value v(u, i) may be calculated as the simple average or the
weighted sum of ratings for items evaluated by nearest neighbors, not rated
by the active user. Although the user-based approach is very popular, it has
two documented drawbacks. First, the low performance in contexts of high
number of items/users and sparsity of matrix V , and the “cold-start” prob-
lem (when no ratings are available for a user interacting for a first time with
a recommender system. [Schein et al. 2002]).
– Model-based:
In model-based approaches, a model derived from the analysis of avail-
able data is used to predict the v(u, i) values [Sarwar et al. 2002] . This
is an “off-line” process, updating the model every time enough changes on
V have occurred. One implementations of this approach is that users are
clustered into classes such that an item rating is predicted from ratings in
a class. Several techniques have been implemented for clustering purposes
[Sandvig et al. 2008]: K-Nearest Neighbor, k-Means clustering, probabilistic
Latent Semantic Analysis or Principal Component Analysis, among others
[Adomavicious and Tuzhilin 2005]. In some cases, the item-based technique
is usually implemented, where predicted ratings are based on items correla-
tions instead of users’ similarities. It has been argued that if the item-based
method is less dynamic than the user-based method, then a model may
be constructed [Deshpande and Karypis 2004]. However, in this approach
model obsolescence should be considered since changes may affect the accu-
racy of the recommendations.
Step 1 usually assumes a data structure to represent both a user and his/her
preferences. In the case of memory-based approaches the data is compiled into
the V matrix. Differently, in the model-based side of our system, personal data
such as genre, age and the user’s preference for movie categories are used as
the information base for clustering purposes. Knowing the cluster where the
active user belongs, Step 2 consists of a systematic identification of similar users
inside, using the K-Nearest Neighbor algorithm. In order to select the distance
measure that best performs in our model, this algorithm has been tested with
different metrics: Pearson, Jaccard Pearson, cosine and Euclidean (see Section
3.3). In Step 3, the most suitable items are searched among items rated by
similar users and the aggregated rate is computed for each one of them. Step 4
corresponds to the recommendation stage, where the Top-N items are listed. Let
us consider the scenario in where a user watching a movie may simultaneously
receive a recommendation concerning other movies, coming from the service
provider. The main hypothesis behind this work is that people with similar
preferences for a certain movie genre and similar profile characteristics may have
similar preferences (negative or positive) for movies. In fact, there are reasonable
arguments supporting this hypothesis. For example, we might expect that people
watching frequently musical movies will be attracted to follow links for a concert
or pop music movies. Moreover, if two persons watching the same movie are
also in the given age range, there is a high possibility that they like the same
music and hence would follow the same items. In the same way, people frequently
watching cooking programs on the TV may be also interested in programs about
restaurants, or programs where special cooking recipes are shown.
Mu,i M
Pu,i = × log( ). (1)
Mu,• M•,i
The score effectively reflects the genre preference of a user. For example, a user
who has watched several movies, each one from a different genre, will have a low
score for all genres, meaning that the user does not have a special preference
for any of them. On the other hand, a user who has watched a small number
of movies but most of them are from the same genre, will have a high score
indicating a strong preference towards that genre.
The following expression normalizes the previous value, when considering the
total sets of genres:
Pu,i
Peu,i = qP . (2)
P 2
j u,j
Based on this measure a clustering process may be developed. The X-means
clustering algorithm [Pelleg and Moore 2000] has been implemented in this case.
In Figure 2 five clusters are shown, formed when age and preferred movie genre
are used as parameters in (2). The Y axis shows the genres of movies, the X axis
shows the age. Bullets inside a cluster indicate a similar score. These results are
used by the recommender engine to differentiate one type of user from another
based on the genre of movies they watch, thus implementing a preliminary user
behavior categorization. Utilizing a collaborative filtering approach, these results
can be used to establish an indirect relationship among users within the same
cluster.
In order to evaluate the quality [Hernandez and Gaudioso 2008] of the pro-
posed recommendation system, we conducted an experiment using the Movie-
Cluster 1
Rating of U1 for I 2 Item I2 has not been viewed by U7
U1 U2 U3 U4 U5 U6 U7
I1 3 0 5 3 0 1 0
I2 2 4 0 2 4 2 0
I3 1 0 1 0 0 1 5
I4 0 0 0 4 1 5 3
I5 3 3 2 3 0 1 3
I6 0 2 0 1 1 5 3
precision × recall
Fmeasure = 2 . (3)
precision + recall
Accuracy of the recommender system was evaluated by its capacity to retrieve
relevant items among the first 20 recommended movies. Given a specific distance
measure and K, up to 33 simulations were run, each one selecting a random test-
ing set (20% of ratings). Then, the average Fmeasure value was computed. Two
independent criteria were selected for computing the Top-N items: (i) Predicted-
Rating, and (ii) Most-Viewed items. In Figure 4, the Fmeasure metric is compared
for the different distance measures, in the case of items recommended by pre-
dicted value or most viewed. Only the results for K ∈ {150, 500} are depicted.
Qualitative analysis shows that, given a distance measure, the Most-Viewed
criterion outperforms Predicted-Rating. Furthermore, the Jaccard Pearson mea-
sure is clearly the best decision in this model. Notice that, given a criteria, the
Euclidean based recommendations are outperformed in all cases.
0.200
F-Measure
0.150
0.100
0.050
0.000
Prediction k150 Prediction k500 Most Viewed k150 Most Viewed k500
4.1 Introduction
3. what is the popularity of the scene being played for both users in general
and users with a similar preference profile;
4. what is the business model being applied (e.g. maximization of revenues for
sponsors);
The more information we have about users’ background, preferences and prod-
ucts characteristics , the better a recommender system may perform [Kiewra 2005].
However, in real systems the availability of this information, as well as the ac-
curacy of it, depends on many uncontrollable factors: users might not want to
provide or upgrade their private information; detailed information about ad-
vertisement itself is difficult to obtain from the advertising providers, since the
number of advertisement items is huge. In consequence, a content-based strategy
cannot be applied due to the little information available about the content of
the items. The most suitable type of recommender system in this scenario is a
collaborative filtering one.
However, it has been shown that the computation of the recommended item
vs. user matrix takes O(U × A) [Linden et al. 2003], being U the number of
users and A the number of advertising items. This means that a filtering process
becomes necessary. Hence, let us consider that during the sign-up process users
provide personal information such as age, gender and geographical location. Be-
sides, the movie-related information stored by the system includes title, genre
(refined in sub-genres), duration, actors, director, etc. In order to embed,a first
filtering stage in the system we may ask the ad owners to include some target
metadata about the type of people the advertisement is aimed at, which matches
exactly with the information we have from the users: range of age, gender(s) and
location (city, prefecture, whole country). Additionally, the ad owner can choose
a number of film genders where the advertisement should never be included.
This may be particularly interesting when the advertiser might want to avoid
its product to be associated with a certain type of films or film content. In any
case, we expect a relatively small number of genres avoided by most ad owners
compared to the number of genres in which they would like to show their ads.
The digital company has implemented a system to acquire information di-
rectly from the users and the ad owners, but it also automatically builds up be-
havioral information for each user based on its history, for instance, by recording
what kind of movies a certain user prefers or at what time he watches them. The
system utilizes both the explicitly provided data and the generated behavioral
data to establish similarity relations among users through their unique prefer-
ences (for instance, see Figure 2).
Data has been obtained from three sources. First, the user’s profile, which the
user may or may not accurately complete during the registration process. This
data set includes gender, age, and address, among others. The second source
is the information about the movies. This data set includes, among others, the
title, cast, duration, genre and sub genre. The third source is the media access
log files which are automatically generated on the media delivery servers. The
media access log files include the time of play, stop and pause events, the user
ID, the purchased movie ID and the IP address of the user’s machine (although
this might be that of a firewall or proxy) among other parameters. In particular,
through the IP address it is possible to know the location from where each user
is connecting using a Geo-IP database service like http://www.hostip.info/ or
http://www.ipinfodb.com/. In this particular case, we have used the services of
Maxmind GeoIP because its coverage of Japanese locations is quite complete.
This information is very convenient especially when the user’s address is not
provided on the registration. The log files also include data about how long the
user watched a particular movie, including the starting point and the end point.
This data correspond in many cases to users seeking some particular scene within
the movie. This data could be very useful to display advertising before or after
the most popular scenes of a movie.
The media servers’ log files are parsed using several scripts written in Python.
Using the data mentioned in the previous section we implemented a complete
log analyzer that is used to merge the information of the access logs and the
information available in the database. The log analyzer works as the preliminary
process in order to obtain relevant information from users’ behavior.
4.3 Where to include recommendations
The display time during the film (timing factor), the layout for recommendations
and their number per unit of time plays an important role in the decision of the
user on whether to click on it or not. However, we are not going to tackle those as-
pects in this paper since these are issues for marketing experts, human-computer
interface experts and graphic designers, and can be approached independently
from (but complementary to) the recommender system itself. We hereby assume
the problem the recommender system has to solve is that of choosing a certain
number of relevant movies for each user.
Data available in log files is of paramount importance to target advertising
since it allows for the answering of questions like:
– What are the most popular movies for female/male customers between X
and Y years old living in the regions A, B and C?
– Which are the hottest scenes inside a certain movie for a given group of
users?
This information certainly helps advertisers in defining the genre of the films
where the ad should not be displayed, the age of the target audience and the
geographical location where the ad should appear (see section 3.1). It also helps
the business planning division of the company to decide which type of advertising
has more possibilities of being successful and when to show it during the play
screening of a certain movie. Since the data contained in the log file is so large
it is very important to display it in an aggregated and compact way. In order
to do this, we developed a log analyzer tool, which graphically displays this
information, allowing a technical user to set the relevant parameters to filter the
information.
The main functionality of this tool displays an interface where a specialized
user has to enter the age, gender and location parameters for filtering purposes.
As an example, in Figure 5, the user has chosen to filter the data for male and
female customers between 18 and 40 years old living in Tokyo, Kyoto, Osaka,
Kanagawa, Aichi, Chiba, Saitama and Hokkaido. After this, various charts are
displayed.
Figure 5: Filtered data from Japanese database
Following this, two pie charts showing the age distribution and gender dis-
tribution of the selected customers group are displayed (see Figure 6).
Among other functionalities, the log analyzer presents statistics for a certain
movie genre or a single movie. For example, in Figure 9 we see a chart showing
the number of people who have seen a certain part of a movie. The movie is
divided into several one-minute-long pieces and the log analyzer counts how
many users have seen each piece. By looking at this report we can easily find
the scenes in the movie attracting large number of people. This information can
be used to decide when (or when not) to display advertisements.
5 Conclusions
References
[Adomavicious and Tuzhilin 2005] Adomavicius, G. and Tuzhilin, A. “Toward the next
generation of recommender systems: A survey of the state-of-the-art and possible
extensions”. IEEE Transactions on Knowledge and Data Engineering, 17, 6 (2005),
734-749.
[Broadband 2010] Broadband Forum: Strong DSL and IPTV growth,
press article released on April 2010, Last visited on September 2010
http://mybroadband.co.za/news/broadband/11908-Strong-DSL-and-IPTV-
growth.html.
[Buriano et al. 2006] Buriano, L., Marchetti, M., Carmagnola,F., Cena, F., Gena, C.
and Torre, I. “The role of ontologies in context-aware recommender systems”.
MDM’06 Proceedings of the 7th International Conference on Mobile Data Man-
agement (2006), 80.
[Candillier et al. 2008] Candillier, L., Jack, K, Fressant, F. and Meyer, F. “State-of-
the-art Recommender Systems”. In Collaborative and Social Information Retrieval
and Access: Techniques for Improved User Modeling. Eds. Chevalier, M., Christine,
J., Soulé-Dupuy, C., Information Science Reference Publisher (2008), 1-22.
[Choi et al. 2010] Choi, S. S., Cha, S. H. and Tappert, C. “A Survey of Binary Simi-
larity and Distance Measures”. Journal on Systemics, Cybernetics and Informatics,
8, 1(2010), 43-48.
[Cotriss 2009] Cotriss, D.: IPTV Advertising Comes To Life ,
http://www.dailyiptv.com/news/iptv-advertising-rises-031207/, October 2010.
[Cremonesi et al. 2011] Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A. and
Turrin, R. “Comparative evaluation of recommender system quality”. Proceedings
of the 2011 annual conference extended abstracts on Human factors in computing
systems, Vancouver, BC, Canada, May 07-12 (2011), 1927-1932.
[Deshpande and Karypis 2004] Deshpande, M. and Karypis, G. “Item-based top-N
recommendation algorithms”. ACM Transactions on Information Systems (TOIS),
22, 1(2004), 143-177.
[Hernandez and Gaudioso 2008] Hernández del Olmo, F. and Gaudioso, E. “Evalua-
tion of recommender Systems: a new approach”. Expert Systems with applications:
An International Journal, 35 (2008), 790-804.
[Kiewra 2005] Kiewra, M. “RankFeed - Recommendation as Searching without
Queries: New Hybrid Method of Recommendation”. Journal of Universal Computer
Science, 11, 2(2005), 229-249.
[Klein et al. 2006] Klein, G., Brian M and Hoffman, R. “Making Sense of Sensemaking
2: A Macrocognitive Model”. Intelligent Systems, IEEE, 21, 5 (2006), 88-92.
[Konow et al. 2010] Konow, R., Wayman, T., Loyola,L. , Pereira, J. and Baloian,N.
“Recommender System for contextual advertising in IPTV scenarios”, Computer
Supported Collaborative Work in Design CSCWD (2010), 617-622.
[Arunachalam and Thambidurai 2010] Arunachalam, K. and Thambidurai, P. “Col-
laborative Web Recommendation Systems - A Survey Approach”. Global Journal
of Computer Science and Technology, 9, 5 (2010), 30-35.
[Linden et al. 2003] Linden, G., Smith, B. and York, J. “Amazon.com recommenda-
tions: item-to-item collaborative filter”, IEEE Internet Computing, 7, 1(2003), 76-
80.
[Maes 1994] Maes, P. “Agents that reduce work and information overload”, Commu-
nications of the ACM, 37, 7 (1994), 30-40.
[Manouselis and Costopoulou 2007] Manouselis , N. and Costopoulou, C. “Analysis
and classification of multi-criteria recommender systems”. World Wide Web: Inter-
net and Web Information Systems, Special issue on Multi-channel Adaptive Infor-
mation Systems on the WWW, 10, 4 (2007), 415-441.
[Memmel et al. 2009] Memmel, M., Kockler, M. and Schirru, R. “Providing Multi
Source Tag Recommendations in a Social Resource Sharing Platform”. Journal of
Universal Computer Science, 15, 3 (2009), 678-69.
[Pazzani and Billsus 2007] Pazzani, M. and Billsus, D., “Contet-Based Recommenda-
tion Systems”, The Adaptive web, LNCS 4321 (2007), 325-341.
[Pelleg and Moore 2000] Pelleg D. and Moore, A. “X-means: Extending K-means with
efficient estimation of the number of clusters”, Proc. 17th Int. Conf. Machine Learn-
ing (ICML”00) (2000) 727-734.
[Piper 2010] Piper, B., “Report on IPTV Forecast and Outlook: 13.7 Billion by 2012”,
Strategy and Analytics, Feb. 27th (2008).
[Sandvig et al. 2008] Sandvig, J.J., Mobasher, B. and Burke, R. “A Survey of Collabo-
rative Recommendation and the Robustness of Model-Based Algorithms”. Bulletin
of the Technical Committee on Data Engineering, 31, 2 (2008), 3-13.
[Sarwar et al. 2002] Sarwar, B.M., Karypis, G., Konstan, J. and Riedl, J. “Scalable
neighborhood formation using clustering”. Proceedings of the Fifth International
Conference on Computer and Information Technology (2002).
[Schein et al. 2002] Schein, A., Popescul, A., Ungar, L. and Pennock, D. “Methods
and Metrics for Cold-Start Recommendations”. Proceedings of the 25th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR 2002). New York City, New York: ACM., 253-260.
[Shani and Gunawardana 2011] Shani, G. and Gunawardana, A. “Evaluating Recom-
mendation Systems”. In Recommender Systems Handbook (2011), 257-297.
[Spark 1973] Spark Jones, K. “A statistical interpretation of term specificity and its
application in retrieval”. Journal of Documentation 28, 1(1973), 11-21.
[Schirru et al. 2010] Schirru, R., Baumann, S., Memmel, M. and Dengel, A. “Extrac-
tion of Contextualized User Interest Profiles in Social Sharing Platforms” Journal
of Universal Computer Science, 16, 16 (2010), 2196-2213.
[Unni and Harmond 2007] Unni, R. and Harmond, R., “Perceived Effectiveness of
Push vs. Pull Mobile Location-based Advertising”. Journal of Interactive adver-
tising, 7, 2(2007), 28-40.
[Van den Dam 2007] Van den Dam, R. “IPTV ADVERTISING: a gold mine for tel-
cos?” last visited October 2010 http://broadcastengineering.com/RF/ broadcast-
ing iptv advertising gold/index.html.