Sunteți pe pagina 1din 46

1

The Internet Can Read Your Mind: A Study of Media Recommendation Algorithms
by
ELLIE A!"E!
#AC$LTY A%&IS'R( %R) RYA! LA*RIE
SEC'!% REA%ER( +R'#) ELAI!E ,ELT"
A -ro.ect submitted in -artial fulfillment
of the re/uirements of the $ni0ersity Scholars +rogram
Seattle +acific $ni0ersity
1213
A--ro0ed 444444444444444444444444444444
%ate4444444444444444444444444444444444
1
Abstract
Recommender systems are emerging as a 5ey 6ay to manage data on the Internet) In this -a-er( an o0er0ie6 of
different recommender systems is -resented( including collaborati0e( content7based( 5no6ledge7based( and hybrid
algorithms) Each of these methods is e8amined for strengths( 6ea5nesses( and -referred content) *ased on this
research( the design and im-lementation of a real76orld recommender is e8-lained as a -roof of conce-t) The
,ebcomic Com-anion( an online 6ebcomic recommendation system( is outlined( including design( im-lementation(
and testing results) The main com-onents of the system are a +9+ 6ebsite( a MyS:L database( and a
recommendation algorithm) The 5ey algorithm in the 6ebsite is an item7based collaborati0e system augmented 6ith
content7based features) 'n a small test dataset( the algorithm 6as found to ha0e an a0erage ;;< success rate) Also
discussed are the ethical conse/uences of recommendation systems( including -ri0acy and data di0ersity)
;
The Internet Can Read Your Mind:
A Study of Media Recommendation Algorithms
1 INTRODUCTION
Getting Information off the Internet is like taking a drink from a fire hydrant.
7 Mitchell =a-or( founder of Lotus %e0elo-ment
1.1 Background
,ith the -roliferation of the Internet( a -erson 6ith a com-uter no6 has access to almost any -iece of
information they could need) $nfortunately( they 6ill also ha0e access to all the rest of it) Ta5e a sim-le >oogle
Search e8-eriment) E0en a search for something relati0ely s-ecific( li5e ?organic alfalfa farming in Tennessee(@
returns nearly four million results) 9o6 is a user su--osed to 5no6 6hich of these four million references contain
the data they are loo5ing forA According to ,orld,ide,ebSiBe)com( a 6ebsite that estimates the siBe of the Internet
based on inde8ed -ages in -o-ular search engines( the Internet includes at least 1)1C billion -ages as of anuary 1213
D=under 1213E) Searching online is the classic needle in a haystac5 -roblem( e8ce-t you are searching through
orders of magnitude more haystac5s than anything outside the digital 6orld could re/uire)
Recommender systems ha0e been de0elo-ed to sol0e this -roblem) The most familiar recommender
systems are the suggested -urchases on AmaBon)com( the recommended mo0ies on !etfli8( and suggested songs on
iTunes) These are systems that e0aluate a user@s tastes( goals and -riorities and automatically sort through the
?haystac5@ to find items the user might be interested in)
Automated recommendation falls under the heading of ?big data(@ an emerging field in com-uter science)
*ig data in0ol0es mining massi0e /uantities to data( but 6hat ma5es the data distincti0e is not its siBe but its
relationality to other data D*oyd 1211E) *ig data can be analyBed to find rele0ant( large7scale -atterns about the
underlying ob.ects) Some fields 6here big data is being e8-lored are genetics( insurance( finance mar5ets( and
targeted ad0ertising) Though it is a -o6erful technology( it is a ne6 field 6ithout 6ell7defined ethical limitations
D*oyd 1211E)
1.2 The Webcomic Companion
In order to study recommender systems( this -ro.ect 6ill focus on an a--lication that is familiar to many
-eo-le 7 recommenders that suggest ne6 media items) This is similar to the e8am-le of !etfli8 ran5ing a user@s film
-references) #or this -ro.ect( rather than films or music( 6ebcomics are the consumer media being e0aluated)
,ebcomics are comics -ublished online( usually for free and -roduced by one or t6o -eo-le) Since the barrier for
entry into the industry is so lo6( there are tens of thousands of 6ebcomics a0ailable online( many of -oor /uality)
#ans of these comics usually find ne6 ones to read by follo6ing the lin5s a -articular 6ebcomic artist -osts on their
site( but it is a tedious -rocess) This seemed a ri-e area for a recommendation ser0ice)
The center-iece of this -ro.ect is a 6ebsite called the ,ebcomic Com-anion( 6hich 6ill allo6 users to
create a -reference list of 6ebcomics and recei0e recommendations for ne6 comics to read) This -a-er ser0es as a
re0ie6 of the research conducted to create this 6ebsite( an e8-lanation of the system design( and documentation for
the im-lementation -rocess)
1.3 Paper Ou!ine
The ,ebcomic Com-anion -ro.ect consists of four main -arts( a literature re0ie6( the algorithm for
determining recommendations( the 6ebsite and database that ma5e u- the system( and an e8amination of some
current ethical issues surrounding recommendation)
3
The first section of this -a-er contains an o0er0ie6 of current literature on the to-ic of recommendation
systems) The main issues addressed are different ty-es of recommendation algorithms( their strengths and
6ea5nesses( and some as-ects of data design) Section t6o of this -a-er is an o0er0ie6 of the system design)
#ollo6ing this is an e8-lanation of the im-lementation of the algorithm( 6ebsite( and database) This section contains
code sam-les and details on algorithm -erformance and data collection) The final section of the -a-er discusses
se0eral ethical issues surrounding the to-ic of recommender systems( and ho6 these concerns a--ly to the -ro.ect as
a 6hole)
F
2 "IT#R$TUR# R#%I#W
The first research on recommender systems began in the early nineties into 6hat became 5no6n as
collaborati0e filtering methods) The most cited of these early -a-ers is li5ely the research on Ta-estry( a sim-le mail
filter that allo6ed users to tag documents 6ith annotations such as ?interesting@ or ?uninteresting@ Dannach et al)
1211( >oldberg et al) 1GG1E)
Research has e8-anded 6idely since then( es-ecially because of the a--lications for recommenders in
online commercial sites DABaria et al) 121;H "hang and +ennacchiotti 121;E) This section 6ill outline the main
bodies of this research( focusing -articularly on different 5inds of algorithms( their strengths and 6ea5nesses( and
ho6 they are integrated into 6hole recommender systems)
2.1 Raing&
In order to ma5e successful recommendations( a recommender system needs to ha0e some 6ay to learn the
tastes of a user) 'ne of the most common 6ays to do this is through ratings or feedbac5 DMontaner et al) 122;E)
Ratings are a com-onent of the user -rofile that re-resents ho6 the user feels to6ards a certain set of items in the
system) ,ithout ratings( a recommender system does not 5no6 6hat 5ind of items to recommend)
There are se0eral different 5inds of ratings that recommender systems em-loy) The broadest classification
is e8-licit 0ersus im-licit feedbac5) E8-licit feedbac5 is -ro0ided directly by the user( most commonly as a numeric
ran5ing( such as 17F stars on a -urchased -roduct) Im-licit feedbac5 is gathered by -assi0ely monitoring the user@s
acti0ity( such as ho6 much time they s-end on a -age or 6hich lin5s they 0isit)
Ty-es of feedbac5 can be further bro5en do6n underneath these categories) Scalar ratings -ro0ide a
-osition 6ithin a numerical range( such as 3 out of F stars or strongly disagree on a scale of strongly disagree to
strongly agree) *inary ratings -ro0ide only -ositi0e or negati0e feedbac5( such as li5eIdisli5e or yesIno) $nary
ratings -ro0ide only one dimension of feedbac5( either -resent or not -resent) #or e8am-le( a customer has
-urchased an item( but the absence of a -urchase does not mean the user does not li5e it DSchafer et al) 122CH Aharon
et a) 121;E)
2.1.1 Explicit Feedback
The e8-licit and im-licit feedbac5 systems ha0e different strengths and 6ea5nesses) E8-licit feedbac5 is
the most accurate and sim-le 6ay to ca-ture a user@s -references( and is 6idely used in many systems) 9o6e0er(
e8-licit ratings suffer from se0eral serious dra6bac5s) The most significant of these is the difficulty in obtaining it)
E8-licit feedbac5 re/uires effort from the user( and users are usually 0ery reluctant to ma5e an effort unless they see
immediate gain) 'ne study found that only 1F< of users su--lied feedbac5 e0en 6hen encouraged to do so DShafer
et al) 122C( Montaner et al) 122;E) This results in e8tremely s-arse rating data for the system( and is one of the ma.or
difficulties in building a successful recommender system)
There are other -roblems 6ith e8-licit ratings as 6ell) Ratings of different items are assumed to be
inde-endent( 6hen in fact they are not DShafer et al) 122CE) There are also some cases 6here numeric scales do not
re-resent a user@s feeling about an item 0ery 6ell DMontaner et al) 122;E) There has been some research into the
ideal rating scale for e8-licit ratings but no consensus so far( and the e8act influence of different rating systems
remains o-en to debate Dannach et al) 1211E)
There are 6ays around these -roblems( ho6e0er) A common solution is that in order to 6or5( a
recommender system does not need a lot of ratings from e0eryone) A fe6 Jearly ado-tersK 6ho rate a large number
of items are enough to 5ic57start the system to -ro0ide ratings for other users DShafer et al) 122CE)
C
2.1.2 Implicit Feedback
The other -rimary method for gathering feedbac5 information is im-licit feedbac5) ,hen gathering im-licit
feedbac5( a system monitors the user@s beha0ior and uses it to deduce their interests) #or e8am-le( if a user clic5s on
a lin5 it can be e8tra-olated that they ha0e some interest in the contents of the lin5) A ma.or strength of this method
is that it re/uires no effort on the -art of the user( and users are often highly reluctant to -ro0ide information to an
online system DMontaner et al) 122;E)
9o6e0er( it can be difficult to correctly inter-ret im-licit results) #or e8am-le( an im-licit system can loo5
at item -urchases as a 6ay to gather information about the user@s taste( but the user may ha0e bought some of the
items as gifts or may not ha0e li5ed all of the items they bought Dannach et al) 1211E) ,hile there is more
uncertainty in im-licit ratings( if you ha0e enough of them the errors 6ill balance out on a0erage)
Many ratings are re/uired to ma5e im-licit feedbac5 accurate( so e8-licit rating is necessary to augment a
good system unless the system can ca-ture a large number of im-licit ratings) This reflects a common theme in
recommendations( that t6o recommenders 6or5ing together usually do a better .ob than one) A hybrid rating system
uses im-licit feedbac5 to decrease user effort and e8-licit feedbac5 to increase the accuracy of the recommender
system DMontaner et al) 122;E)
2.2 Co!!aborai'e $!gorihm&
'nce items in a system ha0e been rated( ho6 are these ratings usedA Recommendation algorithms use this
data to automatically sort through a large number of items) +robably the most 6idely studied and im-lemented
recommender systems are those 6ith collaborati0e algorithms at their core) Collaborati0e algorithms use
information about the li5es and disli5es of a user community to ma5e -redictions about 6hat the acti0e user of a
system 6ill be interested in DSar6ar et al) 1222E)
#or e8am-le( imagine a grou- of !etfli8 clients 6ho ha0e all 6atched the same mo0ies) T6o of these users(
Abel and *ina( generally gi0e films similar ratings) A collaborati0e algorithm 6ould see this rating -attern( and the
ne8t time Abel rates a mo0ie highly( the system 6ould recommend that film to *ina as 6ell) Ty-ically( rather than
.ust t6o users( the algorithm 6ould ta5e into account the ratings of many similar users to recommend the a0erage
best item) This grou- of similar users is fre/uently referred to as a neighborhood)
2.2.1 Collaboration and Content
Since -ure collaborati0e algorithms use only information about user -references( they are agnostic about
the content of the items being recommended) This can be a ma.or benefit for a recommender system( es-ecially
6hen the items in /uestion are e0aluated by sub.ecti0e features li5e ?ease of use(@ ?creati0ity(@ or ?humor)@ #eatures
li5e this are difficult to efficiently /uantify in a com-uter system( and collaborati0e algorithms allo6 designers to
a0oid this -roblem) #or this reason( collaborati0e algorithms are often used to recommend media items li5e films
and boo5s)
9o6e0er( being com-letely content7ignorant has its dra6bac5s( as it sometimes misses the o--ortunity to
recommend ob0iously similar items Dannach et al) 1211E( such as the ne6est 9arry +otter boo5 to a lo0er of fantasy
no0els) It also means that the system relies hea0ily on the -resence of user ratings( 6hich create -roblems)
2.2.2 Criteria
Collaborati0e systems must meet certain criteria before an algorithm can be successful) To -erform 6ell(
there must be many users of the system( the system must ha0e a 6ay to easily re-resent the interests of these users(
and the system must be able to match similar users or items together DMontaner et al) 122;( Ter0een 1221E) If there
are more items than users( a collaborati0e system 6ill only be able to ma5e confident recommendations for a small
subset of these items( and users must rate multi-le items to get good recommendations)
#or e8am-le( a recommender on a system 6ith a thousand users may be able to -ro0ide recommendations
for a hundred items) %e-ending on the distribution of the data( only ten or less of these recommendations might be
reliable DSchafer et al) 122CE) 'ne common test data source for collaborati0e algorithms is the Mo0ieLens database(
L
6hich includes 122(222 ratings on 1(L22 mo0ies from 1(222 users D>rou-Lens 1213E) Clearly( a large number of
user -ro0ided ratings are necessary to ma5e the system effecti0e)
2.2.3 User-based Collaboration
Collaborati0e recommendation strategies ha0e t6o main classifications: user7based collaboration and item7
based collaboration) $ser7based collaboration 6as one of the first recommendation methods to be de0elo-ed( and it
is -robably the most intuiti0e of the 0arious recommendation strategies) >i0en a set of items that the current acti0e
user has rated( the system finds a -ool of other users 6ho ha0e gi0en similar ratings to those same items) The system
then e0aluates the combined ratings of these similar users and finds the highest7rated item that the acti0e user has
not seen and recommends it to them DSar6ar et al) 1221E)
,hile this method is straightfor6ard( it does not al6ays -erform 6ell in large data sets( since com-utation
effort scales 6ith the number of users and items in the system DSar6ar et al) 1222E) Recommendations in real7time
become more time consuming as the number of users and catalogue of items gro6s) Therefore( e8clusi0e
collaborati0e recommendation is rarely feasible in large7scale data sets)
2.2.4 Item-based Collaboration
Item7based collaboration uses the same -rinci-le as user7based collaboration( but it uses similarities found
bet6een items rather than similarities found bet6een users as its -rimary metric DSar6ar et al) 1221E) An item7based
collaborati0e system loo5s at the ratings gi0en to a set of items and determines 6hich items tend to be gi0en
com-arable ratings) The system then recommends items a5in to those that the user has rated highly) This is done by
referencing a matri8 of item similarity 0alues)
There is a 5ey com-utational difference bet6een user7based and item7based collaboration) Since an item
ne0er logs into the system( the time7intensi0e com-utations re/uired to -o-ulate the item similarity matri8 can be
-erformed offline) 'nce constructed( this table can be accessed in real time for the gi0en acti0e user) #urthermore(
the time to reference this table scales according to the number of items a user has rated( 6hich is usually much
smaller than the number of users in the system) This ma5es it a much more 0iable o-tion for large databases)
AmaBon)com uses a 0ariant of an item7based collaborati0e system to ma5e its recommendations DLinden et al)
122;E)
2.3 Conen(ba&ed a!gorihm&
The second main classification of recommendation algorithms are those that ta5e ad0antage of content7
based filtering) Content7based recommendation is com-lementary to collaborati0e filtering( and they often ha0e
o--osite strengths and 6ea5nesses DSchafer et al) 122CE) Rather than data on community -references( content7based
systems use factual( measurable information about items to ma5e recommendations) This information is often
a0ailable in the form of technical characteristics( such as mo0ie genre( 6riter( director( or -o-ular actors)
2.3.1 Utilized Information
The most common ty-es of information utiliBed by content7based systems are item category( genre( and
5ey6ords that can be automatically e8tracted from te8t) The success of a content7based recommender de-ends on a
strong association bet6een these characteristics and user -references) In -ractice( a sim-le a--roach such as
classification by genre can sometimes 6or5) #or e8am-le( if a user tends to li5e fantasy boo5s( a recommendation
of the ne6est fantasy no0el might be delightful to the reader D*asu et al) 1GGME)
,hile the content7based a--roach may seem straightfor6ard( obtaining the characteristics of items can be a
significant -roblem) E0en though technical information may be commonly a0ailable in electronic form( the more
hel-ful sub.ecti0e information li5e an item@s design /uality or ease of use is often difficult to gather automatically
and e8-ensi0e to add manually) %ue to this limitation( content7based filtering is often im-lemented on items that
consist largely of te8t( such as ne6s articles) =ey6ords and content can be e8tracted from the te8t 6ith 5no6n
automated algorithms) The e8tracted 5ey6ords fre/uently ser0e as successful characteristics for the content based
filtering D=im et al) 122FH annach et al) 1211E)
M
2.3.2 Adanta!es
Content7based filtering has se0eral ad0antages o0er collaborati0e methods) #irst of all( recommendation
accuracy is not related to system users and submitted ratings) In a collaborati0e system( recommendation 6or5s the
best if there are more users in the system than items( or if each user has rated many items DSchafer et al) 122CE)
Content based recommenders do not ha0e this re/uirement( and the system 6ill o-erate .ust as accurately 6ith one
or 122(222 users D*asu et al) 1GGMH annach et al) 1211E) #urthermore( ne6 items can be recommended as soon as
they are entered into the system( rather than 6aiting for users to su--ly ratings before the item can be recommended
to others)
2.3.3 "ra#backs
Content7based filtering also has se0eral dra6bac5s) 'ne common -roblem is the issue of recommending
unusual or no0el content D=im et al) 122FE) Since recommendation is based only on the content of items the user
already li5es( content7based recommenders are not able to recommend serendi-itous items) Rather( content7based
recommenders tend to o0ers-ecialiBe and return items 0ery similar to ones the user has already seen)
Another significant -roblem is the /uestion of 6hether te8tual content and 5ey6ords is enough to .udge the
usefulness or interestingness of an item) A -oorly 6ritten article and 6ell76ritten article on the same sub.ect 6ill
li5ely use e8actly the same 5ey6ords) Increasingly( more and more information on 6eb -ages is also contained in
multimedia elements li5e 0ideo that cannot be automatically e8tracted) ,hile research into e8tracting information
from these sources is ongoing( it is still a young field Dannach et al) 1211E) Researchers agree that in most fields(
manual insertion of this 5ind of data is im-ractical)
There is some -ossibility of this changing( ho6e0er) A -henomena collo/uially 5no6n as ?,eb 1)2@ has
emerged largely due to the -rominent role social net6or5s -lay on the Internet today) ,eb 1)2 is characteriBed by the
large amount of user7contributed content( often attached to an ob.ect li5e an article or 0ideo D*elNm et al) 121;E) A
common form of this user7generated content is in the form of tags) $sers 0oluntarily tag and grou- content on many
sites for the -ur-oses of social sharing( and ne8t7generation recommenders may be able to utiliBe these tags as data
in content7based recommenders D'@Reilly 122LH *elNm et al) 121;E)
2.) Common I&&ue&
There are t6o ma.or -roblems that sho6 u- fre/uently in the recommender system field) These are called
the cold start -roblem and the data s-arsity -roblem( and both are related to the distribution of ratings 6ithin a
system)
2.4.1 $%e Cold &tart 'roblem
The cold start -roblem is created 6hen a ne6 user or item is added to a recommender system DShafer et al)
122CE) *oth collaborati0e and content7based algorithms ha0e this -roblem) A ne6 item that has not been rated yet
cannot be recommended( and if it has only been rated a small number of times the recommendation may not be
accurate DAharon et al) 121;H Lousame et al) 122GE) Similarly( a ne6 user 6ho has not rated any items yet 6ill not be
able to recei0e any recommendations at all) The -roblem also a--lies to users 6ith unusual tastes( for 6hom it may
be difficult for the system to com-ute recommendations reliably DLousame et al) 122GE)
There are se0eral -o-ular solutions to the cold start -roblem) Some algorithms augment a s-arse user
-rofile by merging it 6ith ?trusted@ -rofiles) Trusted -rofiles are ones that the user has decided -ro0ide 0aluable
ratings) This allo6s more robust recommendations to be made for a larger number of users 6ithout significant loss
in accuracy D>uo 121;E)
There are non7algorithmic solutions as 6ell) Some systems( li5e the .o5e recommender ester( re/uire users
to rate a small initial set of items to 5ic57start the ser0ice D9erloc5er et al) 1222E) 'thers dis-lay general
recommendations based on o0erall -o-ularity until the user ma5es their o6n ratings) Still other systems as5 for
su--lementary content information( such as 6hat genre of mo0ie a user is interested in( or -lacing the user in a
demogra-hic category and using that information as a general recommendation starting -lace DShafer et al) 122CE)
All of these o-tions allo6 the system to ma5e some 5ind of recommendation immediately to ne6 users)
G
2.4.2 "ata &parsit(
The other ma.or -roblem in recommender systems( data s-arsity( is one of the main bottlenec5s for
collaborati0e filtering) Commercial sites 6ith recommender agents often ha0e catalogues of tens of thousands of
items) E0en 0ery acti0e users 6ill be able to rate only a fraction( sometimes less than 1<( of the total catalogue( and
e0en 0ery -o-ular items 6ill not be rated 0ery many times DLousame et al) 122GH Sar6ar et al) 1221E) This results in
systems 6or5ing off of a nearly em-ty rating matri8 and -ro0iding -oor recommendations or no recommendations
at all) This is -articularly -roblematic if the main dra6 of the o0erall system is -ersonaliBed recommendations(
because users 6ill not stay long enough to add more ratings DShafer et al) 122CE)
'ne of the sim-lest solutions to the data s-arsity -roblem is to ta5e ad0antage of additional information
besides ratings in ma5ing recommendations DRonen et al) 121;E) *y adding demogra-hic information or some
5no6ledge of the content of items( the effect of s-arse ratings can be lessened) This is es-ecially hel-ful for the
ram-7u- -hase of a ne6 system) To e8tend the idea further( a collaborati0e algorithm may be combined 6ith another
category of recommender system( content7based algorithms) This is 5no6n as a hybrid system( and 6ill be discussed
in section 1)C)
2.* +no,!edge(ba&ed recommendaion
=no6ledge7based recommendation systems are a category of recommender often utiliBed 6here a high7
ris5( highly -ersonaliBed recommendation is necessary( such as buying a ne6 car) The system 6ill -rom-t the user
6ith a series of /uestions about 6hat itemOsP they are loo5ing for( such as the ma8imum -rice they are 6illing to -ay
or any re/uired features) The system then interacti0ely determines and suggests the item that best fits their
re/uirements Dannach et al) 1211E)
$nli5e content and collaborati0e methods( 5no6ledge7based systems ha0e no ram-7u- -roblem D*ur5e
1GGGE) They re/uire no user history( only that the user -ro0ides the necessary information to the recommender as
they ste- through it) Instead( recommendations are calculated by finding similarities bet6een customer re/uirements
and items in the catalogue( or on the basis of e8-licit re/uirement rules D*ur5e 1GGGE) #or e8am-le( one 6ay to build
a 5no6ledge7based recommender is to use the customer7s-ecified information to construct a /uery against the item
catalogue) The most rele0ant results of the /uery are then -resented to the user)
2.- ./brid a!gorihm&
9ybrid algorithms combine t6o or more of the methods outlined abo0e) %ifferent algorithms ha0e different
strengths and 6ea5nesses( and by combining t6o or more methods( the strengths of one can com-ensate the
6ea5nesses of the other) This ma5es hybrid algorithms more usable and secure as a rule than a -ure recommendation
method Dannach et al) 1211E) Most commonly( a collaborati0e algorithm is combined 6ith a content or 5no6ledge7
based system to alle0iate the cold7start -roblem D*ur5e 1221H *asu et al) 1GGMH =im et al) 122FE) #or e8am-le( in
122C !etfli8 announced its famous !etfli8 +riBe( a com-etition to im-ro0e the com-any@s recommendation
accuracy by 12<) After being unable to sol0e the -roblem indi0idually( the contest 6as finally 6on by three teams
that combined their algorithms( 6ith the strengths of some systems accommodating 6ea5nesses in others DSiegel
121;E)
The effecti0eness of a hybrid algorithm de-ends on the method and ty-es of algorithms used) #or e8am-le(
contentIcollaborati0e algorithms 6ill al6ays suffer from the cold7start -roblem( since this is an issue that both
content7based and collaborati0e algorithms ha0e) 'n the other hand( combining one of these methods 6ith a
5no6ledge7based algorithm could be effecti0e( since 5no6ledge7based methods do not ha0e the -roblem of ram-7u-
D*ur5e 1221E)
There are many different 6ays to combine t6o algorithms( and the follo6ing section 6ill detail t6o ma.or
methods: linear and cascade) This list is by no means e8hausti0e or mutually e8clusi0e( but rather ser0es to illustrate
different 6ays that recommendation methods can be combined)
12
2.).1 *ixed +(bridization
Mi8ed hybridiBation is -robably the least algorithmically com-le8 of the different hybridiBation methods)
The strategy for a mi8ed hybridiBation recommender runs t6o algorithms simultaneously and -resents the results of
both algorithms to the user D*ur5e 1221E) This 6or5s es-ecially 6ell 6hen it is -ractical to ma5e a large number of
recommendations to the user) 'ne such e8am-le is recommending digital T& channels( as the +T& system did
DSmyth and Cotter 1221E) The +T& system ta5es the content7based information about the te8tual descri-tion of T&
sho6s and the collaborati0e o-inions of other users( combines the results( and -resents them to the user)
To refine this method( some mi8ed hybridiBation methods add different 6eights to the results of the
com-onent algorithms( and these can be ad.usted as the system changes D=im et al) 122FE) The online ne6s-a-er
recommender de0elo-ed by Clay-ool et al) D1GGGE ad.usts its algorithm to6ards collaborati0e results as more users
.oin the system and collaborati0e recommendation becomes more accurate)
2.).2 Cascade +(bridization
The second method of algorithm hybridiBation discussed is cascade hybridiBation( 6hen the results of one
algorithm are fed into and refined by another) #or this method( the ordering of the algorithms is critical( as any
-otential recommendations left out by the first method 6ill not be seen by the second D*ur5e 1221H =im et al) 122FE)
Cascade hybridiBation is ty-ically more efficient than a mi8ed hybridiBation( since it 6ill only run the second
algorithm on a subset of the a0ailable items D*ur5e 1221E)
'ne e8am-le of this 5ind of system is a content7based filter that 0ie6s collaborati0e information as another
5ind of item data) A collaborati0e algorithm is run to determine 0alid user clusters( and then traditional content7
based methods are -erformed on this data( along 6ith other features of the items being recommended) The mo0ie
recommender system de0elo-ed by *asu et al) D1GGME is one such e8am-le( and it 6as able to significantly im-ro0e
the accuracy of a -urely collaborati0e a--roach)
2.0 Conc!u&ion
Recommender systems are a gro6ing field( and research 6ill continue to e8-and as its commercial
a--lications are further de0elo-ed) This section sa6 an o0er0ie6 of the basic recommendation strategies(
collaborati0e and content7based( as 6ell as a descri-tion of some common -roblems and -arameters that go into
these systems) The ne8t section 6ill discuss the recommender system de0elo-ed from this research for the
,ebcomic Com-anion)
11
3 D#1I2N
3.1 Webcomic&3 $ Ca&e 1ud/
As a 6ay to study recommendation( a core -iece of this -ro.ect is a -roof7of7conce-t for a recommender
system de0elo-ed for 6ebcomics) The items being recommended are a critical com-onent of the design of a system(
and they 6ill ha0e great effect on 6hich recommendation strategies are 0iable and the ultimate success of the
6ebsite)
So 6hy 6ebcomicsA #irst of all( it is an area in 6hich the author has e8-erience) =no6ledge about the
recommended items and 6hy users -refer one item o0er another is critical for designing a good system) Secondly(
6ebcomics are 6ell suited in many 6ays to automatic filtering) They are digital media largely freely a0ailable( so
they are easy to lin5 and manage in a database) +eo-le 6ho follo6 one comic usually follo6 se0eral others( so there
is -otential to gather large amounts of collaborati0e data from each user) #inally( the 6ebcomic mar5et stands to
benefit greatly from a recommendation ser0ice) The barrier for entry to the 6ebcomics mar5et is e8tremely lo6(
resulting in a large number of abandoned or lo67/uality comics) This ma5es it difficult for readers to find /uality
material and for artists to gain e8-osure for their 6or5)
3.1.1 In,-ential &o-rces
Throughout the -rocess of research for this -ro.ect( se0eral other resources that offer functionality similar
to the ,ebcomic Com-anion 6ere disco0ered) These sites ha0e greatly influenced the system@s design( and they are
described briefly here to sho6 6hat as-ects ha0e been mimic5ed or ada-ted)
Personal Blog
Long before the beginning of this -ro.ect( the author created a small 6eb-age on a social media account
that collected all of the author@s fa0orite comics in one -lace) This site ser0ed as the -rototy-e for the ,ebcomic
Com-anion( and many design decisions 6ere based on as-ects of this -age that did or did not 6or5 6ell) The -age
included the title and author for each 6ebcomic( a brief descri-tion( a small sam-le of art( and a lin5 to the comic@s
home-age) Each lin5 o-ened on a ne6 tab( so it 6as easy to scroll do6n the list and clic5 on 6hiche0er comics the
user 6anted to chec5) The -age 6as 0isually -leasing and allo6ed other users 6ho 0isited the -age to see 6hich
comics the author en.oyed) 9o6e0er( it 6as time7consuming to add ne6 comics to the -age( so after the initial
creation of the -age( ne6 comics 6ere often na0igated to directly rather than added to the list) As a result( 6ith time
the -age has become more and more outdated)
This site s-a6ned the initial idea to ma5e a 6ebsite that allo6ed easy creation of these comic lists and
utiliBed that information to ma5e recommendations) As other sites 6ere encountered during research( they 6ere
al6ays com-ared against the basic standard of a sim-le comic listing -age)
Comic Rocket
%uring the course of research for the ,ebcomic Com-anion( the author found a 6ebsite called Comic
Roc5et that im-lemented many of the features intended for the ,ebcomic Com-anion DComic Roc5et 1213E) Comic
Roc5et allo6s users to sign u- for an account and add comics that they follo6 to this account) Comic Roc5et trac5s
the u-dates of these comics and also offers -ersonaliBed recommendations to the user)
,hile it 6as at first disa--ointing to find a 6ebsite in e8istence that had im-lemented much more than
could be done 6ithin the sco-e of an undergraduate thesis -ro.ect( the site has -ro0ed 0aluable for the design of the
,ebcomic Com-anion) In using Comic Roc5et( the author 6as able to determine 6hich features 6or5ed 6ell 6ithin
a fully7fledged site and 6hich features could benefit from alteration)
The ma.or strength of Comic Roc5et is that it automatically trac5s the user@s comic u-dates) This is
es-ecially con0enient for comics that a user follo6s casually and has not memoriBed the u-date schedule for)
9o6e0er( an u-date notification on Comic Roc5et often lags se0eral hours or days behind the actual comic u-date(
11
the author often used the original -ersonal comic lin5s to chec5 for u-dates to fa0orite comics) Adding comics to a
Comic Roc5et list is also relati0ely easy( though it 6ould be more con0enient if there 6as some 6ay to add multi-le
comics at once) Adding each comic indi0idually 6hile -o-ulating an initial list can be tedious)
The site has an integrated recommendation feature( and the author has found se0eral ne6 comics through it)
9o6e0er( it is not a trans-arent system( and users cannot ad.ust -arameters and search again if none of the
recommendations catch their eye) There is also no 6ay to indicate that a user li5es one comic more than another(
6hich is something that 6as integrated into the ,ebcomic Com-anion)
3.2 1e!eced Conen
The ,ebcomic Com-anion algorithm ta5es ad0antage of both collaborati0e and content7based data to
inform its recommendations) ,hile collaborati0e data is inde-endent of the items being recommended( content is
not) Content data is time consuming and difficult to collect( so only those features that most influence user
-reference should be included) In order to illustrate ho6 each of these as-ects 6ill affect the design of the algorithm(
each is outlined briefly here)
3.2.1 Content "escriptions
Update schedule: This is the rate at 6hich the comic author or authors ty-ically -ost ne6 content)
%e-ending on the ty-e of comic( any6here from one to three times -er 6ee5 is ty-ical) +articularly 6ith art7
intensi0e comics( longer stretches bet6een u-dates is not unusual( but some readers find it frustrating to read comics
that go for 6ee5s 6ithout u-dates) Additionally( some comics ha0e the s-ecial flag of ?on hiatus(@ 6hich means that
the comic is not being currently 6or5ed on and no ne6 u-dates may e0er be forthcoming) Similarly( a com-leted
comic has no ne6 u-dates because the author has finished the story)
Form: This is the length of a ty-ical story arc in the comic) The ty-es included in the ,ebcomic
Com-anion are gag7a7day( short story( and gra-hic no0els) >ag7a7day comics( such as xkcd( ha0e no continuous
story and are may be read in any order) Short to medium form comics consist of a series of small arcs that are each
com-leted in usually a doBen or so -ages( such as ty-ical ne6s-a-er comics li5e Calvin and o!!es) >ra-hic no0els
usually ha0e one large storyline that may continue for hundreds of -ages) Readers tend to fa0or either gag7a7day or
gra-hic no0el story forms( 6ith short story falling in the middle)
Genre: This is the genre or grou- of genres that categoriBe the comic) This as-ect is es-ecially rele0ant for
long7form story comics) #or this -ro.ect( genre is considered in terms of large7scale comic setting and story content)
In order to determine the genres included in the database( the online comic database Comi8-edia( the genre listing
on ,ebcomic Roc5et( and the content tags on in5'utbrea5 6ere consulted DComi8-edia 1213( ,ebcomic Roc5et
122G( =ing 1213E) After some t6ea5ing( the final list of genres is:
ActionIAd0enture( Autobiogra-hy( Crime( %rama( #antasyIMythology( >amingI!erd Culture( 9istorical(
9orror( 9umor( L>*T:IA( Mystery( +arodyISatire( +olitical( Romance( Sci7#i( Slice of Life( and Su-erhero)
3.2.2 Excl-ded Elements
There are se0eral as-ects of 6ebcomics that are im-ortant in determining -ersonal -reference that ha0e
been intentionally e8cluded) Artistic style is a huge influence for many -eo-le( but it is so difficult to /uantify
meaningfully that it is not included in the sco-e of the -ro.ect) Similarly( such elements as themes( humor( or 6riting
so-histication are not re-resented in the content side of this -ro.ect@s algorithm) Instead( the collaborati0e
com-onents of the algorithm should ta5e some of these factors inherently into account)
Similarly( the ,ebcomic Com-anion algorithm does not ha0e 5no6ledge of ?negati0e@ ratings( or comics
that a user does not li5e) It has been sho6n that ta5ing into account information about 6hich items a user does not
6ant to see can significantly im-ro0e a system( but it also adds com-le8ity to the algorithm DMontaner et al) 122;E)
Moreo0er( reasons for not li5ing a -articular comic can be com-le8( and the author did not 6ish to unintentionally
damage recommendation /uality through -oor -rocessing of negati0e data)
1;
3.3 Web&ie De&ign
The ,ebcomic com-anion is an online recommender( 6hich means that it is accessed through a bro6ser on
a user@s local com-uter) The 6ebsite is the go7bet6een for all user interaction 6ith the database( including bro6sing(
maintaining a list of comics( and recei0ing recommendations) The 6ebsite is designed to be sim-le( -o6erful( and
easy to use)
3.3.1 Use Case "ia!ram
To illustrate the functionality of the ,ebcomic Com-anion 6ebsite( a $se Case diagram is -resented in
#igure 1) A $se Case diagram dis-lays the functions that a system needs to be able to -erform and ho6 these
functions are related to each other) As the diagram sho6s( the ,ebcomic Com-anion 6ebsite is designed to be used
in t6o main ca-acities( bro6sing that can be done by anyone 6ho accesses the site( and maintaining a -ersonal
comic list for users that ha0e an account) Each of the uses cases is e8-lained belo6( along 6ith a symbol 5ey)
This symbol illustrates a human actor that 6ill interact 6ith the system)
This symbol re-resents a $se Case) $se Cases are -articular beha0iors or ca-abilities of the system)

This symbol is a $ses arro6( and it connects a base $se Case 6ith another $se Case that must be
im-lemented during the e8ecution of the base case) The arro6 al6ays -oints a6ay from the base
case)

This symbol is an E8tends arro6( and it connects a base $se Case 6ith another $se Case that may be
o-tionally im-lemented during the e8ecution of the base case) The arro6 al6ays -oints to the base
case)
13
Fig. 1. The Webcomic Companion website Use Case Diagram
"egister: This is the ability for a user to create a ne6 account in the ,ebcomic Com-anion) $sers -ro0ide a
username and -ass6ord( and this allo6s the 6ebsite to 5ee- trac5 of their -ersonal list of comics)
#og In: This allo6s users to tell the 6ebsite 6ho they are) ,hen a user logs in( they enter a session that lets the
6ebsite 5ee- trac5 of information -articular to that -erson) Some 6ebsite functionality such as -ersonaliBed
recommendations is only a0ailable to logged in users)
$ro%se: This allo6s a user to search the database) Casual bro6sing is done by 6ebcomic title( and if the 6ebcomic
is -resent in the database all information about it is returned to the user)
&dvanced 'earch: This allo6s a user to search for comics filtered by content as 6ell as using a seed comic in the
similarity matri8)
&dd (e% Comic: This is the 6ay that ne6 comics are added to the master 6ebcomic list in the database) $sers fill
out a form that as5s for the 6ebcomic title( authors( home-age( u-date schedule( form( genres( and 6hether the
comic contains any adult content)
1F
)dit Comic #ist: This is the 6ay that users add( ran5 and remo0e comics from their -ersonal list) #or any 6ebcomic
that has been added to the master table( users enter the title and the 6ebcomic 6ill a--ear on their list) ,ebcomics
can be remo0ed from a user list in a similar fashion)
Get "ecommendation: This feature allo6s a logged in user 6ith at least one comic on their -ersonal list to recei0e a
list of the to- 12 comics recommended to them by the system) Clic5ing on a lin5 in their -ersonal list 6ill bring the
user to a -age that -resents lin5s to all 12 comics)
3.3.2 User *otiation
The most im-ortant com-onent of the ,ebcomic Com-anion 6ebsite is the ability to -ro0ide
recommendations to users) In order to do this( it must be able to easily gain and store information about indi0idual
users) 9o6e0er( as discussed in section 1)1)1 E8-licit #eedbac5( users are usually highly reluctant to -ro0ide
information for a system)
,ith this in mind( the site is designed to allo6 as much functionality 6ith as little user in0estment as
-ossible) #or e8am-le( the database may be searched 6ithout logging in( and an ad0anced search is a0ailable on the
home-age of the site that utiliBes the similarity table in -art) This 6ay( e0en users 6ho are not interested in creating
an account can ta5e ad0antage of some form of recommendation)
$sers should find the bro6sing function useful enough that they ta5e the ste-s to recei0e -ersonal
recommendations) The moti0ation for a user to add comics to their -ersonal -age is t6ofold) #irst of all( it gathers
useful information about each comic and hy-erlin5s to their home-ages in one -lace) This is a con0enient 6ay to
chec5 for u-dates( since users can sim-ly scroll through the list and clic5 on a comic lin5 to na0igate to the home
-age) In this as-ect( it is 0ery similar to the -ersonal comic -age that -redates the ,ebcomic Com-anion) Secondly(
by gi0ing the system 5no6ledge of their -ersonal taste( a user is able to recei0e recommendations for ne6 comics
they 6ould en.oy)
3.) $!gorihm 1rucure
#or this -ro.ect( t6o different algorithms 6ere im-lemented in the 6ebsite) The first and -rimary algorithm
is an item7based collaborati0e hybrid algorithm) This algorithm o-erates -rimarily on users@ comic lists) The second
algorithm is a 5no6ledge7based algorithm that users can utiliBe 6ithout signing in) The 5no6ledge7based algorithm
is -resented on the front -age and 6ill hel- dra6 users into the site and find initial comics)
3.4.1 'rimar( Al!orit%m &election Criteria
#or this -ro.ect( the main algorithm is an item7based collaborati0e hybrid algorithm) This is the algorithm
that ma5es -ersonal recommendations to users based on their comic lists) The algorithm is based on collaborati0e
data bet6een different 6ebcomics in the database( but it also ta5es ad0antage of some content information such as
u-date schedule and form)
A collaborati0e algorithm 6as chosen o0er a content7based algorithm due to the -ro.ect content)
,ebcomics are image rather than te8t based( and it is difficult to e8tract information about them automatically)
%omains such as mo0ies and music almost all use some form of collaborati0e algorithm 6hen they ma5e
recommendations( and 6ebcomics are more closely related to these items than to articles and 6ebsites that often use
content7based algorithms)
,ithin the s-here of collaborati0e algorithms( the system could be im-lemented 6ith user or item based
collaboration) $ser7based algorithms initially seemed li5e the best choice( since these 6ere some of the first
algorithms im-lemented and are straightfor6ard to understand) 9o6e0er( 6ith more research( an item7based
algorithm turned out to be more a--ro-riate for a -ro.ect li5e the ,ebcomic Com-anion) #irst of all( user7based
collaboration is com-utationally e8-ensi0e( 6ith the com-utational com-le8ity increasing 6ith the number of
members and items in the system Oorder *( 6orst caseP DLinden et al) 122;E) The 6ebsite Comic Roc5et( as of
%ecember 121;( lists the number of 6ebcomics in their database as ;C(;;F( so hy-othetically the ,ebcomic
Com-anion catalogue could become /uite large) Methods that limit com-utational e8-ense( such as random
1C
sam-ling of customers( clustering( or discarding some items( tend to decrease the accuracy of the algorithm DLinden
et al) 122;E)
Item7based collaboration( on the other hand( scales inde-endently of the number of users in a system) It
com-utes offline the similarity of different items in an e8tremely time7intensi0e -rocess Oorder
(
1
*
at 6orstP)
9o6e0er( once the similarity table is built( recommendations can be made in real time by /ueries to the table)
Another ad0antage to item7based collaboration is that reasonable recommendations can be made based on only a
fe6 items( 6hich is essential in an a--lication 6here users may only read a handful of comics DLinden et al) 122;E)
The -rimary algorithm integrates com-onents of content7based methods as 6ell) ,hile -ure content7based
filtering 6ould not 6or5 6ell for 6ebcomics( adding some content7a6are functionality to a recommendation
algorithm can hel- im-ro0e it considerably Dannach et al) 1211E) 9ybrid algorithms are more accurate and more
robust against attac5s) Se0eral features that 6ould be easy to determine( such as u-date schedule and genre( 6ere
added to each 6ebcomic entry to allo6 this 5ind of content a6areness in the algorithm)
3.4.2 'rimar( Al!orit%m .erie#.
The central algorithm of the 6ebsite is im-lemented as a hybrid item7based collaborati0e and content7based
algorithm) The -rimary data source for the algorithm is the collection of all users@ comic lists) Comic lists are
created 6hen a user adds a comic that they read to their -ersonal comic -age) In the ,ebcomic Com-anion( the
system utiliBes a user@s comic list as re-resentati0e of their -ersonal taste)
3.4.2.1 Adjusted Cosine Similarity Value
In order to ma5e item7based recommendations( the ,ebcomic Com-anion uses ad.usted cosine similarity
to measure the similarity bet6een a -air of comics) This is the standard measure for item7based collaboration and
has been sho6n to be the most accurate DSchafer et al) 122CH annach et al) 1211E) The cosine similarity bet6een t6o
items re-resented as 0ectors a and ! is defined as follo6s:
Equation 1. Cosine similarity between two vectors
This e/uation sho6s the similarity bet6een a and ! as the dot -roduct bet6een the t6o 0ectors di0ided by
the -roduct of their magnitude) The final result is a 0alue bet6een 2 and 1( 6ith 1 re-resenting t6o comics 6ith
identical ratings and 2 re-resenting t6o comics 6ith no ratings in common) #igure 1 sho6s ho6 this 6ould be
calculated for t6o comics a and ! that had common ratings 1 through n)
Equation 2. ample cosine similarity calculation
The ad.usted cosine similarity 0alue is calculated in the same 6ay as the cosine similarity 0alue( e8ce-t
each rating is slightly ad.usted) *efore the calculation is -erformed( each user@s a0erage rating is subtracted from all
of their ratings in the original ratings table) This accounts for differences in general rating beha0ior bet6een users)
#or the ad.usted cosine similarity 0alue( the similarities range from 1 to 71( 6ith 1 re-resenting t6o identically rated
items and 71 re-resenting comics 6ith o--osite ratings)
3.4.2.2 Recommendation Design
'nce the similarity 0alues for all -airs of comics ha0e been stored inside the database( they can be accessed
real time by the 6ebsite to ma5e recommendations to users) ,ebcomics are selected based on their similarity to
1L
6ebcomics -resent in the user@s comic list) This is determined by both their ad.usted cosine similarity 0alue and
their content) At most 12 comics are returned for each recommendation( though the actual number may be less for
users 6ith 0ery s-arse lists)
3.4.2.2.1 Collaborative
The bul5 of the -rimary recommendation algorithm@s results are returned through collaborati0e methods)
To do this( the recommender ta5es comics -resent on the acti0e user@s comic list and loo5s them u- in the similarity
table that holds the ad.usted cosine similarity 0alues) It finds those 6ebcomics 6ith the highest ad.usted cosine
similarity to comics -resent on the users list but not already in it and returns them as recommended items)
To further im-ro0e the /uality of the recommendations( not all items on the user@s list are 6eighted e/ually
in the recommendation -rocess) $sers may ran5 each of their comics 6ith a score of 1 to ;( 6ith 1 being the most
fa0orable) The recommendation algorithm considers those comics similar to ran5 1 comics first( ran5 1 comics
second( and ran5 ; comics last) Since a ma8imum of 12 comics are recommended( the results are 6eighted to6ards
those comics that the user has rated most highly)
3.4.2.2.2 Content-based
The ,ebcomic Com-anion recommendation algorithm also contains a content7based com-onent( though
this secondary to the collaborati0e -iece) Content7based recommendations only come into -lay if the collaborati0e
algorithm is unable to recommend 12 items)
*ecause of the difficulty in collecting data( 0ery little content data about each comic is stored in the
database) This means that content7based methods are li5ely to be less accurate than collaborati0e ones( so the
content7based -iece of the recommendation algorithm e8ists only to fill the crac5s bet6een collaborati0e ratings)
9o6e0er( the content7based side of the algorithm does ha0e a 0aluable contribution to ma5e) Since it does
not consider comic similarity( content7based methods can ma5e recommendations to users 6ith small lists or
obscure tastes more easily than collaborati0e methods can) Also( since content7based recommendation does not
re/uire any users to ha0e -re0iously rated an item in order to recommend it( this com-onent of the algorithm allo6s
more obscure comics to be recommended to users( 6hich 6ill hel- counter the more -o-ular comics that are li5ely
to a--ear on most user@s lists) #igure 1 Osho6n belo6P gi0es a conce-tual outline of ho6 the collaborati0e and
content7based com-onents of the algorithm 6ill 6or5 together)
1M
Fig. 2. Conceptual !iagram o" the primary item#base! collaborative recommen!er
3.4.3 &econdar( Al!orit%m
The second algorithm( the 5no6ledge7based algorithm( ta5es ad0antage of the item7matri8 created by the
main collaborati0e algorithm as 6ell as MyS:L /ueries into the database) The -ur-ose of the 5no6ledge7based
algorithm is to -ro0ide initial recommendations to users 6ith no information in the system) 'ne frustrating as-ect
about the Comic Roc5et recommendation site 6as ho6 long it too5 to start recei0ing good recommendations) Comic
Roc5et re/uired manually searching for e0ery comic a user read) This is a classic e8am-le of the cold7start -roblem)
In order to a0oid this -roblem on the ,ebcomic Com-anion( a 5no6ledge7based recommender is a0ailable
on the front -age) The recommender allo6s users to enter criteria for any of the content com-onents that are trac5ed(
as 6ell as a ?seed comic)@ This 6ill allo6 users to ta5e ad0antage of the recommender in a lo67effort( lo67
commitment 6ay( 6hich 6ill ideally ma5e the site more useful and hel- dra6 in more users) This algorithm is also
much less com-utationally com-le8 than the -rimary algorithm( and really .ust sets -arameters for a sim-le MyS:L
select statement) *ecause of this( it 6ill not be discussed in the im-lementation section)
1G
3.* Conc!u&ion
Careful consideration is needed in order to design a strong recommendation system) This section outlined
the design s-ecification for the ,ebcomic Com-anion content( 6ebsite( database and algorithm) Section 3 sho6s
ho6 this design 6as translated into a 6or5ing -roof7of7conce-t( including im-lementation details and code sam-les)
12

) I4P"#4#NT$TION
This section details the technical im-lementation of the -ro.ect) The ,ebcomic Com-anion is a -ublic 6eb
site anyone in the 6orld can access through a 9y-erTe8t Mar5u- Language O9TMLP bro6ser connected to the
Internet) This section co0ers the im-lementation of the database( 6ebsite( and recommendation algorithm used by
the 6eb site) The 9TML design is not addressed)
).1 Web&ie
The first -art of im-lementation for this -ro.ect 6as creating the s5eleton of the 6ebsite) 9TML( Cascading
Style Sheets OCSSP and +9+ files 6ere 6ritten in a standard editor) !o additional tools 6ere used to form a tem-late
or basic 9TML frame6or5) The fundamental site -ages are the home -age( the user@s My Comics -age( and the
*ro6se -age)
4.1.1 &ite Use Case
As an introduction to the im-lementation of the ,ebcomic Com-anion( a scri-t is -ro0ided of ho6 a user
6ould log into the 6ebsite and recei0e a recommendation)
+. Create an &ccount
#rom the home-age at:
htt-:II.anBene)cs)s-u)eduIthe6ebcomiccom-anionI
Clic5 on the Register lin5 in the u--er right hand corner of the -age) This 6ill bring you to a ne6 screen(
the Member Register +age) In the te8t entry bo8es -ro0ided( ty-e in a username( -ass6ord( and confirm that the
-ass6ord is correct) Clic5 Submit) If the username is already ta5en or the t6o -ass6ords do not match( the 6eb-age
6ill dis-lay an error) If so( enter ne6 0alues into the te8t fields and try again) If the username is free and the
-ass6ords match( your account data 6ill be entered into the database) The 6ebsite 6ill log you in and na0igate you
to Ste- ;)
,. #og In
This ste- assumes you ha0e already created a user account by -erforming Ste- 1 at an earlier time) #rom
the home -age( clic5 the Sign In lin5( located ne8t to the Register lin5 in the u--er right hand corner) In the Member
Log In screen( enter your username and -ass6ord) If the username does not e8ist or the -ass6ord is incorrect( you
6ill be as5ed to try again) 'ther6ise( you 6ill be logged in and ta5en to the member home -age)
-. .ie% #ist
After logging in( the 6ebsite 6ill na0igate you to your -ersonal 6ebcomic list) This -age is only accessible
to logged in users) This -age allo6s a user to maintain a list of 6ebcomics that they read) #or all comics that the user
has added to their list( the -age 6ill dis-lay the title( authors( comic information( and a hy-erlin5 to the comic@s
home -age) Thumbnails of comic art are not dis-layed due to the com-le8ity of storing images in a database)
/. &dd Comics
To add a 6ebcomic to your list( sim-ly enter the comic@s title as it a--ears on the 6ebcomic@s home -age in
the Title To Add te8t bo8) Select a ran5 for the comic) Ran5 allo6s the user to grou- their comics into three
-reference tiers( To- #a0orite( E8cellent( and #un Stuff( 6ith To- #a0orite being the highest ran5 and #un Stuff
being the lo6est) There is no limit to ho6 many comics may be in each ran5) Clic5 Submit) If the comic does not
e8ist in the database( you 6ill recei0e an error) If it does e8ist( you 6ill be redirected bac5 to the My Comics -age(
and the ne6 comic 6ill be -resent in your list) Some sam-le comics to add to a starting -age are xkcd( $ad
*achinery( ark0 & .agrant( 1ohnny 2ander( and *onsterkind)
3. Get "ecommendation
'nce you ha0e added at least one comic to your list( the 6ebsite can return recommendations to you using
the -rimary recommender) The more comics you ha0e on your list( the more refined the recommendations 6ill be)
To recei0e a recommendation( clic5 on the >et Recommendation Lin5 located on the My Comics -age) This 6ill
11
ta5e you to a ne6 -age that lists u- to 12 6ebcomics 6ith information and hy-erlin5s) These comics are ordered
roughly by -reference( so start at the to- and 6or5 your 6ay do6n) Adding or remo0ing comics from a list 6ill alter
the list of recommended comics( and it 6ill change o0er time as more comics and lists are added to the database by
other users)
4.1.2 /ebsite 0ai!ation
In the interest of 5ee-ing the site easy to use( the number of -ages on the ,ebsite com-anion is small) The
core functionality of bro6sing( logging in( and recei0ing recommendations is achie0ed 6ith only a fe6 -ages)
#igure ; -ro0ides a -age7le0el ma- of the 6ebsite)
Fig. $. %age navigation !iagram "or The Webcomic Companion
In addition to the -ages discussed in the site use scri-t( the 6ebsite includes a Lin5s -age and an About
-age) The Lin5s -age lists other sites similar to the ,ebcomic Com-anion that users may be interested in e8-loring)
The About -age gi0es a summary of the -ro.ect as a 6hole and -ro0ides more information about 6here this -a-er
and other resources about recommendation can be accessed)
11
4.1.3 /ebsite 1o!in Implementation
Since an e8-lanation of the code that manages the entire site 6ould be redundant and im-ractical( the ste-s
re/uired for a user to log in to the 6ebsite 6ill be e8-lained as an e8am-le of ho6 the +9+ for the 6ebsite 6or5s as
a 6hole) To see the code that -erforms the follo6ing ste-s( see A--endi8 A Login +9+ Code)
Certain -ages in the ,ebcomic Com-anion are only accessible to logged in users( such as the My Comics
-age) To im-lement this security feature( login information is handled through the Q4SESSI'! +9+ global 0ariable)
At the beginning of any members7only -age( the Q4SESSI'! 0ariable is chec5ed for a username and user I%) If
these are not -resent( the -age redirects to the login screen)
4.1.3.1 og !n Screen and Processlogin
To log in( a user must enter their username and -ass6ord in the log in screen) These 0alues are -assed
through a form -ost from the login -age to the -rocesslogin +9+ scri-t) The +9+ scri-t first chec5s that both fields
are -resent) If this is successful( it calls the function LogIn from the database$til class and -asses it the username
and -ass6ord)
4.1.3.2 Data"ase#til
%atabase$til is a +9+ class that handles a large number of the database transactions in the 6ebsite) ,ithin
LogIn( database$til o-ens a connection ob.ect to the ,ebcomic Com-anion database) It then uses a -re-ared
statement to issue a /uery that selects all entries from the user table that ha0e the gi0en username and -ass6ord) All
communications 6ith the database that in0ol0e user in-ut ha--en through -re-ared statements to -rotect against
S:L in.ection attac5s) If this /uery is successful( it returns the $serI%) If not( it -asses bac5 nothing and the LogIn
function sets the $serI% e/ual to the failure flag of 71)
4.1.3.3 $inali%ation
After recei0ing the result from LogIn( the -rocesslogin scri-t chec5s to see if the o-eration 6as successful)
If so( it sets the Q4SESSI'! username and userid 0ariables to the a--ro-riate strings and redirects them to the My
Comics -age) If the o-eration 6as not successful( it dis-lays an error message and returns the user to the login -age)
).2 Daaba&e
The 6ebsite 6as created on a shared CS 6eb ser0er at Seattle +acific $ni0ersity running MyS:L 0F)F)1;
and +9+ 0F);)C) The MyS:L relational database resides on the S+$ CS 6eb ser0er) As a Com-uter Science
student( the author is granted access to a subsection of the CS ser0er -ublicly a0ailable on the Internet) Therefore(
creating the database 6as as sim-le as o-ening a connection to the CS ser0er through MyS:L ,or5bench and
running create table commands)
4.2.1 'op-lation
$sers -o-ulate the database through the 6ebsite) The 6ebsite allo6s ne6 users to register and manually
add comic information to the database) The database 6as initially -o-ulated by hand 6ith limited sam-le data) After
launching the 6ebsite( it ran smoothly enough that it 6as easier and safer to -o-ulate the database e8clusi0ely
through the 6ebsite) As of the time of this 6riting( no comic editing mechanism has been im-lemented in the
6ebsite( so any changes to e8isting data must be made by the administrator in the database directly)
4.2.2 &c%ema
The ,ebcomic Com-anion database is a relational database consisting of 11 tables) See #igure 3 for a
schema diagram) ,ithin the database( se0eral relations are of s-ecial note)
1;
Fig. &. Database schema "or the Webcomic Companion
As the diagram clearly sho6s( the comics table is central to the database design) This table stores all
content information about each comic -resent in the database) The authors and genres of a comic are re-resented
through many7to7many relationshi-s 6ith authors and genrename) This allo6s a comic to ha0e multi-le authors and
13
an author to 6rite multi-le comics) Li5e6ise( a comic can ha0e multi-le genres and a genre can re-resent multi-le
comics)
Comiclists( another central table( is the table that stores associations bet6een users and comics) This
information ma5es u- each user@s comic list( 6hich is the starting -lace for the -rimary recommendation algorithm)
Comiclists is -art of a ternary relationshi- bet6een comics( users and rank) This ternary relationshi- is really a
many7to7many relationshi- bet6een comics and users( but 6ith the added information of 6hat ran5 each user has
associated 6ith each comic)
'imilaritymatrix is the final table that 6ill be discussed here) 'imilaritymatrix is maintained through the
-rimary recommendation +9+ scri-t) It is this table that is referenced in real time 6hen a user on the site as5s for a
recommendation) It is inde8ed t6ice on the same foreign 5ey ComicI%) Each reference allo6s the table to store a
similarity 0alue bet6een t6o comics) Each ComicI% is stored t6ice( once as ComicI%1 and once as ComicI%1)
,hile inefficient for storage s-ace( this allo6s s-ecific ComicI%s to be loo5ed u- much more easily)
4.2.3 2ack-p
There is not -resently any additional automatic bac5u- or disaster reco0ery for the ,ebcomic Com-anion
database beyond those routines that e8ist for the entire S+$ CSC ser0er) !or is the -ass6ord field for each user
encry-ted) These are both changes that 6ould be made to any future iteration of the database)
).3 $!gorihm Imp!emenaion
The im-lementation of the -rimary recommender in The ,ebcomic Com-anion in0ol0es t6o database
tables Ocomiclists and similaritymatrixP( t6o tem-orary t6o7dimensional arrays created in a +9+ scri-t( and a stored
-rocedure in the database)
4.3.1 3at%erin! "ata
The first -art of the recommendation -rocess in0ol0es fetching the starting data) E0ery night( the +9+
scri-t dra6s data from the comiclists table( 6hich contains all ratings from the comic lists of all users on the site)
This data is re-resented as a uni/ue association bet6een a $serI%( ComicI%( and Ran5I%) Each ro6 in the
comiclists table corres-onds to one entry in one user@s comic list) Table 1 sho6s ho6 this might be re-resented for ;
users rating 3 different 6ebcomics)
In the +9+ scri-t( this data is fetched using a -re-ared statement and returned as a +9+ data structure)
Currently( the entire database table is loaded into the scri-t at once and stored in local memory) In the future( the siBe
of the table may ma5e this im-ractical) As the database gro6s( a better strategy 6ould be to load and -rocess the
data one -iece at a time) #or e8am-le( all the similarity 0alues could be calculated for one comic( -laced in the
database( and then the ne8t comic 6ould be loaded( as o--osed to loading and calculating the similarity bet6een all
comics at once) 9o6e0er( for a -roof of conce-t( the small siBe of the database allo6ed the entire table to be loaded
into memory at once 6ithout the added com-le8ity of multi7stage -rocessing)
Table 1: Sample data in the database table comiclists
User'D Comic'D (an)'D
1 CC ;
1 LM ;
1 G1 1
1F
3 CC ;
3 LM 1
3 M1 1
3 G1 1
C CC 1
C M1 1
4.3.2 3eneratin! t%e 4atin! *atrices
'nce the +9+ scri-t has a co-y of the data in the comiclists table( it must transfer it to a useable data
structure) This ta5es the form of an ! 8 M t6o dimensional array rating4array( 6here ! is the number of comics
6ith at least one rating and M is the number of users 6ho ha0e rated at least one comic) Each indi0idual rating is
stored in the a--ro-riate rating4arrayDnEDmE subscri-t) The array is inde8ed by ComicI% and $serI%( so the ratings
for 6ebcomic 6ith the ComicI% of MM is stored in rating4arrayDMMED1))!E 6here 1))! re-resents all of the $serI%s
for users that ha0e rated comic MM) Con0ersely( all of the ratings for the user 6ith the $serI% of L are stored in
rating4arrayD1))MEDLE 6here 1))M re-resents all of the ComicI%s user L has rated)
Table 2: Sample rating table generated in PHP script from the sample database table
Comic'D * User'D 1 & +
++ ; ; 1
,- ; 1
-2 1 1
.1 1 1
,hile this rating4array data structure is easy to create and access( se0eral ad.ustments are necessary for the
scri-t to run smoothly) #irst of all( ComicI%s start at CC rather than 2( so to sa0e s-ace CC is subtracted from all
ComicI%s before they are inde8ed into rating4array)
Secondly( because not all users ha0e rated comics( not all comics ha0e been rated( and all users ha0e not
rated the same set of comics( the array subscri-ts are not continuous) This means that there are some ro6s( columns
and cells in rating4array that are not allocated) ,hile this -roblem is a0oided 6ith a sim-le chec5 to see if an area
has 0alid data 6hile iterating through ro6s( columns or cells( it means that rating4array is allocated much more
s-ace than it actually uses)
#or a regular cosine similarity algorithm( the data sho6n in Table 1 6ould be sufficient) 9o6e0er( an
additional ste- of -rocessing ta5es -lace 6hen calculating the ad.usted cosine similarity) In Table 1( the ratings are
re-resented e8actly as they e8ist in the database) In Table ;( the corres-onding user@s a0erage rating is subtracted
from each rating before it is stored in the rating4array) This com-ensates for differences in a0erage user beha0ior)
#or e8am-le( Table 1 sho6s that $ser 1 tends to gi0e comics lo6er ratings 6hile $ser 3 tends to gi0e comics higher
ratings) Table ; has ad.usted for this difference( 6hich gi0es more accurate results 6hen the similarity 0alue is
calculated)
Table 3: Sample rating table that taes the user a!erage ad"ustment# $otice mi% of positi!e and negati!e ratings#
1C
Comic'D * User'D /subtracte! user
average0
1 /2.+,0 & /1.,10 + /1.10
++ 2);; 1)1F 72)F
,- 2);; 2)1F
-2 72)LF 2)F
.1 72)CL 72)LF
!otice that each user@s highest ratings are no6 listed as negati0e because the higher ratings are stored as the
lo6er numbers) At first( it might seem unintuiti0e that the most fa0orably rated comics ha0e the lo6est scores)
9o6e0er( it is im-ortant to remember that recommendations are not made based on a0erage comic ratings( but on
similar rating -atterns bet6een comics) ,hether negati0e ratings re-resent a fa0orable or unfa0orable rating is not
im-ortant( as long as it is consistent bet6een ratings)
The +9+ scri-t stores the a0erage ad.usted ratings in rating4array) Each user@s a0erage rating is calculated
and stored in an array beforehand( and is subtracted from each rating as it is added into rating4array)
4.3.3 Calc-latin! t%e Ad5-sted Cosine &imilarit(
'nce all of the ratings ha0e been stored in the -ro-er form in rating4array( the ratings are used to calculate
the ad.usted cosine similarity bet6een each -air of comics in the table) To see ho6 the ad.usted cosine similarity
matri8 is mathematically calculated( refer to section ;)3)1) To see the code that e8ecutes this in the +9+ scri-t( refer
to A--endi8 * Similarity Matri8 >eneration)
4.3.3.1 &a"le !nter'retation
Since the ad.usted cosine similarity is calculated for e0ery -air of comics( the results are stored in an ! 8 !
table( 6here ! is the number of comics that ha0e at least one rating) The similarity 0alues range from 1 to 71( 6here
1 re-resents comics that ha0e identical ratings and 71 re-resents comics that ha0e o--osite ratings) Table 3 sho6s the
similarity 0alues generated from the sam-le ad.usted rating matri8 in Table ;)
As one 6ould e8-ect( the ro6 of ones do6n the diagonal sho6s 6here each comic 6as com-ared 6ith
itself) !aturally( these com-arisons result in the highest similarity 0alue -ossible) The table is also mirrored( 6ith the
0alues in the lo6er left side of the table reflected in the u--er right) ,hile this is an inefficient 6ay to handle the
data( it is a sim-ler algorithm to im-lement and allo6s for easy storage in the database)
Table &: 'omic to comic similarity table
Comic'D ++ ,- -2 .1
++ 1 2)LG 72)GM 72)MG
,- 2)LG 1 71 71
-2 72)GM 71 1 1R
.1 72)MG 71 1R 1
As mentioned in section 3);)1 >enerating Rating Matrices( the scalar 0alues Oor magnitudeP of the ad.usted
ratings do not directly corres-ond to the similarity 0alues) #or e8am-le( the ratings shared bet6een Comics G1 and
M1 in Table ; 6ere all negati0e( and the ratings shared bet6een Comics LM and CC 6ere all -ositi0e) *ut since both
of these comic -airs had ratings similar to their -artners( both similarity 0alues are -ositi0e in Table 3)
1L
4.3.3.2 o( Rating Adjustment
Ignoring the diagonal ro6 of ones( the comic -airs 6ith the highest similarity 0alues are CC I LM and M1 I G1)
*et6een these( M1 I G1 a--ears to be a -erfect match) 9o6e0er( Table 3 sho6s that the similarity 0alue bet6een
comics M1 I G1 Omar5ed by RP 6as based on only one shared rating( the rating from $ser 3) In fact( in instances
6here there is only one shared rating bet6een t6o comics( the similarity 0alue is guaranteed to be one( 6hereas if
more ratings 6ere -resent the similarity 0alue 6ould li5ely be much lo6er) In a large database 6here ratings tend to
be s-arse( comics 6ith only one shared rating are e8tremely common) This results in a large number of single7rating
similarity 0alues ta5ing highest -reference in recommendations)
To sol0e this -roblem( any similarity 0alue 6ith less than a certain number of ratings is multi-lied by a
linear fraction related to some constant c Dannach et al) 1211E) In the +9+ scri-t( any similarity 0alue calculated
from less than ; ratings is multi-lied by
r
;
before being stored in the table( 6here r is the number of ratings) The
constant c 6ill increase from its current 0alue of ; after more data is -resent in the database) As a future
enhancement( the 0alue of c may be dynamically determined) In the e8am-le sho6n in Table F( the similarity 0alues
mar5ed 6ith a star 6ere multi-lied by S( since only t6o ratings are a0ailable) This allo6s the similarity of the t6o
comics to still be reflected 6hile lo6ering their im-ortance relati0e to similarity 0alues 6ith more data behind them)
'ne common as-ect of the im-lemented similarity matri8 that Table F does not reflect is the -resence of
t6o comics that ha0e been rated but ha0e no users in common) 'b0iously( no similarity 0alue can be calculated for
such -airs) This is reflected by an unallocated s-ace in the matri8( and no similarity 0alue 6ill e8ist in the database
for that -air)
It is 6orth noting that any real data set is much less dense than the sam-le one -resented here) In this data
set( each user has rated on a0erage LF< of all rated comics) *y contrast( in the ,ebcomic Com-anion database each
user has rated an a0erage of 1M< of all rated comics( and e0en this is an artificially inflated 0alue due to the small
number of comics entered in the database)
4.3.4 &torin! t%e 4es-lts
'nce the similarity 0alue for all comic -airs has been calculated( the results are stored bac5 in the database
in the table similaritymatrix) 'imilaritymatrix is inde8ed by a com-ound inde8 of ComicI%1 and ComicI%1) At this
ste-( all ComicI%s ha0e CC added to them to return them to their original form after being referenced as array
subscri-ts) #or each comic -air( the corres-onding ad.usted cosine similarity 0alue based on user7su--lied ratings is
stored in the Similarity column) Li5e the similarity table in the +9+ scri-t OTable 3P( the similarity 0alue bet6een
each comic -air OA( *P is stored t6ice in the database( once for ComicI%1( ComicI%1 OA( *P and once for
ComicI%1( ComicI%1 O*( AP)
Table (: Sample database table similarit(matrix
Comic'D1 Comic'D2 imilarity
CC LM 2)LG
CC M1 72)GM
CC G1 72)MG
LM CC 2)LG
LM M1 71
LM G1 71
1M
M1 CC 72)GM
M1 LM 71
M1 G1 2)F
G1 CC 72)MG
G1 LM 71
G1 M1 2)F
As in the +9+ algorithm( the tradeoff is bet6een memory s-ace and algorithm com-le8ity) ,hile this
0ersion ta5es double the s-ace( it ma5es the table much easier to reference 6hen searching for comics that correlate
to each other) ,ith all comics -resent in each column( 6hen searching for comics similar to a s-ecific ComicI%(
ComicI%1 can al6ays be the reference column and ComicI%1 can be the result column) If each -air 6ere only
re-resented once( both columns 6ould ha0e to be searched in turn to determine if a gi0en ComicI% 6as -resent in
the table) %e-ending on 6hich order the comic -air 6as stored in( ComicI%1 or ComicI%1 could contain the related
comic the user is searching for) This 6ould re/uire a much more com-le8 /uery and a longer res-onse time for the
recommender stored -rocedure)
4.3.6 Fetc%in! 4ecommendations
The database table similaritymatrix is the table that is referenced 6hen actual recommendations are made)
The +9+ scri-t that creates this table is com-utationally e8-ensi0e( so it runs only once a day to u-date
similaritymatrix 6ith any changes that result from u-dated user comic lists) ,hen a user as5s for a recommendation(
the 6ebsite calls the stored -rocedure ad5usted4item4cosine6UI78 and sends it the re/uesting user@s $serI% as a
-arameter) This allo6s recommendation to ha--en in real time that 6ould not other6ise be -ossible 6ith a large
database) The MyS:L stored -rocedure itself can be found in A--endi8 C Stored +rocedure)
The stored -rocedure ta5es the $serI% and finds all comics -resent on the user@s list from comiclists) It
then loo5s u- these comics in the ComicI%1 column in the similaritymatrix table( sorts by Similarity 0alue and
returns the ComicI%1 associated to the highest one) This select techni/ue is -erformed at most three times( 6ith
each select loo5ing at comics 6ith -rogressi0ely lo6er ran5s) If the higher ran5s do not -roduce enough results( the
lo6er ran5s are utiliBed) The fourth selection is done based on the dominant content elements in each user@s list)
The results of these four select statements are unioned together and the to- 12 are returned and dis-layed to the user)
Since this is done dynamically each time the user re/uests a recommendation( changes to the user@s comic list are
reflected immediately as different recommendations 6ill be returned)
).) Te& Re&u!&
To measure the accuracy of the ,ebcomic Com-anion@s -rimary algorithm( se0eral small tests 6ere run on
the user lists) Since it 6ould be im-ractical to as5 a user to e0aluate each item on a recommendation list to see if
they li5ed it( an easier strategy to test the accuracy of a recommender is to remo0e a -ortion of items from a user@s
list( re7calibrate the similarity 0alues( and see ho6 many of the original items from the list are recommended bac5 to
the user) Since the lists 6ere generated by -eo-le 6ho read 6ebcomics( their tastes ha0e been established based on
-ersonal e8-erience and not an algorithm that randomly -roduced sam-le data)
4.4.1 4es-lts of Acc-rac( $est
Table C dis-lays the results for fi0e different users that 6ere tested) #or each user( M7L comics or F2< of
their list 6as remo0ed( 6hiche0er 0alue 6as smaller) Accuracy is based on ho6 many of these remo0ed comics
sho6ed u- in the list of 12 recommended comics)
1G
Table ): Results of accuracy test
$serI% !umber of comics
in list
+ercent of comics
remo0ed OnumberP
+ercent of remo0ed
comics returned
OnumberP
1 11 F2< OCP F2< O;P
; ;1 1F< OMP F2< O3P
3 ;1 1;< OLP 1G< O1P
L 12 F2< OFP 2< O2P
M 3 F2< O1P 2< O2P
A0erage return rate ;;<
As Table C clearly sho6s( the more comics a -erson has rated the more accurate their recommendations)
9o6e0er( this is not a strict -attern( and the small siBe of the database introduces a high amount of 0ariability into
the results) A much larger data set 6ould be needed to find a more reliable recommendation rate) 9o6e0er( e0en on
the test data set( the a0erage return rate of ;;< sho6s that( 6hile not incredibly accurate( the algorithm does 6or5
better than sim-ly random recommendations O6hich( gi0en the 0ery small data set( 6ould return a success on
a0erage 1G< of the timeP)
4.4.2 Collaboratie s. Content-based 4es-lts
Table L loo5s at only those recommendations that returned successful results) 'f these( the successes are
di0ided on 6hether they 6ere generated by the collaborati0e or content7based -ortion of the algorithm)
Table *: 'ollaborati!e !s# content+based successes
!umber of successful returns 0ia collaborati0e
com-onent
!umber of successful returns 0ia content7based
com-onent
L OLM<P 1 O11<P
#urther analysis into the accurate comics returned sho6s that nearly all of them 6ere generated by the
collaborati0e com-onent of the algorithm) 'n one hand( this is li5ely due to the fact that the collaborati0e
com-onents are considered first and the content7based com-onent only ma5es recommendations if there is not
enough collaborati0e data to fill the first 12 slots) 'n the other hand( some users( such as user M( had 0ery small
comic lists that resulted in almost all of their recommendations being content7based) This suggests that the content7
based com-onent of the algorithm is not as accurate as the collaborati0e com-onent)
).* Conc!u&ion
'0erall( the ,ebcomic Com-anion design outlined in section ; 6as im-lemented successfully) All -arts of
the system are functional and return fe6 errors( and the algorithm is able to ma5e successful recommendations)
Though a ;;< accuracy rate is 6ell belo6 industry standards and needs im-ro0ement( it does -erform better than
random recommendations) Most im-ortantly( the entire system 6or5s together as a unit to -ro0ide a resource to the
user)
;2
* #T.IC1
*.1 Inroducion
This section gi0es a brief o0er0ie6 of some of the ethical /uestions that relate to recommendation systems)
At first( it may not seem li5e recommending mo0ies and 6ebsites ha0e much ethical rele0ance) 9o6e0er(
recommender systems are fundamentally information filtering ser0ices( and the treatment of information is an
ethically rich to-ic) #or e8am-le( is it ethical for a ne6s 6ebsite to filter out all stories about ongoing 6ars for a user
because that user ne0er clic5s on themA
This section 6ill focus on t6o main ethical to-ics( data set -ri0acy and the Jfilter bubbleK effect that can be
created by recommendation systems D+ariser 121;E) ,ays that these issues a--ly to the ,ebcomic Com-anion 6ill
also be discussed)
*.2 Daa 1e Pri'ac/
%ata -ri0acy is a huge issue in the digital age( one that far e8ceeds the sco-e of this -a-er) ,hat -ri0acy
means( 6ho o6ns data( the conte8t in 6hich data is used( and the aggregation of large amounts of data are all
/uestions that -eo-le are trying to address DTa0ani 1211H S-inello 1211H Ta0ani and Moor 1221E) This section 6ill
concern itself 6ith a small subset of these /uestions( and that is the -ublic release of anonymiBed data sets for
academic -ur-oses)
6.2.1 "e7nitions of 'riac(
Some definition of -ri0acy is necessary to begin) Many la6yers and -hiloso-hers ha0e 6ritten on the
sub.ect of -ri0acy( and some of these definitions a--ly -articularly 6ell to -ri0acy online) Ruth >a0ison@s seclusion
theory( for e8am-le( defines -ri0acy as Jthe limitation of others@ access to an indi0idualK D>a0ison 1GM3H S-inello
1211E) She loo5s at these limitations -articularly in the conte8t of secrecy( anonymity( and solitude) According to
>a0ison( -ri0acy is a condition of restricted access to yourself and your information) The focus is on 6ho is allo6ed
0ie6 your data in 6hat conte8t)
,ith the rise of Internet technology( -ri0acy theories ha0e shifted to6ards informational -ri0acy) The
control theory and the restricted access theory are t6o generic theories that stand out in modern literature DS-inello
1211E) >a0ison@s theory is an e8am-le of restricted access theory) Control theory( by contrast( suggests that -ri0acy
means ha0ing sole management o0er one@s information) In the digital 6orld( -eo-le often fear losing control o0er
information that can be accessed by organiBations) im Moor and 9erman Ta0ani ha0e combined these t6o threads
into a synthesiBed theory described as restricted accessIlimited control DTa0ani and Moor 122;( S-inello 1211E) The
theory recogniBes that -ersonal information must sometimes be shared 6ith others( and the -ro-er use of -ri0acy
falls some6here bet6een total information isolation and total disclosure DS-inello 1211E) ,hen as5ing /uestions
about -ri0acy online( it is to-ics li5e 6ho has access to one@s -ersonal information( ho6 indi0iduals can control the
6ay this information is 0ie6ed( stored( combined( sold and transformed that are addressed DTa0ani 1211E) This boils
do6n to do main concerns: 6ho can access your data( and 6hat they are allo6ed to do 6ith it)
A distinction should be raised here bet6een security concerns and -ri0acy concerns) In a -ri0acy concern( a
-erson may fear that they 6ill lose control of -ersonal information that they gi0e to a com-any for legitimate
-ur-oses DS-inello 1211E) To breach the conte8t in 6hich -ersonal data is allo6ed to be used brea5s the -ri0acy of
the information itself DTa0ani 1211E) #or e8am-le( a -erson 6ill gi0e a com-any their address to sign u- for a
ne6sletter( and the com-any then sells the address to a s-am com-any( or uses it to infer their socioeconomic status
and discontinue their store credit card)
In a security concern( on the other hand( the 6orry is that unauthoriBed indi0iduals 6ill find a 6ay to access
-ersonal information DS-inello 1211E) In the -re0ious e8am-le( a hac5er might brea5 into the store database and ta5e
;1
all of the addresses( 6hich the com-any had not -ublished) This section 6ill concern itself mainly 6ith the control
of legitimately accessed information( rather than security)
6.2.2 "ata &et "eanon(mization
$nder 6hat circumstances can the contents of a database be -ublicly releasedA In order to su--ort data
mining research( such databases are more and more fre/uently being stri--ed of identifying information and
released to the -ublic) If the data set is anonymous( some argue( the -ri0acy of the -artici-ants has not been 0iolated)
9o6e0er( the -roblem 6ith many databases containing micro7data( or information on indi0iduals( is that indi0iduals
can be identified by cross7referencing the anonymous data 6ith bac5ground information or -ublic data sources
D!arayanan 122ME)
#or e8am-le( in 122C a 9ar0ard research -ro.ect collected -ersonal -rofiles of 1(L22 college7based
#aceboo5 users to in0estigate ho6 their friends and interests changed o0er time DLe6is et al) 122ME) The data set
used in their study 6as released to the -ublic( and other researchers /uic5ly disco0ered that the data set could be
easily de7anonymiBed) The -ri0acy of nearly 1(222 students 6as com-romised( and the students 6ere not e0en
a6are their data 6as being collected D"immer 122M( boyd 1211E)
Another similar e8am-le came as a result of the famous !etfli8 +riBe in 122C) !etfli8 offered a re6ard of
one million dollars to any team that could im-ro0e on their mo0ie recommendation algorithm by 12<) As -art of
this contest( a section of the !etfli8 mo0ie ran5ing database 6as released to the -ublic) The database consisted of the
anonymous 0ie6ing history of many !etfli8 users( as 6ell as the ratings they ga0e different mo0ies)
Fig. 1. %robability o" unique i!enti"ication when a!versary )nows e2act user ratings an! appro2imate !ates 34arayanan 255-6
A research team in Te8as( ho6e0er( 6as able to brea5 the anonymity of the !etfli8 data set D!arayanan
122ME) Their -a-er sho6s that hardly any bac5ground information 6as necessary to uni/uely identify a 0ie6er) If an
ad0ersary is able to ac/uire the ratings on C7M mo0ies that a 0ie6er 6atched and a 137day time -eriod of 6hen they
6atched them( there is a GM< chance that their records could be uni/uely identified 6ithin the ?anonymous@ data set)
=no6ing only the ratings on t6o mo0ies and a 137day time -eriod still lea0es a 32< chance of identification)
Since so little information is necessary to identify a record and it need not be e8act( a determined ad0ersary
6ould li5ely be able to ac/uire the data necessary to target a s-ecific indi0idual) !arayanan used the IM%* database
as the source of their au8iliary information) $sing the information -osted -ublicly on IM%*( the researchers 6ere
able to ac/uire the entire -ri0ate !etfli8 0ie6ing history of certain indi0iduals)
;1
,hile not ideal state of affairs( this breach of -ri0acy may seem relati0ely benign) After all( 6hich mo0ies a
-erson has 6atched is not usually considered highly sensiti0e information) 9o6e0er( as !arayanan -oints out( the
/uestion is not( ?do most -eo-le care about the -ri0acy of their 0ie6ing history(@ but ?are there any -eo-le 6ho care
about the -ri0acy of their 0ie6ing historyA@ In some cases( the ideological to-ics of films may be /uite sensiti0e( and
users may not 6ant the -ublic to 5no6 6hat mo0ies they en.oy in -ri0ate)
Rele0ance of the data aside( it should ha0e been the user@s decision to release their histories( not !etfli8@s)
Legally( it difficult to determine 6ho really o6ns data generated online about indi0iduals) Current la6s are
ambiguous( and many com-anies assume o6nershi- rights by default DS-inello 1211E) !e0ertheless( it is difficult to
ethically defend releasing -ri0ate information about indi0iduals 6ithout their consent that they may not ha0e e0en
been a6are a com-any 6as collecting)
6.2.3 $%e /ebcomic Companion Application
,hile the ,ebcomic Com-anion is not as far reaching as >oogle or !etfli8( it is still a 6orth6hile e8ercise
to consider ho6 these ethical issues affect the im-lementation of the 6ebsite) The 6ebsite gathers and stores
usernames( -ass6ords( and lists of comics read) ,hile none of these things are highly confidential( the !etfli8 case
illustrates that e0en seemingly inconse/uential and anonymous data needs to be -rotected) Ethically( measures
should be ta5en to ensure as much -ri0acy for the ,ebcomic Com-anion users as -ossible)
In accordance 6ith this( usual security measures ha0e been ta5en to -rotect the database from outside
attac5) $nfortunately( the sco-e of this -ro.ect does not include a highly robust security system) As a safeguard
against unintentional breaches of -ri0acy( the -ro.ect collects as little -ersonal information as -ossible) Since the
-ro.ect is largely -roof7of7conce-t and the cost of re7creating an account is small( no -ersonal emails 6ill be
collected) $sers 6ill also be encouraged to use a uni/ue -ass6ord for the site to minimiBe the damage if database
-ri0acy should someho6 be com-romised)
In terms of the use of anonymous data( data of 6ebcomic readershi- 6ill not be -ublished) %ata 6ill be
used internally for recommendation -ur-oses and summary statistics may be -resented on the 6ebsite( but no
indi0idual user lists 6ill be made a0ailable) This is to ensure that no user can be identified by their information in
the same 6ay that !arayanan et al) D122ME 6ere able to do for the !etfli8 +riBe)
*.3 The 5i!er Bubb!e
This section addresses some of the ethical ramifications of recommendation algorithms themsel0es( or
?-ersonaliBation@ as it is often referred to commercially) At first( this might not seem li5e a -ressing ethical issue) As
this research has already demonstrated( filtering is necessary to ma5e the Internet usable( and each of us engages in
massi0e filtering e0ery day to ma5e the 6orld manageable) As Eli +ariser -oints out in his boo5 9he Filter $u!!le(
JAs a customer( it@s hard to argue 6ith blotting out the irrele0ant and unli5ableK D+ariser 1211( 1ME)
*ut there ha0e been se0eral authors 6ho ha0e -ointed out the dangers of o0er7filtering( of li0ing
e8clusi0ely inside a tailored bubble of information( or a filter bubble( as +ariser refers to it in his same7titled boo5 on
the sub.ect D+ariser 1211H Sunstein 1221E) As filtering and recommendation become more -er0asi0e( and the to- fi0e
sites on the Internet OYahoo( >oogle( #aceboo5( YouTube( and MS!P already use them e8tensi0ely( the issues
surrounding them 6ill only become more rele0ant D+ariser 121;E)
6.3.1 'ersonalization and Its Effects
Search engines and ne6s sites are t6o filtering domains es-ecially rele0ant to this to-ic) >oogle first
introduced -ersonaliBed search results as a test feature for its users in 1223( and in 122F it became a0ailable to all
users 6hile signed into their >oogle accounts) >oogle +ersonaliBed Search returns results based not only on their
rele0ancy to a user@s search 5ey6ords( but based on their -re0ious bro6sing history) In 122G( this feature 6as
integrated into the main site( e0en for users 6ho are not logged in D9orling and =uli5 122GE) In a small e8-eriment
to demonstrate this( t6o -eo-le 6ith different bro6sing histories searched for ?Egy-t@ in 1211) 'ne recei0ed ne6s
results about the ongoing -rotests in Egy-t( the big story at the time( and the other recei0ed nothing about them
D+ariser TE%E) +ersonaliBation can ha0e a huge effect on the information a -erson recei0es)
;;
Returning hel-ful search results 6ithout distorting a -erson@s flo6 of information is a difficult balance)
=rishna *harat( the creator of the -ersonaliBed ne6s site >oogle !e6s( discussed this during a -anel) 9e said that
6hile >oogle !e6s should -romote the stories a user en.oys reading( to ignore im-ortant ne6s for the sa5e of o0er7
-ersonaliBation 6ould be disastrous D*harat 1212E) The trouble 6ith -ersonaliBation is that it builds an information
en0ironment of Jthe ad.acent un5no6n(K information that is no0el but 0ery similar to things 6e already 5no6 and
li5e D+ariser 1211E) This -ro0ides less room for the truly original or unconsidered information that ma5es u- the
foundation of learning and creati0ity) Left alone in a filter bubble( a -erson can get stuc5 in narro6 recycling of 6hat
they already 5no6 and li5e( and endless Jyou7loo-K D+ariser 1211E) +erha-s the biggest -roblem 6ith this filtered
Jyou7loo-K is that it does not accurately reflect reality) The search result you li5e the most may not be the most
rele0ant or accurate one( and this distortion for the sa5e of user7friendliness can ha0e unintended side effects)
In -olitics( for e8am-le( it is 6ell7established that -eo-le are more li5ely to consume -olitical media that
confirms their e8isting beliefs D9art et al) 122GE) Since educated -eo-le are more li5ely to follo6 -olitical ne6s( they
6ill be consuming a dis-ro-ortionate amount of reinforcing media) Therefore( more highly educated -eo-le can
actually become mis7educated by the sources they choose to follo6) This ha--ened recently 6ith the -o-ular rumor
that +resident 'bama 6as a Muslim DChait 1212E) +olls found that the number of -eo-le 6ho belie0ed +resident
'bama 6as a Muslim 6ere increasing( and the grou- 6ith the largest -ercentage increase 6ere Re-ublicans 6ith
some college education or a college degree) %es-ite their higher amount of education( they 6ere more li5ely to
belie0e the rumor than Re-ublicans 6ith only a high school di-loma DChait 1212E)
Selecti0e filtering of this 5ind is not ne6) +eo-le ha0e been choosing to consume biased media as long as
media has e8isted( and it is common sense that consuming biased media 6ill only ma5e a -erson more biased) ,hat
selecti0e filtering online does is ma5e the mechanism automatic and far more -er0asi0e D+ariser 1211E) The ease
6ith 6hich Internet filtering technology allo6s you to absorb information from only li5e7minded -eo-le ma5es it a
-otential breeding ground for -olariBation) And 6hile many -eo-le may not use filtering to isolate themsel0es from
other -oints of 0ie6( there are those 6ho 6ill) #or a democratic nation( this could become a significant social
-roblem) 9o6 6ill difficult com-romises or issues be discussed if a nation@s -o-ulation is increasingly listening
only to Jlouder echoes of their o6n 0oicesK DSunstein 1221EA
6.3.3 $%e '-blic For-m
Another 6ay to loo5 at this issue is through the lens of the -ublic forum) The -ublic forum is a right
established by the first amendment to -eaceably assemble and -etition the go0ernment) Streets and -ar5s are
considered -ublic areas 6here s-eech is -rotected DSunstein 1221E) Streets and -ar5s are -laces 6here a 6ide
di0ersity of -eo-le gather( and by -rotecting -eo-les@ rights to -rotest in these areas( the go0ernment is -rotecting
the s-ea5ers@ right to access both -laces and -eo-le( e0en if the other -eo-le in the street do not 6ant to hear it)
The im-ortance of the -ublic forum to a democracy is easy to understand) A le0el of shared e8-erience( of
ma5ing someone ta5e notice of an o--osing 0ie6 e0en if they do not li5e it( is essential for a go0ernment 6here the
-o-ulation ma5es its o6n decisions DSunstein 1221E) In many 6ays( general interest information sources act li5e a
-ublic forum of information( 6here -eo-le are li5ely to encounter information counter to their beliefs) And 6hen all
of a -erson@s information is tailored s-ecifically to their interests( this is difficult to come by)
6.3.4 Et%ical Al!orit%ms
There are authors 6ho disagree 6ith these ideas( and claim that e0en if filtering increases social
-olariBation( the Internet is too ne6 and is e0ol0ing too ra-idly to ma5e such claims about its long7term effects
DTa0ani 1211E) !e0ertheless( it@s clear that filtering algorithms can ha0e a -o6erful effect on the information that
-eo-le recei0e( and therefore their li0es)
+ariser stresses the im-ortance of -rogrammers being a6are of these issues 6hen they design the systems)
Algorithms can be built to ta5e factors into account other than -ersonal taste) The best algorithms( +ariser says(
balance comfortable information( or information desserts( 6ith challenging or no0el information( information
0egetables D+ariser 121;E) !o one is sure yet 6hat e8actly a ?balanced@ recommender system 6ould loo5 li5e( but is
an issue that should be considered 6hile designing an algorithm)
;3
6.3.6 $%e /ebcomic Companion Application
The issue of the filter bubble relates directly to the design of the recommender algorithm 6ithin the
,ebcomic Com-anion) Should the central recommendation algorithm be 5eyed only to user -references( or should it
include thought7-ro0o5ing or 6ell7re0ie6ed comics as 6ellA A central goal of the ,ebcomic Com-anion is ma5ing
it easier for users to access high7/uality comics( and a 5ey 6ay to -romote these 6ould be through the recommender)
9o6e0er( there are those 6ho sim-ly -refer lo6er7end fare( and the 6ebsite needs to be able to function for those
users as 6ell)
Li5ely( the 6ebsite 6ill end u- mi8ing these strategies) The central recommender 6ill be 5eyed largely to
-reference( 6hile es-ecially high7/uality comics 6ill be -romoted in other areas of the site)
*.) Conc!u&ion
As the Internet and digital recommendation is still a young field( many of these to-ics are the sub.ect of
li0ely debate) In some cases there are not clear la6s defining ho6 information can be used) Regardless( it is
im-ortant that the com-uter scientists and -rogrammers building these systems are a6are of the ethical im-lications
of their 6or5( so that they can ta5e ste-s to ma5e the systems that 6ill filter our information not only accurate but
socially res-onsible)
;F
- CONC"U1ION
-.1 2enera! Re&u!&
'0erall( the ,ebcomic Com-anion has sho6n one 6ay that it is -ossible to build a recommender for
6ebcomics) The system 6as designed according to the items it recommends( choosing a collaborati0e method and
integrating content7based com-onents to alle0iate 5no6n issues li5e the cold start -roblem)
The 6ebsite and database communicate smoothly( allo6ing ne6 data to be entered easily and securely into
the database) $sers can maintain a list of -ersonal 6ebcomics that they read( 6hich is both a con0enient 6ay for
users to chec5 for comic u-dates and a 6ay for the system to understand the user@s taste) The site@s algorithm is able
to ma5e informed recommendations to the user based on this list)
The small amount of test data -resent in the database sho6s that the algorithm has an a0erage accuracy of
;;<( 6hich 0aries 6idely de-ending on the number of comics in a user@s list) This number 6ill li5ely continue to
gro6 as more comics are rated and more data is added to the database)
-.2 "imiaion& o6 1ud/
).2.1 "ata &et
The main limitation for this -ro.ect 6as the siBe of the data set) E0en though only a small amount of
information about each comic is collected( it is time consuming to enter each one by hand) *ecause of this( only the
minimum amount of data re/uired for the site to generate meaningful recommendations 6as entered) ,hile it
allo6ed successful construction of the recommendation algorithm( the accuracy of the algorithm has suffered)
Aside from its siBe( the distribution of data among the comic lists 6as also an issue) Since there are fe6
com-rehensi0e listings of high7/uality 6ebcomics a0ailable( the 6ebcomics entered into the database 6ere largely
those 5no6n about -re0iously by the author and the author@s -eers) This means that the data set is s5e6ed to6ards
the author@s -ersonal taste( 6hich may e8-lain some of the dis-arity bet6een the algorithms accuracy on different
user lists)
).2.2 Content Aspects
Though the ,ebcomic Com-anion -rimary recommendation algorithm includes a content7based -iece( it is
rudimentary com-ared to the collaborati0e com-onent) This is due -artially to collaborati0e recommendation
lending itself better to media item recommendations( and -artially due to the small amount of data a0ailable for each
comic) In a more full7fledged recommender( the content7based -iece 6ould be more so-histicated and more fully
integrated 6ith the -rimary recommender) Similarly( more com-le8 content information such as artistic style could
be included in the database)
-.3 5urher Re&earch
If de0elo-ment on the ,ebcomic Com-anion continues in the future( there are se0eral areas that 6ould
benefit from im-ro0ement)
#irst of all( the recommendation algorithm 6ould benefit greatly from a larger data set) This 6ould include
more comics in the database and more user ratings) ,ith more data( recommendations 6ould be more accurate and
testing 6ould be more reliable)
Related to this( further analysis on the o-eration of the -rimary algorithm 6ould allo6 the im-lementation
to be im-ro0ed) ,ith a larger data set( much more thorough tests could be run on the algorithm to determine its
s-eed and 6hat data it -erforms best on) This 6ould allo6 focused im-ro0ements on the -arts of the algorithm that
gi0e the lo6est7/uality recommendations)
At -resent( the -rimary recommendation algorithm is a -roof of conce-t rather than an industry7standard
recommender) The methods outlined in this -ro.ect ha0e barely scratched the surface of the techni/ues a0ailable to
;C
more so-histicated recommenders) !eighborhood selection( -robability7based recommendation( and model7based
recommending are all strategies that could significantly im-ro0e -erformance and ha0e not been factored at all into
the current im-lementation DLousame et al) 122GE)
#inally( 6ith more de0elo-ment time( the ,ebcomic Com-anion could benefit from integrating some of the
ideas outlined in the ethical section of this -ro.ect) A more secure database and smarter( more ?ethical@
recommendation -rocess 6ould hel- im-ro0e the site)
-.) 1ummar/ and C!o&ing Remark&
In conclusion( this -a-er sho6s ho6 a media recommender can be created from initial conce-t through
design and im-lementation) Though this study used 6ebcomics as its o-erati0e item( the -rocess 6ould remain
similar for other 5inds of media recommendation) As more information continues to be managed online(
recommendation 6ill become a more and more -o6erful tool for na0igating data) *y learning ho6 information is
tailored to them( users gain a broader understanding of the data landsca-e and ho6 to better use the 6ealth of
information that defines the digital age)
;L
R#5#R#NC#1
>ediminas Adoma0icius and Ale8ander TuBhilin) 122F) To6ard the !e8t >eneration of Recommender Systems: A Sur0ey of the State7of7the7Art
and +ossible E8tensions) I))) 9ransactions on :no%ledge and 7ata )ngineering 1L( C Oune 122FP( L;373G) %'I:
htt-:II-ages)stern)nyu)eduITatuBhiliI-dfIT=%E7+a-er7as7+rinted)-df
Michal Aharon( !atalie AiBenberg( Ed6ard *ortni5o0( Ronny Lem-el( Roi Adadi( Tomer *enyamini( Liron Le0in( Ran Roth( and 'had Serfaty)
121;) '##7Set: 'ne7-ass #actoriBation of #eature Sets for 'nline Recommendation in +ersistent Cold Start Settings) In ;roceedings of the
<th &C* Conference on "ecommender 'ystems) ACM +ress( !e6 Yor5( !Y( ;LF7;LM)
Amos ABaria( A0inatan 9assidim( Sarit =raus( Adi Esh5ol( 'fer ,eintraub( and Irit !etanely) 121;) Mo0ie Recommender System for +rofit
Ma8imiBation) - 111711M) In ;roceedings of the <th &C* Conference on "ecommender 'ystems) ACM +ress( !e6 Yor5( !Y( 111711M)
Chum5i *asu( 9aym 9irsh( and ,illiam Cohen) 1GGM) Recommendation as Classification: $sing Social and Content7*ased Information in
Recommendation) In ;roceedings of the Fifteenth (ational Conference on &rtificial Intelligence) AAAI +ress( Menlo +ar5( CA( L137L12)
#abiano *elNm( Rodrygo Santos( ussara Almeida( and Marcos >onUal0es) 121;) To-ic %i0ersity in Tag Recommendation) In ;roceedings of the
<th &C* Conference on "ecommender 'ystems) ACM +ress( !e6 Yor5( !Y( 131713M)
=rishna *harat) 1212) =rishna *harat discusses the -ast and future of >oogle !e6s) &ideo) In Google (e%s $log Oune 1212P) Retrie0ed March
13( 1213 from htt-:IIgooglene6sblog)blogs-ot)comI1212I2CI5rishna7bharat7discusses7-ast7and)html)
danah boyd and =ate Cra6ford) 1211) Si8 +ro0ocations for *ig %ata OSe-tember 11( 1211P) In ;roceedings of & 7ecade in Internet 9ime:
'ymposium on the 7ynamics of the Internet and 'ociety. '8ford Internet Institute) %'I:
htt-:IIsoft6arestudies)comIcultural4analyticsISi84+ro0ocations4for4*ig4%ata)-df
Robin *ur5e) 1221) 9ybrid recommender systems: Sur0ey and e8-eriments) User *odeling and User=&dapted Interaction 11( 3 O!o0) 1221P(
;;17;L2) %'I: htt-:II.os/uin)cs)de-aul)eduITrbur5eI-ubsIbur5e7umuai21)-df
Robin *ur5e) 1GGG) Integrating =no6ledge7based and Collaborati0e7filtering Recommender Systems) In &&&I 2orkshop on &I in )lectronic
Commerce) AAAI +ress( Menlo +ar5( CA( CG7L1)
onathan Chait) 1212) 9o6 Re-ublicans Learn That 'bama Is Muslim) In (e% "epu!lic OAug) 1212P) Retrie0ed March 13( 1213 from
666)tnr)comIblogI.onathan7chaitILL1C2Iho67re-ublicans7learn7obama7muslim)
Mar5 Clay-ool( Anu.a >o5hale(Tim Miranda( +a0el Murni5o0( %mitry !etes( and Matthe6 Sartin) 1GGG) In ;roceedings of the &C* 'IGI" >??
2orkshop on "ecommender 'ystems: &lgorithms and )valuation) ACM +ress( !e6 Yor5( !Y)
Comic Roc5et) 122G) Retrie0ed %ec) 1L( 121; from htt-s:II666)comic7roc5et)comI)
Comi8ology) 1213) Iconology Inc. Retrie0ed %ec) 1L( 121; from htt-s:II666)comi8ology)comIbro6se7genre)
Tom #orester and +erry Morrison) 1GG3) Computer )thics O1nd) ed)P) The MIT +ress( Cambridge( MA)
Ruth >a0ison) 1GM2) +ri0acy and the Limits of the La6) The Yale La6 ournal MG( ; Oan) 1GM2P( 31173L1)
%'I: htt-:IIcourses)ischool)ber5eley)eduIi12FIs12IreadingsI6ee511Iga0ison7-ri0acy)-df
%a0id >oldberg( %a0id !ichols( *rian M) '5i( and %ouglas Terry) 1GG1) $sing collaborati0e filtering to 6ea0e an information ta-estry)
Commun. &C* ;F O%ec) 1GG1P( 11( C17L2)
>uibing >uo) 121;) S-arsity and Cold Start for Recommender Systems) In ;roceedings of the <th &C* Conference on "ecommender 'ystems)
ACM +ress( !e6 Yor5( !Y( 3F173F3)
,illiam 9art( %olores AlbarracVn( Alice 9) Eagly( Inge *rechan( Matthe6 ) Lindberg( Lisa Merrill) 122G) #eeling 0alidated 0ersus being correct:
A meta7analysis of selecti0e e8-osure to information) ;sychological $ulletin 1;F( 3 Ouly 122G( FFF7FMM)
onathan L) 9erloc5er( ose-h A) =onstan( and ohn Riedl) 1222) E8-laining collaborati0e filtering recommendations) In ;roceedings of the ,@@@
Conference on Computer 'upported Cooperative 2ork( ACM +ress( !e6 Yor5( !Y( 131W1F2)
;M
*ryan 9orling and Matthe6 =ulic5) 122G) +ersonaliBed search for e0eryone) In Google Afficial $log O%ec) 122GP) Retrie0ed March 11( 1213
from
htt-:IIgoogleblog)blogs-ot)comI122GI11I-ersonaliBed7search7for7e0eryone)html)
%ietmar annach( Mar5us "an5er( Ale8ander #elfernig and >erhard #riedrich) 1211) "ecommender 'ystems: &n Introduction O1st ed)P) Cambridge
$ni0ersity +ress( Cambridge( !Y)
*yeong Man =im( :ing Li( Chang Seo5 +ar5( Si >6an =im( and u Yeon =im) 122F) A ne6 a--roach for combining content7based and
collaborati0e filters) 1ournal of Intelligent Information 'ystems 1L( 1 Oune 122CP( 1LG71G1)
*rian =ing) 1213) In5'utbrea5) OMarch 1213P) Retrie0ed March 1F( 1213 from htt-:IIin5outbrea5)comI
Maurice de =under) 1213) ,orld ,ide ,eb SiBe) Oan) 1213P) Retrie0ed anuary 1C( 1213 from htt-:II666)6orld6ide6ebsiBe)comI
=e0in Le6is( ason =aufman( Marco >onBaleB( Andreas ,immer( and !icholas Christa5is) 122M) Tastes( ties( and time: A ne6 social net6or5
dataset using #aceboo5)com) 'ocial (et%orks ;2( 3 O'ct) 122MP( ;;27;31)
>reg Linden( *rent Smith( and eremy Yor5) 122;) AmaBon)com recommendations: item7to7item collaborati0e filter) I))) Internet Computing L(
1 Oan) 122;P( LC7M2) %I': htt-:II666)cs)umd)eduITsamirI3GMIAmaBon7Recommendations)-df
#abian +) Lousame and Eduardo SancheB) 122G) A Ta8onomy of Collaborati0e7*ased Recommender Systems) In 2e! ;ersonaliBation in
Intelligent )nvironments O1st ed)P) >io0anna Castellano( La5hmi C) ain( and Anna Maria #anelli OEd)P) Studies in Com-utational
Intelligence( &ol) 11G) S-ringer *erlin 9eidelberg( *erlin)
Mi/uel Montaner( *eatriB LX-eB( and ose- LluVs %e La Rosa) 122;) A Ta8onomy of Recommender Agents on the Internet) &rtificial Intelligence
"evie% 1G( 3 Oune 122;P( 1MF7;;2)
Mo0ieLens) 1213) Group#ens O1213P) Retrie0ed March G( 1213 from htt-:IIgrou-lens)orgIdatasetsImo0ielens)
Ar0ind !arayanan and &italy Shmati5o0) 122M) Robust %e7anonymiBation of Large %atasets O9o6 to *rea5 Anonymity of the !etfli8 +riBe
%atasetP) In '; >@C ;roceedings of the ,@@C I))) 'ymposium on 'ecurity and ;rivacy) IEEE Com-uter Society ,ashington( %C( 11171F)
%I': htt-:II666)cs)ute8as)eduITshmatIshmat4oa52Mnetfli8)-df
Tim '@Reilly) 122L) ,hat is ,eb 1)2: %esign -atterns and business models for the ne8t generation of soft6are) International 1ournal of 7igital
)conomics CF( 1 OMarch 122LP( 1L7;L)
Eli +ariser) 1211) 9he Filter $u!!le: o% the (e% ;ersonaliBed 2e! Is Changing 2hat 2e "ead and o% 2e 9hink) The +enguin +ress( !e6
Yor5( !Y)
Eli +ariser) 121;) Eli +ariser: *e6are 'nline Y#ilter *ubblesY) &ideo) 9)7 9alks) TE% Conferences( !e6 Yor5( !Y)

Royi Ronen( !oam =oenigstein( Elad "i5li5( and !ir !ice) 121;) Selecting Content7based #eatures for Collaborati0e #iltering Recommenders)
In ;roceedings of the <th &C* Conference on "ecommender 'ystems) ACM +ress( !e6 Yor5( !Y( 32L7312)
*adrul Sar6ar( >eorge =ary-is( ose-h =onstan( and ohn Riedl) 1221) Item7based Collaborati0e #iltering Recommendation Algorithms) In
;roceedings of the +@th International Conference on 2orld 2ide 2e!( ACM +ress( !e6 Yor5( !Y( 1MF71GF)
*adrul Sar6ar( >eorge =ary-is( ose-h =onstan( and ohn Riedl) 1222) Analysis of Recommendation Algorithms for E7commerce) In
;roceedings of the ,nd &C* Conference on )lectronic Commerce( ACM +ress( !e6 Yor5( !Y( 1FM71CL)

) *en Schafer( %an #ran5o6s5i( on 9erloc5er( and Shilad Sen) 122C) Collaborati0e filtering recommender systems) In 9he &daptive 2e!:
*ethods and 'trategies of 2e! ;ersonaliBation) +eter *rusilo0s5y( Alfred =obsa( and ,olfgang !e.dl OEds)P) Lecture !otes in Com-uter
Science( &ol) 3;11) S-ringer( *erlin 9eidelberg( 1G17;13)
Eric Siegel) 121;) ;redictive &nalytics: 9he ;o%er to ;redict 2ho 2ill ClickD $uyD #ieD or 7ie O1st ed)P) ,iley( 9obo5en( !)
*arry Smyth and +aul Cotter) 1221) +ersonaliBed Electronic +rogram >uides for %igital T&) &I *agaBine 11( 1 OSummer 1221P( MG7GM)
Richard A) S-inello) 1211) Cy!erethics: *orality and #a% in Cy!erspace O1st ed)P) ones Z *artlett( Sudbury( MA)
Cass Sunstein) 1221) "epu!lic.com O1st ed)P) +rinceton $ni0ersity +ress( +rinceton( !)
;G
9erman T) Ta0ani) 1211) )thics and 9echnology: ControversiesD EuestionsD and 'trategies for )thical Computing O1st ed)P) ,iley( 9obo5en( !)
9erman T) Ta0ani and im Moor) 1221) +ri0acy +rotection( Control of Information( and +ri0acy7Enhancing Technologies) Computers and 'ociety
;1( 1 OMarch 1221P( C711)
Loren Ter0een and ,ill 9ill) 1221) *eyond Recommender Systems: 9el-ing +eo-le 9el- Each other) In uman=Computer Interaction in the
(e% *illennium O1st ed)P) ohn M) Carroll Oed)P) Addison ,esley( *oston( MA)
YongBheng "hang and Marco +ennacchiotti) 121;) Recommending *randed +roducts from Social Media) In ;roceedings of the <th &C*
Conference on "ecommender 'ystems) ACM +ress( !e6 Yor5( !Y( LL7M3)
Michael "immer) 122M) More on the ?Anonymity@ of the #aceboo5 %ataset W It@s 9ar0ard College) In *ichaelFimmer.org $log O'ct) 122MP)
Retrie0ed March G( 1213 from htt-:II666)michaelBimmer)orgI122MI21I2;Imore7on7the7anonymityof7
the7faceboo57dataset7its7har0ard7college)
32
$PP#NDI7 $
1 "ogin P.P Code
This code is ta5en from three different +9+ files on the ,ebcomic Com-anion 6ebsite that handle the
function of logging a user into the 6ebsite)
C'%E SAM+LE 1
login.php
7"orm name[\login\ metho![\post\ action[\chec5login)-h-\]
^strong]Member Login ^*strong]
^br]$sername: ^input name[\myusername\ type[\text\ i![\myusername\ I]
+ass6ord: ^input name[\mypass%ord\ type[\-ass6ord\ i![\mypass%ord\ I]
^input type[\su!mit\ name[\'u!mit\ value[\#ogin\I]^I"orm]
check_login.php
session8startOPH
_Qusername [ _Q4+'STD\myusername\EH IIsu--ress error messages
_Q-ass6ord [ _Q4+'STD\mypass%ord\EH IIsu--ress error messages
Qob.%*$til [ new db$tilityH
i"OemptyOQusernameP 'R emptyOQ-ass6ordPP
`
print\;lease enter your username and pass%ord)\H
a
else
`
GGcall data!ase
Quserid [ Qob.%*$til7]LoginOQob.%*$til( Qusername( Q-ass6ordPH II
i"OQuserid b[ 71P
`
Q4SESSI'!DYusernameYE [ QusernameH
Q4SESSI'!DYuseridYE [ QuseridH II
headerO\location:mycomics)-h-\PH IIredirect
a
else
`
printHI!rJInvalid username KLusernameM and pass%ord KLpass%ordM\H
a
a
data!aseUtil.php
"unction LogInOQob.%*$til( Qusername( Q-ass6ordP
`
Qdb [ Qob.%*$til7]9penOPH GGcall to internal open function. includes mysNli
i"ObQistmt [ Qdb7]prepareO
\')#)C9 ;ass%ordD UserI7 from users 2)") users.Username O P \PP
31
`echo \;repare didn>t %ork\H
Qresult [ falseHa
i"ObQistmt7]bin!8paramOYsY( QusernamePP
`echo \$ind didn>t %ork\H
Qresult [ falseHa
i"ObQistmt7]e2ecuteOPP
`echo \)xecute didn>t %ork\H
Qresult [ falseHa
Qcomicid [ !$LLH
i"ObQistmt7]bin!8resultOQ-assresult( QuseridPP
`echo \$ind "esult didn>t %ork\H
Qresult [ falseHa
i"ObQistmt7]"etchOPP
`echo \fetch didn>t %ork\H
Qresult [ falseH a
Qistmt7]closeOPH
i"OQ-ass6ord [[ Q-assresultP
`
return QuseridH
a
else
` echo Q-assresultH
return 71Ha
a
31
$PP#NDI7 B
1 1imi!ari/ 4ari8 2eneraion
This is a -iece of the +9+ scri-t ad.usted4cosine)-h- that uses the comiclists table to -o-ulate the
similaritymatri8 table) Ratings are fetched from the comic4list t6o7dimensional array( the ad.usted cosine similarity
0alue is calculated bet6een them( and the 0alue is stored in rating4matrix)
C'%E SAM+LE 1
Qsimilarity4matri8 [ arrayOarrayOPPH
"orOQc1 [ 2H Qc1 ^ QnumcomicsH Qc1ccP
`
whileObissetOQrating4arrayDQc1EPP GGskip over comic ids that %ere never allocated
`
Qc1ccH
a
GGsecond comic
"orOQc1 [ 2H Qc1 ^ QnumcomicsH Qc1ccP
`
whileObissetOQrating4arrayDQc1EPP
`
Qc1ccH
a
Qdot [ 2H GGdot product
Qa [ 2H GGtemp
Qb [ 2H GGtemp
Qnr [ 2H GGnum!er of ratings in common
GGiterate through ratings
"or OQr [ 2H Qr ^ QnumusersH QrccP
`
GGif a rating exists for that user for !oth comics
i"OissetOQrating4arrayDQc1EDQrEP ZZ issetOQrating4arrayDQc1EDQrEPP
`
GGfind dot product of the ith ratings of the first and second comic
GGadd to running total
Qdot [ Qdot c OQrating4arrayDQc1EDQrE R Qrating4arrayDQc1EDQrEPH
Qa [ Qa c powOQrating4arrayDQc1EDQrE( 1PH
Qb [ Qb c powOQrating4arrayDQc1EDQrE( 1PH
QnrccH
a
Qa [ sqrtOQaPH GGtake sNuare root of first comic sNuares
Qb [ sqrtOQbPH GGtake sNuare root of second comic sNuares
i" OQa R Qb b[ 2P
`
3;
Qsim [ Qdot I OQaRQbPH GGfind cosine similarity
a
else
`
Qsim [ 2H
a
i"OQnr ^ ;P
`
GGad5ust for small num!er of ratings in common
Qsim [ Qsim R OOQnrPI;PH
a
Qsimilarity4matri8DQc1EDQc1E [ QsimH
a
a
33
$PP#NDI7 C
1 Recommendaion 1ored Procedure
This is the stored -rocedure that uses the 0alues stored in similaritymatrix to ma5e recommendations to
indi0idual users) It is called from the 6ebsite and sent the -arameter of the acti0e user@s $serI%)
C'%E SAM+LE ;
:E;'4
E<ECT !istinct ComicI%1 "rom O
Oselect !istinct sm)ComicI%1( 1 as \-riority\( sm)Similarity "rom .anBene4t6c)similaritymatri8 as sm
=9'4 .anBene4t6c)comiclists as cl
on Ocl)ComicI% [ sm)ComicI%1P
W>E(E Ocl)$serI% [ $I%P an! cl)Ran5I% [ 1 an! sm)Similarity ] );;3 an! sm)ComicI%1 4ot 'n
OE<ECT ComicI% "rom .anBene4t6c)comiclists W>E(E $serI% [ $I%PP
U4'94
OE<ECT !istinct sm)ComicI%1( 1 as \-riority\( sm)Similarity "rom .anBene4t6c)similaritymatri8 as sm
=9'4 .anBene4t6c)comiclists as cl
on Ocl)ComicI% [ sm)ComicI%1P
W>E(E Ocl)$serI% [ $I%P an! cl)Ran5I% [ 1 an! sm)Similarity ] );;3 an! sm)ComicI%1 4ot 'n
OE<ECT ComicI% "rom .anBene4t6c)comiclists W>E(E $serI% [ $I%PP
U4'94
OE<ECT !istinct sm)ComicI%1( ; as \-riority\( sm)Similarity "rom .anBene4t6c)similaritymatri8 as sm
=9'4 .anBene4t6c)comiclists as cl
on Ocl)ComicI% [ sm)ComicI%1P
W>E(E Ocl)$serI% [ $I%P an! cl)Ran5I% [ ; an! sm)Similarity ] );;3 an! sm)ComicI%1 4ot 'n
Oelect ComicI% "rom .anBene4t6c)comiclists where $serI% [ $I%PP
U4'94
OE<ECT c)ComicI% as ComicI%1( 3 as \-riority\( 1 as \Similarity\ "rom .anBene4t6c)comics as c
W>E(E c)#orm [ OE<ECT form as num "rom .anBene4t6c)comics as c
=9'4 .anBene4t6c)comiclists as cl on cl)ComicI% [ c)ComicI%
W>E(E cl)$serI% [ $I%
;(9U% :? form
9(DE( :? countOformP !esc limit 1P
an!
schedule ^ Oselect a0gOscheduleP "rom .anBene4t6c)comics as c
=9'4 .anBene4t6c)comiclists as cl on cl)ComicI% [ c)ComicI%
W>E(E cl)$serI% [ $I%P c 1 an! schedule ] Oselect a0gOscheduleP "rom .anBene4t6c)comics as c
=9'4 .anBene4t6c)comiclists as cl on cl)ComicI% [ c)ComicI%
W>E(E cl)$serI% [ $I%P 7 1
an! c)ComicI% 4ot 'n
OE<ECT ComicI% "rom .anBene4t6c)comiclists W>E(E $serI% [ $I%PP
9(DE( :? -riority( Similarity !esc limit 12
P tH
3F
3C
$PP#NDI7 D
5aih and "earning
At first( it seems li5e 6ebcomic recommendation algorithms ha0e little to say about the relationshi- of faith
and learning) The bul5 of the research for this -ro.ect in0ol0ed reading technical -a-ers( and the im-lementation
in0ol0ed hours of fiddling 6ith code and reading online +9+ guides) !one of it seemed es-ecially rele0ant to the
human condition( aside from an interesting( con0enient 6ay to find ne6 6ebcomics)
9o6e0er( 6hen I began in0estigating the ethics com-onent of this -ro.ect( I 6as sur-rised by ho6 much
information I found) A sub.ect that a--eared amoral on the surface turned out to ha0e com-le8 ethical /uestions
behind it once it 6as -laced in a real76orld setting) #urthermore( the ramifications of ho6 -eo-le treat these ethical
/uestions as the field de0elo-s could ha0e 0ery real conse/uences for the 6ay our society 0ie6s information)
As a $ni0ersity Scholar and college student( I ha0e started to learn ho6 to hold( not al6ays successfully(
the 5no6n in one hand and the un5no6n in the other) ,hile this -ro.ect started rooted in such 5no6n( technical
/uestions as ?6hat is the difference bet6een a collaborati0e and content7based algorithm(@ it gre6 offshoots into the
territory of ?ho6 do search engines affect the -olitical landsca-e of the $nited StatesA@ I am learning that this
ha--ens more often than not 6hen in0estigating any sub.ect intently( regardless of ho6 /uantifiable its solutions
may seem at first)
My e8-eriences 6ith faith ha0e not been an e8ce-tion to this) #aith is sometimes a battle to orient myself in
the continuum of the 5no6n and un5no6n( 6here I 6ant to li0e in the safe ans6ers but find myself -ulled to6ards
the o-en7ended ones) To sur0i0e( faith needs to be able to learn) Li5e this -ro.ect about 6ebcomic recommendation
e8-anded into the ethical res-onsibilities of -rogrammers( the safe ans6ers in faith must e8-and into those that ha0e
no ans6ers) As tem-ting as it is( es-ecially for technically7minded -eo-le as myself( the com-utational 6orld of
hard ans6ers needs to be -laced in the larger 6orld of un5no6ns( or your faith( or recommendation system( 6ill not
understand the full sco-e of its role)

S-ar putea să vă placă și