Sunteți pe pagina 1din 11

review articles

D OI:10.1145/ 1924421.1924442

The practice of crowdsourcing is transforming


the Web and giving rise to a new field.
BY ANHAI DOAN, RAGHU RAMAKRISHNAN, AND ALON Y. HALEVY

Crowdsourcing
Systems
on the
World-Wide
Web
social systems, social search, social
media, collective intelligence, wiki-
nomics, crowd wisdom, smart mobs,
mass collaboration, and human
computation. The topic has been dis-
cussed extensively in books, popular
press, and academia.1,5,15,23,29,35 But this
body of work has considered mostly
efforts in the physical world.23,29,30
Some do consider crowdsourcing
systems on the Web, but only certain
system types28,33 or challenges (for ex-
ample, how to evaluate users12).
This survey attempts to provide a
global picture of crowdsourcing sys-
tems on the Web. We define and clas-
CROW DSOURCING S YS T EMS enlist a multitude of sify such systems, then describe a
broad sample of systems. The sample
humans to help solve a wide variety of problems.
Over the past decade, numerous such systems key insights
have appeared on the World-Wide Web. Prime
Crowdsourcing systems face four key
examples include Wikipedia, Linux, Yahoo! Answers, challenges: How to recruit contributors,
what they can do, how to combine their
Mechanical Turk-based systems, and much effort is contributions, and how to manage abuse.
being directed toward developing many more. Many systems “in the wild” must also
carefully balance openness with quality.
As is typical for an emerging area, this effort The race is on to build general crowd-
has appeared under many names, including peer sourcing platforms that can be used to
quickly build crowdsourcing applications
production, user-powered systems, user-generated in many domains. Using these, we can
already build databases previously
content, collaborative systems, community systems, unimaginable at lightning speed.

86 COM MUNICATIONS O F TH E AC M | A P R I L 201 1 | VO L. 5 4 | NO. 4


ranges from relatively simple well-es- Crowdsourcing Systems Ten Thousand Cents is a digital artwork by
Aaron Koblin that creates a representation
tablished systems such as reviewing Defining crowdsourcing (CS) systems of a $100 bill. Using a custom drawing tool,
books to complex emerging systems turns out to be surprisingly tricky. thousands of individuals, working in isolation
that build structured knowledge bas- from one another, painted a tiny part of the
Since many view Wikipedia and Linux bill without knowledge of the overall task.
es to systems that “piggyback” onto as well-known CS examples, as a natu-
other popular systems. We discuss ral starting point, we can say that a (just users coming together for this
fundamental challenges such as how CS system enlists a crowd of users to particular task). And yet, like ESP, this
to recruit and evaluate users, and to explicitly collaborate to build a long- system clearly benefits from users, and
merge their contributions. Given the lasting artifact that is beneficial to the faces similar human-centric challeng-
space limitation, we do not attempt whole community. es. Given this, it ought to be considered
to be exhaustive. Rather, we sketch This definition, however, appears a CS system, and the goal of building
only the most important aspects of too restricted. It excludes, for example, artifacts ought to be relaxed into the
the global picture, using real-world the ESP game,32 where users implicitly more general goal of solving problems.
examples. The goal is to further our collaborate to label images as a side Indeed, it appears that in principle any
collective understanding—both con- effect while playing the game. ESP non-trivial problem can benefit from
ceptual and practical—of this im- clearly benefits from a crowd of users. crowdsourcing: we can describe the
portant emerging topic. More importantly, it faces the same hu- problem on the Web, solicit user in-
It is also important to note that man-centric challenges of Wikipedia puts, and examine the inputs to devel-
many crowdsourcing platforms have and Linux, such as how to recruit and op a solution. This system may not be
been built. Examples include Me- evaluate users, and to combine their practical (and better systems may ex-
chanical Turk, Turkit, Mob4hire, uT- contributions. Given this, it seems un- ist), but it can arguably be considered
ART WORK BY AA RON KOBL IN A ND TA KASHI KAWASH IMA

est, Freelancer, eLance, oDesk, Guru, satisfactory to consider only explicit a primitive CS system.
Topcoder, Trada, 99design, Inno- collaborations; we ought to allow im- Consequently, we do not restrict
centive, CloudCrowd, and Cloud- plicit ones as well. the type of collaboration nor the target
Flower. Using these platforms, we The definition also excludes, for ex- problem. Rather, we view CS as a gen-
can quickly build crowdsourcing ample, an Amazon’s Mechanical Turk- eral-purpose problem-solving method.
systems in many domains. In this based system that enlists users to find We say that a system is a CS system if
survey, we consider these systems a missing boat in thousands of satellite it enlists a crowd of humans to help solve
(that is, applications), not the images.18 Here, users do not build any a problem defined by the system owners,
crowdsourcing platforms them- artifact, arguably nothing is long last- and if in doing so, it addresses the fol-
selves. ing, and no community exists either lowing four fundamental challenges:

A P R I L 2 0 1 1 | VO L. 54 | N O. 4 | C OM M U N I C AT ION S O F T HE ACM 87
review articles

A sample of basic CS system types on the World-Wide Web.

Must
Nature of recruit
Collaboration Architecture users? What users do? Examples Target Problems Comments
Evaluating ! reviewing and voting at Amazon, Evaluating a Humans as perspective
! review, vote, tag tagging Web pages at del.ici.ous.com collection of items providers. No or loose
and Google Co-op (e.g., products, users) combination of inputs.
Sharing ! Napster, YouTube, Flickr, CPAN, Building a (distributed Humans as content
! items programmableweb.com or central) collection providers. No or loose
! textual knowledge ! Mailing lists, Yahoo! Answers, QUIQ, of items that can be combination of inputs.
! structured knowledge ehow.com, Quora shared among users.
! Swivel, Many Eyes, Google Fusion
Tables, Google Base, bmrb.wisc.edu,
galaxyzoo, Piazza, Orchestra
Networking ! LinkedIn, MySpace, Facebook Building social Humans as component
networks providers. Loose
Explicit Standalone Yes combination of inputs.
Building artifacts ! Linux, Apache, Hadoop Building physical Humans can play all
! software ! Wikipedia, openmind, Intellipedia, artifacts roles. Typically tight
! textual knowledge ecolicommunity combination of inputs.
bases ! Wikipedia infoboxes/DBpedia, IWP, Some systems ask both
! structured knowledge Google Fusion Tables, YAGO-NAGA, humans and machines
bases Cimple/DBLife to contribute.
! systems ! Wikia Search, mahalo, Freebase,
! others Eurekster
! newspaper at Digg.com, Second Life

Task execution ! Finding extraterrestrials, elections, Possibly any problem


finding people, content creation (e.g.,
Demand Media, Associated Content)
Standalone Yes ! play games with a ! ESP ! labeling images Humans can play
purpose ! intrade.com, Iowa Electronic Markets ! predicting events all roles. Input
! bet on prediction ! IMDB private accounts ! rating movies combination can be
markets ! recaptcha.net ! digitizing written loose or tight.
! use private accounts ! eBay, World of Warcraft text
! solve captchas ! building a user
! buy/sell/auction, play community (for
massive multiplayer purposes such
Implicit games as charging fees,
advertising)
Piggyback on No ! keyword search ! Google, Microsoft, Yahoo ! spelling correction,Humans can play
another system ! buy products ! recommendation feature of Amazon epidemic prediction all roles. Input
! browse Web sites ! adaptive Web sites ! recommending combination can be
(e.g., Yahoo! front page) products loose or tight.
! reorganizing
a Web site for
better access

How to recruit and retain users? What coordination”31) can be relevant to CS Classifying CS systems. CS systems
contributions can users make? How to contexts. But the two system classes can be classified along many dimen-
combine user contributions to solve are clearly distinct. sions. Here, we discuss nine dimen-
the target problem? How to evaluate In this survey we focus on CS sys- sions we consider most important. The
users and their contributions? tems that leverage the Web to solve the two that immediately come to mind are
Not all human-centric systems ad- four challenges mentioned here (or a the nature of collaboration and type of
dress these challenges. Consider a significant subset of them). The Web is target problem. As discussed previously,
system that manages car traffic in unique in that it can help recruit a large collaboration can be explicit or implic-
Madison, WI. Its goal is to, say, coor- number of users, enable a high degree it, and the target problem can be any
dinate the behaviors of a crowd of hu- of automation, and provide a large set problem defined by the system owners
man drivers (that already exist within of social software (for example, email, (for example, building temporary or
the system) in order to minimize traf- wiki, discussion group, blogging, and permanent artifacts, executing tasks).
fic jams. Clearly, this system does not tagging) that CS systems can use to The next four dimensions refer re-
want to recruit more human drivers (in manage their users. As such, compared spectively to how a CS system solves
fact, it wants far fewer of them). We call to the physical world, the Web can dra- the four fundamental challenges de-
such systems crowd management (CM) matically improve existing CS systems scribed earlier: how to recruit and retain
systems. CM techniques (a.k.a., “crowd and give birth to novel system types. users; what can users do; how to combine

88 COMMUNICATIONS O F TH E ACM | A P R I L 201 1 | VO L . 5 4 | N O. 4


review articles

their inputs; and how to evaluate them. diverse crowd where each human can
Later, we will discuss these challenges make independent decisions, to avoid
and the corresponding dimensions in “group think.”29
detail. Here, we discuss the remaining Standalone versus piggyback. When
three dimensions: degree of manual
effort, role of human users, and stand- Compared to the building a CS system, we may decide to
piggyback on a well-established system,
alone versus piggyback architectures.
Degree of manual effort. When build-
physical world, by exploiting traces that users leave in
that system to solve our target problem.
ing a CS system, we must decide how the Web can For example, Google’s “Did you mean”
much manual effort is required to solve
each of the four CS challenges. This can
dramatically and Yahoo’s Search Assist utilize the
search log and user clicks of a search
range from relatively little (for example, improve existing engine to correct spelling mistakes. An-
combining ratings) to substantial (for
example, combining code), and clearly
crowdsourcing other system may exploit user purchas-
es in an online bookstore (Amazon) to
also depends on how much the system systems and give recommend books. Unlike standalone
is automated. We must decide how to
divide the manual effort between the birth to novel systems, such piggyback systems do not
have to solve the challenges of recruit-
users and the system owners. Some
systems ask the users to do relatively
system types. ing users and deciding what they can
do. But they still have to decide how to
little and the owners a great deal. For evaluate users and their inputs (such
example, to detect malicious users, as traces in this case), and to combine
the users may simply click a button to such inputs to solve the target problem.
report suspicious behaviors, whereas
the owners must carefully examine all Sample CS Systems on the Web
relevant evidence to determine if a user Building on this discussion of CS di-
is indeed malicious. Some systems do mensions, we now focus on CS systems
the reverse. For example, most of the on the Web, first describing a set of
manual burden of merging Wikipedia basic system types, and then showing
edits falls on the users (who are cur- how deployed CS systems often com-
rently editing), not the owners. bine multiple such types.
Role of human users. We consider The accompanying table shows a
four basic roles of humans in a CS set of basic CS system types. The set is
system. Slaves: humans help solve not meant to be exhaustive; it shows
the problem in a divide-and-conquer only those types that have received
fashion, to minimize the resources most attention. From left to right, it
(for example, time, effort) of the own- is organized by collaboration, archi-
ers. Examples are ESP and finding a tecture, the need to recruit users, and
missing boat in satellite images using then by the actions users can take. We
Mechanical Turk. Perspective provid- now discuss the set, starting with ex-
ers: humans contribute different per- plicit systems.
spectives, which when combined often Explicit Systems: These standalone
produce a better solution (than with a systems let users collaborate explicitly.
single human). Examples are review- In particular, users can evaluate, share,
ing books and aggregating user bets to network, build artifacts, and execute
make predictions.29 Content providers: tasks. We discuss these systems in turn.
humans contribute self-generated con- Evaluating: These systems let users
tent (for example, videos on YouTube, evaluate “items” (for example, books,
images on Flickr). Component provid- movies, Web pages, other users) using
ers: humans function as components textual comments, numeric scores, or
in the target artifact, such as a social tags.10
network, or simply just a community Sharing: These systems let users
of users (so that the owner can, say, sell share “items” such as products, servic-
ads). Humans often play multiple roles es, textual knowledge, and structured
within a single CS system (for example, knowledge. Systems that share prod-
slaves, perspective providers, and con- ucts and services include Napster, You-
tent providers in Wikipedia). It is im- Tube, CPAN, and the site programma-
portant to know these roles because bleweb.com (for sharing files, videos,
that may determine how to recruit. For software, and mashups, respectively).
example, to use humans as perspective Systems that share textual knowledge
providers, it is important to recruit a include mailing lists, Twitter, how-to

A P R I L 2 0 1 1 | VO L . 54 | NO. 4 | C OM M U N IC AT ION S O F T HE ACM 89


review articles

repositories (such as ehow.com, which of structured data,” which lets users A key distinguishing aspect of sys-
lets users contribute and search how- share, query, and visualize census- and tems that evaluate, share, or network is
to articles), Q&A Web sites (such as Ya- voting data, among others. In general, that they do not merge user inputs, or
hoo! Answers2), online customer sup- sharing systems can be central (such as do so automatically in relatively simple
port systems (such as QUIQ,22 which YouTube, ehow, Google Fusion Tables, fashions. For example, evaluation sys-
powered Ask Jeeves’ AnswerPoint, a Swivel) or distributed, in a peer-to-peer tems typically do not merge textual user
Yahoo! Answers-like site). Systems fashion (such as Napster, Orchestra). reviews. They often merge user inputs
that share structured knowledge (for Networking: These systems let users such as movie ratings, but do so auto-
example, relational, XML, RDF data) collaboratively construct a large social matically using some formulas. Simi-
include Swivel, Many Eyes, Google network graph, by adding nodes and larly, networking systems automati-
Fusion Tables, Google Base, many e- edges over time (such as homepages, cally merge user inputs by adding them
science Web sites (such as bmrb.wisc. friendships). Then they exploit the as nodes and edges to a social network
edu, galaxyzoo.org), and many peer-to- graph to provide services (for example, graph. As a result, users of such systems
peer systems developed in the Seman- friend updates, ads, and so on). To a do not need (and, in fact, often are not
tic Web, database, AI, and IR commu- lesser degree, blogging systems are allowed) to edit other users’ input.
nities (such as Orchestra8,27). Swivel, for also networking systems in that blog- Building Artifacts: In contrast, sys-
example, bills itself as the “YouTube gers often link to other bloggers. tems that let users build artifacts such

90 COMM UNICATIONS O F THE ACM | A P R I L 201 1 | VO L. 5 4 | N O. 4


review articles

as Wikipedia often merge user in- cently, the success of Wikipedia has in- The Sheep Market by Aaron Koblin is a
collection of 10,000 sheep made by workers
puts tightly, and require users to edit spired many “community wikipedias,” on Amazon’s Mechanical Turk. Workers were
and merge one another’s inputs. A such as Intellipedia (for the U.S. intel- paid $0.02 (USD) to “draw a sheep facing to
well-known artifact is software (such ligence community) and EcoliHub (at the left.” Animations of each sheep’s creation
may be viewed at TheSheepMarket.com.
as Apache, Linux, Hadoop). Another ecolicommunity.org, to capture all in-
popular artifact is textual knowledge formation about the E. coli bacterium). ers can create and populate schemas
bases (KBs). To build such KBs (such as Yet another popular target artifact to describe topics of interest, and build
Wikipedia), users contribute data such is structured KBs. For example, the collections of interlinked topics using
as sentences, paragraphs, Web pages, set of all Wikipedia infoboxes (that is, a flexible graph model of data. As yet
then edit and merge one another’s attribute-value pairs such as city-name another example, Google Fusion Ta-
contributions. The knowledge capture = Madison, state = WI) can be viewed as bles (tables.googlelabs.com) lets users
(k-cap.org) and AI communities have a structured KB collaboratively created upload tabular data and collaborate
ART WORK BY AA RON KOBL IN

studied building such KBs for over a by Wikipedia users. Indeed, this KB on it by merging tables from different
decade. A well-known early attempt is has recently been extracted as DBpedia sources, commenting on data items,
openmind,28 which enlists volunteers and used in several applications (see and sharing visualizations on the Web.
to build a KB of commonsense facts dbpedia.org). Freebase.com builds an Several recent academic projects
(for example, “the sky is blue”). Re- open structured database, where us- have also studied building structured

A P R I L 2 0 1 1 | VO L . 54 | N O. 4 | C OM M U N IC AT ION S O F T HE ACM 91
review articles

KBs in a CS fashion. The IWP project35 (such as the Metaweb query language a task, we must find task parts that can
extracts structured data from the tex- and a hosted development environ- be “crowdsourced,” such that each user
tual pages of Wikipedia, then asks us- ment). Eurekster.com lets users col- can make a contribution and the con-
ers to verify the extraction accuracy. laboratively build vertical search en- tributions in turn can be combined to
The Cimple/DBLife project4,5 lets us- gines called swickis, by customizing solve the parts. Finding such parts and
ers correct the extracted structured a generic search engine (for example, combining user contributions are of-
data, expose it in wiki pages, then add specifying all URLs the system should ten task specific. Crowdsourcing the
even more textual and structured data. crawl). Finally, MOBS, an academic parts, however, can be fairly general,
Thus, it builds structured “commu- project,12,13 studies how to collabora- and plaforms have been developed to
nity wikipedias,” whose wiki pages mix tively build data integration systems, assist that process. For example, Ama-
textual data with structured data (that those that provide a uniform query in- zon’s Mechanical Turk can help distrib-
comes from an underlying structured terface to a set of data sources. MOBS ute pieces of a task to a crowd of users
KB). Other related works include YAG- enlists users to create a crucial system (and several recent interesting toolkits
ONAGA,11 BioPortal,17 and many recent component, namely the semantic have even been developed for using Me-
projects in the Web, Semantic Web, mappings (for example, “location” = chanical Turk13,37). It was used recently
and AI communities.1,16,36 “address”) between the data sources. to search for Jim Gray, a database re-
In general, building a structured KB In general, users can help build and searcher lost at sea, by asking volun-
often requires selecting a set of data improve a system running on the Web teers to examine pieces of satellite im-
sources, extracting structured data in several ways. First, they can edit the ages for any sign of Jim Gray’s boat.18
from them, then integrating the data system’s code. Second, the system typi- Implicit Systems: As discussed ear-
(for example, matching and merging cally contains a set of internal compo- lier, such systems let users collaborate
“David Smith” and “D.M. Smith”). Us- nents (such as URLs to crawl, semantic implicitly to solve a problem of the sys-
ers can help these steps in two ways. mappings), and users can help improve tem owners. They fall into two groups:
First, they can improve the automatic these without even touching the sys- standalone and piggyback.
algorithms of the steps (if any), by edit- tem’s code (such as adding new URLs, A standalone system provides a ser-
ing their code, creating more training correcting mappings). Third, users can vice such that when using it users im-
data,17 answering their questions12,13 or edit system inputs and outputs. In the plicitly collaborate (as a side effect) to
providing feedback on their output.12,35 case of a search engine, for instance, solve a problem. Many such systems
Second, users can manually partici- users can suggest that if someone que- exist, and the table here lists a few rep-
pate in the steps. For example, they can ries for “home equity loan for seniors,” resentative examples. The ESP game32
manually add or remove data sources, the system should also suggest query- lets users play a game of guessing
extract or integrate structured data, or ing for “reverse mortgage.” Users can common words that describe images
add even more structured data, data also edit search result pages (such as (shown independently to each user),
not available in the current sources promoting and demoting URLs, as then uses those words to label images.
but judged relevant.5 In addition, a CS mentioned earlier). Finally, users can Google Image Labeler builds on this
system may perform inferences over monitor the running system and pro- game, and many other “games with a
its KB to infer more structured data. To vide feedback. purpose” exist.33 Prediction markets23,29
help this step, users can contribute in- We note that besides software, KBs, let users bet on events (such as elec-
ference rules and domain knowledge.25 and systems, many other target arti- tions, sport events), then aggregate the
During all such activities, users can facts have also been considered. Ex- bets to make predictions. The intuition
naturally cross-edit and merge one an- amples include community newspa- is that the “collective wisdom” is often
other’s contributions, just like in those pers built by asking users to contribute accurate (under certain conditions)31
systems that build textual KBs. and evaluate articles (such as Digg) and and that this helps incorporate inside
Another interesting target prob- massive multi-player games that build information available from users. The
lem is building and improving sys- virtual artifacts (such as Second Life, a Internet Movie Database (IMDB) lets
tems running on the Web. The project 3D virtual world partly built and main- users import movies into private ac-
Wikia Search (search.wikia.com) lets tained by users). counts (hosted by IMDB). It designed
users build an open source search en- Executing Tasks: The last type of the accounts such that users are strong-
gine, by contributing code, suggest- explicit systems we consider is the ly motivated to rate the imported mov-
ing URLs to crawl, and editing search kind that executes tasks. Examples ies, as doing so bring many private ben-
result pages (for example, promoting include finding extraterrestrials, min- efits (such as they can query to find all
or demoting URLs). Wikia Search was ing for gold, searching for missing imported action movies rated at least
recently disbanded, but similar fea- people,23,29,30,31 and cooperative debug- 7/10, or the system can recommend ac-
tures (such as editing search pages) ging (cs.wisc.edu/cbi, early work of this tion movies highly rated by people with
appear in other search engines (such project received the ACM Doctoral Dis- similar taste). IMDB then aggregates all
as Google, mahalo.com). Freebase sertation Award in 2005). The 2008 elec- private ratings to obtain a public rating
lets users create custom browsing tion is a well-known example, where for each movie, for the benefit of the
and search systems (deployed at Free- the Obama team ran a large online CS public. reCAPTCHA asks users to solve
base), using the community-curated operation asking numerous volunteers captchas to prove they are humans (to
data and a suite of development tools to help mobilize voters. To apply CS to gain access to a site), then leverages

92 COM MUNICATIONS O F THE AC M | A P R I L 201 1 | VO L . 5 4 | N O. 4


review articles

the results for digitizing written text.34 networking). Then around 2003, aided
Finally, it can be argued that the target by the proliferation of social software
problem of many systems (that provide (for example, discussion groups, wiki,
user services) is simply to grow a large blog), many full-fledged CS systems
community of users, for various reasons
(such as personal satisfaction, charging The user interface (such as Wikipedia, Flickr, YouTube,
Facebook, MySpace) appeared, mark-
subscription fees, selling ads, selling
the systems to other companies). Buy/
should make it ing the arrival of Web 2.0. This Web
is growing rapidly, with many new CS
sell/auction websites (such as eBay) easy for users to systems being developed and non-CS
and massive multiplayer games (such
as World of Warcraft) for instance fit this
contribute. This is systems adding CS features.
These CS systems often combine
description. Here, by simply joining the highly non-trivial. multiple basic CS features. For example,
system, users can be viewed as implicit- Wikipedia primarily builds a textual KB.
ly collaborating to solve the target prob- But it also builds a structured KB (via
lem (of growing user communities). infoboxes) and hosts many knowledge
The second kind of implicit system sharing forums (for example, discus-
we consider is a piggyback system that sion groups). YouTube lets users both
exploits the user traces of yet another share and evaluate videos. Community
system (thus, making the users of this portals often combine all CS features
latter system implicitly collaborate) to discussed so far. Finally, we note that
solve a problem. For example, over time the Semantic Web, an ambitious at-
many piggyback CS systems have been tempt to add structure to the Web, can
built on top of major search engines, be viewed as a CS attempt to share struc-
such as Google, Yahoo!, and Micro- tured data, and to integrate such data to
soft. These systems exploit the traces build a Web-scale structured KB. The
of search engine users (such as search World-Wide Web itself is perhaps the
logs, user clicks) for a wide range of largest CS system of all, encompassing
tasks (such as spelling correction, find- everything we have discussed.
ing synonyms, flu epidemic predic-
tion, and keyword generation for ads6). Challenges and Solutions
Other examples include exploiting user Here, we discuss the key challenges of
purchases to recommend products,26 CS systems:
and exploiting click logs to improve the How to recruit and retain users? Re-
presentation of a Web site.19 cruiting users is one of the most impor-
tant CS challenges, for which five ma-
CS Systems on the Web jor solutions exist. First, we can require
We now build on basic system types users to make contributions if we have
to discuss deployed CS systems on the the authority to do so (for example, a
Web. Founded on static HTML pages, manager may require 100 employees
the Web soon offered many interactive to help build a company-wide system).
services. Some services serve machines Second, we can pay users. Mechanical
(such as DNS servers, Google Map API Turk for example provides a way to pay
server), but most serve humans. Many users on the Web to help with a task.
such services do not need to recruit us- Third, we can ask for volunteers. This
ers (in the sense that the more the bet- solution is free and easy to execute,
ter). Examples include pay-parking- and hence is most popular. Most cur-
ticket services (for city residents) and rent CS systems on the Web (such as
room-reservation services. (As noted, Wikipedia, YouTube) use this solution.
we call these crowd management sys- The downside of volunteering is that it
tems). Many services, however, face is hard to predict how many users we
CS challenges, including the need to can recruit for a particular application.
grow large user bases. For example, The fourth solution is to make users
online stores such as Amazon want a pay for service. The basic idea is to re-
growing user base for their services, quire the users of a system A to “pay” for
to maximize profits, and startups such using A, by contributing to a CS system
as epinions.com grow their user bases B. Consider for example a blog website
for advertising. They started out as (that is, system A), where a user U can
primitive CS systems, but quickly im- leave a comment only after solving a
proved over time with additional CS puzzle (called a captcha) to prove that U
features (such as reviewing, rating, is a human. As a part of the puzzle, we

A P R I L 2 0 1 1 | VO L. 54 | N O. 4 | C OM M U N IC AT ION S O F T HE ACM 93
review articles

can ask U to retype a word that an OCR ence, fame management, and so on, to
program has failed to recognize (the maximize user participation. Finally,
“payment”), thereby contributing to a we note that deployed CS systems often
CS effort on digitizing written text (that employ a mixture of recruitment meth-
is, system B). This is the key idea behind
the reCAPTCHA project.34 The MOBS Given the success ods (such as bootstrapping with “re-
quirement” or “paying,” then switch-
project12,13 employs the same solution.
In particular, it ran experiments where
of current ing to “volunteering” once the system
is sufficiently “mature”).
a user U can access a Web site (such as crowdsourcing What contributions can users
a class homepage) only after answering
a relatively simple question (such as, is
systems, make? In many CS systems the kinds
of contributions users can make are
string “1960” in “born in 1960” a birth we expect that somewhat limited. For example, to
date?). MOBS leverages the answers to
help build a data integration system.
this emerging field evaluate, users review, rate, or tag; to
share, users add items to a central Web
This solution works best when the “pay- will grow rapidly. site; to network, users link to other us-
ment” is unintrusive or cognitively sim- ers; to find a missing boat in satellite
ple, to avoid deterring users from using images, users examine those images.
system A. In more complex CS systems, how-
The fifth solution is to piggyback on ever, users often can make a far wider
the user traces of a well-established sys- range of contributions, from simple
tem (such as building a spelling correc- low-hanging fruit to cognitively com-
tion system by exploiting user traces of plex ones. For example, when build-
a search engine, as discussed previous- ing a structured KB, users can add a
ly). This gives us a steady stream of us- URL, flag incorrect data, and supply
ers. But we must still solve the difficult attribute-value pairs (as low-hanging
challenge of determining how the trac- fruit).3,5 But they can also supply in-
es can be exploited for our purpose. ference rules, resolve controversial is-
Once we have selected a recruit- sues, and merge conflicting inputs (as
ment strategy, we should consider cognitively complex contributions).25
how to further encourage and retain The challenge is to define this range of
users. Many encouragement and reten- possible contributions (and design the
tion (E&R) schemes exist. We briefly system such that it can gather a critical
discuss the most popular ones. First, crowd of such contributions).
we can provide instant gratification, by Toward this goal, we should con-
immediately showing a user how his or sider four important factors. First,
her contribution makes a difference.16 how cognitively demanding are the
Second, we can provide an enjoyable ex- contributions? A CS system often has
perience or a necessary service, such as a way to classify users into groups,
game playing (while making a contri- such as guests, regulars, editors, ad-
bution).32 Third, we can provide ways to mins, and “dictators.” We should take
establish, measure, and show fame/trust/ care to design cognitively appropriate
reputation.7,13,24,25 Fourth, we can set up contribution types for different user
competitions, such as showing top rat- groups. Low-ranking users (such as
ed users. Finally, we can provide own- guests, regulars) often want to make
ership situations, where a user may feel only “easy” contributions (such as an-
he or she “owns” a part of the system, swering a simple question, editing one
and thus is compelled to “cultivate” to two sentences, flagging an incor-
that part. For example, zillow.com dis- rect data piece). If the cognitive load is
plays houses and estimates their mar- high, they may be reluctant to partici-
ket prices. It provides a way for a house pate. High-ranking users (such as edi-
owner to claim his or her house and tors, admins) are more willing to make
provide the correct data (such as num- “hard” contributions (such as resolv-
ber of bedroomss), which in turn helps ing controversial issues).
improve the price estimation. Second, what should be the impact
These E&R schemes apply naturally of a contribution? We can measure the
to volunteering, but can also work well potential impact by considering how
for other recruitment solutions. For the contribution potentially affects
example, after requiring a set of users the CS system. For example, editing a
to contribute, we can still provide in- sentence in a Wikipedia page largely
stant gratification, enjoyable experi- affects only that page, whereas revis-

94 COM MUNICATIONS O F TH E ACM | A P R I L 201 1 | VO L. 5 4 | NO. 4


review articles

ing an edit policy may potentially affect and friendships) to form a social net- rect. Should M override U’s assertion?
million of pages. As another example, work graph. More complex CS systems, And if so, how can M explain its reason-
when building a structured KB, flag- however, such as those that build soft- ing to U? The main problem here is it
ging an incorrect data piece typically ware, KBs, systems, and games, com- is difficult for a machine to enter into
has less potential impact than supply- bine contributions more tightly. Exactly a manual dispute with a human user.
ing an inference rule, which may be how this happens is application depen- The currently preferred method is for
used in many parts of the CS system. dent. Wikipedia, for example, lets users M to alert U, and then leave it up to U
Quantifying the potential impact of manually merge edits, while ESP does to decide what to do. But this method
a contribution type in a complex CS so automatically, by waiting until two clearly will not scale with the number
system may be difficult.12,13 But it is im- users agree on a common word. of conflicting contributions.
portant to do so, because we typically No matter how contributions are How to evaluate users and con-
have far fewer high-ranking users such combined, a key problem is to decide tributions? CS systems often must
as editors and admins (than regulars, what to do if users differ, such as when manage malicious users. To do so, we
say). To maximize the total contribu- three users assert “A” and two users can use a combination of techniques
tion of these few users, we should ask “not A.” Both automatic and manual that block, detect, and deter. First,
them to make potentially high-impact solutions have been developed for we can block many malicious users
contributions whenever possible. this problem. Current automatic solu- by limiting who can make what kinds
Third, what about machine contribu- tions typically combine contributions of contributions. Many e-science CS
tions? If a CS system employs an algo- weighted by some user scores. The systems, for example, allow anyone to
rithm for a task, then we want human work12,13 for example lets users vote on submit data, but only certain domain
users to make contributions that are the correctness of system components scientists to clean and merge this data
easy for humans, but difficult for ma- (the semantic mappings of a data in- into the central database.
chines. For example, examining textual tegration systems in this case20), then Second, we can detect malicious
and image descriptions to decide if combines the votes weighted by the users and contributions using a vari-
two products match is relatively easy trustworthiness of each user. The ety of techniques. Manual techniques
for humans but very difficult for ma- work25 lets users contribute structured include monitoring the system by the
chines. In short, the CS work should KB fragments, then combines them owners, distributing the monitoring
be distributed between human users into a coherent probabilistic KB by workload among a set of trusted us-
and machines according to what each computing the probabilities that each ers, and enlisting ordinary users (such
of them is best at, in a complementary user is correct, then weighting contrib- as flagging bad contributions on mes-
and synergistic fashion. uted fragments by these probabilities. sage boards). Automatic methods typ-
Finally, the user interface should Manual dispute management solu- ically involve some tests. For example,
make it easy for users to contribute. tions typically let users fight and settle a system can ask users questions for
This is highly non-trivial. For example, among themselves. Unresolved issues which it already knows the answers,
how can users easily enter domain then percolate up the user hierarchy. then use the answers of the users to
knowledge such as “no current living Systems such as Wikipedia and Linux compute their reliability scores.13,34
person was born before 1850” (which employ such methods. Automatic so- Many other schemes to compute us-
can be used in a KB to detect, say, in- lutions are more efficient. But they ers’ reliability/trust/fame/reputation
correct birth dates)? A natural lan- work only for relatively simple forms have been proposed.9,26
guage format (such as in openmind. of contributions (such as voting), or Finally, we can deter malicious us-
org) is easy for users, but difficult for forms that are complex but amenable ers with threats of “punishment.” A
machines to understand and use, and a to algorithmic manipulation (such common punishment is banning. A
formal language format has the reverse as structured KB fragments). Manual newer, more controversial form of pun-
problem. As another example, when solutions are still the currently pre- ishment is “public shaming,” where a
building a structured KB, contributing ferred way to combine “messy” con- user U judged malicious is publicly
attribute-value pairs is relatively easy flicting contributions. branded as a malicious or “crazy” user
(as Wikipedia infoboxes and Freebase To further complicate the matter, for the rest of the community (possibly
demonstrate). But contributing more sometimes not just human users, but without U’s knowledge). For example, a
complex structured data pieces can be machines also make contributions. chat room may allow users to rate other
quite difficult for naive users, as this Combining such contributions is dif- users. If the (hidden) score of a user U
often requires them to learn the KB ficult. To see why, suppose we employ a goes below a threshold, other users
schema, among others.5 machine M to help create Wikipedia in- will only see a mechanically garbled
How to combine user contributions? foboxes.35 Suppose on Day 1 M asserts version of U’s comments, whereas U
Many CS systems do not combine con- population = 5500 in a city infobox. On continues to see his or her comments
tributions, or do so in a loose fashion. Day 2, a user U may correct this into exactly as written.
For example, current evaluation sys- population = 7500, based on his or her No matter how well we manage ma-
tems do not combine reviews, and com- knowledge. On Day 3, however, M may licious users, malicious contributions
bine numeric ratings using relatively have managed to process more Web often still seep into the system. If so,
simple formulas. Networking systems data, and obtained higher confidence the CS system must find a way to undo
simply link contributions (homepages that population = 5500 is indeed cor- those. If the system does not combine

A P R I L 2 0 1 1 | VO L . 54 | N O. 4 | C OM M U N IC AT ION S O F T HE ACM 95
review articles

contributions (such as reviews) or now on to move beyond building indi- Mangrove: Enticing ordinary people onto the semantic
web via instant gratification. In Proceedings of ISWC,
does so only in a loose fashion (such vidual systems, toward building gen- 2003.
as ratings), undoing is relatively easy. eral CS platforms that can be used to 15. Mihalcea, R. and Chklovski, T. Building sense tagged
corpora with volunteer contributions over the Web. In
If the system combines contributions develop such systems quickly. Proceedings of RANLP, 2003.
tightly, but keeps them localized, then Second, we expect that crowdsourc- 16. Noy, N.F., Chugh, A. and Alani, H. The CKC challenge:
Exploring tools for collaborative knowledge
we can still undo with relatively sim- ing will be applied to ever more classes construction. IEEE Intelligent Systems 23, 1, (2008)
ple logging. For example, user edits of applications. Many of these applica- 64–68.
17. Noy, N.F., Griffith, N. and Munsen, M.A. Collecting
in Wikipedia can be combined exten- tions will be formal and structured in community-based mappings in an ontology repository.
sively within a single page, but kept some sense, making it easier to employ In Proceedings of ISWC, 2008.
18. Olson, M. The amateur search. SIGMOD Record 37, 2
localized to that page (not propagated automatic techniques and to coordi- (2008), 21–24.
19. Perkowitz, M. and Etzioni, O. Adaptive web sites.
to other pages). Consequently, we can nate them with human users.37–40 In Comm. ACM 43, 8 (Aug. 2000).
undo with page-level logging, as Wiki- particular, a large chunk of the Web is 20. Rahm, E. and Bernstein, P.A. A survey of approaches
to automatic schema matching. VLDB J. 10, 4, (2001),
pedia does. Hoever, if the contribu- about data and services. Consequent- 334–350.
tions are pushed deep into the system, ly, we expect crowdsourcing to build 21. Ramakrishnan, R. Collaboration and data mining,
2001. Keynote talk, KDD.
then undoing can be very difficult. For structured databases and structured 22. Ramakrishnan, R., Baptist, A., Ercegovac, A.,
example, suppose an inference rule services (Web services with formalized Hanselman, M., Kabra, N., Marathe, A. and Shaft, U.
Mass collaboration: A case study. In Proceedings of
R is contributed to a KB on Day 1. We input and output) will receive increas- IDEAS, 2004.
then use R to infer many facts, apply ing attention. 23. Rheingold, H. Smart Mobs. Perseus Publishing, 2003.
24. Richardson, M. and Domingos, P. Mining knowledge-
other rules to these facts and other Finally, we expect many techniques sharing sites for viral marketing. In Proceedings of
facts in the KB to infer more facts, let will be developed to engage an ever KDD, 2002.
25. Richardson, M. and Domingos, P. Building large
users edit the facts extensively, and so broader range of users in crowdsourc- knowledge bases by mass collaboration. In
on. Then on Day 3, should R be found ings, and to enable them, especially Proceedings of K-CAP, 2003.
26. Sarwar, B.M., Karypis, G., Konstan, J.A. and Riedl, J.
incorrect, it would be very difficult to naïve users, to make increasingly Item-based collaborative filtering recommendation
remove R without reverting the KB to complex contributions, such as creat- algorithms. In Proceedings of WWW, 2001.
27. Steinmetz, R. and Wehrle, K. eds. Peer-to-peer
its state on Day 1, thereby losing all ing software programs and building systems and applications. Lecture Notes in Computer
Science. 3485; Springer, 2005.
good contributions made between mashups (without writing any code), 28. Stork, D.G. Using open data collection for intelligent
Day 1 and Day 3. and specifying complex structured software. IEEE Computer 33, 10, (2000), 104–106.
29. Surowiecki, J. The Wisdom of Crowds. Anchor Books,
At the other end of the user spec- data pieces (without knowing any 2005.
trum, many CS systems also iden- structured query languages). 30. Tapscott, D. and Williams, A.D. Wikinomics. Portfolio,
2006.
tify and leverage influential users, 31. Time. Special Issue Person of the year: You,
using both manual and automatic References 2006; http://www.time.com/time/magazine/
1. AAAI-08 Workshop. Wikipedia and artificial article/0,9171,1569514,00.html.
techniques. For example, productive 32. von Ahn, L. and Dabbish, L. Labeling images with a
intelligence: An evolving synergy, 2008.
users in Wikipedia can be recom- 2. Adamic, L.A., Zhang, J., Bakshy, E. and Ackerman,
computer game. In Proc. of CHI, 2004.
33. von Ahn, L. and Dabbish, L. Designing games with a
mended by other users, promoted, M.S. Knowledge sharing and Yahoo answers:
purpose. Comm. ACM 51, 8 (Aug. 2008), 58–67.
Everyone knows something. In Proceedings of WWW,
and given more responsibilities. As an- 2008.
34. von Ahn, L., Maurer, B., McMillen, C., Abraham, D.
and Blum, M. Recaptcha: Human-based character
other example, certain users of social 3. Chai, X., Vuong, B., Doan, A. and Naughton, J.F.
recognition via Web security measures. Science 321,
Efficiently incorporating user feedback into
networks highly influence buy/sell de- 5895, (2008), 1465–1468.
information extraction and integration programs. In
35. Weld, D.S., Wu, F., Adar, E., Amershi, S., Fogarty, J.,
cisions of other users. Consequently, Proceedings of SIGMOD, 2009.
Hoffmann, R., Patel, K. and Skinner, M. Intelligence in
4. The Cimple/DBLife project; http://pages.cs.wisc.
some work has examined how to auto- Wikipedia. AAAI, 2008.
edu/~anhai/projects/cimple.
36. Workshop on collaborative construction, management
5. DeRose, P., Chai, X., Gao, B.J., Shen, W., Doan,
matically identify these users, and le- A., Bohannon, P. and Zhu, X. Building community
and linking of structured knowledge (CK 2009), 2009.
http://users.ecs.soton.ac.uk/gc3/iswc-workshop.
verage them in viral marketing within Wikipedias: A machine-human partnership approach.
37. Franklin, M, Kossman, D., Kraska, T, Ramesh, S,
In Proceedings of ICDE, 2008.
a user community.24 6. Fuxman, A., Tsaparas, P., Achan, K. and Agrawal,
and Xin, R. CrowdDB: Answering queries with
crowdsourcing. In Proceedings of SIGMOD 2011.
R. Using the wisdom of the crowds for keyword
38. Marcus, A., Wu, E. and Madden, S. Crowdsourcing
generation. In Proceedings of WWW, 2008.
Conclusion 7. Golbeck, J. Computing and applying trust in Web-
databases: Query processing with people. In
Proceedings of CRDR 2011.
We have discussed CS systems on based social network, 2005. Ph.D. Dissertation,
39. Parameswaran, A., Sarma, A., Garcia-Molina, H.,
University of Maryland.
the World-Wide Web. Our discussion 8. Ives, Z.G., Khandelwal, N., Kapur, A., and Cakir, M.
Polyzotis, N. and Widom, J. Human-assisted graph
search: It’s okay to ask questions. In Proceedings of
shows that crowdsourcing can be ap- Orchestra: Rapid, collaborative sharing of dynamic
VLDB 2011.
data. In Proceedings of CIDR, 2005.
plied to a wide variety of problems, 9. Kasneci, G., Ramanath, M., Suchanek, M. and Weiku,
40. Parameswaran, A. and Polyzotis, N. Answering
queries using humans, algorithms, and databases.
and that it raises numerous interesting G. The yago-naga approach to knowledge discovery.
In Proceedings of CIDR 2011.
SIGMOD Record 37, 4, (2008), 41–47.
technical and social challenges. Given 10. Koutrika, G., Bercovitz, B., Kaliszan, F., Liou, H. and
the success of current CS systems, we Garcia-Molina, H. Courserank: A closed-community
social system through the magnifying glass. In The AnHai Doan (anhai@cs.wisc.edu) is an associate
expect that this emerging field will 3rd Int’l AAAI Conference on Weblogs and Social professor of computer science at the University of
Media (ICWSM), 2009. Wisconsin-Madison and Chief Scientist at Kosmix Corp.
grow rapidly. In the near future, we 11. Little, G., Chilton, L.B., Miller, R.C. and Goldman, M.
foresee three major directions: more Raghu Ramakrishnan (ramakris@yahoo-inc.com)
Turkit: Tools for iterative tasks on mechanical turk,
is Chief Scientist for Search & Cloud Computing,
2009. Technical Report. Available from glittle.org.
generic platforms, more applications 12. McCann, R., Doan, A., Varadarajan, V., and Kramnik,
and a Fellow at Yahoo! Research, Silicon Valley, CA,
where he heads the Community Systems group.
and structure, and more users and A. Building data integration systems: A mass
collaboration approach. In WebDB, 2003.
complex contributions. 13. McCann, R., Shen, W. and Doan, A. Matching schemas
Alon Y. Halevy (halevy@google.com) heads the
Structured Data Group at Google Research, Mountain
First, the various systems built in the in online communities: A Web 2.0 approach. In
View, CA.
Proceedings of ICDE, 2008.
past decade have clearly demonstrated 14. McDowell, L., Etzioni, O., Gribble, S.D., Halevy, A.Y.,
the value of crowdsourcing. The race is Levy, H.M., Pentney, W., Verma, D. and Vlasseva, S. © 2011 ACM 0001-0782/11/04 $10.00

96 COMM UNICATIONS O F TH E ACM | A P R I L 201 1 | VO L . 5 4 | N O. 4

S-ar putea să vă placă și