Sunteți pe pagina 1din 60

Fron%ers

of Computa%onal Journalism
Columbia Journalism School Week 3: Algorithmic Filtering September 18, 2013

Lecture 3: Algorithmic Filtering


The Filtering Problem Columbia Newsblaster system design What should a lter do?

How Journalism Works (a model)

User

User

stories not covered x

x x x

x x x ltering User

Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interac%ve

more video on YouTube than produced by TV networks during en%re 20th century

Google now indexes more web pages than there are people in the world 400,000,000 tweets per day es%mated 130,000,000 books ever published

10,000 legally-required reports led by U.S. public companies every day

All New York Times ar%cles ever = 0.06 terabytes


(13 million stories, 5k per story)

Its not informa%on overload, its lter failure


- Clay Shirky

Lecture 3: Algorithmic Filtering


The Filtering Problem Columbia Newsblaster system design What should a lter do?

System Descrip%on

Scrape

Cluster events

Cluster topics

Summarize

Scrape
Handcraced list of source URLs (news front pages) and links followed to depth 4 Then extract the text of each ar%cle

Text extrac%on from HTML


Ideal world: HTML5 ar%cle tags The ar%cle element represents a component of a page that consists of a self-contained composi%on in a document, page, applica%on, or site and that is intended to be independently distributable or reusable, e.g. in syndica%on. - W3C Specica%on

Text extrac%on from HTML


Slightly closer to reality? hNews markup Used by 577 news orgs (October 2010) unclear adop%on today

Text extrac%on from HTML


Newsblaster paper: For each page examined, if the amount of text in the largest cell of the page (acer stripping tags and links) is greater than some par%cular constant (currently 512 characters), it is assumed to be a news ar%cle, and this text is extracted. (At least its simple. This was 2002. How ocen does this work now?)

Text extrac%on from HTML


Now mul%ple services/apis to do this, e.g. readability.com

Cluster Events

Cluster Events
Surprise! encode ar%cles into feature vectors cosine distance func%on hierarchical clustering algorithm

Dierent clustering algorithms


Par%%oning
keep adjus%ng clusters un%l convergence e.g. K-means

Agglomera%ve hierarchical
start with leaves, repeatedly merge clusters e.g. MIN and MAX approaches

Divisive hierarchical
start with root, repeatedly split clusters e.g. binary split

Agglomera%ve combining clusters



put each item into a leaf node while num clusters > 1 find two closest clusters merge them

Divisive spliqng clusters



put all items into one cluster while num clusters < num items find largest cluster split so pieces as far as possible

single link or min

complete link or max

average

Trees and Dendrograms

But news is an on-line problem...


Ar%cles arrive one at a %me, and must be clustered immediately. Cant look forward in %me, cant go back and reassign. Greedy algorithm.

Single pass clustering



put first story in its own cluster repeat get next story S look for cluster C with distance < T if found put S in C else put S in new cluster

Evalua%ng clusterings
When is one clustering beser than another? Ideally, wed like a quan%ta%ve metric.

Evalua%ng clusterings
When is one clustering beser than another? Ideally, wed like a quan%ta%ve metric. This is possible if we have training data = human generated clusters. Available from the TDT2 corpus (topic detec%on and tracking)

Error with respect to hand-generated clusters from training data

Features: all words + en%%es

En%ty extrac%on

Names, dates, places. Services like OpenCalais. Best algorithms use dic%onaries + probabilis%c parsers.

All words does best!

But maybe possible to combine features to do beser?

How to combine dierent features?


In Newsblaster case, for every pair of documents (di,dj) we have three similarity values
dist1(i,j) = TF-IDF on all words dist2(i,j) = nouns extracted by LinkIt algorithm dist3(i,j) = en%%es extracted by Nominator algorithm

But we need a single distance func%on dist(di,dj) that takes into account all informa%on.

Answer: weighted sum distance fns

dist (i, j ) =

w
k =1..3

distk (i, j )

w1 = weight of TF-IDF distance w2 = weight of distance on LinkIt nouns w3 = weight of distance on Nominate en%%es

How to nd op%mal wi?


Well, we have training data available. So

Regression t
From training data, dene perfect distance fn: R(i,j) = 0 if i,j in same cluster = 1 otherwise Find wi that minimize 2
= ( dist (i, j ) R(i, j ))
i, j

*actually, we use logis%c regression, because its a beser t to binary variables like R(i,j)

Combina%on slightly beser

Also, error less sensi%ve to clustering threshold T

Now sort events into categories


Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primi%ve opera%on: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF coordinate. Story category = closest point.
world category latest story

nance category

Cluster summariza%on
Problem: given a set of documents, write a sentence summarizing them. Dicult problem. See references in Newsblaster paper, and more recent techniques.

Is Newsblaster really a lter?


Acer all, it shows all the news... Dierences with Google News?

Personaliza%on! Not every person needs to see the same news.

Filter design problem


Formally, given U = user preferences, history, characteris%cs S = current story {P} = results of func%on on previous stories {B} = background world knowledge (other users?) Dene r(S,U,{P},{B}) in [0...1] relevance of story S to user U

Lecture 3: Algorithmic Filtering


The Filtering Problem Columbia Newsblaster system design What should a lter do?

What makes a ltering algorithm "good"?

Editors
Editors are lters. They decide what stories to run, and how prominent to make them. How do they choose?

The Echo Chamber


[Echo chambers are] those Internet spaces where like-minded people listen only to those people who already agree with them. ... While most of us had assumed that the Internet would increase the diversity of opinion, the echo chamber meme says the Net encourages groups to form that increase the homogeneity of belief. This isnt simply a factual argument about the topography carved by trac and links. A tut, tut has been appended: See, you Web idealists have been shown up humankinds social nature sucks, just as we always told you! - David Weinberger, Is there an echo in here?

Graph of poli%cal book sales during 2008 U.S. elec%on, by orgnet.org From Amazon "users who bought X also bought Y" data.

Retweet network of poli%cal tweets. From Conover, et. al., Poli0cal Polariza0on on Twi4er

The Filter Bubble


What people care about poli%cally, and what theyre mo%vated to do something about, is a func%on of what they know about and what they see in their media. ... People see something about the decit on the news, and they say, Oh, the decit is the big problem. If they see something about the environment, they say the environment is a big problem.

This creates this kind of a feedback loop in which your media inuences your preferences and your choices; your choices inuence your media; and you really can go down a long and narrow path, rather than actually seeing the whole set of issues in front of us.

- Eli Pariser, How do we recreate a front-page ethos for a digital world?

The (Algorithmic) Filter Bubble


If we try to present stories that the user will want to click on... do we end up only telling people what they want to hear? If an algorithm only shows us things our friends like, will we ever see anything that challenges us?

Filter design problem, restated


When should a user see a story? Aspects to this ques%on: norma've personal: what I want societal: emergent group eects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely

Informa%on diet
The holy grail in this model, as far as Im concerned, would be a Firefox plugin that would passively watch your websurng behavior and characterize your personal informa%on consump%on. Over the course of a week, it might let you know that you hadnt encountered any news about La%n America, or remind you that a full 40% of the pages you read had to do with Sarah Palin. It wouldnt necessarily prescribe changes in your behavior, simply help you monitor your own consump%on in the hopes that you might make changes. - Ethan Zuckerman, Playing the Internet with PMOG

S-ar putea să vă placă și