Sunteți pe pagina 1din 45

Fron%ers

of Computa%onal Journalism
Columbia Journalism School Week 4: Algorithmic Filtering October 1, 2012

Week 4: Algorithmic Filtering


The Filtering Problem Columbia Newsblaster system design What should a lter do?

How Journalism Works (a model)

User

User

stories not covered x

x x x

x x x ltering User

Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interac%ve

more video on YouTube than produced by TV networks during en%re 20th century

Google now indexes more web pages than there are people 400,000,000 tweets per day es%mated 130,000,000 books ever published

10,000 legally-required reports led by U.S. public companies every day

All New York Times ar%cles ever = 0.06 terabytes


(13 million stories, 5k per story)

Its not informa%on overload, its lter failure


- Clay Shirky

Week 4: Algorithmic Filtering


The Filtering Problem Columbia Newsblaster system design? What should a lter do?

System Descrip%on

Scrape

Cluster events

Cluster topics

Summarize

Scrape
Handcrabed list of source URLs (news front pages) and links followed to depth 4 Then extract the text of each ar%cle

Text extrac%on from HTML


Ideal world: HTML5 ar%cle tags The ar%cle element represents a component of a page that consists of a self-contained composi%on in a document, page, applica%on, or site and that is intended to be independently distributable or reusable, e.g. in syndica%on. - W3C Specica%on

Text extrac%on from HTML


Slightly closer to reality? hNews markup Used by 577 news orgs (October 2010) unclear adop%on today

Text extrac%on from HTML


Newsblaster paper: For each page examined, if the amount of text in the largest cell of the page (aber stripping tags and links) is greater than some par%cular constant (currently 512 characters), it is assumed to be a news ar%cle, and this text is extracted. (At least its simple. This was 2002. How oben does this work now?)

Text extrac%on from HTML


Now mul%ple services/apis to do this, e.g. readability.com

Cluster Events

Cluster Events
Surprise! encode ar%cles into feature vectors cosine distance func%on hierarchical clustering algorithm

Choose clustering algorithm...


Baseline: simple agglomera%ve clustering put each item into a leaf node

while num clusters > 1 find two clusters with distance < T merge them

single link or min

complete link or max

average

But news is an on-line problem...


Ar%cles arrive one at a %me, and must be clustered immediately. Cant look forward in %me, cant go back and reassign. Greedy algorithm.

Single pass clustering



put first story in its own cluster repeat get next story S look for cluster C with distance < T if found put S in C else put S in new cluster

Evalua%ng clusterings
When is one clustering bemer than another? Ideally, wed like a quan%ta%ve metric.

Evalua%ng clusterings
When is one clustering bemer than another? Ideally, wed like a quan%ta%ve metric. This is possible if we have training data = human generated clusters. Available from the TDT2 corpus (topic detec%on and tracking)

Error with respect to hand-generated clusters from training data

Features: all words + en%%es

En%ty extrac%on

Names, dates, places. Services like OpenCalais. Best algorithms use dic%onaries + probabilis%c parsers.

All words does best!

But maybe possible to combine features to do bemer?

How to combine dierent features?


In Newsblaster case, for every pair of documents (di,dj) we have three similarity values
dist1(i,j) = TF-IDF on all words dist2(i,j) = nouns extracted by LinkIt algorithm dist3(i,j) = en%%es extracted by Nominator algorithm

But we need a single distance func%on dist(di,dj) that takes into account all informa%on.

Answer: weighted sum distance fns


dist(i, j) =

w
k=1..3

distk (i, j)

w1 = weight of TF-IDF distance w2 = weight of distance on LinkIt nouns w3 = weight of distance on Nominate en%%es

How to nd op%mal wi?


Well, we have training data available. So

Regression t
From training data, dene perfect distance fn: R(i,j) = 0 if i,j in same cluster = 1 otherwise Find wi that minimize 2
= ( dist(i, j) R(i, j))
i, j

*actually, we use logis%c regression, because its a bemer t to binary variables like R(i,j)

Combina%on slightly bemer

Also, error less sensi%ve to clustering threshold T

Now sort events into categories


Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primi%ve opera%on: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF coordinate. Story category = closest point.
world category latest story

nance category

Cluster summariza%on
Problem: given a set of documents, write a sentence summarizing them. Dicult problem. See references in Newsblaster paper, and more recent techniques.

Is Newsblaster really a lter?


Aber all, it shows all the news... Dierences with Google News?

Week 4: Algorithmic Filtering


The Filtering Problem Columbia Newsblaster system design What should a lter do?

Filter design problem


Formally, given U = user preferences, history, characteris%cs S = current story {P} = results of func%on on previous stories {B} = background world knowledge (other users?) Dene r(S,U,{P},{B}) in [0...1] relevance of story S to user U

Filter design problem, restated


When should a user see a story? Aspects to this ques%on: norma've personal: what I want societal: emergent group eects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely

S-ar putea să vă placă și