Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 4: Information Overload and Algorithmic Filtering

Fron%ers
of Computa%onal Journalism
Columbia Journalism School Week 4: Algorithmic Filtering October 1, 2012
Week 4: Algorithmic Filtering

The Filtering Problem Columbia Newsblaster system design What should a lter do?
How Journalism Works (a model)
User
User
stories not covered x
x x x
x x x ltering User
Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interac%ve
more video on YouTube than produced by TV networks during en%re 20th century
Google now indexes more web pages than there are people 400,000,000 tweets per day es%mated 130,000,000 books ever published
10,000 legally-required reports led by U.S. public companies every day
All New York Times ar%cles ever = 0.06 terabytes

(13 million stories, 5k per story)
Its not informa%on overload, its lter failure

- Clay Shirky

The Filtering Problem Columbia Newsblaster system design? What should a lter do?
System Descrip%on
Scrape
Cluster events
Cluster topics
Summarize
Scrape
Handcrabed list of source URLs (news front pages) and links followed to depth 4 Then extract the text of each ar%cle
Text extrac%on from HTML

Ideal world: HTML5 ar%cle tags The ar%cle element represents a component of a page that consists of a self-contained composi%on in a document, page, applica%on, or site and that is intended to be independently distributable or reusable, e.g. in syndica%on. - W3C Specica%on

Slightly closer to reality? hNews markup Used by 577 news orgs (October 2010) unclear adop%on today

Newsblaster paper: For each page examined, if the amount of text in the largest cell of the page (aber stripping tags and links) is greater than some par%cular constant (currently 512 characters), it is assumed to be a news ar%cle, and this text is extracted. (At least its simple. This was 2002. How oben does this work now?)

Now mul%ple services/apis to do this, e.g. readability.com
Cluster Events
Cluster Events
Surprise! encode ar%cles into feature vectors cosine distance func%on hierarchical clustering algorithm
Choose clustering algorithm...

Baseline: simple agglomera%ve clustering put each item into a leaf node
while num clusters > 1 find two clusters with distance < T merge them
single link or min
complete link or max
average
But news is an on-line problem...

Ar%cles arrive one at a %me, and must be clustered immediately. Cant look forward in %me, cant go back and reassign. Greedy algorithm.
Single pass clustering

put first story in its own cluster repeat get next story S look for cluster C with distance < T if found put S in C else put S in new cluster
Evalua%ng clusterings
When is one clustering bemer than another? Ideally, wed like a quan%ta%ve metric.
Evalua%ng clusterings
When is one clustering bemer than another? Ideally, wed like a quan%ta%ve metric. This is possible if we have training data = human generated clusters. Available from the TDT2 corpus (topic detec%on and tracking)
Error with respect to hand-generated clusters from training data
Features: all words + en%%es
En%ty extrac%on
Names, dates, places. Services like OpenCalais. Best algorithms use dic%onaries + probabilis%c parsers.
All words does best!
But maybe possible to combine features to do bemer?
How to combine dierent features?

In Newsblaster case, for every pair of documents (di,dj) we have three similarity values
dist1(i,j) = TF-IDF on all words dist2(i,j) = nouns extracted by LinkIt algorithm dist3(i,j) = en%%es extracted by Nominator algorithm
But we need a single distance func%on dist(di,dj) that takes into account all informa%on.
Answer: weighted sum distance fns

dist(i, j) =
w
k=1..3
distk (i, j)
w1 = weight of TF-IDF distance w2 = weight of distance on LinkIt nouns w3 = weight of distance on Nominate en%%es
How to nd op%mal wi?

Well, we have training data available. So
Regression t
From training data, dene perfect distance fn: R(i,j) = 0 if i,j in same cluster = 1 otherwise Find wi that minimize 2
= ( dist(i, j) R(i, j))
i, j
*actually, we use logis%c regression, because its a bemer t to binary variables like R(i,j)
Combina%on slightly bemer
Also, error less sensi%ve to clustering threshold T
Now sort events into categories

Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.
Primi%ve opera%on: what topic is this story in?
TF-IDF, again
Each category has pre-assigned TF-IDF coordinate. Story category = closest point.
world category latest story
nance category
Cluster summariza%on
Problem: given a set of documents, write a sentence summarizing them. Dicult problem. See references in Newsblaster paper, and more recent techniques.
Is Newsblaster really a lter?

Aber all, it shows all the news... Dierences with Google News?

The Filtering Problem Columbia Newsblaster system design What should a lter do?
Filter design problem

Formally, given U = user preferences, history, characteris%cs S = current story {P} = results of func%on on previous stories {B} = background world knowledge (other users?) Dene r(S,U,{P},{B}) in [0...1] relevance of story S to user U
Filter design problem, restated

When should a user see a story? Aspects to this ques%on: norma've personal: what I want societal: emergent group eects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 4: Information Overload and Algorithmic Filtering

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 4: Information Overload and Algorithmic Filtering

Încărcat de

Drepturi de autor:

Formate disponibile

Fron%ers

Week 4: Algorithmic Filtering

How Journalism Works (a model)

stories not covered x

10,000 legally-required reports led by U.S. public companies every day

All New York Times ar%cles ever = 0.06 terabytes

Its not informa%on overload, its lter failure

Week 4: Algorithmic Filtering

Text extrac%on from HTML

Text extrac%on from HTML

Text extrac%on from HTML

Text extrac%on from HTML

Choose clustering algorithm...

single link or min

complete link or max

But news is an on-line problem...

Single pass clustering

Error with respect to hand-generated clusters from training data

Features: all words + en%%es

All words does best!

But maybe possible to combine features to do bemer?

How to combine dierent features?

Answer: weighted sum distance fns

How to nd op%mal wi?

Combina%on slightly bemer

Also, error less sensi%ve to clustering threshold T

Now sort events into categories

Primi%ve opera%on: what topic is this story in?

Is Newsblaster really a lter?

Week 4: Algorithmic Filtering

Filter design problem

Filter design problem, restated

S-ar putea să vă placă și