Computational Journalism at Columbia, Fall 2013, Lecture 3: Algorithmic Filtering

Fron%ers
of Computa%onal Journalism
Columbia Journalism School Week 3: Algorithmic Filtering September 18, 2013
Lecture 3: Algorithmic Filtering

The Filtering Problem Columbia Newsblaster system design What should a lter do?
How Journalism Works (a model)
User
User
stories not covered x
x x x
x x x ltering User
Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interac%ve
more video on YouTube than produced by TV networks during en%re 20th century
Google now indexes more web pages than there are people in the world 400,000,000 tweets per day es%mated 130,000,000 books ever published
10,000 legally-required reports led by U.S. public companies every day
All New York Times ar%cles ever = 0.06 terabytes

(13 million stories, 5k per story)
Its not informa%on overload, its lter failure

- Clay Shirky

System Descrip%on
Scrape
Cluster events
Cluster topics
Summarize
Scrape
Handcraced list of source URLs (news front pages) and links followed to depth 4 Then extract the text of each ar%cle
Text extrac%on from HTML

Ideal world: HTML5 ar%cle tags The ar%cle element represents a component of a page that consists of a self-contained composi%on in a document, page, applica%on, or site and that is intended to be independently distributable or reusable, e.g. in syndica%on. - W3C Specica%on

Slightly closer to reality? hNews markup Used by 577 news orgs (October 2010) unclear adop%on today

Newsblaster paper: For each page examined, if the amount of text in the largest cell of the page (acer stripping tags and links) is greater than some par%cular constant (currently 512 characters), it is assumed to be a news ar%cle, and this text is extracted. (At least its simple. This was 2002. How ocen does this work now?)

Now mul%ple services/apis to do this, e.g. readability.com
Cluster Events
Cluster Events
Surprise! encode ar%cles into feature vectors cosine distance func%on hierarchical clustering algorithm
Dierent clustering algorithms

Par%%oning
keep adjus%ng clusters un%l convergence e.g. K-means
Agglomera%ve hierarchical
start with leaves, repeatedly merge clusters e.g. MIN and MAX approaches
Divisive hierarchical
start with root, repeatedly split clusters e.g. binary split
Agglomera%ve combining clusters

put each item into a leaf node while num clusters > 1 find two closest clusters merge them
Divisive spliqng clusters

put all items into one cluster while num clusters < num items find largest cluster split so pieces as far as possible
single link or min
complete link or max
average
Trees and Dendrograms
But news is an on-line problem...

Ar%cles arrive one at a %me, and must be clustered immediately. Cant look forward in %me, cant go back and reassign. Greedy algorithm.
Single pass clustering

put first story in its own cluster repeat get next story S look for cluster C with distance < T if found put S in C else put S in new cluster
Evalua%ng clusterings
When is one clustering beser than another? Ideally, wed like a quan%ta%ve metric.
Evalua%ng clusterings
When is one clustering beser than another? Ideally, wed like a quan%ta%ve metric. This is possible if we have training data = human generated clusters. Available from the TDT2 corpus (topic detec%on and tracking)
Error with respect to hand-generated clusters from training data
Features: all words + en%%es
En%ty extrac%on
Names, dates, places. Services like OpenCalais. Best algorithms use dic%onaries + probabilis%c parsers.
All words does best!
But maybe possible to combine features to do beser?
How to combine dierent features?

In Newsblaster case, for every pair of documents (di,dj) we have three similarity values
dist1(i,j) = TF-IDF on all words dist2(i,j) = nouns extracted by LinkIt algorithm dist3(i,j) = en%%es extracted by Nominator algorithm
But we need a single distance func%on dist(di,dj) that takes into account all informa%on.
Answer: weighted sum distance fns
dist (i, j ) =
w
k =1..3
distk (i, j )
w1 = weight of TF-IDF distance w2 = weight of distance on LinkIt nouns w3 = weight of distance on Nominate en%%es
How to nd op%mal wi?

Well, we have training data available. So
Regression t
From training data, dene perfect distance fn: R(i,j) = 0 if i,j in same cluster = 1 otherwise Find wi that minimize 2
= ( dist (i, j ) R(i, j ))
i, j
*actually, we use logis%c regression, because its a beser t to binary variables like R(i,j)
Combina%on slightly beser
Also, error less sensi%ve to clustering threshold T
Now sort events into categories

Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.
Primi%ve opera%on: what topic is this story in?
TF-IDF, again
Each category has pre-assigned TF-IDF coordinate. Story category = closest point.
world category latest story
nance category
Cluster summariza%on
Problem: given a set of documents, write a sentence summarizing them. Dicult problem. See references in Newsblaster paper, and more recent techniques.
Is Newsblaster really a lter?

Acer all, it shows all the news... Dierences with Google News?
Personaliza%on! Not every person needs to see the same news.
Filter design problem

Formally, given U = user preferences, history, characteris%cs S = current story {P} = results of func%on on previous stories {B} = background world knowledge (other users?) Dene r(S,U,{P},{B}) in [0...1] relevance of story S to user U

What makes a ltering algorithm "good"?
Editors
Editors are lters. They decide what stories to run, and how prominent to make them. How do they choose?
The Echo Chamber

[Echo chambers are] those Internet spaces where like-minded people listen only to those people who already agree with them. ... While most of us had assumed that the Internet would increase the diversity of opinion, the echo chamber meme says the Net encourages groups to form that increase the homogeneity of belief. This isnt simply a factual argument about the topography carved by trac and links. A tut, tut has been appended: See, you Web idealists have been shown up humankinds social nature sucks, just as we always told you! - David Weinberger, Is there an echo in here?
Graph of poli%cal book sales during 2008 U.S. elec%on, by orgnet.org From Amazon "users who bought X also bought Y" data.
Retweet network of poli%cal tweets. From Conover, et. al., Poli0cal Polariza0on on Twi4er
The Filter Bubble

What people care about poli%cally, and what theyre mo%vated to do something about, is a func%on of what they know about and what they see in their media. ... People see something about the decit on the news, and they say, Oh, the decit is the big problem. If they see something about the environment, they say the environment is a big problem.

This creates this kind of a feedback loop in which your media inuences your preferences and your choices; your choices inuence your media; and you really can go down a long and narrow path, rather than actually seeing the whole set of issues in front of us.

- Eli Pariser, How do we recreate a front-page ethos for a digital world?
The (Algorithmic) Filter Bubble

If we try to present stories that the user will want to click on... do we end up only telling people what they want to hear? If an algorithm only shows us things our friends like, will we ever see anything that challenges us?
Filter design problem, restated

When should a user see a story? Aspects to this ques%on: norma've personal: what I want societal: emergent group eects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely
Informa%on diet
The holy grail in this model, as far as Im concerned, would be a Firefox plugin that would passively watch your websurng behavior and characterize your personal informa%on consump%on. Over the course of a week, it might let you know that you hadnt encountered any news about La%n America, or remind you that a full 40% of the pages you read had to do with Sarah Palin. It wouldnt necessarily prescribe changes in your behavior, simply help you monitor your own consump%on in the hopes that you might make changes. - Ethan Zuckerman, Playing the Internet with PMOG

Computational Journalism at Columbia, Fall 2013, Lecture 3: Algorithmic Filtering

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Computational Journalism at Columbia, Fall 2013, Lecture 3: Algorithmic Filtering

Încărcat de

Drepturi de autor:

Formate disponibile

Fron%ers

Lecture 3: Algorithmic Filtering

How Journalism Works (a model)

stories not covered x

10,000 legally-required reports led by U.S. public companies every day

All New York Times ar%cles ever = 0.06 terabytes

Its not informa%on overload, its lter failure

Lecture 3: Algorithmic Filtering

Text extrac%on from HTML

Text extrac%on from HTML

Text extrac%on from HTML

Text extrac%on from HTML

Dierent clustering algorithms

Agglomera%ve combining clusters

Divisive spliqng clusters

single link or min

complete link or max

Trees and Dendrograms

But news is an on-line problem...

Single pass clustering

Error with respect to hand-generated clusters from training data

Features: all words + en%%es

All words does best!

But maybe possible to combine features to do beser?

How to combine dierent features?

Answer: weighted sum distance fns

How to nd op%mal wi?

Combina%on slightly beser

Also, error less sensi%ve to clustering threshold T

Now sort events into categories

Primi%ve opera%on: what topic is this story in?

Is Newsblaster really a lter?

Personaliza%on! Not every person needs to see the same news.

Filter design problem

Lecture 3: Algorithmic Filtering

What makes a ltering algorithm "good"?

The Echo Chamber

The Filter Bubble

- Eli Pariser, How do we recreate a front-page ethos for a digital world?

The (Algorithmic) Filter Bubble

Filter design problem, restated

S-ar putea să vă placă și