Documente Academic
Documente Profesional
Documente Cultură
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
4:
Algorithmic
Filtering
October
1,
2012
User
User
x x x
x x x ltering User
Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interac%ve
more video on YouTube than produced by TV networks during en%re 20th century
Google now indexes more web pages than there are people 400,000,000 tweets per day es%mated 130,000,000 books ever published
System Descrip%on
Scrape
Cluster events
Cluster topics
Summarize
Scrape
Handcrabed
list
of
source
URLs
(news
front
pages)
and
links
followed
to
depth
4
Then
extract
the
text
of
each
ar%cle
Cluster Events
Cluster
Events
Surprise!
encode
ar%cles
into
feature
vectors
cosine
distance
func%on
hierarchical
clustering
algorithm
while num clusters > 1 find two clusters with distance < T merge them
average
Evalua%ng
clusterings
When
is
one
clustering
bemer
than
another?
Ideally,
wed
like
a
quan%ta%ve
metric.
Evalua%ng
clusterings
When
is
one
clustering
bemer
than
another?
Ideally,
wed
like
a
quan%ta%ve
metric.
This
is
possible
if
we
have
training
data
=
human
generated
clusters.
Available
from
the
TDT2
corpus
(topic
detec%on
and
tracking)
En%ty extrac%on
Names, dates, places. Services like OpenCalais. Best algorithms use dic%onaries + probabilis%c parsers.
But we need a single distance func%on dist(di,dj) that takes into account all informa%on.
w
k=1..3
distk (i, j)
w1 = weight of TF-IDF distance w2 = weight of distance on LinkIt nouns w3 = weight of distance on Nominate en%%es
Regression
t
From
training
data,
dene
perfect
distance
fn:
R(i,j)
=
0
if
i,j
in
same
cluster
=
1
otherwise
Find
wi
that
minimize
2
= ( dist(i, j) R(i, j))
i, j
*actually, we use logis%c regression, because its a bemer t to binary variables like R(i,j)
TF-IDF,
again
Each
category
has
pre-assigned
TF-IDF
coordinate.
Story
category
=
closest
point.
world
category
latest
story
nance category
Cluster
summariza%on
Problem:
given
a
set
of
documents,
write
a
sentence
summarizing
them.
Dicult
problem.
See
references
in
Newsblaster
paper,
and
more
recent
techniques.