Documente Academic
Documente Profesional
Documente Cultură
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
3:
Algorithmic
Filtering
September
18,
2013
User
User
x x x
x x x ltering User
Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interac%ve
more video on YouTube than produced by TV networks during en%re 20th century
Google now indexes more web pages than there are people in the world 400,000,000 tweets per day es%mated 130,000,000 books ever published
System Descrip%on
Scrape
Cluster events
Cluster topics
Summarize
Scrape
Handcraced
list
of
source
URLs
(news
front
pages)
and
links
followed
to
depth
4
Then
extract
the
text
of
each
ar%cle
Cluster Events
Cluster
Events
Surprise!
encode
ar%cles
into
feature
vectors
cosine
distance
func%on
hierarchical
clustering
algorithm
Agglomera%ve
hierarchical
start
with
leaves,
repeatedly
merge
clusters
e.g.
MIN
and
MAX
approaches
Divisive
hierarchical
start
with
root,
repeatedly
split
clusters
e.g.
binary
split
average
Evalua%ng
clusterings
When
is
one
clustering
beser
than
another?
Ideally,
wed
like
a
quan%ta%ve
metric.
Evalua%ng
clusterings
When
is
one
clustering
beser
than
another?
Ideally,
wed
like
a
quan%ta%ve
metric.
This
is
possible
if
we
have
training
data
=
human
generated
clusters.
Available
from
the
TDT2
corpus
(topic
detec%on
and
tracking)
En%ty extrac%on
Names, dates, places. Services like OpenCalais. Best algorithms use dic%onaries + probabilis%c parsers.
But we need a single distance func%on dist(di,dj) that takes into account all informa%on.
dist (i, j ) =
w
k =1..3
distk (i, j )
w1 = weight of TF-IDF distance w2 = weight of distance on LinkIt nouns w3 = weight of distance on Nominate en%%es
Regression
t
From
training
data,
dene
perfect
distance
fn:
R(i,j)
=
0
if
i,j
in
same
cluster
=
1
otherwise
Find
wi
that
minimize
2
= ( dist (i, j ) R(i, j ))
i, j
*actually, we use logis%c regression, because its a beser t to binary variables like R(i,j)
TF-IDF,
again
Each
category
has
pre-assigned
TF-IDF
coordinate.
Story
category
=
closest
point.
world
category
latest
story
nance category
Cluster
summariza%on
Problem:
given
a
set
of
documents,
write
a
sentence
summarizing
them.
Dicult
problem.
See
references
in
Newsblaster
paper,
and
more
recent
techniques.
Editors
Editors
are
lters.
They
decide
what
stories
to
run,
and
how
prominent
to
make
them.
How
do
they
choose?
Graph of poli%cal book sales during 2008 U.S. elec%on, by orgnet.org From Amazon "users who bought X also bought Y" data.
Retweet network of poli%cal tweets. From Conover, et. al., Poli0cal Polariza0on on Twi4er
This
creates
this
kind
of
a
feedback
loop
in
which
your
media
inuences
your
preferences
and
your
choices;
your
choices
inuence
your
media;
and
you
really
can
go
down
a
long
and
narrow
path,
rather
than
actually
seeing
the
whole
set
of
issues
in
front
of
us.
Informa%on
diet
The
holy
grail
in
this
model,
as
far
as
Im
concerned,
would
be
a
Firefox
plugin
that
would
passively
watch
your
websurng
behavior
and
characterize
your
personal
informa%on
consump%on.
Over
the
course
of
a
week,
it
might
let
you
know
that
you
hadnt
encountered
any
news
about
La%n
America,
or
remind
you
that
a
full
40%
of
the
pages
you
read
had
to
do
with
Sarah
Palin.
It
wouldnt
necessarily
prescribe
changes
in
your
behavior,
simply
help
you
monitor
your
own
consump%on
in
the
hopes
that
you
might
make
changes.
-
Ethan
Zuckerman,
Playing
the
Internet
with
PMOG