Sunteți pe pagina 1din 53

Natural Language Processing

for Investigative Journalism


Jonathan Stray
NYC ML Meetup, 2014/6/19
Links!



http://bit.ly/JournalismNLP
Proof of concept algorithm
For every document x
convert to TF-IDF vector
label by three words with highest TF-IDF
color by incident type (from original data)

For every pair of documents x,y
if cosine_distance(x,y) < threshold
add_edge(x,y)

Plot all documents and edges with force-directed layout

Document sets too big for O(N
2
)
Prototype scatterplot
for each document x
let edges[x] = { N random documents }

For each time step
for each document x
for each y in edges[x]
d1 = cosine_distance(x,y)
d2 = layout_distance(x,y)
apply_spring_force(d1, d2)
for N/2 edges with largest d1
edges[y] = random document

Prototype tree
let c = { one component with all documents }

for threshold = [0.1, 0.2 ... 1]
c_new = {}
for x in c
pieces = connected_components(x, threshold)
x.children = pieces
c_new += pieces
c = c_new
Lots of emails would be meaningless, spam or pictures of cats, so Overview can
be used to easily dismiss the majority. Given a set of emails based on a keyword
search, the problem is more difficult because most of the emails will be at least
somewhat relevant.

In this case, Overview was most useful as an organizational tool. I could look at
an email, make a note, and easily have it grouped with other similar emails
through tagging.

I started with a branch of Overviews document tree and starting clicking,
glancing, noting and tagging. Right off the bat, I found that Overview had
grouped together all of the similarly formatted service desk requests. There
were hundreds if not thousands of those, so I was able to tag them by the
dozens without a second thought while focusing on the more meaty emails.

And then no one really used it....
Sources of user feedback
Log data (select node, apply tag, view document, ...)
Emails and other personal contact
After-use semi-structured interviews
Think-aloud usability tests with naive users
Usability lesson #1


If the workflow doesn't work,
the algorithm doesn't matter
Workflow improvements
Potential users were not able to download and install a
command line system, they couldn't get their documents
into it, and they didn't understand how to use it.

Rewritten as a web application
Import from something other than a CSV
Split long documents into pages
UI overhaul has to be obvious without reading manual!

Is the tree any good?
Evaluation Methods for Topic Models
Wallach et. al. 2009
Interpretation and Trust: Designing Model-Driven Visualizations for Text Analysis
Chuang et al, SIGCHI 2012
The curious case of Petroleum
Engineering.

The top visualization shows a
2D projection of pairwise topical
distances between academic
departments. In 2005,
Petroleum Engineering appears
similar to Neurobiology,
Medicine, and Biology. Was
there a collaboration among
those departments?

The bottom visualization shows
the undistorted distances from
Petroleum Engineering to other
departments by radial distance.
The connection to biology
disappears: it was an artifact of
dimensionality reduction.

The visual encoding of spatial
distance in the first view is
interpretable, but on its own is
not trustworthy.
Usability lesson #2


Your users define the tasks
and therefore the measure of quality
Current tree
c = { one component with all documents }
max_kids = 5

while c is not empty
c_new = {}
for x in c where size(x) > 1
children = adaptive_kmeans(x, max_kids)
x.children = children
c_new += children
c = c_new

Adaptive k-means
Folder labeling
for each folder x
let d = { docs in x }
let v = sum(TF-IDF vectors of d)
let t = { 10 terms in v with highest weight }

label =
"ALL" + t in all d
"MOST" + t in >=70% of d
"SOME" + t in < 70% of d
Types of document driven-stories
Smoking gun
basically a search problem
often hard to formulate a query, so visual exploration can help

Categorize and count
"trend story" about quantitative patterns
Find/invent useful categories, then tag and count documents

Exhaustive reading
still desirable or necessary for some stories!
for example, prove that something does not exist
To our surprise, wide scope for computer-assisted speedup
Added in response to user feedback
Limit to five children per folder
ALL, MOST, SOME folder labeling
Search
Show untagged documents
Multiple language support
Many import and export options
...
Simplify, simplify, simplify!
K-means vs. LDA on xkcd
Why not "real" topic models?
How to display topic model output?
many systems just use output for distance metric
we've already got a tree, we've already rejected MDS
popular topics-over-time view not applicable for most users
multiple topics per document even more confusing
LDA interpretability not obviously better
K-means, LDA, NMF are mathematically related anyway
Need hierarchical, O(N) algorithm

But ultimately...
So far, usability problems data modeling problems
Just haven't gotten around to trying

What we're building now
Coming soon: named entity recognition
NER accuracy is really low!
Test of OpenCalais against 5 random articles from various sources
versus hand-tagged entities

Overall PRECISION = 77%
Overall RECALL = 30%

...and journalism inputs can be from any domain!
Initially populate using NER, but let user edit entities,
aliases, and entity tags on each document.
Usability lesson #3

The user doesn't care about
having an accurate algorithm.

They care about getting clean data out.
Plugin API Custom visualizations
Plugin calls to Overview:
- get document text
- write/read persistent objects
- read/write document metadata

Overview calls to Plugin
- display visualization (render HTML/JS for iframe)
- selection changed

In development now, coming this summer!
Your Visualization Here
Your Visualization Here
Your Visualization Here
Your Visualization Here
I mean, I really feel the reason that what
occurred in Homicide occurred was
because of the incident in, I believe, the
Brentwood area with that stand-by pay.
And, you know, what I'm beginning to
learn in this business is that payback is
not nice sometimes.
An interactive NLP testbed
Conjecture: across many domains, it's much faster for a
human to correct an algorithm than to do the whole task by
hand.

Plugins can read document text, write document metadata,
and interact with the user.

Perfect for hybrid human-computer tasks.

Thank you!
For links to everything referenced in this talk please go to:

http://bit.ly/JournalismNLP

Find us at:

overviewproject.org

github.com/overview

jonathanstray@gmail.com

S-ar putea să vă placă și