NEVA: An Automatic Summarizer For Narrative Texts (I.e., Stories)

THE COOPER UNION FOR THE ADVANCEMENT OF SCIENCE AND
ART
ALBERT NERKEN SCHOOL OF ENGINEERING
NEVA: An Automatic Summarizer

for Narrative Texts
by
Joshua Blachman
A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Engineering
April 15, 2011
Advisor
Dr. Carl Sable

THE COOPER UNION FOR THE ADVANCEMENT OF SCIENCE AND
ART
ALBERT NERKEN SCHOOL OF ENGINEERING
This thesis was prepared under the direction of the Candidate’s Thesis Ad-
visor and has received approval. It was submitted to the Dean of the School
of Engineering and the full Faculty, and was approved as partial fulfillment of
the requirements for the degree of Master of Engineering.
Dr. Simon Ben-Avi

Acting Dean, School of Engineering
Dr. Carl Sable

Candidate’s Thesis Advisor
Abstract
Automatic summarization research to date has mostly been concerned with

summarizing technical documents and news articles, and the domain of nar-
ratives has been neglected. A new rule based approach has been created to
summarize narrative texts, specifically isolating plot lines. A system imple-
menting this approach, called NEVA, has been applied to three different books
and has been evaluated using both human volunteers and the ROGUE metric.
Results show that NEVA successfully creates plot summaries containing up
to 85.2% of the same content as some human-written summaries of the same
narrative.
Acknowledgments
This thesis could not have been possible without the support I’ve been given
in the past few years of my education by my family, friends and professors.
First and foremost, I’d like to thank my advisor, Professor Carl Sable for his
tireless efforts as a guide and friend as I took the necessary steps in conceiving,
researching and writing this thesis. It seemed like at every turn, even before I
encountered problems, Professor Sable was already there, helping me through
them, and always with a smile. Thank you for always having your door open
and even allowing me entrance into your coveted Bennett apartment.
I’d also like to thank the professors of Cooper Union who have taught me
so much about engineering, science, and life in general, including but certainly
not limited to: Chris Lent, Stuart Kirtman, Kausik Chatterjee, Yash Risbud,
Fred Fontaine, James Abbott, Stanley Shinners, Alan Wolf, Robert Uglesich,
Alan Berenbaum, Toby Cumberbatch, and Benjamin Davis. Special thanks to
Glenn Gross and Dino Melendez for just being awesome people and giving all
their efforts to help around the lab whenever necessary.
I’d like to thank my friends who are always there for me, giving me support
and helping me complete the results section of my thesis (after what was for
some, consistent nagging on my part). These friends are Yonah Kupferstein,
Josh Nissel, Sippy Laster, Yael Sacks, Aviva Bukiet, Ezra Obstfeld, Elissa
Gelnick, Naomi Levin, Batya Septimus, Evan Hertan, Michael Sterman, Aliza
Ben-Arie, Eliana Grosser, Michali Steinig, Michael Feder, Hanna Clevenson
and Daniel Rich. I’d like to thank my parents for supporting me both emo-
tionally and financially through my years of education; I am who I am today
only because of you. Also, thanks for helping with my results section as well,
I know how hard it was for you to read those ten pages.
Acharon Acharon Chaviv, I’d like to thank Hakadosh Baruch Hu for all
of the Brachot He has given me throughout my years. I understand that my
whole being is tied to His Ratzon, and I try every day only to fullful the Tachlis
He has laid out for me and to incorporate my Yiddishkeit in everything I do,
creating as much Kiddush Hashem as possible. Thank you.

Contents
Table of Contents vi
List of Figures viii
1 Introduction 1
2 Background 4
2.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 General NLP Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 POS taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Chart Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Pronoun Resolution . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Automatic Summarization . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Structural Approach . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Multiple Document Summarization . . . . . . . . . . . . . . . 24
2.4 Evaluating Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Related Work 28
3.1 Plot Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Project Description 32
4.1 Problems With Previous Work . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Creating a Narrative Summarizer Algorithm . . . . . . . . . . . . . . 33
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Choice of Resources . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Point System . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Analysis of NEVA 39
5.1 ROGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Human Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Rating System . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
CONTENTS vii
6 Results and Discussion 43
7 Conclusion and Future Work 52
A Hobbs Algorithm 54
B Human Analysis Ratings 56
C Annotated Summaries 58
C.1 NEVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
C.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
C.3 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Bibliography 65
List of Figures
2.1 The parse tree of “The dog ate the food”. . . . . . . . . . . . . . . . 12

2.2 A valid parse tree of “Colorless green ideas sleep furiously”. . . . . . 14
6.1 The human evaluations for Dr. Jekyll and Mr. Hyde. . . . . . . . . . 43
6.2 The human evaluations for The Awakening. . . . . . . . . . . . . . . 44
6.3 The human evaluations for The Ambassadors. . . . . . . . . . . . . . 45
6.4 The ROGUE scores for Dr. Jekyll and Mr. Hyde. . . . . . . . . . . . 46
6.5 The ROGUE scores for The Awakening. . . . . . . . . . . . . . . . . 47
6.6 The ROGUE scores for The Ambassadors. . . . . . . . . . . . . . . . 48
6.7 The ratios of ROGUE scores for all summaries to ROGUE scores for
human-written summaries. . . . . . . . . . . . . . . . . . . . . . . . . 49
B.1 The original scores for the human evaluations of Dr. Jekyll and Mr.
Hyde. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
B.2 The original scores for the human evaluations of The Awakening. . . . 56
B.3 The original scores for the human evaluations of The Ambassadors. . 57
viii
Chapter 1
Introduction
Airing on February 14 through 16 2011, for the first time in its 47 year runtime,
Jeopardy! faced off two human players against a computer. Millions of viewers
watched as the IBM computer program called Watson defeated Jeopardy! champions
in a landslide victory that had Ken Jennings saying at its end “I for one welcome our
new computer overlords” [24]. But Jennings may have spoken too soon. Although it
may have seemed that Watson was very smart as it answered many questions correctly
and quickly, Watson was actually just providing the illusion of intelligence by using
Natural Language Processing (NLP). NLP is a branch of artificial intelligence that
focuses on creating programs that can interpret natural languages (e.g., English)
usefully. NLP allows computer programs to perform difficult human tasks such as
playing Jeopardy! without actually understanding anything at all.
One of the most researched areas of NLP is the field of automatic summarization.
The topic, as the name suggests, deals with creating programs to automatically sum-
marize a text into a shorter text which has a length that is some percentage of the
original text. Most of the work in automatic summarization has been focused on sum-
marizing online news articles [5,10,27]. The usefulness of this is readily apparent; the
1
2
Internet is continuously becoming a primary source for news [20], but the enormity
of articles posted every day make it an insurmountable task to read or even organize
everything. A human trying to sift through this information would need to read all
or parts of the available articles to determine what information they contained and
whether it is relevant to him; such a task would most likely take weeks or months for
just a days worth of online material. Automatic summarizers can be used to create
headlines or short summaries of each article and then the task of sifting through the
mound of articles can be done in a fraction of the time.
With such potential, it is a wonder that automatic summarizers have not been
applied to all different types of texts. One domain that has been overlooked is narra-
tive texts. In this thesis, a novel approach for automatic summarization of narrative
texts has been created. Narratives are fundamentally different than news articles in
that not everything is relevant. A news article tries to make a point, providing as
much information on whatever event about which it is reporting. A narrative on the
other hand, has background information and character development that might have
nothing to do with the actual plot of the text.
For this purpose, the program described in this thesis tries to extract only plot
information from narratives. Called Named Entity Verb-focused Automatic sum-
marizer (NEVA), the program looks for all of the actions that are performed by the
main characters and extracts the sentences from the narrative that describe those ac-
tions. These sentences are then ordered based on a series of rules and only the highest
ranked sentences are chosen to be part of the final text. The resulting sentences are
effectively a list of all of the important events that take place in the narrative and
therefore provide an accurate summary of the narrative’s plot.
NEVA has been applied to three books; namely, Dr. Jekyll and Mr. Hyde by
Robert Louis Stevenson, The Awakening by Kate Chopin, and The Ambassadors
3
by Henry James. The resulting summaries have been compared to human-written
summaries, summaries produced by a statistical automatic summarization system
called MEAD, and a baseline summary. The summaries were evaluated using both the
ROGUE (Recall-Oriented Understudy for Gisting Evaluation) summarization metric
and human volunteers in a blind experiment. The ROGUE metric is an accepted
automatic summarization metric used in the National Institute of Standards and
Technology (NIST) annual Document Understanding Conferences [22]. This thesis
shows that NEVA performs exceptionally well with regards to including relevant plot
content, containing up to 85.2% of the same content as a human-written summary
used as an upper-bound.
Chapter 2
Background
2.1 Artificial Intelligence
As soon as people realized that computers can replace humans as a means for calcu-
lations, they started searching for ways to make computers as intelligent as people.
Artificial intelligence became the holy grail of scientific progress with every fictional
depiction of a future world including intelligent humanoid robots performing everyday
human tasks. Even a “simple” human task such as walking, though, would require a
humanoid robot to first master several other tasks including visually understanding
a room, moving several parts of its body at once without falling, and avoiding obsta-
cles, to name a few; each of these tasks by itself is extremely complicated and difficult
to solve. Due to the explosion of tasks that researchers sought for computers to do,
measures of intelligence became important. This led researchers to develop rigorous
methods of testing how intelligent a computer has actually become.
In 1950, Alan Turing developed a computer intelligence test that is still in use
today, called the Turing test [40]. To conduct the Turing test, a master of human
psychology would ask questions using a terminal to elicit answers from either a human
4
2.1 Artificial Intelligence 5
on another terminal or a computer program. If the psychologist could not figure out
whether he was talking to a human or a computer program, the computer program
would be deemed intelligent. Clearly, there would be no way to tell whether a com-
puter passing the Turing test would be capable of walking (even with the required
mechanisms), but theoretically, the computer would be considered as “aware” as any
competent human.
And so the race to make computers understand language began. For if a computer
can understand, interpret and generate language that was understandable to a natural
speaker, according to the Turing test that computer would be as intelligent as any
person. This new field developed into what is today called natural language processing
(NLP) [18].
Understanding language is clearly no simple task. It involves so many things that
people take for granted but are actually quite difficult. For example, it is no simple
task for a computer to separate a sentence into its constituent words. The reason
is clear with spoken language, because a computer understands sound as one long
sound wave with indeterminate beginnings and endings throughout. However, even
with written language, the task is far from solved. For example, it may seem as if a
period is always at the end of a sentence, with the sentence’s final word immediately
preceding that period; however, this is not true for abbreviations, in which periods are
very much a part of the word. While the difference between the two would be obvious
to any human reader, this task is not so simple for the programmed computer.
Natural language processing has to be capable of much more than just word tok-
enization (separating sentences into words) if it is to be successful. Working strictly
with just the written side of language processing, after tokenizing a sentence, a com-
puter must do a plethora of other tasks before coming close to meeting Turing’s
standard of intelligence. Once tokenized, words are still just a conglomerate of let-
ters to a computer. Those words must still have meaning assigned to them. Other
required tasks would include, among others: disambiguating each word based on its
meaning in that sentence; logically constructing a connection between the words in
the sentence; understanding the inferences that are implied by the words, but are
not definitively stated; keeping a dynamic database of all new and old information,
and knowing how each piece of information relates to the others; and then, regen-
erating this entire puzzle to create a unique response that a human speaker would
understand.
For many of these tasks there have been decades of research all over the world in-
vestigating novel approaches for tackling these problems, ranging from mathematical
models to logical step-by-step approaches. One of the many things the different ap-
proaches have in common, though, is that they all involve multiple complicated steps,
with each step involving its own natural language processing problem. None of these
problems have a perfect solution to them, and as such, a solution that is a summation
of smaller solutions, each with its own errors, is a solution that is compounded with
even more errors. It is because of this that most researchers have donated their time
to solving just one small subproblem of natural language processing, instead of going
for the gold by passing the Turing test.
There are many different ways to go about programming a computer to “think”
like a human, but surprisingly, there are only a few general methods which stand
out as working exceptionally well. These methods have become the cornerstone of
NLP research and they are used very heavily to solve most NLP problems with high
success. As NLP grew, there were two general ways of approaching problems, which
still divide the algorithms used in NLP today. These approaches became known as
the symbolic and statistical paradigms [18].
The symbolic paradigm was inspired by what humans are perceived to do when
they interpret language, and its goal is to get computers to ultimately understand
language. The idea is that there must be some underlying structure to texts that
allows humans to understand it, and if a program can translate the text into this
logical structure, it should be able to interpret it. This theory led to programs that
produce such structures as first-order logic representations of a text [1]. In first-order
logic, a computer would label each word in a text with a type and create simple
relationships between those words. With these relationships in place, a computer
would “understand” the text, and even be able to answer questions about details
from the text.
Instead of trying to get computers to achieve an understanding of a text, the
statistical paradigm’s goal is to achieve good results, no matter what the method.
This led to developing algorithms that produced results based on probabilities. The
inspiration is that there are overall patterns that language follows and if the computer
can learn those patterns, it can predict how unseen texts follow those patterns.
One of the earliest and most well-known examples of this of this type of approach
occurred in 1963 when Mosteller and Wallace discovered the authorship The Feder-
alist Papers, thus ending a long historical debate [29]. They statistically analyzed
known works of the proposed authors, James Madison and Alexander Hamilton, and
compared frequencies of certain common words such as “of” or “about”. They found
that the frequencies of use of these words were consistent for each suspected author
but different between the authors. The frequencies of these words in the disputed
papers matched those of Madison.
Mosteller and Wallace gathered their information using a technique called data
mining. In most NLP applications, there exists an input text and an unknown state
of this text; in Mosteller and Wallace’s case the input text consisted of certain articles
comprising The Federalist Papers and the unknown state was the author. For data
mining, a large set of texts, called a corpus, is needed from which data is to be
extracted. In the case of a supervised data mining algorithm, the corpus would
have manually tagged data, providing examples of patterns for which types of data
leads to which states. The assumption of a supervised data mining algorithm is that
the patterns that exist in the corpus are the same as the patterns that exist in the
unknown data set. The goal, then, is to “mine” data from the corpus to create a
statistical representation of it that can be used to extract unknown states from other
texts.
Another mainstream way of gathering statistical data is called machine learning.
Instead, machine learning uses algorithms to train a function using the known cor-
pus. For many machine learning techniques (e.g., neural networks or support vector
machines), the function is a black-box, but for others (e.g., decision trees), the inner
workings of the function are apparent [34]. In either case, the trained system can then
be applied to an unknown text in order to classify or make some statistical prediction
about it. Some people classify machine learning as a subset of data mining due to
their similarities [13].
Data mining and machine learning can also both be used for unsupervised learning
algorithms which do not involve a known training corpus (although they may train on
an unlabeled corpus of unknown states). The unsupervised technique relies on finding
patterns that are inherent in the data instead of using patterns from known states
of the data. A common example of an unsupervised technique is called clustering,
which tries to group data together based on some measurement of distance that exists
between any two data points [32]. The appeal of such methods is that there is no
need for manually labeled data, which may be expensive or even impossible to obtain
at times.
2.2 General NLP Tasks 9
2.2 General NLP Tasks
One subset of Natural Language Processing that has received considerable attention
is the task of summarizing a body of text [26]. While this might seem unrelated to
the overall task of human-computer conversations, solving the summarization prob-
lem completely would be a pivotal turn towards computer intelligence. One reason
for this is simply that if a computer can filter out all of the unnecessary language
that accompanies a normal conversation and get to the point that the communica-
tor is trying to make, it would be much easier to understand the conversation as a
whole. Even without the eventual goal of passing the Turing test, though, the task
of automatic summarization would be useful in its own right. Artificial intelligence
is, after all, the process of a computer performing a task that is normally done by
a person, and this is no exception. It would be extremely useful if a person could
input a body of text to a computer and receive an accurate summary of the text.
This becomes especially useful as the input text gets larger and unmanageable for
a person to read, understand and summarize himself. A summary could help the
person decide whether or not they want to read the entire text; or it could save a
tremendous amount of time if the summary itself tells them everything they want to
know.
In addition to summarizing large texts, it is also useful to summarize many small
texts like those commonly found on the Internet. Now that the Internet is a primary
source for news [20], the conglomerate of daily news articles has become unman-
ageable and represents the perfect domain for which to develop automatic computer
summarizers. Before reaching the point where researchers could begin working on au-
tomatically summarizing news articles, there were years spent researching the more
basic parts of natural language processing. This research led to resources such as
part-of-speech taggers, sentence chart parsers, named entity recognizers, and pro-
noun resolvers.
2.2.1 POS taggers
The task of a part-of-speech (POS) tagger is to label each word in a sentence with
its correct POS [4]. The difficulty in this arises because many words contain several
meanings and different meanings often have different parts of speech. For example,
most gerunds, in addition to being nouns, can be used as adjectives as well; “running”
can be used both in “Running is healthy” and in “the running man”. Humans in-
stinctively pick out the difference between the two cases because they can understand
that “running” is used in two very different contexts. Therefore, many standard POS
taggers try to mimic humans in this way by “learning” which contexts produce which
POS tags for a given word. A common POS tagger implementation is called the Brill
tagger, first used by Eric Brill in 1993. This algorithm uses a supervised learning
method to produce a list of rules that can be applied to unknown words. The Brill
tagger has an accuracy of approximately 95% [4].
2.2.2 Chart Parsers
A chart parser is a program that takes a sentence as input and outputs a structured
representation of the sentence based on the tags of the words [18]. For this reason,
many chart parsers require a tagged sentence as an input, formatted in a way that
is recognizable by that particular chart parser. The output of a chart parser is in
the form of a tree structure, with the leaves of the tree representing the individual
word/tag pairs of the sentence. The different places where the tree branches off are
determined by the context-free grammar (CFG) that the chart parser is using.
A CFG is a grammar that describes how words and phrases are related to one
another, and therefore sets rules for the syntax of a sentence in a language [7]. A
CFG consists of rules that equate one type of phrase with one or more other types
of phrases in that syntactically (disregarding the actual meaning of the words) they
can be swapped for each other without worrying about their contexts; it is for this
reason that a context-free grammar is context-free. A simple example of this is the
rule that a sentence can be comprised of a subject (a noun phrase) and a predicate
(verb phrase). A CFG would therefore include the rule:
S → NP VP
where → means “can have the form of” and S, NP and VP refer to sentence, noun
phrase and verb phrase respectively. However, for the English language, this rule
does not describe the only way that a sentence can be formed; a valid exclamatory
sentence, for example, might consist of only one word (e.g., “Wow!”). Therefore, an
English CFG might also include the rule:
S → W ORD !
Note that the word “wow” is not so easily tagged, and also it is questionable whether
the punctuation “!” even deserves a tag. This all adds to the complexity of creating
a complete tag set as well as integrating that tag set with a CFG and ultimately a
chart parser.
The functionality of a chart parser can be demonstrated with a simple sentence
such as “The dog ate the food”. A correct chart parser output would have a “sentence”
node at the root of the tree, which branches off into a noun phrase node, consisting
of “The dog”, and a verb phrase node consisting of “ate the food”. Then the noun
phrase node would be divided into the determinant “The” and the noun “dog”, while
the verb phrase would be divided into the verb “ate”, and the noun phrase “the food”,
which would in turn be divided into the determinant “the” and the noun “food”. The
parse tree representation of the sentence is shown in Figure 2.1. Many chart parsers
Figure 2.1 The parse tree of “The dog ate the food”.
use a simplified parentheses delimited representation of trees, which for the example
sentence would lead to:
(S (N P (DET T he)(N dog))(V P (V ate)(N P (DET the)(N f ood))))
There might be other correct outputs depending on various factors such as the set of
tags and the CFG that the chart parser is using.
Most of the difficulty with a chart parser producing the correct output comes
from the uncertainty as to which rule of the CFG to use for a given phrase. Each new
parsed sentence will start at the top of the parse tree with a “sentence” node, but
as shown previously, this “sentence” node can be split into a noun phrase and verb
phrase, or a single exclamatory phrase, or one of many other valid sentence structures
in the English language. As a parser gets further down the tree, the complexity only
increases, as almost every type of phrase has multiple CFG rules that split it into
other types of phrases.
Parsing a sentence into a full valid parse tree is not as simple as taking any phrase
and using any CFG rule, for doing so will most likely not result in a valid sentence,
due to words that are “left over” after part of the parsing is completed. This can be
illustrated more clearly by looking at the example sentence from before. Instead of
parsing the sentence the way it was parsed above, an equally valid parse (following
the CFG rules) for the beginning of the sentence would be:
(S (N P (DET T he)(N dog))(V P (V ate)))
This leaves the words “the food” unable to validly fit into any CFG rule as there is
no structure in English that allows for such a noun phrase to sit alone at the end of
a sentence. This problem manifests itself in a plethora of different ways for almost
any English sentence, leaving the issue of finding an algorithm that can efficiently
produce grammatically correct parses of a sentence.
There are two main approaches for such parsing algorithms: top-down and bottom-
up [18]. Top-down parsing, as the name suggests, starts with the top “sentence” node
and splits it into all possible parses based on the CFG rules. The parser continues to
split the phrases until it reaches the part-of-speech branches. Valid parses are simply
all parse trees whose bottom part-of-speech branches fit with the part-of-speech tags
of the words of the sentence. Bottom-up approaches work the opposite way. They
start with the part-of-speech tags of a sentence and apply the CFG rules to them in
a backwards fashion to produce higher level nodes in the parsed sentence tree. Mul-
tiple trees are continuously combined either until no valid CFG rules exist that can
be applied to the remaining nodes (in which case the parse tree is discarded) or the
branches unify into a single sentence node (i.e., a valid parse tree). Both approaches
have their advantages and disadvantages, in that neither approach takes into account
both overall sentence structure (as top-down approach does) nor the actual input sen-
tence (as the bottom-up approach does). Many current parsing algorithms use hybrid
approaches that are more efficient than either approach individually with respect to
both speed and memory efficiency.
Regardless of which parsing method is used, a single English sentence can have
upwards of 300 different valid parses; while all might be technically grammatically
correct, only few make much sense to an English speaking person. (A famous example
that shows how a valid parse can lead to nonsense is the sentence “Colorless green
ideas sleep furiously”, first proposed by Noam Chomsky in 1957 [7]. The sentence
has a correct grammatical structure shown in Figure 2.2, but it is readily apparent
that none of the words in the sentence have any meaning relative to each other.)
Since a chart parser would be relatively useless if it produced 300 parses for any
Figure 2.2 A valid parse tree of “Colorless green ideas sleep furiously”.
sentence, NLP researchers have developed methods of choosing a single best parse.
Most notably, the statistical parsing methods have shown to be very successful in this
regard [9].
Like other statistical NLP algorithms, statistical parsing uses probabilities that
must be learned before they can be applied to unknown inputs. Although what is
learned changes slightly from algorithm to algorithm, the basics remain the same;
give patterns that appear frequently in the learning corpus a higher probability of
occurring in the input sentence. This usually manifests itself with regards to CFG
rules as well as tag and word pairs that are grouped together. For example, “S →
WORD !” and “green ideas” will have extremely low probabilities when compared to
“S → NP VP” and “green leaves” respectively. After combining together all of the
different probabilities in a given parse tree, the parse tree with the highest overall
probability is chosen as the “correct” parse.
2.2.3 Named Entity Recognition
In addition to labeling words with part-of-speech, words can be classified as different
types of entities, most notably as a named entity. Named entities are phrases that
contain the names of persons, organizations, locations, or times [39]. Example named
entities of those types would be “John”, “Microsoft”, “London” and “July 4th, 1776”,
respectively. What makes named entities so useful, though, is not always that they are
named entities, but rather that they are some specific type of named entity. Labeling
words in a text as a “Person”, “Organization”, “Location”, or “Time” proves to be
invaluable for some NLP applications.
There are many different ways to go about identifying named entities, the easiest of
which is to just look up every word in lists of named entities to see if they match. One
problem with this method is that there are many words that are not named entities but
that have named entities counterparts. This leads to errors such as the verb “reading”
being labeled as a city in England. To counteract this effect, named entity recognizers
apply chunkers to the text before searching for named entities [3]. A chunker, as the
name suggests, breaks down sentences into chunks of smaller phrases, each with its
own tag. The result is somewhat of a cross between a POS tagger and a chart
parser in that chunkers give sentences some structure, and as such, most chunkers
use similar methods to chart parsers to obtain their results. Popular among these
methods are pattern matching and supervised learning techniques. Once sentences
are chunked, there is much less of an issue with mislabeling named entities, because
chunks will define when a named entity is expected, to the exclusion of verbs like
“reading”. Named entity recognizers are reported to achieve 79% accuracy for the
English language [14].
2.2.4 Pronoun Resolution
One of the ubiquitous techniques of writers is to use pronouns often instead of re-
peating a noun, called the pronoun’s antecedent, several times. The assumption that
these writers make is that the pronoun has an obvious antecedent, and that there
can be no other ambiguous antecedent to which the stated pronoun corresponds. In
fact, it can be considered bad English grammar when a pronoun is stated and there
are two or more possible antecedents to which the pronoun refers. Despite this fact,
many writers still use ambiguous pronouns in the hope that most readers would be
able to figure out the correct antecedent with limited intellectual effort.
Unfortunately, computers do not even have a limited intellect which they can
draw upon to resolve ambiguous pronoun references. Incidentally, even unambiguous
pronouns can pose an issue to a computer; this is largely because there are very few
cases where a pronoun is truly unambiguous. Although a human reader might be
able to distinguish immediately between the type of noun that a pronoun refers to
and therefore immediately eliminate most of the other nouns in a sentence as possible
antecedents, computers can make no such distinction and are left with the group of
all previous (and possibly future) nouns in the sentence from which to choose the
antecedent. Fortunately, there are current algorithms that account for many of the
issues that a computer must handle when trying to resolve a pronoun.
Among others, there is a pronoun resolution algorithm that is still popular among
NLP use today called the “Hobbs algorithm” after its creator Jerry Hobbs [16]. The
2.3 Automatic Summarization 17
Hobbs algorithm uses the chart parsed structure of a sentence to locate the most
probable noun as an antecedent for a given pronoun. The algorithm uses a list
of steps that, if followed, produce the best matched antecedent. The algorithm is
presented in Appendix A. Hobbs evaluated this algorithm on hundreds of examples
from three different texts and reported an accuracy of 88.3% [16]. While this accuracy
is not indicative of all corpus genres, Jeol Tetreault evaluated many different pronoun
resolution algorithms on different genres and found the Hobbs algorithm to be have
an accuracy of 80.1% for fictional works, one of the highest for that genre [37].
2.3 Automatic Summarization
The field of automatic summarization is an NLP topic that been given significant
attention over the past few decades. The topic, as the name suggests, deals with
finding algorithms to automatically summarize a text into a significantly shorter text.
As with any summary, the goal is to remove as much unnecessary information from
the input text as possible while not removing any necessary information so that the
shorter output text is still a good representation of the input text.
While there are a plethora of types of text that can be summarized, most of
the work in automatic summarization has been focused on summarizing online news
articles [5,10,27]. There are several reasons for this, including the availability of huge
databases of texts to work with. There are many websites that provide daily news
(and many new websites are created each year) so training corpora for this domain
are readily available. This is significant, because when starting a new area of study
in NLP (or any research for that matter), getting a working product of any kind is
of prime importance, and therefore, a good topic of study is one that provides many
existing examples, thus leading to a high probability for success in that topic.
In addition, news articles provide a good source for automatic summarization
because they are usually highly unique pieces of literature. One can be certain that an
article on a terrorism plot would sound nothing like an article on the latest basketball
game. Not only would most of the words be different between the two articles, but
entire phrases and paragraphs would probably be structured differently as well. While
this contrast would not completely hold between two articles of the same genre, in
many respects, even two articles on a terrorist plot would be vastly different. The
location of the plot, the names of the people involved, the techniques used, and many
other details would all be different between the articles. All these differences are very
significant when it comes to summarizing a text because a summary can be treated
as a list of things that are unique to a text. While that may not be obvious for any
text, it is certainly apparent for a news article which people read so that they know
what is new and different in the world. Disregarding the lack of appeal due to style,
would anyone really object to a news article that succinctly stated the changes in the
world (i.e., what is unique as compared to yesterday) as a list of attributes?
Another interesting aspect of news articles that appealed to NLP researchers was
the number of different articles about the same topic. Since all news websites (as-
suming they are the same type of news websites) write about the same incidents that
happened in the world, for every event, there are usually an overabundance of arti-
cles that look different, but talk about the same thing with only minor changes. This
poses a unique opportunity for automatic summarization, because it means that there
is an availability of many different input texts that can all technically produce the
same output text. Therefore, instead of just summarizing a single text, researchers
started working on multiple document summaries, which combined the input of mul-
tiple texts to produce a single output summary. Multiple documents mean more
information which hopefully leads to higher success rates for summarization.

When humans write a summary they basically use the same simple method. First
they read the target text, then ruminate on it for a while and try to understand
what the crux of the text was about. Finally, they reformulate their ideas in new
words that might be completely different from the words used in the target text,
but still retain the same information. Based on this chain of events, there are some
researchers who write programs that summarize texts using sentence abstraction.
When abstracting sentences, a computer first reads in the input text, then it processes
it by “understanding” what the text is about using different artificial intelligence
methods. Finally, the computer generates completely different grammatically correct
sentences that consist of the main ideas of the input text [31]. This method mimics
humans in the closest way possible, and should therefore theoretically produce the
most “human” summary, but it uses an overly complex method to achieve that goal
which can lead to errors. The task of getting a computer to understand a text is
a task that is many years ahead of what the artificial intelligence field is currently
capable of doing. In fact, if a computer would be able to understand a text in a
way that would allow it to produce an accurate summary of that text, then that
computer would be able to do many other NLP tasks such as question answering and
machine translation. Fortunately, it turns out that for most of the uses of automatic
summarization, perfect human-like summaries are not always necessary.
Because of the issues involved with generating sentence abstraction summaries,
most researchers rely on creating summaries using sentence extraction [31]. A sentence
extraction summary is produced by literally extracting the most relevant sentences
(and/or phrases) from the input text. While this method clearly is not focused on
producing a better resulting summary than a sentence abstraction summarizer would,
it often achieves summaries that are still useful for the task at hand. There are usually
two phases to creating extracts; the first involves determining which sentences are
relevant and the second involves post-processing of these sentences. Post-processing
is necessary because a sentence extraction is inevitably missing linking information
from the original text that. For example, if a text is 30 lines long, a sentence extraction
summary of that text might contain only lines 1, 5, 15 and 21, encompassing the key
points of the text. However, the other lines of the original text, while not containing
the main ideas of the document, may contribute to the overall progression of ideas.
Line 21 might start with the words “He then said”, but without line 20 to specify
who “He” is, line 21 becomes almost meaningless. Therefore, the extracted sentences
are post-processed to make the summary flow well enough so that it makes sense as
a stand-alone text.
As for determining the sentences’ relevancies, current researchers are creating
new ways to do this, but there are a few “classic” algorithms that have become the
bread and butter of the automatic summarization field. They each use vastly different
methods, yet achieve very similar outcomes with varying success rates that are among
the highest in the field.
2.3.1 TF-IDF
Since the target documents for most automatic summarization systems were known to
be news articles, a good algorithm would exploit that knowledge as much as possible.
As stated earlier, one of the highlights of news articles is each article’s uniqueness of
words, so an algorithm that can harness that aspect by determining the rare words
in a document should be successful. To do this requires a system to tabulate, in
some way, the sentences that contain the most unusual words, and to consider these
sentences to be very relevant for a summary. However, the quality of being unusual is
not sufficient alone. In order to be a relevant unusual word, the word has to have some
significance within the document itself. Hence, a relevant sentence should be one that
contains unusual and distinct words that play an important role in the document as
a whole.
To accomplish this, two weights are calculated for each word in a document; the
word’s term-frequency (TF) and its inverse-document-frequency (IDF) [6]. The TF is
simply the number of times that any given word appears in the document as a whole.
All words (terms) in the document automatically receive a TF of at least 1, and they
get higher numbers as they appear more frequently. This weight is to account for the
significance that a word plays in the document as a whole, for if a word appears many
times in a document, it is probably a word that the document is trying to focus on.
If the document is trying to focus on the said word, then a sentence containing the
word is most probably a sentence to focus on as well for a summary of the document.
However, there are two problems with just using the TF as a weight for deter-
mining sentence relevancy in a document. The first is that it does not exploit the
attribute of news articles that made them so desirable. Since each news article is so
unique when compared to the next, if a summary is supposed to say what makes this
article different enough to be worth reading, it should include things that are unique
to only this article. Without a weight contributing to that aspect, a summarizer may
end up generating very generic basketball articles, all talking a lot about dribbling
and shooting, but lacking many significant details. Secondly, the words that would
undoubtedly score the highest in the TF weights would be the words that are the
most common in the English language and not the ones that are the biggest focus in
the article. The word “the” is by far the most used word in any article, but it clearly
does not give any measure of the relevance of a sentence in a summary. Another
weight is needed to counteract the bias for common words in order to give the TF
weight more meaning.
The inverse-document-frequency weight both accounts for the rareness of words

and disregards words that are too common. The IDF is calculated by looking through
a corpus of articles and tabulating, for every word in a language, the number of
documents that a word appears in (or alternatively, the number of times that a
word appears throughout the entire corpus) and dividing by the total number of
documents. This weight is called the word’s document-frequency, and it is a measure
of how common a word is in the language. The inverse-document-frequency of a given
word, which serves as a measure of the rareness of the word, is then calculated by the
following rule:
count(w)
IDF (w) = − log
N
where N is the total number of documents in a corpus and count(w) is the number of
documents containing a given word, w. A word’s TF score and its IDF score are then
multiplied together to produce a total TF-IDF score that accounts for the two desired
relevancy aspects. Since a common word’s IDF would be extremely small (relative
to other words) it effectively negates the TF of those words when they appear in the
target document. On the other hand, rare words would have a relatively high IDF
leading to good relevancy scores. Even if a word appears only once in the target
document, if it is a rare enough word, it has a much higher combined TF-IDF score
than common words in the document.
The resultant algorithm which extracts sentences containing words with high TF-
IDF values has been tested extensively and produces accurate summaries of news
algorithms [26]. In terms of accuracy, it is quite difficult to produce a single score
that encompasses the “goodness” of a summary. Current metrics for automatic sum-
marization algorithms will be discussed in Section 2.4.

2.3.2 Structural Approach
Most summarization algorithms use TF-IDF weights as a starting point but also use
additional information to achieve better results. While the plain TF-IDF method
looks at sentences using a “bag of words” approach, meaning it looks at the sentences
as a combination of a set of words without any particular order or relation between
them, it is prudent to be able to use the inherent structure of a language for clues to
which sentences may be relevant. In addition, most articles have artificial structure
given to them by their author which gives additional hints for relevancy. Therefore,
some researchers have included such components as cues, titles, and locations in their
summarizers [12].
English has several words whose function is not to add much content but rather
to change the impression of other words. Such words as “significant,” “impossible,”
and “hardly” are dubbed “cues” and the presence of them in a sentence can add
relevance to or subtract relevance from the sentence. In order to effectively use this
method, a dictionary of “cues” must be manually compiled and each “cue” must
be labeled as positively, negatively or neutrally relevant. The weights from “cues”
are then integrated into the TF-IDF weights to produce a final sentence relevancy
weight [12].
Most news articles have author given titles for the entire article and also for
individual sections. Since titles are usually one line summaries of the piece they are
titles for, it is clear that the words in a title would definitely be relevant for an entire-
document summary. Therefore, by giving words that appear in titles higher TF-IDF
scores than words that do not, more relevant sentences can be chosen [12].
The locations of sentences in their paragraphs also give a clue as to the relevancy
of sentences. Firstly, this means that sentences that occur immediately after titles
are most likely important. Secondly, it has been shown that topic sentences are more
likely to occur either at the very beginning or at the very end of an article. Again,
using these clues, TF-IDF weights of certain sentences can be modified to produce
more relevant sentences for a document summary [12].
2.3.3 Multiple Document Summarization
As stated earlier, one of the benefits of summarizing news articles is that there are
usually many articles online that are written about the same incident. This leads
to the possibility of multiple document summarization (MDS), which merges many
documents into a single summary [15]. While the positive side to this is that multiple
documents allows for a more robust summary with more information, using multiple
documents also gives rise to complications that are not present when summarizing
single documents.
The first problem with MDS is that with multiple documents, in addition to sen-
tences being relevant, entire documents can actually have more relevant information
than others. Because of this, high single document TF-IDF scores may not be as
important if they belong to a fairly irrelevant article. One method used to account
for this to compute the TF scores for the words in a document based on the frequency
with which they appear across all of the documents in the set being summarized. This
way, a word will only have a high TF score if it is common to many of the documents
used for the MDS. Also, instead of summing up the TF-IDF scores of all of the words
in a sentence, only the scores of a limited number of words, called the centroid of that
group of documents, are considered for determining relevant sentences. The words
that go into the centroid are words that have a TF-IDF score above some minimum
threshold [30].
Further complications include the redundancy of sentences between documents
and the ordering of sentences once all the relevant sentences have been chosen [2].
2.4 Evaluating Summaries 25
Redundancy is usually taken care of by using a cosine similarity metric between
the chosen sentences. If the two sentences produce a number that is higher than a
predetermined threshold, that means that the two sentences contain too much of the
same information, and the one with the lower TF-IDF score is usually discarded. A
cosine similarity metric is given by the following equation:
n
A·B i=1 Ai × Bi
P
similarity = cos(θ) = = qP qP
kAkkBk n
i=1 (A i )2× n
i=1 (Bi )
2
A and B are two n-dimensional vectors of attributes representing the two sentences
to be compared. The attributes vary and can be values such as binary values for the
existence of words or TF-IDF values of the words. θ is the angle between these vectors
in n-dimensional space.
Sentence ordering is much harder to deal with than redundancy in general. One
study showed that when humans were asked to order sentences they each gave com-
pletely different orderings; despite this, most of the human orderings seemed to be
valid and logical orderings [2]. Therefore, the authors of the study concluded that
an exact ordering is not necessary but rather an ordering just had to be acceptable
to people. They proposed grouping sentences into general topics and then trying to
order the topics using timestamps from the original documents.
2.4 Evaluating Summaries
The evaluation of computer-generated summaries is a problem that has plagued the
field of automatic summarization, and to which there is currently no great solution.
The root of this problem is that unlike most NLP applications, there is no absolute
gold standard for an automatic summarizer. A POS tagger, for example, has clearly
correct tags that are defined for each word in any given sentence (although even this is
debatable, but for the purposes of this thesis it is a reasonable assumption). Therefore,
in order to evaluate the accuracy of a POS tagger, all one has to do is compare the
output of the tagger to the known-to-be-correct output. However, for summaries in
general, even human-written summaries, there is no single correct answer. Two people
can summarize the same text and wind up with completely different summaries in
terms of words and style, yet both summaries would be “correct” summaries in the
sense that they accurately depict a shorter version of the target text. Hence, there can
never be an absolute summary to which to compare a computer-generated summary.
Because of this shortcoming, there is no simple evaluation technique that can be used
for automatic summarizers, and non-optimal techniques must be used instead.
One of the evaluation techniques that has become accepted by the National In-
stitute of Standards and Technology (NIST) annual Document Understanding Con-
ferences (DUC) is the Recall-Oriented Understudy for Gisting Evaluation (ROGUE)
metric [22]. The ROGUE metric is based on a program that calculates and compares
combinations of words, called n-grams. N-grams are commonly used in NLP to refer
to groups of words, with size “n”, as a single unit. Tri-grams, or 3-grams, group
together 3 words as one, whereas bi-grams, or 2-grams, group together 2 words, and
unigrams denote single words. ROGUE compares two texts together and calculates a
score based on the number of n-grams that the two texts have in common. The theory
is that if two texts are valid summaries of the same document, then they will both
have the same key phrases. There are several different ROGUE metrics, each calcu-
lating different types of n-grams and permutations of n-grams. The specific metrics
used for the NIST DUC are ROGUE-1, ROGUE-2 and ROUGE-SU4. ROGUE-1 cal-
culates unigram matches between documents, ROGUE-2 includes bi-gram matches,
and ROGUE-SU4 allows for bi-gram matches that are distanced by up to 4 words.
These metrics specifically are used because they have been shown to have the highest
correlation to what humans evaluate as good summaries [23].

Chapter 3
Related Work
Automatic summarization techniques have been given significant attention in the
NLP field, and most of that attention has gone to summarizing news articles. There
are many techniques to choose from, but the problem is that most of these techniques
are domain specific in that they work best with news articles, and not as well with
other genres. Other domains have been considered, including online discussions on
blogs [42], movie reviews [43], and even email threads [41]. However, because of the
disparity of the functions of these texts, the algorithms developed for news articles
have not produced good results when applied to them. In addition, almost all of
the texts studied for automatic summarization have been relatively short texts that
contain just a few key points with little extraneous material; techniques have not been
developed to sift out small amounts of relevant information from very large texts.
There are few exceptions to all of this, one being the work that was done by
Mihalcea and Ceylan on automatic summarizations of entire narrative books [28].
Mihalcea and Ceylan understood the differences that longer texts would require and
acted accordingly. They started with an existing algorithm, called MEAD [30], which
is a centroid based approach, mostly used for multiple documents, but applied in
28
29
this case to single books. They made several modifications to the MEAD algorithm,
achieving increasingly better results for their system. They scored their system using
the ROGUE metric comparing their generated summaries to human-written sum-
maries from online websites (specifically, gradesaver.com and cliffsnotes.com).
Their first modification was actually an un-modification of original automatic
summarization systems. Based on studies, it was shown that for news articles the lead
sentences of paragraphs are extremely important for the overall summaries [12, 25].
In short documents this would make sense, since each paragraph is usually making
a new point that is pertinent to the focus of the document. However, as Mihalcea
and Ceylan showed, if you do not consider the weight of lead sentences and strictly
consider TF-IDF scores, the overall summary achieves better results for full books.
The reason for these results is because different authors might have different ways
of stressing their points that do not include relying heavily on lead sentences. A
second reason is that longer documents do not necessarily have focal points in each
paragraph due to style and topic changes among chapters.
This second reason gives rise to the idea of segmenting a larger book into shorter
individual “documents” that each have their own TF-IDF weights. They therefore
divided each book into 15 different segments using a graph-based segmentation al-
gorithm, and then applied their algorithm to each segment individually. The final
summary was chosen by taking the highest ranked sentence from each segment, start-
ing with the first, then the second and so on until a preset word limit was reached.
In addition to running separate summaries on each segment, they also calculated
separate and augmented TF-IDF scores for each segment. They added two more
factors into the weight: STF, the segmentation term frequency; and ISF, the inverse
segmentation frequency. The final TF-STF-IDF-ISF scores combined with the other
two modifications led to a 30% error reduction rate compared to using just the MEAD
3.1 Plot Units 30
algorithm for full length books.
Additional work has been performed on automatic narrative summarization by
Kazantseva and Szpakowicz [19]. Instead of trying to summarize an entire book’s
plot, they just looked at short stories, and their goal was to provide the background
of a story, while trying to not give away any plot details. Instead of using tradi-
tional automatic summarization methods, they looked for patterns in the text and
extracted relevant sentences that fit into their “background information” prototype.
By comparing their system to traditional techniques and achieving better results, they
showed that pattern matching may prove to be more suited to the non-linear nature
of narratives. Their work, however, is clearly unsuited for producing a generalized
summary of a long narrative.
3.1 Plot Units
In 1982, Wendy Lehnert devised a theory on narrative summarization that was used
by Kazantseva and Szpakowicz and many others as the definitive source for plot
summarization [21]. Previous work consisted of “story grammars” which tried to
generalize events from stories into different categories [33, 36, 38]. A story grammar
would include categories for elements such as characters, setting, conflict and resolu-
tion. A computer summarizer would try to extract all of these elements from a story,
and if done successfully, would produce a logical structure from which the computer
could “understand” the story. The failure of story grammars was that they could
not possibly incorporate every structure created by the human mind. The human
mind is able to understand any story, and abstract it in such a way that no matter
how unintuitive and irregular the story, the abstract summary makes sense. This
technique of the mind inevitably creates different structures depending on the type
3.1 Plot Units 31
of story and possibly even the way that the mind understood it. Seeing how there
can be limitless structures for the mind to mold a summary from, there can be no
way that a top-down approach such as the one using story grammars could work.
Lehnert sought to create a bottom-up approach for realizing story structures [21].
She postulates the existence of mental states of events and characters, the value
of these states being positive, negative or neutral. Next she describes each relation
between these states as either being a motivation, actualization, termination or equiv-
alence (meaning that nothing changes). Using only these states and links she shows
that any story structure can be broken down into these basic parts that she calls plot
units. It is no longer a problem to try to fit an abstract human summary to her story
structure because her basic building blocks of positive, negative or neutral events can
fit with any summary as it is a truism that any event or mental state can be defined
as positive, negative or neutral.
Examples of primitive plot units that Lehnert gives are: “You need a car so you
steal one”, “Your proposal of marriage is declined”, or “The woman you love leaves
you.” Because of the nature of the character-event relationship in plot units, most
events are either preexisting or are caused by one of the characters. The events occur-
ring in these example plot units are “you steal a car”, “girlfriend declines marriage”,
and “lover leaves you” respectively. The mental states of the characters are equally
as important to the meaning of the narrative, but do not progress the narrative in
itself. It is the stringing together of these event actions that compose the plot and
are really what is needed to create a complete summary. Currently, however, there is
no complete automatic summarization system that is able to incorporate plot units
into its algorithm to create a perfect summarizer.

Chapter 4
Project Description
4.1 Problems With Previous Work
The best method so far for summarizing long narratives is that of Mihalcea and
Ceylan using modified TF-IDF scores [28]. However, the method has a few pitfalls
that cause it to fall short of producing summaries that come close to human-written
summaries. The first and foremost is that it is heavily based on TF-IDF which was
not created for specific use with narratives. TF-IDF relies heavily on the notion
that important sentences and phrases will contain words that are rare yet repeatedly
brought up in the text. This idea makes sense for texts such as news articles which
have a focused goal in mind and generally try to minimize wasted words. However,
both of these premises are false when it comes to narratives.
Narratives, even short sections of narratives, dont generally focus on a single topic.
There is a principle in writing called “Show, don’t tell” that is found throughout
common literature [35]. “Show, don’t tell” states that a writer should show the reader
what is happening through character action, words, etc., instead of bluntly telling the
reader through description. Therefore, narratives may never use “important” topic
32
4.2 Creating a Narrative Summarizer Algorithm 33
words at all as they dance around topics with unnecessary prose and literary tools,
wanting the reader to gain the information only through inference.
With the extra prose comes extra words. It is not uncommon for a narratives to
have several paragraphs whose sole purpose is to describe scenery or settings. These
descriptions often use flowery language, including words that are very rarely seen, and
therefore have a high IDF weight. In fact, the rarer a word is the better, as uniqueness
catches the reader’s attention. With all of an author’s techniques and tools that he
uses to make a book interesting, the TF-IDF algorithm becomes useless.
Work has been done that shows that pattern matching may be more suited for the
summarization of narratives than statistical analysis, due to the non-linear nature of
narratives [19]. It is suitable, then, to look for an algorithm that “understands” the
nature of narratives and tries to logically deduce the sentences that are important for
a summary.
4.2 Creating a Narrative Summarizer Algorithm
The author of this thesis has created an algorithm dubbed the Named Entity Verb-
focused Automatic summarizer, or NEVA, to work well specifically for summarizing
narratives. The goal was to create a good algorithm for summarizing narratives using
an accepted theory of how narratives are formulated; the concept of Lehnerts plot
units was the perfect basis around which to formulate the algorithm. At the core of
each of Lehnerts plot units is the idea that character-event relations are what make
a narrative progress. Therefore, a good narrative summary is one that incorporates
all of the character-event units somehow. One way of doing this is to extract all
of the actions that each character performs throughout the narrative. A list of the
character actions, at least an abstract one, would be a valid representation of the plot
units. In NLP terms (and because abstract lists are extremely hard to generate), this
translates to listing character named entities, along with their associated verbs, and
the object that each verb is acting upon. Also, while it may be true that a given
character-verb pair is implicitly or even passively stated, skipping these verbs does
not pose a major problem since this is not a common occurrence and the surrounding
active verbs usually provide adequate information to fill in the plot. Moreover, even
when passive verbs do occur, the related active verbs are generally more important.
For example, if in a narrative an apple goes missing, it is possible for a sentence to
be “The apple was gone,” a sentence that would not be picked out by this type of
summarizer. However, it is almost inevitable that there would be active verbs leading
up to the state of the apple being gone such as “John ate the apple,” or “Jane looked
everywhere for the apple.”
This would be fine if the goal was to create a comprehensive listing of everything
that happens in the book. The effect of this would be to take out most of the literary
sugar, that which is unnecessary to the plot and provides descriptions or feelings;
while these may be important in literary terms to some books, they almost never
contribute to the plot of the book. This would not be a summary, though, as a
comprehensive listing can be almost as long as the original book. What is needed
is a system to limit the character-verb interactions that are chosen for a summary.
Like TF-IDF summaries, the best way would be to take the highest scoring sentences
ranked in some way that is pertinent to the type of summary that is desired.
Although narratives do not usually follow a cookie cutter template, there are a
few general rules that can be used to indicate that a sentence is more important than
others with regard to advancing the plot, and therefore worthy of additional points
(more details of NEVA’s point system is descibed in Section 4.3.2). It should be
noted that in addition to these rules being based on logic, the rules all came about
empirically using data from human-written summaries. The first of these rules is that
sentences that emphasize an important or plot-advancing action will often start with
the character who is performing that action. The first reason this would be true is
that if the character is the first word of a sentence, then that character is almost
always part of the subject of the sentence, and likely the topic of the sentence; this
type of sentence becomes more valuable for the plot compared to a sentence in which
a character and his actions are part of a predicate. There may be times, however,
that a sentence can start with a preposition or even adjectives that do not preclude
the character from being the sentence’s subject even though that character is not the
first word of the sentence. This is not so much of an issue because those sentences as
a whole are generally less focused on the character-action, and, either way, they do
not take away from the definitive import granted when a character is the first word.
The second rule is that a sentence is more important when it deals with more than
one character, relating the two in some way. This rule is inspired by Lehnert’s plot
units. As Lehnert proposed, plot units deal with individual character-action relation-
ships [21], but there are also plot units that deal with character-character relationships
as these are necessary for most problems and resolutions in plots. Therefore, if a sen-
tence contains characters in addition to the character who is performing the action it
is given additional points.
The last rule is that sentences that include dialogue get additional points. While
containing dialogue may not inherently contribute to a sentence’s importance for a
summary, empirical evidence suggested that many conversations in narratives contain
valuable plot information even if they do not fall under the first two rules. The reason
for this is simply that dialogue is a natural way to progress a narrative, and while
it may not advance the plot with regards to physical actions, dialogue does advance
the plot with regards to inter-character relations. Through dialogue, for example,
4.3 Implementation 36
one character may make another character angry, which provides the much needed
motivation for possible later actions. It should be noted that dialogue sentences still
have to contain a character-verb pair in order to be considered for the summary
(even if the pair is only “he said”, it usually points to the start or end of a character’s
speech, from which one can often infer the overall idea of the speech).
4.3 Implementation
Before the NEVA can be applied to a document, the text has to be passed through
several other programs to format it and label it with the necessary data. These steps
are as follows:
1. The original text is processed by a named entity recognizer, written in Python,
using Python’s Natural Language Tool Kit (NLTK) [3].
2. The Python output is processed by a self-written Perl script to extract all the
named entities in the text that fall into the category of “Person” or “Organiza-
tion”.
3. The original text is processed by a self-written Perl script to preprocess all texts
and convert them to a standard format.
4. The preprocessed text is processed by the GPoSTTL v0.9.3 POS tagger [17].
5. The output of the POS tagger is processed by a self-written Perl script to
reformat it for the next step.
6. The formatted text is processed by a Collins chart parser [9].
7. The output of the Collins parser is processed by a provided clean-up script
written in Perl.
8. The cleaned-up output from the Collins parser is processed by a self-written
C++ implementation of the Hobbs pronoun resolution algorithm [16].
The output of the pronoun resolution is the original text with all of the pronouns
replaced with their antecedents. After these steps, the list of named entities, the
cleaned-up Collins parser text, and the pronoun resolved original text are processed
by a Perl script that executes the NEVA algorithm.
4.3.1 Choice of Resources
The Collins parser was chosen as the chart parser because it is known to have a
high accuracy and it is available freely online [8, 9]. The GPoSTTL v0.9.3 POS
tagger was chosen because the Collins parser requires a Brill tagger input, and the
GPoSTTL is a Brill tagger with high accuracy that is available freely online [17]. The
Hobbs algorithm was used because of its high accuracy specifically when dealing with
fictional works [37].
The pronoun resolution algorithm has been modified to be more fit to be used
specifically with an automatic summarizer. The first modification is that instead
of looking for antecedents for every pronoun, only pronouns that might be referring
to people or organizations (characters that can perform actions) are resolved. The
classifications of people and organizations were labeled using the named entity rec-
ognizer. The reason for this modification is that no algorithm is 100% accurate, so
an incorrect resolution that replaces the pronoun “it” with “John” can hurt the final
summarization system. Following the same logic, to ensure that a pronoun such as
“he” is never resolved to an inanimate object, the program has been modified so that
only named entities that fell into the category of “person” or “organization” are used
as possible antecedents. With both of these modifications, the pronoun resolution

program has a perceived increase in accuracy; no rigorous tests have been conducted
to prove this, but the initial data seemed to favor this hypothesis heavily.
4.3.2 Point System
Once the initial steps have been completed, the NEVA algorithm, implemented as
a rule based point system, is applied to every sentence. Any sentence that has at
least one character-verb pair is automatically given one point. For each additional
rule that the sentence follows, it is given an additional point. After all of the points
are distributed, the sentences are chosen to be part of the final summary in order
from the highest ranked sentences to the lowest, arranging selected sentences in the
same order that they appeared in the original text. If after choosing any sentence,
a predetermined character limit is reached, no more sentences are chosen and the
summary is complete.
Chapter 5
Analysis of NEVA
5.1 ROGUE
Analyzing automatically generated summaries becomes additionally difficult when
dealing with plot summaries. While the ROGUE metric has been shown to work well
for news article summaries, there is no indication that there is a similar correlation
for plot summaries. The reason to differentiate the evaluation method for the two
types of summaries is the same reason why different algorithms are needed for the two
domains. News articles give major significance to rare words, and therefore TF-IDF
values and n-grams can play a role in those articles’ classification. This is not as true
in plot summaries, for reasons explained in previous sections.
Despite this, when Mihalcea and Ceylan did their work on automatic plot sum-
maries, they used the ROGUE metric to rate their summaries [28]. As with the rest
of their work, they did not recognize all of the a reasons to differentiate between
news articles and novels; rather, they just dealt with the issue of novels being exceed-
ingly long by comparison. Be that as it may, they were in some way justified to use
ROGUE, as the ROGUE metric is the only current metric that has both been shown
39
5.2 Human Evaluations 40
to be accurate and is accepted by the NLP community for automatic summaries in
general. Therefore, to evaluate NEVA, the ROGUE metric as well as volunteer-based
human analysis has been used. The ROGUE metric used to evaluate the summaries
were ROGUE-1, ROGUE-2, and ROGUE-SU4, chosen because these are the metrics
most accepted and used in the NIST Document Understanding Conferences [22].
Following Mihalcea and Ceylan’s lead, the automatically generated summaries
were tested against human-written summaries found online; these were taken from
sparknotes.com and gradesaver.com, chosen because they had similar length sum-
maries for the desired books. However, because of the constraint of using human
analysis as well, it was impractical to use the large corpus of books that Mihalcea
and Ceylan used. Instead, only select chapters from three different books were chosen
for analysis. The books used were Dr. Jekyll and Mr. Hyde by Robert Louis Steven-
son, The Awakening by Kate Chopin, and The Ambassadors by Henry James. For
each of these books, a random section of the narrative was chosen to be summarized,
and then additional consecutive sections were added until the corresponding online
summary reached approximately one page in length (this was approximately 4000
characters).
5.2 Human Evaluations
For the human evaluations, each of the three books’ sections has been analyzed by
seven different volunteers. Since human grading can be capricious and biased, each
person was given multiple summaries to grade in a blind experiment; that is, the
sources of the summaries were not revealed to the volunteers.
For each of the three books, five summaries were compiled. Two of these sum-
maries came from human-written online sources, one came from NEVA, one was a
baseline summary, and one was generated from a generic automatic summarizer called
MEAD. The baseline summary was a sentence extracted summary of evenly spaced
sentences from the target text; while this may not be the worst summary possible
(and hence, not a true baseline), it is a good reference of what any reasonable sum-
marizer should be able to beat. MEAD is an automatic summarizer available freely
online that uses a centroid based TF-IDF approach [30]. MEAD was the base sum-
marizer that Mihalcea and Ceylan used for their system, and therefore it is a suitable
“opponent” for NEVA [28].
Each volunteer was first presented with one of the online summaries and was
told to treat that summary as the “Gold Standard”, an ideal summary. The online
summary that was chosen to be the gold standard was the one that was further away
from 10% of the original text (10% being a standard value for summary to target
text ratio [11]); for example, if one summary contained 13% of the original text and
the other contained 15%, the 15% summary chosen as the gold standard. The other
online summary was given to each volunteer, along with the three computer-generated
summaries for that text; this online summary was used as an upper bound on the
score that a computer-generated summary can hope to achieve. The volunteers were
not told about the sources of these four summaries.
5.2.1 Rating System
The volunteers were asked to rate each summary in three different categories using
a scale of one to ten, ten being the best. The three categories were content, flow,
and overall style. Although the categories overlap slightly (especially flow and overall
style), not squeezing the ratings into a single score gave the volunteers the necessary
freedom to truly rate different systems that might be better at different things. The
assumption was that some of the summaries might score higher in one category but
lower in the others. The volunteers were given definitions of the three categories, but
were told to interpret them in any way that they wanted. The reason this does not
matter too much is that as long as each volunteer used the same metric to rate every
summary he was given, the scores can still be scaled and considered meaningful. The
definitions given were:
1. Content: How much of the narrative that has been described in the Gold Stan-
dard is included in the summary?
2. Flow: Do the sentences in the summary flow well from one to the next? Is it
easy to follow the summary’s progression?
3. Overall Style: How much does the text sound like it is a well written summary
of the given chapters? Does it sound good? Does reading this summary feel as
comfortable as reading the Gold Standard?

Chapter 6
Results and Discussion
The scores for each of the three categories were calculated as the average score given
by the seven volunteers for each book; the original scores are presented in Appendix B
for reference. The results for Dr. Jekyll and Mr. Hyde, The Awakening, and The
Figure 6.1 The human evaluations for Dr. Jekyll and Mr. Hyde.
Ambassadors are shown in Figure 6.1, Figure 6.2, and Figure 6.3 respectively. As the
summaries were evaluated by humans on a very subjective and arbitrary basis, exact
43
44
Figure 6.2 The human evaluations for The Awakening.
scores do not have much meaning; therefore, only overall trends will be considered.
Based on the graphs, it can be seen that the human-written summaries were
evaluated the highest in all of the categories for all of the books as expected. In
second place in every score is the MEAD summary, but across the board its score
is only about half that of the human-written summary. Third and fourth place are
much closer, and trends are divided by book. For Dr. Jekyll and Mr. Hyde, NEVA
and the baseline scored almost equal to MEAD on content; however, for both flow
and style, NEVA scored much closer to MEAD trailing by only about a point, while
the baseline trailed by about three points. For The Awakening, the positions are
almost reversed as the baseline beat NEVA in flow and style, and while they scored
similarly in content, both trailed MEAD by about one point. For The Ambassadors,
both NEVA and the baseline scored similarly to each other with NEVA doing slightly
better in all categories, while both trailed MEAD by about one point in the content
category, and fractions of a point in the other categories.
The ROGUE-1, ROGUE-2, and ROGUE-SU4 metrics were applied to each of the
45
Figure 6.3 The human evaluations for The Ambassadors.
summaries. The resulting scores for Dr. Jekyll and Mr. Hyde, The Awakening, and
The Ambassadors are shown in Figure 6.4, Figure 6.5, and Figure 6.6 respectively. As
seen from the figures, the human-written summary has all of the highest scores for all
books according to all three metrics, which is to be expected since the human-written
summary is being used as an upper bound for the scores of the computer-generated
summaries. Looking at all of the other scores, with one exception, the second best
score is NEVA followed by the baseline followed by MEAD (the one exception is a tie
between MEAD and the baseline for last place).
As opposed to the human evaluations which are subjective, the ROGUE scores
are calculated, and therefore precise relative scores become meaningful. Since the
human-written summary is being used as an upper bound and ROGUE scores have
much more significance when viewed relative to other ROGUE scores, it is useful to
consider how well each summary did compared to the human-written summary. For
each ROGUE metric applied, the ratios of the automatic summary ROGUE scores to
the human-written summary ROGUE scores have been calculated and are presented
46
Figure 6.4 The ROGUE scores for Dr. Jekyll and Mr. Hyde.
as percentages in Figure 6.7. From Figure 6.7 it can be seen that not only did NEVA
beat the other two computer-generated summaries, but it performed exceedingly well,
achieving more than 45% of the score of the human-written summary in six out of the
nine cases. Additionally, when the ROGUE-2 metric was applied to The Awakening,
NEVA scored an amazing 85.2% of what the human-written summary scored. Note
that MEAD and the baseline respectively scored 28.5% and 52.6% of the human-
written summary according to the same metric; if NEVA’s 85.2% score was only so
high because the human-written summary did so poorly on that metric, then it would
be expected that MEAD and the baseline would have scored closer to NEVA than
they did.
MEAD scored low with every metric, performing 5.0% to 11.8% worse than the
baseline in most of the cases (when both were compared to the human-written sum-
mary). This shows that not only is MEAD a bad summarizer, but it is worse than a
simple baseline of evenly spaced lines according to the ROGUE metric.
These results are interesting because, for the human evaluations, MEAD per-
47
Figure 6.5 The ROGUE scores for The Awakening.
formed consistently better than both NEVA and the baseline. Also, NEVA seemed to
perform comparable to the baseline according to human evaluators, contrary to the
relatively high scores it achieved with ROGUE. This information convinced the au-
thor of this thesis to perform further analysis of the computer-generated summaries
to determine why there is such a discrepancy between evaluations. Therefore, the
author performed a manual analysis to determine how much content was similar be-
tween the human-written and computer-generated summaries. It is important to note
that this analysis only took into account content, and not flow or style.
To demonstrate the outcome of this analysis, the texts of the gold standard,
NEVA and MEAD’s summaries for the analyzed chapters from Dr. Jekyll and Mr.
Hyde have been provided in Appendix C. The author has annotated the texts to
highlight certain points about the content contained in the summaries. A red phrase
in NEVA’s summary denotes that there is a parallel phrase containing similar content
in the gold standard; those parallel phrases are marked with the same number as a
superscript at the end of the phrase. Similarly, a green phrase in MEAD’s summary
48
Figure 6.6 The ROGUE scores for The Ambassadors.
denotes that there is a parallel phrase in the gold standard. The gold standard has
red phrases and green phrases that denote that those phrases are the parallel phrases
that occurred in either NEVA’s or MEAD’s summary respectively. Blue phrases in the
gold standard denote that there were parallel phrases to this phrase in both NEVA’s
and MEAD’s summary. Only in blue phrases does a superscript number appear in
all three summaries. A regular black phrase in NEVA’s or MEAD’s summary means
that there is no clear parallel in the gold standard.
Looking at the colors of NEVA’s summary, it can be seen that most of the text is
“checkerboarded” black and red more or less. This means that about half of what was
produced by NEVA was relevant to the summary. What this does not tell us is if the
lack of content in NEVA’s summary was due to the length of the summary or not.
As seen from counting the superscripts, NEVA’s summary has thirteen individual
phrases that are relevant, in other words, being parallel to content from the gold
standard.
MEAD’s summary does not look as colorful, containing only one line of green at
49
Figure 6.7 The ratios of ROGUE scores for all summaries to ROGUE scores
for human-written summaries.
the beginning and a chunk of green in the middle. Even if that chunk were to be
dispersed over the entire text, the summary would still be mostly black, i.e., mostly
irrelevant information. Also, compared to NEVA’s thirteen relevant phrases, MEAD
only produced four relevant phrases. What this tells us is that MEAD does not isolate
content well. The fact that MEAD generated relevant content for only a small chunk
of the text makes it seem that that “good” content may have been a statistical fluke;
MEAD might not actually be good at summarizing narratives, but it is bound to get
lucky sometimes.
The last text to look at is the gold standard summary. Almost unsurprisingly
at this point, most of the colored text is red, with two phrases colored blue and
two more colored green. The green/blue text (which denotes having a parallel in
the MEAD summary), while slightly spread out, is all in the first half of the gold
standard summary. As opposed to that, the red/blue text (which denotes having
a parallel in the NEVA summary) is nicely spread throughout the entire summary.
It is clear, though, that a majority of the text is not red/blue, meaning that the
50
NEVA summary was unable to account for even half of all relevant content. What
this means with regard to the half red NEVA summary is that NEVA is able to
consistently extract relevant sentences, but not shorten them in the same way that a
human writer might.
Although this analysis was only performed with one of the three summarized
books, the results are clear. The human evaluators did not pay close enough attention
to the content scores, and for some reason MEAD was considered better than NEVA
despite the large amount of additional relevant content that NEVA produced over
MEAD. A more accurate representation of a score based on content is found with
the ROGUE metric, although it is unclear how well the ROGUE scores correlate
with the amount of content contained in a summary. For example, looking at the
ROGUE metric for Dr. Jekyll and Mr. Hyde in Figure 6.7, it can be seen that the
NEVA summary has less than twice the relevant content that the MEAD summary
contains; however, it has just been demonstrated that NEVA has about three times
as much relevant content as MEAD contains in that summary (thirteen phrases to
four phrases). Although the author compared the computer-generated summaries to
the gold standard while the ROGUE percentages compares them to the other human-
written summary, it is a fair assumption that the human-written summaries contained
similar content.
The likely reason why human evaluators gave MEAD a higher score than NEVA
is probably because humans inherently desire there to be more than just content in
a summary. This was the basis for why the human evaluators were asked to rate
the summaries in the two additional categories of flow and style. However, it is
likely that humans really only have one impression of a summary after they read
it, and even if asked later to differentiate between different categories their original
notions of a “good” or “bad” summary will take precedence. Analysis of the original
51
scores (see Appendix B) supports this theory. If human evaluators were objective in
the scoring between categories, then the scores would likely differ to a large extant,
especially if content and flow qualitatively differed so greatly. However, the average
difference in scores by a single evaluator on a single summary is only 0.905 out of 10.
(Every evaluator gave three scores per summary, so three differences can be calculated
per evaluator per book per summary; this totals to 252 values. The average was
determined by taking the average of the absolute values of all these scores.)
By reading the NEVA and MEAD summaries of Dr. Jekyll and Mr. Hyde in
Appendix C one sees what may have been the determiner human evaluators used
to rate the “goodness” of each summary. The MEAD summary reads well, and
the NEVA summary does not. For example, the first few sentences of the MEAD
summary form a logical progression that any human can follow: the narrator says how
Utterson goes to Jekyll’s house; then he describes the house, explaining who Jekyll
bought it from; then he talks about Utterson’s initial impressions of the house; then
he describes a room that Utterson walked in to. Indeed, these sentences should follow
a logical progression because these are sentences 1, 2, 3 and 5 from the original text
of the book! On the other hand, the NEVA summary does not flow well because the
sentences that are extracted from the original text are not necessarily consecutive and
therefore the intermediate progression is lost. A human reading these two summaries
would surely like the way MEAD’s summary sounds better than NEVA’s summary
and might miss the fact that NEVA’s summary contains more relevant content. It
seems as if MEAD looks to extract sentences that are in proximity to each other in
the original text. If a chunk extracted by MEAD was largely descriptive, there would
be almost no plot content extracted for many sentences as is the case in the first few
lines of MEAD’s summary of Dr. Jekyll and Mr. Hyde.

Chapter 7
Conclusion and Future Work
A system to automatically summarize narrative texts has been designed. Previous
systems have used statistical methods for summarization and these systems have been
unsuccessful at producing content rich summaries for narratives. In this thesis, a new
rule-based algorithm called NEVA has been presented that extracts sentences from
a narrative and successfully produces summaries with relevant content. The results
of NEVA have been compared against a statistical automatic summarization system
called MEAD using both the ROGUE summarization metric and human volunteers
in a blind experiment.
The human and ROGUE analyses were found to be in contradiction of each other.
As seen from Figures 6.1, 6.2 and 6.3, the human evaluators gave MEAD scores
that were consistently about one point higher than NEVA’s scores. However, Fig-
ures 6.4, 6.5, 6.6, and 6.7 show that ROGUE consistently gave NEVA better scores
than it gave MEAD. Additionally, the ROGUE scores show that NEVA performs
very well in terms of including relevant content, containing up to 85.2% of the same
content as a human-written summary used as an upper-bound. A third analysis was
performed by the author that showed that NEVA produced more than three times
52
53
as much relevant content as MEAD did for one case. This shows that the ROGUE
scores give a better indication of content than the human analyses. A hypothesis
for this is that humans tend to rate summaries largely based off how well it reads in
general, regardless of the actual content. Therefore, human evaluators can be used
to judge the coherence of a summary, but not its content.
Since neither ROGUE nor human volunteers can fully analyze a narrative sum-
mary, the problem exists as to how to accurately rate the summaries. A summary
might be unintelligible if it is not coherent enough, however, it is useless if it does
not contain the right content. A combination of both analyses, where ROGUE rates
content and humans rate coherence, would be suitable for a full analysis of a narrative
summary. However, further research is necessary to determine exactly how accurately
humans can rate a summary’s cohesiveness.
The conclusions of the analyses show that NEVA produces content well, but fails
to string that content together so that humans can comfortably read its summaries.
This leaves a lot of work still to be done with NEVA. Firstly, the rule-based algorithm
seems to work much better than statistical methods for this domain; however, it can
definitely be improved upon. A possible approach for this would be to implement a
machine learning algorithm that creates an optimal list of rules for summarization.
Further work can be done to manually label large corpora of data for supervised
learning methods of this sort. Secondly, NEVA still needs a lot of post-processing
for its summaries so that the sentences flow and do not lack background information.
The current algorithm produces content that is somewhat close to human-written
summaries in content, and if NEVA’s summaries could be as coherent as the human-
written summaries, NEVA would be primed to replace humans for writing narrative
summarizers. This would be a large step in the right direction for automatic summa-
rizers and artificial intelligence in general.

Appendix A
Hobbs Algorithm
1. Begin at the noun phrase (NP) node immediately dominating the pronoun.
2. Go up the tree to the first NP or sentence (S) node encountered. Call this node
X, and call the path used to reach it p.
3. Traverse all branches below node X to the left of path p in a left-to-right,
breadth-first fashion. Propose as the antecedent any NP node that is encoun-
tered which has an NP or S node between it and X.
4. If node X is the highest S node in the sentence, traverse the surface parse trees
of previous sentences in the text in order of recency, the most recent first; each
tree is traversed in a left-to-right, breadth-first manner, and when an NP node
is encountered, it is proposed as antecedent. If X is not the highest S node in
the sentence, continue to step 5.
5. From node X, go up the tree to the first NP or S node encountered. Call this
new node X, and call the path traversed to reach it p.
6. If X is an NP node and if the path p to X did not pass through the Nominal
node that X immediately dominates, propose X as the antecedent.

54
55
7. Traverse all branches below node X to the left of path p in a left-to-right,
breadth-first manner. Propose any NP node encountered as the antecedent.
8. If X is an S node, traverse all branches of node X to the right of path p in
a left-to-right, breadth-first manner, but do not go below any NP or S node
encountered. Propose any NP node encountered as the antecedent.
9. Go to Step 4.
Appendix B
Human Analysis Ratings
Figure B.1 The original scores for the human evaluations of Dr. Jekyll and
Mr. Hyde.
Figure B.2 The original scores for the human evaluations of The Awakening.
56
57
Figure B.3 The original scores for the human evaluations of The Ambas-
sadors.
Appendix C
Annotated Summaries
C.1 NEVA
It was late in the afternoon, when Mr. Utterson found his way to Dr. Jekyll’s door1 ,
where he was at once admitted by Poole, and carried down by the kitchen offices and
across a yard which had once been a garden, to the building which was indifferently
known as the laboratory or dissecting rooms. And indeed he does not want my help2 ;
you do not know him as I do; he is safe, he is quite safe; mark my words, he will
never more be heard of.”2 Utterson ruminated awhile; he was surprised at his friend’s
selfishness, and yet relieved by it. he asked. The doctor seemed seized with a qualm
of faintness; he shut his mouth tight and nodded. ”He meant to murder you5 . And
he covered his face for a moment with his hands. The newsboys, as he went, were
crying themselves hoarse along the footways: ”Special edition. Presently after, he sat
on one side of his own hearth, with Mr. Guest6 , his head clerk, upon the other, and
midway between, at a nicely calculated distance from the fire, a bottle of a particular
old wine6 that had long dwelt unsunned in the foundations of his house. There was
no man from whom he kept fewer secrets than Mr. Guest; and he was not always sure
58
C.1 NEVA 59
that he kept as many as he meant. ”This is a sad business about Sir Danvers,” he said.
”Henry Jekyll forge for a murderer7 !” Sir came out of his seclusion, renewed relations
with his friends, became once more their familiar guest and entertainer; and whilst he
had always been known for charities8 , he was now no less distinguished for religion.
”The doctor was confined to the house,” Poole said, ”and saw no one.” On the 15th,
he tried again, and was again refused9 ; and having now been used for the last two
months to see his friend almost daily, he found this return of solitude to weigh upon
his spirits. ”Yes,” he thought; ”he is a doctor, he must know his own state and that
his days are counted; and the knowledge is more than he can bear.” And yet when
Utterson remarked on his ill-looks, it was with an air of great firmness that Lanyon
declared himself a doomed man. ”I have had a shock,” he said, ”and I shall never
recover10 . But Lanyon’s face changed, and he held up a trembling hand. ”I wish to
see or hear no more of Dr. Jekyll11 ,” he said in a loud, unsteady voice. he inquired.
”He will not see me,” said the lawyer. As soon as he got home, Utterson sat down and
wrote to Jekyll, complaining of his exclusion from the house, and asking the cause of
this unhappy break with Lanyon; and the next day brought him a long answer, often
very pathetically worded, and sometimes darkly mysterious in drift. The quarrel with
Lanyon was incurable. ”I do not blame our old friend,” Jekyll wrote, ”but I share his
view that we must never meet12 . Utterson went to call indeed; but he was perhaps
relieved to be denied admittance; perhaps, in his heart, he preferred to speak with
Poole upon the doorstep13 and surrounded by the air and sounds of the open city,
rather than to be admitted into that house of voluntary bondage, and to sit and speak
with its inscrutable recluse. Utterson became so used to the unvarying character of
these reports, that he fell off little by little in the frequency of his visits. The middle
one of the three windows was half-way open; and sitting close beside it, taking the
air with an infinite sadness of mien, like some disconsolate prisoner, Utterson saw Dr.
C.2 MEAD 60
Jekyll14 . ”God forgive us, God forgive us,” said Mr. Utterson. But Mr. Enfield only
nodded his head very seriously, and walked on once more in silence15 .
C.2 MEAD
It was late in the afternoon, when Mr. Utterson found his way to Dr. Jekyll’s
door1 , where he was at once admitted by Poole, and carried down by the kitchen
offices and across a yard which had once been a garden, to the building which was
indifferently known as the laboratory or dissecting rooms. The doctor had bought
the house from the heirs of a celebrated surgeon; and his own tastes being rather
chemical than anatomical, had changed the destination of the block at the bottom of
the garden. It was the first time that the lawyer had been received in that part of
his friend’s quarters; and he eyed the dingy, windowless structure with curiosity, and
gazed round with a distasteful sense of strangeness as he crossed the theatre, once
crowded with eager students and now lying gaunt and silent, the tables laden with
chemical apparatus, the floor strewn with crates and littered with packing straw, and
the light falling dimly through the foggy cupola. It was a large room fitted round
with glass presses, furnished, among other things, with a cheval-glass and a business
table, and looking out upon the court by three dusty windows barred with iron. And
indeed he does not want my help2 ; you do not know him as I do; he is safe, he is
quite safe; mark my words, he will never more be heard of2 .” I should like to leave it
in your hands, Utterson; you would judge wisely, I am sure; I have so great a trust in
you3 .” The letter was written in an odd, upright hand and signed ”Edward Hyde”:
and it signified, briefly enough, that the writer’s benefactor, Dr. Jekyll, whom he had
long so unworthily repaid for a thousand generosities, need labour under no alarm
for his safety, as he had means of escape on which he placed a sure dependence4 .
C.2 MEAD 61
Presently after, he sat on one side of his own hearth, with Mr. Guest6 , his head clerk,
upon the other, and midway between, at a nicely calculated distance from the fire,
a bottle of a particular old wine6 that had long dwelt unsunned in the foundations
of his house. Guest had often been on business to the doctor’s; he knew Poole; he
could scarce have failed to hear of Mr. Hyde’s familiarity about the house; he might
draw conclusions: was it not as well, then, that he should see a letter which put that
mystery to right? Much of his past was unearthed, indeed, and all disreputable: tales
came out of the man’s cruelty, at once so callous and violent; of his vile life, of his
strange associates, of the hatred that seemed to have surrounded his career; but of his
present whereabouts, not a whisper. He was busy, he was much in the open air, he
did good; his face seemed to open and brighten, as if with an inward consciousness of
service; and for more than two months, the doctor was at peace. The rosy man had
grown pale; his flesh had fallen away; he was visibly balder and older; and yet it was
not so much these tokens of a swift physical decay that arrested the lawyer’s notice,
as a look in the eye and quality of manner that seemed to testify to some deep-seated
terror of the mind. ”Yes,” he thought; ”he is a doctor, he must know his own state
and that his days are counted; and the knowledge is more than he can bear.” Utterson
was amazed; the dark influence of Hyde had been withdrawn, the doctor had returned
to his old tasks and amities; a week ago, the prospect had smiled with every promise
of a cheerful and an honoured age; and now in a moment, friendship, and peace of
mind, and the whole tenor of his life were wrecked. The doctor, it appeared, now
more than ever confined himself to the cabinet over the laboratory, where he would
sometimes even sleep; he was out of spirits, he had grown very silent, he did not read;
it seemed as if he had something on his mind.

C.3 Gold Standard 62
C.3 Gold Standard
Utterson calls on Jekyll1 , whom he finds in his laboratory looking deathly ill. Jekyll
feverishly claims that Hyde has left and that their relationship has ended2 . He also
assures Utterson that the police shall never find the man. Jekyll then shows Utterson
a letter and asks him what he should do with it3 , since he fears it could damage
his reputation if he turns it over to the police. The letter is from Hyde, assuring
Jekyll that he has means of escape, that Jekyll should not worry about him, and
that he deems himself unworthy of Jekylls great generosity4 . Utterson asks if Hyde
dictated the terms of Jekylls willespecially its insistence that Hyde inherit in the
event of Jekylls -disappearance. Jekyll replies in the affirmative, and Utterson tells
his friend that Hyde probably meant to murder him5 and that he has had a near
escape. He takes the letter and departs. On his way out, Utterson runs into Poole,
the butler, and asks him to describe the man who delivered the letter; Poole, taken
aback, claims to have no knowledge of any letters being delivered other than the
usual mail. That night, over drinks, Utterson consults his trusted clerk, Mr. Guest6 ,
who is an expert on handwriting. Guest compares Hydes letter with some of Jekylls
own writing and suggests that the same hand inscribed both; Hydes script merely
leans in the opposite direction, as if for the purpose of concealment. Utterson reacts
with alarm at the thought that Jekyll would forge a letter for a murderer7 . As time
passes, with no sign of Hydes reappearance, Jekyll becomes healthier-looking and
more sociable, devoting himself to charity8 . To Utterson, it appears that the removal
of Hydes evil influence has had a tremendously positive effect on Jekyll. After two
months of this placid lifestyle, Jekyll holds a dinner party, which both Utterson and
Lanyon attend, and the three talk together as old friends. But a few days later, when
Utterson calls on Jekyll, Poole reports that his master is receiving no visitors. This
scenario repeats itself for a week9 , so Utterson goes to visit Lanyon, hoping to learn
why Jekyll has refused any company. He finds Lanyon in very poor health, pale and
sickly, with a frightened look in his eyes. Lanyon explains that he has had a great
shock and expects to die in a few weeks10 . [L]ife has been pleasant, he says. I liked it;
yes, sir, I used to like it. Then he adds, I sometimes think if we knew all, we should
be more glad to get away. When Utterson mentions that Jekyll also seems ill, Lanyon
violently demands that they talk of anything but Jekyll11 . He promises that after
his death, Utterson may learn the truth about everything, but for now he will not
discuss it. Afterward, at home, Utterson writes to Jekyll, talking about being turned
away from Jekylls house and inquiring as to what caused the break between him and
Lanyon. Soon Jekylls written reply arrives, explaining that while he still cares for
Lanyon, he understands why the doctor says they must not meet12 . As for Jekyll
himself, he pledges his continued affection for Utterson but adds that from now on
he will be maintaining a strict seclusion, seeing no one. He says that he is suffering
a punishment that he cannot name. Lanyon dies a few weeks later, fulfilling his
prophecy. After the funeral, Utterson takes from his safe a letter that Lanyon meant
for him to read after he died. Inside, Utterson finds only another envelope, marked
to remain sealed until Jekyll also has died. Out of professional principle, Utterson
overcomes his curiosity and puts the envelope away for safekeeping. As weeks pass,
he calls on Jekyll less and less frequently, and the butler continues to refuse him
entry13 . The following Sunday, Utterson and Enfield are taking their regular stroll.
Passing the door where Enfield once saw Hyde enter to retrieve Jekylls check, Enfield
remarks on the murder case. He notes that the story that began with the trampling
has reached an end, as London will never again see Mr. Hyde. Enfield mentions that
in the intervening weeks he has learned that the run-down laboratory they pass is
physically connected to Jekylls house, and they both stop to peer into the houses
windows, with Utterson noting his concern for Jekylls health. To their surprise, the
two men find Jekyll at the window, enjoying the fresh air14 . Jekyll complains that
he feels very low, and Utterson suggests that he join them for a walk, to help his
circulation. Jekyll refuses, saying that he cannot go out. Then, just as they resume
polite conversation, a look of terror seizes his face, and he quickly shuts the window
and vanishes. Utterson and Enfield depart in shocked silence15 .

Bibliography
[1] Jon Barwise. An introduction to first-order logic. In Handbook of mathematical

logic, chapter A.1, pages 6–47. North–Holland, 1977.
[2] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. Inferring strate-
gies for sentence ordering in multidocument news summarization. Journal of
Artificial Intelligence Research, 17:2002, 2002.
[3] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with
Python. O’Reilly Media, Inc, 2009.
[4] Eric Brill. A simple rule-based part of speech tagger. In Proceedings of the
Third Conference on Applied Computational Linguistics (ACL), Trento, Italy,
June-July 1992.
[5] Hsin-Hsi Chen and Chuan-Jie Lin. A multilingual news summarizer. In Proceed-
ings of the 18th conference on Computational linguistics - Volume 1, COLING
’00, pages 159–165, Stroudsburg, PA, USA, 2000. Association for Computational
Linguistics.
[6] Hsin-Hsi Chen and Chuan-Jie Lin. Sentence extraction by tf/idf and position
weighting from newspaper articles. In Proceedings of the Third NTCIR Work-
shop, 2002.
[7] Noam Chomsky. Syntactic Structures. Moutan, The Hague, 1957.
[8] Michael Collins. http://www.cs.columbia.edu/∼mcollins/code.html. Accessed:

3/11/2011.
[9] Michael Collins. A new statistical parser based on bigram lexical dependencies.
In Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz, California,
USA, June 1996.
[10] Ani Nenkova Columbia. Automatic text summarization of newswire: Lessons

learned from the document understanding conference, 2005.
65
BIBLIOGRAPHY 66
[11] Naomi Daniel, Dragomir Radev, and Timothy Allison. Sub-event based multi-
document summarization. In Proceedings of the HLT-NAACL 03 on Text sum-
marization workshop - Volume 5, HLT-NAACL-DUC ’03, pages 9–16, Strouds-
burg, PA, USA, 2003. Association for Computational Linguistics.
[12] H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–

285, April 1969.
[13] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data
mining to knowledge discovery in databases. AI Magazine, 17(3):37–54, 1996.
[14] Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named entity
recognition through classifier combination. In Proceedings of the seventh con-
ference on Natural language learning at HLT-NAACL 2003 - Volume 4, pages
168–171, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
[15] Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. Multi-
document summarization by sentence extraction. In Proceedings of the 2000
NAACL-ANLPWorkshop on Automatic summarization - Volume 4, NAACL-
ANLP-AutoSum ’00, pages 40–48, Stroudsburg, PA, USA, 2000. Association for
Computational Linguistics.
[16] Jerry R. Hobbs. Resolving pronoun references. Lingua, 44(4):311–338, 1978.
[17] Golam Mortuza Hossain. http://gposttl.sourceforge.net/. Accessed: 3/11/2011.
[18] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice
Hall, 1999.
[19] Anna Kazantseva and Stan Szpakowicz. Summarizing short stories. Comput.
Linguist., 36:71–109, March 2010.
[20] Christopher M. Kelley and Gillian DeMoulin. The web cannibalizes media, May
2002.
[21] W. G. Lehnert. Plot units: a narrative summarization strategy. In W. G. Lehnert

and M. H. Ringle, editors, Strategies for natural language processing. Hillsdale,
NJ: Lawrence Erlbaum, 1982.
[22] Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In

Proceedings of the Workshop on Text Summarization Branches Out, Barcelona,
Spain, July 2004.
[23] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-
gram co-occurrence statistics. In Proceedings of Human Language Technology
Conference, Edmonton, Canada, May 2003.
BIBLIOGRAPHY 67
[24] Melissa Maerz. Watson wins ’jeopardy!’ finale; ken jennings welcomes ’our new
computer overlords’. http://latimesblogs.latimes.com/showtracker/2011/02/
watson-jeopardy-finale-man-vs-machine-showdown.html. Accessed: 3/11/2011.
[25] Inderjeet Mani. Automatic Summarization. John Benjamins, 2001.
[26] Inderjeet Mani and Mark T. Maybury. Advances in Automatic Text Summariza-
tion. MIT Press, Cambridge, MA, USA, 1999.
[27] Kathleen McKeown and Dragomir R. Radev. Generating summaries of multiple
news articles. In Proceedings of the 18th annual international ACM SIGIR con-
ference on Research and development in information retrieval, SIGIR ’95, pages
74–82, New York, NY, USA, 1995. ACM.
[28] Rada Mihalcea and Hakan Ceylan. Explorations in automatic book summa-
rization. In Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning,
pages 380–389, Prague, June 2007.
[29] Frederick Mosteller and David L. Wallace. Inference in an authoriship problem.
Journal of the Americal Statistical Association, 58(302):275–309, 1963.
[30] Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, and Daniel Tam. Centroid-
based summarization of multiple documents. Information Processing and Man-
agement, 40(6):919 – 938, 2004.
[31] Dragomir R. Radev and Kathleen R. McKeown. Generating natural lan-
guage summaries from multiple on-line sources. Comput. Linguist., 24:470–500,
September 1998.
[32] Duda Ro and Hart Pe. Pattern Classification and Scene Analysis. Wiley, 1973.
[33] D. E. Rumelhart. Understanding and summarizing brief stories. In D. Laberge
and S. Samuels, editors, Basic processing in reading, perception, and comprehen-
sion. Hillsdale, NJ: Lawrence Erlbaum, 1977.
[34] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Appraoch.
Prentice Hall, 2002.
[35] Peter Selgin. By Cunning and Craft: Sound Advice and Practical Wisdom for
Fiction Writers. Writer’s Digest Books, 2007.
[36] R. F. Simmons and A. Correira. Rule forms for verse, sentences and story trees.
In N. Findler, editor, Associative networksrepresentation and use of knowledge
by computers. New York: Academic Press, 1979.
[37] Joel R. Tetreault. A corpus-based evaluation of centering and pronoun resolution.
Computational Linguistics, 27(4):507–520, 2001.
BIBLIOGRAPHY 68
[38] P. W. Thorndyke. Cognitive structures in comprehension and memory of narra-

tive discourse. Cognitive Psychology, 9:77–110, 1977.
[39] Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: language-
independent named entity recognition. In proceedings of the 6th conference on
Natural language learning - Volume 20, COLING-02, pages 1–4, Stroudsburg,
PA, USA, 2002. Association for Computational Linguistics.
[40] Alan M. Turing. Computing Machinery and Intelligence. Mind, LIX:433–460,

1950.
[41] Stephen Wan and Kathy McKeown. Generating overview summaries of ongoing
email thread discussions. In Proceedings of the 20th international conference on
Computational Linguistics, COLING ’04, Stroudsburg, PA, USA, 2004. Associ-
ation for Computational Linguistics.
[42] Liang Zhou and Eduard Hovy. On the summarization of dynamically introduced
information: Online discussions and blogs. In AAAI Symposium on Computa-
tional Approaches to Analysing Weblogs (AAAI-CAAW), pages 237–242, 2006.
[43] Li Zhuang, Feng Jing, and Xiao-Yan Zhu. Movie review mining and summariza-
tion. In Proceedings of the 15th ACM international conference on Information
and knowledge management, CIKM ’06, pages 43–50, New York, NY, USA, 2006.
ACM.

NEVA: An Automatic Summarizer For Narrative Texts (I.e., Stories)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

NEVA: An Automatic Summarizer For Narrative Texts (I.e., Stories)

Încărcat de

Drepturi de autor:

Formate disponibile

THE COOPER UNION FOR THE ADVANCEMENT OF SCIENCE AND

ALBERT NERKEN SCHOOL OF ENGINEERING

NEVA: An Automatic Summarizer

A thesis submitted in partial fulfillment

of the requirements for the degree of

April 15, 2011

Dr. Carl Sable

ALBERT NERKEN SCHOOL OF ENGINEERING

Dr. Simon Ben-Avi

Dr. Carl Sable

Automatic summarization research to date has mostly been concerned with

in the past few years of my education by my family, friends and professors.

encountered problems, Professor Sable was already there, helping me through

and even allowing me entrance into your coveted Bennett apartment.

Alan Berenbaum, Toby Cumberbatch, and Benjamin Davis. Special thanks to

their efforts to help around the lab whenever necessary.

some, consistent nagging on my part). These friends are Yonah Kupferstein,

tionally and financially through my years of education; I am who I am today

of the Brachot He has given me throughout my years. I understand that my

He has laid out for me and to incorporate my Yiddishkeit in everything I do,

creating as much Kiddush Hashem as possible. Thank you.

List of Figures viii

6 Results and Discussion 43

7 Conclusion and Future Work 52

B Human Analysis Ratings 56

2.1 The parse tree of “The dog ate the food”. . . . . . . . . . . . . . . . 12

Natural Language Processing (NLP). NLP is a branch of artificial intelligence that

playing Jeopardy! without actually understanding anything at all.

mound of articles can be done in a fraction of the time.

much information on whatever event about which it is reporting. A narrative on the

nothing to do with the actual plot of the text.

information from narratives. Called Named Entity Verb-focused Automatic sum-

therefore provide an accurate summary of the narrative’s plot.

by Henry James. The resulting summaries have been compared to human-written

summaries, summaries produced by a statistical automatic summarization system

ROGUE (Recall-Oriented Understudy for Gisting Evaluation) summarization metric

and human volunteers in a blind experiment. The ROGUE metric is an accepted

automatic summarization metric used in the National Institute of Standards and

Technology (NIST) annual Document Understanding Conferences [22]. This thesis

content, containing up to 85.2% of the same content as a human-written summary

2.1 Artificial Intelligence

depiction of a future world including intelligent humanoid robots performing everyday

measures of intelligence became important. This led researchers to develop rigorous

methods of testing how intelligent a computer has actually become.

whether he was talking to a human or a computer program, the computer program

mechanisms), but theoretically, the computer would be considered as “aware” as any

Understanding language is clearly no simple task. It involves so many things that

enization (separating sentences into words) if it is to be successful. Working strictly

meaning in that sentence; logically constructing a connection between the words in

for the gold by passing the Turing test.

There are many different ways to go about programming a computer to “think”

the symbolic and statistical paradigms [18].

produce such structures as first-order logic representations of a text [1]. In first-order

relationships between those words. With these relationships in place, a computer

from the text.

Instead of trying to get computers to achieve an understanding of a text, the

papers matched those of Madison.

Another mainstream way of gathering statistical data is called machine learning.

be applied to an unknown text in order to classify or make some statistical prediction

their similarities [13].

an unlabeled corpus of unknown states). The unsupervised technique relies on finding

of the data. A common example of an unsupervised technique is called clustering,

2.2 General NLP Tasks