Sunteți pe pagina 1din 76

THE COOPER UNION FOR THE ADVANCEMENT OF SCIENCE AND

ART

ALBERT NERKEN SCHOOL OF ENGINEERING

NEVA: An Automatic Summarizer


for Narrative Texts

by

Joshua Blachman

A thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Engineering

April 15, 2011

Advisor

Dr. Carl Sable


THE COOPER UNION FOR THE ADVANCEMENT OF SCIENCE AND
ART

ALBERT NERKEN SCHOOL OF ENGINEERING

This thesis was prepared under the direction of the Candidate’s Thesis Ad-
visor and has received approval. It was submitted to the Dean of the School
of Engineering and the full Faculty, and was approved as partial fulfillment of
the requirements for the degree of Master of Engineering.

Dr. Simon Ben-Avi


Acting Dean, School of Engineering

Dr. Carl Sable


Candidate’s Thesis Advisor
Abstract

Automatic summarization research to date has mostly been concerned with


summarizing technical documents and news articles, and the domain of nar-
ratives has been neglected. A new rule based approach has been created to
summarize narrative texts, specifically isolating plot lines. A system imple-
menting this approach, called NEVA, has been applied to three different books
and has been evaluated using both human volunteers and the ROGUE metric.
Results show that NEVA successfully creates plot summaries containing up
to 85.2% of the same content as some human-written summaries of the same
narrative.
Acknowledgments

This thesis could not have been possible without the support I’ve been given

in the past few years of my education by my family, friends and professors.

First and foremost, I’d like to thank my advisor, Professor Carl Sable for his

tireless efforts as a guide and friend as I took the necessary steps in conceiving,

researching and writing this thesis. It seemed like at every turn, even before I

encountered problems, Professor Sable was already there, helping me through

them, and always with a smile. Thank you for always having your door open

and even allowing me entrance into your coveted Bennett apartment.

I’d also like to thank the professors of Cooper Union who have taught me

so much about engineering, science, and life in general, including but certainly

not limited to: Chris Lent, Stuart Kirtman, Kausik Chatterjee, Yash Risbud,

Fred Fontaine, James Abbott, Stanley Shinners, Alan Wolf, Robert Uglesich,

Alan Berenbaum, Toby Cumberbatch, and Benjamin Davis. Special thanks to

Glenn Gross and Dino Melendez for just being awesome people and giving all

their efforts to help around the lab whenever necessary.

I’d like to thank my friends who are always there for me, giving me support

and helping me complete the results section of my thesis (after what was for

some, consistent nagging on my part). These friends are Yonah Kupferstein,

Josh Nissel, Sippy Laster, Yael Sacks, Aviva Bukiet, Ezra Obstfeld, Elissa

Gelnick, Naomi Levin, Batya Septimus, Evan Hertan, Michael Sterman, Aliza
Ben-Arie, Eliana Grosser, Michali Steinig, Michael Feder, Hanna Clevenson

and Daniel Rich. I’d like to thank my parents for supporting me both emo-

tionally and financially through my years of education; I am who I am today

only because of you. Also, thanks for helping with my results section as well,

I know how hard it was for you to read those ten pages.

Acharon Acharon Chaviv, I’d like to thank Hakadosh Baruch Hu for all

of the Brachot He has given me throughout my years. I understand that my

whole being is tied to His Ratzon, and I try every day only to fullful the Tachlis

He has laid out for me and to incorporate my Yiddishkeit in everything I do,

creating as much Kiddush Hashem as possible. Thank you.


Contents

Table of Contents vi

List of Figures viii

1 Introduction 1

2 Background 4
2.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 General NLP Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 POS taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Chart Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Pronoun Resolution . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Automatic Summarization . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Structural Approach . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Multiple Document Summarization . . . . . . . . . . . . . . . 24
2.4 Evaluating Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Related Work 28
3.1 Plot Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Project Description 32
4.1 Problems With Previous Work . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Creating a Narrative Summarizer Algorithm . . . . . . . . . . . . . . 33
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Choice of Resources . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Point System . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Analysis of NEVA 39
5.1 ROGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Human Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Rating System . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi
CONTENTS vii

6 Results and Discussion 43

7 Conclusion and Future Work 52

A Hobbs Algorithm 54

B Human Analysis Ratings 56

C Annotated Summaries 58
C.1 NEVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
C.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
C.3 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography 65
List of Figures

2.1 The parse tree of “The dog ate the food”. . . . . . . . . . . . . . . . 12


2.2 A valid parse tree of “Colorless green ideas sleep furiously”. . . . . . 14

6.1 The human evaluations for Dr. Jekyll and Mr. Hyde. . . . . . . . . . 43
6.2 The human evaluations for The Awakening. . . . . . . . . . . . . . . 44
6.3 The human evaluations for The Ambassadors. . . . . . . . . . . . . . 45
6.4 The ROGUE scores for Dr. Jekyll and Mr. Hyde. . . . . . . . . . . . 46
6.5 The ROGUE scores for The Awakening. . . . . . . . . . . . . . . . . 47
6.6 The ROGUE scores for The Ambassadors. . . . . . . . . . . . . . . . 48
6.7 The ratios of ROGUE scores for all summaries to ROGUE scores for
human-written summaries. . . . . . . . . . . . . . . . . . . . . . . . . 49

B.1 The original scores for the human evaluations of Dr. Jekyll and Mr.
Hyde. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
B.2 The original scores for the human evaluations of The Awakening. . . . 56
B.3 The original scores for the human evaluations of The Ambassadors. . 57

viii
Chapter 1

Introduction

Airing on February 14 through 16 2011, for the first time in its 47 year runtime,

Jeopardy! faced off two human players against a computer. Millions of viewers

watched as the IBM computer program called Watson defeated Jeopardy! champions

in a landslide victory that had Ken Jennings saying at its end “I for one welcome our

new computer overlords” [24]. But Jennings may have spoken too soon. Although it

may have seemed that Watson was very smart as it answered many questions correctly

and quickly, Watson was actually just providing the illusion of intelligence by using

Natural Language Processing (NLP). NLP is a branch of artificial intelligence that

focuses on creating programs that can interpret natural languages (e.g., English)

usefully. NLP allows computer programs to perform difficult human tasks such as

playing Jeopardy! without actually understanding anything at all.

One of the most researched areas of NLP is the field of automatic summarization.

The topic, as the name suggests, deals with creating programs to automatically sum-

marize a text into a shorter text which has a length that is some percentage of the

original text. Most of the work in automatic summarization has been focused on sum-

marizing online news articles [5,10,27]. The usefulness of this is readily apparent; the

1
2

Internet is continuously becoming a primary source for news [20], but the enormity

of articles posted every day make it an insurmountable task to read or even organize

everything. A human trying to sift through this information would need to read all

or parts of the available articles to determine what information they contained and

whether it is relevant to him; such a task would most likely take weeks or months for

just a days worth of online material. Automatic summarizers can be used to create

headlines or short summaries of each article and then the task of sifting through the

mound of articles can be done in a fraction of the time.

With such potential, it is a wonder that automatic summarizers have not been

applied to all different types of texts. One domain that has been overlooked is narra-

tive texts. In this thesis, a novel approach for automatic summarization of narrative

texts has been created. Narratives are fundamentally different than news articles in

that not everything is relevant. A news article tries to make a point, providing as

much information on whatever event about which it is reporting. A narrative on the

other hand, has background information and character development that might have

nothing to do with the actual plot of the text.

For this purpose, the program described in this thesis tries to extract only plot

information from narratives. Called Named Entity Verb-focused Automatic sum-

marizer (NEVA), the program looks for all of the actions that are performed by the

main characters and extracts the sentences from the narrative that describe those ac-

tions. These sentences are then ordered based on a series of rules and only the highest

ranked sentences are chosen to be part of the final text. The resulting sentences are

effectively a list of all of the important events that take place in the narrative and

therefore provide an accurate summary of the narrative’s plot.

NEVA has been applied to three books; namely, Dr. Jekyll and Mr. Hyde by

Robert Louis Stevenson, The Awakening by Kate Chopin, and The Ambassadors
3

by Henry James. The resulting summaries have been compared to human-written

summaries, summaries produced by a statistical automatic summarization system

called MEAD, and a baseline summary. The summaries were evaluated using both the

ROGUE (Recall-Oriented Understudy for Gisting Evaluation) summarization metric

and human volunteers in a blind experiment. The ROGUE metric is an accepted

automatic summarization metric used in the National Institute of Standards and

Technology (NIST) annual Document Understanding Conferences [22]. This thesis

shows that NEVA performs exceptionally well with regards to including relevant plot

content, containing up to 85.2% of the same content as a human-written summary

used as an upper-bound.
Chapter 2

Background

2.1 Artificial Intelligence

As soon as people realized that computers can replace humans as a means for calcu-

lations, they started searching for ways to make computers as intelligent as people.

Artificial intelligence became the holy grail of scientific progress with every fictional

depiction of a future world including intelligent humanoid robots performing everyday

human tasks. Even a “simple” human task such as walking, though, would require a

humanoid robot to first master several other tasks including visually understanding

a room, moving several parts of its body at once without falling, and avoiding obsta-

cles, to name a few; each of these tasks by itself is extremely complicated and difficult

to solve. Due to the explosion of tasks that researchers sought for computers to do,

measures of intelligence became important. This led researchers to develop rigorous

methods of testing how intelligent a computer has actually become.

In 1950, Alan Turing developed a computer intelligence test that is still in use

today, called the Turing test [40]. To conduct the Turing test, a master of human

psychology would ask questions using a terminal to elicit answers from either a human

4
2.1 Artificial Intelligence 5

on another terminal or a computer program. If the psychologist could not figure out

whether he was talking to a human or a computer program, the computer program

would be deemed intelligent. Clearly, there would be no way to tell whether a com-

puter passing the Turing test would be capable of walking (even with the required

mechanisms), but theoretically, the computer would be considered as “aware” as any

competent human.

And so the race to make computers understand language began. For if a computer

can understand, interpret and generate language that was understandable to a natural

speaker, according to the Turing test that computer would be as intelligent as any

person. This new field developed into what is today called natural language processing

(NLP) [18].

Understanding language is clearly no simple task. It involves so many things that

people take for granted but are actually quite difficult. For example, it is no simple

task for a computer to separate a sentence into its constituent words. The reason

is clear with spoken language, because a computer understands sound as one long

sound wave with indeterminate beginnings and endings throughout. However, even

with written language, the task is far from solved. For example, it may seem as if a

period is always at the end of a sentence, with the sentence’s final word immediately

preceding that period; however, this is not true for abbreviations, in which periods are

very much a part of the word. While the difference between the two would be obvious

to any human reader, this task is not so simple for the programmed computer.

Natural language processing has to be capable of much more than just word tok-

enization (separating sentences into words) if it is to be successful. Working strictly

with just the written side of language processing, after tokenizing a sentence, a com-

puter must do a plethora of other tasks before coming close to meeting Turing’s

standard of intelligence. Once tokenized, words are still just a conglomerate of let-
2.1 Artificial Intelligence 6

ters to a computer. Those words must still have meaning assigned to them. Other

required tasks would include, among others: disambiguating each word based on its

meaning in that sentence; logically constructing a connection between the words in

the sentence; understanding the inferences that are implied by the words, but are

not definitively stated; keeping a dynamic database of all new and old information,

and knowing how each piece of information relates to the others; and then, regen-

erating this entire puzzle to create a unique response that a human speaker would

understand.

For many of these tasks there have been decades of research all over the world in-

vestigating novel approaches for tackling these problems, ranging from mathematical

models to logical step-by-step approaches. One of the many things the different ap-

proaches have in common, though, is that they all involve multiple complicated steps,

with each step involving its own natural language processing problem. None of these

problems have a perfect solution to them, and as such, a solution that is a summation

of smaller solutions, each with its own errors, is a solution that is compounded with

even more errors. It is because of this that most researchers have donated their time

to solving just one small subproblem of natural language processing, instead of going

for the gold by passing the Turing test.

There are many different ways to go about programming a computer to “think”

like a human, but surprisingly, there are only a few general methods which stand

out as working exceptionally well. These methods have become the cornerstone of

NLP research and they are used very heavily to solve most NLP problems with high

success. As NLP grew, there were two general ways of approaching problems, which

still divide the algorithms used in NLP today. These approaches became known as

the symbolic and statistical paradigms [18].

The symbolic paradigm was inspired by what humans are perceived to do when
2.1 Artificial Intelligence 7

they interpret language, and its goal is to get computers to ultimately understand

language. The idea is that there must be some underlying structure to texts that

allows humans to understand it, and if a program can translate the text into this

logical structure, it should be able to interpret it. This theory led to programs that

produce such structures as first-order logic representations of a text [1]. In first-order

logic, a computer would label each word in a text with a type and create simple

relationships between those words. With these relationships in place, a computer

would “understand” the text, and even be able to answer questions about details

from the text.

Instead of trying to get computers to achieve an understanding of a text, the

statistical paradigm’s goal is to achieve good results, no matter what the method.

This led to developing algorithms that produced results based on probabilities. The

inspiration is that there are overall patterns that language follows and if the computer

can learn those patterns, it can predict how unseen texts follow those patterns.

One of the earliest and most well-known examples of this of this type of approach

occurred in 1963 when Mosteller and Wallace discovered the authorship The Feder-

alist Papers, thus ending a long historical debate [29]. They statistically analyzed

known works of the proposed authors, James Madison and Alexander Hamilton, and

compared frequencies of certain common words such as “of” or “about”. They found

that the frequencies of use of these words were consistent for each suspected author

but different between the authors. The frequencies of these words in the disputed

papers matched those of Madison.

Mosteller and Wallace gathered their information using a technique called data

mining. In most NLP applications, there exists an input text and an unknown state

of this text; in Mosteller and Wallace’s case the input text consisted of certain articles

comprising The Federalist Papers and the unknown state was the author. For data
2.1 Artificial Intelligence 8

mining, a large set of texts, called a corpus, is needed from which data is to be

extracted. In the case of a supervised data mining algorithm, the corpus would

have manually tagged data, providing examples of patterns for which types of data

leads to which states. The assumption of a supervised data mining algorithm is that

the patterns that exist in the corpus are the same as the patterns that exist in the

unknown data set. The goal, then, is to “mine” data from the corpus to create a

statistical representation of it that can be used to extract unknown states from other

texts.

Another mainstream way of gathering statistical data is called machine learning.

Instead, machine learning uses algorithms to train a function using the known cor-

pus. For many machine learning techniques (e.g., neural networks or support vector

machines), the function is a black-box, but for others (e.g., decision trees), the inner

workings of the function are apparent [34]. In either case, the trained system can then

be applied to an unknown text in order to classify or make some statistical prediction

about it. Some people classify machine learning as a subset of data mining due to

their similarities [13].

Data mining and machine learning can also both be used for unsupervised learning

algorithms which do not involve a known training corpus (although they may train on

an unlabeled corpus of unknown states). The unsupervised technique relies on finding

patterns that are inherent in the data instead of using patterns from known states

of the data. A common example of an unsupervised technique is called clustering,

which tries to group data together based on some measurement of distance that exists

between any two data points [32]. The appeal of such methods is that there is no

need for manually labeled data, which may be expensive or even impossible to obtain

at times.
2.2 General NLP Tasks 9

2.2 General NLP Tasks

One subset of Natural Language Processing that has received considerable attention

is the task of summarizing a body of text [26]. While this might seem unrelated to

the overall task of human-computer conversations, solving the summarization prob-

lem completely would be a pivotal turn towards computer intelligence. One reason

for this is simply that if a computer can filter out all of the unnecessary language

that accompanies a normal conversation and get to the point that the communica-

tor is trying to make, it would be much easier to understand the conversation as a

whole. Even without the eventual goal of passing the Turing test, though, the task

of automatic summarization would be useful in its own right. Artificial intelligence

is, after all, the process of a computer performing a task that is normally done by

a person, and this is no exception. It would be extremely useful if a person could

input a body of text to a computer and receive an accurate summary of the text.

This becomes especially useful as the input text gets larger and unmanageable for

a person to read, understand and summarize himself. A summary could help the

person decide whether or not they want to read the entire text; or it could save a

tremendous amount of time if the summary itself tells them everything they want to

know.

In addition to summarizing large texts, it is also useful to summarize many small

texts like those commonly found on the Internet. Now that the Internet is a primary

source for news [20], the conglomerate of daily news articles has become unman-

ageable and represents the perfect domain for which to develop automatic computer

summarizers. Before reaching the point where researchers could begin working on au-

tomatically summarizing news articles, there were years spent researching the more

basic parts of natural language processing. This research led to resources such as
2.2 General NLP Tasks 10

part-of-speech taggers, sentence chart parsers, named entity recognizers, and pro-

noun resolvers.

2.2.1 POS taggers

The task of a part-of-speech (POS) tagger is to label each word in a sentence with

its correct POS [4]. The difficulty in this arises because many words contain several

meanings and different meanings often have different parts of speech. For example,

most gerunds, in addition to being nouns, can be used as adjectives as well; “running”

can be used both in “Running is healthy” and in “the running man”. Humans in-

stinctively pick out the difference between the two cases because they can understand

that “running” is used in two very different contexts. Therefore, many standard POS

taggers try to mimic humans in this way by “learning” which contexts produce which

POS tags for a given word. A common POS tagger implementation is called the Brill

tagger, first used by Eric Brill in 1993. This algorithm uses a supervised learning

method to produce a list of rules that can be applied to unknown words. The Brill

tagger has an accuracy of approximately 95% [4].

2.2.2 Chart Parsers

A chart parser is a program that takes a sentence as input and outputs a structured

representation of the sentence based on the tags of the words [18]. For this reason,

many chart parsers require a tagged sentence as an input, formatted in a way that

is recognizable by that particular chart parser. The output of a chart parser is in

the form of a tree structure, with the leaves of the tree representing the individual

word/tag pairs of the sentence. The different places where the tree branches off are

determined by the context-free grammar (CFG) that the chart parser is using.
2.2 General NLP Tasks 11

A CFG is a grammar that describes how words and phrases are related to one

another, and therefore sets rules for the syntax of a sentence in a language [7]. A

CFG consists of rules that equate one type of phrase with one or more other types

of phrases in that syntactically (disregarding the actual meaning of the words) they

can be swapped for each other without worrying about their contexts; it is for this

reason that a context-free grammar is context-free. A simple example of this is the

rule that a sentence can be comprised of a subject (a noun phrase) and a predicate

(verb phrase). A CFG would therefore include the rule:

S → NP VP

where → means “can have the form of” and S, NP and VP refer to sentence, noun

phrase and verb phrase respectively. However, for the English language, this rule

does not describe the only way that a sentence can be formed; a valid exclamatory

sentence, for example, might consist of only one word (e.g., “Wow!”). Therefore, an

English CFG might also include the rule:

S → W ORD !

Note that the word “wow” is not so easily tagged, and also it is questionable whether

the punctuation “!” even deserves a tag. This all adds to the complexity of creating

a complete tag set as well as integrating that tag set with a CFG and ultimately a

chart parser.

The functionality of a chart parser can be demonstrated with a simple sentence

such as “The dog ate the food”. A correct chart parser output would have a “sentence”

node at the root of the tree, which branches off into a noun phrase node, consisting

of “The dog”, and a verb phrase node consisting of “ate the food”. Then the noun

phrase node would be divided into the determinant “The” and the noun “dog”, while

the verb phrase would be divided into the verb “ate”, and the noun phrase “the food”,
2.2 General NLP Tasks 12

which would in turn be divided into the determinant “the” and the noun “food”. The

parse tree representation of the sentence is shown in Figure 2.1. Many chart parsers

Figure 2.1 The parse tree of “The dog ate the food”.

use a simplified parentheses delimited representation of trees, which for the example

sentence would lead to:

(S (N P (DET T he)(N dog))(V P (V ate)(N P (DET the)(N f ood))))

There might be other correct outputs depending on various factors such as the set of

tags and the CFG that the chart parser is using.

Most of the difficulty with a chart parser producing the correct output comes

from the uncertainty as to which rule of the CFG to use for a given phrase. Each new

parsed sentence will start at the top of the parse tree with a “sentence” node, but

as shown previously, this “sentence” node can be split into a noun phrase and verb

phrase, or a single exclamatory phrase, or one of many other valid sentence structures

in the English language. As a parser gets further down the tree, the complexity only

increases, as almost every type of phrase has multiple CFG rules that split it into

other types of phrases.

Parsing a sentence into a full valid parse tree is not as simple as taking any phrase
2.2 General NLP Tasks 13

and using any CFG rule, for doing so will most likely not result in a valid sentence,

due to words that are “left over” after part of the parsing is completed. This can be

illustrated more clearly by looking at the example sentence from before. Instead of

parsing the sentence the way it was parsed above, an equally valid parse (following

the CFG rules) for the beginning of the sentence would be:

(S (N P (DET T he)(N dog))(V P (V ate)))

This leaves the words “the food” unable to validly fit into any CFG rule as there is

no structure in English that allows for such a noun phrase to sit alone at the end of

a sentence. This problem manifests itself in a plethora of different ways for almost

any English sentence, leaving the issue of finding an algorithm that can efficiently

produce grammatically correct parses of a sentence.

There are two main approaches for such parsing algorithms: top-down and bottom-

up [18]. Top-down parsing, as the name suggests, starts with the top “sentence” node

and splits it into all possible parses based on the CFG rules. The parser continues to

split the phrases until it reaches the part-of-speech branches. Valid parses are simply

all parse trees whose bottom part-of-speech branches fit with the part-of-speech tags

of the words of the sentence. Bottom-up approaches work the opposite way. They

start with the part-of-speech tags of a sentence and apply the CFG rules to them in

a backwards fashion to produce higher level nodes in the parsed sentence tree. Mul-

tiple trees are continuously combined either until no valid CFG rules exist that can

be applied to the remaining nodes (in which case the parse tree is discarded) or the

branches unify into a single sentence node (i.e., a valid parse tree). Both approaches

have their advantages and disadvantages, in that neither approach takes into account

both overall sentence structure (as top-down approach does) nor the actual input sen-

tence (as the bottom-up approach does). Many current parsing algorithms use hybrid
2.2 General NLP Tasks 14

approaches that are more efficient than either approach individually with respect to

both speed and memory efficiency.

Regardless of which parsing method is used, a single English sentence can have

upwards of 300 different valid parses; while all might be technically grammatically

correct, only few make much sense to an English speaking person. (A famous example

that shows how a valid parse can lead to nonsense is the sentence “Colorless green

ideas sleep furiously”, first proposed by Noam Chomsky in 1957 [7]. The sentence

has a correct grammatical structure shown in Figure 2.2, but it is readily apparent

that none of the words in the sentence have any meaning relative to each other.)

Since a chart parser would be relatively useless if it produced 300 parses for any

Figure 2.2 A valid parse tree of “Colorless green ideas sleep furiously”.

sentence, NLP researchers have developed methods of choosing a single best parse.

Most notably, the statistical parsing methods have shown to be very successful in this

regard [9].

Like other statistical NLP algorithms, statistical parsing uses probabilities that

must be learned before they can be applied to unknown inputs. Although what is

learned changes slightly from algorithm to algorithm, the basics remain the same;

give patterns that appear frequently in the learning corpus a higher probability of

occurring in the input sentence. This usually manifests itself with regards to CFG
2.2 General NLP Tasks 15

rules as well as tag and word pairs that are grouped together. For example, “S →

WORD !” and “green ideas” will have extremely low probabilities when compared to

“S → NP VP” and “green leaves” respectively. After combining together all of the

different probabilities in a given parse tree, the parse tree with the highest overall

probability is chosen as the “correct” parse.

2.2.3 Named Entity Recognition

In addition to labeling words with part-of-speech, words can be classified as different

types of entities, most notably as a named entity. Named entities are phrases that

contain the names of persons, organizations, locations, or times [39]. Example named

entities of those types would be “John”, “Microsoft”, “London” and “July 4th, 1776”,

respectively. What makes named entities so useful, though, is not always that they are

named entities, but rather that they are some specific type of named entity. Labeling

words in a text as a “Person”, “Organization”, “Location”, or “Time” proves to be

invaluable for some NLP applications.

There are many different ways to go about identifying named entities, the easiest of

which is to just look up every word in lists of named entities to see if they match. One

problem with this method is that there are many words that are not named entities but

that have named entities counterparts. This leads to errors such as the verb “reading”

being labeled as a city in England. To counteract this effect, named entity recognizers

apply chunkers to the text before searching for named entities [3]. A chunker, as the

name suggests, breaks down sentences into chunks of smaller phrases, each with its

own tag. The result is somewhat of a cross between a POS tagger and a chart

parser in that chunkers give sentences some structure, and as such, most chunkers

use similar methods to chart parsers to obtain their results. Popular among these

methods are pattern matching and supervised learning techniques. Once sentences
2.2 General NLP Tasks 16

are chunked, there is much less of an issue with mislabeling named entities, because

chunks will define when a named entity is expected, to the exclusion of verbs like

“reading”. Named entity recognizers are reported to achieve 79% accuracy for the

English language [14].

2.2.4 Pronoun Resolution

One of the ubiquitous techniques of writers is to use pronouns often instead of re-

peating a noun, called the pronoun’s antecedent, several times. The assumption that

these writers make is that the pronoun has an obvious antecedent, and that there

can be no other ambiguous antecedent to which the stated pronoun corresponds. In

fact, it can be considered bad English grammar when a pronoun is stated and there

are two or more possible antecedents to which the pronoun refers. Despite this fact,

many writers still use ambiguous pronouns in the hope that most readers would be

able to figure out the correct antecedent with limited intellectual effort.

Unfortunately, computers do not even have a limited intellect which they can

draw upon to resolve ambiguous pronoun references. Incidentally, even unambiguous

pronouns can pose an issue to a computer; this is largely because there are very few

cases where a pronoun is truly unambiguous. Although a human reader might be

able to distinguish immediately between the type of noun that a pronoun refers to

and therefore immediately eliminate most of the other nouns in a sentence as possible

antecedents, computers can make no such distinction and are left with the group of

all previous (and possibly future) nouns in the sentence from which to choose the

antecedent. Fortunately, there are current algorithms that account for many of the

issues that a computer must handle when trying to resolve a pronoun.

Among others, there is a pronoun resolution algorithm that is still popular among

NLP use today called the “Hobbs algorithm” after its creator Jerry Hobbs [16]. The
2.3 Automatic Summarization 17

Hobbs algorithm uses the chart parsed structure of a sentence to locate the most

probable noun as an antecedent for a given pronoun. The algorithm uses a list

of steps that, if followed, produce the best matched antecedent. The algorithm is

presented in Appendix A. Hobbs evaluated this algorithm on hundreds of examples

from three different texts and reported an accuracy of 88.3% [16]. While this accuracy

is not indicative of all corpus genres, Jeol Tetreault evaluated many different pronoun

resolution algorithms on different genres and found the Hobbs algorithm to be have

an accuracy of 80.1% for fictional works, one of the highest for that genre [37].

2.3 Automatic Summarization

The field of automatic summarization is an NLP topic that been given significant

attention over the past few decades. The topic, as the name suggests, deals with

finding algorithms to automatically summarize a text into a significantly shorter text.

As with any summary, the goal is to remove as much unnecessary information from

the input text as possible while not removing any necessary information so that the

shorter output text is still a good representation of the input text.

While there are a plethora of types of text that can be summarized, most of

the work in automatic summarization has been focused on summarizing online news

articles [5,10,27]. There are several reasons for this, including the availability of huge

databases of texts to work with. There are many websites that provide daily news

(and many new websites are created each year) so training corpora for this domain

are readily available. This is significant, because when starting a new area of study

in NLP (or any research for that matter), getting a working product of any kind is

of prime importance, and therefore, a good topic of study is one that provides many

existing examples, thus leading to a high probability for success in that topic.
2.3 Automatic Summarization 18

In addition, news articles provide a good source for automatic summarization

because they are usually highly unique pieces of literature. One can be certain that an

article on a terrorism plot would sound nothing like an article on the latest basketball

game. Not only would most of the words be different between the two articles, but

entire phrases and paragraphs would probably be structured differently as well. While

this contrast would not completely hold between two articles of the same genre, in

many respects, even two articles on a terrorist plot would be vastly different. The

location of the plot, the names of the people involved, the techniques used, and many

other details would all be different between the articles. All these differences are very

significant when it comes to summarizing a text because a summary can be treated

as a list of things that are unique to a text. While that may not be obvious for any

text, it is certainly apparent for a news article which people read so that they know

what is new and different in the world. Disregarding the lack of appeal due to style,

would anyone really object to a news article that succinctly stated the changes in the

world (i.e., what is unique as compared to yesterday) as a list of attributes?

Another interesting aspect of news articles that appealed to NLP researchers was

the number of different articles about the same topic. Since all news websites (as-

suming they are the same type of news websites) write about the same incidents that

happened in the world, for every event, there are usually an overabundance of arti-

cles that look different, but talk about the same thing with only minor changes. This

poses a unique opportunity for automatic summarization, because it means that there

is an availability of many different input texts that can all technically produce the

same output text. Therefore, instead of just summarizing a single text, researchers

started working on multiple document summaries, which combined the input of mul-

tiple texts to produce a single output summary. Multiple documents mean more

information which hopefully leads to higher success rates for summarization.


2.3 Automatic Summarization 19

When humans write a summary they basically use the same simple method. First

they read the target text, then ruminate on it for a while and try to understand

what the crux of the text was about. Finally, they reformulate their ideas in new

words that might be completely different from the words used in the target text,

but still retain the same information. Based on this chain of events, there are some

researchers who write programs that summarize texts using sentence abstraction.

When abstracting sentences, a computer first reads in the input text, then it processes

it by “understanding” what the text is about using different artificial intelligence

methods. Finally, the computer generates completely different grammatically correct

sentences that consist of the main ideas of the input text [31]. This method mimics

humans in the closest way possible, and should therefore theoretically produce the

most “human” summary, but it uses an overly complex method to achieve that goal

which can lead to errors. The task of getting a computer to understand a text is

a task that is many years ahead of what the artificial intelligence field is currently

capable of doing. In fact, if a computer would be able to understand a text in a

way that would allow it to produce an accurate summary of that text, then that

computer would be able to do many other NLP tasks such as question answering and

machine translation. Fortunately, it turns out that for most of the uses of automatic

summarization, perfect human-like summaries are not always necessary.

Because of the issues involved with generating sentence abstraction summaries,

most researchers rely on creating summaries using sentence extraction [31]. A sentence

extraction summary is produced by literally extracting the most relevant sentences

(and/or phrases) from the input text. While this method clearly is not focused on

producing a better resulting summary than a sentence abstraction summarizer would,

it often achieves summaries that are still useful for the task at hand. There are usually

two phases to creating extracts; the first involves determining which sentences are
2.3 Automatic Summarization 20

relevant and the second involves post-processing of these sentences. Post-processing

is necessary because a sentence extraction is inevitably missing linking information

from the original text that. For example, if a text is 30 lines long, a sentence extraction

summary of that text might contain only lines 1, 5, 15 and 21, encompassing the key

points of the text. However, the other lines of the original text, while not containing

the main ideas of the document, may contribute to the overall progression of ideas.

Line 21 might start with the words “He then said”, but without line 20 to specify

who “He” is, line 21 becomes almost meaningless. Therefore, the extracted sentences

are post-processed to make the summary flow well enough so that it makes sense as

a stand-alone text.

As for determining the sentences’ relevancies, current researchers are creating

new ways to do this, but there are a few “classic” algorithms that have become the

bread and butter of the automatic summarization field. They each use vastly different

methods, yet achieve very similar outcomes with varying success rates that are among

the highest in the field.

2.3.1 TF-IDF

Since the target documents for most automatic summarization systems were known to

be news articles, a good algorithm would exploit that knowledge as much as possible.

As stated earlier, one of the highlights of news articles is each article’s uniqueness of

words, so an algorithm that can harness that aspect by determining the rare words

in a document should be successful. To do this requires a system to tabulate, in

some way, the sentences that contain the most unusual words, and to consider these

sentences to be very relevant for a summary. However, the quality of being unusual is

not sufficient alone. In order to be a relevant unusual word, the word has to have some

significance within the document itself. Hence, a relevant sentence should be one that
2.3 Automatic Summarization 21

contains unusual and distinct words that play an important role in the document as

a whole.

To accomplish this, two weights are calculated for each word in a document; the

word’s term-frequency (TF) and its inverse-document-frequency (IDF) [6]. The TF is

simply the number of times that any given word appears in the document as a whole.

All words (terms) in the document automatically receive a TF of at least 1, and they

get higher numbers as they appear more frequently. This weight is to account for the

significance that a word plays in the document as a whole, for if a word appears many

times in a document, it is probably a word that the document is trying to focus on.

If the document is trying to focus on the said word, then a sentence containing the

word is most probably a sentence to focus on as well for a summary of the document.

However, there are two problems with just using the TF as a weight for deter-

mining sentence relevancy in a document. The first is that it does not exploit the

attribute of news articles that made them so desirable. Since each news article is so

unique when compared to the next, if a summary is supposed to say what makes this

article different enough to be worth reading, it should include things that are unique

to only this article. Without a weight contributing to that aspect, a summarizer may

end up generating very generic basketball articles, all talking a lot about dribbling

and shooting, but lacking many significant details. Secondly, the words that would

undoubtedly score the highest in the TF weights would be the words that are the

most common in the English language and not the ones that are the biggest focus in

the article. The word “the” is by far the most used word in any article, but it clearly

does not give any measure of the relevance of a sentence in a summary. Another

weight is needed to counteract the bias for common words in order to give the TF

weight more meaning.

The inverse-document-frequency weight both accounts for the rareness of words


2.3 Automatic Summarization 22

and disregards words that are too common. The IDF is calculated by looking through

a corpus of articles and tabulating, for every word in a language, the number of

documents that a word appears in (or alternatively, the number of times that a

word appears throughout the entire corpus) and dividing by the total number of

documents. This weight is called the word’s document-frequency, and it is a measure

of how common a word is in the language. The inverse-document-frequency of a given

word, which serves as a measure of the rareness of the word, is then calculated by the

following rule:
count(w)
IDF (w) = − log
N
where N is the total number of documents in a corpus and count(w) is the number of

documents containing a given word, w. A word’s TF score and its IDF score are then

multiplied together to produce a total TF-IDF score that accounts for the two desired

relevancy aspects. Since a common word’s IDF would be extremely small (relative

to other words) it effectively negates the TF of those words when they appear in the

target document. On the other hand, rare words would have a relatively high IDF

leading to good relevancy scores. Even if a word appears only once in the target

document, if it is a rare enough word, it has a much higher combined TF-IDF score

than common words in the document.

The resultant algorithm which extracts sentences containing words with high TF-

IDF values has been tested extensively and produces accurate summaries of news

algorithms [26]. In terms of accuracy, it is quite difficult to produce a single score

that encompasses the “goodness” of a summary. Current metrics for automatic sum-

marization algorithms will be discussed in Section 2.4.


2.3 Automatic Summarization 23

2.3.2 Structural Approach

Most summarization algorithms use TF-IDF weights as a starting point but also use

additional information to achieve better results. While the plain TF-IDF method

looks at sentences using a “bag of words” approach, meaning it looks at the sentences

as a combination of a set of words without any particular order or relation between

them, it is prudent to be able to use the inherent structure of a language for clues to

which sentences may be relevant. In addition, most articles have artificial structure

given to them by their author which gives additional hints for relevancy. Therefore,

some researchers have included such components as cues, titles, and locations in their

summarizers [12].

English has several words whose function is not to add much content but rather

to change the impression of other words. Such words as “significant,” “impossible,”

and “hardly” are dubbed “cues” and the presence of them in a sentence can add

relevance to or subtract relevance from the sentence. In order to effectively use this

method, a dictionary of “cues” must be manually compiled and each “cue” must

be labeled as positively, negatively or neutrally relevant. The weights from “cues”

are then integrated into the TF-IDF weights to produce a final sentence relevancy

weight [12].

Most news articles have author given titles for the entire article and also for

individual sections. Since titles are usually one line summaries of the piece they are

titles for, it is clear that the words in a title would definitely be relevant for an entire-

document summary. Therefore, by giving words that appear in titles higher TF-IDF

scores than words that do not, more relevant sentences can be chosen [12].

The locations of sentences in their paragraphs also give a clue as to the relevancy

of sentences. Firstly, this means that sentences that occur immediately after titles

are most likely important. Secondly, it has been shown that topic sentences are more
2.3 Automatic Summarization 24

likely to occur either at the very beginning or at the very end of an article. Again,

using these clues, TF-IDF weights of certain sentences can be modified to produce

more relevant sentences for a document summary [12].

2.3.3 Multiple Document Summarization

As stated earlier, one of the benefits of summarizing news articles is that there are

usually many articles online that are written about the same incident. This leads

to the possibility of multiple document summarization (MDS), which merges many

documents into a single summary [15]. While the positive side to this is that multiple

documents allows for a more robust summary with more information, using multiple

documents also gives rise to complications that are not present when summarizing

single documents.

The first problem with MDS is that with multiple documents, in addition to sen-

tences being relevant, entire documents can actually have more relevant information

than others. Because of this, high single document TF-IDF scores may not be as

important if they belong to a fairly irrelevant article. One method used to account

for this to compute the TF scores for the words in a document based on the frequency

with which they appear across all of the documents in the set being summarized. This

way, a word will only have a high TF score if it is common to many of the documents

used for the MDS. Also, instead of summing up the TF-IDF scores of all of the words

in a sentence, only the scores of a limited number of words, called the centroid of that

group of documents, are considered for determining relevant sentences. The words

that go into the centroid are words that have a TF-IDF score above some minimum

threshold [30].

Further complications include the redundancy of sentences between documents

and the ordering of sentences once all the relevant sentences have been chosen [2].
2.4 Evaluating Summaries 25

Redundancy is usually taken care of by using a cosine similarity metric between

the chosen sentences. If the two sentences produce a number that is higher than a

predetermined threshold, that means that the two sentences contain too much of the

same information, and the one with the lower TF-IDF score is usually discarded. A

cosine similarity metric is given by the following equation:

n
A·B i=1 Ai × Bi
P
similarity = cos(θ) = = qP qP
kAkkBk n
i=1 (A i )2× n
i=1 (Bi )
2

A and B are two n-dimensional vectors of attributes representing the two sentences

to be compared. The attributes vary and can be values such as binary values for the

existence of words or TF-IDF values of the words. θ is the angle between these vectors

in n-dimensional space.

Sentence ordering is much harder to deal with than redundancy in general. One

study showed that when humans were asked to order sentences they each gave com-

pletely different orderings; despite this, most of the human orderings seemed to be

valid and logical orderings [2]. Therefore, the authors of the study concluded that

an exact ordering is not necessary but rather an ordering just had to be acceptable

to people. They proposed grouping sentences into general topics and then trying to

order the topics using timestamps from the original documents.

2.4 Evaluating Summaries

The evaluation of computer-generated summaries is a problem that has plagued the

field of automatic summarization, and to which there is currently no great solution.

The root of this problem is that unlike most NLP applications, there is no absolute

gold standard for an automatic summarizer. A POS tagger, for example, has clearly

correct tags that are defined for each word in any given sentence (although even this is
2.4 Evaluating Summaries 26

debatable, but for the purposes of this thesis it is a reasonable assumption). Therefore,

in order to evaluate the accuracy of a POS tagger, all one has to do is compare the

output of the tagger to the known-to-be-correct output. However, for summaries in

general, even human-written summaries, there is no single correct answer. Two people

can summarize the same text and wind up with completely different summaries in

terms of words and style, yet both summaries would be “correct” summaries in the

sense that they accurately depict a shorter version of the target text. Hence, there can

never be an absolute summary to which to compare a computer-generated summary.

Because of this shortcoming, there is no simple evaluation technique that can be used

for automatic summarizers, and non-optimal techniques must be used instead.

One of the evaluation techniques that has become accepted by the National In-

stitute of Standards and Technology (NIST) annual Document Understanding Con-

ferences (DUC) is the Recall-Oriented Understudy for Gisting Evaluation (ROGUE)

metric [22]. The ROGUE metric is based on a program that calculates and compares

combinations of words, called n-grams. N-grams are commonly used in NLP to refer

to groups of words, with size “n”, as a single unit. Tri-grams, or 3-grams, group

together 3 words as one, whereas bi-grams, or 2-grams, group together 2 words, and

unigrams denote single words. ROGUE compares two texts together and calculates a

score based on the number of n-grams that the two texts have in common. The theory

is that if two texts are valid summaries of the same document, then they will both

have the same key phrases. There are several different ROGUE metrics, each calcu-

lating different types of n-grams and permutations of n-grams. The specific metrics

used for the NIST DUC are ROGUE-1, ROGUE-2 and ROUGE-SU4. ROGUE-1 cal-

culates unigram matches between documents, ROGUE-2 includes bi-gram matches,

and ROGUE-SU4 allows for bi-gram matches that are distanced by up to 4 words.

These metrics specifically are used because they have been shown to have the highest
2.4 Evaluating Summaries 27

correlation to what humans evaluate as good summaries [23].


Chapter 3

Related Work

Automatic summarization techniques have been given significant attention in the

NLP field, and most of that attention has gone to summarizing news articles. There

are many techniques to choose from, but the problem is that most of these techniques

are domain specific in that they work best with news articles, and not as well with

other genres. Other domains have been considered, including online discussions on

blogs [42], movie reviews [43], and even email threads [41]. However, because of the

disparity of the functions of these texts, the algorithms developed for news articles

have not produced good results when applied to them. In addition, almost all of

the texts studied for automatic summarization have been relatively short texts that

contain just a few key points with little extraneous material; techniques have not been

developed to sift out small amounts of relevant information from very large texts.

There are few exceptions to all of this, one being the work that was done by

Mihalcea and Ceylan on automatic summarizations of entire narrative books [28].

Mihalcea and Ceylan understood the differences that longer texts would require and

acted accordingly. They started with an existing algorithm, called MEAD [30], which

is a centroid based approach, mostly used for multiple documents, but applied in

28
29

this case to single books. They made several modifications to the MEAD algorithm,

achieving increasingly better results for their system. They scored their system using

the ROGUE metric comparing their generated summaries to human-written sum-

maries from online websites (specifically, gradesaver.com and cliffsnotes.com).

Their first modification was actually an un-modification of original automatic

summarization systems. Based on studies, it was shown that for news articles the lead

sentences of paragraphs are extremely important for the overall summaries [12, 25].

In short documents this would make sense, since each paragraph is usually making

a new point that is pertinent to the focus of the document. However, as Mihalcea

and Ceylan showed, if you do not consider the weight of lead sentences and strictly

consider TF-IDF scores, the overall summary achieves better results for full books.

The reason for these results is because different authors might have different ways

of stressing their points that do not include relying heavily on lead sentences. A

second reason is that longer documents do not necessarily have focal points in each

paragraph due to style and topic changes among chapters.

This second reason gives rise to the idea of segmenting a larger book into shorter

individual “documents” that each have their own TF-IDF weights. They therefore

divided each book into 15 different segments using a graph-based segmentation al-

gorithm, and then applied their algorithm to each segment individually. The final

summary was chosen by taking the highest ranked sentence from each segment, start-

ing with the first, then the second and so on until a preset word limit was reached.

In addition to running separate summaries on each segment, they also calculated

separate and augmented TF-IDF scores for each segment. They added two more

factors into the weight: STF, the segmentation term frequency; and ISF, the inverse

segmentation frequency. The final TF-STF-IDF-ISF scores combined with the other

two modifications led to a 30% error reduction rate compared to using just the MEAD
3.1 Plot Units 30

algorithm for full length books.

Additional work has been performed on automatic narrative summarization by

Kazantseva and Szpakowicz [19]. Instead of trying to summarize an entire book’s

plot, they just looked at short stories, and their goal was to provide the background

of a story, while trying to not give away any plot details. Instead of using tradi-

tional automatic summarization methods, they looked for patterns in the text and

extracted relevant sentences that fit into their “background information” prototype.

By comparing their system to traditional techniques and achieving better results, they

showed that pattern matching may prove to be more suited to the non-linear nature

of narratives. Their work, however, is clearly unsuited for producing a generalized

summary of a long narrative.

3.1 Plot Units

In 1982, Wendy Lehnert devised a theory on narrative summarization that was used

by Kazantseva and Szpakowicz and many others as the definitive source for plot

summarization [21]. Previous work consisted of “story grammars” which tried to

generalize events from stories into different categories [33, 36, 38]. A story grammar

would include categories for elements such as characters, setting, conflict and resolu-

tion. A computer summarizer would try to extract all of these elements from a story,

and if done successfully, would produce a logical structure from which the computer

could “understand” the story. The failure of story grammars was that they could

not possibly incorporate every structure created by the human mind. The human

mind is able to understand any story, and abstract it in such a way that no matter

how unintuitive and irregular the story, the abstract summary makes sense. This

technique of the mind inevitably creates different structures depending on the type
3.1 Plot Units 31

of story and possibly even the way that the mind understood it. Seeing how there

can be limitless structures for the mind to mold a summary from, there can be no

way that a top-down approach such as the one using story grammars could work.

Lehnert sought to create a bottom-up approach for realizing story structures [21].

She postulates the existence of mental states of events and characters, the value

of these states being positive, negative or neutral. Next she describes each relation

between these states as either being a motivation, actualization, termination or equiv-

alence (meaning that nothing changes). Using only these states and links she shows

that any story structure can be broken down into these basic parts that she calls plot

units. It is no longer a problem to try to fit an abstract human summary to her story

structure because her basic building blocks of positive, negative or neutral events can

fit with any summary as it is a truism that any event or mental state can be defined

as positive, negative or neutral.

Examples of primitive plot units that Lehnert gives are: “You need a car so you

steal one”, “Your proposal of marriage is declined”, or “The woman you love leaves

you.” Because of the nature of the character-event relationship in plot units, most

events are either preexisting or are caused by one of the characters. The events occur-

ring in these example plot units are “you steal a car”, “girlfriend declines marriage”,

and “lover leaves you” respectively. The mental states of the characters are equally

as important to the meaning of the narrative, but do not progress the narrative in

itself. It is the stringing together of these event actions that compose the plot and

are really what is needed to create a complete summary. Currently, however, there is

no complete automatic summarization system that is able to incorporate plot units

into its algorithm to create a perfect summarizer.


Chapter 4

Project Description

4.1 Problems With Previous Work

The best method so far for summarizing long narratives is that of Mihalcea and

Ceylan using modified TF-IDF scores [28]. However, the method has a few pitfalls

that cause it to fall short of producing summaries that come close to human-written

summaries. The first and foremost is that it is heavily based on TF-IDF which was

not created for specific use with narratives. TF-IDF relies heavily on the notion

that important sentences and phrases will contain words that are rare yet repeatedly

brought up in the text. This idea makes sense for texts such as news articles which

have a focused goal in mind and generally try to minimize wasted words. However,

both of these premises are false when it comes to narratives.

Narratives, even short sections of narratives, dont generally focus on a single topic.

There is a principle in writing called “Show, don’t tell” that is found throughout

common literature [35]. “Show, don’t tell” states that a writer should show the reader

what is happening through character action, words, etc., instead of bluntly telling the

reader through description. Therefore, narratives may never use “important” topic

32
4.2 Creating a Narrative Summarizer Algorithm 33

words at all as they dance around topics with unnecessary prose and literary tools,

wanting the reader to gain the information only through inference.

With the extra prose comes extra words. It is not uncommon for a narratives to

have several paragraphs whose sole purpose is to describe scenery or settings. These

descriptions often use flowery language, including words that are very rarely seen, and

therefore have a high IDF weight. In fact, the rarer a word is the better, as uniqueness

catches the reader’s attention. With all of an author’s techniques and tools that he

uses to make a book interesting, the TF-IDF algorithm becomes useless.

Work has been done that shows that pattern matching may be more suited for the

summarization of narratives than statistical analysis, due to the non-linear nature of

narratives [19]. It is suitable, then, to look for an algorithm that “understands” the

nature of narratives and tries to logically deduce the sentences that are important for

a summary.

4.2 Creating a Narrative Summarizer Algorithm

The author of this thesis has created an algorithm dubbed the Named Entity Verb-

focused Automatic summarizer, or NEVA, to work well specifically for summarizing

narratives. The goal was to create a good algorithm for summarizing narratives using

an accepted theory of how narratives are formulated; the concept of Lehnerts plot

units was the perfect basis around which to formulate the algorithm. At the core of

each of Lehnerts plot units is the idea that character-event relations are what make

a narrative progress. Therefore, a good narrative summary is one that incorporates

all of the character-event units somehow. One way of doing this is to extract all

of the actions that each character performs throughout the narrative. A list of the

character actions, at least an abstract one, would be a valid representation of the plot
4.2 Creating a Narrative Summarizer Algorithm 34

units. In NLP terms (and because abstract lists are extremely hard to generate), this

translates to listing character named entities, along with their associated verbs, and

the object that each verb is acting upon. Also, while it may be true that a given

character-verb pair is implicitly or even passively stated, skipping these verbs does

not pose a major problem since this is not a common occurrence and the surrounding

active verbs usually provide adequate information to fill in the plot. Moreover, even

when passive verbs do occur, the related active verbs are generally more important.

For example, if in a narrative an apple goes missing, it is possible for a sentence to

be “The apple was gone,” a sentence that would not be picked out by this type of

summarizer. However, it is almost inevitable that there would be active verbs leading

up to the state of the apple being gone such as “John ate the apple,” or “Jane looked

everywhere for the apple.”

This would be fine if the goal was to create a comprehensive listing of everything

that happens in the book. The effect of this would be to take out most of the literary

sugar, that which is unnecessary to the plot and provides descriptions or feelings;

while these may be important in literary terms to some books, they almost never

contribute to the plot of the book. This would not be a summary, though, as a

comprehensive listing can be almost as long as the original book. What is needed

is a system to limit the character-verb interactions that are chosen for a summary.

Like TF-IDF summaries, the best way would be to take the highest scoring sentences

ranked in some way that is pertinent to the type of summary that is desired.

Although narratives do not usually follow a cookie cutter template, there are a

few general rules that can be used to indicate that a sentence is more important than

others with regard to advancing the plot, and therefore worthy of additional points

(more details of NEVA’s point system is descibed in Section 4.3.2). It should be

noted that in addition to these rules being based on logic, the rules all came about
4.2 Creating a Narrative Summarizer Algorithm 35

empirically using data from human-written summaries. The first of these rules is that

sentences that emphasize an important or plot-advancing action will often start with

the character who is performing that action. The first reason this would be true is

that if the character is the first word of a sentence, then that character is almost

always part of the subject of the sentence, and likely the topic of the sentence; this

type of sentence becomes more valuable for the plot compared to a sentence in which

a character and his actions are part of a predicate. There may be times, however,

that a sentence can start with a preposition or even adjectives that do not preclude

the character from being the sentence’s subject even though that character is not the

first word of the sentence. This is not so much of an issue because those sentences as

a whole are generally less focused on the character-action, and, either way, they do

not take away from the definitive import granted when a character is the first word.

The second rule is that a sentence is more important when it deals with more than

one character, relating the two in some way. This rule is inspired by Lehnert’s plot

units. As Lehnert proposed, plot units deal with individual character-action relation-

ships [21], but there are also plot units that deal with character-character relationships

as these are necessary for most problems and resolutions in plots. Therefore, if a sen-

tence contains characters in addition to the character who is performing the action it

is given additional points.

The last rule is that sentences that include dialogue get additional points. While

containing dialogue may not inherently contribute to a sentence’s importance for a

summary, empirical evidence suggested that many conversations in narratives contain

valuable plot information even if they do not fall under the first two rules. The reason

for this is simply that dialogue is a natural way to progress a narrative, and while

it may not advance the plot with regards to physical actions, dialogue does advance

the plot with regards to inter-character relations. Through dialogue, for example,
4.3 Implementation 36

one character may make another character angry, which provides the much needed

motivation for possible later actions. It should be noted that dialogue sentences still

have to contain a character-verb pair in order to be considered for the summary

(even if the pair is only “he said”, it usually points to the start or end of a character’s

speech, from which one can often infer the overall idea of the speech).

4.3 Implementation

Before the NEVA can be applied to a document, the text has to be passed through

several other programs to format it and label it with the necessary data. These steps

are as follows:

1. The original text is processed by a named entity recognizer, written in Python,

using Python’s Natural Language Tool Kit (NLTK) [3].

2. The Python output is processed by a self-written Perl script to extract all the

named entities in the text that fall into the category of “Person” or “Organiza-

tion”.

3. The original text is processed by a self-written Perl script to preprocess all texts

and convert them to a standard format.

4. The preprocessed text is processed by the GPoSTTL v0.9.3 POS tagger [17].

5. The output of the POS tagger is processed by a self-written Perl script to

reformat it for the next step.

6. The formatted text is processed by a Collins chart parser [9].

7. The output of the Collins parser is processed by a provided clean-up script

written in Perl.
4.3 Implementation 37

8. The cleaned-up output from the Collins parser is processed by a self-written

C++ implementation of the Hobbs pronoun resolution algorithm [16].

The output of the pronoun resolution is the original text with all of the pronouns

replaced with their antecedents. After these steps, the list of named entities, the

cleaned-up Collins parser text, and the pronoun resolved original text are processed

by a Perl script that executes the NEVA algorithm.

4.3.1 Choice of Resources

The Collins parser was chosen as the chart parser because it is known to have a

high accuracy and it is available freely online [8, 9]. The GPoSTTL v0.9.3 POS

tagger was chosen because the Collins parser requires a Brill tagger input, and the

GPoSTTL is a Brill tagger with high accuracy that is available freely online [17]. The

Hobbs algorithm was used because of its high accuracy specifically when dealing with

fictional works [37].

The pronoun resolution algorithm has been modified to be more fit to be used

specifically with an automatic summarizer. The first modification is that instead

of looking for antecedents for every pronoun, only pronouns that might be referring

to people or organizations (characters that can perform actions) are resolved. The

classifications of people and organizations were labeled using the named entity rec-

ognizer. The reason for this modification is that no algorithm is 100% accurate, so

an incorrect resolution that replaces the pronoun “it” with “John” can hurt the final

summarization system. Following the same logic, to ensure that a pronoun such as

“he” is never resolved to an inanimate object, the program has been modified so that

only named entities that fell into the category of “person” or “organization” are used

as possible antecedents. With both of these modifications, the pronoun resolution


4.3 Implementation 38

program has a perceived increase in accuracy; no rigorous tests have been conducted

to prove this, but the initial data seemed to favor this hypothesis heavily.

4.3.2 Point System

Once the initial steps have been completed, the NEVA algorithm, implemented as

a rule based point system, is applied to every sentence. Any sentence that has at

least one character-verb pair is automatically given one point. For each additional

rule that the sentence follows, it is given an additional point. After all of the points

are distributed, the sentences are chosen to be part of the final summary in order

from the highest ranked sentences to the lowest, arranging selected sentences in the

same order that they appeared in the original text. If after choosing any sentence,

a predetermined character limit is reached, no more sentences are chosen and the

summary is complete.
Chapter 5

Analysis of NEVA

5.1 ROGUE

Analyzing automatically generated summaries becomes additionally difficult when

dealing with plot summaries. While the ROGUE metric has been shown to work well

for news article summaries, there is no indication that there is a similar correlation

for plot summaries. The reason to differentiate the evaluation method for the two

types of summaries is the same reason why different algorithms are needed for the two

domains. News articles give major significance to rare words, and therefore TF-IDF

values and n-grams can play a role in those articles’ classification. This is not as true

in plot summaries, for reasons explained in previous sections.

Despite this, when Mihalcea and Ceylan did their work on automatic plot sum-

maries, they used the ROGUE metric to rate their summaries [28]. As with the rest

of their work, they did not recognize all of the a reasons to differentiate between

news articles and novels; rather, they just dealt with the issue of novels being exceed-

ingly long by comparison. Be that as it may, they were in some way justified to use

ROGUE, as the ROGUE metric is the only current metric that has both been shown

39
5.2 Human Evaluations 40

to be accurate and is accepted by the NLP community for automatic summaries in

general. Therefore, to evaluate NEVA, the ROGUE metric as well as volunteer-based

human analysis has been used. The ROGUE metric used to evaluate the summaries

were ROGUE-1, ROGUE-2, and ROGUE-SU4, chosen because these are the metrics

most accepted and used in the NIST Document Understanding Conferences [22].

Following Mihalcea and Ceylan’s lead, the automatically generated summaries

were tested against human-written summaries found online; these were taken from

sparknotes.com and gradesaver.com, chosen because they had similar length sum-

maries for the desired books. However, because of the constraint of using human

analysis as well, it was impractical to use the large corpus of books that Mihalcea

and Ceylan used. Instead, only select chapters from three different books were chosen

for analysis. The books used were Dr. Jekyll and Mr. Hyde by Robert Louis Steven-

son, The Awakening by Kate Chopin, and The Ambassadors by Henry James. For

each of these books, a random section of the narrative was chosen to be summarized,

and then additional consecutive sections were added until the corresponding online

summary reached approximately one page in length (this was approximately 4000

characters).

5.2 Human Evaluations

For the human evaluations, each of the three books’ sections has been analyzed by

seven different volunteers. Since human grading can be capricious and biased, each

person was given multiple summaries to grade in a blind experiment; that is, the

sources of the summaries were not revealed to the volunteers.

For each of the three books, five summaries were compiled. Two of these sum-

maries came from human-written online sources, one came from NEVA, one was a
5.2 Human Evaluations 41

baseline summary, and one was generated from a generic automatic summarizer called

MEAD. The baseline summary was a sentence extracted summary of evenly spaced

sentences from the target text; while this may not be the worst summary possible

(and hence, not a true baseline), it is a good reference of what any reasonable sum-

marizer should be able to beat. MEAD is an automatic summarizer available freely

online that uses a centroid based TF-IDF approach [30]. MEAD was the base sum-

marizer that Mihalcea and Ceylan used for their system, and therefore it is a suitable

“opponent” for NEVA [28].

Each volunteer was first presented with one of the online summaries and was

told to treat that summary as the “Gold Standard”, an ideal summary. The online

summary that was chosen to be the gold standard was the one that was further away

from 10% of the original text (10% being a standard value for summary to target

text ratio [11]); for example, if one summary contained 13% of the original text and

the other contained 15%, the 15% summary chosen as the gold standard. The other

online summary was given to each volunteer, along with the three computer-generated

summaries for that text; this online summary was used as an upper bound on the

score that a computer-generated summary can hope to achieve. The volunteers were

not told about the sources of these four summaries.

5.2.1 Rating System

The volunteers were asked to rate each summary in three different categories using

a scale of one to ten, ten being the best. The three categories were content, flow,

and overall style. Although the categories overlap slightly (especially flow and overall

style), not squeezing the ratings into a single score gave the volunteers the necessary

freedom to truly rate different systems that might be better at different things. The

assumption was that some of the summaries might score higher in one category but
5.2 Human Evaluations 42

lower in the others. The volunteers were given definitions of the three categories, but

were told to interpret them in any way that they wanted. The reason this does not

matter too much is that as long as each volunteer used the same metric to rate every

summary he was given, the scores can still be scaled and considered meaningful. The

definitions given were:

1. Content: How much of the narrative that has been described in the Gold Stan-

dard is included in the summary?

2. Flow: Do the sentences in the summary flow well from one to the next? Is it

easy to follow the summary’s progression?

3. Overall Style: How much does the text sound like it is a well written summary

of the given chapters? Does it sound good? Does reading this summary feel as

comfortable as reading the Gold Standard?


Chapter 6

Results and Discussion

The scores for each of the three categories were calculated as the average score given

by the seven volunteers for each book; the original scores are presented in Appendix B

for reference. The results for Dr. Jekyll and Mr. Hyde, The Awakening, and The

Figure 6.1 The human evaluations for Dr. Jekyll and Mr. Hyde.

Ambassadors are shown in Figure 6.1, Figure 6.2, and Figure 6.3 respectively. As the

summaries were evaluated by humans on a very subjective and arbitrary basis, exact
43
44

Figure 6.2 The human evaluations for The Awakening.

scores do not have much meaning; therefore, only overall trends will be considered.

Based on the graphs, it can be seen that the human-written summaries were

evaluated the highest in all of the categories for all of the books as expected. In

second place in every score is the MEAD summary, but across the board its score

is only about half that of the human-written summary. Third and fourth place are

much closer, and trends are divided by book. For Dr. Jekyll and Mr. Hyde, NEVA

and the baseline scored almost equal to MEAD on content; however, for both flow

and style, NEVA scored much closer to MEAD trailing by only about a point, while

the baseline trailed by about three points. For The Awakening, the positions are

almost reversed as the baseline beat NEVA in flow and style, and while they scored

similarly in content, both trailed MEAD by about one point. For The Ambassadors,

both NEVA and the baseline scored similarly to each other with NEVA doing slightly

better in all categories, while both trailed MEAD by about one point in the content

category, and fractions of a point in the other categories.

The ROGUE-1, ROGUE-2, and ROGUE-SU4 metrics were applied to each of the
45

Figure 6.3 The human evaluations for The Ambassadors.

summaries. The resulting scores for Dr. Jekyll and Mr. Hyde, The Awakening, and

The Ambassadors are shown in Figure 6.4, Figure 6.5, and Figure 6.6 respectively. As

seen from the figures, the human-written summary has all of the highest scores for all

books according to all three metrics, which is to be expected since the human-written

summary is being used as an upper bound for the scores of the computer-generated

summaries. Looking at all of the other scores, with one exception, the second best

score is NEVA followed by the baseline followed by MEAD (the one exception is a tie

between MEAD and the baseline for last place).

As opposed to the human evaluations which are subjective, the ROGUE scores

are calculated, and therefore precise relative scores become meaningful. Since the

human-written summary is being used as an upper bound and ROGUE scores have

much more significance when viewed relative to other ROGUE scores, it is useful to

consider how well each summary did compared to the human-written summary. For

each ROGUE metric applied, the ratios of the automatic summary ROGUE scores to

the human-written summary ROGUE scores have been calculated and are presented
46

Figure 6.4 The ROGUE scores for Dr. Jekyll and Mr. Hyde.

as percentages in Figure 6.7. From Figure 6.7 it can be seen that not only did NEVA

beat the other two computer-generated summaries, but it performed exceedingly well,

achieving more than 45% of the score of the human-written summary in six out of the

nine cases. Additionally, when the ROGUE-2 metric was applied to The Awakening,

NEVA scored an amazing 85.2% of what the human-written summary scored. Note

that MEAD and the baseline respectively scored 28.5% and 52.6% of the human-

written summary according to the same metric; if NEVA’s 85.2% score was only so

high because the human-written summary did so poorly on that metric, then it would

be expected that MEAD and the baseline would have scored closer to NEVA than

they did.

MEAD scored low with every metric, performing 5.0% to 11.8% worse than the

baseline in most of the cases (when both were compared to the human-written sum-

mary). This shows that not only is MEAD a bad summarizer, but it is worse than a

simple baseline of evenly spaced lines according to the ROGUE metric.

These results are interesting because, for the human evaluations, MEAD per-
47

Figure 6.5 The ROGUE scores for The Awakening.

formed consistently better than both NEVA and the baseline. Also, NEVA seemed to

perform comparable to the baseline according to human evaluators, contrary to the

relatively high scores it achieved with ROGUE. This information convinced the au-

thor of this thesis to perform further analysis of the computer-generated summaries

to determine why there is such a discrepancy between evaluations. Therefore, the

author performed a manual analysis to determine how much content was similar be-

tween the human-written and computer-generated summaries. It is important to note

that this analysis only took into account content, and not flow or style.

To demonstrate the outcome of this analysis, the texts of the gold standard,

NEVA and MEAD’s summaries for the analyzed chapters from Dr. Jekyll and Mr.

Hyde have been provided in Appendix C. The author has annotated the texts to

highlight certain points about the content contained in the summaries. A red phrase

in NEVA’s summary denotes that there is a parallel phrase containing similar content

in the gold standard; those parallel phrases are marked with the same number as a

superscript at the end of the phrase. Similarly, a green phrase in MEAD’s summary
48

Figure 6.6 The ROGUE scores for The Ambassadors.

denotes that there is a parallel phrase in the gold standard. The gold standard has

red phrases and green phrases that denote that those phrases are the parallel phrases

that occurred in either NEVA’s or MEAD’s summary respectively. Blue phrases in the

gold standard denote that there were parallel phrases to this phrase in both NEVA’s

and MEAD’s summary. Only in blue phrases does a superscript number appear in

all three summaries. A regular black phrase in NEVA’s or MEAD’s summary means

that there is no clear parallel in the gold standard.

Looking at the colors of NEVA’s summary, it can be seen that most of the text is

“checkerboarded” black and red more or less. This means that about half of what was

produced by NEVA was relevant to the summary. What this does not tell us is if the

lack of content in NEVA’s summary was due to the length of the summary or not.

As seen from counting the superscripts, NEVA’s summary has thirteen individual

phrases that are relevant, in other words, being parallel to content from the gold

standard.

MEAD’s summary does not look as colorful, containing only one line of green at
49

Figure 6.7 The ratios of ROGUE scores for all summaries to ROGUE scores
for human-written summaries.

the beginning and a chunk of green in the middle. Even if that chunk were to be

dispersed over the entire text, the summary would still be mostly black, i.e., mostly

irrelevant information. Also, compared to NEVA’s thirteen relevant phrases, MEAD

only produced four relevant phrases. What this tells us is that MEAD does not isolate

content well. The fact that MEAD generated relevant content for only a small chunk

of the text makes it seem that that “good” content may have been a statistical fluke;

MEAD might not actually be good at summarizing narratives, but it is bound to get

lucky sometimes.

The last text to look at is the gold standard summary. Almost unsurprisingly

at this point, most of the colored text is red, with two phrases colored blue and

two more colored green. The green/blue text (which denotes having a parallel in

the MEAD summary), while slightly spread out, is all in the first half of the gold

standard summary. As opposed to that, the red/blue text (which denotes having

a parallel in the NEVA summary) is nicely spread throughout the entire summary.

It is clear, though, that a majority of the text is not red/blue, meaning that the
50

NEVA summary was unable to account for even half of all relevant content. What

this means with regard to the half red NEVA summary is that NEVA is able to

consistently extract relevant sentences, but not shorten them in the same way that a

human writer might.

Although this analysis was only performed with one of the three summarized

books, the results are clear. The human evaluators did not pay close enough attention

to the content scores, and for some reason MEAD was considered better than NEVA

despite the large amount of additional relevant content that NEVA produced over

MEAD. A more accurate representation of a score based on content is found with

the ROGUE metric, although it is unclear how well the ROGUE scores correlate

with the amount of content contained in a summary. For example, looking at the

ROGUE metric for Dr. Jekyll and Mr. Hyde in Figure 6.7, it can be seen that the

NEVA summary has less than twice the relevant content that the MEAD summary

contains; however, it has just been demonstrated that NEVA has about three times

as much relevant content as MEAD contains in that summary (thirteen phrases to

four phrases). Although the author compared the computer-generated summaries to

the gold standard while the ROGUE percentages compares them to the other human-

written summary, it is a fair assumption that the human-written summaries contained

similar content.

The likely reason why human evaluators gave MEAD a higher score than NEVA

is probably because humans inherently desire there to be more than just content in

a summary. This was the basis for why the human evaluators were asked to rate

the summaries in the two additional categories of flow and style. However, it is

likely that humans really only have one impression of a summary after they read

it, and even if asked later to differentiate between different categories their original

notions of a “good” or “bad” summary will take precedence. Analysis of the original
51

scores (see Appendix B) supports this theory. If human evaluators were objective in

the scoring between categories, then the scores would likely differ to a large extant,

especially if content and flow qualitatively differed so greatly. However, the average

difference in scores by a single evaluator on a single summary is only 0.905 out of 10.

(Every evaluator gave three scores per summary, so three differences can be calculated

per evaluator per book per summary; this totals to 252 values. The average was

determined by taking the average of the absolute values of all these scores.)

By reading the NEVA and MEAD summaries of Dr. Jekyll and Mr. Hyde in

Appendix C one sees what may have been the determiner human evaluators used

to rate the “goodness” of each summary. The MEAD summary reads well, and

the NEVA summary does not. For example, the first few sentences of the MEAD

summary form a logical progression that any human can follow: the narrator says how

Utterson goes to Jekyll’s house; then he describes the house, explaining who Jekyll

bought it from; then he talks about Utterson’s initial impressions of the house; then

he describes a room that Utterson walked in to. Indeed, these sentences should follow

a logical progression because these are sentences 1, 2, 3 and 5 from the original text

of the book! On the other hand, the NEVA summary does not flow well because the

sentences that are extracted from the original text are not necessarily consecutive and

therefore the intermediate progression is lost. A human reading these two summaries

would surely like the way MEAD’s summary sounds better than NEVA’s summary

and might miss the fact that NEVA’s summary contains more relevant content. It

seems as if MEAD looks to extract sentences that are in proximity to each other in

the original text. If a chunk extracted by MEAD was largely descriptive, there would

be almost no plot content extracted for many sentences as is the case in the first few

lines of MEAD’s summary of Dr. Jekyll and Mr. Hyde.


Chapter 7

Conclusion and Future Work

A system to automatically summarize narrative texts has been designed. Previous

systems have used statistical methods for summarization and these systems have been

unsuccessful at producing content rich summaries for narratives. In this thesis, a new

rule-based algorithm called NEVA has been presented that extracts sentences from

a narrative and successfully produces summaries with relevant content. The results

of NEVA have been compared against a statistical automatic summarization system

called MEAD using both the ROGUE summarization metric and human volunteers

in a blind experiment.

The human and ROGUE analyses were found to be in contradiction of each other.

As seen from Figures 6.1, 6.2 and 6.3, the human evaluators gave MEAD scores

that were consistently about one point higher than NEVA’s scores. However, Fig-

ures 6.4, 6.5, 6.6, and 6.7 show that ROGUE consistently gave NEVA better scores

than it gave MEAD. Additionally, the ROGUE scores show that NEVA performs

very well in terms of including relevant content, containing up to 85.2% of the same

content as a human-written summary used as an upper-bound. A third analysis was

performed by the author that showed that NEVA produced more than three times

52
53

as much relevant content as MEAD did for one case. This shows that the ROGUE

scores give a better indication of content than the human analyses. A hypothesis

for this is that humans tend to rate summaries largely based off how well it reads in

general, regardless of the actual content. Therefore, human evaluators can be used

to judge the coherence of a summary, but not its content.

Since neither ROGUE nor human volunteers can fully analyze a narrative sum-

mary, the problem exists as to how to accurately rate the summaries. A summary

might be unintelligible if it is not coherent enough, however, it is useless if it does

not contain the right content. A combination of both analyses, where ROGUE rates

content and humans rate coherence, would be suitable for a full analysis of a narrative

summary. However, further research is necessary to determine exactly how accurately

humans can rate a summary’s cohesiveness.

The conclusions of the analyses show that NEVA produces content well, but fails

to string that content together so that humans can comfortably read its summaries.

This leaves a lot of work still to be done with NEVA. Firstly, the rule-based algorithm

seems to work much better than statistical methods for this domain; however, it can

definitely be improved upon. A possible approach for this would be to implement a

machine learning algorithm that creates an optimal list of rules for summarization.

Further work can be done to manually label large corpora of data for supervised

learning methods of this sort. Secondly, NEVA still needs a lot of post-processing

for its summaries so that the sentences flow and do not lack background information.

The current algorithm produces content that is somewhat close to human-written

summaries in content, and if NEVA’s summaries could be as coherent as the human-

written summaries, NEVA would be primed to replace humans for writing narrative

summarizers. This would be a large step in the right direction for automatic summa-

rizers and artificial intelligence in general.


Appendix A

Hobbs Algorithm

1. Begin at the noun phrase (NP) node immediately dominating the pronoun.

2. Go up the tree to the first NP or sentence (S) node encountered. Call this node

X, and call the path used to reach it p.

3. Traverse all branches below node X to the left of path p in a left-to-right,

breadth-first fashion. Propose as the antecedent any NP node that is encoun-

tered which has an NP or S node between it and X.

4. If node X is the highest S node in the sentence, traverse the surface parse trees

of previous sentences in the text in order of recency, the most recent first; each

tree is traversed in a left-to-right, breadth-first manner, and when an NP node

is encountered, it is proposed as antecedent. If X is not the highest S node in

the sentence, continue to step 5.

5. From node X, go up the tree to the first NP or S node encountered. Call this

new node X, and call the path traversed to reach it p.

6. If X is an NP node and if the path p to X did not pass through the Nominal

node that X immediately dominates, propose X as the antecedent.


54
55

7. Traverse all branches below node X to the left of path p in a left-to-right,

breadth-first manner. Propose any NP node encountered as the antecedent.

8. If X is an S node, traverse all branches of node X to the right of path p in

a left-to-right, breadth-first manner, but do not go below any NP or S node

encountered. Propose any NP node encountered as the antecedent.

9. Go to Step 4.
Appendix B

Human Analysis Ratings

Figure B.1 The original scores for the human evaluations of Dr. Jekyll and
Mr. Hyde.

Figure B.2 The original scores for the human evaluations of The Awakening.

56
57

Figure B.3 The original scores for the human evaluations of The Ambas-
sadors.
Appendix C

Annotated Summaries

C.1 NEVA

It was late in the afternoon, when Mr. Utterson found his way to Dr. Jekyll’s door1 ,

where he was at once admitted by Poole, and carried down by the kitchen offices and

across a yard which had once been a garden, to the building which was indifferently

known as the laboratory or dissecting rooms. And indeed he does not want my help2 ;

you do not know him as I do; he is safe, he is quite safe; mark my words, he will

never more be heard of.”2 Utterson ruminated awhile; he was surprised at his friend’s

selfishness, and yet relieved by it. he asked. The doctor seemed seized with a qualm

of faintness; he shut his mouth tight and nodded. ”He meant to murder you5 . And

he covered his face for a moment with his hands. The newsboys, as he went, were

crying themselves hoarse along the footways: ”Special edition. Presently after, he sat

on one side of his own hearth, with Mr. Guest6 , his head clerk, upon the other, and

midway between, at a nicely calculated distance from the fire, a bottle of a particular

old wine6 that had long dwelt unsunned in the foundations of his house. There was

no man from whom he kept fewer secrets than Mr. Guest; and he was not always sure

58
C.1 NEVA 59

that he kept as many as he meant. ”This is a sad business about Sir Danvers,” he said.

”Henry Jekyll forge for a murderer7 !” Sir came out of his seclusion, renewed relations

with his friends, became once more their familiar guest and entertainer; and whilst he

had always been known for charities8 , he was now no less distinguished for religion.

”The doctor was confined to the house,” Poole said, ”and saw no one.” On the 15th,

he tried again, and was again refused9 ; and having now been used for the last two

months to see his friend almost daily, he found this return of solitude to weigh upon

his spirits. ”Yes,” he thought; ”he is a doctor, he must know his own state and that

his days are counted; and the knowledge is more than he can bear.” And yet when

Utterson remarked on his ill-looks, it was with an air of great firmness that Lanyon

declared himself a doomed man. ”I have had a shock,” he said, ”and I shall never

recover10 . But Lanyon’s face changed, and he held up a trembling hand. ”I wish to

see or hear no more of Dr. Jekyll11 ,” he said in a loud, unsteady voice. he inquired.

”He will not see me,” said the lawyer. As soon as he got home, Utterson sat down and

wrote to Jekyll, complaining of his exclusion from the house, and asking the cause of

this unhappy break with Lanyon; and the next day brought him a long answer, often

very pathetically worded, and sometimes darkly mysterious in drift. The quarrel with

Lanyon was incurable. ”I do not blame our old friend,” Jekyll wrote, ”but I share his

view that we must never meet12 . Utterson went to call indeed; but he was perhaps

relieved to be denied admittance; perhaps, in his heart, he preferred to speak with

Poole upon the doorstep13 and surrounded by the air and sounds of the open city,

rather than to be admitted into that house of voluntary bondage, and to sit and speak

with its inscrutable recluse. Utterson became so used to the unvarying character of

these reports, that he fell off little by little in the frequency of his visits. The middle

one of the three windows was half-way open; and sitting close beside it, taking the

air with an infinite sadness of mien, like some disconsolate prisoner, Utterson saw Dr.
C.2 MEAD 60

Jekyll14 . ”God forgive us, God forgive us,” said Mr. Utterson. But Mr. Enfield only

nodded his head very seriously, and walked on once more in silence15 .

C.2 MEAD

It was late in the afternoon, when Mr. Utterson found his way to Dr. Jekyll’s

door1 , where he was at once admitted by Poole, and carried down by the kitchen

offices and across a yard which had once been a garden, to the building which was

indifferently known as the laboratory or dissecting rooms. The doctor had bought

the house from the heirs of a celebrated surgeon; and his own tastes being rather

chemical than anatomical, had changed the destination of the block at the bottom of

the garden. It was the first time that the lawyer had been received in that part of

his friend’s quarters; and he eyed the dingy, windowless structure with curiosity, and

gazed round with a distasteful sense of strangeness as he crossed the theatre, once

crowded with eager students and now lying gaunt and silent, the tables laden with

chemical apparatus, the floor strewn with crates and littered with packing straw, and

the light falling dimly through the foggy cupola. It was a large room fitted round

with glass presses, furnished, among other things, with a cheval-glass and a business

table, and looking out upon the court by three dusty windows barred with iron. And

indeed he does not want my help2 ; you do not know him as I do; he is safe, he is

quite safe; mark my words, he will never more be heard of2 .” I should like to leave it

in your hands, Utterson; you would judge wisely, I am sure; I have so great a trust in

you3 .” The letter was written in an odd, upright hand and signed ”Edward Hyde”:

and it signified, briefly enough, that the writer’s benefactor, Dr. Jekyll, whom he had

long so unworthily repaid for a thousand generosities, need labour under no alarm

for his safety, as he had means of escape on which he placed a sure dependence4 .
C.2 MEAD 61

Presently after, he sat on one side of his own hearth, with Mr. Guest6 , his head clerk,

upon the other, and midway between, at a nicely calculated distance from the fire,

a bottle of a particular old wine6 that had long dwelt unsunned in the foundations

of his house. Guest had often been on business to the doctor’s; he knew Poole; he

could scarce have failed to hear of Mr. Hyde’s familiarity about the house; he might

draw conclusions: was it not as well, then, that he should see a letter which put that

mystery to right? Much of his past was unearthed, indeed, and all disreputable: tales

came out of the man’s cruelty, at once so callous and violent; of his vile life, of his

strange associates, of the hatred that seemed to have surrounded his career; but of his

present whereabouts, not a whisper. He was busy, he was much in the open air, he

did good; his face seemed to open and brighten, as if with an inward consciousness of

service; and for more than two months, the doctor was at peace. The rosy man had

grown pale; his flesh had fallen away; he was visibly balder and older; and yet it was

not so much these tokens of a swift physical decay that arrested the lawyer’s notice,

as a look in the eye and quality of manner that seemed to testify to some deep-seated

terror of the mind. ”Yes,” he thought; ”he is a doctor, he must know his own state

and that his days are counted; and the knowledge is more than he can bear.” Utterson

was amazed; the dark influence of Hyde had been withdrawn, the doctor had returned

to his old tasks and amities; a week ago, the prospect had smiled with every promise

of a cheerful and an honoured age; and now in a moment, friendship, and peace of

mind, and the whole tenor of his life were wrecked. The doctor, it appeared, now

more than ever confined himself to the cabinet over the laboratory, where he would

sometimes even sleep; he was out of spirits, he had grown very silent, he did not read;

it seemed as if he had something on his mind.


C.3 Gold Standard 62

C.3 Gold Standard

Utterson calls on Jekyll1 , whom he finds in his laboratory looking deathly ill. Jekyll

feverishly claims that Hyde has left and that their relationship has ended2 . He also

assures Utterson that the police shall never find the man. Jekyll then shows Utterson

a letter and asks him what he should do with it3 , since he fears it could damage

his reputation if he turns it over to the police. The letter is from Hyde, assuring

Jekyll that he has means of escape, that Jekyll should not worry about him, and

that he deems himself unworthy of Jekylls great generosity4 . Utterson asks if Hyde

dictated the terms of Jekylls willespecially its insistence that Hyde inherit in the

event of Jekylls -disappearance. Jekyll replies in the affirmative, and Utterson tells

his friend that Hyde probably meant to murder him5 and that he has had a near

escape. He takes the letter and departs. On his way out, Utterson runs into Poole,

the butler, and asks him to describe the man who delivered the letter; Poole, taken

aback, claims to have no knowledge of any letters being delivered other than the

usual mail. That night, over drinks, Utterson consults his trusted clerk, Mr. Guest6 ,

who is an expert on handwriting. Guest compares Hydes letter with some of Jekylls

own writing and suggests that the same hand inscribed both; Hydes script merely

leans in the opposite direction, as if for the purpose of concealment. Utterson reacts

with alarm at the thought that Jekyll would forge a letter for a murderer7 . As time

passes, with no sign of Hydes reappearance, Jekyll becomes healthier-looking and

more sociable, devoting himself to charity8 . To Utterson, it appears that the removal

of Hydes evil influence has had a tremendously positive effect on Jekyll. After two

months of this placid lifestyle, Jekyll holds a dinner party, which both Utterson and

Lanyon attend, and the three talk together as old friends. But a few days later, when

Utterson calls on Jekyll, Poole reports that his master is receiving no visitors. This
C.3 Gold Standard 63

scenario repeats itself for a week9 , so Utterson goes to visit Lanyon, hoping to learn

why Jekyll has refused any company. He finds Lanyon in very poor health, pale and

sickly, with a frightened look in his eyes. Lanyon explains that he has had a great

shock and expects to die in a few weeks10 . [L]ife has been pleasant, he says. I liked it;

yes, sir, I used to like it. Then he adds, I sometimes think if we knew all, we should

be more glad to get away. When Utterson mentions that Jekyll also seems ill, Lanyon

violently demands that they talk of anything but Jekyll11 . He promises that after

his death, Utterson may learn the truth about everything, but for now he will not

discuss it. Afterward, at home, Utterson writes to Jekyll, talking about being turned

away from Jekylls house and inquiring as to what caused the break between him and

Lanyon. Soon Jekylls written reply arrives, explaining that while he still cares for

Lanyon, he understands why the doctor says they must not meet12 . As for Jekyll

himself, he pledges his continued affection for Utterson but adds that from now on

he will be maintaining a strict seclusion, seeing no one. He says that he is suffering

a punishment that he cannot name. Lanyon dies a few weeks later, fulfilling his

prophecy. After the funeral, Utterson takes from his safe a letter that Lanyon meant

for him to read after he died. Inside, Utterson finds only another envelope, marked

to remain sealed until Jekyll also has died. Out of professional principle, Utterson

overcomes his curiosity and puts the envelope away for safekeeping. As weeks pass,

he calls on Jekyll less and less frequently, and the butler continues to refuse him

entry13 . The following Sunday, Utterson and Enfield are taking their regular stroll.

Passing the door where Enfield once saw Hyde enter to retrieve Jekylls check, Enfield

remarks on the murder case. He notes that the story that began with the trampling

has reached an end, as London will never again see Mr. Hyde. Enfield mentions that

in the intervening weeks he has learned that the run-down laboratory they pass is

physically connected to Jekylls house, and they both stop to peer into the houses
C.3 Gold Standard 64

windows, with Utterson noting his concern for Jekylls health. To their surprise, the

two men find Jekyll at the window, enjoying the fresh air14 . Jekyll complains that

he feels very low, and Utterson suggests that he join them for a walk, to help his

circulation. Jekyll refuses, saying that he cannot go out. Then, just as they resume

polite conversation, a look of terror seizes his face, and he quickly shuts the window

and vanishes. Utterson and Enfield depart in shocked silence15 .


Bibliography

[1] Jon Barwise. An introduction to first-order logic. In Handbook of mathematical


logic, chapter A.1, pages 6–47. North–Holland, 1977.

[2] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. Inferring strate-
gies for sentence ordering in multidocument news summarization. Journal of
Artificial Intelligence Research, 17:2002, 2002.

[3] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with
Python. O’Reilly Media, Inc, 2009.

[4] Eric Brill. A simple rule-based part of speech tagger. In Proceedings of the
Third Conference on Applied Computational Linguistics (ACL), Trento, Italy,
June-July 1992.

[5] Hsin-Hsi Chen and Chuan-Jie Lin. A multilingual news summarizer. In Proceed-
ings of the 18th conference on Computational linguistics - Volume 1, COLING
’00, pages 159–165, Stroudsburg, PA, USA, 2000. Association for Computational
Linguistics.

[6] Hsin-Hsi Chen and Chuan-Jie Lin. Sentence extraction by tf/idf and position
weighting from newspaper articles. In Proceedings of the Third NTCIR Work-
shop, 2002.

[7] Noam Chomsky. Syntactic Structures. Moutan, The Hague, 1957.

[8] Michael Collins. http://www.cs.columbia.edu/∼mcollins/code.html. Accessed:


3/11/2011.

[9] Michael Collins. A new statistical parser based on bigram lexical dependencies.
In Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz, California,
USA, June 1996.

[10] Ani Nenkova Columbia. Automatic text summarization of newswire: Lessons


learned from the document understanding conference, 2005.

65
BIBLIOGRAPHY 66

[11] Naomi Daniel, Dragomir Radev, and Timothy Allison. Sub-event based multi-
document summarization. In Proceedings of the HLT-NAACL 03 on Text sum-
marization workshop - Volume 5, HLT-NAACL-DUC ’03, pages 9–16, Strouds-
burg, PA, USA, 2003. Association for Computational Linguistics.

[12] H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–


285, April 1969.

[13] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data
mining to knowledge discovery in databases. AI Magazine, 17(3):37–54, 1996.

[14] Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. Named entity
recognition through classifier combination. In Proceedings of the seventh con-
ference on Natural language learning at HLT-NAACL 2003 - Volume 4, pages
168–171, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

[15] Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. Multi-
document summarization by sentence extraction. In Proceedings of the 2000
NAACL-ANLPWorkshop on Automatic summarization - Volume 4, NAACL-
ANLP-AutoSum ’00, pages 40–48, Stroudsburg, PA, USA, 2000. Association for
Computational Linguistics.

[16] Jerry R. Hobbs. Resolving pronoun references. Lingua, 44(4):311–338, 1978.

[17] Golam Mortuza Hossain. http://gposttl.sourceforge.net/. Accessed: 3/11/2011.

[18] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice
Hall, 1999.

[19] Anna Kazantseva and Stan Szpakowicz. Summarizing short stories. Comput.
Linguist., 36:71–109, March 2010.

[20] Christopher M. Kelley and Gillian DeMoulin. The web cannibalizes media, May
2002.

[21] W. G. Lehnert. Plot units: a narrative summarization strategy. In W. G. Lehnert


and M. H. Ringle, editors, Strategies for natural language processing. Hillsdale,
NJ: Lawrence Erlbaum, 1982.

[22] Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In


Proceedings of the Workshop on Text Summarization Branches Out, Barcelona,
Spain, July 2004.

[23] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-
gram co-occurrence statistics. In Proceedings of Human Language Technology
Conference, Edmonton, Canada, May 2003.
BIBLIOGRAPHY 67

[24] Melissa Maerz. Watson wins ’jeopardy!’ finale; ken jennings welcomes ’our new
computer overlords’. http://latimesblogs.latimes.com/showtracker/2011/02/
watson-jeopardy-finale-man-vs-machine-showdown.html. Accessed: 3/11/2011.
[25] Inderjeet Mani. Automatic Summarization. John Benjamins, 2001.
[26] Inderjeet Mani and Mark T. Maybury. Advances in Automatic Text Summariza-
tion. MIT Press, Cambridge, MA, USA, 1999.
[27] Kathleen McKeown and Dragomir R. Radev. Generating summaries of multiple
news articles. In Proceedings of the 18th annual international ACM SIGIR con-
ference on Research and development in information retrieval, SIGIR ’95, pages
74–82, New York, NY, USA, 1995. ACM.
[28] Rada Mihalcea and Hakan Ceylan. Explorations in automatic book summa-
rization. In Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning,
pages 380–389, Prague, June 2007.
[29] Frederick Mosteller and David L. Wallace. Inference in an authoriship problem.
Journal of the Americal Statistical Association, 58(302):275–309, 1963.
[30] Dragomir R. Radev, Hongyan Jing, Malgorzata Stys, and Daniel Tam. Centroid-
based summarization of multiple documents. Information Processing and Man-
agement, 40(6):919 – 938, 2004.
[31] Dragomir R. Radev and Kathleen R. McKeown. Generating natural lan-
guage summaries from multiple on-line sources. Comput. Linguist., 24:470–500,
September 1998.
[32] Duda Ro and Hart Pe. Pattern Classification and Scene Analysis. Wiley, 1973.
[33] D. E. Rumelhart. Understanding and summarizing brief stories. In D. Laberge
and S. Samuels, editors, Basic processing in reading, perception, and comprehen-
sion. Hillsdale, NJ: Lawrence Erlbaum, 1977.
[34] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Appraoch.
Prentice Hall, 2002.
[35] Peter Selgin. By Cunning and Craft: Sound Advice and Practical Wisdom for
Fiction Writers. Writer’s Digest Books, 2007.
[36] R. F. Simmons and A. Correira. Rule forms for verse, sentences and story trees.
In N. Findler, editor, Associative networksrepresentation and use of knowledge
by computers. New York: Academic Press, 1979.
[37] Joel R. Tetreault. A corpus-based evaluation of centering and pronoun resolution.
Computational Linguistics, 27(4):507–520, 2001.
BIBLIOGRAPHY 68

[38] P. W. Thorndyke. Cognitive structures in comprehension and memory of narra-


tive discourse. Cognitive Psychology, 9:77–110, 1977.

[39] Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: language-
independent named entity recognition. In proceedings of the 6th conference on
Natural language learning - Volume 20, COLING-02, pages 1–4, Stroudsburg,
PA, USA, 2002. Association for Computational Linguistics.

[40] Alan M. Turing. Computing Machinery and Intelligence. Mind, LIX:433–460,


1950.

[41] Stephen Wan and Kathy McKeown. Generating overview summaries of ongoing
email thread discussions. In Proceedings of the 20th international conference on
Computational Linguistics, COLING ’04, Stroudsburg, PA, USA, 2004. Associ-
ation for Computational Linguistics.

[42] Liang Zhou and Eduard Hovy. On the summarization of dynamically introduced
information: Online discussions and blogs. In AAAI Symposium on Computa-
tional Approaches to Analysing Weblogs (AAAI-CAAW), pages 237–242, 2006.

[43] Li Zhuang, Feng Jing, and Xiao-Yan Zhu. Movie review mining and summariza-
tion. In Proceedings of the 15th ACM international conference on Information
and knowledge management, CIKM ’06, pages 43–50, New York, NY, USA, 2006.
ACM.

S-ar putea să vă placă și