Sunteți pe pagina 1din 3

Automatic News Summarization Using a

Dependency Structure
Aaron Fetterman1
Advised by Janice T. Searleman2
With increasing amounts of digitized news articles available on the internet, a summary which is
indicative of the article it summarizes helps readers to choose between articles in different publications,
or between a set of articles written by the same publication over a period of time. Using aggregators like
Google News or Yahoo! News, people browse dozens of articles from many newspapers at once.
Publishers with long histories also have a large amount of historic articles online; the New York Times
alone has an archive of over 28 million articles [1, 2]. A search can bring readers to articles which might
be helpful, but their own judgments based on summaries guide them to the articles which they read.
This research aims to create a program that will make a coherent and indicative high-density
summary of a given news article automatically. Ideally, this program would be able to create useful
summaries for archives as large as the New York Times’, or front pages news articles for smaller
publications.
Currently, there are four main approaches for summarizing news articles: headlines, handwritten
summaries, lead extract summaries, and general automatic summaries. While each of these has notable
advantages which serve as good guidelines, they also have inherent downfalls. Headlines communicate a
large amount about the article in a very short space. However, adding a few more words to the headline
can improve it substantially. Headlines are often edited for space and to draw readers into the article.
Handwritten summaries produce output that an automatic system would aspire to, but at the cost of
paying a skilled professional to create summaries for each desired length. The Wall Street Journal and the
New York Times use handwritten summaries [3,4]. A lead excerpt summary, which uses the first few lines
of the article and is founded on the idea that news articles should start with the most general, important
information first, can be easily generated by a computer program, and can give a good introduction if the
author of the article has taken it into account. For other types of articles, it is not optimal. It is not as high
density as a handwritten summary, and it is repetitive if they do read the article. The Hartford Courant
uses lead excerpt summaries, which are sometimes edited slightly to be more coherent on their own [5].
General automatic summaries improve upon lead extract summaries by extracting sentences from
throughout the article based on a myriad of scoring techniques, like cue words, title words, and key
words. This can create a good summary, given enough space, and is generated automatically. But because
the fundamental unit for extract summaries is a sentence, the summaries depend on having good overview
sentences in the article itself. Overview sentences are more common in academic papers than in news
1
Class of 2008, Software Engineering, Honors, Oral Presentation
2
Department of Computer Science, Faculty Advisor
126
articles. Many summary systems are designed with abstracting academic journal articles in mind, where
there is a prevalence of cue phrases (like, “This research aims”), and very useful title words, and where
abstracts are 20% to 30% of the length of the original article, typically 100 to 200 words. For news
articles, though, summaries are usually 20-50 words.
The majority of current summarization systems use extraction to generate a summary [6]. Instead
of full summarization, which involves understanding, abstracting, and generating text, extraction leaves
an easier problem: “ranking sentences from the original document” [7]. This can be solved using
information retrieval, statistical, and lexical tools to rank the sentences in the article. The summary is
generated out of the highest ranked sentences. A different method is to use information extraction and
discourse analysis to generate sentences from scratch [8]. These abstracts combine world knowledge with
the information in the article.
To generate the high-density summaries that are useful to newspapers, the summarizer needs to
use a more fine-grained approach than sentence-level extraction. To do this, the summarizer ranks
individual words. The ranking is based on dependency structures, which describe the relationships
between words, for instance, that adjectives depend on the nouns that they describe, and nouns depend on
the verbs that they act in. The words that are depended on by the most words are the most important, and
are included in the summary. This way, many of the important ideas in the article can be put together in a
relatively short summary because they can be connected through common objects using the dependency
structures, even if the author does not state their relation in a concise sentence.
The goal of the project is to create summaries that are indicative of the content that they describe,
and also coherent English, by using dependency structures to create a new representation of the article.
They will be analyzed individually for being indicative and coherent, both by surveys of readers and by
statistical measures. This work, as well as refinements to the algorithm based on their results, will be
completed next semester.

References

1. New York Times Archives, 1851 - Present


<http://www.nytimes.com/ref/membercenter/nytarchive.html>

2. The New York Times Archives <http://pqasb.pqarchiver.com/nytimes/advancedsearch.html>

3. The Wall Street Journal Online <http://online.wsj.com/public/us>

4. The New York Times <http://www.nytimes.com/>

5. The Hartford Courant Online <http://www.courant.com/>

127
6. Varadarajan, R. and Hristidis, V. A system for query-specific document summarization. In
Proceedings of the 15th ACM international Conference on information and Knowledge
Management (Arlington, Virginia, USA, November 06 - 11, 2006). CIKM '06. ACM Press, New
York, NY, 622-631.

7. Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents:
sentence selection and evaluation metrics. In Proceedings of the 22nd Annual international ACM
SIGIR Conference on Research and Development in information Retrieval (Berkeley, California,
United States, August 15 - 19, 1999). SIGIR '99. ACM Press, New York, NY, 121-128.

8. Radev, D. R., Hovy, E., and McKeown, K. 2002. Introduction to the special issue on summarization.
Comput. Linguist. 28, 4 (Dec. 2002), 399-408.

128

S-ar putea să vă placă și