Sunteți pe pagina 1din 2

G u e s t E d i t o r s ’ I n t r o d u c t i o n

Recent Advances in
Natural Language
Processing
Fabio Ciravegna, University of Sheffield

Sanda Harabagiu, University of Texas, Dallas

L anguage is the most natural way of communication for humans. The vast major-

ity of information is stored or passed in natural language (for example, in textual

format or in dialogues). Natural language processing aims at defining methodologies,

models, and systems able to cope with NL, to either understand or generate it.

Opportunities for NLP applications range from for example, English, Spanish, and Italian, and even
querying archives (for example, the historical activity Arabic, Farsi, Urdu, and Somali.
on querying databases), to accessing collections of texts NLP is a multidisciplinary field at the intersection
and extracting information, to report generation, to of linguistics (classic and computational), psycholin-
machine translation. guistics, computer science and engineering, machine
learning, and statistics. Recent advances have only been
Ongoing progress possible thanks to developments in each of these fields,
The past few years have witnessed several major NLP from linguistic-resources definition (for example, large
advances, allowing text and speech processing to make corpora), to computational models, to evaluation exer-
vast inroads in universal access to information. cises. Large strategic funds have been devoted to NLP
It is a common experience nowadays to contact call in the last few years, from the DARPA TIPSTER program
centers that use speech-understanding systems (for in the 90s and the current Advanced Research and
example, when accessing travel information). We will Development Activity’s ACQUAINT (www.ic-arda.
see even more of this in the future. One main field of org/InfoExploit/aquaint) and DARPA TIDES (www.
application is the ever-growing World Wide Web. darpa.mil/iao/TIDES.htm) programs in the US, to the
Owing to recent progress in open-domain question European funds for linguistic engineering in the 90s
answering, multidocument summarization, and infor- and the current Information Society Technology
mation extraction, in the near future we’ll be able to (www.cordis.lu/ist) program.
explore information posted on the Web, such as is often As usual, advances in research work in concatena-
in large online document collections, fast and effi- tion; small advances allow big technological jumps,
ciently. Instead of users having to scan long lists of which in turn allow other advances. The availability of
documents deemed relevant to a Boolean query (as large resources such as the Penn Treebank1 (www.cis.
when using search engines), they will receive a con- upenn.edu/~treebank) has made possible the develop-
cise, exact answer to their questions expressed in nat- ment of large-coverage, accurate syntactic parsers.2,3
ural language. Similarly, multidocument-summariza- Semantic knowledge bases such as WordNet4 and, more
tion techniques will help solve the current information recently, FrameNet5 have moved semantic processing
overload by producing short abstracts that coherently from the back burner, enabling exciting applications
combine essential information scattered among mul- such as question answering on open-domain collections.
tiple documents. Moreover, research efforts are Semantic parsers6 are under development, allowing
enabling question answering, summarization, and deeper processing of language. In parallel, the emerg-
information extraction in a wide range of languages— ing field of text mining allows computational linguists

12 1094-7167/03/$17.00 © 2003 IEEE IEEE INTELLIGENT SYSTEMS


Published by the IEEE Computer Society
T h e A u t h o r s
Fabio Ciravegna is a
senior research scientist
at the University of
to customize text or speech processing on the domain-specific ontologies that result from these Sheffield’s Department
of Computer Science.
basis of the most salient relations. knowledge acquisition methods can be incorpo-
His research interests
Additionally, numerous speech and text data- rated into WordNet, a lexico-semantic database include adaptive infor-
bases available from the Linguistic Data Con- that numerous NLP systems employ. mation extraction from
sortium (www.ldc.upenn.edu) have let more “Personalizing Web Publishing via Informa- text. He is Sheffield’s
researchers than ever share data and compare tion Extraction,” by Roberto Basili, Alessandro technical manager for Advanced Knowledge
Technologies, a project dealing with knowledge-
techniques. Corpus-based NLP has let us dis- Moschitti, Maria Teresa Pazienza, and Fabio based technologies for advanced knowledge man-
cover in the past decade how to exploit the sta- Massimo Zanzotto, portrays a knowledge-based agement. He is also the coordinator of the dot.kom
tistical properties of text and speech databases. information extraction approach that enables consortium, a project dealing with adaptive infor-
With considerable interest focused on the new multilingual text classification as well as hyper- mation extraction for knowledge management and
the Semantic Web. Finally, he is a co-investigator
de facto corpus for modern NLP procedures and linking. This approach generates hyperlinks and
in the MIAKT (Medical Informatics AKT) proj-
the availability of the Semantic Web standards, their motivations on the basis of similarity ect. Before joining Sheffield, he was principal
the field is set for an exciting new research and between events extracted from document pairs. investigator and coordinator for information
development phase. “Machine and Human Performance for Sin- extraction at the Fiat Research Center and at ITC-
gle and Multidocument Summarization,” by Irst. Contact him at the Natural Language Pro-
cessing Group, Dept. of Computer Science, Univ.
In this issue Judith Schlesinger, John Conroy, Mary Ellen of Sheffield, Regent Court, 211 Portobello St.,
This special issue’s theme focuses on recent Okurowski, and Dianne O’Leary, surveys mod- Sheffield, S1 4DP, UK; f.ciravegna@dcs.shef.
NLP advances. The seven articles specifically ern summarization techniques and their evalua- ac.uk; www.dcs.shef.ac.uk/~fabio.
identify the state of the art of NLP in terms of tions at the 2002 Document Understanding
Sanda Harabagiu is
research and development activities, current Conference.
an associate professor
issues, and future directions. “Intelligent Indexing of Crime Scene Pho- and the Erik Jonsson
A special issue of a magazine can, of course, tographs,” by Katerina Pastra, Horacio Saggion, School Research Initi-
allow only a partial representation of the current and Yorick Wilks, introduces a novel indexing ation Chair at the Uni-
development in a field. However, we believe that approach that exploits predicate-argument rela- versity of Texas, Dal-
las. She also directs
the articles in this issue are largely representative tions and textual fragments. Additionally, their the university’s Human
of at least some NLP major trends. There are arti- approach extracts possible relations for better Language Technology
cles describing NLP-based applications and arti- indexing and uses a crime domain ontology. Research Institute. Her research interests in-
cles analyzing language methodologies, provid- “Automatic Ontology-Based Knowledge clude natural language processing, knowledge
processing, and AI, and particularly textual
ing, we believe, the right mix of technological Extraction from Web Documents,” by Harith
question answering, reference resolution, and
insight into the state of the art. Alani, Sanghee Kim, David Millard, Mark textual cohesion and coherence. She received
“Finding the WRITE Stuff:Automatic Identi- Weal, Wendy Hall, Paul Lewis, and Nigel Shad- her PhD in computer engineering from the Uni-
fication of Discourse Structure in Student bolt, presents Artequakt, a system that auto- versity of Southern California and her PhD in
Essays,” by Jill Burstein, Daniel Marcu, and matically extracts knowledge about artists from computer science from the University of Rome
Tor Vergata. She has received the NSF Career
Kevin Knight, describes a decision-based dis- the Web to populate a knowledge base and to Award. She is a member of the IEEE Computer
course analyzer using decision-tree machine generate narrative biographies. Society, AAAI, and Association for Computa-
learning algorithms combined with boosting tech- tional Linguistics. Contact her at the Dept. of
niques for obtaining an optimal discourse model. Acknowledgments Computer Science, Univ. of Texas at Dallas,
Organizing a special issue is both difficult and tir- Richardson, TX 75083-0688; sanda@cs.utdallas.
This model is based on a rich feature set, com-
ing. We thank all those who submitted papers for this edu; www.utdallas.edu/~sanda.
prising rhetorical-structure-theory relations, dis-
special issue. Space limitations and the necessity of
course markers, and lexico-syntactic information. representing a wide range of technologies caused the
This discourse analysis software, embedded in Parser,” Proc. 6th Applied Natural Lang.
exclusion of some other excellent papers. They will
Conf. and 1st Meeting North Am. Chapter of
the Criterion writing-analysis tools, helps students be published in future issues of this magazine.
the Assoc. for Computational Linguistics
develop sophisticated essays by accessing knowl- Finally, we thank Pauline Hosillos for taking care of
(ANLP-NAACL 2000), Morgan Kaufmann,
the everyday work for this special issue.
edge of discourse structure. San Francisco, 2000, Section 2, pp. 132–139.
“Speaking the Users’ Languages,” by Amy
Isard, Jon Oberlander, Ion Androutsopoulos, and References 4. C. Fellbaum, ed., WordNet: An Electronic
Lexical Database, MIT Press, Cambridge,
Colin Matheson, presents a system that gener- 1. M.P. Marcus, B. Santorini, and M.A. Mar- Mass., 1998.
ates descriptions of unseen objects in text of var- cinkiewicz, “Building a Large Annotated Cor-
ious degrees of complexity. It tailors these pus of English: The Penn Treebank,” Com- 5. C.R. Johnson and C.J. Fillmore, “The
descriptions to the user’s sophistication—for putational Linguistics, vol. 20, no. 2, 1993, FrameNet Tagset for Frame-Semantic and
pp. 313–330. Reprinted in Using Large Cor- Syntactic Coding of Predicate Argument
example, for adults, children, or experts. More- Structure,” Proc. 6th Applied Natural Lang.
pora, Susan Armstrong, ed., MIT Press, Cam-
over, the system can generate descriptions in bridge, Mass., 1994, pp. 273–290. Conf. and 1st Meeting North Am. Chapter of
English, Greek, or Italian. the Assoc. for Computational Linguistics
“Ontology Learning and Its Application to 2. M. Collins, “A New Statistical Parser Based on (ANLP-NAACL 2000), Morgan Kaufmann,
Bigram Lexical Dependencies,” Proc. 34th San Francisco, 2000, Section 2, pp. 56–62.
Automated Terminology Translation,” by
Ann. Meeting Assoc. for Comutational Lin-
Roberto Navigli, Paola Velardi, and Aldo Gan- guistics (ACL 96), 1996, pp. 184–191. 6. D. Gildea, and D. Jurasky, “Automatic Label-
gemi, depicts a system that automatically learns ing of Semantic Roles,” Computational Lin-
ontologies from texts in a given domain. The 3. E. Charniak, “A Maximum Entropy-Inspired guistics, vol. 28, no. 3, Sept. 2002, pp. 245–288.

JANUARY/FEBRUARY 2003 computer.org/intelligent 13

S-ar putea să vă placă și