Sunteți pe pagina 1din 75

Tutorial AM4

Identification of Biomedical
Events and Relations in the
LIterature with Ontological
Support

Sophia Ananiadoa, Jung-Jae Kim,


Dietrich Rebholz-Schuhmann, Jun’ichi Tsujii

www.iscb.org/ismb2008
Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

ISMB2008 Tutorial Proposal

Title:
Identification of Biomedical Events and
Relations in the literature with ontological support

Presenters: (in alphabetical order ....)


- Sophia Ananiadou, Ph.D (SA)
Reader in Text Mining, National Centre for Text Mining, School of Computer
Science, University of Manchester
- Jung-Jae Kim, Ph.D. (JJK)
Dietrich Rebholz-Schuhmann, M.D., Ph.D. (DRS)
Rebholz Group, European Bioinformatics Institute, Cambridge, U.K.
- Prof. Jun’ichi Tsujii (JT)
Professor in Text Mining, Department of Computer Science, University of Tokyo and
National Centre for Text Mining, School of Computer Science, University of
Manchester

1 Abstract ........................................................................................................................ - 1 -
2 Introduction.................................................................................................................. - 1 -
2.1 Information retrieval and information extraction: relevant terminology............. - 2 -
2.2 Selected types of data that text mining can deliver ............................................. - 3 -
2.3 Basic software components for text mining ........................................................ - 4 -
2.4 Evaluation of text mining systems ..................................................................... - 4 -
3 Building terminological, ontological resources and annotation projects ..................... - 5 -
3.1 Building terminological resources....................................................................... - 5 -
3.2 Building an ontological resource: the example GRO.......................................... - 6 -
3.3 Biological Annotation and Linguistic Annotation............................................... - 7 -
4 Information Extraction of named entities and ontological concepts (IE-1) ................. - 9 -
4.1 Identification of GO concepts in the literature .................................................... - 9 -
5 Information Extraction of relations and events (IE-2) ............................................... - 10 -
5.1 Protein-protein interactions ............................................................................... - 10 -
5.2 Event/Relation Recognition............................................................................... - 11 -
5.3 Identification of gene regulatory events ............................................................ - 12 -
6 Conclusion, Outlook, Discussion............................................................................... - 12 -
7 Annex A: Terminological resources .......................................................................... - 19 -
7.1 Building terminological resources..................................................................... - 19 -
7.2 Information Extraction of named entities and ontological concepts (IE-1) ...... - 19 -

1 Abstract
Text mining solutions increasingly integrate ontological resources to extract normalized
information from the literature. The tutorial will demonstrate through working examples how
biomedical events can be identified from literature based on state of the art technology
integrating ontological resources. Successful use cases will be demonstrated and the
infrastructure of existing solutions will be explained.

2 Introduction
In the past two decades biologists developed new technologies that support analysis and
understanding of processes in model organisms and that can be performed in high-throughput

ISMB08 -1- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

to speed up knowledge discovery. As a consequence, the work of a biologist has changed


from traditional laboratory work into a combination of laboratory and bioinformatics oriented
work, where he has to deal with special software components and learn how to access and
understand electronic data. A new development over recent years has led to the integration of
literature retrieval and information extraction services into the IT infrastructure that biologists
are using on a daily basis (Jensen et al., 2006; Zweigenbaum et al., 2007; Ananiadou et al.,
2006; Rebholz-Schuhmann et al., 2005; Shatkay et al., 2005).
2.1 Information retrieval and information extraction: relevant terminology
The term “text mining” refers to a number of techniques that are applied to process text. The
purpose of text processing can be quite diverse. In some cases a set of documents that are
relevant to a query given by a user are retrieved and ranked according to their relevancy to the
query. In other cases individual facts are extracted from the text and entered into a data base
system. Text mining has two big fields for research and development: information retrieval
(IR) and information extraction (IE).
Information retrieval is concerned with the selection of a set of documents or a set of facts
out of a larger set of documents and forms the core of search engines (e.g. the Google engine
for Web pages as the best known example).
Systems based on IR techniques include:
- Textpresso1 uses a custom ontology to query a collection of documents for
information on specific classes of biological concepts (e.g. gene, allele or cell) and
their relations (e.g. association and/or regulation);
- Query Chem2 combines chemical structure with text based IR using chemical
databases and Web API (Google) to retrieve information and relations between
compound structures and their properties;
- iHOP3 visualizes the interactions between genes.;
- EBIMed4 also retrieves sentences based on detecting co-occurrences between
biological entities;
- KLEIO5 is an advanced information retrieval (IR) which offers textual and metadata
searches across MEDLINE and provides enhanced searching functionality by
leveraging terminology management technologies such as named entity recognition
and acronym disambiguation;
- FACTA6 finds associated named entities across MEDLINE.;
- GoPubMed7 a thesaurus-driven, GO-based abstract-classification system.;
- PubMatrix8 searches PubMed comparing lists of terms, for example, genes or
proteins, given by the user with a set of functionalities, and outputs several papers
associated with the list of terms and the functionalities
- PubFinder9 leverages a small set of seed abstracts provided by the user that are
relevant to a specific scientific topic. The result is a ‘hit-list’ of references ranked
according to likelihood.

In information extraction the text is processed to identify predefined facts, for example in
the biomedical domain, facts about genes, proteins, diseases, and chemical entities and
relations among proteins or between proteins and other types of entities (e.g. Diseases).

1
http://www.textpresso.org/
2
http://www.QueryChem.com
3
http://www.ihop-net.org/UniPub/iHOP/
4
http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
5
http://www.nactem.ac.uk/software/kleio/
6
http://text0.mib.man.ac.uk/software/facta/
7
http://www.gopubmed.org/
8
http://pubmatrix.grc.nia.nih.gov/
9
http://www.glycosciences.de/tools/PubFinder

ISMB08 -2- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

Information extraction can be used to improve information retrieval. The separation between
both types of text mining techniques is not fixed.
Named entity recognition (NER) is a special research field in text mining. It is concerned
with the extraction of textual fragments in text that represent objects of predefined semantic
types (e.g. gene names, disease mentions). Ontologists call such objects the instances of the
semantic classes or types.

Information extraction and information retrieval are solutions that offer access to parts of an
abstract and parts of a sentence. To build efficient information extraction solutions the
following questions have to be answered: (1) what kinds of information can be extracted in a
systematic way, (2) how do we represent the information, (3) how do we solve problems that
come from the variability of the language, or the other way round, (4) how do we normalize
the extracted results to have a 'purified' view to the textual information, which is a lot more
condensed than the textual representation, and last but not least (5) how do we deal with shifts
in the textual representation of information, which are due to errors in the perception, naming
and classification of terms that have taken place in the past.
One core problem concerning information extraction in biology is the definition,
standardization and classification of biological names and terms. Unarguably, it is the names
how genes and proteins and any other biological object are identified in natural language text.
The sequence databases, e.g. UniProtKb [Appweiler, Lee], also usually provides unique
names to their database entries. In this way, links between different data sources can be
established, and information about the same biological object can thus be integrated from
different data sources. The integration enables classification and categorization of the
biological objects according to their similarities such as shared functions and similar
structures.
This need for standard names has lead to the need for a nomenclature of standard terms that
mirror the characteristics of biological objects, e.g. function, structure, and localization. Any
of these nomenclatures are the starting point for an ontology, where individuals and biological
groups or networks want to standardize information representation to enhance the exchange of
biological information. This standardized information can again be used to improve the
quality of the information extraction tools.
2.2 Selected types of data that text mining can deliver
Text mining in the biomedical domain has been focused mainly on the identification of named
entities such as names of proteins, genes, and diseases. Increasingly, sophisticated text mining
techniques (e.g. full parsing) are applied to increase the precision of systems for the
biomedical text mining.
We can distinguish the following types of facts relevant to the biomedical domain. Note that
the types are not disjoint. For example, the representation of a point mutation might also
represent a gene.
- Identification of named entities, e.g., genes, proteins, diseases, cell types, species, and
drug brand names (Adamic et al., 2002; Rindflesch et al., 2000).
- Identification of linguistic expressions of not-named entity types, e.g., point
mutations, karyotypes, chemicals (Rebholz-Schuhmann et al., 2004)
- Identification of ontological terms (Couto et al., 2006; Gaudan et al., 2008)
- Identification of associations, i.e. pairs of two named entities that appear together
(Jenssen et al., 2001).
- Identification of semantic relations between named entities (Blaschke et al., 1999;
Marcotte et al., 2001)

ISMB08 -3- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

2.3 Basic software components for text mining


A number of well-defined software components are used for the processing of natural
language text. It is the goal to separate the natural language text into components that can be
annotated either with syntactic information or with semantic information. The syntactic
information reports on the meaning and the roles of the components in the text.
Tokenization is one of the most basic steps in text processing. In this step the text is
separated into single tokens e.g. words, to not treat a sentence as a sequence of characters.
Tokenisation in the biomedical text requires special solutions, since punctuation, e.g. commas
and different types of hyphens10, have special meanings.
Morphological analysis is used to identify morphological features of the tokens such as
lemma and plurality.
Part-of-speech tagging (PoS tagging) refers to the processing step where a token is denoted
with its syntactic role in the sentence (e.g. distinction between a noun and a verb). PoS
tagging takes contextual information into consideration (machine learning or rule-based
approaches). Taggers for biomedical text mining have to be trained on biomedical texts (e.g.,
GENIA tagger11, Tsuruoka et al 2005; Smith et al 2004).
Chunking is used to separate the text into syntactically complete pieces (called “chunks”)
that consist of one or several tokens. The GENIA tagger functions also a shallow parser.
Parsing is a processing step that delivers the syntactic and/or semantic structure of a
sentence. The structure explains which components of the sentence group together and how
they act on other components of the sentence syntactically and/or semantically.
Deep or full parsing has the advantage that it allows us to make easy generalizations for more
than one type of biological interaction. It finds deep syntactic relations from the whole of a
sentence, for example, a relation between a passive verb and its semantic object. We
demonstrated that the human AMID gene promoter was activated by p53, where p53 is the
subject of the sentence, and human AMID gene promoter is the object. To achieve this
generalisation, we use predicate argument structures which are canonical representations of
sentence meanings that represent relations in an abstract manner. PAS are normalized forms
representing syntactic relations, as in the example ‘ENTITY1/NN activate/VB
ENTITY2/NN’. In this sentence, activate is the predicate, which contains the main meaning
of the predicate argument structure, and ENTITY1 and ENTITY2 are its arguments, carrying
information about the participants described by a predicate. Enju is a wide-coverage full
parser12.
2.4 Evaluation of text mining systems
Information extraction systems are measured based on the two methods called recall and
precision.
Precision is the percentage of findings that are correctly identified by a method amongst all
findings extracted by the method from the text (= 100 * true positives / (true positives + false
positives)).
Recall is the percentage of facts correctly identified by a method amongst all facts mentioned
in the text (= 100 * true positives / (true positives + false negatives)).
F-measure is the weighted harmonic mean of precision and recall (= 2 · (precision · recall) /
(precision + recall)).
Recall is generally difficult to measure, since a set of documents has to be examined to
determine all facts contained in the documents.
Recent evaluation challenges in BioText Mining include:
1. KDD Challenge Task 1 (2002) IE from Biomedical Articles13
2. TREC Genomics Track (2003, 2004)14
10
http://www.cs.rochester.edu/meetings/hlt-naacl07/tutorials/bionlp.html
11
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
12
http://www-tsujii.is.s.u-tokyo.ac.jp/enju
13
http://www.biostat.wisc.edu/craven/kddcup/tasks.html

ISMB08 -4- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

3. BioCreAtIvE (2004) (2007)15


4. BioNLP, JNLPBA16

3 Building terminological, ontological resources and annotation


projects
3.1 Building terminological resources
Automatic population of lexical resources such as BioLexicon, term clustering, etc. will be
inserted here. Terminological resources for bio-text mining are important links between text
and ontologies. The Biolexicon is such a resource containing rich linguistic information and
linking with ontologies. We will explain how we populate a biologically motivated
terminological resource useful for a multitude of text mining applications.

3.1.1 Term extraction, term variation and normalization for building terminological
and ontological resources: BioLexicon (SA)
Terms are the backbone of specialized knowledge because they denote the biological entities
of the documents. Unfortunately, the naming of biological entities is often inconsistent and
imprecise. Metabolites, proteins and genes often have a variety of names (terms) for denoting
the same concept. For example, the metabolite glucose-6-phosphate is referred to as variants
and permutations of α or β, d- or l-glucose (or hexose)-6-(mono)-phosphate. Furthermore,
within the same text a term can be given in an extended compounded form then later
expressed through various mechanisms, including orthographic variation (usage of hyphens
and slashes e.g. amino acid and amino-acid), lower and upper cases (NF-KB and NF-kb),
spelling variations (tumour and tumor), various Latin and/or Greek transliterations (oestrogen
and estrogen), and abbreviations (RAR and retinoic acid receptor). Therefore, a term is
increasingly viewed as an equivalence class of termforms, the rich variety of which have to be
recognized, indexed, linked and mapped to the abundant biological databases and ontologies.
Ontologies are crucial for text mining because they provide semantic interpretation to text and
also constrain the possible interpretations of biological entities (terms) when we provide
semantic interpretation to text, we link terms to concepts in ontologies whereby textual
evidence is used to update and to maintain existing ontologies.
We will describe the process of populating a Bio-Lexicon with new terms (named entities)
which are automatically extracted from text. The BioLexicon has been developed within the
framework of the BOOTStrep project (National Centre for Text Mining, ILC, EBI). We will
cover:
a) A method for named entity recognition (NER) used to populate the Biolexicon with
term variants extracted from biomedical literature
b) a method developed for term mapping to UniProt Accession Numbers through term
normalization. The BioLexicon contains two million gene/protein names,
straightforward similarity calculation of term pairs are not practical at all. Linking
with smart dictionary look up of large-scale gene/protein name dictionaries, including
GENA, ProMiner and BioThesaurus. These dictionaries contain variants of names as
well as canonical names.
c) How to link the BioLexicon with a biontology (GRO)

(see section iii of bibliography)

14
http://trec.nist.gov/pubs/trec12/t12_proceedings.html
15
http://biocreative.sourceforge.net/
16
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html

ISMB08 -5- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

3.2 Building an ontological resource: the example GRO


The basic structure of the Gene Regulation Ontology (GRO) was developed manually based
on textbook knowledge of biology. The construction took terminology into consideration
already available from the UMLS17 data resources. Additional work was required to populate
the GRO and interlink it with existing ontological resources. In particular, the OBO
ontologies were screened for concepts that are suitable to describe genetic regulation.
The following ontologies contributed the majority of concepts: the Gene Ontology (GO),18 the
Sequence Ontology (SO),19 the Molecule Role Ontology (IMR),20 the ontology for Chemical
Entities of Biological Interest (CheBI),21 and the Event Ontology (IEV). All relevant classes
were extracted and integrated into the GRO. For all reused classes, the references to the
source terminologies were kept. In addition, relevant data was taken from the NCBI
taxonomy22 and the transcription factor database TransFac.23 Table 1 lists all selected sources
and the semantic types that were deduced from the sources.

Resource with URL Relevant information


Gene Ontology (GO) molecular functions, biological processes, cellular components
http://geneontology.org/
Sequence Ontology (SO) sequence regions and attributes of sequence regions
http://sequenceontology.org/
ChEBI chemical entities
http://www.ebi.ac.uk/chebi/
INOH Molecule Role (IMR) transcription factors and their functional domains
http://www.inoh.org/
INOH Event Ontology (IEV) biological processes
http://www.inoh.org/
NCBI taxonomy species (eukaryotes, prokaryotes)
http://130.14.29.110/Taxonomy/
TransFac classification of transcription factors, domains of transcription
http://www.generegulation.com/ factors
Table 1: Conceptual resources for the population of the GRO

The GRO covers classes for the core events relevant to the domain. This comprises
fundamental gene regulatory events such as regulation of transcription and also the
participants involved in the events such as transcription factor and regulatory region. A large
portion of the classes in GRO were supplied with textual definitions to a large extent from
input from external resources.
GRO distinguishes itself from most of other biomedical ontologies that are currently
available, by the fact that GRO makes use of a larger number of relations in the ontological
structure. This organisation leads to the result that classes are highly interlinked by various
manually encoded relations and their reciprocal counterparts that are mentioned in
parentheses in this text. The relation partOf (hasPart) is used to relate spatial, temporal, or
procedural parts to the whole. For example the class ProteinDomain is partOf the class
Protein and the class TranscriptionInitiation is partOf the class Transcription. Wholes and
their parts must belong to a super ontological category, i.e., a continuant can only have
continuants as parts and an occurrent can only have occurrents as parts.
The relation fromSpecies relates species-specific classes to the species they belong to, such as
Bacterial RNA Polymerase is fromSpecies Bacterium. Also, continuants and occurrents are
related to processes, in which they are involved, by the relation participatesIn

17
http://www.nlm.nih.gov/pubs/factsheets/umls.html
18
http://geneontology.org/
19
http://sequenceontology.org/
20
http://www.inoh.org
21
http://www.ebi.ac.uk/chebi/
22
http://130.14.29.110/Taxonomy/
23
http://www.gene-regulation.com/

ISMB08 -6- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

(hasParticipant) or by its two subrelations agentOf (hasAgent) or patientOf (hasPatient). For


example, consider the event Regulation of Transcription. In the GRO it is defined in the
following way:

Regulation of Transcription ...


- isA RegulatioryProcess
- hasAgent some TranscriptionRegulator
- hasPatient some Transcription
- hasPatient only Transcription

3.3 Biological Annotation and Linguistic Annotation


In the bio-medical domain, not only do special terms appear, but in addition common words
are used often with different meanings. Because of this, we need to re-train or adapt NLP/TM
programs for the domain. For example, since the statistical distributions of sequences of part-
of-speech (POS) and syntactic trees of the biomedical text are different from those in general
domain, POS taggers and syntactic parsers have to be adapted for the domain. This adaptation
requires corpora annotated in terms of POS and syntactic trees. This type of annotation
focuses on linguistic aspects of text and is called linguistic annotation.
On the other hand, text annotation can be based on biological knowledge and called biological
annotation. Term annotation is a typical biological annotation. Biologically relevant entities
such as proteins, genes, cell components, and diseases are expressed in text as terms, and
these terms can be classified into the different semantic classes. In term annotation, each
occurrence of terms in text is marked with its semantic classes. Event or relation annotation is
another example of biological annotation where occurrences in text of biologically interesting
events which usually comprise a few arguments such as AGENT, THEME, LOCATION,
CAUSE, etc. with event predicates are annotated as such.
Annotated corpora are central to the production of such repositories, and much effort has
recently been devoted to the development of domain-specific corpora within the biomedical
field (see among others Kim and Tsujii, 2006). There now exist several biomedical corpora
with event-level annotations and domain-specific semantic frame repositories e.g. Chou et al.
(2006), Dolbey et al. (2006), Kim et al. (2008), Kulick et al. (2004), Pyysalo et al. (2007) and
Wattarujeekrit et al. (2004).
We will make a brief survey of existing annotated corpora publicly available such as GENIA,
PennBio, GENETAG, AIMED, BioCreative Corpora, etc.
3.3.1 Combining Linguistic with Biological Annotation
We will present how we built a bio-event annotated corpus of biomedical literature
(BOOTStrep project (FP6 - 028099). The focus is on gene regulation events in a corpus of
MEDLINE abstracts on the subject of E. coli. Events described both by verbs and
nominalised verbs, such as transcription or expression, are annotated. Annotation consists of
identifying semantic arguments of the event within the same sentence, and labelling them
with event-independent semantic roles and named entity (NE) types. Annotation is carried out
using a version of the WordFreak annotation tool (Morton and LaCivita, 2003) that has been
customised to the task. The resulting annotated corpus contains 677 abstracts.
The annotated corpus is designed to facilitate acquisition of semantic frames, via the
application of a machine-learning algorithm, for inclusion within the BioLexicon.This is one
of the resources being produced as part of the BOOTStrep project (together with a bio-
ontology described above) for use by the biological text mining community.
We will present the design of a scheme for annotating events in biomedical texts and how we
have automatically extracted bio-event frames from the E-coli corpus. Results from this
provide further evidence regarding the effectiveness and adequacy of the annotation scheme
for its main intended purpose. (Thompson et al 2008)

ISMB08 -7- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

3.3.2 Biological Annotation and Ontological Resources


A growing number of structured knowledge resources or ontologies in the biology domain are
providing guidance to researchers who perform large-scale data annotations for text mining
and other language technologies. Some well-known examples include MeSH, Snowmed-CT
and the ontologies of the Open Biomedical Ontologies Foundry (OBO), among many others.
These loosely called "ontologies" are not proper ontologies in the philosophical sense. In fact,
their structural differences reflect differences in their functional goals and in their
appropriateness for certain tasks.
We first establish two major criteria for selecting ontologies for biological text annotation and
then how these criteria work for term and event annotation. The two criteria are:
- Granularity of ontologies for biological text annotation: A structured knowledge
resource for term-level annotation should represent terms and relations that are
commonly expressed in contiguous spans of text in the target domain.
- Interoperability of ontology: A structured knowledge resource for term-level
annotation should represent consistently-defined relations among terms that bear
meaningful correspondences with other authoritative knowledge resources in the
target domain.
3.3.3 Term/Event Annotation
We take the following three biological annotation projects as examples and summarize the
difficulties and their solutions.
Term Annotation in GENIA: The GENIA corpus is a collection of 2,000 abstracts chosen
from those retrieved from MEDLINE by a search query, "Humans" [MeSH] AND "Blood
Cells" [MeSH] AND "Transcription Factors" [MeSH]. 93,293 bio-medical terms were
annotated and classified into 35 classes of the GENIA term ontology.
Event Annotation in GENIA:The event annotation in GENIA was made on half of the
GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which
36,114 events are identified. The major challenges during event annotation were (1) to select
an appropriate subset of the ontological classes in GO which meets specific requirements of
text annotation, (2) to achieve biology oriented annotation which reflect biologists'
interpretation of text, and (3) to ensure the homogeneity of annotation quality across
annotators. To meet these challenges, we introduced new concepts such as Single-facet
Annotation and Semantic Typing.
Gene Regulation Annotation in BOOTStrep: Annotation was made on 677 abstracts in
MEDLINE on the subjects of E.coli. All of annotated events belong to those of gene
regulation and no further subclassification of events were not made. Annotation was made by
using rich semantic frames which consist of a rich set of semnatic roles such as MANNER,
INSTRUMENT, LOCATION, SOURCE, TEMPORAL, RATE, etc. 1,158 events were
annotated in a subset of 80 abstracts which were agreed by at least two annotators, while
3,612 events were annotated in 597 abstracts by single annotators.
Transcription Regulation Annotation in BOOTStrep: 309 abstracts in MEDLINE on the
subjects of E. coli and human transcription regulation have been annotated with three types of
events: regulation of gene expression (ROGE), regulation of transcription (ROT), and binding
of transcription factor to gene regulatory region (TFBRR). The goal of the annotation is to
populate databases of transcription regulation (e.g. RegulonDB, TransFac) with transcription
regulation events found in the literature and to provide an evaluation corpus for evaluating
systems that have the same goal of the database population. Two curators have so far
annotated 207 ROGE, 298 ROT, and 74 TFBRR events on the corpus. A cross-evaluation of
the two curators' annotations reports that they show 11~32% error rates, indicating that an
interactive training is required for high quality of the annotation regulsts.

ISMB08 -8- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

3.3.4 Annotation Tools


For efficiency of text annotation, good annotation tools are crucial. There are many tools
available, with different functionalities. We illustrate the functions which these annotation
tools provide by taking the XConc (XML-based Concordancer) Suite as an example.
XConc is an integrated annotation environment providing an XML editor, a concordancer and
an ontology browser which all interact with each other. For example, the users can retrieve
existing annotations and view the concordance in KWIC (keyword in context) format. It
supports multi-layered annotations such as annotations of POS, syntactic trees, terms, events,
co-references, etc.
The XConc Suite is implemented on top of Eclipse, a widely used, general-purpose software
development platform. This provides the XConc Suite with general functionality for software
development, including project and file management and version control. A proper version
control system, like CVS supported by Eclipse, is particularly crucial for long term software
development.
4 Information Extraction of named entities and ontological
concepts (IE-1)
Different components of the linguistic analysis of biomedical text will be explained (e.g., POS
tagging, shallow parsing, syntactic parsing, deep syntactic parsing). Different types of IE
approaches used in the biomedical domain will be discussed (e.g., pattern matching of
terminology, full syntactic parsing, machine learning based approaches such as
disambiguation).
This part of the tutorial will also focus on the extraction of named entities and ontological
concepts from the literature based on statistical methods. Furthermore, solutions for the
mining of ontological concepts from the scientific literature will be presented (feeding
ontologies from the literature) and for the annotation of genes/proteins with gene ontological
terms
4.1 Identification of GO concepts in the literature
The identification of Gene Ontology terms delivered the features from the literature to build
the concept profiles. We relied on an approach that puts weights to the terms used in the GO
resource according to their specificity and seeks evidence from the text for the support of GO
terms (Gaudan et al., submitted).
The GO lexicon is processed to extract properties for all lexical items, such as the frequency
of individual words in the terminological set. As a result, the specificity of the term is
calculated based on the information content of its tokens. When processing the text, several
parameters are measured on a given zone in the text (a zone is any stretch of text such as a
noun phrase, a sentence or an abstract). One parameter is the evidence for a term scored by
the number of tokens that have been matched in the text of the zone. Another parameter is the
proximity between all the words matched in the zone. All three parameters are combined in a
scoring function (the product of all individual measures) that allows prioritization of GO
terms extracted from a document. The approach has been developed and optimized for the
identification of GO terms. In direct comparison to published techniques (FiGO, GoCAT), our
approach showed improved performance (Couto et al.,2005; Ruch, 2006; Gaudan et al.,
submitted). Precision and recall were evenly 34% for terms ranked first in the assessment
against the BioCreAtIve I corpus.
In our analysis, we focused on the selection of the most specific terms from the GO taxonomy
assuming that the GO terms closest to the leafs in the taxonomy are the most specific to
annotate the cellular mechanisms behind a disease or a gene. Only the leaf nodes and their
direct neighbours, i.e. nodes at one level up, have been used to build the concept profiles. This
selection reduced the number of eligible GO terms from 24,903 to 16,220.

ISMB08 -9- Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

5 Information Extraction of relations and events (IE-2)


Information extraction (IE) is defined as an attempt of automatically extracting structured
information, i.e. categorized and contextually and semantically well-defined data from a
certain domain, from unstructured machine-readable documents.
The goal is to extract information from text without requiring the end user of the information
to read the text. Having used a search engine, the user must read each document to know the
facts reported in it. IE can be used to support a fact-retrieval service or as a step towards text
mining based on conceptually annotated text. We use a text mining pipeline from
unstructured to annotated data using part-of-speech tagging, named entity recognition and
syntactic analysis (parsing), using external resources (i.e. dictionaries and ontologies). Each
module enhances text representation with a layer of annotation, which represents explicit
linguistic and/or semantic information attached to text in machine-usable form. Such
information is inferred by a human reader using (i) linguistic and general knowledge, and (ii)
domain-specific expertise. However, for the text to be analysed automatically at a higher
semantic level, such knowledge has to be explicitly represented in a machine-readable form.
To detect and extract the types of evidence needed for hypothesis generation, we need
semantic interpretation of the text, upon which we base relation extraction. Relation
extraction extracts pairs or triples of biological entities, for example, p53 induces Peg3 or
Pw1 mRNA expression. The first step of IE, is NER (Named Entity Recognition). Event and
relation extraction is the second step in IE. Most of event/relation extraction systems in the
bio-medical domain have been concerned with extraction from single sentences, partly
because they deal with abstracts in MEDLINE.
We give a brief overview of event/relation extraction in the general domain. Some of
extraction systems treat co-referential expressions, which IE in the bio-medical domain will
also encounter when it process full papers. Some IE systems in biology use pattern matching
approaches which sometimes have limited generalisation. Moreover, the closer the analysis is
to the text, the more patterns are needed to take account of the large amount of surface
grammatical variation in texts. Their main limitation is that some measure of semantic
processing beyond pattern matching is required that is superior to either text strings or
annotations connected with surface analyses. Other approaches use a combination of syntactic
and semantic parsing.
We give a brief overview of event/relation extraction in the general domain. Some of
extraction systems treat co-referential expressions, which IE in the bio-medical domain will
also encounter when it process full papers.
5.1 Protein-protein interactions
Automated protein-protein interaction (PPI) extraction is a task of significant interest in the
biology domain. The most commonly addressed problem has been the extraction of binary
interactions, where the system identifies which protein pairs in a sentence have a biologically
relevant relationship between them. In the simplest framework, systems extract protein pairs
without classification of interaction relationships.
PPI has been most intensively studied, and proposed solutions include hand-crafted rule-based
systems, machine learning approaches, shallow/deep parsing, etc. A wide range of results
have been reported for the systems. However, differences in evaluation resources, metrics and
strategies make direct comparison of these numbers problematic.
We first discuss problems related with evaluation and then survey some of representative
systems with their performances. Particular emphases are on
(1) machine learning approaches using annotated corpora
(2) Approaches based on parsing
Several researchers have reported on their information extraction solutions, some of which
make reference to the verb forms that are integrated into their approach. The earliest work is
from (Sekimizu et al., 1998) who identify frequently used verbs in Medline to parse relations
amongst genes from the literature. (Blaschke et al., 1999) applied a text mining solution to

ISMB08 - 10 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

Medline abstracts that identifies keywords in conjunction with a selection of verbs. A similar
solution has been proposed by (Ono et al., 2001) that focuses on relations defined by
“interact”, “bind”, “associate” and “complex” and that have been extracted with regular
expression for syntactical patterns. Precision is around 94% and recall is around 85% (82.5-
86.8%).
Several solutions have been proposed that match language patterns in form of Finite State
Automata (FSA) to the scientific literature. (Pustejovsky et al., 2001) analyzed syntactical
language patterns for inhibitory events. Their system performed at recall of 57% and at
precision of 90%. They did not offer any solution for the PGN normalization. Part of their
solution is the processing of subordinate clauses, sentential coordination and anaphoric
resolutions. (Leroy et al., 2003) also applied cascaded FSAs to extract protein-protein
interactions from Medline abstracts. They reported 90% precision, but again did not apply
any PGN normalization. (Saric et al., 2006) extracted regulatory gene/protein networks from
Medline with cascaded FSAs. For their evaluation they opted for semantic correctness in
contrast to grammatical correctness and claim to have achieved 83 to 90% accuracy for
expression relations and 86 to 95% accuracy for phosphorylation relations.
(Park et al., 2001) applied Combinatory Categorial Grammar in conjunction of seven verbs
(including their inflections and noun phrases) denoting a positive regulatory effect (e.g.,
activate, stimulate) and five verbs denoting a negative regulatory effect (inhibit, down-
regulate). They consider solving coordination, appositions and anaphoric expressions. They
claim 48% recall and 80% precision measured on a selection of 492 sentences.
(Friedman et al., 2001) use a system that parses text based on grammar rules (semantic
patterns, MedLee). The grammar makes use of 22 terms denoting verbs and also nouns (e.g.,
apopotosis, myogenesis) that are categorized into 14 classes representing actions and
processes. All inflectional forms and the nominalizations have been considered for all verbal
forms. NER for PGNs is based on BLAST. (Temkin et al., 2003) applied context-free
grammar for the processing of Medline abstracts. They integrated 49 verb forms, their
inflectional forms and nominalizations and achieved 63.9% recall at 70.2% precision. Temkin
thus reports the biggest coverage of proposed verbs. Finally, (Daraselia et al., 2004) use
Context-Free Grammar to identify PPIs and report 91% precision based on Medline abstracts
(recall rate 21%).
5.2 Event/Relation Recognition
More refined systems than PPI are now emerging. Some systems distinguish different types of
interaction, while others extract relationships other than PPI and extend relationships between
pairs of proteins to those more than two entities of different classes.
We will demo two systems extracting relations and events from MEDLINE based on full
parsing: MEDIE24 (Miyao et al 2006) and InfoPubMed25. MEDIE uses a subset of
ontological classes of GO to define event classes and use results of a deep parser as features
for machine learners.
The advantage of full parsing is that we can easily make generalizations for more than one
type of biological interaction. To achieve this generalisation, we use predicate argument
structures, which are canonical representations of sentence meanings that represent relations
in an abstract manner. Both are provided as services to the Life Sciences community by the
UK National Centre for Text Mining26.
As an example of such systems, we take up an event extraction system based on deep parsing
and machine learning, which has been developed by the University of Tokyo and used as a
service system (MEDIE) at National Centre for Text Mining, UK. The system uses a subset of
ontological classes of GO to define event classes and use results of a deep parser as features
for machine learners.

24
http://www-tsujii.is.s.u-tokyo.ac.jp/medie/
25
https://www-tsujii.is.s.u-tokyo.ac.jp/info-pubmed/
26
www.nactem.ac.uk

ISMB08 - 11 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

5.2.1 Characteristics of Bio-Event Extraction and Domain Adaptation


Compared with event extraction in the general domain, extraction of bio-events has a few
characteristics which require special techniques. Since texts in the biology domain are
significantly different from those in the general domain, systems for the general domain have
to be adapted. NEs in the biology domain are mostly common nouns (not proper nouns) and
tend to be long concatenated noun phrases, which require specialized NERs. Sentences in
abstracts contain much more co-ordination constructions than general texts, which require
special treatment. We discuss how these characteristics affect event extraction programs and
how domain adaptation can be done.
• Use of full parsing for the identification of related named entities (Enju, Medie).
• Event extraction and examples from literature to pathway (PathText27)
• Use of information extraction to identify semantic roles and use of extracted results
for semantic support to information retrieval (KLEIO28; FACTA29)
• Automatic extraction of pathway information: objectives, obstacles

5.3 Identification of gene regulatory events


The reasoning facilities that come with the OWL language family are needed for processing
tasks like the analysis of complex representations of events that require the identification of
participants involved and their relations. We are currently investigating into a system that
extracts mentions of events involving gene regulations from the scientific literature by using
the GRO. The system first recognizes entities such as gene, protein, and transcription factor
names and labels them with classes in the continuant branch of the GRO, particularly with
the two classes gene and transcription regulator. The entity recognition utilizes
UniProtKB30 and RegulonDB31 as sources for names of the two classes. The system then
identifies instances of the classes in the occurrent branch by matching linguistic patterns of
keywords of the classes (e.g. ‘regulate’ for regulatory process). It finally deduces complex
information by employing rules written in SWRL32 thus capturing sophisticated forms of
biological knowledge. For example, the system may identify two instances of the classes
regulation of gene expression and binding of transcription factor to gene regulatory
region, which share instances of transcription regulator and gene as their agent and patient,
respectively. It will then make the rule-based deduction that the instance of regulation of
gene expression also belongs to the class regulation of transcription which is a subclass of
regulation of gene expression. We are developing such rules, which should be OWL-DL safe
(Motik et al., 2004), based on the classes and relations provided by the ontology. The GRO is
necessary to implement these rules, since such inferences cannot be achieved without a
logically sound ontology
6 Conclusion, Outlook, Discussion
The outlook will assess the latest achievements of the TM community and state future
requirements expressed from curators and users. Other topics are the open access initiative
and Grid computing.
Open discussion

Forums for biomedical text mining on the Web

27
http://www.nactem.ac.uk/pathtext/
28
http://www.nactem.ac.uk/software/kleio/
29
http://text0.mib.man.ac.uk/software/facta/
30
http://www.ebi.ac.uk/uniprot/
31
http://regulondb.ccg.unam.mx/
32
http://www.w3.org/Submission/SWRL/

ISMB08 - 12 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

BLIMP33: “BLIMP covers all publications related to the fast-growing field of biomedical
literature and text mining. It is a one-stop resource, letting researchers find out who-does-
what in the area and where it is published, bridging across the many discipline-specific
venues in which biomedical text-mining papers are published.”

BIONLP34: Bob Futrelle’s NLP for Biotext Mining

A comprehensive collection of text mining resources (http://www.text-mining.org/), including


links to publications, commercial suppliers, news items, research groups, events, etc

33
http://blimp.cs.queensu.ca/
34
http://www.ccs.neu.edu/home/futrelle/bionlp/

ISMB08 - 13 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

References

Adamic, L.A., Wilkinson, D., Huberman, B.A., and Adar, E.. A Literature Based Method for
Identifying Gene-Disease Connections. IEEE Computer Society Bioinformatics Conference, 2002.
Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999) "Automatic Extraction of
Biological Information from Scientific Text: Protein-Protein Interactions". ISMB99, 60-67.
Daraselia,N. et al. (2007) Automartic extraction of gene ontology annotation and its correlation with
clusters in protein networks. BMC Bioinformatics 10(8):243.
Friedman,C., et al. (2001) GENIES: a natural-language processing system for the extraction of
molecular pathways from journal articles. Bioinformatics, 17(Suppl 1), S74–82.
Gaudan,G., et al. (2008) Identifying GO terms in text. EURASIP JBSB, Hindawi Publishing Group.
(accepted)
Hakenberg,J., et al. (2005). Systematic feature evaluation for gene name recognition. BMC
Bioinformatics; 6 Suppl 1:S9.
Hirschman,L., et al. (2005) Overview of BioCreAtIvE task 1B: normalized gene lists. BMC
Bioinformatics, 6(Suppl 1):S11.
Hoffmann,R. and Valencia,A. (2005) Implementing the iHOP concept for navigation of biomedical
literature. Bioinformatics 21 (Suppl 2):ii252-8.
Huang,M., et al. (2004) Discovering patterns to extract protein–protein interactions from full texts.
Bioinformatics 20(18):3604-3612
Jelier,R., et al. (2005) Co-occurrence based meta-analysis of scientific texts retrieving biological
relationships between genes. Bioinformatics, 21(9):2049-58.
Kirsch,H., et al. (2006) Distributed modules for text annotation and IE applied to the biomedical
domain. Int. J. Med. Inform. 75(6):496-500.
Krallinger,M., Leitner,F., and Valencia,A. (2007) Assessment of the Second BioCreative PPI task:
Automatic Extraction of Protein-Protein Interactions. Proc Second BioCreative Challenge
Evaluation Workshop.
Leroy, G., and Chen,H.. (2002) Filling Preposition-based Templates to Capture Information from
Medical Abstracts. Proc Pacific Symp. on Biocomputing 7 (PSB), 362--373.
Liu H., Hu Z., Zhang J. and Wu C., (2006). BioThesaurus: a web-based thesaurus of protein and gene
names. Bioinformatics 22(1):103-105.
Marcotte, E. M., Xenarios, I., and Eisenberg, D.. Mining literature for protein-protein interactions,
Bioinformatics. 2001 Apr;17(4):359-63.
Morgan A., Hirschman L. (2007) Overview of BioCreative II Gene normalization. Proc Second
BioCreative Challenge Evaluation Workshop.
Ono,T., et al. (2001) Automated extraction of information on protein-protein interactions from the
biological literature. Bioinformatics 17(2):155-161.
Park,J.C. et al. (2001) Bidirectional incremental parsing for automatic pathway identification with
combinatory categorial grammar. Pac Symp Biocomput :396-407.
Pustejovsky,J. et al. (2001) Robust relational parsing over biomedical literature: extracting inhibit
relations. Pac Symp Biocomput :362-73.
Rebholz-Schuhmann,D., et al. (2006a) Annotation and Disambiguation of Semantic Types in
Biomedical Text: a Cascaded Approach to Named Entity Recognition. Workshop on "Multi-
Dimensional Markup in NLP", EACL 2006, Trente, Italy.
Rebholz-Schuhmann,D., et al. (2007a) EBIMed: text crunching to gather facts for proteins from
Medline. Bioinformatics 23(2):e237-e244.
Rebholz-Schuhmann,D., et al. (2007b) Text processing through Web services: Calling Whatizit.
Bioinformatics 2007 Nov 21
Rindflesch TC, Hunter L and Aronson AR (1999). "Mining Molecular Binding Terminology from
Biomedical Text". Proceedings of the AMIA Annual Symposium, 127-131.
Rzhetsky,A., et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating
molecular pathway data. J Biomed Inform 37(1):43-53.
Saric,J., et al. (2006) Extraction of regulatory gene/protein networks from Medline. Bioinformatics
22(6):645-650.
Sekimizu,T., et al. (1998) Identifying the interaction between genes and gene products based on
frequently seen verbs in Medline abstracts. Genome Informatics, 62-71.
Temkin,J.M. and Gilder,M.R. (2003) Extraction of protein interaction information from unstructured
text using a context-free grammar. Bioinformatics. 19(16):2046-53.
Tsuruoka,Y., et al. (2007). Learning string similarity measures for gene/protein name dictionary look-
up using logistic regression. Bioinformatics. 23(20):2768-74.

ISMB08 - 14 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

The following comprehensive references have been contributed by Dr. Sophia


Ananiadou (NaCTeM, University of Manchester)
6.1.1 General overviews and position papers on Text Mining for Biomedicine
Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, Kevin B. Cohen: Frontiers of biomedical text
mining: current progress. Briefings in Bioinformatics 8(5): 358-375 (2007)
Ananiadou, S. & J. McNaught (eds) (2006) Text Mining for Biology and Biomedicine, Artech House.
Ananiadou, S., Kell, D.B. and Tsujii, J. (2006) Text Mining and its Potential Applications in Systems
Biology, in Trends in Biotechnology (TIBTECH) 24(12):571-579
Lars Juhl Jensen, J. Saric and P. Bork (2006) "Literature mining for the biologist: from information
retrieval to biological discovery", Nature Reviews Genetics, Vol. 7, Feb. 2006, pp 119-129
Cohen, A. M., and W. R. Hersh, “A Survey of Current Work in Biomedical Text Mining,” Briefings in
Bioinformatics, Vol. 6, 2005, pp. 57--71.
MacMullen, W.J, and S.O. Denn, “Information Problems in Molecular Biology and Bioinformatics,”
Journal of the American Society for Information Science and Technology, Vol. 56, No. 5, 2005, pp.
447--456.
Rebholz-Schuhmann, D., H. Kirsch, and F. Couto, “Facts from Text—Is Text Mining Ready to
Deliver? ” PLoS Biology, Vol. 3, No. 2, 2005, pp. 0188--0191, http://www.plosbiology.org, June
2005.
Shatkay H. (2005) Hairpins in bookstacks: information retrieval from biomedical text. Brief Bioinform
2005;6(3):222–38.
Tsujii, Jun-ichi and Sophia Ananiadou (2005) Thesaurus or logical onotology, which do we need for
mining text? Language Resources and Evaluation . 39(1). pp. 77-90, Springer SBM, September
2005.
Cohen KB and Hunter L. (2004) Natural language processing and systems biology. In: Dubitzky W,
Azuaje F, (eds). Artificial Intelligence Methods and Tools for Systems Biology. Heidelberg:
Springer, 2004;147–74.
Nédellec, C., “Machine Learning for Information Extraction in Genomics---State of the Art and
Perspectives.” Text Mining and its Applications, pp. 99--118, S. Sirmakessis (ed.), Berlin: Springer-
Verlag, Studies in Fuzziness and Soft Computing 138, 2004.
Shatkay, H., and R. Feldman, “Mining the Biomedical Literature in the Genomic Era: An Overview,”
Journal of Computational Biology, Vol. 10, No. 6, 2004, pp. 821--855.
Blaschke, C., L. Hirschman, and A. Valencia, “Information Extraction in Molecular Biology,”
Briefings in Bioinformatics, Vol. 3, No. 2, 2002, pp. 1--12.
Hirschman, L., et al., “Accomplishments and Challenges in Literature Data Mining for Biology,”
Bioinformatics, Vol. 18, No. 12, 2002, pp. 1553--1561.
Mack, R., and M. Hehenburger, “Text-based Knowledge Discovery: Search and Mining of Life-
sciences Documents,” Drug Discovery Today, Vol. 7, No. 11, 2002, pp. S89--S98.
Yandell, M. D., and W. H. Majoros, “Genomics and Natural Language Processing,” Nature
Reviews/Genetics, Vol. 3, 2002, pp. 601--610.
6.1.2 Named Entity Recognition and Terminology Management
Ananiadou, S. & Nenadic, G. (2006) Automatic Terminology Management in Biomedicine, in Text
Mining for Biology and Biomedicine, Artech house, pp.67-98
Aronson, A.R, “Effective Mapping of Biomedical Text to the UMLS Metathesaurus: the MetaMap
Program,” Proceedings of AMIA, 2001, pp.17--21.
Chang, J.T. and Schütze, H. (2006) Abbreviations in biomedical text. In Ananiadou, S. and McNaught,
J. (Eds.). Text Mining for Biology and Biomedicine, London Artech House Inc, pp. 99–119.
Fukuda, K., et al., “Towards Information Extraction: Identifying Protein Names from Biological
Papers”, PSB, Hawaii, USA, 1998, pp.707--718.
Hanisch D, Fundel K, Mevissen HT, et al. Prominer: rule based protein and gene entity recognition.
BMC Bioinformatics 2005;6(Suppl 1):(S14).
Hanisch, D., et al., “Playing Biology’s Name Game: Identifying Protein Names in Scientific Text,”
Proc. Pacific Symp. on Biocomputing, 2003, pp. 403--414.
Hatzivassiloglou, V., P.A. Duboue, and A. Rzhetsky, “Disambiguating Proteins, Genes, and RNA in
Text: A Machine Language Approach” Bioinformatics, 17, Suppl 1, 2001, pp.97--106.
Koike, A., Y. Niwa, and T. Takagi, “Automatic Extraction of Gene/Protein Biological Functions from
Biomedical Text”, Bioinformatics, Vol. 21, No.7, 2005, pp.1227--1236.
Krauthammer, M., A. Rzhetsky, P. Morozov, and C. Friedman, “Using BLAST for identifying gene
and protein names in journal articles” Gene, Vol. 259, No. (1–2), 2001, pp.245--252.

ISMB08 - 15 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

Krauthammer, M., and G. Nenadic, “Term Identification in the Biomedical Literature”, Journal of
Biomedical Informatics, Special Issue on Named Entity Recognition in Biomedicine, Vol.37, No. 6,
2004, pp.512--526.
Mika, S., and B. Rost, “Protein Names Precisely Peeled off Free Text,” Bioinformatics, Vol. 20, Suppl.
1, 2004, pp. i241--i247.
Morgan, A., et al., “Gene Name Extraction Using FlyBase Resources”, Proceedings of ACL Workshop,
NLP in Biomedicine, Sapporo, Japan, 2003, pp.1--8.
Morgan, A., et al., “Gene Name Identification and Normalization using a Model Organism Database,”
Journal of Biomedical Informatics, Vol. 37, 2004, pp. 396--410.
Nenadic, G., I. Spasic, and S. Ananiadou, “Terminology-Driven Mining of Biomedical Literature”,
Bioinformatics, Vol. 19, No.8, 2003, pp.938--943.
Nobata, C., N. Collier, and J. Tsujii, “Automatic Term Identification and Classification in Biological
Texts,” Proceedings of Natural Language Pacific Rim Symposium, 1999, pp.369--374.
Okazaki, N. and Ananiadou, S. (2006) Building an Abbreviation Dictionary using a Term Recognition
Approach, in Bioinformatics, 22(24):3089-3095
Park, J.C. & Kim, J.J. (2006) Named Entity Recognition, in Text Mining for Biology and Biomedicine,
Artech house, pp. 121- 142
Proux, D., et al., “Detecting Gene Symbols and Names in Biomedical Texts: A First Step toward
Pertinent Information,” Proc. 9th Workshop on Genome Informatics, 1998, pp. 72--80.
Pustejovsky, J., et al., “Automatic Extraction of Acronym--Meaning Pairs from Medline Databases,”
Medinfo, Vol. 10, 2001, pp. 371--375.
Pustejovsky, J., et al., “Medstract: Creating Large-Scale Information Servers for Biomedical Libraries,”
ACL Workshop on Natural Language Processing in the Biomedical Domain, 2002, pp. 85—92.
Raychaudhuri, S., J.T. Chang, P.D. Sutphin, and R.B. Altman, “Associating Genes with Gene Ontology
Codes Using a Maximum Entropy Analysis of Biomedical Literature,” Genome Res, Vol.12, No.1,
2002, pp.203--214.
Schwartz, A.S. and M.A. Hearst, “A Simple Algorithm for Identifying Abbreviation Definitions in
Biomedical Text,” Pacific Symposium on Biocomputing, 2003, pp. 451--462.
Tanabe, L., and W.J. Wilbur, “Tagging Gene and Protein Names in Biomedical Text”, Bioinformatics,
18(8), 2002, pp.1124--1132.
Wren, J.D., et al. (2005) Biomedical term mapping databases. Nucleic Acids Res, 33.
Yoshida, M., K. Fukuda, and T. Takagi, “Pnad-css: a Workbench for Constructing a Protein Name
Abbreviation Dictionary,” Bioinformatics, Vol. 16, 2000, pp. 169--75.
Yoshimasa Tsuruoka, John McNaught, and Sophia Ananiadou (2008) Normalizing biomedical terms
by minimizing ambiguity and variability.BMC Bioinformatics (in press).
Yu, H., and E. Agichtein, “Extracting Synonymous Gene and Protein Terms from Biological
Literature,” Bioinformatics, 19, Suppl 1, 2003, pp.I340--349.
Zhou, G., et al., “Recognizing Names in Biomedical Texts: A Machine Learning Approach,”
Bioinformatics, Vol. 20, No. 4, 2004, pp. 1178--1190.
6.1.3 Relation mining, event extraction
Ahlers CB, Fiszman M, Demner-Fushman D, et al. Extracting semantic predications from MEDLINE
citations for pharmacogenomics. In: Pac Symp Biocomput 12. Maui, Hawaii, 2007:209–20.
Chun, H., et al., “Extraction of Gene-Disease Relations from MedLine using Domain Dictionaries and
Machine Learning, “ Proc. Pacific Symp. on Biocomputing (PSB), 2006, pp. 4--15.
Chun, Hong-woo, Yoshimasa Tsuruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, Teruyoshi Hishiki,
Jun'ichi Tsujii (2006) Automatic Recognition of Topic-Classified Relations between Prostate Cancer
and Genes using MEDLINE Abstracts. BMC-Bioinformatics. 7(Suppl 3). pp. S4, November 2006.
Daraselia, N., et al., “Extracting Human Protein Interactions from MEDLINE using a Full-sentence
Parser,” Bioinformatics, Vol. 20, No. 5, 2004, pp. 604--611.
De Bruijn, B., and J. Martin, “Getting to the (C)ore of Knowledge: Mining Biomedical Literature,”
International Journal of Medical Informatics, Vol. 67, 2002, pp. 7--18.
Friedman, C., et al., “GENIES: A Natural-language Processing System for the Extraction of Molecular
Pathways from Journal Articles,” Bioinformatics, Vol. 17, 2001, pp. S74--S82.
Hu, Z., et al., “Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-
based System,” Bioinformatics, Vol. 21, 2005, 2759--2765.
Jelier,R., et al. (2005) Co-occurrence based meta-analysis of scientific texts retrieving biological
relationships between genes. Bioinformatics, 21(9):2049-58.
Kim JJ, Zhang Z, Park JC, et al. BioContrasts: extracting and exploiting protein-protein contrastive
relations from biomedical literature. Bioinformatics 2006;22:597-605.

ISMB08 - 16 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

Kim, J., and J. C. Park, “BioIE: Retargetable Information Extraction and Ontological Annotation of
Biological Interactions from the Literature,” Journal of Bioinformatics and Computational Biology,
Vol. 3, No. 3, 2004, pp. 551--568.
Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii. Corpus annotation for mining biomedical events
from lterature. BMC Bioinformatics. 9(1). pp. 10, BioMed Central, 2008.
Leroy, G., and H. Chen, “Genescene: An Ontology-enhanced Integration of Linguistic and Co-
occurrence Based Relations in Biomedical Texts,” Journal of the American Society for Information
Science and Technology, Vol. 56, No. 5, 2005, pp. 457--468.
Leroy, G., H. Chen, and J.D. Martinez, “A Shallow Parser based on Closed-Class Words to Capture
Relations in Biomedical Text,” Journal of Biomedical Informatics, Vol. 36, No. 3, 2003, pp.145--
158.
McDonald, D.M., et al., “Extracting Gene Pathway Relations Using a Hybrid Grammar: The Arizona
Relation Parser,” Bioinformatics, Vol. 20, No. 18, 2004, pp. 3370--3378.
Novichkova, S., S. Egorov, and N. Daroselia, “MedScan, a Natural Language Processing Engine for
MEDLINE Abstracts,” Bioinformatics, Vol. 19, 2003, pp. 1699--1706.
Oda, Kanae, Jin-Dong Kim, Tomoko Ohta, Daisuke Okanohara, Takuya Matsuzaki, Yuka Tateisi and
Jun'ichi Tsujii. (2008) New challenges for text mining: Mapping between text and manually curated
pathways. In Christopher JO Baker and Su Jian (Eds.), BMC Bioinformatics.2008. To appear.
Park, J.C., “Using Combinatory Categorial Grammar to Extract Biomedical Information,” IEEE
Intelligent Systems, Vol. 16, No. 1, 2001, pp.62--67.
Pyysalo, S., et al., “Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-
Protein Interactions,” Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine
and its Applications (JNLPBA), 2004, pp. 15--21.
Rebholz-Schuhmann D, Kirsch H, Arregui M, et al. EBIMed-text crunching to gather facts for proteins
from Medline. Bioinformatics 2007;23:e237–44
Rinaldi F, Schneider G, Kaljurand K, et al. Mining of relations between proteins over biomedical
scientific literature using a deep-linguistic approach. Artif Intell Med 2007;39(2):127–36.
Rinaldi, F., et al., “Mining Relations in the GENIA Corpus,” Proc. Second European Workshop on
Data Mining and Text Mining for Bioinformatics, 2004, pp. 61--68.
Sætre, Rune, Kazuhiro Yoshida, Akane Yakushiji, Yusuke Miyao, Yuichiro Matsubayashi and Tomoko
Ohta. (2007) AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge, PPI-
IPS subtask. Proceedings of the Second BioCreative Challenge Evaluation Workshop.
Šarić, J., L. J. Jensen, and I. Rojas, “Large-scale Extraction of Gene Regulation for Model Organisms
in an Ontological Context,” In Silico Biology, Vol. 5, No. 0004, 2004,
http://www.bioinfo.de.isb/2004/05/0004/, June 2005
Wattarujeekrit T, Shah PK, Collier N. PASBio: predicate argument structures for event extraction in
molecular biology. BMC Bioinform 2004;5:155.
Yakushiji, A., et al., “Biomedical Information Extraction with Predicate-Argument Structure Patterns,”
Proc. Int. Symp. on Semantic Mining in Biomedicine, 2005, pp. 60--69.
Yakushiji, A., et al., “Event Extraction from Biomedical Papers Using a Full Parser,” Proc. Pacific
Symp. on Biocomputing (PSB 2001), Kauai, Hawaii, Jan. 3--7, 2001, pp. 408--419.
6.1.4 Annotation of Biomedical Corpora
Chou WC, Tsai RTH, Su YS, et al. A semi-automatic method for annotating a biomedical proposition
bank. Proceedings of workshop on frontiers in linguistically annotated corpora 2006. Association
for Computational Linguistics.Sydney, Australia, 2006:5–12.
Cohen, K., et al., “Corpus Design for Biomedical Natural Language Processing,” Proc. ACL Workshop
on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, 2005, pp.
38--45.
Erjavec, T., et al., “Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus,” Proc.
ACL Workshop on Natural Language Processing in Biomedicine, 2003, pp. 97--104.
Kim, J., et al., “GENIA Corpus---a Semantically Annotated Corpus for Bio-Textmining,”
Bioinformatics, Vol. 19, Suppl. 1, 2003, pp. i180--i182.
Kim, J.D. & Tsujii, J. (2006) Corpora and their Annotation, in Text Mining for Biology and
Biomedicine, Ananiadou, S. & McNaught, J. (eds), Artech House, pp. 179-211.
Pakhomov, S., A. Coden, and C. Chute, “Creating a Test Corpus of Clinical Notes Manually Tagged
for Part-of-Speech Information,” Proc. COLING Joint Workshop on Natural Language Processing in
Biomedicine and its Applications, 2004, pp. 62--65.

ISMB08 - 17 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

Smith, L.H., et al., “MedTag: a Collection of Biomedical Annotations,” Proc. ACL Workshop on
Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, 2005, pp.
32--37.
Tateisi, Y., and J. Tsujii, “Part-of-Speech Annotation of Biology Research Abstracts,” Proc. Int. Conf.
on Language Resource and Evaluation (LREC), 2004, pp. 1267--1270.
Tateisi, Y., et al., “Syntax Annotation for the GENIA corpus,” Proc. IJCNLP Companion volume,
2005, pp. 222--227.
Thompson, Paul, Giuila Venturi, John McNaught, Simonetta Montemagni and Sophia Ananiadou
(2008). Categorising Modality in Biomedical Texts. LREC 2008 workshop "Building and
Evaluating resources for biomedical text mining" Marrakech, Morocco.
Thompson, Paul, Philip Cotter, John McNaught, Sophia Ananiadou, Simonetta Montemagni, Andrea
Trabucco and Giulia Venturi (2008). Building a bio-event annotated corpus for the acquisition of
semantic frames from biomedical corpora. Sixth International Conference on Language Resources
and Evaluation (LREC 2008), Marrakech, Morocco
6.1.5 Tagging and Parsing
Brants, T., “TnT - A Statistical Part-of-Speech Tagger,” Proc. Applied Natural Language Processing
Conf., 2000.
Charniak, E., “A Maximum-Entropy-Inspired Parser,” Proc. NAACL, 2000.
Collins, M., “Head-Driven Statistical Models for Natural Language Parsing,” Ph.D. thesis, University
of Pennsylvania, 1999.
Giménez, J., and L. Màrquez, “Fast and accurate part-of-speech tagging: The SVM approach
revisited,” Proc. RANLP, 2003, pp. 153-163.
Hara, T., Y. Miyao and J. Tsujii, “Adapting a probabilistic disambiguation model of an HPSG parser to
a new domain,” Proc. Int. Joint Conf. on Natural Language Processing, 2005.
Miyao, Y., and J. Tsujii, “Probabilistic disambiguation models for wide-coverage HPSG parsing,”
Proc. ACL, 2005, pp. 83--90.
Miyao, Yusuke and Jun'ichi Tsujii. Feature Forest Models for Probabilistic HPSG Parsing.
Computational Linguistics. 34(1). pp. 35–80, MIT Press, 2008.
Ninomiya T, Tsuruoka Y, Miyao Y, et al. Fast and scalable HPSG parsing. TAL 2005 2007;46(2).
Reynar, J., and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence Boundaries,”
Proc. ANLP, 1997, pp. 16-19.
Smith, L., et al., “MedPost: a part-of-speech tagger for bioMedical text, Bioinformatics, Vol. 20, No.
14, 2004, pp. 2320-2321
Toutanova K., et al., “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” Proc.
HLT-NAACL, 2003.
Tsuruoka, Y., and J. Tsujii, “Bidirectional Inference with the Easiest-First Strategy for Tagging
Sequence Data,” Proc. HLT/EMNLP, 2005. pp. 467--474.
Tsuruoka, Y., et al., “Developing a Robust Part-of-Speech Tagger for Biomedical Text,” Proc. 10th
Panhellenic Conference on Informatics, 2005, pp. 382-392.
Yoshida, Kazuhiro, Yoshimasa Tsuruoka, Yusuke Miyao and Jun'ichi Tsujii (2007) Ambiguous Part-
of-Speech Tagging for Improving Accuracy and Domain Portability of Syntactic Parsers. Twentieth
International Joint Conference on Artificial Intelligence (IJCAI)

ISMB08 - 18 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

7 Annex A: Terminological resources


7.1 Building terminological resources
Automatic population of lexical resources such as BioLexicon, term clustering, etc. will be
inserted here. Terminological resources for bio-text mining are important links between text
and ontologies. The Biolexicon is such a resource containing rich linguistic information and
linking with ontologies. We will explain how we populate a biologically motivated
terminological resource useful for a multitude of text mining applications.
7.2 Information Extraction of named entities and ontological concepts (IE-1)
Different components of the linguistic analysis of biomedical text will be explained (e.g., POS
tagging, shallow parsing, syntactic parsing, deep syntactic parsing). Different types of IE
approaches used in the biomedical domain will be discussed (e.g., pattern matching of
terminology, full syntactic parsing, machine learning based approaches such as
disambiguation).
This part of the tutorial will also focus on the extraction of named entities and ontological
concepts from the literature based on statistical methods. Furthermore, solutions for the
mining of ontological concepts from the scientific literature will be presented (feeding
ontologies from the literature) and for the annotation of genes/proteins with gene ontological
terms
The following terminological resources are made available to the public
- Unified Medical Language System (UMLS)35: This is one of the most commonly
used terminological resources. It is provided from the National Library of Medicine
and was developed to support retrieval of Medline abstracts. UMLS contains terms
for diseases and syndromes, species terms, gene ontology terms and other terms. The
content of UMLS has a high degree of heterogeneity and thus induces problems
whenever a homogeneous subset of terms is required.
- UMLS Metathesaurus36: This lexical resource is a large scale multi-lingual
vocabulary resource provided from the NLM. It can be used as a controlled
vocabulary or as a taxonomy of medical terms and its benefits lie in the fact that it
delivers meanings with the terms and can thus support disambiguation.
- Snowmed-CT is a comprehensive taxonomy of medical terms. It is used in
commercial IT solutions to achieve consistent representation of medical concepts and
increasingly gains importance in public IT solutions. It is contained in UMLS, but
every user has to check whether the correct license agreements are in place.
- The NCBI taxonomy37 “database is a curated set of names and classifications for all
of the organisms that are represented in GenBank” (from the Web page). It is used to
offer cross-linking between the different data resources at the NCBI.
- The Gene Ontology (GO) is a taxonomy of concepts for molecular function, cellular
component and biological process. It is used as a controlled vocabulary for the
annotation of database entries, for the categorization of genes in microarray
experiments and for the extraction of terms representing the concepts from the
scientific literature (Couto et al). GO is now part of UMLS.
- A larger number of ontologies is contained in the collection of the OBO ontologies.
All ontologies are available under open access distribution agreements. A big portion
of the ontologies are still under development.
- Medline Plus38 provides information on disease and drugs that can be used as
controlled vocabularies.

35
http://www.nlm.nih.gov/pubs/factsheets/umls.html
36
http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html
37
http://130.14.29.110/Taxonomy/
38
http://medlineplus.gov/

ISMB08 - 19 - Tutorial AM4


Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support

- BioThesaurus is a collection of protein and gene names extracted from a number of


different data resources, e.g. UniProtKb, EntrezGene, Hugo, InterPro, Pfam and
others (Liu et al., 2006). The database contains several million entries and cannot be
completely integrated into an information extraction solution due to its huge size.
- GPSDB39 is a thesaurus of gene and protein names that have been collected from 14
different resources (Pillet et al., 2005). It serves as a controlled vocabulary for named
entities referring to genes and proteins and is thus similar to the BioThesaurus.
- DrugBank40 is a database that collects information for bioinformatics and
cheminformatics use. The data resource holds information on the drug brand names,
the chemical IUPAC names, the chemical formula, the generic name, InCHI
identifier, chemical structure representation and relevant links to other data bases, e.g.
ChEBI, PharmGKB and others.
- ChEBI41 is a taxonomy for chemical entities developed at the EBI. The contained
entities have relevance to the biological domain based on the definition of this
resource. ChEBI terms are for example identified by OSCAR3, a software package
for the identification of chemical entities.
- The PubChem42 data resource provides information on small chemical entities at a
large scale (currently 17 million chemical entities described). This public resource is
meant to keep track of information relating small chemical entities to biological
experiments. Similar to DrugBank it allows access to the names, structures and
kinetic parameters of the small chemical entities.
- Other biomedical data resources are available that can be exploited for terminology.
For example EntrezGene43, the EMBL Nucleotide Sequence Database44 and
GenBank Database45 are large databases that contain sequence information for
proteins and genes and annotations linked to the genes and proteins. Apart from the
field containing the name they also provide definitions and other annotations.

39
http://biomint.pharmadm.com/protop/bin/bmstaticpage.pl?userType=guest&p=gpsdb
40
http://redpoll.pharmacy.ualberta.ca/drugbank/
41
http://www.ebi.ac.uk/chebi/
42
http://pubchem.ncbi.nlm.nih.gov/
43
www.ncbi.nlm.nih.gov/projects/LocusLink/
44
www.ebi.ac.uk/embl/
45
www.ncbi.nlm.nih.gov/Genbank/

ISMB08 - 20 - Tutorial AM4


ISMB08 21 Tutorial AM4
ISMB08 22 Tutorial AM4
ISMB08 23 Tutorial AM4
ISMB08 24 Tutorial AM4
ISMB08 25 Tutorial AM4
ISMB08 26 Tutorial AM4
ISMB08 27 Tutorial AM4
ISMB08 28 Tutorial AM4
ISMB08 29 Tutorial AM4
ISMB08 30 Tutorial AM4
ISMB08 31 Tutorial AM4
ISMB08 32 Tutorial AM4
ISMB08 33 Tutorial AM4
ISMB08 34 Tutorial AM4
ISMB08 35 Tutorial AM4
ISMB08 36 Tutorial AM4
ISMB08 37 Tutorial AM4
ISMB08 38 Tutorial AM4
ISMB08 39 Tutorial AM4
ISMB08 40 Tutorial AM4
ISMB08 41 Tutorial AM4
ISMB08 42 Tutorial AM4
ISMB08 43 Tutorial AM4
ISMB08 44 Tutorial AM4
ISMB08 45 Tutorial AM4
ISMB08 46 Tutorial AM4
ISMB08 47 Tutorial AM4
ISMB08 48 Tutorial AM4
ISMB08 49 Tutorial AM4
ISMB08 50 Tutorial AM4
ISMB08 51 Tutorial AM4
ISMB08 52 Tutorial AM4
ISMB08 53 Tutorial AM4
ISMB08 54 Tutorial AM4
ISMB08 55 Tutorial AM4
ISMB08 56 Tutorial AM4
ISMB08 57 Tutorial AM4
ISMB08 58 Tutorial AM4
ISMB08 59 Tutorial AM4
ISMB08 60 Tutorial AM4
ISMB08 61 Tutorial AM4
ISMB08 62 Tutorial AM4
ISMB08 63 Tutorial AM4
ISMB08 64 Tutorial AM4
ISMB08 65 Tutorial AM4
ISMB08 66 Tutorial AM4
ISMB08 67 Tutorial AM4
ISMB08 68 Tutorial AM4
ISMB08 69 Tutorial AM4
ISMB08 70 Tutorial AM4
ISMB08 71 Tutorial AM4
ISMB08 72 Tutorial AM4
ISMB08 73 Tutorial AM4

S-ar putea să vă placă și