Documente Academic
Documente Profesional
Documente Cultură
Identification of Biomedical
Events and Relations in the
LIterature with Ontological
Support
www.iscb.org/ismb2008
Rebholz-Schuhmann/Kim/Ananiadou/Tsujii
Identification of Biomedical Events and Relations in the literature with ontological support
Title:
Identification of Biomedical Events and
Relations in the literature with ontological support
1 Abstract ........................................................................................................................ - 1 -
2 Introduction.................................................................................................................. - 1 -
2.1 Information retrieval and information extraction: relevant terminology............. - 2 -
2.2 Selected types of data that text mining can deliver ............................................. - 3 -
2.3 Basic software components for text mining ........................................................ - 4 -
2.4 Evaluation of text mining systems ..................................................................... - 4 -
3 Building terminological, ontological resources and annotation projects ..................... - 5 -
3.1 Building terminological resources....................................................................... - 5 -
3.2 Building an ontological resource: the example GRO.......................................... - 6 -
3.3 Biological Annotation and Linguistic Annotation............................................... - 7 -
4 Information Extraction of named entities and ontological concepts (IE-1) ................. - 9 -
4.1 Identification of GO concepts in the literature .................................................... - 9 -
5 Information Extraction of relations and events (IE-2) ............................................... - 10 -
5.1 Protein-protein interactions ............................................................................... - 10 -
5.2 Event/Relation Recognition............................................................................... - 11 -
5.3 Identification of gene regulatory events ............................................................ - 12 -
6 Conclusion, Outlook, Discussion............................................................................... - 12 -
7 Annex A: Terminological resources .......................................................................... - 19 -
7.1 Building terminological resources..................................................................... - 19 -
7.2 Information Extraction of named entities and ontological concepts (IE-1) ...... - 19 -
1 Abstract
Text mining solutions increasingly integrate ontological resources to extract normalized
information from the literature. The tutorial will demonstrate through working examples how
biomedical events can be identified from literature based on state of the art technology
integrating ontological resources. Successful use cases will be demonstrated and the
infrastructure of existing solutions will be explained.
2 Introduction
In the past two decades biologists developed new technologies that support analysis and
understanding of processes in model organisms and that can be performed in high-throughput
In information extraction the text is processed to identify predefined facts, for example in
the biomedical domain, facts about genes, proteins, diseases, and chemical entities and
relations among proteins or between proteins and other types of entities (e.g. Diseases).
1
http://www.textpresso.org/
2
http://www.QueryChem.com
3
http://www.ihop-net.org/UniPub/iHOP/
4
http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
5
http://www.nactem.ac.uk/software/kleio/
6
http://text0.mib.man.ac.uk/software/facta/
7
http://www.gopubmed.org/
8
http://pubmatrix.grc.nia.nih.gov/
9
http://www.glycosciences.de/tools/PubFinder
Information extraction can be used to improve information retrieval. The separation between
both types of text mining techniques is not fixed.
Named entity recognition (NER) is a special research field in text mining. It is concerned
with the extraction of textual fragments in text that represent objects of predefined semantic
types (e.g. gene names, disease mentions). Ontologists call such objects the instances of the
semantic classes or types.
Information extraction and information retrieval are solutions that offer access to parts of an
abstract and parts of a sentence. To build efficient information extraction solutions the
following questions have to be answered: (1) what kinds of information can be extracted in a
systematic way, (2) how do we represent the information, (3) how do we solve problems that
come from the variability of the language, or the other way round, (4) how do we normalize
the extracted results to have a 'purified' view to the textual information, which is a lot more
condensed than the textual representation, and last but not least (5) how do we deal with shifts
in the textual representation of information, which are due to errors in the perception, naming
and classification of terms that have taken place in the past.
One core problem concerning information extraction in biology is the definition,
standardization and classification of biological names and terms. Unarguably, it is the names
how genes and proteins and any other biological object are identified in natural language text.
The sequence databases, e.g. UniProtKb [Appweiler, Lee], also usually provides unique
names to their database entries. In this way, links between different data sources can be
established, and information about the same biological object can thus be integrated from
different data sources. The integration enables classification and categorization of the
biological objects according to their similarities such as shared functions and similar
structures.
This need for standard names has lead to the need for a nomenclature of standard terms that
mirror the characteristics of biological objects, e.g. function, structure, and localization. Any
of these nomenclatures are the starting point for an ontology, where individuals and biological
groups or networks want to standardize information representation to enhance the exchange of
biological information. This standardized information can again be used to improve the
quality of the information extraction tools.
2.2 Selected types of data that text mining can deliver
Text mining in the biomedical domain has been focused mainly on the identification of named
entities such as names of proteins, genes, and diseases. Increasingly, sophisticated text mining
techniques (e.g. full parsing) are applied to increase the precision of systems for the
biomedical text mining.
We can distinguish the following types of facts relevant to the biomedical domain. Note that
the types are not disjoint. For example, the representation of a point mutation might also
represent a gene.
- Identification of named entities, e.g., genes, proteins, diseases, cell types, species, and
drug brand names (Adamic et al., 2002; Rindflesch et al., 2000).
- Identification of linguistic expressions of not-named entity types, e.g., point
mutations, karyotypes, chemicals (Rebholz-Schuhmann et al., 2004)
- Identification of ontological terms (Couto et al., 2006; Gaudan et al., 2008)
- Identification of associations, i.e. pairs of two named entities that appear together
(Jenssen et al., 2001).
- Identification of semantic relations between named entities (Blaschke et al., 1999;
Marcotte et al., 2001)
3.1.1 Term extraction, term variation and normalization for building terminological
and ontological resources: BioLexicon (SA)
Terms are the backbone of specialized knowledge because they denote the biological entities
of the documents. Unfortunately, the naming of biological entities is often inconsistent and
imprecise. Metabolites, proteins and genes often have a variety of names (terms) for denoting
the same concept. For example, the metabolite glucose-6-phosphate is referred to as variants
and permutations of α or β, d- or l-glucose (or hexose)-6-(mono)-phosphate. Furthermore,
within the same text a term can be given in an extended compounded form then later
expressed through various mechanisms, including orthographic variation (usage of hyphens
and slashes e.g. amino acid and amino-acid), lower and upper cases (NF-KB and NF-kb),
spelling variations (tumour and tumor), various Latin and/or Greek transliterations (oestrogen
and estrogen), and abbreviations (RAR and retinoic acid receptor). Therefore, a term is
increasingly viewed as an equivalence class of termforms, the rich variety of which have to be
recognized, indexed, linked and mapped to the abundant biological databases and ontologies.
Ontologies are crucial for text mining because they provide semantic interpretation to text and
also constrain the possible interpretations of biological entities (terms) when we provide
semantic interpretation to text, we link terms to concepts in ontologies whereby textual
evidence is used to update and to maintain existing ontologies.
We will describe the process of populating a Bio-Lexicon with new terms (named entities)
which are automatically extracted from text. The BioLexicon has been developed within the
framework of the BOOTStrep project (National Centre for Text Mining, ILC, EBI). We will
cover:
a) A method for named entity recognition (NER) used to populate the Biolexicon with
term variants extracted from biomedical literature
b) a method developed for term mapping to UniProt Accession Numbers through term
normalization. The BioLexicon contains two million gene/protein names,
straightforward similarity calculation of term pairs are not practical at all. Linking
with smart dictionary look up of large-scale gene/protein name dictionaries, including
GENA, ProMiner and BioThesaurus. These dictionaries contain variants of names as
well as canonical names.
c) How to link the BioLexicon with a biontology (GRO)
14
http://trec.nist.gov/pubs/trec12/t12_proceedings.html
15
http://biocreative.sourceforge.net/
16
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html
The GRO covers classes for the core events relevant to the domain. This comprises
fundamental gene regulatory events such as regulation of transcription and also the
participants involved in the events such as transcription factor and regulatory region. A large
portion of the classes in GRO were supplied with textual definitions to a large extent from
input from external resources.
GRO distinguishes itself from most of other biomedical ontologies that are currently
available, by the fact that GRO makes use of a larger number of relations in the ontological
structure. This organisation leads to the result that classes are highly interlinked by various
manually encoded relations and their reciprocal counterparts that are mentioned in
parentheses in this text. The relation partOf (hasPart) is used to relate spatial, temporal, or
procedural parts to the whole. For example the class ProteinDomain is partOf the class
Protein and the class TranscriptionInitiation is partOf the class Transcription. Wholes and
their parts must belong to a super ontological category, i.e., a continuant can only have
continuants as parts and an occurrent can only have occurrents as parts.
The relation fromSpecies relates species-specific classes to the species they belong to, such as
Bacterial RNA Polymerase is fromSpecies Bacterium. Also, continuants and occurrents are
related to processes, in which they are involved, by the relation participatesIn
17
http://www.nlm.nih.gov/pubs/factsheets/umls.html
18
http://geneontology.org/
19
http://sequenceontology.org/
20
http://www.inoh.org
21
http://www.ebi.ac.uk/chebi/
22
http://130.14.29.110/Taxonomy/
23
http://www.gene-regulation.com/
Medline abstracts that identifies keywords in conjunction with a selection of verbs. A similar
solution has been proposed by (Ono et al., 2001) that focuses on relations defined by
“interact”, “bind”, “associate” and “complex” and that have been extracted with regular
expression for syntactical patterns. Precision is around 94% and recall is around 85% (82.5-
86.8%).
Several solutions have been proposed that match language patterns in form of Finite State
Automata (FSA) to the scientific literature. (Pustejovsky et al., 2001) analyzed syntactical
language patterns for inhibitory events. Their system performed at recall of 57% and at
precision of 90%. They did not offer any solution for the PGN normalization. Part of their
solution is the processing of subordinate clauses, sentential coordination and anaphoric
resolutions. (Leroy et al., 2003) also applied cascaded FSAs to extract protein-protein
interactions from Medline abstracts. They reported 90% precision, but again did not apply
any PGN normalization. (Saric et al., 2006) extracted regulatory gene/protein networks from
Medline with cascaded FSAs. For their evaluation they opted for semantic correctness in
contrast to grammatical correctness and claim to have achieved 83 to 90% accuracy for
expression relations and 86 to 95% accuracy for phosphorylation relations.
(Park et al., 2001) applied Combinatory Categorial Grammar in conjunction of seven verbs
(including their inflections and noun phrases) denoting a positive regulatory effect (e.g.,
activate, stimulate) and five verbs denoting a negative regulatory effect (inhibit, down-
regulate). They consider solving coordination, appositions and anaphoric expressions. They
claim 48% recall and 80% precision measured on a selection of 492 sentences.
(Friedman et al., 2001) use a system that parses text based on grammar rules (semantic
patterns, MedLee). The grammar makes use of 22 terms denoting verbs and also nouns (e.g.,
apopotosis, myogenesis) that are categorized into 14 classes representing actions and
processes. All inflectional forms and the nominalizations have been considered for all verbal
forms. NER for PGNs is based on BLAST. (Temkin et al., 2003) applied context-free
grammar for the processing of Medline abstracts. They integrated 49 verb forms, their
inflectional forms and nominalizations and achieved 63.9% recall at 70.2% precision. Temkin
thus reports the biggest coverage of proposed verbs. Finally, (Daraselia et al., 2004) use
Context-Free Grammar to identify PPIs and report 91% precision based on Medline abstracts
(recall rate 21%).
5.2 Event/Relation Recognition
More refined systems than PPI are now emerging. Some systems distinguish different types of
interaction, while others extract relationships other than PPI and extend relationships between
pairs of proteins to those more than two entities of different classes.
We will demo two systems extracting relations and events from MEDLINE based on full
parsing: MEDIE24 (Miyao et al 2006) and InfoPubMed25. MEDIE uses a subset of
ontological classes of GO to define event classes and use results of a deep parser as features
for machine learners.
The advantage of full parsing is that we can easily make generalizations for more than one
type of biological interaction. To achieve this generalisation, we use predicate argument
structures, which are canonical representations of sentence meanings that represent relations
in an abstract manner. Both are provided as services to the Life Sciences community by the
UK National Centre for Text Mining26.
As an example of such systems, we take up an event extraction system based on deep parsing
and machine learning, which has been developed by the University of Tokyo and used as a
service system (MEDIE) at National Centre for Text Mining, UK. The system uses a subset of
ontological classes of GO to define event classes and use results of a deep parser as features
for machine learners.
24
http://www-tsujii.is.s.u-tokyo.ac.jp/medie/
25
https://www-tsujii.is.s.u-tokyo.ac.jp/info-pubmed/
26
www.nactem.ac.uk
27
http://www.nactem.ac.uk/pathtext/
28
http://www.nactem.ac.uk/software/kleio/
29
http://text0.mib.man.ac.uk/software/facta/
30
http://www.ebi.ac.uk/uniprot/
31
http://regulondb.ccg.unam.mx/
32
http://www.w3.org/Submission/SWRL/
BLIMP33: “BLIMP covers all publications related to the fast-growing field of biomedical
literature and text mining. It is a one-stop resource, letting researchers find out who-does-
what in the area and where it is published, bridging across the many discipline-specific
venues in which biomedical text-mining papers are published.”
33
http://blimp.cs.queensu.ca/
34
http://www.ccs.neu.edu/home/futrelle/bionlp/
References
Adamic, L.A., Wilkinson, D., Huberman, B.A., and Adar, E.. A Literature Based Method for
Identifying Gene-Disease Connections. IEEE Computer Society Bioinformatics Conference, 2002.
Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. (1999) "Automatic Extraction of
Biological Information from Scientific Text: Protein-Protein Interactions". ISMB99, 60-67.
Daraselia,N. et al. (2007) Automartic extraction of gene ontology annotation and its correlation with
clusters in protein networks. BMC Bioinformatics 10(8):243.
Friedman,C., et al. (2001) GENIES: a natural-language processing system for the extraction of
molecular pathways from journal articles. Bioinformatics, 17(Suppl 1), S74–82.
Gaudan,G., et al. (2008) Identifying GO terms in text. EURASIP JBSB, Hindawi Publishing Group.
(accepted)
Hakenberg,J., et al. (2005). Systematic feature evaluation for gene name recognition. BMC
Bioinformatics; 6 Suppl 1:S9.
Hirschman,L., et al. (2005) Overview of BioCreAtIvE task 1B: normalized gene lists. BMC
Bioinformatics, 6(Suppl 1):S11.
Hoffmann,R. and Valencia,A. (2005) Implementing the iHOP concept for navigation of biomedical
literature. Bioinformatics 21 (Suppl 2):ii252-8.
Huang,M., et al. (2004) Discovering patterns to extract protein–protein interactions from full texts.
Bioinformatics 20(18):3604-3612
Jelier,R., et al. (2005) Co-occurrence based meta-analysis of scientific texts retrieving biological
relationships between genes. Bioinformatics, 21(9):2049-58.
Kirsch,H., et al. (2006) Distributed modules for text annotation and IE applied to the biomedical
domain. Int. J. Med. Inform. 75(6):496-500.
Krallinger,M., Leitner,F., and Valencia,A. (2007) Assessment of the Second BioCreative PPI task:
Automatic Extraction of Protein-Protein Interactions. Proc Second BioCreative Challenge
Evaluation Workshop.
Leroy, G., and Chen,H.. (2002) Filling Preposition-based Templates to Capture Information from
Medical Abstracts. Proc Pacific Symp. on Biocomputing 7 (PSB), 362--373.
Liu H., Hu Z., Zhang J. and Wu C., (2006). BioThesaurus: a web-based thesaurus of protein and gene
names. Bioinformatics 22(1):103-105.
Marcotte, E. M., Xenarios, I., and Eisenberg, D.. Mining literature for protein-protein interactions,
Bioinformatics. 2001 Apr;17(4):359-63.
Morgan A., Hirschman L. (2007) Overview of BioCreative II Gene normalization. Proc Second
BioCreative Challenge Evaluation Workshop.
Ono,T., et al. (2001) Automated extraction of information on protein-protein interactions from the
biological literature. Bioinformatics 17(2):155-161.
Park,J.C. et al. (2001) Bidirectional incremental parsing for automatic pathway identification with
combinatory categorial grammar. Pac Symp Biocomput :396-407.
Pustejovsky,J. et al. (2001) Robust relational parsing over biomedical literature: extracting inhibit
relations. Pac Symp Biocomput :362-73.
Rebholz-Schuhmann,D., et al. (2006a) Annotation and Disambiguation of Semantic Types in
Biomedical Text: a Cascaded Approach to Named Entity Recognition. Workshop on "Multi-
Dimensional Markup in NLP", EACL 2006, Trente, Italy.
Rebholz-Schuhmann,D., et al. (2007a) EBIMed: text crunching to gather facts for proteins from
Medline. Bioinformatics 23(2):e237-e244.
Rebholz-Schuhmann,D., et al. (2007b) Text processing through Web services: Calling Whatizit.
Bioinformatics 2007 Nov 21
Rindflesch TC, Hunter L and Aronson AR (1999). "Mining Molecular Binding Terminology from
Biomedical Text". Proceedings of the AMIA Annual Symposium, 127-131.
Rzhetsky,A., et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating
molecular pathway data. J Biomed Inform 37(1):43-53.
Saric,J., et al. (2006) Extraction of regulatory gene/protein networks from Medline. Bioinformatics
22(6):645-650.
Sekimizu,T., et al. (1998) Identifying the interaction between genes and gene products based on
frequently seen verbs in Medline abstracts. Genome Informatics, 62-71.
Temkin,J.M. and Gilder,M.R. (2003) Extraction of protein interaction information from unstructured
text using a context-free grammar. Bioinformatics. 19(16):2046-53.
Tsuruoka,Y., et al. (2007). Learning string similarity measures for gene/protein name dictionary look-
up using logistic regression. Bioinformatics. 23(20):2768-74.
Krauthammer, M., and G. Nenadic, “Term Identification in the Biomedical Literature”, Journal of
Biomedical Informatics, Special Issue on Named Entity Recognition in Biomedicine, Vol.37, No. 6,
2004, pp.512--526.
Mika, S., and B. Rost, “Protein Names Precisely Peeled off Free Text,” Bioinformatics, Vol. 20, Suppl.
1, 2004, pp. i241--i247.
Morgan, A., et al., “Gene Name Extraction Using FlyBase Resources”, Proceedings of ACL Workshop,
NLP in Biomedicine, Sapporo, Japan, 2003, pp.1--8.
Morgan, A., et al., “Gene Name Identification and Normalization using a Model Organism Database,”
Journal of Biomedical Informatics, Vol. 37, 2004, pp. 396--410.
Nenadic, G., I. Spasic, and S. Ananiadou, “Terminology-Driven Mining of Biomedical Literature”,
Bioinformatics, Vol. 19, No.8, 2003, pp.938--943.
Nobata, C., N. Collier, and J. Tsujii, “Automatic Term Identification and Classification in Biological
Texts,” Proceedings of Natural Language Pacific Rim Symposium, 1999, pp.369--374.
Okazaki, N. and Ananiadou, S. (2006) Building an Abbreviation Dictionary using a Term Recognition
Approach, in Bioinformatics, 22(24):3089-3095
Park, J.C. & Kim, J.J. (2006) Named Entity Recognition, in Text Mining for Biology and Biomedicine,
Artech house, pp. 121- 142
Proux, D., et al., “Detecting Gene Symbols and Names in Biomedical Texts: A First Step toward
Pertinent Information,” Proc. 9th Workshop on Genome Informatics, 1998, pp. 72--80.
Pustejovsky, J., et al., “Automatic Extraction of Acronym--Meaning Pairs from Medline Databases,”
Medinfo, Vol. 10, 2001, pp. 371--375.
Pustejovsky, J., et al., “Medstract: Creating Large-Scale Information Servers for Biomedical Libraries,”
ACL Workshop on Natural Language Processing in the Biomedical Domain, 2002, pp. 85—92.
Raychaudhuri, S., J.T. Chang, P.D. Sutphin, and R.B. Altman, “Associating Genes with Gene Ontology
Codes Using a Maximum Entropy Analysis of Biomedical Literature,” Genome Res, Vol.12, No.1,
2002, pp.203--214.
Schwartz, A.S. and M.A. Hearst, “A Simple Algorithm for Identifying Abbreviation Definitions in
Biomedical Text,” Pacific Symposium on Biocomputing, 2003, pp. 451--462.
Tanabe, L., and W.J. Wilbur, “Tagging Gene and Protein Names in Biomedical Text”, Bioinformatics,
18(8), 2002, pp.1124--1132.
Wren, J.D., et al. (2005) Biomedical term mapping databases. Nucleic Acids Res, 33.
Yoshida, M., K. Fukuda, and T. Takagi, “Pnad-css: a Workbench for Constructing a Protein Name
Abbreviation Dictionary,” Bioinformatics, Vol. 16, 2000, pp. 169--75.
Yoshimasa Tsuruoka, John McNaught, and Sophia Ananiadou (2008) Normalizing biomedical terms
by minimizing ambiguity and variability.BMC Bioinformatics (in press).
Yu, H., and E. Agichtein, “Extracting Synonymous Gene and Protein Terms from Biological
Literature,” Bioinformatics, 19, Suppl 1, 2003, pp.I340--349.
Zhou, G., et al., “Recognizing Names in Biomedical Texts: A Machine Learning Approach,”
Bioinformatics, Vol. 20, No. 4, 2004, pp. 1178--1190.
6.1.3 Relation mining, event extraction
Ahlers CB, Fiszman M, Demner-Fushman D, et al. Extracting semantic predications from MEDLINE
citations for pharmacogenomics. In: Pac Symp Biocomput 12. Maui, Hawaii, 2007:209–20.
Chun, H., et al., “Extraction of Gene-Disease Relations from MedLine using Domain Dictionaries and
Machine Learning, “ Proc. Pacific Symp. on Biocomputing (PSB), 2006, pp. 4--15.
Chun, Hong-woo, Yoshimasa Tsuruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, Teruyoshi Hishiki,
Jun'ichi Tsujii (2006) Automatic Recognition of Topic-Classified Relations between Prostate Cancer
and Genes using MEDLINE Abstracts. BMC-Bioinformatics. 7(Suppl 3). pp. S4, November 2006.
Daraselia, N., et al., “Extracting Human Protein Interactions from MEDLINE using a Full-sentence
Parser,” Bioinformatics, Vol. 20, No. 5, 2004, pp. 604--611.
De Bruijn, B., and J. Martin, “Getting to the (C)ore of Knowledge: Mining Biomedical Literature,”
International Journal of Medical Informatics, Vol. 67, 2002, pp. 7--18.
Friedman, C., et al., “GENIES: A Natural-language Processing System for the Extraction of Molecular
Pathways from Journal Articles,” Bioinformatics, Vol. 17, 2001, pp. S74--S82.
Hu, Z., et al., “Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-
based System,” Bioinformatics, Vol. 21, 2005, 2759--2765.
Jelier,R., et al. (2005) Co-occurrence based meta-analysis of scientific texts retrieving biological
relationships between genes. Bioinformatics, 21(9):2049-58.
Kim JJ, Zhang Z, Park JC, et al. BioContrasts: extracting and exploiting protein-protein contrastive
relations from biomedical literature. Bioinformatics 2006;22:597-605.
Kim, J., and J. C. Park, “BioIE: Retargetable Information Extraction and Ontological Annotation of
Biological Interactions from the Literature,” Journal of Bioinformatics and Computational Biology,
Vol. 3, No. 3, 2004, pp. 551--568.
Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii. Corpus annotation for mining biomedical events
from lterature. BMC Bioinformatics. 9(1). pp. 10, BioMed Central, 2008.
Leroy, G., and H. Chen, “Genescene: An Ontology-enhanced Integration of Linguistic and Co-
occurrence Based Relations in Biomedical Texts,” Journal of the American Society for Information
Science and Technology, Vol. 56, No. 5, 2005, pp. 457--468.
Leroy, G., H. Chen, and J.D. Martinez, “A Shallow Parser based on Closed-Class Words to Capture
Relations in Biomedical Text,” Journal of Biomedical Informatics, Vol. 36, No. 3, 2003, pp.145--
158.
McDonald, D.M., et al., “Extracting Gene Pathway Relations Using a Hybrid Grammar: The Arizona
Relation Parser,” Bioinformatics, Vol. 20, No. 18, 2004, pp. 3370--3378.
Novichkova, S., S. Egorov, and N. Daroselia, “MedScan, a Natural Language Processing Engine for
MEDLINE Abstracts,” Bioinformatics, Vol. 19, 2003, pp. 1699--1706.
Oda, Kanae, Jin-Dong Kim, Tomoko Ohta, Daisuke Okanohara, Takuya Matsuzaki, Yuka Tateisi and
Jun'ichi Tsujii. (2008) New challenges for text mining: Mapping between text and manually curated
pathways. In Christopher JO Baker and Su Jian (Eds.), BMC Bioinformatics.2008. To appear.
Park, J.C., “Using Combinatory Categorial Grammar to Extract Biomedical Information,” IEEE
Intelligent Systems, Vol. 16, No. 1, 2001, pp.62--67.
Pyysalo, S., et al., “Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-
Protein Interactions,” Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine
and its Applications (JNLPBA), 2004, pp. 15--21.
Rebholz-Schuhmann D, Kirsch H, Arregui M, et al. EBIMed-text crunching to gather facts for proteins
from Medline. Bioinformatics 2007;23:e237–44
Rinaldi F, Schneider G, Kaljurand K, et al. Mining of relations between proteins over biomedical
scientific literature using a deep-linguistic approach. Artif Intell Med 2007;39(2):127–36.
Rinaldi, F., et al., “Mining Relations in the GENIA Corpus,” Proc. Second European Workshop on
Data Mining and Text Mining for Bioinformatics, 2004, pp. 61--68.
Sætre, Rune, Kazuhiro Yoshida, Akane Yakushiji, Yusuke Miyao, Yuichiro Matsubayashi and Tomoko
Ohta. (2007) AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge, PPI-
IPS subtask. Proceedings of the Second BioCreative Challenge Evaluation Workshop.
Šarić, J., L. J. Jensen, and I. Rojas, “Large-scale Extraction of Gene Regulation for Model Organisms
in an Ontological Context,” In Silico Biology, Vol. 5, No. 0004, 2004,
http://www.bioinfo.de.isb/2004/05/0004/, June 2005
Wattarujeekrit T, Shah PK, Collier N. PASBio: predicate argument structures for event extraction in
molecular biology. BMC Bioinform 2004;5:155.
Yakushiji, A., et al., “Biomedical Information Extraction with Predicate-Argument Structure Patterns,”
Proc. Int. Symp. on Semantic Mining in Biomedicine, 2005, pp. 60--69.
Yakushiji, A., et al., “Event Extraction from Biomedical Papers Using a Full Parser,” Proc. Pacific
Symp. on Biocomputing (PSB 2001), Kauai, Hawaii, Jan. 3--7, 2001, pp. 408--419.
6.1.4 Annotation of Biomedical Corpora
Chou WC, Tsai RTH, Su YS, et al. A semi-automatic method for annotating a biomedical proposition
bank. Proceedings of workshop on frontiers in linguistically annotated corpora 2006. Association
for Computational Linguistics.Sydney, Australia, 2006:5–12.
Cohen, K., et al., “Corpus Design for Biomedical Natural Language Processing,” Proc. ACL Workshop
on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, 2005, pp.
38--45.
Erjavec, T., et al., “Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus,” Proc.
ACL Workshop on Natural Language Processing in Biomedicine, 2003, pp. 97--104.
Kim, J., et al., “GENIA Corpus---a Semantically Annotated Corpus for Bio-Textmining,”
Bioinformatics, Vol. 19, Suppl. 1, 2003, pp. i180--i182.
Kim, J.D. & Tsujii, J. (2006) Corpora and their Annotation, in Text Mining for Biology and
Biomedicine, Ananiadou, S. & McNaught, J. (eds), Artech House, pp. 179-211.
Pakhomov, S., A. Coden, and C. Chute, “Creating a Test Corpus of Clinical Notes Manually Tagged
for Part-of-Speech Information,” Proc. COLING Joint Workshop on Natural Language Processing in
Biomedicine and its Applications, 2004, pp. 62--65.
Smith, L.H., et al., “MedTag: a Collection of Biomedical Annotations,” Proc. ACL Workshop on
Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, 2005, pp.
32--37.
Tateisi, Y., and J. Tsujii, “Part-of-Speech Annotation of Biology Research Abstracts,” Proc. Int. Conf.
on Language Resource and Evaluation (LREC), 2004, pp. 1267--1270.
Tateisi, Y., et al., “Syntax Annotation for the GENIA corpus,” Proc. IJCNLP Companion volume,
2005, pp. 222--227.
Thompson, Paul, Giuila Venturi, John McNaught, Simonetta Montemagni and Sophia Ananiadou
(2008). Categorising Modality in Biomedical Texts. LREC 2008 workshop "Building and
Evaluating resources for biomedical text mining" Marrakech, Morocco.
Thompson, Paul, Philip Cotter, John McNaught, Sophia Ananiadou, Simonetta Montemagni, Andrea
Trabucco and Giulia Venturi (2008). Building a bio-event annotated corpus for the acquisition of
semantic frames from biomedical corpora. Sixth International Conference on Language Resources
and Evaluation (LREC 2008), Marrakech, Morocco
6.1.5 Tagging and Parsing
Brants, T., “TnT - A Statistical Part-of-Speech Tagger,” Proc. Applied Natural Language Processing
Conf., 2000.
Charniak, E., “A Maximum-Entropy-Inspired Parser,” Proc. NAACL, 2000.
Collins, M., “Head-Driven Statistical Models for Natural Language Parsing,” Ph.D. thesis, University
of Pennsylvania, 1999.
Giménez, J., and L. Màrquez, “Fast and accurate part-of-speech tagging: The SVM approach
revisited,” Proc. RANLP, 2003, pp. 153-163.
Hara, T., Y. Miyao and J. Tsujii, “Adapting a probabilistic disambiguation model of an HPSG parser to
a new domain,” Proc. Int. Joint Conf. on Natural Language Processing, 2005.
Miyao, Y., and J. Tsujii, “Probabilistic disambiguation models for wide-coverage HPSG parsing,”
Proc. ACL, 2005, pp. 83--90.
Miyao, Yusuke and Jun'ichi Tsujii. Feature Forest Models for Probabilistic HPSG Parsing.
Computational Linguistics. 34(1). pp. 35–80, MIT Press, 2008.
Ninomiya T, Tsuruoka Y, Miyao Y, et al. Fast and scalable HPSG parsing. TAL 2005 2007;46(2).
Reynar, J., and A. Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence Boundaries,”
Proc. ANLP, 1997, pp. 16-19.
Smith, L., et al., “MedPost: a part-of-speech tagger for bioMedical text, Bioinformatics, Vol. 20, No.
14, 2004, pp. 2320-2321
Toutanova K., et al., “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” Proc.
HLT-NAACL, 2003.
Tsuruoka, Y., and J. Tsujii, “Bidirectional Inference with the Easiest-First Strategy for Tagging
Sequence Data,” Proc. HLT/EMNLP, 2005. pp. 467--474.
Tsuruoka, Y., et al., “Developing a Robust Part-of-Speech Tagger for Biomedical Text,” Proc. 10th
Panhellenic Conference on Informatics, 2005, pp. 382-392.
Yoshida, Kazuhiro, Yoshimasa Tsuruoka, Yusuke Miyao and Jun'ichi Tsujii (2007) Ambiguous Part-
of-Speech Tagging for Improving Accuracy and Domain Portability of Syntactic Parsers. Twentieth
International Joint Conference on Artificial Intelligence (IJCAI)
35
http://www.nlm.nih.gov/pubs/factsheets/umls.html
36
http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html
37
http://130.14.29.110/Taxonomy/
38
http://medlineplus.gov/
39
http://biomint.pharmadm.com/protop/bin/bmstaticpage.pl?userType=guest&p=gpsdb
40
http://redpoll.pharmacy.ualberta.ca/drugbank/
41
http://www.ebi.ac.uk/chebi/
42
http://pubchem.ncbi.nlm.nih.gov/
43
www.ncbi.nlm.nih.gov/projects/LocusLink/
44
www.ebi.ac.uk/embl/
45
www.ncbi.nlm.nih.gov/Genbank/