Sunteți pe pagina 1din 67

Information extraction from clinical trials

Dennis Jasch (DJ)

Centre for Health Informatics (CHI)


Australian Institute of Health Innovation (AIHI)
Macquarie University
Sydney Australia

Master of Research (MRES)

Supervisor: Dr Guy Tsafnat (GT)

Australian Institute of Health Innovation (AIHI)


Level 6, 75 Talavera Road
Macquarie University
NSW 2109, Australia

Dennis Jasch: dennis.jasch@students.mq.edu.au


Guy Tsafnat: guy.tsafnat@mq.edu.au
Contents
Summary ................................................................................................................................................. 4
1. Introduction .................................................................................................................................... 5
1.1. Thesis Structure .......................................................................................................................... 5
1.2. Evidence based medicine ............................................................................................................ 6
1.3. Systematic Reviews ..................................................................................................................... 7
1.4. Clinical Trials ............................................................................................................................... 8
2. Literature review........................................................................................................................... 10
2.1. Introduction .............................................................................................................................. 10
2.2. Methods .................................................................................................................................... 10
2.2.1. Search.................................................................................................................................... 10
2.2.2. Selection criteria ................................................................................................................... 11
2.2.3. Citation tracking .................................................................................................................... 12
2.2.4. Extraction .............................................................................................................................. 12
2.3. Results ....................................................................................................................................... 12
2.3.1. Tasks supported by automatic extraction............................................................................. 16
2.3.2. Extraction granularity, source, approach, and targeted elements ....................................... 17
2.3.3. Machine learning algorithms and features ........................................................................... 21
2.3.4. Evaluation, corpora and analyses of findings ....................................................................... 25
2.4. Discussion of the literature review ........................................................................................... 31
2.4.1. Unclear extraction tasks ....................................................................................................... 31
2.4.2. Machine Learning.................................................................................................................. 31
2.4.2.1. Machine learning algorithms and their features .............................................................. 32
2.4.3. Heuristic Methods................................................................................................................. 33
2.4.4. Available corpora .................................................................................................................. 34
2.4.5. Comparability of performance is limited .............................................................................. 34
2.4.6. Limitations............................................................................................................................. 36
2.5. Conclusions of the literature review ......................................................................................... 36
3. Selecting systems for analysis and experiment replication .......................................................... 37
3.1. ExaCT ......................................................................................................................................... 37
3.2. ACRES ........................................................................................................................................ 38
3.3. Independent result replication ................................................................................................. 38
3.3.1. Execution environment setup ............................................................................................... 38
3.3.2. System architecture .............................................................................................................. 39
3.3.2.1. Pre-process PubMed XML to ACRES required format ...................................................... 42
3.3.2.2. Extract key elements ......................................................................................................... 43
3.3.2.3. Association of extracted mentions ................................................................................... 45
4. Creation of a new testing corpus .................................................................................................. 46
4.1. Origin of clinical trial documents for annotation ...................................................................... 46
4.2. Annotation process ................................................................................................................... 46
4.3. Details of newly created corpus................................................................................................ 47
5. Executing the experiment on the new corpus .............................................................................. 48
5.1. Refining the corpus ................................................................................................................... 49
6. Methods for further development ............................................................................................... 54
6.1. Ripple Down Rules .................................................................................................................... 55
6.2. Markov Logic Networks ............................................................................................................ 56
7. Discussion...................................................................................................................................... 57
8. Conclusion ..................................................................................................................................... 59
Acknowledgement ................................................................................................................................ 60
List of tables .......................................................................................................................................... 61
List of figures ......................................................................................................................................... 61
Abbreviations ........................................................................................................................................ 63
Appendix ............................................................................................................................................... 64
References ............................................................................................................................................ 65
Summary
Evidence-based medicine (EBM) stipulates that decisions regarding the effectiveness of potential
interventions should be based on objective evidence such as randomized clinical trials and systematic
reviews. However, research output rate is accelerating faster than it is feasible to manually synthesize
into summaries, clinical practice guidelines and other decision aids. The most likely solution lies in the
computerized summarization of this research.

In this thesis we perform an extensive systematic literature review regarding methods of extracting
information targets from clinical trials. Based on this review, we select a method to be analysed and
the experiment to be re-produced and critically appraised. Upon successful recreation of the
experiment, we annotate a new corpus consented upon by two annotators. We run the experiment
on the new corpus and compare the extraction results. Finally, we present two methods to possibly
advance the difficult task of extracting key information elements from clinical trials.
1. Introduction
This thesis represents my investigation into the state of the art of information extraction tools for the
automation and acceleration of the conduct of systematic reviews in evidence based medicine. The
thesis includes a systematic review of the literature on existing systems as well as an independent
evaluation of those systems I was able to obtain for independent assessment.

1.1. Thesis Structure


Within this thesis we take a look at information extraction methods focusing on clinical trial
characteristics.

Section 1 provides an introduction into the background of evidence based medicine.

Section 2 presents an overview of the related work that has been done in the automatic identification
of clinical trial characteristics from the literature aiming to explore the current status of the available
methods. The main emphasis is on the recognition of targeted trial features and the applied data.

Section 3 presents a group of information extraction systems that were chosen for validation based
on the literature review conducted in section 2. Following this, and in order to further validate the
selected system, section 4 presents the steps used to create a newly annotated clinical trial corpus,
on which a previously published experiment is executed.

Section 5 represents the application of the chosen information extractions system to the newly
annotated trial corpus and shows a detailed comparison of the generated results with the previous
ones. Section 6 identifies potential methods that could be applied to the system and further optimise
its performance.

Section 7 contains a thorough discussion regarding our methods and results along with a detailed
error analysis.

Finally, section 8 concludes the thesis with a brief summary of the achievements made during this
course of research and presents further questions and challenges that were raised and that can be
further explored.
1.2. Evidence based medicine
Evidence-based medicine is the process of reviewing, appraising and using systematically available
clinical research findings to optimise the provision of health care to patients [1].

Health research promises societal benefit by improving the provided health care. However, there is a
gap between research findings (what is known) and health care practice recommendations [2]. As
highlighted in a recent Lancet series, this gap leads to waste in the investment made by society in
health research [3] since the current available evidence is not able to provide answers in specific
clinical questions. At the same time, clinicians are seeking guidance on questions where the evidence
seems to be out of date, missing or unverified.

Figure 1 Evidence-Based Medicine Pyramid [4]

Figure 1 shows the evidence-based medicine pyramid [4], distinguishing the filtered (e.g. systematic
reviews) and unfiltered (e.g. randomized controlled trials) information. In recent decades, there has
been a constant rise in the generation of clinically related data. Among them, there is unfiltered
information in large amounts therefore making the task of manipulating, inspecting and identifying
important concepts of interest difficult and time consuming. The filtering of this data to separate
relevant information in regards to the research question at hand from the irrelevant bits and
aggregate them into useful guidelines can be increasingly tedious for the very same reasons. Such
procedures require patience, time and demand the supervision of more than one domain expert to
ensure the authenticity and integrity of reviewed information. In addition, the number of performed
clinical trials is growing at a fast pace (see Figure 2), causing related systematic reviews to become
outdated quickly. Shojania 2007 found that of 100 publications reviewed, 4% required updating within
a year, and 11% after 2 years whereas 7% of the systematic reviews needed updating at the time of
publication [5].

To close the gap between unfiltered and filtered information and to enable clinical professionals to
identify important pieces of knowledge in the vast amounts of text, the application of computational
methods is inevitable and this is where information extraction methods can contribute. Information
extraction procedures have been applied widely in the clinical field to either identify key information
such as medication prescription or clinical concepts from a variety of health related data (e.g.,
electronic health records, epidemiological studies, clinical notes) [6].
Figure 2 Number of registered studies over time 1

1.3. Systematic Reviews


There are various methods for conducting literature-based research and systematic reviews are
largely considered the most authoritative kind. In particular Rosner 2012 mentions that secondary
types of research such as systematic reviews and meta-analyses as opposed to primary research types
like observational studies and clinical trials:

“not only attempt to screen the largest sample of reports of potential interest, but also vet
them for their rigor and select only the articles presumed to be the most valid pertaining to a
focused research question. Particularly with meta-analyses, such reviews attempt to reconcile
contradictory reports by combining specific outcome measures statistically in the attempt to
reach a final level of significance of treatment effects.” [4]

The median time to conduct a Cochrane systematic review (“the leading resource for systematic
reviews in health care” 2) is 11 months with other systematic reviews requiring about half that much
[7]. The process of conducting the review consists of a number of tasks such as the selection of the
related studies of interest and the exclusion criteria [8]. Figure 3 and Figure 4 depicts these and the
expected length (in terms of time to completion) for each task. Steps include the critical appraisal,
data extraction, and meta-analysis. This clearly reveals the immense amount of time and personal
resources required to conduct such systematic review.

1
http://clinicaltrials.cov
2
http://www.cochranelibrary.com/cochrane-database-of-systematic-reviews/
Figure 3 Timeline for a Cochrane review [8] Figure 4 Process of systematic review creation [9]

1.4. Clinical Trials


Systematic reviews include the reporting and validation of clinical knowledge recorded in
epidemiological studies known as clinical trials. Clinical trials are an arm of the experimental research
and as opposed to observational studies where no intervention (or administered treatment is
involved), they are the front line of the modern medicine. To be more specific, Last 2001 states that :

“A research activity that involves the administration of a test regimen to humans to evaluate
its efficacy or its effectiveness and safety. The term is broadly polysemic: meanings include
from the first test of a drug in humans without any control treatment to a rigorously designed
randomized controlled trial.”

A specific form of clinical trial is the Randomized Controlled (or Control) Trial (RCT). The main purpose
of RCTs is to create an scientific environment for evaluating average effects, determining intervention
effects upon groups of subjects rather than the individual [4]. A typical RCT involves the random
assignment of participants to a study group, also called intervention arm, or to a control group.
Random assignment is important to mitigate against potential risks of bias that may otherwise exist.
The aim of RCTs is to ascribe specific effects generated from the administered intervention arms
related to a given medical condition without confounding factors such as patient and staff pre-
conceptions about the intervention and/or control.

For over half a century, central to the design of the traditional RCT is the blinding of patient and
practitioner. Blinding refers to the process in which the assignment and dispensation of treatment
regimens remains masked as neither the patient or the researcher know what type of intervention the
patient is receiving (i.e. placebo or the experimental treatment). In blinded studies, the participating
individuals are assigned to the corresponding arm without knowing which arms they are assigned to
through a variety of techniques such as placebo controls, sham surgery. In double-blind studies, staff
providing the intervention don’t know themselves which intervention is provided (e.g. a nurse
administers a pill that looks identical whether it is the tested pill or a placebo). This minimises the
potential confounding elements that may arise in the study.

A question to be answered by a clinical trials is commonly based on the PICO structure. PICO stands
for:

• patient/problem/population (P)
• intervention (I)
• comparison/control (C)
• outcomes (O)

These are the central elements of a clinical trial. Elements such as those mentioned above contain
useful information regarding the design and implementation of a clinical trial and are frequently
targeted in procedures of automated extraction. Since there is an increasing amount of clinical trials
occurring worldwide, the amount of time that investigators have to spend to inspect, validate and
incorporate any related information of interest for the generation of related systematic review is
rather vast. Therefore, there is a need for systems that can enable the recognition of such clinical trial
features in order to assist researchers to detect and summarize information of interest from the
relevant literature [11].
2. Literature review
In the following section (section 2) we present our review of the literature of information extraction
from clinical trial. This section is currently in the process of being published in the journal Systematic
Reviews 3 (submission number: 8107973231745334), and has completed the second peer review. The
second peer review involves minor changes only, and we expect acceptance within the followings 2
weeks.

2.1. Introduction
Evidence based medicine (EBM) is explicitly based on empirically proven research. Evidence that
considers or mitigates risk of bias associated with findings are considered to have a higher quality and
be more authoritative [12]. Systematic reviews (SR) are authoritative summaries of the primary
evidence that involve a rigorous and quality-centred review processes. Randomized controlled trials
(RCT) are a fundamental source of high quality clinical evidence [12]. They are required by several
regulatory bodies and many systematic reviews rely on them exclusively.

High resource demands of systematic reviews and their short life expectancy [5] mean that many
topics are covered by systematic reviews are obsolete while others have not been addressed at all.
Indeed, a study that compared the rate at which RCTs and SRs are produced found that current
evidence syntheses are not able to keep up with the current rate of evidence production [13] and
automation seems to be the only feasible way to address this problem [14].

A recent review found that research on the automation of evidence synthesis has clearly favoured
some aspects of the process over others [15]. Here we systematically review the state of the art of
algorithms for extraction of information from RCTs and other types of clinical trials.

2.2. Methods

2.2.1. Search
We have searched common electronic databases (Table 1) for studies on algorithms and tools for data
extraction from text using the following search strategy:

1. text mining "clinical trial"


2. natural language extraction "clinical trial"
3. bionlp trial
4. extractive summarization "clinical trial"
Table 1 Scientific databases used in literature search

Database Search details


PubMed 1. ("data mining"[MeSH Terms] OR ("data"[All Fields] AND
http://www.ncbi.nlm.nih.g "mining"[All Fields]) OR "data mining"[All Fields] OR ("text"[All
ov/pubmed Fields] AND "mining"[All Fields]) OR "text mining"[All Fields])
AND "clinical trial"[All Fields]
2. (natural[All Fields] AND ("programming languages"[MeSH Terms]
OR ("programming"[All Fields] AND "languages"[All Fields]) OR
"programming languages"[All Fields] OR "language"[All Fields]
OR "language"[MeSH Terms]) AND extraction[All Fields]) AND
"clinical trial"[All Fields]

3
http://www.systematicreviewsjournal.com/
3. bionlp[All Fields] AND ("clinical trials as topic"[MeSH Terms] OR
("clinical"[All Fields] AND "trials"[All Fields] AND "topic"[All
Fields]) OR "clinical trials as topic"[All Fields] OR "trial"[All
Fields])
4. (extractive[All Fields] AND summarization[All Fields]) AND
"clinical trial"[All Fields]
Scopus 1. TITLE-ABS-KEY ( text mining "clinical trial" )
http://www.scopus.com/ 2. TITLE-ABS-KEY ( natural language extraction "clinical trial" )
3. TITLE-ABS-KEY ( bionlp trial )
4. TITLE-ABS-KEY ( extractive summarization "clinical trial" )

Embase (via Ovid) 1. "text mining clinical trial".mp. [mp=title, abstract, heading word,
http://ovidsp.tx.ovid.com/ drug trade name, original title, device manufacturer, drug
manufacturer, device trade name, keyword]
2. "natural language extraction clinical trial".mp. [mp=title,
abstract, heading word, drug trade name, original title, device
manufacturer, drug manufacturer, device trade name, keyword]
3. bionlp trial.mp. [mp=title, abstract, heading word, drug trade
name, original title, device manufacturer, drug manufacturer,
device trade name, keyword]
4. "extractive summarization clinical trial".mp. [mp=title, abstract,
heading word, drug trade name, original title, device
manufacturer, drug manufacturer, device trade name, keyword]

All searches were performed between the 1st and 31st of May 2015. Searches were performed with
the default parameters of the respective search engine (with the exception noted in Table 1). No
restrictions on time of publication were set. The search strategy was developed by DJ and reviewed
by GT and the search was performed by DJ.

In addition to the common databases, we further included the first 200 results retrieved via Google
Scholar. While Google Scholar has many useful features and a large database of scientific papers, it
often suffers from poor precision and inconsistent across multiple users. Google Scholar was used to
verify the completeness of the search but to ensure reproducibility and limit the number of false-
positive documents retrieved, more substantive use was not made.

2.2.2. Selection criteria


The selection criteria were defined as follows:

• Include studies that evaluate the accuracy, precision, recall, sensitivity, specificity and/or
F-measure pertaining to methods, algorithm or tools that extract or label meta-information (e.g.
part of speech and structural headings such as Methods and Results) of text elements that may
help in the extraction of information from these elements
• Exclude studies on manual extraction methods
• Exclude studies on methods that extract information exclusively from clinical trial protocols
Selection criteria was applied by one reviewer (DJ) who consulted with the other (GT) on borderline
cases.
2.2.3. Citation tracking
The following steps were applied recursively to find more publications matching the search criteria:

• Follow (first-level) citations included in the study to previous or related work


• Search in all specified databases for publications by the same (first) authors

2.2.4. Extraction
The following characteristics of each study were extracted:

1. The task for which extraction for RCT was applied


2. The class of algorithm used (e.g. machine learning, heuristic rules) along with the granularity of
extraction (e.g. sentence, noun phrases), the source of extraction (e.g. abstract, full-text), and the
targeted extraction elements of the clinical trial (e.g. PICO elements)
3. If machine learning was used, the core machines learning algorithm(s) and selection of features
to be used as input for the algorithm. We consider features to be heuristic whenever they are
based on domain knowledge e.g. cue-word lists, and specific number ranges.
4. The corpus characteristics and methods used for evaluation
Information was extracted from each included study by one reviewer (DJ) who consulted with the
other (GT) for borderline cases.

2.3. Results
After screening, citation tracking and citation de-duplication, 30 articles were selected for inclusion
(Figure 5, Table 2).
Figure 5 PRISMA diagram
Table 2 Publications included in this review

Name Title
McKnight 2003 [16] Categorization of sentence types in medical abstracts
Hara 2005 [17] Information extraction and sentence classification applied to clinical trial MEDLINE abstracts
Demner-Fushman 2005 [18] Knowledge Extraction for Clinical Question Answering: Preliminary Results
Rosemblat 2006 [19] A pragmatic approach to summary extraction in clinical trials
Demner-Fushman 2006 [20] Automatically Identifying Health Outcome Information in MEDLINE Records
Xu 2006 [21] Combining text classification and hidden Markov modelling techniques for structuring randomized clinical trial abstracts
Chung 2007 [22] A Study of Structured Clinical Abstracts and the Semantic Classification of Sentences
Hara 2007 [23] Extracting clinical trial design information from MEDLINE abstracts
Xu 2007 [24] Extracting Subject Demographic Information from Abstracts of Randomized Clinical Trial Reports
Rosemblat 2007 [25] Extractive Summarization in Clinical Trials Protocol Summaries: A Case Study
Hansen 2008 [26] A method of extracting the number of trial participants from abstracts describing randomized controlled trials
De Bruijn 2008 [27] Automated Information Extraction of Key Trial Design Elements from Clinical Trial Publications
Chung 2009 [28] Towards identifying intervention arms in randomized controlled trials: Extracting coordinating constructions
Summerscales 2009 [29] Identifying treatments, groups, and outcomes in medical abstracts
Chung 2009 [30] Sentence retrieval for abstracts of randomized controlled trials
Boudin 2010 [31] Combining classifiers for robust PICO element detection
Kiritchenko 2010 [32] ExaCT: automatic extraction of clinical trial characteristics from journal publications
Lin 2010 [33] Extracting formulaic and free text clinical research articles metadata using conditional random fields
Boudin 2010 [34] Improving medical information retrieval with PICO element detection
Zhao 2010 [35] Improving Search for Evidence-based Practice using Information Extraction
Kim 2011 [36] Automatic classification of sentences to support Evidence Based Medicine
Summerscales 2011 [37] Automatic summarization of results from clinical trials
Huang 2011 [38] Classification of PICO elements by text features systematically extracted from PubMed abstracts
Verbeke 2012 [39] A statistical relational learning approach to identifying evidence based medicine categories
Hsu 2012 [40] Automated extraction of reported statistical analyses: towards a logical representation of clinical trial literature
Zhao 2012 [41] Exploiting Classification Correlations for the Extraction of Evidence-based Practice Information
Summerscales 2013 [42] Automatic Summarization of Clinical Abstracts for Evidence-based Medicine
Huang 2013 [43] PICO element detection in medical text without metadata: Are first sentences enough?
Sarker 2013 [44] An Approach for Automatic Multi-label Classification of Medical Sentences
Hassanzadeh 2014 [45] Identifying scientific artefacts in biomedical literature: The Evidence Based Medicine use case
2.3.1. Tasks supported by automatic extraction
Only 2 closely related studies (published by the same first author and using the same approach)
explicitly described a specific goal to be supported by the extraction algorithm [37, 42]. Both aim to
statistically summarise the results of multiple trials using relative risk (RR) and number needed to treat
(NTT). Another algorithm [32] was developed for curation of a structured RCT archive [46].

Other studies only implied the type of tasks for which the algorithms might be used. This is, to some
extent subject to interpretation, and not necessarily stated explicitly. Further, an algorithm may not
be exclusive to support one aim, but could potentially be used for multiple tasks. For example, a
method to identify PICO elements may support searching and/or screening tasks. On the other hand,
a method to extract participant numbers would be more suited for synthesis. Finally, methods
extracting whole sentences are mainly limited to search and screening tasks, as a sentence can hardly
serve as direct input for any synthesis tasks. Several authors state future plans to further develop their
approach for more tasks. However, we based our decision on the approach described in the given
publication. The assigned (Table 3) aims are based on the 15 tasks in creating a systematic review
described by Tsafnat 2014.
Table 3 Extraction aims

Aim Publication
“search” Hara 2005 Kiritchenko 2010
Demner-Fushman 2005 Boudin 2010
Rosemblat 2006 Kim 2011
Demner-Fushman 2006 Summerscales 2011
Xu 2007 Huang 2011
Hara 2007 Verbeke 2012
Rosemblat 2007 Zhao 2012
De Bruijn 2008 Sarker 2013
Chung 2009 Summerscales 2013
Summerscales 2009 Huang 2013
Boudin 2010 Hassanzadeh 2014
Zhao 2010
“screen” McKnight 2003 Boudin 2010
Hara 2005 Boudin 2010
Demner-Fushman 2005 Zhao 2010
Xu 2006 Kiritchenko 2010
Rosemblat 2006 Huang 2011
Demner-Fushman 2006 Kim 2011
Hara 2007 Summerscales 2011
Xu 2007 Verbeke 2012
Rosemblat 2007 Zhao 2012
Chung 2007 Huang 2013
De Bruijn 2008 Sarker 2013
Chung 2009 Summerscales 2013
Summerscales 2009 Hassanzadeh 2014
“extract” Xu 2007 Kiritchenko 2010
Hansen 2008 Summerscales 2011
De Bruijn 2008 Summerscales 2013
“meta analyse” Summerscales 2013
Summerscales 2011

2.3.2. Extraction granularity, source, approach, and targeted


elements
A fifth (n=6) of the studies extract from more than just the abstract (Figure 6, Table 4). Fourteen (47%)
studies limited the granularity of the extraction to sentence-level (Figure 7, Table 4). Nineteen (63%)
studies use machine learning only, while 9 (30%) combine machine learning with heuristics (Figure 8,
Table 4). The elements most commonly extracted are Patients/Population/Participants,
Intervention/Treatment, and Outcome (Figure 9, Table 4). Another commonly extracted set of
elements are structural headings, such as Introduction, Methods, Results, Conclusion.

Most studies identify sentences, as candidates for subsequent extraction of more fine-grained
information via machine learning. Notable exceptions are Hara 2007, who chunk text into base noun
phrases and classify those using machine learning [23], and Hansen 2008, who classify numbers using
machine learning [26].
Table 4 Extraction level and approach

Publication Level of Extraction Extraction approach Extracted elements


extraction source
McKnight 2003 [16] sentence abstract Machine learning Introduction, methods, results, conclusion
Hara 2005 [17] sentence abstract Machine learning Compared treatment, endpoint, patient population
Demner-Fushman 2005 sentence/ abstract Machine learning Population/problem, intervention/comparison, outcome
[18] mixed Heuristic (manually crafted rules,
disorder concept recognition,
intervention extractor using UMLS
semantic network)
Rosemblat 2006 [19] text unclear Heuristic (regular expressions, Purpose
fragment sentence boundary detection,
decision-based rules)
Demner-Fushman 2006 sentence abstract Machine learning Outcome
[20] Heuristic (rule-based classifier)
Xu 2006 [21] sentence abstract Machine learning Background, objective, methods, results, conclusions
Chung 2007 [22] sentence abstract Machine learning Aim, method, participants, results, conclusions
Hara 2007 [23] base noun abstract Machine learning Compared treatment, endpoint, patient population
phrases Heuristic (regular expressions)
Xu 2007 [24] word/phras abstract Machine learning Subject descriptors, number of trial participants,
e/number Heuristic (anchor words, diseases/symptoms, descriptors of diseases/symptoms
grammatical rules, UMLS semantic
type extraction)
Rosemblat 2007 [25] text full text Heuristic (sentence boundary Purpose
fragment detection, regular expressions,
semantic/syntactic checks)
Hansen 2008 [26] integers abstract Machine learning (heuristic Number of trial participants
features)
De Bruijn 2008 [27] word/phras full text Machine learning 23 information elements with differing characteristics (e.g.
e/number Heuristic (manually designed weak name of control/experimental treatment, sample size,
extraction rules) primary/secondary outcome name/time point, etc.)
Chung 2009 [28] phrase abstract Machine learning Coordinating instructions
Heuristic (extracting coordinating
constructions from parse trees)
Summerscales 2009 [29] phrase abstract Machine learning Treatments, groups, outcomes
Chung 2009 [30] sentence abstract Machine learning Intervention, participants, outcome
Boudin 2010 [31] sentence abstract Machine learning (heuristic Population/Problem, intervention/comparison, outcome
features)
Kiritchenko 2010 [32] word/phras full text Machine learning 21 information elements characteristics (e.g. name of
e/number Heuristic (set of regular expression control/experimental treatment, sample size,
‘weak’ rules) primary/secondary outcome name/time point, etc.)
Lin 2010 [33] word/phras full text Machine learning (heuristic Author name, author email, institution, age group, data
e/number features) analysis name, data collection method, database name,
data type (cohort, retrospectively), geographical area,
intervention, longitudinal variables, number of
observations, time period
Boudin 2010 [34] sentence abstract Machine learning (heuristic Population/problem, intervention/comparison, outcome
features)
Zhao 2010 [35] word/phras full text Machine learning (heuristic Sentences: patient, result, intervention, study design,
e/number features) research goal
Words: sex, age, race, condition, intervention, study design
Kim 2011 [36] sentence abstract Machine learning Background, intervention, outcome, population, study
design, other
Summerscales 2011 [37] word/phras abstract Machine learning (heuristic Treatment groups, outcome, associated quantities
e/number features)
Huang 2011 [38] sentence abstract Machine learning Population/problem, intervention/comparison, outcome
Verbeke 2012 [39] sentence abstract Machine learning Background, intervention, outcome, population, study
design, other
Hsu 2012 [40] word/phras full text unclear Hypothesis, statistical method, outcomes and estimation,
e/number generalizability
Zhao 2012 [41] word/phras full text Machine learning (heuristic Sentences: Patient, result, Intervention, study design,
e/number features) research goal
Words: sex, age, race, condition, intervention, study design
Summerscales 2013 [42] word/phras abstract Machine learning (heuristic Age values, conditions, treatment groups, group sizes,
e/number features) outcome, outcome numbers, outcome event rates
Heuristic (rule-based methods)
Huang 2013 [43] sentence abstract Machine learning Patient, intervention, outcome
Sarker 2013 [44] sentence abstract Machine learning (heuristic Background, intervention, outcome, population, study,
features) other
Hassanzadeh 2014 [45] sentence abstract Machine learning (heuristic Background, intervention, outcome, population, study
features) design, other
Abstract Full-text unclear Sentence More fine-grained

3%

20%

47%
53%

77%

Figure 6 Included studies by part of text from which they Figure 7 Included studies by granularity of extraction
extract information

Machine learning
Combined machine learning and heuristics
Heuristic

10%

29%

61%

Figure 8 Included studies by class of algorithm Figure 9 Frequency of targeted information fragments (>2)

2.3.3. Machine learning algorithms and features


A wide range of machine learning algorithms (Table 5) have been tested to classify sentences, phrases,
and numbers. Common algorithms include Conditional Random Fields (CRF), Support Vector Machines
(SVMs), and ensembles (combinations) of several different algorithms. CRF is a probabilistic modelling
method to label sequential data (such as natural language text), based on the probabilities of tokens
(words) in, e.g., a sentence. The tokens do not necessarily need to be in consecutive order. SVM
models the given input sentences as data points in a high-dimensional vector space and determines a
hyper-plane, such that data points belonging to different classes can be distinguished by a maximum
possible distance from that hyper-plane. Different so called kernel functions can be used to map the
data points to different hyperspaces to achieve non-linear classification. Hidden Markov Models
(HMM) can be viewed as a generative variant of CRFs, which are the discriminative analogue. First-
order HMMs take into account only previous state at a time, but one may use higher-order HMMs.
Less frequently used machine learning algorithms applied to this specific task include Naïve Bayes
(NB), Decision Trees (DT), Random Forests (RF), or Multi-Layer Perceptron (MLP). Verbeke 2012 was
the only study to use kLog, a statistical relational learning language [39].

A wide range of features has been tried as input for machine learning algorithms as listed in Table 5.
They may be classified into token-based, domain knowledge based, positional, heuristic/statistical,
and contextual/sequential features.

Common token-based features used in machine learning were individual words (also known as
unigrams), bi- or n-grams (sequence of 2 or more tokens forming one feature), tagging individual
words with their part of speech (POS; i.e. verb, noun, adjective, etc.), stems or lemmas of words. Bag
of Words (BOW) is a representation schema independent of unigrams, one may encode higher order
n-grams via the BoW representation.

Common domain-knowledge features include the mapping of words to a controlled vocabulary such
as Unified Medical Language System (UMLS) concepts or Medical Subject Heading (MeSH) terms, and
adding those concepts/terms as input features. One aim of using a controlled vocabulary is to aid in
unifying different representations of the same concept in different locations (e.g. disorder and disease
get mapped to the same UMLS concept).

Positional features incorporate information about the position of a token or a sentence in relation to
its including sentences or text. This can relate to sentences being included in one structural section,
or the relative position of a sentence/word within one section or text. Secondly, as texts normally
follow a logical flow, the relative position of a sentence within an abstract provides a clue on the type
of information it contains. The sentence location within the abstract and the title of the containing
section were among the most commonly used features in this category.

Heuristic and statistical features measure the frequency or occurrence of certain tokens, such as pre-
defined cue-word lists to mark the presence of desired information (e.g. “measured”, “participants”).
They can also measure the number of, for example, verbs, adjectives, nouns, use of active/passive
voice, and used tenses in one sentence. Further, document or sentence length were used in this
feature category.

Contextual or sequential features integrate the feature information of surrounding tokens and/or
sentences into the classification of the current token/sentence. This intends to leverage the fact, that
natural language text normally follows a coherent flow.
Table 5 Machine learning algorithms and the features used as input to them. A semicolon separating the algorithms indicates both algorithms have been tried separately, a comma indicates an
ensemble of the mentioned algorithms has been used (for abbreviations, please refer to the Abbreviations table)

Publication Machine learning algorithm Input features


McKnight 2003 [16] SVM BOW, sentence location
Hara 2005 [17] BACT [44] BOW; n-gram; dependency grammar
Demner-Fushman 2005 Naive Bayes, Maximum BOW, n-gram, sentence location, document length, ad-hoc boosted UMLS score
[18] likelihood
Rosemblat 2006 [19] Naive Bayes, Maximum BOW, n-gram, sentence location, document length, ad-hoc boosted UMLS score
likelihood
Demner-Fushman 2006 HMM BOW without pre-processing
[20]
Xu 2006 [21] CRF BOW
Chung 2007 [22] SVM; CRF BOW, POS, categories of noun phrase chunking (if determined)
Hara 2007 [23] HMM BOW, bi-gram
Xu 2007 [24] SVM 56 + 38 features, based on 6 feature templates (unclear description)
Rosemblat 2007 [25] SVM BOW, n-gram
Hansen 2008 [26] CRF BOW, POS
De Bruijn 2008 [27] CRF BOW, POS, MeSH, semantic tag of MeSH term, section title, set of 4 context words to the left
and right (their POS and semantic tags)
Chung 2009 [28] SVM; CRF BOW, POS, sentence location, features from previous and following sentence, section title
Summerscales 2009 J48 (WEKA decision tree C4.5); Sentence location, sentence length, # of punctuation marks, # of numeric numbers, word
[29] Naive Bayes; Random Forest; overlap with title, # of cue-words, # of cue-verbs, MeSH semantic types, # of (n=[0-9]+)
SVM; Multi-Layer Perceptron
Chung 2009 [30] SVM BOW, n-gram
Boudin 2010 [31] CRF Stemmed BOW, occurrence of cue words, token position within first 15 lines of text,
occurrence of email, is numeric, orthographic features (has capital letters),
Kiritchenko 2010 [32] J48 (WEKA decision tree C4.5); Sentence location (absolute, relative), sentence length, # of punctuation marks; # of numbers
Naive Bayes; Random Forest; (≤10, >10), word overlap with title, # of cue-words, number of cue-words, MeSH semantic
SVM; Multi-Layer Perceptron types
Lin 2010 [33] Maximum entropy [47] Sentence classification: n-gram (n=1…3), sentence length, named entity (based on OpenNLP
extraction), MeSH terms, cue-words
Word classification: BOW, stemmed BOW, POS, position of token in sentence, head noun of
noun phrase, named entity (based on OpenNLP extraction), MeSH terms, cue-words
Boudin 2010 [34] CRF BOW, POS, bi-gram, UMLS concepts/synonyms, sentence location, section title,
features/labels of different number of previous sentences
Zhao 2010 [35] CRF BOW, POS, token in parenthesis, phrase type (noun phrase, verb phrase, …), UMLS semantic
type, is first/last token in phrase, 4 surrounding tokens (token, POS, semantic tags, is same
phrase), section title
For quantities: <5, 4 surrounding tokens (token, POS, semantic tags, mention label), specific
pattern match, syntactic/semantic context features, section title
Kim 2011 [36] Multi-layer Perceptron Stemmed BOW
Summerscales 2011 kLog Structured abstracts: lemma of sentence root word, contains header word, section title,
[37]
Unstructured abstracts: lemma of sentence root word, POS, lemma of root word of previous
sentence
Huang 2011 [38] UMIA [48] unclear
Verbeke 2012 [39] CRF Sentence classification: n-gram (n=1…3), sentence length, named entity (based on OpenNLP
extraction), MeSH terms, cue-words
Word classification: BOW, stemmed BOW, POS, position of token in sentence, head noun of
noun phrase, named entity (based on OpenNLP extraction), MeSH terms, cue-words
Hsu 2012 [40] CRF Lemma, POS, special annotation, is acronym, is number %/int/float, is number negative, is
number < 10, UMLS, semantic tag based on cue-words, integer patterns (e.g. n=xx),
expansion of acronyms, is inside parentheses, closest parent verb in parse tree, dependency
features, token and semantic features of 3-4 surrounding tokens, section title
Zhao 2012 [41] Naive Bayes BOW
Summerscales 2013 SVM n-gram (n=1…3), POS, sentence location, section title, UMLS, n-grams (n=1…3) of previous (1,
[42] 2) sentence, set of cue-phrases
Huang 2013 [43] CRF; SVM; Naive Bayes; POS, orthographic cases, lemmas, # of nouns, # of adjectives, # of tokens, # of different
Multinomial Logistic tenses, # of passive/active verbs, # of negative verbs, sentence location, section title, labelling
Regression equality of 1/2/3 preceding sentences
2.3.4. Evaluation, corpora and analyses of findings
The used corpora (Table 7) are almost unique to each study. Only 2 corpora were used by more than
one group (Table 6). The selection criteria and sources for corpus citations to extract from is mainly
arbitrary. The main source is a random selection of publications from MEDLINE/PubMed and medical
journals like PLoS Clinical Trials, New England Journal of Medicine, Lancet, and the Journal of the
American Medical Association. Some authors search for a list of diseases to retrieve trial publications,
and some apply filters for MeSH terms indicating RCTs, and many limit the publication date to a recent
timespan.
Table 6 Corpora re-used in multiple publications for comparison

Publication Re-used in Size Annotation level Annotations


Demner- Kim 2011 633 abstracts sentence Background,
Fushman 2006 Population,
Intervention,
Outcome,
Supposition, Other
Kim 2011 Verbeke 2012, 1000 abstracts sentence Background,
Sarker 2013, Population,
Hassanzadeh 2014 Intervention,
Outcome, Study
Design, Other

Very little information is given on how validation methods were decided on. The most commonly
(n=12) used validation technique is 10-fold cross-validation, which randomly splits the original data
set into 10 equal size sub-sets. Of the 10 sub-sets, one is retained for testing the classifier, and the
remaining 9 are used as training data. This process is then repeated 10 times, with each of the 10 sub-
sets used once as the testing data. The 10 results are then averaged (or otherwise combined) to
produce a single result. Apart from 10-fold cross-validation, 3-, 5-, 15-fold cross-validation were also
used. Several publications (n=7) withhold an unseen set for the purpose of testing their classifier.

The most popular performance measures used are precision (P) which is the likelihood of an extracted
term to be correct and recall (R) which is the likelihood of an information element being extracted
correctly by the algorithm. The combination of precision and recall is the F1-Score (F1) (Equation 1),
which forms the harmonic mean. It can take on values from 0 (worst) to 1 (best). Another frequently
reported measure is accuracy, which describes the number of correctly extracted information
elements divided by the total number of information elements to be extracted. Several publications
(n=4) only report percentages of fragments classified entirely correct, partially correct, or incorrect,
which does not allow a clear distinction into positive and negative classification. Singular publications
report Area under the Receiver Operating Characteristics curve (ROC or AUC), which shall not be
explained further at this point.
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ∗ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
𝐹𝐹1 = 2 ∗
precision + recall
Equation 1 F-Score
Table 7 Details of evaluation corpora and validation methods (for abbreviations, please refer to the Abbreviations table)

Publication Corpus description Corpus size Validation methods Reported


measures
McKnight 2003 RCTs on medical therapy from MEDLINE - 7253 structured abstracts - 10-fold cross-validation ROC, P, R, F1,
[16] (90655 sentences) on structured abstract Accuracy
- 204 unstructured abstracts set
(1629 sentences) manually - Testing of classifier
labelled trained on structured
abstracts on 204
manually labelled
unstructured abstracts
Hara 2005 [17] Recent abstracts of hepatitis RCTs from MEDLINE - 50 abstracts (562 sentences) - 5-fold cross-validation P, R
manually labelled
Demner- MEDLINE abstracts regarding: childhood immunization, - 633 abstracts manually labelled - Multiple tests on hold- % of
Fushman 2005 different diseases (i.e. diabetes) out sets (100 abstracts) correct/unkn
[18] own/wrong
Rosemblat Randomly chosen clinical trials - 300 - Test over entire corpus % of
2006 [19] perfect/appro
priate/wrong
Demner- 5 sets of MEDLINE abstracts of citation relating to various - 633 abstracts manually labelled - 10 iterations of randomly % of correctly
Fushman 2006 diseases - 1312 MEDLINE citations selecting from 633 identified
[20] citations and setting
aside 60 as the test set
and another 60 citations
as the verification set
and using the rest for
training
- “extrinsic evaluation” on
1312 MEDLINE citations
Xu 2006 [21] Randomly selected structured RCT abstracts published - 3896 structured abstracts - 50% were used for P, R, F1
from 2004-2005 (46370 sentences) training, 50% for testing
Chung 2007 RCTs on asthma, diabetes, breast cancer, prostate cancer, - 3657 structured abstracts - Trained on 3439 P, R, F1
[22] erectile dysfunction, heart failure, cardiovascular (45000 sentences) abstracts, tested on a
disorders and angina from MEDLINE test set of 62 unseen
abstracts
Hara 2007 [23] Abstracts labelled as both “Neoplasms [MeSH Terms]” and - 200 abstracts (2390 sentences) - Trained on 199 abstracts, P, R
“Clinical Trial, Phase III [Publication Type]” manually annotated test on 1 abstract
Xu 2007 [24] RCT abstracts that contained Methods sections, published - 3896 structured abstracts - Trained on 3896 Accuracy, F1
in 2005 (46370 sentences) abstracts, tested on 250 (partially)
- 250/50 (for participant unstructured/50
numbers) unstructured unstructured abstracts
abstracts manually labelled
Rosemblat ClinicalTrials.gov - 27489 XML documents, as of - Training: 13800 Purpose - % to meet
2007 [25] February 2006 Descriptions from the human judge
ClinicalTrials.gov XML criteria
documents, as of May - % of
2005 perfect/appro
- Testing: full set of 27489 priate/wrong
XML documents, as of extraction
February 2006
- Manual evaluation on a
sample of 300
documents (of 27489)
Hansen 2008 Subset of corpus from [22] - 223 abstracts (142 structured) - Trained on 148 abstracts, P, R, F1,
[26] manually labelled tested on 75 abstracts Accuracy
De Bruijn 2008 Full-text RCT articles from 5 top-tier medical journals: - 88 full-text articles manually - Trained on 78 articles, P, R
[27] PLoS Clinical Trials, NEJM, Lancet, JAMA, and Annals of labelled tested on 10 unseen
Internal Medicine; randomly selected (mostly from 2006) articles
Chung 2009 MEDLINE RCTs published between 1998 and 2006, - 13605 structured abstracts - 15-fold cross-validation P, R, F1,
[28] specifying RCT in the publication type field, keywords: (156622 sentences) - 10-fold cross-validation Accuracy
asthma, diabetes, breast cancer, prostate cancer, erectile - 203 abstracts manually labelled
dysfunction, heart failure, cardiovascular, angina - 124 abstracts partially manually
labelled as independent non-
overlapping test set, reviewed
after classification
Summerscales Abstracts of the first 100 randomized controlled trials, - 100 abstracts (1344 sentences) - 10-fold cross-validation P, R, F1
2009 [29] published online in 2005 and 2006 (publication dates manually labelled
range from 2005 to 2007)
Chung 2009 MEDLINE RCTs published between 1998 and 2006, - 318 abstracts (211 structured) - 15-fold cross-validation P, R, F1
[30] specifying RCT in the publication type field, keywords: manually labelled
asthma, diabetes, breast cancer, prostate cancer, erectile - 13605 structured abstracts
dysfunction, heart failure, cardiovascular, angina (156622 sentences)
Boudin 2010 Structured abstracts from PubMed, criteria: publication - 14279 abstracts (191608 - 10-fold cross-validation P, R, F1
[31] date 1999-2009, Humans, Clinical Trial, Randomized sentences) containing
Controlled Trial, English Population/Problem
- 9095 abstracts (125399
sentences) contain
Intervention/Comparison
- 2394 abstracts (32908
sentences) containing Outcome
(not mutually exclusive)
Kiritchenko - Randomly chosen full-text RCTs from 5 core clinical - 78 full-text articles for training - Trained on training set, % of fully,
2010 [32] journals: Annals of Internal Medicine, the New England - 50 full-text articles for testing tested on test set partially/incor
Journal of Medicine, PLoS Clinical Trials, JAMA, and The - leave-one-out cross- rect solution
Lancet; for training and rule-devising validation mentioned,
- RCTs on drug treatment in English, were published in the but not reported
core clinical journals (as defined by PubMed) in 2009, had
abstracts and full texts available in HTML format, reported
on RCTs on human subjects
Lin 2010 [33] Open-access full-text literature documenting oncological - 185 articles manually labelled - 3-fold cross-validation on P, R, F1
and cardio-vascular studies in the region, over a 3 year 93 articles
period from 2005 to 2008
Boudin 2010 Structured PubMed abstracts, keyword “diabetes”, - 14279 abstracts containing - 10-fold cross-validation P, R, F1, MAP,
[34] limitations: Humans and English language Population/Problem P@10
- 9095 abstracts contain
Intervention/Comparison
- 2394 abstracts containing
Outcome
(not mutually exclusive)
Zhao 2010 [35] Abstracts and full text articles from 17 journal websites - 19893 abstracts/full-text - 5-fold cross-validation P, R, F1
which contain quality research materials as recommended articles
by the nurses from the Evidence-Based Nursing Unit in - 2000 sentences (of all
National University Hospital abstracts) manually labelled for
sentence classification
- 6754 words (667 sentences)
manually labelled for word
classification
Kim 2011 [36] Random abstracts from Global Evidence Mapping - 1000 abstracts (376 structured) - 10-fold cross-validation P, R, F1
Initiative and The Agency for Healthcare Research and manually labelled
Quality
Summerscales British Medical Journal (BMJ) RCT abstracts from PubMed - 263 abstracts manually labelled - 10-fold cross-validation P, R, F1
2011 [37] between 2005 and 2009, evaluating
treatments
Huang 2011 Semi-structured abstracts from PubMed - 8448 sentences containing - 10-fold cross-validation P, R, F1
[38] Participants
- 5615 sentences containing
Intervention
- 9409 sentences containing
Outcome
Verbeke 2012 Same as [36] - 1000 abstracts (376 structured) - 10-fold cross-validation F1
[39] manually labelled
Hsu 2012 [40] PubMed abstracts, keywords “non-small cell lung cancer - 42 papers manually labelled - Trained on 35 papers, P, R, F1
OR NSCLC”, “chemotherapy”, “randomized controlled tested on 7 papers
trial”, and “NOT review”
Zhao 2012 [41] Abstracts and full text articles from 17 journal websites - 19893 abstracts/full-text - 5-fold cross-validation P, R, F1
which contain quality research materials as recommended articles
by the nurses from the Evidence-Based Nursing Unit in - 2000 sentences (of all
National University Hospital abstracts) manually labelled for
sentence classification
- 6754 words (667 sentences)
manually labelled for word
classification
Summerscales - British Medical Journal (BMJ) RCT abstracts from PubMed - 188 abstracts manually labelled - Trained/developed on P, R, F1
2013 [42] between 2005 and 2006, evaluating treatments - 42 abstracts BMJ and Cardio corpus,
- Cardio corpus, a set of 42 abstracts from different journals - 117 abstracts tested on Ischemia
obtained using the PubMed query “cardio-vascular corpus
disease”
- Ischemia corpus, a collection of 117 abstracts from various
journals obtained using the query “myocardial ischemia”
Huang 2013 MEDLINE articles, which are RCTs according to MeSH - 15986 abstracts containing - 10-fold cross-validation P, R, F1
[43] labels Population/Problem
- 13029 abstracts contain
Intervention/Comparison
- 10778 abstracts containing
Outcome

Sarker 2013 Same as [36] - 1000 abstracts (376 structured) - 10-fold cross-validation F1, AUC
[44] manually labelled on 800 abstracts (partially)
- 200 abstracts as unseen
test set
Hassanzadeh Same as [36] - 1000 abstracts (376 structured) - 10-fold cross-validation P (partially), R
2014 [45] manually labelled (partially), F1
2.4. Discussion of the literature review

2.4.1. Unclear extraction tasks


According to Tsafnat 2014 the process of creating a systematic review can be divided into 15 tasks
that require different approaches to automation [15]. There is a clear need for different classes of
extraction tools. Ideally, searching for trials for inclusion in a systematic review results in a perfect
recall (R=1.0) i.e. all relevant trials are retrieved. In the best possible scenario appraisal of trials for
inclusion in a systematic review should have perfect precision (P = 1.0) i.e. all included trials are
relevant. While we do not expect tools to reach these ideal goals, we note that it is important to assess
the tools’ contribution with these targets in mind. Knowing the task the tool is aiming to achieve will
thus help interpret the results provided. For example, the extraction of salient words to increase recall
requires different methods, than extracting risk of bias information from selected trials. In each
example the sought information appears in different parts of the study. Systematic search for evidence
aims to identify all relevant citations even if many irrelevant ones are included [14]. Therefore
approaches that trade-off of precision in favour of recall would be suitable for improving search. In
contrast, screening studies for inclusion requires that key points of the trial be extracted with high
precision.

We could identify only 2 studies [37, 42] describing a specific clinical task and one [32] that mentions
a specific task but did not show how the algorithm can be applied to that goal. None of the other
included publications described the clinical task they aim to support which may make it difficult to
categorize them for a specific use. Specifically tailoring an extraction tool for a specific task would
increase its potential for re-use and incremental growth.

Extraction of information into tabulated form may be seen as a general task for making clinical trials
machine-readable [46]. It was demonstrated by Chalmers 1994 that using a database of extracted
trials could greatly speed up review times [49]. Tabulation was mentioned by one study analysed [32].

Evidence synthesis requires a finer granularity (i.e. phrases, numerical data, etc.). Methods that
extract whole sentences and algorithms that extract meta-information about sentences (e.g.
structural headings) are assumed to be limited to supporting manual tasks or as a step in a larger
automatic extraction system. 14 of the included studies extracted sentences only and 2 extracted
meta-information only.

2.4.2. Machine Learning


The most popular choice of extraction algorithms is certainly to use machine learning in order to
classify sentences, phrases, or even numbers. Only 3 studies [19, 25, 40] (10%) do not make use of
machine learning at all. A wide range of machine learning algorithms have been tried, while CRF,
ensembles of multiple classifiers, and SVMs are the ones most often utilised.

Machine learning primarily addresses the task of identifying candidate sentences for further
processing/extraction of more fine-grained information. The techniques predominantly used for the
purpose of further processing are regular expressions, “hand-crafted” rules (these mainly consist of
if-then-else rules of any given complexity), and lists of cue-words, in order to locate desired
information.

31
2.4.2.1. Machine learning algorithms and their features
There are only few publications directly comparing the performance of different algorithms of
machine learning. In many cases, it is merely stated that one algorithm outperformed the other,
without reporting data on the differences in performance. We were able to identify 6 studies
comparing the performance of SVMs and CRF, reporting data on their performance (see Table 8). As
seen from the data, CRF appears to perform slightly better, however the differences are not
statistically significant (unpaired t-test: two-tailed, p-value = 0.089; 95% confidence interval: -0.016 to
0.197). We are unable to draw specific conclusions regarding, which machine learning algorithm may
be best suited for the task at hand, even though CRFs seem to be utilised slightly more frequent than
others. Performance is influenced by a multitude of factors, such as selected features, targets of
extraction, and test corpus.
Table 8 CRF-SVM comparison (some publications with multiple experiments)

Publication CRF SVM


Chung 2007 (average of 5 extracted elements) 0.834 0.796
Hara 2007 - 1 0.911 0.913
Hara 2007 - 2 0.901 0.869
Chung 2009 - 1 0.9475 0.83
Chung 2009 - 2 0.8567 0.8133
Hassanzadeh 2014 0.9076 0.5956

The experimental data of Hassanzadeh 2014 indicates that positional features have a large impact on
extraction performance [45]. Omitting positional features resulted in the greatest performance
decrease. The results of Zhao 2010 suggest that token based features lead to the most significant drop
in performance [35]. The conclusion of Summerscales 2009 is that no single feature or feature class
has an overall effect on performance, but “that the usefulness of each feature depends on the type of
entity that the system is trying to recognize and the type of matching criteria used for analysis” [29].
Summerscales makes a similar conclusion in his later work, saying that semantic, syntax and sentence
features do improve performance, but also that the benefit of one feature depends on the extraction
target [42]. Considering these diverse conclusions, we are unable to make recommendations about
feature selection.

While one may assume, that a higher number of features or more domain-specific features lead to
better classification results, this may not always be the case. Summerscales 2009 concludes, that they
were able to improve their extraction performance by omitting domain specific features (here: UMLS
annotation) [29]. Similarly, Hassanzadeh 2014 concludes that their performance gain was due to the
lack of “external resources” (e.g. domain specific concepts) [45]. Further research might be needed to
establish a plausible relation between selected features and extraction performance. This also leads
to the question, if different medical fields may respond better to different features being considered
in the training process.

The integration of heuristic features, for example lists of cue-words, may be seen critical, as it may
introduce a bias. Such cue word lists are usually manually created, and are likely based on the corpus
used for testing – thus the extraction method is specifically adapted to the underlying corpus.

More research is needed into the correlation between selected features, text characteristics (such as
medical field), corpus characteristics, and the performance of the extraction.

32
2.4.3. Heuristic Methods
Machine learning is an integral part of most publications to identify sentences that potentially contain
targeted information. To extract information at phrase/word level from these previously identified
sentences, different heuristic methods have been applied. Some publications try to directly identify
information from full text, omitting sentence identification via machine learning [23]. Some
publications use additional methods to improve their extraction. This section intends to present an
overview of several notable approaches.

One of the most common techniques is regular expressions which were used by 11 of the 30 included
studies (37%) to identify fragments of sentences.

Another common practice lies in some form of prior sentence normalisation; converting written-out
numbers to numerical forms [26, 28], normalising statistical, mathematical notations, dates,
percentage notations, time/date notations, etc.

Kiritchenko 2010, De Bruijn 2008, and Summerscales 2013 used “weak extraction rules” to extract
information fragments from identified sentences. Additionally, Kiritchenko 2010 checked multiple
candidate sentences for redundant information fragment phrases to potentially identify valuable
information. This relies on the common occurrence, that key information is exactly repeated
throughout the sentences.

Demner-Fushman 2005 extract Population, Problem, and Intervention/Comparison without use of


machine learning [18]. In order to identify Population and Problem, they make use of semantic groups
[50], which appear in close proximity to the information to be extracted. To extract
Intervention/Comparison they generate an unordered list of interventions in combination with
semantic information from section headings, and cue phrases (e.g. “This * study examines”). Cue
words have also been used as features for machine learning [31, 33-35, 38].

Hara 2007 performed base noun phrase chunking, and applied machine learning on the phrases
instead of whole sentences. Further they applied previous sentence filtering (using machine learning)
to filter out sentences that did not contain any targets of information extraction.

To extract subject descriptors Xu 2007 compiled a set of “anchor words”, which are mainly found
before/after these descriptors. To extract participant numbers, they developed grammatical rules to
address different structures of reporting participant numbers. Finally, to extract disease/symptoms,
they extract UMLS concepts of a specific semantic type from a sentence.

Rosemblat 2007 employed semantic and syntactic checks to improve extraction performance. These
included checks for length, ensuring that numbered or ordered lists were extracted in their entirety,
or marking of “leading sentences” (starting with “To accomplish this”, “Therefore”, etc.) as signals that
elevate relevance.

Chung 2009 used a head-driven phrase structure grammar parser (HPSG) to extract coordinating
constructions from parse trees.

Heuristic approaches, such as regular expressions, and certain syntactic and semantic rules are often
used as the second step to extract more fine-grained information fragments after candidate sentence
identification via machine learning. Few approaches solely build on those techniques alone. No
publication precisely published their utilized rules. They generally limit the specifications to textual
descriptions. Further, there seems to be very little convergence in the development of these
approaches. We could not identify any publication that re-used previously developed heuristic
methods. A structuring and formalization of heuristic measures could help focussing and improving

33
the performance in this field. A formalisation could allow those heuristic methods to be encoded using
a formal language, and thus be re-used and shared among researchers.

2.4.4. Available corpora


Many (n=9) of the corpora annotations are generated automatically from structural information
immanent within the literature (i.e. headings from structured abstracts such as Introduction and
Methods in some cases or Population, Intervention and Outcome depending on the journal). For
extraction of more fine-grained elements, manual annotation is required. In addition to the waste
associated with duplication of efforts in the annotation of a new corpus for each study, it also limits
the comparability among extraction studies. To focus the effort of creating fine-grained annotations
for citations, we suggest publication of the annotated corpus with papers and reuse of published
corpora whenever possible.

2.4.5. Comparability of performance is limited


One initial intention of this review was to present the state of the art in extraction methods for clinical
trial reports. However, due to differing extraction targets and performance measures, we cannot
reach a well summarised conclusion. In particular differences were noted in: reported statistics,
extraction targets and granularity (sentence vs. word/phrase), and characteristics of the corpora used.

Firstly, the targeted information varies greatly. A large number of studies (14, 47%) measured
sentence extraction, which should be seen as an intermediate goal. The remaining 16 studies that
were subject to more specific extraction vary greatly in the target elements they attempt to extract.
The matching success criteria differ among studies (some consider a partial match as successful, some
require an exact match), limiting the potential for direct comparison even further.

A wide range of validation techniques was used within the set of considered publications. While 10-
fold-cross-validation on the entire corpus is used most often, the choice of method is rarely justified
or discussed. Other methods, such as n-fold cross validation and a priori corpus division into training
and test sets by different ratios have rarely been considered. Estimations of error such as p-values or
standard deviations were never reported which limits the ability to estimate the statistical significance
of the results.

The reported performance measures used vary significantly between studies. Many studies report
data on accuracy, but not on precision, recall, or F-scores, or vice-versa. Some studies distinguish
between correctly, partially correct, and incorrectly labelled extraction targets. The definitions of
correctly/partially correct/incorrect also differ between studies.

Commonly, the number of extraction targets in the corpus is not mentioned. The distribution of
extraction targets (i.e. how accurate the data could be manually extracted from the corpus) was not
mentioned by any study. The reporting of an expected value and a standard deviation could give
valuable insights into the extraction performance.

Corpora are mainly unique to the publication, and commonly sourced from Medline by searching for
different diseases. Only a very small fraction of publications tests their method on a previously
published corpus, in order to be able to directly compare their performance. As few publications
describe their corpus in specific detail, a direct comparison of performance should be done with
caution, if at all. The only notable exception is the “NICTA PIBOSO” corpus first published in Kim 2011.
Four studies by different authors (Figure 10, Table 9) have benchmarked their method against this
corpus extracting the same targets, and allow for a valid comparison (it has to be noted, that the

34
number of folds were not available in all experiments, this may affect comparability). Of these studies,
Hassanzadeh 2014 clearly outperforms all others in regards to F-score performance.

Extraction performance (F1-Score) comparison,


6 way classifiaction, unstructured abstracts
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Background Intervention Outcome Population Study Design Other Mean

Hassanzadeh et al. (2014) Sarker et al. (2013) Verbeke et al. (2012) Kim et al. (2011)

Figure 10 Comparison of extraction from one corpus by 4 different approaches

Table 9 ML algorithms and features used

Features/algorithm comparison
CRF SVM CRF kLog
Hassanzadeh 2014 Sarker 2013 Kim 2011 Verbeke 2012
• POS • POS • POS • POS
• Section heading • Section heading • Section heading • Section heading
• Sentence position • Sentence position • Sentence position

• Orthographic case • UMLS • UMLS • Lemma of root word


• Lemma • n-grams • BOW • Has header word
• # of verbs in past/present tense • Sentence length • Bi-gram • Root word lemma of
• # of verbs in active/passive voice • n-grams of 2 • Label of prev. sentence previous of prev. sentence
• Verb negation previous • Features of prev. sentence
• # of nouns in sentence sentences
• # of adjectives in sentence • Cue phrases
• Context support

We assembled available performance figures (F1-scores) of phrase level extraction of outcome


information (Figure 11). We were unable to draw particular assumptions from the data. There is no
clear trend to be recognised. However, the lack of trend leads us to assume, that the underlying
methods to retrieve the respective performance figures are incoherent and do not actually allow for
a valid comparison.

35
Outcome (phrase level) extraction (F1-score)
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
De Bruijn et al. Summerscales et Kiritchenko et al. Summerscales et Hsu et al. (2012) Summerscales
(2008) al. (2009) (2010) al. (2011) (2013)

Figure 11 Phrase-level outcome extraction

Finally, several studies present statements such as “our method performs according to the state-of-
the-art”, without actually defining what exactly the state-of-the-art might be. We could not identify a
solid specification of a “state-of-the-art”, and can only assume authors reference their own notions
rather than a specified method.

2.4.6. Limitations
Screening for included studies was only conducted by one author and borderline cases were reviewed
by another. This introduces a small risk of bias.

This study does not include a quantitative comparison of the included studies. This is primarily due to
insufficient information provided by each study and our lack of resources for contacting study authors
requesting more data.

2.5. Conclusions of the literature review


This section has presented a systematic review of the history and current state of the art of
information extraction from clinical trials. Despite not restricting our search and inclusion criteria by
date of publication, the included studies were all published in the last 12 years.

The predominant method is to use machine learning to identify sentences containing extraction
targets and heuristic methods (in particular regular expressions) to extract more fine grain
information. While machine learning is central to information extraction from clinical trials, choice of
algorithm does not have major impact on performance.

The majority of algorithms are not developed for a particular task of extraction (e.g. search, screening,
synthesis), hindering the integration into a superordinate process e.g. to create systematic reviews.
The orientation towards a clear task would help enable the direct use of algorithm in a subsequent
task.

Comparability of extraction performance between studies is limited by heterogeneity in the extraction


targets, the reported performance measures, and the underlying corpora. Consequently we were
unable to define a specific “state-of-the-art” for extracting information from clinical trials. There is a
clear need for a well annotated corpus to standardise benchmarking.

36
3. Selecting systems for analysis and experiment replication
As the next step in our work, we aim to select the most suitable system for analysis and replicate its
related experiment on the existing and on a newly annotated corpus. We take this approach to a)
support our conclusion from the literature review, that different approaches are incomparable and
may perform entirely different on another corpus, b) familiarize ourselves with current methods of
machine learning and information extraction. Based on our detailed literature review, we have
identified the most sophisticated systems in terms of methods and performance, as well as the
systems with the highest potential for further development. We consider an advanced or
sophisticated system, an application that fulfils a specific task, i.e. the extraction of phrase level
information, or the summarization of targeted information to form informative summaries, rather
than just labelling entire sentences, or recognize only numeric values. Specific requirements were
chosen and involve:

• Extraction of phrase level information


• Identification of all (or most) of the PICO elements
• Usage of extracted data for a subsequent task (as this may indicate an extraction performance
suitable for actual use)

Considering the availability of the source code, we aim to acquire the source code of one system for
our intended analysis, including the validation of the experiment, and the evaluation of any possible
methods that could be implemented in order to further advance the performance of the system at
hand.

After a careful review and inspection of five systems, we identified initially two systems as the most
suitable ones [32, 42] for our requirements. The overall suitability of the systems was decided in terms
of availability, purpose and how specific each application was regarding the targeted information. The
two applications were also the most promising ones for further development because they had
undergone multiple iterations (in terms of number of publications describing the system). It was
observed that the majority of the other applications:

a) did not identify any information of interest on the phrase level


b) did not include any PICO elements
c) did not perform any subsequent task

The chosen systems were:

1. ExaCT, a clinical trial element extraction system that was created by Kiritchenko 2010
2. ACRES (Automatic Clinical Result Extraction and Summarization) that was designed and
implemented by Summerscales 2013

We also tried to acquire the method of Hassanzadeh 2014. Even though the method only extracted at
the sentence level (and not phrase level), his method performed especially well on the (already
mention sentence-level annotated) NICTA PIBOSO corpus [36], outperforming all other previously
published methods working with this particular corpus. To our regret the author did not reveal his
source code to us.

3.1. ExaCT
The first application that was considered suitable for our experiment is named ExaCT [32] and was
created by Svetlana Kiritchenko, Berry de Bruijn, Simona Carini, Joel Martin, and Ida Sim at the
Institute for Information Technology, Ontario, Canada. A two-step approach was applied: selection of

37
sentences that contains the desirable candidate trial characteristics and information extraction for a
number of trial elements. In particular, ExaCT focuses in the identification of 21 diverse information
elements from full text journal publications of human RCTs. More specifically, ExaCT aims to recognise
key trial features using a machine learning approach. Their method included an SVM text classifier
that was trained to recognise the most promising sentences in which the information element is
present while each sentence is represented by a bag of terms. For each information element present,
extraction patterns relying on “weak” extraction rules (based on regular expressions) were applied.
“Weak” rules were manually crafted based on the idea that a simple extraction pattern is possible to
be accurate when it is applied to the right context. ExaCT is also used to populate Trial Bank [46], a
system for structured registration of clinical trials. Unfortunately, the ExaCT system was not publicly
available, which led us not to further use it for our experiments.

3.2. ACRES
The second system (and eventual choice) is called Automatic Clinical Result Extraction and
Summarization (ACRES), and was conceived by Rodney Summerscales in 2013 [42]. The ACRES system
uses machine learning and natural language processing techniques to extract the key clinical
information describing the trial and its results. It extracts the names and sizes of treatment groups,
population demographics, outcomes measured in the trial and outcome results for each treatment
group. Using the extracted outcome measurements, the system calculates key summary measures
used by physicians when evaluating the effectiveness of treatments. It computes absolute risk
reduction (ARR; the reduction of the risk of incidence of a specific event; e.g. death, a change from a
2% to a 1.6% chance of death is equivalent to a 0.4pp ARR) and number needed to treat (NNT; number
of patients needed to treat to achieve the desired outcome in one (1) patient; e.g. given an ARR of
30%, 3.33 people need to be treated to statistically achieve the desired outcome in at least one
patient) values complete with confidence intervals. The extracted information and computed statistics
are automatically compiled into XML and HTML summaries that describe the details and results of the
trial.

We attempted to contact Rodney Summerscales, who ensured full cooperation regarding the sharing
of ACRES’ source code including the used corpora. This decision resulted to the selection of ACRES for
our experiment and further exploration. In the following we describe our attempt of re-producing the
results generated by this system. We wish to point out, that any problems encountered should not be
understood as criticism, but rather as descriptive reporting of encountered hurdles.

3.3. Independent result replication


After retrieving a compressed package containing the entire ACRES system along with all used corpora,
re-running the experiments was not as an easy task as initially expected. Easy re-use is not one of the
priorities of a system developed to facilitate research experiments. It took a significant amount of time
to re-create the execution environment and to enable us to exactly re-produce the published
experimental results. Even with the generous support of Rodney Summerscales, we needed to invest
almost four weeks, until we were able to generate the exact same execution output (the system
produces an execution log including all steps and performance measures), as the original author.

3.3.1. Execution environment setup


Initially, it was necessary to re-create the execution environment in order to successfully run ACRES.
Concluding from a first brief analysis (based on file encodings, and found system-specific files), the
ACRES system had originally been executed within a 32 bit Mac environment running Java, Python 2.7

38
and several other binary libraries (e.g. BANNER [51], MALLET 4, MetaMap5, OpenNLP 6, Stanford
parser 7). An initial attempt to re-run the ACRES system on a Window based platform failed due to a
large number of different path specifications (“/” on Linux vs “\” on Windows), different file-
encodings, or libraries being unavailable for Windows.

Consequently we attempted to re-run the ACRES system on a (64bit) Linux based system. After
installing Python 2.7, along with required libraries (nltk 8, numpy 9), we were able to execute the main
ACRES system pipeline. During the execution stage, an external call to the MegaM 10 library failed, due
to the fact, that the supplied library was not compatible with a 64-bit system. This happened unnoticed
at first, since the system seemed to complete the execution successfully and did not generate any
obvious errors. The only unexpected deviation were significantly different performance results. A
closer analysis of external library calls, revealed the incompatibility. After acquiring and compiling the
sources for our 64-bit based system, the entire pipeline could be executed successfully.

However, we were still unable to reproduce the exact same results as published by Rodney
Summerscales [42]. We still observed slightly different extraction performances. A detailed, time-
consuming analysis of approximately two weeks revealed basically by chance, that a Python function
to return the files within a given folder (here: the training corpus) will return the list of files in
alphabetical order on a Mac OS based system, whereas Linux will return it in random order. This slight
difference caused the machine learning models to be trained differently, and consequently lead to
small deviations in labelling performance. As there was no obvious program output, which would have
hinted at this difference, the search for this easily fixable phenomenon evolved to be lengthy and
painstaking. An added function to sort the list of files finally allowed us to exactly reproduce the same
results.

3.3.2. System architecture


The entire system is described very well in the PhD thesis of Rodney Summerscales [42]. However, the
description is of an abstract level and does not reference back to the actual code, making it complex
to locate particular functional entities of the system within the code. Since it is an experimental
system, many code fragments are obsolete or of unknown purpose, making it even harder to link
described functionality to actual code fragments. In the following paragraphs we aim to close the gap
between abstract description and code entities.

Figure 12 [42] provides a high level overview of the main processing stages of ACRES on an abstract
level. An abstract is represented as a PubMed citation in XML format. This XML format is transformed
(“preprocessed”) into an ACRES compatible “raw” structure. This transformation also performs UMLS
parsing, and part-of-speech parsing. Subsequently, key elements are extracted using machine learning
and rule-based approaches. Finally, extracted mentions are clustered and associated to eventually
generate the summary data.

4
http://mallet.cs.umass.edu/
5
http://metamap.nlm.nih.gov/
6
https://opennlp.apache.org/
7
http://nlp.stanford.edu/software/lex-parser.shtml
8
http://www.nltk.org/
9
http://www.numpy.org/
10
http://www.umiacs.umd.edu/~hal/megam/

39
Figure 12 Main processing stages in ACRES [42]

Figure 13 shows the top-level folder structure of the system.

Figure 13 Top-level folder layout

Table 10 provides a brief initial description of folder shown in Figure 13.


Table 10 ACRES folder structure description

Folder Contents
acres 96 python files containing the ACRES functionality as laid out in Figure 12 (Extract key
elements and Associate)
bin 10 shell scripts to execute different parts of the ACRES pipeline
corpora 4 different corpora in “annotated” and “raw” format
lib 7 libraries used within the ACRES system
models Model files of 7 different parsers
output Output folder for created XML summaries
src Java project to transform PubMed XML files into “raw” format ingested by ACRES
(Preprocess in Figure 12)

Figure 16 shows a broad high-level sequence diagram of key steps within the training and testing
pipeline process. It covers the stages of “Extract key elements” and “Associate” as depicted in Figure
12. It does not cover any of the “Preprocess” steps however. The initial entry point resides within the

40
file pipeline.py, which is being passed the training- and test-corpus specifications, and is executed as
follows:
python pipeline.py bmjcardio ischemia

Within the file, an object of the class RunConfiguration is instantiated. This object holds several
configuration options regarding the processing steps. It then instantiates a list of abstracts
(AbstractList), which are read from the raw XML files. For each information element to be extracted,
an object of the class FinderTask (not shown in Figure 16 to maintain high level overview) is created,
with different parameters to identify the according information element. These FinderTasks are
passed a certain token classifier, in this case the MalletTokenClassifier, which uses the MALLET
SimpleTagger, a Conditional Random Field (CRFs) implementation to label tokens (words) of given
abstracts. The external call to the MALLET tagger is made as follows:

For training:

For testing:

features.outcome.train.txt contains the (machine learning) features read from the “raw”
xml files. It is structured to hold the features for one token on one single line, separated by a space.
The last word per line denotes the classification (here: “outcome” or “other”). Figure 14 shows a
sample line from this file, showing the features of a token being annotated as part of an outcome.

Figure 14 Sample feature file line

The feature file is read during the training process by the MALLET algorithm and converted to a model
file (outcomefinder.model), which is then used by the testing process to label the tokens of the
testing corpus. The output of the testing process is the file tokens.[dd].outcome.txt (Figure 15), which
contains the top 15 classifications (along with probability score) found by the MALLET algorithm for
one token on one line.

Figure 15 Outcome classification output, top 15

41
The clustering and association is (counter-intuitively) performed by an object of the class FinderTask
as well, just provided with different parameters on instantiation.

The generation of summaries and output is handled by an object of the class SummaryList.

Figure 16 High level sequence diagram of ACRES

3.3.2.1. Pre-process PubMed XML to ACRES required format


A step, which is only implicitly referred to (within the publication of the original author) is the
transformation of PubMed XML files to an ACRES compatible XML “raw” format. Citations in XML
format are manually downloaded from PubMed, are then annotated using the GATE Developer 11
program, exported as “GATE inline XML”, and subsequently transformed into an ACRES compatible
XML (“raw”) format, see Figure 17. A Java program (folder: src) has been written to perform this
transformation task along with the pre-processing tasks, such as UMLS parsing via MetaMap 12, part-

11
https://gate.ac.uk/gate/doc/
12
http://metamap.nlm.nih.gov/

42
of-speech parsing via the Stanford 13 parser, or tokenisation of the text. This “raw” XML data forms the
input for the next step of extracting key elements.

Figure 17 "XML "raw" format expected by ACRES

3.3.2.2. Extract key elements


After the ACRES compatible and enriched XML files are available, key elements are extracted using
rule-based- (see Figure 18) and machine learning techniques (Figure 19).

Figure 18 Rule-based extraction [42]

The following code entities deal with rule-based extraction:

• Special numeric values:


o This functionality could not be located within the code.
• Time phrases:
o timefinder.py [simple finder for “baseline” and “follow-up” tokens]
o timetemplate.py [finds time phrases like [number] [time unit], e.g. “35 days”]
• Primary outcome:
o primaryoutcomefinder.py [finds patterns like “primary (composite)? (outcome
outcomes endpoint endpoints)”]
• Age phrases:

13
http://nlp.stanford.edu/software/lex-parser.shtml

43
o agefinder.py [finds individual age phrases]
o agetemplate.py [finds age phrases describing a population, e.g. “greater than 10
years”]

Figure 19 Classifier-based extraction [42]

The following code entities deal with classifier-based extraction:

• Labelling group, outcome, condition, group size (GS), outcome number (ON), and event rate
o mallet.py
[executes an external call to the MALLET SimpleTagger, setting options pointing to
according model and feature files; for training and testing]

The following code entities implement re-ranking the labelling outcomes:

• labelingreranker.py

The labelling re-ranker was actually not trivial to understand, we try to explain in more detail in the
following:

1. Get the top 15 best labellings for a sentence from MALLET.


2. Each token has 15 labels. Determine the most popular label for each token from its list of 15 labels.
3. Consider only the top 3 best sequence labellings for the sentence. Of these top 3, find the
sequence that has the most tokens whose label in the sequence matches the popular label
(computed from the top 15) for that token.
4. Use this sequence labelling for the tokens in the sentence.

44
3.3.2.3. Association of extracted mentions

Figure 20 Association of extraction elements [42]

The following code entities deal with clustering mentions:

• baseclusterer.py
• baselinementionclusterer.py
• basementiontemplate.py
• clustermentions.py
• truementionclusterer.py

The following code entities deal with associating mentions with numbers:

• baseassociator.py
• baselinementionquantityassociator.py
• mentionquantityassociator.py
• outcomemeasurementassociator.py
• truementionquantityassociator.py

45
4. Creation of a new testing corpus
In order to be able to re-run the experiment as independently as possible and to assess how
transferrable the published performance figures may be, we selected a new corpus of clinical trials
and annotated for the characteristics of interest. We then proceeded to compare the results retrieved
using our corpus to the original ones based on previously used performance measures, specifically
precision, recall, and F-Score.

4.1. Origin of clinical trial documents for annotation


To create synergies with another project (which investigates correlations of systematic reviews and
their updates), we sourced clinical trials from systematic reviews and their updates. In a first step we
needed to assess whether the resulting citations retrieved during the search, indeed formed an update
of a systematic review. Two people (the author of this thesis, and his supervisor at the institute, Guy
Tsafnat (GT) ) assessed this and sought consensus upon different assessments (Cohen’s kappa = 0.875,
confidence interval: from 0.638 to 1.0 [52], which is “very good” agreement according to Altman
1991). After identifying actual updates of systematic reviews, we identified the associated original
systematic review. In case of multiple previous reviews, we selected the most recent one, to be able
to correctly identify made changes between two versions. Subsequently, we extracted all associated
clinical trials referenced within the systematic review. We selected a maximum of 10 clinical trials
from a systematic review and an additional 10 trials from the associated update, which were included
exclusively in the update. To avoid potentially having to aggregate multiple publications into one
entity, we selected trials only, if they were published within a single publication and did not consist of
multiple publications (as it is often the case in Cochrane systematic reviews).

4.2. Annotation process


We noticed, that the corpus created by Summerscales was not annotated by multiple consenting
annotators, which may introduce a potential bias. In order to retrieve a corpus of highest objectivity
of annotation, we chose to have two persons (the author of this thesis, and a fellow MRES student
within the same institute) annotate the new corpus. To measure agreement, we calculate Cohen’s
kappa [52]. The annotation process should follow the one used originally as closely as possible.

To create a corpus, which is annotated in the most similar way possible to the one created by
Summerscales, we followed the supplied annotation guide line (see Appendix 1). Due to resource
limitations, we also limited the annotation process to the abstract of each clinical trial.

Initially we intended to use Gate Teamware 14 to collaboratively annotate the given abstracts. Due to
several reasons, such as complexity of setup of the application, delays in cooperation with University
IT to provide a server instance and associated networking settings, after approximately three weeks,
we fell back to using Gate Developer software, installed locally to annotate the new abstracts.

For our annotation process we calculated a Cohen’s kappa of 0.782 (confidence interval: from 0.771
to 0.793) on a token-label basis (i.e. we compared the label assigned to each token by the two
annotators and calculated Cohen’s kappa). According to Altman 1991, this can be seen as “good”
agreement.

14
https://gate.ac.uk/teamware/

46
4.3. Details of newly created corpus
From an initial list of 68 systematic updates, we selected 19. Forty-nine either did not describe an
update of an initial systematic review, the publication was not accessible, or the previous review was
not referenced or not available.

From these 19 reviews, we were able to extract 385 clinical trials associated with the reviews and/or
their updates. Of these 385 trials, 106 matched the selection criteria (clinical trial [according to its
MeSH tags]; single publication; included in previous review, and not in update OR included in update
and not in previous review).

From these 106 trials we randomly selected 50 for annotation. During annotation, 6 publications
turned out not to be actual clinical trials and were consequently discarded. Thus, the final size of our
corpus amounted to 44 clinical trials.

Table 11 shows a comparison of the corpus characteristics of the ischemia (corpus originally used by
Rodney Summerscales for testing his algorithm) corpus and our “review” corpus. The second and third
column show total occurrences in all abstracts, and the average number of occurrences per abstract
in brackets. The fourth column shows the difference in percent of the total occurrences (review /
ischemia). The fifth column shows the difference in percent points compared to the difference in
number of abstracts (37.61% - review/ischemia).
Table 11 Corpora characteristics comparison (average annotations per abstract in brackets)

Metric Ischemia Review Difference Difference in % points


(avg) (avg) to number of abstracts
Number of abstracts 117 44 37.61%
Structured abstracts 94 (80%) 27 (61%) 28.72%
Abstracts with key values 117 (100%) 22 (50%) 18.80%
Average number of sentences 11.5 10.8 93.91%
Average length of sentences 29.9 26.6 88.96%
Average acronym occurrences 11.0 6.5 59.09% 21.4pp
Annotations: conditions 271 (2.3) 119 (2.7) 43.91% 6.30pp
Annotations: groups 1256 (10.7) 375 (8.5) 29.86% -7.75pp
Annotations: outcomes 915 (7.8) 332 (7.5) 36.28% -1.32pp
Annotations: age values 26 (0.2) 8 (0.2) 30.77% -6.84pp
Annotations: group sizes 152 (1.3) 37 (0.8) 24.34% -13.26pp
Annotations: outcome numbers 122 (1.0) 34 (0.8) 27.87% -9.74pp
Annotations: event rates 648 (5.5) 26 (0.6) 4.01% -33.59pp

Since, the focus of ACRES was to extract and calculate statistics, it was a selection criteria for the
ischemia corpus, that all abstracts must contain key values, thus the rate is 100%. We did not enforce
such constrain, as our main focus lies on extraction of elements.

Major differences can be seen in the occurrence of even rates. Further, our corpus is less complete, in
terms of containing all figures to calculate an ARR. The general density of annotations per abstract is
lower overall. The only exception is the average number of annotated conditions per sentence, which
is slightly increased. Other occurrence metrics are between 13.26 percentage points decreased or
increased by up to 21.48 percentage points, taking the different corpus sizes into account.

47
5. Executing the experiment on the new corpus
Upon creation of the new corpus, we execute the main experiment performed in the evaluation of the
selected method.

The main experiment of Rodney Summerscales used 230 abstracts (“bmjcardio” corpus) for training,
and tested the trained model on 117 abstracts (“ischemia” corpus).

We used the same corpus for the training step, and tested the trained model on our corpus of 44 trials
(“review” corpus).
Table 12 Extraction performance comparison on new corpus (ML: machine learning)

Extracted element Ischemia corpus Review corpus


Recall Precision F-Score Recall Precision F-Score
Conditions (ML-based) 0.41 0.60 0.49 0.05 1.00 0.10
Groups (ML-based) 0.80 0.80 0.80 0.47 0.83 0.60
Outcomes (ML-based) 0.61 0.51 0.55 0.24 0.45 0.31
Age phrases (rule-based) 0.82 0.58 0.68 1.00 0.14 0.25
ARR (partial match) 0.30 0.61 0.40 0.04 0.25 0.07

ACRES extraction - ischemia vs review corpus


1
0.8
0.8 0.68
0.6
0.55
0.6
F-Score

0.49
0.4
0.4 0.31
0.25
0.2 0.1 0.07
0
Conditions (ML- Groups (ML-based) Outcomes (ML- Age phrases (rule- ARR (partial match)
based) based) based)
Information element

Ischemia Review

Figure 21 ACRES extraction - ischemia vs review corpus

Table 12 and Figure 21 show comparisons of selected reported results. F-Scores are considerably lower
for all measures.

Figure 22 shows the distribution of F-Scores when each abstract is extracted in single for outcome
extraction. We retrieved a mean F-Score of 0.31 and a standard deviation of 0.25.

48
Figure 22 F-Score histogram for outcome extraction per abstract ("review" corpus)

5.1. Refining the corpus


During experimenting with the old corpora (“bmjcardio”-training-corpus, and “ischemia”-testing-
corpus), we noticed some abstracts seemed to be missing annotations, we would have expected to
see. For example, several title sentences lacked relatively obvious annotations:

“Effects of acupuncture and stabilising exercises as adjunct to standard treatment in pregnant


women with pelvic girdle pain: randomised single blind controlled trial.”

This title sentence had not been annotated at all. We would have expected to see annotations similar
to the following:

“Effects of <group role=”experiment”>acupuncture</group> and <group


role=”experiment”>stabilising exercises</group> as adjunct to <group role=”control”>standard
treatment</group> in <population>pregnant women</population> with <condition>pelvic girdle
pain</condition>: randomised single blind controlled trial.”

To assess if this lack of annotations may have had an effect on extraction performance, we excluded
all abstracts, which did not have any title annotations from the training and the test corpus.

Those refined corpora consisted of 88 (of initially 230) abstracts for training and 57 (of initially 117)
abstracts for testing. Table 13 and Table 14 show the different corpus characteristics.
Table 13 Comparison of bmjcardio (training) corpus with its refined version (average annotations per abstract in brackets)

Metric Bmjcardio (avg) Bmjcardio, clean title (avg) Difference


Number of abstracts 230 88 38.26%
Structured abstracts 227 (99%) 86 (98%) 37.89%
Abstracts with key values 185 (80.0%) 80 (91%) 43.24%
Average number of sentences 13.4 13.7 102.24%
Average length of sentences 23.8 24.8 104.20%
Average acronym occurrences 3.2 4.7 146.88%

49
Annotations: conditions 417 (1.8) 226 (2.6) 54.20%
Annotations: groups 2293 (10.0) 936 (10.6) 40.82%
Annotations: outcomes 1981 (8.6) 765 (8.7) 38.62%
Annotations: age values 154 (0.7) 60 (0.7) 38.96%
Annotations: group sizes 386 (1.7) 181 (2.1) 46.89%
Annotations: outcome numbers 457 (2.0) 275 (3.1) 60.18%
Annotations: event rates 575 (2.5) 298 (3.4) 51.83%

Table 14 Comparison of ischemia (testing) corpus with its refined version (average annotations per abstract in brackets)

Metric Ischemia (avg) Ischemia, clean title (avg) Difference


Number of abstracts 117 57 48.72%
Structured abstracts 94 (80%) 46 (81%) 48.94%
Abstracts with key values 117 (100%) 57 (100%) 48.72%
Average number of sentences 11.5 11.4 99.13%
Average length of sentences 29.9 29.7 99.33%
Average acronym occurrences 11.0 11.8 107.27%
Annotations: conditions 271 (2.3) 183 (3.2) 67.53%
Annotations: groups 1256 (10.7) 732 (12.8) 58.28%
Annotations: outcomes 914 (7.8) 502 (8.8) 54.92%
Annotations: age values 27 (0.2) 13 (0.2) 48.15%
Annotations: group sizes 152 (1.3) 80 (1.4) 52.63%
Annotations: outcome numbers 122 (1.0) 50 (0.9) 40.98%
Annotations: event rates 648 (5.5) 331 (5.8) 51.08%

Both refined corpora have a clearly higher annotation density (in terms of average annotations per
abstract), than the unrefined corpora. As we only sampled the quality of the unrefined corpora, it
remains difficult to say, how many annotations are factually missing. Due to time constraints, we
unfortunately were unable to revise all existing corpora.

As seen in Table 15 and Figure 23, almost all performance measures perform better, using the refined
corpus, even though the training corpus is only 38.2% in size of the original corpus. This is likely due
to the reduction of false positives, due to smaller corpus size.
Table 15 Extraction performance on refined corpora (trained on refined corpus, tested on refined corpus)

Extracted element Train: bmjcardio (UNrefined) Train: bmjcardio (refined)


Test: Ischemia corpus (UNrefined) Test: Ischemia corpus (refined)
Recall Precision F-Score Recall Precision F-Score
Conditions (ML-based) 0.41 0.60 0.49 0.42 0.90 0.57
Groups (ML-based) 0.80 0.80 0.80 0.76 0.86 0.81
Outcomes (ML-based) 0.61 0.51 0.55 0.57 0.64 0.60
Age phrases (rule-based) 0.82 0.58 0.68 0.67 0.50 0.57
ARR (partial match) 0.30 0.61 0.40 0.31 0.65 0.42

50
ACRES extraction - original ischemia vs refined ischemia
corpus
0.9 0.8 0.81
0.8
0.68
0.7 0.6
0.57 0.55 0.57
0.6
0.49
F-Score

0.5 0.4 0.42


0.4
0.3
0.2
0.1
0
Conditions (ML- Groups (ML-based) Outcomes (ML- Age phrases (rule- ARR (partial match)
based) based) based)
Information element

bmjcardio/Ischemia UN-refined bmjcardio/Ischemia refined

Figure 23 ACRES extraction - original ischemia vs refined ischemia corpus

We also performed an experiment, testing on the unrefined corpus, and training on the refined
corpus, see Table 16.
Table 16 Extraction performance on refined corpora (trained on refined corpus, tested on unrefined corpus)

Extracted element Train: bmjcardio (refined)


Test: Ischemia corpus (UNrefined)
Recall Precision F-Score
Conditions (ML-based) 0.41 0.60 0.49
Groups (ML-based) 0.76 0.82 0.79
Outcomes (ML-based) 0.58 0.56 0.57
Age phrases (rule-based) 0.82 0.58 0.68
ARR (partial match) 0.29 0.66 0.41

Interestingly, reducing the training corpus in size to 38.26% (refined ischemia corpus) has almost no
impact on extraction performance (and can even improve performance) in terms of F-Score. It
increased the precision slightly, while slightly lowering recall, leaving the F-Score almost unchanged.

By refining the testing corpus, we may have randomly selected abstracts, which the system can
successfully process and omitted the ones it may not process as successfully. The original performance
values were measured across the entire corpus, ignoring the association of a token/sentence to an
abstract. To get a more detailed performance picture, we performed the experiment on each test
abstract in single and gained a distribution, i.e. we performed the experiment on each abstract in
single and retrieved a value for precision, recall, F-Score for each abstract – thus representing the
sampling of a distribution. The mean and deviation figures should provide a more robust performance
figure. Figure 24 shows the histogram for the extraction of the outcome target (trained on the refined
corpus, tested on all ischemia abstracts in single).

51
Figure 24 F-Score histogram for outcome extraction per abstract

We calculated a mean F-Score of 0.55, and a sample standard deviation of 0.22. As one can see, the
values are spread out relatively evenly over the entire range, with one peak at 0.6. The peak is not
well distinguished. A statement, this method can extract with an F-Score of 0.55 would provide a
limited impression. Testing on the abstracts only falling in the 0.7-0.85 bin could lead to a very
different impression.

We calculated the same distribution for the refined corpus, see Figure 25 (trained on the refined
corpus, tested on refined ischemia abstracts in single). We retrieved a mean F-Score of 0.58 and a
standard deviation 0.22. We are uncertain, as to why this figure deviates (delta = 0.02) from the F-
Score across the entire corpus (0.6). We assume it to deviate due to rounding errors.

Taking these two figures into account (mean F-Score and standard deviation), makes two tests more
comparable. A more narrow distribution (low standard deviation), and a higher mean F-Score would
indicate better extraction performance overall. A high F-Score combined with a high standard
deviation limits the transferability of the algorithm.

52
Figure 25 F-Score histogram for outcome extraction per abstract (refined corpus)

53
6. Methods for further development
After we replicate the experiment of ACRES, we further investigate various other methods that could
potentially improve the system’s performance. Enhancements could be done in several aspects, such
as the application of additional input features for a machine learning algorithm, improved pre/post-
processing of input texts. Alternatively the introduction of completely new approaches may be
considered to further advance the precise extraction of information elements from clinical trials.

Previous work on clinical trial characteristics extraction has already tried to integrate simple or weak
rules [32]. Summerscales 2013 uses specific syntactic token patterns to identify time phrases (e.g.
“[number] [time unit]” : “3 weeks”), primary outcome phrases (e.g. “the primary outcome is [noun
phrase]”), or age phrases (e.g. “mean age [number]”). However, the implementation of such approach
has some limitations; such matching approaches are entirely unstructured and do not follow a
convention or a re-useable and improvable structure. They are implemented in arbitrary programming
languages, following no convention, e.g. they are not constructed to be a pluggable module with
defined interfaces or characteristics, making it difficult for these to be distributed or improved. So far,
there was no publication that we reviewed employing pattern matching/heuristic approaches that
went beyond textual descriptions of their pattern matching methods. The creation of a formalism, by
creating or using an existing framework (which e.g. provides specific conventions and
design/architectural guidelines or even provides a common/extensible library of standard rule-based
extractors), would greatly improve and accelerate the development of such heuristic methods, as it
would focus and centralise research and development efforts. Especially an improvement in
portability and re-usability would be beneficial.

A more complex area of potential improvement lies within reasoning on top of initially extracted
elements. If one could elevate the rule based extraction to a process closer to a human-like conscious
reasoning and understanding, taking complex relations, contexts, and background knowledge into
account, it may significantly increase extraction performance. Reasoning may range from simple rules
(e.g. pattern-matching, regular expressions) to more complex inferences, where one type of extracted
elements may have an effect on the extraction of another one. For example, participant numbers may
be extracted relatively easy and precise, by using patterns as mentioned above. From the count and
combination of these numbers, the number of participant groups may be inferred. This inference may
then influence the extraction of intervention groups. For example, the classification as intervention
group may be influenced by a threshold, where a token is classified as intervention group, if the
threshold (e.g. calculated from the token features) is exceeded. Should it be determined, that
according to the number of extracted participant group sizes, more intervention groups should exist
than actually were classified, the mentioned threshold could be adjusted (lowered) to potentially
identify the correct number of intervention groups.

Another example, where machine learning based approaches are limited, is the subsequent
appearance of reported numeric values in close proximity. For example in the sentence three
percentages can be seen in an order (10%, 43% and 32%):

“the rate of high on-treatment platelet reactivity was significantly lower in group 3 than in groups 1
and 2 (10% versus 43% versus 32%, respectively; P less than 0.05). “

A human can easily establish the connection between the group numbers and the percentage figures,
whereas a machine learning based approach may struggle without any establishing content to
differentiate them from each other. This may be solved through the implementation of reasoning,
which allows to uncover and establish this link.

54
To formulate this reasoning within a formal structure, we looked at Markov Logic Networks (MLN) and
Ripple Down Rules (RDR).

6.1. Ripple Down Rules


Ripple Down Rules (RDR) were originally conceived by Paul Justin Compton in the late 1990’s [54].
They form a construct to incrementally build knowledge-based systems, and have been used for a
broad variety of applications, such as the interpretation of pathology results [55], anomaly detection
(e.g. pregnant male subject) [56], or Email-management [57].

The basic principle is based on the refinement on rules, directly in the event of miss-classification.

Figure 26 Sample RDR tree [57]

Figure 26 shows a sample tree of rules, which clarifies the development of the knowledge base. For
example, Rule 1 had led to a misclassification for the conditions (a,b) to CSE. Manual analysis revealed,
that under the condition (a,b) and (e,f) the actually correct classification is CRC, upon which Rule 6
was added to correct the classification for (a,b) and (e,f) to assign CRC.

The main advantage of this approach is, that new rules can be added with minimal modification to the
entire system, and with minimal side-effects on other rules, as rules are modified only within their
scope and directly after a detected misclassification. Further, new rules can be added about 5 times
faster than in conventional industry rule-based systems [58].

To extract information from clinical trials, RDRs could be of high value. A large variety of conditions
and properties may be used to formulate rules to label text phrases as specific information elements.
All common features used in machine learning could form rule input. For example, part of speech
(verb, noun, adjective, etc.), sentence location within the text, and existence of cue-words.

To achieve an even better text understanding, such rules could be elevated to an even higher level to
closer reproduce how humans actually understand text. Humans understand text based on the current
context, the topic, experience with similar texts, and background knowledge.

Most importantly, it would define a structure to describe rules, and thus ease portability and re-use.

55
6.2. Markov Logic Networks
Markov Logic Networks (MLN) combine first-order logic and probabilistic models [59]. First-order logic
allows for the representation of a wide variety of knowledge (e.g. Andrzejewski 2011). Probabilistic
models allow the inclusion of uncertainty.

Figure 27 Example of a MLN (Fr is short for friends, Sm for smokes, Ca for Cancer) [59]

Figure 27 depicts an example of a simple MLN, which can make probability statements about how
likely friends are to get cancer, depending on smoking. When using first-order logic only, limitations
are encountered very quickly, such as the discreteness of the knowledge base. Referring back to Figure
27, it may be that you have friends that smoke, and some that do not – meaning that you subsequently
either smoke or not. In first-order logic both would not be possible (having friends that smoke, while
you are not smoking). A rule is either true or false, but does not allow for differentiation. Real world
objects are rarely limited to those strict circumstances, or it is extremely difficult to create a
knowledge base, which is true for every possible case. Combining first-order logic with probabilistic
weights lifts this true or false constraint, and introduces probabilities, where a violated rule makes the
case less probable, but not impossible. In this example, having more friends that smoke, would make
it more likely to smoke oneself, but it would not be limited to an either-or scenario.

Verbeke 2012 already employed kLog, a relational learning language [61] to enable the formulation of
(machine learning) features in the form of relations, e.g. Figure 28 shows a relation that aims to
identify, if a sentence is a header of a section, by selecting words of four characters or more, which
are all uppercase.

Figure 28 Example of kLog relation [39]

Additionally, the features derived by those relations employed machine learning (SVM-HMM, [62])
techniques to classify abstracts of the NICTA-PIBOSO corpus [36]. The difference in using Markov Logic
Networks is that MLNs do not require a training corpus to learn a model for the machine learning
algorithm – essentially linking the trained model to the training corpus. They could rather form rules
that make direct statements about a token.

56
7. Discussion
We found that re-using or integrating published methods can be a complex task for various reasons.
Once a method has been identified for re-use, several hurdles need to be overcome. After deciding
on which method will be re-used, its author needs to be willing to share his (or her) source code. In
66% of the cases, for various circumstances the author was not ready to do so. Even after a method
could be acquired successfully, re-producing previously published experiments and/or re-using them
was not as easy as initially assumed. Small deviations (such as the order in which files are read) can
have an effect on the overall replication process, which may be extraordinarily hard to locate. Re-
creating the exact environment was possible only after the direct support by the original creator.

Due to the nature of such programs being written in the context of an experimental research setting,
the structure of the code along with its documentation are not of industry standard, hence making it
explicitly hard to further develop the related systems. To be more specific, the architectural decision
to develop one program which transforms XML files, performs the UMLS, and part-of-speech parsing
(both Java libraries) in Java, while the main one (also using several Java libraries) has been
implemented in Python seems not ideal, and leads to unnecessary code redundancies.
Retrospectively, an implementation entirely in Java may have led to a more cohesive, consistent, and
manageable system.

Several previous publications creating their own corpus have pointed out the rather large amount of
time that is required to annotate any form of medical text. As far as we know, we are the first to create
a corpus of clinical trials being annotated and consented upon by two annotators. We can confirm the
significant time requirements needed to create annotations (in our case, having two annotators
annotate 44 trials including subsequent consensus required approximately 3 weeks). Additional time
was needed for the process of consenting upon annotations, as it is possible to conflict on almost any
token/word (e.g. given the phrase “coma patients”, it may be controversial if “coma patients” is the
population, or if “coma” is the condition, and “patients” the population). This further contributes to
the need to centralize annotation efforts in order to save costly annotation time, and create a central
and public repository of annotations to benchmark against.

One conclusion we derive from our experiments is that the extraction performance is directly related
to the corpus used for training/testing, and rarely universal applicable or transferrable.

During the experiment with the refined corpora, we found that the quality of annotations, and the
size of the testing corpus contribute to the extraction performance. Even by taking unspecific
measures of refining the corpus (simply omitting citations, which did not have any title annotations),
we were able to manipulate the experienced extraction performance between 1 and 8% for machine
learning based extraction. One may argue, that by removing those citations, one introduces another
bias. It may also be likely, that the missing annotations within the removed abstracts led to items being
falsely classified as false positives, hence falsely reducing overall performance. Another interesting
finding is that reducing the size of training corpus by over 60% (removing abstracts with no title
annotations) had almost no impact on extraction performance. We hesitate to draw concrete
conclusions from this finding. An argument could be made that the machine learning algorithm is
becoming saturated at some point, or that the abstracts of the training corpus are so similar that
further training on a larger number of documents does not add any additional training value.

This leads us to believe, that the extraction performance is linked to the quality of the annotation, the
corpus size, and the content of the corpus. Published performance figures along with the training and
testing corpus always need to be viewed as an entity and are rarely transferrable. Further, extraction

57
performance figures should always be published in the form of a sampled distribution including mean
and standard deviation to provide more insights into the actual extraction performance.

It has to be noted, that the extraction performance on the newly created (“review”) corpus is
considerably lower, than on the original test (“ischemia”) corpus. This may be due to several
circumstances. A major difference lies in the selection criteria as the ischemia citations had to contain
all key values in order to be considered. We did not impose such criteria on our corpus. A citation
containing all key value is likely to be well structured and written, thus making it potentially easier to
extract more data. Our corpus also contains less structured abstracts (i.e. structured into sections like
participants, outcomes, etc.). Since we sourced the new corpus from systematic reviews and their
updates, many clinical trials are of high age. Structured abstracts are a relatively recent trend. The
review corpus contains 20% less abstracts structured into sections. Another challenge lies in the
diversity of topics within the review corpus. It is based on 19 different systematic reviews, each dealing
with a different topic, which could affect the performance of information extraction techniques in a
high degree. The usage of different annotators could also contribute to the differences in extraction
performance. Summerscales annotated the training- as well as the test-corpus. We on the other hand
did merely annotate another test-corpus. While we have made every effort to follow the original
annotation style as closely as possible, a further structural difference regarding the annotation style
may have been introduced.

This supports our conclusion from the literature review that methods are not comparable and may
perform very differently on another corpus. As we already pointed out in our literature review, it is
highly complex to perform an adequate comparison between the extraction performances of various
methods. Even worse, the same approach may perform well/bad on one corpus, while performing
badly/well on another. As already mentioned, this may be due to multiple factors, such as differences
in: the structure of corpus documents, in the topics within the corpus, in the age of the corpus
documents, etc. Even when looking at the distribution of extraction performance for abstracts in
single, the comparison of the performance might be difficult due to the different corpora (compare
ischemia and review corpus distribution, Figure 22 and Figure 24). We hypothesize that a method
showing a low standard deviation in extraction performance, is likely to perform similar on different
corpora.

We noticed that none of the publications reused previously published heuristic methods, and limited
their definition to textual descriptions. The implementation would happen in a large variety of tools,
or would not be described at all. We had difficulties linking the description of the utilized rules of
Rodney Summerscales’ work to actual code entities. Earlier we presented two methods for further
development, which could be used to structure these heuristic methods, and potentially enable for
better exchange and facilitate reuse.

58
8. Conclusion
In this thesis we have systematically reviewed the state of the art methods in information extraction
from clinical trial text. The literature review has revealed that while the comparison between
implemented tools for that purpose is difficult due to a lack of standardised reporting, there is no
compelling evidence that shows there has been any major improvements in the last 15 years.

We conclude that replicating a published experiment can be a cumbersome and difficult experience.
Not only does this hinder any validation of published results, but it also makes the re-use of such
methods and algorithms difficult due to the time required to do so. Especially in the field of computer
science, where modularity and reusability are of high importance, easier methods of exchange are
required.

Independent evaluation of existing algorithms reveals that the over-fitting of machine learning
classifiers might be an issue, bonding an algorithm to the corpus it is tested on and limiting
transferability of extraction performance onto other corpora. This combined with high costs of
creating individual corpora, indicate a clear requirement for a public repository of well-described
annotated documents, in which researchers can benchmark their methods against. We further
suggest a more detailed performance reporting (via a distribution across documents, rather than a
single figure across the entire corpus) to provide better insights into the actual performance.

Another area, which we believe needs better exchange and formalisms is the definition of heuristic
methods (e.g. extraction by pattern matching). None of the publications reviewed, re-used any
previously published heuristics. Using a framework or specification method would focus efforts in this
area. We presented two methodologies, which could potentially be used for these purposes. For
example, RDR trees are easy to publish and distribute, and would make a good starting point for
further development, due to their easy extensibility.

59
Acknowledgement
I would like to thank my supervisor Guy Tsafnat for his persistent support and understanding. I thank
Rodney Summerscales for his patient cooperation to re-run the experiment. Further, I thank Georgina
Kennedy for annotating the new corpus, a highly time-consuming and tedious task.

I would also like to thank the examiners, who voluntarily agreed to assess my work. I am aware of or
your certainly busy schedule, and appreciate the time you kindly agreed to commit.

This work has been supported by a Macquarie University’s Research Training Pathway (RTP) stipend.

60
List of tables
Table 1 Scientific databases used in literature search .......................................................................... 10
Table 2 Publications included in this review ......................................................................................... 14
Table 3 Extraction aims ......................................................................................................................... 16
Table 4 Extraction level and approach.................................................................................................. 18
Table 5 Machine learning algorithms and the features used as input to them. A semicolon separating
the algorithms indicates both algorithms have been tried separately, a comma indicates an ensemble
of the mentioned algorithms has been used (for abbreviations, please refer to the Abbreviations table)
.............................................................................................................................................................. 23
Table 6 Corpora re-used in multiple publications for comparison ....................................................... 25
Table 7 Details of evaluation corpora and validation methods (for abbreviations, please refer to the
Abbreviations table).............................................................................................................................. 26
Table 8 CRF-SVM comparison (some publications with multiple experiments)................................... 32
Table 9 ML algorithms and features used............................................................................................. 35
Table 10 ACRES folder structure description ........................................................................................ 40
Table 11 Corpora characteristics comparison (average annotations per abstract in brackets) ........... 47
Table 12 Extraction performance comparison on new corpus (ML: machine learning) ...................... 48
Table 13 Comparison of bmjcardio (training) corpus with its refined version (average annotations per
abstract in brackets) ............................................................................................................................. 49
Table 14 Comparison of ischemia (testing) corpus with its refined version (average annotations per
abstract in brackets) ............................................................................................................................. 50
Table 15 Extraction performance on refined corpora (trained on refined corpus, tested on refined
corpus) .................................................................................................................................................. 50
Table 16 Extraction performance on refined corpora (trained on refined corpus, tested on unrefined
corpus) .................................................................................................................................................. 51

List of figures
Figure 1 Evidence-Based Medicine Pyramid [4] ..................................................................................... 6
Figure 2 Number of registered studies over time .................................................................................. 7
Figure 3 Timeline for a Cochrane review [8] ........................................................................................... 8
Figure 4 Process of systematic review creation [9] ................................................................................ 8
Figure 5 PRISMA diagram...................................................................................................................... 13
Figure 6 Included studies by part of text from which they extract information .................................. 21
Figure 7 Included studies by granularity of extraction ......................................................................... 21
Figure 8 Included studies by class of algorithm .................................................................................... 21
Figure 9 Frequency of targeted information fragments (>2)................................................................ 21
Figure 10 Comparison of extraction from one corpus by 4 different approaches ............................... 35
Figure 11 Phrase-level outcome extraction .......................................................................................... 36
Figure 12 Main processing stages in ACRES [42] .................................................................................. 40
Figure 13 Top-level folder layout .......................................................................................................... 40
Figure 14 Sample feature file line ......................................................................................................... 41
Figure 15 Outcome classification output, top 15.................................................................................. 41
Figure 16 High level sequence diagram of ACRES................................................................................. 42
Figure 17 "XML "raw" format expected by ACRES................................................................................ 43
Figure 18 Rule-based extraction [42] .................................................................................................... 43
Figure 19 Classifier-based extraction [42] ............................................................................................ 44
Figure 20 Association of extraction elements [42] ............................................................................... 45

61
Figure 21 ACRES extraction - ischemia vs review corpus...................................................................... 48
Figure 22 F-Score histogram for outcome extraction per abstract ("review" corpus) ......................... 49
Figure 23 ACRES extraction - original ischemia vs refined ischemia corpus......................................... 51
Figure 24 F-Score histogram for outcome extraction per abstract ...................................................... 52
Figure 25 F-Score histogram for outcome extraction per abstract (refined corpus)............................ 53
Figure 26 Sample RDR tree [57] ............................................................................................................ 55
Figure 27 Example of a MLN (Fr is short for friends, Sm for smokes, Ca for Cancer) [59] ................... 56
Figure 28 Example of kLog relation [39] ............................................................................................... 56

62
Abbreviations
Abbreviation Meaning
AUC Area under the curve
BOW Bag of words
CRF Conditional Random Fields
F1 F1-score
HMM Hidden Markov Model
MeSH Medical Subject Headings
ML Machine Learning
MLP Multi-Layer Perceptron
NB Naive Bayes
NLP Natural Language Processing
P Precision
PICO Patient/Problem, Intervention, Comparison, Outcome
POS Part Of Speech
R Recall
RCT Randomized Clinical Trial
RF Random Forrest
ROC Receiver Operating Characteristic
SVM Support Vector Machine
UMLS Unified Medical Language System

63
Appendix
1. Annotation guideline, Rodney Summerscales

64
References

[1]. Rosenberg, W. and A. Donald, Evidence based medicine: an approach to clinical problem-solving .
BMJ : British Medical Journal, 1995. 310(6987): p. 1122-1126.
[2]. Organization, W.H., Bridging the “Know–Do” Gap Meeting on Knowledge Translation in Global
Health. Retrieved September, 2005. 25: p. 2006.
[3]. Macleod, M.R., et al., Biomedical research: increasing value, reducing waste. Lancet, 2014.
383(9912): p. 101-4.
[4]. Rosner, A.L., Evidence-based medicine: revisiting the pyramid of priorities. Journal of Bodywork
and Movement Therapies, 2012. 16(1): p. 42-49.
[5]. Shojania, K.G., et al., How quickly do systematic reviews go out of date? A survival analysis. Annals
of internal medicine, 2007. 147(4): p. 224-233.
[6]. Coden, A., et al., Automatically extracting cancer disease characteristics from pathology reports
into a Disease Knowledge Representation Model. Journal of biomedical informatics, 2009. 42(5):
p. 937-949.
[7]. Bero, L., et al., Measuring the performance of the Cochrane library. The Cochrane database of
systematic reviews, 2011. 12: p. ED000048-ED000048.
[8]. Higgins, J., Green S. Cochrane handbook for systematic reviews of interventions version 5.1. 0.
The Cochrane Collaboration, 2011. 5(0).
[9]. Participation, C.f.H.C.a. Exploring Systematic Reviews. 03/2012 [cited 2015 28/09/2015];
Available from:
http://navigatingeffectivetreatments.org.au/exploring_systematic_reviews.html.
[10]. Last, J.M. and I.E. Association, A dictionary of epidemiology. Vol. 141. 2001: Oxford Univ Press.
[11]. Karystianis, G., I. Buchan, and G. Nenadic, Mining characteristics of epidemiological studies from
Medline: a case study in obesity. J Biomed Semantics, 2014. 5(22.10): p. 1186.
[12]. Sackett, D.L., et al., How to practice and teach EBM. Edinburgh: Churchill Livingstone, 2000.
[13]. Guyatt, G.H., et al., What is "quality of evidence" and why is it important to clinicians? Bmj, 2008.
336(7651): p. 995-8.
[14]. Tsafnat, G., et al., The automation of systematic reviews. BMJ, 2013. 346: p. f139.
[15]. Tsafnat, G., et al., Systematic review automation technologies. Syst Rev, 2014. 3(1): p. 74.
[16]. McKnight, L. and P. Srinivasan, Categorization of Sentence Types in Medical Abstracts. AMIA
Annual Symposium Proceedings, 2003. 2003: p. 440-444.
[17]. Hara, K. and Y. Matsumoto. Information extraction and sentence classification applied to clinical
trial MEDLINE abstracts. in Proceedings of the 2005 International Joint Conference of InCoB,
AASBi and KSB. 2005. Citeseer.
[18]. Demner-Fushman, D. and J. Lin. Knowledge extraction for clinical question answering:
Preliminary results. in Proceedings of the AAAI-05 Workshop on Question Answering in
Restricted Domains. 2005.
[19]. Rosemblat, G. and L. Graham. A pragmatic approach to summary extraction in clinical trials. in
Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology.
2006. Association for Computational Linguistics.
[20]. Demner-Fushman, D., et al., Automatically identifying health outcome information in MEDLINE
records. Journal of the American Medical Informatics Association, 2006. 13(1): p. 52-60.
[21]. Xu, R., et al. Combining text classification and hidden Markov modeling techniques for structuring
randomized clinical trial abstracts. in AMIA Annual Symposium Proceedings. 2006. American
Medical Informatics Association.
[22]. Chung, G.Y. and E. Coiera. A study of structured clinical abstracts and the semantic classification
of sentences. in Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and
Clinical Language Processing. 2007. Association for Computational Linguistics.

65
[23]. Hara, K. and Y. Matsumoto, Extracting clinical trial design information from MEDLINE abstracts.
New Generation Computing, 2007. 25(3): p. 263-275.
[24]. Xu, R., et al., Extracting subject demographic information from abstracts of randomized clinical
trial reports. 2007.
[25]. Rosemblat, G., L. Graham, and T. Tse. Extractive Summarization in Clinical Trials Protocol
Summaries: A Case Study. in IICAI. 2007.
[26]. Hansen, M.J., N.Ø. Rasmussen, and G. Chung, A method of extracting the number of trial
participants from abstracts describing randomized controlled trials. Journal of Telemedicine and
Telecare, 2008. 14(7): p. 354-358.
[27]. De Bruijn, B., et al. Automated information extraction of key trial design elements from clinical
trial publications. in AMIA Annual Symposium Proceedings. 2008. American Medical Informatics
Association.
[28]. Chung, G.Y.-C., Towards identifying intervention arms in randomized controlled trials: Extracting
coordinating constructions. Journal of biomedical informatics, 2009. 42(5): p. 790-800.
[29]. Summerscales, R., et al. Identifying treatments, groups, and outcomes in medical abstracts. in
The Sixth Midwest Computational Linguistics Colloquium (MCLC 2009). 2009.
[30]. Chung, G.Y., Sentence retrieval for abstracts of randomized controlled trials. BMC medical
informatics and decision making, 2009. 9(1): p. 10.
[31]. Boudin, F., et al., Combining classifiers for robust PICO element detection. BMC medical
informatics and decision making, 2010. 10(1): p. 29.
[32]. Kiritchenko, S., et al., ExaCT: automatic extraction of clinical trial characteristics from journal
publications. BMC Med Inform Decis Mak, 2010. 10: p. 56.
[33]. Lin, S., et al. Extracting formulaic and free text clinical research articles metadata using
conditional random fields. in Proceedings of the NAACL HLT 2010 Second Louhi Workshop on
Text and Data Mining of Health Documents. 2010. Association for Computational Linguistics.
[34]. Boudin, F., L. Shi, and J.-Y. Nie, Improving medical information retrieval with pico element
detection, in Advances in Information Retrieval. 2010, Springer. p. 50-61.
[35]. Zhao, J., et al. Improving search for evidence-based practice using information extraction. in AMIA
Annual Symposium Proceedings. 2010. American Medical Informatics Association.
[36]. Kim, S.N., et al., Automatic classification of sentences to support Evidence Based Medicine. BMC
bioinformatics, 2011. 12(Suppl 2): p. S5.
[37]. Summerscales, R.L., et al. Automatic Summarization of Results from Clinical Trials. in
Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on. 2011.
[38]. Huang, K.-C., et al. Classification of PICO elements by text features systematically extracted from
PubMed abstracts. in Granular Computing (GrC), 2011 IEEE International Conference on. 2011.
IEEE.
[39]. Verbeke, M., et al. A statistical relational learning approach to identifying evidence based
medicine categories. in Proceedings of the 2012 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning. 2012.
Association for Computational Linguistics.
[40]. Hsu, W., W. Speier, and R.K. Taira. Automated extraction of reported statistical analyses: towards
a logical representation of clinical trial literature. in AMIA Annual Symposium Proceedings. 2012.
American Medical Informatics Association.
[41]. Zhao, J., P. Bysani, and M.-Y. Kan. Exploiting classification correlations for the extraction of
evidence-based practice information. in AMIA Annual Symposium Proceedings. 2012. American
Medical Informatics Association.
[42]. Summerscales, R.L., Automatic Summarization of Clinical Abstracts for Evidence-based Medicine.
2013, Illinois Institute of Technology.
[43]. Huang, K.-C., et al., PICO element detection in medical text without metadata: Are first sentences
enough? Journal of biomedical informatics, 2013. 46(5): p. 940-946.

66
[44]. Sarker, A., D. Mollá-Aliod, and C. Paris, An Approach for Automatic Multi-label Classification of
Medical Sentences. 2013.
[45]. Hassanzadeh, H., T. Groza, and J. Hunter, Identifying scientific artefacts in biomedical literature:
The Evidence Based Medicine use case. Journal of biomedical informatics, 2014. 49: p. 159-170.
[46]. Sim, I., B. Olasov, and S. Carini, The Trial Bank system: capturing randomized trials for evidence-
based medicine. AMIA Annu Symp Proc, 2003: p. 1076.
[47]. Kottmann, J. Apache OpenNLP. 2013 [cited 2015 05/2015]; Available from:
http://opennlp.apache.org/.
[48]. Foundation, T.A.S. Apache UIMA. 2015 [cited 2015 05/2015]; Available from:
https://uima.apache.org/.
[49]. Chalmers, I. and B. Haynes, Systematic Reviews: Reporting, updating, and correcting systematic
reviews of the effects of health care. Bmj, 1994. 309(6958): p. 862-865.
[50]. McCray, A.T., A. Burgun, and O. Bodenreider, Aggregating UMLS semantic types for reducing
conceptual complexity. Studies in health technology and informatics, 2001(1): p. 216-220.
[51]. Leaman, R. and G. Gonzalez, BANNER: an executable survey of advances in biomedical named
entity recognition. Pac Symp Biocomput, 2008: p. 652-63.
[52]. Cohen, J., A coefficient of agreement for nominal scales. Educational and psychological
measurement, 1960. 20(1): p. 37-46.
[53]. Altman, D., Mathematics for kappa. Practical statistics for medical research, 1st edn. London:
Chapman & Hall, 1991. 406407.
[54]. Compton, P. and R. Jansen, Knowledge in context: A strategy for expert system maintenance.
1990: Springer.
[55]. Compton, P., et al., Experience with ripple-down rules. Knowledge-Based Systems, 2006. 19(5):
p. 356-362.
[56]. Prayote, A. and P. Compton, Detecting anomalies and intruders, in AI 2006: Advances in Artificial
Intelligence. 2006, Springer. p. 1084-1088.
[57]. Ho, V., W. Wobcke, and P. Compton. EMMA: an e-mail management assistant. in Intelligent
Agent Technology, 2003. IAT 2003. IEEE/WIC International Conference on. 2003. IEEE.
[58]. Compton, P. and R. Jansen, A philosophical basis for knowledge acquisition. Knowledge
acquisition, 1990. 2(3): p. 241-258.
[59]. Richardson, M. and P. Domingos, Markov logic networks. Machine learning, 2006. 62(1-2): p.
107-136.
[60]. Andrzejewski, D., et al. A framework for incorporating general domain knowledge into latent
Dirichlet allocation using first-order logic. in IJCAI Proceedings-International Joint Conference on
Artificial Intelligence. 2011.
[61]. Frasconi, P., et al., klog: A language for logical and relational learning with kernels. Artificial
Intelligence, 2014. 217: p. 117-143.
[62]. Tsochantaridis, I., et al. Support vector machine learning for interdependent and structured
output spaces. in Proceedings of the twenty-first international conference on Machine learning.
2004. ACM.

67

S-ar putea să vă placă și