Sunteți pe pagina 1din 8

2016 IEEE/WIC/ACM International Conference on Web Intelligence

Domain-Specific Term Extraction for Concept Identification in Ontology


Construction
Kiruparan Balachandran, Surangika Ranathunga
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka
kiruparan@gmail.com, surangika@cse.mrt.ac.lk

The output of a manually constructed ontology depends


wholly on the domain experts viewpoints, assumptions, and
needs regarding that domain. However, since these three
factors differ according to each domain expert, this leads to
inconsistent ontologies. Researchers have discussed ontology
learning as a solution to overcome issues related to the
manual construction of ontologies. Ontology learning is
either an automatic or a semi-automatic process that applies
methods for building ontology from scratch, or enriching or
adapting an existing ontology.
According to Buitelaar et al. [1], ontology learning
process can be organized into a layer cake with the
following modules: (1) Extracting domain-specific terms,
(2) Finding the synonyms for identified terms, (3)
Discovering
concepts,
(4)
Extracting
taxonomic
relationships, (5) Extracting non-taxonomic relationships,
and (6) Extracting rules from text to validate the discovered
ontology. To discover a valid ontology from a corpus for a
given domain, it requires improving each step in the
ontology learning process.
In this study, we focus on improving the domain-specific
term extraction process. Available approaches for term
extraction process are limited in various ways. Most
existing approaches assume that the domain expert feeds
domain-specific terms to ontology learning process [2].
Further, the automated approaches that use an available
corpus to extract domain-specific terms are not efficient.
Previous research has defined two types of corpora [3], [4],
[5]:
Target domain corpus: the corpus from where the
domain-specific terms are extracted. The corpus is
dedicated to one domain.
Contrastive corpus: one corpus created by
combining other domain corpora, except the target
domain.

AbstractAn ontology is a formal and explicit specification


of a shared conceptualization. Manual construction of domain
ontology does not adequately satisfy requirements of new
applications, because they need a more dynamic ontology and
the possibility to manage a considerable quantity of concepts
that humans cannot achieve alone. Researchers have discussed
ontology learning as a solution to overcome issues related to
the manual construction of ontology. Ontology learning is
either an automatic or semi-automatic process to apply
methods for building ontology from scratch, or enriching or
adapting an existing ontology. This research focuses on
improving the process of term extraction for identifying
concepts in ontology learning. Available approaches for term
extraction process are limited in various ways. These
limitations include: (1) obtaining domain-specific terms from a
domain expert as seed words without automatically
discovering them from the corpus, and (2) unsuitable usage of
corpora in discovering domain-specific terms for multiple
domains. Our study uses linguistic analysis and statistical
calculations to extract domain-specific simple and complex
terms to overcome this first limitation. To eliminate the second
limitation, we use multiple contrastive corpora that reduce the
biasness in using a single contrastive corpus. Evaluations show
that our system is better at extracting terms when compared
with the previous research that used the same corpora.
Keywords - term extraction; ontology learning; nontaxonomic relations; taxonomic relations; concept

I.

INTRODUCTION

Ontology is a formal and explicit specification of a


shared conceptualization. Ontology should be readable and
understandable by a software agent, and the constructed
ontology should be verified and accepted by relevant
domain-experts and the community. An objective of
ontology is to eliminate the conceptual and terminological
confusion in a specific community. An ontology consists of
a set of concepts, set of relations, set of rules, and instances
of concepts (also referred to as terms). Terms can be simple
or complex. For example, consider two terms randomized
algorithm and program in the Computer Science domain.
Here, randomized algorithm is referred to as a complex
term. It contains more than one word to form a term,
whereas program can be considered a simple term.

978-1-5090-4470-2/16 $31.00 2016 IEEE


DOI 10.1109/WI.2016.16

In existing approaches, a term can be considered as a


domain-specific term if the term has more influence on the
target domain compared to the contrastive corpus. However,
to extract domain-specific terms, we need to identify unique
terms for each domain. To do so, influence of each term on

34

different domains should be compared by considering the


domain corpora separately.
In other words, to extract multiple domain-specific
terms, we need to consider the distribution of each term in
each individual corpus. In addition, corpus selection based
on just word count is not adequate to find the lexical
richness of corpora.
Existing studies only consider nouns as simple terms,
and headwords (Adjective + Noun) as complex terms, in
their domain-specific term extraction process [4]. However,
other possible instances such as 3-gram terms (e.g.
Adjective + Noun + Noun or Adjective + Adjective + Noun)
can be complex terms.
This paper presents an approach to eliminate the
biasness caused by simply merging the other domain
corpora (except the target domain) and creating a single
contrastive corpus. It also presents a more accurate
calculation of statistical distribution to extract domainspecific simple and complex terms. Performance evaluation
of the implemented term extraction process discloses that
our implementations provide better accuracy in both the
Computer Science domain (55% precision for simple term
and 52.5% precision for complex term) and the Bio Medical
domain (80% precision for simple term and 62% precision
for complex term) when compared with existing work.
Rest of the paper is organized as follows. In Section 2,
we elaborate on the existing techniques for identifying
domain-specific terms. In Section 3, we present our
implemented solution to overcome the issues discussed in
section 2, and in section 4, we present our evaluation results
for this implemented solution. Finally, section 5 concludes
the paper with an outlook on future work.
II.

Rules
Relations
Concept Hierarchies
Concepts
Synonyms
Terms
Figure 1. Ontology Learning Layer-Cake [1]

domain-specific terms is paramount in building domain


ontologies. We need to find a set of domain specific terms
to further build up other ontology components such as
concepts, relations, and axioms.
In ontology learning approaches, the term extraction process
involves identifying these terms from text. In conventional
approaches, a set of terms is input into a term extraction
algorithm by domain experts [2]. Existing automated term
extraction algorithms (further discussed below) have used
either pure statistical approaches [3], [6], or a combination
of statistical approaches and linguistic information [4].
1) Term extraction using inverse document frequency
(IDF)
The well-known approach to identify domain-specific
terms is inverse document frequency, which indicates
whether the term is common or rare across all documents
[6] [7].
This approach is inadequate to identify the cross-domain
distribution (how much the term relates to other domains) of
a term because it only checks how the term distributes
within a domain, but not how the term is distributed across
different domains [3].
2) Term extraction using domain relevance
Domain relevance (DR) approach has been discussed
[3], [5], [6] to overcome the problem identified in the IDF
based approach. The domain relevance of a term is high if a
term is frequent in the domain of interest and less frequent
in other domains. Domain relevance fails if a term is used at
a higher frequency in a few documents in the domain, but
not equally across the domain. This identifies some nondomain-specific terms as domain-specific terms, which
appear in a high frequency in a few documents.
3) Term extraction using domain consensus
To resolve the issue of extracting non-domain-specific
words, domain consensus (DC) approach was presented [3],
[5], [6]. The consensus is to measure how a term is
distributed within the domain.
However, domain consensus alone is not suitable to
extract domain-specific terms since it only considers the
term distribution within a domain but not the term
distribution across domains. This identifies some common
terms as domain-specific terms that are equally distributed
within the domain.

RELATED WORK

In section A, we discuss ontology learning in general,


based on Buitelaar et al.s [1] layer-cake. Section B
discusses previous work and its drawbacks on domainspecific term extraction process, and section C discusses
existing corpus selection approaches.
A. Ontology learning
According to Buitelaar et al. [1], the output of ontology
learning can be organized into a layer-cake, as depicted in
Fig. 1. Right side of the figure lists an example from the
Computer Science domain for each layer.
In ontology learning, this process is automated as steps,
which facilitates obtaining the output of each stage using the
outputs of the previous layers. These steps are: Extracting
Terms, Discovering Concepts, Identifying Concept Pairs,
Taxonomic Labeling, Non-Taxonomic Labeling, and
Extracting Rules (axioms).
B. Term Extraction
Domain-specific terms are linguistic realizations of
domain-specific concepts and are involved in all other tasks
in ontology learning [1]. Thus, the identification of such
35

Frantzi et al. [8] presented an approach called C-value/NCvalue to use multiple POS tag combinations, to extract
complex terms. They present three linguistic filters as
follows:
Noun+ Noun
(Adj|Noun)+Noun
((Adj|Noun)+|((Adj|Noun)*(NounPrep)?)(Adj|Noun)
* +
) Noun

4) Term extraction using a combination of domain


relevance and consensus
As discussed by Velardi et al. [6], domain relevance and
domain consensus identify some non-domain-specific terms
and common terms as domain-specific terms. To avoid that,
both measurements can be combined together and terms are
selected where domain relevance (DR) >, and domain
consensus (DC) < ( and are threshold values for DR
and DC, respectively). This combination worked well and
was able to select domain-specific words while filtering
terms that are not distributed within the document. The
problem with this approach is, when combining a large
number of corpora, there is a significant count for each term
from individual corpora and this count misleads the
calculation of statistical distribution. Moreover, to extract
domain-specific terms, we need to identify unique terms for
each domain. To do so, each term should be compared
individually in multiple domain corpora; hence, this
approach worked efficiently to extract domain-specific
terms from a single domain. However, this approach is not
suitable to extract domain-specific terms from multiple
domains.
5) Complex Term Extraction
The approaches discussed above [3], [6], [7] cannot
categorize a term as simple or complex. Moreover, they did
not identify a method to extract domain-specific complex
terms.
Wong et al. [4] discussed a probabilistic framework
approach to identify complex terms using Part of Speech
(POS) tags. In his approach, he considered words with
headword as complex terms; for example, consider
randomized algorithm from the Computer Science
domain corpus. Here, they consider JJ (Adjective) + NN
(Noun) as the POS tag combination that corresponds to a
complex term.
This slightly varies from the above approaches (1-4),
where in this study the authors had to consider the
frequency of individual terms in the headword, in addition
to headword frequency.
This study only focuses on headwords as complex terms.
However, there are other possible POS tag combinations
such as adjective-NPs and prepositional-NPs that can be
considered as domain-specific complex terms. For example,
it is possible to consider linear complementarity problem
as Adjective-NPs and theory of programming language as
prepositional-NPs.
Additionally, this study merges corpora from different
domains to create the contrastive corpus. This creates an
imbalance between target and contrastive corpora. In
addition, this study does not address the need of extracting
domain-specific terms for multiple domains.
6) Term extraction using linguistic filters
As discussed above, words with headword are
considered as complex terms in term extraction process.

These rules do not consider all possible POS tags. For


example, the POS tag combination non-taxonomic/JJ
relations/NNS is not considered by the above approach,
since they consider only NN, but not NNS. Another
example is considering the same term in two different ways
as, computer/NN science/NN and Computer/NNP
Science/NNP. Here, only computer/NN science/NN is
identified as a complex term where Computer/NNP
Science/NNP is not considered as a complex term since
NNP POS tag is not considered in the filters. Another
problem with this approach is that it does not limit the size
of complex terms (they consider 1-gram, 2-gram... n-gram).
7) Heuristic-based Term Extraction
Apart from the above four approaches, some heuristicbased approaches to identify domain-specific terms have
also been discussed [5], [9].
(1) Lexical Cohesion: Cohesion is high if the words
composing the (compound) term are more frequently found
within the (compound) term, than alone in the text.
However, in a real scenario, this may not work, since one
can find that the individual word frequency is higher than
the compound word.
(2) Structural Relevance: If a term is highlighted in a
document, domain relevancy is increased by a factor k,
where k is a constant for a weighting scheme. This can be
achieved only through a manual process since readers have
to scrutinize each document and find the highlighted words.
In addition, most of the available corpora are in text format,
making it difficult to find highlighted words.
(3) Miscellaneous: A set of heuristics is used to remove
generic modifiers. For example, in the relationship large
knowledge database, where large is the generic modifier,
removing the word large helps to reduce noise, since a
generic modifier is not considered for weight calculation.
Identifying the set of generic modifiers is a manual process
and such a set of generic modifiers is not readily available.
In summary, approaches on finding relevant terms from
text focused on extracting terms only from a single domain.
Further, most studies discuss about noun phrases (NP) as
complex terms, but terms can be formed in many ways.
Since the existing studies only consider words with
headwords as complex terms, the available weighing
schemes are suitable only for 2-gram terms and this is only
available for extracting single domain-specific terms.

36

C. Corpus Selection
Existing approaches in ontology learning select corpora
based on word count [3], [4]. Nevertheless, Sinclair [10] has
stated that frequency of terms in the corpus follows Zipf's
Law, which means that approximately half of them occur
only once, a quarter only twice, and so on. For example, in
the Brown corpus (first million-word corpus of general
written American English), there was a vocabulary of
different word forms of 69002, of which 35065 occurred
only once. At the other end of the frequency scale, the
commonest word with a frequency of 69970, which is
almost twice as common as the next one, was at 36410. This
implies that word count is not enough to select a good
corpus.
III.

Candidate
terms
Applying linguistic rules

Calculating term
distribution

Annotated corpora
Organizing
corpora

Annotating target domain and multiple


contrastive domains corpora
Figure 2. Solution Architecture

disconnected domains, to make sure that they do not


influence the domain-specific term extraction process.
As discussed in Section 2, existing approaches used one
target domain corpus and one contrastive corpus (a corpus
created from different domains except the target domain).
This is suitable to extract domain-specific terms for a
single domain. However, to extract multiple domainspecific terms from multiple domains, frequency of each
term in other domains should be individually considered.
For example, to calculate the weight of a term
algorithm for the Computer Science domain, we should
consider the frequency of algorithm in the Computer
Science (target domain) domain as well as other contrastive
domains individually (e.g. Bio Medical domain and Cricket
domain).
Since we need multiple domain corpora, in this work we
propose an approach to select corpora from available
corpora. Existing studies have used different corpora,
however only few corpora are available freely for
researchers. We select four such existing corpora.
In addition, we are building two corpora using RSS
feeds (business domain using Reuters and ABC RSS feeds,
and Cricket domain using Cricinfo's feeds and Cricinfo
magazine). Corpora were built from RSS feeds in previous
studies as well [3] [4] and [5]. In their studies, contrastive
corpora are built using feeds such as Reuters.com,
CNet.com, and ABC.com.
Finally, we have four existing corpora and two newly
built corpora.

EXTRACTING DOMAIN-SPECIFIC TERMS

Based on the discussion in related work, it can be


concluded that some automated approaches are in fact semiautomated since they take seed terms as inputs from domain
experts. The available fully automated approaches introduce
biasness, since a single contrastive corpus is created by
combining other domain corpora, except the target domain.
The available term extraction approaches consider only
simple terms and headwords as candidate terms.
To address the above discussed problems, this paper
presents an automated methodology to identify domainspecific terms from multiple domain corpora for multiple
domains. Fig. 2 provides an overview on how different
modules are combined to identify domain-specific terms.
Organizing corpora: As discussed by Wynne [10], not all
corpora have lexical richness. To identify the good corpora, a
shallow processing is required to select corpora that are good
in lexical richness. To eliminate biasness in organizing the
corpora, we consider multiple contrastive corpora where
each contrastive corpus is dedicated to a single domain.
Corpus Annotation: The selected corpora are annotated
with POS tags.
Extracting domain-specific terms: In our study, we
develop a method to extract multiple domain-specific terms
from multiple domains. This module extracts both simple
and complex terms. This module is developed with a
combination of lexical analysis and statistical distribution
calculations. Lexical analysis is used to find candidate
simple terms and complex terms. This helps to extract all the
combinations of complex terms, rather than only considering
noun phrases. The simple and complex terms are filtered as
domain-specific-terms based on distribution calculations.

Existing corpora
Computer Science
o Mikalai Krapivin Corpus (MK) [11]
o NUS Keyphrase Corpus (NUS) [12]
Bio Medical
o GENIA corpus (G) [13]
Agriculture
o FAO-Food-And-Agriculture (FAO) [11]
Built corpora,
Cricket domain
o based on Cricinfo RSS feeds (C)
Business domain
o based on Reuters and ABC RSS feeds (R)

A. Corpus Selection and Building


This section discusses corpus selection for our problem,
selecting suitable RSS feeds to build domain corpus, and
deciding on how to use the corpora in an effective manner.
We ensure the selected corpora are from highly
37

The next task is to select the most suitable corpora from


the above list. As discussed by Wynne [10], considering the
word count is not adequate to select a corpus. In our corpus
selection process, we are eliminating corpora that contain a
high frequency of terms occurring few times, referring them
as having a poor lexical richness [10]. Table I shows
calculated lexical richness of each corpus.
According to the results, GENIA corpus has less than
3.63% of less frequent terms (a term occurring less than 5
times in the corpus, considering the total length of the
corpus). We have selected GENIA corpus since it has a
higher lexical richness to represent the domain even for
shallow processing. Reuters-ABC-Business-RSS has 33.5%
of less frequent terms, and thus, this corpus is rejected. In
addition, we have selected the corpus built using the
Cricinfo RSS as another domain corpus since it has a less
percentage of less frequent items (except for frequency 3),
and has good terms to represent the domain (since authors
are highly competent on that domain, this is based on
manual inspection).
As a third domain, we have selected Mikalai Krapivin
corpus - Computer Science, since it has a less percentage of
less frequent items compared to the NUS key phrase corpus.
We rejected FAO since it contains 3.63% of less frequent
terms, which are higher than all other corpora except
Reuters-ABC-Business-RSS, which is previously rejected.
As an outcome of this module, we have selected two
existing corpora (Mikala Krapivin for Computer Science
domain, GENIA for Bio Medical domain) and one newly
built corpus using RSS feeds (using Cricinfo RSS feed for
cricket domain). These act as the target domain and multiple
contrastive corpora in an iterative manner. For instance,
while we extract Computer Science related terms, Mikalai
Krapivin corpus acts as target domain while GENIA and
Cricinfo corpora act as two different contrastive domain
corpora.

TABLE I: LEXICAL RICHNESS OF CORPORA


Frequency
Occurrence (normalized by length)%
of words

Without Stop words


MK

NUS

FAO

1.31

2.23

1.69

2.46

0.08

20.82

0.39

0.55

0.81

0.63

6.49

0.19

0.25

0.39

0.27

2.26

3.09

0.12

0.15

0.26

0.17

0.12

1.82

0.08

0.10

0.19

0.11

1.26

Total

2.11

3.31

3.63

3.66

2.48

33.50

counterparts in terms of cost and scalability, but it is a


widely accepted fact that they are much more difficult to
program than shared memory machines. One major reason
for this difficulty is the absence of a single global address
space.
Step 1: POS tag each sentence
The selected corpus is tokenized and the tokens are
annotated with POS tags.
Step 2: Extract candidate terms using linguistic rules
Candidate terms (simple and complex terms) are
extracted based on a set of linguistic rules. To find the
complex terms and simple terms, we consider three words
(W2W1W0) at a time (we do not address the 4-gram term or
above, since our manual evaluations as shown in Table II
provided evidence that only six 4-gram domain-specific
terms exist in 300 documents of the Mikalai Krapivin
Corpus).
In each iteration,
W2 is referred to as the Word Before The Previous
Word and the POS tag is referred to as P 2
W1 is referred to as the Previous Word and the
POS tag is referred to as P1
W0 is referred to as the Current Word and the POS
tag is referred to as P0

B. Extracting Domain-Specific Terms


As mentioned above, we consider multiple domain
corpora separately to extract domain-specific terms. Here,
we propose a mechanism that extends the approach
discussed by Wong et al. [4] to weigh each term to
understand how much it relates to the given domain. This
approach extracts multiple domain-specific terms (e.g.
algorithm and randomized algorithm relate to Computer
Science, whereas peripheral blood lymphocyte and
human androgen receptor relate to Bio Medical domain).
In addition, based on continuous evaluation, we obtained a
set of possible POS tag combinations to represent the
complex terms. The process of extracting domain-specific
terms is explained in three steps, with the following
example paragraph from the Computer Science domain:
Distributed memory multiprocessors are increasingly
being used for providing high levels of performance for
scientific applications. The distributed memory machines
offer significant advantages over their shared memory

After each iteration, the pointer moves to the next word.


For example, W1 becomes W2, W0 becomes W1, and the
next word in the sentence becomes W0.
Linguistic Rules
The below domain independent rules are manmade - a
manually annotated list of terms was taken from 200
documents of the Mikalai Krapivin corpus. These rules
worked well for other two domains as well. Based on the list
of terms, we identified possible combinations of POS tags.
We identified five rules to extract complex terms and one
TABLE II: FREQUENCY OF EACH N-GRAM IDENTIFIED BY
MANUAL ANNOTATOR FOR 300 DOCUMENTS FROM MIKALAI
KRAPIVIN CORPUS
1-gram
2-gram
3-gram
4-gram
88
122
35
6

38

In the third iteration, W2W1W0 is increasingly/RB,


being/VBG, used/VBN, which does not satisfy any of these
rules.
To explain how other rules are satisfied, we take the
iteration where W2W1W0 is provide/VBG high/JJ
level/NNS, which satisfies our 4th rule where the Previous
Word Tag (P1) is JJ and the Current Word Tag (P0) is
NNS. It considers high level as a complex term. With
the same condition, the following iteration finds scientific
applications as a complex term for the same 4th rule.
We take another iteration to describe how a 3-gram term
is considered. In this iteration, W2W1W0 is global address
space and its POS tag is global/JJ address/NN space/NN.
This satisfies the 1st rule where the Word Before Previous
Word Tag (P2) is JJ, NN is the Previous Word Tag (P1),
and the Current Word Tag (P0) is NN, thus global
address space is considered as a complex term.
As per the above explanation, the corpus is iterated
through each sentence and the process finds all possible
complex terms. While we iterate, we have noticed the
possibility of a same term satisfying two different rules (e.g.
memory multiprocessors- this satisfies rules 4 and 5). To
avoid redundancy, we only consider a single entry of a term
per sentence (e.g., as per our process, frequency of memory
multiprocessors is two since it is satisfied by two rules but
we consider it as one entry per sentence).
At the end of this step, we have a list of simple terms
and complex terms. A simple term or a complex term is
eligible to be used in the following steps, if it occurs in
more than one document within the domain. In the
following step, we discuss the method to calculate whether
the term is domain-specific or not.
Step 3: Finding domain-specific terms based on term
distribution calculation.
The approach presented by Wong et al. [4] measures the
level of relevance of a simple term and a complex term to a
domain. However, as we discussed, that approach only
focused on headwords and does not cover all possible
complex terms. In addition, that approach uses a single
contrastive corpus, which is not an efficient approach.
Further, this approach cannot extract multiple domainspecific terms. In our approach, we extend this approach to
support multiple contrastive corpora and to extract simple
terms and complex terms (up to 3-gram).
Our calculation of statistical distribution supports: (1)
multiple contrastive corpora that can avoid the problem of
having single contrastive corpora, and (2) weighing simple
terms and all possible complex terms (maximum of 3-gram).

rule to extract simple terms (rule number five). A similar


approach was discussed by Frantzi et al. [8], where
linguistic filters consider most combinations of POS tags,
but that study does not consider combinations of all POS
tags (i.e. considers only NN, JJ, and noun prepositions), and
number of words to be in terms is not limited. In our
linguistic rules, we used linguistic filters that were used by
Frantzi et al. [8] with all possible POS tags (NN, NNS,
NNP, NNPS, JJ, JJR, JJS, and possessive ending tags). For
example, global/JJ address/NN space/NN can be satisfied
by three words with POS tags JJ/NN/NN. Likewise, we
identified all the possible POS tag combinations. Identified
combinations are considered as rules to extract simple and
complex terms from the corpus. We consider a maximum of
3-grams.
Rule 1: (P0 == NN or NNS or NNP or NNPS) and (P 1 ==
NN or NNS or NNP or NNPS) and (P 2==JJ or JJR or JJS) > Lemmatize the word and combine W2+W1+W0 and
consider as a complex term with three words and W 2 pointer
= current W2 pointer + 3
Rule 2: (P0 == NN or NNS or NNP or NNPS) and (P 1 ==
any other tag) and (P2==NN or NNS or NNP or NNPS) ->
Lemmatize the word and combine W2+W1+W0 and consider
as a complex term with three words and W 2 pointer =
current W2 pointer + 3
Rules 3: ((P1 == NN or NNS or NNP or NNPS) and (P 0
==NN or NNS or NNP or NNPS)) ->Lemmatize the word
and combine W1+W0 and consider as a complex term with
two words and W2 pointer = current W2 pointer + 3.
Rule 4: ((P1 == JJ or JJR or JJS) and (P0 ==NN or NNS or
NNP or NNPS)) ->Lemmatize the word and combine
W1+W0 and consider as a complex term with two words and
W2 pointer = current W2 pointer + 3.
Rule 5: P2== (NN, NNS, NNP or NNPS) -> Lemmatize the
word W2 and consider as a simple term with one word and
W2 pointer = current W2 pointer + 1
The output from step 1 is used to explain the use case
scenario for step 2. In the first iteration, W 2W1W0 is
Distributed memory multiprocessors. In this iteration, this
does not satisfy the first three rules since Distributed has a
VBN POS tag, but the 4th rule is satisfied for terms W1W0
(memory/NN, multiprocessors/NNS) where memory is a
Previous Word (W1) and multiprocessor is a Current
Word (W0). This iteration succeeds at the 3rd rule since
memory/NN, multiprocessors/NNS.
In the second iteration, W2W1W0 is are/VBP,
increasingly/RB, being/VBG, which does not satisfy any of
these rules.

(1) Our calculation of statistical distribution considers


multiple contrastive corpora to help identify unique domainspecific terms. It can use multiple contrastive corpora to
weigh a term. In our calculation of statistical distribution,
we consider the probability of each term in the target
domain and in the other multiple contrastive corpora where
the probability of each term is high.

39

For example, consider the simple term receptor and


find its weights. In this case, the numerator in Equation 1 is
the probability of the term receptor in the target domain
(Bio Medical domain). The denominator in Equation 1 is the
highest probability of receptor in the other domain corpus
(in our study, either Computer Science domain or Cricket
domain), except the Bio Medical domain.
This approach is capable of extracting multiple domainspecific terms. Since we consider multiple contrastive
corpora, we can iteratively change the target domain by
replacing the earlier target domain with one of the
contrastive corpora. In the meantime, the earlier target
domain is considered as one of the multiple contrastive
corpora.

IV.

We compared our results with the existing approaches


using precision as a metric, since the existing approaches
only used precision in their evaluations. We did not discuss
recall, since number of domain-specific terms in each
corpus is unknown. In our evaluation, precision is calculated
as the percentage of valid domain-specific terms (validated
by domain experts) compared to extracted domain-specific
terms.
We performed an evaluation of the best 700 simple and
complex terms for the Computer Science domain, and the
best 300 simple and complex terms for the Bio Medical
domain extracted by our system. (This work is part of the
process of discovering non-taxonomic relations from
domain-specific corpora. Our study revealed that the
number of valid non-taxonomic relations increases with the
number of terms until the top 700 terms for the Computer
Science domain. For the Bio Medical domain, this cut-off
value is 300. The validity of the non-taxonomic relations
was evaluated using domain experts. However these details
are out of the scope of this paper, and are not further
discussed).
Our results were evaluated against existing approaches
for the same corpora. These approaches are the domain
coherence measure (DC) [11] and the C-value/NC-value
method [8]. We used the currently available manually
annotated domain-specific terms provided with the corpus.
As per Table III, our results are comparatively better
than the results of the domain coherence measure (DC) [11]
and the C-value/NC-value method [8] for the Computer
Science domain and Bio Medical domain, respectively.
Thus we have effectively evaluated our approach for
more than one domain, which implies our solution works
accurately across different domains. Further, our process
can be applied to any number of domain corpora.

(2) Our calculations of statistical distribution support


extracting all possible simple and complex terms. We
consider the probability of each complex term (assuming it
is 3-gram W2W1W0) with the multiplication by each
individual probability of W2, W1, and W0 in the target
domain as well as in the other multiple contrastive corpora
where the probability of W2W1W0 is high.
For example, consider the term human androgen
receptor in the Bio Medical domain. In our approach, we
find the probability of the term human androgen receptor,
and find the individual word probability of the terms
human, androgen, and receptor. Multiplication of
each probability is the numerator in Equation 2. The next
step takes the term human androgen receptor and finds
the probability in the remaining corpora. If it exists in more
than one corpus, we select the domain corpus with the
highest probability. Further, in the denominator in the
Equation 2, the probability of an individual word in the term
is calculated and is multiplied with the probability of each
individual word in the term.
P term in target domain
arg max P term in d m

V.

P term in traget domain P ai in target domain


n

(2)

arg max P term in dm P ai in dm


m

Here,

CONCLUSION

Automatic ontology creation involves the steps term


extraction, identifying synonyms, extracting concepts,
discovering taxonomic relations, discovering non-taxonomic
relations, and extracting axioms. This study focused on
improving the term extraction process to support the
concept extraction process from multiple domain corpora.

(1)
i

EVALUATION

TABLE III: EVALUATION FOR EXTRACTED TERMS BASED ON


AVAILABLE DATA SET FOR COMPUTER SCIENCE AND BIO
MEDICAL DOMAIN
Precision for
Computer
Bio
Science
Medical
domain
domain
Our approach
complex terms
52.5%
62%
simple terms
55%
80%
Existing
DC simple terms
47%
55%
approaches
[11]
C-value/NC-value
28%
32%
simple and complex
terms [8]

a1 a2 a3 and a1a2 are complex terms, where a1, a2, a3


are set of words in a complex term.
dm is mth contrastive corpus.

As mentioned above, for each domain, we calculate


weights for all possible candidate terms. After identifying
the weight for each candidate term by applying Equation 1
and 2, we have calculated the weight for each simple term
and complex term.

40

[9] B. Fortuna, N. Lavra, and P. Velardi, "Advancing topic Ontology


learning through term extraction," in PRICAI 2008: Trends in Artificial
Intelligence. Springer Science + Business Media, pp. 626635.
[10] M. Wynne, "Developing Linguistic Corpora:a Guide to Good
Practice",
2016.
[Online].
Available:
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/chapter1.htm.
[Accessed: 20- Feb- 2016].
[11] P. Buitelaar, G. Bordea, and T. Polajnar, "Domain-independent term
extraction through domain modelling," in the 10th International
Conference on Terminology and Artificial Intelligence (TIA 2013), Paris,
France, 2013, pp. 61-68.
[12] Wing.comp.nus.edu.sg, "Keyphrase Corpus", 2016. [Online].
Available:
http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus/.
[Accessed: 20- Feb- 2016].
[13] J.D. Kim, T. Ohta, Y. Tateisi, and J. i. Tsujii, "GENIA corpus-a
semantically annotated corpus for bio-textmining," Bioinformatics, vol. 19,
pp. i180-i182, 2003.

We have implemented a solution to use multiple


contrastive corpora, instead of using a single contrastive
corpus as done in previous research. The existing statistical
calculations were inadequate to use multiple contrastive
corpora. Therefore, to overcome this issue, we implemented
a calculation of statistical distribution that can extract
simple and complex terms from multiple domains by using
multiple contrastive corpora. Our approach was evaluated
by comparing against existing approaches, which used the
same corpora for their evaluations. Our results are
comparatively better than the results of the domain
coherence measure (DC) [11] and the C-value/NC-value
method [8] for the Mikalai Krapivin corpus and GENIA
corpus respectively.
In our current approach, we only consider terms that are
only present in the corpus. However, there may be
synonyms of these terms that can also represent the
respective domain. For future work, we intend to consider
synonyms of each domain-specific term, which helps to
identify more terms specific to a domain. Additionally, we
will compare remaining techniques for term extraction with
our approach by incorporating new corpora to our term
extraction process. Limitation of proposed approach is that
it uses manually created linguistic rules. This problem can
be alleviated by investigating techniques for learning
linguistic rules from corpora.
REFERENCES
[1] P. Buitelaar, P. Cimiano, and B. Magnini, Ontology learning from text:
methods, evaluation and applications vol. 123: IOS press, 2005.
[2] H. Hu and D.-Y. Liu, "Learning OWL ontologies from free texts," in
Machine Learning and Cybernetics, 2004. Proceedings of 2004
International Conference on, 2004, pp. 1233-1237.
[3] W. Wong, W. Liu, and M. Bennamoun, "Determining termhood for
learning domain ontologies using domain prevalence and tendency," in
Proceedings of the sixth Australasian conference on Data mining and
analytics-Volume 70, 2007, pp. 47-54.
[4] W. Wong, W. Liu, and M. Bennamoun, "Determining termhood for
learning domain ontologies in a probabilistic framework," in Proceedings
of the sixth Australasian conference on Data mining and analytics-Volume
70, 2007, pp. 55-63.
[5] E. Dietz, D. Vandic, and F. Frasincar, "Taxolearn: A semantic approach
to domain taxonomy learning," in 2012 IEEE/WIC/ACM International
Conferences on Web Intelligence and Intelligent Agent Technology (WIIAT), 2012, pp. 58-65.
[6] P. Velardi, M. Missikoff, and R. Basili, "Identification of relevant terms
to support the construction of domain ontologies," in Proceedings of the
workshop on Human Language Technology and Knowledge ManagementVolume 2001, 2001, p. 5.
[7] R. Basili, A. Moschitti, M. T. Pazienza, and F. M. Zanzotto, "A
contrastive approach to term extraction," in Terminologie et intelligence
artificielle. Rencontres, 2001, pp. 119-128.
[8] K. T. Frantzi, S. Ananiadou, and J. Tsujii, "The c-value/nc-value
method of automatic recognition for multi-word terms," in Research and
advanced technology for digital libraries, ed: Springer, 1998, pp. 585-604.

41

S-ar putea să vă placă și