Sunteți pe pagina 1din 8

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

A New Sentence Similarity Method


based on a Three-Layer Sentence Representation
Rafael Ferreira, Rafael Dueire Lins, Fred Freitas, Bruno Avila
Informatics Center
Federal University of Pernambuco
Recife, Pernambuco, Brazil
{rm,rdl,fred,bta}@cin.ufpe.br

changes completely their meaning.


Taking such problems into account, this paper presents
(i) a new sentence representation based on lexical (Shallow
Layer), syntactic (Syntactic Layer) and semantic analysis
(Frame Layer), and (ii) four new sentence similarity measure based on such representation.
The shallow layer mainly consists of a product of the
lexical analysis. The similarity based on this layer using
bag of word vectors, as in previous work [6], [7], [8], [9].
In addition to lexical analysis, this layer applies Named
Entity Recognizer (NER) to group related words. NER
helps improving the accuracy of the similarity measure.
The syntactic layer employs syntactic relation and coreference resolution to deal with meaning and word order
problems. On the other hand, frame layer uses Semantic
Role Annotation (SRA) [12] to deal with both problems.
The SRA analysis returns the meaning of the actions,
the actor who performs the action, and the object/actor
that suffers the action, among others features. To the
best of the authors knowledge, this is the rst time that
SRA is used to measure the semantic similarity between
sentences. The traditional methods employ WordNet [13]
or a corpus-based measure [8] in a classic bag of word
vectors similarity.
The syntactic layer and frame layer deal with the
meaning and word order problems. The third layer covers
different aspects of the aforementioned problems. The
intermediate layer typically achieves better results for
sentences with low similarity. On the other hand, frame
layer is better for sentences with a higher degree of
similarity.
In order to evaluate the proposed measure, a series of
experiments was performed using the benchmark dataset
proposed by Li et al [6], which may be considered as the
standard dataset for this problem. The proposed measure
was assessed with two traditional measures: Pearsons
correlation coefcient (PCC) and Spearmans rank correlation coefcient (SRCC). A combination of the proposed
measure achieves 0.92 for the PCC and 0.91 for the SRCC,
which means an improvement of 7% and 6%, respectively,
in relation to the state of the art.
The rest of this paper is organized as follows. Section II describes the main differences between the proposed method and some of the related works. Section III
introduces the sentence representation used in proposed

AbstractSentence similarity methods are used to assess


the degree of likelihood between phrases. Many natural language applications such as text summarization, information
retrieval, text categorization, and machine translation employ
measures of sentence similarity. The existing approaches
for this problem represent sentences as vectors of bag of
words or the syntactic information of the words in the
phrase. The likelihood between phrases is calculated by
composing the similarity between the words in the sentences.
Such schemes do not address two important concerns in the
area, however: the semantic problem and the word order.
This paper proposes a new sentence similarity measure that
attempts to address such problems by taking into account
the lexical, syntactic, and semantic analysis of sentences. The
new similarity measure proposed outperforms the state of the
art systems in around 6%, when tested using a standard and
publically available dataset.
Keywords-Sentence Similarity; Sentence Representation;
Natural Language Processing.

I. I NTRODUCTION
Sentence similarity, or short-text similarity, measures
the degree of similarity between phases. It is a challenging problem because the similarity methods should
also address problems of measuring sentences with partial
information, such as when one sentence is split into
two or more short-texts, and phrases that contain two or
more sentences. Applications, such as, text summarization
[1], information retrieval [2], image retrieval [3], text
categorization [4], machine translation [5], rely or may
benet from some sentence similarity method.
Many works in the literature try to address such problem
[6], [7], [8], [9] by representing sentences using vectors
of bag of words or the syntactic information among
words. After the sentence representation, such methods
compute different measures to evaluate the degree of
similarity between words. The overall sentence similarity
is a compound of those measures. Nevertheless, those
approaches do not address the following problems: (i)
meaning problem [10], sentences with the same meaning,
but with different words. For example, the sentences John
is an intelligent boy and John is a brilliant lad, mean
the same thing, if upon the context surrounding them do
not differ too much; (ii) word order [11], the order of
the words in the text inuences the meaning of texts. For
instance, in the sentences A loves B and B loves A, the
words used are the same, but the difference of word order
978-1-4799-4143-8/14 $31.00 2014 IEEE
DOI 10.1109/WI-IAT.2014.23

Downloaded from http://www.elearnica.ir

Steven J. Simske, Marcelo Riss


Hewlett-Packard Labs
Fort Collins, CO 80528, USA
{steven.simske,marcelo.riss}@hp.com

110

Furthermore, the SRA is used to extract the semantic of


the text. Previous works, that claims to use semantic information, do not actually evaluate the sentence semantic.
Instead, they use WordNet to evaluate the semantic of the
words, which could lead to poor results.

approach. In Section IV, the proposed similarity measure


are detailed, which uses the three-layer representation.
Section V evaluates proposed measure in terms of Pearsons correlation, Spearmans rank correlation and average
difference between rank measure. Finally, the conclusions
and discussion of some future work are presented in
Section VI.

III. T HREE -L AYER S ENTENCE R EPRESENTATION


In this section, the proposed sentence representation
used for calculating the similarity measure is introduced.
The representation consists of three layers: shallow, syntactic and frame. It uses lexical, syntactic and semantic
analysis, respectively, to model the sentences. However,
it is important to notice that these layers do not reect
exactly the linguistical analysis, they incorpore other processing, such as, NER and coreference relations.
The input to build this representation is a single sentence
and the output is a text and two RDF les contain the
shallow, syntactic and frame layers, respectively. Each
layer is detailed as follows.

II. R ELATED W ORK


In this section, the related works that achieve the best
results so far in sentence similarity problem are presented
as follows [6], [7], [8], [9].
Li et al. [6] propose a similarity measure that translates
each sentence in a semantic vector by using a lexical
database and a word order vector. Then, it creates a
ratio to weight the signicance between semantic and
syntactic information. A new word vector is created for
each sentence using information from the lexical database
that calculates the weight of signicance of a word using
information content obtained from a corpus-based method
to measure similarities between words [14]. By combining
the semantic vector with information contents from the
corpus-based method, a new semantic vector is built for
each of the two sentences. Semantic similarity is measured taking into account these semantic vectors. At last,
sentence similarity is computed by combining semantic
similarity and order similarity.
Islam et al. [7] measure the similarity of two texts
by relying on semantic and syntactic informations. At
rst, it calculates the string similarity, which is acquired
applying the longest common subsequence measure, and
the semantic word similarity, which is measured by a
corpus-based metric. These informations help the creation
of an optional common-word order similarity function
to incorporate the syntactic information. After, the text
similarity is calculated combining the string similarity,
semantic similarity and common-word order similarity.
Mihalcea et al. [8] propose a similarity metric that
work as follows: for each word in the rst sentence (main
sentence), it tries to identify the word in the second
sentence that has the highest semantic similarity according
to one of the word-to-word similarity measure. Then,
the process is repeated using the second sentence as the
main sentence. Finally, the resulting similarity scores are
combined using the arithmetic average.
The SyMSS [9] method assess the inuence of syntactic
structure of two compared sentences on the calculation of
similarity. It relies on the idea that a sentence is made
up of meaning of its individual words and the syntactic
connections among them. Using WordNet, the semantic
information is obtained through a deep parsing process
that nds the main phrases that compose the sentence.
Although the methods in this paper use similar ideas
of Mihalcea et al. [8] to compare sentences, the proposed
approach use lexical, syntactic and semantic analysis to
further improve the results. In fact, the similarity measure are based on a three layers sentence representation,
which encapsules different levels of sentence information.

A. Shallow Layer
In the shallow layer, each sentence is represented by
a sequence of group of tokens. The input consists on a
sentence and the output is a text le which contain a list
of the sentence tokens. The steps to represent a sentence
in this layer are described as follows:
1) Lexical analysis: This step splits the sentence into
a list of tokens. Every token takes part of the list,
including punctuation.
2) Stop words removal: It rules out words with little
representative value for the document, e.g. articles
and pronouns, and the punctuation.
3) NER: This step applies the NER, which groups
compound tokens. In other words, this step identies related tokens that appear in sequence on the
list. For instance, a compound proper name like
United States of America or generic terms like
motor vehicle.
Figure 1 shows the entire process accomplished to
build this layer for the sentence The president criticized
insurance companies and banks. It also displays the
output of each step. The output of this layer is a text le
containing the NER list.
This layer is important to improve the performance of
simple text processing tasks. The novelty in the layer is
the process to group compound tokens. Despite it does not
convey much information about the sentence, it is widely
employed in various traditional text mining tasks, such as,
information retrieval and summarization.
B. Syntactic Layer
In this layer, the sequence of groups of tokens, generated in the shallow layer, is converted to a graph
represented using RDF triples [15]. The steps to convert
a sequence of group of tokens into a RDF graph are
described as follows:
1) Syntactic analysis: The rst step is to perform a
syntactic sentence analysis. Relations like subject,

111

Input

the
president
criticized
insurance
companies
and
banks

The president criticized


insurance companies and
banks.

Figure 1.

Stop Words
Removal

Lexical Analysis

president
criticized
insurance
companies
banks

Shallow layer for the sentence The president criticized insurance companies and banks.

sentence similarity, information retrieval, among others.


To deal with the meaning problem, this layer uses the
NER to identify and qualify the main entities of the
sentence. In addition, coreference plays a relevant role
for the layer given that, even when analyzing a single
sentence, it helps describing anaphora relations.
The design choice RDF format to storage the graph due
to following reasons: (i) RDF is a standard model for data
interchange on the web; (ii) it provides a simple and clean
format; (iii) inferences are easily summoned with the RDF
triples; and (iv) a lot of tools to handle RDF are available
freely.

direct object, adverbial modier, among others, are


represented here as usual. In addition, prepositions
and conjunct relations are also extracted from the
syntactic analysis.
2) NER: Differently from the shallow layer, NER is not
used to group words. Instead, it qualies the entities.
For instance, president is the title of a person.
3) Coreference relations: This step identies anaphora
relations. For example, in the sentence I am going
to visit my parents, the possession relation that
connects the pronoun I to the object my parents
is highly informative. This information combined
with the entity qualication provided by NER complement the syntactic analysis.
4) Graph creation: An unweighed and directed graph
is used to store the entities with theirs relations. The
vertices are the elements obtained from the shallow
layer, while the edges denote the relations described
on previous steps.
Figure 2 deploys the syntactic layer for the sentence The president criticized insurance companies and
banks. Some edges have one direction since the syntactic
relations in general have some direction. However, this
is not a rule of thumb: the model also accommodates
bi-directed edges, usually coreference and conjunctions
relations. Notice that all vertices from the example are
listed on the output of the previous layer.

President

Nominal
Subject

NER
president
criticized
insurance_companies
banks

Criticized

NER

Direct
Object

Person

Banks

Direct
Object

And

NER

C. Frame Layer
In this layer, the RDF graph is augmented with entity
roles and sense identication. As syntactic layer, the
input is the sequence of groups of tokens, extracted in
the shallow layer. This layer applies SRA to dene the
entities roles and to identify each entity sense in the
sentence.
The frame layer uses SRA to perform two different
operations:
1) Sense identication: Sense identication is
paramount to this type of representation since
different words could denote the same meaning,
particularly regarding to verbs. For instance,
accuse, blame, criticize, deprecate are
words that could be regarded to the sense of
judgment.
2) Role annotation: Differently from the syntactic
layer, role annotation identies each entitys semantic function. For instance, in the same sentence of
the previous example, the president is communicator
of action criticized. Thus, the interpretation of the
action is identied not only its syntactic relation.
This layer deals with meaning problem using the output
of sense identication. The general meaning of the main
entities of the sentence, not just the written words, are
identied on this step. On the other hand, the role annotation extracts discourse information, as it deploys order
of actions, actors, etc, dealing with word order problema.
Moreover, such information is relevant for the tasks of
extraction and summarization, for instance.
Figure 3 presents a frame layer example. Two different relation types are identied in the gure: the sense
relations, e.g. the triple banks-sense-business, and the role
annotation relations, e.g. criticized-evaluee-banks. In the
frame layer, the representation has a RDF graph, just as
the syntactic layer.

Insurance
companies

NER

Companies

Figure 2. Intermediary layer for The president criticized insurance


companies and banks.

The syntactic analysis step (rst step) is important due


to the fact that it represents, at syntactic level, a order
relation among tokens from sentence. Moreover, this step
describes the possible or acceptable syntactic structures
of the language; and decomposes the text into syntactic
units in order to understand the way in which the syntactic
elements are arranged in a sentence. This kind of relations
could be used in applications, such as, summarization,

112

edit operations being insertion, deletion, or substitution of a single character.


From this point on, similarity between concepts, tokens
and words refers to one of the methods presented in this
section.

Judgment

Sense

B. Similarity Between Sentences


President

Communicator

Criticized

Evaluee

Evaluee

Banks

Insurance
companies

The three-layer representation is the basis for creating


four different sentence similarity measure. Different sentence similarity measure are proposed because the output
of them is complementary, as presented on Section V.
In the sequel, a detailed explanation of each metric is
presented.
1) Shallow metric: This metric is based on the shallow
layer. The resemblance between sentences is composed by
the similarity between sentence tokens. The main novelty
here is that proposed representation group compound
tokens (See Section III). The steps to calculate this metric
are as follows.
Let A = {a1 , a2 ..., an } and B = {b1 , b2 ..., bm } be
two sentences, such that, ai is a token of sentence A, bj
is a token of sentence B, n is the number of tokens of
sentence A and m is the number of tokens of sentence B.
In the rst step, each token of sentence A (main sentence) is compared to all tokens of sentence B (secondary
sentence) and the similarity to the highest correlated token
is returned. For example, in Figure 4, the term a1 is
compared with all terms in B and it returns the highest
similarity, which is 0.66 obtained by the pair (a1 , b4 ).

Sense

Sense

Business

Figure 3. Frame layer for the phrase The president criticized insurance
companies and banks.

IV. S ENTENCE S IMILARITY: M EASURE AND M ETHODS


In this section, four sentence similarity measure are
introduced and they are based on the three-layer representation of Section III. The proposed measure use similarity
between words measure to perform the similarity. Faced
on that, the rst part of this section describes six measure
to measure similarity between words, and the second one
presents details of the proposed approach.
A. Similarity Between Words
In the sequel, the six measure used to calculated the
similarity between words are presented. They cover the top
ve dictionary measure based on the results extract from
[9] and [16]. In addition, the Levenshtein measure [17] is
used in order to provide a statistic evaluation because in
general, this metric is faster than dictionary methods. The
similarity measure are:

A=

a1

a2

a3

a4

a5

B=

b1

b2

b3

b4

b5

0.2

0.5

0.3

0.66

0.4

PartialSim =

Path metric: It measures the length of the path


between two concepts to score the similarity between
them.
Resnik metric (Res): It measures how much information content is common in two concepts. It calculates
the information content based on the lowest common
subsumer (LCS) of two concepts.
Lin metric: It is based on the Resnik measure. It
measures the ratio of the information contents of
the LCS to the information contents of each of the
concepts.
Wu & Palmer metric (WP): This similarity metric
compares the global depth value of two concepts,
using the WordNet taxonomy.
Leacock & Chodorow metric (LC): This similarity
metric uses the length of the shortest path and the
maximum depth of the taxonomy of two concepts to
measure the similarity between them.
Levenshtein metric (Lev): It is a static measure to
calculate the minimum number of operations needed
to transform one string into another, with the allowed

Figure 4.

Example of Similarity Between Tokens.

The next step is to sum the similarities of each word


from A given by:
P artialSim(A) =

n


T okenSim(an , B),

(1)

i=1

where T okenSim(an , B) returns the highest similarity


between the term an and the terms of sentence B.
Then, the process is repeated using B as main sentence.
The nal result of the similarity is given by:
SentenceSim(A, B) =

P artialSim(A) + P artialSim(B)
. (2)
2

2) Syntactic measure: Two measure are derived from


the syntactic layer. The rst metric is called Synt1 and it
is calculated by matching the complete RDF triples.
The process works as presented in Figure 4. However,
instead of comparing words, this metric compares RDF
triples. To accomplish this comparison, it checks the triples

113

First Step

Sim =

Second Step

a1

a2

a1

a2

b1

b2

b1

b2

0.3
0.5
TotalSimilarity = 0.33

0.2

Sim =

0.4
0.5
TotalSimilarity = 0.4

0.3

Figure 5. Example of Similarity Between Triples (Synt1), where Sim is the similarity between two tokens or two edges, T otalSimilarity is the
total similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated to the nodes of the graph.

First Step

Sim =

Second Step

a1

a2

a1

a2

b1

b2

b1

b2

0.3
TotalSimilarity = 0.25

0.2

Sim =

0.4
TotalSimilarity = 0.35

0.3

Figure 6. Example of Extended Similarity Between Triples (Synt2), where Sim is the similarity between two tokens or two edges, T otalSimilarity
is the total similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated to the nodes of the graph.

in two steps as presented in Figure 5. For each step,


the similarity between vertices and edge is measured by
using one similarity between words metric. The total is
calculated by using a simple arithmetic mean. The nal
similarity of the triples is the arithmetic mean of two steps.
For instance, in the example, the nal result is 0.365.

3) Frame metric: The last metric is based on the frame


layer. The process follows the idea presented for the
syntactic measure. However, besides using a different layer
representation, it changes the RDF triples comparison. The
frame metric compares the pairs (vertex, edge) as the basic
similarity value. By analyzing the graphs generated by
frame layer, it became apparent that the pair (vertex, edge)
is meaningful. For instance, the sense edges, introduced
in Section III, are connected with the token presented in
the sentence and with its sense. Therefore, it is important
to measure if two sentences contain related tokens and
senses. Figure 7 shows the calculation of the similarity.

The rest of the process works as presented on Equations


1 and 2. The only difference between both equations are
the parameters. Here, n is the number of triples in a
graph that represent the sentence, and T okenSim(t, S)
is the function that return the highest similarity for triple
t in relation to triples in graph S. This method proved
to be unfeasible and the results were not as expected, as
presented in Section V.

V. E XPERIMENTAL R ESULTS

Therefore, a second metric, called Synt2, had to be


created. It extends the rst one by comparing only the
vertices of RDF triples. This way, the process described
in Figure 5 is replaced by Figure 6. The rest of the
process remains the same. This simple change in the triples
comparison improved the results.

In this section, a series of experiments was performed


to assess the proposed approach. The dataset used was the
benchmark dataset proposed by Li et al. [6]. It consists of
65 pairs of sentences extracted from word denitions in
the Collins Cobuild dictionary. The dataset contains the
average similarity scores given to each pair of sentence

114

First Comparison

Second Comparison

a1

a1

b1

b1

Sim = 0.3
0.5
TotalSimilarity = 0.4

Sim =
0.5
0.2
TotalSimilarity = 0.35

Figure 7. Example of Similarity Between Pairs (vertex, edge), where Sim is the similarity between two tokens or two edges, T otalSimilarity is
the total similarity of one triple, u and v are edges and a1 and b1 are the tokens associated to the nodes of the graph.

Table I
Pearsons and Spearmans coefcients of the sentences similarities
given by proposed measure.

by 32 human judges. Nevertheless, McLean et al. [6]


only take advantage of 30 of these sentence pairs once
the similarities in the other sentences were considered
irrelevant. Faced on that, the dataset used contains the 30
pairs of sentences with relevant similarities values.
The Pearsons correlation coefcient (PCC) and the
Spearmans rank correlation coefcient (SRCC) are used
to evaluate the proposed measure. The PCC measures the
strength and direction of the linear relationship between
two variables. It provides the relation between human similarity and the similarity obtained with proposed measure.
The SRCC calculates the correlation between the ranks of
two variables. In this experiment, the sentences are ranked
from the highest similarity to lowest one.
Table I presents the results of the proposed measure
in terms of Pearsons and Spearmans coefcients. The
measure are described as the pair (similarity between
words, similarity between sentences), dened on Sections
IV-A and IV-B, respectively; and the Synt1 compares
entire triples (vertex, edge, vertex) and Synt2 compares
only the vertices, as described in Section IV-B.
The rst important outcome is that the measure based
on the frame layer achieve the best results, see highlighted
results on Table I. This empirically demonstrates the
importance of the third layer, which uses the SRA to
represent the semantic information of the sentence. The
best result achieved is the combination of Lin and Frame
metric (Lin-Frame), which achieves 0.83 and 0.85 of PCC
and SCC, respectively.
Another conclusion is about the similarity between
words measure (Section IV-A). The Lin and Resnik measure achieve better results comparing to other measure (see
Table I). Differently from previous work, in which case the
best results came from Leacock & Chodorow metric [16]
and the Path metric [9]. It happens because three-layer
representation uses more general terms, mainly in frame
layer, differently from previous works.
As mentioned in Section IV-B, the results of Synt2
are better than that Synt1. This is because Synt2 does

Metric
Path-Shallow
Res-Shallow
Lin-Shallow
WP-Shallow
LC-Shallow
Lev-Shallow
Path-Synt1
Res-Synt1
Lin-Synt1
WP-Synt1
LC-Synt1
Lev-Synt1
Path-Synt2
Res-Synt2
Lin-Synt2
WP-Synt2
LC-Synt2
Lev-Synt2
Path-Frame
Res-Frame
Lin-Frame
WP-Frame
LC-Frame
Lev-Frame

PCC
0.73
0.75
0.76
0.64
0.69
0.72
0.70
0.74
0.74
0.60
0.65
0.68
0.73
0.77
0.76
0.56
0.65
0.74
0.79
0.82
0.83
0.76
0.70
0.81

SRCC
0.74
0.73
0.73
0.53
0.65
0.70
0.59
0.65
0.65
0.45
0.53
0.57
0.66
0.69
0.70
0.44
0.54
0.67
0.82
0.83
0.85
0.72
0.77
0.84

not measure the similarity between edges. In syntactic


layer the edges are the syntactic or coreference relations.
In many cases, two related concepts are connected by
different syntactic relations. Thus, the usage of the edge
decreases the correlation.
In addition to these results, the outcome of Inter2 and
Frame metric could be largely improved by combining
them. Table II presents which sentence similarity metric
achieves better results for each pair of sentence using Lin
metric. The Synt2 metric reached the best result in 15
cases, the Frame in 12 cases, and Shallow in 3 cases. The
Synt2 metric achieve better results in sentences with low
similarity, and Frame in sentences with high similarity.
Faced on that result, Lin-Synt2 (Synt2) and Lin-Frame
(Frame) measure were combined as follows:

115

CombSim =

Synt2,
F rame,

rame
if Synt2+F
< 0.5
2
otherwise

In relation to the measure for measuring sentence similarities, the main difference from related works is the
integration of lexical, syntactic and semantic analysis to
further improve the results. In fact, the similarity measure are based on a three-layers sentence representation,
which encapsules different levels of sentence information.
Furthermore, the SRA is used to extract the semantic of
the text. Previous works, that claims to use semantic information, do not actually evaluate the sentence semantic.
Instead, they use WordNet to evaluate the semantic of the
words, which could give poor results.
A series of experiments are presented using the dataset
proposed by Li et al. [6] and the two traditional measure
Pearsons correlation coefcient (PCC) and Spearmans
rank correlation coefcient (SRCC). The best metric
achieves 0.83 and 0.85 of PCC and SRCC, respectively,
and the combination of proposed measure achieve 0.92
and 0.91, which improves 7% for the PCC and 6% for the
SRCC in relation to the state of the art.
There is already some future work in progress, which
includes: (i) evaluation of the proposed measure in paraphrase detection; (ii) propose different combination of the
proposed measure; and (iii) apply the sentence similarity
to improve text summarization features.

(3)

where CombSim is the combined similarities of Synt2


and Frame measure. The combined metric achieves PCC
and SCC equal to 0.92 and 0.91, respectively.
Table II
The best sentence similarity metric for each pair of sentence using Lin
metric.
Sentence Pair
cord:smile
autograph:shore
asylum:fruit
boy:rooster
coast:forest
boy:sage
forest:graveyard
bird:woodland
Hill:woodland
magician:oracle
oracle:sage
furnace:stove
magician:wizard
hill:mound
cord:string

Metric
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Shallow
Synt2
Shallow
Synt2
Frame

Sentence Pair
glass:tumbler
grin:smile
serf:slave
journey:voyage
autograph:signature
coast:shore
forest:woodland
implement:tool
cock:rooster
boy:lad
cushion:pillow
cemetery:graveyard
automobile:car
midday:noon
gem:jewel

Metric
Synt2
Frame
Shallow
Frame
Frame
Frame
Frame
Frame
Frame
Synt2
Frame
Frame
Synt2
Frame
Frame

To conclude the experiments, Table III presents a


comparison between the proposed measure and related
measure [6], [7], [9]. The combined version, in Equation 3,
of the proposed metric attained the results 7% better with
regard to PCC and 6% for the SRCC. In addition, the LinFrame achieves results comparable to the state of the art,
mainly in relation to the SRCC. For some applications,
such as, information retrieval, the rank of more relevant
sentences is more important than the similarity itself. An
information retrieval system do not need to reach good
similarity, it only need to retrieval more relevant sentences
(based on their ranks) for a specic query. Therefore, it is
important to highlight that SRCC is more important than
PCC for some application.

VII. ACKNOWLEDGEMENTS
The research results reported in this paper have been
partly funded by a R&D project between Hewlett-Packard
do Brazil and UFPE originated from tax exemption (IPI Law no 8.248, of 1991 and later updates).
R EFERENCES
[1] R. Ferreira, L. de Souza Cabral, R. D. Lins,
G. de Franca Silva, F. Freitas, G. D. C. Cavalcanti,
R. Lima, S. J. Simske, and L. Favaro, Assessing sentence
scoring techniques for extractive text summarization,
Expert Systems with Applications, vol. 40, no. 14, pp.
57555764, 2013.
[2] L.-C. Yu, C.-H. Wu, and F.-L. Jang, Psychiatric document
retrieval using a discourse-aware model, Articial Intelligence, vol. 173, no. 7-8, pp. 817829, May 2009.

Table III
Comparing the proposed measure against the best related work.
Metric
Proposed combination
Islam-Inkpen[7]
Lin-Frame
Li-McLean[6]
SyMSS[9]

PCC
0.92
0.85
0.83
0.81
0.79

SRCC
0.91
0.83
0.85
0.81
0.85

[3] T. A. S. Coelho, P. Calado, L. V. Souza, B. A. RibeiroNeto, and R. R. Muntz, Image retrieval using multiple
evidence ranking. IEEE Transactions on Knowledge and
Data Engineering, vol. 16, no. 4, pp. 408417, 2004.
[4] T. Liu and J. Guo, Text similarity computing based on
standard deviation, in Proceedings of the 2005 International Conference on Advances in Intelligent Computing Volume Part I, ser. ICIC05. Berlin, Heidelberg: SpringerVerlag, 2005, pp. 456464.

VI. C ONCLUSIONS AND L INES FOR F URTHER W ORK


This paper presents a three-layer sentence representation
and four different measure to compute the similarity
between two sentences. The three layer are: (i) shallow
layer, it encapsules lexical analysis, stop words and named
entity recognition; (ii) syntactic layer, which encapsules
syntactic, NER, and coreference relations; and (iii) frame
layer, that mainly describes the semantic role annotation.
These layers deals with two major problems in sentence
similarity measure: the meaning and word order problems.

[5] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, Bleu: A


method for automatic evaluation of machine translation,
in Proceedings of the 40th Annual Meeting on Association
for Computational Linguistics, ser. ACL 02. Stroudsburg,
PA, USA: Association for Computational Linguistics, 2002,
pp. 311318.
[6] Y. Li, D. McLean, Z. Bandar, J. OShea, and K. A.
Crockett, Sentence similarity based on semantic nets and
corpus statistics. IEEE Trans. Knowl. Data Eng., vol. 18,
no. 8, pp. 11381150, 2006.

116

[7] A. Islam and D. Inkpen, Semantic text similarity using


corpus-based word similarity and string similarity, ACM
Transactions on Knowledge Discovery from Data, vol. 2,
no. 2, pp. 10:110:25, Jul. 2008.
[8] R. Mihalcea, C. Corley, and C. Strapparava, Corpus-based
and knowledge-based measures of text semantic similarity,
in Proceedings of the 21st National Conference on Articial
Intelligence - Volume 1, ser. AAAI06. AAAI Press, 2006,
pp. 775780.
[9] J. Oliva, J. I. Serrano, M. D. del Castillo, and
A. Iglesias, Symss: A syntax-based measure for shorttext semantic similarity, Data Knowl. Eng., vol. 70,
no. 4, pp. 390405, Apr. 2011. [Online]. Available:
http://dx.doi.org/10.1016/j.datak.2011.01.002
[10] B. Choudhary and P. Bhattacharyya, Text clustering using
semantics, in Proceedings of WORLD WIDE WEB CONFERENCE 2002, ser. WWW 02, 2002.
[11] F. Zhou, F. Zhang, and B. Yang, Graph-based text representation model and its realization, in Natural Language
Processing and Knowledge Engineering (NLP-KE), 2010
International Conference on, 2010, pp. 18.
[12] D. Das, N. Schneider, D. Chen, and N. A.
Smith, Probabilistic frame-semantic parsing, in
Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the
Association for Computational Linguistics, ser. HLT 10.
Stroudsburg, PA, USA: Association for Computational
Linguistics, 2010, pp. 948956. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1857999.1858136
[13] C. Fellbaum, Ed., WordNet: an electronic lexical database.
MIT Press, 1998.
[14] Y. Li, Z. A. Bandar, and D. McLean, An approach
for measuring semantic similarity between words using
multiple information sources, IEEE Trans. on Knowl. and
Data Eng., vol. 15, no. 4, pp. 871882, Jul. 2003. [Online].
Available: http://dx.doi.org/10.1109/TKDE.2003.1209005
[15] W3C,
Resource
http://www.w3.org/RDF/,
2014.

description
2004, last

framework,
Access March

[16] A. Budanitsky and G. Hirst, Evaluating wordnetbased measures of lexical semantic relatedness, Comput.
Linguist., vol. 32, no. 1, pp. 1347, Mar. 2006. [Online].
Available: http://dx.doi.org/10.1162/coli.2006.32.1.13
[17] F. P. Miller, A. F. Vandome, and J. McBrewster, Levenshtein
Distance: Information theory, Computer science, String
(computer science), String metric, Damerau? Levenshtein
distance, Spell checker, Hamming distance. Alpha Press,
2009.

117

S-ar putea să vă placă și