Documente Academic
Documente Profesional
Documente Cultură
I. I NTRODUCTION
Sentence similarity, or short-text similarity, measures
the degree of similarity between phases. It is a challenging problem because the similarity methods should
also address problems of measuring sentences with partial
information, such as when one sentence is split into
two or more short-texts, and phrases that contain two or
more sentences. Applications, such as, text summarization
[1], information retrieval [2], image retrieval [3], text
categorization [4], machine translation [5], rely or may
benet from some sentence similarity method.
Many works in the literature try to address such problem
[6], [7], [8], [9] by representing sentences using vectors
of bag of words or the syntactic information among
words. After the sentence representation, such methods
compute different measures to evaluate the degree of
similarity between words. The overall sentence similarity
is a compound of those measures. Nevertheless, those
approaches do not address the following problems: (i)
meaning problem [10], sentences with the same meaning,
but with different words. For example, the sentences John
is an intelligent boy and John is a brilliant lad, mean
the same thing, if upon the context surrounding them do
not differ too much; (ii) word order [11], the order of
the words in the text inuences the meaning of texts. For
instance, in the sentences A loves B and B loves A, the
words used are the same, but the difference of word order
978-1-4799-4143-8/14 $31.00 2014 IEEE
DOI 10.1109/WI-IAT.2014.23
110
A. Shallow Layer
In the shallow layer, each sentence is represented by
a sequence of group of tokens. The input consists on a
sentence and the output is a text le which contain a list
of the sentence tokens. The steps to represent a sentence
in this layer are described as follows:
1) Lexical analysis: This step splits the sentence into
a list of tokens. Every token takes part of the list,
including punctuation.
2) Stop words removal: It rules out words with little
representative value for the document, e.g. articles
and pronouns, and the punctuation.
3) NER: This step applies the NER, which groups
compound tokens. In other words, this step identies related tokens that appear in sequence on the
list. For instance, a compound proper name like
United States of America or generic terms like
motor vehicle.
Figure 1 shows the entire process accomplished to
build this layer for the sentence The president criticized
insurance companies and banks. It also displays the
output of each step. The output of this layer is a text le
containing the NER list.
This layer is important to improve the performance of
simple text processing tasks. The novelty in the layer is
the process to group compound tokens. Despite it does not
convey much information about the sentence, it is widely
employed in various traditional text mining tasks, such as,
information retrieval and summarization.
B. Syntactic Layer
In this layer, the sequence of groups of tokens, generated in the shallow layer, is converted to a graph
represented using RDF triples [15]. The steps to convert
a sequence of group of tokens into a RDF graph are
described as follows:
1) Syntactic analysis: The rst step is to perform a
syntactic sentence analysis. Relations like subject,
111
Input
the
president
criticized
insurance
companies
and
banks
Figure 1.
Stop Words
Removal
Lexical Analysis
president
criticized
insurance
companies
banks
Shallow layer for the sentence The president criticized insurance companies and banks.
President
Nominal
Subject
NER
president
criticized
insurance_companies
banks
Criticized
NER
Direct
Object
Person
Banks
Direct
Object
And
NER
C. Frame Layer
In this layer, the RDF graph is augmented with entity
roles and sense identication. As syntactic layer, the
input is the sequence of groups of tokens, extracted in
the shallow layer. This layer applies SRA to dene the
entities roles and to identify each entity sense in the
sentence.
The frame layer uses SRA to perform two different
operations:
1) Sense identication: Sense identication is
paramount to this type of representation since
different words could denote the same meaning,
particularly regarding to verbs. For instance,
accuse, blame, criticize, deprecate are
words that could be regarded to the sense of
judgment.
2) Role annotation: Differently from the syntactic
layer, role annotation identies each entitys semantic function. For instance, in the same sentence of
the previous example, the president is communicator
of action criticized. Thus, the interpretation of the
action is identied not only its syntactic relation.
This layer deals with meaning problem using the output
of sense identication. The general meaning of the main
entities of the sentence, not just the written words, are
identied on this step. On the other hand, the role annotation extracts discourse information, as it deploys order
of actions, actors, etc, dealing with word order problema.
Moreover, such information is relevant for the tasks of
extraction and summarization, for instance.
Figure 3 presents a frame layer example. Two different relation types are identied in the gure: the sense
relations, e.g. the triple banks-sense-business, and the role
annotation relations, e.g. criticized-evaluee-banks. In the
frame layer, the representation has a RDF graph, just as
the syntactic layer.
Insurance
companies
NER
Companies
112
Judgment
Sense
Communicator
Criticized
Evaluee
Evaluee
Banks
Insurance
companies
Sense
Sense
Business
Figure 3. Frame layer for the phrase The president criticized insurance
companies and banks.
A=
a1
a2
a3
a4
a5
B=
b1
b2
b3
b4
b5
0.2
0.5
0.3
0.66
0.4
PartialSim =
Figure 4.
n
T okenSim(an , B),
(1)
i=1
P artialSim(A) + P artialSim(B)
. (2)
2
113
First Step
Sim =
Second Step
a1
a2
a1
a2
b1
b2
b1
b2
0.3
0.5
TotalSimilarity = 0.33
0.2
Sim =
0.4
0.5
TotalSimilarity = 0.4
0.3
Figure 5. Example of Similarity Between Triples (Synt1), where Sim is the similarity between two tokens or two edges, T otalSimilarity is the
total similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated to the nodes of the graph.
First Step
Sim =
Second Step
a1
a2
a1
a2
b1
b2
b1
b2
0.3
TotalSimilarity = 0.25
0.2
Sim =
0.4
TotalSimilarity = 0.35
0.3
Figure 6. Example of Extended Similarity Between Triples (Synt2), where Sim is the similarity between two tokens or two edges, T otalSimilarity
is the total similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated to the nodes of the graph.
V. E XPERIMENTAL R ESULTS
114
First Comparison
Second Comparison
a1
a1
b1
b1
Sim = 0.3
0.5
TotalSimilarity = 0.4
Sim =
0.5
0.2
TotalSimilarity = 0.35
Figure 7. Example of Similarity Between Pairs (vertex, edge), where Sim is the similarity between two tokens or two edges, T otalSimilarity is
the total similarity of one triple, u and v are edges and a1 and b1 are the tokens associated to the nodes of the graph.
Table I
Pearsons and Spearmans coefcients of the sentences similarities
given by proposed measure.
Metric
Path-Shallow
Res-Shallow
Lin-Shallow
WP-Shallow
LC-Shallow
Lev-Shallow
Path-Synt1
Res-Synt1
Lin-Synt1
WP-Synt1
LC-Synt1
Lev-Synt1
Path-Synt2
Res-Synt2
Lin-Synt2
WP-Synt2
LC-Synt2
Lev-Synt2
Path-Frame
Res-Frame
Lin-Frame
WP-Frame
LC-Frame
Lev-Frame
PCC
0.73
0.75
0.76
0.64
0.69
0.72
0.70
0.74
0.74
0.60
0.65
0.68
0.73
0.77
0.76
0.56
0.65
0.74
0.79
0.82
0.83
0.76
0.70
0.81
SRCC
0.74
0.73
0.73
0.53
0.65
0.70
0.59
0.65
0.65
0.45
0.53
0.57
0.66
0.69
0.70
0.44
0.54
0.67
0.82
0.83
0.85
0.72
0.77
0.84
115
CombSim =
Synt2,
F rame,
rame
if Synt2+F
< 0.5
2
otherwise
In relation to the measure for measuring sentence similarities, the main difference from related works is the
integration of lexical, syntactic and semantic analysis to
further improve the results. In fact, the similarity measure are based on a three-layers sentence representation,
which encapsules different levels of sentence information.
Furthermore, the SRA is used to extract the semantic of
the text. Previous works, that claims to use semantic information, do not actually evaluate the sentence semantic.
Instead, they use WordNet to evaluate the semantic of the
words, which could give poor results.
A series of experiments are presented using the dataset
proposed by Li et al. [6] and the two traditional measure
Pearsons correlation coefcient (PCC) and Spearmans
rank correlation coefcient (SRCC). The best metric
achieves 0.83 and 0.85 of PCC and SRCC, respectively,
and the combination of proposed measure achieve 0.92
and 0.91, which improves 7% for the PCC and 6% for the
SRCC in relation to the state of the art.
There is already some future work in progress, which
includes: (i) evaluation of the proposed measure in paraphrase detection; (ii) propose different combination of the
proposed measure; and (iii) apply the sentence similarity
to improve text summarization features.
(3)
Metric
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Synt2
Shallow
Synt2
Shallow
Synt2
Frame
Sentence Pair
glass:tumbler
grin:smile
serf:slave
journey:voyage
autograph:signature
coast:shore
forest:woodland
implement:tool
cock:rooster
boy:lad
cushion:pillow
cemetery:graveyard
automobile:car
midday:noon
gem:jewel
Metric
Synt2
Frame
Shallow
Frame
Frame
Frame
Frame
Frame
Frame
Synt2
Frame
Frame
Synt2
Frame
Frame
VII. ACKNOWLEDGEMENTS
The research results reported in this paper have been
partly funded by a R&D project between Hewlett-Packard
do Brazil and UFPE originated from tax exemption (IPI Law no 8.248, of 1991 and later updates).
R EFERENCES
[1] R. Ferreira, L. de Souza Cabral, R. D. Lins,
G. de Franca Silva, F. Freitas, G. D. C. Cavalcanti,
R. Lima, S. J. Simske, and L. Favaro, Assessing sentence
scoring techniques for extractive text summarization,
Expert Systems with Applications, vol. 40, no. 14, pp.
57555764, 2013.
[2] L.-C. Yu, C.-H. Wu, and F.-L. Jang, Psychiatric document
retrieval using a discourse-aware model, Articial Intelligence, vol. 173, no. 7-8, pp. 817829, May 2009.
Table III
Comparing the proposed measure against the best related work.
Metric
Proposed combination
Islam-Inkpen[7]
Lin-Frame
Li-McLean[6]
SyMSS[9]
PCC
0.92
0.85
0.83
0.81
0.79
SRCC
0.91
0.83
0.85
0.81
0.85
[3] T. A. S. Coelho, P. Calado, L. V. Souza, B. A. RibeiroNeto, and R. R. Muntz, Image retrieval using multiple
evidence ranking. IEEE Transactions on Knowledge and
Data Engineering, vol. 16, no. 4, pp. 408417, 2004.
[4] T. Liu and J. Guo, Text similarity computing based on
standard deviation, in Proceedings of the 2005 International Conference on Advances in Intelligent Computing Volume Part I, ser. ICIC05. Berlin, Heidelberg: SpringerVerlag, 2005, pp. 456464.
116
description
2004, last
framework,
Access March
[16] A. Budanitsky and G. Hirst, Evaluating wordnetbased measures of lexical semantic relatedness, Comput.
Linguist., vol. 32, no. 1, pp. 1347, Mar. 2006. [Online].
Available: http://dx.doi.org/10.1162/coli.2006.32.1.13
[17] F. P. Miller, A. F. Vandome, and J. McBrewster, Levenshtein
Distance: Information theory, Computer science, String
(computer science), String metric, Damerau? Levenshtein
distance, Spell checker, Hamming distance. Alpha Press,
2009.
117