Documente Academic
Documente Profesional
Documente Cultură
ftL × × ftR
Local inference collected over parse trees We The composition layer In our sequential infer-
use tree models to help collect local inference in- ence model, we keep using BiLSTM to compose
formation over linguistic phrases and clauses in local inference information sequentially. The for-
this layer. The tree structures of the premise and mulas for BiLSTM are similar to those in Equations
hypothesis are produced by a constituency parser. (1) and (2) in their forms so we skip the details, but
Once the hidden states of a tree are all computed the aim is very different here—they are used to cap-
with Equation (3), we treat all tree nodes equally ture local inference information ma and mb and
as we do not have further heuristics to discrimi- their context here for inference composition.
nate them, but leave the attention weights to figure In the tree composition, the high-level formulas
out their relationship. So, we use Equation (11) of how a tree node is updated to compose local
to compute the attention weights for all node pairs inference is as follows:
between a premise and hypothesis. This connects
all words, constituent phrases, and clauses between
the premise and hypothesis. We then collect the in- va,t = TrLSTM(F (ma,t ), hL R
t−1 , ht−1 ), (16)
formation between all the pairs with Equations (12) vb,t = TrLSTM(F (mb,t ), hL R
t−1 , ht−1 ). (17)
and (13) and feed them into the next layer.
Enhancement of local inference information We propose to control model complexity in this
In our models, we further enhance the local in- layer, since the concatenation we described above
ference information collected. We compute the to compute ma and mb can significantly increase
difference and the element-wise product for the tu- the overall parameter size to potentially overfit the
ple <ā, ã> as well as for <b̄, b̃>. We expect that models. We propose to use a mapping F as in
such operations could help sharpen local inference Equation (16) and (17). More specifically, we use a
information between elements in the tuples and cap- 1-layer feedforward neural network with the ReLU
ture inference relationships such as contradiction. activation. This function is also applied to BiLSTM
The difference and element-wise product are then in our sequential inference composition.
concatenated with the original vectors, ā and ã,
or b̄ and b̃, respectively (Mou et al., 2016; Zhang Pooling Our inference model converts the result-
et al., 2017). The enhancement is performed for ing vectors obtained above to a fixed-length vector
both the sequential and the tree models. with pooling and feeds it to the final classifier to
determine the overall inference relationship.
We consider that summation (Parikh et al., 2016)
ma = [ā; ã; ā − ã; ā ã], (14) could be sensitive to the sequence length and hence
mb = [b̄; b̃; b̄ − b̃; b̄ b̃]. (15) less robust. We instead suggest the following strat-
egy: compute both average and max pooling, and
concatenate all these vectors to form the final fixed
This process could be regarded as a special case length vector v. Our experiments show that this
of modeling some high-order interaction between leads to significantly better results than summa-
the tuple elements. Along this direction, we have tion. The final fixed length vector v is calculated
as follows: The parse trees used in this paper are produced
by the Stanford PCFG Parser 3.5.3 (Klein and Man-
`a
X va,i `a ning, 2003) and they are delivered as part of the
va,ave = , va,max = max va,i , (18)
i=1
`a i=1 SNLI corpus. We use classification accuracy as the
`b evaluation metric, as in related work.
X vb,j `b
vb,ave = , vb,max = max vb,j , (19)
j=1
`b j=1 Training We use the development set to select
models for testing. To help replicate our results,
v = [va,ave ; va,max ; vb,ave ; vb,max ]. (20) we publish our code1 . Below, we list our training
details. We use the Adam method (Kingma and
Ba, 2014) for optimization. The first momentum
Note that for tree composition, Equation (20) is set to be 0.9 and the second 0.999. The initial
is slightly different from that in sequential com- learning rate is 0.0004 and the batch size is 32. All
position. Our tree composition will concatenate hidden states of LSTMs, tree-LSTMs, and word
also the hidden states computed for the roots with embeddings have 300 dimensions.
Equations (16) and (17), which are not shown here. We use dropout with a rate of 0.5, which is
We then put v into a final multilayer perceptron applied to all feedforward connections. We use
(MLP) classifier. The MLP has a hidden layer with pre-trained 300-D Glove 840B vectors (Penning-
tanh activation and softmax output layer in our ex- ton et al., 2014) to initialize our word embeddings.
periments. The entire model (all three components Out-of-vocabulary (OOV) words are initialized ran-
described above) is trained end-to-end. For train- domly with Gaussian samples. All vectors includ-
ing, we use multi-class cross-entropy loss. ing word embedding are updated during training.
Overall inference models Our model can be 5 Results
based only on the sequential networks by removing
all tree components and we call it Enhanced Se- Overall performance Table 1 shows the results
quential Inference Model (ESIM) (see the left part of different models. The first row is a baseline
of Figure 1). We will show that ESIM outperforms classifier presented by Bowman et al. (2015) that
all previous results. We will also encode parse in- considers handcrafted features such as BLEU score
formation with tree LSTMs in multiple layers as of the hypothesis with respect to the premise, the
described (see the right side of Figure 1). We train overlapped words, and the length difference be-
this model and incorporate it into ESIM by averag- tween them, etc.
ing the predicted probabilities to get the final label The next group of models (2)-(7) are based
for a premise-hypothesis pair. We will show that on sentence encoding. The model of Bowman
parsing information complements very well with et al. (2016) encodes the premise and hypothe-
ESIM and further improves the performance, and sis with two different LSTMs. The model in Ven-
we call the final model Hybrid Inference Model drov et al. (2015) uses unsupervised “skip-thoughts”
(HIM). pre-training in GRU encoders. The approach pro-
posed by Mou et al. (2016) considers tree-based
4 Experimental Setup CNN to capture sentence-level semantics, while
the model of Bowman et al. (2016) introduces a
Data The Stanford Natural Language Inference stack-augmented parser-interpreter neural network
(SNLI) corpus (Bowman et al., 2015) focuses on (SPINN) which combines parsing and interpreta-
three basic relationships between a premise and a tion within a single tree-sequence hybrid model.
potential hypothesis: the premise entails the hy- The work by Liu et al. (2016) uses BiLSTM to gen-
pothesis (entailment), they contradict each other erate sentence representations, and then replaces
(contradiction), or they are not related (neutral). average pooling with intra-attention. The approach
The original SNLI corpus contains also “the other” proposed by Munkhdalai and Yu (2016a) presents
category, which includes the sentence pairs lacking a memory augmented neural network, neural se-
consensus among multiple human annotators. As mantic encoders (NSE), to encode sentences.
in the related work, we remove this category. We
The next group of methods in the table, models
used the same split as in Bowman et al. (2015) and
1
other previous work. https://github.com/lukecq1231/nli
Model #Para. Train Test
(1) Handcrafted features (Bowman et al., 2015) - 99.7 78.2
(2) 300D LSTM encoders (Bowman et al., 2016) 3.0M 83.9 80.6
(3) 1024D pretrained GRU encoders (Vendrov et al., 2015) 15M 98.8 81.4
(4) 300D tree-based CNN encoders (Mou et al., 2016) 3.5M 83.3 82.1
(5) 300D SPINN-PI encoders (Bowman et al., 2016) 3.7M 89.2 83.2
(6) 600D BiLSTM intra-attention encoders (Liu et al., 2016) 2.8M 84.5 84.2
(7) 300D NSE encoders (Munkhdalai and Yu, 2016a) 3.0M 86.2 84.6
(8) 100D LSTM with attention (Rocktäschel et al., 2015) 250K 85.3 83.5
(9) 300D mLSTM (Wang and Jiang, 2016) 1.9M 92.0 86.1
(10) 450D LSTMN with deep attention fusion (Cheng et al., 2016) 3.4M 88.5 86.3
(11) 200D decomposable attention model (Parikh et al., 2016) 380K 89.5 86.3
(12) Intra-sentence attention + (11) (Parikh et al., 2016) 580K 90.5 86.8
(13) 300D NTI-SLSTM-LSTM (Munkhdalai and Yu, 2016b) 3.2M 88.5 87.3
(14) 300D re-read LSTM (Sha et al., 2016) 2.0M 90.7 87.5
(15) 300D btree-LSTM encoders (Paria et al., 2016) 2.0M 88.6 87.6
(16) 600D ESIM 4.3M 92.6 88.0
(17) HIM (600D ESIM + 300D Syntactic tree-LSTM) 7.7M 93.5 88.6
Table 1: Accuracies of the models on SNLI. Our final model achieves the accuracy of 88.6%, the best
result observed on SNLI, while our enhanced sequential encoding model attains an accuracy of 88.0%,
which also outperform the previous models.
(8)-(15), are inter-sentence attention-based model. We ensemble our ESIM model with syntactic
The model marked with Rocktäschel et al. (2015) tree-LSTMs (Zhu et al., 2015) based on syntactic
is LSTMs enforcing the so called word-by-word parse trees and achieve significant improvement
attention. The model of Wang and Jiang (2016) ex- over our best sequential encoding model ESIM, at-
tends this idea to explicitly enforce word-by-word taining an accuracy of 88.6%. This shows that syn-
matching between the hypothesis and the premise. tactic tree-LSTMs complement well with ESIM.
Long short-term memory-networks (LSTMN) with
deep attention fusion (Cheng et al., 2016) link the Model Train Test
current word to previous words stored in memory.
(17) HIM (ESIM + syn.tree) 93.5 88.6
Parikh et al. (2016) proposed a decomposable atten- (18) ESIM + tree 91.9 88.2
tion model without relying on any word-order in- (16) ESIM 92.6 88.0
formation. In general, adding intra-sentence atten- (19) ESIM - ave./max 92.9 87.1
tion yields further improvement, which is not very (20) ESIM - diff./prod. 91.5 87.0
surprising as it could help align the relevant text (21) ESIM - inference BiLSTM 91.3 87.3
spans between premise and hypothesis. The model (22) ESIM - encoding BiLSTM 88.7 86.3
of Munkhdalai and Yu (2016b) extends the frame- (23) ESIM - P-based attention 91.6 87.2
work of Wang and Jiang (2016) to a full n-ary tree (24) ESIM - H-based attention 91.4 86.5
(25) syn.tree 92.9 87.8
model and achieves further improvement. Sha et al.
(2016) proposes a special LSTM variant which con-
Table 2: Ablation performance of the models.
siders the attention vector of another sentence as an
inner state of LSTM. Paria et al. (2016) use a neu-
ral architecture with a complete binary tree-LSTM Ablation analysis We further analyze the ma-
encoders without syntactic information. jor components that are of importance to help us
The table shows that our ESIM model achieves achieve good performance. From the best model,
an accuracy of 88.0%, which has already outper- we first replace the syntactic tree-LSTM with the
formed all the previous models, including those full tree-LSTM without encoding syntactic parse
using much more complicated network architec- information. More specifically, two adjacent words
tures (Munkhdalai and Yu, 2016b). in a sentence are merged to form a parent node, and
1-
3-
5-
7-
8- 21 -
9- 16 - 23 -
10 - 18 - 25 -
12 - 27 -
2 4 6 11 13 14 15 17 19 20 22 24 26 28 29
A man wearing a white shirt and a blue jeans reading a newspaper while standing
1-
2- 5-
6-
8-
9- 12 - (d) Input gate of tree-LSTM in inference composi-
tion (l2 -norm)
14 -
3 4 7 10 11 13 15 16 17
A man is sitting down reading a newspaper .
(c) Normalized attention weights of tree-LSTM (f) Normalized attention weights of BiLSTM
Figure 3: An example for analysis. Subfigures (a) and (b) are the constituency parse trees of the premise
and hypothesis, respectively. “-” means a non-leaf or a null node. Subfigures (c) and (f) are attention
visualization of the tree model and ESIM, respectively. The darker the color, the greater the value. The
premise is on the x-axis and the hypothesis is on y-axis. Subfigures (d) and (e) are input gates’ l2 -norm of
tree-LSTM and BiLSTM in inference composition, respectively.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- Phong Le and Willem Zuidema. 2015. Compositional
danau, and Yoshua Bengio. 2014. On the proper- distributional semantics with long short term mem-
ties of neural machine translation: Encoder-decoder ory. In Proceedings of the Fourth Joint Conference
approaches. In Dekai Wu, Marine Carpuat, Xavier on Lexical and Computational Semantics. Associ-
Carreras, and Eva Maria Vecchi, editors, Proceed- ation for Computational Linguistics, pages 10–19.
ings of SSST@EMNLP 2014, Eighth Workshop on https://doi.org/10.18653/v1/S15-1002.
Yang Liu, Chengjie Sun, Lei Lin, and Xiao- for Computational Linguistics, pages 1532–1543.
long Wang. 2016. Learning natural language https://doi.org/10.3115/v1/D14-1162.
inference using bidirectional LSTM model
and inner-attention. CoRR abs/1605.09090. Tim Rocktäschel, Edward Grefenstette, Karl Moritz
http://arxiv.org/abs/1605.09090. Hermann, Tomás Kociský, and Phil Blun-
som. 2015. Reasoning about entailment
Bill MacCartney. 2009. Natural Language Inference. with neural attention. CoRR abs/1509.06664.
Ph.D. thesis, Stanford University. http://arxiv.org/abs/1509.06664.
Bill MacCartney and Christopher D. Manning. Alexander Rush, Sumit Chopra, and Jason We-
2008. Modeling semantic containment and ston. 2015. A neural attention model for ab-
exclusion in natural language inference. In stractive sentence summarization. In Proceed-
Proceedings of the 22Nd International Confer- ings of the 2015 Conference on Empirical Meth-
ence on Computational Linguistics - Volume 1. ods in Natural Language Processing. Associa-
Association for Computational Linguistics, Strouds- tion for Computational Linguistics, pages 379–389.
burg, PA, USA, COLING ’08, pages 521–528. https://doi.org/10.18653/v1/D15-1044.
http://dl.acm.org/citation.cfm?id=1599081.1599147.
Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li.
Yashar Mehdad, Alessandro Moschitti, and Mas- 2016. Reading and thinking: Re-read LSTM unit
simo Fabio Zanzotto. 2010. Syntactic/semantic for textual entailment recognition. In Proceedings
structures for textual entailment recognition. In of COLING 2016, the 26th International Confer-
Human Language Technologies: The 2010 Annual ence on Computational Linguistics: Technical Pa-
Conference of the North American Chapter of the pers. The COLING 2016 Organizing Committee,
Association for Computational Linguistics. Associ- pages 2870–2879. http://aclweb.org/anthology/C16-
ation for Computational Linguistics, pages 1020– 1270.
1028. http://aclweb.org/anthology/N10-1146.
Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng,
Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, and Christopher D. Manning. 2011. Parsing natu-
Rui Yan, and Zhi Jin. 2016. Natural language ral scenes and natural language with recursive neu-
inference by tree-based convolution and heuris- ral networks. In Lise Getoor and Tobias Scheffer,
tic matching. In Proceedings of the 54th An- editors, Proceedings of the 28th International Con-
nual Meeting of the Association for Computational ference on Machine Learning, ICML 2011, Bellevue,
Linguistics (Volume 2: Short Papers). Associa- Washington, USA, June 28 - July 2, 2011. Omnipress,
tion for Computational Linguistics, pages 130–136. pages 129–136.
https://doi.org/10.18653/v1/P16-2022.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Tsendsuren Munkhdalai and Hong Yu. 2016a. Neu- Chuang, D. Christopher Manning, Andrew Ng, and
ral semantic encoders. CoRR abs/1607.04315. Christopher Potts. 2013. Recursive deep models
http://arxiv.org/abs/1607.04315. for semantic compositionality over a sentiment tree-
bank. In Proceedings of the 2013 Conference on
Tsendsuren Munkhdalai and Hong Yu. 2016b. Neu- Empirical Methods in Natural Language Processing.
ral tree indexers for text understanding. CoRR Association for Computational Linguistics, pages
abs/1607.04492. http://arxiv.org/abs/1607.04492. 1631–1642. http://aclweb.org/anthology/D13-1170.
Biswajit Paria, K. M. Annervaz, Ambedkar Dukkipati, Sheng Kai Tai, Richard Socher, and D. Christopher
Ankush Chatterjee, and Sanjay Podder. 2016. A neu- Manning. 2015. Improved semantic representations
ral architecture mimicking humans end-to-end for from tree-structured long short-term memory net-
natural language inference. CoRR abs/1611.04741. works. In Proceedings of the 53rd Annual Meet-
http://arxiv.org/abs/1611.04741. ing of the Association for Computational Linguistics
and the 7th International Joint Conference on Natu-
Ankur Parikh, Oscar Täckström, Dipanjan Das, and ral Language Processing (Volume 1: Long Papers).
Jakob Uszkoreit. 2016. A decomposable attention Association for Computational Linguistics, pages
model for natural language inference. In Proceed- 1556–1566. https://doi.org/10.3115/v1/P15-1150.
ings of the 2016 Conference on Empirical Meth-
ods in Natural Language Processing. Association Ivan Vendrov, Ryan Kiros, Sanja Fidler, and
for Computational Linguistics, pages 2249–2255. Raquel Urtasun. 2015. Order-embeddings of
http://aclweb.org/anthology/D16-1244. images and language. CoRR abs/1511.06361.
http://arxiv.org/abs/1511.06361.
Barbara Partee. 1995. Lexical semantics and composi-
tionality. Invitation to Cognitive Science 1:311–360. Shuohang Wang and Jing Jiang. 2016. Learning nat-
ural language inference with LSTM. In Proceed-
Jeffrey Pennington, Richard Socher, and Christo- ings of the 2016 Conference of the North Ameri-
pher Manning. 2014. GloVe: Global vectors can Chapter of the Association for Computational
for word representation. In Proceedings of the Linguistics: Human Language Technologies. Asso-
2014 Conference on Empirical Methods in Nat- ciation for Computational Linguistics, pages 1442–
ural Language Processing (EMNLP). Association 1451. https://doi.org/10.18653/v1/N16-1170.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun
Cho, Aaron C. Courville, Ruslan Salakhutdi-
nov, Richard S. Zemel, and Yoshua Bengio.
2015. Show, attend and tell: Neural image
caption generation with visual attention. In
Proceedings of the 32nd International Confer-
ence on Machine Learning, ICML 2015, Lille,
France, 6-11 July 2015. pages 2048–2057.
http://jmlr.org/proceedings/papers/v37/xuc15.html.
Junbei Zhang, Xiaodan Zhu, Qian Chen, Lirong
Dai, Si Wei, and Hui Jiang. 2017. Ex-
ploring question understanding and adapta-
tion in neural-network-based question an-
swering. CoRR abs/arXiv:1703.04617v2.
https://arxiv.org/abs/1703.04617.