Documente Academic
Documente Profesional
Documente Cultură
Figure 1: The sentences generated in our synthetic languages based on an original English sentence. All verbs
in the experiments reported in the paper carried subject and object agreement suffixes as in the polypersonal
agreement experiment; we omitted these suffixes from the word order variation examples in the table for ease of
reading.
We then train a model to predict the agreement tems, agreement patterns, and order of core ele-
features of the verb; in the present paper, we focus ments. For each parametric version of English, we
on predicting the plurality of the subject and the recorded the verb-argument relations within each
object (that is, whether they are singular or plural). sentence, and created a labeled dataset. We ex-
The subject plurality prediction problem for (1-b), posed our models to sentences from which one of
for example, can be formulated as follows: the verbs was omitted, and trained them to predict
the plurality of the arguments of the unseen verb.
(2) The man the apples hsingular/plural subject?i. The following paragraph describes the process of
We illustrate the potential of this approach in a collecting verb-argument relations; a detailed dis-
series of case studies. We first experiment with cussion of the parametric generation process for
polypersonal agreement, in which the verb agrees agreement marking, word order and case marking
with both the subject and the object (§3). We then is given in the corresponding sections. We have
manipulate the order of the subject, the object and made our synthetic language generation code pub-
the verb (§4), and experiment with overt morpho- licly available.1
logical case (§5). For a preview of our synthetic
Argument Collection We created a labeled
languages, see Figure 1.
agreement prediction dataset by first collecting
2 Setup verb-arguments relations from the parsed corpus.
We collected nouns, proper nouns, pronouns, ad-
Synthetic Language Generation We used an jectives, cardinal numbers and relative pronouns
expert-annotated corpus, to avoid potential con- connected to a verb (identified by its part-of-
founds between the typological parameters we speech tag) with an nsubj, nsubjpass or dobj de-
manipulated and possible parse errors in an au- pendency edge, and record the plurality of those
tomatically parsed corpus. As our starting point, arguments. Verbs that were the head of a clausal
we took the English Penn Treebank (Marcus et al., complement without a subject (xcomp dependen-
1993), converted to the Universal Dependencies cies) were excluded. We recorded the plurality of
scheme (Nivre et al. 2016) using the Stanford con- the dependents of the verb regardless of whether
verter (Schuster and Manning, 2016). We then the tense and person of the verb condition agree-
manipulated the tree representations of the sen- ment in English (that is, not only in third-person
tences in the corpus to generate parametrically
1
modified English corpora, varying in case sys- https://github.com/Shaul1321/rnn typology
3533
Prediction Subject Object Object words were represented as the sum of the word
task accuracy accuracy recall embedding and embeddings of the character n-
Subject 94.7 ± 0.3 - - grams that made up the word.2
Object - 88.9 ± 0.26 81.8 ± 1.4 The model (including the embedding layer)
Joint 95.7 ± 0.23 90.0 ± 0.1 85.4 ± 2.3 was trained end-to-end using the Adam optimizer
(Kingma and Ba, 2014). For each of the experi-
Table 1: Results of the polypersonal agreement exper- ments described in the paper, we trained four mod-
iments. “Joint” refers to multitask prediction of subject
els with different random initializations; we report
and object plurality.
averaged results alongside standard deviations.
Table 3: Subject and object plurality prediction for different word orders (recall for the subject is 100% and is not
indicated). The % attractors columns indicate the percentage of sentences containing verb-argument attractors.
The number are averaged over four runs and the error interval represents the standard deviation.
caused by the object. We conjectured that this tervening between the subject and the verb
is due to the fact that many verbs are intransi- (non-object attractor); e.g., The gap between
tive, that is, have a subject but not an object. The winners and losers will grow is intransi-
clauses in which those verbs appear provide am- tive, but the plural words winners and losers,
biguous evidence: they are equally compatible which are a part of a modifier of the subject,
with a generalization in which the subject is the may serve as attractors for the singular sub-
most recent core element before the verb, and with ject gap.
a generalization in which the subject is the first
core constituent of the clause. Attraction effects The results are shown in Table 4. Withholding di-
suggest that the inductive bias of the RNN leads rect objects during training dramatically degraded
it to adopt the incorrect recency-based generaliza- the performance of the model on sentences with
tion. To test this hypothesis in a controlled way, an object attractor: the accuracy decreased from
we adopt the “poverty of the stimulus” paradigm 90.6% for the model trained on the full SOV cor-
(Wilson, 2006; Culbertson and Adger, 2014; Mc- pus (Table 3) to 60.0% for the model trained only
Coy et al., 2018): we withhold all evidence that on intransitive sentences from the same corpus.
disambiguates these two hypotheses (namely, all There was an analogous drop in performance in
transitive sentences), and test how the RNN gen- the case of VOS (89.5% compared to 48.3%). By
eralizes to the withheld sentence type. contrast, attractors that were not core arguments,
We used the SOV and VOS corpora described or objects that were not attractors, did not hurt
before; in both of these languages, the object in- performance in a comparable way. This suggests
tervened between the subject and the verb, poten- that in our poverty of the stimulus experiments
tially causing agreement attraction. Crucially, we RNNs were able to distinguish between core and
train only on sentences without a direct object, and non-core elements, but struggled on instances in
test on the following three types of sentences: which where the object directly preceded the verb
(the instances that were withheld in training). This
1. Sentences with an object of the opposite plu- constitutes strong evidence for the RNN’s recency
rality from the subject (object attractor). bias: our models extracted the generalization that
2. Sentences with an object of the same plural- subjects directly precede the verb, even though the
ity as the subject (non-attractor object).5 data were equally compatible with the generaliza-
tion that the subject is the first core argument in
3. Sentences without an object, but with one the clause.
or more nouns of the opposite plurality in- These findings align with the results of Khan-
5
When the object is a noun-noun compound, it is consid- delwal et al. (2018), who demonstrated that RNN
ered a non-attractor if its head is not of the opposite plurality language models are more sensitive to perturba-
of the subject, regardless of the plurality of other elements. tions in recent input words compared with pertur-
This can only make the task harder compared with the al-
ternative of considering compound objects such as “screen bations to more distant parts of the input. While
displays” as attractors for plural subjects. in their case the model’s recency preference can
3537
Object Object Non-object deed, to compensate for this issue, SOV languages
(attractor) (non attractor) attractor more frequently employ case marking (Matthew
SOV 60.3 ± 3.7 92.8 ± 0.3 79.2 ± 3 Dryer, quoted in Gibson et al. 2013).
VOS 48.3 ± 2.3 94.0 ± 2.3 83.1 ± 1.1 There was not a clear relationship between the
prevalence of a particular word order in the lan-
Table 4: Subject prediction accuracy in the “poverty guages of the world and the difficulty that our
of the stimulus” paradigm of Section 4.3, where transi- models experienced with that order. The model
tive sentences were withheld during training. Numbers performed best on the OVS word order, which
are averaged over four runs and the error interval rep- is present in a very small number of languages
resents the standard deviation. (∼1%). SOV languages were more difficult for
our RNNs to learn than SVO languages, even
be a learned property (since recent information is though SOV languages are somewhat more com-
more relevant for the task of next-word predic- mon (Dryer, 2013). These results weakly support
tion), our experiment focuses on the inherent in- functional explanations of these typological ten-
ductive biases of the model, as the cues that are dencies; such explanations appeal to communica-
necessary for differentiating between the two gen- tive efficiency considerations rather than learning
eralizations were absent in training. biases (Maurits et al., 2010). Of course, since the
inductive biases of humans and RNNs are likely
4.4 Discussion to be different in many respects, our results do
not rule out the possibility that the distribution of
Our reordering manipulation was limited to core
word orders is driven by a human learning bias af-
element (subjects, objects and verbs). Languages
ter all.
also differ in word order inside other types of
phrases, including noun phrases (e.g., does an 5 Overt Morphological Case Systems
adjective precede or follow the noun?), adposi-
tional phrases (does the language use prepositions The vast majority of noun phrases in English
or postpositions?), and so on. Greenberg (1963) are not overtly marked for grammatical function
pointed out correlations between head-modifier (case), with the exception of pronouns; e.g., the
orders across phrase categories; while a signifi- first-person singular pronoun is I when it is a sub-
cant number of exceptions exist, these correlations ject and me when it is an object. Other languages
have motivated proposals for a language-wide set- mark case on most nouns. Consider, for example,
ting of a Head Directionality Parameter (Stowell, the following example from Russian:6
1981; Baker, 2001). In future work, we would like (7) a. ya kupil knig-u.
to explore whether consistent reordering across I bought book-OBJECT
categories improves the model’s performance. ‘I bought the book.’
In practice, even languages with a relatively b. knig-a ischezla.
rigid word order almost never enforce this order book-SUBJECT disappeared
in every clause. The order of elements in English, ‘The book disappeared.’
for example, is predominately SVO, but construc-
tions in which the verb precedes the subject do ex- Overt case marking reduces ambiguity and facil-
ist, e.g., Outside were three police officers. Other itates parsing languages with flexible word or-
languages are considerably more flexible than En- der. To investigate the influence of case on agree-
glish (Dryer, 2013). Given that word order flexi- ment prediction—and on the ability to infer sen-
bility makes the task more difficult, our setting is tence structure—we experimented with different
arguably simpler than the task the model would case systems. In all settings, we used “fused” suf-
face when learning a natural language. fixes, which encode both plurality and grammat-
ical function. We considered three case systems
The fact that the agreement dependency be-
(see Figure 1):
tween the subject and the verb was more challeng-
ing to establish in the SOV order compared to the 1. An unambiguous case system, with a unique
SVO order is consistent with the hypothesis that 6
The standard grammatical term for these cases are nom-
SVO languages make it easier to distinguish the inative (for subject) and accusative (for object); we use SUB -
subject from the object (Gibson et al., 2013); in- JECT and OBJECT for clarity.
3538
Flexible word order VOS OVS
Case system
Subject A Object A/R Subject A Object A/R Subject A Object A/R
Unambiguous 99.2 ± 0.5 98.7 ± 0.2 98.9 ± 0.2 99.5 ± 0.1 99.5 ± 0.2 98.6 ± 0.3
/98.0 ± 0.5 /99.1 ± 0.1 /98.4 ± 0.6
Syncretic 99.3 ± 0.2 93.6 ± 0.4 99.1 ± 0.2 97.1 ± 0.2 99.4 ± 0.1 97.8 ± 0.2
/88.9 ± 1.7 /95.0 ± 1.1 /97.4 ± 1.2
Argument marking 96.0 ± 0.3 86.1 ± 0.9 96.9 ± 0.1 93.6 ± 0.1 99.6 ± 0.1 96.8 ± 0.1
/79.7 ± 4.9 /89.8 ± 2.4 /95.5 ± 0.5
Table 5: Accuracy (A) and recall (R) in predicting subject and object agreement with different case systems.
suffix for each combination of number and Results and Analysis The results are summa-
grammatical function. rized in Table 5. Unambiguous case marking
dramatically improved subject and object plural-
2. A partially syncretic (ambiguous) case sys- ity prediction compared with the previous experi-
tem, in which the same suffix was attached ments; accuracy was above 98% for all three word
to both singular subjects and plural objects orders. Partial syncretism hurt performance some-
(modeled after Basque). what relative to the unambiguous setting (except
with flexible word order), especially for object
3. A fully syncretic case system (argument
prediction. The fully syncretic case system, which
marking only): the suffix indicated only the
marked only the plurality of the head of each argu-
plurality of the argument, regardless of its
ment, further decreased performance. At the same
grammatical function (cf. subject/object syn-
time, even this limited marking scheme was help-
cretism in Russian neuter nouns).
ful: accuracy in the most challenging setting, flex-
In the typological survey reported in Baerman and ible word order (subject: 96.0%; object: 86.1%),
Brown (2013), 62% of the languages had no or was not very different from the results on unmod-
minimal case marking, 20% had syncretic case ified English (95.7% and 90.0%). This contrasts
systems, and 18% had case systems with no syn- with the poor results on the flexible setting with-
cretism. out cases (subject: 88.6%; object: 60.2%). On the
rigid orders, a fully syncretic system still signifi-
Corpus Creation The suffixes we used are cantly improved agreement prediction. The mod-
listed in Table 2. We only attached the suffix to erate effect of case syncretism on performance
the head of the relevant argument; adjectives and suggests that most of the benefits of case mark-
other modifiers did not carry case suffixes. The ing stems from the overt marking of the heads of
same suffix was used to mark plurality/case on all arguments.
noun and the agreement features on the verb; e.g., Overall, these results are consistent with the ob-
if the verb eat had a singular subject and plural servation that languages with explicit case mark-
object, it appeared as eatkarker (the singular sub- ing tend to allow a more flexible word orders com-
ject suffix was kar and the plural object suffix was pared with languages such as English that make
ker). We stripped off plurality and case mark- use of word order to express grammatical function
ers from the original English noun phrases before of words.
adding these suffixes.
6 Related Work
Setup We evaluated the interaction between dif-
ferent case marking schemes and three word or- Our approach of constructing synthetic languages
ders: flexible word order and the two orders on by parametrically modifying parsed corpora for
which the model achieved the best (OVS) and natural languages is closely inspired by Wang and
worst (VOS) subject prediction accuracy. We train Eisner (2016) (see also Wang and Eisner 2017).
one model for each combination of case system While they trained a model to mimic the POS tags
and word order. We jointly predicted the plurality order-statistics of the target language, we manu-
of subject and the object. ally modified the parsed corpora; this allows us to
3539
control for selected parameters, at the expense of even when the case system was highly syncretic.
reducing generality. Agreement feature prediction in some of our
Simpler synthetic languages (not based on nat- synthetic languages is likely to be difficult not
ural corpora) have been used in a number of recent only for RNNs but for many other classes of learn-
studies to examine the inductive biases of different ers, including humans. For example, agreement
neural architectures (Bowman et al., 2015; Lake in a language with very flexible word order and
and Baroni, 2018; McCoy et al., 2018). In an- without case marking is impossible to predict in
other recent study, Cotterell et al. (2018) measured many cases (see §4.2), and indeed such languages
the ability of RNN and n-gram models to perform are very rare. In future work, a human experiment
character-level language modeling in a sample of based on the agreement prediction task can help
languages, using a parallel corpus; the main ty- determine whether the difficulty of our languages
pological property of interest in that study was is consistent across humans and RNNs.
morphological complexity. Finally, a large num-
ber of studies, some mentioned in the introduc- Acknowledgements
tion, have used syntactic prediction tasks to exam- This work is supported by the Israeli Science
ine the generalizations acquired by neural models Foundation (grant number 1555/15) and by Theo
(see also Bernardy and Lappin 2017; Futrell et al. Hoffenberg, the founder & CEO of Reverso.
2018; Lau et al. 2017; Conneau et al. 2018; Et-
tinger et al. 2018; Jumelet and Hupkes 2018).
References
7 Conclusions Matthew Baerman and Dunstan Brown. 2013. Case
syncretism. In Matthew S. Dryer and Martin
We have proposed a methodology for generating Haspelmath, editors, The World Atlas of Language
parametric variations of existing languages and Structures Online. Max Planck Institute for Evolu-
evaluating the performance of RNNs in syntactic tionary Anthropology, Leipzig.
feature prediction in the resulting languages. We Mark C. Baker. 2001. The atoms of language: The
used this methodology to study the grammatical mind’s hidden rules of grammar. Basic Books, New
inductive biases of RNNs, assessed whether cer- York.
tain grammatical phenomena are more challeng- Jean-Philippe Bernardy and Shalom Lappin. 2017. Us-
ing for RNNs to learn than others, and began to ing deep neural networks to learn syntactic agree-
compare these patterns with the linguistic typol- ment. LiLT (Linguistic Issues in Language Technol-
ogy literature. ogy), 15.
In our experiments, multitask training on Samuel R. Bowman, Christopher D. Manning, and
polypersonal agreement prediction improved per- Christopher Potts. 2015. Tree-structured composi-
formance, suggesting that the models acquired tion in neural networks without tree-structured ar-
chitectures. In Proceedings of the NIPS Workshop
syntactic representations that generalize across on Cognitive Computation: Integrating Neural and
argument types (subjects and objects). Perfor- Symbolic Approaches.
mance varied significantly across word orders.
Noam Chomsky. 1981. Lectures on Government and
This variation was not correlated with the fre-
Binding. Foris, Dordrecht.
quency of the word orders in the languages of the
world. Instead, it was inversely correlated with the Shammur Absar Chowdhury and Roberto Zamparelli.
frequency of attractors, demonstrating a recency 2018. RNN simulations of grammaticality judg-
ments on long-distance dependencies. In Proceed-
bias. Further supporting this bias, in a poverty- ings of the 27th International Conference on Com-
of -the-stimulus paradigm, where the data were putational Linguistics, pages 133–144. Association
equally consistent with two generalizations—first, for Computational Linguistics.
the generalization that the subject is the first ar- Alexis Conneau, Germán Kruszewski, Guillaume
gument in the clause, and second, the generaliza- Lample, Loı̈c Barrault, and Marco Baroni. 2018.
tion that the subject is the most recent argument What you can cram into a single vector: Probing
preceding the verb—RNNs adopted the recency- sentence embeddings for linguistic properties. In
Proceedings of the 56th Annual Meeting of the As-
based generalization. Finally, we found that overt sociation for Computational Linguistics (Volume 1:
case marking on the heads of arguments dramat- Long Papers), pages 2126–2136. Association for
ically improved plurality prediction performance, Computational Linguistics.
3540
Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, and green recurrent networks dream hierarchically. In
Brian Roark. 2018. Are all languages equally hard Proceedings of the 2018 Conference of the North
to language-model? In Proceedings of the 2018 American Chapter of the Association for Computa-
Conference of the North American Chapter of the tional Linguistics: Human Language Technologies,
Association for Computational Linguistics: Human NAACL-HLT 2018, New Orleans, Louisiana, USA,
Language Technologies, Volume 2 (Short Papers) , June 1-6, 2018, Volume 1 (Long Papers), pages
pages 536–541. Association for Computational Lin- 1195–1205.
guistics.
Jaap Jumelet and Dieuwke Hupkes. 2018. Do lan-
Jennifer Culbertson and David Adger. 2014. Language guage models understand anything? on the ability
learners privilege structured meaning over surface of lstms to understand negative polarity items. In
frequency. Proceedings of the National Academy of Proceedings of the 2018 EMNLP Workshop Black-
Sciences, page 201320525. boxNLP: Analyzing and Interpreting Neural Net-
works for NLP, pages 222–231. Association for
Myles Dillon and Donncha Ó Cróinin. 1961. Teach Computational Linguistics.
Yourself Irish. The English Universities Press Ltd.,
London. Urvashi Khandelwal, He He, Peng Qi, and Dan Juraf-
sky. 2018. Sharp nearby, fuzzy far away: How neu-
Matthew S. Dryer. 2013. Order of subject, object and ral language models use context. In Proceedings
verb. In Matthew S. Dryer and Martin Haspelmath, of the 56th Annual Meeting of the Association for
editors, The World Atlas of Language Structures On- Computational Linguistics, ACL 2018, Melbourne,
line. Max Planck Institute for Evolutionary Anthro- Australia, July 15-20, 2018, Volume 1: Long Pa-
pology, Leipzig. pers, pages 284–294.
Émile Enguehard, Yoav Goldberg, and Tal Linzen. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A
2017. Exploring the syntactic abilities of RNNs method for stochastic optimization. arXiv preprint
with multi-task learning. In Proceedings of the 21st 1412.6980.
Conference on Computational Natural Language
Learning (CoNLL 2017), pages 3–14. Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-
gatama, Stephen Clark, and Phil Blunsom. 2018.
Allyson Ettinger, Ahmed Elgohary, Colin Phillips, and LSTMs can learn syntax-sensitive dependencies
Philip Resnik. 2018. Assessing composition in well, but modeling structure makes them better. In
sentence vector representations. In Proceedings Proceedings of the 56th Annual Meeting of the As-
of the 27th International Conference on Compu- sociation for Computational Linguistics (Volume 1:
tational Linguistics, pages 1790–1801. Association Long Papers), pages 1426–1436. Association for
for Computational Linguistics. Computational Linguistics.
Richard Futrell, Ethan Wilcox, Takashi Morita, and Brenden M. Lake and Marco Baroni. 2018. Gener-
Roger Levy. 2018. RNNs as psycholinguistic sub- alization without systematicity: On the composi-
jects: Syntactic state and grammatical dependency. tional skills of sequence-to-sequence recurrent net-
arXiv preprint arXiv:1809.01329. works. In Proceedings of the 35th International
Conference on Machine Learning, ICML 2018,
Edward Gibson, Steven T. Piantadosi, Kimberly Brink, Stockholmsmässan, Stockholm, Sweden, July 10-15,
Leon Bergen, Eunice Lim, and Rebecca Saxe. 2018, pages 2879–2888.
2013. A noisy-channel account of crosslinguis-
tic word-order variation. Psychological Science, Jey Han Lau, Alexander Clark, and Shalom Lappin.
24(7):1079–1088. 2017. Grammaticality, acceptability, and probabil-
ity: A probabilistic view of linguistic knowledge.
Mario Giulianelli, Jack Harding, Florian Mohnert, Cognitive Science, 41(5):1202–1247.
Dieuwke Hupkes, and Willem Zuidema. 2018. Un-
der the hood: Using diagnostic classifiers to in- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
vestigate and improve how language models track 2016. Assessing the ability of LSTMs to learn
agreement information. In Proceedings of the 2018 syntax-sensitive dependencies. Transactions of the
EMNLP Workshop BlackboxNLP: Analyzing and Association for Computational Linguistics, 4:521–
Interpreting Neural Networks for NLP, pages 240– 535.
248. Association for Computational Linguistics.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Joseph H. Greenberg. 1963. Some universals of gram- Marcinkiewicz. 1993. Building a large annotated
mar with particular reference to the order of mean- corpus of English: The Penn Treebank. Computa-
ingful elements. In Joseph H. Greenberg, editor, tional Linguistics, 19(2):313–330.
Universals of language, pages 73–113. MIT Press,
Cambridge, MA. Rebecca Marvin and Tal Linzen. 2018. Targeted syn-
tactic evaluation of language models. In Proceed-
Kristina Gulordava, Piotr Bojanowski, Edouard Grave, ings of the 2018 Conference on Empirical Methods
Tal Linzen, and Marco Baroni. 2018. Colorless in Natural Language Processing, pages 1192–1202.
3541
Luke Maurits, Danielle J. Navarro, and Amy Perfors. Ethan Wilcox, Roger Levy, Takashi Morita, and
2010. Why are some word orders more common Richard Futrell. 2018. What do RNN language
than others? A uniform information density ac- models learn about filler–gap dependencies? In
count. In Advances in Neural Information Process- Proceedings of the 2018 EMNLP Workshop Black-
ing Systems 23: 24th Annual Conference on Neural boxNLP: Analyzing and Interpreting Neural Net-
Information Processing Systems 2010. Proceedings works for NLP, pages 211–221. Association for
of a meeting held 6-9 December 2010, Vancouver, Computational Linguistics.
British Columbia, Canada., pages 1585–1593. Cur-
ran Associates, Inc. Colin Wilson. 2006. Learning phonology with sub-
stantive bias: An experimental and computational
R. Thomas McCoy, Robert Frank, and Tal Linzen. study of velar palatalization. Cognitive Science,
2018. Revisiting the poverty of the stimulus: Hi- 30:945–982.
erarchical generalization without a hierarchical bias
in recurrent neural networks. In Proceedings of the
40th Annual Conference of the Cognitive Science
Society, pages 2093—2098.