Less Is Less in Language Acquisition

Less is Less in Language Acquisition
Douglas L. T. Rohde David C. Plaut

Carnegie Mellon University and the Center for the Neural Basis of Cognition
February 2002
To appear in Quinlin, P. (Ed.) (in press) Connectionist modelling of cognitive development. Hove, UK: Psychology Press.
1 Introduction Kareev, Lieberman, and Lev (1997), Cochran, McDonald,

and Parault (1999), and Kersten and Earles (2001). In the
A principal observation in the study of language acquisi- current chapter, we consider these studies in detail and, in
tion is that people exposed to a language as children are each case, find serious cause to doubt their intended sup-
more likely to achieve fluency in that language than those port for the less-is-more hypothesis.
first exposed to it as adults, giving rise to the popular no- Elman (1991, 1993) found that simple recurrent con-
tion of a critical period for language learning (Lenneberg, nectionist networks could learn the structure of an
1967; Long, 1990). This is perhaps surprising since chil- English-like artificial grammar only when “starting
dren have been found to be inferior to adults in most tests small”—when either the training corpus or the net-
of other cognitive abilities. work’s memory was limited initially and only grad-
A variety of explanations have been put forth to ac- ually made more sophisticated. We show, to the
count for the benefit of early language learning. Possi- contrary, that language learning by recurrent net-
bly the most prevalent view is that children possess a spe- works does not depend on starting small; in fact, such
cific “language acquisition device” that is programmati- restrictions hinder acquisition as the languages are
cally deactivated prior to or during adolescence (Chom- made more realistic by introducing graded semantic
sky, 1965; McNeill, 1970). Important to this view is that constraints (Rohde & Plaut, 1999).
knowledge or processes necessary for effective language
learning are only available for a limited period of time. We discuss the simple learning task introduced by
But this theory has trouble accounting for continued ef- Goldowsky and Newport (1993) as a clear demon-
fects of age-of-acquisition after adolescence (Bialystok & stration of the advantage of memory limitations. But
Hakuta, 1999) and evidence that some adult second lan- we show that their filtering mechanism actually con-
guage learners are still able to reach fluency (see Bird- stitutes a severe impairment to learning in both a sim-
song, 1999). ple statistical model and a neural network model.
An alternative account is provided by Newport’s (1990) Kareev, Lieberman, and Lev (1997) argued that
“less-is-more” hypothesis. Rather than attributing the small sample sizes, possibly resulting from weak
early language advantage to a specific language learning short-term memory, have the effect of enhancing cor-
device, this theory postulates that children’s language ac- relations between two observable variables. But we
quisition may be aided rather than hindered by their lim- demonstrate that the chance that a learner is able to
ited cognitive resources. According to this view, the abil- detect a correlation actually improves with sample
ity to learn a language declines over time as a result of size and that a simple prediction model indeed per-
an increase in cognitive abilities. The reasoning behind forms better when it relies on larger samples.
this suggestion is that a child’s limited perception and
memory may force the child to focus on smaller linguis- Cochran, McDonald, and Parault (1999) taught par-
tic units which form the fundamental components of lan- ticipants ASL verbs with and without additional cog-
guage, as opposed to memorizing larger units which are nitive loads and found apparently better generaliza-
less amenable to recombination. While this is an attrac- tion performance for participants in the load condi-
tive explanation, for such a theory to be plausible, the po- tion. But we argue that the learning task actually pro-
tential benefit of limited resources must be demonstrated vided no support for the expected generalization and
both computationally and empirically. that the no-load participants simply learned the more
reasonable generalization much better.
The strongest evidence for Newport’s theory comes
from computational simulations and empirical findings Finally, we consider the Kersten and Earles (2001)
of Elman (1991, 1993), Goldowsky and Newport (1993), findings to provide little support for the less-is-more
Rohde and Plaut Less is Less in Language Acquisition
hypothesis because the task learned by participants would be affected if the language incorporated more of
in their experiment is unlike natural language learn- the constraints of natural languages. A salient feature of
ing in some important and relevant aspects and the the grammar used by Elman is that it is purely syntactic,
critical manipulation in their experiment involved in the sense that all words of a particular class, such as the
staged input, rather than cognitive limitations. singular nouns, were identical in usage. A consequence
of this is that embedded material modifying a head noun
In the final section, we consider some general princi- provides relatively little information about the subsequent
ples of learning language-like tasks in recurrent neural corresponding verb. Earlier work by Cleeremans, Servan-
networks and what the implications for human learning Schreiber, and McClelland (1989), however, had demon-
might be. We then briefly discuss an alternative account strated that simple recurrent networks were better able to
for the language-learning superiority of children. learn long-distance dependencies in finite-state grammars
when intervening sequences were partially informative of
2 Elman (1991, 1993) (i.e., correlated with) the distant prediction. The intuition
behind this finding is that the network’s ability to repre-
Elman (1990, 1991) set out to provide an explicit formu- sent and maintain information about an important word,
lation of how a general connectionist system might learn such as the head noun, is reinforced by the advantage this
the grammatical structure of a language. Rather than com- information provides in predicting words within embed-
prehension or overt parsing, Elman chose to train the net- ded phrases. As a result, the noun can more effectively
works to perform word prediction. Although word pre- aid in the prediction of the corresponding verb following
diction is a far cry from language comprehension, it can the intervening material.
be viewed as a useful component of language processing, One source of such correlations in natural language are
given that the network can make accurate predictions only distributional biases, due to semantic factors, on which
by learning the structure of the grammar. Elman trained nouns typically co-occur with which verbs. For exam-
a simple recurrent network—sometimes termed an “El- ple, suppose dogs often chase cats. Over the course of
man” network—to predict the next word in sentences gen- training, the network has encountered chased more often
erated by an artificial grammar exhibiting number agree- after processing sentences beginning The dog who... than
ment, variable verb argument structure, and embedded after sentences beginning with other noun phrases. The
clauses. He found that the network was unable to learn the network can, therefore, reduce prediction error within the
prediction task—and, hence, the underlying grammar— embedded clause by retaining specific information about
when presented from the outset with sentences generated dog (beyond it being a singular noun). As a result, infor-
by the full grammar. The network was, however, able to mation on dog becomes available to support further pre-
learn if it was trained first on only simple sentences (i.e., dictions in the sentence as it continues (e.g., The dog who
those without embeddings) and only later exposed to an chased the cat barked). These considerations led us to be-
increasing proportion of complex sentences. lieve that languages similar to Elman’s but involving weak
It thus seems reasonable to conclude that staged input semantic constraints might result in less of an advantage
enabled the network to focus early on simple and im- for starting small in child language acquisition. We be-
portant features, such as the relationship between nouns gan by examining the effects of an incremental training
and verbs. By “starting small,” the network had a bet- corpus, without manipulating the network’s memory. The
ter foundation for learning the more difficult grammatical methods we used were very similar, but not identical, to
relationships which span potentially long and uninforma- those used by Elman (1991, 1993).
tive embeddings. Recognizing the parallel between this
finding and the less-is-more hypothesis, Elman (1993) de- 2.1.1 Grammar
cided to investigate a more direct test of Newport’s (1990)
theory. Rather than staging the input presentation, Elman Our pseudo-natural language was based on the grammar
initially interfered with the network’s memory span and shown in Table 1, which generates simple noun-verb and
then allowed it to gradually improve. Again, he found noun-verb-noun sentences with the possibility of relative
successful learning in this memory limited condition, pro- clause modification of most nouns. Relative clauses could
viding much stronger support for the hypothesis. be either subject-extracted or object-extracted. Although
this language is quite simple, in comparison to natural lan-
2.1 Rohde and Plaut (1999) Simulation 1: guage, it is nonetheless of interest because, in order to
make accurate predictions, a network must learn to form
Progressive Input
representations of potentially complex syntactic structures
Rohde and Plaut (1999) investigated how the need for and remember information, such as whether the subject
starting small in learning a pseudo-natural language was singular or plural, over lengthy embeddings. The
2
26 OUTPUT
Table 1: The Grammar Used in Simulation 1
S NP VI . | NP VT NP . 10
NP N | N RC
RC who VI | who VT NP | who NP VT 70 HIDDEN copy
N boy | girl | cat | dog | Mary | John |
boys | girls | cats | dogs 10 CONTEXT
VI barks | sings | walks | bites | eats |

bark | sing | walk | bite | eat 26 INPUT
VT chases | feeds | walks | bites | eats |
chase | feed | walk | bite | eat Figure 1: The architecture of the network used in the sim-
Note: Transition probabilities are specified and additional ulations. Each solid arrow represents full connectivity be-
constraints are applied on top of this framework. tween layers, with numbers of units next to each layer.
Hidden unit states are copied to corresponding context
units (dashed arrow) after each word is processed.
Table 2: Semantic Constraints on Verb Usage
Intransitive Transitive Objects 2.1.2 Network Architecture
Verb Subjects Subjects if Transitive
chase – any any The simple recurrent network used in both Elman’s simu-
feed – human animal lations and in the current work is shown in Figure 1. In-
bite animal animal any puts were represented as localist patterns or basis vectors:
walk any human only dog Each word was represented by a single unit with activity
eat any animal human 1.0, all other units having activity 0.0. This representation
bark only dog – – was chosen to deprive the network of any similarity struc-
sing human or cat – – ture among the words that might provide indirect clues to
their grammatical properties. The same 1-of-n represen-
Note: Columns indicate legal subject nouns when verbs
are used intransitively or transitively and legal object nouns tation was also used for outputs, which has the convenient
when transitive. property that the relative activations of multiple words can
be represented independently.
On each time step, a new word was presented by fix-
ing the activations of the input layer. The activity in the
grammar used by Elman was nearly identical, except that main hidden layer from the previous time step was copied
it had one fewer mixed transitivity verb in singular and to the context layer. Activation then propagated through
plural form, and the two proper nouns, Mary and John, the network, as in a feed-forward model, such that each
could not be modified. unit’s activation was a smooth, nonlinear (logistic, or sig-
In our simulation, several additional constraints were moid) function of its summed weighted input from other
applied on top of the grammar in Table 1. Primary among units. The resulting activations over the output units were
these was that individual nouns could engage only in cer- then compared with their target activations, generating an
tain actions, and that transitive verbs could act only on cer- error signal. In a simple recurrent network, errors are not
tain objects (see Table 2). Another restriction in the lan- back-propagated through time (cf. Rumelhart, Hinton, &
guage was that proper nouns could not act on themselves. Williams, 1986) but only through the current time step,
Finally, constructions which repeat an intransitive verb, although this includes the connections from the context
such as Boys who walk walk, were disallowed because of units to the hidden units. These connections allow infor-
redundancy. These so-called semantic constraints always mation about past inputs—as encoded in the prior hidden
applied within the main clause of the sentence as well as representation copied onto the context units—to influence
within any subclauses. Although number agreement af- current performance.
fected all nouns and verbs, the degree to which the se- Although the target output used during training was the
mantic constraints applied between a noun and its modi- encoding for the actual next word, a number of words
fying phrase was controlled by specifying the probability were typically possible at any given point in the sentence.
that the relevant constraints would be enforced for a given Therefore, to perform optimally the network must gen-
phrase. In this way, effects of the correlation between a erate, or predict, a probability distribution over the word
noun and its modifying phrase, or of the level of informa- units indicating the likelihood that each word would occur
tion the phrase contained about the identity of the noun, next. Averaged across the entire corpus, this distribution
could be investigated. will generally result in the lowest performance error.
3
2.1.3 Corpora ification probability for each noun.

For each of the 20 grammars (five levels of semantic
Elman’s complex training regimen involved training a net- constraints crossed with four percentages of complex sen-
work on a corpus of 10,000 sentences, 75% of which tences), two corpora of 10,000 sentences were generated,
were “complex” in that they contained at least one rela- one for training and the other for testing. Corpora of this
tive clause. In his simple regimen, the network was first size are quite representative of the statistics of the full
trained exclusively on simple sentences and then on an language for all but the longest sentences, which are rel-
increasing proportion of complex sentences. Inputs were atively infrequent. Sentences longer than 16 words were
arranged in four corpora, each consisting of 10,000 sen- discarded in generating the corpora, but these were so rare
tences. The first corpus was entirely simple, the second ( 0 2%) that their loss should have had negligible effects.

25% complex, the third 50% complex, and the final cor- In order to perform well, a network of this size couldn’t
pus was 75% complex—identical to the initial corpus that possibly “memorize” the training corpus but must learn
the network had failed to learn when it alone was pre- the structure of the language.
sented during training. An additional 75% complex cor-
pus, generated in the same way as the last training corpus,
2.1.4 Training and Testing Procedures
was used for testing the network.
In order to study the effect of varying levels of informa- In the condition Elman referred to as “starting small,”
tion in embedded clauses, we constructed five grammar he trained his network for 5 epochs (complete presen-
classes. In class A, semantic constraints did not apply tations) of each of the four corpora, in increasing order
between a clause and its subclause, only between nouns of complexity. During training, weights were adjusted
and verbs explicitly present in each individual clause. In to minimize the summed squared error between the net-
class B, 25% of the subclauses respected the semantic work’s prediction and the actual next word, using the
constraints of their parent clause. In such cases, the modi- back-propagation learning procedure (Rumelhart et al.,
fied noun must be a semantically valid subject of the verb 1986) with a learning rate of 0.1, reduced gradually to
for a subject-relative or object of the verb for an object- 0.06. No momentum was used and weights were updated
relative. In class C, 50% of the subclauses respected this after each word presentation. Weights were initialized to
constraint, 75% in class D, and 100% in class E. There- random values sampled uniformly between 0.001.
fore, in class A, which was most like Elman’s grammar, For each of the five language classes, we trained the
the contents of a relative clause provided no information network shown in Figure 1 using both incremental and
about the noun being modified other than whether it was non-incremental training schemes. In the complex regi-
singular or plural, whereas class E produced sentences men, the network was trained on the most complex corpus
which were the most English-like. We should empha- (75% complex) for 25 epochs with a fixed learning rate.
size that, in this simulation, semantic constraints always The learning rate was then reduced for a final pass through
applied within a clause, including the main clause. This the corpus. In the simple regimen, the network was trained
is because we were interested primarily in the ability of for five epochs on each of the first three corpora in increas-
the network to perform the difficult main verb prediction, ing order of complexity. It was then trained on the fourth
which relied not only on the number of the subject, but corpus for 10 epochs, followed by a final epoch at the re-
on its semantic properties as well. In a second simulation, duced learning rate. The six extra epochs of training on
we investigate a case in which all the semantic constraints the fourth corpus—not included in Elman’s design—were
were eliminated to produce a grammar essentially identi- intended to allow performance with the simple regimen to
cal to Elman’s. approach asymptote.
As in Elman’s work, four versions of each class were Because we were interested primarily in the per-
created to produce languages of increasing complexity. formance level possible under optimal conditions, we
Grammars A0 , A25 , A50 , and A75 , for example, produce searched a wide range of training parameters to determine
0%, 25%, 50%, and 75% complex sentences, respectively. a set which consistently achieved the best performance
In addition, for each level of complexity, the probability overall.1 We trained our network with back-propagation
of relative clause modification was adjusted to match the using momentum of 0.9, a learning rate of 0.004 reduced
average sentence length in Elman’s corpora, with the ex- to 0.0003 for the final epoch, a batch size of 100 words per
ception that the 25% and 50% complex corpora involved weight update, and initial weights sampled uniformly be-
slightly longer sentences to provide a more even progres- tween 1.0 (cf. 0.001 for Elman’s network). Network

sion, reducing the large difference between the 50% and performance for both training and testing was measured
75% complex conditions apparent in Elman’s corpora. 1 The effects of changes to some of these parameter values—in partic-
Specifically, grammars with complexity 0%, 25%, 50%, ular, the magnitude of initial random weights—are evaluated in a second
and 75% respectively had 0%, 10%, 20%, and 30% mod- simulation.
4
0.14
in terms of divergence and network outputs were normal-
Simple Regimen
ized using Luce ratios (Luce, 1986), also known as soft-
Mean Divergence Per Prediction

0.12 Complex Regimen
max constraints (see Rohde & Plaut, 1999).
Because our grammars were in standard stochastic, 0.10
context-free form, it was possible to evaluate the network
by comparing its predictions to the theoretically correct 0.08
next-word distributions given the sentence context (Ro-
hde, 1999). By contrast, it was not possible to generate 0.06
such optimal predictions based on Elman’s grammar. In
order to form an approximation to optimal predictions, 0.04
Elman trained an empirical language model on sentences

0.02
generated in the same way as the testing corpus. Predic-
tions by this model were based on the observed next-word 0.00
statistics given every sentence context to which it was ex- A B C D E
posed. Grammar/Training Corpus
2.1.5 Results and Discussion Figure 2: Mean divergence per word prediction over the
75% complex testing corpora generated from grammar
Elman did not provide numerical results for the complex classes A through E (increasing in the extent of semantic
condition, but he did report that his network was unable constraints) for the simple and complex training regimes.
to learn the task when trained on the most complex cor- Note that lower values correspond to better performance.
pus from the start. However, learning was effective in the Means and standard errors were computed over the best
simple regimen, in which the network was exposed to in- 16 of 20 trials in each condition.
creasingly complex input. In this condition, Elman found
that the mean cosine 2 of the angle between the network’s
prediction vectors and those of the empirical model was clauses. Nevertheless, against expectations, starting small
0.852 (SD = 0.259), where 1.0 is optimal. failed to improve performance even for class A, in which
Figure 2 shows, for each training condition, the mean relative clauses did not conform to semantic constraints
divergence error per word on the testing corpora of our imposed by the preceding noun.
network when evaluated against the theoretically optimal In summary, starting with simple inputs proved to be
predictions given the grammar. To reduce the effect of of no benefit and was actually a significant hindrance
outliers, and because we were interested in the best possi- when semantic constraints applied across clauses. The
ble performance, results were averaged over only the best networks were able to learn the grammars quite well even
16 of 20 trials. Somewhat surprisingly, rather than an ad- in the complex training regimen, as evidenced by addi-
vantage for starting small, the data reveals a significant ad- tional analyses reported in Rohde and Plaut (1999). More-
vantage for the complex training regimen (F1 150 = 53.8, over, the advantage for training on the fully complex cor-
p .001). Under no condition did the simple training pus increased as the language was made more English-
regimen outperform the complex training. Moreover, the like by enforcing greater degrees of semantic constraints.
advantage in starting complex increased with the propor- While it has been shown previously that beginning with
tion of fully constrained relative clauses. Thus, when the a reduced training set can be detrimental in classification
16 simple and 16 complex training regimen networks for tasks such as exclusive-OR (Elman, 1993), it appears that
each grammar were paired with one another in order of beginning with a simplified grammar can also produce
increasing overall performance, there was a strong posi- significant interference on a more language-like predic-
tive correlation (r = .76, p .001) between the order of tion task. At the very least, starting small does not appear
the grammars from A–E and the difference in error be- to be of general benefit in all language learning environ-
tween the simple versus complex training regimes. 3 This ments.
is consistent with the idea that starting small is most ef-
fective when important dependencies span uninformative 2.2 Rohde and Plaut (1999) Simulation 2:
2 The cosine of the angle between two vectors of equal dimensionality Replication of Elman (1993)
can be computed as the dot product (or sum of the pairwise products of
the vector elements) divided by the product of the lengths of the two Our failure to find an advantage for starting small in our
vectors.
3 The correlation with grammar class is also significant (r = .65, p

initial work led us to ask what differences between that
.001) when using the ratio of the simple to complex regimen error rates study and Elman’s were responsible for the discrepant re-
for each pair of networks, rather than their difference. sults. All of the grammars in the first set of simulations
5
differed from Elman’s grammar in that the language re- the complex regimen only fell 8%, from 0.051 to 0.047. 4
tained full semantic constraints within the main clause. It When the network was trained using parameters simi-
is possible that within-clause dependencies were in some lar to those chosen by Elman, it failed to learn adequately,
way responsible for aiding learning in the complex train- settling into bad local minima. The network consistently
ing regimen. Therefore, we produced a language, labeled reached a divergence error of 1.03 under the simple train-
R for replication, which was identical to Elman’s in all ing regimen and 1.20 under the complex regimen. In
known respects, thus ruling out all but the most subtle dif- terms of city-block distance, these minima fall at 1.13 and
ferences in language as the potential source of our dis- 1.32 respectively—much worse than the results reported
parate results. by Elman. We did, however, obtain successful learning
by using the same parameters but simply increasing the
2.2.1 Methods weight initialization range from 0 001 to 1 0, although

performance under these conditions was not quite as good

Like Elman’s grammar, grammar R uses just 12 verbs: as with all of our parameters and methods. Even so, we
2 pairs each of transitive, intransitive, and mixed transi- again found a significant advantage for the complex reg-
tivity. In addition, as in Elman’s grammar, the proper imen over the simple regimen in terms of mean diver-
nouns Mary and John could not be modified by a rela- gence error (means of 0.122 vs. 0.298, respectively; F 1 22
tive clause and the only additional constraints involved = 121.8, p .001).
number agreement. We should note that, although our Given that the strength of initial weights appears to be
grammar and Elman’s produce the same set of strings to a key factor in successful learning, we conducted a few
the best of our knowledge, the probability distributions additional runs of the network to examine the role of this
over the strings in the languages may differ somewhat. factor in more detail. The networks were trained on 25
As before, corpora with four levels of complexity were epochs of exposure to corpus R 75 under the complex reg-
produced. In this case they very closely matched Elman’s imen using parameters similar to Elman’s, although with
corpora in terms of average sentence length. a fixed learning rate of 1.0 (i.e., without annealing). Fig-
Networks were trained on this language both with our ure 3 shows the sum squared error on the testing corpus
own methods and parameters and with those as close as over the course of training, as a function of the range of
possible to the ones Elman used. In the former case, we the initial random weights. It is apparent that larger initial
used normalized output units with a divergence error mea- weights help the network break through the plateau which
sure, momentum of 0.9, eleven epochs of training on the lies at an error value of 0.221.
final corpus, a batch size of 10 words, a learning rate of The dependence of learning on the magnitudes of ini-
0.004 reduced to 0.0003 for the last epoch, and initial tial weights can be understood in light of properties of the
weights between 1. In the latter case, we used logis-

logistic activation function, the back-propagation learn-

tic output units, squared error, no momentum, five epochs ing procedure, and the operation of simple recurrent net-
of training on the fourth corpus, online weight updating works. It is generally thought that small random weights
(after every word), a learning rate of 0.1 reduced to 0.06 aid error-correcting learning in connectionist networks
in equal steps with each corpus change, and initial weights because they place unit activations within the linear range
between 0 001.

of the logistic function where error derivatives, and hence

weight changes, will be largest. However, the error
2.2.2 Results and Discussion derivatives that are back-propagated to hidden units are
scaled by their outgoing weights; feedback to the rest of
Even when training on sentences from a grammar with the network is effectively eliminated if these weights are
no semantic constraints, our learning parameters resulted too small. Moreover, with very small initial weights, the
in an advantage for the complex regimen. Over the best summed inputs of units in the network are all almost zero,
12 of 15 trials, the network achieved an average diver- yielding activations very close to 0.5 regardless of the in-
gence of 0.025 under the complex condition compared put presented to the network. This is particularly prob-
with 0.036 for the simple condition (F1 22 = 34.8, p lematic in a simple recurrent network because it leads to
.001). Aside from the learning parameters, one impor- context representations (copied from previous hidden acti-
tant difference between our training method and Elman’s vations) that contain little if any usable information about
was that we added 6 extra epochs of training on the fi- previous inputs. Consequently, considerably extended
nal corpus to both conditions. This extended training did
4 The further drop of these error values, 0.047 and 0.061, to the re-
not, however, disproportionately benefit the complex con-
ported final values of 0.025 and 0.036 resulted from the use of a reduced
dition. Between epoch 20 and 25, the average divergence learning rate for epoch 26. Ending with a bit of training with a very low
error under the simple regimen dropped from 0.085 to learning rate is particularly useful when doing online, or small batch
0.061, or 28%. During the same period, the error under size, learning.
6
0.25
2.3 Rohde and Plaut (1999) Simulation 3:
Progressive Memory
0.20
+/- 0.07 Elman (1993) argued that his finding that initially simpli-
Sum Squared Error
+/- 0.1
+/- 0.2 fied inputs were necessary for effective language learn-
0.15 +/- 0.3 ing was not directly relevant to child language acquisi-
+/- 1.0
tion because, in his view, there was little evidence that
0.10
adults modify the grammatical structure of their speech
when interacting with children (although we would dis-
agree, see, e.g., Gallaway & Richards, 1994; Snow, 1995;
0.05 Sokolov, 1993). As an alternative, Elman suggested that
the same constraint could be satisfied if the network itself,
0.00
rather than the training corpus, was initially limited in its
0 5 10 15 20 25 complexity. Following Newport’s less-is-more hypothesis
Training Epoch (Newport, 1990; Goldowsky & Newport, 1993), Elman
proposed that the gradual maturation of children’s mem-
Figure 3: Sum squared error produced by the network on ory and attentional abilities could actually aid language
the testing set at each epoch of training on corpus R 75 learning.
under the complex regimen, as a function of the range of
To test this proposal, Elman (1993) conducted addi-
initial random weights.
tional simulations in which the memory of a simple re-
current network (i.e., the process of copying hidden ac-
tivations onto the context units) was initially hindered
training may be required to accumulate sufficient weight and then allowed to gradually improve over the course
changes to begin to differentiate even the simplest differ- of training. When trained on the full complexity of the
ences in context (see Figure 3). By contrast, starting with grammar from the outset, but with progressively improv-
relatively large initial weights not only preserves the back- ing memory, the network was again successful at learn-
propagated error derivatives but also allows each input to ing the structure of the language which it had failed to
have a distinct and immediate impact on hidden represen- learn when using fully mature memory throughout train-
tations and, hence, on context representations. Although ing. In this way, Elman’s computational findings dove-
the resulting patterns may not be particularly good rep- tailed perfectly with Newport’s empirical findings to pro-
resentations for solving the task (because the weights are vide what seemed like compelling evidence for the impor-
random), they at least provide an effective starting point tance of maturational constraints on language acquisition
for beginning to learn temporal dependencies. (see, e.g., Elman et al., 1996, for further discussion).
In summary, on a grammar essentially identical to that Given that the primary computational support for the
used by Elman (1991, 1993), we found a robust advan- less-is-more hypothesis comes from Elman’s simulations
tage for training with the full complexity of the language with limited memory rather than those with incremental
from the outset. Although we cannot directly compare training corpora, it is important to verify that our contra-
the performance of our network to that of Elman’s net- dictory findings of an advantage for the complex regimen
work, it appears likely that the current network learned the in Simulations 1 and 2 also hold by comparison with train-
task considerably better than the empirical model that we ing under progressively improving memory. Accordingly,
used for evaluation. By contrast, the network was unable we conducted simulations similar to those of Elman, in
to learn the language in either the simple or the complex which a network with gradually improving memory was
condition when we used parameters similar to those em- trained on the full semantically constrained grammar, E,
ployed by Elman. However, increasing the range of the as well as on the replication grammar, R, using both El-
initial connection weights allowed the network to learn man’s and our own training parameters.
quite well, although in this case we again found a strong
advantage for starting with the full grammar. It was possi- 2.3.1 Methods
ble to eliminate this advantage by removing all dependen-
cies between main clauses and their subclauses, and even In his limited-memory simulation, Elman (1993) trained
to reverse it by, in addition, training exclusively on com- a network exclusively on the complex corpus, 5 which he
plex sentences. But these training corpora bear far less re- had previously found to be unlearnable. As a model of
semblance to the actual structure of natural language than 5 It is unclear from the text whether Elman (1993) used the corpus
do those which produce a clear advantage for training on with 75% or 100% complex sentences in the progressive memory exper-
the full complexity of the language from the beginning. iments.
7
limited memory span, the recurrent feedback provided by the same parameters but with initial connection weights
the context layer was eliminated periodically during pro- in the range 1.0, the limited-memory network again per-

cessing by setting the activations at this layer to 0.5. For formed almost equivalently to the network with full mem-
the first 12 epochs of training, this was done randomly af- ory (means of 0.130 vs. 0.122, respectively; F 1 22 = 2.39,
ter 3–4 words had been processed, without regard to sen- p 0.10), and significantly better than the full-memory
tence boundaries. For the next 5 epochs the memory win- network trained with progressive inputs (mean of 0.298;
dow was increased to 4–5 words, then to 5–6, 6–7, and F1 22 = 109.1, p .001).
finally, in the last stage of training, the memory was not To summarize, in contrast with Elman’s findings, when
interfered with at all. training on the fully complex grammar from the outset,
In the current simulation, the training corpus consisted initially limiting the memory of a simple recurrent net-
of 75% complex sentences, although Elman’s may have work provided no advantage over training with full mem-
extended to 100% complexity. Like Elman, we extended ory, despite the fact that the limited-memory regimen in-
the first period of training, which used a memory win- volved 7 more epochs of exposure to the training corpus.
dow of 3–4 words, from 5 epochs to 12 epochs. We then On the other hand, in all of the successful conditions,
trained for 5 epochs each with windows of 4–5 and 5– limited memory did provide a significant advantage over
7 words. The length of the final period of unrestricted gradually increasing the complexity of the training cor-
memory depended on the training methods. When using pus.
our own methods (see Simulation 2), as when training on
the final corpus in the simple regimen, this period con-
sisted of 10 epochs followed by one more with the re- 2.4 Summary
duced learning rate. When training with our approxima- Contrary to the results of Elman (1991, 1993), Rohde and
tion of Elman’s methods on grammar R, this final period Plaut (1999) found that it is possible for a standard simple
was simply five epochs long. Therefore, under both con- recurrent network to gain reasonable proficiency in a lan-
ditions, the memory-limited network was allowed to train guage roughly similar to that designed by Elman without
for a total of 7 epochs more than the corresponding full- staged inputs or memory. In fact, there was a significant
memory network in Simulations 1 and 2. When using our advantage for starting with the full language, and this ad-
methods, learning rate was held fixed until the last epoch, vantage increased as languages were made more natural
as in Simulation 1. With Elman’s method, we reduced the by increasing the proportion of clauses which obeyed se-
learning rate with each change in memory limit. mantic constraints. There may, of course, be other train-
ing methods which would yield even better performance.
2.3.2 Results and Discussion However, at the very least, it appears that the advantage of
staged input is not a robust phenomenon in simple recur-
Although he did not provide numerical results, Elman rent networks.
(1993) reported that the final performance was as good In order to identify the factors that led to the disad-
as in the prior simulation involving progressive inputs. vantage for starting small, we returned to a more direct
Again, this was deemed a success relative to the com- replication of Elman’s work in Simulation 2. Using El-
plex, full-memory condition which was reportedly unable man’s parameters, we did find what seemed to be an ad-
to learn the task. vantage for starting small, but the network failed to suf-
Using our training methods on language R, the limited- ficiently master the task in this condition. We do not yet
memory condition resulted in equivalent performance to understand what led Elman to succeed in this condition
that of the full-memory condition, in terms of divergence where we failed. One observation made in the course
error (means of 0.027 vs. 0.025, respectively; F 1 22 = of these simulations was that larger initial random con-
2.12, p .15). Limited memory did, however, provide a nection weights in the network were crucial for learning.
significant advantage over the corresponding progressive- We therefore reapplied Elman’s training methods but in-
inputs condition from Simulation 2 (mean 0.036; F 1 22 = creased the range of the initial weights from 0 001 to

24.4, p .001). Similarly, for language E, the limited- 1 0. Both this condition and our own training parame-

memory condition was equivalent to the full-memory con- ters revealed a strong advantage for starting with the full
dition (mean of 0.093 for both; F 1) but better than the language.
progressive-inputs condition from Simulation 2 (mean of Finally, in Simulation 3 we examined the effect of
0.115; F1 22 = 31.5, p .001). progressive memory manipulations similar to those per-
With Elman’s training methods on grammar R, the net- formed by Elman (1993). It was found that, despite in-
work with limited memory consistently settled into the creased training time, limited memory failed to provide
same local minimum, with a divergence of 1.20, as did an advantage over full memory in any condition. Inter-
the network with full memory (see Simulation 2). Using estingly, training with initially limited memory was gen-
8
erally less of a hindrance to learning than training with ings: A1 , A2 , and A3 corresponded to M 1 , M2 , and M3 ,
initially simplified input. In all cases, though, successful respectively, B1 , B2 , and B3 corresponded to N1 , N2 , and
learning again required the use of sufficiently large initial N3 , and so forth.6 Thus, the form A 2 B1C3 had the meaning
weights. M2 N1 O3 . The task was, apparently, to learn this simple
Certainly there are situations in which starting with underlying mapping.
simplified inputs is necessary for effective learning of a Goldowsky and Newport suggested that one way to
prediction task by a recurrent network. For example, Ben- solve the task might be to gather a table with counts of all
gio, Simard, and Frasconi (1994) (see also Lin, Horne, form and meaning correspondences across some observed
& Giles, 1996) report such results for tasks requiring a data. If the form A 2 B1C3 and the meaning M 2 N1 O3 were
network to learn contingencies which span 10–60 entirely observed, the model would increment values of cells in
unrelated inputs. However, such tasks are quite unlike the the table corresponding to the pairing of each of the eight
learning of natural language. It may also be possible that subsets of the form symbols with each subset of the three
starting with a high proportion of simple sentences is of meaning symbols. If trained on all 27 possible examples,
significant benefit in learning other language processing the model would have a value of 9 for each of the cells
tasks, such as comprehension. A child’s discovery of the correctly pairing individual elements of the form to indi-
mapping between form and meaning will likely be facili- vidual elements of the meaning (e.g. A 1 to M1 and B3 to
tated if he or she experiences propositionally simple utter- N3 ). The next largest, incorrectly paired, cells would have
ances whose meaning is apparent or is clarified by the ac- a value of 3 and the rest of the cells would have a value of
companying actions of the parent. However, the real ques- 1.
tion in addressing the less-is-more hypothesis is whether Goldowsky and Newport suggested that there is too
limited cognitive capacity will substantially aid this pro- much noise in such a table because of the many values
cess. representing incorrect or overly complex pairings. They
Having failed to replicate Elman’s results, it seems ap- then introduced a filtering scheme meant to simulate the
propriate to turn a critical eye on the other major sources effect of poor working memory on a child’s experiences.
of evidence for the less-is-more hypothesis. Aside from Before a form/meaning pair is entered into the table, some
Elman’s findings, four main studies have been charac- of its information is lost at random. Half of the time one of
terized as providing support for the advantage of learn- the three elements of the form is retained and half of the
ing with limited resources. Goldowsky and Newport time two elements are retained. Likewise for the mean-
(1993) presented evidence of the noise-reducing power ing. The authors argued that this improves learning be-
of random filtering in a statistical learning model of a cause it produces a table with a higher signal-to-noise ra-
simple morphological system. Kareev, Lieberman, and tio. Therefore, they concluded, having limited memory
Lev (1997) offered a statistical argument in favor of can be helpful because it can help the learner focus on the
the correlation-enhancing power of small samples and simple, often important, details of a mapping.
performed two empirical studies purported to confirm But we should examine this learning situation a bit
this. The other two studies are more purely empirical. more carefully. First of all, in what sense is the signal-
Cochran, McDonald, and Parault (1999) taught partici- to-noise ratio improving as a result of filtering? The ratio
pants ASL verbs with and without the presence of a si- between the correct, largest values in the table in the adult
multaneous cognitive load and with practice on the full (unfiltered) case and the next largest competitors was 3:1.
signs or on individual morphemes. Finally, Kersten and In the child (filtered) case, the expected ratio remains 3:1.
Earles (2001) taught participants a simple novel language Although some of the competitors will become propor-
with and without sequential input. We discuss each of the tionately less likely, others will not. What is eliminated
four papers here in some detail. by the filtering is the large number of very unlikely map-
pings. So the signal-to-noise ratio is improving if it is
taken to be the ratio of the correct value to the sum of all
3 Goldowsky and Newport (1993) other values. If taken to be the ratio of the correct value to
the nearest incorrect value, there is no improvement. Fur-
Goldowsky and Newport (1993) proposed a simple learn- thermore, the child learner must experience many more
ing task, and one form of learning model that might be form/meaning pairings than the adult learner before it can
used to solve the task. Training examples consisted of adequately fill its co-occurrence table.
pairings of forms and meanings. A form had three parts, To see the implications of these points, we need to make
A, B, and C. For each part there were three possible val- 6
The mapping used in the Goldowsky and Newport (1993) paper ac-
ues: A1 , A2 , A3 , B1 , B2 , etc. Meanings were also com- tually included one exception, that form A4 B4C4 has meaning M3 N3 O3 .
posed of three parts, M, N, and O, each with three values. Because the introduction of this did not seem to strengthen their case for
There was a very simple mapping from forms to mean- starting small, it is eliminated here for simplicity.
9
100 100
Plurality without filter
Plurality with filter Plurality without filter
80 80
Percent Correct Mappings

Plurality with filter
60 60
40 40
Sampling with filter Sampling with filter
20 20
Sampling without filter Sampling without filter
0 0
0 50 100 150 200 0 50 100 150 200
Training Items Training Items
Figure 4: Learning the Goldowsky & Newport (1993) task Figure 5: Learning the Goldowsky & Newport (1993) task
using raw counts in a noise-free environment. using raw counts with random loss of 50% of the data.
the task somewhat more explicit. Goldowsky and New- effective. But in that case, filtering is harmful, and slows
port (1993) presented a model that counts statistics, but learning down considerably. Even after 200 trials, the fil-
not one that actually solves the form/meaning mapping. tered model is able to completely solve the task only about
To complete the story, we will need to generate a model 80% of the time.
that is capable of taking a form and producing its best Now one might reasonably make the argument that this
guess for the appropriate meaning. Two potential solu- isn’t a fair comparison. Perhaps the Plurality method is
tions to this problem immediately come to mind. In the much more susceptible to noise and the benefit of the fil-
first, arguably simpler, method, the model looks down the ter isn’t apparent in such perfect conditions. After all, it
column of values under the given form and chooses the is probably unreasonable to expect that a human learner
meaning corresponding to the largest value. If two mean- is able to perfectly notice and store all available informa-
ings have the same strength, the model is counted wrong. tion. To test this possibility, a source of noise was added to
This will be referred to as the Plurality method. the simulations. 50% of the time, the operation of incre-
In the second method, the model draws at random from menting a value in the table failed. Thus, half of the data
the distribution of values, such that the probability of se- was lost at random. As shown in Figure 5, this manipu-
lecting a meaning is proportional to the value associated lation had almost no effect on the Sampling method, but
with that meaning. This Sampling method seems to be did have some effect on the Plurality method. However,
more in line with what Goldowsky and Newport implied the Plurality method remained significantly better without
might be going on, judging from their use of the term the filter.
signal-to-noise ratio. The Plurality method only fails if A final consideration is that the bubble diagrams used
the nearest competitor is as strong as the correct answer. to represent the form/meaning co-occurrence table in the
In contrast, the Sampling method is wrong in proportion Goldowsky and Newport (1993) paper did not directly re-
to the total strength of competitors. Both of these meth- flect raw co-occurrence counts. The radius of the bubbles
ods were implemented and tested experimentally with and was proportional to the ratio of the co-occurrence count
without random filtering. The models were judged by to the square root of the product of the overall number
their ability to provide the correct meaning for each of of occurrences of the form and the overall number of oc-
the nine forms involving a single element. The results, currences of the meaning. This was termed the consis-
averaged over 100 trials in each condition, are shown in tency of co-occurrence. So one might ask, how well do
Figure 4. the two proposed models perform if they work with co-
As Goldowsky and Newport (1993) suggested, their fil- occurrence consistency values rather than raw counts. As
tering mechanism is indeed beneficial when used with the shown in Figure 6, performance declines slightly for the
Sampling method, achieving a score of about 25.2% ver- Sampling method and improves slightly for the Plurality
sus 14.3% without filtering. However, Sampling overall method with filtering. But overall the results are qualita-
performs quite poorly. The Plurality method is much more tively similar.
10
100 100
Plurality without filter Network without filter
Plurality with filter

80 80
Network with filter

60 60
40 40
20 Sampling with filter 20
Sampling without filter

0 0
0 50 100 150 200 0 200 400 600 800 1000
Training Items Training Items
Figure 6: Learning the Goldowsky & Newport (1993) task Figure 7: Learning the Goldowsky & Newport (1993) task
using correlation values with no noise. using a single layer neural network.
Thus, with the much more effective Plurality method of of training parameters. It is possible that the benefit of
determining form/meaning pairs from co-occurrence data, filtering could be masked by a poor choice of parameters.
the filtering mechanism was a serious hindrance. But it Therefore, we trained networks using 32 parameter sets.
seems that building a large table may not be at all similar Four learning rates (0.05, 0.1, 0.2, 0.4) were crossed with
to the way the human brain might solve this mapping task. two momentum values (0.0, 0.9), two initial weight ranges
Perhaps a better model is that of a connectionist network. ( 0 1, 1 0), and two weight decay values (0.0, 0.0001).

Could such a model learn the underlying regularity and Networks were trained on 1000 randomly selected exam-
would it benefit from the same filtering method proposed ples using online learning, meaning that weight updates
by Goldowsky and Newport? To answer this question, we were performed after each example.
performed some simulation experiments. Performance was measured by testing the model’s abil-
First a simple one-layer network was constructed, with ity to produce the correct meaning for each of the nine
a 9-unit input layer fully connected to a 9-unit output isolated forms. The final performance in each condition,
layer. The nine input units corresponded to the nine pos- averaged over 50 trials, is shown in Table 3. Without fil-
sible elements of the form. One of the first three units was tering, the network learns best with small initial weights,
turned on to represent the A element, one of the second some weight decay, momentum, and a large learning rate.
set of three units was turned on to represent the B ele- With filtering, the network learns best with a small learn-
ment, and so forth. Similarly, the nine units in the output ing rate and no momentum. But under no conditions did
representation corresponded to the nine possible elements filtering improve learning. Figure 7 shows the averaged
of the meaning, with three of the nine units normally hav- learning profiles with and without filtering using training
ing targets of 1, and the rest having targets of 0. If an parameters with which the filtered networks performed
element of the form was eliminated by the filtering mech- quite well: no weight decay or momentum, initial weights
anism, the corresponding three units of the input were all 0 1, and learning rate 0.05. Although they reach sim-

turned off. If an element of the meaning was eliminated, ilar final performance, the networks learned much more
the corresponding three units of the output had no target quickly and smoothly without filtering.
values. The network was tested by presenting it with a One might argue that we have cheated by applying a
single element of the form as an input. Although the net- single layer network to the task because such a network
work may never have been trained to perform this particu- cannot learn very complex mappings, so it doesn’t need
lar mapping, the desired response is that it will output just filtering to learn this simple one. Admittedly, if the task
the corresponding element of the meaning. A response were not so simple, we would have used a larger network.
was considered correct if the activations of all nine output To test the possibility that a larger network will fail to
units were on the correct side of 0.5. learn the simple rule without filtering, we trained a two
In order to argue that filtering is or is not beneficial, layer, 9-9-9, feed-forward network using the same task
one cannot simply rely on performance under a single set and parameters.
11
Table 3: Final performance levels with a 9-9 network under various conditions. The left value in each pair is the
performance without filtering and the right value is the performance with filtering.
Weight Momentum Initial Learning Rate
Decay Weights 0.05 0.1 0.2 0.4
0 0 01 100.0 98.9 100.0 98.4 100.0 76.7 100.0 44.9
0 0 10 85.6 77.3 96.9 88.7 98.7 75.6 100.0 45.6
0 0.9 01 100.0 33.3 100.0 16.7 100.0 4.4 100.0 3.3
0 0.9 10 100.0 32.2 100.0 15.8 100.0 4.4 100.0 3.3
0.0001 0 01 100.0 99.6 100.0 97.6 100.0 78.0 100.0 44.4
0.0001 0 10 88.9 79.6 97.1 89.3 100.0 76.0 100.0 46.4
0.0001 0.9 01 100.0 42.2 100.0 22.2 100.0 5.6 100.0 3.3
0.0001 0.9 10 100.0 42.2 100.0 22.0 100.0 5.6 100.0 3.1
Table 4: Final performance levels with a 9-9-9 network under various conditions. The left value in each pair is the
performance without filtering and the right value is the performance with filtering.
Weight Momentum Initial Learning Rate
Decay Weights 0.05 0.1 0.2 0.4
0 0 01 0.0 1.1 42.0 2.2 92.9 8.9 99.1 26.9
0 0 10 60.2 14.2 72.2 41.6 88.4 40.7 88.4 33.3
0 0.9 01 98.7 24.9 93.8 14.4 81.1 6.4 19.6 2.4
0 0.9 10 81.8 23.8 79.1 14.4 76.2 5.8 41.1 2.4
0.0001 0 01 0.0 1.1 35.6 2.2 94.0 7.6 99.6 26.9
0.0001 0 10 66.0 10.0 79.1 37.1 93.1 47.1 88.4 34.7
0.0001 0.9 01 99.3 24.7 99.3 16.2 99.6 6.9 94.0 2.9
0.0001 0.9 10 99.3 25.6 99.3 15.6 99.1 5.6 99.1 3.6
As shown in Table 4, the two layer network doesn’t 4 Kareev, Lieberman, and Lev
solve the task as easily as the one layer network. But un-
der several different choices of parameters, the network is
(1997)
able to master the task nearly all of the time without filter- Kareev, Lieberman, and Lev (1997) began by reiterating
ing. The best performance achieved with filtering, on the a theoretical point about sampled distributions which was
other hand, was just 47.1% correct. In only two cases— first raised in Kareev (1995). If a distribution over two
with a small learning rate, small initial weights, and no correlated real-valued variables is sampled repeatedly, the
momentum—did the filtered networks perform better than expected median of the observed correlations in the sam-
the unfiltered ones. But in those cases the filtered net- ples increases as the size of the sample decreases. On the
works only reached an average performance of 1.1%. basis of this fact, Kareev et al. suggested that humans
estimating correlations in observed events will be better
at detecting those correlations if they have limited work-
ing memory, and thus presumably rely on smaller remem-
bered samples in formulating their judgments.
In the first experiment, participants were given 128 en-
velopes, each containing a coin. Envelopes were either
In summary, the filtering mechanism proposed by red or green and the coin inside was either marked with
Goldowsky and Newport (1993) for this task did not im- an X or an O. Participants opened envelopes one-by-one
prove the performance of either an effective tabulation in random order and each time tried to predict the type of
strategy or two neural network models. Although the ran- coin based on the envelope’s color. The envelopes’ con-
dom filtering mechanism sometimes isolates correct one- tents were manipulated to produce true color/mark corre-
to-one form/meaning pairs, it more frequently destroys lations ranging from -0.6 to 0.6. The eight participants
those pairs and isolates incorrect ones. This introduces in each condition were grouped based on the results of a
noise that outweighs the occasional benefit and that can single-trial digit-span test of working memory. Response
be detrimental to learning. correlation was computed for each participant using the
12
100
matrix of envelope colors and mark predictions. Kareev
et al. found that the low-span participants tended to have C = 0.8
larger response correlations and to have more accurate
% Chance that Observed Correlation >= 0.1

90
overall predictions.
C = 0.6
This is certainly an interesting result, but the theoreti-
cal explanation ought to be reconsidered. To begin with,
80
the authors stressed the fact that median observed corre-
lation increases as sample size decreases. That is, with
a smaller sample, observers have a higher probability of 70
C = 0.4
encountering a correlation that is larger than the true cor-

relation. This is mainly an artifact of the increased noise
resulting from small samples. On the basis of increasing 60
median, Kareev et al. concluded that, “The limited ca- C = 0.2
pacity of working memory increases the chances for early
detection of a correlation.. . . [A] relationship, if it exists, 50
4 5 6 7 8 9 10
is more likely to be detected, the smaller the sample” (p. Sample Size
279). Thus, the authors seem to be equating median esti-
mation with the ability to detect any correlation whatso-
ever. However, they do not offer an explicit account of Figure 8: The probability that the observed correlation
how participants might be solving the correlation detec- value is greater than 0.1 (and thus presumably detectable)
tion or coin prediction task. as a function of sample size and true correlation (C).
The median correlation happens to be one measure
computable over a series of samples. 7 But there are other iment involved pairs of real-valued random variables. A
measures that may be more directly applicable to the prob- desired correlation, C, was achieved by generating the val-
lem of detecting a correlation, such as the mean, and not ues as follows:
all measures increase in magnitude with smaller samples.
The mean correlation diminishes with decreasing sample a
b
rand()
Ca

1 C2 rand()
size. But an individual participant is not encountering a
series of samples, but just one sample, so the median or where rand() produces a random value uniformly dis-
mean computed over multiple samples is not necessarily tributed in the range [-1,1]. 1 million trials were con-
relevant. ducted for each pairing of sample size and correlation.
So what is an appropriate model of how participants Clearly, for the range of parameters covered, the chance
are solving the task, and how is this model affected by that the observed correlation is greater than threshold in-
sample size? Signal detection theory typically assumes creases monotonically with sample size. Larger samples
that human observers have a threshold above which a sig- lead to a greater chance of detecting a correlation. One
nal is detected. In this case, we might presume that the may disagree with the arbitrary choice of 0.1 for the de-
signal is the perceived correlation between envelope color tection threshold, but the same penalty for small samples
and coin type, and that the correlation, whether positive or is seen with a value of 0.2, provided the true correlation
negative, is detectable if its magnitude is above a partici- is greater than 0.2, and the effect becomes even stronger
pant’s threshold. If participants are basing their responses with thresholds below 0.1. Thus, the fact that the median
in the coin prediction task on a signal detection procedure observed correlation increases with small sample sizes
involving a fixed threshold, we must ask what is the prob- does not bear on what is arguably a reasonable model of
ability that a sample of size N from a distribution with human correlation detection.
true correlation C has an observed correlation greater than Another important issue is that the sampling distribu-
a given threshold? tion measures discussed by Kareev et al. were for pairs of
It seems reasonable to suppose that the typical human real-valued variables, but the experiments they conducted
threshold for detecting correlations in small samples prob- involved binary variables. Do the same principles apply to
ably falls between 0.05 and 0.2, although it presumably small samples of binary data? Figure 9 shows the median
varies based on task demands. Figure 8 shows the prob- observed correlation in small samples of binary data, as a
ability that a small sample has an observed correlation function of the sample size and the true correlation. Al-
above 0.1 as a function of the size of the sample and the though median correlation decreases as a function of sam-
strength of the true correlation. The data in this exper- ple size for real-valued data, median correlation doesn’t
7 The term sample is used here to refer to a set of observations, or seem to vary in any systematic way as a function of sam-
examples, not just a single observation. ple size for binary data. There is simply more variabil-
13
1 1.0
C = 0.8
0.8
0.6 Window Size 13

0.8
Window Size 9
Window Size 5
Median Observed Correlation
0.4
Response Correlation
0.6 0.2
C = 0.6
0.0
0.4 −0.2
C = 0.4
−0.4
−0.6
0.2
C = 0.2 −0.8
−1.0
0 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
4 5 6 7 8 9 10
Actual Correlation
Sample Size
Figure 10: The correlation between envelope color and

Figure 9: The median observed correlation in small sam-
the models’ predictions of coin marking as a function of
ples of binary data, as a function of sample size and true
the actual correlation and the model’s memory window
correlation (C).
size.
ity in the small samples. But again, median correlation line is provided as a reference, but note that optimal per-
value is not necessarily indicative of the ease of detection. formance in this task has nothing to do with matching the
As with real-valued data, the probability that an observed actual correlation values. An optimal predictor will al-
correlation is greater than some small threshold tends to ways predict the more likely coin, whether the actual cor-
increase with larger samples of binary data. relation is 0.1 or 0.9. Contrary to Kareev et al.’s predic-
But it may be possible that these statistical measures tion, the larger sample size results in larger response cor-
don’t accurately reflect the power of small samples in a relations, not smaller ones. Figure 11 gives the prediction
practical context. Therefore, we designed a simple model accuracy as a function of correlation and window size. Al-
to perform the envelope/coin task using varying levels of though the difference is fairly small, larger window sizes
working memory. The model was intended to reflect the consistently outperformed the smaller ones.
manner in which Kareev et al. seem to imply humans Therefore, although the results of the first experiment
might be solving this task. The model simply remembers in Kareev, Lieberman, and Lev (1997) are rather inter-
the contents of the last N cards of each color and chooses esting and deserve replication and explanation, these re-
the coin that was more frequent in that sample. If the sults cannot be attributed to the effects of small samples
coins were equally frequent in the sample, the choice is on perceived correlation. The probability of observing a
random. The model was run with three sample sizes, 5, 9, correlation stronger than a relatively sensitive detection
and 13, meant to reflect small, medium, and large working threshold is lower with small sample sizes and the me-
memory capacity and was run 1000 times on each of the dian observed correlation value with binary data does not
14 distributional conditions used by Kareev, Lieberman, change systematically with sample size. A simple predic-
and Lev (1997). 7 of these conditions were symmetric in tion model that relies on samples of varying size performs
that they used an equal number of X’s and O’s and 7 did better with larger samples. While it is true that this model
not satisfy this constraint and were termed asymmetric. does not appear to fully capture human performance in
Each symmetric condition had a corresponding asymmet- this task, the relevant point is that the effects of small
ric one with approximately the same envelope/coin corre- sample sizes on perceived correlation do not adequately
lation. The correlation between the models’ predictions explain the empirical findings.
and the envelope color was computed in the same way as The second experiment reported by Kareev, Lieberman,
for the experimental participants. and Lev (1997) also does not seem to fully support their
Figure 10 shows the prediction correlation values as a theory. In this case, participants were not blocked by digit
function of actual correlation for the three working mem- span but were given samples of varying size upon which
ory levels, with results in the corresponding symmetric to base a prediction. The samples were either fully visi-
and asymmetric conditions averaged. The identity base- ble throughout the process or were presented sequentially
14
80
taught some sentences and then tested in their ability to
produce either the same or novel ASL sentences.
75 Window Size 13
In the first two experiments, participants were taught
Window Size 9 16 verbs. Each verb was encountered in the context of a
Window Size 5
single sentence, in which either the subject was “I” and
70
% Correct Predictions
the object was “you”, or vice-versa. Six of the verbs used

congruent agreement, in which the direction of the sign
65 was from the verb’s subject (either the signer or the ad-
dressee) to the verb’s object. Two of the verbs used incon-
gruent agreement, in which the direction of the sign was
60
from object to subject. Four nonagreement verbs required
a static direction of motion, which was either always away
55 from or always toward the signer. The last four verbs had
a direction of motion aligned vertically, either up or down.
Participants were exposed to each verb in a single con-
50
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 text, with half of the verbs in each condition using the
Actual Correlation
subject “I” and half using the subject “you”. The 16 study
sentences were observed three times in the first experi-
Figure 11: The prediction accuracy as a function of the ment and eight times in the second experiment. In order to
actual correlation and the model’s memory window size. place a load on working memory, half of the participants
performed a tone-counting task during training. This was
known as the load condition. Participants were then tested
and were unavailable in formulating the prediction. In on the 16 familiar sentences as well as the 16 novel sen-
this case, the variables were real-valued, rather than bi- tences created by reversing the subject and object.
nary. The results indicated that when samples were ab- Cochran, McDonald, and Parault (1999) found that par-
sent, there was better performance with the small samples ticipants in the no-load condition produced the familiar
than with the medium or large ones. But when the samples sentences better overall and performed better on famil-
were present, performance increased with sample size. iar and novel non-agreement verbs. However, partici-
This latter result is inconsistent with the prediction that pants in the no-load condition did not perform as well
small samples should statistically magnify correlations. If on the agreement verbs in novel sentences. They were
that were true, larger samples would lead to worse perfor- much more likely to produce the sign in the same direction
mance, especially if the samples are present. The fact that that they learned it, rather than reversing the direction in
participants viewing sequential samples performed better the new context. This was taken as evidence that “adults
with smaller ones is indeed interesting, but cannot be ex- learning under normal conditions were failing to learn the
plained by a statistical property of sample size itself. internal structure of the language and were therefore lim-
ited in their ability to generalize to new contexts” (p. 30).
However, an alternative reading of the data is that par-
5 Cochran, McDonald, and Parault ticipants in the load condition were simply not learning as
(1999) well and performed more randomly during test. Not only
did load participants have more movements in the correct
Much of the empirical support for the less-is-more hy- direction, they produced more verbs with no movement
pothesis derives from the study of American Sign Lan- or, in the first experiment, with movement outside the axis
guage (ASL). Newport (1990) observed that late learners between the signer and addressee. The fact that load con-
of ASL tend to make more morphological errors in the dition participants happened to use the correct movement
production of verbs than do early learners. While interest- more often in novel conditions can be attributed to their
ing, it is not clear to what this finding should be attributed. generally more noisy behavior, rather than their having
The problems incurred by late learners could be due to de- learned to generalize to novel conditions.
activation of a language acquisition device, greater cogni- The main problem with these experiments is that par-
tive capacity, different types or degrees of exposure, or a ticipants are expected to learn that the movement of cer-
variety of other factors. Cochran, McDonald, and Parault tain verbs should agree with sentence context when there
(1999) sought to provide empirical evidence supporting was no basis for such a generalization in the examples to
the idea that cognitive limitations can actually lead to bet- which the participants had been exposed. Each verb was
ter learning of ASL verbs. They conducted three exper- seen in just one context, with just one direction of motion,
iments in which participants unfamiliar with ASL were and only six of the 16 verbs underwent congruent agree-
15
ment. The evidence to which the participants were ex- study seems to provide strong evidence that learning in-
posed fully supports the simpler hypothesis: that direction dividual parts of signs is not, overall, of significant ben-
of motion is an intrinsic, non-inflected part of the sign for efit. Although whole-sign learners produced more frozen
a verb. In fact, this is the correct rule for half of the verbs signs, they performed better in other respects, balancing
used in the experiment. Given the lack of any evidence to the overall performance. Somewhat disturbingly, how-
the contrary, it seems much more reasonable for partici- ever, more participants were thrown out for inadequate
pants to surmise that ASL permits no agreement, than to performance or unscorable data from the part-learning
surmise that some verbs have agreement, some have in- group. One person in the whole-sign condition was
congruent agreement, and some have no agreement. The thrown out for unscoreable data and 9 people in the part-
results in these experiments are consistent with the hy- sign condition were replaced, three for bad performance
pothesis that participants in the no-load condition learned and two for unscoreable data. Across the three experi-
this very reasonable rule much better than did participants ments, three participants were discarded from the no-load
in the load condition. and whole-sign conditions for performance or scoreabil-
A true test of generalization ability must provide the ity reasons, compared with 12 participants in the load and
learner with some support for the validity of the expected part-sign conditions. In experiments of this sort involving
generalization. Had participants experienced some agree- a direct comparison between training methods, eliminat-
ment verbs used with different motions in different cir- ing participants for performance reasons during training
cumstances, they would have some basis for expecting has the clear potential to bias the average testing perfor-
that agreement plays a role in ASL. A second factor bias- mance. If participants must be removed from one con-
ing the participants against formulating the desired gener- dition for performance reasons, an equal number of the
alization was that, unlike in ASL, pronouns were explic- worst performers in the other conditions should be re-
itly produced in all training sentences. Languages with moved as well, although this still may not fully eliminate
strong verb inflection, such as Spanish, often drop first- the bias.
and second-person pronouns, because they convey redun-
dant information. Because such pronoun drop was not a
feature of the training sentences, learners are more likely 6 Kersten and Earles (2001)
to assume that pronominal information is not redundantly
conveyed in the verb form. In summary, the first two Kersten and Earles (2001) conducted three language
experiments of this study essentially found that partici- learning experiments which compared learning in a staged
pants trained to perform one reasonable generalization did input condition to learning in a full-sentence condition. In
poorly when tested on a different, more complex, gener- each experiment, participants viewed events in which one
alization. bug-like object moved towards or away from another, sta-
The third experiment conducted by Cochran, McDon- tionary, bug-like object. In the full-sentence condition,
ald, and Parault (1999) tested the learning of ASL motion each event was paired with the auditory presentation of
verbs, comparing participants who were taught to mimic a three-word sentence. The first word corresponded to
whole signs to those who were taught to mimic just one the appearance of the moving bug and ended in “–ju”.
part of each sign, either the form or the motion, at a time. The second word described the manner of motion—either
During training, signs for a certain type of actor moving walking with legs together or alternating—and ended
in a certain way were paired with a hand movement in- in “–gop”. 9 The third word described the direction of
dicating the path of motion. For some verbs, the motion walking—towards or away from the stationary bug—and
sign is produced at the same time as the verb, but for other ended in “–tig”.
verbs they are produced in sequence. During testing, all In the first two experiments, half of the participants
verbs were paired with all path signs. heard complete sentences for the whole training period.
Overall there was no difference in performance on the The other participants initially heard just the first (object)
studied or the novel signs between the “whole” and “part” word for a third of the trials, then the first two words, and
learners. There was an unexplained tradeoff, in that whole finally all three words. In the testing period, participants
learners did better if the parts of the new sign were to be were shown two events that varied on a single attribute
performed sequentially and worse if they were to be per- and heard either an isolated word (corresponding to the
formed simultaneously. The only other difference was the manipulated attribute) or a sentence. They were to iden-
marginally significant tendency for whole-practice partic- tify the event that correctly matched the word or sentence.
ipants to produce more frozen signs, 8 which could be a The most important finding in these experiments was
cause or effect of the other difference. If anything, this significantly better performance, overall, for participants
8 A frozen sign was a new sign that contained an unnecessary part of 9 In the first experiment, some participants heard object-manner-path
a previously studied sign. word order and others heard object-path-manner.
16
in the staged input condition. Kersten and Earles inter- language learning. The language used in this study was
preted this as evidence in favor of the less-is-more hy- missing a number of important features of natural lan-
pothesis. However, one should exercise some caution in guage. Word order and morphology were entirely redun-
drawing conclusions from these experiments. Although dant and, more importantly, conveyed no meaning. Words
there was an overall advantage for starting small, if one always appeared in the same position in every sentence
tests performance on object words, manner words, and and were always paired with the same ending. In this
path words independently, the effect is only significant for simple language, there wasn’t a productive syntax or mor-
object words. Thus, the results are consistent with the hy- phology, just a conventional word order. Participants were
pothesis that starting small was only beneficial in learning thus free to use strategies such as ignoring word order and
the meanings of the object words, i.e., those words trained morphological information, much as they learned to ig-
in isolation for the first third of the trials. nore meaningless details of the events.
Kersten and Earles sought to rule out a slightly differ- Participants in the full sentence condition were there-
ent, but equally viable, hypothesis—that the effect relies fore at a potential disadvantage. Any effective, general
on the fact that the object words, as opposed to manner learning mechanism in a similar situation would devote
or path, were learned first. Therefore, in the third exper- time and resources to testing the information carried in all
iment, participants in the staged condition first heard the aspects of the events and sentences, including morphol-
last (path) word, then the last two words (manner-path), ogy and word order. In this case, those features happened
and finally all three words. Again there was a signifi- to convey no additional information beyond that provided
cant overall advantage for the staged input condition. In by the word stems themselves, placing participants who
this case, path words were learned better than object and paid attention to word order and morphology at a dis-
manner words in both conditions. Although the overall advantage. However, these factors play critical roles in
advantage for the starting small condition reached signif- shaping the meaning of natural language sentences, and
icance, none of the tests isolating the three word types devoting time and resources to learning them is useful,
were significant. These results therefore do not rule out and even necessary. The staged input learner, on the other
the hypothesis that participants in the staged input con- hand, will have traded off exposure to syntax for more
dition were only better on the words trained in isolation. exposure to individual words and their meanings, which
Nevertheless, it is possible that these effects would reach is not clearly advantageous. A stronger test of the im-
significance with more participants. portance of staged input would be to measure comprehen-
The third experiment also added a test of the partici- sion or production of whole, novel sentences in a language
pants’ sensitivity to morphology. Novel words were cre- with some aspects of meaning carried exclusively by syn-
ated by pairing an unfamiliar stem with one of the three tax and morphology.
familiar word endings (–ju, –gop, or –tig). Each word was Perhaps tellingly, some studies cited by Kersten and
first paired with an event that was novel in all three impor- Earles comparing children learning French in immersive
tant dimensions. Participants were then shown a second programs with and without prior exposure to more tradi-
event that differed from the first in a single dimension and tional, elementary French-as-a-second-language courses
were instructed to respond “Yes” if the second event was found either no difference or an advantage for children
also an example of the new word. In other words, partic- in the purely immersive programs (Shapson & Day, 1982;
ipants responded “Yes” if the two events didn’t differ on Day & Shapson, 1988; Genesee, 1981). Although these
the feature associated with the word ending. Kersten and studies may not have adequately controlled for age of ex-
Earles again found a significant advantage for the starting posure, intelligence, or motivational factors, it certainly
small condition. is suggestive that staged input may be less effective than
However, there is some reason to question the results immersion in learning natural languages.
of this experiment. With the path-word ending, there was A final point of criticism of the Kersten and Earles
clearly no difference between the two conditions. In three (2001) paper is their desire to equate the effects of staged
of the four other conditions, participants performed below input with those of internal memory limitations. There
chance levels, significantly so in one of them. The finding is little reason to believe that these two factors will have
of significantly below chance performance leads one to similar effects. Teaching the meanings of isolated words
suspect that participants may have been confused by the is bound to be helpful, provided that it is only a supple-
task and that some participants may have incorrectly been ment to exposure to complete language, is relatively noise
responding “Yes” if the events did differ on the feature free, and makes up a relatively small percentage of lin-
associated with the word ending. guistic experience. However, memory limitations do not
Even if we accept that there was an across-the-board result in the same simple pairing of words and their mean-
advantage for the staged input condition in these exper- ings. At best, memory limitations have the effect of pair-
iments, we should be cautious in generalizing to natural ing isolated words or phrases to noisy, randomly sampled
17
portions of a complex meaning. The actual part of the resent information about an input using its hidden units,
complex meaning contributed by the isolated word may that information then becomes available as context when
be partially or completely lost and some extraneous in- processing the next input. If this context provides impor-
formation may be retained. Learning the correct pairings tant constraints on the prediction generated by the sec-
of words to meanings is no easier in this case than when ond input, the context to hidden connections involved in
faced with the full, complex meaning. retaining that information will be reinforced, leading the
A more appropriate, though still not entirely sufficient, information to be available as context for the third input,
test of the benefit of memory limitations in the context of and so on.
Kersten and Earles’s design would be to test randomly se- In this way, the network first learns short-range depen-
lected words in the isolated word condition, rather than dencies, starting with simple word transition probabilities
always the first or last word of the sentence. These should for which no deeper context is needed. At this stage, the
be paired with scenes with randomly selected details, such long-range constraints effectively amount to noise which
as the identity of the moving object or the location of the is averaged out across a large number of sentences. As the
stationary object, obscured. Furthermore, tests should not short-dependencies are learned, the relevant information
be performed on familiar sentences but on novel ones, as becomes available for learning longer-distance dependen-
the potential problem in starting with complete sentences cies. Very long-distance dependencies, such as grammat-
is that adults will memorize them as wholes and will not ical constraints across multiple embedded clauses, still
generalize well to novel ones. It would be quite inter- present a problem for this type of network in any training
esting if initial training of this form, which is more like regimen. Information must be maintained across the inter-
the presumed effect of poor attention or working mem- vening sequence to allow the network to pick up on such
ory, was beneficial in the comprehension or production of a dependency. However, there must be pressure to main-
novel sentences. tain that information or the hidden representations will
The actual claim of Newport’s less-is-more hypothe- encode more locally relevant information. Long-distance
sis does not concern staged input. It is that memory or dependencies are difficult because the network will tend
other internal limitations are the key factor in enabling to discard information about the initial cue before it be-
children to learn language more effectively. Evidence for comes useful. Adding semantic dependencies to embed-
or against the benefit of staged input should be clearly dis- ded clauses aids learning because the network then has
tinguished from evidence concerning the effect of internal an incentive to continue to represent the main noun, not
cognitive impairments. just for the prediction of the main verb, but for the predic-
tion of some of the intervening material as well (see also
Cleeremans et al., 1989). 10
7 General Discussion It might be thought that starting with simplified inputs
would facilitate the acquisition of the local dependencies
We believe that studying the way in which connectionist so that learning could progress more rapidly and effec-
networks learn languages is particularly helpful in build- tively to handling the longer-range dependencies. There
ing an understanding of human language acquisition. The is, however, a cost to altering the network’s training en-
intuition behind the importance of starting with properly vironment in this way. If the network is exposed only to
chosen simplified inputs is that it helps the network to fo- simplified input, it may develop representations which are
cus immediately on the more basic, local properties of the overly specialized for capturing only local dependencies.
language, such as lexical syntactic categories and simple It then becomes difficult for the network to restructure
noun-verb dependencies. Once these are learned, the net- these representations when confronted with harder prob-
work can more easily progress to harder sentences and lems whose dependencies are not restricted to those in the
further discoveries can be based on these earlier represen- simplified input. In essence, the network is learning in
tations. an environment with a nonstationary probability distribu-
Our simulation results indicate, however, that such ex- tion over inputs. In extreme form, such nonstationarity
ternal manipulation of the training corpus is unnecessary can lead to so-called catastrophic interference, in which
for effective language learning, given appropriate training training exclusively on a new task can dramatically impair
parameters. The reason, we believe, is that recurrent con- 10 It should be pointed out that the bias towards learning short- be-
nectionist networks already have an inherent tendency to fore long-range dependencies is not specific to simple recurrent net-
extract simple regularities first. A network does not begin works; backpropagation-through-time and fully recurrent networks also
with fully formed representations and memory; it must exhibit this bias. In the latter case, learning long-range dependencies is
functionally equivalent to learning an input-output relationship across a
learn to represent and remember useful information under larger number of intermediate processing layers (Rumelhart et al., 1986),
the pressure of performing particular tasks, such as word which is more difficult than learning across fewer layers when the map-
prediction. As a simple recurrent network learns to rep- ping is simple (see Bengio et al., 1994; Lin et al., 1996).
18
performance on a previously learned task that is similar to quisition of syntactic structure (with some semantic con-
but inconsistent with the new task (see, e.g., McClelland, straints), which is just a small part of the overall language
McNaughton, & O’Reilly, 1995; McCloskey & Cohen, learning process. Among other things, the child must also
1989). learn the meanings of words, phrases, and longer utter-
A closely related phenomenon has been proposed by ances in the language. This process is certainly facili-
Marchman (1993) to account for critical period effects in tated by exposing the child to simple utterances with sim-
the impact of early brain damage on the acquisition of ple, well-defined meanings. We support Newport and col-
English inflectional morphology. Marchman found that leagues’ conclusion that the form of child-directed speech
the longer a connectionist system was trained on the task is governed by a desire to communicate with the child and
of generating the past tense of verbs, the poorer it was not to teach syntax. However, we would predict that lan-
at recovering from damage. This effect was explained in guage acquisition would ultimately be hindered if particu-
terms of the degree of entrenchment of learned represen- lar syntactic or morphological constructions were avoided
tations: As representations become more committed to a for extended periods in the input to either a child or adult
particular solution within the premorbid system, they be- learner.
come less able to adapt to relearning a new solution after But the main implication of the less-is-more hypothesis
damage. More recently, McClelland (2001) and Thomas is not that staged input is necessary, but that the child’s
and McClelland (1997) have used entrenchment-like ef- superior language learning ability is a consequence of the
fects within a Kohonen network (Kohonen, 1984) to ac- child’s limitations. This might be interpreted in a variety
count for the apparent inability of non-native speakers of of ways. Goldowsky and Newport (1993), Elman (1993),
a language to acquire native-level performance in phono- Kareev, Lieberman, and Lev (1997), and Cochran, Mc-
logical skills, and why only a particular type of retraining Donald, and Parault (1999) suggest that the power of re-
regimen may prove effective (see also Merzenich et al., duced memory is that it leads to information loss which
1996; Tallal et al., 1996). Thus, there are a number of can be beneficial in highlighting simple contingencies in
demonstrations that connectionist networks may not learn the environment. This, it is suggested, encourages ana-
as effectively when their training environment is altered lytical processing over rote memorization. We have ar-
significantly, as is the case in the incremental training pro- gued, to the contrary, that in a range of learning proce-
cedure employed by Elman (1991). dures, from simple decision making models to recurrent
There has been much debate on the extent to which connectionist networks, such random information loss is
children experience syntactically simplified language of no benefit and may be harmful. Although it sometimes
(see, e.g., Richards, 1994; Snow, 1994, 1995, for discus- has the effect of isolating meaningful analytical units, it
sion). While child-directed speech is undoubtedly marked more often destroys those units or creates false contigen-
by characteristic prosodic patterns, there is also evidence cies.
that it tends to consist of relatively short, well-formed ut- Another take on the less-is-more hypothesis is that a
terances and to have fewer complex sentences and sub- learning system can benefit by being differentially sensi-
ordinate clauses (Newport, Gleitman, & Gleitman, 1977; tive to local information or simple input/output relation-
Pine, 1994). The study by Newport and colleagues is in- ships. This we do not deny. In fact, it seems difficult to
structive here, as it is often interpreted as providing evi- conceive of an effective learning procedure that is not bet-
dence that child-directed speech is not syntactically sim- ter able to learn simple relationships. A related argument
plified. Indeed, these researchers found no indication that is that when the mapping to be learned is componential,
mothers carefully tune their syntax to the current level of a learning procedure specialized for learning such map-
the child or that aspects of mothers’ speech styles have pings, as opposed to one specialized for rote memoriza-
a discernible effect on the child’s learning. Nonetheless, tion, is to be preferred. This, too, we support. However,
it was clear that child-directed utterances, averaging 4.2 we suggest that neural networks—and, by possible impli-
words, were quite unlike adult-directed utterances, av- cation, the human brain—are naturally better at learning
eraging 11.9 words. Although child-directed speech in- simple or local contingencies and regular, rather than arbi-
cluded frequent deletions and other forms that are not trary, mappings. But this is true of learning in experienced
handled easily by traditional transformational grammars, networks or adults, just as it is true of learning in random-
whether or not these serve as complexities to the child is ized networks or children. The general architecture of the
debatable. system is the key factor that enables learning of compo-
If children do, in fact, experience simplified syntax, it nentiality, not the child’s limited working memory.
might seem as if our findings suggest that such simplifi- Simulating poor working memory by periodically dis-
cations actually impede children’s language acquisition. rupting a network’s feedback during the early stages of
We do not, however, believe this to be the case. The sim- learning has relatively little effect because, at that point,
ple recurrent network simulations have focused on the ac- the network has not yet learned to use its memory effec-
19
tively. As long as memory is interfered with less as the guage because his or her resources are initially uncom-
network develops, there will continue to be little impact mitted, allowing neurons to be more easily recruited and
on learning. In a sense, early interference with the net- the response characteristics of already participating neu-
work’s memory is superfluous because the untrained net- rons to be altered. Additionally, the child is less hindered
work is naturally memory limited. One might say that is by interference from prior learned representations. This
the very point of the less-is-more argument, but it is miss- idea, which accords with Quartz and Sejnowski’s (1997)
ing a vital component. While we accept that children have theory of neural constructivism, is certainly not a new
limited cognitive abilities, we don’t see these limitations one, but is one that seems to remain largely ignored (al-
as a source of substantial learning advantage to the child. though see Marchman, 1993; McClelland, 2001). On this
Both are symptoms of the fact that the child’s brain is in view, it seems unlikely that limitations in a child’s cog-
an early stage in development at which its resources are nitive abilities are of significant benefit in language ac-
largely uncommitted, giving it great flexibility in adapt- quisition. While adults’ greater memory and analytical
ing to the particular tasks to which it is applied. abilities lead to faster initial learning, these properties are
not themselves responsible for the lower asymptotic level
of performance achieved, relative to children.
7.1 Late Exposure and Second Languages Along similar lines, the detrimental impact of de-
Elman’s (1991, 1993) computational findings of the im- layed acquisition of a first language may not implicate a
portance of starting small in language acquisition, as well language-specific system that has shut down. Rather, it
as the other studies reviewed here, have been influen- may be that, in the absence of linguistic input, those areas
tial in part because they seemed to corroborate empiri- of the brain which normally become involved in language
cal observations that language acquisition is ultimately may have been recruited to perform other functions (see,
more successful the earlier in life it is begun (see Long, e.g., Merzenich & Jenkins, 1995, for relevant evidence
1990). While older learners of either a first or a sec- and discussion). While it is still sensible to refer to a crit-
ond language show initially faster acquisition, they tend ical or sensitive period for the acquisition of language, in
to plateau at lower overall levels of achievement than do the sense that it is important to start learning early, the
younger learners. The importance of early language ex- existence of a critical period need not connote language-
posure has been cited as an argument in favor of either acquisition devices or genetically prescribed maturational
an innate language acquisition device which operates se- schedules.
lectively during childhood or, at least, genetically pro- Indeed, similar critical periods exist for learning to play
grammed maturation of the brain which facilitates lan- tennis or a musical instrument. Rarely if ever does an indi-
guage learning in childhood (Johnson & Newport, 1989; vidual attain masterful abilities at either of these pursuits
Newport, 1990; Goldowsky & Newport, 1993). It has unless he or she begins at an early age. And certainly in
been argued that the fact that late first- or second-language the case of learning the piano or violin, remarkable abil-
learners do not reach full fluency is strong evidence ities can be achieved by late childhood and are thus not
for “maturationally scheduled language-specific learning simply the result of the many years of practice afforded
abilities” (Long, 1990, p. 259, emphasis in the original). to those who start early. One might add that no species
We would argue, however, that the data regarding late other than humans is capable of learning tennis or the vi-
language exposure can be explained by principles of olin. Nevertheless, we would not suppose that these abili-
learning in connectionist networks without recourse to ties rely upon domain-specific innate mechanisms or con-
maturational changes or innate devices. Specifically, adult straints.
learners may not normally achieve fluency in a second While general connectionist principles may explain the
language because their internal representations have been overall pattern of results in late language learning, con-
largely committed to solving other problems—including, siderable work is still needed to demonstrate that this ap-
in particular, comprehension and production of their na- proach is sufficient to explain the range of relevant de-
tive language (see Flege, 1992; Flege, Munro, & MacKay, tailed findings. For example, it appears that vocabulary is
1995). The aspects of an adult’s second language that are more easily acquired than morphology or syntax, and that
most difficult may be those that directly conflict with the second language learners have variable success in master-
learned properties of the native language. For example, ing different syntactic rules (Johnson & Newport, 1989).
learning the inflectional morphology of English may be In future work, we intend to develop simulations that in-
particularly difficult for adult speakers of an isolating lan- clude comprehension and production of more naturalistic
guage, such as Chinese, which does not inflect number or languages, in order to extend our approach to address the
tense. empirical issues in late second-language learning and to
By contrast to the adult, the child ultimately achieves allow us to model a wider range of aspects of language
a higher level of performance on a first or second lan- acquisition more directly.
20
7.2 Conclusion Day, E. M., & Shapson, S. (1988). A comparison study of early
and late French immersion programs in British Columbia.
We seem to be in agreement with most proponents of the Canadian Journal of Education, 13, 290–305.
less-is-more hypothesis in our belief that the proper ac-
Elman, J. L. (1990). Finding structure in time. Cognitive Sci-
count of human language learning need not invoke the
ence, 14, 179–211.
existence of innate language-specific learning devices.
However, we depart from them in our skepticism that lim- Elman, J. L. (1991). Distributed representations, simple re-
ited cognitive resources are themselves of critical impor- current networks, and grammatical structure. Machine
tance in the ultimate attainment of linguistic fluency. The Learning, 7, 195–225.
simulations reported here, principally those inspired by Elman, J. L. (1993). Learning and development in neural net-
Elman’s language-learning work, call into question the works: The important of starting small. Cognition, 48,
proposal that staged input or limited cognitive resources 71–99.
are necessary, or even beneficial, for learning. We believe Elman, J. L., Bates, E. A., Johnson, M. H., Karmiloff-Smith, A.,
that the cognitive limitations of children are only advanta- Parisi, D., & Plunkett, K. (1996). Rethinking innateness:
geous for language acquisition to the extent that they are A connectionist perspective on development. Cambridge,
symptomatic of a system that is unorganized and inex- MA: MIT Press.
perienced but possesses great flexibility and potential for Flege, J. E. (1992). Speech learning in a second language. In
future adaptation, growth and specialization. C. A. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.),
Phonological development: Models, research, implica-
tions (pp. 565–604). Timonium, MD: York Press.
Acknowledgements
Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995). Factors
This research was supported by NIMH Program Project Grant affecting strength of perceived foreign accent in a second
MH47566 (J. McClelland, PI), and by an NSF Graduate Fellow- language. Journal of the Acoustical Society of America,
ship to the first author. Correspondence regarding this article 97, 3125–3134.
may be sent either to Douglas Rohde (dr@cs.cmu.edu), School Gallaway, C., & Richards, B. J. (Eds.). (1994). Input and in-
of Computer Science, Carnegie Mellon University, 5000 Fifth teraction in language acquisition. London: Cambridge
Avenue, Pittsburgh, PA 15213–3890, USA, or to David Plaut University Press.
(plaut@cmu.edu), Mellon Institute 115–CNBC, 4400 Fifth Av-
enue, Pittsburgh, PA 15213–2683, USA. Genesee, F. (1981). A comparison study of early and late sec-
ond language learning. Canadian Journal of Behavioral
Sciences, 13, 115–128.
References Goldowsky, B. N., & Newport, E. L. (1993). Modeling the ef-
fects of processing limitations on the acquisition of mor-
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning phology: the less is more hypothesis. In E. Clark (Ed.),
long-term dependencies with gradient descent is difficult. The proceedings of the 24th annual Child Language Re-
IEEE Transactions on Neural Networks, 5, 157–166. search Forum (pp. 124–138). Stanford, CA: Center for
Bialystok, E., & Hakuta, K. (1999). Confounded age: Linguistic the Study of Language and Information.
and cognitive factors in age differences for second lan- Johnson, J. S., & Newport, E. L. (1989). Critical period effects in
guage acquisition. In D. P. Birdsong (Ed.), Second lan- second language learning: The influence of maturational
guage acquisition and the critical period hypothesis (pp. state on the acquisition of English as a second language.
161–181). Mahwah, NJ: Erlbaum. Cognitive Psychology, 21, 60–99.
Birdsong, D. (1999). Introduction: Whys and why nots of the Kareev, Y. (1995). Through a narrow window: Working memory
critical period hypothesis for second language acquisi- capacity and the detection of covariation. Cognition, 56,
tion. In D. P. Birdsong (Ed.), Second language acquisi- 263–269.
tion and the critical period hypothesis (pp. 1–22). Mah-
wah, NJ: Erlbaum. Kareev, Y., Lieberman, I., & Lev, M. (1997). Through a narrow
window: Sample size and the perception of correlation.
Chomsky, N. (1965). Aspects of the theory of syntax. Cam- Journal of Experimental Psychology, 126(3), 278–287.
bridge, MA: MIT Press.
Kersten, A. W., & Earles, J. L. (2001). Less really is more for
Cleeremans, A., Servan-Schreiber, D., & McClelland, J. (1989).
adults learning a miniature artificial language. Journal of
Finite state automata and simple recurrent networks.
Memory and Language, 44, 250–273.
Neural Computation, 1, 372–381.
Kohonen, T. (1984). Self-organization and associative memory.
Cochran, B. P., McDonald, J. L., & Parault, S. J. (1999). Too
New York: Springer-Verlag.
smart for their own good: The disadvantage of a superior
processing capacity for adult language learners. Journal Lenneberg, E. H. (1967). Biological foundations of language.
of Memory and Language, 41, 30–58. NY: Wiley.
21
Lin, T., Horne, B. G., & Giles, C. L. (1996). How embedded Richards, B. J. (1994). Child-directed speech and influences on
memory in recurrent neural network architectures helps language acquisition: Methodology and interpretation. In
learning long-term temporal dependencies (Tech. Rep. C. Gallaway & B. J. Richards (Eds.), Input and interac-
Nos. CS-TR-3626, UMIACS-TR-96-28). College Park, tion in language acquisition (pp. 74–106). London: Cam-
MD: University of Maryland. bridge University Press.
Long, M. (1990). Maturational constraints on language develop- Rohde, D. L. T. (1999). The Simple Language Generator: En-
ment. Studies in Second Language Acquisition, 12, 251– coding complex languages with simple grammars (Tech.
285. Rep. No. CMU-CS-99-123). Pittsburgh, PA: Carnegie
Mellon University, Department of Computer Science.
Luce, D. R. (1986). Response times. New York: Oxford.
Rohde, D. L. T., & Plaut, D. C. (1999). Language acquisition in
Marchman, V. A. (1993). Constraints on plasticity in a con-
the absence of explicit negative evidence: How important
nectionist model of the English past tense. Journal of
is starting small? Cognition, 72(1), 67–109.
Cognitive Neuroscience, 5, 215–234.
Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995).
McClelland, J. L. (2001). Failures to learn and their remediation:
Backpropagation: The basic theory. In Y. Chauvin &
A competitive, Hebbian approach. In J. L. McClelland &
D. Rumelhart (Eds.), Back-propagation: Theory, archi-
R. S. Siegler (Eds.), Mechanisms of cognitive develop-
tectures, and applications (pp. 1–34). Hillsdale, NJ: Erl-
ment: Behavioral and neural perspectives. Mahwah, NJ:
baum.
Lawrence Erlbaum Associates.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986).
McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C.
Learning internal representations by error propagation.
(1995). Why there are complementary learning systems
In D. E. Rumelhart, J. L. McClelland, & the PDP Re-
in the hippocampus and neocortex: Insights from the suc-
search Group (Eds.), Parallel distributed processing: Ex-
cesses and failures of connectionist models of learning
plorations in the microstructure of cognition. Volume 1:
and memory. Psychological Review, 102, 419–457.
Foundations (pp. 318–362). Cambridge, MA: MIT Press.
McCloskey, M., & Cohen, N. J. (1989). Catastrophic interfer-
Shapson, S. M., & Day, E. M. (1982). A comparison of three
ence in connectionist networks: The sequential learning late immersion programs. Alberta Journal of Educational
problem. In G. H. Bower (Ed.), The psychology of learn- Research, 28, 135–148.
ing and motivation (pp. 109–165). New York: Academic
Press. Snow, C. E. (1994). Beginning from baby talk: Twenty years
of research on input and interaction. In C. Gallaway &
McNeill, D. (1970). The acquisition of language: The study of B. J. Richards (Eds.), Input and interaction in language
developmental psycholinguistics. New York: Harper & acquisition (pp. 3–12). London: Cambridge University
Row. Press.
Merzenich, M. M., & Jenkins, W. M. (1995). Cortical plas- Snow, C. E. (1995). Issues in the study of input: Finetuning,
ticity, learning and learning dysfunction. In B. Julesz & universality, individual and developmental differences,
I. Kovacs (Eds.), Maturational windows and adult cor- and necessary causes. In P. Fletcher & B. MacWhinney
tical plasticity (pp. 247–272). Reading, MA: Addison- (Eds.), The handbook of child language (pp. 180–193).
Wesley. Oxford: Blackwell.
Merzenich, M. M., Jenkins, W. M., Johnson, P., Schreiner, C., Sokolov, J. L. (1993). A local contingency analysis of the fine-
Miller, S. L., & Tallal, P. (1996). Temporal processing tuning hypothesis. Developmental Psychology, 29, 1008–
deficits of language-learning impaired children amelio- 1023.
rated by training. Science, 271, 77–81.
Tallal, P., Miller, S. L., Bedi, G., Byma, G., Wang, X., Nagaraja,
Newport, E. L. (1990). Maturational constraints on language S. S., Schreiner, C., Jenkins, W. M., & Merzenich, M. M.
learning. Cognitive Science, 34, 11–28. (1996). Language comprehension in language-learning
Newport, E. L., Gleitman, H., & Gleitman, L. R. (1977). Mother, impaired children improved with acoustically modified
i’d rather do it myself: Some effects and non-effects of speech. Science, 271, 81–84.
maternal speech style. In C. E. Snow & C. A. Ferguson Thomas, A., & McClelland, J. L. (1997). How plasticity can pre-
(Eds.), Talking to children: Language input and acqui- vent adaptation: Induction and remediation of perceptual
sition (pp. 109–149). Cambridge, England: Cambridge consequences of early experience (abstract 97.2). Society
University Press. for Neuroscience Abstracts, 23, 234.
Pine, J. M. (1994). The language of primary caregivers. In
C. Gallaway & B. J. Richards (Eds.), Input and interac-
tion in language acquisition (pp. 38–55). London: Cam-
bridge University Press.
Quartz, S. R., & Sejnowski, T. J. (1997). The neural basis of
cognitive development: A constructivist manifesto. Be-
havioral and Brain Sciences, 20, 537–596.
22

Less Is Less in Language Acquisition

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Less Is Less in Language Acquisition

Încărcat de

Drepturi de autor:

Formate disponibile

Less is Less in Language Acquisition

Douglas L. T. Rohde David C. Plaut

1 Introduction Kareev, Lieberman, and Lev (1997), Cochran, McDonald,

VI barks | sings | walks | bites | eats |

2.1.3 Corpora ification probability for each noun.

Mean Divergence Per Prediction

Elman trained an empirical language model on sentences

performance under these conditions was not quite as good

logistic activation function, the back-propagation learn-

of the logistic function where error derivatives, and hence

Plurality with filter Plurality without filter

Percent Correct Mappings

Sampling with filter Sampling with filter

Sampling without filter Sampling without filter

Plurality without filter Network without filter

Plurality with filter

Percent Correct Mappings

20 Sampling with filter 20

Sampling without filter

larger response correlations and to have more accurate

% Chance that Observed Correlation >= 0.1

encountering a correlation that is larger than the true cor-

0.6 Window Size 13

Figure 10: The correlation between envelope color and

the object was “you”, or vice-versa. Six of the verbs used

a previously studied sign. word order and others heard object-path-manner.

S-ar putea să vă placă și