Sunteți pe pagina 1din 45

Corpora in grammatical studies

Corpus Linguistics
Richard Xiao
Aims of this session
• Lecture
– Corpus-based grammar: Scope and principles
– The state of the art of using corpora in grammatical studies
– Using corpora to improve grammatical descriptions:
Infinitival complementation of help
• Lab session
– Position of if-clauses in ICE-GB
Corpus revolution
• Like lexicographic and lexical studies, grammar is another
area which has frequently exploited corpus data
– A balanced representative corpus provides a reliable basis for
quantifying grammatical categories and syntactic features
– It is also useful in testing hypotheses derived from
grammatical theory
• There has been increasing consensus that non-corpus-
based grammars can contain biases while corpora can
help to improve grammatical descriptions (McEnery &
Xiao 2005)
• Corpora have had a strong influence on recently
published reference grammar books (at least for English)
– ‘even people who have never heard of a corpus are using the
product of corpus-based investigation’ (Hunston 2002: 96)
Principles of corpus grammar (Leech 2000)
• Data-oriented grammar
– allowing the combination of a quantitative and a qualitative
description of the data
– a grammar accountable to observed data of attested language use
• Functional Grammar
– establishing a relation between phenomena that are external to the
language system and system-internal phenomena (form vs. meaning)
– their explanation of grammar in terms of the wider context of human
psychology and behaviour
• Variety Grammar
– allowing the description of the full range of varieties (e.g.
conversation, fiction writing, news writing, academic writing)
• Integrative Grammar
– allowing an integrated description of syntactic, lexical, and discourse
– close to communicative grammar as opposed to ‘autonomous syntax’
view of grammar
A new milestone in English grammar
• Longman Grammar of Spoken and Written
English (i.e. LGSWE, Biber et al 1999)
– A new milestone following Quirk et al (1985)
Comprehensive Grammar
– Based entirely on the 40-million-word Longman
Spoken and Written English Corpus
– Giving “a thorough description of English
grammar, which is illustrated throughout with real
corpus examples, and which gives equal attention
to the ways speakers and writers actually use
these linguistic resources” (Biber et al 1999: 45)
Features of corpus-based grammars
• Paying attention to the differences in speech
and writing
• Taking account of register/genre variations
• Providing frequency information
• Treating lexis as an integral part of
grammatical description
• Giving authentic examples
Some examples of corpus grammars
• Corpus-based English
grammars focusing on
– Carter, R. and
McCarthy, M. (1997)
Exploring Spoken
English. Cambridge:
Cambridge University
– McCarthy, M. (1998)
Spoken Language and
Applied Linguistics.
Cambridge: Cambridge
University Press.
Some examples of corpus grammars
• Corpus-based grammars with a
focus on lexis
– Francis, G., Hunston, S. and
Manning, E. (1996) Collins
COBUILD Grammar Patterns 1:
Verbs. London: HarperCollins.
– Francis, G., Hunston, S. and
Manning, E. (1998) Collins
COBUILD Grammar Patterns 2:
Nouns and Adjectives. London:
– Hunston, S. and Francis, G. 2002.
Pattern Grammar. Amsterdam:
John Benjamins.
Some examples of corpus grammars
• Corpus-based grammar
exploring taking account
of register variation
– Biber, D., Johansson S.,
Leech G., Conrad S. and
Finegan, E. (1999)
Longman Grammar of
Spoken and Written
English. London: Longman.
A case study
• Using corpora to improve grammatical
– Infinitival complementation of HELP
A commonly used word
• In the 100-million-word BNC
– 245th most frequent word
• 529 instances per million words
– 72nd most frequent verb as a lemma
A verb with a distinctive syntax
• English has two main-clause verbs that can control either a full or a bare
infinitive: dare and help (Biber et al 1999: 735)
– The choice between a full and bare infinitive is only available when dare is
used as a lexical verb (as a modal verb, always followed by a bare infinitive)
• HELP is the only English verb that can control either a full or bare infinitive
AND occur either with or without an intervening NP
– HELP to V
• Perhaps the book helped to prevent things from getting even worse.
– HELP NP to V
• I thought I could help him to forget.
• Savings can help finance other Community projects.
• We helped him get to his feet and into the chair.
• Dare can occur with or without an intervening NP, but it cannot control a
bare infinitive when such an intervening NP is present
– Ernest <…> dared Archie to punch him in the stomach.
A unique verb of great interest
• A verb that has often been given prominence in
textbooks, grammars and dictionaries
– E.g. Chalker (1984); Murphy (1985); Quirk et al (1972,
1985); Eastwood (1992); Biber et al (1999)
• A verb that has aroused much interest and debate
– Language variety
– Language change
– Register variation
– Semantic distinction
– Syntactic conditions
The corpora
Language variety: AmE vs. BrE
• Bare infinitives are much more
common in AmE (cf. Biber et al
– 80% (AmE) vs. 52% (BrE)
– LL=23 (1 df), p<0.001
• British preference for full
– You’re going to help me make
to make a birthday cake for
Jim remember. (BNC)
• A construction of American
provenance, which has
penetrated rapidly into BrE
– Zandvoort (1966): ‘except in
American English, however, to
help usually takes an infinitive
with to’
• No longer valid
Language change:
• Changing labels for bare infinitives
– (OED,1933) “vulgar” -> (Vallins
1951) “not seriously questioned
now…” -> (Mair 1995) “lost the
informal ring”
• An increase in the proportions of
bare infinitives over the three
decades in both AmE and BrE
– AmE: 68% -> 82% (+14%)
• LL=10.6 (1 df), p=0.001
– BrE: 22% -> 60% (+38%)
• LL=47.5 (1 df), p<0.001
• A greater shift towards the use of
bare infinitives in BrE because AmE
was already more “tolerant” of bare
infinitives in the 1960s
Spoken vs. written
• Bare infinitives are slightly more
frequent in speech than in
writing, in both AmE and BrE
• The differences are not
statistically significant
– AmE: LL=2.71 (1 df), p=0.10
– BrE: LL=2.16 (1 df), p=0.142
• No predictable distribution
pattern for bare infinitives in 15
written genres
– Common in some formal genres
(e.g. official documents) but
infrequent in other formal
genres (e.g. academic writing)
Semantic distinction
• The debate has a long history
• Some “pre-corpus” arguments
– Wood (1962: 107-8): to ‘can be omitted only when the
helper does some of the work, or shares in the activity
jointly with the person that is helped’ – Wood’s
“unacceptable” examples
• These tablets will help you sleep.
– But tablets do not sleep
• Writing out a poem will help you learn it.
– But writing does no learning
– According to Quirk et al (1972: 841), the choice ‘is
conditioned by the subject’s involvement’
• With a bare infinitive, ‘external help is called in’
• With a full infinitive, ‘assistance is outside the action proper’
Semantic distinction
– Dixon (1991)
• John helped Mary eat the pudding
– John ate part of the pudding as Mary did
• John helped Mary to eat the pudding
– John fed the pudding to Mary
– Duffley (1992)
• A bare infinitive evokes helping as ‘direct or active involvement’
• … help to V evokes help as a condition which enables the person
being helped to realize the event
– Lu (1996: 813)
• When the subject of ‘help’ does not take part in the helping
activity, the infinitive must take to
– The book helped me to see the truth.
– What do your intuitions tell you?
Semantic distinction
• Not reported in more recent corpus-based works (e.g.
Longman 1993/1996; Collins 1995; Biber et al 1999)
– Quirk et al (1985) dropped the argument for semantic distinction
– Collins CoBuild Dictionary
• “If you help someone, you make it easier for them to do something, for
example by doing part of the work for them or by giving them advice or
• It is not always easy or even possible to make a distinction
between whether or not the helper actually takes part in the
helping activity
• Counter examples are abundant in corpora
– I help people stop smoking. (FLOB)
– oh it says if you have a dose last thing at night it helps you sleep. (BNC)
Syntactic condition:
Intervening NP

• The previous claim (Lind 1983; Kjellmer 1985; Biber et al

1999) that an intervening NP increases the proportion of bare
infinitives is only partly supported by our corpora
– Only valid in AmE, both written and spoken
– Unpredictable results, no statistical significance in BrE
Syntactic condition:
Intervening adverbial
• Lind (1983) claims that ‘an
intervening adverbial will
12 preclude omission of to’

10 – The whisky helped me not to

8 stagger under this blow.
4 • This claim is ungrounded, esp. in
2 AmE (CPSA)
• Some counter examples
wn SA B B CS
F ro
B N – So, to help people not jump all
over it as soon as they see it
bare-inf full-inf <…> (CPSA)
– <…> that would even help
...helping dramatically reduce poverty. perhaps focus some of those
(Time Magazine 2005/12/05) responses. (CPSA)
– Mr. Clinton <…> also helped, to
a much lesser degree, organize
Now my helping digitally a huge march in Washington
restore the Disney films her grandfather <…> (Frown)
worked on. (Time Magazine 2006/04/10)
Syntactic condition:
to preceding help
• To preceding help is a decisive
syntactic condition that
80% encourages the omission of to (cf.
Lind 1983; Kjellmer 1985; Biber et
al 1999)
40% – HELP (lemma): 60%
20% – help (finite verb): 65%
– to help (infinitive): 88% (+23%)
HELP help to help • Consecutive repetition of to tends
to be avoided on the grounds of
bare-inf full-inf euphony (cf. Lind 1983)
– They took on an estate manager
In the BNC, to help V (2,161) is 17 and wine-maker to help run the
times as frequent as to help to V business. (FLOB)
(127) • A statistical norm, not
categorical distinction
Syntactic condition:
Passive voice
• Palmer (1965: 169) observes that ‘passive occurs <…> only
with to: They were helped to do it.’
• All of the 9 instances of passivized HELP in our corpora take a
full infinitive with no exception
• No instance of BE helped V is found in the whole BNC or the
100-million-word Time corpus of AmE
• Explanation (?): An analogy can be drawn between HELP and
verbs such as MAKE, LET, SEE and HEAR: oC = bare infinitive
– The infinitive shifts from oC to sC in passive transformation
• So they should be made to bring their prices down. (BNC)
– So the authorities should make them (*to) bring their prices down.
• Pupils should be helped to investigate topics on their own. (BNC)
– Teachers should help pupils (to) investigate topics on their own.
Case study: A summary
• The choice of a full or bare infinitive following
HELP is conditioned by a wide range of factors
including, for example, language variety,
language change, as well as various syntactic
• Non-corpus-based grammars are likely to
contain biased descriptions that do not accord
with attested language use
Adverbial clauses:
Position vs. semantic types

Greenbaum and Nelson (1995)

Exploring if-clauses in ICE-GB
– One million words
– 500 samples (300 spoken + 200 written)
– Parsed corpus
• Position of if-clauses
– Clause initial position
• If it’s a really nice day we could walk.
– Clause-final position
• We could walk if it’s a really nice day.
• Reference
– Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural
Language: Working with the British Component of ICE.
Amsterdam: John Benjamins

+ Expand to see
text categories
Fuzzy Tree Fragment (FTF)

Press "Inset after" twice

“Edit Node” menu
Editing 1st node
Editing 2nd node
Editing 3 rd node
Specifying word
Complete nodes with specified word

clause (main) Adverbial clause introduced by

the subordinator “if”
Specifying position (initial)

Finally press "Start"

Click on "First: Yes" for initial position;
white linking line disappears
Results for initial position
Example of parse tree

Parsing unit
Specifying position (final)
Results for final position
Example of parse tree
Frequencies of initial / final positions
• Initial position appears to be the “unmarked”
position for if-clauses
– Initial position (886, 61.4%)
– Final position (556, 38.6%)
Written registers

Greenbaum and Nelson's (1995) observation of conditional

clause (64.8% for initial and 35.32% final) only applies to
written registers
Spoken registers

In the spoken data as a whole, the final position is preferred,

though there is considerable internal variation.
The more "formal" spoken registers (parliamentary debates, legal
presentations and non-broadcast (scripted) speeches show a
marked preference for the initial position.
ICE-GB: Ditransitive verbs