P. Durrant 2009 - Investigating The Viab

Available online at www.sciencedirect.
com
ENGLISH FOR
SPECIFIC
PURPOSES
English for Specic Purposes 28 (2009) 157169
www.elsevier.com/locate/esp
Investigating the viability of a collocation list for students

of English for academic purposes
Philip Durrant *
School of English Studies, The University of Nottingham, University Park, Nottingham, NG7 2RD, United Kingdom
Abstract
A number of researchers are currently attempting to create listings of important collocations for students of EAP. How-
ever, so far these attempts have (1) failed to include positionally-variable collocations, and (2) not taken sucient account
of variation across disciplines. The present paper describes the creation of one listing of positionally-variable academic
collocations and evaluates the extent to which it is likely to be useful to students from across a wide range of disciplines.
A number of key ndings emerge. First, cross-disciplinary collocations dier in type from the collocations on which most
researchers have traditionally focused in that they tend not to be combinations of two lexical words, but rather pairings of
one lexical and one grammatical word. Second, most of the words which are found in academic collocations are not found
on Coxheads inuential Academic Word List. This, it is argued, reects a serious methodological weakness in Coxheads
listing. Third, the vocabulary needs of students in the arts and humanities are characteristically dierent from those of stu-
dents in other disciplines. Researchers and teachers therefore need to deal with these learners separately. The paper nishes
by making a number of recommendations for future developments in this area.
2009 The American University. Published by Elsevier Ltd. All rights reserved.
1. Introduction
Language teachers have long used word lists as a means of focusing students vocabulary learning. It is
known that the vast majority of written and spoken language is composed of a relatively small number of
high-frequency words (Nation, 2001), so focusing on these very common items seems likely to pay substantial
dividends for novice learners. Within English for Academic Purposes (EAP), there has been much interest in
constructing lists of generic academic vocabulary sub-technical (Yang, 1986) words which are common
across academic disciplines, but which may cause problems for learners because they are neither suciently
frequent in the language as a whole to be learnt implicitly nor part of the technical lexicon which is likely
to be explicitly taught as part of subject courses (Nation, 2001, pp. 189191). The most commonly-used listing
*
Present address: Bilkent University, Graduate School of Education, G Binas, 06800 Bilkent, Ankara, Turkey. Tel.: +90 312 290 1561.
E-mail address: durrant.phil@googlemail.com
0889-4906/$36.00 2009 The American University. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.esp.2009.02.002
158 P. Durrant / English for Specic Purposes 28 (2009) 157169
of academic vocabulary today is Coxheads Academic Word List (AWL), a collection of 570 word families
which are reported to account for approximately 10% of the words found in academic texts (Coxhead, 2000).
A limitation of such lists in their present form is that they do not take account of the multi-word colloca-
tional patterns which are known to be prevalent in language (e.g., Nattinger & DeCarrico, 1992; Pawley &
Syder, 1983). As dened within the neo-Firthian tradition represented by corpus linguists such as Sinclair
(1991), Stubbs (1995), and Hoey (2005), collocations are sets of two or more words which appear together
more frequently than their individual frequencies would lead us to expect (Hoey, 1991; Jones & Sinclair,
1974). Though Firth (1968) originally conceived of collocation as a purely textual phenomenon, researchers
in this tradition have since given the notion a psychological interpretation, seeing the frequent co-occurrence
of words as evidencing the existence of semi-preconstructed phrases that constitute single choices for the
speaker (Sinclair, 1987, p. 320), linguistic chunking (Ellis, 2003), or a psychological association between
words (Hoey, 2005, p. 5).
Viewed in this way, high-frequency collocations are part of the lexicon which learners need to acquire. This
implies as Coxhead (2008) has recently discussed that it will be necessary to extend existing wordlists to
take account of such items, and some attempts have now been made to generate pedagogically-oriented list-
ings of high-frequency collocations, both in general (Shin & Nation, 2008) and in academic English (Biber,
Conrad, & Cortes, 2004; Ellis, Simpson-Vlach, & Maynard, 2008).
While these eorts represent progress, a number of fundamental issues still remain to be addressed. One
problem is that the EAP studies cited above have focused exclusively on one particular type of collocation:
what Biber et al. call lexical bundles, that is, frequently recurring xed sequences of words. In Biber
et al.s case, these are dened, specically, as four-word sequences which occur at least 40 times per million
words. For Ellis et al., they are three-, four- and ve-word sequences that are signicantly more frequent in
academic than in non-academic text. While lexical bundles of this sort are important, and have the great ben-
et of being easily identiable with simple corpus search methods, they are also likely to leave out much that is
of collocational interest. Collocation often involves relationships between words which may be separated by
other, non-xed, or semi-xed words, and which may dier in their position relative to one another. Compare:
he made a powerful argument;

he made a powerful, but ultimately unconvincing, argument;
his argument was a powerful one.
Such collocations are of great interest, but will be missed by an exclusive focus on lexical bundles. Any com-
prehensive attempt to identify important academic collocations must take such items into account. We there-
fore require methodologies which will provide such items (see, e.g., Cheng, Greaves, & Warren, 2006 for a
discussion of one such method).
A second key issue concerns the degree to which collocations are common across academic disciplines.
Research has often emphasised the topic- and genre-specicity of collocation. This raises the possibility that
collocations may be too domain-specic to permit a listing of collocations that would be genuinely useful to
students from across the disciplines. Indeed, Hyland and Tse (2007) see divergent collocation use between dis-
ciplines as a key factor undermining the notion that there is a shared academic vocabulary. If this view is right,
then it is clearly misguided to seek any generic listing of academic collocations.
We do not yet have sucient indication of the variation between disciplines to decide this issue either way.
Most of the existing work on collocations in EAP has concerned itself only with the technical collocations found
in specic academic disciplines (e.g., Cortes, 2004; Gledhill, 2000; Marco, 2000; Ward, 2007; Williams, 1998;
Yang, 1986), and so has made no attempt to deal with dierences between disciplines. Biber and his colleagues
(Biber, 2006; Biber, Conrad, & Cortes, 2004) do aim to identify generically academic items. However, the corpus
on which their research is based, which includes only 7,60,000 words of academic writing in total, is as Biber
himself notes (2006, pp. 163164) far too small to support robust claims about the usages of particular disci-
plines, and so no real attempt is made to determine how evenly distributed the collocations identied are across
subject areas. Ellis, Simpson-Vlach, and Maynard (2007) do appear to have made some attempt to indicate how
well items are spread across disciplines. They report dividing their 2.1 million word corpus of academic writing
and 2.1 million word corpus of academic speech into ve spoken and four written genres and grading bundles
P. Durrant / English for Specic Purposes 28 (2009) 157169 159
according to how well they are spread between genres. However the results of this procedure are not reported in
their later analysis (which focuses instead on the overall frequencies of items), and the inevitably small size of each
genre (an average of 4.2 m/9 = 46,666 words per genre) is again rather limited for work of this kind.
While a strong case has yet to be made for the cross-disciplinary value of academic collocations, the opposing
position also stands in need of rigorous support. Hyland and Tses (2007) argument for disciplinary divergence
in the use of collocations is based only on a very few examples: specically, that value is often found in computer
science within the collocations value stream and value attribute mapping; and that strategy is often found in mar-
keting strategy in business texts, learning strategy in applied linguistics texts and coping strategy in sociology.
While this rather anecdotal evidence indicates that discipline-specic collocations do exist, it does not indicate
that there are not also sucient cross-disciplinary regularities for an EAP collocation list to be of use.
To develop further our pedagogical descriptions of academic collocations, it will be important both to
incorporate positionally-variable expressions and to provide a clear account of how well distributed items
are across academic disciplines. The research reported here represents a step towards this ambitious goal. Spe-
cically, it asks whether two-word cross-disciplinary academic collocations (including both xed and variable
items) exist to a suciently great extent for a listing of such items to be pedagogically valuable.
Collocations will be dened as pairs of words which frequently appear within a certain span of each other in
text. This represents progress toward a more comprehensive account of high-frequency phraseology in that it
allows collocations which are positionally variable. However, it remains limited in that it does not identify
collocations which involve more than two words. Thus, while the variable collocation signicant-dierences
(as seen in, for example, there were no signicant dierences; there were signicant physiological dierences;
dierences were not signicant) would be identied, longer collocations (such as statistically-signicant-dier-
ences and signicant-dierences-between) would not. However, it is hoped that insights gained at this two-word
level will serve as a guide to the future development of research into such longer collocations.
2. Methods
2.1. Corpus design and compilation
The rst part of this paper describes one method for identifying positionally-variable collocations which are
used across the disciplines in academic writing. More specically, it will identify two-word collocations which
are common in research articles published in peer-reviewed journals. This text type provides a suitable basis
for identifying pedagogically-useful academic collocations, rstly, because of its centrality within academic
writing and, secondly, because, as Hyland notes, research articles are often the target of good writing which
students are encouraged to emulate (Hyland, 2008, p. 47).
Since most collocations are relatively rare, in comparison to individual words, a large corpus is required in
order to nd such items. Moreover, in order to ensure that the collocations identied are common across aca-
demic disciplines, rather than being tied to particular subject areas, it is important that the corpus be struc-
tured so as to reect the likely diversity of collocation use between disciplines. As an initial approximation
to such a structure, a disciplinary map was rst constructed on language-external grounds, based on the
departments of the researchers own university (the University of Nottingham). The university is divided into
ve faculties: Arts and Humanities; Engineering; Medicine and Health Sciences; Science; Social Sciences, Law
and Education. Each faculty is in turn divided into a number of dierent schools, which are themselves often
divided into separate divisions, reecting the dierent research interests within the school. Thus, for example,
the Faculty of Arts and Humanities hosts the School of English Studies, within which are the four Divisions of
Medieval Studies, Modern English Language, Modern English Literature, and Drama. The rst draft of the
corpus was created by collecting ve million words for each of the ve university faculties. These ve million
words were divided as evenly as possible between however many schools were in each faculty, and the words
allocated to each school in turn divided as evenly as possible between the divisions within the school.
While it is plausible that a corpus structured on this basis will to some extent reect dierences of colloca-
tion use across disciplines schools within the Engineering faculty seem likely to share collocations which are
distinct from those shared by disciplines within Arts and Humanities, for instance the university structure is
obviously not based on linguistic decisions, and so may not match the distribution of collocations as
accurately as we would like. For example, the Faculty of Social Sciences, Law and Education includes the
School of Built Environment. However, in terms of vocabulary use, initial analysis revealed this subject area
to have more in common with subjects found in the faculty of Engineering than it does with other schools
within its own faculty (see below). By including this school in its faculty grouping, we are therefore likely
to create the false impression that certain items which are common in engineering are also common in the
social sciences.
To overcome this problem, it was assumed that variation in collocation use would be closely related to
variation in the use of single-word vocabulary. An initial analysis of such single-word use across the corpus
was therefore used in order to restructure the corpus. For each of the 31 schools represented (which were
assumed to represent basic, approximately homogenous, disciplinary units), a listing of keywords (Scott,
1999) was generated. These were words which (1) contained four or more alphabetical characters; (2) appeared
in the collection of writing for that school with a mean frequency of at least 20 per million words; (3) appeared
in at least 20% of texts in the school; (4) appeared in the school signicantly more frequently than in the British
National Corpus (BNC), with the threshold for signicance set at p < .1 10 7. A schools keywords were
taken to represent items which are of particular importance for its students.
To quantify the commonalities in vocabulary use between two schools, the percentage overlap in keyword
use (that is, the percentage of the total unique words found in two lists which were common to both) was then
calculated. This analysis was repeated across all schools, generating a matrix which showed the degree to
which every school resembled every other in its vocabulary use. This matrix was then used as the basis of a
hierarchical cluster analysis (McEnery & Wilson, 2001, pp. 9295) to determine the most natural groupings
of schools in terms of their vocabulary use. This analysis produced the dendrogram shown in Fig. 1.
The dendrogram reads from left to right and represents progressively larger and less homogenous clusters
of schools. For example, the very similar schools of Human Development and Medical and Surgical Sciences
(rows 1 and 2) combine at the rst level of analysis. These are then joined by Community Health Sciences. At
the next level, this group of three combines with a larger group (which contains the schools of Molecular and
Medical Science, Veterinary Medicine and Science, Biomedical Science, Biology, Pharmacy, and Biosciences)
to form a group of nine. At the next level, this group in turn combines with another group of nine (containing
the Engineering schools plus Chemistry, Physics and Astronomy, Computer Science and Mathematical
Science and the school of Built Environment). This group of 18 schools then combines with another group
of eight, comprising various social sciences, plus Nursing and Psychology. Finally, this large group combines
with a group of four schools corresponding to the Arts and Humanities faculty.
Five major clusters seem to emerge from this analysis:
1. Human Development; Medical and Surgical Sciences; Community Health Sciences; Molecular and Medical
Science; Veterinary Medicine and Science. Biomedical Sciences; Biology; Pharmacy; Biosciences.
2. Mechanical, Materials and Manufacturing Engineering; Chemistry; Chemical, Environmental and Mining
Engineering; Electrical and Electronic Engineering; Civic Engineering; Physics and Astronomy; Computer
Science and Information Technology; Built Environment; Mathematical Sciences.
3. Education; Sociology and Social Policy; Nursing; Psychology.
4. Business; Economics; Politics and International Relations; Law.
5. English Studies; Modern Languages and Cultures; Humanities; American and Canadian Studies; History.
These groupings appear to form intuitively satisfying sets, which will be referred to as: (1) Life sciences; (2)
Science and Engineering; (3) Social-Psychological; (4) Social-Administrative and (5) Arts and Humanities.
Restructured according to these groups, the contents of the nal corpus can be summarised as in Table 1.
Before proceeding to the next section, it is worth noting that this analysis of single-word vocabulary revealed
a large gap between disciplines in the arts and humanities and those in other groups. As Fig. 1 illustrates, in the
cluster analysis, schools within this area formed a distinct, internally-homogenous set which had relatively little
in common with the rest of the corpus. To quantify this relationship, we can note that the mean percentage
overlap of keywords between all 26 schools outside of the arts and humanities disciplines was 27%, whereas
the mean overlap between all 31 schools once arts and humanities are included drops sharply to 20%. The arts
and humanities group appears therefore to be something of an outlier in terms of its vocabulary use.
Fig. 1. Dendrogram using Average Linkage (between groups).
Table 1
Summary of the academic corpus.
Group Total words Total articles
Arts and Humanities 5,049,627 436
Life sciences 6,156,089 991
Science and Engineering 8,242,417 1270
Social-Administrative 2,850,789 215
Social-Psychological 2,778,426 339
Total 25,077,348 3251
2.2. Identifying academic collocations
Academic collocations will be dened here in parallel to existing denitions of single-word academic
vocabulary as word pairs which co-occur with at least moderate frequency across a wide range of academic
disciplines, but which are not often found in non-academic language. This denition has a number of parts,
which need to be unpacked. First, words are taken to mean word forms, rather than any more abstract cat-
egory such as lemmas since, as many corpus linguists have argued (e.g., Clear, 1993; Hoey, 2005; Sinclair,
1991; Stubbs, 1996), lemmatisation has the potential to disguise important dierences in collocational prefer-
ences between dierent forms of a lemma. Second, I follow Jones and Sinclairs (1974) widely used precedent
of limiting co-occurrence to occurrences within a four-word span. Third, we need to state more explicitly
what is meant by the condition that collocations have moderate frequency in academic writing, but not in
other forms of language. To operationalise this, I adapted Scotts log-likelihood-based keyword technique
(1999): academic collocations are those pairs which appear signicantly more frequently in academic than
in non-academic texts. This was calculated by comparing the total frequency of collocations in the academic
corpus with their frequency in an 85 million word subsection of the BNC, comprising only non-academic texts
(identied using the class codes devised by Lee (2001), which are incorporated into the Text converter utility of
WordSmith Tools). Rather than setting any (necessarily arbitrary) level of signicance as a criterion for inclu-
sion, I used log-likelihood to produce a ranked list of the collocations which are most key to academic writ-
ing. This list was then used to select the 1000 most key collocates. To avoid wrongly attributing keyness to
unimportant collocations simply because they fail to appear in the BNC, only collocations appearing at least
once per million words in the academic corpus (that is, at least 25 times in 25 million words) were included in
the analysis. To eliminate frequently co-occurring words which do not stand in any interesting collocational
relationship, pairs with mutual information scores of less than four were also excluded (a cut-o point which
extensive piloting showed to make intuitively appropriate distinctions). Finally, to ensure that these colloca-
tions are used across a range of academic disciplines, word pairs needed to meet these last two frequency cri-
teria (minimum frequency of once/million words and minimum mutual information of four) in all ve of the
subject groupings described above.
The rst step in putting this denition into practice was to generate, for each of the ve academic
sub-corpora, a listing of all word pairs which co-occurred within a four-word span of each other more than
once per million words and with a mutual information score of at least four. This was achieved using the Word
list function in WordSmith Tools (Scott, 1996). An important methodological point which is not acknowledged
in the WordSmith documentation is that Word list only provides information about collocates occurring to the
right-hand side of node words. Thus, for example, in a listing generated for the arts and humanities sub-cor-
pus, the node word as lists among its collocates the word well, with 2250 occurrences in 5 million words; the
word well, meanwhile, lists among its collocates the word as with 1782 occurrences. Consulting a concordance
list for well shows that these two gures correspond to appearances of as to its left- and right-hand sides,
respectively. The Wordsmith lists were therefore of node words plus collocates which appear within a span
of four words to their right a given number of times. This means that word pairs which are frequent in both
directions appear on the list twice. This arrangement is, of course, benecial for many types of corpus anal-
ysis. However, for the present research, it must be borne in mind that additional steps are required to eliminate
duplication (see below).
Once separate listings of collocations had been generated for each academic group, collocations which
were common to all groups were identied using the functionality of Microsoft Excel. For these shared
collocations, an overall frequency gure for the academic corpus as a whole was then calculated by sum-
ming the frequencies in each sub-corpus. To determine which collocations are distinctively academic,
frequency counts for these collocations were then generated (again using WordSmith Tools) for the non-
academic sub-sections of the BNC. Microsoft Excel was then used to calculate log-likelihood ratios com-
paring the frequency of each collocation in the two corpora and to rank collocations according to the size
of this ratio, so yielding an ordered listing of the most distinctively academic collocations. Finally, a
number of collocations were manually removed from the listing. Collocations were removed if: (1) they
included an acronym or abbreviation, a proper name, an article, or a number or ordinal other than
one and rst; (2) the collocation corresponded to a single Latin word (e.g. ad hoc, per cent); (3) the major-
ity of their occurrences appeared to be in writing outside the main text of the articles, for example, in bib-
liographies, copyright information, or acknowledgements; (4) they appeared on the listing twice (because
they are frequent in both directions see above. The more highly-ranked of the two appearances was
kept in each case).
Once this process had been completed, a nal list of the most distinctively academic collocations was cre-
ated comprising the 1000 most key items. All of these collocations have log-likelihood ratios of greater than
82, indicating that they are far more frequent in academic writing than in everyday English. These 1000 items
are not meant, of course, to represent an exhaustive inventory of academic phraseology. They are intended,
rather, as a pedagogically-manageable body of learning targets to which learners should pay special attention
and more immediately as a sample from which we can evaluate the success of this search strategy. The 100
most key collocations from the list are shown in Appendix A.1
3. Results and discussion
The methodology described above identied 1000 two-word collocations which are frequent across all ve
of our academic subject areas. The remainder of this paper aims to evaluate whether these collocations are
worth learning in general, and whether they are of equal value to students from all disciplines.
Looking through the listing of academic collocations, one point that becomes immediately clear is that the
great majority of word pairs are grammatical collocations that is, they contain at least one non-lexical word
(I take non-lexical words to comprise prepositions, determiners, primary and modal verbs, conjunctions, sub-
ordinating adverbs, pronouns and numerals and ordinals other than one and rst).2 Indeed, of the 1000 col-
locations listed, 763 are grammatical in this sense. This may be a disappointment to some. Gledhill (2000, pp.
7379) has noted that many researchers systematically eliminate grammatical collocations from their analyses,
considering collocation between lexical items to be the only sort worthy of examination; I also suspect that
such items are not what many teachers have in mind when they think of collocation (see, for example, the def-
initions of collocation given by contributors to Lewiss edited collection of papers (Lewis, 2000)).
If lexical collocations are indeed the more interesting type, then the listing is disappointing. However (as
Gledhill (2000) also argues), an exclusive focus on lexical collocations may be misguided. Linguistic frame-
works which engage seriously with collocation have denied any absolute distinction between lexis and gram-
mar, seeing the terms as referring to end-points on a spectrum, rather than clear and mutually-exclusive
categories (e.g., Langacker, 1987, p. 3; Sinclair, 1991, p. 108). Pattern grammar, for example, asserts that sup-
posedly abstract grammatical patterns are often strongly associated with specic lexical instantiations
(Hunston & Francis, 2000, p. 96), while Sinclair (2004, pp. 3035) and Hoey (2005, p. 40) have shown that
lexical items often favour particular grammatical forms. One benet to learners of a listing of high-frequency
grammatical collocations is that the most typical versions of the patterns they need, and the most typical pat-
terns of the words they need, can be brought to their attention.
To take a concrete example, the key collocations list includes 36 pairs instantiating the often-taught report-
ing pattern verb + that. Of these 36, many contain alternate forms of the same verb (e.g. assume, assumed,
assumes, assuming all get separate entries). Collapsing these together leaves 16 distinct verbs: argue, assume,
conclude, conrm, demonstrate, emphasize, hypothesize, imply, indicate, note, predict, reveal, show, speculate,
suggest, suppose. Both lemmatised and non-lemmatised versions of this listing would, I suspect, be of great
value to learners trying to get to grips with this pattern. Instead of simply learning the abstract form (verb
+ that), learners could be introduced to the patterns through these instantiations. Learning the collocations
as pairs may provide learners with both a good basis for getting to grips with the meaning and use of a pattern,
and a bias them towards using it in the most lexically appropriate ways.
Another benet of listing grammatical collocations is that they may draw attention to productive patterns
which are tied to specic lexis in a way that can lead them to be overlooked by traditional grammars. One
example is the collocation and-respectively, as in:
1
Note that frequencies given in the appendix are the total frequencies of both left- and right-hand occurrences of the collocations.
2
This usage diers slightly from that of Benson, Benson, and Ilson (1997, p. xv), who use the term grammatical collocation to indicate
the collocation of a noun, adjective, or verb with a preposition or grammatical structure. By including reference to grammatical structures,
Benson et al.s denition includes verb patterns such as verbs allowing dative movement transformation (e.g., he sent her a book), and
verbs followed by an innitive without to (e.g., we must work). My focus on co-occurring words does not allow such items to be included.
Benson et al.s term lexical collocation (1997, p. xxx), on the other hand, does seem to correspond to my use.
The survival rates after 12 and 24 months were 88 per cent and 83 per cent, respectively, for the dogs with
single tumours.
Two and one asterisks denote, respectively, that the estimates are statistically signicant at the 5% and 10%
levels. . .
This form is, I would suggest, likely to be of great use to students of academic English, playing as it does a
vital text-organising function which would be dicult to manage by other means. A strong indicator of this
usefulness is its high frequency of occurrence in the corpus on average, 250 appearances per million words. It
is also, I suspect, likely to be neglected by many EAP courses. It seems likely that one reason that researchers
have not been interested in grammatical collocations is that such pairs lack the striking salience of collocations
like signicant dierence or control group. However, I would argue that this lack of salience makes it all the
more important that researchers bring such items to teachers and learners attention, since they are otherwise
likely to be passed over unnoticed.
Related to the prevalence of grammatical items on our collocation listing is the fact that many do not over-
lap with academic vocabulary as it has been traditionally dened. Of the 1000 collocations on the key col-
locations list, only 425 include an item from the AWL. I would argue that this lack of overlap indicates a
shortcoming of traditional approaches to identifying academic vocabulary, rather than a weakness of the pres-
ent list. The majority of the 509 individual words on the key collocations list which are not in Coxheads list
appear to have been excluded from the latter because they or one of their inectionally or derivationally-
related forms are found in Wests General Service List (such items are excluded from the AWL on the grounds
that students of EAP are likely already to have mastered these items). At least 456 forms (90%) appear to have
been excluded for this reason. Examples include:
address (found in the collocation address-issue)

control (found in the collocation control-group)
means (found the in collocation by-means)
Such items highlight a serious disadvantage of Coxheads strategy of eliminating from her word list any
items which are related to words found on the GSL. While intermediate learners coming to EAP for the rst
time are likely to have met some form of the word families to which these items belong, usages such as those
seen here are, I would suggest, likely not to be known. These examples are not untypical of items which appear
in the current listing but are not on the AWL, and I would argue that they probably require specic pedagog-
ical attention. Indeed, the fact that the words are supercially familiar to learners may make them all the more
problematic, since learners may not even notice when they have not understood them properly. The strategy of
eliminating all high-frequency words from academic word lists therefore seems a somewhat suspect one: many
items which are excluded by this strategy may be of considerable importance for learners of EAP.
I have noted that an important issue not adequately addressed by previous research is the degree to which
supposedly generic academic collocations are indeed shared across disciplines. In the present paper, steps were
taken to ensure that the collocations identied were common across a principled range of dierent subject
areas. However, though all collocations on our lists meet certain minimal frequency requirements in all subject
areas, it is not yet clear whether the lists will be equally useful across all disciplines. To evaluate this, the
summed total frequencies of all 1000 collocations on our listing were calculated for each subject group,
Table 2
Total frequencies of key collocations in each faculty.
Total occurrences per million words
Life sciences 31,632
Science and Engineering 29,175
Social-Psychological 35,306
Social-Administrative 30,529
Arts and Humanities 17,677
there was (no) statistically significant difference(s) between the
were (a) in
Fig. 2. Longer academic collocations.
and the results normalised to mean frequencies per million words. The results of this analysis are shown in
Table 2.
The most striking point to note from these data concerns the gures for the arts and humanities group.
While all of the other subject groupings fall within a relatively narrow band of around 3035,000 occurrences
per million words, for disciplines in the arts and humanities, the rate is far lower, at 17,677 occurrences per
million. This suggests that the academic collocations identied in Section 2.2 will be of far less use to students
in these disciplines than to students in other areas.
We have already seen (Section 2.1) that the key single-word vocabulary of arts and humanities subjects is
rather dierent from that used in the rest of academia. The present results underline this nding. Together,
these data suggest that the vocabulary needs of students in these areas should be treated separately from those
other of EAP learners. This would both enable more relevant listings to be created for arts and humanities
students and allow us to include in our inventory of generic academic vocabulary many items which are cur-
rently excluded because they are not common in arts and humanities writing, in spite of their high frequencies
in all other areas.
4. Conclusions
This paper has argued that existing pedagogical listings of academic collocations are insucient because they
fail to include positionally-variable items and because they fail to demonstrate their value to students from
across the range of academic disciplines. It has attempted to take some steps towards a more adequate listing
by exploring the extent to which there is a shared vocabulary of positionally-variable two-words collocations
across academic disciplines. We have seen that such a shared vocabulary can be identied. However, we have
also seen that a number of caveats need to be made about the list. First, it is composed largely of grammatical
collocations. While this may be a disappointment to some, I have argued that grammatical collocations are
legitimate learning targets, and so the prominence of such items should not put teachers o using an academic
collocation list. Second, the collocation list does not overlap strongly with Coxheads AWL. Again, this may
disappoint some readers. However, I have claimed that this lack of overlap indicates not a problem with the
current listing, but rather an important methodological aw in the design of the AWL, of which teachers ought
to be aware (regardless of their opinion of the present collocation list). Thirdly, academic collocations are far
less important for students in the arts and humanities than they are for students in other academic areas. I have
suggested that future research should therefore treat the needs of these students separately. Unless such a sep-
aration is observed, there is a danger that vocabulary lists will not meet anyones needs as well as they should.
Finally, two limitations of the present research need to be noted. First, it has restricted itself to two-word
collocations. I have noted that collocational patterns can be found beyond the two-word level. A prominent
example in academic writing would be the variable phrase shown in Fig. 2.
The identication of such patterns remains methodologically problematic, though programs such as Conc-
gram (Cheng et al., 2006) seem to oer an interesting way forward here. It should be borne in mind, however,
that as collocations become longer their frequency will in general decrease, and their range of applications is
likely to narrow (that is, they become more situationally-specic). The existence of a useful cross-disciplinary
set of two-word items is therefore a necessary, but not a sucient, condition for the existence of a similarly
useful cross-disciplinary set of longer collocations. Future research will need to address the issue of how
far we can lengthen collocations while retaining cross-disciplinary usefulness.
Second, this analysis has looked only at the forms, and not at the functions, of collocations. Therefore,
while disciplines outside of the arts and humanities appear to share many collocations, it is not obvious that
they all use them in the same way. Future research will also need to address this issue. Analysis of the use of
collocations would need to be undertaken manually, and so may prove a prohibitively labour-intensive task.
However, it is worth noting that as collocations become longer, their potential for ambiguity also decreases.
That is, whereas a one, or two-word items may be polysemous and so used dierently in dierent areas, longer
items (e.g. there were not statistically-signicant-dierences-between) will be increasingly less polysemous. As
we come to deal with larger collocations, therefore, the need for manual semantic analysis is likely to decrease.
Acknowledgement
This research was supported by a studentship from the Economic and Social Research Council.
Appendix A. Top 100 key academic collocations
Word 1 Word 2 Mean frequency/million words

THIS STUDY 296.96
ASSOCIATED WITH 315.52
THIS PAPER 163.68
BASED ON 404.64
AND RESPECTIVELY 249.68
DUE TO 374.12
CONSISTENT WITH 121.88
BETWEEN AND 935.56
WAS PERFORMED 84.8
RELATED TO 190.72
COMPARED TO 152.04
WAS USED 184.64
PRESENT STUDY 76.28
NUMBER OF 634.6
AS SHOWN 126.16
THESE RESULTS 91.88
RESPECT TO 107.16
TO DETERMINE 121.08
NOTE THAT 134.56
OUR STUDY 68.32
OBTAINED FROM 109.08
WERE PERFORMED 63.12
WITH RESPECT 101.48
STATISTICALLY SIGNIFICANT 57.84
WERE USED 132.04
PREVALENCE OF 67.32
OUR RESULTS 72.6
ARE SHOWN 100.36
IN CONTRAST 126.76
DEFINED AS 100.76
FOLLOW UP 90.12
COMPARED WITH 165.88
THIS ARTICLE 69.64
NO SIGNIFICANT 61.84
SUGGESTS THAT 123.88
CAN BE 857.4
WAS MEASURED 53.32
(continued on next page)
Appendix A (continued)
IN ADDITION 204.72
DECREASE IN 64.64
PERCENTAGE OF 88.08
RELATIONSHIP BETWEEN 115.32
DETERMINED BY 84.96
UNDER CONDITIONS 83.92
CORRELATED WITH 45.88
EFFECT ON 184.36
EFFECTS ON 112.92
PREVIOUS STUDIES 41.04
CONTROL GROUP 44.2
SIGNIFICANT DIFFERENCES 46.04
DIFFERENCES BETWEEN 84.8
FIGURE SHOWS 50.64
SHOW THAT 127.84
SUGGEST THAT 129.32
CORRELATION BETWEEN 48.48
WERE COLLECTED 45.64
WERE SIGNIFICANTLY 40.8
INDICATE THAT 71.8
CORRESPONDS TO 45.36
ARE PRESENTED 54.32
MEASURED BY 53.92
PARTICIPANTS WERE 46.24
TO EVALUATE 56.4
SIGNIFICANT BETWEEN 46.56
DEFINED BY 54.96
IS CONSISTENT 50.08
SHOWN FIGURE 42.32
INDICATES THAT 57.8
CHARACTERIZED BY 41.8
THESE FINDINGS 38.32
OCCURRENCE OF 40.56
OVER TIME 73.6
SUBSET OF 33.28
BEEN SHOWN 51.56
ASSOCIATION BETWEEN 36.32
CURRENT STUDY 27.8
WAS CALCULATED 36.36
SEE ALSO 75.36
DATA SET 35.36
RESULTS OBTAINED 37.68
SHOWS THAT 86.6
CHANGES IN 156.96
SIGNIFICANT DIFFERENCE 37.96
SHOWN TABLE 39.56
THERE SIGNIFICANT 52.96
(continued on next page)
Appendix A (continued)
INCREASE IN 167.68
SIGNIFICANTLY HIGHER 31.16
HIGHER THAN 98.56
IMPLIES THAT 46.36
BETWEEN GROUPS 45.84
ACCORDING TO 267.36
SIGNIFICANTLY DIFFERENT 29.4
INTERACTION BETWEEN 37.2
WE ASSUME 44.52
CORRESPOND TO 36.4
THESE SUGGEST 33.24
EXPOSURE TO 56.28
SAMPLE SIZE 27.76
HAVE SHOWN 66.8
SIGNIFICANTLY THAN 37.92
WERE DETERMINED 37.36
References
Benson, M., Benson, E., & Ilson, R. (1997). The BBI dictionary of English word combinations. Amsterdam: John Benjamins.
Biber, D. (2006). University language. Amsterdam: John Benjamins.
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at. . .: Lexical bundles in university teaching and textbooks. Applied Linguistics,
25(3), 371405.
Cheng, W., Greaves, C., & Warren, M. (2006). From n-gram to skipgram to concgram. International Journal of Corpus Linguistics, 11(4),
411433.
Clear, J. (1993). From Firth principles: Computational tools for the study of collocations. In M. Baker, G. Francis, & E. Tognini-Bonelli
(Eds.), Text and technology: In honour of John Sinclair (pp. 271292). Amsterdam: John Benjamins.
Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specic
Purposes, 23, 397423.
Coxhead, A. (2000). A new academic wordlist. TESOL Quarterly, 34(2), 213238.
Coxhead, A. (2008). Phraseology and English for academic purposes. In F. Meunier & S. Granger (Eds.), Phraseology in language learning
and teaching (pp. 149161). Amsterdam: John Benjamins.
Ellis, N. C. (2003). Constructions, chunking, and connectionism: The emergence of second language structure. In C. J. Doughty & M. H.
Long (Eds.), The handbook of second language acquisition (pp. 63103). Oxford: Blackwell.
Ellis, N. C., Simpson-Vlach, R., Maynard, C. (2007). The processing of formulas in native and second-language speakers: Psycholinguistic
and corpus determinants. Paper presented at the 25th UWM Linguistics symposium, University of Wisconsin-Milwaukee.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and second-language speakers: Psycholinguistics,
corpus linguistics, and TESOL. TESOL Quarterly, 41(3), 375396.
Firth, J. R. (1968). A synopsis of linguistic theory, 193055. In F. R. Palmer (Ed.), Selected papers of J.R. Firth 19521959 (pp. 168205).
Harlow: Longman.
Gledhill, C. (2000). Collocations in science writing. Tubingen: Gunter Narr Verlag.
Hoey, M. (1991). Patterns of lexis in text. Oxford: Oxford University Press.
Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.
Hunston, S., & Francis, G. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John
Benjamins.
Hyland, K. (2008). Academic clusters: Text patterning in published and postgraduate writing. International Journal of Applied Linguistics,
18(1).
Hyland, K., & Tse, P. (2007). Is there an academic vocabulary? TESOL Quarterly, 41(2), 235253.
Jones, S., & Sinclair, J. M. (1974). English lexical collocations. A study in computational linguistics. Cahiers de Lexicologie, 24, 1561.
Langacker, R. W. (1987). Foundations of cognitive grammar. Theoretical prerequisites (Vol. 1). Stanford: Stanford University Press.
Lee, D. (2001). Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle.
Language Learning & Technology, 5(3), 3772.
Lewis, M. (Ed.). (2000). Teaching collocations: Further developments in the lexical approach. Boston: Thomson.
Marco, M. J. L. (2000). Collocational frameworks in medical research papers: A genre-based study. English for Specic Purposes, 19,
6386.
McEnery, T., & Wilson, A. (2001). Corpus linguistics: An introduction (2nd ed.). Edinburgh: Edinburgh University Press.
Nation, P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press.
Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press.
Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike uency. In J. C. Richards & R. W.
Schmidt (Eds.), Language and communication (pp. 191226). New York: Longman.
Scott, M. (1996). WordSmith Tools. Oxford: Oxford University Press.
Scott, M. (1999). WordSmith tools users help le. Oxford: Oxford University Press.
Shin, D., & Nation, P. (2008). Beyong single words: The most frequent collocations in spoken English. ELT Journal, 62(4), 339348.
Sinclair, J. M. (1987). Collocation: A progress report. In R. Steele & T. Threadgold (Eds.). Language topics: Essays in honour of Michael
Halliday (Vol. 2, pp. 319331). Amsterdam: John Benjamins.
Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Sinclair, J. M. (2004). Trust the text: Language, corpus and discourse. London: Routledge.
Stubbs, M. (1995). Collocations and semantic proles: On the cause of the trouble with quantitative methods. Functions of Language, 2(1),
133.
Stubbs, M. (1996). Corpus evidence for norms of lexical collocation. In G. Cook & B. Seidlhofer (Eds.), Principle and practice in applied
linguistics: Studies in honour of H.G.Widdowson (pp. 245256). Oxford: Oxford University Press.
Ward, J. (2007). Collocation and technicality in EAP engineering. Journal of English for Academic Purposes, 6, 1835.
Williams, G. C. (1998). Collocational networks: Interlocking patterns of lexis in a corpus of plant biology research articles. International
Journal of Corpus Linguistics, 3(1), 151171.
Yang, H. Z. (1986). A new technique for identifying scientic/technical terms and describing science texts. Literary and Linguistic
Computing, 1(2), 93103.
Philip Durrant has recently completed a PhD on the subject of collocations in second language learning at the School of English Studies,
University of Nottingham, UK. He is currently Visiting Assistant Professor in the Faculty of Education at Bilkent University, Turkey.

P. Durrant 2009 - Investigating The Viab

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

P. Durrant 2009 - Investigating The Viab

Încărcat de

Drepturi de autor:

Formate disponibile

Available online at www.sciencedirect.

Investigating the viability of a collocation list for students

he made a powerful argument;

2.1. Corpus design and compilation

Fig. 1. Dendrogram using Average Linkage (between groups).

2.2. Identifying academic collocations

3. Results and discussion

address (found in the collocation address-issue)

there was (no) statistically significant difference(s) between the

Fig. 2. Longer academic collocations.

Appendix A. Top 100 key academic collocations

Word 1 Word 2 Mean frequency/million words

S-ar putea să vă placă și