Sunteți pe pagina 1din 5

Types of corpora There are many different kinds of corpora.

They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types. 'General corpora' consist of general texts, texts that do not belong to a single text type, subject field, or register. An example of a general corpus is the British National Corpus. Some corpora contain texts that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect or from a particular subject area. These corpora are sometimes called 'Sublanguage Corpora'. Corpora can consist of texts in one language (or language variety) only or of texts in more than one language. If the texts are the same in all languages, e.i. translations, the corpus is called a Parallel Corpus. A Comparable Corpus is a collection of "similar" text For a list of various corpora, click HERE Corpora serve as the basis for a number of research tasks within the field of Corpus Linguistics.

There are many types of corpora, which can be used for different kinds of analyses (cf. Kennedy 1998). Some (not necessarily mutually exclusive) examples of corpus types are (for a description of the individual corpora see below):

- general/reference corpora (vs. specialized corpora) (e.g. BNC = British National Corpus, or Bank of English): aim at representing a language or variety as a whole (contain both spoken and written language, different text types etc.)

- historical corpora (vs. corpora of present-day language) (e.g. Helsinki Corpus, ARCHER) aim at representing an earlier stage or earlier stages of a language - regional corpora (vs. corpora containing more than one variety) (e.g. WCNZE = Wellington Corpus of Written New Zealand English) aim at representing one regional variety of a language - learner corpora (vs. native speaker corpora) (e.g. ICLE = International Corpus of Learner English) aim at representing the language as produced by learners of this language - multilingual corpora (vs. one-language corpora) aim at representing several, at least two, different languages, often with the same text types (for contrastive analyses) - spoken (vs. written vs. mixed corpora) (e.g. LLC = London-Lund Corpus of Spoken English) aim at representing spoken language

A further distinction of corpus types refers not to the texts that have been included in the corpus, but to the way in which these texts have been treated:

- annotated corpora (vs. orthographic copora) in annotated corpora, some kind of linguistic analysis has already been performed on

the texts, such as sentence analysis, or, more commonly, word class classification corpus has resisted and the choice has proved to be remarkably appropriate, in terms both of etymology and of the semantic ramifications of related or derived lexical items. We might examine some of these, exploiting the historical information provided by the Oxford English Dictionary. The first derivate, or cognate in this case, we can consider is corpse. It was once perfectly normal to use this term to signify a living body but this sense died out, as it were, in the 17th Century leaving us with the dead body of a man (or formerly any animal). And this is precisely what critics have accused corpus linguistics of perpetrating, that is, it assassinates living, communicative language and renders it a lifeless corpse or corpus for cold linguistic dissection. The schools ofCorpus-Assisted Discourse Studies (CADS, Partington 2004, 2008; Baker 2006;) are a lively reply to this charge and attempt to demonstrate how corpus techniques can be used to shed light on the contexts of production of both written texts and interactive discourse. Typically, for instance, the CADS researcher will work with concordances with a far greater quantity of co-text than was often traditional in other types of corpus linguistics where 80 to100 characters was often deemed sufficient to capture a lexical-semantic pattern in order to glean clues about the context of situation. For the same reason it is common for the analyst to read or watch or listen to at least part of the corpus material, a process which can help provide a feel for how things are done linguistically in the discourse type being studied. Another musing might be: can a corpus die out, become a corpus corpse? Would an updated version of the British National Corpus (BNC) render the old one obsolete? The answer is decidedly not. FLOB and FROWNdo not kill off LOB and BROWN, nor does SiBol 05 do away with SiBol 93. And not only because some researchers will still be interested in earlier stages of the language. The nascent discipline of MD-CADS, that is,ModernDiachronic Corpus-Assisted Discourse Studies is predicated on the ability to track recent developments in language, that is, changes over recent time periods. Pioneering work in this field was conducted by Mairand his team using LOB and FLOB. With these comparatively small corpora, they were able to conduct studies of changes in the behaviour of very frequent words or constructions; their studies are therefore largely on grammar. But today having at our disposal much larger corpora, studies of less frequent items, the so-called lexical as opposed to grammatical - words also becomes feasible. This opens up entirely new avenues of research in modern

diachronic linguistics; we can study meaning change, especially of sets of lexical items, in relationship to both internal linguistic factors and also in response to external social, political and cultural influences. By allowing us also to study lexical patterns and how they differ in the two corpora we are thus able to study changes in discourse processes as well (Partington ed. 2010). Perhaps the only sort of corpus that can really die is one that was so badly designed it had little chance of surviving the travails of its birth. Another set of derivates has to do with corpus, the body, as bulk. The OED gives us corpulent: large or bulky of body, fleshy, fat, as well as corpulence, and the magnificent US slang corporosity: bulkiness of body. So: a big fat corpus. The two most celebrated standard corpora, the BNC and the Bank of English were certainly undertaken partly with the philosophy of big is beautiful. How tightly they were designed is another matter: to a degree they were fattened up in a compromise between what was desirable and what was available. From the very large to the extremely small, the OED also has corpuscule: a minute particle or body of matter, along with corpuscular and corpusculum. This term is generally used in physics and botany to indicate a tiny body of which a larger body consists. In this sense we might reasonably refer to LOB, FLOB, FROWN, BROWN and the recent BE06 (Baker 2009) as corpuscular corpora, since each is composed of tiny texts fragments each of 2,000 words in length. There are, of course, one or two common and rather well-known collocations which include our term. By far the most frequent in SiBol 05 is habeas corpus (literally you have a/the body), most aptly, since all corpus linguists presumably have to have at least one corpus. In common parlance too is corpus delicti (the body of the crime, sometimes the murder victim); we can all recall some criminally wretched corpus linguistic papers, including, often enough, ones own. Especially interesting is the expression corpus vile (lowly, worthless body) defined as a living or dead body which is of so little value it can be used for experiment without regard for the outcome or experimental material [] which has no value except as an object of experimentation. Much to our collective human shame, classic corpora vilia are the animals used in scientific experiments. In some sense, all corpora are vile in having no value except as objects of experimentation, though we are usually careful not to destroy them in the course of enquiry. Perhaps the true corpus vile, however, would be one compiled for the purpose of a single study and which was never of use again.

The most commonly employed plural form of corpus is corpora, although it suffered some competition from corpuses for a period (and perhaps still does; at the time of writing May 2011- Google dredged up some 2,300 instances of text corpuses). Normally, perhaps, not renowned as great enthusiasts of etymology, corpus linguists have elected for once to be true to the Latin; corpus is a neuter noun, third declension, and thus takescorpora as its plural (as tempus, tempora); had it been masculine, second declension, we would indeed have been earnestly discussing corpi. We might, then, look at some related items with a corpor* stem.[2] Such items tend to fall into two sets. Firstly, words like corporate, incorporate, whose meanings are variations on united in one body a sense most certainly relevant to our professional notion of corpus and secondly those relating to corpor* as body or matter as opposed to spirit. We find corporify: to cause to assume a body or material form; to solidify (but also again with a second sense of to incorporate, unite into one body), and corporize to interpret or explain literally and materially; the opposite of spiritualize. Are corpus linguists a literal-minded crew? Very possibly. The most significant such use is perhaps corporeal, meaning of the nature of the animal body or material and tangible. This leads us to a very serious point. Corpus research is certainly, in a wider philosophical sense, materialistic. The corpus itself is a tangible artefact, the results of corpus research are tangible, evidence-based and open to replication or, at least para-replication (Stubbs 2001: 124; Partington 2009: 293-294). But corpus linguistics is also and in equal measure mentalistic in its reliance on the intuition, selection, introspection and judgement of the analyst. Corpus linguistics is no respecter of the Cartesian mind body dualism. The mind and the machine interact inseparably; suffice it to contemplate serendipitous discovery and how the data can lead the analyst/mind into entirely unenvisaged and uncharted territory. In this precise regard, the most famous proverb to use corpor* is mens sana in corpore sano, that is, a healthy mind in a healthy corpus. An excellent slogan for a salutary, as well as sane, corpus linguistics.

S-ar putea să vă placă și