Sunteți pe pagina 1din 48

Corpus Linguistics

Developing a
PolyU Language Bank
Sherman Lee
egslee@inet.polyu.edu.hk
PI: Grahame Bilbow
Thanks to: Chris Greaves, Raymond Cheung, Li Lan

Outline

Background

As an illustration

Exploring units of meaning


Case study

Developing a PolyU Language Bank

Goals of corpus linguistics


Types of corpora
Applications of corpus analysis

Aims and objectives of project


Similar existing projects
Procedures

The PolyU Language Bank

Current status
Sample corpora
Sample search
2

Goals of corpus linguistics

Chomskyan
linguistics

Langue
(competence)
Ideal speaker/hearer
Language = innate
mental faculty
Intuitive evidence
Universals
Grammar

Corpus
linguistics

Parole
(performance)
Complexity/variation
Language = social
phenomenon
Empirical evidence
Differences
Meaning
3

Basic tools

Corpus: a systematic collection of speech or writing


that is built according to explicit design criteria for a
specific purpose

c.f. EAGLES broad definition: A corpus can


potentially contain any text type, incl. word lists,
dictionaries, etc.

Concordancer: search engine


(e.g. WordSmith; SARA)

Concordance: occurrences of search item, displayed


in list with immediate context shown

Types of corpora
Written vs Spoken
General vs Specialised

e.g. ESP, Learner corpora

Monolingual vs Multilingual

e.g. Parallel, Comparable

Synchronic vs Diachronic; Monitor


Annotated vs Unannotated

Written corpora

Specialised corpora

Other examples of available corpora

Some applications of corpus analysis

Language teaching & learning

Empirical teaching data authentic examples of language use


Reference source answering learners questions or explaining learner
errors:
Whats the difference between at last and in the end?
How is hardly used?

Translation

Preparation of teaching materials e.g. vocabulary lists, CLOZE tests


CALL; concordancing and data-driven learning
Using parallel texts to find suitable translation equivalents
Creation of translation databases or glossaries for domain-specific
terminology, e.g. business, law, science
Exploring units of meaning in texts

Linguistics and language research

Lexicography & lexical studies e.g. relative word frequency


Language variation e.g. linguistic features across registers
Grammar corpora used as data to test hypotheses, syntactic theory
Pragmatics & discourse e.g. CA of discourse features in spoken
(conversational) data

Exploring meaning,
units of meaning

Focus on meaning because:

What are basic units of meaning?

People interested in the meanings of texts, in how language is


actually used in discourse
Meaning is a key problem for translation, language learning,
information management
Language teaching (TEFL): vocabulary often introduced in the
form of new single words
Words considered to be basic units of meaning

Is the word an ideal unit of meaning?


If you dog a dog during the dog days
of summer, youll be a dog tired dog catcher
Can I sit down? My dogs are barking

Most lexical errors made by language learners result from


failure to deal with ambiguities of single words
10

Unambiguous
Units of Meaning

Notion of an Unambiguous Unit of Meaning


necessary for understanding meaning
UUoM = keyword and all words in the context that
contribute to making the word unambiguous
Compounds, idioms, multi-word units, collocations,
set phrases
Often determined by a syntactic pattern

Adj + N

friendly fire, closing remarks

V+N

invite proposals, draw conclusions

Adv + A

politically correct, environmentally friendly

N + of + N

cause of death, proof of identity, code of practice, duty of care


11

Case study

Search for units of meaning in online dictionaries and corpora

friendly fire
environmentally friendly

Corpora from 1990s

British National Corpus (BNC)


100,000,000+ words
Written (90%)

Extracts from regional/national newspapers, specialist periodicals, academic


books, popular fiction, un/published letters, memos, school/university essays

Spoken (10%)

Informal conversation, formal meetings (business, government), radio shows,


phone-ins

The Times (1995, Jan March)

10,220,367 words
Written : business, home news, readers letters, reviews

Corpora from 1960 - 1970s

Brown corpus / LOB corpus

Each 1 million words


Written, balanced corpora of 15 genres of text

12

Search results

What the results show

friendly fire, environmentally friendly

Represent fairly new concepts


Occur in the newer corpora (1990s) as units of meaning
Occur as entries in some of the online dictionaries only
(not bilingual dictionaries)

New terminology and terms of common usage not


always recorded in dictionaries and termbanks
One way of using corpora for learning and
translation:

Use corpus evidence to help students recognise units of


meaning; introduce notion of units of meaning into
language learning
16

Aims of PULB project

To design and build an archive of language


corpora = language bank

To be used by staff and students in the


department
For teaching, language learning and research
purposes

To provide a user-friendly platform

A WWW interface via which users can freely


access the language bank
With browse, search and concordance facilities
17

Ingredients of PULB

Sources: standard corpora, departmental


collections
Medium: written texts, transcribed spoken data
Language types: native speaker, learner corpora
Languages: English, Chinese, Japanese, French,
German
Genres: business, law, academia, media, social,
literature
Target Size: 30 million
words (European) / characters (Asian)

18

Why a language bank?


- Whats in it for us

Free and simple shared access to a collection of language corpora

That you can utilise for your teaching

Authentic examples of language use at your fingertips


Empirical teaching data covering different specialisms (ESP, EAP)

That you can utilise for your research

A ready-made collection of data waiting for you to work on


Saving on time and resources

Way of incorporating new methods and information technology into


the departments teaching and research activities

Increase students awareness of this rapidly developing methodology /


branch of language studies (corpus linguistics, corpora studies)
Way of integrating theory with technology in the classroom
Train students to be more computer-literate
All of the above can

Motivate students to become active learners


Help students to more effectively learn the target language (cf goals of DDL)19

Similar existing projects

W3 Corpora Project (Essex)

http://clwww.essex.ac.uk/w3c/
Access to corpora (Gutenberg texts, LOB, LOB-tagged)
Web interface for performing searches
Online tutorial and info on corpus linguistics

Web Concordancer (VLC, PolyU)

http://vlc.polyu.edu.hk/concordance/
Access to variety of corpora and texts (bilingual/parallel
corpora, news, Bible, works of fiction)
Web interface for performing searches

20

Directions for PULB

Build a language bank with features that


parallel those of similar sites

~ VLC

Bring together corpora and texts of various types and


genres, of different languages

~ Essex

Make available different facilities for different


categories of users (cf. legal considerations)
Provide on-site tutorial, corpora-based info

Include extra features

Allow searches in multiple texts / corpora


simultaneously
Some form of parallel concordancing

30

Target composition of PULB


French

Business
Chinese
Chinese

German

Business
Japanese
Japanese

PolyU Language Bank

Legal
Chinese

Japanese
Literature

English

Legal
English

Specialised corpora

Spoken Corpora
Stude
nt
work

B
R
O
W
N

Academic
English

English
Literature

HK spoken
corpus

Conference
speeches

Socia
l
intera
ction
s

Business
English
(PUBC)

I
C
E

Academic
presentations

Teach
ing
reflect
ions

B
N
C

Learner corpora

Busin
ess
writin
g

General corpora

Workplace
English

31

Procedures (i)

Collate, sort, categorise data from


various sources

Commercially available data


Departmental collections, incl.
PolyU

Business Corpus (Li and Bilbow)


Bilingual corpora (Xu)
ESP / EAP corpora (Forey)
Learner corpora (Sengupta)

32

Procedures (ii)

For the departmental collections:

Decide how to present each collection

E.g. Sub-categories, macro categories

Clean up texts

E.g. Duplications of text samples


E.g. Structural features (headings, typographic features)
E.g. Personal information found in data
To protect anonymity or privacy of authors and speakers

Annotate texts

Provide descriptive information about each corpus


Compiler, time of compilation, type of collection

Provide descriptive information about the texts


Number, size, genre of subtexts
Bibliographic info (written text)
Ethnographic info (spoken data)

Provide structural information for texts if necessary


Mark texts for paragraph boundaries etc

33

Procedures (iii)

Put corpora together on platform; set up search


and support facilities:

PULB map
Browse facility
Search and concordance facilities
Tutorial / general information

Transplant PULB onto dept website for use by


staff and students

Promote PULB among corpora community

Data provider to data archives / distribution sites, e.g.


OLAC; ICAME
34

The PolyU Language Bank

Current status
Range of corpora totalling 12M+ words
Individual corpus descriptions
Index of corpora
Simple to use built-in concordancer
Available at http://
langbank.engl.polyu.edu.hk/

35

The PolyU Language Bank

Some of the currently available corpora

PolyU Business Corpus (Eng, Chi, Jap)


BNC Sampler Corpus (Spoken, Written)
Corpus of Multilingual Texts
Corpus of Nursing and Health Science Texts
Learner Corpus of Essays and Reports
HK Bilingual Corpus of Legal and Documentary
Texts
...
37

How you can contribute

Talk to us about your ideas

What would you like to see being incorporated into PULB?


In terms of corpora
In terms of search facilities and supplementary information

Can you think of other ways in which PULB can be organised


and structured?
How likely are you to make use of PULB in your teaching and
research?
Do you have any suggestions for corpus studies based on
available or potentially available corpora from PULB?
Do you know of similar projects being undertaken elsewhere
that we can learn from?

Talk to us about your collections / corpora

Do you have collections of language data from past research


projects that are (could be) presented as a corpus (corpora)?
Can we help you put your collections to good use?
Can we work together to incorporate your collections into
PULB?
41

Concluding remarks

Corpora represent a valuable but under exploited


resource for teaching and research
PULB aims to bring together various corpora
under a single departmental archive, accessible
via WWW
You can help us by contributing your ideas
and/or your language collections
Please visit and test the PULB website at http://
langbank.engl.polyu.edu.hk/ and provide us with
feedback using the online evaluation form
Thank you very much
42

Social grooming

CLOZE

PolyU Business Corpus

Compiled in 1999-2000 (Li & Bilbow)


Multilingual - comparable corpora:

English (c. 1.3 M words)


Chinese (c. 1.2 M words)
Japanese (c. 1.1 M words)

Business texts from: newspapers,


government reports, company reports
and brochures
Has been used for creating a bilingual
English-Chinese business lexicon
45

PolyU Business Lexicon

Duplication

S-ar putea să vă placă și