Sunteți pe pagina 1din 29

Kuliah 2

• Parameter Korpus
• Jenis-jenis Korpus
Tugasan 10%
• Pilih satu korpus dan selidiki kaedah
pembinaan korpus tersebut.
• Bentangkan dapatan anda dalam kelas (ppt)
• Tugasan bertulis – 1000 pp.
Parameter

• Bahasa:
– Monolingual
– Multilingual (comparable corpora)
– Parallel
• Jenis Sumber:
– Tulisan
– Lisan
– Campuran
Parameter (II)
• Saiz korpus: is not all important and it
depends very much on the type of texts used
• Anotatasi /tiada anotasi (type of encoding
used: plain text, SGML/XML encoded)
• Statik / Dinamik = static/monitor corpus
• Korpus / sub-korpus
• Bilangan kata / bilangan jenis kata (types)
Type/token ratio
121m tokens (general corpus) - 475,633 types -
213,684 occur only once
211m tokens (general corpus) - 638,901 types
323m tokens (general corpus) - 812,467 types
418m tokens (general corpus) - 938,914 types -
438,647 occur only once
Pembinaan korpus
• Kaedah perolehan :
– Secara langsung dari format eletronik
– Imbasan optikal
– Keyboarding
– Transkripsi lisan > ...
Pembinaan korpus (II)
• Kriteria dalam rekabentuk korpus:
– Saiz (small corpora are for genre specific studies, whereas
big corpora make robust, general statements about a
language)
– Genre (domain, distribution, age, …)
• Struktur korpus :
– A priori (Brown, LOB, …) – ditentukan terlebih dahulu
– A posteriori – elektik, opportunistic ;
– Old material is replaced with new one
Pembinaan korpus (III)
• Pemilihan, kebenaran (hak cipta),
pemerolehan
• Penyediaan data: optical scanning,
keyboarding, speech transcription
• Pembersihan, ejaan, encoding (anotasi),
• Dokumentasi Manual
• Penilaian
• Edaran / Sebaran
Kriteria ...
• Pensampelan : A sampling frame designed to
allow the exploitation of a certain linguistics
properties
• Keseimbangan dan keterwakilan
(Balance and representativeness)
• Maklumat hilang dengan pembersihan
• Duplikasi
• Korpus lisan ... When working with speech
information can be lost through transcribing
Web sebagai corpus
• The Web can be very useful source of texts
• The Web is very helpful for languages other
than English
• Quite often there is not control on the
language which is investigated therefore
filtering (if possible) is necessary
Web as a corpus
Web as a corpus
Web as a corpus
Anotasi Korpus
• Enrichment of a corpus with various types of
information
• It can be done at every level:
– Word: part of speech, sense
– Sentence: sentence boundaries, syntactic tree
– Discourse: coreferential chains, discourse
segments
– Certain expressions: named entities
Skema Anotasi
• A standard used to annotate certain
characteristics
• Gives meaning to a tag
• Nowadays it is in XML
• Usually in addition to an annotation scheme, a
set of guidelines is produces to assist the
annotation
Examples (II)

• <P><S><W POS="PRON" NUM="PL“


LEMMA="we">We</W><W POS="V"
LEMMA="have">have</W><W POS="EN"
LEMMA="develop">developed</W><NP><
W POS="DET" LEMMA="a">a</W><W
POS="A“ LEMMA="computational">
computational</W><W POS="N"
NUM="SG" LEMMA="paradigm">
paradigm</W><W POS="PUNCT">,</W>
...</NP> ... </S></P>
Kelebihan anotasi?
• Ease of exploitation
• Reusability
• Multi-functionality
• Explicit analyses
• Once a corpus is annotated it can be used in
further research
Kaedah anotasi
• Can be done: automatically, semi-
automatically and manually
• Sometimes the method is automatic and then
the results postprocessed
• Usually special tools are used to minimise the
human error
Kritikan anotasi
• Corpus annotation produce impure corpora
– Sometimes annotation can hide certain features
• Consistency versus accuracy
– Measures to compute the reliability of an
annotation
• Sometimes the annotation scheme can cover
a phenomenon only partially.
Existing corpora
• Brown Corpus/LOB corpus
• Bank of English
• Wall Street Journal, Penn Tree Bank, BNC, ANC, ICE, WBE,
Reuters Corpus
• Canadian Hansard: parallel corpus English-French
• York-Helsinki Parsed corpus of Old Poetry
• Tiger corpus – German
• CORII/CODIS - contemporary written Italian
• MULTEX 1984 and The Republic in many languages
Distributors of corpora
• LDC (Linguistic Data Consortium)
• ELRA (European Language Resources
Association)
• TRACTOR (TELRI Research Archive of
Computational Tools and Resources)
• ICAME (International Computer Archive of
Modern and Medieval English)
References
• Karin Aijmer and Bengt Altenberg (1991) English
corpus linguistics, Longman
• Duglas Biber, Susan Cnrad and Randi Reppen (1998)
Corpus linguistics, Cambridge University Press
• Graeme D. Kennedy (1998) An introduction to corpus
linguistics, Longman
• Tony McEnery and Andrew Wilson (1996) Corpus
linguistics, Edinburgh University Press
References (II)
• Geoff Barnbrook (1996) Language and
Computers, Edinburgh University Press
• Tony McEnery (2003) Corpus linguistics. In
Ruslan Mitkov (ed.) The Oxford Handbook of
Computational Linguistics, Oxford University
Press
Type/token ratio
From Brown corpus: 1m tokens (written only) - 50,406 types
From 1980s Birmingham/Cobuild corpora: 1m tokens (spoken
only) - 36,807 types - 17,459 occur only once
[NB - fewer types than Brown (written only); = spoken language
is more repetitive, smaller vocabulary is used]
4m tokens (Times newspapers only) - 122,773 types - 54,144
occur only once
18m tokens (general corpus) - 228,323 types - 131,299 occur
only once
Ways to exploit a corpus
• Word (token) / types frequency lists
• N-grams
• Concordances
• Collocations/collegations
• Specially designed programs (especially when
the corpus is annotated)
Frequency lists

• are lists which indicates the words which


appear in a corpus and their frequency
• they provide a survey of the corpus
• a frequency list becomes more meaningful
when compared with other lists
• they remove a word from its contexts
Concordances
• show words in the context they appear
• usually they are obtained using special
programs which allow to manipulate the lists
of concordances
• KWIC (Key Word In Context) is the most
common format
Collocations
• collocation = the occurrence of two or more
words within a short space of each other in
text
• the collocates are extracted using a window
to the left and right of a specified word
• can be used to further analyse the context of
a word
The word learning

S-ar putea să vă placă și