Documente Academic
Documente Profesional
Documente Cultură
• Parameter Korpus
• Jenis-jenis Korpus
Tugasan 10%
• Pilih satu korpus dan selidiki kaedah
pembinaan korpus tersebut.
• Bentangkan dapatan anda dalam kelas (ppt)
• Tugasan bertulis – 1000 pp.
Parameter
• Bahasa:
– Monolingual
– Multilingual (comparable corpora)
– Parallel
• Jenis Sumber:
– Tulisan
– Lisan
– Campuran
Parameter (II)
• Saiz korpus: is not all important and it
depends very much on the type of texts used
• Anotatasi /tiada anotasi (type of encoding
used: plain text, SGML/XML encoded)
• Statik / Dinamik = static/monitor corpus
• Korpus / sub-korpus
• Bilangan kata / bilangan jenis kata (types)
Type/token ratio
121m tokens (general corpus) - 475,633 types -
213,684 occur only once
211m tokens (general corpus) - 638,901 types
323m tokens (general corpus) - 812,467 types
418m tokens (general corpus) - 938,914 types -
438,647 occur only once
Pembinaan korpus
• Kaedah perolehan :
– Secara langsung dari format eletronik
– Imbasan optikal
– Keyboarding
– Transkripsi lisan > ...
Pembinaan korpus (II)
• Kriteria dalam rekabentuk korpus:
– Saiz (small corpora are for genre specific studies, whereas
big corpora make robust, general statements about a
language)
– Genre (domain, distribution, age, …)
• Struktur korpus :
– A priori (Brown, LOB, …) – ditentukan terlebih dahulu
– A posteriori – elektik, opportunistic ;
– Old material is replaced with new one
Pembinaan korpus (III)
• Pemilihan, kebenaran (hak cipta),
pemerolehan
• Penyediaan data: optical scanning,
keyboarding, speech transcription
• Pembersihan, ejaan, encoding (anotasi),
• Dokumentasi Manual
• Penilaian
• Edaran / Sebaran
Kriteria ...
• Pensampelan : A sampling frame designed to
allow the exploitation of a certain linguistics
properties
• Keseimbangan dan keterwakilan
(Balance and representativeness)
• Maklumat hilang dengan pembersihan
• Duplikasi
• Korpus lisan ... When working with speech
information can be lost through transcribing
Web sebagai corpus
• The Web can be very useful source of texts
• The Web is very helpful for languages other
than English
• Quite often there is not control on the
language which is investigated therefore
filtering (if possible) is necessary
Web as a corpus
Web as a corpus
Web as a corpus
Anotasi Korpus
• Enrichment of a corpus with various types of
information
• It can be done at every level:
– Word: part of speech, sense
– Sentence: sentence boundaries, syntactic tree
– Discourse: coreferential chains, discourse
segments
– Certain expressions: named entities
Skema Anotasi
• A standard used to annotate certain
characteristics
• Gives meaning to a tag
• Nowadays it is in XML
• Usually in addition to an annotation scheme, a
set of guidelines is produces to assist the
annotation
Examples (II)