Computational Lingiustics Lectures

T.
1 The Field of Study of Computational Linguistics

Computational Linguistics (CL) or Natural Language Processing (NLP) is an area that is as much part of computer science as of linguistics and that is still terra Incognita for many linguists. It is an interdisciplinary field concerned with the processing of language by computers. Since machine translation (MT) began to emerge some sixty years ago, CL has grown and developed considerably. It has expanded theoretically through the development of computational and formal models of language. In the process it has increased vastly the range and usefulness of its applications. Now, in the era of information technology CL is experiencing continuous and vigorous growth. CL crept into existence tentatively. When shall we say it all began? Perhaps in 1949, when Warren Weaver wrote his famous memorandum claiming that the translation by machines might be possible. The first conference on machine translation took place at MiT in 1952 and first journal called Machine Translation came out in 1954. However, the term computational linguistics appeared only in the mid1960s. The association for MT and CL was formed in 1962. The term computational linguistics was coined by David Hays. The rapid growth in the field has taken place mostly since the late 1970s. Computational Linguistics is the study of computer systems for understanding and generating NL. The founders of CL were mostly linguists and they saw the potential of the computer in carrying out exactly, and with great speed, the rules of language. Chomskys Syntactic Structures which came out in 1957 served to solidify the notion of grammar as a deductive system suitable for computer applications. At that time, CL was seen as a field dealing with the implementation of formal grammatical theories. The findings of CL also led to the development of theoretical linguistics. A number of theories within the formal tradition came into existence, such as Generalized Phrase Structure Grammar, Lexical Functional Grammar, Head Driven Phrase Structure Grammar. One natural function of CL is to test the applications of grammatical frameworks proposed by Theoretical Linguistics. Another aim is to facilitate practical tasks such as information extraction, summarization, machine translation, computeraided translation etc. In none of these areas is success achievable by linguistic methods alone. CL is closely related to statistical Natural Language Processing. Roughly speaking, statistical NLP associate probabilities with the alternatives encountered in the course of analyzing an utterance or a text and accepts the most probable outcome as the correct one. E.g. The boy saw the girl with the telescope. In this sentence the phrase with the telescope is more likely to modify saw than the girl. This is obvious to humans, but not for machines, because computers process language from left to right. As a whole, text processing depends on knowledge of the world as well as knowledge of language. As an interdisciplinary field of study, CL is construed very broadly. On the one end is speech recognition and text-to-speech synthesis, which are often studied in departments of electrical engineering rather than linguistics. On the other end is the compiling of linguistic corpora in electronic
1
format, these are large collections of sentences, texts or recorded speech as sources of linguistic research. Other areas of study include text segmentation, part-of-speech tagging, parsing, word-sense disambiguation, anaphora resolution, natural language generation, machine translation, computer-aided translation, information retrieval (IR), automatic generation of multiple choice tests, and computerassisted language learning (CALL). It is also important to outline the relationship between Theoretical Linguistics and Computational Linguistics. Despite the fact that both are ultimately concerned with understanding linguistic processes, Computational and Theoretical Linguistics have rather different approaches to language. Computational linguists have been concerned with developing procedures for handling a useful range of NL input. The requirement of constructing complete systems has led them to seek understanding the entire process of NL comprehension and generation. In contrast, theoretical linguists have focused primarily on one aspect of language performance, i.e. grammatical competence, i.e. how people come to accept some sentences as grammatical and reject others as ungrammatical. Traditional (theoretical) Linguistics has been concerned with language universals. It hopes to gain insight into innate language mechanisms which enable people to learn and use languages. Thus TL provides valuable input to CL.
Computational Linguistics
Natural Language Understanding
Natural Language Generation
Sentence Analysis
Discourse Analysis
T. 2 Approaches to Phonology in Computational Linguistics
Phonology is the systematic study of the sounds used in language and their composition into syllables, words and phrases. CL is the application of formal and computational techniques to the representation and processing of phonological information. 1. The phoneme and its distinctive features There is no limit to the number of distinct sounds that can be produced by the human vocal apparatus. However, this infinite variety is harnessed by human languages into sound systems consisting of a few dozen languages-specific categories known as phonemes. English has a variety of t-like sounds,
2
such as the aspirated t of ten, the unreleased t of net, and the flapped t in water. In English the variety of t sounds do not differentiate words. Nevertheless, since these phones are similar and appear in complementary distribution, they are said to be allophones of the English phoneme t. However, setting up a few allophonic variants for each a finite set of phonemes does not account for the infinite variety of sounds. If one is to record multiple instances of the same utterance by the single speaker many small variations could be observed in pitch, frequency, intonation and so on. These variations arise because speech is a motor activity and perfect repetition of the same utterance is practically impossible. Similar variations occur between different speakers since one persons vocal apparatus is different from the next persons and this is how we distinguish peoples voices. So, if ten people say the word ten ten times, they will produce a hundred distinct acoustic records for the t sound. Similarity of phonemes is determined mostly on the basis of their place and manner of articulation. E.g. t has several allophones in English and the choice of allophone depends on phonological context. T is flapping in water but two other allophones are to be found in atlas and cactus. Since English syllables cannot begin with Tl, atlas cannot be segmented into a + atlas. 2. Computational phonology Phonology is the study of sound alternations in language. Computational phonology deals with computational models of those alternations. One of the major problems has to do with the relation between spelling and pronunciation, i.e. the relationship between orthography and phonology. Some languages exhibit close relation between spelling and pronunciation, e.g. Spanish, Serbo-Croatian, Finnish etc. In other languages, such as English and French, the spelling is relatively different from the pronunciation. Phonological alternations can be obscured by spelling. At the same time, the spelling may introduce alternations, which have no counterpart in phonology. E.g. innovate -> innovation picnic -> picnicking
Such cases usually pose problems for computational analysis of phonemes. The basic approaches used in computational phonology are auto-segmental rules and constraint-based approaches. The major idea is to segment speech into different allophones and then when we have certain phonological environment to use filters, or constraints, to make a selection of possible variants and to choose the correct ones in order to generate speech.
T. 3 Approaches to Morphology in Computational Linguistics
Computational Morphology deals with the processing of words in both their graphemic and phonemic form. It has a wide range of practical applications, such as spelling correction, tagging etc. These tasks may seem simple to humans but they might pose problems with a computer program. 1. Linguistic background Natural languages have intricate systems to create words and word-forms from smaller units in a systematic way. The branch of linguistics concerned with these phenomena is morphology. In CL words are just arbitrary strings of symbols. In any human language the infinity of words is produced from a finite collection of smaller units (morphemes). The task of computational morphology is to find and describe the mechanisms behind this process. The basic building blocks of human language are morphemes defined as the smallest units of language to which meaning can be assigned, i.e. they are the minimal units of grammatical analysis.
Sentence
Phrase
Phrase
Immediate constituents
Lexeme
Lexeme
Morpheme
Morpheme
Ultimate constituent
The realization of morphemes as part of words are called morphs. Take and took are considered allomorphs of the lexeme take. Usually, exceptions are difficult to analyze computationally. E.g. ox oxen Man men analysis analyses woman women
Linguistically morphemes are divided into free morphemes and bound morphemes. All affixes are bound morphemes. Languages contain approximately 10 000 morphs. Strict rules govern the combination of these morphs to form words. This way of structuring the lexicon makes the cognitive load of remembering so many words much easier. The basic notion in text processing is that of the word. In CL a word is considered to be a string of symbols between spaces, a sentence is considered to be a string of symbols between two full stops. How much and what sort of information is expressed morphologically differs widely from one language to another. As a whole, the more simple and implicit the grammatical system of a language, the more difficult it is to process computationally.
4
Traditionally, linguists differentiate between the following types of languages: Isolating e.g. Mandarin Chinese There are no bound forms and no affixes. The only morphological operation is composition. Agglutinating/ Agglutinative e.g. Agra-Finnic, Turkic All bound forms are affixes added to a root like beads on a string. Every affix represents a distinct morphological feature and every feature is expressed by exactly one affix. Inflectional Indo-European languages Polysynthetic Eskimo languages These languages express more structural information morphologically than other languages. 2. Applications of computational morphology The branch of morphology which deals with the way morphs are combined to form lexemes is called morphotactics. The constraints involved in this process are of interest to computational linguists because these constraints determine the correctness of lexemes which computers generate. E.g. pseudohospitalization Hospital hospital pseudohospitalize pseudohospitalization Despite attempts to exclude Semantics in CL, semantic constraints are often employed in order to generate lexemes correctly. E.g. untidy *unsad *undirty That is how the importance of semantics comes to the foreground. The latter has always posed considerable problems for NLP since computers treat sentences as strings of symbols and are not capable of differentiating meaning. The basic applications of computational morphology: Hyphenation for this application segmenting words correctly into their morphs is a prerequisite. The main problem here is, the so-called, false segmentation. E.g. unity *un + ity Unlikely UN + likely Grapheme-to-phoneme conversion it has to do with the conversion of text to speech. It has to resolve ambiguities in the conversion of characters into phonemes.
5
E.g. hothouse Spelling correction most current systems use a lexicon and a set of affixes to check spelling properly Parsing in computer science, parsing is the process of analyzing a sequence of symbols in order to determine its grammatical structure (morphological and syntactic) with respect to a given formal grammar. A parser is a program that carries out this task. The name is analogous to its usage in grammar and linguistics. In a parser morphological analysis of words is an important prerequisite for syntactic analysis. Lemmatization this is the process of determining the lemma for a given word. The lemma is the base form of a lexeme, e.g. walk is the lemma in the grammatical paradigm of walking, walker, walks. Better has good as its lemma.
Recommended bibliography:
1. Mitkov, Ruslan: Anaphora Resolution 2. Mitkov, Ruslan: The Oxford Handbook of Computational Linguistics 3. Quah, Chiew Kin: Language and Translation Technology

Computational Lingiustics Lectures

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Computational Lingiustics Lectures

Încărcat de

Drepturi de autor:

Formate disponibile

T.

1 The Field of Study of Computational Linguistics

Natural Language Understanding

Natural Language Generation

T. 2 Approaches to Phonology in Computational Linguistics

T. 3 Approaches to Morphology in Computational Linguistics

S-ar putea să vă placă și