Sunteți pe pagina 1din 101

380 Phonetic Processes in Discourse

that have taken place from the phonetic shape of the


same word spoken in isolation. For this reason, the
changes are often called processes. This method of
accounting for the patterns is questioned later in this
article.

Problems Analyzing Phonetic


Processes in Discourse
Before looking in more detail at some of the phonetic
patterns that occur in discourse, we should first
examine some of the specific problems that arise in
analyzing spoken discourse.
Difficulty of Observation and Comparability

The analysis of phonetic patterns in spoken discourse is still in its infancy. Phoneticians have begun
to seriously analyze conversation only in the last
2 decades, despite the programmatic remarks by one
of the most prominent figures in British phonetics and
linguistics in the first half of the last century, J. R.
Firth (1935: 71): Neither linguists nor psychologists
have begun the study of conversation, but it is here we
shall find the key to a better understanding of what
language really is and how it works. There are two
main reasons for this lack of progress: difficulty of
observation and lack of comparability.
Speakers generally do not like to be observed
when they are involved in a conversation, and it is
difficult to know the extent to which the form and
content of conversation is affected by being observed.
Nevertheless, large corpora of spoken discourse have
been collected using various methods of elicitation.
For example, in the SWITCHBOARD corpus of
American English (Godfrey et al., 1992), 500 speakers
of several American English dialects were recorded
having telephone conversations about a chosen topic.
Lack of comparability is a more serious problem. It
is relatively simple to investigate the ways in which
the phonetic shape of a word changes in different
utterance contexts in read speech. An experimenter
constructs sentences that ensure that the word or
words of interest are placed in different contexts, for
example, initial versus final or stressed versus unstressed. Several subjects can then be asked to produce multiple repetitions of the sentences. If the
experimenter is interested in the effect of speech
rate, the subjects can be asked to speak more slowly
or more quickly for some of the renditions. Several
repetitions of the same sentences by a large number of
subjects allow the investigator to collect different
phonetic shapes.
In the recording of natural spoken discourse, the
analyst has no control over the form or content:
repetitions of the same material will be chance

occurrences. One of the ways that that researchers


have sought to overcome this is to impose certain
controls on elicitation. The SWITCHBOARD corpus
represents one method of control by requiring speakers to talk about particular topics. The most extreme
method of control is to require subjects to cooperatively complete a simple task. The map task
(Anderson et al., 1991) is one such scenario. Each
subject receives a simple map containing fictitious
landmarks. One of the maps has a path drawn on it,
and it is the task of the subject with this map to
instruct the other where the path goes. This form of
elicitation goes some way toward solving the comparability problem because speakers produce the same
lexical items in similar syntactic and interactive structures.
Analysts have also proposed less satisfactory
solutions. The first has been to simply consider the
phenomena of read speech to be an accurate representation of the patterns to be found in discourse.
The second method has been to rely on intuition,
experience, and other patterns to make some of the
data up. Both of these solutions are themselves problematic. Discourse is the most complex, the most
intricate, and the most frequent form of language
use. At best, the patterns we produce in read speech
will be a subset of the patterns we use when we talk
in natural situations. And, although, there is a wellestablished tradition of using intuition in syntactic
theory, using it to predict the patterns that occur in
discourse is implausible.
Accounting for the Patterns

Up until now the phonetic patterns found in discourse


have been described with reference to the phonetic
shape of the relevant words spoken in isolation. However, in describing and explaining the patterns found
in discourse, analysts have often used the phonetic
shape of a word in isolation as the starting point from
which to derive the other forms. The rationale behind
this is that the shape of a word spoken in isolation is
phonetically the most complete form because it has
not been subjected to any of the factors giving rise to
reduction.
Despite its appeal, this approach is problematic for
several reasons. It is often tacitly assumed that the
phonetics of the word spoken in isolation (also
known as citation form) are readily accessible and
stand in a simple relationship to the phonetics of
nonisolation forms. However, this is not the case. In
order to elicit the isolated phonetics of word, a speaker must read words from a list. This means that the
isolated and discourse phonetics are taken from
completely different linguistic and interactive activities. Second, reading aloud is an activity we learn to

Phonetic Processes in Discourse 381

Figure 1 Sonagram and phonetic transcription of I used to want my dad to stop smoking, spoken by a female. Courtesy of the IViE Corpus.

do, usually at school, and is affected by sociolinguistic factors such as spelling pronunciation and being
told to speak clearly. Perhaps the most far-reaching
sociolinguistic effect is that a speaker who uses a
nonstandard variety in discourse may produce phonetics that are closer to the standard variety when
reading words from a list. A more interesting problem
is that the phonetics of the word in isolation might
simply be the phonetic shape of a word spoken at
one particular place in the rhythmic and interactional
structure of an utterance in discourse. For example,
in Tyneside English, one of the phonetic characteristics of turn-final plosives is that they have an aspirated release (Local et al., 1986; Local, 2003). And it
is precisely the same pattern that speakers produce
when reading words lists (Docherty and Foulkes,
1999).
The final problem may be clarified best with an
analogy. The behavior of an animal in the wild is
not described in terms of behavioral patterns observed in captivity, but this is exactly what many
phoneticians and phonologists have done when trying
to account for the phonetic patterns they observe in
discourse.

Patterns in Discourse
The patterns of reduction we have described can be
found in many types of speech, but it is in spoken
language in its most common and most natural
setting that the most elaborate and most systematic
patterns are to be found spoken discourse.
Figures 1 and 2 contain sonagrams (see Phonetics,
Acoustic) of extracts, taken from the IViE corpus,
from a short conversation between two female students talking about the effects of tobacco advertising.
Both excerpts illustrate well the types of patterns that
are typically found in discourse. The transcriptions

above the sonagrams are designed to provide a rough


indication for the reader of what is being said.
The pronunciation of the nearly every word in
the two excerpts differs from what we might expect
the speakers to have said had they been reading the
words individually from a list. Spoken in isolation,
the infinitive particle to is pronounced [thu:].
In the three occurrences of this word in Figure 1 (at
A and C) and in Figure 2, we find different phonetic
shapes:
. In the first example in Figure 1 (at A) and in
Figure 2, no closure for a plosive is made. Instead,
a fricative stricture remains. However, the friction
is weaker than it is for the preceding alveolar fricative, indicating that the speaker is making a tighter
stricture.
. Despite the absence of complete closure, all three
examples show an abrupt increase in energy, which
we expect of a plosive release.
. The central vowel following the release in each case
is voiceless.
In an isolated pronunciation of making we might
expect a voiceless prevelar plosive between the two
vowels. However, in Figure 1 (at F) the speaker does
not make the closure for a plosive. Instead, she produces a voiceless palatal fricative that can be seen
in the spectrogram at F as energy above 2000 Hz.
Likewise, in Figure 2 there is no bilabial plosive at
the beginning of be but, instead, a voiceless bilabial
approximant.
Spoken in isolation, the word want has a voiced
alveolar nasal followed by a voiceless alveolar plosive
finally: [wQnt]. In Figure 1 (at B) the expression want
my is pronounced [woDmaI]. The nasal at the end of
want has assimilated its place of articulation to the
nasal at the beginning of my. Instead of an alveolar
plosive, there is short period of creaky voice that can

382 Phonetic Processes in Discourse

Figure 2 Sonagram and phonetic transcription of I used to be, spoken by a female. Courtesy of the IViE Corpus.

be seen as a brief discontinuity in regular voicing at


around 33 000 ms.

Table 1 Glottal disharmony in Suffolk English

Glottal Disharmony in Suffolk English

Glottalization of first word


a.
dont you
b.
about her
c.
keep on
d.
get a
e.
want a
f.
make him
g.
look a

A phenomenon of consonant disharmony in some


varieties of English illustrates well a typical discourse process, which was discovered and has only
been observed systematically in natural conversation and which can only be adequately described
without making reference to read speech or speech
in isolation.
English is not traditionally regarded as a language
that has systems of consonant or vowel harmony,
in which a particular feature or features are spread
over a structurally defined piece of utterance, such as a
word (see Phonetics of Harmony Systems). Disharmony refers to the opposite process, whereby consonants
or vowels in neighboring syllables become less alike.
A form of consonant disharmony involving restrictions on the occurrence of glottalization in adjacent
syllables has been sporadically reported in discourse
material from a few varieties of English (Trudgill,
1974; Lodge, 1984). It has been described most fully
for Suffolk English in Simpson (1992, 2001).
A well-known phenomenon in many varieties of
English is the realization of voiceless plosives in the
syllable coda as glottal stops, for example, [bV but
and [lo ] look. Although the glottal stop can be all
that represents the plosive, a glottal stop may accompany the oral closure of the plosive (sometimes called

Example

Example
No glottalization of first word

h.
i.
j.
k.
l.
m.
n.
o.

dont it
about it
keep it
get it
want it
make it
look at
look at it

Transcription

d je
bE @
D D
ki: Qn
gE @
DD
@
D
m m
lo @
D D

Transcription

dVnV
D
bE:&@
D
ki:F@
D
gE&@
D
wQn@
D
mEI @
D
lox@
D
log:&@
D

reinforcement), for example, [lo ] look and [ki: ]


keep.
In Suffolk English, the situation is more complex.
Tokens of the same word can sometimes have a final
glottal stop and sometimes not. Transcriptions of
word tokens extracted from a conversation that have
this alternating pronunciation are shown in Table 1.
In Table 1 examples (ag), the vowel of the first word
is glottalized, and in tokens (b,c,e,g) there is a wordfinal glottal stop. In some cases, the glottalization

Phonetic Processes in Discourse 383

continues into the following syllable. In Table 1


examples (ho), no glottalization is present over the
vowel of the first word and the final consonant of the
word is either voiced or voiceless. Vowel glottalization, followed by a glottal stop is, however, found
over the syllable it or at. There is one other important
difference between the phonetic shape of the words in
(ag) and those in (ho). In the words with final
glottalization, (ag), there is only rarely any accompanying oral closure, as in (c). By contrast, in examples (ho), although there is no glottalization over the
first words, there is always a stricture of close approximation (the fricatives in j, m, n) or of complete
closure in the tap and oral and nasal stops of the
remaining examples. What all the tokens in Table 1
have in common is that there is one instance of glottalization that defines a stretch of utterance that is
one (ag), two (hn), or three (o) syllables in length.
The glottalization patterns in Suffolk English are
of particular interest for a number of reasons. First,
they have been reported for data only from spoken
discourse. Indeed, in an attempt to examine the phenomenon under controlled laboratory conditions, the
speaker who had produced the tokens in Table 1 was
asked to read short sentences of the type we can look
it over first in the studio. However, the recording
situation together with reading aloud led the speaker
to move her pronunciation closer to the standard, and
differences between the two varieties had a severe
effect on her reading fluency. During the majority of
the instances of a tri- or quadrasyllabic stretch, for
example, we can stop it at work, she produced different types of disfluency, such as pauses and repairs
somewhere within the relevant stretch.
Second, the patterns constitute a long-domain phenomenon; that is, they must be described in terms of a
stretch of utterance comprising one or more syllables.
As such, they are comparable to phenomena such as
vowel harmony. In general, analysts have treated patterns of glottalization or consonantal alternation as
being restricted to the segment. For example, it is
possible to treat the intervocalic dorsal fricatives in
make and look (in Table 1, m, n) or the bilabial
fricative in keep (in Table 1, j) as being the products
of lenition, that is, the weakening of a plosive closure
(see, e.g., Lodge, 1984, for this type of approach).
However, the patterns in the data in Table 1 show
that this account misses the bigger picture.

Assimilation and Nonassimilation in


German Discourse
Assimilation is a phenomenon whereby a consonant
or a vowel becomes more like an adjacent consonant
or vowel (see Assimilation). Assimilation, like other

Table 2 Progressive assimilation in German


Example

Meaning

Transcription

Happen
haben
nehmen
Nacken
wagen
fangen

bit
have
take
nape
dare
catch

[pm]
[bm]
[mm]
[kN]
[gN]
[NN]

Table 3 Regressive assimilation in German


Example

Meaning

Transcription

anmachen
hat mehr
hat Peter
ankleben
angeblich
hat keine

put up
has more
has Peter
stick on
so-called
has no

[mm]
[pm]
[pb]
[Nk]
[Ng]
[kk]

phenomena in connected speech, is often seen as a


consequence of the speaker striving to optimize and,
where possible, reduce articulatory movements. One
of the most common forms of assimilation is in place
of articulation, for example, [bggk:l] bad girl.
Phoneticians and phonologists have been concerned
with two main aspects of place assimilation. First,
there has been considerable debate as to whether the
assimilation to place is as simple as the transcription
of our example suggests. Research has shown that,
in examples such as [bbbOI] bad boy, although there
is bilabial closure at the end of bad arising from
the assimilation to the beginning of boy, an apical
gesture of the final [d] in bad may still be present.
The second important issue that phonologists have
been concerned with is providing a formal description
of assimilation.
An aspect of assimilation that has received relatively
little attention is what factors govern its occurrence.
In common with other patterns of articulatory reduction, it is assumed that assimilation is optional but that
the likelihood of occurrence increases as the conditions for reduction become more and more favorable
(e.g. increased tempo, informal context, unstressed
position, and function vs. content words). German,
like English, is a language that has traditionally been
described as having progressive and regressive assimilation in place of articulation. Examples of progressive
assimilation are given in Table 2 and examples of
regressive assimilation are given in Table 3.
However, research on place assimilation from
discourse material (Simpson, 2001) suggests that
assimilation in German, and perhaps in other languages, is a more complex phenomenon than the

384 Phonetic Processes in Discourse

Figure 3 Sonagram and phonetic transcription of two tokens of the German expression kein Problem no problem, spoken by two
females. In (A), the final nasal of kein is unassimilated; in (B), the final nasal shares the bilabial place of the initial plosive of Problem.

Figure 4 Sonagram and phonetic transcription of tokens of the German expressions (A) in Kiel in Kiel and (B) gut gehen be okay,
spoken by two female speakers. In both cases, the word-final alveolars retain their place of articulation. Arrows indicate nonpulmonic
stop releases of the nasal (A) and the plosive (B).

examples in Tables 2 and 3 at first suggest. Places of


potential assimilation in approximately 4 hours of
spontaneous dialogs from the Kiel corpus of spontaneous speech (IPDS, 19951997) were examined.
Two tokens of the expression kein Problem no problem are shown in the sonagram in Figure 3. In the
second example, the nasal at the end of kein is bilabial, assimilating to the place of articulation of
the plosive at the beginning of Problem. In the first
example, however, no such assimilation is present;
the nasal is apical. Of the 14 tokens present in the
4.5 hours of dialog analyzed, over one-half were considered to have an assimilated nasal. From these
examples alone, we might conclude that assimilation
is indeed an optional phenomenon.
But in a whole range of other intra- and interword
structures, assimilation is not as common. Two typical examples are shown in Figure 4. The example in
Figure 4A is a token of the prepositional phrase in
Kiel; the one in Figure 4B is a token of the expression
gut gehen be okay. In both of these examples, despite the presence of the word-initial dorsal plosives

in Kiel and gehen, the final stop of in and gut both


maintain their apico-alveolar place of articulation.
Indeed, both stop releases are visible (indicated by
arrows). Interestingly, these releases are produced
with an oral air stream mechanism caused by the
dorsal closure being made before the word-final
stop closure is released. Tongue movement after
both closures have been made causes a slight change
of pressure sufficient to cause weak plosion on release
of the apical closure. From the expectations about
assimilation, the examples in Figure 4 appear to be
good candidates to assimilation. However, out of 34
cases of the prepositional phrase in town name
(with an initial nonapical nasal or oral stop) in the
corpus, only three were considered to have assimilated; in all others, the pattern shown in Figure 4 was
present. In over 70 cases of the words gut gut and
geht goes followed by a word-initial nonapical stop,
only three cases were thought to constitute assimilations. In common intraword contexts, as unbedingt
really or ungefa hr, on the other hand, assimilation
occurred in nearly all tokens. By contrast, in many

Phonetic Processes in Discourse 385

noun or verb compounds also common in the corpus,


such as Terminkalender diary and anbieten offer
tokens with assimilation were completely absent.
Although it is not yet clear in the German corpus
under which circumstances a morpheme-final oral
or nasal stop may be assimilated to the beginning of
the following syllable, it is clear that it is not a case of
optionality under appropriate conditions. In many
of the interword examples, all the conditions thought
to be favorable for assimilation (lack of stress, tempo,
word class, and style of speech) are met, yet the final
stops in in, geht, and gut fail to assimilate. In other
cases, assimilation is the norm. Assimilation looked
at from a discourse perspective (i.e., what speakers
produce in the most natural environment of conversation) is a complex phenomenon, not merely accountable for in terms of a phenomenon that reduces
articulatory effort.

Conclusion
It is clear from this article that research into the
phonetic patterns found in discourse is still in its
infancy. However, a slowly growing body of analysis
is showing that the phonetic patterns of discourse are
detailed and systematic. Most important, perhaps,
this work shows that the phonetic patterns of discourse cannot be extrapolated directly from those
found in read speech. And it is likely that many of
the hypotheses and theories that have mainly been
constructed on the basis of patterns in read speech
will require major revisions.
See also: Assimilation; Coarticulation; Conversation Anal-

ysis; Generative Phonology; Natural Phonology; Phonetics of Harmony Systems; Phonetics, Articulatory;
Phonetics, Acoustic; Phonetics: Overview; Phonology in
the Production of Words.

Bibliography
Abercrombie D (1965). Conversation and spoken prose.
In Abercrombie D (ed.) Studies in phonetics and linguistics. London: Oxford University Press. 19.
Anderson A, Bader M, Bard E, Boyle E, Doherty G, Garrod
S, Isard S, Kowtko J, McAllister J, Miller J, Sotillo C,
Thompson H & Weinert R (1991). The HCRC map task
corpus. Language and Speech 34, 351366.
Brown G (1981). Listening to spoken English. London:
Longman.

Docherty G & Foulkes P (1999). Sociophonetic variation


in glottals in Newcastle English. In Proceedings of
XIVth International Congress of Phonetic Sciences. San
Francisco. 10371040.
Docherty G J, Milroy J, Milroy L & Walshaw D (1997).
Descriptive adequacy in phonology: a variationist perspective. Journal of Linguistics 33, 275310.
Firth J R (1935). The technique of semantics. Transactions
of the Philological Society 3672.
Godfrey J J, Holliman E C & McDaniel J (1992).
SWITCHBOARD: telephone speech corpus for research
and development. In Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing
1992, vol. 1. San Francisco. 517520.
IPDS (19951997). The Kiel corpus of spontaneous speech
(3 vols). CD-ROM# 24. Kiel: Institut fu r Phonetik
und digitale Sprachverarbeitung. Available at: http://
www.ipds.uni-kiel.de.
Kelly J & Local J K (1989). Doing phonology. Manchester,
UK: Manchester University Press.
Lass R (1984). Phonology. an introduction to basic concepts. Cambridge, UK: Cambridge University Press.
Lindblom B (1983). Economy of speech gestures. In
MacNeilage P F (ed.) Speech production. New York:
Springer. 217246.
Lindblom B (1990). Explaining phonetic variation: a sketch
of the H and H theory. In Hardcastle W J & Marchal A
(eds.) Speech production and speech modeling.
Dordrecht: Kluwer Academic Publishers. 403439.
Local J K (2003). Variable domains and variable relevance:
interpreting phonetic exponents. Journal of Phonetics
31, 321339.
Local J K, Kelly J & Wells W H G (1986). Some phonetic
aspects of turn-delimitation in the speech of urban
Tynesiders. Journal of Linguistics 22, 411437.
Lodge K R (1984). Studies in the phonology of colloquial
English. London: Croom Helm.
Shockey L (2003). Sound patterns of spoken English.
Oxford: Blackwell.
Simpson A P (1992). Casual speech rules and what the
phonology of connected speech might really be like.
Linguistics 30, 535548.
Simpson A P (2001). Does articulatory reduction miss
more patterns than it accounts for? Journal of the International Phonetic Association 31, 2940.
Trudgill P (1974). The social differentiation of English
in Norwich. Cambridge, UK: Cambridge University
Press.

Relevant Website
IViE corpus. English intonation in the British Isles. http://
www.phon.ox.ac.uk.

386 Phonetic Transcription and Analysis

Phonetic Transcription and Analysis


J C Wells, University College London, London, UK
2006 Elsevier Ltd. All rights reserved.

Introduction
Phonetic transcription is the use of phonetic symbols
to represent speech sounds. Ideally, each sound in a
spoken utterance is represented by a written phonetic
symbol, so as to furnish a record sufficient to render
possible the accurate reconstruction of the utterance.
The transcription system will in general reflect the
phonetic analysis imposed by the transcriber on the
material. In particular, the choice of symbol set will
tend to reflect decisions about (1) segmentation of the
language data and (2) its phonemicization or phonological treatment. In practice, the same data set may
be transcribed in more than one way. Different transcription systems may be appropriate for different
purposes. Such purposes might include descriptive
phonetics, theoretical phonology, language pedagogy,
lexicography, speech and language therapy, computerized speech recognition, and text-to-speech synthesis.
Each of these has specific requirements.

Phonetic Symbols
For most phoneticians, the symbol set of choice is the
International Phonetic Alphabet (IPA), the alphabet
devised by the International Phonetic Association.
This is a set of about 100 alphabetic symbols (e.g.,
N, O) together with a handful of non-alphabet symbols
(e.g., the length mark :) and about 30 diacritics (e.g.,
those exemplified in t9 , AD ). All of the symbols are
summarized in the IPA chart (Figure 1); this chart
and guidelines for symbol use appear in the IPA
Handbook (Nolan and Esling, 1999), which replaced
the earlier Principles booklet (Jones, 1949).
The IPA is not the only phonetic alphabet in use.
Some scholarly traditions deviate in trivial particulars
(e.g., by the use of s in place of IPA S, or y for IPA j);
others deviate in a substantial number of the symbols
used (e.g., the Danish dialect alphabet; see Figure 2)
(Jespersen, 1890). Where the local language, or the
language being taught, is written in a non-Latin
script, phonetic symbols for pedagogical or lexicographic purposes may be based on the local script,
e.g., Cyrillic (Figure 3) or kana (Figure 4). Even where
the local language is written in the Latin alphabet,
IPA symbols might be judged unfamiliar and userunfriendly. Thus, in English-speaking countries, zh is
often used as an informal symbol corresponding to
IPA Z, whereas in Turkey, s might be used rather than
IPA S. Some dictionaries aimed at native speakers of

English attempt to show the pronunciation of a word


entirely by following the conventions of the English
orthography (a practice perhaps better termed respelling rather than transcription). In practice, these
conventions are insufficient: for example, English
spelling does not indicate word stress, and there is no
unambiguous way of indicating certain vowel qualities.
Thus, ordinary spelling conventions may be supplemented by the use of diacritics in symbols, such as
( IPA aI), or, indeed, by a sprinkling of IPA symbols,
such as e. Other dictionaries use entire transcription
systems based on ad hoc diacritics, using symbols such
as a ( IPA eI). Between 1970 and 2000 in Britain,
though not in the United States, these non-standard systems were largely supplanted in general lexicography
by the use of the IPA.
Until the recent development of computers able to
handle large character sets, authors who wanted to
use the IPA in print publications often faced typographical difficulties, because non-specialist printers
would often be unable to provide the special symbols.
Since the 1990s, this problem has disappeared; customized single-byte phonetic and multi-byte Unicode
fonts have become widely available, together with the
applications that can use them (Wells, 2003). Nevertheless, there are many circumstances such as in
email communication under which the robust transmission of special symbols cannot be relied upon.
There is still a place, therefore, for ways of representing phonetic symbols using nothing but the American
Standard Code for Information Interchange (ASCII)
character set. The Speech Assessment Methods
Phonetic Alphabet (SAMPA) is one widely used
ASCIIization of the IPA (Figure 5) (Wells, 1997).
For the remainder of this article, it is assumed
that transcription will be based on the IPA. As will
become clear, however, there is no unique IPA transcription for a language: rather, there may be several
systems, all using the IPA alphabet and all equally
scientific.

Impressionistic vs. Systematic


On first exposure to an unknown language, or to an
unknown dialect of a familiar language, the fieldworker does not know what sort of phonetic material
is going to be encountered. Under these circumstances, a phonetically untrained observer will be
likely to refer the incoming data to the known phonetic categories of his/her own first language, or to
those of some other language with which he/she
is familiar. This is an impressionistic transcription
(Abercrombie, 1964: 35). The trained observer, on

Phonetic Transcription and Analysis 387

Figure 1 The International Phonetic Alphabet chart. Reprinted with permission from the International Phonetic Association (Department of Theoretical and Applied Linguistics, School of English, Aristotle University of Thessaloniki, Thessaloniki, Greece).

388 Phonetic Transcription and Analysis

Figure 2 The Dania phonetic alphabet, as used in Den store Danske udtaleordbog. Reprinted from Brink et al. (1991), with permission.

Figure 3 Transcription of Russian citation forms in Cyrillic


respelling, as used in Essentials of Russian grammar. Reprinted
from Smirnitsky (1975), with permission.

the other hand, can ideally refer instead to general


phonetic categories. (The purpose of phonetic eartraining is precisely to establish such language-independent, general-phonetic categories in the phoneticians mind.) As a sound system is investigated, any
impressionistic transcription is subject to revision.
Characteristics that were initially ignored or overlooked may prove to be phonologically relevant; conversely, some characteristics that were noticed and
notated may prove to be phonologically irrelevant.
Thus, for example, a European producing an impressionistic transcription of an Australian language
might at first overlook a distinction such as alveolar
vs. retroflex place (which then turns out to be relevant) while distinguishing voiced vs. voiceless phonation (which then turns out to be irrelevant). As a sound
system becomes clear, the analyst is in a position to
replace the ad hoc impressionistic transcription by a
systematic transcription that reflects the structure of
the language under description.

Figure 4 IPA symbols for English consonants and their kana


transcription equivalents, as used in Nihongo kara super-native no
Eigo e. Note the use of diacritics with the kana equivalents of l
and r. Reprinted from Shimaoka (2004), with permission.

A maximally narrow transcription explicitly indicates all of the phonetic detail that is available.
A broad transcription implicitly states much of this

Phonetic Transcription and Analysis 389

Figure 5 Vowel symbol chart, showing the Speech Assessment Methods Phonetic Alphabet (SAMPA) and International Phonetic
Alphabet (IPA). Reprinted with permission from University College London, Department of Phonetics and Linguistics.

detail in the conventions for interpreting the symbols,


while keeping the transcriptions of actual language
material (the text) less complicated. There are two
main factors to be considered: the choice of characters
(simple vs. comparative) and the number of characters (phonemic vs. allophonic). We consider these in
turn.

Simple vs. Comparative


For practical purposes, it is important that a transcription system for a language be kept simple. The
symbol t, part of the basic lowercase Roman alphabet, is simpler than symbols such as t9 , th, or t9 h. Therefore, when a language has only one voiceless plosive
in the alveolar area, it is appropriate to symbolize it t,
rather than to complicate the transcription by deploying the more complex symbol. Thus it is appropriate
to use the same symbol /t/ for Swedish (for which the
sound so denoted is typically dental and aspirated),
English (alveolar, aspirated), French (dental, unaspirated), and Dutch (alveolar, unaspirated), even though
the first three could more precisely be written t9 h, th,
and t9 , respectively. It is more efficient to state these
interpretative conventions once only, rather than to
repeat the information every time that the symbol is
used. If, however, the Swedish sound is written t9 h, etc.,
the transcription system is in this respect comparative.
Similarly, the five vowels of Greek are represented
as i e a o u, even though phonetically the second
and fourth vowels may well be associated with the

general-phonetic symbols E and O, respectively. The


five vowels of Japanese may also be written simply as
i e a o u, even though the last vowel is typically
unrounded and resembles cardinal [M] (Okada,
1999).
Letters of the basic Latin alphabet are simpler than
other letters: d is simpler than B or F, and i is simpler
than I. More subtly, S is considered simpler than C or ,
and e is simpler than #, , p, d, and !. Nevertheless, in
languages in which S and C are distinct, e.g., Polish,
both symbols are required; similarly, both e and ! are
required in Danish. Letters without diacritics are simpler than letters with diacritics (as with Swedish t9 h).
Consonantal diacritics are usually unnecessary in the
broad transcription of a language. The Arabic ayn
can reasonably be written , even by those who
believe it to be a pharyngealized glottal stop [ ]
(Thelwall and Saadeddin, 1999). The rather wide
inventory of vowel symbols furnished by the IPA
means that diacritics for raising, lowering, centralizing, and so on can normally be dispensed with in
broad transcription. Again, diacritics may be necessary in languages in which they symbolize a phonemic
contrast, as in the case of French nasalized vowels
(AD , etc.). It is typographically simple to transcribe
the English consonant in red as r, even though
phonetically it is an approximant rather than a trill.
It would be comparative to write it &. Equally, it is
comparative to write the French consonant in rue as
R or R; it is simple to write it r, as in the transcription
shown in Figure 6 (Passy, 1958), but in an account of

390 Phonetic Transcription and Analysis

than with explicit click symbols |, !, || (or the former


IPA symbols 9, , ).

Phonemic vs. Allophonic

Figure 6 A page from Passys 1958 text, Conversations Francaises


en transcription phonetique, as an example of a pedagogical
transcription of French. The French R is written simply, as r.
Reprinted from Passy (1958), with permission.

the contrastive phonetics of English and French, the


comparative symbols might be appropriate.
IPA symbols for voiceless plosives, such as p t k,
may be regarded as unspecified with respect to possible aspiration. Diacritics are available to show them
as aspirated (ph th kh) or as unaspirated (p t k); the
latter diacritic is not part of the official IPA repertoire
but rather is part of the extended IPA (ExtIPA) supplement designed for clinical phonetics (Nolan and
Esling, 1999: 187, 193). In Chinese, in which there is
a phonemic contrast between aspirated and unaspirated plosives, but there are no essentially voiced
plosives, to write them as /p t k b d g/ (as in the Pinyin
romanization) is simpler than to write them as /ph th
kh p t k/ or /p t k p t k/ or / ph th kh p t k/. Some
possible simplifications are generally rejected as inappropriate, except, perhaps, for special purposes, such
as email. Following the orthography, the Polish high
central vowel could be written y rather than $i (Jones,
1956: 335). The Zulu simple clicks could be written
with c, q, x, again following the orthography, rather

For many purposes, it is more appropriate to use


symbols that refer to phonemes rather than to allophones. In a transcription, the rules for the distribution of allophones can be stated once and for all in
the accompanying conventions, which will allow the
remaining transcription to be uncluttered and less
complicated. Pedagogical transcriptions in dictionaries and language-learning textbooks are usually
phonemic. One of the sources of tension in the establishment of agreed transcription systems for automatic speech recognition, as successive languages were
brought within the SAMPA system (Wells, 1997), was
between the generalists assumption, that there should
be a distinct phonetic symbol for each sound, and the
linguists preference for a more economic notation,
with a distinct symbol only for each phoneme.
The Spanish voiced obstruents have both plosive
and fricative/approximant allophones. Allophonically, they include both b d g and b X. Following
the principle of simplicity, the phonemes are usually
written b, d, and g. For some pedagogical purposes,
it may be relevant to insist on the difference between
plosives and fricatives, but for lexicographic or
speech-recognition purposes, this difference is irrelevant (particularly since word-initial /b d g/ are pronounced as plosives in some contexts but as fricatives
in others: e.g., un [d]edo a finger but mi []edo my
finger).
The English lateral consonant varies between clear
(plain) l and dark (velarized) L, the conditioning factor
in Received Pronunciation (RP) being the following
segment: the lateral is clear before a vowel, dark
elsewhere. In a phonemic transcription, both are written l (the simpler of the two symbols in question).
Phonemic notation is undoubtedly advantageous in a
pronunciation dictionary, since the final consonant in
a word such as sell is sometimes clear (sell it), sometimes dark (sell them), so it is better to leave the
choice of clear or dark to be decided by reference to
the general rule rather than spelled out at each entry.
However, there are circumstances in which a strictly
phonemic notation may be considered inappropriate.
In German, the fricative corresponding to orthographic
ch is velar [x] when following an open or back vowel,
but palatal [c ] elsewhere (including following a consonant and in initial position). The two sounds may
be considered to be co-allophones of the same
phoneme /x/, and (provided certain morphological
boundaries are explicitly indicated) unambiguously
so written. Nevertheless, this allophonic distinction

Phonetic Transcription and Analysis 391

Figure 7 Entries on a page in Mangolds standard German


pronunciation dictionary, Duden Aussprachewo rterbuch. Note the
switch from x in the singular of the word Buch, to c in the plural
Bu cher. In the singular this consonant is preceded by the back
vowel u:, but in the plural the consonant is preceded by the front
vowel y:. Reprinted from Mangold (1990), with permission.

has a high perceptual salience for speakers. The two


sounds have familiar non-technical names, Ach-laut
and Ich-laut, respectively. The standard German
pronunciation dictionary symbolizes them distinctly
(Figure 7) (Mangold, 1990).
There are many other cases when users of phonetic
transcription may feel more comfortable with a selectively narrowed (partially allophonic) transcription.
An example is the symbol N for the velar nasal in
languages (e.g., Italian) in which it is not distinctive,
but arises only by assimilation before another velar.
Another is Russian, wherein, for pedagogical purposes, it may be useful to narrow the transcription so
as to indicate vowel allophones explicitly (Figure 8).
Conversely, there are cases when users want a transcription that reflects phonemic neutralizations.
In many languages, the vowel system in unstressed
syllables is less complex than it is in stressed syllables.
In a polysystemic transcription, different vowel systems are identified as operating in different structural
positions or in different phonetic environments.
In French, for example, the oppositions eE, aA,
, oO, ED
are typically neutralized in non-final
syllables, and it is possible to use special (non-IPA)
cover symbols to reflect these neutralizations: E, A,
, O, E .
In contemporary English, speakers tend to be
aware of the glottal stop
as something distinct
from t, and students learning phonetic transcription
see it as natural to distinguish the two. In terms of
classical phonemics, the status of is odd in that in
some positions it behaves as an allophone of /t/ (e.g.,
a[ ]mosphere), but in other positions (notably wordinitially) it realizes no underlying phoneme but, if
used, merely signals emphasis (optional hard attack,
as in an [ ]egg).

Analysis and Transcription: the English


Vowels
There are often several possible phonological treatments of the same phonetic data. Naturally, different

competing phonemicizations may be reflected in different phonemic notations; however, the two do not
necessarily go hand in hand, and it is possible for
analysts who disagree on the phonological treatment
to use the same transcription system, or conversely,
for analysts who agree on the phonology to use different notations. Furthermore, the shortcomings of
classical phonemic theory now generally acknowledged by phonologists mean that many are unhappy
with the notion of a phonemic transcription, despite
its convenience in practice.
The notation of English vowels (in RP and similar
varieties) has been a particularly difficult area. One
view is that pairs such as sleepslip contain the same
vowel phoneme, but under different conditions of
length (length being treated as a separate, suprasegmental, feature). This view is reflected in the notation
sli:pslip, widely used in English as a foreign language
(EFL) work in the first three-quarters of the 20th
century. Thus, in the first 12 editions of Daniel Joness
(1917) English pronouncing dictionary (EPD), the
English monophthongs were written as follows, in
what was then widely known as EPD transcription
(the monophthongs are exemplified, respectively,
in the keywords fleece, kit, dress, trap, start, lot,
thought, foot, goose, strut, nurse, comma, face, goat,
price, mouth, choice, near, square, force, and cure):
i: i e A: O O: u u: V e: e ei ou ai au Oi ie Ee Oe ue

Since this set of symbols is not maximally simple (see


preceding section, Simple vs. Comparative), a simplified transcription also came into use and was popular
in EFL work in the middle of the 20th century:
i: i e a a: o o: u u: V e: e ei ou ai au oi ie ee oe ue

Unconfirmed hearsay has it that Jones would have


liked to switch to this simplified transcription for
EPD, but that the books publishers refused to allow
it. In these quantitative transcription systems, the
length mark is crucial, since it alone represents
the distinction between several pairs of phonemically
distinct vowels: not only sleepslip but also, in
the simplified transcription, catcart, spotsport,
putboot, and insert(n.)concert.
Another possible analysis of the English vowels is
that the long and diphthongal ones consist of a short
vowel followed by a glide identified with one of the
semivowels, j (non-IPA notational variant: y), or w,
or, in the case of a non-high final tendency, h (Trager
and Smith, 1951). This type of analysis has enjoyed
considerable support among adherents of structural
linguistics. Applied to RP (which it rarely was), it
would look like this:
iy i e ah o oh u uw e eh i$ ey ew ay aw oy ih eh uh

392 Phonetic Transcription and Analysis

Figure 8 A page from Wards Russian pronunciation illustrated, showing a narrow transcription of Russian. The symbol o represents a
centralized allophone of /o/. In 1989, the IPA withdrew recognition from the palatalization diacritic seen here, replacing it with a raised j,
thus dj in place of, , and from the symbol i, an alternative to . Reprinted from Ward (1966), with permission.

The notation used by Chomsky and Halle in The


sound pattern of English (1968) builds on this by
retaining the off-glide analysis while adding a macron
to symbolize tenseness in the vowels previously analyzed as long, yielding a system of the following type:
y i e a h O o h u u w e @Eh i$ e y o w a y a w Zy

However, a long-established rival view saw vowel


quality, rather than quantity (length) or off-glides, as

the feature distinguishing slipsleep and similar


vowel pairs. Differences in quantity (and perhaps of
off-glides) could be treated as predictable once the
quality was known. From about 1920, phoneticians
working on English also made use of a qualitative
transcription system, in which length marks were
not used:
i I E A Q O o u V k e eI oo aI ao OI Ie Ee Oe oe

Phonetic Transcription and Analysis 393

A variant of this was also used, in which the symbol


shapes I and o were replaced by i and +, respectively.
In the United States, a system of this type was used
by Kenyon and Knott (1944). They analyzed the
vowels of face and goat as essentially monophthongal,
and wrote them e and o, respectively:
iIEAQOouV

e 6 e o aI ao OI

The hooked symbols are for the rhotacized vowels


in nurse and letter, respectively. American English
does not have phonemically distinct centering
diphthongs. This notation gained wide popularity
in some American circles, so much so that the expression IPA is often understood as meaning
this transcription of English. It is often used in
American-oriented EFL work.
The view that eventually prevailed in Britain was
that the vowels of sleep and slip are phonemically
distinct, based on a complex distinction of length,
quality, and tensity. The rivalry between quantitative
and qualitative transcription systems was resolved
by A. C. Gimson (q.v.), whose notation system
(Gimson, 1962) symbolized both quantity and quality
differences, redundantly but conveniently:
i: I e A: Q O: o u: V k: e eI eo aI Ao OI Ie ee oe

By this time, the quality of the diphthong in goat had


changed and the diphthong Oe had merged with O:.
Minor modifications were subsequently introduced,
leading to the system now used by nearly all British
phoneticians (Wells, 1990):
i: I e A: Q O: o u: V k: e eI eo aI ao OI Ie ee oe i u

The important change here is the addition of two


symbols for weak vowels, i (as in happy) and u (as
in situation). Although sometimes viewed as an
abbreviatory convention, meaning either i: or I and
either u: or o, these additional symbols really reflect
a dissatisfaction with classical phonemics. In weakvowelled syllables, English has a neutralization of the
phonemic contrasts i:I and u:o, and the symbols i
and u stand for what some would call archiphonemes
and others would call underspecified vowels. Speakers may be inconsistent in whether the vowel of happy
or glorious is more like the i: of sleep or the I of slip,
and it may often be impossible for the listener to
categorize it with certainty as one or the other.
There are no pairs of words distinguished by this
distinction in this position.

Segments and Digraphs


English long vowels are not the only area in which it
is possible for different views to exist concerning the
number of successive segments in which the phonetic

material should be analyzed. This question tends to


arise whenever diphthongs or affricates are to be
transcribed. There are also certain types of single
sounds that are conveniently written as a digraph,
i.e., as two successive letters. The IPA does this
for voiced and nasalized clicks, e.g., g|| and N!, and
for consonants with double articulation, e.g., kp.
(Exceptions are the approximants w and H and the
Swedish velar-palatoalveolar fricative Q, for which
single symbols are available.) If necessary, the fact
that these digraphs stand for single segments can be
made explicit by the use of a tie bar: , , .
In general, diminuendo (falling) diphthongs may
be regarded either as unitary phonemes or as sequences of vowel plus semivowel. The corresponding
decision has to be made in transcription. Thus, in
some languages Polish, for example a diphthong
of the type [ ] is best analyzed as the vowel e followed
by the consonant j. In others English, perhaps it
may be regarded as an unanalyzable whole. English
orthography follows the latter approach, given such
spellings as basic "bbeIsIk, and the unitary analysis is
reflected in the respelling notation a . But IPA users,
even those who consider the diphthong phonologically unitary, mostly write it with two letters, eI. The
use of this digraph does not carry any necessary implication that the diphthong consists of e (as in dress)
and I (as in kit). In principle, the IPA writes affricates,
too, as digraphs, as in the examples p , dz, tS, kx. To
emphasize their unitary status, the tie bar can be used:
, , , . (In 1976, the IPA withdrew recognition
from a number of affricate symbols that had featured
in earlier versions of the phonetic alphabet but had
never been widely used, e.g., for dz (Wells, 1976).)
However the symbols c and c are sometimes pressed
into service to represent what might otherwise be
written tS and dZ. (Alternatively, the non-IPA c and
may be used.) This is particularly convenient when
the affricates occur contrastively aspirated and unaspirated, as in Hindi. Contrastive aspiration raises the
question of whether it should be symbolized by a
diacritic (ph vs. p) or by using digraphs (ph vs. p). In
the case of an aspirated affricate, as in Hindi Jhelum,
there is a transcriptional choice between diacritics
alone (cP,), a digraph (dZP or cP), or a trigraph (dZP).
(More simply, the diacritic h or the letter h could be
used instead of P or P.) Ohala (1999) chose a digraph
with a diacritic, dZP (Figure 9).
In a transcription system that includes digraphs, it
is important to maintain parsability, avoiding possible confusion between the single sound symbolized by
the digraph and the sequence of sounds symbolized
by the two symbols separately. In a language in which
affricates are in contrast with clusters (or sequences)
of the corresponding plosive plus fricative, either the

394 Phonetic Transcription and Analysis

Figure 9 A page from the chapter Hindi in the Handbook of the International Phonetic Association: a guide to the use of the International
Phonetic Alphabet, showing the transcription of Hindi. Reprinted from Ohala (1999), with permission.

affricates must be written with a tie bar or the cluster


(sequence) must be written with a separator symbol.
Thus Polish czy and trzy must be written either as $i
czy and tSi$ trzy or (more conveniently) as tSi$ czy and
t-Si$ trzy. Some fonts provide ligatured symbols such as
0 ( ), so that one can write 0i$ and tSi$, respectively.
In non-IPA notation, c $i and ts $i can be used. (Another
view of the Polish cluster represented by orthographic
trz takes it as tSS, i.e., affricate plus fricative. With this
analysis, the problem of parsability does not arise.)
Problems of segmentation also arise in the annotation of spectrograms or other physical records of the
speech signal. The latter tends to be continuously
variable, rather than reflecting the neatly discrete
segments implied by a phonetic transcription, which
means that the stretch of speech corresponding to a
given transcription symbol is not easily delimited. For
example, the moment of silence in the middle of apa
corresponds to the voiceless plosive identity of p, but
its labiality can be inferred only from the formant
transitions in the adjacent portions of the vowels
and from the characteristics of the plosive burst at
the release.

Dictionary Entries
The pronunciation entry in a dictionary will usually
relate to the citation form of the word in question.
This may differ in various respects from the forms to
be expected in connected speech, sometimes referred
to as phonotypical forms. The notion of a phonotypical transcription arises from the work of speech
technologists working on French, a language in which
many final consonants that may appear in running

speech are absent in the citation form the wellknown phenomenon of liaison. Thus the citation
form of the pronoun vous you is vu, but the liaison
form, used before a word beginning with a vowel, is
vuz. The phonotypical transcription of the phrase
vous avez you have is vuzave. Pronunciation dictionaries of French must include these liaison forms,
because the identity of the liaison consonant, if any,
cannot be predicted from the citation form. Certain
vowel-initial words block the activation of the liaison
consonant (those spelled with h aspire and certain
others, e.g., onze eleven): this, too, must be shown in
the pronunciation dictionary (usually by an asterisk
or some other arbitrary symbol). In English, on the
other hand, forms with final liaison r (linking r, intrusive r) may not need to be listed in the dictionary,
since this possibility applies to every word for which
the citation form ends in a non-high vowel. As with
the simple/comparative and phonemic/allophonic distinctions, it is more efficient to state a rule once rather
than to repeat the statement of its effects at each
relevant dictionary entry.
Many English function words have distinct strong
and weak forms, e.g., at, strong form t, weak form
et. The strong form is used when the word is
accented and in certain syntactic positions (what are
you looking at?). A few words have more than one
weak form, depending on context, as in the case of
the: e.g., i eg the egg, prevocalic, but e mn the
man, preconsonantal. A phonotypical transcription
of connected speech would select the appropriate
form for the context. Aside perhaps from such
special-context forms, for pronunciation in a general-purpose dictionary it may be sufficient to state only

Phonetic Transcription and Analysis 395

the citation form of a word. Some dictionaries,


though, and particularly specialist pronunciation
dictionaries, will go further, and this may impact on
the form of transcription chosen, e.g., in the use of
abbreviatory conventions.
First, the word may have several competing citation forms, used by different speakers of the standard
form of the language and differing unpredictably
from one another. Thus, in British English, again
may rhyme with pen or with pain; controversy may
be stressed on the initial syllable or on the antepenultimate; schedule may begin with S- or sk- (for statistics
on speaker preferences in these words, see Wells
(2000)). In English, there is great intraspeaker and
interspeaker variability in the choice between I and
e in weak syllables (reduce, aspiration, horses). (The
2003 Longman dictionary of contemporary English
uses a special symbol to show this.)
Second, the dictionary may wish to cover more
than one form of the language, e.g., American English
(AmE) and British English (BrE), or American and
European Spanish. Rather than transcribe each relevant word separately for each variety of the language,
dictionaries may use abbreviatory conventions to
show such variability. For example, the English
word start might be transcribed stA:(r)t or stA:rt,
with the convention that the r is to be pronounced
in AmE but not in BrE.
Third, there may be predictable (rule-governed)
variability. For example, in certain positions in a
word, where some English speakers place the cluster
ns, others pronounce nts, e.g., prIns or prInts prince.
This may be shown by an abbreviatory convention
such as prInts or prIn(t)s. (The rule of plosive epenthesis is more general than this: it also affects other
clusters of nasal plus voiceless fricative. It also applies
in German, thus han(t)s Hans, but this is ignored
in Mangold (1990).) As a second example, English
words with more than one lexical stress are pronounced in isolation with an accent on the last such
stress (sIks"ti:n sixteen), but in connected speech,
under certain surrounding accent conditions, are pronounced with the accent on the first such stress ("sIksti:n "pi:pl% sixteen people). Particularly in dictionaries
aimed at speakers of EFL, this too may be explicitly
indicated. As a third example, in both English
and German, the syllabic consonants l% and n% alternate with the sequences el and en; words may
be pronounced with either the first or the second
sequence, depending on a combination of phoneticenvironment, stylistic, and speech-rate factors: thus
German Gu rtel "gYrtl% or "gYrtel, English hidden "hIdn%
or "hIden. Although the Duden dictionary mentions
the alternation only in the foreword (Mangold, 1990:
32), English dictionaries often make it explicit at each

relevant entry, using abbreviatory devices such as


"hIden or "hId(e)n.

Pedagogical Transcription, Dictation, and


Reading Exercises
In phonetic training of the kind associated with the
Daniel Jones and Kenneth Pike traditions, those
studying the phonetics of a particular language (including their own) practice the skills of transcribing
phonetically from orthography, transcribing from
dictation, and reading aloud from a phonetically
transcribed text. In these exercises, words are transcribed not in their citation form, but phonotypically.
In particular, great attention is paid to the possibility
of connected-speech processes such as assimilation,
elision, liaison, and weakening (including vowel reduction). In the case of English, instead of the lexical
stress-marking of words, the student may be required
to produce a full markup of accentuation (sentence
stress) and intonation. For example, in the English
phrase bread and butter, the word and would most
probably be pronounced not nd, but rather en or
em or n% or m
% . The transcriber from orthography
should be able to predict this, the transcriber from
dictation should be able to hear which pronunciation
was used, and the student reading from transcription
ought to be able to reproduce whichever form is in the
written text.
The transcription used for these exercises is often
referred to as phonemic. However, if we follow
current ideas in regarding phonemes as being mental
entities part of the speakers competence then this
term is not really accurate. The word bad presumably always has the mental representation bd,
even though under assimilation in a phrase such as
bad man it may be pronounced with a final bilabial,
nasally released, thus bb mn. This form of
transcription is better referred to as a reading
transcription.

See also: International Phonetic Association; Phonemics,


Taxonomic; Phonetic Classification; Phonetic Pedagogy;
Phonetic Transcription: History; Phonetics, Articulatory;
Phonetics, Acoustic; Phonetics: Overview; Phonology
Phonetics Interface; Second Language Acquisition: Phonology, Morphology, Syntax; Second Language Speaking.

Bibliography
Abercrombie D (1953). Phonetic transcriptions. Le Matre
Phonetique 100, 3234.
Abercrombie D (1964). English phonetic texts. London:
Faber and Faber.

396 Phonetic Transcription and Analysis


Brink L, Lund J, Heger S & Jrgensen J (1991). Den store
Danske udtaleordbog. Copenhagen: Munksgaard.
Gimson A C (1962). An introduction to the pronunciation
of English. London: Arnold.
Jespersen O (1890). Danias lydskrift. In Dania I. 3379.
Jones D (1917). English pronouncing dictionary. London:
Dent.
Jones D (1949). The principles of the International Phonetic Association: being a description of the International
Phonetic Alphabet and the manner of using it, illustrated
by texts in 51 languages. London: International Phonetic
Association.
Jones D (1956). An outline of English phonetics (8th edn.).
Cambridge: Heffer. [Appendix A, Types of phonetic transcription.].
Kenyon J & Knott T (1944). A pronouncing dictionary of
American English. Springfield, MA: Merriam.
Mangold M (1990). Duden Aussprachewo rterbuch. Wo rterbuch der deutschen Standardaussprache (3rd edn.).
Mannheim: Dudenverlag.
Nolan F, Esling J et al. (eds.) (1999). Handbook of the
International Phonetic Association: a guide to the use
of the International Phonetic Alphabet. Cambridge:
Cambridge University Press.
Ohala M (1999). Hindi. In Nolan, Esling et al. (eds.).
Okada H (1999). Japanese. In Nolan, Esling et al. (eds.).
Passy P (1958). Conversations franc aises en transcription
phone tique (2nd edn.). Coustenoble H (ed.). London:
University of London Press.
Shimaoka T (2004). Nihongo kara super-native no Eigo e.
Tokyo: Kaitakusha.
Smirnitsky A (1975). Essentials of Russian grammar.
Moscow: Vysshaya Shkola.

Thelwall R & Saadeddin M (1999). Arabic. In Nolan,


Esling et al. (eds.).
Trager G & Smith H (1951). An outline of English structure.
Norman, OK: Battenburg. [2nd edn. (1957). Washington,
D.C.: American Council of Learned Societies.]
Ward D (1966). Russian pronunciation illustrated.
Cambridge: Cambridge University Press.
Wells J (1976). The Associations alphabet. Journal of the
International Phonetic Association 6(1), 23.
Wells J (1990). Longman pronunciation dictionary.
Harlow: Longman. [2nd edn. (2000). Harlow: Pearson
Education.]
Wells J (1997). SAMPA computer readable phonetic alphabet. In Gibbon D, Moore R & Winski R (eds.) Handbook of standards and resources for spoken language
systems. Berlin and New York: Mouton de Gruyter.
(Part IV, section B).
Wells J (2003). Phonetic symbols in word processing and
on the web. In Proceedings of the 15th International
Congress of Phonetic Sciences, Barcelona, 39 August,
2003.

Relevant Websites
http://www.phon.ucl.ac.uk University College London,
Department of Phonetics and Linguistics website.
Resources include information on the Speech Assessment
Methods Phonetic Alphabet (SAMPA), a machinereadable phonetic alphabet.
http://www.arts.gla.ac.uk University of Glasgow, Faculty
of Arts website; links to the International Phonetic Associations phonetic alphabet chart.

Phonetic Transcription: History


A Kemp, University of Edinburgh, Edinburgh, UK
2006 Elsevier Ltd. All rights reserved.

Transcription, in its linguistic sense, has been defined


as the process of recording the phonological and/or
morphological elements of a language in terms of a
specific writing system, as distinct from transliteration, which is the process of recording the graphic
symbols of one writing system in terms of the corresponding graphic symbols of a second writing system.
Transcription, in other words, is writing down a language in a way that does not depend on the prior
existence of a writing system, whereas transliteration
does.
Systems of transcription have existed from the
earliest times. Traditional writing systems of most
languages may originally have been transcriptions
of speech, but in the course of time have lost much

of this connection (see ). Spelling reformers have often


sought to restore the connection. Journalists, missionaries, colonial administrators, teachers, traders, travelers, and scholars have all at one time or another
required a precise way of writing down previously
unwritten languages for various purposes: to improve
communication, to make available translations of the
Bible and of noteworthy literary works, to provide
education, to record folk literature, and so on. For
phoneticians above all, it is essential to have a notation system that allows sounds to be referred to unambiguously.

The Segmentation of Speech


Speech in its physical form is a continuum, but transcription requires it to be split up into segments,
on the basis of some kind of linguistic analysis.
Writing systems of the world fall into different groups

396 Phonetic Transcription and Analysis


Brink L, Lund J, Heger S & Jrgensen J (1991). Den store
Danske udtaleordbog. Copenhagen: Munksgaard.
Gimson A C (1962). An introduction to the pronunciation
of English. London: Arnold.
Jespersen O (1890). Danias lydskrift. In Dania I. 3379.
Jones D (1917). English pronouncing dictionary. London:
Dent.
Jones D (1949). The principles of the International Phonetic Association: being a description of the International
Phonetic Alphabet and the manner of using it, illustrated
by texts in 51 languages. London: International Phonetic
Association.
Jones D (1956). An outline of English phonetics (8th edn.).
Cambridge: Heffer. [Appendix A, Types of phonetic transcription.].
Kenyon J & Knott T (1944). A pronouncing dictionary of
American English. Springfield, MA: Merriam.
Mangold M (1990). Duden Ausspracheworterbuch. Worterbuch der deutschen Standardaussprache (3rd edn.).
Mannheim: Dudenverlag.
Nolan F, Esling J et al. (eds.) (1999). Handbook of the
International Phonetic Association: a guide to the use
of the International Phonetic Alphabet. Cambridge:
Cambridge University Press.
Ohala M (1999). Hindi. In Nolan, Esling et al. (eds.).
Okada H (1999). Japanese. In Nolan, Esling et al. (eds.).
Passy P (1958). Conversations francaises en transcription
phonetique (2nd edn.). Coustenoble H (ed.). London:
University of London Press.
Shimaoka T (2004). Nihongo kara super-native no Eigo e.
Tokyo: Kaitakusha.
Smirnitsky A (1975). Essentials of Russian grammar.
Moscow: Vysshaya Shkola.

Thelwall R & Saadeddin M (1999). Arabic. In Nolan,


Esling et al. (eds.).
Trager G & Smith H (1951). An outline of English structure.
Norman, OK: Battenburg. [2nd edn. (1957). Washington,
D.C.: American Council of Learned Societies.]
Ward D (1966). Russian pronunciation illustrated.
Cambridge: Cambridge University Press.
Wells J (1976). The Associations alphabet. Journal of the
International Phonetic Association 6(1), 23.
Wells J (1990). Longman pronunciation dictionary.
Harlow: Longman. [2nd edn. (2000). Harlow: Pearson
Education.]
Wells J (1997). SAMPA computer readable phonetic alphabet. In Gibbon D, Moore R & Winski R (eds.) Handbook of standards and resources for spoken language
systems. Berlin and New York: Mouton de Gruyter.
(Part IV, section B).
Wells J (2003). Phonetic symbols in word processing and
on the web. In Proceedings of the 15th International
Congress of Phonetic Sciences, Barcelona, 39 August,
2003.

Relevant Websites
http://www.phon.ucl.ac.uk University College London,
Department of Phonetics and Linguistics website.
Resources include information on the Speech Assessment
Methods Phonetic Alphabet (SAMPA), a machinereadable phonetic alphabet.
http://www.arts.gla.ac.uk University of Glasgow, Faculty
of Arts website; links to the International Phonetic Associations phonetic alphabet chart.

Phonetic Transcription: History


A Kemp, University of Edinburgh, Edinburgh, UK
2006 Elsevier Ltd. All rights reserved.

Transcription, in its linguistic sense, has been defined


as the process of recording the phonological and/or
morphological elements of a language in terms of a
specific writing system, as distinct from transliteration, which is the process of recording the graphic
symbols of one writing system in terms of the corresponding graphic symbols of a second writing system.
Transcription, in other words, is writing down a language in a way that does not depend on the prior
existence of a writing system, whereas transliteration
does.
Systems of transcription have existed from the
earliest times. Traditional writing systems of most
languages may originally have been transcriptions
of speech, but in the course of time have lost much

of this connection (see ). Spelling reformers have often


sought to restore the connection. Journalists, missionaries, colonial administrators, teachers, traders, travelers, and scholars have all at one time or another
required a precise way of writing down previously
unwritten languages for various purposes: to improve
communication, to make available translations of the
Bible and of noteworthy literary works, to provide
education, to record folk literature, and so on. For
phoneticians above all, it is essential to have a notation system that allows sounds to be referred to unambiguously.

The Segmentation of Speech


Speech in its physical form is a continuum, but transcription requires it to be split up into segments,
on the basis of some kind of linguistic analysis.
Writing systems of the world fall into different groups

Phonetic Transcription: History 397

according to what types of segments they are based


on words, syllables, or consonants and vowels.
Certain features of speech are associated with longer
segments than others; for example, stress and intonation, which in many writing systems are not marked,
are associated with stretches of speech such as the
syllable, word, or sentence.

Types of Transcription
A transcription can never capture all the nuances
of speech. The amount of detail it attempts to
include in its text will vary according to its purpose.
A system intended for the specialist linguist investigating a language never previously studied would
often need to allow the recording of as many as
possible of the various nuances of sounds, pitch variations, voice quality changes, and so on. Such a
transcription may be called impressionistic, and
is unlikely to be helpful to anyone other than a
specialist.
Proceeding from this initial transcription, the linguist can deduce the way in which the sound system
of the language is structured, and can replace the
impressionistic transcription with a systematic one,
which records in its text only the elements that are
crucial for conveying the meanings of the language.
This type of transcription may well form the basis
for a regular writing system for that language, and is
called a phonemic, or broad, transcription. For use
in teaching the spoken language, however, it may be
helpful to transcribe some of the subphonemic sound
differences likely to present problems to the learner.
This kind of transcription may be called allophonic,
or narrow. If detailed comparisons are to be made
between this language or dialect and another one,
showing the more subtle sound distinctions, the transcription may begin to resemble the impressionistic
one, but as it is the result of a prior analysis, it will
still be systematic. Conventions may be supplied to
show the way in which the broad transcription is
realized phonetically in certain environments. For
special purposes, such as recording the speech of
the deaf, very complex transcription systems may be
necessary, to cope with sound variations that rarely
occur in the speech of those without such a disability
(see later, discussion of the International Phonetic
Association).

Notation
Transcription systems need to employ a notation that
allows them to refer to a sound unambiguously. The
following approaches utilize some of the principles
followed in effective systems of notation:

1. To avoid ambiguity, each symbol used in the notation, in its particular environment, should be restricted to one particular sound or sound class
(or, in some cases, groups of sounds, such as the
syllable), and each sound, etc. should be represented by only one symbol. So, for instance, the
symbol <j>, which has different values in German
and English orthography, would need to be confined to only one of those values. Conversely,
the sound [s], which in English may be conveyed
either by <s> as in supersede or <c> as in cede,
must be limited to only one symbol.
2. The symbols used should ideally be simple, but
distinctive in shape, easily legible, easy to write
or print, aesthetically pleasing, and familiar to
the intended users. If printing types are not
readily available, the system will be limited in its
accessibility and expensive to reproduce.
3. If the transcription is to be pronounceable (not all
kinds are required to be), the sound values of the
symbols must be made clear, through a description
of the ways in which the sounds are formed,
or through recorded examples, or by key words
taken from a language, provided that the accent
referred to is specified. Some transcription systems
include pieces of continuous text to illustrate the
application to particular languages (e.g., those
of Carl Lepsius and the International Phonetic
Association (IPA); see later).
4. The symbol system should be expandable,
particularly if it is intended to be used to cover
all languages. As new languages are encountered,
new varieties of sounds will have to be defined.
Alphabetic Notations: Roman and Non-Roman

Alphabetic notations (e.g., the Roman alphabet) are


based on the principle of having one simple symbol
to represent each segment. However, many transcription systems are not based on the Roman alphabet,
because of the ambiguous values of some of its symbols, or because it has been found preferable to use
iconic symbols, intended to convey by their shapes
the phonetic nature of the sound concerned, and/or
to link related groups of sounds. One variety of iconic
notation has been called organic, because the shapes
of its symbols are meant to suggest the organs of
speech used to produce them. Shorthand systems
characteristically are non-Roman and iconic, but
not necessarily organic. Iconic systems have a number
of drawbacks. Apart from the difficulties of reading
and printing them, they cannot be easily expanded
to incorporate sounds newly encountered. It is also
less easy to adapt them as and when phonetic theory
undergoes changes.

398 Phonetic Transcription: History


Analphabetic Systems

Analphabetic systems are not based on alphabetictype segments; instead, the symbols are composed
of several elements, some of which resemble chemical
formulas, each element representing one ingredient
of the sound concerned (see later).
Supplementing the Roman Alphabet

The number of Roman symbols is far too limited


to convey the sound distinctions needed. There are
various ways of supplementing the basic alphabet (see
Abercrombie, 1981):
1. Using compound letters such as <q> and
<x> (equivalent to [k(w)] and [ks], respectively,
in English orthography) to stand for other sounds.
Thus <x> may be used to represent the Scottish
sound represented by the <ch> in loch.
2. Inverting and/or reversing the letters and giving
them other values, e.g., [O e J M r V] are all phonetic
symbols formed by inverting <c e f m r v>, respectively.
3. Adding diacritical marks to basic symbols, as in
<a , c , >. These diacritics may be attached to the
letter or placed somewhere adjacent to it. They are
an economical way of enlarging the repertoire,
because one mark can be used to change the
value of a number of symbols; for example, []
represents nasality and may be added above any
vowel symbol. Conversely, being small, diacritics
may be inadvertently omitted or obscured, they
tend to reduce the legibility of texts, and they can
be expensive to reproduce (though less so since the
advent of computers).
4. Adapting symbols borrowed from other alphabets;
examples of symbols taken from Greek by the
International Phonetic Alphabet, with modification to blend with the roman font, are [y w b X]
(see International Phonetic Association). Symbols
may also be borrowed from another use, such as
@), , $, %, etc.
5. Using digraphs to represent simple sounds, as
English orthography does in thing, wherein <th>
and <ng> may represent the simple sounds [y],
and [N], respectively. The symbols are easily accessible, but problems arise when these sequences are
needed to convey actual sound sequences, such as
the aspirated stop [th].
6. Using different typefaces, such as UPPERCASE,
italic, or bold (or even
). However,
these are less satisfactory for use in handwriting,
they may cause confusion with other conventional
uses, such as emphasis, and they do not blend well
with other fonts, unless specially adapted.

7. Less commonly, using spatial relationships. With


respect to a median line, symbols may be placed on
the line, above it, or below it. Braille makes use of
this in certain of its symbols. For normal printing
and writing this is not very satisfactory, as it interferes with legibility and can easily lead to errors.
8. Inventing entirely new symbols. Unless new symbols are relatively straightforward adaptations of
Roman letters, e.g., inversions or addition of diacritics, they rarely survive. One that has survived is
the IPA symbol for the velar nasal, [N], probably
first used in 1619 by Alexander Gill.
Nonsegmental Aspects of Speech

In any language, English, for example, the position of


the stress in the word may need to be indicated. This
can be done by placing a raised mark before the
stressed syllable, for instance, <be"come, "beckon>,
or by highlighting the syllable in some way. Extra
length can be shown by doubling the segmental symbol, as in Italian freddo cold, or by adding a diacritical mark, as in [a:]. Pitch is an essential feature of the
words of tone languages, such as Chinese, Thai, and
Yoruba, and may be marked by accents over the
vowel (e.g., [a ] high, [a`] low), or by numerals
(e.g., the Mandarin Chinese segmental structure
[ma] can have four different tones, distinguishing
different words: [ma1], [ma2], [ma3], [ma4]). The intonation pattern of English may be indicated by
marking the pitch of certain prominent syllables.
The first person to try to provide a detailed system
of notation for these features in English was Joshua
Steele (see Steele, Joshua (17001791)). Other features are also sometimes treated as nonsegmental,
notably nasality and secondary articulations such
as palatalization, velarization, and labialization, because they frequently overlap segmental boundaries.
Traditional transcriptions have allocated these features to segments, adding diacritical marks to the
vowel or consonant symbols, but some phonological
analyses of speech have associated them with longer
stretches by setting up an extra tier for them. The
prosodic analysis of J. R. Firth allocated prosodies
such as nasality and velarity to longer domains,
including syllable part, syllable, word, and sentence
(see Firthian Phonology). A similar idea lies behind
the autosegmental model of phonology.

Historical Survey of
Transcription Systems
Examples of some of the different types of transcription systems can be found in the alphabets from
early times.

Phonetic Transcription: History 399


Roman-Based Alphabetic Systems
(pre-19th Century)

The reform of traditional orthographies (notably


those of French and English, which presented particular problems for learners) led to innovations
in notation. In France, Loys Meigrets La Trette de
la grammere franc oeze (1550; see Meigret, Louis
(?1500-1558)) included a phonetically based alphabet for French. In England, Sir Thomas Smith, in his
De recta et emendata linguae anglicanae scriptura
(On the proper and corrected writing of the English
language) (Paris, 1568), employed several of the
devices mentioned previously. For example, diacritics
([ H]) were used to distinguish the long vowels in
cheap and hate, symbolized <e > and <a >, respectively; non-Roman alphabets (Greek <V>) were used
for French and Scottish u (i.e., [y]); and the Irish form
of capital <G> was used to represent the first consonant in judge. A reversed <z> replaced the <sh> of
ship, and for the dental fricatives, Smith used Old
English thorn and <> (see Figure 1).
John Hart (see Hart, John (?1501-1574)) was familiar with Meigrets system. His book, An orthographie (London, 1569), included special symbols for the
consonant sounds [y S tS dZ], beautifully integrated
into the text. He was a keen observer of speech and
recorded the occurrence of syllabic consonants such
as the final <l> in fable, which he transcribed as
<l%>. Subsequently, William Bullokar (see Bullokar,
William (c. 15311609)), in his Book at large for the
amendment of orthographie for English speech (London, 1580), provided 40 symbols for transcribing
English. By using a
font he opened up
extra possibilities of rarely used printing sorts. He
illustrated his phonetic alphabet by using it in a number of literary texts, including Aesops fables.
In Logonomia Anglica (first edition published
in London in 1619), Alexander Gill (15641635)
introduced a number of extra letters. Like Smith, he
used <> for the first sound in this, and in the
second edition (1621) of his work he introduced
<N> for the velar nasal, thus maintaining the connection with nasal <n>. He transcribed the word high
- >, using <j> for the vowel and <h
- > for the
as <hjh
final consonant. This illustrates the use of a diacritic
incorporated in a letter. Charles Butler, in his English

Figure 1 Symbols devised in 1568 by Sir Thomas Smith, to


represent the final sound made when pronouncing the words
pith, bathe, and dish.

grammar (Oxford, 1633), also used a horizontal


stroke through certain letters to avoid digraphs,
replacing the letter <h>, so that instead of <sh, ph,
ch, wh, gh> he had <s, p, c, w, h>, respectively, with
a stroke through each letter. He also introduced
inverted <t> to represent the <th> in thin.
John Wilkins (see Wilkins, John (16141672)), in
his Essay towards a real character (1668), devised
three separate systems of notation (see Figure 2).
One of these is Roman based and uses digraphs to
supply extra consonant symbols. The letter <h> has
a dual role: it is used both to indicate fricatives, as in
<ch, gh, th, dh, sh, zh> (for [x X y S Z]), and to
indicate voiceless forms of nasals and liquids, so that
voiced <ng, n, m, 1, r> are paralleled by [voiceless
<ngh, nh, mh, lh, rh>. For some of the vowels,
Wilkins employed rather poorly designed Greek symbols, though this contradicted one of his stated principles for choosing symbols:
1. They should be the most simple and facil, and yet
elegant and comely as to the shape of them.
2. They must be sufficiently distinguished from one
another.
3. There should be some kind of suitableness, or correspondency of the figure to the nature and kind of the
letters which they express.

Wilkinss third condition refers to a nonarbitrary, or


iconic, type of notation (for his other notations, see
later, Iconic Alphabets, pre-19th Century).

Figure 2 Three non-Roman transcriptions. (A) Iconic representations of the sounds [l] and [m] (John Wilkins, 1668). (B) Syllabic
transcription of Give us this day our daily bread (John Wilkins,
1868). (C) Organic alphabet. Transcription of I remember. I
remember, the roses red and white (Henry Sweet, 1906).

400 Phonetic Transcription: History

Figure 3 Excerpt illustrating the extended alphabet devised by Benjamin Franklin in 1768 (published in Franklins collected works
in 1779).

In the 18th century, social reformers aiming to reduce class barriers tried to establish a standard form of
pronunciation; to facilitate the spread of literacy,
reformed spelling systems were suggested. Thomas
Sheridan (see Sheridan, Thomas (17191788)) was
one of the first to publish a pronouncing dictionary
of English (1780), which gave a respelling to every
word, and a similar dictionary was published in
1791 by John Walker (see Walker, John (1732
1807)). In America, spelling reform led the famous
American statesman, scientist, and philosopher, Benjamin Franklin, to put forward a new alphabet in
1768. It was limited to 26 symbols, of which six
were newly invented to take the place of the ambiguous letters <c j q w x y>. Some of these new symbols
were rather too similar to each other in form to be
satisfactory, but the printed font was attractively
designed; it was published as part of Franklins collected works (London, 1779) (see Figure 3).
William Thornton (17591828), a Scottish
American who traveled and lived in many places but
who spent most of his life in the American capital,
Washington, also attempted to reform English
spelling, and in the longer term to make possible the
transcription of unwritten languages. His treatise,
entitled Cadmus, or a treatise on the elements of
written language (1793), won the Magellanic gold
medal of the American Philosophical Society. The
notation he used was Roman based and introduced
some well-designed additional letters, including <M>
to replace <w>, <&> to represent the first consonant
in ship, and a circle with a dot in the center (8,
a Gothic symbol) to represent <wh> in when. He
aimed to economize by using inverted basic symbols
where possible, e.g., <m, M>, <n, u>, <J, &>. Some
years later, Thornton used his alphabet to transcribe
288 words in the Miami Indian language. Among
admirers of his system were Thomas Jefferson,
Alexander von Humboldt, and Count Volney (see
later, Volney and the Volney Prize).
The increasing involvement of Europeans with the
languages of Asia, Africa, and America, whether as
traders, missionaries, travelers, or colonial administrators, emphasized the need for a standard, universal
alphabet. One of the first to try to provide a transliteration for Asian languages was the brilliant English
oriental scholar and linguist Sir William Jones

(see Jones, William, Sir (17461794)). He was a highly skilled phonetician, and during his time as a high
court judge in India (17831794) saw the need for a
consistent way of transcribing languages. His system
was presented in Dissertation on the orthography of
Asiatick words in Roman letters (1788). He thought
it unnecessary to provide any detailed account of the
speech organs, but gave a short description of the
articulations. An ideal solution, he believed, would
be to have a natural character for all articulate
sounds . . . by delineating the several organs of speech
in the act of articulation (i.e., an organic alphabet),
but for oriental languages he preferred a transliteration. This was partly because of the difficulty of
conveying the precise nature of sounds to the nonspecialist, but also because he wished to preserve the
orthographical structure, so that the grammatical
analogy would not be lost, and there would be no
danger of representing a provincial and inelegant
pronunciation. The system was not intended as a
universal alphabet; his notation was confined to the
letters of the Roman alphabet, supplemented by
digraphs and a few diacritics. He chose the vowel
symbols on the basis of the values they have in the
Italian language, rather than those of English, unlike
some other schemes used in India at the time. His
alphabet had an influence on nearly all subsequent
ideas for the transliteration of oriental languages, at
least for the following century. The romanization of
these languages became a major concern of missionaries, administrators, educationists, and travelers,
though some scholars, and literate members of the
communities concerned, were less enthusiastic, believing that something culturally vital would be lost
if the native scripts were changed into a different form.
Iconic Alphabets (pre-19th Century)

Two of John Wilkinss systems of notation were iconic. The more elaborate one, which was not intended
to be used to transcribe connected speech, consisted
of small diagrams of the head and neck, cut away to
show the articulatory formation of each sound. Next
to each diagram was a simplified symbol relating to
the way the sound was formed (see Figure 2A). The
second notation assigned each consonant a symbol,
which took various forms: straight line, T shape,
L shape, or various curve shapes. To this basic shape

Phonetic Transcription: History 401

Wilkins added a small circle or hook, at the top,


middle, or base, to represent one of six vowels. Thus
the composite symbol represented a syllable, either
vowel consonant or consonant vowel. Each category of sound, such as oral stop consonant, fricative,
or nasal, had a characteristic shape, as did voiceless
and voiced sounds at the same place of articulation.
The system was ingenious, but the symbols could
easily be confused, and it is unlikely that anyone
other than Wilkins actually used it (see Figure 2B).
Wilkinss contemporary, Francis Lodwick (see
Lodwick, Francis (16191694)), published a similar
system in 1686 under the title An essay towards an
universal alphabet. He stated in his text the important
principle that no one character should have more
than one sound, nor any one sound be expressed by
more than one character. The notation was a syllabary, using shapes designed to show similarities between the sounds symbolized, which are set out in a
table with six places of articulation: bilabial, dental,
palatal, velar, labiodental, and alveolar. The top symbol in each column is the voiced stop, and the lower
ones are formed by progressive modifications of it. As
with Wilkinss system, the vowels are added to the
consonant symbols as diacritics.
Another iconic alphabet is to be found in chapter 5
of the Traite de la formation me chanique des langues,
published in 1765 by the French scholar and magistrate, Charles de Brosses (see Brosses, Charles de
(17091777)). The work was intended for scholars
researching into languages, rather than for everyday
use. Brosses called it organic and universal. It is
based on a somewhat idiosyncratic analysis of speech
production, which, among other things, assumed that
the vowels were sounded at different points on a
corde, or string, equivalent to the vocal tract tube.
Brossess understanding of speech production is suspect in a number of ways; for example, he classes
<s> as a nasal consonant. His first attempt at notation was complex, using symbols that pictured the
outline of the different vocal organs (lips, teeth, palate, nose, etc.), but he simplified this subsequently,
using symbols made up of curves and straight lines
at different angles. The vowel symbols were attached
to the consonants to give a syllabic sign, and the
notation included composite symbols to represent
consonant clusters.

Nineteenth-Century Transcription
Systems
Volney and the Volney Prize

The French orientalist, statesman, and reformer,


Count Constantin Franc ois Volney (see Volney,

Constantin-Francois Chasseboeuf, Comte de


(17571820)), had been concerned for many years
about the difficulties experienced by Europeans in
learning oriental languages, and the poor standard
of the teaching of these languages. His book, Simplification des langues orientales (Paris, 1795), put forward a system for transliterating Arabic, Persian, and
Turkish into Roman script, supplemented by a few
Greek letters and some newly invented symbols. During a visit to America from 1795 to 1798, he stayed
with William Thornton, and while there became
acquainted with Sir William Joness alphabet. He
conceived the idea of a universal alphabet, not for
scholarly purposes, but to act as a practical tool for
travelers, traders, etc. His 1795 system was used with
modifications for geographical names on the map of
Egypt compiled in 1803 by the French government,
but his later LAlphabet europe en applique aux langues asiatiques (Paris, 1819) provided a fuller system
of 60 symbols, mostly Roman, replacing some of his
previous, newly invented, symbols with more familiar
letters modified by diacritics. However, Volney realized that further research was needed, and his final
gesture was to leave 24 000 francs in his will, for a
prize to be awarded by the Institut de France to
anyone who could devise a suitable harmonic alphabet (to bring harmony out the existing confusion of
practices) in Roman script (see Kemp, 1999).
The Volney Prize for the first year (1822) was to be
for an essay setting out the necessary conditions for
such an alphabet. The prizewinners were both German librarians: Josef Scherer (d. 1829) argued that
what was needed was a transcription reflecting pronunciation, rather than a transliteration, whereas
A. A. E. Schleiermacher (17871858), the co-winner,
favored a transliteration, for very much the same
reasons as those given by Sir William Jones. Scherer
and Schleiermacher submitted detailed transcription
systems for the 1823 prize, which was won by
Scherer; Schleiermachers essay was submitted later
in a revised form and was published in 1835. He
continued to work on his system, and his completed
scheme, Das harmonische oder allgemeine Alphabet
zur Transcription fremder Schriftsysteme in lateinische Schrift (The harmonic or general alphabet for
the transcription of foreign writing systems into Latin
script), was eventually published in Darmstadt, after
his death (1864). Together with his new alphabet, this
work contained examples of the non-Roman scripts
of 10 languages. In all, 275 new characters had to be
cast. His notation excluded digraphs and letters from
other alphabets, which he felt would be typographically unsuitable, so his main resource was diacritics,
both above and below the basic symbols. Some
of these were used systematically (e.g., to indicate

402 Phonetic Transcription: History

Figure 4 Consonant symbols devised by A. A. E. Schleiermacher, as part of his transcription system, originally submitted in 1823 for
the Volney Prize; the revised form was published in 1835.

nasality, aspiration, or palatalization), but he admitted that problems of legibility and combinability had
often made total consistency impossible. The alphabet was never adopted for wider use (see Figure 4).
Further essays on transcription were submitted for
the Volney Prize over the next 20 years, but the commission set up to administer the prize deemed none of
the essays to have the final answer to the problem.
Shorthand and Spelling Reform

In the 19th century, the most prominent spelling reformer was Sir Isaac Pitman (see Pitman, Isaac, Sir
(18131897)). Pitman was of comparatively humble
origin, and determined from his early years to further
social reform and improve the educational system by
developing new alphabets to make spelling easier. His
first contribution was to develop a system of shorthand (now world famous), which he published in
1837 as Phonography; this work explored the ways
in which notation systems can be made to act efficiently in conveying language. Unlike earlier systems,
it was based not on English spelling but on the English
sound system.
By 1842, Pitman had devised several possible phonetic alphabets, but they still contained elements of
his shorthand. In the following year, he came down
firmly in favor of using the letters of the Roman
alphabet as a basis, and the same year saw the beginning of his connection and cooperation with Alexander
J. Ellis (see Ellis, Alexander John (ne Sharpe) (1814
1890)). Ellis and Pitman were from very different backgrounds; Ellis had a first in mathematics from Cambridge and a private fortune. He had developed an
interest in phonetic notation partly through his
attempts to write down dialects he encountered in
his travels abroad, but it was only after exposure to

Pitmans work that Ellis began to study phonetics


seriously. Over the next few years, Pitmans untiring
enthusiasm in publicizing the new ideas, and Elliss
knowledge of languages and assiduous research into
the background of phonetics and notation systems,
resulted in the English phonotypic alphabet of 1847
(see Kelly, 1981). Many of the symbols used were later
to form the basis of the alphabet of the International
Phonetic Association. The proposed reform of English
spelling never materialized, but Elliss subsequent
work on phonetic notation undoubtedly owed much
to these early years of collaboration with Pitman.
In America also, proposals for spelling reform
continued to appear. In 1858, the president of the
Phonetic Society of Great Britain, Sir Walter Trevelyan,
proposed a prize for an essay on a reform in the
spelling of the English language, which should contain
an analysis of sounds and an alphabetic notation containing as few new types as possible. The prizewinning
essay, entitled Analytic orthography (Philadelphia,
1860), was by Samuel Haldeman (18121880),
professor of Zoology and Natural History at the University of Pennsylvania, and later professor of Comparative Philology at the same university. Haldeman
had a strong linguistic background, notably in Native
American languages, and was a good phonetic observer, fully familiar with the work of Sir William Jones
and with his contemporaries Lepsius (see Lepsius,
Carl Richard (18101884)), Ellis, Pitman, Melville
Bell (see Bell, Alexander Melville (18191905)), and
Max Mu ller (see Mu ller, Friedrich Max (1823
1900)). Haldemans notation was based on the Roman
alphabet, and the letters used were restricted to the
values they had in Latin. He used some diacritics and a
few new letters, including some broken letters that
is, Roman forms with part of their strokes missing, not

Phonetic Transcription: History 403

a satisfactory device. He also attempted to symbolize


durational differences by introducing characters of
different widths. However, his system had no better
success than did others of the time.
Languages of America and Africa

In the early 19th century, most of the languages


of Africa and the indigenous languages of America
had no writing systems. Pierre Duponceau (see
Duponceau, Pierre Etienne (17601844)), a French
e migre to the United States, won the Volney Prize in
1838 with his Me moire sur le syste`me grammatical
des langues de quelques nations indiennes de lAme rique du Nord, and had shown in an earlier article
(English phonology, 1817) a thorough understanding of the principles of a universal alphabet, though
he never produced one himself. Under his influence,
John Pickering (17771846), like Duponceau a lawyer by training, was led to publish his Essay on a
uniform orthography for the Indian languages of
North America (Cambridge, Massachusetts, 1818).
Pickering, like Sir William Jones, whose work he
admired, used a Roman alphabet supplemented by
digraphs and some diacritics, preferring small superscript letters or subscript numerals to dots or hooks,
which he felt might accidentally be omitted. This
system was designed specifically for Native American
languages, not as a universal alphabet.
Many of the missionary societies were concerned
at this time to establish a standard transcription system. The Church Missionary Society produced a
pamphlet in 1848 entitled Rules for reducing unwritten languages to alphabetical writing in Roman characters: with reference especially to the languages
spoken in Africa. The rules allowed some flexibility
in deciding how detailed the transcription should be,
according to its intended use. The notation suggested
was Roman based, with a few diacritics and some
digraphs. Lewis Grout produced a Roman-based system for Zulu in 1853, on behalf of the American
Mission in Port Natal, using the symbols <c, q, x>
for the clicks.

Bunsen, who, as a scholar with an interest in philology, wished to explore the possibility of an agreed
system for representing all languages in writing. The
conference was attended by representatives from the
CMS, the Baptist Missionary Society, the Wesleyan
Missionary Society, and the Asiatic and Ethnological
Societies, and a number of distinguished scholars,
including Lepsius and Friedrich Max Mu ller. In spite
of their well-known involvement in the transcription
problem, neither Isaac Pitman nor A. J. Ellis was
among those included. Four resolutions were passed:
(1) the new alphabet must have a physiological basis,
(2) it must be limited to the typical sounds employed
in human speech, (3) the notation must be rational
and consistent and suited to reading and printing and
also it should be Roman based, supplemented by
various additions, and (4) the resulting alphabet
must form a standard to which any other alphabet is
to be referred and from which the distance of each
is to be measured.
Lepsius and Max Mu ller both submitted alphabets
for consideration; Mu llers Missionary alphabet,
which used italic type mixed with roman type, was
not favored, and Lepsiuss extensive use of diacritics
had obvious disadvantages for legibility and the availability of types. The conference put off a decision,
but later in 1854, the CMS gave its full support to
Lepsiuss alphabet. A German version of the alphabet
appeared in 1854 (Das allgemeine linguistische Alphabet), followed in 1855 by the first English edition,
entitled Standard alphabet for reducing unwritten
languages and foreign graphic systems to a uniform orthography in European letters. The Lepsius
alphabet had some success in the first few years, but
Lepsius was pressed by the CMS to produce a new
enlarged edition, which appeared in English in 1863
(printed in Germany, like the first edition, because
the types were only available there; see Figure 5).
The most obvious difference from the first edition
was that the collection of alphabets, illustrating the
application of Lepsiuss standard alphabet to different languages, had been expanded from 19 pages and

Carl Richard Lepsius (18101884)

In 1852, the Church Missionary Society (CMS) invited the distinguished German Egyptologist Lepsius to
adapt an alphabet that he had devised earlier, to suit
the needs of the Society. Lepsius had been interested
in writing systems for many years. In 1853, he won
the agreement of the Royal Academy of Berlin to fund
the cutting and casting of type letters for a new alphabet, to be used as a basis for recording languages with
no writing system. In the following year, an Alphabetical Conference was convened in London, on the
initiative of the Prussian ambassador in London, Carl

Figure 5 Symbols of the standard alphabet devised by Lepsius


in 1863.

404 Phonetic Transcription: History

54 languages to 222 pages and 117 languages (see


Lepsius, 1863).
There was as yet no phoneme theory, but Lepsius
was well aware that no alphabet could or should try
to convey all of the subtle nuances of speech. His
practical aim was to make an intelligible and usable
system available to nonspecialists, hence the expansion of the collection of alphabets and the avoidance
of a technical description of the physiology of speech.
However, Lepsius relied almost entirely on the use of
diacritics to supplement the basic Roman symbols.
They were used in a consistent way for the most
part, and Lepsius foresaw situations in which it
would not be necessary to use all of the distinctions
provided for, namely, when, in modern terms, a
broad transcription would be sufficient. Nevertheless, the profusion of diacritics meant that the printing types were less accessible, and the symbols were
less legible and more subject to errors in reproduction. Ellis calculated that on the basis of 31 diacritical
marks (17 superscripts and 14 subscripts), Lepsiuss
alphabet had at least 286 characters, of which at least
200 would have to be cut for every font used.
The alphabet attained a limited success in Africa;
the distinguished Africanist Carl Meinhof (see
Meinhof, Carl Friedrich Michael (18571944)) and
his missionary friend Karl Endemann (18361919)
gave it their support by using it, with some modifications, in their works on African languages (see Heepe,
1983), and the missionary P. Wilhelm Schmidt adopted
it as a basis for his Anthropos alphabet (see later).
However, in spite of Lepsiuss high international reputation, the support of the Berlin Academy, and the
resources of the CMS, the alphabet failed to find an
established place.

Probably Elliss most well-known alphabet is


Palaeotype (so-called because it used the old types,
without diacritics or non-Roman symbols, and with
relatively few turned letters). Palaeotype was for scientific, not popular, use, and Ellis employed it in
his monumental work On early English pronunciation (EEP) (18691889). There were about 250 separate symbols, the greatest number being digraphs or
trigraphs, but with some italics, small capitals, and
(very few) turned letters. Certain letters were used as
diacritical signs for instance <j> is used to indicate
palatality, as in <lj> for [L]. There were also nonalphabetic signs for features such as ingressive airstreams, tones, and stress. Ellis provided a reduced
form of his alphabet, known as Approximative
Palaeotype, which contained about 46 separate symbols. The full alphabet was, he believed, in all probability the most complete scheme which has yet been
published, but he foresaw the need to supplement
it to accommodate sounds from languages not yet
phonetically studied.
In the third volume of EEP (1871), Ellis published
the Glossic alphabet, a simplified form of transcription for English, based on symbols used in normal
English orthography. He intended it as a new system
of spelling, to be used concurrently with the existing
English orthography. In answer to objections that it
would be too sweeping a change, Ellis produced a
revised form in 1880 called Dimid-iun, but this
received even less support. None of these alphabets
was destined to have any lasting success, but in the
process of devising them Ellis laid a foundation for the
development of phonetics as a discipline in Britain,
and more particularly for the study and recording of
English dialects.

A. J. Elliss Later Alphabets

Germany: Physiology and New Alphabets

Elliss Universal writing and printing with ordinary


letters (Edinburgh, 1856) contained his Digraphic
and Latinic alphabets, with examples, hints on practical use, and comparisons with the systems of
Lepsius and Max Mu ller, together with suggestions
for a future Panethnic alphabet. The digraphic alphabet, as its name suggests, supplemented the
Roman alphabet with digraphs or trigraphs, such as
<kh> for [x], <ng> for [N], and <ngj> for [ J], with
the idea of making its notation accessible to as wide a
group of people as possible. It was intended for use in
any language, and Ellis supplied an abbreviated form
of it for use when detailed precision of description
was not necessary. The Latinic alphabet was intended
for those wishing to avoid the cumbersomeness of
the Digraphic alphabet, which it did, by employing
small capitals and turned letters; for example, [x] is
rendered by <K> and [N] is rendered by <N>.

In Germany, several schemes for a new alphabet were


proposed by scholars who approached phonetics
from a physiological rather than a linguistic angle.
Karl Rapp (18031883), in Versuch einer Physiologie
der Sprache (1836), used a Roman alphabet, supplemented by some letters taken from the Old English
and Greek alphabets and a few diacritics. Ellis,
writing in the 1840s, frequently paid tribute to Rapps
work. Ernst Bru cke (see Bru cke, Ernst (18191891)),
in his Grundzu ge der Physiologie (1856), also put
forward a Roman-based alphabet, supplemented by
some Greek letters and by superscript numerals used
as diacritics. For example, <tl> represented alveolar
[t], <t2> represented retroflex [<], and so on. Bru cke
ber eine neue Methode der phonetischen
later (U
Transscription, 1863) developed an iconic nonRoman alphabet, in which the consonant symbols
were based on articulations but the vowel symbols

Phonetic Transcription: History 405

were based on acoustic resonances. The consonant


symbols occupy one or more of three vertically
aligned areas and are made up of two basic parts,
one showing place of articulation and the other manner of articulation. A further part is particularly interesting in its attempt to indicate states of the glottis
other than vibrating open, narrowed, closed,
creaky, and what Bru cke called hard resonance and
soft resonance. The vowels occupy only the middle
of the three areas, so that they stand out clearly.
Diacritics are provided to indicate accent, duration,
and types of juncture. Carl Merkel (18121876)
also devised a non-Roman alphabet (Anatomie und
Physiologie des menschlichen Stimm-und Sprachorgans, 1857; revised edition, 1866), but, like Bru ckes
alphabet, it is extremely difficult to read, and neither
system attained any success. Both Bru cke and Merkel
were familiar with only a small range of languages
and were concerned primarily to show the total capacities of the human vocal organs in the production of
sounds.
Moritz Thausing (18381884), in Das natiirliche
Lautsystem der menschlichen Sprache (1863), based
his system on a Naturlaut natural sound represented
by the vowel <a>, and 21 other sounds, which diverged from <a> in three different directions (seven
on each path), like a three-sided pyramid. His notation used a musical staff of four lines and three
spaces, thus accommodating the seven grades of
sounds as notes on the staff. Each of the three sets
of seven had a special note shape to distinguish it.
Intermediate sounds were shown by modifiers.
Thausing believed this was preferable to Bru ckes
scheme in that the symbols were not iconic, and so
could be used for sounds for which formation was not
fully understood.
Felix Du Bois Reymond (17821865) was stimulated by the schemes put forward by Bru cke and
Lepsius to complete a scheme of his own (Kadmus,
1862), which he had sketched out earlier. It was
Roman based, but, unusually, attempted to combine
this with an iconic approach. So, for instance, all the
voiceless consonants had symbols that extend below
the middle area (mittlere Bahn), unlike the voiced
ones: <p> and <b> already conform to this principle, and to continue it, Reymond proposed, for example, that the symbol <q> should replace <t> as the
voiceless equivalent of <d>. Like Bru cke, he confined
the vowel symbols to the middle area. In spite of a
good phonetic basis outlined in his book, the scheme
failed to become established.
Bells Visible Speech

Alexander Melville Bell was the son of an elocution


teacher, and in due course became his fathers

principal assistant. Between the years 1843 and


1870, he lectured in the universities of Edinburgh
and London, after which he emigrated to Canada
and continued his teaching there. In 1864, he gave
public demonstrations of his new scheme for recording speech in writing, and in 1867 the system was
published under the title Visible speech, the science of
universal alphabetics. It was not (at least avowedly)
intended to be a new spelling system, but rather to
assist children in learning to read, and to provide a
sound bridge from language to language. Bell was
unsuccessful in attempts to persuade the British government to give him funds to support the system, but
continued to use it for his own purpose in teaching,
and claimed that it was perfect for all its purposes.
The symbols he used were iconic, intended to signify
the vocal organs involved in the production of the
sound concerned. For instance, the open vocal cords
are shown by <O>, which represents [h]. The consonant symbols are based on a sagittal diagram of the
head facing right. The shape < > represents a continuant (shown by the fact that there is a gap to the
right) with a constriction at the back (i.e., [x]), whereas < > represents a constriction at the front of the
vocal tract, namely [y ]. The same symbol with the gap
facing upward represents dental [y], and with the
gap facing downward, palatal [c ]. The remaining
consonant symbols have similar iconic relationships,
with modifiers to show complete closure (a bar across
the gap), nasality (a different bar), voicing, etc. The
vowel symbols are based on a vertical line, with
hooks at the top (close vowels), bottom (open
vowels), or both (mid vowels), facing left for back
vowels, right for front vowels, and in both directions
for the so-called mixed or central vowels. Rounding
is shown by a horizontal bar through the middle.
Ellis, writing just before the full publication of
Visible speech, admitted that his Palaeotype was
far less complete than Bells scheme. However,
that alphabet, he said, requires new types, which
is always an inconvenience, though I believe that an
entirely new system of letters, such as that of Mr. Bell,
is indispensable for a complete solution of the problem. He pointed out also that many potential users
are ill-qualified, without special training, to use a
very refined instrument. Iconic notations are certainly subject to the criticism that they may not be able to
accommodate new sounds or new descriptive frameworks, and can never convey the exact nature of the
sound symbolized. Bells symbols were much better
in design than most alphabets of this kind are, but
he faced the immense task of persuading people to
adopt a system that looked very different from what
they were used to seeing. The alphabet failed to find
supporters outside the circle of his pupils.

406 Phonetic Transcription: History


Sweets Romic and Organic Alphabets

Henry Sweet (see Sweet, Henry (18451912)), perhaps the greatest of 19th-century phoneticians, studied under Bell, and his Handbook of phonetics (1877)
was intended to be an exposition and development of
Bells work, but in this book he used a Roman-based
notation (influenced by Elliss Palaeotype), which he
called romic, distinguishing two varieties of it. Broad
romic, his practical notation, intended to record only
fundamental distinctions, corresponding to distinctions of meaning (i.e., phonemes, in modern terms),
was confined to symbols with their original Roman
values supplemented by digraphs and turned letters.
Narrow romic was to be a scientific notation and
provided extra symbols, notably for the vowels, for
which Sweet used italics, diacritic <h>, and, further,
turned letters. However, in 1880, he took over Bells
notation, which he regarded as an improvement on
any possible modification of the Roman alphabet for
scientific purposes. He modified it and added some
symbols, to form an organic alphabet, which he used
in his Primer of phonetics (1890) and in some other
works (see Figure 2). At this stage, Sweet felt that, even
for more practical purposes, the necessity to supplement the Roman alphabet with other devices (in particular, diacritics and new letters, which he strongly
opposed) made it cumbersome and inefficient. Toward
the end of his life, however, he emphasized that uniformity in notation was not necessarily a desirable thing
while the foundations of phonetics are still under
discussion, and accepted that the unfamiliarity of organic types might be too formidable an obstacle to
overcome. He continued to use his romic alphabet as
an alternative to the organic one, and broad romic
formed the basis for the new alphabet of the International Phonetic Association (see later). Sweets organic
alphabet did not enjoy a long life, nor indeed did the
idea of iconic alphabets, even though Daniel Jones and
Paul Passy (see Passy, Paul Edouard (18591940))
thought it worthwhile to propose another such scheme
in Le matre phonetique (1907).
Analphabetic Schemes

Analphabetic notations use symbols that represent


the subcomponents of a segment, rather like chemical
formulas. One early example of such a scheme was
proposed by the Dutch writer Lambert ten Kate (see
Kate Hermansz, Lambert ten (16741731)) in 1723,
and another by Charles Darwins grandfather, Erasmus Darwin, in 1803. Thomas Wright Hill (1763
1851), a schoolmaster in Birmingham who had a keen
ear, though his knowledge of sounds was self-taught,
devised a notation in which each place of articulation
was allotted a number. The interaction of active and

passive articulators was expressed in terms of numerator (passive) and denominator (active); for example,
a bilabial articulation is 1 (upper lip) over 1 (lower
lip), a labiodental is 2 (upper teeth) over 1, a dental is
2 over 3 (tongue tip), and so on. The degree of stricture and state of the glottis were shown by the shape
of the line between the numbers, and vowels were
indicated by the use of double lines instead of a single
one. It is easier, however, to typeset symbols that are in
horizontal sequence. Otto Jespersen (see Jespersen,
Otto (18601943)) included his analphabetic (later
called antalphabetic) alphabet in The articulations of
speech sounds (1889). It used a combination of
Roman letters, Greek letters, numerals, italics, heavy
type, and subscript and superscript letters. The Greek
letters represented the active articulators involved
lower lip (a), tongue tip (b), tongue body (g), velum
and uvula (d), vocal cords (E), and respiratory organs
(z). The numerals following the Greek letter showed
the relative stricture taken up by the articulators and
the Roman letters referred to the passive articulators.
For example, the combination b1fed0E3 would represent one kind of [s] (b tongue tip, 1 close stricture,
fe in the area of alveolar ridge/hard palate,
d0 velic closure, E3 open vocal cords). It was not
intended for use in a continuous transcription (though
Jespersen showed how this is possible in a matrix
form), but served as a descriptive label for the segment
concerned (cf. modern feature notations).
Friedrich Techmer had proposed an analphabetic
scheme in his Phonetik (1880). It employed five horizontal lines that, together with the spaces in between
them, showed the major places of articulation.
Musical-type notes were then inserted to show the
manners of articulation. It was designed essentially
as a scientific notation for Techmers own use and
never achieved widespread adoption. His Romanbased alternative, published in the Internationale
Zeitschrift fur allgemeine Sprachwissenschaft (in
1884 and in 1888), was a highly detailed and systematic scheme, making use of a basic italic typeface,
both uppercase and lowercase, with various diacritics
either directly beneath or to the right of the main
symbol. Johan Storm (see Storm, Johan (1836
1920)) judged it to be the best of the German systems
of notation, and it was the basis for Seta la s 1901
transcription for Finno-Ugric languages (see Laziczius, 1966).
Kenneth Pike (see Pike, Kenneth Lee (19122000)),
in his classic book Phonetics (1943), outlined an even
more detailed analphabetic notation, called functional analphabetic symbolism. It was composed of
roman and italic letters in uppercase and lowercase,
and was intended to illustrate the complexity of
sound formation and to expose the many assumptions

Phonetic Transcription: History 407

that lie behind the customary short labels used to


refer to sounds. The segment [t] is represented by
the notation MaIlDeCVveIcAPpaat dtltnransfsSiFSs;
the italic symbols give the broad headings of the
mechanisms involved and the roman letters give subclassifications. Even this degree of complexity, Pike
says, is suggestive but by no means exhaustive.
Dialect Alphabets

Various special alphabets have been created for


the transcription of particular dialects (see further
in Heepe (1983)). J. A. Lundell was commissioned
while a student at Uppsala to produce an alphabet for
Swedish. The resulting Swedish dialect alphabet was
first published in 1879 and has been widely used since
then, not only for Swedish. The basic font is italic,
which Lundell thought most suitable for cursive
writing as well as for printing, and he supplemented
the letters of the Roman alphabet almost entirely
by employing new letter shapes, mostly formed by
adding loops or hooks to the basic symbols. They
retain some iconic character through a consistent
use of a particular hook, etc. for one place of articulation. Lundell rejected the use of different fonts
and of capital letters to make distinctions, and also
avoided unattached diacritics, except those for suprasegmental features. Sweet was critical of Lundells
scheme, mainly on the grounds of the complex letter
shapes and the consequent expense of casting the
many new types required, but Storm, in correspondence with Sweet, expressed a much higher opinion
of it, particularly the systematic character of the letter
shapes used for consonants. Some of the vowel symbols are easily confused and would require extreme
care in handwritten texts.
Jespersen produced a dialect alphabet for Danish,
first published in the periodical Dania in 1890, and
later used in the Danish pronouncing dictionary. It
follows the broad principle, employing phonetic
symbols without diacritics to represent the Danish
sounds. Its reversal of the values assigned to <a>
and <A> in the IPA alphabet is a source of possible
confusion.
The Alphabet of the International Phonetic
Association

LAssociation Phone tique Internationale, founded in


1897, grew out of two previous organizations: The
Phonetic Teachers Association, founded in Paris in
1886, and LAssociation Phone tique des Professeurs
de Langues Vivantes, which replaced it in 1889. The
first version of the IPA alphabet, based on Pitmans
alphabet of 1847 (as revised in 1876) and on Sweets
broad romic, appeared in 1888. From the beginning, the emphasis was on practical use for language

teaching. Consequently, the symbols were chosen


with a view to clarity, familiarity, and economy. The
published IPA principles stipulated that there should
be a separate sign for each distinctive sound; that is
for each sound which, being used instead of another
in the language, can change the meaning of a word
(i.e., for each phoneme). The symbols were to be
letters of the Roman alphabet as far as possible,
with values determined by international usage, and
when very similar shades of sound were to be found
in several languages, the same sign was to be used.
The use of diacritics was to be avoided, except for
representing shades of varieties of the typical
sounds, because of the problem they presented for
reading and writing. It was also stipulated that, when
possible, the shape of new symbols should be suggestive of the sound they represent, by their resemblance to the old ones. For example, the basic <n>
shape was retained for the palatal, retroflex, velar,
and uvular nasals [ J, 0, N, N].
Over the years, the alphabet has been modified for
use as a general phonetic resource, to make detailed
phonetic transcriptions and comparisons of language
sounds. Diacritics have been accepted as admissible
for certain limited purposes. A thorough reappraisal
of it has been made (much of it reflected in articles in
the Journal of the International Phonetic Association
(19861989)). Since the Kiel Convention in 1989,
additions and amendments have been made to the
symbols and to their presentation in chart form (see
Figure 6).
The IPA principles have been amended in certain
respects, notably to make it clear that IPA symbols
should not be seen simply as representations of
sounds, but as shorthand ways of designating certain
intersections of . . . a set of phonetic categories which
describe how each sound is made. However, it is still
stated that the sounds that are represented by the
symbols are primarily those that serve to distinguish
one word from another in a language. Two important developments, as a result of working groups set
up following the Kiel Convention, are a complete
computer coding of IPA symbols and an extension
of the IPA alphabet to include disordered speech
and voice quality (see MacMahon, 1986; Albright,
1958; International Phonetic Association, 1999:
Appendices 2 and 3).

Twentieth-Century and Later


Developments
The Anthropos Alphabet of P. W. Schmidt

The alphabet of P. W. Schmidt was first published


in the periodical Anthropos in 1907, and was revised

408 Phonetic Transcription: History

Phonetic Transcription: History 409

in 1924. Schmidt kept most of Lepsiuss symbols,


adding some diacritics to distinguish sounds left
undifferentiated by Lepsius, but introduced the IPA
symbols <>, <> (and turned versions of them)
and <M>, mostly to replace symbols with diacritics.
For the consonants, to give some examples, < > and
<z > are used for the dental fricatives instead of
P
Lepsiuss <y> and <d>, and in Schmidts revised
edition the inverted forms of <t>, <c>, and <k>
replaced Lepsiuss click symbols </>, <//>, and <!>.
Interestingly, the 1989 revision of the IPA alphabet
adopted Lepsiuss click symbols (slightly modified).
Native American Languages

In 1916, the Smithsonian Institution published a


pamphlet entitled Phonetic transcription of Indian
languages, embodying the report of the committee
of the American Anthropological Association, consisting of Franz Boas (see Boas, Franz (18581942)),
P. E. Goddard, Edward Sapir (see Sapir, Edward
(18841939)), and A. L. Kroeber (see Kroeber, Alfred
Louis (18761960)). The report took as a basis the
alphabet used by J. W. Powell in Contributions to
North American ethnology (vol. 3, 1877). In 15
pages, the pamphlet sets out general principles of
transcription and rules for both a simpler and a
more complete system. The principles closely resemble those of the IPA, concerning the use of the same
symbol when the same sound occurs, the restrictions
on the use of diacritics, the harmonizing of fonts, and
the use of symbols for sound values like those that
they customarily stand for. The simpler system is
suggested for ordinary purposes of recording and
printing texts, and the complete system is for the
recording and discussing of complex and varied phonetic phenomena by specialists in phonetics. The full
system of vowels is based on Sweets 36-vowel system
(excluding the shifted vowels of his final system).
Sweets wide vowels are normally shown by Greek
letters, and narrow ones by Roman letters. The consonant symbols and prosodic marks are not very different from those of the IPA; some exceptions are that
small capitals are used for voiceless liquids and
nasals, and also for stops and fricatives said to be
intermediate between surd (voiceless) and sonant
(voiced). These include unaspirated voiceless stops.
The total system is a sophisticated one, providing
both a high degree of precision in transcribing the
detailed features of Native American languages and
a satisfactory, simple form for nonspecialists.

Jo rgen Forchhammers Weltlautschrift


(World Sound Notation)

Forchhammers World sound notation was published in Die Grundlage der Phonetik (Heidelberg,
1924). It comprises a basic set of 44 Lautgruppen
(sound groups, made up of 13 vowels and 31 consonants), each comprising a set of sounds that can be
represented by the same letter. The nuances within
each group can be shown by the wide range of diacritics, which include subscript numerals to indicate
successive points of tongue contact along the palate.
Of the 44 basic symbols, 36 are identical with IPA
symbols, but the diacritics are mostly different (see
also Heepe, 1983).
The Copenhagen Conference

In April, 1925, a conference was held in Copenhagen,


convened by Otto Jespersen and attended by an international group of 12 specialists in different language
groups, to try to establish a norm for a universal
phonetic script. Their proposals, published in 1926
in Phonetic transcription and transliteration, were
reprinted in 1983 (Heepe, 1983). The Copenhagen
group firmly rejected the possibility of further iconic
alphabets and approved the notion of broad transcriptions based on the phoneme. Their detailed proposals for symbols were given a somewhat cool
reception by the Council of the IPA (as reported in
Le Matre Phone tique in 1927), but the following
protocols were accepted:
1. [j b] for the bilabial fricatives (instead of [F V])
2. [o] for labialization
3. [< B 0 U 8 ] for the retroflex series (following
Lundell)
4. a raised period [] to show length
5. vertical stress marks ["] and [%] instead of the
oblique [B].
Other proposals (rejected or previously adopted by
the IPA) included a reversed comma below the letter
for nasalization; a new diacritic for palatals; [d] for
the voiced dental fricative; [x X] for velar fricatives;
[K G N R L] for uvular stops, nasal, trill, and lateral and
[X G] for uvular fricatives; [O] for glottal stop; and [D]
as a diacritic for clicks. Among the suggestions for the
vowels were abandoning the use of [a] and [A] to
signify different vowels, use of superscript [.] for central vowels, and umlaut for front rounded vowels,
e.g., [u o ].

Figure 6 The most current symbol chart of the International Phonetic Association. Reprinted from the International Phonetic
Association (1999) (the Department of Theoretical and Applied Linguistics, School of English, Aristotle University of Thessaloniki,
Thessaloniki, Greece), with permission.

410 Phonetic Transcription: History


Machine-Readable Transcriptions

One major consideration in recent years has been the


choice of symbols to represent speech in computer
systems. The need to have symbols that are available
on normal keyboards has led to various systems of
machine-readable phonetic alphabets. The most easily available method of supplementing a lowercase
Roman alphabet is the use of uppercase, and there is
a fair amount of agreement among different schemes
in allocating uppercase symbols to IPA symbols. Another area of common ground is the use of <@> for a
central schwa-type vowel, and numerals that resemble phonetic symbols, such as <3> standing for [k].
The machine-readable Speech Assessment Methods
Phonetic Alphabet (SAMPA), developed in 1987 and
1989 by an international group of phoneticians, is
capable of dealing with the transcription of a wide
range of languages (see further in Wells (1997); see
also Phonetic Transcription: Analysis).
See also: International Phonetic Association; Bell, Alexander Melville (18191905); Boas, Franz (18581942);
Brosses, Charles de (17091777); Brucke, Ernst (1819
1891); Bullokar, William (c. 15311609); Duponceau,
Pierre Etienne (17601844); Ellis, Alexander John (ne
Sharpe) (18141890); Firthian Phonology; Hart, John
(?1501-1574); Jespersen, Otto (18601943); Jones, William, Sir (17461794); Kate Hermansz, Lambert ten
(16741731); Kroeber, Alfred Louis (18761960); Lepsius,
Carl Richard (18101884); Lodwick, Francis (16191694);
Meigret, Louis (?1500-1558); Meinhof, Carl Friedrich Michael (18571944); Muller, Friedrich Max (18231900);
Passy, Paul Edouard (18591940); Phonetic Transcription:
Analysis; Pike, Kenneth Lee (19122000); Pitman, Isaac,
Sir (18131897); Sanctius, Franciscus (15231600); Sapir,
Edward (18841939); Steele, Joshua (17001791); Storm,
Johan (18361920); Sweet, Henry (18451912); Volney,
Constantin-Francois Chasseboeuf, Comte de (1757
1820); Walker, John (17321807); Wilkins, John (1614
1672).

Bibliography
Abercrombie D (1967). Elements of general phonetics.
Edinburgh: Edinburgh University Press.
Abercrombie D (1981). Extending the Roman alphabet:
some orthographic experiments of the past four centuries.

In Asher R E & Henderson E J A (eds.) Towards a


history of phonetics. Edinburgh: Edinburgh University
Press.
Albright R W (1958). The International Phonetic Alphabet:
its background and development. International Journal
of American Linguistics 24(1B), part III. [Publication
seven of the Indiana Research Center in Anthropology,
Folklore, and Linguistics.]
Copenhagen Conference (1925). Phonetic transcription
and transliteration: proposals of the Copenhagen Conference April 1925. Oxford: Clarendon Press.
Heepe M (ed.) (1983). Lautzeichen und ihre Anwendung
in verschiedenen Sprachen. Hamburg: Helmut Buske
Verlag.
International Phonetic Association (IPA) (1999). Handbook of the IPA; a guide to the use of the International
Phonetic Alphabet. Cambridge: Cambridge University
Press.
Kelly J (1981). The 1847 alphabet: an episode of phonotypy. In Asher R E & Henderson E J A (eds.) Towards a
history of phonetics. Edinburgh: Edinburgh University
Press.
Kemp J A (1999). Transcription, transliteration and
the idea of a universal alphabet. In Leopold J (ed.)
Prix Volney essay series, vol. I:2. Dordrecht: Kluwer.
476571.
Laziczius G (1966). Schrift und Lautbezeichnung. In
Sebeok T A (ed.) Selected writings of Gyula Laziczius.
The Hague: Mouton.
Lepsius C R (1863). Standard alphabet for reducing unwritten languages and foreign graphic systems to a uniform
orthography in European letters. [2nd rev. edn. (1981),
with an introduction by Kemp J A. Amsterdam: John
Benjamins].
MacMahon M K C (1986). The International Phonetic
Association: the first 100 years. Journal of the International Phonetic Association 16, 3338.
Pullum G K & Ladusaw W A (1996). Phonetic symbol
guide (2nd edn.). Chicago: University of Chicago Press.
Sweet H (1880). Sound notation. Transactions of the Philological Society, 177235.
Wellisch H H (1978). The conversion of scripts its nature,
history and utilization. New York: Wiley.
Wells J C (1997). SAMPA computer readable phonetic
alphabet. In Gibbon D, Moore R & Winski R (eds.)
Handbook of standards and resources for spoken language systems. Berlin & New York: Mouton de Gruyter.
Part IV, sect. B.

Phonetically Motivated Word Formation 411

Phonetically Motivated Word Formation


F Katamba, Lancaster University, Lancaster, UK
2006 Elsevier Ltd. All rights reserved.

Tacit Knowledge about Phonetics in


Morphology
Speakers of a language have a wealth of tacit knowledge about phonetics that underlies some of the morphological processes that they use. For instance, in
English, monosyllabic words are subject to a constraint that if they contain a long vowel or diphthong
that is followed by two coda consonants or, alternatively, if they contain a short vowel followed by
three coda consonants, the final consonant must be
coronal (e.g., [t, d, s]) (Table 1). Coronals are special.
Due to their special phonological status, coronals
(/-s, z/ and /t-d/) are the only consonants that are used
in English suffixes (e.g., plural, past tense).
Phonological conditioning of allomorphs is a wellknown phenomenon that highlights the interplay between morphology and sound structure. In regular
English inflectional morphology, for example, the
suffix need only be coronal, but it must also agree in
voicing with the final sound of the stem. The choice of
the suffix (/-s, z/ or /t-d/) depends on whether the last
sound of the base is voiceless, as in bucks [bVk-s] and
clocked [klQk-t], or voiced, as in bugs [bVg-z]
and clogged [klQg-d].

Syllables and Allomorphy


It is not just the attributes of individual sounds that
are relevant. In many languages, the choice of allomorph may be conditioned by the number of syllables
in the base. For instance, in English the comparative
and superlative degree suffixes degree er and -est
suffix are conditioned by the number of syllables in
the base. Either suffix is automatically attached to
monosyllabic words (cf. taller, saddest) and may attach to disyllabic words if the second syllable ends in
a weak sound like , [I], [e], or [eo] (e.g., gentler,
sunniest, safer, and narrowest). Acceptability of

disyllabic adjectives with a stronger second syllable


is variable. Whereas commoner may be acceptable,
*certainer is not. However, if the base exceeds two syllables, er and est are ruled out. More and most are
obligatory (e.g., more intelligent and most beautiful,
not *intelligenter and *beautifulest).
The structure of syllables can also condition allomorphy. Consider the two allomorphs of the genitive
in the Australian language Djabugay (Patz, 1991):
(1a) /-n/ following a base ending on a vowel
e.g., guludu dove Genitive guludun
(1b) Nun/ after a base ending in a consonant
e.g., gaJal goanna Genitive gaJal-Nun

Kager (1996: 155) showed that the choice of the


allomorph is partially predictable from the syllable
structure of the base and from the way it is syllabified
once the suffix is appended. Because Djabugay disallows syllables that end in a consonant cluster, allomorphy is not allowed to deliver impermissible
syllables that have a consonant cluster in the coda.
Hence, the single consonant allomorph /-n/ never
attaches to a base ending in a consonant:
(2) /-n/
/Nun/

gu.lu. dun
(?gu.lu.du.Nun/

*ga.Jal.n
ga.Jal.Nun

Phonetic Similarity
Phonetic attributes of not just individual syllables but
also entire words may be important. That is the case
when a language has vowel harmony, a process of
vowel harmony whereby within the word vowels are
required to share certain phonetic traits. The vowels
of the language are divided into two sets and, within
the relevant domain, all vowels must be front or back,
round or unround, high or non-high, etc.
A good example of a language with vowel harmony
is Turkish, which has eight vowels divided in two sets
as shown in Table 2.
Vowel harmony goes from left to right, and it
requires all vowels to agree with the first stem vowel
in backness; in addition, where the first vowel is high,
they must also agree with it in roundness.

Table 1 Privileged nature of coronal consonants


Long vowel or diphthong
CC[coronal]

blind, dined
fiend,cleaned
Gould,cooled

*blink [blaINk]
etc.
*fiemp, etc.
*Goulg, coolep,
etc.

Short vowel CCC[coronal]

Table 2 Turkish vowels


text, vexed
blocked,
clocked
waxed, axed

*texk [teksk]
etc.
*blockek
*waxep

Front

High
Non-high

Back

Unround

Round

Unround

Round

i
e

$i
a

u
o

412 Phonetically Motivated Word Formation


Table 3 Turkish vowel harmony
Noun stem

Singular

Plural

Genitive plural

adam man
ev house
kol arm
gz eye
i work
ki$z girl
pul stamp

adam-i$n
ev-i$n
kol-un
gz-yn
i -in
ki$z-i$n
pul-un

adam-lar
ev-ler
kol-lar
gz-ler
i -ler
ki$z-lar
pul-lar

adam-lar-i$n
ev-ler-in
kol-lar-i$n
gz-ler-in
i -ler-in
ki$z-lar-i$n
pul-lar-i$in

As seen in Table 3, if several suffixes follow a stem,


all the suffix vowels harmonize with the root vowel.
However, things do not always work so smoothly.
There is disharmony on occasion. Consider the following examples from Rocca and Johnson (1995: 155):
(3a) iki
aktM
yedi
sekiz

two
six
seven
eight

(3b) ikigen
aktMgen
yedigen
sekizgen

two-dimensional
hexagonal
heptagonal
octagonal

The expected backness harmony does not materialize. Regardless of the nature of the root vowel, the
suffix-gen contains the front vowel /e/.

Avoidance of Phonetic Similarity


Morphology may also be driven by the opposite concern: avoidance of similarity. The null hypothesis that
consonants freely co-occur in roots is not always
borne out. A classic case of this is Arabic, which
normally bars verbal roots with homorganic consonants (Greenberg, 1950). In current theories of phonology, this is accounted for by invoking an OCPPlace constraint (i.e., Obligatory Contour Principle
regulating Place), which prohibits consonants with
the same place of articulation from occurring in the
same verb root (McCarthy, 1979).
The canonical verb root in Arabic is CCC.
Vowels are added and consonants may be geminated
in various parts of the paradigm.
(4) kataba
kattaba
kaataba
kutiba

he wrote
he caused to write
he corresponded
it was written

There exist also roots with just two distinct consonants (e.g., zr pull). According to McCarthy (1979),
Arabic enforces the OCP:
(5) OCP: Identical adjacent consonants are
prohibited.

Enforcement of the OCP means that words such as


*qaqata and *zazara, whose first two consonants are
identical, and hence violate the OCP, are not allowed.
If the root has only two consonants to associate
with a template that has three positions, it is always
the last two consonants that are identical. This must
be due to the spreading of the second consonant to the
third C position.

Such forms do not violate the OCP since there is just


one consonant at the root tier.
Some treatments of these data in the OT framework
have regarded forms such as zarara as violations of the
OCP that are tolerated by the language because they
are incurred in the context of endeavoring to satisfy a
higher ranked constraint (Rose, 2000). A proposal by
Frisch et al. (2004) dealt with the problem using the
notion of similarity avoidance. The strength of the
OCP-Place constraint is determined by establishing
the ratio of the actual number of examples subject to
the OCP effect and comparing it to the expected number if the distribution were random. It is shown that
the potency of the OCP effect is a function of the
phonetic similarity between consonant pairs. For instance, Arabic triconsonantal roots of the /d t C/ type
(where C any consonant) are not attested although
numerous roots start with /d/ and although /t/ is a very
common second consonant in triconsonantal roots.
This is attributed to the pressure to conform to the
OCP by avoiding near identical consonants that only
differ with respect to one parameter, namely voicing,
from being next to each other. In the dictionary that
the study was based on there were a mere two roots
with /d s C/. The near absence of such roots is again
attributed to their similarity. In contrast, /d g C/ roots,
which show greater phonetic distance, are more
common. This suggests that the OCP is sensitive
to phonetic similarity. The greater the degree of
similarity, the more zealously it is observed.

Blocking
Even in a language such as English that has no general
co-occurrence restrictions on consonants, there is not
always total freedom. For instance, blocking of an
otherwise quite general rule can be observed in the
behavior of the suffix en, which forms inchoative
verbs from adjectives. Halle (1973) pointed out that
this suffix can only be added if the base is monosyllabic and ends in an obstruent. Both conditions are

Phonetically Motivated Word Formation 413

met by the forms in (7a) but not by those in (7b),


which are hence ill-formed:
(7a) blacken
soften
toughen
harden

investigated using prosodic morphology, a theory that


is reliant on the prosodic hierarchy in (10) proposed
in McCarthy and Prince (1995: 284):

(7b) *bluen
*greenen
*comforten
*flexiblen

A light syllable such as V or CV has one mora, and


a heavy syllable such as CVV or CVC has two moras
(cf. Hayes, 1989; Hyman, 1985), as shown in (11):

Licensing
Words display phonotactic patterns that are a consequence of constraints on syllabification. According to
Ito (1988), syllabification can be viewed as a case of
template matching. Segments that are not matched
with a slot in the syllable template are unlicensed
and hence fail to appear in the surface representation.
This may result in allomorphy. Consider the example
from the Australian language Lardil in (8), analyzed
in Kenstowicz (1994):
(8a) pir.Nen
rel.ka
kar.mu
kan.tu
kuN.ka
(8b)

woman
head
bone
blood
groin

wa.Nal
wu.lun
ma.yar
yaR.put
Nam.pit

boomerang
fruit species
rainbow
snake, bird
humpy

As seen, Lardil has a CVC syllable template and there


is a requirement that only coronal consonants may
occur in coda position. Labial and velar consonants
are not licensed to appear in coda position unless they
are followed by a homorganic consonant. To conform
to the requirement in (8b), the allomorph ending in
CV appears in the absolute form, whereas the one
with a final /k/ is found in the inflected form, where
it does not violate (8b) since the velar is reanalyzed as
a syllable onset:
(9) absolute
Nal1
thurara

inflected
Naluk-in
thuraraN-in

story
shark

The privileged status of coronals observed here that


licenses them to appear in coda position is reminiscent of their special status in English, which was
observed in Table 1.

Prosodic Morphology
More interesting still are phenomena such as reduplication and root-pattern morphology, where morphology is subject to prosodic circumscription (McCarthy
and Prince, 1990, 1995). These phenomena have been

Syllable weight is the determinant of metrical structure. Metrical feet are defined in terms of the moraic
structure of their syllables. A metrical foot in which a
light syllable precedes a heavy one is an iambic foot; a
foot in which a heavy syllable precedes a light one is a
trochaic foot. McCarthy and Prince go on to posit
foot binarity:
(12) Foot binarity
Metrical feet contain two moras or two
syllables.

Typically, the minimal word is bimoraic and this


tends to have a key role in morphology (to, 1986;
to and Mester, 2003; McCarthy and Prince, 1995).
For instance, Luganda (Uganda) has a constraint requiring a lexical stem to be a minimal word, which
means that it must be at least bimoraic. To satisfy this
constraint, a monosyllabic CV stem undergoes final
vowel lengthening:
(13) /mu-ti/ tree
/ki-be/ jackal
/ka-so] knife

[muti:]
[kibe:]
[kaso:]

Reduplication also shows the central role played by


the prosodic hierarchy in morphology. Many reduplicative phenomena call for the reduplication of either
a light syllable (sm) or a heavy syllable (smm). In
Tagalog (Philippines), the future tense can be formed
by reduplicating the first light syllable. of the stem:

414 Phonetically Motivated Word Formation


Table 4 Arabic root pattern morphology
Singular

Plural

nafs
rajul
asad
jundub

soul
man
lion
locust

(14) Sulat write


ibig love
aral teach

Susulat
i ibig
a aral

nufuus
rijaal
)usuud
janaadib

will write
will love
will teach

syllable structure. For instance, en suffixation in


English is not allowed after sonorants (cf. fatten,
*greenen). Licensing of sounds with particular phonetic characteristics may result in stem allomorphy, as
demonstrated for Lardil. The selection of bases for
compounding may be influenced by their sounds.
Finally, the prosodic hierarchy and the minimality
have vital roles in reduplicative and root-pattern
morphology.
See also: Morphophonemics; Prosodic Morphology; Redu-

plication.

Ilokano (Ethnologue name, Ilocano), which is also


spoken in the Philippines, forms plurals by reduplication of a heavy syllable (McCarthy and Prince, 1995:
285):
(15) kaldN
pu sa
ro ot
tra k

goat
cat
litter
truck

kal-kaldN
pus-pu sa
ro: -ro ot
tra: -tra k

goats
cats
litter (pl.)
trucks

Root-pattern morphology is also highly sensitive


to prosodic domains. This can be seen in the behavior
of the productive plural morphology of Arabic.
The plural and diminutive have a fixed canonical
form whose template is an iambic foot smsmm. The
actual segments associated with the foot vary but the
prosodic template remains fixed (Table 4).

Sound-Motivated Compounding in English


Sound may also play a more direct role in selecting
bases for compounding (Thun, 1963). It is possible to
distinguish between two types of reduplication in
English, namely rhyme motivated compounds and
ablaut motivated compounds. In the former, the
final syllables of the two parts of the compound
rhyme:
(16) rap-tap
claptrap

nitwit
ragtag

hocus-pocus
hum-drum

In contrast, in ablaut motivated compounds, the


consonants remain constant while the vowels change:
(17) flip-flop
tick-tock

snip-snap
ping-pong

teeny-weeny
riff-raff

Many reduplicatives in English are informal or


familiar and occur especially in childparent talk
(e.g., din-din dinner) (Quirk et al., 1985: 1579).

Conclusion
Morphology is strongly intertwined with phonetics
and phonology. Allomorphy may be conditioned by
bases ending in a particular sound or in a particular

Bibliography
Frisch S, Pierrehumbert J B & Broe M B (2004). Similarity
avoidance and the OCP. Natural Language and Linguistic Theory 22, 179228.
Greenberg J (1950). The patterning of root morphemes in
Semitic. Word 6, 162181.
Halle M (1973). Prolegomena to a theory of wordformation. Linguistic Inquiry 4, 316.
Hayes B (1989). Compensatory lengthening in moraic
phonology. Linguistic Inquiry 20, 253306.
Hyman L (1985). A theory of phonological weight.
Dordrecht, The Netherlands: Foris.
Ito J (1988). Syllable theory in prosodic phonology. New
York: Garland.
Ito J & Mester A (2003). Japanese morphophonemics:
markedness and word structure. Cambridge: MIT
Press.
Kager R (1996). On affix allomorphy and syllable counting. In Kleinhenz U (ed.) Interfaces in phonology, Studia
Grammatica 41. Berlin: Akademie Verlag. 155171.
Kenstowicz M (1994). Phonology in generative grammar.
Oxford: Blackwell.
McCarthy J (1979). A prosodic theory of nonconcatenative
morphology. Linguistic Inquiry 12, 373418.
McCarthy J & Prince A (1990). Prosodic morphology and
tempatic morphology. In Eid M & McCarthy J (eds.)
Perspectives on Arabic linguistics: papers from the second
symposium. Amsterdam: Benjamins. 154.
McCarthy J & Prince A (1995). Prosodic morphology. In
Goldsmith J A (ed.) The handbook of phonology.
Oxford: Blackwell. 318366.
Patz E (1991). Djabugay. In Dixon R M W & Blake B J
(eds.) The handbook of Australian languages (vol. 4).
Melbourne: Oxford University Press. 245347.
Quirk R, Svartvik J, Leech G & Greenbaum S (1985).
A comprehensive grammar of the English language.
London: Longman.
Rocca I & Johnson W (1999). A course in phonology.
Oxford: Blackwell.
Rose S (2000). Rethinking geminates, long-distance geminates and the OCP. Linguistic Inquiry 31, 85112.
Thun N (1963). Reduplicative words in English. Uppsala,
Sweden: Carl Bloms.

Phonetics and Pragmatics 415

Phonetics and Pragmatics


W Koyama, Rikkyo University, Tokyo, Japan
2006 Elsevier Ltd. All rights reserved.

Linguistic and Semiotic Matrices of


Phonetics and Pragmatics
If we define pragmatics as what we do with words,
that is, performing acts of referential and nonreferential (socialindexical) signification, involving both
(signifying) signs and objects (i.e., what is signified),
then phonetics is part of pragmatics, since verbal
articulations constitute a kind of action. On the
other hand, if we narrowly define pragmatics as
the referential or nonreferential speech acts that result
from the use of sounds, mediated by a denotational
code called linguistic structure, pragmatics appears
located at the signified pole of signification and thus
diametrically opposed to phonetics at the signifying
pole (see Pragmatics: Overview; Semiotics: History).
Although the latter definition is usually adopted in
linguistics, the former definition is preferred in philosophy, semiotics, and communication studies,
which see phonetics and pragmatics as fields dealing
with actually occurring indexical acts, events, or their
regularities in the extensional universe, as opposed to
the intensional, abstract codes of symbolic signs such
as make up linguistic structure (see Deixis and
Anaphora: Pragmatic Approaches). In this model,
pragmatics includes phonetics (i.e., phonetically articulated signs), which, along with graphic, visual,
olfactory, and other kinds of signs, may be used to
signify some objects, that is, make meaningful significations.
As we shall see later, the linguistic model, which is
centered around linguistic structure (i.e., a denotational
code presupposingly used in the referential acts of communication), is, as Saussure and Peirce have noted,
properly and systematically included in the semiotic
matrix, which is concerned with communication as
such, including both the referential and socialindexical
aspects and both codes and processes (see Saussure,
Ferdinand (-Mongin) de (18571913); Peirce, Charles
Sanders (18391914)).

The Linguistic Matrix


In the narrower matrix of linguistics, pragmatics and
phonetics are polar opposites in terms of methodology and disciplinary organization, partially because
linguistics sees language primarily as structurally
mediated denotational signification, starting with
phonetic sounds and ending with pragmatic referents
(see PhonologyPhonetics Interface). Although this is

a partial (and, in fact, a somewhat limited) view of


communication, inasmuch as it abstracts away the
latters dialectic (interactively processual) and socialindexical aspects, the model is compatible with
the total matrix of communication and captures the
asymmetric directionality of signification (see following discussion for details).
Let us briefly observe the linguistic matrix, focusing on the methodological aspect. Here, one may
find a scale consistent with the flow of denotational signification and extending from phonetics
to phonology, morphophonology, morphology, syntax, semantics, and pragmatics. On this scale, the
positivistic evaluation of facts over interpretations
tends to become gradually more dominant as we
approach the leftmost pole (i.e., phonetics) and inversely for the hermeneutic evaluation of contextual
interpretations regarding bare facts, as it is most
prominently seen in pragmatics. Apparently, this
configuration suggests that the process of signification moving from sounds to meanings is seen as
the transition from physical nature to hermeneutic
culture (see Phonetics: Overview; Phonology: Overview).
Thus, the distinct modi operandi of phonetics
and pragmatics appear to fit neo-Kantian epistemology, which classifies all kinds of sciences into two
methodological ideal types, namely (1) nomothetic
natural sciences of Erklaren, explanation, based on
laws and other regularities that can be abstracted
from contextualized actions and events; and (2) idiographic, cultural-historic sciences of Verstehen, understanding, holistically and hermeneutically dealing
with uniquely contextualized eras, cultures, individuals, or events (cf. Mey, 2001) (see Kant, Immanuel
(17241804)). One may count classical physics and
chemistry among the prototypical nomothetic
sciences, in contrast to historiography and ethnography as prototypical idiographic sciences; between
these two extremes, the soft sciences, including linguistics, are pulled in two directions and thus contain two opposing fields within themselves, to wit:
physicalistic phonetics and interpretive pragmatics,
with (nomothetic but not physicalistic) structural
linguistics in the middle (cf. Koyama, 1999, 2000).
Such is the de facto condition of phonetics and
pragmatics in our times. Yet, once we take a critical
stance to the actual condition, it becomes clear that
pragmatics, too, can be construed positivistically,
as was done by people like Bloomfield and (later)
Carnap, who understood referents as physical things
out there, existing independently of any signifying
communication; inversely, phonetics may similarly

416 Phonetics and Pragmatics

be construed interpretively and idiographically, inasmuch as phones (unlike the abstract regularities
called phonemes) are singular happenings in context.
Phonetics and pragmatics both concern actual acts
(i.e., unique happenings), which may show some
regularities; hence, they can be studied idiographically or nomothetically, whereas linguistic structure is an abstract code of denotational regularities,
which can be studied only nomothetically. Thus, a
critical understanding points to the semiotic matrix
of phonetics and pragmatics, to be explicated later
(see Bloomfield, Leonard (18871949); Carnap,
Rudolf (18911970); Phoneme).

The Semiotic Matrix of Communication


Pragmatics, Phonetics, and Indexical Semiosis

Communication, as a pragmatic (including phonetic)


act or event, is a process of referential or social
indexical signification (i.e., an actual happening that
occurs in the extensional universe that may presupposingly index contextual variables) (see Context,
Communicative). These variables include contextually presupposable intensional codes such as are embodied in the linguistic structure and create certain
effects in the extensional universe, such as referential
texts, dealing with what is said, and socialindexical
texts, dealing with what is done. In this model, as we
will see, phonetics becomes a thoroughly pragmatic
phenomenon (see Pragmatic Presupposition).
The signifying event, whether phonetically executed or not, is a singular happening in the extensional universe and functions indexically, as it points to
the context of its occurrence at the hic et nunc (origo)
of the signifying process. Also, the signifying event
may present itself as a replica of some objects that
appear similar to the event and thus iconically signify
these objects, as in quotative repetition and mirroring
reflection. In these ways, the signifying event may
signal a number of presupposable objects, which
may be types (regularities) or individual objects, and
such objects may become signs signifying other
objects (see Semiosis; Peirce, Charles Sanders (1839
1914)). Thus, the signifying event iconically signals or
presupposingly indexes regularities and individuals in
the context of its occurrence (i.e., it contextualizes
them) and creates (i.e., entextualizes) some effects:
the aforementioned referential and socialindexical
texts (what is said and what is done), the latter primarily concerning the group identities and power
relations of discourse participants and other social
beings (see Identity and Language; Power and Pragmatics).
The preceding is a general picture of signification, as it obtains across the two dimensions of

(a) reference and predication, and (b) social indexicality. Note that the objects that are contextualized
(i.e., presupposingly indexed) in signifying events
may be of various types: viz. (1) individual particulars
found in the microsocial surrounds of the signifying
event, including cooccurring signs (sounds, gestures,
etc.), the discourse participants, and what has been
already said and done (i.e., referential and social
indexical texts that have been entextualized and become presupposable at the time of the signifying
event); (2) microsocial regularities of referential
indexicality (e.g., the usage of deictic expressions)
and social indexicality (e.g., addressee honorifics,
turn taking, adjacency pairs, activity types, frames,
scripts, pragmatic principles, maxims, norms) (see
Honorifics; Conversation Analysis; and (3) macrosocial regularities of referential indexicality (e.g., the
causal chain of reference) (Putnam, 1975: 215271).
The latter are involved in the use of proper names
(viz., as macrosocially mediated references to individuals that are not found in the microsocial context)
and in usage-related social indexicality (e.g., speech
genres and socio- and dialectal varieties, characterized by such macrosociological variables as regionality, ethnicity, class, status, gender, occupation, or age)
(see Genre and Genre Analysis; Maxims and Flouting; Politeness; Pragmatics: Overview). Importantly,
these three kinds of presupposingly indexed objects
are often phonetically signaled; also, they are all
pragmatic in character, as they belong to the extensional universe of actions (vs. the intensional universe
of concepts). Indeed, any actions, including body
moves and phonetic gestures such as involving (nonphonological) intonation, pitch, stress, tempo, vowel
lengthening, breathing, nasalization, laughter, belching, clearing ones throat, snoring, sneezing, going
tut-tut, stuttering, response crying, or even pauses
and silence, may be contextualized in the speech
event so as to create some particular socialindexical
(interactional) effects or to become presupposable
regularities that may be expected to occur in certain contexts of particular speech communities (cf.
Goffman, 1981; Gumperz, 1982; Tannen & SavilleTroike, 1985; Duranti & Goodwin, 1992; Mey, 2001
for details) (see Gestures: Pragmatic Aspects; Phonetics: Overview; Silence).
Linguistic Structure and Other Symbols in
Indexical Semiosis

A fourth kind of object may be contextually presupposed, namely, the macrosocial regularities constituting symbolic codes. Recall that icons and indexicals
signify objects on the empirically motivated basis of
contextual similarity and contiguity, respectively.
There are, however, numerous attested instances of

Phonetics and Pragmatics 417

signification that cannot be totally accounted for by


these empirical principles. In such cases, discourse
participants appear to indexically presuppose the
intensional kind of signs that, without any observable
empirical motivation, regularly signify extensional
objects. Such signs, called symbols, constitute the
system of concepts (cultural stereotypes) and
the denotational code called linguistic structure. The
linguistic system is thus made up of the intensional
signs that are presupposingly indexed in the speech
event and symbolically denote extensional entities.
Therefore, these intensional systems of symbols are
indexically anchored on the extensional universe.
This linguistic structure, at the center of which we
find the maximally symbolic lexicon, that is, arbitrary
(in the Saussurean sense) combinations of morphophonemes and morphosyntactic forms, is organized
by the systematic interlocking of symbolic arbitrariness (language-particular structural constraints) and
indexical motivation (extensionally based constraints). Of these two, the latter gradually increases
as we move from (more abstract) morphophonemes
to (more concrete) phonemes and to allophones and
other surface phonetic phenomena such as phonotactic filters (see Pragmatics: Optimality Theory), and as
we move from (formal) morphosyntax to (denotational) semantics and to (referential) pragmatics
(see Pragmatics and Semantics). Further, just as the
markedness hierarchy of distinctive features such as
[syllabic] (i.e., [vocalic]), [sonorant], [voiced], which
is anchored on and characterized by phonetic extensions (degrees of sonority), can be formulated
to describe the differences among (morpho)phonemes
and their correspondences, the markedness hierarchy
of grammatical categories such as [pronoun], [proper
noun], [concrete noun], which is anchored on and
characterized by pragmatic extensions (degrees of
the contextual presupposability of referents), can be
formulated to describe the differences among
morphosyntactic and semantic categories (see the
Jakobsonian literature: e.g., Lee, 1997; Koyama,
1999, 2000, 2001 for details) (see Jakobson, Roman
(18961982); Markedness; Distinctive Features).
Dialectics of Signs: Interactions of Structure
and Discourse

Similarly, at the interface of the intensional and extensional universes, just as semantic categories (e.g.,
[animate]) may have contextually variable extensions
such as [animal] (inclusive use), [nonhuman animal]
(exclusive use), as well as particular contextual referents, phonemes may have various allophones, which
are contextualized happenings distinct from one another. These phonetic variants and other varying surface expressions, such as allomorphs (e.g., matrixes

vs. matrices) and syntactic variants (e.g., Its me


vs. Its I) appear denotationally identical; in addition, they may clearly show statistically different patterns of cooccurrence with some social categories
(e.g., of class, gender, ethnicity), as the variable of
denotation is naturally controlled in this environment (see Class Language; Gender and Language)
As a consequence, the variations in surface forms
get distinctly associated with the variations in social
categories (cf. Lucy, 1992; also see the journal Language Variation and Change). Under such circumstances, language users may essentialize such merely
statistical (probabilistic) correspondence patterns by
perceiving them as categorical and thus ascribing particular sociological categories to particular linguistic
forms. The latter thereby become sociolinguistic
stereotypes (Labov, 1972) or registers (made up of
lexicalized stereotypes), that is, symbols having the
illocutionary forces of socialindexical character
(e.g., masculinity, honorificity) in themselves,
independent of the actual contexts of their use (see
Honorifics; Register: Overview). The decontextualizing process also underlies the formation of diminutives, augmentatives, and performative formulae
(e.g., I baptize thee), which may be used as formulaic one-liners to create rather strong effects in discourse (cf. Lucy, 1993; Hinton et al., 1994; Schieffelin
et al., 1998; Koyama, 2001) (see Performative
Clauses; Speech Acts; Pragmatic Acts). This illustrates the general process in which the quotative use
of a symbol achieves the effect of iconically presenting itself as a replica (token) of the symbolic pattern
(type), imposing the presupposable pattern on discursive interaction, and thus creating a text that
is more or less bracketed from its contextual surrounds and possesses the socialindexical meanings commonly ascribed to the symbol. More
generally, the enunciative, phonetico-pragmatic act
of repetition (cf. the Jakobsonian poetic function)
saliently serves to create textuality, as witnessed by
the poetic use of rhymes; the religious, political,
commercial use of chants, slogans, and divinations;
the quotidian use of turns, adjacency pairs, and
other everyday conversational routines; and
any other metapragmatic framings of discourse (cf.
Tannen, 1989, 1993; Koyama, 1997; Silverstein,
1998) (see Discourse Markers; Metapragmatics).

Conclusion
Being abundantly demonstrated in the literature,
the facts referred to in the preceding discussion unmistakably show that phonetics is a semiotically
integrated part of pragmatics; it is what we do in the
social context in which we live by creating referential

418 Phonetics and Pragmatics

and socialindexical texts through the iconic or presupposing indexing of contextual particulars and
regularities (types), of which the latter are systematically anchored on phonetic and other pragmatic (i.e.,
contextual) extensions, thus forming the basis of the
relationship between phonetics and pragmatics.
See also: Bloomfield, Leonard (18871949); Carnap, Rudolf
(18911970); Class Language; Context, Communicative;
Conversation Analysis; Deixis and Anaphora: Pragmatic
Approaches; Discourse Markers; Distinctive Features;
Gender and Language; Genre and Genre Analysis;
Gestures: Pragmatic Aspects; Honorifics; Identity and
Language; Jakobson, Roman (18961982); Kant, Immanuel (17241804); Markedness; Maxims and Flouting; Metapragmatics; Peirce, Charles Sanders (18391914);
Performative Clauses; Phoneme; Phonetics: Overview;
Phonology: Overview; PhonologyPhonetics Interface;
Politeness; Power and Pragmatics; Pragmatic Acts; Pragmatic Presupposition; Pragmatics and Semantics; Pragmatics: Optimality Theory; Pragmatics: Overview;
Register: Overview; Saussure, Ferdinand (-Mongin) de
(18571913); Semiosis; Semiotics: History; Silence;
Silence; Speech Acts.

Bibliography
Duranti A & Goodwin C (eds.) (1992). Rethinking context.
Cambridge: Cambridge University Press.
Goffman E (1981). Forms of talk. Philadelphia: University
of Pennsylvania Press.
Gumperz J J (1982). Discourse strategies. Cambridge:
Cambridge University Press.
Hinton L, Nichols J & Ohala J J (eds.) (1994). Sound
symbolism. Cambridge: Cambridge University Press.

Koyama W (1997). Desemanticizing pragmatics. Journal


of Pragmatics 28, 128.
Koyama W (1999). Critique of linguistic reason I. RASK,
International Journal of Language and Communication
11, 4583.
Koyama W (2000). Critique of linguistic reason II. RASK,
International Journal of Language and Communication
12, 2163.
Koyama W (2001). Dialectics of dialect and dialectology.
Journal of Pragmatics 33, 15711600.
Labov W (1972). Sociolinguistic patterns. Philadelphia:
University of Pennsylvania Press.
Lee B (1997). Talking heads. Durham, NC: Duke University
Press.
Lucy J A (1992). Language diversity and thought.
Cambridge: Cambridge University Press.
Lucy J A (ed.) (1993). Reflexive language. Cambridge:
Cambridge University Press.
Mey J L (2001). Pragmatics (2nd edn.). Oxford: Blackwell.
Putnam H (1975). Philosophical papers, vol. 2: Mind,
language, and reality. London: Cambridge University
Press.
Schieffelin B B, Woolard K A & Kroskrity P V (eds.) (1998).
Language ideologies. Oxford: Oxford University Press.
Silverstein M (1998). The improvisational performance of
culture in realtime discursive practice. In Sawyer K
(ed.) Creativity in performance. Greenwich, CT: Ablex.
265312.
Tannen D (1989). Talking voices. Cambridge: Cambridge
University Press.
Tannen D (ed.) (1993). Framing in discourse. Oxford:
Oxford University Press.
Tannen D & Saville-Troike M (eds.) (1985). Perspectives on
silence. Norwood, NJ: Ablex.

Phonetics of Harmony Systems


M Gordon, University of California, Santa Barbara,
CA, USA
2006 Elsevier Ltd. All rights reserved.

Introduction
Harmony involves a non-local spreading of some
feature or combination of features over some domain
larger than a single segment. The following example
from Finnish illustrates back/front vowel harmony.
The inessive suffix has two realizations. The variant
containing a front vowel (-ss) occurs after roots
consisting of front vowels, e.g., kylss in the village, whereas the allomorph containing a back vowel
(-ssA) appears after roots with back vowels, e.g.,
tAlossA in the house.

Harmony processes abound cross-linguistically


and may be classified according to the types of features being propagated and whether vowels or consonants are targeted. The Chumash language
provides an example of consonant harmony (Beeler,
1970). Chumash has two coronal fricatives, an apical
/s/ and a laminal /S/, which may not occur in the same
word. This restriction triggers a right-to-left harmony
process when a suffix containing a coronal fricative is
added to a word containing a different coronal fricative, e.g., saxtun to pay vs. Saxtun-S to be paid, uSla
with the hand vs. ulsa-siq to press firmly by hand.
Up until recently, studies of harmony were strictly
phonological in nature, relying on impressionistic
observations rather than instrumental investigation.
While this approach yielded many insights into the

418 Phonetics and Pragmatics

and socialindexical texts through the iconic or presupposing indexing of contextual particulars and
regularities (types), of which the latter are systematically anchored on phonetic and other pragmatic (i.e.,
contextual) extensions, thus forming the basis of the
relationship between phonetics and pragmatics.
See also: Bloomfield, Leonard (18871949); Carnap, Rudolf
(18911970); Class Language; Context, Communicative;
Conversation Analysis; Deixis and Anaphora: Pragmatic
Approaches; Discourse Markers; Distinctive Features;
Gender and Language; Genre and Genre Analysis;
Gestures: Pragmatic Aspects; Honorifics; Identity and
Language; Jakobson, Roman (18961982); Kant, Immanuel (17241804); Markedness; Maxims and Flouting; Metapragmatics; Peirce, Charles Sanders (18391914);
Performative Clauses; Phoneme; Phonetics: Overview;
Phonology: Overview; PhonologyPhonetics Interface;
Politeness; Power and Pragmatics; Pragmatic Acts; Pragmatic Presupposition; Pragmatics and Semantics; Pragmatics: Optimality Theory; Pragmatics: Overview;
Register: Overview; Saussure, Ferdinand (-Mongin) de
(18571913); Semiosis; Semiotics: History; Silence;
Silence; Speech Acts.

Bibliography
Duranti A & Goodwin C (eds.) (1992). Rethinking context.
Cambridge: Cambridge University Press.
Goffman E (1981). Forms of talk. Philadelphia: University
of Pennsylvania Press.
Gumperz J J (1982). Discourse strategies. Cambridge:
Cambridge University Press.
Hinton L, Nichols J & Ohala J J (eds.) (1994). Sound
symbolism. Cambridge: Cambridge University Press.

Koyama W (1997). Desemanticizing pragmatics. Journal


of Pragmatics 28, 128.
Koyama W (1999). Critique of linguistic reason I. RASK,
International Journal of Language and Communication
11, 4583.
Koyama W (2000). Critique of linguistic reason II. RASK,
International Journal of Language and Communication
12, 2163.
Koyama W (2001). Dialectics of dialect and dialectology.
Journal of Pragmatics 33, 15711600.
Labov W (1972). Sociolinguistic patterns. Philadelphia:
University of Pennsylvania Press.
Lee B (1997). Talking heads. Durham, NC: Duke University
Press.
Lucy J A (1992). Language diversity and thought.
Cambridge: Cambridge University Press.
Lucy J A (ed.) (1993). Reflexive language. Cambridge:
Cambridge University Press.
Mey J L (2001). Pragmatics (2nd edn.). Oxford: Blackwell.
Putnam H (1975). Philosophical papers, vol. 2: Mind,
language, and reality. London: Cambridge University
Press.
Schieffelin B B, Woolard K A & Kroskrity P V (eds.) (1998).
Language ideologies. Oxford: Oxford University Press.
Silverstein M (1998). The improvisational performance of
culture in realtime discursive practice. In Sawyer K
(ed.) Creativity in performance. Greenwich, CT: Ablex.
265312.
Tannen D (1989). Talking voices. Cambridge: Cambridge
University Press.
Tannen D (ed.) (1993). Framing in discourse. Oxford:
Oxford University Press.
Tannen D & Saville-Troike M (eds.) (1985). Perspectives on
silence. Norwood, NJ: Ablex.

Phonetics of Harmony Systems


M Gordon, University of California, Santa Barbara,
CA, USA
2006 Elsevier Ltd. All rights reserved.

Introduction
Harmony involves a non-local spreading of some
feature or combination of features over some domain
larger than a single segment. The following example
from Finnish illustrates back/front vowel harmony.
The inessive suffix has two realizations. The variant
containing a front vowel (-ss) occurs after roots
consisting of front vowels, e.g., kylss in the village, whereas the allomorph containing a back vowel
(-ssA) appears after roots with back vowels, e.g.,
tAlossA in the house.

Harmony processes abound cross-linguistically


and may be classified according to the types of features being propagated and whether vowels or consonants are targeted. The Chumash language
provides an example of consonant harmony (Beeler,
1970). Chumash has two coronal fricatives, an apical
/s/ and a laminal /S/, which may not occur in the same
word. This restriction triggers a right-to-left harmony
process when a suffix containing a coronal fricative is
added to a word containing a different coronal fricative, e.g., saxtun to pay vs. Saxtun-S to be paid, uSla
with the hand vs. ulsa-siq to press firmly by hand.
Up until recently, studies of harmony were strictly
phonological in nature, relying on impressionistic
observations rather than instrumental investigation.
While this approach yielded many insights into the

Phonetics of Harmony Systems 419

nature of harmony processes, it also left many questions that proved unanswerable without phonetic
data: What are the precise physical and acoustic properties that spread in harmony processes? To what
extent is harmony motivated by phonetic considerations such as the desire to minimize articulatory difficulty and enhance perceptual salience? Are segments
that appear to be transparent to harmony truly phonetically unaffected by the spreading feature? Do
phonetic differences underlie the dual behavior of
apparent harmonically neutral segments?
Recent advancements in instrumentation techniques and increased accessibility of speech analysis
software have made possible the phonetic research
necessary to tackle some of these unresolved issues.
This article will discuss some of the phonetic studies
that have enhanced our understanding of many
aspects of harmony systems. For purposes of the present work, the research on the phonetics of harmony
will be divided into two broad categories according to
the types of segments affected by harmony. The first
section considers vowel harmony, focusing on four
types of vowel harmony that have been subject to
phonetic research: front/back harmony, rounding
harmony, ATR harmony, and height harmony. In
the second section we discuss phonetic aspects of
harmony processes affecting consonants, including
nasal harmony and various types of long-distance
consonant harmony.

Vowel Harmony
Different phonetically based explanations for vowel
harmony have been proposed in the literature. Suomi
(1983) offers a perceptual account of vowel harmony
focusing on front/back harmony of the type found
in Finnish. He suggests that harmony reflects an
attempt to minimize the need to perceive differences
in the frequency of the second formant, the primary
acoustic correlate of backness, in syllables after the
first. Drawing on results from perceptual experiments
suggesting greater perceptibility of the first formant
(Flanagan, 1955), the acoustic correlate of height,
relative to the second formant, Suomi argues that
vowel harmony reduces the burden of perceiving the
perceptually less salient contrasts in backness.
Ohala (1994) proposes a slightly different explanation for the development of vowel harmony, suggesting that it is a fossilized remnant of an earlier
phonetic process involving vowel-to-vowel assimilation (p. 491). Coarticulation effects between noncontiguous vowels are well documented in the
hman, 1966). Ohala suggests
phonetic literature (O
that vowel harmony systems arise when these coarticulation effects, which normally are factored out

of the signal by the listener, are misparsed by the


listener as being independent of the vowel triggering
the coarticulation. This misapprehension leads the
listener to infer that the speaker was producing a
different target vowel than the speaker actually
intended to utter. The listener then introduces into
her own speech this new vowel, setting off a sound
change to be adopted by other speakers.
Front/back Vowel Harmony

An important question raised by vowel harmony is


whether the coarticulatory effects driving harmony
actually pass over segments intervening between the
target and trigger without affecting them. The apparent transparency of certain segments to harmony can
be clearly seen in the neutral vowels described in
many vowel harmony systems. Thus, for example,
although most Finnish suffixes containing a vowel
have two variants, one with a front vowel and the
other with a back vowel, there are two vowels /i, e/
which are not paired with corresponding back
vowels. These neutral vowels can occur in the same
word with either front or back vowels as words containing the translative suffix -ksi show, e.g., kAtoksi
roof (translative) vs. kylvyksi bath (translative).
Furthermore, the neutral vowels can occur in roots
containing either front or back vowels, e.g., hihA
sleeve vs. ik age, pes nest vs. pensAs bush.
The dual behavior of neutral vowels raises the
question of whether a neutral vowel is pronounced
the same in all contexts. Specifically, one might ask
whether the neutral vowels, which are widely regarded as front vowels, also have two variants, one
back and the other front, parallel to other vowels.
Investigation of the articulatory properties of neutral
vowels potentially has important implications for the
treatment of assimilatory processes in phonological
theory. If the neutral vowels were phonetically front
vowels even when the surrounding vowels are back,
this would prove that harmony is truly a long distance phenomenon and can thus only be handled by a
theory allowing for non-local spreading of a feature.
Phonetic data, both acoustic and articulatory, have
recently been used to investigate the possibility that
neutral vowels in front/back vowel harmony systems
have both front and back variants. A key advantage
of phonetic research on vowel harmony over impressionistic study is its potential ability to differentiate
between categorical phonological effects of harmony
and the normal coarticulation effects found in lan hman,
guages lacking true phonological harmony (O
1966). One recent study focuses on Finnish neutral
vowels using acoustic data while another body of
research investigates articulatory aspects of the
Hungarian neutral vowels, which are also /i, e/.

420 Phonetics of Harmony Systems

articulatory phonetic data and typological observations about harmony, Gafos hypothesizes that spreading in harmony systems is local rather than longdistance (see also N Chiosain and Padgett, 1997 for
a similar view). Under his view, segments that superficially appear to be transparent in the harmony system are actually articulated differently depending on
the harmonic environment. Thus, the neutral vowels
/i, e/ are claimed to be backer in back vowel environments than in front vowel environments, but crucially
the effect of this articulatory backing is not substantial enough to be perceptible. He finds support for
this view from Boyces (1988) phonetic study of coarticulation and rounding harmony in Turkish, whereby high vowels agree in rounding with the preceding
vowel in a word (see Rounding Harmony, below). In
an electromyographic study of muscle activity, Boyce
finds that the Orbicularis Oris muscle, which is responsible for lip rounding, remains contracted during
a non-labial consonant intervening between two
rounded vowels for Turkish speakers. This is consistent with Gafos view that harmony is a local spreading process in which no segments transparently allow
a propagating feature to spread through them while
being unaffected themselves.
ATR Harmony

Hess (1992) is another study that uses phonetic data


to test claims about the phonological properties of
harmony. Her study focuses on ATR vowel harmony
in Akan, a Kwa language of Ghana. In Akan, vowels
other than the low vowel /a/ come in pairs differing in
tongue root position. The advanced tongue root
(ATR) vowels /i e, u, o/ are associated with an
expanded pharyngeal cavity relative to their retracted
tongue root (ATR) counterparts /i, e, u
, o
/. This
expansion of the pharyngeal space associated with
the ATR vowels is achieved primarily by advancing
the posterior portion of the tongue (and also by
adopting a lowered larynx position relative to that
associated with ATR vowels). Hess uses phonetic
data to test two competing analyses of vowel harmony. According to one analysis, that adopted by
Dolphyne (1988), vowel harmony is a categorical
process whereby a ATR vowel becomes ATR
when the vowel in the immediately following syllable
is ATR. For example, the ATR vowel in the second
syllable of the isolation form fr call turns into a
ATR vowel when followed by a ATR vowel in the
sentence fr k f. The competing analysis (Clements,
1981) treats vowel harmony in Akan as a gradient
assimilation in vowel height, whereby vowels are
raised when followed by a ATR vowel. In this account, raising gradiently propagates over all vowels
preceding the trigger vowel, such that raising is

Gordon (1999) compares the realization of the


Finnish neutral vowels in front and back vowel
words by inferring tongue position from the location
of the first two formants. He finds an asymmetric
effect that mirrors the left-to-right vowel harmony
found in Finnish: the place of the neutral vowel is
influenced by the frontness/backness of a preceding
vowel but not a following vowel. Following a back
vowel, the neutral vowels are noticeably backer as
reflected in a lowering of the second formant (F2).
The backing of the neutral vowels is observed between back vowels, e.g., ukithAn grandfathers (emphatic), and when preceded by a back vowel, e.g.,
tApit plugs, but not when only followed by a back
vowel, e.g., iho skin. However, the effect of vowels
in adjacent syllables on the neutral vowels is relatively
small, with an approximately 100 Hz difference in
F2 as a function of surrounding vowel context averaged over two speakers. Gordon concludes that while
vowels in neighboring syllables exert an effect on the
relative backness of the neutral vowels, this effect
is more consistent with phonetic coarticulation as
opposed to a categorical difference in backness.
Gafos and Benus (2003) and Benus et al. (forthcoming) investigate neutral vowels in Hungarian
using ultrasound and electromagnetic midsaggital
articulometry (EMMA), two techniques that provide
a direct measure of tongue position. They find that
the neutral vowels are articulated with a more posterior tongue dorsum position (by an average of .67
millimeters in their EMMA data) in back vowel contexts relative to front vowel environments. However,
although these differences are statistically reliable,
their relatively small magnitude leaves open the
possibility that the differences are due to normal
coarticulation between vowels of the type found in
languages lacking vowel harmony. In order to tease
apart coarticulation from true harmony effects, Gafos
and Benus compare neutral vowels differing in their
subcategorization for front and back vowel suffixes.
For example, the word vi:z water takes the front
vowel variant of the dative suffix -nOk/-nek, i.e., vi:znek, while the word hi:d bridge takes the back vowel
allomorph, i.e., hi:dnOk. Interestingly, they find that
in unsuffixed roots containing neutral vowels but
taking back vowel suffixes (e.g., hi:d), there is tendency for the tongue dorsum to be slightly retracted
during the neutral vowel relative to the vowel in unsuffixed roots containing neutral vowels taking front
vowel suffixes (e.g., vi:z). They regard their results,
however, as suggestive but nevertheless tentative
pending the collection of more data.
Gafos and Benus and Benus et al.s work builds
on earlier work by Gafos exploring the articulatory basis of harmony systems. Drawing on both

Phonetics of Harmony Systems 421

greatest in the vowel immediately preceding the trigger and gradually decreases in magnitude the farther
the target vowel is from the trigger.
As a starting point in her study, Hess identifies the
most reliable correlates of the feature ATR. She
explores several potential indicators of ATR, including formant frequency, formant bandwidth, vowel
duration, and the relative amplitude of the fundamental and the second harmonic. Hess finds the bandwidth of the first formant to be the most robust
correlate of the ATR feature: ATR vowels have
narrower bandwidth values than their ATR counterparts. (She also finds that ATR vowels have
lower first formant frequency values, but this difference is consistent not only with a difference in tongue
root advancement but also a difference in height of
the tongue body.) Applying first formant bandwidth
as a diagnostic of ATR, Hess then examines vowels
preceding a ATR trigger of vowel harmony in
order to test whether harmony involves spreading of
height or ATR features and whether it propagates
leftward across multiple vowels or is limited to the
vowel immediately preceding the trigger vowel. As
predicted by Dolphyne, Hess finds that only the immediately preceding vowel is affected by harmony.
Furthermore, although the lowering of the first formant frequency in the target vowel is consistent
with the height-based analysis of Akan harmony, the
decrease in the bandwidth of the first formant is more
consistent with ATR harmony than height harmony.
Most of the existing phonetic data on harmony
comes from languages where harmony is a firmly
entrenched phonological process. However, Przezdzieckis (2000) phonetic study of ATR harmony in
Yoruba provides some insight into the development
of harmony systems. In this study, he tests Ohalas
hypothesis that harmony arises from simple coarticulation effects against data from three dialects of
Yoruba differing in the productivity of their ATR
harmony systems. In the Akure dialect, ATR vowel
harmony is a productive process that creates alternations in the third person singular pronominal prefixes.
Before a ATR vowel, which include /i e, u, o/,
the prefix is realized as a ATR mid back rounded
vowel, e.g., k s/he died,
rule s/he saw the
house, whereas the prefix surfaces as a ATR vowel
before a ATR vowel /i, e, u
, o
/ e.g., lo s/he went,
o
rugba s/he saw the calabash. The Moba dialect
also has prefixal ATR vowel harmony, but the high
vowels do not participate in the alternations either as
triggers or as targets of harmony. Finally, Standard
Yoruba lacks prefixal alternations entirely, though it
has static co-occurrence restrictions on ATR within
words. Przezdziecki explores the hypothesis that the
fully productive alternations affecting the high

vowels in the Akure dialect will also be observed as


smaller coarticulation effects for the high vowels in
the other two dialects. Taking the first formant frequency as the primary correlate of ATR harmony in
Yoruba, he measures the first formant for the high,
mid, and low vowels in two contexts, before a ATR
mid vowel and before a ATR mid vowel, in the three
dialects. As expected, for the Akure speakers, both
the high and mid vowels differ markedly in their first
formant values between the ATR and ATR contexts: vowels in the ATR context have much higher
F1 values than vowels in the ATR context. Low
vowels do not reliably differ in F1 between the two
environments. In the other two dialects where harmony does not target high vowels, the high vowels
show F1 differences going in the same direction
as the Akure data (higher F1 values before ATR
vowels), but the magnitude of these differences is
much smaller than in Akure. These results are consistent with the view that phonological harmony
arises as a phonetic coarticulation effect that becomes sufficiently large to develop into a categorical
alternation.
Height Harmony

Phonetic data has also been used to test claims about


the existence of vowel harmony in a particular language. Kockaerts (1997) acoustic study attempts to
experimentally verify the system of height harmony
reported for siSwati (Swati). According to Kockaert,
siSwati mid vowels are reported by several researchers to have two realizations in the penultimate syllable, a relatively high allophone /e, o/ when the final
vowel is one of the high vowels /i, u/, and a lowered
allophone /E O/, when the following vowel is nonhigh. Contra these reports of harmony, Kockaert
finds that first formant values, the formant correlated
with vowel height, fail to support the hypothesized
variation in vowel height in the penultimate syllable.
Rounding Harmony

Kaun (1995) pursues a perceptually driven account of


vowel harmony involving rounding. As a starting
point in her investigation, she observes a number of
recurring cross-linguistic patterns found in rounding
harmony systems. First, she finds that rounding harmony is most favored among high vowel targets. We
thus find languages like Turkish, in which only high
vowels alternate in rounding as a function of the
rounding in the preceding vowel. For example, the
1st person possessive suffix has four variants -Im/
-Ym/ -om/ -i$m, where the choice of allomorph is
conditioned by the frontness/backness and rounding
of the root vowels: ipIm my rope, ki$zi$m my girl,
sytYm my milk, buzom my ice. The non-high vowel

422 Phonetics of Harmony Systems

dative suffix, on the other hand, has only two allomorphs E/-A which differ in frontness and not rounding: ipE rope (dative), sytE milk (dative), ki$zA girl
(dative), buzA ice (dative). Conversely, the typology
indicates that rounding harmony is favored when the
triggering vowel is non-high. Thus, there are languages in which rounding harmony is unrestricted
(e.g., many varieties of Kirgiz [Karghiz]): high vowels
and non-high vowels trigger rounding harmony in
both high and non-high vowels. There are also languages in which rounding harmony is triggered in
high vowels by both high and non-high vowels (e.g.,
Turkish), and languages in which rounding harmony
only occurs if both the trigger and target are both
non-high (e.g., Tungusic languages). We do not find
any languages, however, in which only high vowels
but not non-high vowels trigger harmony in both high
and non-high vowels. Kaun also finds that rounding
harmony is more likely when the trigger and target
vowels agree in height, i.e., either both are high
vowels or both are non-high vowels. Thus, in Kachin
Khakass, both the trigger and target must be high
vowels. Finally, rounding harmony is more prevalently triggered by front vowels. Thus, in Kazakh, rounding harmony in high suffixal vowels is triggered by
both front and back vowels, e.g., kl-dY lake (accusative), kol-do servant (accusative). For non-high
suffixal vowels, however, rounding harmony is only
triggered by front vowels, e.g., kl-d lake (locative), kol-dA servant (locative).
Kaun attempts to explain these typological asymmetries in perceptual terms. Following Suomis account of front/back vowel harmony, Kaun suggests
that rounding harmony reflects an attempt to reduce
the burden of perceiving subtle contrasts in rounding.
By extending a feature over several vowels, in this
case rounding, the listener will be better able to perceive that feature and also will not have to attend
to the rounding feature once it is correctly identified
the first time. Rounding, like frontness/backness,
primarily affects the second formant, which as we
saw earlier, is perceptually less salient than the first
formant.
Kaun draws on Linkers (1982) articulatory study
of lip rounding and Terbeeks (1977) perceptual study
of rounded vowels to explain the typological asymmetries in rounding harmony based on backness
and vowel height. Linkers work shows that rounded
vowels can be differentiated in their lip positions
(expressed in terms of lip opening and protrusion)
and their concomitant degree of rounding. Among
the set of rounded vowels, she finds that high rounded
vowels are characteristically more rounded than nonhigh rounded vowels and that back rounded vowels
are more rounded than their front counterparts.

Terbeeks study indicates a perceptual correlate of


these articulatory differences in rounding: high
rounded vowels are perceived as more rounded than
non-high rounded vowels and back rounded vowels
are perceived as more rounded than front rounded
vowels. Kaun suggests that the lesser perceptibility of
non-high and front rounded vowels makes them more
likely to spread their rounding features to other
vowels in order to enhance identification of rounding. Kaun attributes the bias for rounding in high
vowels to the synergistic relationship between lip
rounding and the higher jaw position associated
with high vowels. Rounded vowels are associated
with increased lip protrusion which is achieved in
large part by decreasing the vertical opening between the lips; the decreased vertical opening is aided
by a higher jaw position. The final cross-linguistic
tendency in rounding harmony, the requirement in
many languages that trigger and target agree in
height, is attributed by Kaun to a preference for
uniform articulatory gestures associated with a given
phonological feature. High and non-high vowels
achieve their rounding through different articulatory
strategies: non-high vowels rely more on lip protrusion than high vowels, for which rounding is associated with both an approximation of the lips and
protrusion.

Consonant Harmony
Nasal Harmony

Researchers have also offered phonetic accounts of


harmony systems involving the spreading of nasality
to both consonants and vowels. Boersma (2003) suggests a dichotomy in nasal harmony systems. One
type of nasal harmony, he suggests, has an articulatory basis, while the other type of nasal harmony is
perceptually driven. The articulatory nasal harmony
entails spreading of nasality from a nasal consonant
rightward until spreading is blocked by a segment
that is incompatible with nasality. For example, in
Malay, nasality spreads rightward through the glide
in ma ja n stalk but is blocked by the oral plosive in
ma kan eat. Crucially, because the spreading nasality
is attributed to a single velum opening gesture, nasality fails to skip over segments whose identity would
be altered too much by nasality. This explains the
asymmetric behavior of oral plosives, which block
nasal spreading, and glottal stop, which does not.
An oral plosive would become a nasal if the velum
were lowered during its production. Glottal stop, on
the other hand, can be produced with an open velum,
since the closure for the glottal stop is lower in the
oral tract than the velum and thus does not allow for

Phonetics of Harmony Systems 423

nasal airflow. Consonants for which the acoustic effect of nasality is intermediate in strength, e.g.,
liquids, may or may not block nasal harmony depending on the language. In fact, Cohns (1993) study of
airflow in Sundanese suggests a distinction between
sounds that completely inhibit nasal spreading, such
as stops, and those that are partially nasalized due to
interpolation in nasal air flow between a preceding
phonologically nasalized sound and a following phonologically oral sound. Cohn argues that these transitional segments, which include glides and laterals in
Sundanese, are phonologically unspecified for the
nasal feature, unlike true blockers of nasal spreading,
which are phonologically marked as [-nasal].
In contrast to languages in which nasal harmony is
sensitive to articulatory compatibility, in languages
possessing auditory nasal harmony, there is no strict
requirement that nasality be produced by a single
velum opening gesture. For this reason, nasal harmony of the auditory type is not blocked by oral plosives.
Although the oral plosive cannot be articulated with
an open velum, it still can allow spreading of nasality
through it to an adjacent segment compatible with
the nasal feature. Auditory nasal harmony thus
reflects an attempt to expand the perceptual scope
of nasality, even if this entails producing multiple
velum opening gestures.
Palatal Harmony

Recent work by Nevins and Vaux (2004) has investigated the phonetic properties of transparent segments
in the consonant harmony system of the Turkic language Karaim. In Karaim, the feature of backness/
frontness spreads within phonological words, as in
Finnish and Hungarian, but unlike Finnish and Hungarian, it is consonants rather than vowels that agree
in backness. Most consonants in Karaim occur in
pairs characterized by the same primary constriction
but differing in whether they are associated with secondary palatalization. If the first consonant of the
root has a palatalized secondary articulation, palatalization spreads rightward to other consonants in the
word, including consonants in the root and suffixal
consonants. If the first consonant of the root lacks
secondary palatalization, other consonants in the
word also are non-palatalized. Palatal harmony leads
to suffixal alternations. For example, the ablative
suffix has two variants: -dAn and djAnj, the first of
which occurs after roots containing non-palatalized
consonants, e.g., suvdAn water (ablative), the second of which is used with roots containing palatalized
consonants, e.g., khjunjdjAnj day (ablative). Crucially, descriptions of palatal harmony in primary sources
suggest that it is a property only of consonants and
not of vowels, meaning that back vowels remain back

even if surrounded by palatalized consonants. In order


to test this prediction, Nevins and Vaux conduct a
spectrographic comparison between back vowels surrounded by palatalized consonants and back vowels
surrounded by non-palatalized consonants. Taking
the second formant as the primary correlate of backness in vowels, they find no consistent difference in
backness between back vowels in the two contexts,
suggesting that front/back consonant harmony mirrors front/back vowel harmony in being a non-local
process. They do find, however, that vowels, which
occur in contexts associated with phonetic shortening,
are potentially fronter when adjacent to palatalized
consonants. They attribute this effect to coarticulation rather than participation of vowels in the harmony system: phonetically shorter vowels have less
time to reach their canonical back target positions.
Other Long-Distance Consonant Harmony Effects

Consonant harmony encompasses many other assorted types of long distance assimilation processes,
whose phonetic underpinnings may not be uniform.
Drawing a parallel to his account of vowel harmony,
Gafos argues that consonant harmony systems also
involve local assimilatory spreading propagating over
relatively large domains. His cross-linguistic typology of consonant harmony indicates that many cases
of consonant harmony entail spreading of coronal
gestures involving the tongue tip and/or blade, e.g.,
the Chumash case discussed in the Introduction. Because the part of the tongue involved in coronal harmony can be manipulated largely independently of
the tongue body, which is the relevant articulator for
vowels, coronal gestures associated with consonants
may persist through an intervening vowel without
noticeably affecting the vowel.
Not all functional explanations for consonant harmony are purely phonetic in nature, however, although they all rely on a basic notion of phonetic
similarity mediated by phonological features. Hansson (2001a, 2001b) and Walker (2003) discuss consonant harmony systems of various types (e.g.,
nasality, voicing, stricture, dorsal features, secondary
articulations) that may not be best explained in terms
of local spreading of a feature. Hansson and Walker
argue that speech planning factors might account for
certain consonant harmony effects which are truly
long distance.
Building on work by Bakovic (2000) on vowel
harmony, Hansson (2001a) observes a strong tendency for consonant harmony either to involve assimilation of an affix to a stem or to involve anticipatory
assimilation of a stem to a suffix. Crucially, consonant harmony systems in which a stem assimilates to
a prefix appear to be absent.

424 Phonetics of Harmony Systems

Hansson finds parallels to this asymmetry in both


child language acquisition and also speech error data.
Hansson reports results from Vihman (1978) showing
a strong bias toward anticipatory consonant harmony in child language. Furthermore, speech errors also
display a strong anticipatory effect (Schwartz et al.,
1994, Dell et al., 1997), suggesting that errors result
from the articulatory influence of a planned consonant on the production of an earlier consonant
(Dell et al., 1997). Hansson suggests that consonant harmony stems from the same speech planning
mechanisms underlying the anticipatory bias in child
language and adult speech errors.
Hansson also offers a speech planning explanation
for another interesting typological observation he
makes about consonant harmony. He finds that coronal harmony systems of the Chumash type involving
an alternation between anterior and posterior fricatives follow two patterns. In some languages, bidirectional alternations are observed; thus /S/ can become
/s/ and /s/ can become /S/ under appropriate triggering
contexts. In other languages, coronal harmony is
asymmetric, involving a change from /s/ to /S/. Almost
completely absent are languages that asymmetrically
change /S/ to /s/ but not vice versa. Hansson points
out that this asymmetry has an analog in speech error
data from Shattuck-Hufnagel and Klatt (1979)
showing that alveolars such as /s, t/ tend to be
replaced by palatals such as /S, tS/, respectively,
much more often than vice versa. This parallel in the
directionality of harmony in speech error data offers
support for the view that at least certain types of
consonant harmony systems are driven by speech
planning considerations.
Walker (2003) offers direct experimental evidence
that consonant harmony is motivated by speech
planning mechanisms. A survey of long distance nasal
harmony over intervening vowels (Rose and Walker,
2003, Walker, 2003) indicates that harmony in many
languages is subject to a requirement that the target
and trigger are homorganic. For example, in Ganda
(Katamba and Hyman, 1991), roots of the shape
CV(V)C may not contain a nasal stop and a homorganic oral voiced stop or approximant. For example,
roots like no na` fetch, go for, gu`ga curry favor with
occur, but roots like *gu`Na or *no da` do not. Roots may,
however, contain heterorganic consonants differing in
nasality, e.g., bo ne`ka become visible. Walker also
observes that in certain languages, e.g., Kikongo
[Kituba] (Ao, 1991), voiceless stops are transparent to
harmony, neither undergoing it nor blocking it.
Walker sets out to explore the potential psycholinguistic basis for the sensitivity of nasal harmony to
homorganicity and voicing using a speech error
inducing technique in which listeners are asked to

read pairs of monosyllabic words differing in the


initial consonant (e.g., pat, mass) after being primed
with other pairs of words with reversed initial consonants (e.g., mad, pack). In keeping with the typological observations about nasal harmony, Walker
finds that more errors (e.g., mat, pass; mat, mass;
pat, pass) occur when the two consonants are homorganic and when they disagree in voicing. On the basis
of these results, Walker concludes that consonant
harmony has a functional basis in terms of on-line
speech production considerations.

Conclusions
In summary, phonetic research has shed light on a
number of issues relevant to the study of harmony
systems. Evidence suggests that many types of harmony systems have a phonetic basis as natural coarticulation effects that eventually develop into categorical
phonological alternations and static constraints on
word and/or morpheme structure. The desire to increase the perceptual salience of certain features may
also play a role in harmony systems. Harmony processes that may not be driven strictly by phonetic
factors may be attributed to on-line speech production mechanisms that also underlie speech errors
found in natural and experimental settings. Phonetic
data also provide insights into the proper phonological treatment of harmony by exploring issues such as
the phonetic realization of neutral segments, the
acoustic correlates of harmony, and the local versus
non-local nature of assimilation.
See also: Harmony.

Bibliography
Ao B (1991). Kikongo nasal harmony and contextsensitive underspecification. Linguistic Inquiry 22,
193196.
Bakovic E (2000). Harmony, dominance and control.
Ph.D. diss., Rutgers University.
Beeler M (1970). Sibilant harmony in Chumash. International Journal of American Linguistics 36, 1417.
Benus S, Gafos A & Goldstein L (2003). Phonetics
and phonology of transparent vowels in Hungarian.
Berkeley Linguistics Society 29, 485497.
Boersma P (2003). Nasal harmony in functional phonology. In Van de Weijer J, van Heuven V & van der Hulst H
(eds.) The phonological spectrum, vol. 1: segmental
structure. Philadelphia: John Benjamins. 336.
Boyce S (1988). The influence of phonological structure on
articulatory organization in Turkish and in English:
vowel harmony and coarticulation. Ph.D. diss., Yale
University.
Clements G N (1981). Akan vowel harmony: a nonlinear
analysis. Harvard Journal of Phonology 2, 108177.

Phonetics, Articulatory 425


Cohn A (1993). Nasalization in English: phonology or
phonetics. Phonology 10, 4381.
Dell G, Burger L & Svec W (1997). Language production
and serial order: a functional analysis and a model.
Psychological Review 104, 123147.
Dolphyne F (1988). The Akan (Twi-Fante) language:
its sound systems and tonal structure. Accra: Ghana
University Press.
Flanagan J (1955). A difference limen for vowel formant
frequency. Journal of the Acoustical Society of America
27, 613617.
Gafos A (1999). The articulatory basis of locality in
phonology. New York: Garland.
Gafos A & Benus S (2003). On neutral vowels in Hungarian. In Proceedings of the 15th International Congress of
Phonetic Sciences. 7780.
Gordon M (1999). The neutral vowels of Finnish: how
neutral are they? Linguistica Uralica 35, 1721.
Hansson G (2001a). The phonologization of production
constraints: evidence from consonant harmony. In
Chicago Linguistics Society 37: The Main Session.
187200.
Hansson G (2001b). Theoretical and typological issues
on consonant harmony. Ph.D. diss., UC Berkeley.
Hess S (1992). Assimilatory effects in a vowel harmony
system: an acoustic analysis of advanced tongue root
in Akan. Journal of Phonetics 20, 475492.
Katamba F & Hyman L (1991). Nasality and morpheme
structure constraints in Luganda. In Katamba F (ed.)
Lacustrine Bantu phonology [Afrikanistische Arbeitspapiere 25]. Ko ln: Institut fu r Afrikanistik, Universita t zu
Ko ln. 175211.
Kaun A (1995). The typology of rounding harmony:
an optimality theoretic approach. [UCLA Dissertations
in Linguistics 8.] Los Angeles: UCLA Department of
Linguistics.
Kockaert H (1997). Vowel harmony in siSwati: an experimental study of raised and non-raised vowels. Journal
of African Languages and Linguistics 18, 139156.
Linker W (1982). Articulatory and acoustic correlates of
labial activity in vowels: a cross-linguistic study. Ph.D.
diss., UCLA. [UCLA Working Papers in Phonetics 56.].

Nevins A & Vaux B (2004). Consonant harmony in Karaim. In Proceedings of the Workshop on Altaic in Formal
Linguistics [MIT Working Papers in Linguistics 46].
N Chiosa in M & Padgett J (1997). Markedness, segment
realization, and locality in spreading. [Report no. LRC9701.] Santa Cruz, CA: Linguistics Research Center,
University of California, Santa Cruz.
Ohala J (1994). Towards a universal, phonetically-based,
theory of vowel harmony. In 1994 Proceedings of the
International Congress on Spoken Language Processing.
491494.
hman S (1966). Coarticulation in VCV utterances:
O
spectrographic measurements. Journal of the Acoustical
Society of America 39, 151168.
Przezdziecki M (2000). Vowel-to-vowel coarticulation in
Yoru`ba : the seeds of ATR vowel harmony. West Coast
Conference on Formal Linguistics 19, 385398.
Rose S & Walker R (2003). A typology of consonant agreement at a distance. Manuscript. University of Southern
California and University of California, San Diego..
Schwartz M, Saffran E, Bloch D E & Dell G (1994). Disordered speech production in aphasic and normal speakers. Brain and Language 47, 5288.
Shattuck-Hufnagel S & Klatt D (1979). The limited use of
distinctive features and markedness in speech production: evidence from speech error data. Journal of Verbal
Learning and Verbal Behaviour 18, 4155.
Suomi K (1983). Palatal vowel harmony: a perceptuallymotivated phenomenon? Nordic Journal of Linguistics
6, 135.
Terbeek D (1977). A cross-language multidimensional scaling study of vowel perception. Ph.D. diss., UCLA. [UCLA
Working Papers in Phonetics 37.]
Vihman M (1978). Consonant harmony: its scope and
function in child language. In Greenberg J, Ferguson C
& Moravcsik E (eds.) Universals of human language,
vol. 2: phonology. Palo Alto: Stanford University Press.
281334.
Walker R (2003). Nasal and oral consonantal similarity in
speech errors: exploring parallels with long-distance
nasal agreement. Manuscript. University of Southern
California.

Phonetics, Articulatory
J C Catford, University of Michigan, Ann Arbor,
MI, USA
J H Esling, University of Victoria, Victoria, British
Columbia, Canada
2006 Elsevier Ltd. All rights reserved.

Articulatory phonetics is the name commonly applied to traditional phonetic theory and taxonomy, as
opposed to acoustic phonetics, aerodynamic phonetics, instrumental phonetics, and so on. Strictly

speaking, articulation is only one (though a very important one) of several components of the production of speech. In phonetic theory, speech sounds,
which are identified auditorily, are mapped against
articulations of the speech mechanism.
In what follows, a model of the speech mechanism
that underlies articulatory phonetic taxonomy is first
outlined, followed by a description of the actual classification of sounds and some concluding remarks.
The phonetic symbols used throughout are those
of the International Phonetic Association (IPA) as

442 Phonetics, Articulatory


& Trim J L M (eds.) In honour of Daniel Jones. London:
Longmans. 2637.
Catford J C (1977). Fundamental problems in phonetics.
Edinburgh: Edinburgh University Press/Bloomington, IN:
Indiana University Press.
Catford J C (1981). Observations on the recent history of
vowel classification. In Asher R E & Henderson E J A
(eds.) Towards a history of phonetics. Edinburgh:
Edinburgh University Press. 1932.
Catford J C (1988). Notes on the phonetics of Nias. Studies
in Austronesian linguistics. SE Asia Series 76. Athens,
OH: Ohio University Center for SE Asia Studies.
Catford J C (2001). A practical introduction to phonetics
(2nd edn.). Oxford: Oxford University Press.
Esling J H (1996). Pharyngeal consonants and the aryepiglottic sphincter. Journal of the International Phonetic
Association 26, 6588.
Esling J H (1999). The IPA categories pharyngeal and
epiglottal: laryngoscopic observations of pharyngeal
articulations and larynx height. Language & Speech
42, 349372.
Esling J H & Edmondson J A (2002). The laryngeal sphincter as an articulator: tenseness, tongue root and phonation in Yi and Bai. In Braun A & Masthoff H R (eds.)
Phonetics and its applications: Festschrift for Jens-Peter
Koster on the occasion of his 60th birthday. Stuttgart:
Franz Steiner Verlag. 3851.
Esling J H & Harris J G (2005). States of the glottis: an
articulatory phonetic model based on laryngoscopic
observations. In Hardcastle W J & Beck J (eds.) A figure
of speech: a Festschrift for John Laver. Mahwah, NJ:
Lawrence Erlbaum Associates. 347383.

IPA (1999). Handbook of the International Phonetic


Association: a guide to the use of the International
Phonetic Alphabet. Cambridge: Cambridge University
Press.
Jones D (1922). An outline of English phonetics (2nd edn.).
Cambridge: W. Heffer & Sons.
Ladefoged P (1975). A course in phonetics. New York:
Harcourt Brace Jovanovich.
Ladefoged P & Maddieson I (1996). The sounds of the
worlds languages. Oxford: Blackwell.
Ladefoged P, Cochran A & Disner S F (1977). Laterals and
trills. JIPA 7(2), 4654.
Laver J (1980). The phonetic description of voice quality.
Cambridge: Cambridge University Press.
Maddieson I (1984). Patterns of sounds. Cambridge:
Cambridge University Press.
Ohala J J (1990). Respiratory activity in speech. In
Hardcastle W J & Marchal A (eds.) Speech production
and speech modelling. Dordrecht: Kluwer.
Peterson G E & Shoup J E (1966). A physiological theory
of phonetics. Journal of Speech and Hearing Research 9,
567.
Pike K L (1943). Phonetics: A critical analysis of phonetic
theory and a technic for the practical description of
sounds. Ann Arbor: University of Michigan Press.
Shadle C H (1990). Articulatory-acoustic relationships in
fricative consonants. In Hardcastle W J & Marchal A
(eds.) Speech production and speech modelling. Dordrecht:
Kluwer.
Sweet H (1877). A handbook of phonetics. Oxford:
Clarendon Press.

Phonetics, Acoustic
C H Shadle, Haskins Laboratories,
New Haven, CT, USA
2006 Elsevier Ltd. All rights reserved.

Introduction
Phonetics is the study of characteristics of human
sound-making, especially speech sounds, and includes
methods for description, classification, and transcription of those sounds. Acoustic phonetics is focused
on the physical properties of speech sounds, as transmitted between mouth and ear (Crystal, 1991); this
definition relegates transmission of speech sounds
from microphone to computer to the domain of instrumental phonetics, and yet, in studying acoustic
phonetics, one needs to ensure that the speech itself,
and not artifacts of recording or processing, is being
studied. Thus, in this chapter we consider some of the
issues involved in recording, and especially in the

analysis of speech, as well as descriptions of speech


sounds.
The speech signal itself has properties that make
such analysis difficult. It is nonstationary; analysis
generally proceeds by using short sections that are
assumed to be quasistationary, yet in some cases this
assumption is clearly violated, with transitions occurring within an analysis window of the desired length.
Speech can be quasiperiodic, or stochastic (noisy), or
a mixture of the two; it can contain transients. Each
of these signal descriptions requires a different type of
analysis. The dynamic range is large; for one speaker,
speaking at a particular level (e.g., raised voice), the
range may be 10 to 50 dB SPL (decibels Sound
Pressure Level) over the entire frequency range
(Beranek, 1954), but spontaneous speech may potentially range over 120 dB and still be comprehensible
by a human listener. Finally, the frequency range is
large, from 50 to 20 000 Hz. Though it is well known

Phonetics, Acoustic 443

that most of the information in speech occurs in the


range of 3003500 Hz (telephone bandwidth), if one
is trying to describe and classify speech sounds a
bigger range is needed.
Aspects of the recording method and the recording
environment can also introduce artifacts. Breath
noise can occur if the microphone is directly in front
of the speakers lips; moving the microphone further
from the speaker can reduce breath noise, but then
the speech signal will have a lower amplitude at the
microphone, requiring a quiet recording room and
possibly a more sensitive microphone. The microphone can also be moved to one side, but then the
directional characteristics of speech must be considered. Higher frequencies are progressively more
directional, meaning that they are highest amplitude
on-axis (directly in front of the speakers mouth) and
decreasing in amplitude with angle off-axis. For instance, in the band 510 kHz, at 60 degrees off-axis
the amplitude is 5 dB lower than at 0 degrees (on axis)
(Beranek, 1954). This difference may be important
for comparisons, across subjects and recording session, of parameters such as spectral tilt or formant
amplitude. Differences in microphone placement can
be corrected for as long as the location relative to the
speakers mouth (distance and angle) is known and
the microphone is in the acoustic far-field; if in the
near-field, more parameters such as the exact shape
and size of the lip opening are needed.
The acoustic far field is the region where the sound
pressure decreases linearly with distance from the
source. The distance r from the source at which
the far field begins depends on the source extent and
the frequencies of interest. For instance, for frequencies greater than or equal to 350 Hz, far field begins at
r 1 m, and the source could be as much as 16 cm
across (which is much larger than a typical lip opening, or about the size of a medium loudspeaker)
(Beranek, 1954). A far-field pressure can be used to
compute the equivalent source strength of the radiating surface (the air between the lips) and is, thus, important for studies in which source strength is derived
from the radiated acoustic signal, or when absolute
sound pressures measured at different locations need
to be compared.
Background noise is often a limiting factor in microphone placement. If it is 3 dB or more below the
signal, it can be corrected for (or, if 10 dB or more
below, ignored), but this must be true at all frequencies of interest. There can be a big amplitude difference between the peak of the first formant of a vowel
and the amplitude of a weak fricative at frequencies
above 10 kHz. Solutions are to reduce the background
noise by making recordings in sound-proofed, even
anechoic chambers, or to use directional microphones

that are more sensitive to sounds coming from their


front than their back. Directional microphones
work well at reducing background noise, but their
frequency characteristics tend to be much less flat
across all frequencies than those of omnidirectional
microphones. Another solution is to measure the
ambient noise, compare it to the signal plus noise,
and filter out the frequency bands where the noise
dominates the signal. This is commonly done for very
low frequencies (e.g., less than 20 Hz, or often to
eliminate mains hum at 50 or 60 Hz).
If it is important to know the absolute sound level
of a speech signal, and keep that information intact
for every kind of analysis, a calibration signal needs
to be recorded as part of the original recording session and put through the same stages (amplification,
filtering, sampling, analysis) as the speech. Whatever
factor is needed to return the calibration signal to its
known level can then be applied to the speech signal.
If this is desirable, the microphone and amplifier
should be of instrumentation quality, and there must
not be any automatic gain control applied. This is
important if one needs to compare sound levels across
speakers and recording sessions.

Signal Preprocessing
While preprocessing is a relative term, it tends
to be used for processes that are applied to every
signal in a given system before the elective processes.
Thus, amplification (which may have more than one
stage), filtering to remove low-frequency noise, antialiasing filtering, sampling, and preemphasis tend to be
common preprocessing stages. They are best understood as changes to the spectrum of the signal. Some
of the changes are reversible, such as amplification and
preemphasis; some are not, because a part of the original signal is permanently lost, as in high-pass (e.g., to
remove low-frequency noise) or low-pass (e.g., antialiasing) filtering. Sampling is reversible, provided a
suitable antialiasing filter has been used first. Theoretically, the filter should remove all frequencies greater
than half the sampling rate, that is, the cut-off frequency of the filter fco fs =2. In practice, no real filter
can cut off abruptly, so the cut-off frequency should be
set somewhat lower than fs/2; how much lower will
depend on the characteristics of the filter.
If the signal being sampled includes frequencies
that are greater than fs/2, whether because antialiasing was not done or the cutoff was too high,
they will be aliased to lower frequencies. Thus, a
6 kHz component in a signal sampled at 10 kHz
will appear as energy at 4 kHz, adding to whatever
energy originally occurred at 4 kHz. In general,
an aliased signal cannot be unscrambled. The

444 Phonetics, Acoustic

anti-aliasing needs to be done for every sampling


stage, whether the original sampling to convert a
continuous-time signal to a discrete-time (sampled)
signal, or a later downsampling to lower the sampling
rate of a discrete-time signal (McClellan et al., 1998).
In general, one should use the highest sampling rate
likely ever to be needed for that signal and apply
antialiasing for that fs; this will, of course, generate
the largest number of samples and, therefore, largest
file sizes, so for particular parts of the analysis where
such high time resolution is not needed, the signal can
be refiltered and downsampled. Systems that sample,
such as DAT recorders and sound cards, now often
have antialiasing filters built in; analysis software,
such as MATLAB, will not necessarily perform this
step automatically.
Preemphasis was originally devised to make optimal use of the small dynamic range of analog tape.
A speech spectrum tends to fall off with frequency;
that is, amplitudes are lower at higher frequencies.
The pre-emphasis filter tilts upward smoothly and
thus flattens out the speech spectrum while leaving
its important peaks (such as formants and harmonics)
intact relative to each other. This is still useful before
computing a spectrogram, since the upper frequencies
will show up better if they have been boosted in
amplitude. Since it is a reversible operation and simple to describe, there is no reason not to do it, but it is
important to remember when it has and has not been
applied to aid comparisons.

Signal Analysis
The techniques used to analyze speech should be
appropriate to the local signal properties as well as
consistent with the aims of the analysis. The information that is desired is typically related to the type of
speech sound whether it is voiced or not, continuant or not, the place of constriction, and so on. We
will consider speech production models later; let
us first consider analysis methods in relation to the
properties of the signal.
Analysis of Periodic Signals

A perfectly periodic signal repeats exactly at some


time interval T0 and so has a fundamental frequency
F0 1=T0. It may have harmonics, which occur at
integral multiples of F0, i.e., 2 F0, 3 F0, and so repeat
exactly at T0/2, T0/3, . . . , respectively. There is no
noise; the signal is entirely deterministic.
In the real world, there is no such thing as a perfectly periodic signal. The closest equivalent in speech
is quasiperiodic, meaning that F0 changes over time
and has a small amount of noise. A typical example is
a vowel, with the fundamental and many harmonics.

We can look at and measure the time waveform, but if


we want to know the distribution of energy at the
frequency of each harmonic, we need to compute
some type of spectrum. The classic first step for such
a signal is the Discrete Fourier Transform (DFT). The
signal is multiplied by a window, and the DFT is
computed of the windowed signal. If we had a perfectly periodic signal, there would be no difference in
the result if we included exactly one period, or exactly
two, so we could think of the window as selecting
exactly one period to minimize the amount of computation. With a quasiperiodic signal, the window
can exclude parts of the signal in which F0 is very
different. The signal within the window is approximately stationary, and so taking a single DFT is
appropriate.
The window length and shape are important. The
longer the window is, the finer the frequency resolution will be; the shorter it is, the coarser. In other
words, the resolution in time is inversely proportional
to the resolution in frequency. There is one wrinkle in
this simple statement, however; the frequency resolution depends not only on the window length, but
also on the number of points used to compute the
DFT. If we want to be able to see every harmonic
defined, we need fine resolution perhaps 50 Hz
between points on the DFT. But then the time window
may be long enough for the signal to change properties somewhat; if so, the harmonics that are computed
will be an average of the different sets of true values
that occurred during the windowed signal.
How does the number of samples used to compute
the DFT, which we call NDFT, interact with the window length, and why would we ever want a NDFT to
be longer than the window, since all values outside
the window are zero by definition? The short answer
is that the number of points used to compute the DFT
actually controls the frequency resolution. The Fourier transform of a discrete-time signal is a continuous
function of frequency; the DFT samples that transform in the frequency domain, spreading NDFT points
evenly between fs/2 and fs/2. This means that the
bigger NDFT is the more samples used to compute
the DFT the more tightly packed the samples are in
the frequency domain and, thus, the finer the frequency resolution. The technique is called zero-padding,
because the windowed signal is padded with zeros to
match the length of the DFT.
If the signal thus treated is perfectly periodic, it has
energy only at the harmonics of its fundamental frequency. Increasing the frequency resolution beyond
the harmonic spacing will not reveal anything else
since there is not any other energy to see. There is a
hazard, however; increasing resolution slightly beyond the harmonic spacing can mean that some of

Phonetics, Acoustic 445

the harmonics are missed. Increasing the resolution


well beyond the harmonic spacing is less problematic;
zeros between the harmonics will be revealed. If
the windowed signal is not perfectly periodic, energy
will exist between harmonics, and a longer NDFT will
define the shape of the transform of the samples
occurring within the window more accurately. Zeropadding to use a longer NDFT does not provide any
more information about the properties of the signal,
but does allow what is there to be seen better.
Finally, using a longer NDFT does incur a computation cost, since increasing NDFT increases the
number of operations required to compute the DFT.
The Fast Fourier Transform is an algorithm developed to compute the DFT efficiently; if NDFT is a
power of two (e.g., 64, 128) the computation will be
faster. However, a 1024-point DFT will still take longer to compute than 128 or 512 points, and so NDFT
should always be justified in terms of the signal properties and the information sought by the analysis. We
will return to this subject in the next section.
There are many window shapes, starting with the
rectangular window, which weights every sample
equally and cuts off to zero abruptly, and progressing
to the gradually tapered windows typically used in
speech analysis, the Hanning and Hamming. Since
they are tapered at each end, there is no abrupt
change in amplitude, which could create an artifact
of seeming noise in the signal. They also have better
properties in the frequency domain than does the
rectangular window, minimizing the amount of leakage of one spectral component into neighboring components. Figure 1 contrasts two Hanning window
lengths used to analyze the same signal to produce
DFTs and (as discussed below) LPC spectral envelopes. Figure 1A uses a window of 60 ms; every
harmonic is clearly shown. Figure 1B uses a window
of 10 ms. The major peaks are still visible, at approximately 250, 2600, and 3800 Hz, but the rest of the
spectrum has been flattened. The peak at 250 Hz is
wider, because it includes the energy for two or three
harmonics, as we know from examining Figure 1A.
The DFT is plotted as amplitude, or log amplitude,
vs. frequency. The speech spectrogram is made up of a
sequence of DFTs, each computed for the same length
of windowed signal and plotted as frequency vs. time,
representing the spectral amplitude in greyscale.
A fine-grain effect is achieved by having a skip factor
that is much shorter than the window length; Olive
et al. (1993) specified that they used a 30-ms window
for their wideband spectrograms, skipping that window along 1 ms at a time. The narrowband spectrogram uses a bandwidth of 2550 Hz and resolves
every harmonic; the wideband spectrogram uses a
bandwidth of 200300 Hz and blurs the harmonics

together, which shows the more widely separated


formants better.
Linear Prediction Coding (LPC) is often used for a
different type of spectral analysis of quasiperiodic
sounds. LPC analysis consists of finding the best set
of coefficients to predict the entire signal in a frame
from a few of its samples. The user chooses how many
samples will be used and, thus, how many coefficients
will be computed; this specifies the order of a polynomial. In the frequency domain, the order specifies
how many poles there will be in what is known as
the LPC spectral envelope. Two poles are needed for
each peak in the envelope, plus another two for overall spectral tilt. Thus, if fs 10 kHz, and the order is
12, then the spectral envelope will be a smooth envelope that captures the main five peaks of the DFT
spectrum. The peaks will often, but not always, line
up with the formants; two formants near each other
in frequency may be represented by one peak. Increasing the order to, say, 40 will allow 20 peaks to be
found in the same spectrum; depending on the actual
F0, these peaks may coincide with harmonics.
Referring to Figure 1 again, we note that the
two LPC spectral envelopes are similar though not
identical. A minor peak around 3 kHz is more noticeable when the wider time window is used; in the
range 58 kHz, three peaks are visible for the short
window, only two for the long window, and there
are other differences at higher frequencies. There is
little evidence that the signal has changed properties
substantially within the 60-ms window, or the harmonic peaks would be wider; therefore its spectral
representations are likely to be more accurate in this
case.
The cepstrum offers another way to compute a
spectral envelope. If you took a DFT and then immediately an inverse DFT (the IDFT), you would recover
the original time waveform. With the cepstrum, the
DFT is computed; then the log is taken, and then the
IDFT. The result is called the cepstrum, in a domain
that is not quite time, not quite frequency, and is
known as quefrency. It is best understood by thinking
of the log spectrum as if it were a time waveform. The
closely packed patterns, the harmonics, would represent evidence of high-frequency components if the
DFT were a time waveform; the wider-spaced patterns, the formants, would represent lower frequency
components. These elements end up separated in the
cepstrum into what is referred to as high-time and
low-time components, respectively. If only the lowtime components are selected and then the DFT is
taken, the result is essentially the spectral envelope
without the harmonic spikes. Unlike the LPC spectral
envelope, the cepstrum will fit the troughs as well as
the peaks of the spectrum. The only caveat is that the

446 Phonetics, Acoustic

Figure 1 Same speech signal is analyzed with two lengths of Hanning window and LPC. (A) Waveform of Dont feed that vicious
hawk is shown on top, with cursors marking the 60-ms window in [i] of feed. DFT is lower right; LPC spectral envelope is lower left.
(B) Same waveform is shown, with cursors marking a 10-ms window with the same starting point as in (A). DFT is lower right; LPC is
lower left.

Phonetics, Acoustic 447

high- and low-time components must not overlap,


which means that the process works well for low-F0
voices, and less well for higher F0 (Gold and Morgan,
2000).
With periodic signals, it is often desirable to find
out what the period (or, equivalently, fundamental
frequency) is; a secondary question is to determine
the entire source spectrum. Many F0 trackers exist
and can be roughly grouped into time-domain and
frequency-domain algorithms. If a person measured
F0 from a time waveform, they would look for a
repeating pattern using any number of cues such as
the highest-amplitude peaks or longest up- or downslope; they would check earlier and later to make sure
that, even though the pattern is slowly changing, the
interval of quasirepetition seems consistent; and finally, they would measure the time interval between
repetitions and invert that value to obtain a local
estimate of F0. Such manual tracking, when done by
people with some training, is extremely consistent
across trackers and has been used as the gold standard
by which to evaluate computer algorithms. It should
not be surprising, then, that some of the most successful algorithms use similar simple parameters defined
on the time waveform (Rabiner and Schafer, 1978;
Gold and Morgan, 2000).
Another time-domain algorithm takes a different
approach, beginning by computing the autocorrelation of a windowed part of a signal with itself. The
signal and its copy are aligned, the product is computed of each sample with its aligned counterpart,
and the products are summed. The value resulting is
that for the lag t 0. Then the copy is shifted by one
sample, and the process is repeated, with products
being formed of each sample with its one-sampleearlier counterpart. The new sum is computed for
t 1. As the signal and its copy get more and more
out of alignment, the sum of products decreases
until they are misaligned by one pitch period, and
then the sum will have a high value again. When the
total lag equals two and three pitch periods the sum
will peak again, but because the two signals overlap
less and less, successive peaks will be smaller. The
algorithm computes the autocorrelation and then
finds the peaks in the signal. The lag t of the first
peak is taken as T0, of the second peak is 2 T0, and so
on until the peaks are too low in amplitude to be
reliable indicators. Autocorrelation-based F0 trackers
work better on high F0 voices, because the pitch
periods are shorter so more of them fit within the
same size window (Rabiner and Schafer, 1978).
Frequency-domain F0 trackers use some form of
a spectrum in which the harmonics are visible. The
peaks are found, and their lowest common divisor
is determined. This method can work even if the

fundamental and some of the harmonics are missing


(as in telephone speech). Preprocessing, especially
using low-pass filtering, though sometimes more
elaborate, is used. In one algorithm LPC analysis is
used to find the formants; an inverse filter is then
devised and multiplied by the original signal to remove the formants, leaving the harmonics of now
nearly uniform amplitude. Then LPC analysis with a
higher order is used, and the peak frequencies, and
their lowest common divisor, are found (Gold and
Morgan, 2000).
F0 trackers have been compared extensively.
Some work better with speech recorded in noisy
environments; some work better with high, or low,
voices. Generally, voices become very difficult to
track when they verge into vocal fry, diplophonia, or
other kinds of vocal instability. A manual tracker may
be able to discern periodicity where an automatic
tracker has declared a signal unvoiced. Most trackers
include heuristic thresholds that, for instance, do not
allow octave skips in the output F0 values. This is
unfortunate when the speaker has actually produced
a sudden octave change by going into falsetto or
yodeling.
Analysis of Stationary Noise

In completely random noise, adjacent samples are


uncorrelated, and the noise must be described statistically. The time waveform can be described by the
probability distribution of amplitudes, and that distribution can be described by its mean, variance, and
higher moments. The noise can also be described by
its power spectrum, and can be classified in general
terms as wideband or narrowband noise. White noise
is flat across all frequencies and therefore is wideband. One can think of the bandwidth of noise in
terms of the rapidity of the variation possible in the
time domain.
For all such descriptions of noise, stationarity
means that the properties of the noise do not change
with time. If this is true, we can collect a very long
example of the signal to analyze; equally, we could
collect sections of it today, tomorrow, and next year
and assume that the mean, variance, and higher
moments are the same in all of our samples.
In the real world, signals carrying information are
not perfectly stationary. As with periodicity, though,
we can declare something to be quasistationary if its
properties do not change very fast compared to the
intervals we are interested in; alternatively, we can
assume that a signal is stationary and, as part of the
analysis, try to determine if that assumption is valid.
In speech, the central portions of unvoiced fricatives are often treated as if stationary; sometimes the
entire fricative is treated this way, even though the

448 Phonetics, Acoustic

transitions are clearly regions of rapid change. If


nonstationary noise is treated as stationary, the result
is likely to be a sort of muddling together of the
changing values describing the noise. However, nonstationary noise is sometimes analyzed as if it were
a deterministic signal, and this is likely to lead to
erroneous conclusions.
From comments in the previous section on how the
frequency resolution depends on the length of the
DFT (NDFT), it might seem that the best way to
analyze a noisy signal would be to use a relatively
short window so that the noise within it is close to
stationary, and then use a big NDFT so that the resulting transform is sampled with a fine frequency resolution. But it is possible to prove that taking a single
DFT of noise results in a spectrum with an error of the
same magnitude as the true value. Some form of
averaging must be done in order to describe noisy
signals. Using a longer window (and DFT) before
taking a single DFT, which intuitively seems to be a
good idea because more samples are included, does
not help; the frequency resolution becomes finer, but
the values still have a large error. If, on the other
hand, the samples in the long window are subdivided
into many short windows, the DFT is computed for
each short window separately, and the results averaged at each frequency, the resulting averaged power
spectrum converges to the true value. If each window
contains independent, identically distributed samples
of the same underlying process, the variance of the
estimate decreases as the number of such windows
increases (Bendat and Piersol, 2000). This is shown
graphically for white noise being time-averaged with
an increasing number of averages in Figure 2.
There are three ways in which averaging can be
done, each of which will reduce the error of the
spectral estimate, but each also with its own pros
and cons. The method just described, of chopping a
long interval into short windows, is called time averaging (see Figure 3A). If Hanning or Hamming windows are being used, the samples at the tapered edges
can be reused by overlapping windows to some
degree: rules of thumb range from 30 to 50% overlap.
In this way, 100 ms of signal could be chopped into
nine overlapping 20-ms windows, which could significantly improve the variance of the estimate
provided that the signal is more or less stationary
during the 100 ms. The practice used in some speech
studies of overlapping the windows much more than
this (e.g., using a 20-ms window and a skip factor of
1 ms, so that 40 ms of signal is used to generate
21 windows and, thus, 21 DFTs) has two disadvantages: the variance of the estimate is not reduced
proportionate to the number of averages, and the
result is weighted toward the characteristics of the

Figure 2 White noise, analyzed with time-averaging. The number of DFTs computed and averaged at each frequency is shown as
an n value with each curve. (From Shadle (1985) The acoustics of
fricative consonants. PhD thesis, MIT, Cambridge, MA. RLE Tech.
Report 504, with permission.)

middle 20 ms of the 40 ms, since that is the most


heavily overlapped portion.
A second way is to compute the ensemble average
(see Figure 3B). An ensemble of signals means essentially that different signals have been produced under
identical conditions, leading to the same properties
for the noise in each signal. The noise properties can
vary in time, but the time variations must be the same
for each member of the ensemble. For instance, if our
signal is the sound of raindrops falling on the roof,
and they fall louder and faster as the wind blows
harder, then an ensemble could consist of raindrops
falling in ten different storms, in all of which the wind
increased at the same rate. We place our windows at
the same time in each signal (relative to the wind
speed, or other controlling parameter), compute the
DFTs, and average as for the time average. The obvious problem here is in knowing that every member of
the ensemble had the same controlling parameters at
the same times. However, each individual signal does
not need to be stationary for more than the length of
the short window.
A third way is to compute the frequency average, by
computing a single DFT and then averaging in the
frequency domain (see Figure 3C). Ten adjacent frequency components can be averaged to produce a
single component. This reduces the frequency resolution but improves the error. However, this works well
only if the spectrum is fairly flat. If the spectrum has
significant peaks or troughs, the frequency averaging
will flatten them and so introduce bias to the estimate,
meaning that it will converge to the wrong value.
For speech, all of these methods have been
used, but none is ideal. Another method exists and
is beginning to be used in speech research: multitaper

Phonetics, Acoustic 449

Figure 3 Diagrams indicating which parts of the signal(s) are used to generate averaged power spectra. Each rectangular box
represents a part of a speech time waveform; shaded regions indicate the part being analyzed. Brackets indicate length of the window
for which the DFT is computed. The Average boxes compute an average of the DFT amplitude values at each frequency. (A) Time
averaging. (B) Ensemble averaging. (C) Frequency averaging.

analysis. With this method, a single short signal segment is used, but it is multiplied by many different
windows called tapers before computing and averaging their DFTs. The particular shape of the tapers
satisfies the requirement for statistical independence
of the signals being averaged. Figure 4 compares a
multitaper estimate and a DFT spectral estimate of
the same central portion of an [s]. The jaggedness of
the DFT curve can provide a rough visual indication
of its greater error compared to the multitaper curve.
Spectrograms can be constructed of a sequence of
multitaper estimates and plotted similarly. There are
important choices to be made about the number of
tapers to use and other parameters, but the method
offers advantages in speech analysis over the three
averaging techniques described above (Blacklock,
2004).
Note that spectrograms, although they do not include spectral averaging explicitly, are not as misleading as using single DFTs for noisy sounds. Essentially,
the eye averages the noise, aided by the use of a small
skip factor in the computation. The same is not true
of spectral slices derived from a spectrogram; since
these are constructed from a single DFT, there is
nothing shown for the eye to average. This problem
was recognized in an early article about the use of the
spectrogram (Fant, 1962: Figure 6, p. 20).
Analysis of Mixed Noise and Periodic Signals

Mixed-source signals would seem to call for two


different analysis techniques. Examples in speech

Figure 4 Multitaper spectrum of [s] in bassoon in blue


(smooth curve) overlaid on DFT of same signal in red (jagged
curve). British male speaker. (After Blacklock, 2004.)

include voiced fricatives and affricates, and also


breathy or hoarse productions of vowels, liquids,
and nasals. In all of these cases the signal analysis is
complicated by the fact that the noise and voicing
source are not independent; in voiced fricatives the
noise can be modulated by the voicing source, and
breathy or hoarse sounds are likely to change as the
vocal folds vibrate, even if the noise is not specifically
modulated by the acoustic signal.
Mixed-source signals should be analyzed with time
averaging, ensemble averaging, or multitaper. If the

450 Phonetics, Acoustic

periodic component is stationary, the spectral averaging will not affect it, but will reduce the error in the
estimate of the noisy components. If F0 of the periodic component changes noticeably during the interval
or across the ensemble averaged, the harmonics will
be smeared out, which may be obvious in the averaged power spectrum, or may become clear when that
is compared to a spectrogram. In that case, time
averaging should be avoided in order to decrease the
averaging interval length.
Mixed-source signals can also be decomposed into
two parts, harmonic and anharmonic. A wide variety
of algorithms exist that accomplish this. After decomposition, each component can be analyzed in the way
appropriate to a harmonic signal and a noisy signal,
respectively. Jackson and Shadle (2001) reviewed
such algorithms and presented their own, which was
used to investigate voiced fricatives. Multitaper analysis can also be formulated to identify harmonics
mixed with colored noise; a detailed comparison of
the two techniques has not yet been made.
Analysis of Noisy Transients

A transient includes nondeterministic noise, is highly


nonstationary, and is generally very short. An example from speech is the stop release. Because it is noisy,
it requires averaging, but with such a short signal that
is difficult to do. Ensemble averaging is possible, but
an independent means of aligning the signals in the
ensemble would need to be established. Multitaper is
also a possibility.

Production Models
We turn now from consideration of analysis techniques appropriate to the type of signal to models of
speech production that indicate the parameters we
seek from analysis in order to describe and classify
sounds. The vast majority of speech production models that are useful for this purpose are source-filter
models, with independent source and filter, and linear
time-invariant filter. The assumption of independence
is flawed interactions of all sorts have been shown
to exist but it serves well for a first approximation,
in part because the models become simple conceptually. The source characteristics can be predicted,
and the source spectrum multiplied by the transfer
function from that source to an output variable such
as the volume velocity at the lips. (If both characteristics are in log form, it is even simpler; they can just
be added at each frequency.) While it took years to
develop the theory underlying the source characteristics and the tract transfer functions, it is now straightforward to vary a parameter such as F0, a formant
frequency, or pharynx cross-sectional area in such a

model and see its acoustic effect. It is not so straightforward to analyze the far-field pressure into true
source and filter components.
Sources

There are two basic types of sources: the voicing


source, generated by vocal fold vibration and nominally located at the glottis, and noise sources, which
can be located anywhere in the vocal tract, including
at the glottis. In both cases the location of the source
is where some of the energy in the airflow is converted
into sound. Determining the exact location for noise
sources is still a subject of research, and slight differences can affect the predicted radiated pressure
significantly.
A number of factors affect the voicing source: subglottal pressure, degree of adduction of the vocal
folds, tension of the folds, and supraglottal impedance. They determine, first, whether the vocal folds
vibrate and, if so, the frequency at which they vibrate
and the mode or register of vibration. The frequency
of vibration affects F0 and all its harmonics; the
mode of vibration affects the amplitude of all harmonics and also whether noise will also be produced (as in
breathy or whispered speech). These differences can
be characterized in the time waveform of the glottal
volume velocity, Ug(t), or in its spectrum, Ug(f). As
a general rule, abrupt corners or changes of slope
in the time waveform, which occur for the more
adducted registers like modal register or pressed
voice, mean there will be more high-frequency energy,
i.e., the harmonics will have higher amplitudes compared to falsetto or breathy voice. Figure 5 shows a
typical glottal waveform, with a clear closed phase,
and a range of possible spectra; the steeper the slope
(e.g., 18 dB/oct), the smoother the time waveform,
with a sound quality as in falsetto; the shallower
slopes (12, 6 dB/oct) correspond to a richer, brassier sound. The sound quality is related to the spectral
tilt; the spacing of the harmonics that define the
spectrum is related to T0, the spacing between glottal
pulses (Sundberg, 1987; Titze, 2000).
Noise sources occur when the air becomes turbulent and the turbulence produces turbulence
noise. Whether turbulence occurs is determined by
the Reynolds number, Re VD=n UD=An, where
V a characteristic velocity, D a characteristic dimension, typically the cross-dimension where V is
measured, U volume velocity, A cross-sectional
area where D is measured, and n kinematic viscosity
of the fluid 0.15 cm2/s for air. If V increases while
D remains the same, or if U stays the same but
A decreases, Re will increase. Thus, although the
volume velocity must be the same all along the tract
(since there is nowhere else for the air to go), the

Phonetics, Acoustic 451

Figure 6 Noise source: shape downstream of constriction and


spectrum of noise resulting experimentally, for free jet (top) and
jet impinging on an obstacle (bottom).

Figure 5 Plots of typical glottal volume velocity, as (A) time


waveform Ug(t), and (B) spectrum, Ug(f). (Adapted from Titze
(2000) Principles of voice production (2nd printing). Iowa City, Iowa:
National Center for Voice and Speech, with permission.)

Reynolds number will be highest at the points of


greatest constriction. When Re is greater than a certain critical Reynolds number, Recrit, the jet emerging
from the constriction will become turbulent, but
where the most noise will be generated depends on
the geometry downstream of the constriction.
The simplest model for such turbulence noise is to
treat it as completely localized at one place, and place
a series pressure source at the equivalent place in the
model. The strength of the source, ps, is related to the
parameters that affect the amount of turbulence generated: the pressure drop across the constriction, and
the volume velocity and area of the constriction. The
spectral characteristic should be broadband noise;
sometimes, for convenience, it has been defined as
high-pass-filtered white noise (Flanagan and Cherry,
1969). Stevens specified a broad peak characteristic
of free jet noise (though free jet noise generation is
distributed along the length of the jet) (Stevens,
1971), but some experiments indicate it should
have a characteristic with its amplitude highest at
low frequencies (Shadle, 1990) (see Figure 6). The
location of the source has been experimented with;

Flanagan and Cherry (1969) placed it 0.5 cm downstream of the constriction exit; Fant (1970) sought the
location generating the best spectral match for each
fricative; Stevens (1998) has demonstrated the difference made by placing it at any of three locations
downstream. It seems clear that, for some fricatives,
a localized source and a characteristic of spoiler in
duct is fine, while for others, a distributed source with
the broad peak characteristic of a free jet is needed
(Shadle, 1990).
Because ps is related to the pressure drop across the
constriction, the amount of noise will change as
the constriction area changes (as is needed during a
stop release, or in the transitions into and out of a
fricative) and as the pressure just upstream of the
constriction changes (as when the pressure drop
across the glottis changes). Modulation of ps by the
glottal volume velocity is possible in such a model
(Flanagan and Cherry, 1969), though the actual mechanism affecting the source in voiced fricatives appears
to be somewhat more complex than can be modeled
by their synthesizer (Jackson and Shadle, 2000).
Filters

The filtering properties of the vocal tract depend on


its shape and size and, to a small extent, on the
mechanical properties of its walls. Wherever sound
is generated in the tract, sound waves emanate outward from that point. At any acoustic discontinuity
(such as a change in cross-sectional area, or encountering a solid boundary), some of the wave may travel
onward, and some may reflect. Reflected waves can
interfere with sounds emitted from the source later,
combining constructively and destructively. At

452 Phonetics, Acoustic

frequencies in the sound where the interferences recur


at the same spatial positions, standing wave patterns
will be set up.
Many explanations of standing wave patterns
exist (see, for example, Stevens, 1998; Johnson,
2003). It is simple to compute the frequencies at
which such patterns will occur for lossless uniformdiameter tubes, where only two cases matter: tube
closed at one end and open at the other, so that the
boundary conditions differ, and tube open at both
ends or closed at both ends, so that the boundary
conditions are the same. The first sustains quarterwavelength resonances, that is, the tube length equals
integral multiples of l/4, so the resonance frequencies
are fn c2n 1=4L, where n 0, 1, 2, 3, . . . , and
fn nth resonance, c speed of sound, and L length
of tube. The second sustains half-wavelength resonances; the tube length equals integral multiples
of l/2, so the resonances are fn cn=2L, where
n 0, 1, 2, 3, . . . .
More complex tract shapes can be approximated by
concatenating two or more tubes, each of uniform
cross-sectional area. If the number of tubes is low, it
is still relatively easy to predict the resonances of the
combined system and is, thus, useful conceptually.
Analytic solutions can be found for the resonances
of the system by solving for the frequencies at which
the sum of the admittances at any junction is zero.
This was first shown by Flanagan (1972) for a set of
two-tube systems approximating vowels. For more
than two tubes, it is still possible, but the calculations
become so complex that it is preferable to use many
more tubes, simulate them as a digital filter, and calculate the resonances by computer. However, if the
area changes by a factor of six or more between sections, for instance, with a constricted region between
two larger-area sections, one can assume that the
cavities are decoupled and compute the resonances
for each tube in this case, three separately. In this
situation, each resonance of the system will have a
strong cavity affiliation, with its frequency inversely
proportional to the length of that cavity. In other
cases, where the area does not change so significantly
between sections, the resonances are coupled. An extreme case of a coupled resonance is the Helmholtz
resonance, which depends on the interaction of a
small-area neck and a large-volume cavity.
All of these resonances result from plane-wave
modes of propagation, meaning that the acoustic
wavefronts are planar, perpendicular to the ducts
longitudinal axis. A point source in the duct will
radiate sound in all directions, but below a certain
cut-on frequency any sound traveling in directions
other than along the ducts axis will die out; these
waves are evanescent. The cut-on frequency depends

on the cross-dimensions of the duct and its crossdimensional shape. It is easiest to understand for
a duct that has a rectangular cross-dimension, say,
Lx by Ly; the cut-on frequency occurs where a halfwavelength fits the larger of Lx and Ly, which
we shall call Lmax. In other words, fco c=2Lmax .
For a duct of circular cross-section, with radius a,
fco 1:841c=2ap.
Above the cut-on frequency, cross-modes will propagate. These modes are also dispersive, meaning that
higher frequencies travel faster (Pierce, 1981). Many
of the assumptions underlying the basic model used in
speech become progressively less true.
For vocal-tract-sized cross-dimensions, what are
the cut-on frequencies? If the duct is rectangular,
with Lmax 2.5 cm, fco 7.2 kHz; Lmax 4.0 cm
gives fco 4.41 kHz.
If the duct is circular, a diameter 2a 2.5 cm gives
fco 8.42 kHz; 2a 4.0 cm gives fco 5.26 kHz. The
maximum cross-sectional areas in these cases are,
respectively, 6.2 and 16 cm2 for rectangular duct,
and 4.9 and 12.6 cm2 for the circular duct. (We use
c 35,900 cm/s as the speed of sound at body
temperature, 37  C, and for completely saturated air.)
Obviously the vocal tract is never precisely rectangular or circular in cross-section. But in comparing to
Fants data, for instance (1970), we can estimate that
the cut-off frequencies for the six vowels of his subject
ranged from 4.6 to 9.0 kHz (assuming a rectangular
cross-section) or 4.8 to 9.3 kHz (assuming circular cross-section). For a smaller subject, and where
cross-dimensions are given (Beautemps et al., 1995),
the largest cross-dimension in the front cavity is
1.79 cm (for /a/), giving fco 10.0 to 11.8 kHz; the
largest back-cavity cross-dimension is 2.4 cm (for /i/),
giving fco 7.5 to 8.8 kHz. For formant estimation for
vowels, then, the lumped-parameter models considering only plane-wave propagation are based on reasonable assumptions. For fricatives, there may well be
significant energy above the cut-off frequency, where
these models become increasingly inaccurate, but in the
absence of articulatory data good enough to support
more complex high-frequency models, plane-wave
propagation models are often pressed into service.
There are several sources of loss in the vocal tract
that have the effect of altering resonance frequencies
and bandwidths. The most significant is radiation
loss, especially occurring at the lip opening, but also
present to a lesser extent wherever a section with
small cross-sectional area exits into a region of much
larger area. The main effect is to tilt the spectrum up
at high frequencies. If resonances have been computed assuming no loss, their predicted frequencies
will be higher than actually occur, and the difference
is bigger at higher frequencies. The larger the area of

Phonetics, Acoustic 453

the mouth opening relative to the front-cavity volume, the greater the radiation loss. If there is a small
constriction such that front and back cavities are
decoupled, back-cavity resonances will have little radiation loss and so will have sharper peaks (lower
bandwidths) than the front-cavity resonances.
Viscosity describes the loss that occurs because of
the friction of the air along the walls of the tract; heat
conduction describes the thermal loss into the walls.
Both increase when the surface area of the tract is
higher relative to the cross-sectional area and increase
with frequency. Though not as big sources of loss as
radiation, they contribute to the increased bandwidths of higher resonances. Finally, the walls of the
tract are not rigid; when modeled as yielding,
the bandwidths of low-frequency resonances are
predicted to increase (Rabiner and Schafer, 1978).
Any sound source excites the resonances of the
vocal tract, and those resonances can be calculated,
approximately or more precisely, by the methods outlined above. There may also be antiresonances, when
the tract is branched and/or when the source is intermediate in the tract. The antiresonances vary according to the position and type of source; for each source
possibility, a different transfer function can be computed. The transfer function is a function of frequency
and is the ratio of output to input. Thus, multiplying
the transfer function for a particular source by the
sources spectral characteristic yields the predicted
output spectrum. At frequencies where the transfer
function equals zero, the output will be zero no matter
what input is applied; at frequencies where the transfer function has a high amplitude, any energy in the
input at that frequency will appear in the output,
scaled by the amplitude of the transfer function.
It is worth remembering that the resonances and
antiresonances are properties of the actual air in the
tract, duct, tube system. Poles and zeros are attributes
of the transfer function, where the analytical expression goes to infinity (at a root of the denominator) or
to zero (at a root of the numerator). A spectrum of
actual speech is best described as having peaks and
troughs; according to the particular set of approximations used, these may be modeled as corresponding to
poles and zeros. A given spectral peak may be produced by more than one resonance, modeled by more
than one set of poles; a pole-zero pair near each
other in frequency may effectively cancel, producing
neither peak nor trough.

Methods of Classification
Vowels

Peterson and Barney (1952), in their classic study,


determined the range of variation in the first two

formants for 10 vowels, thus demonstrating not


only the usefulness of those two parameters but also
their average values for men, women, and children.
Although they measured formants from spectral
slices, having determined the best place to compute
the slice from a spectrogram, that is only one of
several techniques available now. One can locate the
vowel using only the time waveform and compute the
LPC spectral envelope and determine the frequencies
of the peaks in that envelope. One can run an LPCbased formant tracker on the entire utterance, which
computes the peak frequencies directly. Since LPC
can occasionally fail to identify closely spaced formants separately, as a safeguard one can compute
either a single DFT or the entire spectrogram, respectively, and superimpose the LPC spectral envelope or
formant tracks on top for a quick visual check of the
LPC performance. The window used for either DFT
or LPC analysis should be at least as long as a single
pitch period; the LPC order should be chosen to allow
for the expected number of formants within the frequency range, or adjusted and recomputed after an
initial analysis.
To understand formant patterns, it is useful to consider vocal tract shapes as departures from a uniform
tube that is closed at one end, the glottis, and open at
the lips. For a length of 17.5 cm, assuming no losses,
resonances are predicted at 500, 1500, 2500, . . . Hz.
Shortening the uniform tube raises all frequencies.
Decreasing the area at the lip end only, akin to rounding, lowers all frequencies and reduces the bandwidths. To consider vowels other than schwa, we
need at least a two-tube model. If the tongue is high,
the pharyngeal area becomes large, and the oral cavity area becomes small. The lowest formant is best
modeled as a Helmholtz resonance and moves down
from 500 Hz; upper formants shift, depending mainly
on the lengths of the two cavities, and partly depending on the area ratio. The rule of thumb, often quoted,
is that increasing tongue height brings down F1, and
increasing tongue frontness brings up F2. This rule
works roughly, even though /u/ cannot really be modeled by a two-tube combination. The extreme, cardinal vowels /i/ and /a/ do fit. For /i/, the tongue is high
and front, F1 is low, and F2 is high. For /a/, the tongue
is low and back, F1 is high, and F2 is low. For these
vowels with large area differences from pharynx to
oral cavity, the tubes can be treated as decoupled,
leading to the observation that cavity affiliation of
each formant occurs in a different order. For /i/, F1 is
a Helmholtz resonance, and F2, F3, and F4 are, respectively, back, front, and back-cavity resonances.
For /a/, F1 to F4 are, respectively, back, front, back,
front-cavity resonances. This means that, in a transition from one to the other, as occurs in the diphthong

454 Phonetics, Acoustic

/aI/, the formants do not smoothly change frequencies


from one vowel to the other.
These models help us to understand, but real
speech is seldom so clean. Figure 7 shows waveforms
and spectrograms of two sentences, one spoken by an
adult female (Figure 7A), Dont feed that vicious
hawk, and one by an adult male (Figure 7B), You
should be there on time, both British speakers. We
will be referring to these spectrograms throughout
this section. Note that the vowel in You at the start
of Figure 7B has a low F1, but high F2 inconsistent
with /u/; /ju/ has apparently been realized as [jI]. The
vowel in should is very short, but still has three
steady formants visible. The vowel in be, after the
initial formant transitions, has a classic pattern for
/i/; note the differences between this [i] and that in
Figure 7A, feed. The words there on show a fairly
gradual lowering of F3 for [r], followed by a more
sudden lowering of F2 for [a]. The formants in time
do change from F1 and F2, from being near each
other to a wider separation, as expected, but F2
does not rise very far.
The simple models also allow one to understand
how vowels vary with gender and age. As children
grow, their pharynxes lengthen more than do their
oral cavities; the vocal tract length differences between adult men and women are due more to differences in pharyngeal than in oral cavity length. Thus,
the formant space does not scale uniformly by vocal
tract length. The higher F2 in the female subjects [i]
agrees with this explanation.
To a first approximation, the voicing source and
the vocal tract filter are independent. We can therefore think of the transfer function from glottis to lips
as a spectral envelope that is sampled by the fundamental and its harmonics. If the vocal tract remains
the same shape, leaving the formants at the same
frequencies, the harmonics sample it more coarsely
at higher values of F0. On average, speakers with
smaller larynges also have shorter vocal tracts, so
that, as the range of F0 values possible moves up,
the range of formant frequencies increases too. However, as Peterson and Barneys data show (1952), they
do not increase at the same rate; womens F0 is 1.7
times higher than mens, while their formant frequencies are only 1.15 times higher, on average. This
means that F0 is much more likely to approach F1
in women than in men, and formants may be difficult
to resolve. An example of this occurs in Figure 7A in
feed, where F0 is 273 Hz.
Finally, sometimes the properties of the voicing
source are of more interest than the filter properties
of vowels. It is possible to inverse filter the speech
signal and arrive at an estimate of the glottal volume
velocity. In order to inverse filter, one must estimate

what the filter was, invert that, and multiply it by the


speech spectrum. Clearly, the estimate of the glottal
volume velocity is only as good as the estimate of the
tract filter function, but the technique has led to
detailed explanations of voice quality differences, including source differences between men and women.
One of the better-known techniques uses the Rothenberg mask to measure volume velocity at the lips and
inverse filter that signal rather than the far-field pressure. This provides information about the mean flow
of air through the vocal tract, including the degree of
breathiness in the glottal volume velocity.
Nasals

In order to produce a nasal, the velum is lowered, and


complete closure is effected in the oral cavity. The
oral cavity becomes a side branch that contributes
antiresonances inversely related to the length from
pharynx to place of closure; the resonances arise
from the pharynx and nasal cavities. The nasal cavities are convoluted in shape, uniquely so for each
individual; the length of the effective tract is thus
longer than that of pharynx plus oral cavity, with a
correspondingly lower first formant. Bandwidths of
all resonances are also larger because there are more
surfaces to absorb sound.
As with fricatives, the radiated spectrum is a mixture of peaks and troughs that are not always easy to
map to particular cavities. The nasal formants are
packed more closely in frequency than nonnazalized
vowel formants, but they may not all be apparent
because of the antiresonances. Some of the series of
antiresonances may appear as deep spectral troughs,
but where they coincide with nasal formants, they
will cancel, or nearly so, and neither will be apparent.
Because of the cancellation and the wide bandwidths, the spectrum of a nasal will overall have
lower amplitude than an adjacent vowel. While the
antiresonances that could provide a place cue may not
be strikingly apparent, particularly if there is background noise, the formant transitions in adjacent
vowels will also provide place cues; briefly, all formants will decrease before a bilabial nasal, F1 and F3
will decrease and F2 will increase before a velar nasal,
and F1 and F3 will decrease and F2 will decrease or
increase depending on the vowel before an alveolar
nasal (Kent and Read, 1992; Johnson, 2003).
The clearest example of such transitions occurs in
Figure 7A in dont, where F2 clearly rises during the
nasalized portion. In Figure 7B, no transitions are
obvious in the vowel of on, though F2 and the
amplitude both decrease abruptly at the start of the
nasal. In time, a slight F1 transition is observed,
and F3 appears to drop abruptly, though it is not
well-defined in the spectrogram.

Phonetics, Acoustic 455

Figure 7 Waveform and spectrograms of two sentences. (A) Dont feed that vicious hawk, female British speaker, as in Figure 1;
(B) You should be there on time, male British speaker. Note spectrograms extend up to 12 kHz.

456 Phonetics, Acoustic

In nasalized vowels the velum is down, but the oral


cavity is not closed. The presence of two distinct paths
still allows for interference effects, but the antiresonances will be at different frequencies than for nasals,
and these frequencies will depend on the area of the
velo-pharyngeal port. The resonances will correspond
to those of the vowel alone (pharynx plus oral cavity)
and the nasal formants (pharynx plus velo-pharyngeal
port plus nasal cavities); the antiresonances may cancel some of these, or may show up as spectral troughs,
but it is likely that the lowest nasal formant will be
the highest-amplitude peak.
Fricatives

Many different sets of parameters for fricatives have


been explored, but none are yet sufficient to classify
them. Theoretically it seems straightforward; when
the constriction is small, as during the steady-state
portion of a fricative, the back-cavity resonances
are essentially cancelled. The noise source excites
the front-cavity resonances, and antiresonances
zeros appear at low frequencies and at higher frequencies inversely related to the distance between
source location and constriction exit. If the source
is not well localized, these higher-frequency antiresonances may smear out and not be readily apparent. The frequency at which the energy appears in the
spectrum thus should differentiate fricatives by place,
with longer front cavities for palatals and velars
corresponding to energy at lower frequencies. However, the frequency ranges used for different fricatives
overlap extensively across subjects. Further, interdentals seem to be highly variable even within subject, with, sometimes, barely discernable noise. For
instance, in Figure 7A, [f] has significant energy from
1200 Hz to 11 kHz (and possibly higher; the antialiasing filter begins to act there), though clearly not
as high amplitude as the [s] or [S] in the same sentence, and lasting for 150 ms. The [v], however,
appears to consist of a voicebar lasting 100 ms and
weak noise, albeit at roughly the same frequency
range, for only 1020 ms. Note also that [S] differs
slightly in Figure 7A and 7B, with the frequency of
the lower edge of the high-amplitude region occurring
at approximately 2.0 kHz for the female, 1.5 kHz for
the male subject. This may be due to a difference in
length of the front cavity, or, more likely, to the influence of the vowel context, with the higher cut-on
frequency corresponding to the high unrounded
vowel.
It was thought at one point that identification of
interdentals depends on transitions, while that of /s, S/
depends only on steady-state characteristics. An obvious difference in articulation tends to support this
theory; /y/ requires the tongue tip to be in contact

with the teeth, unlike in /f/. However, careful manipulation of speech signals shows that transitions as well
as steady-state characteristics are important for /s, S/
(Whalen, 1991).
In the transition from a vowel to a fricative several
things happen, and not always in the same order.
Formants shift as the constriction becomes smaller,
noise begins to be produced, and the formants as well
as antiresonances begin to be excited. Back-cavity
resonances can be prominent for a time until the
constriction area decreases sufficiently for them to
be cancelled. As the noise increases, the rate at
which it increases depends on the fricative; stridents
appear to have the most efficient noise sources, in that
the noise produced increases at a greater rate proportional to the flow velocity through the constriction.
Both spectral tilt and overall spectral amplitude are
affected. Within a given place and for a given subject,
the spectral tilt can be thought of as occurring in a
family of curves; if the same fricative is produced with
greater effort, the spectrum tends to have higher
amplitudes overall and a less negative slope. Voiced
fricatives with the same place will have a set of curves
with a similar relationship of spectral tilt to effort
that is less than, but overlapping with, the range for
their voiceless versions. However, these differences,
while predictable from an understanding of flow
noise sources, do not sufficiently distinguish fricatives
(Jesus and Shadle, 2002). Finally, voicing changes
during the transition for both voiced and voiceless
fricatives, presumably to allow sufficient pressure
drop across the constriction to support frication.
Many researchers have pursued methods of characterizing fricative spectra by statistical moments,
as if they were probability distributions. Recently
Forrest et al. (1988) described their calculation of
spectral moments, indicating that these were sufficient to distinguish stops, but applied to fricatives,
distinguished /s, S/ from each other and from the
interdentals /f, y/, but did not distinguish the interdentals at all. More recent studies have used methods
of computing the moments that showed that certain
moments of the English voiceless fricatives were statistically significantly different, but the differences
were not enough to allow for categorization.
Spectral moments capture the gross distribution of
energy over the chosen frequency range, but ignore
particular features that we can attribute to particular
production methods, such as back-cavity formants
appearing in the transition regions, or the salience
and frequency of spectral troughs. In addition,
the gross parameters captured depend greatly on the
particular spectral representation from which the
moments were calculated. Ideally, a low-variance
spectral estimate would be used, but this has not

Phonetics, Acoustic 457

typically been done. Computing the moments does


some spectral smoothing as with frequency interpolation, but with more bias, and amplitude thresholding
and frequency range can affect the results dramatically. One question was whether using a better spectral
estimate before computing moments might improve
results. It appears that starting with a good spectral
estimate helps, but only marginally; new parameters
are needed for significant gains (Blacklock, 2004).
The best parameters appear to be based on multitaper spectra, with frequency range to 20 kHz, amplitude threshold carefully controlled, and identical
recording conditions across subjects, as shown in
Figures 4 and 8. Figure 8 shows examples of multitaper spectrograms of the voiceless fricatives uttered
in the same vowel context, by the same subject. While
differences between the fricatives are apparent in
these examples, the problem is to find characteristics
that hold up across tokens, contexts, and subjects.
Men and women require different parameters; an
examination of the variance of the mean power

spectral density in 12 subjects indicated that /s, S/


can best be distinguished from each other at 2.5 kHz
for men, 3.0 kHz for women. The main spectral peak
in /f/ occurred at 2, 4, or 7 kHz for men, but most
often at 2 kHz; for women the peak occurred at 2, 4,
or 8 kHz, and was more evenly distributed among
these frequencies. Spectral variation within particular
tokens was also examined, with somewhat inconclusive results. Clearly multitaper analysis is a powerful
tool that bears further investigation (Blacklock,
2004).
Finally, voiced fricatives often devoice, with the
amount somewhat dependent on language (studies
on English, French, and Portuguese are cited in Jesus
and Shadle, 2002) as well as with fricative place
(posterior fricatives devoice more often) and position
within the phrase (end of sentence devoices more
often). Devoicing allows more air pressure to be
dropped across the supraglottal constriction, thus
strengthening the noise source. However, it appears
that in some cases the fricative denoises instead,

Figure 8 Multitaper spectrograms of [f] from buffoon, [y] from Methuselah, [s] from bassoon, and [S] from cashew, same British
male speaker as in Figure 4. (After Blacklock, 2004.)

458 Phonetics, Acoustic

with additional pressure drop being used across the


glottis, strengthening the voicing source. Voiced fricatives are shorter in duration than their voiceless
equivalents in all languages studied. The modulation
of the noise source by the voicing source indicates
that the phase of the modulation changes rapidly in
the transition into and out of the fricative (Jackson
and Shadle, 2000). This may be a feature that humans
notice and use in identification; further studies await.
Stops and Affricates

Stops are a relatively well-understood class. The manner in which they are articulated is related to the
temporal events that are observable in the time waveform; the place at which they are articulated is related
mainly to spectral cues. Before the stop begins, articulators are moving toward closure; if the stop occurs
postvocalically, formant transitions will occur that
offer place cues. For the stop itself, first is the period
of closure, during which no air exits the vocal tract;
voicing may continue briefly but no other sounds are
produced. When closure is released, there may be the
release burst, followed by brief frication as the articulators move apart, followed by aspiration and, finally,
by voice onset. After voice onset the formants are
more strongly excited, and transitions characteristic
of the stops place will again be observable.
Not all of these stages occur with every stop. If the
stop is preceded by /s/, it has a closure period but no
burst release. Syllable-final stops are often not released. The frication period is not always present
and distinguishable from aspiration. Both frication
and aspiration may be missing in voiced stops; they
tend to be present in voiceless stops, but formant
transitions are less obvious in the vowel occurring
after the stop.
These latter two points are related to one of the
stronger cues to voicing of a stop, the voice onset time
(VOT). The VOT is the time between stop release and
voice onset. In voiced stops, although voicing may
well cease during closure as the pressure builds up in
the vocal tract, the vocal folds remain adducted;
when the supraglottal pressure suddenly drops following release, phonation begins again quickly, leading to a short VOT. In voiceless stops, the vocal folds
are abducted and take time to be adducted for the
following voiced segment, leading to a long VOT.
Aspiration noise is produced near the glottis because
the glottis, while narrowing, provides a constriction
small enough to generate turbulence noise.
Experiments in which the VOT has been varied
in synthetic stimuli have shown that VOT alone
produces a categorical discrimination between voiced
and voiceless stops, with a threshold value of 20
30 ms. However, VOT varies to a smaller extent by

place, with velar stops having longer VOT than bilabial stops; this difference is as much as 20 ms. Finally,
VOT varies with speech rate, with values shortening
at higher rates.
The main spectral cues in stops are the burst spectral shape and the formant transitions in adjacent
vowels. Additional cues lie in the spectral shape of
the frication interval, but this is so brief, relatively
weak, and time-varying that it is much less easy to
analyze. The spectral shape of all three is related to
the movement of the articulators toward closure for
the stop. It can be shown that any narrowing in the
anterior half of the vocal tract will cause the first
formant to drop in frequency. The direction of frequency change in F2 and F3 depends on the place
of the target constriction (of the stop) and the position of the tongue before the movement began (the
vowel front- or backness). As demonstrated initially
by Delattre et al. (1955) and cited in numerous references since, for bilabial stops all formants decrease in
frequency when moving toward the stop (i.e., whether observing formant transitions pre- or poststop);
a clear example of this is seen for be in Figure 7B.
For velar stops, F1 and F3 decrease; F2 increases
when moving toward the stop. For alveolar stops,
F1 and F3 decrease; F2 increases for back vowels
and decreases for front vowels. But note that in
Figure 7A, the vowel formants in hawk do not
change noticeably near the closure.
The burst spectra follow related patterns, since
they are produced by an impulse excitation of the
vocal tract just after closure is released. For bilabials,
the spectrum has its highest amplitude at low frequencies and falls off with frequency. Alveolars are
high amplitude at 35 kHz, and velar bursts are highest amplitude at 13 kHz. Though these are referred
to, respectively, as having shapes of falling, rising, and
indeterminate or compact or midfrequency, these
terms are relative to a frequency range of 0 to, at
most, 5 kHz. The [t] in time in Figure 7B shows a
striking burst, frication, aspiration sequence, which
extends up to 12 kHz. The theoretical burst spectral shapes are roughly similar to those of fricatives
at each place, as we would expect, since all backcavity resonances should be cancelled immediately
postrelease, and the front-cavity resonances are
excited.
Affricates can be thought of as a combination of a
stop and a fricative, but with some important differences in timing and place from either. The closure and
release of a stop are evident, but the frication period is
long for a stop and short for a fricative. Aerodynamic
data indicate that the constriction opens more slowly
for /tS/ than for /t/, directly supporting the longer
frication duration for the affricate compared to the

Phonetics, Acoustic 459

stop (Mair, 1994). The rise time for the frication noise
for /tS/ is significantly shorter than for /S/ (Howell and
Rosen, 1983).

Conclusion
We have surveyed some aspects of acoustics, recording equipment, and techniques, so that appropriate
choices can be made. It is possible to compare speech
analysis results using recordings that were not made
in the same way, provided that information such as
type of microphone and its position relative to the
speaker have been noted, ambient noise has been
recorded, and so on.
By the same token, signal processing principles and
techniques have been reviewed so that the techniques
can be chosen appropriately for both the signal type
(whether periodic, noisy, or a combination) and the
information sought (absolute level, formant frequencies, properties of the voice source, etc.). Some parameters must be estimated and the analysis done twice
or more, iterating. Others must be done correctly the
first time, such as antialiasing before sampling a signal. Each of the different methods of spectral analysis
has its place; the choice of which is best depends not
only on the type of speech sound being studied, but
also on the speaker.
Finally, the basic manner classes of speech have
been reviewed and parameters that can be used for
classification discussed.
See also: Phonetics, Articulatory; Voice Quality.

Bibliography
Beautemps D, Badin P & Laboissiere R (1995). Deriving
vocal-tract area functions from midsagittal profiles and
formant frequencies: a new model for vowels and fricative consonants based on experimental data. Speech
Communication 16, 2747.
Bendat J S & Piersol A G (2000). Random data: analysis
and measurement procedures (3rd edn.). New York: John
Wiley and Sons, Inc.
Beranek L (1954). Acoustics. New York: McGraw-Hill
Book Co. Reprinted (1986). New York: Acoustical
Society of America/American Institute of Physics.
Blacklock O (2004). Characteristics of variation in production of normal and disordered fricatives, using
reduced-variance spectral methods. Ph.D. thesis, School
of Electronics and Computer Science. UK: University of
Southampton.
Catford J C (1977). Fundamental problems in phonetics.
Bloomington, IN: Indiana University Press.
Crystal D (1991). A dictionary of linguistics and phonetics
(3rd edn.). Oxford: Blackwell Publishers Inc.

Delattre P C, Liberman A M & Cooper F S (1955). Acoustic loci and transitional cues for consonants. Journal of
the Acoustical Society of America 27, 769773.
Fant C G M (1962). Sound spectrography. Proceedings
of the 4th International Congress of Phonetic Sciences.
The Hague: Mouton. 1433. Reprinted in Baken R J &
Daniloff R G (eds.) Readings in clinical spectrography of
speech. San Diego, CA: Singular Publishing Group and
Pine Brook. NJ: Kay Elemetrics Corp.
Fant G (1970). Acoustic theory of speech production. The
Hague: Mouton.
Flanagan J L (1972). Speech analysis synthesis and perception. 2nd edn. New York: Springer Verlag.
Flanagan J L & Cherry L (1969). Excitation of vocal
tract synthesizers. Journal of the Acoustical Society of
America 45, 764769.
Forrest K, Weismer G, Milenkovic P & Dougall R N
(1988). Statistical analysis of word initial voiceless
obstruents: preliminary data. Journal of the Acoustical
Society of America 84(1), 115123.
Gold B & Morgan N (2000). Speech and audio signal
processing. New York: John Wiley & Sons, Inc.
Howell P & Rosen S (1983). Production and perception
of rise time in the voiceless affricate/fricative distinction.
Journal of the Acoustical Society of America 93,
976984.
Jackson P J B & Shadle C H (2000). Frication noise modulated by voicing, as revealed by pitch-scaled decomposition. Journal of the Acoustical Society America 108(4),
14211434.
Jackson P J B & Shadle C H (2001). Pitch-scaled estimation
of simultaneous voiced and turbulence-noise components
in speech. IEEE Transactions on Speech and Audio Processing, 9(7), 713726.
Jesus L M T & Shadle C H (2002). A parametric study
of the spectral characteristics of European Portuguese
fricatives. Journal of Phonetics 30, 437464.
Johnson K (2003). Acoustic and auditory phonetics (2nd
edn.). Oxford: Blackwell Publishers.
Kent R D & Read C (1992). The acoustic analysis of
speech. San Diego: Singular Publishing Group.
Ladefoged P (2001). Vowels and consonants. Oxford:
Blackwell Publishing.
Mair S (1994). Analysis and modelling of English /t/ and
/tsh/ in VCV sequences. Ph.D. thesis, Dept. of Linguistics
and Phonetics. UK: University of Leeds.
McClellan J H, Schafer R W & Yoder M A (1998). DSP
first: A multimedia approach. Upper Saddle River, NJ:
Prentice Hall.
Olive J P, Greenwood A & Coleman J (1993). Acoustics
of American English speech: a dynamic approach. New
York: Springer-Verlag.
Peterson G E & Barney H L (1952). Control methods
used in a study of the vowels. Journal of the Acoustical
Society of America 24, 175184.
Pierce A D (1981). Acoustics. New York: McGraw-Hill
Book Co.
Rabiner L R & Schafer R W (1978). Digital processing of
speech signals. Englewood Cliffs, NJ: Prentice-Hall, Inc.

460 Phonetics, Acoustic


Shadle C H (1985). The acoustics of fricative consonants.
Ph.D. thesis, Dept. of ECS, MIT, Cambridge, MA. RLE
Tech. Report 504.
Shadle C H (1990). Articulatory-acoustic relationships in
fricative consonants. In Hardcastle W J & Marchal A
(eds.) Speech production and Speech Modelling.
Dordrecht: Kluwer Academic Publishers.
Stevens K N (1971). Airflow and turbulence noise for
fricative and stop consonants: static considerations.
Journal of the Acoustical Society of America 50,
11801192.

Stevens K N (1998). Acoustic phonetics. Cambridge, MA:


The MIT Press.
Sundberg J (1987). The science of the singing voice.
DeKalb, Illinois: University of Northern Illinois Press.
Titze I (2000). Principles of voice production (2nd printing). Iowa City, Iowa: National Center for Voice and
Speech.
Whalen D H (1991). Perception of the English /s/-/S/ distinction relies on fricative noises and transitions, not on
brief spectral slices. Journal of the Acoustical Society of
America 90(4:1), 17761785.

Phonetics, Forensic
A P A Broeders, University of Leiden, Leiden and
Netherlands Forensic Institute, The Hauge, The
Netherlands
2006 Elsevier Ltd. All rights reserved.
This article is reproduced from the previous edition, volume 6,
pp. 30993101.

The term forensic phonetics refers to the application


of phonetic expertise to forensic questions. Forensic
phonetics was a relatively new area at the beginning
of the 1990s. The kind of activity that its practitioners
are probably most frequently involved in is forensic
speaker identification. Forensic phoneticians may act
as expert witnesses in a court of law and testify as to
whether or not a speech sample produced by an unknown speaker involved in the commission of a crime
originates from the same speaker as a reference sample that is produced by a known speaker, the accused.
Other activities in which forensic phoneticians may
be engaged are speaker characterization or profiling,
intelligibility enhancement of tape-recorded speech,
the examination of the authenticity and integrity of
audiotape recordings, the analysis and interpretation
of disputed utterances, as well as the analysis and
identification of non-speech sounds or background
noise in evidential recordings. In addition, forensic
phoneticians may collaborate with forensic psychologists to assess the reliability of speaker recognition
by earwitnesses.

Speaker Recognition: Identification


versus Verification
Speaker identification, whether in a forensic context
or otherwise, can be regarded as one form of speaker
recognition. The other form is usually called speaker
verification. There are a number of important differences in the sphere of application and in the methodology employed in these two forms of speaker

recognition. Speaker identification is concerned with


establishing the identity of a speaker who is a member
of a potentially unlimited population, verification
with establishing whether a given speaker is in fact
the one member of a closed set of speakers which he
claims to be.
There are basically two different approaches to
speaker recognition. One is linguistically oriented,
the other is essentially an engineering approach. In
recent years, considerable progress has been made
by those taking the latter approach, culminating
in the development of fully operational automatic
speaker verification systems. In a typical verification
application, a person seeking access to a building or
to certain kinds of information will be asked to pronounce certain utterances, which are subsequently
compared with identical reference utterances produced by the speaker he claims to be. If the match
is close enough the speaker is admitted, if not, he is
denied access.
Automatic Speaker Recognition

The success of automatic speaker verification systems


has stimulated research into the application of similar
techniques to the field of automatic speaker identification for forensic purposes. Although attempts in
this field have been numerous, they have not so far
been successful. The problem is that speaker identification, especially in the forensic context, is a much
more complex affair than speaker verification. While
the unknown speaker in a verification context is a
member of a closed set of speakers, the suspect in a
forensic identification context is a member of a much
larger group of speakers whose membership is not
really known and for whom no reference samples
are available. Recording conditions may vary quite
considerably in the identification context and speakers cannot be relied upon to be cooperative. They may
attempt to disguise their voices, they may whisper,
adopt a foreign accent, or speak a foreign language.

Phonetics, Articulatory 425


Cohn A (1993). Nasalization in English: phonology or
phonetics. Phonology 10, 4381.
Dell G, Burger L & Svec W (1997). Language production
and serial order: a functional analysis and a model.
Psychological Review 104, 123147.
Dolphyne F (1988). The Akan (Twi-Fante) language:
its sound systems and tonal structure. Accra: Ghana
University Press.
Flanagan J (1955). A difference limen for vowel formant
frequency. Journal of the Acoustical Society of America
27, 613617.
Gafos A (1999). The articulatory basis of locality in
phonology. New York: Garland.
Gafos A & Benus S (2003). On neutral vowels in Hungarian. In Proceedings of the 15th International Congress of
Phonetic Sciences. 7780.
Gordon M (1999). The neutral vowels of Finnish: how
neutral are they? Linguistica Uralica 35, 1721.
Hansson G (2001a). The phonologization of production
constraints: evidence from consonant harmony. In
Chicago Linguistics Society 37: The Main Session.
187200.
Hansson G (2001b). Theoretical and typological issues
on consonant harmony. Ph.D. diss., UC Berkeley.
Hess S (1992). Assimilatory effects in a vowel harmony
system: an acoustic analysis of advanced tongue root
in Akan. Journal of Phonetics 20, 475492.
Katamba F & Hyman L (1991). Nasality and morpheme
structure constraints in Luganda. In Katamba F (ed.)
Lacustrine Bantu phonology [Afrikanistische Arbeitspapiere 25]. Koln: Institut fur Afrikanistik, Universitat zu
Koln. 175211.
Kaun A (1995). The typology of rounding harmony:
an optimality theoretic approach. [UCLA Dissertations
in Linguistics 8.] Los Angeles: UCLA Department of
Linguistics.
Kockaert H (1997). Vowel harmony in siSwati: an experimental study of raised and non-raised vowels. Journal
of African Languages and Linguistics 18, 139156.
Linker W (1982). Articulatory and acoustic correlates of
labial activity in vowels: a cross-linguistic study. Ph.D.
diss., UCLA. [UCLA Working Papers in Phonetics 56.].

Nevins A & Vaux B (2004). Consonant harmony in Karaim. In Proceedings of the Workshop on Altaic in Formal
Linguistics [MIT Working Papers in Linguistics 46].
N Chiosain M & Padgett J (1997). Markedness, segment
realization, and locality in spreading. [Report no. LRC9701.] Santa Cruz, CA: Linguistics Research Center,
University of California, Santa Cruz.
Ohala J (1994). Towards a universal, phonetically-based,
theory of vowel harmony. In 1994 Proceedings of the
International Congress on Spoken Language Processing.
491494.
hman S (1966). Coarticulation in VCV utterances:
O
spectrographic measurements. Journal of the Acoustical
Society of America 39, 151168.
Przezdziecki M (2000). Vowel-to-vowel coarticulation in
Yoru`ba: the seeds of ATR vowel harmony. West Coast
Conference on Formal Linguistics 19, 385398.
Rose S & Walker R (2003). A typology of consonant agreement at a distance. Manuscript. University of Southern
California and University of California, San Diego..
Schwartz M, Saffran E, Bloch D E & Dell G (1994). Disordered speech production in aphasic and normal speakers. Brain and Language 47, 5288.
Shattuck-Hufnagel S & Klatt D (1979). The limited use of
distinctive features and markedness in speech production: evidence from speech error data. Journal of Verbal
Learning and Verbal Behaviour 18, 4155.
Suomi K (1983). Palatal vowel harmony: a perceptuallymotivated phenomenon? Nordic Journal of Linguistics
6, 135.
Terbeek D (1977). A cross-language multidimensional scaling study of vowel perception. Ph.D. diss., UCLA. [UCLA
Working Papers in Phonetics 37.]
Vihman M (1978). Consonant harmony: its scope and
function in child language. In Greenberg J, Ferguson C
& Moravcsik E (eds.) Universals of human language,
vol. 2: phonology. Palo Alto: Stanford University Press.
281334.
Walker R (2003). Nasal and oral consonantal similarity in
speech errors: exploring parallels with long-distance
nasal agreement. Manuscript. University of Southern
California.

Phonetics, Articulatory
J C Catford, University of Michigan, Ann Arbor,
MI, USA
J H Esling, University of Victoria, Victoria, British
Columbia, Canada
2006 Elsevier Ltd. All rights reserved.

Articulatory phonetics is the name commonly applied to traditional phonetic theory and taxonomy, as
opposed to acoustic phonetics, aerodynamic phonetics, instrumental phonetics, and so on. Strictly

speaking, articulation is only one (though a very important one) of several components of the production of speech. In phonetic theory, speech sounds,
which are identified auditorily, are mapped against
articulations of the speech mechanism.
In what follows, a model of the speech mechanism
that underlies articulatory phonetic taxonomy is first
outlined, followed by a description of the actual classification of sounds and some concluding remarks.
The phonetic symbols used throughout are those
of the International Phonetic Association (IPA) as

426 Phonetics, Articulatory

revised in 1993 and updated in 1996. (A chart of


the International Phonetic Alphabet is given in the
entries: International Phonetic Association and Phonetic Transcription: History.)

The Phases of Speech


When someone speaks, what happens is somewhat
as follows. In response to a need to communicate
about some state of affairs or some event, the speaker
conceptualizes the event in a particular way and
encodes that conceptualization in accordance with
the grammatical rules of his/her language. The linguistically encoded message is then externalized and
apprehended by the hearer through the agency of a
sequence of events that are called the phases of
speech.
These begin in the speaker and, assuming the hearer knows the speakers language, culminate in the
hearer decoding and understanding the utterance,
that is, reaching a conceptualization that closely
matches that of the speaker, which was the start of
the process.
The purely phonetic part of this process may be
said to begin with the execution of a short-term neural program in the central nervous system, which is
triggered by the lexicogrammatical structure of the
utterance and determines the nature and the sequencing of what follows. This can be called the central
programming phase of the speech process. Thereafter, in a sequence and a rhythm presumably determined in the central programming phase, motor
commands are transmitted through motor nerves
to muscles in the chest, throat, mouth, etc., which
contract in whole or in part, successively or simultaneously, more or less strongly. This constitutes the
neuromuscular phase of the process.
As a result of the muscular contractions, the organs
to which the muscles are attached adopt particular
postures and move about in particular ways. These
postures and movements of whole organs the rib
cage, the vocal folds, the tongue, the lips, and so on
constitute the organic phase of speech. The successive and overlapping postures and movements of the
organs act upon the air within the vocal tract, compressing or dilating it, setting it moving in rapid puffs,
in sudden bursts, in a smooth flow, in a rough, eddying turbulent stream, and so on. This is the aerodynamic phase of speech.
The things that happen to the air as it flows
through the vocal tract during the aerodynamic
phase generate sound waves, and this is the acoustic
phase. In the acoustic phase, an airborne sound wave
radiates from the speakers mouth to impinge upon
the eardrum of a hearer, setting it vibrating in step

with the waveform. These vibrations are transmitted through the middle ear to the inner ear, where
they stimulate sensory nerve endings of the auditory
nerve, sending neural impulses into the brain, where
they give rise to sensations of sound. This process of
peripheral stimulation and afferent neural transmission may be called the neuroreceptive phase. Finally,
the incoming neuroreceptive signals are identified
as particular vocal sounds or sound sequences
neurolinguistic identification. In the actual exchange of conversation, identification may normally
be below the threshold of consciousness, since attention is directed more to the meaning of what is
said than to the sounds by which that meaning is
manifested.
These phases can be summarized as follows:
(a) Central programming: determining what follows.
(b) Neuromuscular: motor commands and muscle
contractions.
(c) Organic: postures and movements of organs.
(d) Aerodynamic: pressure changes and airflow
through the vocal tract.
(e) Acoustic: propagation of sound wave from the
speakers mouth.
(f) Neuroreceptive: peripheral auditory stimulation
and transmission of inbound neural impulses.
(g) Neurolinguistic identification: potential or actual
identification of incoming signals as specific
speech sounds.
To these phases may be added two kinds of feedback: kinesthetic feedback, that is, proprioceptive
information about muscle contractions and the movements and contacts of organs, fed back into the central nervous system, and auditory feedback, that is,
stimulation of the speakers own peripheral hearing
organs by the sound wave issuing from the mouth and
reaching the ears by both air conduction and bone
conduction.
Of the seven phases of speech described above,
only three lend themselves conveniently to categorization for general phonetic purposes: the organic
phase, the aerodynamic phase, and the acoustic
phase. All three of these phases can only be fully
investigated instrumentally the organic phase by
means of radiography and fiberoptic laryngoscopy,
the aerodynamic phase by air pressure and airflow
measurements, and the acoustic phase by various
types of electronic acoustic analysis. However, a
good deal can be learned about the organic phase
by direct external observation and by introspective
analysis of the proprioceptive and tactile sensations
derived from kinesthetic feedback.
It is not surprising, therefore, that articulatory phonetic taxonomy has always been primarily based on

Phonetics, Articulatory 427

the organic phase the observation and categorization of the organic activities that give rise to speech.
This was the basis of the remarkably sophisticated
description of the sounds of Sanskrit by the earliest
phoneticians known to modern linguists the Indian
grammarians of 2500 years ago (see Phonetic Transcription: History). The organic phase was also the
basis for the phonetic observations of the Greek and
Roman grammarians, the Medieval Arab grammarians, and the English phoneticians from Elizabethan
times onward.
Modern articulatory phonetics, deriving largely
from the work of 19th-century European phoneticians,
such as Jespersen (see Jespersen, Otto (18601943)),
Passy (see Passy, Paul Edouard (18591940)), Sievers
(see Sievers, Eduard (18501932)), Vie tor (see Vietor,
Wilhelm (18501918)), and especially the British phoneticians Melville Bell (see Bell, Alexander Melville
(18191905)), Alexander Ellis (see Ellis, Alexander
John (ne Sharpe) (18141890)), and Henry Sweet (see
Sweet, Henry (18451912)), is still largely based
upon the organic phase, with some contributions from
20th-century instrumental studies of the aerodynamic
and acoustic phases.

Components of Speech Production


In the production of speech, organic postures and
movements initiate, control, and modulate the flow
of air through the vocal tract in ways that generate
sound. In other words, the sounds of speech result
from the conversion of muscular energy into acoustic
energy through the mediation of the aerodynamic
phase. In this sound-productive process, there are
two essential, basic, components: airstream mechanism, which sets the air in motion, and articulation,
which controls the air flow (arrests it, accelerates it,
interrupts it, etc.) in ways that generate specific types
of sound. There is a third component, present in most
sounds, namely those that involve a flow of air
through the larynx. This is phonation, an activity in
the larynx that modulates the stream of air utilized
in the articulation.
Airstream Mechanism (Initiation)

The airstream mechanism is that component of the


sound-producing process which compresses or rarefies air in the vocal tract and, thus, initiates a flow
of air through the tract. Because of its initiatory function, the airstream mechanism is also known as initiation, and both terms are used interchangeably here.
The organ, or group of organs, involved in the process
constitutes an initiator. Initiation types are classified
according to the location of the initiator within the
vocal tract (pulmonic, glottalic, or velaric) and

the direction of the initiatory movement (that is,


whether it generates positive pressure in the adjacent
part of the vocal tract, setting up an outward, or
egressive, air flow, or negative pressure, setting up
an ingressive flow). Thus, excluding esophagic initiation, used only by laryngectomized persons (see
Disorders of Fluency and Voice) and a few other
very minor types, there are six basic types of initiation: pulmonic egressive/ingressive, glottalic egressive/ingressive, and velaric egressive/ingressive.
Pulmonic Initiation In pulmonic egressive initiation, the initiator is the lungs, which, by decreasing
in volume, generate positive pressure in the adjacent,
subglottal, part of the vocal tract and thus tend to
initiate an egressive flow of air, upward and outward
through the trachea (windpipe), larynx, pharynx,
mouth, and/or nose. This is by far the commonest
airstream mechanism, used all of the time in a majority of languages, and most of the time in the approximately 30% of all languages that also utilize other
airstream mechanisms.
In pulmonic ingressive initiation, the lungs increase in volume, generating negative pressure and
thus initiating an ingressive flow, inward and downward through the vocal tract. Pulmonic ingressive
initiation (talking on inhalation) is not regularly used
in any ordinary language, although a pulmonic ingressive type of [l#] occurs in Damin, the ritual language of
the Lardil people of Mornington Island, Australia. In
other languages, for example English, occasional
words may be pronounced with this airstream mechanism, and longer utterances may be spoken ingressively, to disguise the voice, or simply for fun.
Glottalic Initiation In glottalic egressive initiation,
the glottis (the space between the vocal folds) is
closed; at the same time, the soft palate is raised and
there is a stricture somewhere in the mouth, most
commonly a complete closure, as for a stop consonant such as [p] or [k]. Consequently, a small quantity
of air is trapped between the glottal closure in the
larynx and the oral closure. The larynx is then suddenly raised, compressing the trapped air. When the
oral closure is released, there is a sudden outflow of
air, producing a sharp or hollow popping sound
noticeably distinct from the less sharp noise-burst on
the release of a pulmonic stop.
The oral stricture need not be a complete closure
but can be a narrow, rather tight channel, as for a
fricative, such as [f] or [s]. In this case, the upward
thrust of the larynx drives a high-velocity turbulent
stream of air through the articulatory channel, producing a sharp hiss noise, of fairly short duration,
because of the small quantity of air available.

428 Phonetics, Articulatory

Glottalic egressive sounds are often known as


ejectives or, somewhat misleadingly, as glottalized
sounds misleadingly, because the use of the -ized
form suggests that the glottal component is a secondary articulation (see Modified Articulations section,
this article), whereas in fact it is neither secondary nor
articulatory, being a feature of the basic airstream
mechanism of the sound. Ejective stops are regularly
used in about 20% of the worlds languages; the
corresponding fricatives only in about 4%. Ejectives
occur in all 37 Caucasian languages (see Caucasian
Languages), in a number of AfroAsiatic (see Afroasiatic Languages) and American Indian languages, particularly in the Na-Dene (see Na-Dene Languages),
Salish, and Penutian groups, and sporadically elsewhere. In the alphabet of the International Phonetic
Association, glottalic egressive (ejective) sounds are
indicated by means of an apostrophe placed after the
appropriate symbol, thus [t], [k], [f], [s], etc.
In glottalic ingressive initiation, as for ejectives,
the glottis is closed, and there is an oral stricture, but
this time the larynx is jerked downward, rarefying the
air trapped between the glottal and oral strictures.
Sounds with this type of initiation might be called
inverse ejectives, but are generally known as implosives. Simple implosives, such as have just been
described, are extremely rare. Although voiceless
implosives are attested, for example in the Quichean
languages of Guatemala, most commonly, in the 10%
or so of the worlds languages in which they occur,
glottalic ingressive sounds (normally stops) are voiced
(see Principal Types of Phonation section, this article), and symbolized [K], [F ], etc. In these voiced
implosives, the glottis is not tightly closed, but disposed for the production of voice. Thus, when the
larynx is jerked downward, creating a region of negative pressure above it, a small quantity of air is
drawn upward through the glottis, setting the vocal
folds in vibration. The amount of air drawn up into
the supraglottal cavities during the oral closure is
usually insufficient to raise the pressure there to the
atmospheric level. Consequently, when the oral closure is released, there is normally a momentary influx
of air into the mouth.
It has sometimes been claimed that voiced implosives utilize a pulmonic egressive airstream, but
though there may occasionally be a brief period during an early part of the closed phase of a voiced
implosive when the air is indeed being driven through
the glottis by pulmonic pressure, at the moment of the
actual implosion the air is sucked up through the
glottis by the vacuum created by the sudden lowering
of the larynx and expansion of the pharynx. The
initiation of the implosion is thus purely glottalic
ingressive.

Velaric Initiation Velaric ingressive initiation is


taken first because it is more familiar than the egressive type. Velaric initiation is performed entirely within the mouth. The dorsal surface of the tongue forms
a closure against the roof of the mouth (largely, but
not exclusively, against the soft palate, or velum,
hence the name velaric). A very small quantity of
air is trapped between this dorsal, initiatory, closure
and a second, articulatory, closure, at the lips, at the
teeth, or behind the teeth. The initiatory rarefaction
of the trapped air is effected by a downward movement of the centre of the tongue, or, in the case of
articulatory closure at the lips, a downward movement of the jaw. Velaric ingressive sounds are also
known as clicks, and the most familiar one is the
dentalveolar click, represented in writing as tut tut
or (more explicitly) as tsk tsk (IPA [$]) used by English
speakers, and other western Europeans as an exclamation of mild regret or annoyance. In the eastern
Mediterranean and the Middle East, a single tsk,
usually accompanied by a backward toss of the head,
is part of a common gesture of negation or rejection.
The reader can get an impression of the velaric ingressive mechanism by repeatedly saying this click,
slowly and introspectively, noting the feeling of tension and suction in the tongue just before the tongue
tip breaks away.
Other velaric ingressives include the alveolar lateral click [#] in which the articulatory gesture is made
by the side(s) of the tongue breaking away from the
(front) molar teeth (a sound used to urge on a horse),
and the bilabial, kiss, click [8]. Such sounds,
though common as paralinguistic, interjectional, or
gestural sounds, occur as regular linguistic sounds
only in the Khoisan (see Khoesaan Languages) and
Southern Bantu languages of Africa and in Damin
(see Pulmonic Initiation section earlier in this
article).
Velaric egressive initiation involves much the
same organic configuration as velaric ingressive, except that the trapped air is compressed by an upward
squeezing, or forward movement, of the tongue, so
that there is a brief efflux of air on the release of the
articulatory closure. Such sounds, like clicks, are
sometimes used, though more rarely, as interjectional
sounds. In particular, a velaric egressive bilabial [8"]
may be combined with a shoulder-shrugging gesture
of dismissal or exculpation.
Airstream Mechanisms Summarized Table 1 shows
the six basic types of initiation.
The initiation types named in bold type are those
regularly utilized in normal languages. The short
names in parentheses are commonly used for them,
especially with reference to stops. In traditional

Phonetics, Articulatory 429


Table 1 Types of initiation
Location (initiator)

Lungs
Larynx
Tongue (with velar closure)

Direction, i.e., movement generating


Positive pressure

Negative pressure

pulmonic egressive (plosive)


glottalic egressive (ejective)
velaric egressive

pulmonic ingressive
glottalic ingressive (implosive)
velaric ingressive (click)

articulatory phonetics, pulmonic egressive, being the


normal, or most frequent, type, is not usually named
in the description of sounds; only the nonpulmonic
types are explicitly named.
Other Initiatory Phenomena During speech, the
pulmonic egressive initiator drives air ahead of it
against the resistance imposed by the air pressure
against which it is moving. Varying phonatory and
articulatory strictures impose varying degrees of impedance on the flow of air through the vocal tract. The
backpressures caused by these impediments to flow
react on the initiator, which either is slowed down by
them, or must work harder to overcome the resistance. In other words, the initiator is constantly exerting a varying initiator power during speech.
Initiator power is essentially what has traditionally been called stress in articulatory phonetics,
commonly defined as force (e.g., Sweet, 1877; Jones,
1922) but also as initiator pressure (Pike, 1943), reinforced chest-pulse (Abercrombie, 1967), or increase
in respiratory activity (Ladefoged, 1975). Some of
these definitions of stress are controversial; thus
Ohala (1990), on the basis of considerable instrumental evidence, questions the claim that stress necessarily involves independent action of the respiratory
system.
Stress, no matter how it is defined, is infinitely
variable. However, it is customary in traditional articulatory phonetics to speak as if there were two
distinct degrees of stress, called stressed and unstressed. This reflects the fact that in the phonology
of many languages two degrees of stress are utilized
contrastively. If three degrees of stress are recognized,
they are usually called primary stress, secondary
stress, and unstressed. Primary and secondary stress
are symbolized by ["] and [%] respectively, placed
before the stressed syllable, unstressed being left
unmarked.
Most of the scholars mentioned above distinguish
between stress (as some kind of force or dynamic
effect, however produced) and prominence. Prominence is the degree to which a sound or syllable stands
out from those surrounding it. It is generally agreed
that stress, duration, pitch, and inherent sonority are
all factors that may contribute to prominence.

Another phonetic phenomenon which may be partly related to initiatory activity is the syllable. There
is no universally accepted definition of the syllable.
Nevertheless, it is convenient to be able to mark
intuitively determined syllable boundaries in speech,
and this can be done with the IPA symbol [.]. Normally, each vowel constitutes a separate syllable
peak (but see the section in this article regarding
diphthongs), and flanking consonants constitute syllable margins. There are cases, however, where intuitively determined syllable boundaries occur between
vowels, with no intervening consonant, and these
may be indicated by the IPA symbol [.] for example
[ri.kt]. When a consonant is syllabic, it is marked by
the IPA diacritic [%] for example, middle [mIdl% ] or
lightening [laItn% IN] (the gerundive form of the verb
lighten; as opposed to the noun lightning).
In many languages, including English, in addition
to syllables, initiatory activity appears to be parceled
out into chunks, each containing one or several syllables, and all (at least within any one short stretch,
such as a single intonation group) of very roughly the
same duration. Each of these relatively equal chunks
of initiator activity is called a stress-group or rhythmic group or foot. Within each foot, stress appears
to peak near the beginning, then decreases, to peak
again near the beginning of the next foot, and so
on. Consequently, the first (or only) syllable within a
foot is more strongly stressed than the remaining
syllable(s) of the foot.
The following example illustrates syllables, marked
off by [.] between them, and feet, marked off by single
vertical lines. In addition, the double lines at each end
show the boundaries of an intonation group, while
bold type indicates the tonic syllable, that is, the one
that carries the major pitch movement, in this case a
falling, mid to low tone, within the intonation group.
Notice how the difference in foot division differentiates between the sequence adjective noun black
bird in (1) and the compound noun blackbird in (2).
Stresses are also (redundantly) marked, as a reminder
that in each foot the initial syllable has a stress imposed upon it by its location under the stress peak at
the start of the foot.
(1) || "John.saw.a | "black | "bird.here | "yes.ter.day ||

430 Phonetics, Articulatory


(2) || "John.saw.a | "black.bird.here | "yes.ter.day ||

Stress, syllables, and feet, as well as tone and intonation (see Unphonated Sounds section, this article) and
the duration of sounds, are commonly treated under
the heading of suprasegmentals or prosodic features
(see Prosodic Aspects of Speech and Language).
Phonation

Phonation refers to various modulations imposed


upon the airstream as it passes through the larynx.
Phonation may therefore be defined as a laryngeal
component of speech production which is neither
initiatory nor articulatory in function.
Principal Types of Phonation
(a) Breath, producing voiceless sounds: the glottis is
open (vocal folds abducted), so that the airstream
passes through with minimal obstruction. This is
the phonation type of voiceless fricatives such as
[f], [s], [x], [h], etc.
(b) Whisper, producing whispered sounds: the
epilaryngeal tube or aryepiglottic sphincter
(Esling, 1996, 1999) is constricted so that the
airstream passing through the glottis becomes
markedly turbulent, generating a strong hushing
sound.
(c) Voice, producing voiced sounds: the vocal folds
are approximated (adducted) and set in vibration
as the airstream passes through between them.
(d) Creak: the glottis is shortened by constriction of
the epilaryngeal tube, producing glottal vibration
at very low frequency with a crackling sound.
A number of combinations of these phonation
types are also possible. These include:
(e) Breathy voice: simultaneous breath and voice,
that is, high-velocity airflow through a relatively
open glottis, so that the vocal folds flap in the
breeze: the phonation type of talking when very
much out of breath.
(f) Whispery voice, or murmur: simultaneous whisper and voice.
(g) Creaky voice: simultaneous creak and voice.
In traditional articulatory phonetics, all of these
phonation types are taken account of, though only
the most widely used ones, voiceless and voiced, are
regularly included in tables, such as the table of symbols for consonants on the IPA chart; but note that
diacritics are provided for two other phonation types,
for breathy voice (which could also be used to refer to
whispery voice) and for creaky voice (which could
refer to harsh voice or to other laryngealized effects)

(see States of the Glottis for a full inventory of phonation types).


Unphonated Sounds As was seen above, phonation
necessitates the passage of a stream of air through the
larynx and the glottis (the space between the vocal
folds). Consequently, those types of sound that do not
entail the passage of an airstream through the glottis
are, strictly speaking, unphonated. These unphonated sounds include ejectives, since for them the
glottis is tightly closed and has an initiatory rather
than a phonatory function, and clicks, since the air
involved in their production is entirely contained
within the mouth. In click-using languages, the velaric closure may in fact be of the [k] type or, on
the other hand, it may be a voiced [g], [[] or [N]. In
such cases, the term voiced click is often used,
although this is not strictly accurate. The click,
initiated entirely by rarefaction of air contained within the mouth, is itself unphonated, since the air used
in its initiation and articulation does not pass through
the larynx. The unphonated click is merely being
performed against the background of a voiced
sound. Voiced implosives, on the other hand, are
indeed phonated, since, as observed in the section
on glottalic initiation, they involve an upward (egressive) movement of air through the glottis, which sets
the vocal folds in vibration. For these sounds, the
glottis functions simultaneously as initiator and
phonator.
Glottal stop [ ] is another type of sound which is
strictly speaking unphonated (though often described
as voiceless), the closed glottis functioning as articulator. On the other hand, for the voiceless and voiced
glottal fricatives [h] and [P], the glottis functions
simultaneously as phonator and articulator.
Other Phonatory Phenomena
Aspiration When a voiceless sound, particularly a
voiceless plosive, is followed by a vowel, for example,
in the syllable [pa], the voicing for the vowel may
start almost simultaneously with the opening of the
lips, or, on the other hand, there may be a delay before
the voicing starts, so that a short h-like puff of breath
is heard between the release of the stop and the onset
of voicing, [pha]. A short voiceless delay of this type is
known as aspiration, and the consonant is said to be
aspirated. If the voicing follows the release of the
articulatory closure with virtually no delay, so that
there is no audible aspiration, it is said to be unaspirated. During the closed phase of the articulation of
an unaspirated stop, the vocal folds are in a position
of prephonation (Esling and Harris, 2005), ready to
spring into vibration the moment the oral closure is

Phonetics, Articulatory 431

released and air begins to flow upward through the


glottis. During, or at the end of, the closed phase of an
aspirated stop, however, the glottis is opened, so that
there is a delay before the vocal folds come close
enough together to be set in vibration by the airstream.
Though aspirated plosives and affricates are by far
the commonest type of aspirated sounds, aspirated
fricatives, and occasionally other sounds, do also
occur in languages. What are called voiced aspirated
plosives also occur (although much less commonly
than voiceless aspirated plosives), and are often transcribed as [bP], [dP], etc. For these sounds, the glottis
is apparently configured for murmur or breathy voice
during the stop, and the beginning (sometimes the
whole) of the following vowel also has this type of
phonation. There is, thus, a tendency to transcribe
such sounds with the IPA diacritic for breathy voice,
as [b], [d], etc.

Pitch Phenomena In the production of voice, the


vocal folds vibrate at a frequency determined chiefly
by the tension of the vocal folds and/or the subglottal
air pressure. It is possible to vary the frequency of
vocal-fold vibration over a wide range, and so to
produce the auditory effect of a wide range of
pitches. Languages utilize pitch and pitch changes
in one (or both) of two distinct ways, known as tone
and intonation.
Tone is a distinctive pitch level or pitch movement
associated with a short stretch of speech, often of
syllable length, and a short grammatical unit, such
as a word or morpheme. Languages utilizing pitch in
this way are known as tone languages. Typically, in
a tone language, one word, or grammatical category,
may be distinguished from another purely by tone:
examples of such languages are Chinese, Thai,
Vietnamese, Igbo, Yoruba, etc. (see Tone in Connected
Discourse).
Intonation, on the other hand, is a distinctive patterning of pitches which can be associated with much
longer stretches of speech, many syllables in length,
and with potentially long grammatical units, such as
clauses or sentences. These pitch patterns form intonation contours, also known as intonation groups,
which commonly spread over a number of syllables.
They signal sentence functions (e.g., statement versus
question), major information points, and so on (see
Phonetics and Pragmatics). (For a short example illustrating two English intonation groups, see the end of
Other Initiatory Phenomena section, this article.)
Register and Voice Quality Register is a phonatory
modification associated with a short stretch of

speech, often of syllabic length. Like tone, register


can be utilized to distinguish one word or grammatical category from another. The phonation types typically used in register distinctions are tentatively
classified in Catford (1964). Although register is
often referred to colloquially as voice quality, voice
quality is technically the third strand of accent defined by Abercrombie (1967) and Laver (1980) as the
long-term quality of a voice, largely extralinguistic.
Laryngeal adjustments for phonation type play a
large role in long-term voice quality, but general modifications in the quality of speech due to articulatory
adjustments in the supralaryngeal vocal tract also
contribute substantially to voice quality (see Voice
Quality).
Articulation

As already noted, in the production of a sound, air in


the vocal tract is set in motion by the initiation, or
airstream mechanism; the moving column of air, if it
flows through the larynx, is subjected to phonation,
that is, it undergoes a set of complex modulations.
Finally, the phonated airstream is subjected to a kind
of final shaping, giving rise to a sound of a quite
specific type; this is articulation.
In the classification of different types of articulation, a primary distinction is commonly made between vowels and consonants. This traditional
distinction, which goes back at least to the Greeks
and the Romans, is based on the syllabic function
of the two classes of sound vowels being those
sounds which form syllables on their own, whereas
consonants have to be combined with vowels to be
pronounceable. However, syllabic function is not
a totally satisfactory criterion for distinguishing between vowels and consonants, and attempts have
been made to base the vowel/consonant dichotomy
on features of articulation; but this, too, is unsatisfactory, since no absolute articulatory boundary can be
drawn between the two classes of sound. The problem of finding a differential definition for vowel and
consonant is discussed at length in Pike (1943: Chap.
5) and more briefly in Catford (1977: 165167).
In practice, however, the two classes can be kept
reasonably distinct, and a quite different terminology
is used for the description of members of the two
classes. The reasons for this are discussed in the
Vowels section, later in this article. Meanwhile, the
classification of consonants is considered.
Consonants are traditionally classified in terms of
the location of the articulatory stricture within the
vocal tract, that is, place of articulation, and in terms
of other features of articulation which are commonly
all treated under the general heading of manner of

432 Phonetics, Articulatory

articulation. In addition, in traditional articulatory


phonetics, initiation type is often included under the
heading of manner.

Manners of Articulation
The articulatory features that constitute manner of
articulation are:
(a) whether the airstream passes solely through
the mouth (oral), the nose (nasal), or both (nasalized);
(b) stricture type, that is, (1) the degree of constriction of the articulatory channel (completely
closed, as for stop articulation, to completely
open, as for open vowels), and (2) whether the
articulation is of a maintainable type (stop, fricative, etc.), or is of a momentary (tap) or gliding
(approximant) type;
(c) whether the airstream passes along the central
(median) line of the mouth, or is forced, by a
median obstruction, to flow along one or both
sides of the mouth (lateral).
Although it is customary, in describing consonants,
to name the place first, it is more convenient
for expository purposes to start with the manner of
articulation.
Principal Manners Described

The listing and description of manners of articulation


follows (with some modifications) the order in which
they are presented in the left-hand column of the
chart of the International Phonetic Alphabet. Although not so labeled, this is, in part, a traditional
list of manners.
Stop In a stop, the articulators come together (approach) to form, and hold for an appreciable time, a
complete closure (hold) with buildup of positive or
negative pressure behind it. On the release of the
hold, there is a sudden explosive efflux or influx of air
resulting in a brief burst of noise. Stop is the general
term for this manner of articulation, but, as noted
in the Airstream Mechanisms Summarized section,
there are special terms commonly used for stops produced with different types of initiation: pulmonic
stop is plosive; glottalic egressive stop is ejective
stop (since there can also be ejective affricates and
fricatives); glottalic ingressive stop is implosive; and
velaric ingressive stop or affricate is click.
In the articulation of a stop, either the approach or
the release may be absent, or virtually absent. For
example, if one starts from a position of rest, with
the lips closed, to say such a word as [pa], there is no
observable approach of the articulators, though some

preparatory events are probably taking place, such


as the raising of the soft palate to close the entrance
to the nasal cavity and some tensing of the muscles of
the lips. At the end of an utterance there may be no
observable or audible release of the closure. Commonly in American English, less commonly in British
English, the final [p] in a word like stop or final [t] in
cat may have no audible release. Thus, the only one of
the three phases of a stop that is absolutely essential
is, in fact, the period of closure, or hold, and this must
always have a perceptible duration. This is why stop
must be classified as a maintainable articulation
type, even though the actual noise produced, if the
closure is released, may be momentary. It is primarily
this feature, the maintained closure, that distinguishes
a stop from a tap or flap.
The closed phase, or hold, of a voiceless plosive can
obviously be maintained, in principle, for as long as
one can hold ones breath. The duration of the closed
phase of a voiced plosive is, however, severely limited
by an aerodynamic constraint. Voicing can be maintained only so long as a stream of air is flowing
upward through the glottis, keeping the vocal folds
in vibration. The airflow for voicing can continue
only so long as there is a pressure difference across
the glottis of about 2 cm of water, or more; but as the
air flows upward into the restricted space behind
the articulatory stricture, the supraglottal pressure
rapidly rises, eventually nullifying the necessary pressure difference. At this point, the glottal vibrations
will cease and the sound will no longer be a voiced
plosive.
The duration of voicing for a voiced plosive can be
extended by enlarging the supraglottal cavity, chiefly
by lowering the larynx, thus delaying the moment
when the rising supraglottal pressure arrests the
transglottal airflow. If the larynx-lowering is carried
out very abruptly (and accompanied by little or no
pulmonic pressure), the voiced plosive will turn into a
voiced implosive: that is, the acceleration of the larynx movement, combined with reduction or abolition
of pulmonic activity, converts the function of the
glottis from that of mere phonator to simultaneous
phonator and initiator.
Affricate An affricate is a stop released into a
homorganic fricative, within one and the same syllable, represented in IPA by the symbol for the stop
followed by that for the fricative release, joined by a
tie bar if necessary, for example, [ts], [dz], [tS], [dZ],
[kx], [gX]. If the stop and the fricative, in a close-knit
sequence of this type, belong to different morphemes,
they are not usually regarded as forming an affricate.
Thus, in German Blitz, the sequence [ts] is generally
regarded as an affricate, thus [bl ], but in English

Phonetics, Articulatory 433

bits, where [t] is the final consonant of bit and [s] is


the exponent of the plural morpheme, the sequence
[t] [s] is not regarded as an affricate. If the stop is
released into a homorganic lateral fricative, this gives,
of course, a lateral affricate, for example, [ ].
Nasal Most of the time in speech, the soft palate is
raised, closing the entrance into the nasal cavity, so
that the air coming up from the larynx flows only out
of the mouth. This is the state of affairs in the articulation of the majority of consonants and vowels,
which are thus purely oral sounds; but, since this is
the normal, or default, state, it is not usually explicitly stated in the description of sounds. If, however,
there is a complete closure at some point in the
mouth, but the soft palate is lowered, the entire airstream is diverted through the nose. This gives what
are called nasal sounds, or simply nasals, for example, [m], [n]. Another possibility is to have the soft
palate lowered but the passage through the mouth
also unobstructed. In this case, air flows out of both
the mouth and the nose. Sounds produced with this
type of bifurcated airstream are called nasalized, for
example, the French nasalized vowels [AD ], [ED], as
in vent, vin, etc. These sounds are also quite often
referred to, less accurately, simply as nasal vowels.
Trill In a trill, one articulator taps repeatedly
against another, usually at a frequency of between
25 and 35 Hz. A typical trill is the commonest type
of Italian [r], which is similar to the r popularly
believed to be used by Scots (though, in fact, most
speakers of Scottish English and Scots dialects use a
tapped [&], fricative [r>], or approximant [r] most of
the time). Note that a trill requires a maintained
posture of the articulators, the actual periodic
tapping being produced aerodynamically the airstream setting the articulator flapping in the breeze.
Tap/Flap In tap or flap articulations, one articulator makes momentary contact with another. The
contact may result from a flicking movement, as in
the apico-alveolar tap [&], which, particularly in
American English, commonly represents intervocalic
[t] or [d] in such words as latter or ladder, and in
British English may occur as an intervocalic variant
of r in very, etc. Alternatively, an active articulator
may momentarily strike against a static articulator in
passing, as for the retroflex flap [8] in the Hindi word
[gPo8a] horse. Both tap and flap involve contact between articulators, but they differ from stop articulation in terms of the duration, and lightness, of the
contact. The contact for these sounds is essentially
brief, usually below 35 ms, whereas the contact for a
stop usually lasts considerably longer than this, and

can, of course, be maintained for a very long time.


This is true of voiced stops as well as voiceless, although in this case, as noted above, the duration is
aerodynamically restricted.
Fricative In the articulation of a fricative, a narrow
channel is formed at some point in the vocal tract and
the airstream is forced through it, being accelerated
in the process and becoming turbulent. This turbulence gives rise to the hissing noise characteristic of
fricatives. The aerodynamics of fricatives have been
studied in detail by Shadle (1990). The hiss noise is
usually more apparent in voiceless fricatives than in
voiced ones, since in the latter it is partly masked by
the periodic sound of the voice. The hiss is still present, however, and the combination of this with voice
produces the auditory effect of a buzz. Typical fricatives are [f], [v]; [y], []; [s], [z]; [x], [X]; etc.
A distinction is sometimes made between flat and
grooved fricatives. One can see, or rather feel, this
difference, for example, by comparing English [y] as
in thin with [s] as in sin. From the articulatory (aerodynamic) point of view, it is the cross-sectional area
of the articulatory channel, rather than its shape, that
is important. The wide channel of [y] does not accelerate the airstream to the extent that the narrow
channel of [s] does. Sibilants are discussed further in
the section on Cover Terms for Manners of Articulation.
Lateral fricatives are produced with a narrow
articulatory channel similar to that of a fricative,
but on one side (or, perhaps less commonly, on both
sides) of the tongue. The commonest type are the
dentalveolar voiceless and voiced lateral fricatives,
transcribed [l] and [%], although even these are somewhat uncommon, [l] occurring in fewer than 10% of
the worlds languages, and [%] being even less frequent.
The best-known example of [l] is the ll of Welsh, as in
such place names as Llangollen [langOlen].
Approximant Approximants are produced with an
articulatory channel very slightly wider than that of
fricatives. If one pronounces a prolonged and energetic labiodental fricative [v], for example, and then,
while keeping the sound going, slowly and carefully
slides the lower lip downward (taking care to keep the
inner part of the lower lip in contact with the upper
teeth), a point will very soon be reached, that is, after
sliding downward no more than a millimeter or two,
where the turbulent fricative buzz of [v] is replaced
by the smoother, nonturbulent, sound of voice. This
is the labiodental approximant [^]. If, now, having
reached the approximant [^], one retains the articulation but devoices it, a fricative-like hiss will be
heard again, only noticeably less loud than that of

434 Phonetics, Articulatory

the voiceless fricative [f]. This experiment demonstrates the typical difference between a fricative and
the corresponding approximant. A fricative has turbulent airflow, and hiss noise, both when voiceless
and when voiced. An approximant has mildly turbulent airflow and hiss noise when voiceless, but no
turbulence and no hiss when voiced.
An ultra-short approximant, consisting chiefly of
a glide to or away from the approximant position, is
often called a semivowel. Examples are the palatal
approximant [j] and the labial-velar approximant
[w]. These have, or may have, exactly the same articulation as the vowels [i] and [u] respectively, which
(as a moments experiment shows) exhibit the criteria
for approximants, namely nonturbulent flow when
voiced, but turbulence when voiceless. The difference
between [i] and [j] and between [u] and [w] is simply
that whereas the approximant vowels can be indefinitely prolonged, the semivowels consist merely of a
glide to and/or away from the vowel position.
Lateral approximants are ordinary [l]-type
sounds, with a slightly wider articulatory channel
than that of the lateral fricatives, and hence no turbulence when voiced, which they usually are, but some
turbulence when voiceless. The regular English [l] is a
voiced lateral approximant. A voiceless or partially
voiceless variant of it can be heard in English in the
consonant clusters [pl] and [kl] in such words as

please and clean.


The remaining manners listed on the IPA chart are
ejective stop and implosive, both of which have
been dealt with above.
Some Cover Terms for Manners of Articulation

The traditional manners described above refer to


rather specific and narrowly defined types of articulation. For some purposes, however, it is useful to
have more general terms, each covering a number of
more specific terms, thus creating a small hierarchy
of terms at higher and lower ranks. One such higherranking division is into obstruents and nonobstruents (or sometimes sonorants).
Obstruent Obstruents can be further subdivided
into those that involve complete closure, occlusives,
and those that do not, nonocclusives. There are
problems with the higher rank assignment of several
classes of sound. Nasals, for example, are articulated with a complete closure in the mouth and, thus,
might be called occlusive obstruents. However, the
fact that they involve relatively unobstructed nonfricative airflow through the nose puts them into the
nonobstruent class. The position of trills, flaps, and
taps is ambiguous. All three classes involve articulatory contacts, but these contacts are so loose and

momentary that they might qualify as nonobstruents.


Moreover, these and other types of r-sound (e.g.,
untrilled approximant or fricative types of [&]), together with nasals and lateral approximants, often
pattern in the phonological structure of languages in
ways that set them apart from the more obvious
obstruents. From this phonological rather than phonetic point of view, nasals, lateral approximants, and
r sounds of all types are often treated as forming a
class of sonorants or nonobstruents. This is not unrelated to the tradition going back to the Roman grammarians (who inherited it, with a slight change of
meaning, from the Greeks) of grouping nonobstruent
ls and rs together as liquids.
Sibilant One further cover term is sibilant. This
term refers to fricatives of the [s] and [S] types, the
characteristic feature of which is that they are produced with a narrow dental, alveolar, or postalveolar
articulatory channel, which accelerates airflow into a
turbulent jet that strikes against the teeth, creating a
turbulent wake downstream from the teeth. Note the
difference, already mentioned, under Fricative, between nonsibilant [y] and sibilant [s]. The wide, flat
channel of [y] does not accelerate the airstream to
generate the high-velocity jet required for a sibilant.
Hierarchy of Manners The hierarchy of manner
classes may be summarized as follows:
Obstruents:
. occlusives: stops (plosives, ejective stops, implosives) and affricates (including lateral affricates);
. nonocclusives: sibilants, and all other fricatives
(including lateral fricatives);
Nonobstruents:
. nasals, liquids (i.e., all r-type sounds and approximant laterals), and all other approximants.

Places of Articulation
As seen in the preceeding Manners of Articulation
secton, a lowered soft palate, which directs the airstream through the nose in the articulation of
nasal and nasalized sounds, is traditionally treated
as a manner of articulation. This leaves only the
mouth, and the throat (the pharynx and larynx),
as places of articulation. Oral places of articulation are described in the next section, followed by
pharyngo-laryngeal places.
Oral Articulatory Locations

Articulations in the mouth are effected by the


juxtaposition of lower and upper articulators,

Phonetics, Articulatory 435

sometimes known as active and passive articulators a terminology which has the disadvantage that
in a few cases (e.g., when the lower and upper lips are
juxtaposed) it is difficult to state with certainty which
is the more active articulator. The lower articulators
are those attached to the lower jaw the lower lip,
lower teeth, and tongue. The upper articulators are
the upper lip, the upper teeth, the whole of the roof of
the mouth, and, in the case of the laryngeal articulator, the epiglottis. Each of these is a continuum, or
near-continuum, of possible articulatory locations. In
other words, articulations can occur, in principle, at
virtually any point along each of them. For the purposes of the phonetic description of sounds, linguists
identify a number of places, or zones, along each
articulatory continuum, and it is usual to describe
articulations in terms of these zones. At the same
time, it is sometimes convenient to have more inclusive terms referring to more extensive divisions of the
oral articulatory area, that is, classes at a higher rank
in the locational hierarchy.
Upper Articulators The first, and most obvious,
natural division of the whole upper articulatory area
is that between the upper lip and the teeth, plus the
remainder of the roof of the mouth. One can thus
make a first division of the upper articulatory area
into a labial division (subdivided, when necessary,
into an outer, exolabial, and an inner, endolabial,
zone) and a tectal (i.e., roof of mouth) division.
The tectal division can be subdivided into a front
(dentalveolar) part, and a rear (domal) part. The
dentalveolar subdivision consists of the upper teeth
(dental zone) and the ridge behind the teeth (the
alveolar ridge), which can be subdivided into a
front, relatively flat half (the alveolar zone) and a
maximally convex rear half (the postalveolar zone).
If one feels the alveolar ridge with the tip of the
tongue, the two zones are usually apparent though
there is a good deal of individual variation in the
shapes of alveolar ridges, some exhibiting much
more postalveolar convexity than others. The division between the alveolar ridge and the remainder of
the roof of the mouth, that is, the division between
the dentalveolar and domal subdivisions, can be
roughly defined as occurring at the point where the
convexity of the alveolar ridge gives way to the concavity of the palate.
The rest of the domed roof of the mouth divides
naturally into a front (palatal) zone, consisting of
the hard palate, and a rear (velar) part, consisting
of the soft palate, terminating in the uvula. Each of
these zones is subdivided into a front half and a rear
half, the palatal zone into prepalatal and palatal
proper, and the velar zone into velar and uvular.

Table 2 Hierarchy of upper articulatory divisions and zones


Division

Subdivision

Labial
Tectal

Zone

Subzone

labial

labial (exolabial/
endolabial)
dental
alveolar
postalveolar
prepalatal
palatal
velar
uvular

dentalveolar

dental
alveolar

domal

palatal
velar

The hierarchy of upper articulatory divisions and


zones is summarized in Table 2.
Lower Articulators The lower articulators are
normally named by prefixes (labio-, dorso-, etc.) attached to the names of the upper zones against which
they articulate. The lower articulators, then, are the
lower lip (labio-, subdivided if necessary into exolabio- and endolabio-) and the lower teeth (denti-)
and the tongue. The tongue has no clear-cut natural
divisions, but for phonetic purposes it is divided into
the tip or apex (apico-) and behind that the blade
(lamino-), which is usually taken to consist of that
part of the upper surface of the tongue that lies immediately below the alveolar ridge, extending back
about 1 to 1.5 cm from the tip. This definition of
blade, which goes back at least to Sweet (1877), is
traditionally used in articulatory phonetics, though
some writers have treated what is normally called
the blade as part of the apex, using the term blade
to refer to the entire front half of the dorsal surface
of the tongue behind the apex (e.g., Peterson and
Shoup, 1966). The traditional usage of the term
blade, however, is much more convenient for phonetic taxonomy. The underside of the blade (sublamino-) can be used in the articulation of retroflex
sounds. For these sounds, the apex of the tongue is
raised and somewhat turned back, so that, in the
extreme case, the underblade articulates against the
prepalatal arch, behind the alveolar ridge, giving a
sublaminoprepalatal articulation.
The remaining dorsal surface of the tongue (dorso-)
can be subdivided into front (anterodorso-) and rear
(posterodorso-) halves. However, since it is normal
for dorsodomal articulations to be made by the juxtaposition of the appropriate part of the tongue with the
upper articulatory zone that lies opposite or nearest
to it, these terms are not often used. Finally, the cover
term linguo- can be used, when required, to refer in
the most general way to articulation by the tongue.
Oral Places and Manners of Articulation Oral
places of articulation can thus be fully described by

436 Phonetics, Articulatory

terms compounded of a prefixed designation of a


lower articulator, plus the designation of an upper
articulatory zone, for example, labiolabial (normally
replaced by bilabial), labiodental, apicodental, laminoalveolar, dorsovelar, etc. In practice, however, it
is customary to name only the upper articulatory
zone, unless additional explicitness or accuracy is
required for some specific purpose, such as the
necessary distinction between bilabial [#] and labiodental [f], and this practice is exemplified on the IPA
chart.
The following is an enumeration of the principal
oral articulations, basically following the place order
shown on the IPA chart, but including some additional
locations.
The first is bilabial; in principle, all manners of
articulation can occur here, and all are attested as
regularly occurring in languages, except laterals.
Bilabial trill, though an easy sound to produce (for
voiceless bilabial trill [B], simply place the lips rather

loosely together and blow, then add voice for [B]), is


rare, though known to occur in a few Austronesian
languages, such as Titan and Kele in Papua New
Guinea (Ladefoged et al., 1977: 50) and Nias in
Sumatra (Catford, 1988), and phonetically in Yi and
Bai of SW China (Esling and Edmondson, 2002).
Linguolabial (apicolabial) articulations are
formed by the juxtaposition of tongue tip and upper
lip; plosive, nasal, and fricative articulations
are reported in some languages of Malekula, New
Hebrides, and in Umotina, a Bororo language of
South America (Ladefoged and Maddieson, 1986: 7).
Labiodental nasals, fricatives, and an approximant occur. The fricatives, [f] and [v], are usually
distinctly endolabiodental; that is, the inner surface
of the lower lip covers the greater part of the upper
teeth, whereas the articulation of the approximant [^]
may be exolabiodental.
A bidental (dentidental) fricative occurs in
a dialect of Adyghe (Circassian) in the northwest
Caucasus. In addition, many speakers of American
English articulate the [y] of thin with the tongue
slightly protruded between the upper and lower
teeth an articulation usually called interdental,
but which might be described as apico- or laminobidental. In British English, [y] is perhaps more commonly articulated with the tongue tip just behind the
teeth (Ladefoged and Maddieson, 1986).
The dentalveolar (dental, alveolar, and postalveolar) subdivision contains a great range of possible
apical and laminal articulations. On the IPA chart, a
full set of manner types is exemplified only for
alveolars. This is because the IPA supplies diacritics
for dental [9], for example, [t9 ] dental [t], and for
retracted [ ], for example, [t] retracted [t], which

can be used to symbolize postalveolar [t] where necessary. In fact, a full set of dental and postalveolar
articulations plosive, nasal, trill, etc. can occur.
Rather commonly in the languages of the world,
stops (plosives, ejectives, and implosives), nasals,
and lateral at dentalveolar locations are articulated
with the apex of the tongue, hence they are apicodental, apicoalveolar, and apicopostalveolar, but
they can also be articulated with the blade, and so
laminodental, laminoalveolar, and laminopostalveolar are quite possible. Where it is necessary to
distinguish between these types of articulation, the
IPA again supplies diacritics.
With respect to fricatives, the alveolar sibilants [s]
and [z] are very commonly laminoalveolar, though
apicals are quite possible. Note that the tongueshape for apicodental [y] and [] is rather flat, creating a wider channel than that for sibilant [s] and
[z]. Dental sibilant fricatives are also possible, represented when necessary as [s9 ] and [z9 ]. The postalveolar
fricatives, [S] and [Z], can be either apical or laminal.
Fully retroflex sounds are articulated by the
underblade of the tongue in juxtaposition with the
prepalatal arch, behind the alveolar ridge. Probably
the retroflex flap, [8], is always fully retroflex, the
tongue starting curled up and then shooting forward
and downward, the underside of the apex and blade
momentarily striking the palate on its way down.
Retroflex consonants are particularly common in
the languages of India. In general, the retroflex
sounds of Dravidian languages such as Tamil and
Telugu (see Dravidian Languages) tend to be fully
retroflex (sublaminoprepalatal), while those of the
Indic languages of Northern India, such as Hindi,
are often little more than apicopostalveolar.
Palatal articulations are, in principle, sounds articulated by juxtaposition of the dorsal (especially
anterodorsal) part of the tongue with the highest
part of the hard palate. A full range of stops, nasals,
fricatives, approximants, and laterals is possible here.
However, probably because of the anatomy of the
organs concerned the convex tongue fitting into
the concavity of the palate pure dorsopalatal articulations (except for [j]) seem to be rare. In languages
like Hungarian, Italian, Castilian Spanish, and
French, which are all supposed to have palatal consonants ([c], [J] and [J] in Hungarian, [J] and [L] in
Italian and Spanish, and [J] in French), the articulation may often be prepalatal or even alveolar with a
palatal modification (palatalized alveolar).
Note that on the IPA chart the places for trill, tap or
flap, and lateral fricative are blank, meaning that
there is no special IPA symbol for these sound-types,
not shaded to indicate an articulation judged impossible. In fact, a dorsopalatal trill (or tap/flap) seems

Phonetics, Articulatory 437

highly improbable, though claims have been made


that such a sound can be produced; but a palatal
lateral fricative does occur, for example, in the south
Arabian language Jibbali.
Mention should be made here of the alveolopalatal fricatives, the symbols for which, [C] and [], are
listed under Other symbols on the IPA chart. These,
exemplified by Polish s and z, and by Chinese (Pinyin
spelling) x [C] and j [tC], are articulated by the front
part of the dorsal surface of the tongue against the
prepalatal arch (with the blade or apex of the tongue
close to the postalveolar zone): they might therefore
be termed anterodorsoprepalatal. The traditional
term, alveolopalatal, it may be noted, violates the
principle enunciated above, namely that oral articulatory locations are named by prefixing a term designating the lower articulator to the name of the upper
articulatory zone.
Velar sounds are articulated with the posterodorsal surface of the tongue making contact with the
soft palate. Trills and taps or flaps are probably impossible here, but all other manners of articulation
occur. Velar laterals are rare, but the velar lateral
approximant [L] is reported in a number of languages
of New Guinea and the Chadic language Kotoko
(Maddieson, 1984: 77; see Chadic Languages), and
velar lateral affricates and fricatives occur in the
north Caucasian (Dagestanian) language Artchi
(Archi).
Uvular sounds are articulated with the rearmost
part of the posterodorsum against the posterior part
of the soft palate, including the uvula. A full range of
manners of articulation is possible here, though the
laterals (both fricative and approximant), are not
known to occur. Of the plosives, the voiceless one
[q] is about six times as common as the voiced one
[G]. This is no doubt because of the very small volume
of air that can be contained between the uvular
closure and the glottis, which renders voicing particularly difficult for uvular plosives for the reason
explained earlier in the Stop section.
Pharyngo-Laryngeal Articulations

These are articulations that take place in the pharynx,


the part of the throat immediately behind the mouth,
which includes the upper part of the larynx itself.
Pharyngo-laryngeal articulations are typically a function of the laryngeal constrictor mechanism, with
four manners of articulation stop, trill, fricative,
approximant modified by the parameter of larynx
height. Phoneticians have called these sounds either
pharyngeal or epiglottal.
Pharyngeal and epiglottal articulations most
commonly involve a sphincteric contraction of the
upper part of the larynx, in which the aryepiglottic

folds pull the back part of the larynx upward


and forward to the base of the tongue against the
epiglottis (Esling, 1996). As this happens, the tongue
retracts and descends partially into the pharynx.
This compresses the size of the pharynx in a maneuver whose physiological function is to protect
the airway and efficiently seal air in the lungs. The
tongue itself rarely retracts enough to close off the
pharynx and cannot in any case retract independently
of the laryngeal sphincter. The common variants of
- ] and ain [] are of
the Arabic throat consonants h. a [h
this type. These sounds are usually described as fricatives, but the voiced member of the pair is more
correctly an approximant, because it has no fricative-type hiss, though its voiceless partner is quite
noisy.
The difference between what have been labeled
pharyngeal and epiglottal articulations lies either
in the height of the larynx itself or in the degree of
noise and/or vibration generated. The designated epiglottal fricatives [H] and [- ] will have either a more
elevated larynx (and therefore a smaller pharyngeal
- ] and
cavity and higher-pitched resonances) than [h
[], or more noise and/or trilling (Esling, 1999).
When the laryngeal constrictor makes complete
sphincteric closure, usually with the tongue retracted
and the larynx elevated, an epiglottal stop [d] results.
Epiglottal articulations, with trilling, fricative noise,
and stop modes that sometimes distinguish them from
contrasting pharyngeal sounds, are particularly
common in Caucasian languages, such as dialects of
Agul (Aghul), in Dagestan.
Laryngeal articulation takes place in the glottis
and is thus generally termed glottal. Glottal stop
[ ], voiceless fricative [h], and voiced fricative [P]
occur. In the voiced glottal fricative [P] the arytenoid
cartilages at the rear of the vocal folds are separated,
allowing passage of part of the pulmonic egressive
airstream, but the forward, ligamental, part of the
vocal folds is in vibration, producing voice. Both [h]
and [P] could thus be described as the phonation
types breath (voiceless) and breathy voice (voiced).
However, when they function in languages as consonants, that is, as marginal elements in the structure of syllables, they are usually described as glottal
fricatives.

Modified and Double Articulations


Modified Articulations

Modified articulations are those involving the formation of a primary stricture at some location, accompanied by a secondary, more open, articulation,
usually at some other location. Modified articulations

438 Phonetics, Articulatory

are symbolized, for the most part, by a small superscript symbol for the appropriate approximant (or
fricative, if there is no appropriate approximant symbol). An example would be labialization (or rounding), that is, an approximation or rounding of the lips
co-occurring with some other, closer, articulation,
formed elsewhere, for example, labialized velar plosive
or fricative [kw], [xw]. In the worlds languages, labialization occurs most frequently with velars and uvulars,
but also, rather surprisingly, with labials in a few
languages. Apart from labialized, the four principal
modified articulations are as follows.
In palatalized sounds, the anterodorsum is raised
toward the hard palate simultaneously with a primary articulation elsewhere. Palatalization is most common with labials, thus [pj], [fj], etc. With lingual
articulations, palatalization, since it is effected by
the same organ as the primary articulation, tends to
shift the primary articulation. Thus, palatalized
velars have the dorsovelar contact shifted forward
somewhat, so that in extreme cases the articulation
becomes palatal, or nearly so. In Russian, which contrasts plain versus palatalized labials and dentalveolars, unmodified [t] and [d] are apicodental,
but their palatalized counterparts, [tj] and [dj],
havethe tongue-tip retracted and possibly slightly
lowered, so that they become laminoalveolar or
even laminopostalveolar, often slightly affricated,
thus [tsj], [dzj].

sounds have the posterodorsum raised


Velarized
toward the soft palate simultaneously with a primary
articulation elsewhere, thus [tX] [dX]. Typically, in
most types of English, /l/ at the end of a syllable
is somewhat velarized (dark l); velarization and uvularization are also forms of the modification of emphatic [tR] [dR] [sR] [zR] in Arabic.
In pharyngealized sounds, the primary articulation is modified typically by stricture of the laryngeal
constrictor another form of Arabic emphasis, for
example, [t], [s], etc. This modification can also be
applied to vowels. Velarized, uvularized, and pharyngealized can also be symbolized by a tilde running
through the symbol for the primary articulation, thus
[L] = velarized or pharyngealized /l/.
For nasalized sounds, the soft palate is lowered,
and part of the airstream is diverted through the nose.
Nasalized sounds are most commonly a modification
of vowels, thus [ED], [AD ], but applicable to consonants
where appropriate (e.g., the Japanese syllabic n in,
for example, [hoXD] book, where the final consonant
is, or can be, a nasalized velar fricative or approximant). Note that, in this case, even if the oral articulation channel is of a narrow, fricative, type, so much
of the airstream is diverted through the nose that the
oral airflow is not turbulent.

Double Articulation

Double articulation means co-occurrence of two


articulations of the same degree of stricture (e.g.,
two stops, two approximants) at different locations.
Double articulations are named by hyphenated location names, thus labial-velar means simultaneously
articulated at the lips and at the soft palate. The IPA
supplies symbols for four double articulations: [w]
voiced labial-velar approximant, [w] voiceless labialvelar fricative or approximant (the sound of wh in
one pronunciation of what), [H] voiced labial-palatal
approximant, that is, the initial sound in French
huit [Hit], and, finally, [Q] voiceless postalveolarvelar fricative (simultaneous [S] and [x]), the southern
Swedish fricative that represents the ti in such a word
as station.
Apart from these special cases, the commonest
double articulations consist of the simultaneous articulation of stops at two locations, most frequently
labial-velar [kp] [gb], written [] [ ] when the coarticulation has to be made explicit in transcription.
This particular type of double articulation is often
called labiovelar, a term which must be avoided
in a strictly systematic phonetic taxonomy in which
the first half of such a compound term refers to the
lower articulator. Double articulations of this type are
uncommon, occurring in only about 6% of the worlds
languages, mostly in Africa, particularly in NigerKordofanian (see Niger-Congo Languages) and NiloSaharan languages (see Nilo-Saharan Languages).
In Europe, double articulations occur only in a few
north Caucasian languages, namely Abkhaz and the
now virtually extinct Ubykh, which both have labialdental [pt], [pt], and [bd], and Lak, in Dagestan, with
coarticulated [pk], [pts], and double fricatives such
as [ ], [ ], etc.

Vowels
The Articulation section drew attention to the difficulty of justifying, on articulatory grounds, the distinction that is traditionally made between vowels
and consonants. It is clear that some vowels can be
described in purely consonantal terms, that is, in
terms of consonantal place and manner of articulation. A moments experimentation demonstrates that
the vowel [i] (approximately as in English see) is a
palatal approximant, that is, articulated with the
front of the tongue raised up close to the hard palate
(hence palatal), and that the airflow is nonturbulent when the sound is voiced, but becomes turbulent
when it is devoiced (precisely the criterion for an
approximant). In a similar way, it can easily
be seen that [u] (approximately as in English who) is

Phonetics, Articulatory 439

a labial-velar approximant. Other vowels, for which


the tongue is drawn well back into the pharynx and
engaging the laryngeal constrictor, such as extremely
retracted types of [O] and [A], may be described as
pharyngeal or epiglottal approximants, and so on.
The great 19th-century phoneticians Alexander
Melville Bell and Henry Sweet, from whom the vowel
classification used in articulatory phonetics is chiefly
derived, were well aware of the relationship between
vowels and consonants (Bell, 1867: 75; Sweet, 1877:
51). Nevertheless, the taxonomy of vowels that they
were largely responsible for developing uses a different
approach. The problem is that, although some vowels,
such as [i], [u], [A], and other peripheral vowels, can
easily be classified in the consonantal way, it becomes
more difficult to apply the place and manner classification to certain others, for example, the [*] vowel of
English cat. For vowels like this, the surface of the
tongue is remote from both the roof of the mouth and
the back of the oral vocal tract, so that the precise
specification of a place of articulation, and of a manner in terms of stricture type, becomes problematic.
Classification of Vowels

Consequently, vowels are classified in terms of the


general configuration of the tongue and the lips. It is
this configuration, the shape and location of the body
of the tongue, together with lip position, that determines the shape and size of the oral and pharyngeal
cavities, and hence their resonant frequencies, and
this, in turn, determines the quality of the vowel (see
Phonetics, Acoustic).
Lip position is the most obvious feature of vowel
articulation. The lips may be in a more or less spread
or neutral position, that is, unrounded, as in the
production of the English vowels [i] in beet, [*] in
bat, [V] as in but, for instance; or they may be rounded to a greater or lesser extent, as for [u] in boot or
[O] in bought. As Bell said (1867: 16), it was the
discovery that lip configuration is an independent
parameter, the different lip positions being combinable with any tongue position, that was the major
breakthrough leading to the establishment of the
model of vowel production used throughout the
20th century.
For tongue position, the general shape and location of the main mass of the tongue is approximately
defined in terms of the height to which the convex body of the tongue is raised and the horizontal,
back-front, location of the tongue mass.
For tongue height, by silently and introspectively
saying the vowels [i], [e], [*], approximately as in
English beet, bait, bat, and [u], [o], [O], approximately as in boot, boat, bought, one can obtain an impression of what is meant by tongue height. The tongue

is clearly raised up close to the roof of the mouth in [i]


and [u], and progressively lowered as one goes down
to [*] and [O]. The lowering of the tongue can be
more clearly perceived if one fixates the jaw, for
example, by holding the end of a pencil between the
teeth, and notes the sensation of lowering and flattening of the tongue, and also the fact that in saying [i]
(silently) one can feel contact between the sides of the
tongue and the molar and canine teeth, but that this
contact is lost as the tongue moves down.
For horizontal tongue position, the body of the
tongue can be pushed forward and bunched up in the
front of the mouth, as for [i] or [e], or it can be drawn
back in the mouth, as for [u] or [O]. The difference
between front and back tongue positions can be
clearly felt, by silently and introspectively saying [i] or
[e] and moving quickly to [u] or [O], and then silently
sliding the tongue back and forth between the front
and back positions. It will be noticed that the perceived change of lip position from unrounded
[i] to rounded [u] tends to mask the sensation of
tongue movement. To obtain a clearer perception
of the front-back tongue movement, it is useful to
start from [i], slowly and carefully round the lips to
approximately their position for [u], then, while
retaining that lip position, slide back to [u]. When
the change of lip position is thus abolished, the backward movement of the tongue can be more easily
perceived.
In describing the tongue positions of vowels, reference is often made to the highest point of the convex
surface of the tongue. This is merely a convenient
reference point, useful in comparing tongue positions,
but of no functional significance in the production of
vowel sounds. Another location on the tongue surface
is more important in determining vowel qualities.
This is the location of the narrowest linguo-domal
or linguo-pharyngeal stricture; and this, in its turn,
is determined by the general location of the tongue
mass as a whole, which is, in fact, what one is chiefly
aware of in the silent, introspective experiments
described in this section.
Cardinal Vowels

In the model of vowel production that traditional


articulatory phonetics inherited from Bell and Sweet,
vowels are classified in terms of the degree of tongue
height, horizontal tongue position, and lip position. These pioneers established three cardinal
a term first used by Bell (1867: 15) degrees of
height, namely, high, mid, and low (corresponding
to the close, mid, and open of the IPA chart (Figure 1),
with mid subdivided into close-mid/open-mid), and
of horizontal position, front, back, and central
(originally called mixed). These, together with an

440 Phonetics, Articulatory

Figure 1 Cardinal vowel chart.

array of modifications, brought the total number of


vowel qualities recognized by their model to 81.
Effective though the Bell-Sweet system of vowel
classification was, in one respect it lacked precision:
it furnished coordinates defining relative tongue positions, but provided no absolute fixed points of origin
from which these could be measured. Daniel Jones
(see Jones, 1922) improved the system by basing a set
of eight cardinal vowels (cvs) upon two more or less
fixed reference points: these were 1 [i], for which the
tongue is as far forward and as high as possible, and 5
[A], for which the tongue is as low and as far back
(i.e., retracted) as possible.
These are two limiting points. If, while the airstream remains constant, the tongue is brought still
closer to the hard palate and prepalatal arch for [i],
the flow will become turbulent and the sound will
thus become a palatal or prepalatal fricative. Similarly, if the tongue is pulled back further into the pharynx for [A], the sound will become a pharyngeal or
epiglottal fricative. Adopting these extreme vowels as
key points, the set of cvs is completed by a series of
(roughly) equally spaced fully front vowels, 2, 3, 4
[e, E, a], and three equally spaced fully back vowels,
6, 7, 8 [O, o, u], the equal spacing, according to
Jones, being both articulatory and auditory.
The cvs 1 to 5 [i e E a A] are unrounded, 6 to 8 [O o u]
being rounded. This distribution of lip rounding corresponds to what is normal in the worlds languages,
where, among mid and close back vowels, rounded
vowels are overwhelmingly commoner than
unrounded ones. The set of cvs is completed by a set
of secondary cvs, having the opposite lip positions,
rounded [y A], and unrounded [V M].
The vowel diagram on the IPA chart displays the
cvs, at the points marked by dots on its periphery,
plus a number of other types of vowel whose place
of articulation and articulatory relationship to the
cvs is shown by their location on the diagram. This
commonly used vowel diagram is an easily drawn
simplification of an earlier form which purported to

represent with some accuracy the relative tongue


positions of the cvs, derived from X-ray photographs
of Daniel Jones pronunciation of them. The IPA
diagram is something of a hybrid. Its general shape
roughly indicates the articulatory relations between
the vowels, but the fact that the horizontal line between the close front and close back positions [i] to
[u] is twice as long as the line between open front and
open back [a] to [A] can only be justified on an auditory/acoustic basis. It is clear from X-ray data that the
articulatory distance between [i] and [u] is about the
same as that between [a] and [A], but the acoustic
distance between [i] and [u] in terms of the frequency
of the second formants of these vowels (see Phonetics,
Acoustic) is about twice the distance as between [a]
and [A].
A phonetician trained in the auditory/proprioceptive
identification of vowel qualities, and in the precise
values of the cvs, can place a dot on the cv diagram
showing the location of any particular vowel in relation to the cardinal vowels. This provides, for other
phoneticians trained in the cvs, a fairly precise indication of the quality of the vowel in question. Even those
who are not explicitly trained to use the cvs regularly
make use of the general principles of vowel classification outlined here, sometimes with minor terminological variations, for example, using high and low
(in reference to tongue position) in place of close
and open (in reference to jaw position). See for example the illustrations of individual languages in the
Handbook of the IPA (IPA, 1999) or in the Journal of
the IPA.
Diphthongs

Vowels uttered as part of a single syllable may be


either monophthongs (so-called pure vowels) or
diphthongs. Monophthongs involve no appreciable
change in vowel position throughout their duration,
whereas diphthongs involve a noticeable change.
A diphthong is recognized as having two elements,
the vowel nucleus, which is clearer and more prominent, and the glide, or nonsyllabic element, which is
less prominent and shows a continuous change of
vowel position. In IPA notation, the nonsyllabic element can be transcribed with the diacritic [ ], for
example English high [ha ], how [ha ]. Given that
the nonsyllabic part of a diphthong involves continuous change of vowel position, it is only an approximate indication of its quality; it should be taken as the
direction in which the articulators move rather than
as an absolute target achieved.
Modifications of Vowels

Although the three parameters of lip position and


vertical and horizontal tongue position enable the

Phonetics, Articulatory 441

specification of most vowels with some accuracy, one


must take account of some other articulatory features, which can be regarded as modifications of the
basic articulation of vowels.
Among such features are two already mentioned in
the Modified Articulations section, namely nasalization and pharyngealization. In addition, there are
three other modifications, for which the IPA supplies
diacritics: advanced tongue root, retracted tongue
root, (indicating essentially expansion of the pharynx by lowering the larynx and contraction of the
pharynx by engaging the laryngeal constrictor), and
rhoticity.
Rhoticity, or rhotacization, also known as
r-coloring, is a cover term used for certain modifications of tongue shape that are associated (in English,
at least) with orthographic sequences of vowel r,
most typically in words such as bird, burn, Bert. In
American English, the vowel in such words is often
said to be retroflexed. Retroflexion (see Oral Places
and Manners of Articulation in this article) is, indeed,
one possible form of rhoticity, but nearly the same
auditory quality may be achieved not by turning
up the apex of the tongue but by simultaneously
modifying the shape of the tongue in two other
ways: retraction of the tongue root into the pharynx
while increasing laryngeal constriction (i.e., mild
pharyngealization), and some degree of sulcalization of the back of the tongue, that is, the formation
of a hollow or furrow in the dorsal surface of the
tongue roughly opposite the uvula.

Conclusion
There have been numerous critics of the traditional
classification of vowels who claim that the tongue
positions posited by the model are not borne out by
X-ray data. On the whole, such criticism is exaggerated. Although numerous apparent anomalies have
been pointed out, the vast majority of the hundreds
of published X-ray photographs and tracings from
X-rays demonstrate the validity of the model. Although acoustic definitions of vowels, in terms of the
frequency of their first, second, and third formants
are obviously of great value and are much used, along
with articulatory descriptions, for most of the purposes
of descriptive, comparative, and pedagogical linguistics, there is no useful substitute for the traditional
articulatory model (see Catford, 1981).
In general, articulatory phonetics provides a model
of speech production that is inclusive enough and
flexible enough to allow for the description and classification of virtually any sounds that can be produced by human vocal organs and are thus
potentially utilizable in speech. It does this primarily

by specifying parameters ranges of possibility, rather than narrowly defined features or classes of
sounds. Thus, although linguists specify a finite set
of articulatory zones along the roof of the mouth,
this is merely a matter of convenience, since it is clear
that articulation between the dorsal surface of the
tongue and the roof of the mouth can occur at any
point, or more precisely (since tongue-domal contacts
naturally involve an area, not a point, of each articulator) in any area, and articulatory phonetics allows
for precise definition of such contact areas. One is very
rarely forced to make do with ad hoc categories
when new sounds or sound combinations are found
in languages. There are sufficient categories, and principles for their application, to deal efficiently with
most new sound types. Thus, the principle of defining
airstream mechanisms in terms of the location and
direction of initiation not only permits the description
of all types of initiation known to be used, but is also
extensible to other minor types (Pike, 1943: 99103).
Unusual articulations, like the bidental fricatives, or
linguolabials, and velar laterals (mentioned in the
section on Pharyngo-Laryngeal Articulations) present no difficulty, because the categories are there in the
model, and even where a very specific category has
not previously been called for, it is generally easy to
subdivide existing ones. Thus, when distinctions between inner and outer labial articulations were found
to be necessary, the subdivision of the labial zone
presented no problem (Catford, 1977: 146147).
See also: Afroasiatic Languages; Bell, Alexander Melville
(18191905); Caucasian Languages; Chadic Languages;
Disorders of Fluency and Voice; Dravidian Languages;
Ellis, Alexander John (ne Sharpe) (18141890); Imaging
and Measurement of the Vocal Tract; International Phonetic Association; Jespersen, Otto (18601943); Khoesaan
Languages; Na-Dene Languages; Niger-Congo Languages; Nilo-Saharan Languages; Passy, Paul Edouard
(18591940); Phonetic Transcription: History; Phonetics
and Pragmatics; Phonetics, Acoustic; Prosodic Aspects
of Speech and Language; Sievers, Eduard (18501932);
Speech Aerodynamics; Speech Perception; Speech Production; States of the Glottis; Sweet, Henry (18451912);
Tone in Connected Discourse; Vietor, Wilhelm (1850
1918); Voice Quality.

Bibliography
Abercrombie D (1967). Elements of general phonetics.
Edinburgh: Edinburgh University Press.
Bell A M (1867). Visible speech: the science of universal
alphabetics. London: Simpkin, Marshall.
Catford J C (1964). Phonation types: the classification of
some laryngeal components of speech production. In
Abercrombie D, Fry D B, McCarthy P A D, Scott N C

442 Phonetics, Articulatory


& Trim J L M (eds.) In honour of Daniel Jones. London:
Longmans. 2637.
Catford J C (1977). Fundamental problems in phonetics.
Edinburgh: Edinburgh University Press/Bloomington, IN:
Indiana University Press.
Catford J C (1981). Observations on the recent history of
vowel classification. In Asher R E & Henderson E J A
(eds.) Towards a history of phonetics. Edinburgh:
Edinburgh University Press. 1932.
Catford J C (1988). Notes on the phonetics of Nias. Studies
in Austronesian linguistics. SE Asia Series 76. Athens,
OH: Ohio University Center for SE Asia Studies.
Catford J C (2001). A practical introduction to phonetics
(2nd edn.). Oxford: Oxford University Press.
Esling J H (1996). Pharyngeal consonants and the aryepiglottic sphincter. Journal of the International Phonetic
Association 26, 6588.
Esling J H (1999). The IPA categories pharyngeal and
epiglottal: laryngoscopic observations of pharyngeal
articulations and larynx height. Language & Speech
42, 349372.
Esling J H & Edmondson J A (2002). The laryngeal sphincter as an articulator: tenseness, tongue root and phonation in Yi and Bai. In Braun A & Masthoff H R (eds.)
Phonetics and its applications: Festschrift for Jens-Peter
Ko ster on the occasion of his 60th birthday. Stuttgart:
Franz Steiner Verlag. 3851.
Esling J H & Harris J G (2005). States of the glottis: an
articulatory phonetic model based on laryngoscopic
observations. In Hardcastle W J & Beck J (eds.) A figure
of speech: a Festschrift for John Laver. Mahwah, NJ:
Lawrence Erlbaum Associates. 347383.

IPA (1999). Handbook of the International Phonetic


Association: a guide to the use of the International
Phonetic Alphabet. Cambridge: Cambridge University
Press.
Jones D (1922). An outline of English phonetics (2nd edn.).
Cambridge: W. Heffer & Sons.
Ladefoged P (1975). A course in phonetics. New York:
Harcourt Brace Jovanovich.
Ladefoged P & Maddieson I (1996). The sounds of the
worlds languages. Oxford: Blackwell.
Ladefoged P, Cochran A & Disner S F (1977). Laterals and
trills. JIPA 7(2), 4654.
Laver J (1980). The phonetic description of voice quality.
Cambridge: Cambridge University Press.
Maddieson I (1984). Patterns of sounds. Cambridge:
Cambridge University Press.
Ohala J J (1990). Respiratory activity in speech. In
Hardcastle W J & Marchal A (eds.) Speech production
and speech modelling. Dordrecht: Kluwer.
Peterson G E & Shoup J E (1966). A physiological theory
of phonetics. Journal of Speech and Hearing Research 9,
567.
Pike K L (1943). Phonetics: A critical analysis of phonetic
theory and a technic for the practical description of
sounds. Ann Arbor: University of Michigan Press.
Shadle C H (1990). Articulatory-acoustic relationships in
fricative consonants. In Hardcastle W J & Marchal A
(eds.) Speech production and speech modelling. Dordrecht:
Kluwer.
Sweet H (1877). A handbook of phonetics. Oxford:
Clarendon Press.

Phonetics, Acoustic
C H Shadle, Haskins Laboratories,
New Haven, CT, USA
2006 Elsevier Ltd. All rights reserved.

Introduction
Phonetics is the study of characteristics of human
sound-making, especially speech sounds, and includes
methods for description, classification, and transcription of those sounds. Acoustic phonetics is focused
on the physical properties of speech sounds, as transmitted between mouth and ear (Crystal, 1991); this
definition relegates transmission of speech sounds
from microphone to computer to the domain of instrumental phonetics, and yet, in studying acoustic
phonetics, one needs to ensure that the speech itself,
and not artifacts of recording or processing, is being
studied. Thus, in this chapter we consider some of the
issues involved in recording, and especially in the

analysis of speech, as well as descriptions of speech


sounds.
The speech signal itself has properties that make
such analysis difficult. It is nonstationary; analysis
generally proceeds by using short sections that are
assumed to be quasistationary, yet in some cases this
assumption is clearly violated, with transitions occurring within an analysis window of the desired length.
Speech can be quasiperiodic, or stochastic (noisy), or
a mixture of the two; it can contain transients. Each
of these signal descriptions requires a different type of
analysis. The dynamic range is large; for one speaker,
speaking at a particular level (e.g., raised voice), the
range may be 10 to 50 dB SPL (decibels Sound
Pressure Level) over the entire frequency range
(Beranek, 1954), but spontaneous speech may potentially range over 120 dB and still be comprehensible
by a human listener. Finally, the frequency range is
large, from 50 to 20 000 Hz. Though it is well known

460 Phonetics, Acoustic


Shadle C H (1985). The acoustics of fricative consonants.
Ph.D. thesis, Dept. of ECS, MIT, Cambridge, MA. RLE
Tech. Report 504.
Shadle C H (1990). Articulatory-acoustic relationships in
fricative consonants. In Hardcastle W J & Marchal A
(eds.) Speech production and Speech Modelling.
Dordrecht: Kluwer Academic Publishers.
Stevens K N (1971). Airflow and turbulence noise for
fricative and stop consonants: static considerations.
Journal of the Acoustical Society of America 50,
11801192.

Stevens K N (1998). Acoustic phonetics. Cambridge, MA:


The MIT Press.
Sundberg J (1987). The science of the singing voice.
DeKalb, Illinois: University of Northern Illinois Press.
Titze I (2000). Principles of voice production (2nd printing). Iowa City, Iowa: National Center for Voice and
Speech.
Whalen D H (1991). Perception of the English /s/-/S/ distinction relies on fricative noises and transitions, not on
brief spectral slices. Journal of the Acoustical Society of
America 90(4:1), 17761785.

Phonetics, Forensic
A P A Broeders, University of Leiden, Leiden and
Netherlands Forensic Institute, The Hauge, The
Netherlands
2006 Elsevier Ltd. All rights reserved.
This article is reproduced from the previous edition, volume 6,
pp. 30993101.

The term forensic phonetics refers to the application


of phonetic expertise to forensic questions. Forensic
phonetics was a relatively new area at the beginning
of the 1990s. The kind of activity that its practitioners
are probably most frequently involved in is forensic
speaker identification. Forensic phoneticians may act
as expert witnesses in a court of law and testify as to
whether or not a speech sample produced by an unknown speaker involved in the commission of a crime
originates from the same speaker as a reference sample that is produced by a known speaker, the accused.
Other activities in which forensic phoneticians may
be engaged are speaker characterization or profiling,
intelligibility enhancement of tape-recorded speech,
the examination of the authenticity and integrity of
audiotape recordings, the analysis and interpretation
of disputed utterances, as well as the analysis and
identification of non-speech sounds or background
noise in evidential recordings. In addition, forensic
phoneticians may collaborate with forensic psychologists to assess the reliability of speaker recognition
by earwitnesses.

Speaker Recognition: Identification


versus Verification
Speaker identification, whether in a forensic context
or otherwise, can be regarded as one form of speaker
recognition. The other form is usually called speaker
verification. There are a number of important differences in the sphere of application and in the methodology employed in these two forms of speaker

recognition. Speaker identification is concerned with


establishing the identity of a speaker who is a member
of a potentially unlimited population, verification
with establishing whether a given speaker is in fact
the one member of a closed set of speakers which he
claims to be.
There are basically two different approaches to
speaker recognition. One is linguistically oriented,
the other is essentially an engineering approach. In
recent years, considerable progress has been made
by those taking the latter approach, culminating
in the development of fully operational automatic
speaker verification systems. In a typical verification
application, a person seeking access to a building or
to certain kinds of information will be asked to pronounce certain utterances, which are subsequently
compared with identical reference utterances produced by the speaker he claims to be. If the match
is close enough the speaker is admitted, if not, he is
denied access.
Automatic Speaker Recognition

The success of automatic speaker verification systems


has stimulated research into the application of similar
techniques to the field of automatic speaker identification for forensic purposes. Although attempts in
this field have been numerous, they have not so far
been successful. The problem is that speaker identification, especially in the forensic context, is a much
more complex affair than speaker verification. While
the unknown speaker in a verification context is a
member of a closed set of speakers, the suspect in a
forensic identification context is a member of a much
larger group of speakers whose membership is not
really known and for whom no reference samples
are available. Recording conditions may vary quite
considerably in the identification context and speakers cannot be relied upon to be cooperative. They may
attempt to disguise their voices, they may whisper,
adopt a foreign accent, or speak a foreign language.

Phonetics, Forensic 461

These are some of the main factors that have so


far stood in the way of the development of reliable
automatic speaker identification systems.
Forensic Speaker Identification

In the absence of quantitative, engineering-type solutions to the identification problem, forensic phoneticians largely rely on auditory, or auralperceptual
methods, frequently but not always combined with
some form of acoustic analysis of features like fundamental frequency, or pitch, intonation, and vowel
quality. Their findings are based on a detailed analysis
of the accent and dialect variety used, of the voice
quality and, if sufficient material is available, of any
recurrent lexical, idiomatic, syntactic, or paralinguistic patterns, always allowing for the communicative
context in which the various speech samples are produced and for the physical and emotional state of the
speaker(s) involved. The phonetic analysis typically
includes a narrow transcription of (parts of) the
speech sample, based on the IPA symbols (see International Phonetic Association). What the forensic
phonetician will be looking for in particular are
speaker-specific features in areas like articulation,
voice quality, rhythm, or intonation. Of particular
interest here are features that deviate from the norm
for the accent or dialect in question as well as features
that are relatively permanent and not easily changed
either consciously or unconsciously by the speaker.
Of course, for such features to be amenable to forensic investigation, they not only need to be fairly
frequent but also reasonably robust, so that the limitations imposed by the forensic context, such as less
than perfect recording conditions and relatively short
speech samples, do not preclude their investigation.
An example of a norm-deviating feature would be
the regular use of preconsonantal /r/ in an otherwise
non-r-pronouncing accent, or the bilabial or labio
dental articulation of prevocalic /r/, as in woy or
voy for Roy. Features that may be fairly permanent
and not easily changed are the duration of the aspiration of voiceless plosives, the frequency range at
which the voice descends into creaky voice or glottal
creak, assimilation of voice and, on the lexical level,
the use of certain types of fillers or stock phrases. In
addition, individual speakers may exhibit pathological features such as stammering, inadequate breath
control, lisps, or various types of defective vocalization, which may serve to distinguish them from other
speakers.
Obviously, a combination of such features potentially provides strong evidence for identification.
However, this approach presupposes an ability to
quantify features such as voice quality, which do not
easily lend themselves to quantification, as well as a

knowledge of the distribution of such features in the


relevant speaker population, which is not always
available (see Voice Quality). As a result, forensic
phoneticians have to weigh the significance of the
correspondences and differences found between the
speech samples under investigation, drawing on their
experience and expertise as phoneticians and their
familiarity with the language variety involved. So, in
the absence of rigorous quantitative criteria, the forensic phonetician will frequently have to base his
conclusions on an interpretation of largely qualitative
data.
This means that conclusions inevitably have to be
formulated in terms of probability rather than certainty, and are ideally prefaced by an acknowledgment of the nature of the testimony offered,
which is that of a considered opinion, not a piece of
incontrovertible evidence.
The Status of Forensic Speaker Identification

In view of the nature of the methods applied by


forensic phoneticians, it is not surprising that their
judgments are not always unchallenged. The most
extreme position is held by those who believe that
phoneticians have no special skills to identify or discriminate between speakers and that no phonetician
should consequently claim to possess such skills by
giving evidence in a court of law. Somewhat less
skeptical are those who maintain that phoneticians
do have a role to play, but only in establishing nonidentity. They argue that, in the absence of experimental evidence that no two speakers speak exactly
alike, there is no real basis for positive identification. Finally, there are those who believe that both
identity and nonidentity can be established, but
among those who subscribe to this view there is considerable variation in the degree of probability or
certainty they would be prepared to attach to their
conclusions.
In some countries, forensic speaker identification is
somewhat controversial. This is no doubt partly due
to the exaggerated claims made by those who were
responsible for the introduction of the so-called voice
print technique, which in the 1960s and 1970s
enjoyed considerable prestige in parts of the United
States and in some other countries. A voice print is
essentially a spectrographic representation of a
speech signal. Speech sounds are complex vibrations.
A spectrogram shows the relative intensity of the
frequency components making up this complex vibration (see Phonetics, Acoustic). The voice print technique essentially consisted of a visual comparison of
spectrograms of identical utterances produced by a
known speaker and an unknown speaker to determine whether they originated from a single person.

462 Phonetics, Forensic

Spectrograms may provide forensically useful information about speech signals but most phoneticians
would now agree that there is little justification for
the implication of reliability carried by the term voice
print, which suggests that the status of voice print
evidence is comparable to that of fingerprint evidence. Testimony based on modified forms of the
voice print technique as practiced by certified members of the VIAAS (Voice Identification and Acoustic
Analysis Subcommittee) of the IAI (International
Association for Identification) continues to be accepted as evidence in some states in the United States.
There are two international organizations whose
members are in one way or another involved in forensic speaker identification. In addition to the
VIAAS, whose membership is largely American, there
is the IAFP (International Association for Forensic
Phonetics), which was founded in 1989 with the
aim of providing a forum for those working in the
field of forensic phonetics as well as ensuring professional standards and good practice in this area. Its
membership is in fact almost entirely European.
Speaker Profiling

If there is no suspect available, forensic phoneticians


and dialectologists may also be asked to produce a
speaker profile on the basis of a recorded speech sample of an unknown speaker. This may include information about the criminals sex, age group, regional
and/or social background, and educational standard.
The speaker profile may subsequently play a role in
directing the investigative efforts of the police.

Speaker Identification by Earwitnesses


Earwitnesses may be asked to go through what is
usually called a voice parade or voice line up if a
suspect is available but the offenders speech has not
been recorded. This essentially amounts to a procedure in which a speech sample of the suspect plus
samples of five or six similar sounding speakers are
recorded on audiotape and played to the earwitness,
who will be asked to indicate which if any of the
speakers is the offender. The proper administration
of a voice parade, like that of its visual counterpart
also known as the Oslo confrontation demands
that strict requirements be met, to prevent procedural
errors that would raise serious doubts about any
identification made and thus reduce or destroy its
evidential value.
A distinction that needs to be made with respect to
identifications by nonphoneticians relates to the
familiarity of the earwitness with the offenders
speech. If, for example, the earwitness claims the
offenders voice belongs to a person the earwitness

knows, the ability of the witness to recognize this


voice from a number of similar voices may be tested
directly, and much more rigorously, than is possible
through a voice parade as described above. Some
other factors that may be expected to affect recognition by earwitnesses are the amount of time elapsed
between the exposure to the unknown speakers voice
and the line up, the nature of the earwitnesss interaction with the unknown speaker, the amount of speech
heard, and the age of the earwitness.

Intelligibility Enhancement
Intelligibility enhancement of speech recordings is
undertaken to determine what is said in evidential
recordings rather than to establish the identity of
the speaker. Enhancement work typically involves
the use of digital filtering techniques, aimed to reduce the presence of unwanted noise components in
the recorded signal. The degree of improvement
achieved will generally depend on the nature and
intensity of the nonspeech signal. Filtering techniques
may also play a role in the interpretation of disputed
utterances. An analysis of the speakers speech patterns, combined with an analysis of the acoustic information contained in the signal under investigation,
may resolve the question one way or the other.

Audiotape Examination
Tape authentication and integrity examinations are
conducted to establish whether recordings submitted
by the police or private individuals can be accepted as
evidential recordings. Questions here relate to the
origin of the recording, e.g., whether the recording
was made at the time and in the manner it is alleged
to have been made, or to its integrity, i.e., whether the
recording constitutes a complete and unedited registration of the conversation as it took place. This type
of examination will usually include a visual inspection
of the tape for the presence of splices or any other
forms of interference; a detailed auditory analysis to
localize any record on/off events or any discontinuities
in the progress of the conversation or in the background noise; an electro-acoustic analysis to display
and compare any transients generated by record on/
off events with those produced in replication tests on
the recorder allegedly utilized to make the questioned
recording; and inspection of the magnetic patterns on
the tape surface through the use of ferrofluids, which
may provide information about record on/off events
or about the size and shape of the record and erase
heads. The increasing availability of computer-based
digital speech processing systems has led to a situation
where relatively large numbers of people have access

Phonetics: Field Methods 463

to equipment that can be used to make edits that are


not easily detectable from a technical point of view.
However, apart from technical indications, suspect
recordings may also contain linguistic indications of
fabrication, which may in fact escape detection by all
except the forensic phonetician.
See also: International Phonetic Association; Phonetic
Transcription: History; Phonetics, Acoustic; Phonetics:
Overview; Speaker Recognition and Verification, Automatic; Speech Production; Voice Quality.

Bibliography
Baldwin J & French P (1990). Forensic phonetics. London
and New York: Pinter.

Bolt R H et al. (1979). On the theory and practice of voice


identification. Washington, DC: National Academy of
Sciences.
Hollien H (1990). The acoustics of crime: the new science
of forensic phonetics. New York: Plenum Press.
Koenig B E (1990). Authentication of forensic audio
recordings. Journal of the Audio Engineering Society
38(1/2), 333.
Ku nzel H J (1987). Sprechererkennung: Grundzuge forensischer Sprachverarbeitung. Heidelberg: Kriminalistik
Verlag.
Nolan F (1983). The phonetic bases of speaker recognition.
Cambridge: Cambridge University Press.
Nolan F (1991). Forensic phonetics. JL 27, 483493.
Rose P (2002). Forensic speaker identification. London &
New York: Taylor & Francis.
Tosi O (1979). Voice identification: theory and legal applications. Baltimore, MD: University Park Press.

Phonetics: Field Methods


S Bird, University of Victoria, BC, Canada
B Gick, University of British Columbia, BC, Canada,
and Haskins Laboratories, New Haven, CT, USA
2006 Elsevier Ltd. All rights reserved.

Introduction: What Is Phonetic


Fieldwork and Why Do We Do It?
The primary goals of phonetic research are threefold:
to document the different sounds that occur in natural languages (e.g., Ladefoged and Maddieson, 1996),
to understand the acoustic and articulatory properties
of these sounds (e.g., Miller-Ockhuizen, 2003), and
to evaluate experimentally theories and models of
phonetic and phonological structure (e.g., Bird and
Caldecott, 2004). To achieve these goals, it is crucial
to consider all languages spoken across the world.
How is this done? In some cases, speakers can be
recorded in a laboratory setting; this is practical
with many languages spoken in urban areas. When
speakers cannot be brought to a laboratory, however,
it is necessary to conduct phonetic fieldwork, i.e., to
record speech outside of a laboratory setting. Until
fairly recently, conducting phonetic fieldwork has
been much more limited than laboratory-based phonetic work because of restrictions on the kinds of
instrumental tools that could be taken outside of a
laboratory setting. Particularly techniques for measuring articulatory phonetics, such as electromagnetic
articulography (EMA), magnetic resonance imaging
(MRI), or laryngoscopy, cannot be used in fieldwork

situations. However, technological advances have


made it much easier to collect data in the field: acoustic data can be collected using compact, unobtrusive
equipment such as a pocket-sized mini-disc recorder;
tongue movement data can be recorded using a portable ultrasound machine; air flow and pressure data
can be collected using portable equipment with a
laptop computer. These new technologies have
allowed phoneticians to collect data from speakers
who are either unable to travel to a research institution, or who are uncomfortable working within a
laboratory setting. The data collected in the field
from languages that would not otherwise be studied
are crucial for attaining the goals of phonetic research
and, more generally, for gaining a full understanding of the range of possible phonetic phenomena in
natural language.
This article describes methods used in the collection
of phonetic data in the field, focusing on two areas
of phonetics: acoustic phonetics and articulatory
phonetics. For each of these areas of study, the relevant research methods are described and the kinds of
questions addressed are discussed.

Field Methods in Acoustic Phonetics


The most common kinds of phonetic data recorded in
fieldwork settings are acoustic data, based on audio
recordings. Different recording modes are used,
including solid state, compact disc, mini-disc, digital
audiotape, and cassette. Ladefoged (2003) provided

Phonetics: Field Methods 463

to equipment that can be used to make edits that are


not easily detectable from a technical point of view.
However, apart from technical indications, suspect
recordings may also contain linguistic indications of
fabrication, which may in fact escape detection by all
except the forensic phonetician.
See also: International Phonetic Association; Phonetic
Transcription: History; Phonetics, Acoustic; Phonetics:
Overview; Speaker Recognition and Verification, Automatic; Speech Production; Voice Quality.

Bibliography
Baldwin J & French P (1990). Forensic phonetics. London
and New York: Pinter.

Bolt R H et al. (1979). On the theory and practice of voice


identification. Washington, DC: National Academy of
Sciences.
Hollien H (1990). The acoustics of crime: the new science
of forensic phonetics. New York: Plenum Press.
Koenig B E (1990). Authentication of forensic audio
recordings. Journal of the Audio Engineering Society
38(1/2), 333.
Kunzel H J (1987). Sprechererkennung: Grundzuge forensischer Sprachverarbeitung. Heidelberg: Kriminalistik
Verlag.
Nolan F (1983). The phonetic bases of speaker recognition.
Cambridge: Cambridge University Press.
Nolan F (1991). Forensic phonetics. JL 27, 483493.
Rose P (2002). Forensic speaker identification. London &
New York: Taylor & Francis.
Tosi O (1979). Voice identification: theory and legal applications. Baltimore, MD: University Park Press.

Phonetics: Field Methods


S Bird, University of Victoria, BC, Canada
B Gick, University of British Columbia, BC, Canada,
and Haskins Laboratories, New Haven, CT, USA
2006 Elsevier Ltd. All rights reserved.

Introduction: What Is Phonetic


Fieldwork and Why Do We Do It?
The primary goals of phonetic research are threefold:
to document the different sounds that occur in natural languages (e.g., Ladefoged and Maddieson, 1996),
to understand the acoustic and articulatory properties
of these sounds (e.g., Miller-Ockhuizen, 2003), and
to evaluate experimentally theories and models of
phonetic and phonological structure (e.g., Bird and
Caldecott, 2004). To achieve these goals, it is crucial
to consider all languages spoken across the world.
How is this done? In some cases, speakers can be
recorded in a laboratory setting; this is practical
with many languages spoken in urban areas. When
speakers cannot be brought to a laboratory, however,
it is necessary to conduct phonetic fieldwork, i.e., to
record speech outside of a laboratory setting. Until
fairly recently, conducting phonetic fieldwork has
been much more limited than laboratory-based phonetic work because of restrictions on the kinds of
instrumental tools that could be taken outside of a
laboratory setting. Particularly techniques for measuring articulatory phonetics, such as electromagnetic
articulography (EMA), magnetic resonance imaging
(MRI), or laryngoscopy, cannot be used in fieldwork

situations. However, technological advances have


made it much easier to collect data in the field: acoustic data can be collected using compact, unobtrusive
equipment such as a pocket-sized mini-disc recorder;
tongue movement data can be recorded using a portable ultrasound machine; air flow and pressure data
can be collected using portable equipment with a
laptop computer. These new technologies have
allowed phoneticians to collect data from speakers
who are either unable to travel to a research institution, or who are uncomfortable working within a
laboratory setting. The data collected in the field
from languages that would not otherwise be studied
are crucial for attaining the goals of phonetic research
and, more generally, for gaining a full understanding of the range of possible phonetic phenomena in
natural language.
This article describes methods used in the collection
of phonetic data in the field, focusing on two areas
of phonetics: acoustic phonetics and articulatory
phonetics. For each of these areas of study, the relevant research methods are described and the kinds of
questions addressed are discussed.

Field Methods in Acoustic Phonetics


The most common kinds of phonetic data recorded in
fieldwork settings are acoustic data, based on audio
recordings. Different recording modes are used,
including solid state, compact disc, mini-disc, digital
audiotape, and cassette. Ladefoged (2003) provided

464 Phonetics: Field Methods

a detailed description of many of the methods used


in the collection of acoustic data, including each
methods benefits and drawbacks. Figure 1 illustrates
an experimental setup used for recording acoustic
data.

Figure 1 Example of an acoustic data collection setup: a Sony


MZ-B10 portable mini-disc recorder and a Sony ECM-T115 lapel
microphone.

Audio recordings provide basic acoustic information about the properties of speech sounds. These
data can be analyzed using various tools, also described in Ladefoged (2003). The program Praat,
created by Paul Boersma and David Weenink, is
often used in phonetic data analysis. Among other
things, this program displays speech visually, making
it possible to extract specific kinds of acoustic
information, ranging from overall pitch, duration,
and amplitude measurements to acoustic properties
specific to individual consonants or vowels. Figure 2
shows a waveform and spectrogram associated with
the English sentence She likes singing jazz, as an
example of what acoustic properties of speech can
be depicted visually. The first display is the waveform.
In the second display, pitch and amplitude curves are
superimposed on the spectrogram.
Several kinds of research questions can be
addressed based on acoustic data. One research area
involves the prosodic structure of language. For example, Hargus and Rice (2005) have written a series
of papers on prosodic structure in various Athabaskan
languages; all of their work is based on experimental
data collected in the field. Segmental properties
have also been the focus of much phonetic work.
Miller-Ockhuizen (2003) described guttural sounds
in Ju|hoansi, a Khoisan language spoken in Botswana
and Namibia. Subsegmental properties of speech can
also be studied acoustically, such as the timing
between different components of complex sounds.
Bird and Caldecott (2004) provided an acoustic study

Figure 2 Waveform (top) and spectrogram (bottom) of the English sentence She likes singing jazz. Pitch and amplitude curves are
superimposed on the spectrogram.

Phonetics: Field Methods 465

Figure 4 Airflow and pressure contours for the Navajo word


[hes] (to itch future). From top to bottom, the contours represent the acoustic waveform, oral airflow, oral pressure, and nasal
airflow.

Figure 3 Airflow and pressure data collection setup.

of glottalized resonants in Sta timcets (Lillooet), an


Interior Salish language spoken in British Columbia,
Canada.
In addition to making audio recordings, it is possible to collect data on air pressure and airflow from
the mouth and the nose. The system usually consists
of oral and nasal masks and pressure tubes held by
the speaker, along with a microphone to record
sound. Figure 3 illustrates the experimental setup.
The pressure tubes are connected to a system that
tracks pressure and airflow, and displays data visually onto a computer screen (Ladefoged, 2003), as
shown in Figure 4. Nasal airflow data are useful for
answering questions on the use and timing of nasalization in different sounds. For example, based on
nasal airflow data, Gerfen (2001) showed that Mixtecan has a series of nasalized fricatives that are
extremely marked linguistically: they are difficult
to produce and are also extremely unusual crosslinguistically.
Another area of exploration using airflow and pressure data involves different phonation types. Breathy
voicing, modal voicing, and creaky or laryngealized
voicing differ in the amount of airflow through the
glottis. The rate of airflow through the glottis is highest for breathy voicing, and lowest for creaky or
laryngealized voicing. Using airflow data, it is therefore possible to explore phonation types within a
language and across languages. Ladefoged (2003)
used this parameter to study Javanese, to distinguish
breathy voicing from modal voicing. Phonation types
can also be explored using electroglottography

Figure 5 Ultrasound data collection setup using the Sonosite


Titan portable ultrasound machine.

(EGG), a technique in which the degree of closure


of the vocal folds is measured (for details, see Ladefoged, 2003).

Field Methods in Articulatory Phonetics


Until recently, the most common way of collecting
articulatory data in the field was using a technique
called static palatography (see Ladefoged, 2003).
This technique involves having a speaker pronounce
certain speech sounds after painting his/her tongue
with a mixture of oil and charcoal, and observing
where on the roof of the mouth the mixture has
been transferred. Static palatography is particularly
useful for exploring properties such as place of articulation in spoken segments that involve a complete
closure between the tongue and the roof of the
mouth. Anderson (2000), for example, related

466 Phonetics: Field Methods

Figure 6 Midsagittal ultrasound images of Kinande, showing (A) advanced-tongue root and (B) retracted-tongue root varieties of the
vowel /e/. The tongue tip is at the right of the picture, and the root is at the left. The tongue surface can be seen as the lower edge of the
white curved region.

differences in articulation to various tongue positions


in speakers of Western Arrente, an aboriginal language of Central Australia. One drawback of static
palatography is that it requires repainting the tongue
after every token uttered, which is relatively time
consuming. In addition, this method does not allow
for collection of data on the dynamic properties of
speech production. An alternative technique used in
laboratory settings is dynamic palatography, with
which it is possible to track articulations across
time. This technique is not practical in most fieldwork situations, however, because obtaining good
results requires making a special false palate for
each speaker (Ladefoged, 2003).

Ultrasound imaging is another technique for recording articulatory data in the field (Gick, 2002;
Gick et al., 2005a). Figure 5 illustrates the experimental setup used in ultrasound field research. Portable
ultrasound machines are ideal for collecting articulatory data: they are small enough to fit into a day pack,
language consultants enjoy working with them because they can see what they are producing, and
the data can be used to address questions involving
not only the pronunciation of individual segments,
but also the articulatory timing involved in producing
complex segments and sequences of segments, as well
as motor planning in the production of a whole
sentence. Ultrasound data are useful primarily for

Phonetics: Field Methods 467

observing the articulations involving the tongue; an


example of the kind of data obtained using ultrasound imaging is shown in Figure 6. Although the
ultrasound machines are still relatively expensive
(with portable machines costing approximately
$20 000 U.S.), their price has been dropping and
more and more researchers are able to purchase
them for linguistic research. Nevertheless, fieldwork
on articulatory phonetics using ultrasound imaging is
relatively new, and little published work exists based
on fieldwork using the technique (McDowell, 2004;
Gick et al., 2005b; Miller-Ockhuizen et al., 2005).
Increasing availability of the equipment will no
doubt soon provide valuable new information on
articulatory phonetics.

Conclusion
A wide range of questions can be addressed using
phonetic techniques; advances in technology have
increasingly made it possible to collect phonetic data
in the field. Data collected in field contexts, in which
the researcher goes into a community and records
speakers, rather than transporting them back to a
laboratory setting, have opened up a whole new
range of phenomena to consider, both in documenting natural language sounds and in evaluating current
linguistic theory.
See also: Arrernte; Australia: Language Situation; Canada:

Language Situation; Imaging and Measurement of the


Vocal Tract; Laboratory Phonetics; Phonetics, Acoustic;
Phonetics, Articulatory; Phonetic Classification; Phonetics: Overview; Salishan Languages; States of the Glottis.

Bibliography
Anderson V (2000). Giving weight to phonetic principles:
the case of place of articulation in Western Arrernte.
Ph.D. diss., University of California, Los Angeles.
Bird S (2004). Lheidli intervocalic consonants: phonetic
and morphological effects. Journal of the International
Phonetic Association 34(1), 6991.
Bird S & Caldecott M (2004). Glottal timing in Sta timcets
glottalised resonants: linguistic or biomechanical? Proceedings of the Speech Technology Association (SST),
2004.
Gerfen C (2001). Nasalized fricatives in Coatzospan Mixtec. International Journal of American Linguistics 67(4),
449466.

Gick B (2002). The use of ultrasound for linguistic phonetic fieldwork. Journal of the International Phonetic
Association 32(2), 113122.
Gick B, Bird S & Wilson I (2005a). Techniques for field
application of lingual ultrasound imaging. Clinical
Linguistics and Phonetics, in press.
Gick B, Campbell F, Oh S & Tamburri-Watt L (2005b).
Toward universals in the gestural organization of
syllables: A crosslinguistic study of liquids. Journal of
Phonetics, in press.
Gordon M (2003). Collecting phonetic data on endangered
languages. In Proceedings of the 15th International
Congress of Phonetic Sciences. 207210.
Gordon M & Maddieson I (1999). The phonetics of
Ndumbea. Oceanic Linguistics 38, 6690.
Gordon M, Potter B, Dawson J, de Reuse W & Ladefoged P
(2001). Some phonetic structures of Western Apache.
International Journal of American Linguistics 67,
415448.
Hargus S & Rice K (eds.) (2005). Athabaskan prosody.
Amsterdam: John Benjamins.
Ladefoged P (2003). Phonetic data analysis: an introduction to fieldwork and instrumental techniques. Oxford,
UK: Blackwell.
Ladefoged P (2005). Vowels and consonants: An introduction to the sounds of languages (2nd edition). Oxford,
UK: Blackwell.
Ladefoged P & Maddieson I (1996). The sounds of the
worlds languages. Oxford, UK: Blackwell.
Maddieson I (2001). Phonetic fieldwork. In Newman P &
Ratliff M (eds.) Linguistic fieldwork. Cambridge, UK:
Cambridge University Press. 211229.
Maddieson I (2003). The sounds of the Bantu languages.
In Nurse D & Philippson G (eds.) The Bantu languages.
London, UK: Routledge. 1541.
McDonough J (2003). The Navajo sound system.
Dordrecht/Boston: Kluwer Academic Publishers.
Miller-Ockhuizen A (2003). The phonetics and phonology
of gutturals: a case study from Ju|hoansi. In Horn L (ed.)
Outstanding dissertations in linguistics series. New York:
Routledge.
Miller-Ockhuizen A, Namaseb L & Iskarous K (2005).
Posterior tongue constriction location differences
in click types. In Cole J & Hualde J (eds.) Papers in
Laboratory Phonology 9.

Relevant Websites
http://www.praat.org Phonetic data analysis website.
http://www.linguistics.ucla.edu Department of Linguistics, University of California, Los Angeles.

468 Phonetics: Overview

Phonetics: Overview
J J Ohala, University of California at Berkeley,
Berkeley, CA, USA
2006 Elsevier Ltd. All rights reserved.

Phonetics is the study of pronunciation. Other designations for this field of inquiry include speech
science or the phonetic sciences (the plural is important) and phonology. Some prefer to reserve the term
phonology for the study of the more abstract, the
more functional, or the more psychological aspects
of the underpinnings of speech and apply phonetics
only to the physical, including physiological, aspects
of speech. In fact, the boundaries are blurred, and
some would insist that the assignment of labels to
different domains of study is less important than
seeking answers to questions.
Phonetics attempts to provide answers to such
questions as: What is the physical nature and structure of speech? How is speech produced and perceived? How can one best learn to pronounce the
sounds of another language? How do children first
learn the sounds of their mother tongue? How can
one find the cause and the therapy for defects of
speech and hearing? How and why do speech sounds
vary in different styles of speaking, in different phonetic contexts, over time, over geographical regions?
How can one design optimal mechanical systems to
code, transmit, synthesize, and recognize speech?
What is the character and the explanation for the
universal constraints on the structure of speech
sound inventories and speech sound sequences?
Answers to these and related questions may be sought
anywhere in the speech chain, i.e., the path between
the phonological encoding of the linguistic message
by the speaker and its decoding by the listener.
The speech chain is conceived as starting with the
phonological encoding of the targeted message,
conceivably into a string of units like the phoneme,
although there need be no firm commitment on the
nature of the units. These units are translated into an
orchestrated set of motor commands that control
the movements of the separate organs involved in
speech. Movements of the speech articulators produce
slow pressure changes inside the airways of the vocal
tract (lungs, pharynx, oral and nasal cavities) and,
when released, these pressure differentials create
audible sound. The sound resonates inside the continuously changing vocal tract and radiates to the outside
air through the mouth and nostrils. At the receiving
end of the speech chain, the acoustic speech signal
is detected by the ears of the listener and transformed and encoded into a sensory signal that can be

interpreted by the brain. Although often viewed as an


encoding process that involves simple unidirectional
translation or transduction of speech from one form
into another (e.g., from movements of the vocal
organs into sound, from sound into an auditory representation), it is well established that feedback loops
exist at many stages. Thus what the speaker does may
be continuously modulated by feedback obtained
from tactile and kinesthetic sensation, as well as
from the acoustic signal via auditory decoding of his
speech.
In addition to the speech chain itself, which is the
domain where speech is implemented, some of the
above questions in phonetics require an examination
of the environment in which speech is produced, that
is, the social situation and the functional or task constraints, for example, that it may have evolved out of
other forms of behavior, that it must be capable of
conveying messages in the presence of noise, and
that its information is often integrated with signals
conveyed by other channels.
The end-points of the speech chain in the brains of
the speaker (transmitter) and the listener (receiver)
are effectively hidden, and very little is known about
what goes on there. For practical reasons, then, most
research is done on the more accessible links in the
chain: neuromuscular, aerodynamic, articulatory, and
acoustic. The articulatory phase of speech is perhaps
most immediately accessible to examination by direct
visual inspection and (to the speaker himself) via
tactile and kinesthetic sensation. Thus it is at this
level that speech was first studied, supplemented by
less precise auditory analysis, in several ancient scientific traditions. This history of phonetics, going back
some 2.5 millennia, makes it perhaps the oldest of the
behavioral sciences and, given the longevity and
applicability of some of the early findings from these
times, one of the most successful.
In the second half of the 19th century, the instrumental study of speech, both physiologically and
acoustically, was initiated, and this has developed continuously, until now some very advanced methods are
available, especially ones that involve on-line control
and rapid analysis of signals by computers. One of
the most useful tools in the development of phonetics
has been phonetic transcription, especially the nearuniversally used International Phonetic Alphabet
(IPA). Based on the principle of one sound, one symbol, it surpasses culturally maintained spelling systems
and permits work in comparative phonetics and in
phonetic universals (Maddieson, 1984).
In addition to classifying some of the subareas of
phonetics according to the point in the speech chain

Phonetics: Overview 469

on which they focus, research is often divided up


according to the particular problem attacked or to a
functional division of aspects of the speech signal
itself.
One of the overriding problems in phonetics is
the extreme variability in the physical manifestation
of functionally identical units, whether these be
phonemes, syllables, or words. Theories of coarticulation, i.e., the overlap or partially simultaneous
production of two or more units, have been developed to account for some of this variation. Other
proposed solutions to this problem emphasize that,
if properly analyzed, there is less variation in speech
than appears at first: more global, higher-order
patterns in the acoustic speech signal may be less
variably associated with given speech units than
are the more detailed acoustic parameters. Other
approaches place emphasis on the cognitive capacity
of speakers and hearers to anticipate each others
abilities and limitations and to cooperate in the
conveyance and reception of pronunciation norms.
Another major problem is how the motor control
of speech is accomplished by the brain when there are
so many different structures and movements to be
carefully coordinated and orchestrated in biophysical terms, where there are an inordinate number of
degrees of freedom. A proposed solution is positing
coordinative structures that reduce the degrees of
freedom to a manageable few. Related to this issue
are the metaquestions: What is the immediate goal
of the speaker? What is the strategy of the listener? Is
the common currency of the speaker and hearer a
sequence of articulatory shapes, made audible so
they can be transmitted? Or are the articulatory
shapes secondary, the common coin being acousticauditory images? It is in this context that atypical
modes of speech have significance, i.e., substitute
articulations necessitated for purposes of amusement
as in ventriloquism or because of organic defects in
the normal articulatory apparatus.
One of the hallmarks of human speech, as opposed
to other species vocalizations, is its articulated
character, i.e., that it is a linear concatenation of
units such as consonants and vowels or perhaps of
syllables. Most phonetic research has been done
on this, the segmental, aspect of speech. In parallel
with speech segments, however, are other phonetic
events loosely linked with them and that are
less easily segmented. These are the so-called suprasegmentals, including intonation, stress, (lexical) accent, tone, and voice quality. They are receiving
increased research attention because they are the
next frontier in phonetics (after segmental research)
and because of pressures from speech technology,
especially text-to-speech synthesis, to produce a

theory that fully accounts for the prosodic structure


of speech.
In spite of the breadth of its scope and the diversity
of its approaches, phonetics remains a remarkably
unified discipline.
In its development as a discipline, phonetics has
drawn from a variety of fields and pursuits: medicine
and physiology (including especially the area of communication disorders), physics, engineering, philology, anthropology, psychology, language teaching,
voice instruction (singing and oratory), stenography
and spelling reform, and translation, among others.
It remains a multifaceted discipline in the early 21st
century. As suggested above, phonetics and phonology are closely allied fields, whether one views them
as largely autonomous with small areas of overlap or
as one field with slightly different emphases. In the
present article, it is proposed that phonetics is a part
of phonology: phonologys goal is to try to understand all aspects of speech sound patterning, and
phonetics is one domain where it must seek its
answers; other domains include psychology, as well
as the study of society and culture. Phonetics is at the
same time a scientific discipline that maintains its ties
to physiology, psychology, physics, and anthropology
by trying to acquire new knowledge about the nature
and functioning of speech. It is also a discipline that
has made considerable progress in applying its existing knowledge in useful ways, e.g., in telephony, in
diagnostics and therapy in communication disorders,
in the development of writing systems, and in teaching second languages, as well as a host of areas
in speech technology and forensics. If product sales
are a measure of the accomplishment of a discipline,
phonetics must by this measure be one of the most
successful areas within linguistics.
But despite the many seemingly diverse paths taken
by phonetics, it has proven itself a remarkably unified
field. Reports on work in all of these areas are
welcome at such international meetings as the International Congress of Phonetic Sciences (a series
begun in 1928, the last one being the 15th, held at
Barcelona in 2003), Eurospeech Interspeech (the most
recent being held in Lisbon in 2005), and the International Conference on Spoken Language Processing (a
series started in 1990, now integrated with Interspeech). Likewise, a quite interdisciplinary approach
to phonetics may be found in several journals: Journal
of the Acoustical Society of America, Journal of the
Acoustical Society of Japan, Phonetica, Journal of
Phonetics, Journal of the International Phonetic
Association, Language and Speech, and Speech
Communication.
What this author thinks keeps the field together is
this: on the one hand, we see speech as a powerful but

470 Phonetics: Overview

uniquely human instrument for conveying and propagating information; yet because of its immediacy
and ubiquity, it seems so simple and commonplace.
But on the other hand, we realize how little we know
about its structure and its workings. It is one of the
grand scientific and intellectual puzzles of all ages.
And we do not know where the answer is to be found.
Therefore we cannot afford to neglect clues from any
possibly relevant domain. This is the spirit behind
what may be called unifying theories in phonetics:
empirically based attempts to relate to and to link
concerns in several of its domain, from traditional
phonology to clinical practice, as well as in the other
applied areas. In an earlier era, Zipfs principle of
least effort exemplified such a unifying theory:
the claim that all human behavior, including that in
speech, attempts to achieve its purposes in a way that
minimizes the expenditure of energy. Zipf applied his
theory to language change, phonetic universals, and
syntax, as well as other domains of behavior. In the
late 20th century, there were unifying theories known
by the labels of motor theory of speech perception
(Liberman et al., 1967; Liberman and Mattingly,
1985), quantal theory, action theory, direct realist
theory of speech perception, and biological basis of
speech, among others. They address questions in
phonetic universals, motor control, perception, cognition, and language and speech evolution. Needless
to say, one of the principal values of a theory
including the ones just mentioned is not that they
be true (the history of science, if not our philosophy

of science, tells us that what we regard as true at


the start of the 21st century may be replaced by other
theories in the future), but rather that they be interesting, ultimately useful, testable, and that they force
us to constantly enlarge the domain of inquiry;
in other words, that they present a challenge to
conventional wisdom.
See also: Experimental and Instrumental Phonetics: History; Phonetic Pedagogy; Phonetic Transcription: History;
Phonetics, Articulatory; Phonetics: Precursors to Modern
Approaches; Speech: Biological Basis; Speech Development; Speech Perception; Speech Production; States of
the Glottis; Voice Quality; Whistled Speech and Whistled
Languages.

Bibliography
Ladefoged P (1993). A course in phonetics (3rd edn.). Fort
Worth, TX: Harcourt, Brace, Jovanovitch.
Liberman A M, Cooper F S, Shankweiler D S & StuddertKennedy M (1967). Perception of the speech code.
Psychological Review 74, 431461.
Liberman A M & Mattingly I G (1985). The motor theory
of speech perception revised. Cognition 21, 136.
Maddieson I (1984). Patterns of sounds. Cambridge:
Cambridge University Press.
OShaughnessy D (1990). Speech communication, human
and machine. Reading, MA: Addison-Wesley.
Pickett J M (1980). The sounds of speech communication: a
primer of acoustic phonetics and speech perception.
Baltimore, MD: University Park Press.

Phonetics: Precursors to Modern Approaches


A Kemp, University of Edinburgh, Edinburgh, UK
2006 Elsevier Ltd. All rights reserved.
This article is reproduced from the previous edition, volume 6,
pp. 31023116, 1994. Elsevier Ltd.

It has often been maintained that linguistics, and by


implication phonetics, only began in the 19th century,
but this is far from the truth. This article will attempt
to survey the development of the study of the sounds
of speech and their formation, from the earliest times
until approximately the end of the 19th century, excluding the Arab/Persian and East Asian traditions,
which will be separately treated.

Ancient India
Certain Sanskrit treatises, written during the first
millennium B.C. give an astonishingly full and mostly

accurate account of the mechanism of speech, and


of the way speech sounds (in this case the sounds of
Sanskrit) could be classified. One reason for the existence of these remarkable treatises relates to the importance which was attached to the reading of the
religious books known as Vedas, using the precise
canonical pronunciation handed down from earlier
times. The attempt to convey this resulted in a classificatory system for speech sounds which bears a very
close resemblance to that of modern phonetics. The
ancient Indian interest in phonetics is not confined to
these treatises. The great Indian grammarian Pa n. ini
frequently dealt with phonetic matters in his As. .tadhyay, as did later commentares (see Bhartrhari). This
is in stark contrast to the virtual neglect of phonetics
by the Greeks and Romans. The following account of
Indian phonetics is heavily indebted to Allen (1953),
which should be consulted for a more comprehensive
description.

470 Phonetics: Overview

uniquely human instrument for conveying and propagating information; yet because of its immediacy
and ubiquity, it seems so simple and commonplace.
But on the other hand, we realize how little we know
about its structure and its workings. It is one of the
grand scientific and intellectual puzzles of all ages.
And we do not know where the answer is to be found.
Therefore we cannot afford to neglect clues from any
possibly relevant domain. This is the spirit behind
what may be called unifying theories in phonetics:
empirically based attempts to relate to and to link
concerns in several of its domain, from traditional
phonology to clinical practice, as well as in the other
applied areas. In an earlier era, Zipfs principle of
least effort exemplified such a unifying theory:
the claim that all human behavior, including that in
speech, attempts to achieve its purposes in a way that
minimizes the expenditure of energy. Zipf applied his
theory to language change, phonetic universals, and
syntax, as well as other domains of behavior. In the
late 20th century, there were unifying theories known
by the labels of motor theory of speech perception
(Liberman et al., 1967; Liberman and Mattingly,
1985), quantal theory, action theory, direct realist
theory of speech perception, and biological basis of
speech, among others. They address questions in
phonetic universals, motor control, perception, cognition, and language and speech evolution. Needless
to say, one of the principal values of a theory
including the ones just mentioned is not that they
be true (the history of science, if not our philosophy

of science, tells us that what we regard as true at


the start of the 21st century may be replaced by other
theories in the future), but rather that they be interesting, ultimately useful, testable, and that they force
us to constantly enlarge the domain of inquiry;
in other words, that they present a challenge to
conventional wisdom.
See also: Experimental and Instrumental Phonetics: History; Phonetic Pedagogy; Phonetic Transcription: History;
Phonetics, Articulatory; Phonetics: Precursors to Modern
Approaches; Speech: Biological Basis; Speech Development; Speech Perception; Speech Production; States of
the Glottis; Voice Quality; Whistled Speech and Whistled
Languages.

Bibliography
Ladefoged P (1993). A course in phonetics (3rd edn.). Fort
Worth, TX: Harcourt, Brace, Jovanovitch.
Liberman A M, Cooper F S, Shankweiler D S & StuddertKennedy M (1967). Perception of the speech code.
Psychological Review 74, 431461.
Liberman A M & Mattingly I G (1985). The motor theory
of speech perception revised. Cognition 21, 136.
Maddieson I (1984). Patterns of sounds. Cambridge:
Cambridge University Press.
OShaughnessy D (1990). Speech communication, human
and machine. Reading, MA: Addison-Wesley.
Pickett J M (1980). The sounds of speech communication: a
primer of acoustic phonetics and speech perception.
Baltimore, MD: University Park Press.

Phonetics: Precursors to Modern Approaches


A Kemp, University of Edinburgh, Edinburgh, UK
2006 Elsevier Ltd. All rights reserved.
This article is reproduced from the previous edition, volume 6,
pp. 31023116, 1994. Elsevier Ltd.

It has often been maintained that linguistics, and by


implication phonetics, only began in the 19th century,
but this is far from the truth. This article will attempt
to survey the development of the study of the sounds
of speech and their formation, from the earliest times
until approximately the end of the 19th century, excluding the Arab/Persian and East Asian traditions,
which will be separately treated.

Ancient India
Certain Sanskrit treatises, written during the first
millennium B.C. give an astonishingly full and mostly

accurate account of the mechanism of speech, and


of the way speech sounds (in this case the sounds of
Sanskrit) could be classified. One reason for the existence of these remarkable treatises relates to the importance which was attached to the reading of the
religious books known as Vedas, using the precise
canonical pronunciation handed down from earlier
times. The attempt to convey this resulted in a classificatory system for speech sounds which bears a very
close resemblance to that of modern phonetics. The
ancient Indian interest in phonetics is not confined to
these treatises. The great Indian grammarian Pan. ini
frequently dealt with phonetic matters in his As. .tadhyay, as did later commentares (see Bhartrhari). This
is in stark contrast to the virtual neglect of phonetics
by the Greeks and Romans. The following account of
Indian phonetics is heavily indebted to Allen (1953),
which should be consulted for a more comprehensive
description.

Phonetics: Precursors to Modern Approaches 471


The Organs of Speech and Articulatory Processes

The treatises contain a set of technical terms for the


various parts of the vocal tract articulators (root,
middle, and tip of the tongue, and lower lip) and
place of articulation (foot of the jaw ( velum),
palate, teeth, teeth-roots, upper lips.
Two main types of processes are described: (a)
those occurring in the vocal tract between the larynx
and the lips (internal); (b) those occurring elsewhere
(external). The first of these relates closely to the
modern term stricture, specifying four degrees of
closure between the articulators and the place of articulation: (i) touching ( closure) resulting in
stops; (ii) opened giving vowels; (iii) slight contact; and (iv) slight openness. The last two relate to
the semivowels [j, r, l, V], and to the fricatives [S, s. , s,
#, x, h, P]. Most sources do not draw any distinction
corresponding to close and open vowels, though
some refer to contact being made in [i] and [u]. The
distinction between vowels and consonants is based
not only on stricture, but also on the syllabic function
of vowels. The external processes are those relating to
the larynx, lungs, and nasal cavity. The mechanism of
voicing is for the most part poorly described in the
Western tradition, at least before the 18th century.
The Indian phoneticians account, while not wholly
accurate, is far superior. Breath and voice are distinguished according to whether the glottis is open or
contracted, and there is an intermediate category between voiced and breathed, when the glottis is neither
fully open nor contracted for voicing. This occurs in
the Sanskrit sounds commonly transcribed as hbh, dhi
etc. (namely, what is now called breathy voice,
though the existence of this category was vociferously
denied by some Western scholars even as late as the
19th century).
Unaspirated and aspirated stops are distinguished
by the lesser or greater amount of breath involved.
Nasals are produced by opening the nasal cavity together with the appropriate articulations in the
mouth, to give the nasal stops [m, n, J, 0, N], and
also nasalized vowels and semivowels.
Points of Articulation

Six points of articulation are distinguished: (1) glottal, or pulmonic, producing [h and P], whose similarity to vowels is emphasized; (2) velar root of the
tongue against root of the upper jaw (i.e., the velum)
([k, kh, g, gh, x]; (3) palatal middle of the tongue
against the palate ([c, ch, J, Jh, J, j, S]); (4) retroflex
(mu rdhanya literally of the head hence the terms
cerebral and cacuminal, common in the 19th century), producing [<, <h, B, Bh, 0, ]; the curling back
of the tongue tip is described, and the use of the

underside of it. Some sources also include [r] here


but elsewhere it is described as alveolar, or even as
velar; (5) dental tip of the tongue against the teeth
[t, th, d, dh, n, l, s]; (6) labial [p, ph, b, bh, m, V, #].
The vowel system consists of short [a, i, u] and the
consonantvowels [l, r], with the corresponding long
[a:, i:, u:] and [r:]. [a] and [a:] are classed as glottal.
Allen (1953: 59) points out that hai seems to have been
regarded as a neutral vowel with unimpeded breath
stream through the vocal tract, like [h] apparently
equated in some sources with pure voice, so that the
voicing in the voiced consonants can be described as
consisting of hai. [i] and [i:] are classed as palatal,
and [u] and [u:] as labial, from the obvious rounding
or protrusion of the lips (their velar component is not
mentioned, but this is true of most early vowel descriptions, especially when the vowel system is small
enough not to require this distinction to be made). [r,
r:] and [l, l:] are described as part vowel and part
consonant, presumably from their syllabic function.
Finally, the treatises distinguish the diphthongs [ai, au]
(described as glottopalatal and glottolabial respectively) and the vowels [e] and [o] (originally also
diphthongs), which had a quality intermediate between [a:] and [i:]/[u:].
Prosodies

Here also the Indian phoneticians are far ahead of any


Western counterpart prior to the eighteenth century
in paying attention to features which characterize
longer stretches of speech. It is significant that the
Sanskrit term sandhi has been adopted in modern
phonetics to refer to junction features (see Sandhi).
These features include modifications which take
place at various boundaries between words, morphemes, or sound-segments. At word and morpheme
boundaries the treatises describe types of assimilation
between final and initial segments, for example, of
place or manner of articulation or of voicing, and the
insertion of certain glides to avoid hiatus between
vowels. There is a description of the extent to which
retroflexion may spread through words from a retroflex segment, and of the transitions between different
types of segments, for example, the different types of
stop release.
The syllable is defined as composed of a vowel with
possible preceding consonant(s), and a possible following consonant if before a pause, and there are
rules for syllable division. Vowel length is specified
in units called ma tras (cf. moras). Three distinct tone
classes are distinguished; raised, unraised or low,
and intoned described by most sources as a combination of the first two, i.e., falling. The relative
nature of the pitch difference indicated is clearly
recognized. Allen (1953: 8990) draws attention to

472 Phonetics: Precursors to Modern Approaches

the interesting connection made by some sources with


possible physiological differences: the raised tone is
said to be brought about by tenseness and constriction of the glottis, and the low tone by laxness and
widening of the glottis.

Greece and Rome


The Greek Philosophers

The GrecoRoman tradition in the description


of speech is relatively sparse, and is based more
on auditory characteristics than on articulations.
The sound-elements (called grammata or stoikheia)
were divided by Plato into three groups: (a) those
with pho ne (translatable as sonority) the vowels;
(b) those with neither pho ne nor psophos (noise)
the stop consonants; and (c) intermediate sounds,
having psophos but not pho ne [l m n r s] and
the double sounds [dz] (or [zd]), [ks], and [ps].
Aristotle renamed the third group he mipho na halfsonorous. He also attributed an articulatory feature prosbole (approach, contact) to groups (b)
and (c).
The Stoic philosophers distinguished three aspects
of the stoikheion: (a) the sound; (b) the symbol used
to represent it; and (c) its name. These were translated
later into Latin as aspects of litera (letter): potestas,
figura, and nomen. Subsequent use of the word litera,
or its equivalent in other languages is often ambiguous; sometimes its sense is not far different from that
of the modern term phoneme (see Abercrombie,
1965: 76, 85).
The Tekhne Grammatike

The Tekhne grammatike traditionally attributed to


Dionysius Thrax (see Dionysius Thrax and Hellenistic Language Scholarship) in the first century B.C.,
retained Aristotles terminology for the three groups,
but linked groups (b) and (c) under the term sumpho na literally sounding with (the vowels), i.e.,
consonants. The stop consonants (apho na) were
divided into three subgroups: (a) psila smooth, unaspirated [p t k]; (b) dasea rough, aspirated [ph
th kh]; and (c) mesa intermediate [b d g], said,
puzzlingly, to be midway between the other two
groups on the scale of smoothrough, and not linked
with larynx activity. The Greeks had only a vague
notion of the mechanism of voicing. Aristotle attributes it to the air striking the trachea, and Galen later
talks of it as breath beaten by the cartilages of
the larynx using the term glottis to refer to the top
of the trachea.
The Tekhne does not include any articulatory
description of consonants or vowels, though it

distinguishes long and short vowels. The category


hugra moist, used to refer to [l r m n], is later
translated as liquid (the origin of the term is not
certain), and sullabe held together, strictly used of
consonant vowel combinations but more loosely of
vowels on their own, is the origin of the term syllable. There are rules for syllable length relating to the
length of the vowel and to the consonants that follow
it. Tonal differences are marked respectively by acute,
grave, and circumflex accents, which probably represented high, low, and falling pitches.
Rome

The Roman grammarians on the whole followed the


Greek model: Latin semivocales (from Greek he mipho na) comprise hl, m, n, r, f, si (not [w], [j], like the
modern term semivowel); the terms mutae, vocales,
and consonantes are used for stops, vowels, and consonants respectively. The three Greek stop classes
were preserved and translated as tenues, aspiratae,
and mediae, in spite of the fact that Latin had no
distinctively aspirated stops.

The Middle Ages: The First Grammatical


Treatise
In general there was little interest in phonetics in the
medieval period. One major exception is the short
work now known as the First Grammatical Treatise,
written by an anonymous Icelandic scholar in the
12th century A.D. with the intention of reforming the
Icelandic spelling system (see First Grammatical Treatise). The title refers to its place in the manuscript
which contained it. The writer was clearly well versed
in the Latin tradition, and familiar with the adaptation of the Latin alphabet to English.
What is striking about the work is its accurate
perception of phonological relationships. In order to
determine the segments which must be differentiated
in the writing system, the writer contrasts words with
different meanings which differ only in respect of one
of their sounds. Nine distinctive vowel sounds are
identified, to which are allotted the letters ha, e, i, o,
u, o, e( , , yi; eight of these are exemplified in the set:
sar, so r, ser, ser, sor, sr, sur, and syr, all of which have
different meanings. Each of these vowels could have a
long or a short value, and could be either oral or
nasal. To mark this an acute accent was introduced
for length, and a superscript dot for nasality. Thus,
the vowel in far (vessel) is short and in fa r (danger)
long; ha r (hair) has a long oral vowel, har (shark) a
long nasalized vowel. This gave 36 possible vowel
distinctions. Length in consonants could be shown
by capital letters: o l (beer), o L (all).

Phonetics: Precursors to Modern Approaches 473

The treatise shows astonishing linguistic insight,


anticipating principles more closely associated with
the twentieth century. Unfortunately it was not published until 1818, and remained almost unknown
outside Scandinavia for many years.

Spelling Reform
During the Middle Ages Latin had dominated the
linguistic scene, but gradually the European vernaculars began to be thought worthy of attention. Dante,
in his short work De vulgari eloquentia On the eloquence of the vernaculars, gave some impetus to this
in Italy as early as the fourteenth century. The sounds
of the vernaculars were inadequately conveyed by the
Latin alphabet. Nebrija in Spain (1492), Trissino in
Italy (1524), and Meigret (see Meigret, Louis (?15001558)) in France (1542) all suggested ways of improving the spelling systems of their languages. Most early
grammarians took the written language, not the spoken, as a basis for their description of the sounds.
Some of the earliest phonetic observations on
English are to be found in Sir Thomas Smiths De
recta et emendata linguae Anglicanae scriptione dialogus Dialogue on the true and corrected writing of
the English language, 1568. He tried to introduce
more rationality into English spelling by providing
some new symbols to make up for the deficiency of
the Latin alphabet. These are dealt with elsewhere
(see Phonetic Transcription: History). Smith was one
of the first to comment on the vowel-like nature
(i.e., syllabicity) of the hli in able, stable and the
final hni in ridden, London.
John Hart (d. 1574) (see Hart, John (?1501-1574)),
in his works on the orthography of English (1551,
1569, 1570), aimed to find an improved method of
spelling which would convey the pronunciation while
retaining the values of the Latin letters. Five vowels
are identified, distinguished by three decreasing
degrees of mouth aperture (ha, e, ii), and two degrees
of lip rounding (ho, ui). He believed that these five
simple sounds, in long and short varieties, were as
many as ever any man could sound, of what tongue or
nation soever he were. His analysis of the consonants
groups 14 of them in pairs, each pair shaped in the
mouth in one selfe manner and fashion, but differing
in that the first has an inward sound which the
second lacks; elsewhere he describes them as softer
and harder, but there is no understanding yet of the
nature of the voicing mechanism. Harts observations
of features of connected speech are particularly noteworthy; for example, a weak pronunciation of the
pronouns me, he, she, we with a shorter vowel, and
the regressive devoicing of word final voiced consonants when followed by initial voiceless consonants,

as in his seeing, his shirt, have taken, find fault. Like


Smith, he recognized the possibility of syllabic hli,
hni, and hri in English (otherwise hardly commented
on before the 19th century). His important contribution to phonetic transcription is dealt with elsewhere
(see Phonetic Transcription: History). As a phonetic
observer he was of a high rank.

The Beginnings of General Phonetics:


Jacob Madsen
An important phonetic work appeared in 1586, written by the Dane Jacob Madsen of Aarhus (153886)
(see Madsen Aarhus, Jacob (15381586)), and entitled De Literis On Letters. Madsen was appointed
professor at Copenhagen in 1574, after some years
studying in Germany. He was familiar with a number
of modern European languages as well as Greek,
Latin, and Hebrew, and in spite of the title of his
work he was not simply a spelling reformer. The
term litera is used in a broad sense, including the
sound as well as the symbol. Moreover, he intended
his work to cover the sounds of all languages; in this
respect he can be described as the first to deal with
general phonetics and not just the sounds of a particular language. He placed considerable emphasis or
direct observation, as opposed to evidence from earlier
authorities, but was strongly influenced by Petrus
Ramus (see Ramus, Petrus (15151572)), the French
philosopher/grammarian, and borrows extensively
from his Scholae grammaticae (1569). Aristotelian influence is apparent in his two causes of sounds the
remote cause (guttur, throat) provides the matter of
sounds (breath and voice), and this is converted into
specific sounds by the proximate cause the mouth
and the nose. The mouth has a movable part (the
active organ), and a fixed part, (the passive or
assisting organ). The former comprises lower jaw,
tongue, and lips, and the latter the upper jaw, palate,
and teeth.
In common with earlier accounts Madsen divides
the vowels into lingual and labial, remarking that
their sound is determined by the varying dimensions
of the mouth. He identifies three lingual vowels, ha, e,
ii, differing in mouth aperture large, medium, and
small respectively, and three labial, ho, u, yi, having
progressively smaller lip opening hui and hyi also
have more protrusion than hoi. Madsen does not
commit himself on the tongue position of the labials
(one would need a glass covering of the mouth, he
says, to observe it), and thinks it unnecessary to describe it because nature spontaneously adapts it to
the sound. Few early descriptions are able to improve on this. He distinguished two varieties of
hei and three of hoi (including Danish hi).

S-ar putea să vă placă și