Speech Science PDF

Cours 1
Speech Science
Acronyms
VT = Vocal Tract
VF = Vocal Fold
IPA = International Phonetic Alphabet (1 symbol = 1 sound)
FF = Formant Frequency·ies
VC = Vowel-Consonant
CV = Consonant-Vowel
VOT = Voice Onset Time
Specific Vocabulary
English French
Airstream Courant d’air
Bounce Rebondir
Burst Éclater / éclatement
Glides Semi-voyelles
Horseshoe Fer à cheval
Impede Entraver, gêner, empêcher
Jaws Mâchoires
Outward airflow Flux d’air sortant
Pitch height Hauteur tonale (hauteur ou gravité)
Plosive Occlusive
Pressure drop Chute de pression
Schwa ‘eu’ (menu) [ə] voyelle neutre / centrale
Slack Mou, lâche
Swallowing Déglutition
Thick Épais
Tip Bout / pointe
To scold Réprimander, gronder
Utterance Énoncé / prononciation
Velum Palais mou
Vocal folds/cords Cordes vocales
Welsh Gallois
Yield Rendement / produire
Broad Large / vaste
Sloppy Négligent / bâclé
Onset Début / survenue
Vowel Diagram Triangle vocalique
Slope Pente
Tilting Basculement / inclinaison
Solely Uniquement
Cutback Diminution / réduction
1
Cours 1
Damped Amorti·e
Definitions
Phonetics: the science of speech sounds (language independent)
- Articulatory phonetics: how speech sounds are produced/articulated
- Acoustics phonetics: physical properties like voicing, aspiration, frication
- Auditory phonetics: perception
Phonology: the principles and patterns by which sounds are used in a language (language
dependent)
Phonetics and phonology don’t deal with speech understanding!
Phonemics, or Phonology, is the study of the distribution of sound systems in human

languages.
A Phoneme is a particular set of sounds produced in a particular language and
distinguishable by native speakers of that language from other (sets of) sounds in that
language.
Turbulent
Transient
Fundamental Frequency
Vocal Tract
Phoneme
Minimal pair
Phones
Allophones
Syllables
Morphemes
Articulatory phonetics
Place of articulation
Phonetic transcription
Broad transcription
Narrow transcription
Source filter theory
Formant frequency rate
Voice Onset Time
2
Cours 1
ACRONYMS 1
SPECIFIC VOCABULARY 1
DEFINITIONS 2
SPEECH PRODUCTION 5
The Larynx 7
Different positions of the glottis 9
Vocal fold vibration 9
Intensity and quality 10
Airstream mechanisms 10
ARTICULATORY PHONETICS 12
Manner of articulation 13
Manner of articulation continued ~ linguistic categorization 13
Place of articulation 14
Consonants 14
Vowels 15
Phonetic transcription continued 17
Overview 19
Consonants 19
Parametric analysis 19
THE VOCAL TRACT AS AN ACOUSTIC RESONATOR 20
Characteristics of speech waveform 20

Periodicity 20
Duration 20
The voicebox 21
Physiologic and acoustic aspects of speech sounds 22
Source filter theory 23
Vocal tract constrictions relative to neutral “schwa” 25
SEGMENTAL INFORMATION 26
Vowels 26
Vowel duration 26
3
Cours 1
Summary of acoustic features for vowels 27

Front 28
Central 28
Back 28
Diphthongs /ai/, /au/ 29
Consonants 29
Voice Onset Time (VOT) 29
Formant transitions 30
Release bursts 31
Summary of acoustic features for consonants 31
Cues to manner and place of articulation 31
Voicing cues 33
Plosives 33
Fricatives 36
Affricates 40
Nasal consonants 40
Nasal vowels 41
Glides 42
Liquids 44
EXERCISES 47
4
Cours 1
Speech synthesis Speech recognition

take text as input and produce take speech as input and produce text
understandable speech as output as output (or carry out commands, etc)
Text Acoustic signal
Language identification Signal analysis and spectrum

codebook
Linguistic processing Sequence of spectra and

Morphological decomposition energies
Lexicon & rules, syntax analysis
Language model
Prosodic description
Word sequence
Phonetic processing
Concatenation
Sound generation
Speech production
Air is moved from the lungs, through the trachea via the pharynx to oral and nasal cavities.
During speech, some parts of the vocal tract are constricted. Outward airflow impeded à
pressure rises. Variations in tracheal air pressure provide basis of speech sounds.
Speaking is an alternation of time between a constricted articulation and an open
articulation: no gaps!
Average syllable rate: 2-5 syllables per second!
Vocal Tract = pharynx + oral + nasal cavities. The length, from glottis (VF) to lips, varies:
- Adult male à 17cm
- Adult female à 14-15cm
- Child à 8-9cm
5
Cours 1
6
Cours 1
Stretching the VF is brought about enlarging the distance between the thyroid cartilage and
the two arytenoid cartilages (done by the cricothyroid muscles).
The Larynx
The larynx sits on top of trachea and controls the flow of air in and out of the lungs. Vocal
folds are inside the larynx. The glottis is the opening between the vocal folds. The latter are
elastic.
Thyroid, Cricoid and Arytenoid cartilages support muscles which bring about changes in
voicing.
During swallowing, the epiglottis covers the entrance to the larynx (food and liquids must
pass over the entrance to the lungs).
Tension and elasticity of VF can be varied, made thicker or thinner, shorter or longer;
- Open for voiceless consonants
- Closed for voiced ones
Myoelastic aerodynamic theory of phonation : the VF are activated by the airstream from
the lungs, rather than nerve impulses.
Aerodynamic theory : vocal folds are parted by subglottal air pressure pushed up from the
lungs. Hundred of « pops » of air per second !
For voicing, the air pressure below the vocal folds must exceed the pressure above the folds.
7
Cours 1
The VF close due to a pressure drop. Bernoulli principle : an increase in velocity results in a
drop in the pressure, the pressure drop being perpendicular to the direction of the flow. So,
increase in velocity within a narrow passage à decrease in pressure against the lateral wall
(cf VF : a sudden drop in pressure against the inner sides of each fold à the VF are sucked
together again).
Voiceless speech sounds : /s , t/

Voiced sounds : /a, u, i
Voiced consonants : /z/ , /d/
Turbulent: breath stream passes through narrow constriction (in VT or pharynx, glottis is
open). Produces an aperiodic turbulent stream (/s, f/).
Transient: sudden increase of air flow due to release of air pressure built-up behind a
constriction (/p, t, k/)
8
Cours 1
Different positions of the glottis
a: wide appart ;
b: narrow closure;
c: opening and closing;
d: tightly closed.
Vocal fold vibration

The fundamental frequency (F0) is the number of times the VF open and close per second.
Speech is a complex signal: consists of F0 and multiples of it (harmonics). F0 changes
constantly during speech (intonation)
Depends on
- elasticity (more elastic à higher frequencies because they bounce back faster),
- tension (tenser à mass decrease à higher F0),
- mass (longer and thicker à lower F0)
of the vocal tract.
1: Vocal fry (creaky) voice

2: Modal (normal) voice
3: Falsetto voice
9
Cours 1
4: Breathy voice
F0 male: 80-110Hz; 17-24mm

F0 female: 180-240Hz; 13-17mm
F0 children: >300Hz
Singers: 2 octaves, can stretch up to 4mm.
Intensity and quality

Increased vocal intensity ß sound pressure of greater amplitude + addition of them ß
larger puff of air released (VF blown wider appart) ß greater resistance by the VF against
the increased airflow
Pitch: controlled by VF
Timbre/quality: controlled by tongue, jaws, lips
Monophthong: relatively stationary sound

Diphthong: movement of one vocal-tract position to the other
Airstream mechanisms
Production of any speech sound involves the movement of an airstream.
- Most sounds pulmonic egressive: by pushing air through the lungs, through the
mouth and sometimes also through the nose.
- Glottalic airstream mechanism: implosives and ejectives. However, with implosives,
the air is sucked in, while it is pushed out for ejectives. Instead of lung air, the air in
the mouth is moved.
- Velaric (lingual) airstream mechanism: clicks. Clicks are also ingressive.
6 possible airstream mechanisms:
- Pulmonic egressive (used in all languages)
- Pulmonic ingressive (NF)
- Velaric ergessive (NF)
- Velaric ingressive (Zulu : https://www.youtube.com/watch?v=CcE-BdgCW2A )
(produces click consonants)
- Glottalic egressive (Navajo : https://www.youtube.com/watch?v=XFayFUiyv20 )
(produces ejective consonants) (p’,t’,q’…)
- Glottalic ingressive (Sindhi) (produces implosive consonants) (b,d,g…)
10
Cours 1
Phonemes are individual sounds, the smallest meaningful contrastive unit in the phonology
language. They can be combined to produce distinct word forms. They are written between
slashes / /. They are not defined acoustically by their sound properties, but by their function
in a language system.
Pot-Tot are phonemes but differences like Pie-P’hie are not.
Each language has 20-40 phonemes (/heed/, /hid/, /ahead/, /hayed/, /had/, /hod/, /hawed/,
/hoed/, /hood/, /who’d/, /hide/, /hud/, /howed/, /heard/, /hoyed/).
If you have a phonological knowledge of a language, you can:
- Produce sounds which form meaningful utterances
- Recognize foreign words, foreign accents
- Make up new words
- Add the appropriate segments to form plurals and past tenses
- Know when to aspirate plosives and when not
- Know whether a sound belongs to your language or not
- Know that different phonetic utterances represent the same ‘meaningful unit’.
Minimal pair: when 2 different forms are identical in every way except for one sound
segment (phoneme) that occurs in the same place in the word. Both phonetic form and
meaning are changed (eg : Sink-Zink, Junk-Chunk, Boy-Buy, Teeth-Teethe…).
Counterexamples : Butter and Buʡer have the same meaning in English, Seed and Soup...
When the phonetics is different but the phonemes are the same, we call these sounds “free
variations”. In English, the glottal stop is not a phoneme. Same for unreleased or released
plosives at end of word. It can be transcribed phonetically (using diacritics), but it is not
distinctive phonemically in English.
A word is a combination of both a permitted form and a meaning.

Nonsens words (or possible words) have permissible forms with no meaning: “What is a …?”
Non permissive words: “What did you say ?”
11
Cours 1
Those features can be phonemic: aspiration (English à no (predictable by a rule), Thai à

yes), length of vowels (Danish, Finnish, Arabic, Japanese, Korean à yes) and length of
consonants (Italian).
Phones are the physical sounds that are produced when a phoneme is articulated. As the
vocal tract doesn’t work discretely, each new production by the same speaker of the same
phoneme sounds differently. They are written between brackets [ ].
Allophones describe a class of phones of one phoneme. /k/ and /kh/ (aspirated) are
significantly different. The variation must be systematic (a predictable phonetic variant; it is
rule-governed). Stop consonants tend to have more allophones than other phonemes,
depending on context: aspirated, unaspirated, voiced, unvoiced, short, long… written
between slashes too / /.
Lack of naturalness results from too few allophones like:
- kh (aspirated): initial in stressed syllables before non-front vowels (e.g. could,
because)
- k (unaspirated): after /s/ before non-front vowels; syllable initial in unstressed
syllables before non-front vowels; syllable final position (sometimes). E.g. skull,
scoot, teacup, peak
- k= (unreleased): syllable final position (sometimes). E.g. attic
- kyh (palatalized, unaspirated): initial in stressed syllables before front vowels (e.g.
keep)
- g : syllable initial before non-front vowels; sometimes syllable final position (e.g.
sheepdop, ago)
- g= : syllable final position, sometimes (e.g. rig)
- gy (palatalized): before front vowels (e.g. geese, regain)….
Auditory system is very sensitive to unnatural prosody.
Syllables are the next larger unit of speech after the phoneme. In English a syllable may
consist of a vowel alone, a vowel preceded by one, two, or three consonants, a vowel
followed by one, two, three or four consonants, or a combination of these. The following
words contain 1 syllable:
- Owe: a vowel alone
- Me: a vowel preceded by a single consonant
- Am: a vowel followed by a single consonant
- Strew: a vowel preceded by three consonants
- Inks: a vowel followed by three consonants
- Strengths: a vowel preceded by three and followed by four consants
Morphemes are smallest unit of linguistic meaning (eg : Baseball = base + ball).
Articulatory phonetics
à Relates linguistic features of sounds to positions and movements of the vocal tract
(articulators).
Vowel and consonant phonemes are classified in terms of:
- Manner of articulation (concerns how the vocal tract restricts airflow):
12
Cours 1
o Completely stopping of airflow by an occlusion creates a plosive (stop

consonant) à /d/, /b/, /t/, /p/, /k/, /g/
o Vocal tract constrictions of varying degree occur in liquids (/l/, /r/), fricatives,
glides (palatal glide = /j/ (y), labio-velar glide = /w/) and vowels
o Lowering the velum causes nasal sounds (versus oral sounds) (/n/, /m/)
- Place of articulation (location in the vocal tract)
- Voicing (presence/absence of vocal fold vibration)
https://calleteach.wordpress.com/2010/01/10/sounds-of-english-nasals-liquids-glides/
Manner of articulation
To split phonemes into the broad categories used by most languages.
- Vowels: air flows without constriction from lungs through pharyngeal and oral
cavities to the outside world.
- Glides are like vowels, but with narrow constrictions in the vocal tract.
- Stop consonants (plosives) involve the complete closure and subsequent release of a
vocal tract obstruction. Pressure build-up followed by burst. The closure in the oral
tract and the velum must be raised to prevent nasal airflow, except for glottal stop
(/heh/).
- Liquids are also like vowels, but tongue is used for some degree of obstruction. For
/l/, air escapes around the tip of tongue or dorsum. The /r/ has more variable
articulation. Generally voiced, but can be ‘devoiced’ in ‘please’ or ‘price’. Some
languages have voiceless ‘L’.
- Nasals involve a lowering of the velum. Air flows out of the nostrils. In English, only
nasalized consonants (oral tract completely closed). In French, also nasalized vowels
(air escapes through oral tract and nasal cavities). Vowels may be nasalized in
English, but the distinction is not phonemic (= vowel identity doesn’t change). In
French, there are pairs of vowels that differ only in the presence or absence of vowel
nasalization.
- Fricatives: narrow constriction in the oral tract (for some languages in the pharynx
and in the glottis). If the pressure behind the constriction is high enough and the
passage sufficiently narrow, airflow becomes fast enough to generate turbulence at
the end of the constriction.
o Labiodental fricatives: /f,v/ à friction created at the lips
o Alveolar fricatives: /s,z/ à friction created at alveolar ridge
o Palatal / alveopalatal fricatives: measure à friction created at alveolar ridge
o Dental fricatives: /ð/ (this), /θ/ (thin) à friction occurs between tongue and
teeth
o Velar fricatives: right, knight, enough, through, Bach
o Uvular fricatives: /r/ as in rose
o Voiceless glottal fricative: /h/
o Pharyngeal fricatives: tongue root is pulled towards pharynx (Arabic)
- Affricate (stop + fricative): gin, church
Manner of articulation continued ~ linguistic categorization

- Sonorants: continuous, intense, periodic speech sounds: diphthongs, glides, liquids,
nasals
13
Cours 1
- Obstruents: weak, aperiodic (although sometimes voiced), stop consonants and

fricatives.
o Unvoiced obstruents: tense
o Voiced obstruents: lax
For vowels, liquids, nasals, fricatives: relatively steady-state: the articulatory position can be
sustained (until speaker is out of breath).
Stops are transient consonants, involving a sequence of (rapid) articulatory events (closure
followed by release).
Glides are also considered transient phonemes, because they are usually released into an
ensuing vowel.
Place of articulation
This classification enables finer discrimination of phonemes. Languages differ considerably
with regard to place of articulation (within the various manner classes).
Place of articulation: point of narrowest vocal tract constriction.
Consonants :
- Labials:
o Bilabial: both lips constrict à /p/, /b/, /m/
o Labiodental: the lower lip contacts the upper teeth à /f/, /v/
- Dental, articulated with the tongue against the upper teeth à (/l/):
o Interdental à /the/
- Alveolar, tongue tip or blade against alveolar ridge à /t/, /d/, /n/, /s/, /z/
- Palatals, front part of the tongue is raised to hard palate à measure
- Velar, tongue is raised to soft palate or velum
- Uvular, the dorsum approaches the uvula
- Pharyngeal, constriction in the pharynx
- Glottis, vocal folds close or constrict
14
Cours 1
Vowels
https://en.wikipedia.org/wiki/IPA_vowel_chart_with_audio
Tongue position Part of the tongue Description
High Front Tongue constriction at hard palate
High Back Tongue constriction at soft palate
Low Back Constriction in the upper part of the pharynx
Low Front Constriction in the lower part of pharynx
The third parameter is the position of the lips (rounded or unrounded).
We can also split vowels between:

- Tense vowels: longer, extreme articulation (/i/, /ɵ/, /u/, /o/)
- Lax vowels: shorter, less extreme
https://msu.edu/course/asc/232/Charts/Tense-Lax_Vowels.html
15
Cours 1
16
Cours 1
Nasalization of vowels in English is predictable by a rule: nasalize a vowel or diphthong when

it occurs before a nasal consonant.
Phonetic transcription continued

(see Encyclopedia Linguistics – Straznv)
Phonetic transcription: the use of sequences of phonetic symbols to represent speech.
- It is placed between square brackets [ ] and there is no capitalization or punctuation.
- The same word can be transcribed differently (dialect, fluency, speaker, language…)
- The IPA also provides symbols, like diacritics, for suprasegmental information (word
boundaries (rhythm), stress, intonation, tones (Chinese), pitch height, breathy or
creaky voices, nasalization) and for refining the pronunciation of an utterance
(aspiration, centralization)
1886: International Phonetic Association – Paris
The IPA designates a single symbol for the same sound, regardless of the spelling (quelqu’en
soit l’orthographe). At times the IPA symbol and the spelled or orthographic symbol
coincide, sometimes they do not. The IPA is largely based on articulatory properties.
Broad transcription is a phonemic transcription that involves representing speech using just
a unique symbol for each phoneme of the language: abstract mental constructs.
Narrow transcription captures more phonetic details of the speech sounds.
17
Cours 1
THE INTERNATIONAL PHONETIC ALPHABET (revised to 2015)

CONSONANTS (PULMONIC) © 2015 IPA
Bilabial Labiodental Dental Alveolar Postalveolar Retroflex Palatal Velar Uvular Pharyngeal Glottal
Plosive
Nasal
Trill
Tap or Flap
Fricative
Lateral
fricative
Approximant
Lateral
approximant
Symbols to the right in a cell are voiced, to the left are voiceless. Shaded areas denote articulations judged impossible.
CONSONANTS (NON-PULMONIC) VOWELS
Clicks Voiced implosives Ejectives Front Central Back

Close
Bilabial Bilabial Examples:
Dental Dental/alveolar Bilabial

Close-mid
(Post)alveolar Palatal Dental/alveolar
Palatoalveolar Velar Velar

Open-mid
Alveolar lateral Uvular Alveolar fricative
OTHER SYMBOLS
Open
Voiceless labial-velar fricative Alveolo-palatal fricatives Where symbols appear in pairs, the one
to the right represents a rounded vowel.
Voiced labial-velar approximant Voiced alveolar lateral flap
Voiced labial-palatal approximant Simultaneous and SUPRASEGMENTALS
Voiceless epiglottal fricative Primary stress

Affricates and double articulations
Voiced epiglottal fricative can be represented by two symbols Secondary stress
joined by a tie bar if necessary.
Epiglottal plosive Long
Half-long
DIACRITICS Some diacritics may be placed above a symbol with a descender, e.g.
Extra-short
Voiceless Breathy voiced Dental
Minor (foot) group
Voiced Creaky voiced Apical
Major (intonation) group
Aspirated Linguolabial Laminal
Syllable break
More rounded Labialized Nasalized
Linking (absence of a break)
Less rounded Palatalized Nasal release
TONES AND WORD ACCENTS
Advanced Velarized Lateral release LEVEL CONTOUR
Extra
Retracted Pharyngealized No audible release or high or Rising
Centralized Velarized or pharyngealized High Falling

High
Mid rising
Mid-centralized Raised ( = voiced alveolar fricative)
Low
Low rising
Syllabic Lowered ( = voiced bilabial approximant) Extra Rising-
low falling
Non-syllabic Advanced Tongue Root
Downstep Global rise
Rhoticity Retracted Tongue Root Upstep Global fall
Typefaces: Doulos SIL (metatext); Doulos SIL, IPA Kiel, IPA LS Uni (symbols)
18
Cours 1
Overview
For English speakers
Manner of articulation
Stop
Consonants Glides
consonants Liquids Nasals Fricatives Affricates
(semivowels)
(plosives)
/m/,
/l/, /v/, /ð/,
Voiced /w/, /j/ /b/, /d/, /g/ /n/, /dʒ/
(/r/) /z/, /ʒ/
Voicing
/ŋ/
Unvoiced /f/, /q/,
/p/, /t/, /k/ /s/, /ʃ/, /tʃ/
/c/, /h/
/j/ à lie ; /ŋ/ à ring ; /ð/ à this ; /q/ à thin ; /c/ à ha râclé ; /dʒ/ à judge ; /tʃ/ à chair
Parametric analysis
19
Cours 2
The vocal tract as an acoustic resonator

Characteristics of speech waveform
The cat caught a mouse

0.6297
Amplitude
-0.6658
0 1.292
Time (s)
Periodicity
Periodic: nasals, vowels and approximants
Aperiodic: fricatives
Quiescent: silence preceding the plosive, voice onset time
Transient: release
Duration
5 periods of the vowel /a/ from the cat caught a mouse

0.4969
-0.6467
0 0.03027
Time (s)
20
Cours 2
A
0.8741
-0.8545
0 /s/ /o/ /l/ /d/ /ier/0.383628
Time (s)
B
0.8741
(b?)
(a)
-0.8545
0 0.140272
Time (s)
The voicebox
The vocal tract (VT) can be thought of as a tube that is closed at one end (glottis) and open
at the other end (lips). This type of tube is known as a quarter-wave resonator. The lowest
frequency (formant, F1) at which a quarter-wave system resonates has a wavelength that is
4 times the length of the tube.
For an adult male: 4 ∗ 17 &' = 68 &'
,
+ = - where & = 340 '. 1 23 [5 ] = ' [+] = 78 = 1 23
Hence +19:;< = 500 78
This type of tube will also resonate naturally at odd multiples of lowest frequency (500Hz,
1500Hz, 2500Hz…), odd because of the closure at one end.
Usually, only the first three formants of the VT are considered. Exact values depend on
length and shape of VT (place of constriction and degree of narrowness of constriction).
F1 = lowest resonant frequency, F2 = second formant, F3 = third formant etc.
F4, F5 and higher are relatively constant regardless of changes of the VT.
- F1 is related to volume of the pharyngeal cavity as well as how tightly the VT is
constricted
- F2 is related to the length of the oral cavity
21
Cours 2
Volumes are changed by the position of the tongue (in general: larger volumes will resonate
at lower frequencies, smaller volumes at higher frequencies).
- Raise tongue to palate for high front vowel /i/ à enlarge the pharyngeal cavity
behind the tongue constriction and decrease the volume of the oral cavity in front of
the tongue constriction. As a result, F1 will be lower (volume of the pharyngeal cavity
is large, will resonate more strongly to lower harmonics); F2 will be higher, due to the
shorter length in the oral cavity (amplification of higher harmonics).
- Retract and lower tongue for low back vowel /a/ à enlarge the oral cavity and
decrease the volume of the pharyngeal cavity. Oral cavity is even further lengthened
in case of lip rounding. As a result, F1 will be higher (volume of the pharyngeal cavity
is smaller, resonation of higher harmonics); F2 will be lower because oral cavity is
larger.
NB: formant frequencies change depending on length of vocal tract.
Physiologic and acoustic aspects of speech sounds

The hearing level is given in Decibels (Db).
The frequency is given in Hertz (Hz) (= s-1).
Vocal Tract Vocal Folds Speech sounds

Summary Length Wavelength Formant F1 Length Fund. f Wavelength
(cm) (cm) (Hz) (mm) F0 (Hz) (m)
Child 8-9 34 1000 >300 <1,1
Female 14-15 58 ~600 13-17 200 1,72
Male 17 68 500 17-24 100 3,4
Pitch
Speech sounds: 0-5000 Hz (sometimes – 8000 Hz)

In daily speech, the low frequency sounds travel further than the high frequency sounds
because the attenuation increases with the frequency (and the area density of the obstacle).
22
Cours 2
Source filter theory

è How the glottal sound is filtered according to the frequency response of the vocal tract.
Same F0 and harmonics of glottal source are present, but the amplitudes of the harmonics
have been modified, resulting in a specific sound quality.
According to this theory,
- Fall of 12 dB/octave in the VT
- Increase of 6 dB/octave in the mouth
à Total = -6 dB/octave
Effect of vocal effort:
- Forceful way of speaking: -9dB/oct
- Average slope: -12dB/oct
- More relaxed effort: -15dB/oct
NB: One octave is reached when a frequency is doubled.

+I
>?'@AB C+ C&DEFA1 = GCHI J K
+3
23
Cours 2
Source spectrum is the same for the different vowels, it is changed by filtering of the vocal
tract à Overall, independence of source and filter.
F0 varies according to the gender or age (spacing between harmonics) but the shape of
(vowel) spectrum is not affected by changes in F0.
NB: formant frequency estimation is more difficult at higher pitches (the attenuation is less
important).
24
Cours 2
General slope of spectrum depends on:

- Sound source,
- Degree of proximity of the formant frequencies (close formant frequencies (FF)
reinforce each other),
- Interactions between amplification and attenuation processes.
Some FF also have an effect on higher formant frequencies (F3).
Speech sounds are complex, for information about the Fourier transform click here.
Vocal tract constrictions relative to neutral “schwa”

Oral constriction: the frequency of F1 is lowered by any constriction in the front half of the
oral part of the vocal tract, and the greater constriction the more F1 is lowered (relative to
neutral ‘schwa’).
- Tongue rises to middle palate, e.g. /i/
- Constriction at the lips or teeth, e.g. /u/
Pharyngeal constriction: the frequency of F1 is raised by constriction of the pharynx, and the
greater the constriction, the more F1 is raised.
- No pharyngeal constriction for /o/, therefore low F1.
Back tongue constriction: the frequency of F2 tends to be lowered by a back-tongue
constriction, and the greater the constriction, the more F2 is lowered.
- Tongue is humped towards back of palate for constriction: for [u], more tongue
constriction and more lip rounding than for [o]
25
Cours 2
Front tongue constriction: the frequency of F2 is raised by a front tongue constriction and
the greater the constriction, the more F2 is raised.
Lip-rounding: the frequencies of all formants are lowered by lip rounding (length of VT is
increased). The more the rounding, the more the constriction, the more they are lowered.
Segmental information
Vowels
3000
i
Second formant frequency (Hz)
2500 i e
e
y i
I
2000 i I
e
I
y y
1500 a
a
a
a
1000 u o
WD- female
u
u o MD - male
u o
500
JW - male
0 AG- female
200 400 600 800 1000 1200
First formant frequency (Hz)

F1 dimension:
- Open/low /a/ VS close/high /i/
- Rounded /u/ VS unrounded /a/
F2 dimension:
- front /i/ VS back /u/
Vowel duration
No fixed value for vowel duration (inherent duration). Duration is influenced by:
- syllable stress
- speaking rate
- voicing of preceding or following consonant
- place of articulation of preceding and following cons.
- utterance position (syntactic feature)
- word familiarity
26
Cours 2
Summary of acoustic features for vowels

- formant pattern
- spectrum (slope, rate-of-frequency change in spectrum)
- duration
- fundamental frequency
- formant bandwidth
- formant amplitude
27
Cours 2
Vowels
Front Central Back

Phonemes
/i, e, ɛ, æ/ /ʌ, ə/ /u, ʊ, o, ɔ, ɑ/
Front tongue position, F1 reflects Back tongue position, various

tongue height! The higher the degrees of tongue height. The lower
Phonetic
tongue position the lower the F1 the tongue position, the higher the
Central tongue position.
value. Some high vowels have higher F1 frequency. These sounds vary
intrinsic F0 than others (inherent F0 primarily in the frequency of F1
is relative F0). which varies with vowel height.
Large separation between F1 and F2, Uniform formant pattern (formant
Acoustic
Small separation between F1 and F2,

relatively close positioning between frequencies tend to be equally
large separation between F2 and F3.
F2 and F3. spaced).
Duration
/ɛ/ < /i, e, æ/ /ə/ < /ʌ/ /ʊ/ < /u, o, ɔ, ɑ/

Spectrum
L ∈ [−0.1642 ; 0.259] R~5.89'1 L ∈ [−0.2263 ; 0.268] R~7.79'1 L ∈ [−0.2856 ; 0.3335] R~6.44'1
28
Cours 2
Diphthongs /ai/, /au/

- Are produced with a relatively open vocal tract
- Have well-defined formant structure (first steady-state formant, then transition, then
again steady-state part)
- Have no single-formant pattern
- Have relatively SLOW dynamic changes
- Have specified by on- and off- glide
- Display considerable variation in on- and off-glide values in different contexts or at
different rate
Formant frequency rate may be a characteristic feature of diphthong production. A
perceptual study by Gay (1968): rate-of-frequency change seems to be more ‘invariant’ than
onset or offset values.
Consonants
Voice Onset Time (VOT)
è refers to the time (in ms) between the release of the burst to the beginning of the vocal
fold vibration for the following vibration. It is an indication of the coordination between the
laryngeal and articulatory systems:
- VOT < 0 à vocal folds are vibrating before release. Also called pre-voicing lead (voice
bar). Usually 10-20ms, depending on the accent and on the language.
- VOT = 0 à when voice onset and release occur at the same time
- VOT > 0 à short lag, aspirated: onset of VF vibration follows shortly after release
burst.
https://www.youtube.com/watch?v=KkiuV8GGKUw
VOT varies with place of articulation. In general it increases as place of articulation moves
backward in the oral cavity:
- Bilabials à shortest VOT (often prevoicing)
- Alveolars: intermediate VOTs
- Velar: longest VOT
VOT is not as important in signalling the voicing distinction in final position as it is in initial
position. In final position (of a word) the duration of the vowel preceding the stop is more
important for the voiced-voiceless contrast. Vowels are longer before voiced stops and
shorter before voiceless ones (attendez VS attente).
Languages differ in VOT. VOT values for /p, t, k/ in Spanish, Italian, Dutch and French are
similar to those of /b, d, g/ in English. For English-speaking listeners it seems as though
speakers of Italian, Spanish and French produce only voiced stops (none of them are
aspirated – VOT > 0).
Also, VOT has been used as a measure to document developmental changes. Young children
do not produce voiced and voiceless stops with clearly separated VOT values.
29
Cours 3
Formant transitions
20 - 50 ms from stop to vowel, or vice versa (very rapid articulations)
Very important cues for stop identification and place of identification: release burst is often
not produced in daily speech, transitions are always produced but very fast and therefore
very difficult to measure.
F1 à manner of articulation (degree of constriction): the tighter the VT constriction, the

lower F1 (e.g. plosives). Changes from 0 to F1 of vowel.
F2 & F3 à place of articulation: related to the length of the oral cavity. Reflects the
movement of the tongue and lips in a backward-forward direction. Starting point (locus):
- /b/: 600 – 800 Hz
- /d/: 1800 Hz
- /g/: different places of articulation (2 loci: 1300 Hz, 3000 Hz)
Rate-of-frequency change
Slope (between beginning and end frequencies): linear? exponential?
Effect of longer transition duration
30
Cours 3
Release bursts
Several studies investigated whether a spectral template can be associated with each place
of articulation (Blumstein & Stevens, 1979 à 85% accuracy with 1800 tokens).
Important features of the release burst are:
- The spectrum at burst onset (falling, rising, steady)
- The spectrum at voice onset
- VOT
- Burst amplitude (in some languages)
- Temporal changes in spectral shape appear to be very important for classification:
dynamic cue
- Presence/Absence of mid-frequency peak
Sequence of acoustic cues for syllable-initial stops:
Aspiration (only
Silence Burst Transition
voiceless sounds)
Sequence of acoustic cues for medial stops:
Aspiration (only
Transition Silence Burst Transition
voiceless sounds)
Sequence for final stops:
Burst (only Aspiration (only

Transition Silence
sometimes) sometimes)
Summary of acoustic features for consonants

- Voice Onset Time (VOT)
- Formant transitions
- Type of burst (for the stops)
- A CONTINUER
Cues to manner and place of articulation

Bilabial:
31
Cours 3
- Burst: most of the energy in the burst spectrum is low in frequency, between 500 Hz
and 1500 Hz. Also, the acoustic energy may be spread out over a wide range of
frequencies à flat or falling spectrum;
- F1 increases from nearly zero to the frequency of the vowel (the F1 for vowels is
always higher than the F1 of a plosive);
- F2 increases from a low frequency (800 Hz) to the F2 of the vowel;
- F3 increases from a relative low-frequency (2200 Hz) to the F3 of the vowel.
onset of /aba/ onset of /ada/
4000 4000
0 0
0.33 0.55 0.33 0.55
Time (s) Time (s)
Alveolar:
- Burst: the small area in front of the constriction acts as a high-pass filter
(emphasizing the higher-frequency components in the noise source). High-intensity
and high-frequency energy between 2500 Hz and 4000 Hz à a rising spectrum;
- F1 same as for bilabial (tight constriction)

- F2 begins at locus of approx. 1800 Hz (direction depends on following vowel, for
context effects, see further). Usually, F2 has a rising transition for the high front
vowels and a falling or a flat transition for the other vowels.
Velar:
- Burst: relatively large area in front of constriction enhances energy in the midrange
of frequencies, between 1500 Hz and 4000 Hz. Depends partly on the tongue
position for the following vowel à a compact mid-frequency spectrum;
- F1 same as for bilabial and alveolar;

- F2 has two loci! One at about 1300 Hz, the other near 2300 Hz. At the beginning, F2
and F3 are relatively close, but they diverge during the transition. Also depends on
32
Cours 3
following vowel: when followed by a back vowel, a larger cavity is created (resonates
lower frequencies). When /k/ is followed by a front vowel, the smaller cavity
resonates the higher frequencies.
Voicing cues
- F0 tends to be higher for vowels following voiceless stops than those following
voiced ones.
- VOT
- F1 cutback = a delay in F1 relative to the higher formants.
Plosives
Phonemes: /p, t, k, b, d, g/
Phonetic features :
- Manner: stop (plosive)
- Place (bilabial, alveolar, velar)
Acoustic cues
- Silence (corresponds to the period of oral constriction = stop gap)
o Voiced stops: low energy, also called voice bar
- Burst: corresponds to the articulatory release of the oral constriciton and to
aerodynamic release (due to build-up of pressure). Bursts occur in initial and medial
position, rarely found in final position. Place of articulation may be signaled by
spectrum of burst, but
- Transition is also very important. Transition corresponds to the articulatory
movement from oral constriction for the stop to the more open tract for a following
sound (usually vowel). Easy to identify for voiced than for voiceless sounds.
Most important features:

- stop gap
- release burst
- presence/absence of voice onset time (VOT)
- transition
- voicing features
Duration stop gap: 50-100 ms
Duration burst: 5-40 ms (a ‘transient’ = disappears immediately, shortest event in speech!).
In English the burst of voiceless stops are longer than those of voiced stops because of
aspiration (turbulent air moving through the glottis delays vocal fold closure)
CV (consonant - vowel) and VC (vowel consonant) transitions: 10 - 40 ms. Reflects changes in

the vocal tract. Very difficult to measure/analyze such a short event. However, perceptually
very important!
33
Cours 3
34
Cours 3
35
Cours 3
Fricatives
Phonemes:
- voiced /v, ᶞ, z, ʒ/
- voiceless: /f, ɵ , s, sh, h, X/
Phonetic features:
- Manner: frication (turbulence)
- Place: labiodental, linguadental, alveolar, palatal, glottal
Acoustic cues:
- Voicing
- Frication noise: noise generated as air is forced through a narrow constriction. Then
filtered by the vocal tract:
o /f, v, ɵ, ᶞ / are produced most anteriorly. Not much of a front resonating
cavity. Therefore, very low intensity spectrum spread over a broad range of
frequencies. 4500 Hz - 7000 Hz
o /s,z, s,sh/: front cavity is like an open tube open at one end (lips) and closed
at the other (constriction behind front cavity is very small). Therefore, a
quarter-wave resonator. Front cavity for /s/ is approx. 2.5 cm. Lowest
resonance frequency is 3400 Hz (2.5 cm x 4= 10 cm); 34.000cm/s/10=3400 Hz.
Also higher formants that are odd multiples of the lowest. Lowest is most
important for perception of fricatives.
36
Cours 3
o /s,sh/: longer resonating front cavity than alveolars. Major resonant region
therefore somewhat lower. Also often produced with lip rounding (lengthens
the VT). Most energy: 2000 Hz
- Transitions to and from the vowels due to changes in the vocal tract
- Sibilants/ stridents (/s,z,s,sh/) have intense noise energy
- Non sibilants (/f, v, ɵ, ᶞ /) weak noise energy
/s/ from /asa/
20
-20
0 10000
Frequency (Hz)
/sh/ from /asha/
20
-20
0 10000
Frequency (Hz)
/f/ from /f/

20
-20
0 10000
Frequency (Hz)
37
Cours 3
/asa/
0.282
-0.2207
0 0.738005
Time (s)
/aza/
0.2784
-0.1956
0 0.728027
Time (s)
/aha/
0.2877
-0.2131
0 0.664014
Time (s)
Voiceless and voiced fricatives
38
Cours 3
/acha/ /aga/
104 104
0 0
0 0.684014 0 0.759025
Time (s) Time (s)
/aha/
104
0
0 0.664014
Time (s)
39
Cours 3
Affricates
Phonemes /ʧ ,ʤ /
Phonetic features
- manner: fricative and stop (two phase pattern of production)
- place: linguapalatal
Acoustic cues
- Stop gap (silence or low energy interval) due to articulatory closure or some voice
bar. Stop gap not always well defined if it is preceded by silence for pause.
- Frication is similar as for other fricatives. However, affricates show a relatively rapid
increase in noise energy (short rise time). Duration of noise interval is relatively
shorter than with fricatives. Some studies have shown that difference between
fricative and affricate can be cued on the basis of duration of frication alone.
- Transitions to and from the preceding and following vowel
/adza/ /atza/
104 104
0 0
0 0.740023 0 0.775011
Time (s) Time (s)
Nasal consonants
Phonemes: /m, n, ŋ /
Phonetic features:
- manner: nasal
- place: bilabial, alveolar, velar
- Acoustic features:
Due to lowering of the velum the nasal cavities are coupled to the rest of the vocal tract.
This introduces anti-resonances (anti-formants, zeroes) and an extra formant (nasal murmur,
nasal formant)
- Antiformants are bands of frequencies in which the acoustic energy has been
damped. On spectrograms they look like extremely weak intensity formants (position
depends on how open the velo-pharyngeal passage is). They occur mainly because
the nasal cavities are very sound absorbent (due to soft moist lining, cilia, and
internal structure). Sound waves traveling through the nose are damped at
frequencies of the antiformants.
- Note: a VT formant amplifies frequencies in its bandwidth and attenuates those
outside the cutoff frequencies. An antiformant attenuates frequencies within its
bandwidth and amplifies those outside its bandwidth.
- Nasals have both formants (intense) and antiformants (weak)!
40
Cours 3
- Nasal murmur: caused by oral blockage and lowered velum (extra resonances
because of 2 branches leading to pharynx). Results in nasal radiation of acoustical
energy. The spectrum is dominated by high intensity, low-freq. energy (< 500 Hz).
Murmur cues of three different nasals are not exactly alike, but difficult as a
distinctive cue
- Transitions: preceding and following vowels will be nasalized. Cues to place of
articulation
- Voicing is always present (except during whispering)
Nasal vowels
Nasalize vowels by allowing the velum to remain slightly open.
Nasal-non nasal phonemic distinctions occur in many languages (not in English).
Open velum affects filter curve of VT: F1 becomes broader, less peaked (lower amplitude)
because of damping of F1 by the loss of energy through nose. Nasalization effects on a vowel
depends on the non-nasal formant frequencies determined by the oral tract shape and on
the frequencies of the zeroes and poles introduced by nasal tract coupling
Example of French:
- Both vowels have well-defined peaks in amplitude
- F1 of nasalized vowel is broadened
- F2 appears as a hump from 1100-1750 Hz, while it is a relatively narrow peak for the
non-nasalized vowel
F1 & F2 of nasalized vowel are 5-8 dB weaker than of non-nasalized vowel
F3 weakened by about 11 dB
The F4-F5 region is lowered by about 20 dB, possibly due to a zero near 3125 Hz
41
Cours 3
/ama/ /ana/
6000 6000
0 0
0 0.762018 0 0.764014
Time (s) Time (s)
/anga/
6000
0
0 0.749025
Time (s)
We have a loss of energy with the nasals
Glides
Also called ‘approximants’ and semivowels:
- Gradual articulatory movement
- VT narrowed, not closed
Phonemes: /j/ & /w/
Phonetic features
- Manner: glide or semivowel
- Place: palatal or labiovelar
Acoustic cues :
- A relatively slow transition (75-150 ms)
- No steady-state portion as with diphthongs
- F1 of both sounds starts at very low value (a little higher than for stops)
- F2 of /w/: 800 Hz Compare with /b/!!, F3 of /w/: 2200 Hz. Lower frequencies for all
formants due to lengthening of the vocal tract.
- F2 of /j/: 2200 Hz (compare wih /d/!!), F3 is 3000 Hz
42
Cours 3
Longer glides: vowel-vowel sequences:

- [bi] - [wi]- [ui] and
- [du] - [ju] - [iu]
Differences between glides and voiced stops /b-w/ and /d-j/

/w/ from /awa/ /j/ from /aja/
5000 5000
0 0
0 0.754014 0 0.748027
Time (s) Time (s)
onset of /aba/ onset of /ada/

4000 4000
0 0
0.33 0.55 0.33 0.55
Time (s) Time (s)
‘awa’ looks like ‘aba’ and ‘aja’ looks like ‘ada’ : it’s the same place of articulation, not the
same manner.
[w] – [b] same place of articulation (bilabial)

[j] – [d] : (front) palatal and alveolar
Difference in manner of production. Constriction at lips for [b] or at alveolar ridge [d] blocks
air stream, release burst, rapid transition
Frequencies and direction of the formant transitions are relatively similar for [b] - [w] and
[d]-[w]
- Frequency: starting frequency of F1 of [b] can be slightly lower (0) due to complete
constriction. Same for [d]
- Direction same for stop and semivowel because VT is constricted at similar places
Differences are:
- Voice bar and release burst for stop
- For [w] strong low-frequency info. Throughout, no burst
43
Cours 3
- Duration of stop transition = 40 ms, of glide = 75 ms (note: speaking rate can play an
important role: a transition that is heard at as a stop at a slow rate can be heard as a
glide at a fast rate, Miller & Baer, 1983)
- Steeper onset and offset in intensity changes for stop than for glide
Liquids
Phonemes: /l/ & /r/
Phonetic features:
- Manner: lateral or rhotic
- Place: alveolar for /l/, palatal for /r/
Acoustic cues: rather complex:
- both relatively fast formant transitions
- similarity with glides: well-defined formant structure (less constriction than stops,
fricatives, and affricates)
- /l/: energy mainly in the low frequencies. Resonances and anti-resonances due to
divided vocal tract. Resembles /n/. F1: 360 Hz, F2: 1300 Hz, F3: 2700 Hz
- /r/: similar for F1
o F2 somewhat lower than for /l/
o F3 especially lower (1650 Hz). Durations of formant transitions somewhat
longer for /r/ than for /l/
- temporal cues:
o /r/: F1 has a short steady-state + relatively long transition
o /l/: F1 has a long steady-state + relatively short transition
44
Cours 3
Bilabial Labiodental Linguadental Alveolar Palatal Velar Glottal

[p] [b] [d] [t] [g] [k] [ ?]
F1 increase F1 increases F1 increases Little formant
F2 increases F2 decreases except for high-front F2 increases or change
Burst has flat or vowels decreases.
Stop
falling spectrum Burst has rising spectrum F2-F3 important

[m] [n] [N]
F1 increases F1 increases F1 increases
F2 increases F2 decreases except for high-front F2 increases or
Nasal
Nasal murmur vowels decreases

Nasal murmur Nasal murmur
[v] [f] [ D] [T] [z] [s] [Z] [S] [h]
F1 increases F1 increases F1 increases F1 increases Little formant
F2 increases except for F2 increases except for F2 decreases except for high-front F2 increases or decreases change
Fricative
some back vowels some back vowels vowels Noise segment has intense, Burst has weak and
Burst has weak and flat Burst has weak and flat Noise segment has intense, high high frequency spectrum flat spectrum
spectrum spectrum frequency spectrum (> 4k Hz) (> 3 kHz)
[w] [j]
F1 increases F1 increases
F2 increases F2 increases
Glide
45
Cours 3
46
Cours 3
Exercises
Write the symbols for the vowels in the following words: bread, rough, foot, hymn, pull,
cough, mat, friend.
Transcribe the following words: bake, bought, bored, goat, tick, guard, doubt, bough, peak,
football, bank, threat(e)ning, gently, usual.
Transcribe the sentences: “Opening the bottle presented no difficulty.”, “All the time.”
47

Speech Science PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Speech Science PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Cours 1

Phonetics and phonology don’t deal with speech understanding!

Phonemics, or Phonology, is the study of the distribution of sound systems in human

Different positions of the glottis 9

Vocal fold vibration 9

Intensity and quality 10

Manner of articulation continued ~ linguistic categorization 13

Phonetic transcription continued 17

THE VOCAL TRACT AS AN ACOUSTIC RESONATOR 20

Characteristics of speech waveform 20

Physiologic and acoustic aspects of speech sounds 22

Source filter theory 23

Vocal tract constrictions relative to neutral “schwa” 25

Summary of acoustic features for vowels 27

Speech synthesis Speech recognition

Text Acoustic signal

Language identification Signal analysis and spectrum

Linguistic processing Sequence of spectra and

Voiceless speech sounds : /s , t/

Different positions of the glottis

Vocal fold vibration

1: Vocal fry (creaky) voice

F0 male: 80-110Hz; 17-24mm

Singers: 2 octaves, can stretch up to 4mm.

Intensity and quality

Monophthong: relatively stationary sound

A word is a combination of both a permitted form and a meaning.

Those features can be phonemic: aspiration (English à no (predictable by a rule), Thai à

o Completely stopping of airflow by an occlusion creates a plosive (stop

Manner of articulation continued ~ linguistic categorization

- Obstruents: weak, aperiodic (although sometimes voiced), stop consonants and

We can also split vowels between:

Nasalization of vowels in English is predictable by a rule: nasalize a vowel or diphthong when

Phonetic transcription continued

1886: International Phonetic Association – Paris

THE INTERNATIONAL PHONETIC ALPHABET (revised to 2015)

CONSONANTS (NON-PULMONIC) VOWELS

Clicks Voiced implosives Ejectives Front Central Back

Dental Dental/alveolar Bilabial

Palatoalveolar Velar Velar

Voiced labial-palatal approximant Simultaneous and SUPRASEGMENTALS

Voiceless epiglottal fricative Primary stress

Centralized Velarized or pharyngealized High Falling

For English speakers

The vocal tract as an acoustic resonator

The cat caught a mouse

5 periods of the vowel /a/ from the cat caught a mouse

Hence +19:;< = 500 78

Physiologic and acoustic aspects of speech sounds

Vocal Tract Vocal Folds Speech sounds

Speech sounds: 0-5000 Hz (sometimes – 8000 Hz)

Source filter theory

NB: One octave is reached when a frequency is doubled.

General slope of spectrum depends on:

Vocal tract constrictions relative to neutral “schwa”

First formant frequency (Hz)

Summary of acoustic features for vowels

Front Central Back

/i, e, ɛ, æ/ /ʌ, ə/ /u, ʊ, o, ɔ, ɑ/

Front tongue position, F1 reflects Back tongue position, various

Small separation between F1 and F2,

/ɛ/ < /i, e, æ/ /ə/ < /ʌ/ /ʊ/ < /u, o, ɔ, ɑ/

L ∈ [−0.1642 ; 0.259] R~5.89'1 L ∈ [−0.2263 ; 0.268] R~7.79'1 L ∈ [−0.2856 ; 0.3335] R~6.44'1