Speech and Audio Signal Processing ECE554 - Lec - 1 Speech Production v2.0

9/5/2013
Speech and Audio Signal Processing

ECE554
Overview
Speech Production/Perception
j
Speech : Voiced/Unvoiced
Speech Production
Basics
ECE554
Speech Signal, Waveform
ja

Phonemes/ Phonetics/ Alphabates

Lossless Model
Nikesh Bajaj
nikesh.14730@lpu.co.in
Asst. Prof. DSP, SECE
Lovely Professional University
2 By: Nikesh Bajaj
Speech Production/Speech Perception

Ba Speech Production/Speech Perception
sh
ke
Human vocal mechanism

Speech Production Mechanism:
Air enters the lungs via normal
breathing and no speech is produced
(generally) on in-take.
Ni
As air is expelled from the lungs, via

the trachea or windpipe, the tensed
vocal cords within the larynx are
caused to vibrate (Bernoulli
oscillation) by the air flow.
Air is chopped up into quasi-periodic
pulses which are modulated in
frequency (spectrally shaped) in
passing through the pharynx (the
throat cavity), the mouth cavity, and
possibly the nasal
cavity; the positions of the various
articulators (jaw, tongue, velum, lips,
mouth) determine the sound that is
produced. By: Nikesh Bajaj 6
1
9/5/2013
Human vocal mechanism Lung

All the sound in English is formed during expiration For breathing: inspiring and expiring a tidal volume of
(eg egressive sound).
Ingressive sound are caused by inward airflow due to air about 0.5l every 3-5 sec at rest.
sucking action (eg gasp of surprise, infant cries etc).

60% of breathing cycle is for exhaling.
During ordinary speech 50% of the lung capacity is
used.
ja
Lung capacity is 4-5l for female/male.
Very loud speech may utilize upto 80% of the lung
capacity. 1-2l residual cannot be expelled
Sound amplitude increases with airflow rate. *Tidal volume is the lung volume representing the normal volume of air
displaced between normal inspiration and expiration when extra effort is not
applied
Sounds

Voiced
Unvoiced
Ba Glottal Flow
sh
Glottal volume velocity and resulting sound pressure at the
mouth for the first 30 msec of a voiced sound.
15 msec buildup to periodicity => pitch detection issues at
beginning and end of voicing; also voiced-unvoiced uncertainty
for 15 msec.
9 By: Nikesh Bajaj
ke
Speech classification Voiced Speech

Voiced speech Occurs when air flows through vocal cords into vocal tract at
Unvoiced speech discrete puffs
Ni
Transition regions
Vocal cord vibrate at particular freq. called Fn. Frq. Of sound

50-200 Hz for Male
Typical sampling rate 8 kHz 150-300Hz female
200-400 Hz child
12 By: Nikesh Bajaj
2
9/5/2013
Unvoiced Speech Other Sound

Vocal chords are held apart and air flow continuously Nasal Sounds
Narrow opening produce turbulence Vocal tract coupled acoustically with nasal cavity
/f/ & /s/ Sounds radiates from nostril as well as lips

High frequency /m/ /n/ /ing/
ja
Plosive Sounds
Complete closure/constriction towards front of vocal
tract
Pressure behind closure and sudden relese
/ph/ /t/ /kh/
13 By: Nikesh Bajaj 14 By: Nikesh Bajaj

Resonant Frequencies of Vocal Tract
by the vocal chords and at the end by lips

Ba
Vocal tract is non-uniform acoustic tube is terminated at end
Depends on Articulators: Cross-sectional area of vocal tract,

position of tongue, lips, jaw and velum
Formants

Speech normally exhibits one formant frq. In every 1 KHz
Voiced: magnitude of lower formant is larger than higher
Spectrum of Vocal tract reponse consist of a number of

sh

Unvoiced: Other way round
resonant frq. Formants
Three four formants present below 4KHz

ke
Schematic Production Mechanism

Lungs and associated muscles
Basic Model for Speech production act as the source of air for
exciting the vocal mechanism.
Muscle force pushes air out of
the lungs (like a piston pushing
air up within a cylinder) through
bronchi and trachea.
Ni
If vocal cords are tensed, air flow causes them to

vibrate, producing voiced or quasi periodic speech
sounds (musical notes).
if vocal cords are relaxed, air flow continues
through vocal tract until it hits a constriction in the
tract, causing it to become turbulent, thereby
producing unvoiced sounds (like /s/, /sh/), or it
hits a point of total closure in the vocal tract,
building up pressure until the closure is opened
and the pressure is suddenly and abruptly
released, causing a brief transient sound, like at
17 By: Nikesh Bajaj the beginning of /p/, /t/, or /k/.
3
9/5/2013
Assumptions Basic Sounds

Excitation and Vocal tract are independent Phonemes: Smallest segment of speech /d/
Separate models Linguistic unit, may not be obsevered
j
Sounds depend on vocal tract, person
Time varying vocal tract, fixed characteristics over
dependent

interval of 10ms
Formants changes every 10ms IPA
ja

Basic Speech Processes

idea words sounds waveform
Words: Hi All, did you eat yet?

Sounds: /h/ /ay/-/ae/ /l/-/d/ /ih/ /d/-/y/ /u/-/iy/
Ba
Coarticulated Sounds: /h- ay-l/-/d-ih-j-uh/-/iy-tj--t/ (hial-
Basics

Speech is composed of a sequence of sounds
(phonemes).
Sounds (and transitions between them) serve as a
symbolic representation of information to be shared
between humans (or humans and machines).
dija-eajet) Arrangement of sounds is governed by rules of
language (constraints on sound sequences, word
sh
sequences, etc).
remarkably humans can decode these sounds and
determine the meaning that was intendedat least at the
idea/concept level (perhaps not completely at the word or Linguistics is the study of the rules of language.
sound level)
Phonetics is the study of the sounds of speech.
ke
Ni
4
9/5/2013
ja j
Ba Waveform of utterance Its time

100 msec/line; 0.5 sec for
utterance
S-silence
speech
background-no
U-unvoiced, no vocal cord

vibration
unvoiced sounds)
V-voiced
speech
(aspiration,
quasi-periodic
Speech is a slowly time

varying signal over 5-100
sh
msec intervals
over longer intervals (100
msec 5 sec), the speech
characteristics change as
rapidly as 10-20
times/second.
No well-defined or exact
regions where individuals
sounds begin and end begin
and end.
ke
Parameterization of Spectra
Spectrogram Properties
Human vocal tract is essentially a tube of Speech Spectrogramsound intensity versus time and
varying cross sectional area, or can be
approximated as a concatenation of tubes frequency.
Ni
of varying cross sectional areas. Wideband spectrogram- spectral analysis on 15 msec

Acoustic theory shows that the transfer sections of waveform using a broad (125 Hz) bandwidth
function of energy from the excitation
source to the output can be described in analysis filter, with new analyzes every 1msec.
terms of the natural frequencies or spectral intensity resolves individual periods of the speech and shows
resonances of the tube vertical striations during voiced regions.
Resonances known as formants or
formant frequencies for speech and they Narrowband spectrogram- spectral analysis on 50 msec
represent the frequencies that pass the sections of waveform using a narrow (40 Hz) bandwidth
most acoustic energy from the source to analysis filter, with new analyzes every 1 msec.
the output
narrowband spectrogram resolves individual pitch harmonics and
Typically there are 3 significant formants
below about 3500 Hz shows horizontal striations during voiced regions.
Formants are highly efficient, compact
representation of speech
5
9/5/2013
Wideband and Narrowband spectrogram
ja j
Wideband spectrogram and formant
frequencies Ba

Formants
Perceptually defined
Resonance frequencies of the vocal tract
sh
ke
Hindi Phonemes
American Phonetic Symbols
ARPABET representation
48 sounds
Ni
18 vowels/diphthongs
4 vowel-like
consonants
21 standard
consonants
4 syllabic sounds
6
9/5/2013
Reduced Set of English Sounds Classification of American Phonemes
39 sounds
11 vowels (front, mid, back) classification based on tongue
hump position
j
4 diphthongs (vowel-like combinations)
4 semi-vowels (liquids and glides)
3 nasal consonants
ja
6 voiced and unvoiced stop consonants
8 voiced and unvoiced fricative consonants
2 affricate consonants
1 whispered sound
Look at each class of sounds to characterize their
acoustic and spectral properties.

Vowels
Produced using fixed vocal tract shape.
Sustained sounds.
Vocal cords are vibrating => voiced sounds.
Cross-sectional area of vocal tract determines vowel
Ba Acoustic waveform of vowels
resonance frequencies and vowel sound quality.

sh
Tongue position (height, forward/back position) most
important in determining vowel sound.
Usually relatively long in duration (can be held during
singing) and are spectrally well formed.
ke
Articulatory configuration for typical vowel

sounds
Spectrogram of Vowel Sound
tongue hump
position (front, mid,
back)
tongue hump
height (high, mid,
Ni
low)
/IY/, /IH/, /AE/,
/EH/ => front =>
high resonances
/AA/, /AH/, /AO/
=> mid => energy
balance
/UH/, /UW/,
/OW/ => back =>
low frequency
resonances
7
9/5/2013
Formant Frequencies (F1 and F2) for Vowels Triangle and centriod position
vowels for common vowels
Clear pattern of
variability of vowel
pronunciation among
men, women and
j
children
Strong overlap for
different vowel
ja
sounds
by different talkers
=> no unique
identification of vowel
strictly from
resonances => need
context to define
vowel sound.
Diphthongs Ba Semivowels (Liquids and Glides)

Vowel-like in nature (called semivowels for this reason)
Voiced sounds (w-l-r-y)
Acoustic characteristics of these sounds are strongly

sh
influenced by contextunlike most vowel sounds
which are much less influenced by context.

ke
Nasal Consonants Nasal Waveforms and Spectrograms

The nasal consonants consist of /M/, /N/, and /NG/
nasals produced using glottal excitation => voiced sounds
Ni
vocal tract totally constricted at some point along the tract

velum lowered so sound is radiated at nostrils.
constricted oral cavity serves as a resonant cavity that traps acoustic
energy at certain natural frequencies (antiresonances or zeros of
transmission).
/M/ is produced with a constriction at the lips => low frequency zero.
/N/ is produced with a constriction just behind the teeth => higher
frequency zero.
/NG/ is produced with a constriction just forward of the velum => even
higher frequency zero.
8
9/5/2013
Phoneme Classification based on POA and

Unvoiced Fricatives MOA
Consonant sounds /F/, /TH/, /S/, /SH/ Velar
produced by exciting vocal tract by steady air flow which
becomes turbulent in region of a constriction in the vocal tract Voiced
j
/F/ constriction near the lips Unvoiced
/TH/ constriction near the teeth
/S/ constriction near the middle of the vocal tract Aspirated
ja
/SH/ constriction near the back of the vocal tract Unaspirated
noise source at constriction => vocal tract is separated into two
cavities Dental
sound radiated from lips front cavity
back cavity traps energy and produces antiresonances (zeros of
Alveolar
transmission). Bilabial
Hindi consonants showing their place and

manner of articulation
Articulation
Velar
Unaspirated
Voiceless
/k/
Aspirated
Voiceless
/kh/
Consonants
Unaspirated
Voiced
/g/
Aspirated
Voiced
/gh/
Ba
Nasal
Output sound
Filter
(Vocal tract)
Source
Vibrating vocal fold

Output spectrum
Filter function
Source spectrum
Palatal /t/ /th/ /d/ /dh/

sh
Retroflex // / h/ // /h/ // Air stream
Apico-Dental /t/ /th/ /d/ /dh/ /n/ Lung
Labial /p/ /ph/ /b/ /bh/ /m/

ke
Ni
9
9/5/2013
Artificial Larynx
Schematic Vocal Tract
Simplified vocal tract area => non-uniform tube with time varying cross
section
Plane wave propagation along the axis of the tube (this assumption valid
for frequencies below about 4000 Hz)
j
No losses at walls
ja
Ba Glottal Flow
sh
Glottal volume velocity and resulting sound pressure at the
mouth for the first 30 msec of a voiced sound.
15 msec buildup to periodicity => pitch detection issues at
beginning and end of voicing; also voiced-unvoiced uncertainty
for 15 msec.
ke
Speech classification The Speech Signal

Voiced speech Speech is a sequence of ever changing sounds
Unvoiced speech The state of the vocal cords, the positions, shapes and sizes
Ni
Transition regions
of the various articulatorsall change slowly over time,
thereby producing the desired speech sounds
=> Need to determine the physical properties of speech by
observing the speech waveform
Typical sampling rate 8 kHz
10
9/5/2013
Sound Wave Propagation Solutions to Wave Equation

Using the laws of conservation of mass, momentum and energy, it can No closed form solutions exist for the propagation
be shown that sound wave propagation in a lossless tube satisfies the equations
equations: Need boundary conditions, namely u(0,t) (the volume velocity
j
flow at the glottis), and p(l,t), (the sound pressure at the lips) to
solve the equations.
Needs complete specification of A(x,t), the vocal tract area
function; for simplification purposes we will assume that there is
no time variability in A(x,t) => the term related to the partial time
ja
where derivative of A becomes 0
p=p(x,t)=sound pressure in the tube at position and time Even with these simplifying assumptions, numerical solutions are
u=u(x,t)=volume velocity flow at position and time very hard to compute.
=the density of air in the tube
c=the velocity of sound Consider simple cases and extrapolate results to more
A=A(x,t)=the 'area function' of the tube, complicated cases
i.e., the cross-sectional area normal to the axis of the tube, as a
function of distance along the tube and as a function of time.

Uniform Lossless Tube
Assume uniform lossless tube => A(x,t)=A (shape consistent
with /UH/ vowel)
Ba Acoustic-Electrical Analogs
sh
ke
Traveling Wave Solution Traveling Wave Solution

assume traveling wave solution
+ -(
u(x,t)=u (t-x/c) - u t+x/c)
Ni
+ -(
p(x,t)=c[u (t-x/c) + u t+x/c)]/A
u+(t-x/c) -- wave traveling forward

u-(t+x/c) -- wave traveling backward

Boundary conditions
u(0,t)=UG()ejt
p(l,t)=0
Assume a solution of the form
+ + j(t-x/c)
u (t-x/c)=K e
- - j(t+x/c)
u (t+x/c)=K e
11
9/5/2013
Traveling Wave Solution Traveling Wave Solution
ja j
Frequency Domain Representation Ba
Overall Transfer Function
consider the volume velocity at the lips (x=l) as a function of
the source (at glottis)
sh
Frequency response of the
uniform tube in terms of
volume velocities
using typical values of l=17.5 cm and 35000 cm/sec, giving poles

at the frequencies fn=(2n+1) c/(4l)or 500 Hz, 1500 Hz, 2500 Hz,
3500 Hz, ...
ke
Summary of Solution of Sound Propagation

Equations in the Vocal Tract
Effects of Losses in VT
several types of losses to be considered
viscous friction at the walls of the tube
Ni
heat conduction through the walls of the tube

vibration of the tube walls
loss will change the frequency response of the tube
consider first wall vibrations
assume walls are elastic => cross-sectional area of the tube
will change with pressure in the tube.
assume walls are locally reacting => A(x,t) ~p(x,t)
assume pressure variations are very small.
12
9/5/2013
Nasal Coupling Effects Coupling

VT Transfer Functions Effects
at the branching point
the vocal tract tube can be characterized by a set of sound pressure is same as at input of
resonances (formants) that depend on the vocal tract area each tube.
volume velocity is the sum of the volume
j
function with shifts due to losses and radiation. velocities at inputs to nasal and oral cavities
the bandwidths of the two lowest resonances (F1 and F2) can solve flow equations numerically.
results show resonances dependent on shape
depend primarily on the vocal tract wall losses. and length of the 3 tubes.
ja
closed oral cavity can trap energy at certain
the bandwidths of the highest resonances (F3, F4, ...) frequencies, preventing those frequencies
depend primarily on viscous friction, thermal losses, and from appearing in the nasal output =>
radiation losses. antiresonances or zeros of the transfer function
nasal resonances have broader bandwidths
than non-nasal voiced sounds => due to
greater viscous friction and thermal loss due to
large surface area of the nasal cavity.
Speech Perception
Understanding how we hear sounds and how we perceive
Ba
speech leads to better design and implementation of robust
and efficient systems for analyzing and representing
speech.

Some Facts About Human Hearing
The range of human hearing is incredible

threshold of hearingthermal limit of Brownian motion of air particles in the
inner ear
threshold of painintensities of from 10**12 to 10**16 greater than the
threshold of hearing
Human hearing perceives both sound frequency and sound direction
The better we understand signal processing in the human
sh
can detect weak spectral components in strong broadband noise

auditory system, the better we can (at least in theory) Masking is the phenomenon whereby one loud sound makes another softer
design practical speech processing systems. sound inaudible
masking is most effective for frequencies around the masker frequency
Try to understand speech perception by looking at the masking is used to hide quantizer noise by methods of spectral shaping (similar
physiological models of hearing. grossly to Dolby noise reduction methods)
ke
Speech Communication
Hearing and Perception

physiology
psychophysics
Ni
perception
13
9/5/2013
The Human Ear Black Box Model of

Hearing/Perception
ja j
Outer ear: pinna and external canal
Middle ear: tympanic membrane or eardrum
Inner ear: cochlea, neural connections
Human Ear
Outer ear: funnels sound into ear canal.
Middle ear: sound impinges on tympanic membrane; this
causes motion.

Ba
middle ear is a mechanical transducer, consisting of the hammer, anvil
and stirrup; it converts acoustical sound wave to mechanical vibrations
Middle and Inner Ear
along the inner ear.

Inner ear: the cochlea is a fluid-filled chamber partitioned by
sh
the basilar membrane.

the auditory nerve is connected to the basilar membrane via inner hair
cells.
mechanical vibrations at the entrance to the cochlea create standing
waves (of fluid inside the cochlea) causing basilar membrane to vibrate
at frequencies commensurate with the input acoustic wave frequencies
(formants) and at a place along the basilar membrane that is associated
with these frequencies.
ke
Schematic Representation of the Ear

How does the Cochlea encode
frequencies?
Ni
14
9/5/2013
Basilar Membrane Mechanics Basilar Membrane Motion

Characterized by a set of frequency responses at different points along the The ear is excited by the input acoustic wave which has the
membrane. spectral properties of the speech being produced.
j
Mechanical realization of a bank of filters.
Filters are roughly constant Q (center frequency/bandwidth) with different regions of the BM respond maximally to different input
logarithmically increasing bandwidth. frequencies => frequency tuning occurs along BM.
Distributed along the Basilar Membrane is a set of sensors called Inner Hair the BM acts like a bank of non-uniform cochlear filters.
ja
Cells (IHC) which act as mechanical motion-to neural activity converters. roughly logarithmic increase in BW of filters (<800 Hz has equal
Mechanical motion along the BM is sensed by local IHC causing firing activity BW) => constant Q filters with BW decreasing as we move away
at nerve fibers that innervate bottom of each IHC. from cochlear opening.
Each IHC connected to about 10 nerve fibers, each of different diameter => thin peak frequency at which maximum response occurs along the BM
fibers fire at high motion levels, thick fibers fire at lower motion levels. is called the characteristic frequency.
30,000 nerve fibers link IHC to auditory nerve.
Electrical pulses run along auditory nerve, ultimately reach higher levels of
auditory processing in brain, perceived as sound.
Stretched Cochlea & Basilar Membrane Ba Ear as Frequency Analyzer

sh
ke
Critical bands Critical Bands

Equally loud, close in frequency
Ni
Same IHCs
Slightly louder
Equally loud, separated in freq.
Different IHCs
Twice as loud
Psychoacoustic experiments Idealized basilar membrane filter bank
Center Frequency of Each Bandpass Filter: fc
Bandwidth of Bandpass Filter: f
Real BM filters overlap significantly
15
9/5/2013
ja j
Speech Perception
Speech Perception studies try to answer the key question of what is the
resolving power of the hearing mechanism => how good an estimate of
pitch, formant, amplitude, spectrum, V/UV, etc do we need so that the
perception mechanism cant tell the difference.

Ba
speech is a multidimensional signal with a linguistic association => difficult

Acoustic Definitions
Intensity of a sound is a physical quantity that can be measured

and quantified.
Acoustic Intensity (I) :average flow of energy through a unit
to measure needed precision for any specific parameter or set of parameters. area in watts/m2
sh
rather than talk about speech perception => use auditory discrimination to Audible Intensity range :10-12 watts/m2 to 10 watts/m2
eliminate linguistic or contextual issues.
issues of absolute identification versus discrimination capability => can Intensity Level (IL) : 10log (I/I0) (I0 =10-12 watts/m2)
detect a frequency difference of 0.1% in two tones, but can only absolutely For a pure sinusoidal sound wave of amplitude P, the intensity
judge frequency of five different tones => auditory system is very sensitive to
differences but cannot perceive and resolve them absolutely.
is proportional to P2 and the sound pressure level (SPL) is
defined as:
SPL= 20log (P/P0 ) (P0= 0.00002 N/m 2 )
ke
Hearing Thresholds Anechoic Chamber (no Echos)

Threshold of Audibility is the acoustic intensity level
of pure tone that can barely be heard at a particular
Ni
frequency
Threshold of audibility 0 dB at 1000 Hz
Threshold of feeling 120 dB
Threshold of pain 140 dB
Immediate damage 160 dB
Thresholds vary with frequency and from person-to-
person
Maximum sensitivity is at about 3000Hz.
16
9/5/2013
Range of Human Hearing Sound Pressure Levels (dB)
ja j
Ba
sh
ke
Ni
17

Speech and Audio Signal Processing ECE554 - Lec - 1 Speech Production v2.0

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Speech and Audio Signal Processing ECE554 - Lec - 1 Speech Production v2.0

Încărcat de

Drepturi de autor:

Formate disponibile

9/5/2013

Speech and Audio Signal Processing

Phonemes/ Phonetics/ Alphabates

Speech Production/Speech Perception

Human vocal mechanism

As air is expelled from the lungs, via

Human vocal mechanism Lung

sucking action (eg gasp of surprise, infant cries etc).

Speech classification Voiced Speech

Vocal cord vibrate at particular freq. called Fn. Frq. Of sound

12 By: Nikesh Bajaj

Unvoiced Speech Other Sound

/f/ & /s/ Sounds radiates from nostril as well as lips

13 By: Nikesh Bajaj 14 By: Nikesh Bajaj

by the vocal chords and at the end by lips

Depends on Articulators: Cross-sectional area of vocal tract,

Voiced: magnitude of lower formant is larger than higher

Spectrum of Vocal tract reponse consist of a number of

15 By: Nikesh Bajaj 16 By: Nikesh Bajaj

Schematic Production Mechanism

If vocal cords are tensed, air flow causes them to

Assumptions Basic Sounds

19 By: Nikesh Bajaj 20 By: Nikesh Bajaj

Basic Speech Processes

U-unvoiced, no vocal cord

Speech is a slowly time

of varying cross sectional areas. Wideband spectrogram- spectral analysis on 15 msec

Wideband and Narrowband spectrogram

Reduced Set of English Sounds Classification of American Phonemes

resonance frequencies and vowel sound quality.

Articulatory configuration for typical vowel

Diphthongs Ba Semivowels (Liquids and Glides)

Voiced sounds (w-l-r-y)

Acoustic characteristics of these sounds are strongly

which are much less influenced by context.

Nasal Consonants Nasal Waveforms and Spectrograms

vocal tract totally constricted at some point along the tract

Phoneme Classification based on POA and

Hindi consonants showing their place and

Vibrating vocal fold

Palatal /t/ /th/ /d/ /dh/

Labial /p/ /ph/ /b/ /bh/ /m/

Speech classification The Speech Signal

Typical sampling rate 8 kHz

Sound Wave Propagation Solutions to Wave Equation

Traveling Wave Solution Traveling Wave Solution

u-(t+x/c) -- wave traveling backward

Traveling Wave Solution Traveling Wave Solution

using typical values of l=17.5 cm and 35000 cm/sec, giving poles

Summary of Solution of Sound Propagation

heat conduction through the walls of the tube

Nasal Coupling Effects Coupling

The range of human hearing is incredible

can detect weak spectral components in strong broadband noise

Hearing and Perception

The Human Ear Black Box Model of

along the inner ear.

the basilar membrane.

Schematic Representation of the Ear

Basilar Membrane Mechanics Basilar Membrane Motion

Stretched Cochlea & Basilar Membrane Ba Ear as Frequency Analyzer

Critical bands Critical Bands

Intensity of a sound is a physical quantity that can be measured