Sunteți pe pagina 1din 17

9/5/2013

Speech and Audio Signal Processing


ECE554

Overview
Speech Production/Perception

j
Speech : Voiced/Unvoiced
Speech Production

Basics
ECE554
Speech Signal, Waveform

ja

Phonemes/ Phonetics/ Alphabates


Lossless Model
Nikesh Bajaj

nikesh.14730@lpu.co.in
Asst. Prof. DSP, SECE
Lovely Professional University
2 By: Nikesh Bajaj

Speech Production/Speech Perception


Ba Speech Production/Speech Perception
sh
ke

Human vocal mechanism


Speech Production Mechanism:
Air enters the lungs via normal
breathing and no speech is produced
(generally) on in-take.
Ni

As air is expelled from the lungs, via


the trachea or windpipe, the tensed
vocal cords within the larynx are
caused to vibrate (Bernoulli
oscillation) by the air flow.
Air is chopped up into quasi-periodic
pulses which are modulated in
frequency (spectrally shaped) in
passing through the pharynx (the
throat cavity), the mouth cavity, and
possibly the nasal
cavity; the positions of the various
articulators (jaw, tongue, velum, lips,
mouth) determine the sound that is
produced. By: Nikesh Bajaj 6

1
9/5/2013

Human vocal mechanism Lung


All the sound in English is formed during expiration For breathing: inspiring and expiring a tidal volume of
(eg egressive sound).
Ingressive sound are caused by inward airflow due to air about 0.5l every 3-5 sec at rest.

sucking action (eg gasp of surprise, infant cries etc).


60% of breathing cycle is for exhaling.
During ordinary speech 50% of the lung capacity is
used.

ja
Lung capacity is 4-5l for female/male.
Very loud speech may utilize upto 80% of the lung
capacity. 1-2l residual cannot be expelled
Sound amplitude increases with airflow rate. *Tidal volume is the lung volume representing the normal volume of air
displaced between normal inspiration and expiration when extra effort is not
applied

Sounds


Voiced

Unvoiced
Ba Glottal Flow
sh
Glottal volume velocity and resulting sound pressure at the
mouth for the first 30 msec of a voiced sound.
15 msec buildup to periodicity => pitch detection issues at
beginning and end of voicing; also voiced-unvoiced uncertainty
for 15 msec.
9 By: Nikesh Bajaj
ke

Speech classification Voiced Speech


Voiced speech Occurs when air flows through vocal cords into vocal tract at
Unvoiced speech discrete puffs
Ni

Transition regions

Vocal cord vibrate at particular freq. called Fn. Frq. Of sound


50-200 Hz for Male
Typical sampling rate 8 kHz 150-300Hz female
200-400 Hz child

12 By: Nikesh Bajaj

2
9/5/2013

Unvoiced Speech Other Sound


Vocal chords are held apart and air flow continuously Nasal Sounds
Narrow opening produce turbulence Vocal tract coupled acoustically with nasal cavity

/f/ & /s/ Sounds radiates from nostril as well as lips


High frequency /m/ /n/ /ing/

ja
Plosive Sounds
Complete closure/constriction towards front of vocal
tract
Pressure behind closure and sudden relese
/ph/ /t/ /kh/

13 By: Nikesh Bajaj 14 By: Nikesh Bajaj


Resonant Frequencies of Vocal Tract

by the vocal chords and at the end by lips


Ba
Vocal tract is non-uniform acoustic tube is terminated at end

Depends on Articulators: Cross-sectional area of vocal tract,


position of tongue, lips, jaw and velum
Formants


Speech normally exhibits one formant frq. In every 1 KHz

Voiced: magnitude of lower formant is larger than higher

Spectrum of Vocal tract reponse consist of a number of


sh

Unvoiced: Other way round
resonant frq. Formants
Three four formants present below 4KHz

15 By: Nikesh Bajaj 16 By: Nikesh Bajaj


ke

Schematic Production Mechanism


Lungs and associated muscles
Basic Model for Speech production act as the source of air for
exciting the vocal mechanism.
Muscle force pushes air out of
the lungs (like a piston pushing
air up within a cylinder) through
bronchi and trachea.
Ni

If vocal cords are tensed, air flow causes them to


vibrate, producing voiced or quasi periodic speech
sounds (musical notes).
if vocal cords are relaxed, air flow continues
through vocal tract until it hits a constriction in the
tract, causing it to become turbulent, thereby
producing unvoiced sounds (like /s/, /sh/), or it
hits a point of total closure in the vocal tract,
building up pressure until the closure is opened
and the pressure is suddenly and abruptly
released, causing a brief transient sound, like at
17 By: Nikesh Bajaj the beginning of /p/, /t/, or /k/.

3
9/5/2013

Assumptions Basic Sounds


Excitation and Vocal tract are independent Phonemes: Smallest segment of speech /d/
Separate models Linguistic unit, may not be obsevered

j
Sounds depend on vocal tract, person
Time varying vocal tract, fixed characteristics over
dependent

interval of 10ms
Formants changes every 10ms IPA

ja

19 By: Nikesh Bajaj 20 By: Nikesh Bajaj

Basic Speech Processes


idea words sounds waveform
Words: Hi All, did you eat yet?



Sounds: /h/ /ay/-/ae/ /l/-/d/ /ih/ /d/-/y/ /u/-/iy/
Ba
Coarticulated Sounds: /h- ay-l/-/d-ih-j-uh/-/iy-tj--t/ (hial-
Basics


Speech is composed of a sequence of sounds
(phonemes).
Sounds (and transitions between them) serve as a
symbolic representation of information to be shared
between humans (or humans and machines).
dija-eajet) Arrangement of sounds is governed by rules of
language (constraints on sound sequences, word
sh
sequences, etc).
remarkably humans can decode these sounds and
determine the meaning that was intendedat least at the
idea/concept level (perhaps not completely at the word or Linguistics is the study of the rules of language.
sound level)
Phonetics is the study of the sounds of speech.
ke
Ni

4
9/5/2013

ja j
Ba Waveform of utterance Its time


100 msec/line; 0.5 sec for
utterance
S-silence
speech
background-no

U-unvoiced, no vocal cord


vibration
unvoiced sounds)
V-voiced
speech
(aspiration,

quasi-periodic

Speech is a slowly time


varying signal over 5-100
sh
msec intervals
over longer intervals (100
msec 5 sec), the speech
characteristics change as
rapidly as 10-20
times/second.
No well-defined or exact
regions where individuals
sounds begin and end begin
and end.
ke

Parameterization of Spectra
Spectrogram Properties
Human vocal tract is essentially a tube of Speech Spectrogramsound intensity versus time and
varying cross sectional area, or can be
approximated as a concatenation of tubes frequency.
Ni

of varying cross sectional areas. Wideband spectrogram- spectral analysis on 15 msec


Acoustic theory shows that the transfer sections of waveform using a broad (125 Hz) bandwidth
function of energy from the excitation
source to the output can be described in analysis filter, with new analyzes every 1msec.
terms of the natural frequencies or spectral intensity resolves individual periods of the speech and shows
resonances of the tube vertical striations during voiced regions.
Resonances known as formants or

formant frequencies for speech and they Narrowband spectrogram- spectral analysis on 50 msec
represent the frequencies that pass the sections of waveform using a narrow (40 Hz) bandwidth
most acoustic energy from the source to analysis filter, with new analyzes every 1 msec.
the output
narrowband spectrogram resolves individual pitch harmonics and
Typically there are 3 significant formants
below about 3500 Hz shows horizontal striations during voiced regions.
Formants are highly efficient, compact
representation of speech

5
9/5/2013

Wideband and Narrowband spectrogram

ja j
Wideband spectrogram and formant
frequencies Ba


Formants
Perceptually defined
Resonance frequencies of the vocal tract
sh
ke

Hindi Phonemes
American Phonetic Symbols
ARPABET representation

48 sounds
Ni

18 vowels/diphthongs

4 vowel-like

consonants

21 standard

consonants

4 syllabic sounds

6
9/5/2013

Reduced Set of English Sounds Classification of American Phonemes

39 sounds
11 vowels (front, mid, back) classification based on tongue
hump position

j
4 diphthongs (vowel-like combinations)
4 semi-vowels (liquids and glides)
3 nasal consonants

ja
6 voiced and unvoiced stop consonants
8 voiced and unvoiced fricative consonants
2 affricate consonants
1 whispered sound
Look at each class of sounds to characterize their
acoustic and spectral properties.


Vowels
Produced using fixed vocal tract shape.
Sustained sounds.
Vocal cords are vibrating => voiced sounds.
Cross-sectional area of vocal tract determines vowel
Ba Acoustic waveform of vowels

resonance frequencies and vowel sound quality.


sh
Tongue position (height, forward/back position) most
important in determining vowel sound.
Usually relatively long in duration (can be held during
singing) and are spectrally well formed.
ke

Articulatory configuration for typical vowel


sounds
Spectrogram of Vowel Sound
tongue hump
position (front, mid,
back)
tongue hump
height (high, mid,
Ni

low)
/IY/, /IH/, /AE/,
/EH/ => front =>
high resonances
/AA/, /AH/, /AO/
=> mid => energy
balance
/UH/, /UW/,
/OW/ => back =>
low frequency
resonances

7
9/5/2013

Formant Frequencies (F1 and F2) for Vowels Triangle and centriod position
vowels for common vowels
Clear pattern of
variability of vowel
pronunciation among
men, women and

j
children
Strong overlap for
different vowel

ja
sounds
by different talkers
=> no unique
identification of vowel
strictly from
resonances => need
context to define
vowel sound.

Diphthongs Ba Semivowels (Liquids and Glides)


Vowel-like in nature (called semivowels for this reason)

Voiced sounds (w-l-r-y)

Acoustic characteristics of these sounds are strongly


sh
influenced by contextunlike most vowel sounds

which are much less influenced by context.


ke

Nasal Consonants Nasal Waveforms and Spectrograms


The nasal consonants consist of /M/, /N/, and /NG/
nasals produced using glottal excitation => voiced sounds
Ni

vocal tract totally constricted at some point along the tract


velum lowered so sound is radiated at nostrils.
constricted oral cavity serves as a resonant cavity that traps acoustic
energy at certain natural frequencies (antiresonances or zeros of
transmission).
/M/ is produced with a constriction at the lips => low frequency zero.
/N/ is produced with a constriction just behind the teeth => higher
frequency zero.
/NG/ is produced with a constriction just forward of the velum => even
higher frequency zero.

8
9/5/2013

Phoneme Classification based on POA and


Unvoiced Fricatives MOA
Consonant sounds /F/, /TH/, /S/, /SH/ Velar
produced by exciting vocal tract by steady air flow which
becomes turbulent in region of a constriction in the vocal tract Voiced

j
/F/ constriction near the lips Unvoiced
/TH/ constriction near the teeth
/S/ constriction near the middle of the vocal tract Aspirated

ja
/SH/ constriction near the back of the vocal tract Unaspirated
noise source at constriction => vocal tract is separated into two
cavities Dental
sound radiated from lips front cavity
back cavity traps energy and produces antiresonances (zeros of
Alveolar
transmission). Bilabial

Hindi consonants showing their place and


manner of articulation

Articulation

Velar
Unaspirated
Voiceless

/k/
Aspirated
Voiceless

/kh/
Consonants

Unaspirated
Voiced

/g/
Aspirated
Voiced

/gh/
Ba
Nasal
Output sound

Filter
(Vocal tract)

Source

Vibrating vocal fold


Output spectrum

Filter function

Source spectrum

Palatal /t/ /th/ /d/ /dh/


sh
Retroflex // / h/ // /h/ // Air stream
Apico-Dental /t/ /th/ /d/ /dh/ /n/ Lung

Labial /p/ /ph/ /b/ /bh/ /m/


ke
Ni

9
9/5/2013

Artificial Larynx
Schematic Vocal Tract
Simplified vocal tract area => non-uniform tube with time varying cross
section
Plane wave propagation along the axis of the tube (this assumption valid
for frequencies below about 4000 Hz)

j
No losses at walls

ja
Ba Glottal Flow
sh
Glottal volume velocity and resulting sound pressure at the
mouth for the first 30 msec of a voiced sound.
15 msec buildup to periodicity => pitch detection issues at
beginning and end of voicing; also voiced-unvoiced uncertainty
for 15 msec.
ke

Speech classification The Speech Signal


Voiced speech Speech is a sequence of ever changing sounds
Unvoiced speech The state of the vocal cords, the positions, shapes and sizes
Ni

Transition regions
of the various articulatorsall change slowly over time,
thereby producing the desired speech sounds
=> Need to determine the physical properties of speech by
observing the speech waveform

Typical sampling rate 8 kHz

10
9/5/2013

Sound Wave Propagation Solutions to Wave Equation


Using the laws of conservation of mass, momentum and energy, it can No closed form solutions exist for the propagation
be shown that sound wave propagation in a lossless tube satisfies the equations
equations: Need boundary conditions, namely u(0,t) (the volume velocity

j
flow at the glottis), and p(l,t), (the sound pressure at the lips) to
solve the equations.
Needs complete specification of A(x,t), the vocal tract area
function; for simplification purposes we will assume that there is
no time variability in A(x,t) => the term related to the partial time

ja
where derivative of A becomes 0
p=p(x,t)=sound pressure in the tube at position and time Even with these simplifying assumptions, numerical solutions are
u=u(x,t)=volume velocity flow at position and time very hard to compute.
=the density of air in the tube
c=the velocity of sound Consider simple cases and extrapolate results to more
A=A(x,t)=the 'area function' of the tube, complicated cases
i.e., the cross-sectional area normal to the axis of the tube, as a
function of distance along the tube and as a function of time.


Uniform Lossless Tube
Assume uniform lossless tube => A(x,t)=A (shape consistent
with /UH/ vowel)
Ba Acoustic-Electrical Analogs
sh
ke

Traveling Wave Solution Traveling Wave Solution


assume traveling wave solution
+ -(
u(x,t)=u (t-x/c) - u t+x/c)
Ni

+ -(
p(x,t)=c[u (t-x/c) + u t+x/c)]/A
u+(t-x/c) -- wave traveling forward

u-(t+x/c) -- wave traveling backward


Boundary conditions
u(0,t)=UG()ejt
p(l,t)=0
Assume a solution of the form
+ + j(t-x/c)
u (t-x/c)=K e
- - j(t+x/c)
u (t+x/c)=K e

11
9/5/2013

Traveling Wave Solution Traveling Wave Solution

ja j
Frequency Domain Representation Ba
Overall Transfer Function
consider the volume velocity at the lips (x=l) as a function of
the source (at glottis)
sh
Frequency response of the
uniform tube in terms of
volume velocities

using typical values of l=17.5 cm and 35000 cm/sec, giving poles


at the frequencies fn=(2n+1) c/(4l)or 500 Hz, 1500 Hz, 2500 Hz,
3500 Hz, ...
ke

Summary of Solution of Sound Propagation


Equations in the Vocal Tract
Effects of Losses in VT
several types of losses to be considered
viscous friction at the walls of the tube
Ni

heat conduction through the walls of the tube


vibration of the tube walls
loss will change the frequency response of the tube
consider first wall vibrations
assume walls are elastic => cross-sectional area of the tube
will change with pressure in the tube.
assume walls are locally reacting => A(x,t) ~p(x,t)
assume pressure variations are very small.

12
9/5/2013

Nasal Coupling Effects Coupling


VT Transfer Functions Effects
at the branching point
the vocal tract tube can be characterized by a set of sound pressure is same as at input of
resonances (formants) that depend on the vocal tract area each tube.
volume velocity is the sum of the volume

j
function with shifts due to losses and radiation. velocities at inputs to nasal and oral cavities
the bandwidths of the two lowest resonances (F1 and F2) can solve flow equations numerically.
results show resonances dependent on shape
depend primarily on the vocal tract wall losses. and length of the 3 tubes.

ja
closed oral cavity can trap energy at certain
the bandwidths of the highest resonances (F3, F4, ...) frequencies, preventing those frequencies
depend primarily on viscous friction, thermal losses, and from appearing in the nasal output =>
radiation losses. antiresonances or zeros of the transfer function
nasal resonances have broader bandwidths
than non-nasal voiced sounds => due to
greater viscous friction and thermal loss due to
large surface area of the nasal cavity.

Speech Perception
Understanding how we hear sounds and how we perceive
Ba
speech leads to better design and implementation of robust
and efficient systems for analyzing and representing
speech.

Some Facts About Human Hearing

The range of human hearing is incredible


threshold of hearingthermal limit of Brownian motion of air particles in the
inner ear
threshold of painintensities of from 10**12 to 10**16 greater than the
threshold of hearing
Human hearing perceives both sound frequency and sound direction
The better we understand signal processing in the human
sh

can detect weak spectral components in strong broadband noise


auditory system, the better we can (at least in theory) Masking is the phenomenon whereby one loud sound makes another softer
design practical speech processing systems. sound inaudible
masking is most effective for frequencies around the masker frequency
Try to understand speech perception by looking at the masking is used to hide quantizer noise by methods of spectral shaping (similar
physiological models of hearing. grossly to Dolby noise reduction methods)
ke

Speech Communication

Hearing and Perception


physiology
psychophysics
Ni

perception

13
9/5/2013

The Human Ear Black Box Model of


Hearing/Perception

ja j
Outer ear: pinna and external canal
Middle ear: tympanic membrane or eardrum
Inner ear: cochlea, neural connections

Human Ear
Outer ear: funnels sound into ear canal.
Middle ear: sound impinges on tympanic membrane; this
causes motion.

Ba
middle ear is a mechanical transducer, consisting of the hammer, anvil
and stirrup; it converts acoustical sound wave to mechanical vibrations
Middle and Inner Ear

along the inner ear.


Inner ear: the cochlea is a fluid-filled chamber partitioned by
sh

the basilar membrane.


the auditory nerve is connected to the basilar membrane via inner hair
cells.
mechanical vibrations at the entrance to the cochlea create standing
waves (of fluid inside the cochlea) causing basilar membrane to vibrate
at frequencies commensurate with the input acoustic wave frequencies
(formants) and at a place along the basilar membrane that is associated
with these frequencies.
ke

Schematic Representation of the Ear


How does the Cochlea encode
frequencies?
Ni

14
9/5/2013

Basilar Membrane Mechanics Basilar Membrane Motion


Characterized by a set of frequency responses at different points along the The ear is excited by the input acoustic wave which has the
membrane. spectral properties of the speech being produced.

j
Mechanical realization of a bank of filters.
Filters are roughly constant Q (center frequency/bandwidth) with different regions of the BM respond maximally to different input
logarithmically increasing bandwidth. frequencies => frequency tuning occurs along BM.
Distributed along the Basilar Membrane is a set of sensors called Inner Hair the BM acts like a bank of non-uniform cochlear filters.

ja

Cells (IHC) which act as mechanical motion-to neural activity converters. roughly logarithmic increase in BW of filters (<800 Hz has equal
Mechanical motion along the BM is sensed by local IHC causing firing activity BW) => constant Q filters with BW decreasing as we move away
at nerve fibers that innervate bottom of each IHC. from cochlear opening.
Each IHC connected to about 10 nerve fibers, each of different diameter => thin peak frequency at which maximum response occurs along the BM
fibers fire at high motion levels, thick fibers fire at lower motion levels. is called the characteristic frequency.
30,000 nerve fibers link IHC to auditory nerve.
Electrical pulses run along auditory nerve, ultimately reach higher levels of
auditory processing in brain, perceived as sound.

Stretched Cochlea & Basilar Membrane Ba Ear as Frequency Analyzer


sh
ke

Critical bands Critical Bands


Equally loud, close in frequency
Ni

Same IHCs
Slightly louder
Equally loud, separated in freq.
Different IHCs
Twice as loud
Psychoacoustic experiments Idealized basilar membrane filter bank
Center Frequency of Each Bandpass Filter: fc
Bandwidth of Bandpass Filter: f
Real BM filters overlap significantly

15
9/5/2013

ja j

Speech Perception

Speech Perception studies try to answer the key question of what is the
resolving power of the hearing mechanism => how good an estimate of
pitch, formant, amplitude, spectrum, V/UV, etc do we need so that the
perception mechanism cant tell the difference.

Ba
speech is a multidimensional signal with a linguistic association => difficult


Acoustic Definitions

Intensity of a sound is a physical quantity that can be measured


and quantified.
Acoustic Intensity (I) :average flow of energy through a unit
to measure needed precision for any specific parameter or set of parameters. area in watts/m2
sh
rather than talk about speech perception => use auditory discrimination to Audible Intensity range :10-12 watts/m2 to 10 watts/m2
eliminate linguistic or contextual issues.
issues of absolute identification versus discrimination capability => can Intensity Level (IL) : 10log (I/I0) (I0 =10-12 watts/m2)
detect a frequency difference of 0.1% in two tones, but can only absolutely For a pure sinusoidal sound wave of amplitude P, the intensity
judge frequency of five different tones => auditory system is very sensitive to
differences but cannot perceive and resolve them absolutely.
is proportional to P2 and the sound pressure level (SPL) is
defined as:
SPL= 20log (P/P0 ) (P0= 0.00002 N/m 2 )
ke

Hearing Thresholds Anechoic Chamber (no Echos)


Threshold of Audibility is the acoustic intensity level
of pure tone that can barely be heard at a particular
Ni

frequency
Threshold of audibility 0 dB at 1000 Hz
Threshold of feeling 120 dB
Threshold of pain 140 dB
Immediate damage 160 dB
Thresholds vary with frequency and from person-to-
person
Maximum sensitivity is at about 3000Hz.

16
9/5/2013

Range of Human Hearing Sound Pressure Levels (dB)

ja j
Ba
sh
ke
Ni

17