Documente Academic
Documente Profesional
Documente Cultură
Overview
Speech Production/Perception
j
Speech : Voiced/Unvoiced
Speech Production
Basics
ECE554
Speech Signal, Waveform
ja
nikesh.14730@lpu.co.in
Asst. Prof. DSP, SECE
Lovely Professional University
2 By: Nikesh Bajaj
1
9/5/2013
ja
Lung capacity is 4-5l for female/male.
Very loud speech may utilize upto 80% of the lung
capacity. 1-2l residual cannot be expelled
Sound amplitude increases with airflow rate. *Tidal volume is the lung volume representing the normal volume of air
displaced between normal inspiration and expiration when extra effort is not
applied
Sounds
Voiced
Unvoiced
Ba Glottal Flow
sh
Glottal volume velocity and resulting sound pressure at the
mouth for the first 30 msec of a voiced sound.
15 msec buildup to periodicity => pitch detection issues at
beginning and end of voicing; also voiced-unvoiced uncertainty
for 15 msec.
9 By: Nikesh Bajaj
ke
Transition regions
2
9/5/2013
ja
Plosive Sounds
Complete closure/constriction towards front of vocal
tract
Pressure behind closure and sudden relese
/ph/ /t/ /kh/
Resonant Frequencies of Vocal Tract
Speech normally exhibits one formant frq. In every 1 KHz
3
9/5/2013
j
Sounds depend on vocal tract, person
Time varying vocal tract, fixed characteristics over
dependent
interval of 10ms
Formants changes every 10ms IPA
ja
Sounds: /h/ /ay/-/ae/ /l/-/d/ /ih/ /d/-/y/ /u/-/iy/
Ba
Coarticulated Sounds: /h- ay-l/-/d-ih-j-uh/-/iy-tj--t/ (hial-
Basics
Speech is composed of a sequence of sounds
(phonemes).
Sounds (and transitions between them) serve as a
symbolic representation of information to be shared
between humans (or humans and machines).
dija-eajet) Arrangement of sounds is governed by rules of
language (constraints on sound sequences, word
sh
sequences, etc).
remarkably humans can decode these sounds and
determine the meaning that was intendedat least at the
idea/concept level (perhaps not completely at the word or Linguistics is the study of the rules of language.
sound level)
Phonetics is the study of the sounds of speech.
ke
Ni
4
9/5/2013
ja j
Ba Waveform of utterance Its time
100 msec/line; 0.5 sec for
utterance
S-silence
speech
background-no
quasi-periodic
Parameterization of Spectra
Spectrogram Properties
Human vocal tract is essentially a tube of Speech Spectrogramsound intensity versus time and
varying cross sectional area, or can be
approximated as a concatenation of tubes frequency.
Ni
formant frequencies for speech and they Narrowband spectrogram- spectral analysis on 50 msec
represent the frequencies that pass the sections of waveform using a narrow (40 Hz) bandwidth
most acoustic energy from the source to analysis filter, with new analyzes every 1 msec.
the output
narrowband spectrogram resolves individual pitch harmonics and
Typically there are 3 significant formants
below about 3500 Hz shows horizontal striations during voiced regions.
Formants are highly efficient, compact
representation of speech
5
9/5/2013
ja j
Wideband spectrogram and formant
frequencies Ba
Formants
Perceptually defined
Resonance frequencies of the vocal tract
sh
ke
Hindi Phonemes
American Phonetic Symbols
ARPABET representation
48 sounds
Ni
18 vowels/diphthongs
4 vowel-like
consonants
21 standard
consonants
4 syllabic sounds
6
9/5/2013
39 sounds
11 vowels (front, mid, back) classification based on tongue
hump position
j
4 diphthongs (vowel-like combinations)
4 semi-vowels (liquids and glides)
3 nasal consonants
ja
6 voiced and unvoiced stop consonants
8 voiced and unvoiced fricative consonants
2 affricate consonants
1 whispered sound
Look at each class of sounds to characterize their
acoustic and spectral properties.
Vowels
Produced using fixed vocal tract shape.
Sustained sounds.
Vocal cords are vibrating => voiced sounds.
Cross-sectional area of vocal tract determines vowel
Ba Acoustic waveform of vowels
low)
/IY/, /IH/, /AE/,
/EH/ => front =>
high resonances
/AA/, /AH/, /AO/
=> mid => energy
balance
/UH/, /UW/,
/OW/ => back =>
low frequency
resonances
7
9/5/2013
Formant Frequencies (F1 and F2) for Vowels Triangle and centriod position
vowels for common vowels
Clear pattern of
variability of vowel
pronunciation among
men, women and
j
children
Strong overlap for
different vowel
ja
sounds
by different talkers
=> no unique
identification of vowel
strictly from
resonances => need
context to define
vowel sound.
Vowel-like in nature (called semivowels for this reason)
8
9/5/2013
j
/F/ constriction near the lips Unvoiced
/TH/ constriction near the teeth
/S/ constriction near the middle of the vocal tract Aspirated
ja
/SH/ constriction near the back of the vocal tract Unaspirated
noise source at constriction => vocal tract is separated into two
cavities Dental
sound radiated from lips front cavity
back cavity traps energy and produces antiresonances (zeros of
Alveolar
transmission). Bilabial
Articulation
Velar
Unaspirated
Voiceless
/k/
Aspirated
Voiceless
/kh/
Consonants
Unaspirated
Voiced
/g/
Aspirated
Voiced
/gh/
Ba
Nasal
Output sound
Filter
(Vocal tract)
Source
Filter function
Source spectrum
9
9/5/2013
Artificial Larynx
Schematic Vocal Tract
Simplified vocal tract area => non-uniform tube with time varying cross
section
Plane wave propagation along the axis of the tube (this assumption valid
for frequencies below about 4000 Hz)
j
No losses at walls
ja
Ba Glottal Flow
sh
Glottal volume velocity and resulting sound pressure at the
mouth for the first 30 msec of a voiced sound.
15 msec buildup to periodicity => pitch detection issues at
beginning and end of voicing; also voiced-unvoiced uncertainty
for 15 msec.
ke
Transition regions
of the various articulatorsall change slowly over time,
thereby producing the desired speech sounds
=> Need to determine the physical properties of speech by
observing the speech waveform
10
9/5/2013
j
flow at the glottis), and p(l,t), (the sound pressure at the lips) to
solve the equations.
Needs complete specification of A(x,t), the vocal tract area
function; for simplification purposes we will assume that there is
no time variability in A(x,t) => the term related to the partial time
ja
where derivative of A becomes 0
p=p(x,t)=sound pressure in the tube at position and time Even with these simplifying assumptions, numerical solutions are
u=u(x,t)=volume velocity flow at position and time very hard to compute.
=the density of air in the tube
c=the velocity of sound Consider simple cases and extrapolate results to more
A=A(x,t)=the 'area function' of the tube, complicated cases
i.e., the cross-sectional area normal to the axis of the tube, as a
function of distance along the tube and as a function of time.
Uniform Lossless Tube
Assume uniform lossless tube => A(x,t)=A (shape consistent
with /UH/ vowel)
Ba Acoustic-Electrical Analogs
sh
ke
+ -(
p(x,t)=c[u (t-x/c) + u t+x/c)]/A
u+(t-x/c) -- wave traveling forward
Boundary conditions
u(0,t)=UG()ejt
p(l,t)=0
Assume a solution of the form
+ + j(t-x/c)
u (t-x/c)=K e
- - j(t+x/c)
u (t+x/c)=K e
11
9/5/2013
ja j
Frequency Domain Representation Ba
Overall Transfer Function
consider the volume velocity at the lips (x=l) as a function of
the source (at glottis)
sh
Frequency response of the
uniform tube in terms of
volume velocities
12
9/5/2013
j
function with shifts due to losses and radiation. velocities at inputs to nasal and oral cavities
the bandwidths of the two lowest resonances (F1 and F2) can solve flow equations numerically.
results show resonances dependent on shape
depend primarily on the vocal tract wall losses. and length of the 3 tubes.
ja
closed oral cavity can trap energy at certain
the bandwidths of the highest resonances (F3, F4, ...) frequencies, preventing those frequencies
depend primarily on viscous friction, thermal losses, and from appearing in the nasal output =>
radiation losses. antiresonances or zeros of the transfer function
nasal resonances have broader bandwidths
than non-nasal voiced sounds => due to
greater viscous friction and thermal loss due to
large surface area of the nasal cavity.
Speech Perception
Understanding how we hear sounds and how we perceive
Ba
speech leads to better design and implementation of robust
and efficient systems for analyzing and representing
speech.
Some Facts About Human Hearing
Speech Communication
perception
13
9/5/2013
ja j
Outer ear: pinna and external canal
Middle ear: tympanic membrane or eardrum
Inner ear: cochlea, neural connections
Human Ear
Outer ear: funnels sound into ear canal.
Middle ear: sound impinges on tympanic membrane; this
causes motion.
Ba
middle ear is a mechanical transducer, consisting of the hammer, anvil
and stirrup; it converts acoustical sound wave to mechanical vibrations
Middle and Inner Ear
14
9/5/2013
j
Mechanical realization of a bank of filters.
Filters are roughly constant Q (center frequency/bandwidth) with different regions of the BM respond maximally to different input
logarithmically increasing bandwidth. frequencies => frequency tuning occurs along BM.
Distributed along the Basilar Membrane is a set of sensors called Inner Hair the BM acts like a bank of non-uniform cochlear filters.
ja
Cells (IHC) which act as mechanical motion-to neural activity converters. roughly logarithmic increase in BW of filters (<800 Hz has equal
Mechanical motion along the BM is sensed by local IHC causing firing activity BW) => constant Q filters with BW decreasing as we move away
at nerve fibers that innervate bottom of each IHC. from cochlear opening.
Each IHC connected to about 10 nerve fibers, each of different diameter => thin peak frequency at which maximum response occurs along the BM
fibers fire at high motion levels, thick fibers fire at lower motion levels. is called the characteristic frequency.
30,000 nerve fibers link IHC to auditory nerve.
Electrical pulses run along auditory nerve, ultimately reach higher levels of
auditory processing in brain, perceived as sound.
Same IHCs
Slightly louder
Equally loud, separated in freq.
Different IHCs
Twice as loud
Psychoacoustic experiments Idealized basilar membrane filter bank
Center Frequency of Each Bandpass Filter: fc
Bandwidth of Bandpass Filter: f
Real BM filters overlap significantly
15
9/5/2013
ja j
Speech Perception
Speech Perception studies try to answer the key question of what is the
resolving power of the hearing mechanism => how good an estimate of
pitch, formant, amplitude, spectrum, V/UV, etc do we need so that the
perception mechanism cant tell the difference.
Ba
speech is a multidimensional signal with a linguistic association => difficult
Acoustic Definitions
frequency
Threshold of audibility 0 dB at 1000 Hz
Threshold of feeling 120 dB
Threshold of pain 140 dB
Immediate damage 160 dB
Thresholds vary with frequency and from person-to-
person
Maximum sensitivity is at about 3000Hz.
16
9/5/2013
ja j
Ba
sh
ke
Ni
17