Lec2 Audition

MIT OpenCourseWare
http://ocw.mit.edu
24.910 Topics in Linguistic Theory: Laboratory Phonology

Spring 2007
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
24.910
Laboratory Phonology
Basic Audition
No class next week (Tuesday is a Monday)
Readings for 2/27: Johnson chs 5 & 6
Assignments (due 2/27):
Basic acoustics.
VOT and laryngeal contrasts in Mandarin and
English.
Audition
Middle
Outer Ear Inner Ear
Ear
Anvil
Ear Flap
Auditory
Nerve
Hammer
Ear Canal
Eardrum Cochlea
Stirrup
Eustachian Tube
Figure by MIT OpenCourseWare.
Anatomy
Audition
Loudness
Pitch
Auditory spectrograms
Loudness
The perceived loudness of a sound depends on the
amplitude of the pressure fluctuations in the sound
wave.
Amplitude is usually measured in terms of root-
mean-square (rms amplitude):
The square root of the mean of the squared amplitude
over some time window.
rms amplitude
Square each sample in the analysis window.
Calculate the mean value of the squared waveform:
Sum the values of the samples and divide by the number of
samples.
Take the square root of the mean.
1.5
0.5
pressure
0 pressure^2
0 0.05 0.1 0.15 0.2 rms amplitude
-0.5
-1
-1.5
time
rms amplitude
0 .01 .02
Time in seconds
Figure by MIT OpenCourseWare. Adapted from Johnson, Keith. Acoustic and Auditory Phonetics. Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Intensity
Perceived loudness is more closely related to intensity

(power per unit area), which is proportional to the square
of the amplitude.
relative intensity in Bels = log10(x2/r2)
relative intensity in dB = 10 log10(x2/r2)
= 20 log10(x/r)
In absolute intensity measurements, the comparison

amplitude is usually 20Pa, the lowest audible pressure
fluctuation of a 1000 Hz tone (dB SPL).
logarithmic scales
log xn = n log x
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50
x
Loudness
The relationship between intensity and perceived loudness
is not exactly logarithmic.
20 100
18 90
16 80
dB SPL
14 70
12 60
dB SPL
Sones
10 Sones 50
8 40
6 30
4 20
2 10
0 0
0 500,000 1,000,000 1,500,000 2,000,000
Pressure (Pa)
Loudness
Loudness also depends on frequency.
equal loudness contours for pure tones:
Source: Wikimedia Commons.

Loudness
At short durations, loudness also depends on duration.
Temporal integration: loudness depends on energy in the
signal, integrated over a time window.
Duration of integration is often said to be about 200ms, i.e.
relevant to the perceived loudness of vowels.
Pitch
Perceived pitch is approximately linear with respect to
frequency from 100-1000 Hz, between 1000-10,000 Hz the
relationship is approximately logarithmic.
24
20
Frequency (Bark)
16
12
0
0 1 2 3 4 5 6 7 8 9 10
Frequency (kHz)
Pitch
The non-linear frequency response of the auditory system is related to the
physical structure of the basilar membrane.
basilar membrane uncoiled:
15,400 8,700 4,900 2,700 1,500 750 300
11,500 6,500 3,700 2,000 1,100 500 100
Masking - simultaneous
Energy at one frequency can reduce audibility of
simultaneous energy at another frequency (masking).
One sound can also mask a preceding or following sound.
104
103
Masking
102
10
1
400 600 800 1000 1200 1600 2000 2400 2600 3200 3600 4000
Frequency of masked tone
Example of masking of a tone by a tone. The frequency of the masking tone is 1200 Hz. Each curve corresponds to a
different masker level, and gives the amount by which the threshold intensity of the masked tone is multiplied in the
presence of the masker, relative to its threshold in quiet. The dashed lines near 1200 Hz and its harmonics are estimates
of the masking functions in the absence of the effect of beats.
Figure by MIT OpenCourseWare. Adapted from Stevens, Kenneth N. Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. ISBN; 9780262194044.
Time course of auditory nerve response
Response to a noise burst: 256
Strong initial response

-60 128
Rapid adaptation (~5 ms)

Slow adaptation (>100ms) 0
After tone offset, firing rate 256
only gradually returns to

spontaneous level. -40 128
0
0 64 128
msec
Figure by MIT OpenCourseWare. Adapted from Kiang et al. (1965)

Interactions between sequential sounds
A preceding sound can affect the auditory nerve response
to a following tone (Delgutte 1980).
600 NO AT
TT = 27 dB SPL AT = 13 dB SPL AT = 37 dB SPL
Discharge rate (SP/S)
400
200
0
0 200 400 0 200 400 0 200 400
M M M
AT TT
Figure by MIT OpenCourseWare. Adapted from Stevens, Kenneth N. Acoustic Phonetics. Cambridge, MA: MIT Press, 1999. ISBN: 9780262194044,
after Delgutte, B. "Representation of Speech-like Sounds in the Discharge Patterns of Auditory-nerve Fibers."
Journal of the Acoustical Society of America 68, no. 3 (1980): 843-857.
Auditory spectrograms
The auditory system performs a running frequency analysis of
acoustic signals - cf. spectrogram.
A regular spectrogram analyzes frequency of equal widths,
but the peripheral auditory system analyzes frequency bands
that are wider at higher frequencies.
Further disparities are introduced by the non-linearities of the
peripheral auditory system, e.g.
loudness is non-linearly related to intensity
masking(simultaneous and nonsimultaneous)
85
80
Vowel /I/
75
70
Level, dB
Auditory Frequency (Bark) 65
0 2 4 6 8 10 12 14 16 18 20 22 24
240 60
55
220
50
200
45
Amplitude (dB)
180
40
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000
160
Frequency, Hz
140
120 Image by MIT OpenCourseWare. Adapted from Moore, Brian. The

100 Handbook of Phonetic Science. Edited by William J. Hardcastle and John Laver.
Malden, MA: Blackwell, 1997. ISBN: 9780631188483.
80
0 2 4 6 8 10
Acoustic Frequency (kHz)
A comparison of acoustic (light line) and auditory (heavy line) spectra of a complex wave
80
composed of sine waves at 500 at 1,500 Hz. Both spectra extend from 0 to 10 kHz, although
on different frequency scales. The auditory spectrum was calculated from the acoustic spectrum Vowel /I/
70
using the model described in Johnson (1989). 80
60
Excitation Level, dB
50
Image by MIT OpenCourseWare. Adapted from Johnson, Keith. Acoustic and Auditory
Phonetics. Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483. 40
50
30
20
10
2 4 6 8 10 12 14 16 18 20 22 24 26
Number of ERBs, E
The spectrum of a synthetic vowel /I/ (top) plotted on a linear frequency scale,
and the excitation patterns for that vowel (bottom) for two overall levels, 50 and
80 dB. The excitation patterns are plotted on an ERB scale.
Image by MIT OpenCourseWare. Adapted from Moore, Brian. The Handbook

of Phonetic Science. Edited by William J. Hardcastle and John Laver.
Malden, MA: Blackwell, 1997. ISBN: 9780631188483.
Spectrogram images removed due to copyright restrictions.
Figure 3.8 in Johnson, Keith. "Comparison of Normal Acoustic Spectrogram and Auditory Spectrogram or Cochleagram."
Acoustic and Auditory Phonetics. Malden, MA: Blackwell Publishers, 1997. ISBN: 9780631188483.
Italian vowels
F2 (Hz)
2500 2300 2100 1900 1700 1500 1300 1100 900 700 500
200
i
u
e o 400

600
a
800
ERB scales
F2 (E)
25 23 21 19 17 15
6
10
E(F1)
12
14
16
24.910
Linguistic Phonetics
Analog-to-digital conversion of
speech signals
2.0
1.6
1.2
0.8
0.4
0.0
-0.4
-0.8
-1.2
0.00 0.01 0.02 0.03 0.04 0.05
The Results Of Sampling
2.0
1.6
1.2
0.8
0.4
0.0
-0.4
-0.8
-1.2
0.00 0.01 0.02 0.03 0.04 0.05
Figure by MIT OpenCourseWare.

Analog-to-digital conversion
Almost all acoustic analysis is now computer-based.
Sound waves are analog (or continuous) signals, but digital
computers require a digital representation - i.e. a series of
numbers, each with a finite number of digits.
There are two continuous scales that must be divided into
discrete steps in analog-to-digital conversion of speech: time
and pressure (or voltage).
Dividing time into discrete chunks is called sampling.
Dividing the amplitude scale into discrete steps is called
quantization.
Sampling
The amplitude of the analog
signal is sampled at regular (a)
intervals.
The sampling rate is measured 0 Time .01 .02 .03 Sec
A wave with a fundamental frequency of 100 Hz and a major
in Hz (samples per second). component at 300Hz sampled at 1500Hz.
The higher the sampling rate,

the more accurate the digital (b)
representation will be.

0 Time .01 .02 .03 Sec
The same wave sampled at 600 Hz.
(c)
0 Time .01 .02 .03 Sec
Figure by MIT OpenCourseWare. Adapted from Ladefoged, Peter.

L104/204 Phonetic Theory lecture notes, University of California, Los Angeles.
Sampling
In order to represent a wave
component of a given frequency, it (a)
is necessary to sample the signal
with at least twice that frequency
(the Nyquist Theorem). 0 Time .01 .02 .03 Sec
A wave with a fundamental frequency of 100 Hz and a major
The highest frequency that can be

component at 300Hz sampled at 1500Hz.
represented at a given sampling rate

is called the Nyquist frequency.
(b)
The wave at right has a significant

harmonic at 300 Hz 0 Time .01 .02 .03 Sec
The same wave sampled at 600 Hz.
(a) sampling rate 1500 Hz

(b) sampling rate 600 Hz
(c) sampling rate 500 Hz (c)
0 Time .01 .02 .03 Sec
Figure by MIT OpenCourseWare. Adapted from Ladefoged, Peter.

L104/204 Phonetic Theory lecture notes, University of California, Los Angeles.
What sampling rate should you use?
The highest frequency that (young, undamaged) ears can
perceive is about 20 kHz, so to ensure that all audible
frequencies are represented we must sample at 220 kHz =
40 kHz.
The ear is relatively insensitive to frequencies above 10
kHz, and almost all of the information relevant to speech
sounds is below 10 kHz, so high quality sound is still
obtained at a sampling rate of 20 kHz.
There is a practical trade-off between fidelity of the signal
and memory, but memory is getting cheaper all the time.
What sampling rate should you use?
For some purposes (e.g. measuring vowel formants), a high
sampling rate can be a liability, but it is always possible to
downsample before performing an analysis.
Audio CD uses a sampling rate of 44.1 kHz.

Many A-to-D systems only operate at fractions of this rate
(22050 Hz, 11025 Hz).
Aliasing
Components if a signal which are above the Nyquist
frequency are misrepresented as lower frequency
components (aliasing).
To avoid aliasing, a signal must be filtered to eliminate
frequencies above the Nyquist frequency.
Since practical filters are not infinitely sharp, this will
attenuate energy near to the Nyquist frequency also.
Amplitude
0 2 4 6 8 10
Time (ms)
Figure by MIT OpenCourseWare. Adapted from Johnson, Keith.

Quantization
The amplitude of the signal at each sampling point must be
specified digitally - quantization.
Divide the continuous amplitude scale into a finite number
of steps. The more levels we use, the more accurately we
approximate the analog signal.
20 steps
Amplitude
200 steps
0 5 10 15 20
Time (ms)

Quantization
The number of levels is specified in terms of the number of
bits used to encode the amplitude at each sample.
Using n bits we can distinguish 2n levels of amplitude.
e.g. 8 bits, 256 levels.
16 bits, 65536 levels.
Now that memory is cheap, speech is almost always
digitized at 16 bits (the CD standard).
Quantization
Quantizing an analog signal necessarily introduces
quantization errors.
If the signal level is lower, the degradation in signal-to-
noise ratio introduced by quantization noise will be greater,
so digitize recordings at as high a level as possible without
exceeding the maximum amplitude that can be represented
(clipping).
On the other hand, it is essential to avoid clipping.
Amplitude
0 5 10 15 20 25
Time (ms)

Voicing and aspiration
Many languages make a contrast between two sets of stops
with different laryngeal properties, loosely referred to as
voiced and voiceless.
The precise details of these laryngeal contrasts differ from
language to language.
Some broad distinctions:
voiced [b]: vocal fold vibration during closure
bal (hair)
voiceless unaspirated [p]: no vibration of the vocal
folds, short VOT
pal (take care of)
voiceless aspirated [p]: no vibration of the vocal folds,
long VOT (high airflow after release)
pal (knife blade)
Listen to all three sound files here.
Voicing and aspiration
Voiced vs.voiceless [b vs. p]
Russian, French, Dutch
Unaspirated vs. aspirated [p vs. p]
Mandarin, Cantonese
Voiced vs. voiceless unaspirated vs. aspirated [b vs. p vs.
p]
Hindi, Thai
English shows contextual variation between

voicing and aspiration.
Voice Onset Time
English utterance-initial stops
Voiceless unaspirated Voiceless aspirated
22 ms 86 ms
0.3268 0.1853
0
0
0.471 0.3718
1.1154 1.27558 4.26614 4.42706
5000 Time (s) 5000 Time (s)
0 0
1.1154 1.27558 4.26614 4.42706
Time (s) Time (s)
die tie
VOT, closure voicing
English intervocalic stops can be fully voiced
VOT is 0 ms in 2nd and 3rd stops
0.2644
0.1468
547.195 547.485
5000 Time (s)
0
547.195 547.485
Time (s)
brigadoo(n)
VOT, closure voicing
Hindi - three-way contrast
recordings from Ladefoged
http://www.phonetics.ucla.edu/vowels/chapter12/hindi.html
0.494 0.3851 0.2819
0 0
0
0.4972 0.4971 0.475

0 0.231127 0 0.113588 0 0.13828
5000 Time (s) 5000 Time (s) 5000 Time (s)
0 0 0
0 0.231127 0 0.113588 0 0.13828
Time (s) Time (s) Time (s)
bal pal pal

hair take care of knife blade
Listen to all three sound files here.

Lec2 Audition

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lec2 Audition

Încărcat de

Drepturi de autor:

Formate disponibile

MIT OpenCourseWare

24.910 Topics in Linguistic Theory: Laboratory Phonology

Figure by MIT OpenCourseWare.

Perceived loudness is more closely related to intensity

In absolute intensity measurements, the comparison

Source: Wikimedia Commons.

15,400 8,700 4,900 2,700 1,500 750 300

11,500 6,500 3,700 2,000 1,100 500 100

Strong initial response

Rapid adaptation (~5 ms)

After tone offset, firing rate 256

only gradually returns to

Figure by MIT OpenCourseWare. Adapted from Kiang et al. (1965)

120 Image by MIT OpenCourseWare. Adapted from Moore, Brian. The

Image by MIT OpenCourseWare. Adapted from Moore, Brian. The Handbook

Figure by MIT OpenCourseWare.

in Hz (samples per second). component at 300Hz sampled at 1500Hz.

The higher the sampling rate,

representation will be.

0 Time .01 .02 .03 Sec

Figure by MIT OpenCourseWare. Adapted from Ladefoged, Peter.

The highest frequency that can be

represented at a given sampling rate

The wave at right has a significant

(a) sampling rate 1500 Hz

0 Time .01 .02 .03 Sec

Figure by MIT OpenCourseWare. Adapted from Ladefoged, Peter.

Audio CD uses a sampling rate of 44.1 kHz.

Figure by MIT OpenCourseWare. Adapted from Johnson, Keith.

Figure by MIT OpenCourseWare. Adapted from Johnson, Keith.

Figure by MIT OpenCourseWare. Adapted from Johnson, Keith.

English shows contextual variation between

0.494 0.3851 0.2819

0.4972 0.4971 0.475

bal pal pal

Listen to all three sound files here.

S-ar putea să vă placă și