Speech and Audio Coding

Signal Processing for Communications
An Introduction to Advanced Technology and Research for

Undergraduates
Related Technologies and Applications:

Digital Cell Phones
Technologies for Cable Modems and Wi-Fi
Secure Military Communications
April 14, 2006, 9:45am-12pm, SCOB 101
Lectures and Modules for Undergraduates on:

Speech and Audio Coders, Andreas Spanias
Channel Coders, Tolga Duman
Time-Varying Signal Processing, Antonia Papandreou-Suppappola
Multcarrier and OFDM Systems, Cihan Tepedelenlioglu
Sponsored by the NSF Combined Research and Curriculum Development Grant 0417604
April 2006 Copyright (c) 2006 - Andreas Spanias II-1
Pedagogiesfor transition of
research to UG curriculum Summer Freshman
DEMO MODULES (DM) and Sophomore
Research Camps
ASU J-DSP Technology for
on-line Java Computer Labs
SS EEE 303 RDA EEE 350

SMALL 1-LECTURE/LAB MODULES (SM)
1lecture/1exercise
4 Module Summaries
to inject in 303, 350, 407, 455
DSP EEE 407 CS EEE 455
LARGE-6-LECTURE 498 MODULES (LM)

•Source Coding (6 lect/1 lab
-
EEE 498
•Channel Coding (6 lect/1 lab
-
Intro to
•Multi-carrIer(6 lect/ 1 lab-
SP-COM
•Time-varying signaling (6 lect/1 lab
-
Research
SP-COM Research
drawn from ASU SP -COM research Feedback/
Activities and from research
Improvement
published work from other universities

Wireless Communications
(cell phone appl.)
Input Source Channel

Modulator
Speech Coder Coding
Channel
Output Source Channel

Demodulator
Speech Decoding Decoding
Speech and Audio Coding for Mobile and

Multimedia Applications
CRCD Activity, April 14, 2006
by
Andreas Spanias, Professor
DSP and Speech Processing Labs.
Dept. of Electrical Engineering
Arizona State University
Tempe, AZ 85287-5706
email: spanias@asu.edu
http://www.eas.asu.edu/~spanias

Topics
1. The Speech Coding Problem
2. Speech Processing Analysis-Synthesis Algorithms
3. Historical Perspective on Algorithmic Research
4. The Standards on Speech Coding
5. Algorithm Examples
6. Research / Remarks
Digital Speech
s (n) = s (nT ) = sα (t ) |t = nT
- Can be Manipulated with Software
-Opportunities for Encryption and Enhanced Privacy

Why Digital
Speech? -Stored with High Fidelity
-Error Control
-Mixing Voice/Data/Video- Multimedia

Continuous vs Discrete-time (digital) Speech
Continuous-time (analog) Signal Discrete-time (digital) signal
x(t) x(n)
0 T 2T ...
t n
x(t) Q x(n)
A signal that is bandlimited to B must be sampled at a rate of fs, f s ≥ 2B

Telephone Speech is typically bandlimited to 3.2 kHz and sampled at 8kHz
Quantization Considerations
For uncompressed telephone speech : 8 bits per sample
8000 samples per second
for a total of 8000 x 8 = 64 kilo bits per second (kbits/s)
PCM 64 kbits is often used as a reference for comparison
To transmit this signal using a basic binary signaling scheme

we need at least 32 kHz of bandwidth

Speech Coding
Speech coding or Speech compression is the field concerned

with obtaining compact digital representations of voice
signals for the purpose of efficient transmission or storage.
Speech coding involves sampling and amplitude

quantization.
The objective of speech coding is to represent speech with

a minimum number of bits while maintaining its perceptual
quality.
Medium, Low, and Very-low Rate Speech Coding
The speech methods discussed in this course are those intended

for digital speech communications where speech is generally
bandlimited to 4 kHz ( or 3.2 kHz ) and sampled at 8 kHz.
medium-rate coding - the range of 8 - 16 kbits/s
low-rate the range below 8 kbits/s and down to 2.4 kbits/s
very-low-rate the range below 2.4 kbits/s
Remark: Cellular, Voice-Over-IP and speech streaming

applications typically use low-rate coders
Historical Perspective
The First Vocoder - Dudley’s Channel Vocoder
Analysis Synthesis
Pitch Channel
Frequency Frequency Filter
Discriminator Meter 0-25~ Pitch
Oscillator
Noise
Filter
EQLZR Modulator
Spectrum Channels 0-300~
Filter Filter
0-300~ 0-25~
EQLZR
A total of ten channels
H. Dudley, "Remaking Speech," J. Acoust. Soc. Am., Vol. 11, p. 169, 1939.
H. Dudley, "The Vocoder," Bell Labs. Record., 17, p. 122, 1939.
Voiced and Unvoiced Speech

1.0 Time domain speech segment 50
fundamental
TAPE TIME: 8014 frequency
20 Formant Structure
0.0
Amplitude
0
Magnitude (dB)
-1.0 -20
0 8 16 24 32 0 1 2 3 4
Time (mS) Frequency (KHz)
1.0 Time domain speech segment 40
TAPE TIME: 3840
20
0.0
0
Amplitude
Magnitude (dB)
-1.0 -30
0 8 16 24 32 0 1 2 3 4
Time (mS) Frequency (KHz)

Fine (Pitch) and Formant Structure of the
Short-time Speech Spectrum
Fine Harmonic Structure : reflects the quasi-periodicity of

speech and is attributed to the vibrating vocal chords.
Note the narrow peaks
Formant Structure (Spectral Envelope): is due to the

interaction of the source and the vocal tract. The vocal tract
consists of the pharynx and the mouth cavity.
Note the envelope peaks
Simple Speech Synthesis Model (2)

Requires “hard” (binary)
Pitch τ
info voicing
V/UV
VOCAL SYNTHETIC
gain TRACT
SPEECH
FILTER
b0
H ( z) = M
1+ ∑ai z −i
i =1

H(z) typically estimated using short term linear prediction
The Levinson-Durbin Algorithm
The recursive coefficient update for the m-th order predictor

{ m = 1,2,..., p}
∈f (O ) = r ss (O )
m −1
r ss (m ) − ∑ a i (m − 1 )r ss (m − i )
order a m (m ) = i =1
∈ f (m − 1 )
ai (m ) = ai (m − 1) − am (m )am −i (m − 1) , 1 ≤ i ≤ m -1
index ∈f (m ) = (1 − (a m (m ))2 )∈ f (m − 1)
Speech Analysis-by-Synthesis (closed-loop)
Frequency responses Synthesis speech is

of the two synthesis
filters
forced to match i/p speech
s(n)
+
^
Select + + s(n)
-
or Form gain
Excitation
+ +
A (z) A(z)
L
LTP LP
MSE W(z)

Code Excited Linear Prediction (2)
The Nx1 error vector
e c (k ) = s w − sˆ w0 − g k sˆ w (k )
sˆw0 output due to the initial filter state,
Minimizing ∈ c (k ) = e cT (k )e c (k ) w.r.t. gk we get
swT sˆw (k )
gk = T
sˆw (k )sˆw (k )
Code Excited Linear Prediction (3)
∈ c (k ) = s s w − T w
T sˆ (k ) (s T
w )
2
sˆ w (k )sˆ w (k )
w
The k-th excitation vector, X c (k ) , that minimizes ∈c (k) is selected
closed-loop analysis is used for LTP parameters; range of values for τ

within the integers 20 to 147
M.R. Schroeder and B. Atal, "Code-Excited Linear Prediction (CELP): High Quality Speech at
Very Low Bit Rates," Proc. ICASSP-85, p. 937, Tampa, Apr. 1985.

LTP excited by a random signal creates pseudo-periodicity
1
1 − 0.95 z −30
Impulse response Frequency response
Magnitude Response (dB)

10
-10
0 0.5 0.9 1
Normalized frequency (Nyquist = 1)
Perceptual Weighting Filter (2)
30
Short Term
Predictor
25
H (z ) =
20 1
10
15 1 − ∑ ai z −i
i =1
10
-5
Perceptual Filter χ=0.9
-10 p
1 − ∑ ai z −i
W (z ) =
-15
0 100 200 300 400 500 600 i =1
p
1 − ∑ γ i ai z −i
i =1

Performance and Computational Complexity
A speech coding algorithm is designed and evaluated

based on:
1. Bit rate
2. The quality of reconstructed (“coded”) speech
3. The complexity of the algorithm
4. The end-to-end delay
Subjective Speech Quality

Broadcast
Broadcast wideband speech refers to high quality
“commentary” speech at rates above 64 kbits/s.
Network or toll
Toll or Network quality refers to quality comparable
to the classical analog speech (200-3200 Hz)
Communications
Communications quality implies somewhat degraded
speech quality but adequate for cellular communications.
Synthetic
Synthetic speech is usually intelligible but can be
unnatural and associated with a loss of speaker recognizability.

The Mean Opinion Score
MOS Scale Speech Quality

1 Bad
2 Poor
3 Fair
4 Good
5 Excellent
The Mean Opinion Score (2)
The MOS range relates to speech quality as follows :
MOS 4.0 - 4.5 : network or toll quality
MOS 3.5 - 4.0 : communications quality
MOS 2.5 - 3.5 : synthetic quality
Remarks : MOS ratings may differ significantly from test to

test and hence they are not absolute measures for the
comparison of different coders.

Wideband CDMA
Objective to meet IMT 2000 requirements (at least 144 Kb/s in a vehicular
environment, 384 Kb/s in a pedestrian environment, and 2048 Kb/s in an indoor
office environment)
To supports next generation data services envisioned up to 2MB/s (Full coverage
and mobility for 144 Kb/s, preferably 384 Kb/s - Limited coverage and mobility
for 2 Mb/s)
Enhanced Voice Services (audioconferencing & voice mail)
Concurrent high-quality video/audio
Backward compatible with IS-95B
high security & low power
Significantly enhanced version of EVRC for voice services
- http://www.comsoc.org/pubs/surveys/4q98issue/prasad.html
- D. Knisely et al, Evolution of Wireless Data Services: IS-95 to CDMA 2000, IEEE Communications Magazine, pp. 140-149, October 1998
- IS-95 CDMA and cdma2000: Cellular/PCS Systems Implementation, 1/e, Vijay K. Garg, University of Illinois, Chicago, Illinois Published
December, 1999 by Prentice Hall PTR (ECS Professional)
GSM Adaptive Multirate Coder
Adjusts its bit-rate according to network load

Rates 12.2, 10.2, 7.95, 6.7, 5.9, 5.15, 4.75kb/s
Based on CELP with 20 ms frame and 5 ms subframe
Multirate-ACELP with 10th order short-term LPC and perceptual
weighting (uses levinson)
Encodes LSPs using split VQ
An open loop LTP is first obtained and refined by closed loop
Highest bit rate provides toll quality & half rate provides communications
quality
- ETSI TS 126 090 V.3.1.0 2000-01 - AMR SPEECH CODEC TRANSCODING FUNCTIONS 3G-TS 26.090 Technical Specification
- R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg, "The Adaptive Multi-Rate speech coder, Proc. IEEE Workshop on
Speech Coding, pp. 117-119, 1999

The Selectable Mode Vocoder
• Algorithm to provide higher quality, flexibility, and capacity over existing IS-96C, IS-
127 EVRC, and IS-733 (that replaced IS-96C but working at higher average rate)
• The Conexant SMV algorithm became the core technology for 3G CDMA (core SMV
algorithm to be refined in the interim by participating companies according to the
publication below)
• Based on 4 codecs: full rate at 8.5 kbps, half rate at 4 kbps, quarter rate at 2 kbps, and
eighth rate at 800 bps
• Pre-processing includes noise suppression similar to IS 127 EVRC
• Full rate and half rate based on Conexants eXtended CELP (eX-CELP) a core
technology also used in the ITU G.4 Conexant submission to ITU-4
• Performed better than IS-733 and IS-127 in tests with and without background noise
• Scored as high as 4.1 MOS at full rate with clean speech. Performed very well with
background noise
REFERENCES:
[1] “The SMV algorithm selected for TIA and 3GPP2 for CDMA applications,” conference paper by Conexant systems, Y.Gao, E.
Schlomot, A. Benyassine, J. Thyssen, H. Su, and C Murgia (portions published at ICASSP-2001)
STANDARDS AT A GLANCE
• ITU Wideband Coding

– G.722 Coding of 7 kHz speech at 64, 56,48 kbps - Sub-band ADPCM
– G.WB1 Coding of 7 kHz speech at 32/ 24 kbps - Combined Transform and CELP coding
– G.WB2 Coding of 7 kHz speech at 16 kbps or less (ongoing)
• ITU Telephony
– G.711 PCM (64 kbps) late 60’s
– G.726 ADPCM (32/40/ 24/16 kbps) 1988
– G.728 LD-CELP coding (16 kbps) 1992
– G.723.1 True Speech (5.3/6.3 kbps) 1995
– G.729 CS-ACELP (8/12.8/6.4 kbps) 1996 and Annex in 1998
– G.4kbps Toll quality at 4 kbps (on going)
• Non-ITU
– MPEG1/Audio (includes MP3), 1991
– MPEG2/Audio: 64 kbps (1992)
– MPEG4/Audio: audio/speech coding at bit rates between 64 and 2 kbps (1998)
– MPEG7/Audio: audio/speech/MIDI coding (ongoing)

STANDARDS AT A GLANCE (2)
• TIA
– CDMA
• IS96 8,4,2 kbps Q-CELP (Qualcomm CELP, 1992)
• IS127 8.55, 4, 0.8 kbps EVRC (Enhanced Variable. Rate Coder, 1996)
• IS733 13.3, 6.2, 2.7, 1 kbps VRC (Variable Rate Coder, 1998)
• 3GPP2 0.8-8.55 kbps SMV (Selectable Mode Vocoder, 2001)
– TDMA
• IS54 7.95 kbps VSELP (Vector-Sum Excitation Linear.Predictor., 1989)
• IS641 7.4 kbps CELP (Similar to EFR but at lower rate, 1997)
– PCS1800 (GSM variant working at 1800 MHz)
• IS136-410 12.2 kbps US1 (1999)
• ETSI (GSM):
– 13 kbps RPE-LTP (Full rate GSM, 1988)
– 6.5 kbps VSELP (Half-rate GSM, 1993)
– 12.2 kbps EFR (Enhanced full-rate GSM, 1996)
– 12.2 - 4.75 kbps AMR (Adaptive Multi Rate, 1999)
• ARIB Japan
– Full-rate PDC (Personal Digital Communication) 6.7 kbps VSELP
– Half-rate PDC 3.45 kbps Multimode CELP`
Vocoder/Waveform/Hybrid
MOS PCM
Hybrid Coders ADPCM
1-5 SMV
CELP
Waveform Coders
MELP
LPC10e
Vocoders
1 2 4 8 16 32 64
Bit rate (kbps))

PERFORMANCE OF SOME STANDARDIZED ALGORITHMS
Algorithm Bit Rate MOS Complexity Framesize (ms)

(kbits/s) (MIPS)
PCM G.711 64 4.3 0.01 0

+
ADPCM G.726 32 4.1 2 0.125
SBC G.722 48/56/64 4.1 5 0.125
LD-CELP G.728 16 4 ~30 0.625
CS-ACELP G.729 8 4 ~20 10
CS-ACELP-A G.729 8 3.76 11 10
MPC-MLQ G.723.1 6.3/5.3 3.98/3.7 ~16 30
GSM FR RPE-LTP 13 3.7 (ave) 5 20
GSM EFR 13 4 14 20
GSM HR VSELP 6.3 ~3.4 14 20
IS-54 VSELP 8 3.5 14 20
IS-641 EFR 8 3.8 14 20
Conexant eX-CELP SMV 8.55/4/2/0.8 ~4.1 (8.55) ~20 MIPS 20
IS-96 QCELP 1.2/2.4/4.8/9.6 3.33 (9.6) 15 20
IS-127 EVRC 1.2/4.8/9.6 ~3.8 (9.6) 20 20
PDC VSELP 6.3 3.5 14 20
PDC PCI-CELP 3.45 ~3.4 ~48 40
FS 1015 – LPC 10e 2.4 2.3 7 22.5
FS 1016 – CELP 4.8 4.8 3.2 16 30
MELP 2.4 3.2 ~30 22.5
Inmarsat-B APC 9.6/12.8 ~3.1/3.4 10 20
Inmarsat-M IMBE 6.3 3.4 ~13 20
Research in Speech and Audio Coding at Arizona State

Speech Coding
S. Ahmadi and A. Spanias, “Algorithms for Low-bit rate sinusoidal coding,” Speech
Communications, Vol. 34(2001), pp.369-390, June 2001 - Research funded by Intel Corp.
Perceptual LPC, ICASSP 05, Atti Venkatraman, NSF
Audio Coding
Selection of sinusoids based on perceptual criteriaT. Painter and A. S. Spanias, " Sinusoidal
Analysis-Synthesis of Audio using Perceptual Criteria,” Proc.. IEEE International Symposium on
Circuits and Systems (ISCAS-02), Phoenix, May 2002. - Research funded by Intel Corporation
Enhancing the Bandwidth of Speech Coders, ISCAS05, Visar Berisha, NSF
2002 Donald G. Fink Prize Paper Award by IEEE Board of Directors -

Award Wining Paper
T. Painter and A. S. Spanias, “Perceptual coding of digital audio,” Proc. of the IEEE, vol. 88, no.4 , pp. 451-
513, Apr. 2000. It was recognized by the IEEE Board of Directors with the prestigious 2002 IEEE Donald G.
Fink Prize Paper Award. (A. Spanias principal investigator and Ph.D. advisor of T. Painter)

Speech and Audio Coding

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Speech and Audio Coding

Încărcat de

Drepturi de autor:

Formate disponibile

Signal Processing for Communications

An Introduction to Advanced Technology and Research for

Related Technologies and Applications:

April 14, 2006, 9:45am-12pm, SCOB 101

Lectures and Modules for Undergraduates on:

SS EEE 303 RDA EEE 350

LARGE-6-LECTURE 498 MODULES (LM)

April 2006 Copyright (c) 2006 - Andreas Spanias II-2

Input Source Channel

Output Source Channel

April 2006 Copyright (c) 2006 - Andreas Spanias II-3

Speech and Audio Coding for Mobile and

April 2006 Copyright (c) 2006 - Andreas Spanias II-4

1. The Speech Coding Problem

2. Speech Processing Analysis-Synthesis Algorithms

3. Historical Perspective on Algorithmic Research

4. The Standards on Speech Coding

April 2006 Copyright (c) 2006 - Andreas Spanias II-5

-Opportunities for Encryption and Enhanced Privacy

-Mixing Voice/Data/Video- Multimedia

April 2006 Copyright (c) 2006 - Andreas Spanias II-6

A signal that is bandlimited to B must be sampled at a rate of fs, f s ≥ 2B

For uncompressed telephone speech : 8 bits per sample

8000 samples per second

for a total of 8000 x 8 = 64 kilo bits per second (kbits/s)

PCM 64 kbits is often used as a reference for comparison

To transmit this signal using a basic binary signaling scheme

April 2006 Copyright (c) 2006 - Andreas Spanias II-8

Speech coding or Speech compression is the field concerned

Speech coding involves sampling and amplitude

The objective of speech coding is to represent speech with

April 2006 Copyright (c) 2006 - Andreas Spanias II-9

Medium, Low, and Very-low Rate Speech Coding

The speech methods discussed in this course are those intended

medium-rate coding - the range of 8 - 16 kbits/s

low-rate the range below 8 kbits/s and down to 2.4 kbits/s

very-low-rate the range below 2.4 kbits/s

Remark: Cellular, Voice-Over-IP and speech streaming

Frequency Frequency Filter

Discriminator Meter 0-25~ Pitch

A total of ten channels

Voiced and Unvoiced Speech

1.0 Time domain speech segment 40

TAPE TIME: 3840

April 2006 Copyright (c) 2006 - Andreas Spanias II-12

Fine Harmonic Structure : reflects the quasi-periodicity of

Note the narrow peaks

Formant Structure (Spectral Envelope): is due to the

April 2006 Copyright (c) 2006 - Andreas Spanias II-13

Simple Speech Synthesis Model (2)

April 2006 Copyright (c) 2006 - Andreas Spanias II-14

The recursive coefficient update for the m-th order predictor

April 2006 Copyright (c) 2006 - Andreas Spanias II-15

Speech Analysis-by-Synthesis (closed-loop)

Frequency responses Synthesis speech is

April 2006 Copyright (c) 2006 - Andreas Spanias II-16

sˆw0 output due to the initial filter state,

Minimizing ∈ c (k ) = e cT (k )e c (k ) w.r.t. gk we get

April 2006 Copyright (c) 2006 - Andreas Spanias II-17

Code Excited Linear Prediction (3)

The k-th excitation vector, X c (k ) , that minimizes ∈c (k) is selected

closed-loop analysis is used for LTP parameters; range of values for τ