Speech Signal Processing

SPEECH SIGNAL PROCESSING MKERALA UNIVERSITY M-TECH 1ST SEMESTER
lizytvm@yahoo.com +919495123331 Lizy Abraham Assistant Professor Department of ECE LBS Institute of Technology for Women (A Govt. of Kerala Undertaking) Poojappura Trivandrum -695012 Kerala, India
1
SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3 3-
Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing, Articulatory model). Acoustic Phonetics ( Basic speech units and their classification). Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram, Formant Estimation &Analysis). Cepstral Analysis Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR, MFCC, Sinusoidal Model, GMM, HMM Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic Coding, Vector Quantization based Coders, CELP Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-tospeech conversion, speech enhancement, Speaker Verification, Language Identification, Issues of Voice transmission over Internet.
REFERENCE
1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE Press, Hardcover 2nd edition, 1999; ISBN: 0780334493. 2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547 3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall; ISBN: 013242942X; 1st edition 6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, September 1999; ISBN: 0471349593 For the End semester exam (100 marks), the question paper shall have six questions of 20 marks each covering entire syllabus out of which any five shall be answered. It shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20 marks each and 10 marks for assignments (Minimum two) /Term Project.
3
Speech Processing means Processing of discrete time speech signals
Algorithms (Programming)
Psychoacoustics Room acoustics Speech production
Speech Processing Signal Processing
Acoustics
Information Theory
Phonetics
Fourier transforms Discrete time filters AR(MA) models
Statistical SP Stochastic models
Entropy Communication theory Rate-distortion theory
HOW IS SPEECH PRODUCED ?
Speech can be defined as a pressure acoustic signal that is articulated in the vocal tract
Speech is produced when: air is forced from the lungs through the vocal cords and along the vocal tract.
8
This air flow is referred to as excitation signal. This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in shaping the sound produced. Vocal Tract components: Oral Tract: (from lips to vocal cords). nostrills). Nasal Tract: (from the velum till nostrills).
9
10
11
Larynx: the source of speech Vocal cords (folds): the two folds of tissue in the larynx. They can open and shut like a pair of fans. Glottis: the gap between the vocal cords. As air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. The frequency of vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz).
12
SPEECH PRODUCTION MODEL
13
Places of articulation
dental labial
alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal
14
Classes of speech sounds Voiced sound

The vocal cords vibrate open and close Quasi-periodic pulses of air The rate of the opening and closing the pitch
Unvoiced sounds
Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present Eg. S, F
Plosive sounds
A complete closure in the vocal tract Air pressure is built up and released suddenly Eg. B , P
15
Speech Model
16
SPEECH SOUNDS
Coarse classification with phonemes. A phone is the acoustic realization of a phoneme. Allophones are context dependent phonemes.
17
PHONEME HIERARCHY
Speech sounds Language dependent. About 50 in English. Consonants
Vowels iy, ih, ae, aa, ah, ao,ax, eh, er, ow, uh, uw
Diphtongs ay, ey, oy, aw Glide w, y Plosive p, b, t, d, k, g
Lateral liquid Retroflex liquid Nasal m, n, ng Fricative f, v, th, dh, s, z, sh, zh, h r l
18
19
20
sounds like /SH/ and /S/ look like (spectrally shaped) random noise, while the vowel sounds /UH/, /IY/, and /EY/ are highly structured and quasi-periodic. These differences result from the distinctively different ways that these sounds are produced.
21
22
Vowel Chart Front i High Center Back u
e Mid
Low
24
SPEECH WAVEFORM CHARACTERISTICS
Loudness Voiced/Unvoiced. Pitch.

Fundamental frequency.
Spectral envelope.
Formants.
25
Acoustic Characteristics of speech Pitch:

Signal within each voiced interval is periodic. The period T is called pitch. The pitch depends on the vowel being spoken, changes in time. T~70 samples in this ex. f0=1/T is the fundamental frequency (also known as formant frequency).
26
FORMANTS
Formants can be recognized in the frequency content of the signal segment.
Formants are best described as high energy peaks in the frequency spectrum of speech sound.
27
The resonant frequencies of the vocal tract are called formant frequencies or simply formants. The peaks of the spectrum of the vocal tract response correspond approximately to its formants. Under the linear time-invariant all-pole assumption, each vocal tract shape is characterized by a collection of formants.
28
Because the vocal tract is assumed stable with poles inside the unit circle, the vocal tract transfer function can be expressed either in product or partial fraction expansion form:
29
30
A detailed acoustic theory must consider the effects of the following: Time variation of the vocal tract shape Losses due to heat conduction and viscous friction at the vocal tract walls Softness of the vocal tract walls Radiation of sound at the lips Nasal coupling Excitation of sound in the vocal tract Let us begin by considering a simple case of a lossless tube:
31
28 December 2012
MULTI-TUBE APPROXIMATION OF THE VOCAL TRACT

We can represent the vocal tract as a concatenation of N lossless tubes with area {Ak}.and equal length x = l/N The wave propagation time through each tube is =x/c = l/Nc
32
33
Consider an N-tube model of the previous figure. Each tube has length lk and cross sectional area of Ak. Assume:
No losses Planar wave propagation
The wave equations for section k: 0xlk
34
35
28 December 2012
SOUND PROPAGATION IN THE CONCATENATED TUBE MODEL

Boundary conditions: Physical principle of continuity:
Pressure and volume velocity must be continuous both in time and in space everywhere in the system:
At kth/(k+1)st junction we have:
36
28 December 2012
ANALOGY WITH ELECTRICAL CIRCUIT TRANSMISSION LINE
37
28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
The vocal tract transfer function of volume velocities is
38
28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE

Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0
*(derivation in Quateri text: page 122 125) The poles of the transfer function T (j ) are where cos( l/c)=0
119 124: Quatieri Derivation of eqn.4.18 is important.

39
28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE (CONT)

For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz,
The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat) The length of the vocal tract, l, corresponds to 1/41, 3/42, 5/43, , where i is the wavelength of the ith natural frequency
40
28 December 2012
UNIFORM TUBE MODEL

Example
Consider a uniform tube of length l=35 cm. If speed of sound is 350 m/s calculate its resonances in Hz. Compare its resonances with a tube of length l = 17.5 cm. c f=/2 =k , k = 1,3,5,... 2 l c 1 350 f= =k =k = 250k 2 2 l 2 4 0.35 f = 250,750,1250,...
41
28 December 2012
UNIFORM TUBE MODEL

For 17.5 cm tube:
c 1 350 f= =k =k = 250k 2 2 l 2 4 0.175 f = 500,1500,2500,...
42
43
APPROXIMATING VOCAL TRACT SHAPES
44
45
VOWELS
Modeled as a tube closed at one end and open at the other the closure is a membrane with a slit in it the tube has uniform cross sectional area membrane represents the source of energy (vocal folds) the energy travels through the tube the tube generates no energy on its own the tube represents an important class of resonators odd quarter length relationship Fn=(2n-1)c/4l
VOWELS
Filter characteristics for vowels the vocal tract is a dynamic filter it is frequency dependent it has, theoretically, an infinite number of resonances each resonance has a center frequency, an amplitude and a bandwidth for speech, these resonances are called formants formants are numbered in succession from the lowest F1, F2, F3, etc.
Fricatives Modeled as a tube with a very severe constriction The air exiting the constriction is turbulent Because of the turbulence, there is no periodicity unless accompanied by voicing
When a fricative constriction is tapered

the back cavity is involved this resembles a tube closed at both ends
Fn=nc/2l
such a situation occurs primarily for articulation disorders
Introduction to Digital Speech Processing (Rabiner & Schafer ) 20-23

51
52
Rabiner & Schafer : 98105
53
54
28 December 2012
SOUND SOURCE: VOCAL FOLD VIBRATION

Modeled as a volume velocity source at glottis, UG(j )
55
56
SHORT-TIME SPEECH ANALYSIS

Segments (or frames, or vectors) are typically of length 20 ms.
Speech characteristics are constant. Allows for relatively simple modeling.
Often overlapping segments are extracted.
57
SHORTSHORT-TIME ANALYSIS OF SPEECH
58
the system is an all-pole system with system function of the form:
For all-pole linear systems, the input and output are related by a difference equation of the form:
59
60
The operator T{ } defines the nature of the short-time analysis function, and w[n m] represents a time shifted window sequence
61
62
SHORT-TIME ENERGY
simple to compute, and useful for estimating properties of the excitation function in the model.
In this case the operator T{ } is simply squaring the windowed samples.
63
SHORT-TIME ZERO-CROSSING RATE

Weighted average of the number of times the speech signal changes sign within the time window. Representing this operator in terms of linear filtering leads to:
64
Since |sgn{x[m]} sgn{x[m 1]}| is equal to 1 if x[m] and x[m 1] have different algebraic signs and 0 if they have the same sign, it follows that it is a weighted sum of all the instances of alternating sign (zero-crossing) that fall within the support region of the shifted window w[n m].
65
shows an example of the short-time energy and zero crossing rate for a segment of speech with a transition from unvoiced to voiced speech. In both cases, the window is a Hamming window of duration 25ms (equivalent to 401 samples at a 16 kHz sampling rate). Thus, both the short-time energy and the short-time zero-crossing rate are output of a low pass filter whose frequency response is as shown.
66
Short time energy and zero-crossing rate functions are slowly varying compared to the time variations of the speech signal, and therefore, they can be sampled at a much lower rate than that of the original speech signal. For finite-length windows like the Hamming window, this reduction of the sampling rate is accomplished by moving the window position n in jumps of more than one sample
67
during the unvoiced interval, the zero-crossing rate is relatively high compared to the zerocrossing rate in the voiced interval. Conversely, the energy is relatively low in the unvoiced region compared to the energy in the voiced region.
68
SHORT-TIME AUTOCORRELATION FUNCTION (STACF)

The autocorrelation function is often used as a means of detecting periodicity in signals, and it is also the basis for many spectrum analysis methods. STACF is defined as the deterministic autocorrelation function of the sequence xn[m] = x[m]w[n m] that is selected by the window shifted to time n, i.e.,
69
70
e[n] is the excitation to the linear system with impulse response h[n]. A well known, and easily proved, property of the autocorrelation function is that
i.e., the autocorrelation function of s[n] = e[n] h[n] is the convolution of the autocorrelation functions of e[n] and h[n].
71
72
SHORT-TIME FOURIER TRANSFORM (STFT)

The expression for the discrete-time STFT at time n
where w[n] is assumed to be non-zero only in the interval [0, N w - 1] and is referred to as analysis window or sometimes as the analysis filter
73
74
FILTERING VIEW
75
76
77
SHORT TIME SYNTHESIS

problem of obtaining a sequence back from its discrete-time STFT.
This equation represents a synthesis equation for the discrete-time STFT.
78
FILTER BANK SUMMATION (FBS) METHOD

the discrete STFT is considered to be the set of outputs of a bank of filters. the output of each filter is modulated with a complex exponential, and these modulated filter outputs are summed at each instant of time to obtain the corresponding time sample of the original sequence That is, given a discrete STFT, X (n, k), the FBS method synthesize a sequence y(n) satisfying the following equation:
79
80
81
82
83
OVERLAP-ADD METHOD
Just as the FBS method was motivated from the filteling view of the STFT, the OLA method is motivated from the Fourier transform view of the STFT. In this method, for each fixed time, we take the inverse DFT of the corresponding frequency function and divide the result by the analysis window. However, instead of dividing out the analysis window from each of the resulting short-time sections, we perform an overlap and add operation between the short-time sections.
84
given a discrete STFT X (n, k), the OLA method synthesizes a sequence Y[n] given by
85
86
Furthermore, if the discrete STFT had been decimated in time by a factor L, it can be similarly shown that if the analysis window satisfies
87
88
DESIGN OF DIGITAL FILTER BANKS

282 297: Rabiner & Schafer
89
90
91
92
USING IIR FILTER
93
94
USING FIR FILTER
95
96
97
98
99
100
FILTER BANK ANALYSIS AND SYNTHESIS
101
102
103
FBS synthesis results in multiple copies of the input:
104
PHASE VOCODER
The fourier series is computed over a sliding window of a single pitch period duration and provide a measure of amplitude and frequency trajectories of the musical tones.
105
106
107
which can be interpreted as a real sinewave that is amplitude- and phase-modulated by the STFT, the "carrier" of the latter being the kth filter's center frequency. the STFT of a continuos time signal as,
108
109
where is an initial condition. The signal is likewise referred to as the instantaneous amplitude for each channel. The resulting filter-bank output is a sinewave with generally a time-varying amplitude and frequency modulation. An alternative expression is,
110
which is the time-domain counterpart to the frequency-domain phase derivative.
111
we can sample the continuous-time STFT, with sampling interval T, to obtain the discrete-time STFT.
112
113
114
115
116
117
SPEECH MODIFICATION
118
119
120
121
122
CEPSTRAL) HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS

use of the short-time cepstrum as a representation of speech and as a basis for estimating the parameters of the speech generation model. cepstrum of a discrete-time signal,
123
124
That is, the complex cepstrum operator transforms convolution into addition. This property, is what makes the cepstrum useful for speech analysis, since the model for speech production involves convolution of the excitation with the vocal tract impulse response, and our goal is often to separate the excitation signal from the vocal tract signal.
125
The key issue in the definition and computation of the complex cepstrum is the computation of the complex logarithm. ie, the computation of the phase angle arg[X(ej)], which must be done so as to preserve an additive combination of phases for two signals combined by convolution
126
SHORTTHE SHORT-TIME CEPSTRUM

The short-time cepstrum is a sequence of cepstra of windowed finite-duration segments of the speech waveform.
127
128
RECURSIVE COMPUTATION OF THE COMPLEX CEPSTRUM Another approach to compute the complex cepstrum applies only to minimum-phase signals. i.e., signals having an z-transform whose poles and zeros are inside the unit circle. An example would be the impulse response of an all-pole vocal tract model with system function
129
In this case, all the poles ck must be inside the unit circle for stability of the system.
130
SHORTSHORT-TIME HOMOMORPHIC FILTERING OF SPEECH PAGE N0: 63, RABINER & SCHAFER
131
The low quefrency part of the cepstrum is expected to be representative of the slow variations (with frequency) in the log spectrum, while the high quefrency components would correspond to the more rapid fluctuations of the log spectrum.
132
the spectrum for the voiced segment has a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech. This periodic structure in the log spectrum manifests itself in the cepstrum peak at a quefrency of about 9ms. The existence of this peak in the quefrency range of expected pitch periods strongly signals voiced speech. Furthermore, the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. the autocorrelation function also displays an indication of periodicity, but not nearly as unambiguously as does the cepstrum. But the rapid variations of the unvoiced spectra appear random with no periodic structure. As a result, there is no strong peak indicating periodicity as in the voiced case.
133
These slowly varying log spectra clearly retain the general spectral shape with peaks corresponding to the formant resonance structure for the segment of speech under analysis.
134
APPLICATION TO PITCH DETECTION

The cepstrum was first applied in speech processing to determine the excitation parameters for the discrete-time speech model. The successive spectra and cepstra are for 50 ms segments obtained by moving the window in steps of 12.5 ms (100 samples at a sampling rate of 8000 samples/sec).
135
for the positions 1 through 5, the window includes only unvoiced speech for positions 6 and 7 the signal within the window is partly voiced and partly unvoiced. For positions 8 through 15 the window only includes voiced speech. the rapid variations of the unvoiced spectra appear random with no periodic structure. the spectra for voiced segments have a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech.
136
137
the cepstrum peak at a quefrency of about 11 12 ms strongly signals voiced speech, and the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. Presence of a strong peak implies voiced speech, and the quefrency location of the peak gives the estimate of the pitch period.
138
MELMEL-FREQUENCY CEPSTRUM COEFFICIENTS MFCC) (MFCC)

The idea is to compute a frequency analysis based upon a filter bank with approximately critical band spacing of the filters and bandwidths. For 4 KHz bandwidth, approximately 20 filters are used. a short-time Fourier analysis is done first, resulting in a DFT Xn[k] for analysis time n. Then the DFT values are grouped together in critical bands and weighted by a triangular weighting function.
139
the bandwidths are constant for center frequencies below 1 kHz and then increase exponentially up to half the sampling rate of 4 kHz resulting in a total of 22 filters. The mel-frequency spectrum at analysis timen is defined for r = 1,2,...,R as
140
141
is a normalizing factor for the rth mel-filter. For each frame, a discrete cosine transform of the log of the magnitude of the filter outputs is computed to form the function mfccn[m], i.e.,
142
143
shows the result of mfcc analysis of a frame of voiced speech in comparison with the shorttime Fourier spectrum, LPC spectrum, and a homomorphically smoothed spectrum. all these spectra are different, but they have in common that they have peaks at the formant resonances. At higher frequencies, the reconstructed melspectrum has more smoothing due to the structure of the filter bank.
144
THE SPEECH SPECTROGRAM

simply a display of the magnitude of the STFT. Specifically, the images in Figure are plots of where the plot axes are labeled in terms of analog time and frequency through the relations tr = rRT and fk = k/(NT), where T is the sampling period of the discrete-time signal x[n] = xa(nT).
145
In order to make smooth, R is usually quite small compared to both the window length L and the number of samples in the frequency dimension, N, which may be much larger than the window length L. Such a function of two variables can be plotted on a two dimensional surface as either a grayscale or a color-mapped image. The bars on the right calibrate the color map (in dB).
146
147
if the analysis window is short, the spectrogram is called a wide-band spectrogram which is characterized by good time resolution and poor frequency resolution. when the window length is long, the spectrogram is a narrow-band spectrogram, which is characterized by good frequency resolution and poor time resolution.
148
THE SPECTROGRAM
A classic analysis tool.

Consists of DFTs of overlapping, and windowed frames.
Displays the distribution of energy in time and frequency.

10 log10 X m ( f ) is typically displayed.
2
149
THE SPECTROGRAM CONT.
150
151
Note the three broad peaks in the spectrum slice at time tr = 430 ms, and observe that similar slices would be obtained at other times around tr = 430 ms. These large peaks are representative of the underlying resonances of the vocal tract at the corresponding time in the production of the speech signal.
152
The lower spectrogram is not as sensitive to rapid time variations, but the resolution in the frequency dimension is much better. This window length is on the order of several pitch periods of the waveform during voiced intervals. As a result, the spectrogram no longer displays vertically oriented striations since several periods are included in the window.
153
SHORT TIME ACF

/m/ /ow/ /s/
ACF
154
CEPSTRUM
SPEECH WAVE (X)= EXCITATION (E) . FILTER (H)
(S)
(H)
(Vocal tract filter)
(E)
Glottal excitation From Vocal cords (Glottis)
http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
155
CEPSTRAL ANALYSIS
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index
After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}

Convolution(*) becomes multiplication (.) n(time) w(frequency),
S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
156
CEPSTRUM
C(n)=IDFT[log10 |S(w)|]= IDFT[ log10{|E(w)|} + log10{|H(w)|} ]
X(n) S(n) windowing DFT X(w) Log|x(w)| Log|x(w)| IDFT C(n)
N=time index w=frequency I-DFT=Inverse-discrete Fourier transform
In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis
157
EXAMPLE OF CEPSTRUM
sampling frequency 22.05KHz
158
SUB BAND CODING
159
the time-decimated subband outputs are quantized and encoded, then are decoded at the receiver. In subband coding, a small number of filters with wide and overlapping bandwidths are chosen and each output is quantized each bandpass filter output is quantized individually. although the bandpass filters are wide and overlapping, careful design of the filter, resuIts in a cancellation of quantization noise that leaks across bands.
160
Quadrature mirror filters are one such filter class; shows an example of a two-band subband coder using two overlapping quadrature mirror filters Quadrature mirror filters can be further subdivided from high to low filters by splitting the fullband into two, then the resulting lower band into two, and so on.
161
This octave-band splitting, together with the iterative decimation, can be shown to yield a perfect reconstruction filter bank such octave-band filter banks, and their conditions for perfect reconstruction, are closely related to wavelet analysis/synthesis structures.
162
163
164
LINEAR PREDICTION (INTRODUCTION):

The object of linear prediction is to estimate the output sequence from a linear combination of input samples, past output samples or both :
q p
y(n) = b( j) x(n j) a(i) y(n i)

The factors a(i) and b(j) are called predictor coefficients.
j =0 i =1
165
LINEAR PREDICTION (INTRODUCTION):

Many systems of interest to us are describable by a linear, constant-coefficient difference equation :
p q
a(i) y(n i) = b( j ) x(n j )

If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then
q p
i =0
j =0
N ( z ) = b( j ) z j and D( z ) = a(i ) z i
j =0 i =0 Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).
166
LINEAR PREDICTION (TYPES OF SYSTEM MODEL):

There are two important variants :
All-pole model (in statistics, autoregressive (AR) model ) :
The numerator N(z) is a constant.
All-zero model (in statistics, moving-average (MA) model ) :

The denominator D(z) is equal to unity.
The mixed pole-zero model is called the autoregressive moving-average (ARMA) model.
167
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

Given a zero-mean signal y(n), in the AR model : p
y (n) = a(i ) y (n i )
The error is :
i =1
e( n ) = y ( n ) y ( n )
p
= a (i ) y (n i )
i =0
To derive the predictor we use the orthogonality principle, the principle states that the desired coefficients are those which make the error orthogonal to the samples y(n-1), y(n-2),, y(n-p).
168

Thus we require that < y (n j )e(n) >= 0 for j = 1, 2, ..., p
Or,
p
y (n j ) a (i ) y (n i ) = 0
i =0
Interchanging the operation of averaging and summing, and representing < > by summing over n, we have
p
a(i) y(n i) y(n j ) = 0, j = 1,..., p

The required predictors are found by solving these equations.
i =0 n
169

The orthogonality principle also states that resulting minimum error is given by
E = e 2 ( n ) = y ( n ) e( n )
Or,
p
a(i) y(n i) y(n) = E

i =0 n
We can minimize the error over all time : p
a ( i )ri j = 0 , j = 1 ,2 , ...,p
i=0
i=0
a ( i ) ri = E
where
ri =
y ( n) y ( n i )
n =
170
LINEAR PREDICTION (APPLICATIONS):

Autocorrelation matching :
We have a signal y(n) with known autocorrelation . We model this with the AR system shown below : ryy (n) y (n ) e(n)
1-A(z)
H ( z) =
A( z )
1 ai z i
i =1
171
LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):

The choice of predictor order depends on the analysis bandwidth. The rule of thumb is :
p= 2 BW +c 1000
For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW. One formant requires two complex conjugate poles. Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.
172
LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

True Model:
Pitch DT Voiced Impulse generator G(z) Glottal Filter Gain
s(n) Speech Signal
U(n) Voiced Volume velocity
V U
H(z) Vocal tract Filter
R(z) LP Filter
Uncorrelated
Unvoiced
Noise generator Gain
173
LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

Using LP analysis :
Pitch DT Voiced Impulse generator V U White Noise Unvoiced generator
Gain estimate
s(n) Speech
All-Pole Filter (AR)
Signal
H(z)

Speech Signal Processing

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Speech Signal Processing

Încărcat de

Drepturi de autor:

Formate disponibile

SPEECH SIGNAL PROCESSING MKERALA UNIVERSITY M-TECH 1ST SEMESTER

SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3 3-

Speech Processing means Processing of discrete time speech signals

Psychoacoustics Room acoustics Speech production

Speech Processing Signal Processing

Fourier transforms Discrete time filters AR(MA) models

Statistical SP Stochastic models

Entropy Communication theory Rate-distortion theory

HOW IS SPEECH PRODUCED ?

SPEECH PRODUCTION MODEL

alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal

Classes of speech sounds Voiced sound

Diphtongs ay, ey, oy, aw Glide w, y Plosive p, b, t, d, k, g

Vowel Chart Front i High Center Back u

SPEECH WAVEFORM CHARACTERISTICS

Loudness Voiced/Unvoiced. Pitch.

Acoustic Characteristics of speech Pitch:

MULTI-TUBE APPROXIMATION OF THE VOCAL TRACT

The wave equations for section k: 0xlk

SOUND PROPAGATION IN THE CONCATENATED TUBE MODEL

At kth/(k+1)st junction we have:

ANALOGY WITH ELECTRICAL CIRCUIT TRANSMISSION LINE

PROPAGATION OF SOUND IN A UNIFORM TUBE

The vocal tract transfer function of volume velocities is

PROPAGATION OF SOUND IN A UNIFORM TUBE

119 124: Quatieri Derivation of eqn.4.18 is important.

PROPAGATION OF SOUND IN A UNIFORM TUBE (CONT)

UNIFORM TUBE MODEL

UNIFORM TUBE MODEL

c 1 350 f= =k =k = 250k 2 2 l 2 4 0.175 f = 500,1500,2500,...

APPROXIMATING VOCAL TRACT SHAPES

When a fricative constriction is tapered

such a situation occurs primarily for articulation disorders

Introduction to Digital Speech Processing (Rabiner & Schafer ) 20-23

Rabiner & Schafer : 98105

SOUND SOURCE: VOCAL FOLD VIBRATION

SHORT-TIME SPEECH ANALYSIS

Often overlapping segments are extracted.

SHORTSHORT-TIME ANALYSIS OF SPEECH

the system is an all-pole system with system function of the form:

In this case the operator T{ } is simply squaring the windowed samples.

SHORT-TIME ZERO-CROSSING RATE

SHORT-TIME AUTOCORRELATION FUNCTION (STACF)

SHORT-TIME FOURIER TRANSFORM (STFT)

SHORT TIME SYNTHESIS

This equation represents a synthesis equation for the discrete-time STFT.

FILTER BANK SUMMATION (FBS) METHOD

DESIGN OF DIGITAL FILTER BANKS

USING IIR FILTER

USING FIR FILTER

FILTER BANK ANALYSIS AND SYNTHESIS

FBS synthesis results in multiple copies of the input:

which is the time-domain counterpart to the frequency-domain phase derivative.

CEPSTRAL) HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS

SHORTTHE SHORT-TIME CEPSTRUM

APPLICATION TO PITCH DETECTION

MELMEL-FREQUENCY CEPSTRUM COEFFICIENTS MFCC) (MFCC)

THE SPEECH SPECTROGRAM

A classic analysis tool.

Displays the distribution of energy in time and frequency.

THE SPECTROGRAM CONT.

SHORT TIME ACF

After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}

N=time index w=frequency I-DFT=Inverse-discrete Fourier transform