Sunteți pe pagina 1din 173

SPEECH SIGNAL PROCESSING MKERALA UNIVERSITY M-TECH 1ST SEMESTER

lizytvm@yahoo.com +919495123331 Lizy Abraham Assistant Professor Department of ECE LBS Institute of Technology for Women (A Govt. of Kerala Undertaking) Poojappura Trivandrum -695012 Kerala, India
1

SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3 3-

Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing, Articulatory model). Acoustic Phonetics ( Basic speech units and their classification). Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram, Formant Estimation &Analysis). Cepstral Analysis Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR, MFCC, Sinusoidal Model, GMM, HMM Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic Coding, Vector Quantization based Coders, CELP Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-tospeech conversion, speech enhancement, Speaker Verification, Language Identification, Issues of Voice transmission over Internet.

REFERENCE
1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE Press, Hardcover 2nd edition, 1999; ISBN: 0780334493. 2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547 3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall; ISBN: 013242942X; 1st edition 6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, September 1999; ISBN: 0471349593 For the End semester exam (100 marks), the question paper shall have six questions of 20 marks each covering entire syllabus out of which any five shall be answered. It shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20 marks each and 10 marks for assignments (Minimum two) /Term Project.
3

Speech Processing means Processing of discrete time speech signals

Algorithms (Programming)

Psychoacoustics Room acoustics Speech production

Speech Processing Signal Processing

Acoustics

Information Theory

Phonetics

Fourier transforms Discrete time filters AR(MA) models

Statistical SP Stochastic models

Entropy Communication theory Rate-distortion theory

HOW IS SPEECH PRODUCED ?

Speech can be defined as a pressure acoustic signal that is articulated in the vocal tract

Speech is produced when: air is forced from the lungs through the vocal cords and along the vocal tract.
8

This air flow is referred to as excitation signal. This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in shaping the sound produced. Vocal Tract components: Oral Tract: (from lips to vocal cords). nostrills). Nasal Tract: (from the velum till nostrills).
9

10

11

Larynx: the source of speech Vocal cords (folds): the two folds of tissue in the larynx. They can open and shut like a pair of fans. Glottis: the gap between the vocal cords. As air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. The frequency of vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz).
12

SPEECH PRODUCTION MODEL

13

Places of articulation

dental labial

alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal

14

Classes of speech sounds Voiced sound


The vocal cords vibrate open and close Quasi-periodic pulses of air The rate of the opening and closing the pitch

Unvoiced sounds
Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present Eg. S, F

Plosive sounds
A complete closure in the vocal tract Air pressure is built up and released suddenly Eg. B , P
15

Speech Model

16

SPEECH SOUNDS
Coarse classification with phonemes. A phone is the acoustic realization of a phoneme. Allophones are context dependent phonemes.
17

PHONEME HIERARCHY
Speech sounds Language dependent. About 50 in English. Consonants

Vowels iy, ih, ae, aa, ah, ao,ax, eh, er, ow, uh, uw

Diphtongs ay, ey, oy, aw Glide w, y Plosive p, b, t, d, k, g

Lateral liquid Retroflex liquid Nasal m, n, ng Fricative f, v, th, dh, s, z, sh, zh, h r l

18

19

20

sounds like /SH/ and /S/ look like (spectrally shaped) random noise, while the vowel sounds /UH/, /IY/, and /EY/ are highly structured and quasi-periodic. These differences result from the distinctively different ways that these sounds are produced.

21

22

Vowel Chart Front i High Center Back u

e Mid

Low

24

SPEECH WAVEFORM CHARACTERISTICS

Loudness Voiced/Unvoiced. Pitch.


Fundamental frequency.

Spectral envelope.
Formants.

25

Acoustic Characteristics of speech Pitch:


Signal within each voiced interval is periodic. The period T is called pitch. The pitch depends on the vowel being spoken, changes in time. T~70 samples in this ex. f0=1/T is the fundamental frequency (also known as formant frequency).

26

FORMANTS
Formants can be recognized in the frequency content of the signal segment.
Formants are best described as high energy peaks in the frequency spectrum of speech sound.

27

The resonant frequencies of the vocal tract are called formant frequencies or simply formants. The peaks of the spectrum of the vocal tract response correspond approximately to its formants. Under the linear time-invariant all-pole assumption, each vocal tract shape is characterized by a collection of formants.
28

Because the vocal tract is assumed stable with poles inside the unit circle, the vocal tract transfer function can be expressed either in product or partial fraction expansion form:

29

30

A detailed acoustic theory must consider the effects of the following: Time variation of the vocal tract shape Losses due to heat conduction and viscous friction at the vocal tract walls Softness of the vocal tract walls Radiation of sound at the lips Nasal coupling Excitation of sound in the vocal tract Let us begin by considering a simple case of a lossless tube:

31

28 December 2012

MULTI-TUBE APPROXIMATION OF THE VOCAL TRACT


We can represent the vocal tract as a concatenation of N lossless tubes with area {Ak}.and equal length x = l/N The wave propagation time through each tube is =x/c = l/Nc

32

33

Consider an N-tube model of the previous figure. Each tube has length lk and cross sectional area of Ak. Assume:
No losses Planar wave propagation

The wave equations for section k: 0xlk

34

35

28 December 2012

SOUND PROPAGATION IN THE CONCATENATED TUBE MODEL


Boundary conditions: Physical principle of continuity:
Pressure and volume velocity must be continuous both in time and in space everywhere in the system:

At kth/(k+1)st junction we have:

36

28 December 2012

ANALOGY WITH ELECTRICAL CIRCUIT TRANSMISSION LINE

37

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

The vocal tract transfer function of volume velocities is

38

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE


Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0

*(derivation in Quateri text: page 122 125) The poles of the transfer function T (j ) are where cos( l/c)=0

119 124: Quatieri Derivation of eqn.4.18 is important.


39

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE (CONT)


For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz,

The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat) The length of the vocal tract, l, corresponds to 1/41, 3/42, 5/43, , where i is the wavelength of the ith natural frequency

40

28 December 2012

UNIFORM TUBE MODEL


Example
Consider a uniform tube of length l=35 cm. If speed of sound is 350 m/s calculate its resonances in Hz. Compare its resonances with a tube of length l = 17.5 cm. c f=/2 =k , k = 1,3,5,... 2 l c 1 350 f= =k =k = 250k 2 2 l 2 4 0.35 f = 250,750,1250,...
41

28 December 2012

UNIFORM TUBE MODEL


For 17.5 cm tube:

c 1 350 f= =k =k = 250k 2 2 l 2 4 0.175 f = 500,1500,2500,...

42

43

APPROXIMATING VOCAL TRACT SHAPES

44

45

VOWELS
Modeled as a tube closed at one end and open at the other the closure is a membrane with a slit in it the tube has uniform cross sectional area membrane represents the source of energy (vocal folds) the energy travels through the tube the tube generates no energy on its own the tube represents an important class of resonators odd quarter length relationship Fn=(2n-1)c/4l

VOWELS
Filter characteristics for vowels the vocal tract is a dynamic filter it is frequency dependent it has, theoretically, an infinite number of resonances each resonance has a center frequency, an amplitude and a bandwidth for speech, these resonances are called formants formants are numbered in succession from the lowest F1, F2, F3, etc.

Fricatives Modeled as a tube with a very severe constriction The air exiting the constriction is turbulent Because of the turbulence, there is no periodicity unless accompanied by voicing

When a fricative constriction is tapered


the back cavity is involved this resembles a tube closed at both ends
Fn=nc/2l

such a situation occurs primarily for articulation disorders

Introduction to Digital Speech Processing (Rabiner & Schafer ) 20-23


51

52

Rabiner & Schafer : 98105

53

54

28 December 2012

SOUND SOURCE: VOCAL FOLD VIBRATION


Modeled as a volume velocity source at glottis, UG(j )

55

56

SHORT-TIME SPEECH ANALYSIS


Segments (or frames, or vectors) are typically of length 20 ms.
Speech characteristics are constant. Allows for relatively simple modeling.

Often overlapping segments are extracted.

57

SHORTSHORT-TIME ANALYSIS OF SPEECH

58

the system is an all-pole system with system function of the form:

For all-pole linear systems, the input and output are related by a difference equation of the form:

59

60

The operator T{ } defines the nature of the short-time analysis function, and w[n m] represents a time shifted window sequence

61

62

SHORT-TIME ENERGY
simple to compute, and useful for estimating properties of the excitation function in the model.

In this case the operator T{ } is simply squaring the windowed samples.

63

SHORT-TIME ZERO-CROSSING RATE


Weighted average of the number of times the speech signal changes sign within the time window. Representing this operator in terms of linear filtering leads to:

64

Since |sgn{x[m]} sgn{x[m 1]}| is equal to 1 if x[m] and x[m 1] have different algebraic signs and 0 if they have the same sign, it follows that it is a weighted sum of all the instances of alternating sign (zero-crossing) that fall within the support region of the shifted window w[n m].

65

shows an example of the short-time energy and zero crossing rate for a segment of speech with a transition from unvoiced to voiced speech. In both cases, the window is a Hamming window of duration 25ms (equivalent to 401 samples at a 16 kHz sampling rate). Thus, both the short-time energy and the short-time zero-crossing rate are output of a low pass filter whose frequency response is as shown.

66

Short time energy and zero-crossing rate functions are slowly varying compared to the time variations of the speech signal, and therefore, they can be sampled at a much lower rate than that of the original speech signal. For finite-length windows like the Hamming window, this reduction of the sampling rate is accomplished by moving the window position n in jumps of more than one sample

67

during the unvoiced interval, the zero-crossing rate is relatively high compared to the zerocrossing rate in the voiced interval. Conversely, the energy is relatively low in the unvoiced region compared to the energy in the voiced region.

68

SHORT-TIME AUTOCORRELATION FUNCTION (STACF)


The autocorrelation function is often used as a means of detecting periodicity in signals, and it is also the basis for many spectrum analysis methods. STACF is defined as the deterministic autocorrelation function of the sequence xn[m] = x[m]w[n m] that is selected by the window shifted to time n, i.e.,

69

70

e[n] is the excitation to the linear system with impulse response h[n]. A well known, and easily proved, property of the autocorrelation function is that

i.e., the autocorrelation function of s[n] = e[n] h[n] is the convolution of the autocorrelation functions of e[n] and h[n].

71

72

SHORT-TIME FOURIER TRANSFORM (STFT)


The expression for the discrete-time STFT at time n
where w[n] is assumed to be non-zero only in the interval [0, N w - 1] and is referred to as analysis window or sometimes as the analysis filter

73

74

FILTERING VIEW

75

76

77

SHORT TIME SYNTHESIS


problem of obtaining a sequence back from its discrete-time STFT.

This equation represents a synthesis equation for the discrete-time STFT.

78

FILTER BANK SUMMATION (FBS) METHOD


the discrete STFT is considered to be the set of outputs of a bank of filters. the output of each filter is modulated with a complex exponential, and these modulated filter outputs are summed at each instant of time to obtain the corresponding time sample of the original sequence That is, given a discrete STFT, X (n, k), the FBS method synthesize a sequence y(n) satisfying the following equation:

79

80

81

82

83

OVERLAP-ADD METHOD
Just as the FBS method was motivated from the filteling view of the STFT, the OLA method is motivated from the Fourier transform view of the STFT. In this method, for each fixed time, we take the inverse DFT of the corresponding frequency function and divide the result by the analysis window. However, instead of dividing out the analysis window from each of the resulting short-time sections, we perform an overlap and add operation between the short-time sections.
84

given a discrete STFT X (n, k), the OLA method synthesizes a sequence Y[n] given by

85

86

Furthermore, if the discrete STFT had been decimated in time by a factor L, it can be similarly shown that if the analysis window satisfies

87

88

DESIGN OF DIGITAL FILTER BANKS


282 297: Rabiner & Schafer

89

90

91

92

USING IIR FILTER

93

94

USING FIR FILTER

95

96

97

98

99

100

FILTER BANK ANALYSIS AND SYNTHESIS

101

102

103

FBS synthesis results in multiple copies of the input:

104

PHASE VOCODER
The fourier series is computed over a sliding window of a single pitch period duration and provide a measure of amplitude and frequency trajectories of the musical tones.

105

106

107

which can be interpreted as a real sinewave that is amplitude- and phase-modulated by the STFT, the "carrier" of the latter being the kth filter's center frequency. the STFT of a continuos time signal as,

108

109

where is an initial condition. The signal is likewise referred to as the instantaneous amplitude for each channel. The resulting filter-bank output is a sinewave with generally a time-varying amplitude and frequency modulation. An alternative expression is,

110

which is the time-domain counterpart to the frequency-domain phase derivative.

111

we can sample the continuous-time STFT, with sampling interval T, to obtain the discrete-time STFT.

112

113

114

115

116

117

SPEECH MODIFICATION

118

119

120

121

122

CEPSTRAL) HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS


use of the short-time cepstrum as a representation of speech and as a basis for estimating the parameters of the speech generation model. cepstrum of a discrete-time signal,

123

124

That is, the complex cepstrum operator transforms convolution into addition. This property, is what makes the cepstrum useful for speech analysis, since the model for speech production involves convolution of the excitation with the vocal tract impulse response, and our goal is often to separate the excitation signal from the vocal tract signal.
125

The key issue in the definition and computation of the complex cepstrum is the computation of the complex logarithm. ie, the computation of the phase angle arg[X(ej)], which must be done so as to preserve an additive combination of phases for two signals combined by convolution

126

SHORTTHE SHORT-TIME CEPSTRUM


The short-time cepstrum is a sequence of cepstra of windowed finite-duration segments of the speech waveform.

127

128

RECURSIVE COMPUTATION OF THE COMPLEX CEPSTRUM Another approach to compute the complex cepstrum applies only to minimum-phase signals. i.e., signals having an z-transform whose poles and zeros are inside the unit circle. An example would be the impulse response of an all-pole vocal tract model with system function
129

In this case, all the poles ck must be inside the unit circle for stability of the system.

130

SHORTSHORT-TIME HOMOMORPHIC FILTERING OF SPEECH PAGE N0: 63, RABINER & SCHAFER

131

The low quefrency part of the cepstrum is expected to be representative of the slow variations (with frequency) in the log spectrum, while the high quefrency components would correspond to the more rapid fluctuations of the log spectrum.

132

the spectrum for the voiced segment has a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech. This periodic structure in the log spectrum manifests itself in the cepstrum peak at a quefrency of about 9ms. The existence of this peak in the quefrency range of expected pitch periods strongly signals voiced speech. Furthermore, the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. the autocorrelation function also displays an indication of periodicity, but not nearly as unambiguously as does the cepstrum. But the rapid variations of the unvoiced spectra appear random with no periodic structure. As a result, there is no strong peak indicating periodicity as in the voiced case.
133

These slowly varying log spectra clearly retain the general spectral shape with peaks corresponding to the formant resonance structure for the segment of speech under analysis.

134

APPLICATION TO PITCH DETECTION


The cepstrum was first applied in speech processing to determine the excitation parameters for the discrete-time speech model. The successive spectra and cepstra are for 50 ms segments obtained by moving the window in steps of 12.5 ms (100 samples at a sampling rate of 8000 samples/sec).

135

for the positions 1 through 5, the window includes only unvoiced speech for positions 6 and 7 the signal within the window is partly voiced and partly unvoiced. For positions 8 through 15 the window only includes voiced speech. the rapid variations of the unvoiced spectra appear random with no periodic structure. the spectra for voiced segments have a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech.
136

137

the cepstrum peak at a quefrency of about 11 12 ms strongly signals voiced speech, and the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. Presence of a strong peak implies voiced speech, and the quefrency location of the peak gives the estimate of the pitch period.
138

MELMEL-FREQUENCY CEPSTRUM COEFFICIENTS MFCC) (MFCC)


The idea is to compute a frequency analysis based upon a filter bank with approximately critical band spacing of the filters and bandwidths. For 4 KHz bandwidth, approximately 20 filters are used. a short-time Fourier analysis is done first, resulting in a DFT Xn[k] for analysis time n. Then the DFT values are grouped together in critical bands and weighted by a triangular weighting function.
139

the bandwidths are constant for center frequencies below 1 kHz and then increase exponentially up to half the sampling rate of 4 kHz resulting in a total of 22 filters. The mel-frequency spectrum at analysis timen is defined for r = 1,2,...,R as

140

141

is a normalizing factor for the rth mel-filter. For each frame, a discrete cosine transform of the log of the magnitude of the filter outputs is computed to form the function mfccn[m], i.e.,

142

143

shows the result of mfcc analysis of a frame of voiced speech in comparison with the shorttime Fourier spectrum, LPC spectrum, and a homomorphically smoothed spectrum. all these spectra are different, but they have in common that they have peaks at the formant resonances. At higher frequencies, the reconstructed melspectrum has more smoothing due to the structure of the filter bank.

144

THE SPEECH SPECTROGRAM


simply a display of the magnitude of the STFT. Specifically, the images in Figure are plots of where the plot axes are labeled in terms of analog time and frequency through the relations tr = rRT and fk = k/(NT), where T is the sampling period of the discrete-time signal x[n] = xa(nT).
145

In order to make smooth, R is usually quite small compared to both the window length L and the number of samples in the frequency dimension, N, which may be much larger than the window length L. Such a function of two variables can be plotted on a two dimensional surface as either a grayscale or a color-mapped image. The bars on the right calibrate the color map (in dB).

146

147

if the analysis window is short, the spectrogram is called a wide-band spectrogram which is characterized by good time resolution and poor frequency resolution. when the window length is long, the spectrogram is a narrow-band spectrogram, which is characterized by good frequency resolution and poor time resolution.
148

THE SPECTROGRAM

A classic analysis tool.


Consists of DFTs of overlapping, and windowed frames.

Displays the distribution of energy in time and frequency.


10 log10 X m ( f ) is typically displayed.
2

149

THE SPECTROGRAM CONT.

150

151

Note the three broad peaks in the spectrum slice at time tr = 430 ms, and observe that similar slices would be obtained at other times around tr = 430 ms. These large peaks are representative of the underlying resonances of the vocal tract at the corresponding time in the production of the speech signal.
152

The lower spectrogram is not as sensitive to rapid time variations, but the resolution in the frequency dimension is much better. This window length is on the order of several pitch periods of the waveform during voiced intervals. As a result, the spectrogram no longer displays vertically oriented striations since several periods are included in the window.
153

SHORT TIME ACF


/m/ /ow/ /s/

ACF

154

CEPSTRUM
SPEECH WAVE (X)= EXCITATION (E) . FILTER (H)

(S)

(H)
(Vocal tract filter)

(E)
Glottal excitation From Vocal cords (Glottis)

http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
155

CEPSTRAL ANALYSIS
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index

After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}


Convolution(*) becomes multiplication (.) n(time) w(frequency),

S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
156

CEPSTRUM
C(n)=IDFT[log10 |S(w)|]= IDFT[ log10{|E(w)|} + log10{|H(w)|} ]
X(n) S(n) windowing DFT X(w) Log|x(w)| Log|x(w)| IDFT C(n)

N=time index w=frequency I-DFT=Inverse-discrete Fourier transform

In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis

157

EXAMPLE OF CEPSTRUM
sampling frequency 22.05KHz

158

SUB BAND CODING

159

the time-decimated subband outputs are quantized and encoded, then are decoded at the receiver. In subband coding, a small number of filters with wide and overlapping bandwidths are chosen and each output is quantized each bandpass filter output is quantized individually. although the bandpass filters are wide and overlapping, careful design of the filter, resuIts in a cancellation of quantization noise that leaks across bands.
160

Quadrature mirror filters are one such filter class; shows an example of a two-band subband coder using two overlapping quadrature mirror filters Quadrature mirror filters can be further subdivided from high to low filters by splitting the fullband into two, then the resulting lower band into two, and so on.
161

This octave-band splitting, together with the iterative decimation, can be shown to yield a perfect reconstruction filter bank such octave-band filter banks, and their conditions for perfect reconstruction, are closely related to wavelet analysis/synthesis structures.

162

163

164

LINEAR PREDICTION (INTRODUCTION):


The object of linear prediction is to estimate the output sequence from a linear combination of input samples, past output samples or both :
q p

y(n) = b( j) x(n j) a(i) y(n i)


The factors a(i) and b(j) are called predictor coefficients.
j =0 i =1

165

LINEAR PREDICTION (INTRODUCTION):


Many systems of interest to us are describable by a linear, constant-coefficient difference equation :
p q

a(i) y(n i) = b( j ) x(n j )


If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then
q p

i =0

j =0

N ( z ) = b( j ) z j and D( z ) = a(i ) z i
j =0 i =0 Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).

166

LINEAR PREDICTION (TYPES OF SYSTEM MODEL):


There are two important variants :
All-pole model (in statistics, autoregressive (AR) model ) :
The numerator N(z) is a constant.

All-zero model (in statistics, moving-average (MA) model ) :


The denominator D(z) is equal to unity.

The mixed pole-zero model is called the autoregressive moving-average (ARMA) model.

167

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):


Given a zero-mean signal y(n), in the AR model : p
y (n) = a(i ) y (n i )
The error is :
i =1

e( n ) = y ( n ) y ( n )
p

= a (i ) y (n i )
i =0

To derive the predictor we use the orthogonality principle, the principle states that the desired coefficients are those which make the error orthogonal to the samples y(n-1), y(n-2),, y(n-p).

168

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):


Thus we require that < y (n j )e(n) >= 0 for j = 1, 2, ..., p
Or,
p

y (n j ) a (i ) y (n i ) = 0
i =0

Interchanging the operation of averaging and summing, and representing < > by summing over n, we have
p

a(i) y(n i) y(n j ) = 0, j = 1,..., p


The required predictors are found by solving these equations.
i =0 n

169

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):


The orthogonality principle also states that resulting minimum error is given by

E = e 2 ( n ) = y ( n ) e( n )
Or,
p

a(i) y(n i) y(n) = E


i =0 n

We can minimize the error over all time : p

a ( i )ri j = 0 , j = 1 ,2 , ...,p

i=0

i=0

a ( i ) ri = E

where

ri =

y ( n) y ( n i )
n =

170

LINEAR PREDICTION (APPLICATIONS):


Autocorrelation matching :
We have a signal y(n) with known autocorrelation . We model this with the AR system shown below : ryy (n) y (n ) e(n)

1-A(z)

H ( z) =

A( z )

1 ai z i
i =1

171

LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):


The choice of predictor order depends on the analysis bandwidth. The rule of thumb is :
p= 2 BW +c 1000

For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW. One formant requires two complex conjugate poles. Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.

172

LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):


True Model:
Pitch DT Voiced Impulse generator G(z) Glottal Filter Gain
s(n) Speech Signal
U(n) Voiced Volume velocity

V U

H(z) Vocal tract Filter

R(z) LP Filter

Uncorrelated

Unvoiced

Noise generator Gain

173

LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):


Using LP analysis :
Pitch DT Voiced Impulse generator V U White Noise Unvoiced generator
Gain estimate
s(n) Speech

All-Pole Filter (AR)

Signal

H(z)

S-ar putea să vă placă și