Documente Academic
Documente Profesional
Documente Cultură
Professor:
Abeer Alwan
Authors:
Ozgu Ozun
Philipp Steurer
Daniel Thell
Abstract
Wideband speech signals of 2 males and 2 females were coded using an improved
version of Linear Predictive Coding (LPC). The sampling frequency was at 16 kHz
and the bit rate was at 15450 bits per second, where the original bit rate was at 128000
bits per second. The tradeoffs between the bit rate, end-to-end delay, speech quality
Table of Contents
ABSTRACT
1 INTRODUCTION
2 BACKGROUND
3 PROJECT DESCRIPTION
3.1 METHODOLOGY
3.2 PRE-EMPHASIS FILTER
3.3 QUANTIZATION OF LPC-COEFFICIENTS
4 VOICE-EXCITED LPC VOCODER
4.1 DCT OF RESIDUAL SIGNAL
4.2 PERFORMANCE ANALYSIS
4.2.1 Bit Rates
4.2.2 Overall Delay of the System
4.2.3 Computational Complexity
4.2.4 Objective Performance Evaluation
5 DISCUSSION OF RESULTS
5.1 QUALITY
5.1.1 Subjective quality
5.1.2 Segmental signal to noise ratio
5.2 QUALITY-PERFORMANCE TRADEOFFS
5.2.1 Bit rate performance
5.2.2 Delay and computational complexity
6 CONCLUSIONS
7 REFERENCES
8 APPENDIX
8.1 MAIN FILE
8.2 LPC OUTPUT INFORMATION GENERATION
8.3 PLAIN LPC VOCODER
8.3.1 Main file
8.3.2 Plain LPC decoder
8.4 VOICE-EXCITED LPC VOCODER
8.4.1 Main File
8.4.2 Voice-excited LPC decoder
WAVE FILES
1 Introduction
Speech coding has been and still is a major issue in the area of digital speech
processing. Speech coding is the act of transforming the speech signal at hand, to a
more compact form, which can then be transmitted with a considerably smaller
memory. The motivation behind this is the fact that access to unlimited amount of
bandwidth is not possible. Therefore, there is a need to code and compress speech
signals. Speech compression is required in long-distance communication, high-quality
speech storage, and message encryption. For example, in digital cellular technology
many users need to share the same frequency bandwidth. Utilizing speech
compression makes it possible for more users to share the available system. Another
example where speech compression is needed is in digital voice storage. For a fixed
amount of available memory, compression makes it possible to store longer messages
[1].
Speech coding is a lossy type of coding, which means that the output signal does not
exactly sound like the input. The input and the output signal could be distinguished to
be different. Coding of audio however, is a different kind of problem than speech
coding. Audio coding tries to code the audio in a perceptually lossless way. This
means that even though the input and output signals are not mathematically
equivalent, the sound at the output is the same as the input. This type of coding is used
in applications for audio storage, broadcasting, and Internet streaming [2].
Several techniques of speech coding such as Linear Predictive Coding (LPC),
Waveform Coding and Subband Coding exist. The problem at hand is to use LPC to
code 2 male and 2 female speech sentences. The speech signals that need to be coded
are wideband signals with frequencies ranging from 0 to 8 kHz. The sampling
frequency should be at 16 kHz with a maximum end-to-end delay of 100 ms.
Different types of applications have different time delay constraints. For example in
network telephony only a delay of 1ms is acceptable, whereas a delay of 500 ms is
permissible in video telephony [3]. Another constraint at hand is not to exceed an
overall bit rate of 16 kbps. When all is said and done, the system must have less than
20 million operations per second (MOPS).
The speech coder that will be developed is going to be analyzed using both subjective
and objective analysis. Subjective analysis will consist of listening to the encoded
speech signal and making judgments on its quality. The quality of the played back
speech will be solely based on the opinion of the listener. The speech can possibly be
rated by the listener either impossible to understand, intelligible or natural sounding.
Even though this is a valid measure of quality, an objective analysis will be
introduced to technically assess the speech quality and to minimize human bias. The
objective analysis will be performed by computing Segmental Signal to Noise Ratio
(SEGSNR) between the original and the coded speech signal. Furthermore, an
analysis on the study of effects of bit rate, complexity and end-to-end delay on the
speech quality at the output will be made. The report will be concluded with the
summary of results and some ideas for future work.
2 Background
There are several different methods to successfully accomplish speech coding. Some
main categories of speech coders are LPC Vocoders, Waveform and Subband coders.
The speech coding in this project will be accomplished by using a modified version of
LPC-10 technique. Linear Predictive Coding is one possible technique of analyzing
and synthesizing human speech. The exact details of the analysis and synthesis of this
technique that was used to solve our problem will be discussed in the methodology
section. Only an overview will be included in this section, along with the previously
mentioned other types of coding techniques.
LPC method has been used for a long time. Texas Instruments had developed a
monolithic PMOS speech synthesizer integrated circuit as early as 1978. This marked
the first time the human vocal tract had been electronically duplicated on a single chip
of silicon [5]. This one of the first speech synthesizer used LPC to accomplish
successful synthesis. LPC makes coding at low bit rates possible. For LPC-10, the bit
rate is about 2.4 kbps. Even though this method results in an artificial sounding
speech, it is intelligible. This method has found extensive use in military applications,
where a high quality speech is not as important as a low bit rate to allow for heavy
encryption of secret data. However, since a high quality sounding speech is required
in the commercial market, engineers are faced with using other techniques that
normally use higher bit rates and result in higher quality output. In LPC-10 vocal tract
is represented as a time-varying filter and speech is windowed about every 20 ms. For
each frame, the gain and only 10 of the coefficients of a linear prediction filter are
coded for analysis and decoded for synthesis. In 1996, LPC-10 was replaced by
mixed-excitation linear prediction (MELP) coder to be the United States Federal
Standard for coding at 2.4 kbps. This MELP coder is an improvement to the LPC
method, with some additional features that have mixed excitation, aperiodic pulses,
adaptive spectral enhancement and pulse dispersion filtering as mentioned in [4].
Waveform coders on the other hand, are concerned with the production of a
reconstructed signal whose waveform is as close as possible to the original signal,
without any information about how the signal to be coded was generated. Therefore,
in theory, this type of coders should be input signal independent and work for both
speech and non-speech input signals [4]. Waveform coders produce a good quality of
speech signals above bit rates of 16 kbps. However, if the bit rate is decreased below
16 kpbs, the quality deteriorates quickly. One form of waveform coding is Pulse Code
Modulation (PCM). This type of waveform coding involves sampling and quantizing
the input signal. PCM is a memoryless coding algorithm as mentioned in [4]. Another
type of PCM is Differential Pulse Code Modulation (DPCM). This method quantizes
the difference between the original and the predicted signals. This method involves
prediction of the next sample from the previous samples. This is possible since there
is a correlation in speech samples because of the effects of the vocal tract and the
vibrations in the vocal cords [6]. It is possible to improve the predictor as well as the
quantizer in DPCM if they are made adaptive, in order to match the characteristics of
the speech that is to be coded. This type of coders is called Adaptive Differential
Pulse Code Modulation (ADPCM).
One other type of speech coders is called the Subband coders. This type of coding
involves filter bank analysis to be undertaken in order to filter the input signal into
several frequency bands. Bit allocation is done to each band by a certain criterion [4].
Presently however, Subband coders are not widely used for speech coding. It is very
difficult to create high quality speech by using low bit rates with this technique. As
suggested in [4], Subband coding is mostly utilized in the medium to high bit rate
applications of speech coding.
3 Project Description
3.1 Methodology
In this section an explanation of the LPC speech coding technique will be given. The
specific modifications and additions done to improve this algorithm will also be
covered. However, before jumping into the detailed methodology of our solution, it
will be helpful to give a brief overview of speech production. Speech is produced
when velum is lowered to make it acoustically coupled with the vocal tract. Nasal
sounds of speech are produced this way [7]. Speech signals consist of several
sequences of sounds. Each sound can be thought of as a unique information. There are
voiced and unvoiced types of speech sounds. The fundamental difference between
these two types of speech sounds comes from the way they are produced. The
vibrations of the vocal cords produce voiced sounds. The rate at which the vocal cords
vibrate dictates the pitch of the sound. On the other hand, unvoiced sounds do not rely
on the vibration of the vocal cords. The unvoiced sounds are created by the
constriction of the vocal tract. The vocal cords remain open and the constrictions of
the vocal tract force air out to produce the unvoiced sounds [7].
LPC technique will be utilized in order to analyze and synthesize speech signals. This
method is used to successfully estimate basic speech parameters like pitch, formants
and spectra. A block diagram of an LPC vocoder can be seen in Fig.3-1. The principle
behind the use of LPC is to minimize the sum of the squared differences between the
original speech signal and the estimated speech signal over a finite duration. This
could be used to give a unique set of predictor coefficients [7]. These predictor
coefficients are normally estimated every frame, which is normally 20 ms long. The
predictor coefficients are represented by ak. Another important parameter is the gain
(G). The transfer function of the time-varying digital filter is given by:
The summation is computed starting at k=1 up to p, which will be 10 for the LPC-10
algorithm, and 18 for the improved algorithm that is utilized. This means that only the
first 18 coefficients are transmitted to the LPC synthesizer. The two most commonly
used methods to compute the coefficients are, but not limited to, the covariance
method and the auto-correlation formulation. For our implementation, we will be
using the auto-correlation formulation. The reason is that this method is superior to
the covariance method in the sense that the roots of the polynomial in the denominator
of the above equation is always guaranteed to be inside the unit circle, hence
guaranteeing the stability of the system H (z) [7]. Levinson - Durbin recursion will be
utilized to compute the required parameters for the auto-correlation method. The
block diagram of simplified model for speech production can be seen in Fig. 3-2 as
provided in [1].
The LPC analysis of each frame also involves the decision-making process of
concluding if a sound is voiced or unvoiced. If a sound is decided to be voiced, an
impulse train is used to represent it, with nonzero taps occurring every pitch period. A
pitch-detecting algorithm is employed to determine to correct pitch period / frequency.
We used the autocorrelation function to estimate the pitch period as proposed in [7].
However, if the frame is unvoiced, then white noise is used to represent it and a pitch
period of T=0 is transmitted. Therefore, either white noise or impulse train becomes
the excitation of the LPC synthesis filter. It is important to re-emphasize that the pitch,
gain and coefficient parameters will be varying with time from one frame to another.
The frequency response of a typical pre-emphasis filter is shown in Fig. 3-3 as well as
its inverse filter. This is involved during the synthesis / reconstruction of the speech
signal and is as follows:
Fig. 3-3: Frequency response of the pre-emphasis filter and its inverse filter
The main goal of the pre-emphasis filter is to boost the higher frequencies in order to
flatten the spectrum. To give an idea of the improvement made by this filter the reader
is referred to the plotted frequency spectrum in Fig. 3-4 of the vowel /i/ in the word
nine. It can be seen how the spectrum is flattened. This improvement leads to a better
result for the calculation of the coefficients using LPC. There are higher peaks visible
for higher frequencies in the LPC-spectrum as can be seen in Fig. 3-5. Clearly, the
coefficients corresponding to higher frequencies can be better estimated.
Fig. 3-4: Frequency spectrum of the vowel /i/ in the word nine.
Fig. 3-5: Spectrum of the LPC model for the vowel /i/ in the word nine.
The main idea behind the voice-excitation is to avoid the imprecise detection of the
pitch and the use of an impulse train while synthesizing the speech. One should rather
try to come up with a better estimate of the excitation signal. Thus the input speech
signal in each frame is filtered with the estimated transfer function of LPC analyzer.
This filtered signal is called the residual. If this signal is transmitted to the receiver
one can achieve a very good quality. The tradeoff, however, is paid by a higher bit
rate, although there is no longer a need to transfer the pitch frequency and the voiced /
unvoiced information. We therefore looked for a solution to reduce the bit rate to 16
kbits/sec, which is described in the following section.
4.1 DCT of residual signal
First of all, for a good reconstruction of the excitation only the low frequencies of the
residual signal are needed. To achieve a high compression rate we employed the
discrete cosine transform (DCT) of the residual signal. It is known, that the DCT
concentrates most of the energy of the signal in the first few coefficients. Thus one
way to compress the signal is to transfer only the coefficients, which contain most of
the energy. Our tests and simulations showed that these coefficients could even be
quantized using only 4 bits. The receiver simply performs an inverse DCT and uses
the resulting signal to excite the voice.
4.2 Performance Analysis
4.2.1 Bit Rates
In the sequel the necessary bit rates of the two solutions are computed. The bit rate for
a plain LPC vocoder is shown in Table 4-1 and the bit rate for a voice-excited LPC
vocoder with DCT is printed in Table 4-2. The following parameters were fixed for
the calculation:
Speech signal bandwidth B = 8 kHz
Sampling rate Fs = 16000 Hz (or samples/sec.)
Window length (frame): 20 ms
which results in 320 samples per frame by the given sampling rate Fs
Overlapping: 10 ms (overlapping is needed for perfect reconstruction)
hence: the actual window length is 30ms or consists of 480 samples
There are 50 frames per second
Number of predictor coefficients of the LPC model = 18 (see calculation in .)
void
Predictor coefficients
Gain
Pitch period
Voiced/unvoiced
switch
Total
Overall bit rate
void
Predictor
coefficients
Gain
DCT coefficients
Total
Overall bit rate
Table 4-2: Bit rate for voice-excited LPC vocoder with DCT
Again, the same parameters, as stated in section 4.2.1, are fixed for the calculations.
All numbers show the multiplications or additions required per frame. This number
needs to be multiplied by the number of frames of a given speech signal.
Calculation of the LPC coefficients. The Levinson-Durbin recursion requires O(p2)
floating point operations per second (FLOPS). In our case p = 18, hence this step
requires 324 FLOPS per frame.
The pre-emphasis filter needs 480 additions and 480 multiplications, which is equal
to 960 operations.
The cross-correlation consists of 480 additions and 480 multiplications, which is
equal to 960 operations.
The reconstruction of the LPC needs about 480*18 additions and 480*18
multiplications, which is equal to 17280 operations.
The inverse filter requires again 480 additions and 480 multiplications, which is
equal to 960 operations
Hence, the total number of operations for the plain LPC vocoder is 20484 operations
per frame. The sentences in section 4.2.4 typically contain of about 150 frames at 50
frames/second. Thus, the computational complexity for the plain LPC vocoder is
about 1 MFLOPS (Mega-Flops).
For the voice-excited vocoder the calculation of the cross-correlation is not needed but
the discrete cosine transform and its inverse is needed instead. However, in the
Matlab-code used, the cross-correlation is still computed as it was the case for the
plain LPC vocoder. Therefore, the complexity is slightly increased.
The DCT (if the fast algorithm is applied) requires 480 multiplications equaling 480
operations
The inverse-DCT requires the same number of operations, namely 480 operations
The total number of FLOPS for the voice-excited LPC vocoder is therefore 21444
operations per frame. If we consider the same parameters as before, the computational
complexity is roughly 1.07 MFLOPS. The improved sound quality makes up for the
higher number of FLOPS.
4.2.4 Objective Performance Evaluation
We measured the segmental signal to noise ratio (SEGSNR) of the original speech file
compared to the coded and reconstructed speech file using the provided Matlabfunction "segsnr". The obtained results are as follows:
1) A Male speaker saying: "Kick the ball straight and follow through."
2) A Female speaker saying: "It's easy to tell the depth of a well."
3) A Male speaker saying: "A pot of tea helps to pass the evening."
4) A Female speaker saying: "Glue the sheet to the dark blue background."
Vocoder type
Plain LPC
Voice-excited LPC
SNR 1
SNR 2
SNR 3
SNR 4
-24.92 dB -24.85 dB -24.87 dB -23.94 dB
0.5426 dB 0.7553 dB 0.5934 dB 0.2319 dB
Note that the calculation of the SNR requires the signals to be normalized before the
ratio can be calculated.
5 Discussion of Results
Sentence
"A pot of tea helps
to pass the evening"
"Kick the ball
straight and follow
through"
"Glue the sheet to
the dark blue
background"
"It's easy to tell the
depth of a well"
Original
File
Links to the sounds (original and coded) using both methods. The two first sentences are said by a male
speaker, the two last by a female speaker
5.1 Quality
5.1.1 Subjective quality
A comparison of the original speech sentences against the LPC reconstructed speech
and the voice-excited LPC methods were studied. In both cases, the reconstructed
speech has a lower quality than the input speech sentences. Both of the reconstructed
signals sound mechanized and noisy with the output of plain LPC vocoder being
nearly unintelligible. The LPC reconstructed speech sounds guttural with a lower
pitch than the original sound. The sound seems to be whispered. The noisy feeling is
very strong. The voice-excited LPC reconstructed file sounds more spoken and less
whispered. The guttural feeling is also less and the words are much easier to
understand. Overall the speech that was reconstructed using voice-excited LPC
sounded better, but still sounded muffled.
The waveforms in Fig 5-1 give the same idea. The voice-excited waveform looks
closer to the original sound than the plain LPC reconstructed one.
5.1.2 Segmental signal to noise ratio
Looking at the segmental SNR, computed in section 4.2.4, it is obvious that the first
sound is very noisy, having a negative SNR. The noise in this file is even stronger
than the actual signal. The voice-excited LPC encoded sound sounds far better, and its
SNR, although barely, is in the positive side. However, even the speech coded with
the improved voice-excited LPC does not sound exactly like the original signal.
It is noticeable that both the plain LPC and the voice-excited vocoders are not
sensitive to the input sentence, the result is the same for a sentence with many voiced
sounds and for a sentence with many fricatives or other unvoiced sounds. The good
point is that any spoken sentence can be transmitted with the same overall results. The
disadvantage is that we cannot focus on a specific aspect of the vocoder that would
give much poorer results. To improve the quality, the overall system has to be
improved. We cannot just improve the unvoiced sounds production to make the
vocoder sound perfect.
5.2 Quality-performance tradeoffs
The LPC method to transmit speech sounds has some very good aspects, as well as
some drawbacks. The huge advantage of vocoders is a very low bit rate compared to
what is achieved for sound transmission. On the other hand, the speech quality
achieved is quite poor.
Fig. 5-1: Waveform of the sentence "A pot of tea helps to pass the evening": a) original speech signal, b) LPC
reconstructed speech signal, c) voice-excited LPC reconstructed speech signal
Fig. 5-2: Speech quality vs. Bit rate trade-offs for different speech coding techniques
In the plain LPC vocoder, one tries to estimate the pitch and then excite the
synthesizer with the estimated parameters. This results in a poor, almost unintelligible
sentence. The lack of accuracy in determining the pitch and the exact mathematical
excitation results in a huge degradation in the quality. Shifting to the voice excited
LPC technique, the pitch and the binary choice in the method of excitation is dropped
(7 bits per frame) but all the errors have to be sent.
The increase in the bit rate results then from a change in the excitation method. While
impulse train was used as a source to the transfer function, the actual error made when
computing the ak's, is now encoded and sent. Theoretically, a close to perfect
reconstruction could be achieved if these errors were sent as floating points numbers.
However, this would require a very large bandwidth. If we use 8 bits per error over
the all frame, this would give 3840 bits per frame and then contribute to the overall bit
rate with 192kbps. The errors are therefore compressed using an algorithm similar to
the one used in the jpeg or mpeg compression. Taking the discrete cosine transform
(DCT) and keeping only the 40 first coefficients (each quantized over 4 bits) allows a
pretty good reconstruction of all the 480 errors in the frame. The maximum energies
are located in the first few coefficients, so that the last 440 are assumed to be null.
However, we could notice that increasing the bit-rate of a vocoder is not the best idea
since the improvement in the quality is not linear as can be seen in Fig. 5-2: Speech
quality vs. Bit rate trade-offs for different speech coding techniques, as provided in
[6]. If an increase in the required bandwidth significantly improves the quality at the
beginning, the increase required after for the same amount of improvement is
tremendously more important.
5.2.2 Delay and computational complexity
The overall delay of the system is hard to measure and depends on the machine used.
However it can be estimated looking at the time between the launch of the program
and the creation of the output file.
Both methods employed are of the same computational complexity. The voice-excited
LPC method uses the original sound samples to produce the output sound while the
plain LPC technique creates the output sound from more basic characteristics. The
stronger link to the original signal in the voice-excited LPC method allows a more
accurate reproduction of the sounds without increasing the complexity and the delay,
since both concepts are closely linked.
As mentioned before, an idea to improve the overall quality could be, beside an
increase in the required bandwidth, an increase in the vocoder complexity to transmit
more information of more pertinent information with the same bandwidth. The overall
delay of the system will therefore increase but our vocoder complexity is low enough
to allow a complexity increase and still meet the project requirements regarding the
FLOPS.
6 Conclusions
The results achieved from the voice excited LPC are intelligible. On the other hand,
the plain LPC results are much poorer and barely intelligible. This first
implementation gives an idea on how a vocoder works, but the result is far below
what can be achieved using other techniques. Nonetheless the voice-excited LPC used
gives understandable results and is not optimized. The tradeoffs between quality on
one side and bandwidth and complexity on the other side clearly appear here. If we
want a better quality, the complexity of the system should be increased or a larger
bandwidth has to be used.
Since the voice-excited LPC gives pretty good results with all the required limitations
of this project, we could try to improve it. A major improvement could come from the
compression of the errors. If we can send them in a loss-less manner to the
synthesizer, the reconstruction would be perfect. An idea could be the use of Huffman
codes for the DCT coefficients. Many simulations have to be done to get the right
code book.
This would reduce the bit rate, so that we can use the additional amount of bandwidth
to improve quality. At least two possibilities could be considered. The first one would
be an increase in bits used to quantize the DCT coefficients. The first coefficients
would be more accurate resulting in closer reconstructed errors after the inverse DCT.
The second way could be to increase the number of quantized coefficients. The result
would be of the same kind, a more accurate reconstructed error array. The point is to
know until what point an improve in one way is better than the other, since both
should be improved to get a perfect file. Other kinds of coding techniques could be
considered, all these methods will result in a complexity increase but the vocoder is
simple enough to cope with it.
If someone wants to look for another way of improving the vocoder, he could look at
the plain LPC vocoder and try to implement a covariance model. However, since there
exists no fast algorithm for inverting the covariance matrix, the computational
complexity can increase tremendously as well as the delay.
Finally, the excitation parameters, the weakest part in this implementation, could be
looked at. All the unvoiced sounds cannot result from the same white Gaussian noise
input. An analysis of this and the creation of a code book for unvoiced sounds could
give better results. Again, statistical data and numerous simulations are needed.
7 References
[1] http://www.data-compression.com/speech.html
[2] http://www.bell-labs.com
[3] http://cslu.cse.ogi.edu/HLTsurvey/ch10node4.html
[4] M. H. Johnson and A. Alwan, "Speech Coding: Fundamentals and Applications",
to appear as a chapter in the Encyclopedia of Telecommunications, Wiley, December
2002.
[5] http://www.ti.com/corp/docs/company/history/pmos.shtml
[6] http://www-mobile.ecs.soton.ac.uk
[7] L. R. Rabiner and R. W. Schafer, "Digital Processing of Speech Signals",
Prentice- Hall, Englewood Cliffs, NJ, 1978.
% where s(n) is the original data. a(i) and e(n) are the outputs of the LPC
% analysis with a(i) representing the LPC model. The e(n) term represents
% either the speech source's excitation, or the residual: the details of the
% signal that are not captured by the LPC coefficients. The G factor is a
% gain term.
%
% LPC analysis is performed on a monaural sound vector (data) which has been
% sampled at a sampling rate of "sr". The following optional parameters
modify
% the behaviour of this algorithm.
% L - The order of the analysis. There are L+1 LPC coefficients in the output
% array aCoeff for each frame of data. L defaults to 13.
% fr - Frame time increment, in ms. The LPC analysis is done starting every
% fr ms in time. Defaults to 20ms (50 LPC vectors a second)
% fs - Frame size in ms. The LPC analysis is done by windowing the speech
% data with a rectangular window that is fs ms long. Defaults to 30ms
% preemp - This variable is the epsilon in a digital one-zero filter which
% serves to preemphasize the speech signal and compensate for the 6dB
% per octave rolloff in the radiation function. Defaults to .9378.
%
% The output variables from this function are
% aCoeff - The LPC analysis results, a(i). One column of L numbers for each
% frame of data
% resid - The LPC residual, e(n). One column of sr*fs samples representing
% the excitation or residual of the LPC filter.
% pitch - A frame-by-frame estimate of the pitch of the signal, calculated
% by finding the peak in the residual's autocorrelation for each frame.
% G - The LPC gain for each frame.
% parcor - The parcor coefficients. The parcor coefficients give the ratio
% between adjacent sections in a tubular model of the speech
% articulators. There are L parcor coefficients for each frame of
% speech.
% stream - The LPC analysis' residual or excitation signal as one long
vector.
% Overlapping frames of the resid output combined into a new one% dimensional signal and post-filtered.
%
% The synlpc routine inverts this transform and returns the original speech
% signal.
%
% This code was graciously provided by:
% Delores Etter (University of Colorado, Boulder) and
% Professor Geoffrey Orsak (Southern Methodist University)
% It was first published in
% Orsak, G.C. et al. "Collaborative SP education using the Internet and
% MATLAB" IEEE SIGNAL PROCESSING MAGAZINE Nov. 1995. vol.12, no.6, pp.
% 23-32.
% Modified and debugging plots added by Kate Nguyen and Malcolm Slaney
% A more complete set of routines for LPC analysis can be found at
% http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
% (c) 1998 Interval Research Corporation
if (nargin<3), L = 13; end
if (nargin<4), fr = 20; end
% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)
%
% Returns:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% arguments check
% --------------if ( nargin ~= 1)
error('argument check failed');
end;
%
% system constants
% ---------------Fs = 16000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC
%
% main
% ---% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);
% decode/synthesize speech using LPC and impulse-trains as excitation
outspeech = synlpc1(aCoeff, pitch, Fs, G);
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
% system constants
% ---------------Fs = 16000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC
%
% main
% ---% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);
% perform a discrete cosine transform on the residual
resid = dct(resid);
[a,b] = size(resid);
% only use the first 50 DCT-coefficients this can be done
% because most of the energy of the signal is conserved in these coeffs
resid = [ resid(1:50,:); zeros(430,b) ];
% quantize the data
resid = uencode(resid,4);
resid = udecode(resid,4);
% perform an inverse DCT
resid = idct(resid);
% add some noise to the signal to make it sound better
noise = [ zeros(50,b); 0.01*randn(430,b) ];
resid = resid + noise;
% decode/synthesize speech using LPC and the compressed residual as
excitation
outspeech = synlpc2(aCoeff, resid, Fs, G);
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
analysis with a(i) representing the LPC model. The e(n) term represents
either the speech source's excitation, or the residual: the details of the
signal that are not captured by the LPC coefficients. The G factor is a
gain term.
LPC synthesis produces a monaural sound vector (synWave) which is
sampled at a sampling rate of "sr". The following parameters are mandatory
aCoeff - The LPC analysis results, a(i). One column of L+1 numbers for each
frame of data. The number of rows of aCoeff determines L.
source - The LPC residual, e(n). One column of sr*fs samples representing
the excitation or residual of the LPC filter.
G - The LPC gain for each frame.
The following parameters are optional and default to the indicated values.
fr - Frame time increment, in ms. The LPC analysis is done starting every
fr ms in time. Defaults to 20ms (50 LPC vectors a second)
fs - Frame size in ms. The LPC analysis is done by windowing the speech
data with a rectangular window that is fs ms long. Defaults to 30ms
preemp - This variable is the epsilon in a digital one-zero filter which
serves to preemphasize the speech signal and compensate for the 6dB
per octave rolloff in the radiation function. Defaults to .9378.
Minor modifications: Philipp Steurer
This code was graciously provided by:
Delores Etter (University of Colorado, Boulder) and
Professor Geoffrey Orsak (Southern Methodist University)
It was first published in
Orsak, G.C. et al. "Collaborative SP education using the Internet and
MATLAB" IEEE SIGNAL PROCESSING MAGAZINE Nov. 1995. vol.12, no.6, pp.
23-32.
Modified and debugging plots added by Kate Nguyen and Malcolm Slaney